Performance Decline on "Madison" Server (Reported)
  • Priority - High
  • Affecting Server - Madison
  • As you may have noticed, the server "Madison" performs better than in the past 2 days, but still not as good as before the recently applied system updates.

    The performance decline is caused by multiple factors:

    • The recent updates include the patch that covers the Spectre/Meltdown vulnerabilities. These patches are reported to cause a decline in performance (New Spectre variant 4: Our patches cause up to 8% performance hit, warns Intel). Intel says that it's an 8% performance hit, but we've seen various reports of up to 15% performance decline.
    • Over the past 2 years, we've been quite permissive and tolerant with accounts that have been overusing their assigned resources. We've even silently increased resources - at some point up to 200% more than advertised on our website - because the server could handle the traffic and usage of all websites just fine. Unfortunately, this tolerance turns out to be quite problematic now, as those accounts that are traffic and resource-intensive are having a much more significant impact now with the recent CPU limitation.
    • The MySQL/MariaDB server uses the most CPU and memory. While the server has plenty of memory, the CPU limitation causes MySQL/MariaDB to perform slower, causing the websites to load slower as well. We have done a few optimizations, but unfortunately, it doesn't do wonders and the SQL queries from websites put a higher load on the server.
    • The server reboot requires a full block scan on the next R1Soft Server Backups backup replication. This process is very resource-intensive and took over 9 hours to complete, from Sunday night until noon. Now the CDP backup agent is getting back in sync and should use less resources once it's done. We expect this to be finished by Monday afternoon.

     

    What we're doing to resolve this issue permanently:

    Technically, since we've optimized everything as much as possible on the Madison server, there's nothing more that can be done at the moment. Therefore, we've decided to deploy a new shared hosting server with Dual Intel Xeon Gold 5118 CPUs that should handle the load better.

    Once the new server is ready (within 1 or 2 days), we will transfer the top 20 most resource-intensive accounts to the new server to ease off the Madison server and balance the load. This should be the step that will resolve this issue already.

    Furthermore, we will deploy MySQL Governor on both servers, which will prevent abusive accounts from overloading the database server. The MySQL processes will run inside the accounts' LVE container, so if a website overuses MySQL, their account will be the only one that will be affected by their heavy SQL usage.

    Another step will be the switch from PHP 5.6 to 7.1 as the system PHP version. As PHP 5.6 nears its end of life in December 2018, the switch to PHP 7.1 is inevitable anyway. PHP 7 is known to be more resource-friendly and it should improve performance a bit more. We will announce the switch to PHP 7.1 soon. It will probably be done on the 1st of November, 2018.

    Once we complete these steps, we believe that the websites will perform at least the same or maybe even better than before. The 20 most resource-intensive accounts will be moved from the Madison server within 1 to 3 days and the rest of the measures will be applied within the next 2 or 3 weeks. We'll post updates here regarding our progress.

    If your account is among those 20 most resource-intensive accounts, we will inform you about the transfer to the new server at least 1 hour in advance. There should be no action required from your side, other than updating the domain's nameservers or IPs. The DNS propagation will be shortened by reducing the TTL (Time to Live) to 5 minutes and by updating the DNS records on the Madison server as well.

    We're honestly sorry about the inconvenience caused and kindly ask for your patience while we complete these steps to resolve the performance issues permanently. Thank you!

    Update 14.10.18 23:22 CEST: All MySQL/MariaDB databases of all accounts are being optimized at the moment. This is a time- and resource-intensive process, which should be completed within 8 hours. Databases that are fragmented should perform a bit better after the optimization.

    Update 15.10.18 00:39 CEST: The database optimizations have been successfully completed. We see a very positive impact on performance at the moment, but it's too early to say whether this is a temporary effect or permanent. We'll keep monitoring the Madison server and still work on setting up the new server. Updates will follow.

    Update 15.10.18 00:49 CEST: The R1Soft backup replication will start in 10 minutes and might cause temporary high load. This time it's an incremental replication and should only take about 4 hours.

    Update 15.10.18 00:51 CEST: Recommendation: WordPress websites can be optimized further by installing and enabling the LiteSpeed Cache for WordPress plugin. We've granted free access to LSCache to all accounts recently and it can boost the performance of your website significantly. More information about it here: https://www.litespeedtech.com/products/cache-plugins/wordpress-acceleration

    Update 15.10.18 10:23 CEST: The performance issues are still continuing, unfortunately, now that traffic has started to increase for many websites. We are still working on the above solutions.

    Update 15.10.18 12:05 CEST: The issue might possibly be related to CloudLinux and the network drivers/configuration, as uploads to the server are very fast, while downloads are very slow. This happens on a different CloudLinux server as well, but not on servers running CentOS, so it might be the CloudLinux software or a related component causing the issue. We're looking into this now and will contact CloudLinux for assistance.

    Update 15.10.18 12:50 CEST: The CloudLinux system admins are on the server now and have started their investigation.

    Update 15.10.18 15:18 CEST: CloudLinux insist that the issue isn't caused by their software. Our system admins are continuing to investigate.

    Update 15.10.18 16:52 CEST: To rule out the web server (LiteSpeed Web Server) as a potential cause, we are planning to re-install it using the default settings. Short website outages of a few minutes might occur during this process.

    Update 15.10.18 17:24 CEST: Before going with the LSWS reinstallation, we have been monitoring the statistics result and still see System (Kernel) CPU usage is high compared to other usages including user usage. We have updated the CloudLinux support once again with this result and is waiting for their update.

    Update 15.10.18 18:20 CEST: The CloudLinux ticket has been escalated to Level 3 support (highest, most experienced level). The average response time of this department is 1 or 2 business days. Since we can't leave things this way so long, we will proceed with the above plan of re-installing LSWS, setting up a new server and moving some accounts to it. This should be completed tonight.

    We kindly ask for more patience. The issue must and will be resolved.

    Update 15.10.18 22:17 CEST: The setup of the new server has been completed. We've moved 3 resource-intensive accounts there already and the load has improved a bit on the Madison server.

    LSWS has been re-installed, but this had no impact on performance. We've opened a ticket at LiteSpeed Tech so they can have a look and make sure that nothing is wrong with the web server.

    We will continue to arrange account transfers to the new server to ease off the load on the Madison server, but this might take 1 or 2 days. In the meantime, we hope that the CloudLinux and LSWS technicians will be able to find something.

    Update 16.10.18 01:00 CEST: The main cause of this issue seems to have been found. We are not entirely sure yet, as we'll need to monitor the server tomorrow during traffic peak hours, but so far the server load has dropped and many websites open up smoothly.

    We will update this page again tomorrow and post more information. Thank you for your patience.

    Update 16.10.18 09:14 CEST: Starting this morning, the performance has dropped again. It appears that during the day, there must be one or more websites that are causing high load. We are trying to identify them and will suspend them once found.

    Update 16.10.18 09:52 CEST: We have found 3 websites that were targeted by WordPress-specific Distributed Denial of Service (DDoS) attacks. These attacks are often only successful if the respective websites are vulnerable and out of date. This is one of the most important reasons why we strongly advise to always keep your websites up-to-date, without exceptions. Unfortunately, many clients seem to neglect this by leaving their websites vulnerable to all these attacks.

    We will keep monitoring the server and mitigate all attacks as much as possible.

    Update 16.10.18 12:25 CEST: Investigations and optimization works are still ongoing. Performance is still affected. We're trying our best to mitigate the issue.

    Update 16.10.18 12:42 CEST: We have suspended an account that was under attack. The load is back to normal. We'll keep monitoring the server to assure that the performance remains unaffected.

    Update 16.10.18 16:02 CEST: Last fixes by the LiteSpeed technicians have been implemented, which have very positive results. The server is fast and stable again since almost 2 hours. We'll keep monitoring it closely, but we believe that the issue is permanently resolved now.

    Update 16.10.18 20:13 CEST: CloudLinux have finally analyzed the issue and suggested to install a beta kernel that addresses this specific issue. The kernel has been installed and the server was rebooted already. We're going to monitor the server and see how it performs with this kernel. We have been advised that the kernel may still have a few minor bugs, but we assume that it can't get any worse than this.

    We will test the kernel for 1 or 2 days and report our results to CloudLinux. If the kernel fixes this issue and runs stable, we will continue to run it. Otherwise, we will wait for the stable kernel release from CloudLinux.

    Update 16.10.18 20:25 CEST: The kernel is causing too many issues already. We are reverting and will reboot the server again.

    Update 16.10.18 20:47 CEST: We have re-installed and booted into the kernel we used between March and October 2018. The server seems to run stable so far and the issues we saw with the testing kernel no longer occur.

    We will stay with this kernel if it runs stable and performance is good. Although it has some security vulnerabilities, the most critical ones are patched by KernelCare and the others are complicated to exploit, so we should be safe.

    We'll keep monitoring the server closely and if nothing unusual happens, updates will follow tomorrow.

    Update 17.10.18 10:35 CEST: The server load was perfectly fine from yesterday evening until this morning, but has increased again starting 9 AM. We will try to arrange further transfers to our new server, for accounts that have a high resource usage.

    Update 18.10.18 12:12 CEST: We are constantly trying to mitigate the performance issue while waiting for CloudLinux to provide a permanent solution. We're sorry that this issue takes so long to resolve, but it is dependent on CloudLinux, who are very slow, unfortunately.

    Update 18.10.18 19:57 CEST: The server is being rebooted to enable kernel dumping, in order to provide CloudLinux more detailed information about the system. Services will be unavailable for 5 minutes.

    Update 18.10.18 20:11 CEST: We've rebooted the server again and generated a core dump, which CloudLinux will investigate. They should now have all necessary information to work on a solution. We hope to have it by the end of the week, but this depends entirely on how fast the CloudLinux team is.

    Update 19.10.18 08:57 CEST: The server runs surprisingly fast and stable since Thursday evening, after CloudLinux have entered the server to investigate and asked for a reboot to generate the kernel dump. They have not made any other changes, as far as we're aware. We'll keep monitoring the server closely and try to mitigate any possible overloads while we receive a decisive answer/solution from CloudLinux.

    Update 22.10.18 21:27 CEST: Today we have transferred the most resource-intensive account from the server "Madison" to the new server "Antero". Except for four short periods of server overload, the performance was smooth the entire day. We will keep monitoring the server closely and, if necessary, transfer two or three more accounts that have slightly higher resources than the rest of the accounts.

    The root of the issue is still pending investigation. We will try to postpone the server reboot and kernel dumps until Wednesday next week, as there were too many outages recently.

    Unless something unusual happens in the meantime, we will update this status item again next week. Any possible performance issues will be mitigated promptly until then.

    Update 30.10.18 12:52 CET: We are going to update to a new kernel that was released this week, as an attempt to address the performance/stability issues. The server will go down for reboot within 5-10 minutes.

    Update 30.10.18 13:11 CET: The kernel has been installed and the server is now being rebooted.

    Update 30.10.18 13:47 CET: The server is up, running the newest kernel. Most websites open up smoothly and the average load has dropped at this moment, but the load is still a bit higher than usual. We cannot determine yet if it will continue to remain stable or increase again.

    We will contact CloudLinux again to resume their investigations and arrange further transfers of accounts that have the highest CPU usage. This should ease off the server even more. Updates will follow once we hear back from CloudLinux.

    Update 01.11.18 12:33 CET: CloudLinux have suggested to install a recently-released "beta" kernel, which fixes various issues and might fix our performance issue as well. However, we prefer not to install "beta" kernels on our production systems because these are usually unstable.

    The server is running smoothly since our last update, except for a few short overloads that we were able to mitigate promptly. Unless the server becomes heavily overloaded again for longer periods, we will wait for the upcoming "stable" kernel to be released by CloudLinux and update to it shortly after release. It often takes 1 or 2 weeks for "beta" kernels to be released to the "stable" channel.

    We also plan to switch to PHP 7.1 (as default version) and enable LiteSpeed Web Cache for all WordPress websites that have no caching plugin installed. This should stabilize the server even more and speed up most websites. An announcement with the exact schedule will be emailed to all clients soon. If the performance bug will get fixed as well, then the server should become even faster than before.

    Update 05.11.18 22:15 CET: A new kernel for CloudLinux 7 has been released, which will hopefully improve performance. We are currently installing it and will reboot the server right after.

    Update 05.11.18 22:28 CET: The server is going down for reboot now.

    Update 05.11.18 22:33 CET: The server has been successfully rebooted with the new kernel. We will monitor the server closely for at least the next 3 days and see if the new kernel brings positive results to the overall performance. Should the performance issues still persist, we will contact CloudLinux again and continue to mitigate the performance issues in the meantime.

    Imunify360 version 3.7 is also expected to be released on Thursday this week. We've been told that this version should use a bit less resources, which would benefit the server's performance.

    Update 06.11.18 07:50 CET: We have detected that the system does not log anything to the "messages" log, which can be quite a serious problem. We will reboot the server again in about 2 hours, as an attempt to fix this. If the system logs still won't work, we'll have to return to the old kernel.

    Update 06.11.18 09:24 CET: The server will now be rebooted.

    Update 06.11.18 09:32 CET: The server has been successfully rebooted and the system log works properly now. Seems to have been just a glitch. We'll continue to monitor the server, as mentioned in the 3rd previous status.

    Update 06.11.18 11:47 CET: The server is currently extremely overloaded. We are trying to mitigate this issue now.

    Update 06.11.18 11:51 CET: The server is now being rebooted.

    Update 06.11.18 11:58 CET: Another reboot is required, as the IP addresses haven't loaded.

    Update 06.11.18 12:07 CET: All IPs have loaded and all services/websites are back online. We're investigating the reason for the earlier overload.

    Update 06.11.18 12:40 CET: The server is running stable since the last reboot. We could not determine the exact cause, as the system was almost frozen earlier, but suspect that Imunify360 might have caused or contributed to the extreme overload.

    Currently, there are multiple issues that cause sporadic overloads:

    • Imunify360 often has medium to high CPU usage. We'll update to the upcoming stable version on Thursday, which is expected to address this issue. Otherwise, if the server will be extremely overloaded again, before or after the update, we'll temporarily uninstall Imunify360.
    • The Dovecot (POP3/IMAP) service is often restarting and causing high CPU usage for short periods of 1-2 minutes. This issue has been reported to cPanel and we're awaiting a solution.
    • Many websites/accounts use misconfigured and/or outdated scripts (e.g. WordPress) that are often loaded with tens of plugins that cause high resource usage. A few simultaneous visitors to such websites are often enough to cause the entire server to slow down; especially when multiple websites such as these get traffic at the same time.
    • WordPress has a process called "admin-ajax.php" that is widely known to overuse CPU in the background. This often happens when WordPress has too many plugins installed. It is often enough for a few users to be logged in to the WP admin area of such websites simultaneously to cause a temporary overload.

    If you use WordPress, you can help us to prevent such issues and reduce the load by updating WordPress and all plugins/themes, removing plugins that are not necessary and installing the LSCache plugin. This also applies to other scripts such as Joomla, Drupal, etc. We highly recommend to keep them up-to-date and use caching.

    Soon we will upgrade to PHP 7.1 and install LSCache for all WordPress sites that have no caching enabled. This should help some websites and the server overall with the performance.

    We're still monitoring the server closely and will try to prevent/mitigate possible overloads.

    Update 06.11.18 13:05 CET: Imunify360 has caused another slight overload and short outage. We're currently uninstalling it. Some services might fail for a few minutes during the uninstall process.

    Update 06.11.18 13:27 CET: Imunify360 is still uninstalling. The web server fails to start until the uninstall completes, unfortunately.

    Update 06.11.18 13:39 CET: The server was unresponsive and is now being rebooted.

    Update 06.11.18 13:54 CET: We have booted into the previous kernel and are now attempting to uninstall Imunify360 again.

    Update 06.11.18 13:59 CET: Imunify360 has been uninstalled. All services are running. We expect no further outages on the short term, as the server is running with the previous kernel now.

    These issues will be reported to CloudLinux and are pending further investigation.

    Update 06.11.18 14:48 CET: The server is running smoothly since the last reboot. Considering that the previous kernel is running now and Imunify360 was uninstalled, the server should continue to run stable from here on. We'll keep monitoring it, of course.

    Update 06.11.18 16:10 CET: The server has overloaded extremely again and is now rebooting.

    Update 06.11.18 17:22 CET: The server is running stable now, with the load lower than before. Reason for the overload and crash this time was that KernelCare was still applying live patches to the old kernel, so basically, we were running the old kernel, but patched up with the code of the new kernel.

    KernelCare has been unloaded and uninstalled now. No further kernel patches are applied and we will refrain from updating the kernel until CloudLinux have a reliable solution.

    Another issue was that the Majestic and Semrush bots were crawling multiple websites at very high rates, causing them to overload the server additionally. We've added ModSecurity rules to block these bots, as they're known to be too aggressive when crawling websites.

    Update 07.11.18 08:55 CET: The server continues to run smoothly. It's more fast and stable than at least in the past 4 weeks. The average server load has dropped to the monthly lowest. It might go higher, as we usually expect more traffic to come in during the day.

    Update 07.11.18 21:30 CET: Performance and stability have been very well today. No outages have occurred. We're still expecting CloudLinux to find the root of the issue and provide a permanent fix. Our effort so far has only mitigated the issue, but not resolve it.

    Today there was only a temporary issue with sites behind CloudFlare, as a technician has accidentally removed the CloudFlare IPs from the firewall whitelist and some of them got temporarily blocked. This has been corrected. We're sorry if your website was affected by this in any way.

    Another issue that we're currently dealing with are a few email accounts that are targeted by massive distributed brute force attacks. Our firewall keeps blocking and mitigating these attacks, but unfortunately, this also puts an additional load on the CPU. We cannot do anything but wait for these attacks to stop.

    We'll continue to monitor the server closely until CloudLinux finally provide a feasible solution. Thank you for your patience and understanding so far! It's greatly appreciated.

    Update 08.11.18 10:15 CET: The server keeps holding up well. The load is fine and the websites are performing smoothly.

    CloudLinux are still suggesting to install a "beta" kernel, but we refuse to do this as long as they cannot guarantee that it addresses this particular issue. We've tried 5 kernels so far and all had even more issues than now. None of them have fixed the core of this issue. Our system admins are still discussing with them, trying to get a feasible solution from their team.

    Update 12.11.18 18:00 CET: The server continues to run quite well, with a few minor exceptions, when some websites receive sudden spikes in traffic, but we've been able to mitigate these promptly. Overall, the server has no longer crashed and the websites open fast enough.

    As announced today, we will upgrade to PHP 7.1 as the system default version and install LiteSpeed Cache on all WordPress websites during the course of this week. This should improve performance and stability even more.

    CloudLinux have still not responded yet. Today we've talked to a senior system administrator who works for multiple hosting companies and he has confirmed that we're not the only ones who are experiencing performance and stability issues with CloudLinux servers. These issues have actually started to appear in the past 5 months, but our servers are only affected by them now because we haven't updated the CloudLinux kernel since May. Only the new kernels are affected.

    We will continue to remind CloudLinux about this issue and try to push it to them until they finally acknowledge it and work on a permanent solution.

    Update 16.11.18 03:45 CET: No major issues have occurred since the last update. The performance continues to uphold and the server and all its services/websites have been permanently online since last week. Tomorrow we'll upgrade to PHP 7.1 as the system-default PHP version, which will hopefully improve performance even more.

    We've received two reports regarding the mail server being slow. In one case, this was actually caused by a wrong password being saved in the email client and causing the client's IP address to get blocked. If you're also experiencing issues with the mail services, kindly contact our technical support department.

    A few minutes ago, we have re-installed Imunify360, as the long-awaited "stable" version 3.7 has been released two days ago. We hope that this version will require less CPU and not affect the server performance. It should help mitigate the distributed brute force attacks against some email accounts better.

    Update 16.11.18 06:10 CET: The server became unresponsive earlier and a reboot was necessary. We've booted into an old kernel this time, which won't be patched by KernelCare. The booted kernel was in use for several months, prior to these issues, and we hope that the system will be more stable with it until CloudLinux finally have a solution.

    We've also upgraded the RAM from 32 GB to 40 GB, as the swap partition was being used and this is not good for performance.

    Update 16.11.18 11:07 CET: The server continues to run smoothly since the last reboot. CloudLinux have been noticed logging into the server earlier, most probably to investigate further, but no response has been received from them yet. We'll keep monitoring the server closely and await for a response/solution from the CloudLinux team.

    Update 17.11.18 12:00 CET: We finally seem to be making some progress. CloudLinux have investigated this issue again, this time a bit more rigorously, and have done some tweaks to the kernel settings (related to the memory management and network connections). They've advised us to switch back to the newest kernel with these settings, which we'll do, but we're trying to arrange a schedule with them so they can do the update by themselves and monitor the server for 1-2 hours.

    Since Saturday is the day with the lowest traffic, according to our statistics, we'd like to schedule this task in 2 weeks from now, as next week is Black Friday and some shops might be running ad campaigns for this event. We'll try not to touch the server next week if there won't be any issues.

    The server continues to run very well so far, but it's running an old kernel and we must switch to the latest kernel sooner rather than later, as newer kernels patch some security vulnerabilities, such as the Spectre/Meltdown vulnerabilities of Intel processors. We hope that the next kernel update, planned in 2 weeks from now, will be successful.

  • Date - 14.10.2018 00:00
  • Last Updated - 17.11.2018 12:03

Powered by WHMCompleteSolution