DMI

ON November 28, 2017

"Error establishing a database connection" - WordPress XML-RPC Denial of Service Attack Error

POSTED IN:

Often times after taking over management of a client's website, we'll put the client's website on a sever managed by us, Dedicated Managers, Inc. Last week a client's server started having database issues. When the website was accessed from a public computer, the message "Error establishing a database connection" was returned by the server. It turned out the server was experiencing a DoS (Denial of Service) attack.

Database Server Down

Upon the the first notification of the database failing, which in turn caused the web server to return the failing "Error establishing a database connection" response, a simple restart of the database was successful. In this case the database was MariaDB on a CentOS 7 Server so the service command was used.

service mysql start

Discovery of the Cause

System Logs
The system logs were checked via the journalctl command to investigate the severity of the issue.

journalctl -u mariadb

The output showed the server failed and an attempt to restart also failed (bold added for understanding).

Nov 25 08:41:04 cs6.dedicatedmanagers.com systemd[1]: mariadb.service: main process exited, code=killed, status=9/KILL
Nov 25 08:41:07 cs6.dedicatedmanagers.com systemd[1]: Unit mariadb.service entered failed state.
Nov 25 08:41:08 cs6.dedicatedmanagers.com systemd[1]: mariadb.service failed.
Nov 25 08:42:59 cs6.dedicatedmanagers.com systemd[1]: mariadb.service holdoff time over, scheduling restart.
Nov 25 08:43:58 cs6.dedicatedmanagers.com systemd[1]: Starting MariaDB database server...
Nov 25 08:45:43 cs6.dedicatedmanagers.com systemd[1]: mariadb.service start-pre operation timed out. Terminating.
Nov 25 08:47:05 cs6.dedicatedmanagers.com systemd[1]: Failed to start MariaDB database server.
Nov 25 08:47:07 cs6.dedicatedmanagers.com systemd[1]: Unit mariadb.service entered failed state.
Nov 25 08:47:11 cs6.dedicatedmanagers.com systemd[1]: mariadb.service failed.

The log lines above show that the process was killed with a KILL -9 command, which is not a nice way for a process to be terminated. The KILL -9 command completely terminated the MariaDB service without giving it a chance to clean itself up. For a database, that's not good, and means something major happened.

The lines in the log before the lines shown above didn't offer much but seemed to possibly indicate that the InnoDB page_cleaner may have caused the issue as it happened just before the process was killed (red time below vs red time above).

Nov 21 06:53:40 cs6.dedicatedmanagers.com mysqld[613]: 2017-11-21 6:53:40 140264103036672 [Warning] IP address '85.93.20.74' could not be resolved: Name or service not known
Nov 25 08:40:57 cs6.dedicatedmanagers.com mysqld[613]: 2017-11-25 8:40:55 140263595730688 [Note] InnoDB: page_cleaner: 1000ms intended loop took 9169ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)

Possible Cause
Possibly, the InnoDB page_cleaner may have used up too much memory and caused the kernel to start killing processes. However, further investigation proved there was a greater issue elsewhere and this may have just been the "straw that broke the camel's back" by putting the server load "over the top" to the point where the extra load caused the kernel to pick the most likely offensive process and kill it.

Candidates for Kernel to Kill
A quick check using the dstat command showed that the most likely candidate to be killed by the kernel is the database:

dstat --top-oom

Database Corruption
The logs also showed, that due to the abrupt stoppage of the database server, the tables were marked as crashed and needed repair. So the databases were checked and repaired with the mysqlcheck command:

mysqlcheck -uroot -p --repair --all-databases

Server Load
After the database was up and running again, public access to the website was restored and immediately the site started getting bogged down again. The server load averages (as checked through the "top" or "uptime" commands) were incredibly high and rising:

load average: 135.00, 95.01, 55.05

Typically the load averages (especially for the 2nd & 3rd numbers which represent the 5 minute & 15 minute averages) should be much less than 1, for a single processor system as was the case in this server's (virtual) hardware.

The server also became increasingly slower and unresponsive.

Something was happening. A hard reboot was needed just to regain responsiveness to an ssh shell session.

Upon reboot the system was responsive and the system was monitored with the "top" and "ps -ef" commands. It didn't take long before the load averages started spiking back up. What could be seen was that the website was receiving multiple hits every second. The server was running suPHP which starts a new cgi process for each hit to the server which showed up in the output of the commands.

wwwuser1 20292 18101 0 12:12 ? 00:00:00 /usr/bin/php-cgi

Above is one line of the "ps -ef" output showing a hit to the website. Typically those processes start up and die quickly. It can be so fast that on a low traffic server you may not even catch the process in the output. Even if you load a website page in a browser and immediately after hitting enter to load the page you run the "ps -ef" command in the terminal, you may not catch it.

But these processes were sticking around and increasing in quantity. Every second another was added. Every "ps -ef" and top refresh showed more and more of the lines above. Some of the older ones died off, but the total number was continually increasing.

Reducing the Server Load
The web server was shut down and the lingering processes were killed:

apachectl stop

kill $(ps aux | grep '[p]hp' | awk '{print $2}')

The server load decreased and further investigation could be made.

Checking IP Traffic
Immediately a listener was put on port 80 to view the http traffic:

tcpdump -n -i eth1 port 80

This showed there were hits coming every second coming from a single IP:

08:51:15.178161 IP 185.188.204.6.40984 > 169.48.115.29.http: Flags [S], seq 2828388787, win 29200, options [mss 1460,sackOK,TS val 656190545 ecr 0,nop,wscale 7], length 0
08:51:16.485425 IP 185.188.204.6.46144 > 169.48.115.29.http: Flags [S], seq 1865138997, win 29200, options [mss 1460,sackOK,TS val 656190871 ecr 0,nop,wscale 7], length 0
08:51:17.483773 IP 185.188.204.6.46144 > 169.48.115.29.http: Flags [S], seq 1865138997, win 29200, options [mss 1460,sackOK,TS val 656191121 ecr 0,nop,wscale 7], length 0
08:51:18.590143 IP 185.188.204.6.33517 > 169.48.115.29.http: Flags [S], seq 3204880701, win 29200, options [mss 1460,sackOK,TS val 656192647 ecr 0,nop,wscale 7], length 0
08:51:19.587510 IP 185.188.204.6.33517 > 169.48.115.29.http: Flags [S], seq 3204880701, win 29200, options [mss 1460,sackOK,TS val 656192897 ecr 0,nop,wscale 7], length 0

Note: The output above is actually taken from after the firewall rule was imposed below. Before the rule, there would have been responses from the sever in between each incoming packet from the attacking machine.

Short Term Resolution

Blocking the Offending Traffic with a Firewall Rule
A quick check of the IP on google showed the IP was from a foreign country and has been reported to be causing issues for others. This was not legitimate traffic. So the IP was blocked using an iptables firewall rule:

iptables -A INPUT -s 185.188.204.6 -j DROP

The web server was brought back up:

apachectl start

Monitoring the system again with the "top" and "ps -ef" commands showed the system to continue to be stable and the offending process load did not return.

Note that the attacking traffic will still show up in a tcpdump as tcpdump has access to the Network Interface Controller hardware before the packets are dumped by the firewall. Not knowing this can cause significant confusion as it would appear that the firewall is not blocking the attacking traffic.

Deeper Discovery

So what was causing the issue?

Web Server Logs
A look into the server's apache httpd logs showed the attack was on WordPress' xmlrpc interface. There were thousands of log entries like the following:

185.188.204.6 - - [25/Nov/2017:08:40:11 -0800] "POST /xmlrpc.php HTTP/1.0" 200 370 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.188.204.6 - - [25/Nov/2017:08:40:12-0800] "POST /xmlrpc.php HTTP/1.0" 200 370 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.188.204.6 - - [25/Nov/2017:08:40:13 -0800] "POST /xmlrpc.php HTTP/1.0" 200 370 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.188.204.6 - - [25/Nov/2017:08:40:14 -0800] "POST /xmlrpc.php HTTP/1.0" 200 370 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.188.204.6 - - [25/Nov/2017:08:40:15 -0800] "POST /xmlrpc.php HTTP/1.0" 200 370 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.188.204.6 - - [25/Nov/2017:08:40:16 -0800] "POST /xmlrpc.php HTTP/1.0" 200 370 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.188.204.6 - - [25/Nov/2017:08:40:17 -0800] "POST /xmlrpc.php HTTP/1.0" 200 370 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.188.204.6 - - [25/Nov/2017:08:40:18 -0800] "POST /xmlrpc.php HTTP/1.0" 200 370 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.188.204.6 - - [25/Nov/2017:08:40:19 -0800] "POST /xmlrpc.php HTTP/1.0" 200 370 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"

WordPress uses XML-RPC for communication between the WordPress blog and other applications (ie. WordPress iPhone App to post articles from an iPhone).

Long Term Resolution(s)

Blocking XML-RPC access via .htaccess file

Since this client doesn't use any of the XML-RPC communication features, the interface was blocked via the .htaccess file with the following additional configuration:

#added to .htaccess file
<Files "xmlrpc.php">
Order Allow,Deny
Deny from all
</Files>

Blocking XML-RPC access via WordPress Plugin

There are plugins that can be installed to disable the XML-RPC protocol from Worpress, such as the Disable XML-RPC plugin, but the plugin works at the php level, meaning the extra php cgi process is still loaded which uses more server memory and bogs down the server. Hence the .htaccess extra configuration is preferred as it blocks the request before the expensive php process is called which puts load on the server that can still allow it to fairly easily be taken down.

Blocking XML-RPC access - other methods

Other methods to reduce exposure to this kind of attack can use iptables firewall rules to temporarily block requests that come in too big of bursts. However today's web browsers, in trying to make user load time experience better, can make 10 (or more) connections to a server, so it can be difficult to make a distinction between legitimate traffic and illigimitate traffic as they can look fairly similar at the lower level of the iptables firewall.

Some ISP's can add smarter firewalls (usually at a cost) that help alleviate these types of problematic attacks.

One could also delete the xmlrpc.php file, but WordPress updates may overwrite/reinstall the file. If you use a security mechanism like Wordfence (we at DMI highly recommend you do so), you'll also likely get warnings that a core file was changed, which is annoying.

Wordfence protects against brute force attacks, and if this had been a security attack, it would have blocked the attacker due to too many failed login attempts. But as it turns out, per below, this was simply a Denial of Service attack, there was no attempt to hack a login.

Checking the Severity of the Attack

To see if the attack on this server was an attempt to brute force access through password vulnerabilities in XML-RPC on WordPress, a simple script was put in place of the xmlrpc.php file and the parameters were dumped to a file fore each request. The firewall rule was suspended to allow a few requests to come through and then the rule was put back in place.

<?php
$file = print_r($_REQUEST, true);
file_put_contents("xmlrpc.data." . time(), $file);
exit;
?>

The output saved to each file was an empty array, showing that this was just a DoS attack and not an attempt to hack passwords. If it was a password attack there would have been some payload conforming to XML-RPC standards in the request variables.

Conclusion

This article was about a Denial of Service (Dos) attack that happened to a website running WordPress. The initial indicator of the issue was the database failure and public attempts to reach the website via a web browser received the message "Error establishing a database connection".

The database failing was because the server was overloaded with too many HTTP requests. The HTTP requests were causing memory intensive cgi processes to load without closing them out. Due to the amount of processes open, the computer's memory became overloaded and the kernel had to choose which process to kill in order to save the overall stability of the server. The database server was chosen to be killed by the kernel's algorithm.

The process to discover the cause of the issue was shown along with short term and long term resolutions to help prevent similar attacks in the future.