A techy post - we haven't had one of those here for a while.
The Problem: Apache 2.4 timing out
I noticed occasional timeout messages coming back from monitors keeping an eye on Apache 2.4 on a server. (Centos 7.2.1511 and Apache 2.4.6 if it makes any difference).
They were occuring often enough to make me want to shut up the warning emails at least. They also gave me that nagging doubt - was Apache serving up webpages slowly and/or timing out on actual pageloads for real visitors? But I also couldn't find anything to pin it down in the logs. The Apache error log showed nothing of any use, and neither did any other logs I looked at.
A Scarier Problem: Apache 2.4 not restarting
A few times I tried restarting Apache:
systemctl restart httpd. As you do. Didn't seem to make much difference. And then one time, I restarted it, and Apache never came back up. The command to restart showed as failed. After a few minutes probing some logs, I tried starting it (
systemctl start httpd ) and it started fine. Weird.
Still nothing in the Apache logs, but the system log then gave me something:
systemd: httpd.service stop-final-sigterm timed out. Killing.
systemd: httpd.service: main process exited, code=killed, status=9/KILL
systemd: Failed to start The Apache HTTP Server.
systemd: Unit httpd.service entered failed state.
systemd: httpd.service failed.
Ah, so that's what happened. The command to stop httpd itself timed out. Eventually, the kernel killed it with a -9 kill.
That gave me a few messages I could poke around in Google with, and it led me to some hints that I tried
A Possible Culprit, Remote Rules in mod_security
Someone, with a completely different stack from mine, reported a similar message to mine, and they found that having mod_security rules that are picked up from an off-server URL was causing their problem.
So it's time to have a look. I use rules from AtomiCorp which are updated regularly, but a quick grep of the rules showed two lines in my loaded rules. One used the SecRemoteRulesFailAction declaration - this tells mod_security what to do if the URL being loaded fails to load. The other used SecRemoteRules declaration - this takes a key and a URL to load.
I tried running that URL through curl - it took forever and I eventually gave up. It seems their webserver was having a bad day and not returning the data from that URL in a timely fashion. No, it wasn't failing - that would have just triggered the "warn" action specified by SecRemoteRulesFailAction. It was taking over a minute (I gave up and hit Ctrl+C) to return a value. I loaded the URL from my PC in a browser, and it loaded fine, so it wasn't slow all the time / from everywhere - but I was guessing I'd found the culprit. If the URL containing the remote rules is slow to return a response, maybe this made Apache also timeout - including at shutdown.
Time to test that theory. I commented out those two lines in that mod_security configuration file.
I never received another timeout again. Until the rules were updated, and my comment marker was overwritten.
So I modified my update script. After un-tarring the new rules, I simply run this:
sed -i "s/^SecRemoteRules/# SecRemoteRules/" modsec/10_asl_rules.conf
Now, the two remote-rules declarations are always commented out. And I've not had a single timeout message since.
Have I found the culprit? I'm not sure. I didn't analyse the Apache mod_security source code to find out what happens if a SecRemoteRules line takes too long to process. How long do you need to confirm, for sure, that you've got the culprit? Is a week long enough? It seems to be working, and the problem I've diagnosed would fit the symptoms I was seeing.
So after a week, I'm declaring my problem solved, and I'll post this here in the hope that it helps others who may have similar problems with timeouts and remote rules.
If anything crops up that indicates the problem lies elsewhere, I'll add a comment here to say so.
Have you had an experience of something similar? Leave a comment below.