Originally posted by MischeviousElf
Okay guys I know Skeptic has not got back on this latest one,
Last evening's downtime was caused by what appears to be a "perfect storm" of unpredictable (and predictable) events.
At some point, the Yahoo "Slurp" indexing spider latched onto our old tags page... even though I have a specific "Disallow" statement in our
"robots.txt" file to make sure search spiders don't parse those pages (since we disabled the tag system). And for some reason, Slurp decided to hit
about 6-9 different tag search results every second... a heavy load on a reasonably high-load search.
(I have no idea why Slurp hit those pages,
ignoring our robots.txt declaration, and at a previously unheard of rate of access. We keep logs of search spider activity, and previous logs show a
hit-rate of no more than one tags page every five seconds from Slurp... and yes, it was from a valid Yahoo IP.)
And... at that same time, we were also servicing a peak in a heavy cycle of indexing from an AOL spider, as well as an MSN spider
(all these
non-Google search spiders tend to increase their activity near the end of the month... I've just never seen them all increase their rate-of-index at
the same time.)
And... at that same time, our service provider's load balancer (that balances the activity among our three web servers) had a brief failure that
caused "service unavailable errors" for a time.
And... as is typical of life at ATS, when the site runs slow or has downtime, lots-and-lots of people are hitting refresh in the hopes that it will
come back. But most people don't realize that if you refresh a page that's responding slowly (at the server), your action doesn't stop the server
from delivering the page, it just initiates another process for the same page... and continuing to refresh just introduces more and more process to a
server cluster already struggling to recover.
As a result... the database server was attempting to respond to roughly 500 queries a second... while we've survived intense traffic peaks that were
in the range of 180 queries a second, 500 is an unexpectedly high rate for any database installation.
And... as it was necessary to take all services offline and reboot the database to allow everything to recover, minor corruption crept into a few
database tables, causing another slow-down, and some extended downtime to repair.
It should be noted that
every site with web servers on the Internet has multiple attack attempts on a daily basis. experts estimate that
about 8% of all Internet traffic is the result of automated bots that are constantly looking for servers with weaknesses, and attempting to exploit
anything that is discovered. When and if an attack/exploit occurs, it's rarely the result of a specific site being targeted. While we have indeed
been the target of malicious attacks in the past, and one recently, the reality is that only a handful of attacks over the past five years have been
the result of obviously targeting ATS.
Now... whether or not this particular odd confluence of unexpected and errant search spider activity is the result of TPTB, I'll leave that
speculation up to our members.