My Apache error logs don't tell me enough about 404s. They don't tell me:
User-agent:
If the user-agent is a web crawler, I don't care about bad links on our site; they're probably already fixed and the crawler still has the bad link in their index.
Referrer:
If the 404 is from a bad link on our site, I want to know the originating page.
Query:
Most 404s come from cross-site scripting (XSS) attacks. Without the query part of the URL, it's impossible to distinguish these.
Today I am using tail and grep to check the last few days' worth of 404s:
$ tail -20000 access_log_www | grep -F '" 404'
As it happens, we get about 10,000 requests/day, so to see n days' log entries I just look at the last n0,000 lines in the access logs (tail -20000).
Only 0.2% of requests are 404s, so grep has to winnow through a lot of chaff to find them. To speed it up, I use grep -F, which turns off the regular expression engine.
The string 404 can appear in other places (the file size field, the user-agent field) so I grab the double quote in the last position of the URL field. Naturally that means I have to single-quote the search pattern.
Statistical aside:
How much time does grep -F save?
Here are the results of 10 greps on 20,000 lines, 5 with and 5 without the -F arg:
grep real 0.486 user 0.460 sys 0.070
grep real 0.493 user 0.460 sys 0.020
grep real 0.507 user 0.430 sys 0.080
grep real 0.510 user 0.490 sys 0.010
grep real 0.541 user 0.500 sys 0.130
grep -F real 0.455 user 0.480 sys 0.040
grep -F real 0.458 user 0.420 sys 0.050
grep -F real 0.458 user 0.430 sys 0.050
grep -F real 0.458 user 0.470 sys 0.040
grep -F real 0.523 user 0.600 sys 0.050
grep (no -F) costs:
7.87% in real time
-2.50% in user time (saves time?)
34.78% in system time (big percentage of a small number)
more than grep -F
put another way, grep -F saves:
7.29% in real time
-2.56% in user time (costs time?)
25.81% in system time (big percentage of a small number)
versus grep
Labels: Apache log, error logs, grep, tail