sed and grep: "Stories" of web visitors
I've been thinking for a long time about crunching the web logs to produce "stories" showing how visitors navigate our site. We don't get so many visitors that it would be overwhelming to look at a day or two.
So I've started writing shell scripts to do the crunching.
They are located at /home/steve/webstats/trunk
The plan is, starting from a raw log file:
1. find just requests for pages (.php, .html, .pdf) with result code 200.
(Hmmm...
right now I am using grep -v to get rid of certain things that aren't (1.) above.
is it simpler to use something like:
sed doesn't do aaa OR bbb OR ccc very well; it looks like
This works at about 10,000 raw log lines/sec:
grep, grep, grep takes a while to process one log file, about 15 minutes so far...
Tue May 20 20:44:19 PDT 2008
anyway, next is:
(1.5. weed out search / h@x0r bots)
2. sort by IP address and time
3. clean up the lines to make a tab-delimited table with these fields:
IP address datestamp URL requested referring URL Browser type
Of course you can buy tools that claim to do this...
So I've started writing shell scripts to do the crunching.
They are located at /home/steve/webstats/trunk
The plan is, starting from a raw log file:
1. find just requests for pages (.php, .html, .pdf) with result code 200.
(Hmmm...
right now I am using grep -v to get rid of certain things that aren't (1.) above.
is it simpler to use something like:
grep '] "GET [^ ]*[php|pdf|html][^ ]* HTTP/1.." 200 'or
sed '/] "GET [^ ]*[php|pdf|html][^ ]* HTTP\/1.." 200 /!d'to get "successful requests for pages?"
sed doesn't do aaa OR bbb OR ccc very well; it looks like
sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d # most sedssed is very fast, though.
gsed '/AAA\|BBB\|CCC/!d' # GNU sed only
This works at about 10,000 raw log lines/sec:
sed '/" 200 /!d' $logfilename | \More importantly, if I specify good files, then I have to modify the script every time we add a new file type. And if I forget, it's not easy to notice the absence of the overlooked file type.
sed -e '/"GET [^ ]*\.php/b' \
-e '/"GET [^ ]*\.pdf/b' \
-e '/"GET [^ ]*\.ppt/b' \
-e '/"GET [^ ]*\.avi/b' \
-e '/"GET [^ ]*\.doc/b' \
-e '/"GET [^ ]*\.mpg/b' \
-e '/"GET [^ ]*\.flv/b' \
-e '/"GET [^ ]*\.html/b' \
-e '/"GET [^ ]*\/ HTTP/b' \
-e d > $temporary_story_file
grep, grep, grep takes a while to process one log file, about 15 minutes so far...
Tue May 20 20:44:19 PDT 2008
Lines in apache access log:
636818 /export/www/htdocs/weblogs/2008/Q2/access_log_200804
Dropping requests for images:
Tue May 20 20:54:32 PDT 2008
Lines left in temporary work file:
214027 /tmp/stories_14566
Dropping requests for javascripts:
Tue May 20 20:55:01 PDT 2008
Lines left in temporary work file:
203919 /tmp/stories_2_14566
Dropping requests for css:
Tue May 20 20:55:39 PDT 2008
Lines left in temporary work file:
192682 /tmp/stories_14566
Dropping partial or unsuccessful requests:
Tue May 20 20:59:28 PDT 2008
Lines left in temporary work file:
140429 /tmp/stories_2_14566
anyway, next is:
(1.5. weed out search / h@x0r bots)
2. sort by IP address and time
3. clean up the lines to make a tab-delimited table with these fields:
IP address datestamp URL requested referring URL Browser type
Of course you can buy tools that claim to do this...
Labels: Apache log, bash, grep, sed, server admin, shell, web server logs
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home