Tuesday, May 20, 2008

sed and grep: "Stories" of web visitors

I've been thinking for a long time about crunching the web logs to produce "stories" showing how visitors navigate our site. We don't get so many visitors that it would be overwhelming to look at a day or two.

So I've started writing shell scripts to do the crunching.

They are located at /home/steve/webstats/trunk

The plan is, starting from a raw log file:
1. find just requests for pages (.php, .html, .pdf) with result code 200.

right now I am using grep -v to get rid of certain things that aren't (1.) above.
is it simpler to use something like:
grep '] "GET [^ ]*[php|pdf|html][^ ]* HTTP/1.." 200 '
sed '/] "GET [^ ]*[php|pdf|html][^ ]* HTTP\/1.." 200 /!d'
to get "successful requests for pages?"

sed doesn't do aaa OR bbb OR ccc very well; it looks like
sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d    # most seds
gsed '/AAA\|BBB\|CCC/!d' # GNU sed only
sed is very fast, though.

This works at about 10,000 raw log lines/sec:
sed '/" 200 /!d' $logfilename | \
sed -e '/"GET [^ ]*\.php/b' \
-e '/"GET [^ ]*\.pdf/b' \
-e '/"GET [^ ]*\.ppt/b' \
-e '/"GET [^ ]*\.avi/b' \
-e '/"GET [^ ]*\.doc/b' \
-e '/"GET [^ ]*\.mpg/b' \
-e '/"GET [^ ]*\.flv/b' \
-e '/"GET [^ ]*\.html/b' \
-e '/"GET [^ ]*\/ HTTP/b' \
-e d > $temporary_story_file
More importantly, if I specify good files, then I have to modify the script every time we add a new file type. And if I forget, it's not easy to notice the absence of the overlooked file type.

grep, grep, grep takes a while to process one log file, about 15 minutes so far...

Tue May 20 20:44:19 PDT 2008
Lines in apache access log:
636818 /export/www/htdocs/weblogs/2008/Q2/access_log_200804

Dropping requests for images:

Tue May 20 20:54:32 PDT 2008
Lines left in temporary work file:
214027 /tmp/stories_14566

Dropping requests for javascripts:

Tue May 20 20:55:01 PDT 2008
Lines left in temporary work file:
203919 /tmp/stories_2_14566

Dropping requests for css:

Tue May 20 20:55:39 PDT 2008
Lines left in temporary work file:
192682 /tmp/stories_14566

Dropping partial or unsuccessful requests:

Tue May 20 20:59:28 PDT 2008
Lines left in temporary work file:
140429 /tmp/stories_2_14566

anyway, next is:
(1.5. weed out search / h@x0r bots)
2. sort by IP address and time
3. clean up the lines to make a tab-delimited table with these fields:
IP address datestamp URL requested referring URL Browser type

Of course you can buy tools that claim to do this...

Labels: , , , , , ,


Post a Comment

Subscribe to Post Comments [Atom]

<< Home