Friday, January 05, 2007

Human-centric web log analysis

Analog, the widely-used web server log analysis package, seems designed to give answers to questions a server administrator asks: volume of traffic, which directories get most traffic, etc.

We are more interested in user behavior: how many people visited our site, which pages did they look at, etc.

To get close to an answer, I need to ignore a lot of server log lines.

Some are easy: in analog.cfg I can add:

# let's not even count these:
FILEEXCLUDE */images/*

this leaves only documents that a person would read: html, PDF, etc.

Next is to ignore lines with status code 206 Partial Content

A lot of requests come from robots though.

Here's a bash script I wrote to generate HOST_EXCLUDE directives for analog.cfg:

# /bin/bash -x

# name:
# synopsis: use join to auto-generate HOSTEXCLUDE lines for analog.cfg
# usage:

# Last Updated:
# Mon Jan 15 18:28:06 PST 2007

# check for YYYYMM on command line
# if [ -z $1 ] = if test: zero-length-string 1st arg
# square brackets are actually the command 'test', so following space is neccesary
# leaving it out is saying "if test-z ..." and there is no command test-z
if [ -z $1 ]
echo "usage: `basename $0` YYYYMM"
exit 1

# assert: there was at least one command-line argument;
# try forming a filename from it

# if test: not is-a-file $logfilename
if [ ! -f $logfilename ]
echo "$logfilename not found"
echo "try 'ln /export/www/htdocs/weblogs/YYYY/Qn/access_log_YYYYMM log/'"
exit 1

# make tmp dir, if needed
# if test: not is-a-directory tmp
if [ ! -d tmp ]
mkdir tmp

# extract logfile lines with request for robots.txt
# save them in tmp/robot_access_lines_YYYYMM

if [ -f $robots_txt_req_lines_file ]
echo $robots_txt_req_lines_file already exists.
# grep -F looks for fixed strings (no special character interpretation)
grep -F robots.txt $logfilename > $robots_txt_req_lines_file
wc -l $robots_txt_req_lines_file # show count of lines
# # tmp/robot_access_lines_YYYYMM entries look like:
# - - [<date/time> -0800] "GET /robots.txt HTTP/1.0" 200 119 "-" "<user agent>"

# extract IP addresses, user agents from tmp/robot_access_lines_YYYYMM


if [ -f $robots_txt_req_IP_file ]
echo $robots_txt_req_IP_file already exists.
sed 's/"$//' $robots_txt_req_lines_file \
| sed 's/|/!/' \
| sed 's/ .*"/ |/' \
| sort -u \
> $robots_txt_req_IP_file
wc -l $robots_txt_req_IP_file # show count of lines
# lines look like:
# <IP address> | <user-agent>
# the pipe (|) field separator will make it easier to write the HOSTEXCLUDE lines

# lookup hostnames for IP addresses

# change the following line if you have installed a newer version of dnstran:

# check whether dnscache_file exists
if ! [ -f $dnscache_file ]
echo $dnscache_file not found
# assert: $dnscache_file exists
# check whether dnscache_sorted_file exists and is newer than dnscache_file
if [ -f $dnscache_sorted_file -a $dnscache_sorted_file -nt $dnscache_file ]
echo $dnscache_sorted_file is up to date
echo updating $dnscache_sorted_file
cut -d' ' -s -f2,3 $dnscache_file | sort -u > $dnscache_sorted_file
ls -l $dnscache_sorted_file
# assert: $dnscache_sorted_file exists, is up to date, and is sorted by IP (lexical sort)
# lines look like:
# *

# hostname NOT known (last character of entry is an asterisk *):
if [ -f $IP_only_file ]
echo "$IP_only_file already exists"
# search for lines where last character is an asterisk (*):
grep \*$ $dnscache_sorted_file > $IP_only_file

# hostname known(last character of entry is NOT an asterisk (*):
if [ -f $hostname_file ]
echo "$hostname_file already exists"
# search for lines where last character is NOT (-v) an asterisk (*):
grep -v \*$ $dnscache_sorted_file > $hostname_file

# use join to auto-generate HOSTEXCLUDE lines for analog.cfg

sed 's/ / |/' $IP_only_file | join - $robots_txt_req_IP_file | awk 'BEGIN { FS = "|" } ; {print "HOSTEXCLUDE " $1 " # " $3}' > exclude_IPs_$1

sed 's/ / |/' $hostname_file | join - $robots_txt_req_IP_file | awk 'BEGIN { FS = "|" } ; {print "HOSTEXCLUDE " $2 " # " $3}' > exclude_hosts_$1

exit 0 # successful completion!

# still have to manually add lines to analog.cfg
# it would not be too hard to find the beginning of the hostexcludes
# and replace them with the new lines.

Here's a README for the script:

I. Start by linking to the log files.
log files are automatically generated by the Apache web server
and copied monthly by a cron script to /export/www/htdocs/weblogs/

cd to analog-www/:

$ cd analog-www

I like to make local links for the following reasons:

1. Having a local directory entry (link) makes scripting a little easier

2. Linking (making a local directory entry rather than copying) uses no disk space

3. Having a link to the logfile makes it harder to accidentally delete the logfile
(it won't go away until both directory entries have been rm'd)

Syntax is:

$ ln /export/www/htdocs/weblogs/YYYY/Qn/access_log_YYYYMM log/

You can check the results:

$ ls -l log/
total 2564352
-rw-r--r-- 2 root root 184515350 Jan 31 23:58 access_log_200701
-rw-r--r-- 2 root root 148448116 Feb 28 23:53 access_log_200702
-rw-r--r-- 2 root root 172706277 Apr 1 01:00 access_log_200703
Note that the second column shows 2 links to each file.

II. Next, run analog (but we will run it again before we are done!)
Since we will run it more than once, I find it convenient to use a script to call analog.
The script is called


$ vi

change the filenames to the current time period.
For instance, this time I changed 2006 to 2007 and 12 to 01 (three places):

# Execute analog and report magic on the monthly access_log
nice /home/steve/web/analog-www/analog log/access_log_200701;
# change reports_File_Out and website_Title
nice /home/steve/web/rmagic-2.21/ -reports_File_Out=/export/www/htdocs/wwwlogs/2007.01/ -website_Title="Web Statistics 2007.01";

now run the script:

$ ./

This accomplishes two things:

1. It generates reports and saves them at URL

2. It updates analog-www/dnscache,
that is, it does DNS host lookup on all the IP addresses in the Apache logs,
that is, it tries to match a hostname to the IP address.

III. Next, run robotFinder/

$ cd ../robotFinder

$ ./ 200701 grabs all requests for robots.txt from the logfile,
tries to match the IP address of the requestor in dnscache,
auto-generates HOSTEXCLUDE directives for analog.cfg,
and then writes the result to either of hostname_YYYYMM or IP_only_YYYYMM.

IV. paste the results into analog.cfg:

one way to do this, using vi:

$ cd ../analog-www

$ vi analog.cfg

In vi, find the beginning of the auto-generated HOSTEXCLUDES:

In command mode, type "/Finder" to find the line
# BEGIN read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM

do it again to find the end of the auto-generated HOSTEXCLUDES:
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM

you can either note the line number of the beginning and end of the directives and
issue a vi command something like
(delete lines 90 to 306)

or, from the first line of the directives, issue the command

:.,/Finder/ d
(delete from the current line (.) to the first line with "Finder" on it)
If you do it this way, make sure there are TWO lines like this:
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM

(since you will zap the first of them!)

Now you can insert the new lines with e.g. the command

:r hostname_YYYYMM


:r IP_only_YYYYMM

save analog.cfg and quit (vi command :wq OR ZZ)

V. Now you have to run analog again!!!

(and you might have to rename or delete the web page directories in )

$ ./


Post a Comment

Subscribe to Post Comments [Atom]

<< Home