Diary of a Webster: Human-centric web log analysis

Analog, the widely-used web server log analysis package, seems designed to give answers to questions a server administrator asks: volume of traffic, which directories get most traffic, etc.

We are more interested in user behavior: how many people visited our site, which pages did they look at, etc.

To get close to an answer, I need to ignore a lot of server log lines.

Some are easy: in analog.cfg I can add:


# let's not even count these:
FILEEXCLUDE */images/*
FILEEXCLUDE */js/*
FILEEXCLUDE */css/*

this leaves only documents that a person would read: html, PDF, etc.

Next is to ignore lines with status code 206 Partial Content

A lot of requests come from robots though.

Here's a bash script I wrote to generate HOST_EXCLUDE directives for analog.cfg:


# /bin/bash -x

#     name: hostexclude.sh
# synopsis: use join to auto-generate HOSTEXCLUDE lines for analog.cfg
#    usage: hostexclude.sh 

# Last Updated:
# Mon Jan 15 18:28:06 PST 2007

# check for YYYYMM on command line
#       if [ -z $1 ] = if test: zero-length-string 1st arg
#       square brackets are actually the command 'test', so following space is neccesary
#       leaving it out is saying "if test-z ..." and there is no command test-z
if [ -z $1 ]
then
        echo "usage: `basename $0` YYYYMM"
        exit 1
fi

# assert: there was at least one command-line argument;
# try forming a filename from it
logfilename="log/access_log_$1"

# if test: not is-a-file $logfilename
if [ ! -f $logfilename ]
then
        echo "$logfilename not found"
        echo "try 'ln /export/www/htdocs/weblogs/YYYY/Qn/access_log_YYYYMM log/'"
        exit 1
fi

# make tmp dir, if needed
# if test: not is-a-directory tmp
if [ ! -d tmp ]
then
     mkdir tmp
fi

# extract logfile lines with request for robots.txt
# save them in tmp/robot_access_lines_YYYYMM
robots_txt_req_lines_file="tmp/robot_access_lines_$1"

if [ -f $robots_txt_req_lines_file ]
then
        echo $robots_txt_req_lines_file already exists.
else
        # grep -F looks for fixed strings (no special character interpretation)
        grep -F robots.txt $logfilename > $robots_txt_req_lines_file
        wc -l $robots_txt_req_lines_file                # show count of lines
fi
#       # tmp/robot_access_lines_YYYYMM entries look like:
# 64.4.8.135 - - [<date/time> -0800] "GET /robots.txt HTTP/1.0" 200 119 "-" "<user agent>"

# extract IP addresses, user agents from tmp/robot_access_lines_YYYYMM

robots_txt_req_IP_file="tmp/robot_access_IPs_$1"

if [ -f $robots_txt_req_IP_file ]
then
        echo $robots_txt_req_IP_file already exists.
else
        sed 's/"$//' $robots_txt_req_lines_file \
        | sed 's/|/!/'                          \
        | sed 's/ .*"/ |/'                      \
        | sort -u                               \
        > $robots_txt_req_IP_file
        wc -l $robots_txt_req_IP_file           # show count of lines
        # lines look like:
        # <IP address> | <user-agent>
        # the pipe (|) field separator will make it easier to write the HOSTEXCLUDE lines
fi


# lookup hostnames for IP addresses

# change the following line if you have installed a newer version of dnstran:
dnscache_file='../dnstran1.5.2/dnscache'
dnscache_sorted_file='tmp/dnscache_sorted_by_IP'

# check whether dnscache_file exists
if ! [ -f $dnscache_file ]
then
        echo $dnscache_file not found
else
        # assert: $dnscache_file exists
        # check whether dnscache_sorted_file exists and is newer than dnscache_file
        if [ -f $dnscache_sorted_file -a $dnscache_sorted_file -nt $dnscache_file ]
        then
                echo $dnscache_sorted_file is up to date
        else
                echo updating $dnscache_sorted_file
                cut -d' ' -s -f2,3 $dnscache_file | sort -u > $dnscache_sorted_file
                ls -l $dnscache_sorted_file
        fi
        # assert: $dnscache_sorted_file exists, is up to date, and is sorted by IP (lexical sort)
        # lines look like:
        # 12.0.29.254 *
        # 12.0.47.2 cam-att-fw-ext.camsys.com
fi

# hostname NOT known (last character of entry is an asterisk *):
IP_only_file="tmp/IP_only_$1"
if [ -f $IP_only_file ]
then
        echo "$IP_only_file already exists"
else
        # search for lines where last character is an asterisk (*):
        grep    \*$ $dnscache_sorted_file > $IP_only_file
fi

# hostname known(last character of entry is NOT an asterisk (*):
hostname_file="tmp/hostname_$1"
if [ -f $hostname_file ]
then
        echo "$hostname_file already exists"
else
        # search for lines where last character is NOT (-v) an asterisk (*):
        grep -v \*$ $dnscache_sorted_file > $hostname_file
fi

# use join to auto-generate HOSTEXCLUDE lines for analog.cfg

sed 's/ / |/'  $IP_only_file | join - $robots_txt_req_IP_file  | awk 'BEGIN { FS = "|" } ; {print "HOSTEXCLUDE " $1 " # " $3}' > exclude_IPs_$1

sed 's/ / |/'  $hostname_file | join - $robots_txt_req_IP_file | awk 'BEGIN { FS = "|" } ; {print "HOSTEXCLUDE " $2 " # " $3}' > exclude_hosts_$1



exit 0          # successful completion!

# still have to manually add lines to analog.cfg
#
# it would not be too hard to find the beginning of the hostexcludes
# and replace them with the new lines.

Here's a README for the script:


I.      Start by linking to the log files.
        log files are automatically generated by the Apache web server
        and copied monthly by a cron script to /export/www/htdocs/weblogs/

cd to analog-www/:

$ cd analog-www

I like to make local links for the following reasons:

1.      Having a local directory entry (link) makes scripting a little easier

2.      Linking (making a local directory entry rather than copying) uses no disk space

3.      Having a link to the logfile makes it harder to accidentally delete the logfile
        (it won't go away until both directory entries have been rm'd)

Syntax is:

$ ln /export/www/htdocs/weblogs/YYYY/Qn/access_log_YYYYMM log/

You can check the results:

$ ls -l log/
total 2564352
-rw-r--r--    2 root     root     184515350 Jan 31 23:58 access_log_200701
-rw-r--r--    2 root     root     148448116 Feb 28 23:53 access_log_200702
-rw-r--r--    2 root     root     172706277 Apr  1 01:00 access_log_200703
             ^^^
Note that the second column shows 2 links to each file.

II.     Next, run analog (but we will run it again before we are done!)
        Since we will run it more than once, I find it convenient to use a script to call analog.
        The script is called shanalog.sh

edit shanalog.sh:

$ vi shanalog.sh

change the filenames to the current time period.
For instance, this time I changed 2006 to 2007 and 12 to 01 (three places):

# Execute analog and report magic on the monthly access_log
nice /home/steve/web/analog-www/analog log/access_log_200701;
# change reports_File_Out and website_Title
nice /home/steve/web/rmagic-2.21/rmagic.pl -reports_File_Out=/export/www/htdocs/wwwlogs/2007.01/ -website_Title="Web Statistics 2007.01";

now run the script:

$ ./shanalog.sh

This accomplishes two things:

1.      It generates reports and saves them at URL http://www.techtransfer.berkeley.edu/wwwlogs/

2.      It updates analog-www/dnscache,
        that is, it does DNS host lookup on all the IP addresses in the Apache logs,
        that is, it tries to match a hostname to the IP address.

III.    Next, run robotFinder/hostexclude.sh:

$ cd ../robotFinder

$ ./hostexclude.sh 200701

hostexclude.sh grabs all requests for robots.txt from the logfile,
tries to match the IP address of the requestor in dnscache,
auto-generates HOSTEXCLUDE directives for analog.cfg,
and then writes the result to either of hostname_YYYYMM or IP_only_YYYYMM.

IV.     paste the results into analog.cfg:

one way to do this, using vi:

$ cd ../analog-www

$ vi analog.cfg

In vi, find the beginning of the auto-generated HOSTEXCLUDES:

In command mode, type "/Finder" to find the line
# BEGIN read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM

do it again to find the end of the auto-generated HOSTEXCLUDES:
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM

you can either note the line number of the beginning and end of the directives and
issue a vi command something like
:90,306d
(delete lines 90 to 306)

or, from the first line of the directives, issue the command 

:.,/Finder/ d
(delete from the current line (.) to the first line with "Finder" on it)
If you do it this way, make sure there are TWO lines like this:
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM

(since you will zap the first of them!)

Now you can insert the new lines with e.g. the command

:r hostname_YYYYMM

and 

:r IP_only_YYYYMM

save analog.cfg and quit (vi command :wq  OR  ZZ)


V.      Now you have to run analog again!!!

(and you might have to rename or delete the web page directories in 
  http://www.techtransfer.berkeley.edu/wwwlogs/2007.03/ )

$ ./shanalog.sh

Diary of a Webster

Friday, January 05, 2007

Human-centric web log analysis

0 Comments:

About Me

Previous Posts