Tuesday, January 16, 2007

How I do Web Stats

Distinct hosts served:
ACE: we don't have a quarterly report, so I estimate based on TTP ratios of smallest, largest monthly count of "Distinct hosts served" vs. quarterly count.
TTP: cut -d' ' -f2 dnscache | sort -u | wc

Successful requests for pages (approx # of human page-views)
ACE: look at the File Type Report; add up requests for .awp, .dll, .pdf, .htm
TTP: I just use?? the "Successful requests for pages" line in the General Summary report. (But this is not, apparently, what I did last quarter. What??) If I subtract requests for .gif and .jpg, I get too low a number. This depends on how "successful requests are defined in analog.cfg

Friday, January 05, 2007

Human-centric web log analysis

Analog, the widely-used web server log analysis package, seems designed to give answers to questions a server administrator asks: volume of traffic, which directories get most traffic, etc.

We are more interested in user behavior: how many people visited our site, which pages did they look at, etc.

To get close to an answer, I need to ignore a lot of server log lines.

Some are easy: in analog.cfg I can add:

# let's not even count these:
FILEEXCLUDE */images/*
FILEEXCLUDE */js/*
FILEEXCLUDE */css/*

this leaves only documents that a person would read: html, PDF, etc.

Next is to ignore lines with status code 206 Partial Content

A lot of requests come from robots though.

Here's a bash script I wrote to generate HOST_EXCLUDE directives for analog.cfg:


# /bin/bash -x

# name: hostexclude.sh
# synopsis: use join to auto-generate HOSTEXCLUDE lines for analog.cfg
# usage: hostexclude.sh

# Last Updated:
# Mon Jan 15 18:28:06 PST 2007

# check for YYYYMM on command line
# if [ -z $1 ] = if test: zero-length-string 1st arg
# square brackets are actually the command 'test', so following space is neccesary
# leaving it out is saying "if test-z ..." and there is no command test-z
if [ -z $1 ]
then
echo "usage: `basename $0` YYYYMM"
exit 1
fi

# assert: there was at least one command-line argument;
# try forming a filename from it
logfilename="log/access_log_$1"

# if test: not is-a-file $logfilename
if [ ! -f $logfilename ]
then
echo "$logfilename not found"
echo "try 'ln /export/www/htdocs/weblogs/YYYY/Qn/access_log_YYYYMM log/'"
exit 1
fi

# make tmp dir, if needed
# if test: not is-a-directory tmp
if [ ! -d tmp ]
then
mkdir tmp
fi

# extract logfile lines with request for robots.txt
# save them in tmp/robot_access_lines_YYYYMM
robots_txt_req_lines_file="tmp/robot_access_lines_$1"

if [ -f $robots_txt_req_lines_file ]
then
echo $robots_txt_req_lines_file already exists.
else
# grep -F looks for fixed strings (no special character interpretation)
grep -F robots.txt $logfilename > $robots_txt_req_lines_file
wc -l $robots_txt_req_lines_file # show count of lines
fi
# # tmp/robot_access_lines_YYYYMM entries look like:
# 64.4.8.135 - - [<date/time> -0800] "GET /robots.txt HTTP/1.0" 200 119 "-" "<user agent>"

# extract IP addresses, user agents from tmp/robot_access_lines_YYYYMM

robots_txt_req_IP_file="tmp/robot_access_IPs_$1"

if [ -f $robots_txt_req_IP_file ]
then
echo $robots_txt_req_IP_file already exists.
else
sed 's/"$//' $robots_txt_req_lines_file \
| sed 's/|/!/' \
| sed 's/ .*"/ |/' \
| sort -u \
> $robots_txt_req_IP_file
wc -l $robots_txt_req_IP_file # show count of lines
# lines look like:
# <IP address> | <user-agent>
# the pipe (|) field separator will make it easier to write the HOSTEXCLUDE lines
fi


# lookup hostnames for IP addresses

# change the following line if you have installed a newer version of dnstran:
dnscache_file='../dnstran1.5.2/dnscache'
dnscache_sorted_file='tmp/dnscache_sorted_by_IP'

# check whether dnscache_file exists
if ! [ -f $dnscache_file ]
then
echo $dnscache_file not found
else
# assert: $dnscache_file exists
# check whether dnscache_sorted_file exists and is newer than dnscache_file
if [ -f $dnscache_sorted_file -a $dnscache_sorted_file -nt $dnscache_file ]
then
echo $dnscache_sorted_file is up to date
else
echo updating $dnscache_sorted_file
cut -d' ' -s -f2,3 $dnscache_file | sort -u > $dnscache_sorted_file
ls -l $dnscache_sorted_file
fi
# assert: $dnscache_sorted_file exists, is up to date, and is sorted by IP (lexical sort)
# lines look like:
# 12.0.29.254 *
# 12.0.47.2 cam-att-fw-ext.camsys.com
fi

# hostname NOT known (last character of entry is an asterisk *):
IP_only_file="tmp/IP_only_$1"
if [ -f $IP_only_file ]
then
echo "$IP_only_file already exists"
else
# search for lines where last character is an asterisk (*):
grep \*$ $dnscache_sorted_file > $IP_only_file
fi

# hostname known(last character of entry is NOT an asterisk (*):
hostname_file="tmp/hostname_$1"
if [ -f $hostname_file ]
then
echo "$hostname_file already exists"
else
# search for lines where last character is NOT (-v) an asterisk (*):
grep -v \*$ $dnscache_sorted_file > $hostname_file
fi

# use join to auto-generate HOSTEXCLUDE lines for analog.cfg

sed 's/ / |/' $IP_only_file | join - $robots_txt_req_IP_file | awk 'BEGIN { FS = "|" } ; {print "HOSTEXCLUDE " $1 " # " $3}' > exclude_IPs_$1

sed 's/ / |/' $hostname_file | join - $robots_txt_req_IP_file | awk 'BEGIN { FS = "|" } ; {print "HOSTEXCLUDE " $2 " # " $3}' > exclude_hosts_$1



exit 0 # successful completion!

# still have to manually add lines to analog.cfg
#
# it would not be too hard to find the beginning of the hostexcludes
# and replace them with the new lines.



Here's a README for the script:


I. Start by linking to the log files.
log files are automatically generated by the Apache web server
and copied monthly by a cron script to /export/www/htdocs/weblogs/

cd to analog-www/:

$ cd analog-www

I like to make local links for the following reasons:

1. Having a local directory entry (link) makes scripting a little easier

2. Linking (making a local directory entry rather than copying) uses no disk space

3. Having a link to the logfile makes it harder to accidentally delete the logfile
(it won't go away until both directory entries have been rm'd)

Syntax is:

$ ln /export/www/htdocs/weblogs/YYYY/Qn/access_log_YYYYMM log/

You can check the results:

$ ls -l log/
total 2564352
-rw-r--r-- 2 root root 184515350 Jan 31 23:58 access_log_200701
-rw-r--r-- 2 root root 148448116 Feb 28 23:53 access_log_200702
-rw-r--r-- 2 root root 172706277 Apr 1 01:00 access_log_200703
^^^
Note that the second column shows 2 links to each file.

II. Next, run analog (but we will run it again before we are done!)
Since we will run it more than once, I find it convenient to use a script to call analog.
The script is called shanalog.sh

edit shanalog.sh:

$ vi shanalog.sh

change the filenames to the current time period.
For instance, this time I changed 2006 to 2007 and 12 to 01 (three places):

# Execute analog and report magic on the monthly access_log
nice /home/steve/web/analog-www/analog log/access_log_200701;
# change reports_File_Out and website_Title
nice /home/steve/web/rmagic-2.21/rmagic.pl -reports_File_Out=/export/www/htdocs/wwwlogs/2007.01/ -website_Title="Web Statistics 2007.01";

now run the script:

$ ./shanalog.sh

This accomplishes two things:

1. It generates reports and saves them at URL http://www.techtransfer.berkeley.edu/wwwlogs/

2. It updates analog-www/dnscache,
that is, it does DNS host lookup on all the IP addresses in the Apache logs,
that is, it tries to match a hostname to the IP address.

III. Next, run robotFinder/hostexclude.sh:

$ cd ../robotFinder

$ ./hostexclude.sh 200701

hostexclude.sh grabs all requests for robots.txt from the logfile,
tries to match the IP address of the requestor in dnscache,
auto-generates HOSTEXCLUDE directives for analog.cfg,
and then writes the result to either of hostname_YYYYMM or IP_only_YYYYMM.

IV. paste the results into analog.cfg:

one way to do this, using vi:

$ cd ../analog-www

$ vi analog.cfg

In vi, find the beginning of the auto-generated HOSTEXCLUDES:

In command mode, type "/Finder" to find the line
# BEGIN read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM

do it again to find the end of the auto-generated HOSTEXCLUDES:
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM

you can either note the line number of the beginning and end of the directives and
issue a vi command something like
:90,306d
(delete lines 90 to 306)

or, from the first line of the directives, issue the command

:.,/Finder/ d
(delete from the current line (.) to the first line with "Finder" on it)
If you do it this way, make sure there are TWO lines like this:
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM

(since you will zap the first of them!)

Now you can insert the new lines with e.g. the command

:r hostname_YYYYMM

and

:r IP_only_YYYYMM

save analog.cfg and quit (vi command :wq OR ZZ)


V. Now you have to run analog again!!!

(and you might have to rename or delete the web page directories in
http://www.techtransfer.berkeley.edu/wwwlogs/2007.03/ )

$ ./shanalog.sh