Human-centric web log analysis
Analog, the widely-used web server log analysis package, seems designed to give answers to questions a server administrator asks: volume of traffic, which directories get most traffic, etc.
We are more interested in user behavior: how many people visited our site, which pages did they look at, etc.
To get close to an answer, I need to ignore a lot of server log lines.
Some are easy: in analog.cfg I can add:
this leaves only documents that a person would read: html, PDF, etc.
Next is to ignore lines with status code 206 Partial Content
A lot of requests come from robots though.
Here's a bash script I wrote to generate HOST_EXCLUDE directives for analog.cfg:
Here's a README for the script:
We are more interested in user behavior: how many people visited our site, which pages did they look at, etc.
To get close to an answer, I need to ignore a lot of server log lines.
Some are easy: in analog.cfg I can add:
# let's not even count these:
FILEEXCLUDE */images/*
FILEEXCLUDE */js/*
FILEEXCLUDE */css/*
this leaves only documents that a person would read: html, PDF, etc.
Next is to ignore lines with status code 206 Partial Content
A lot of requests come from robots though.
Here's a bash script I wrote to generate HOST_EXCLUDE directives for analog.cfg:
# /bin/bash -x
# name: hostexclude.sh
# synopsis: use join to auto-generate HOSTEXCLUDE lines for analog.cfg
# usage: hostexclude.sh
# Last Updated:
# Mon Jan 15 18:28:06 PST 2007
# check for YYYYMM on command line
# if [ -z $1 ] = if test: zero-length-string 1st arg
# square brackets are actually the command 'test', so following space is neccesary
# leaving it out is saying "if test-z ..." and there is no command test-z
if [ -z $1 ]
then
echo "usage: `basename $0` YYYYMM"
exit 1
fi
# assert: there was at least one command-line argument;
# try forming a filename from it
logfilename="log/access_log_$1"
# if test: not is-a-file $logfilename
if [ ! -f $logfilename ]
then
echo "$logfilename not found"
echo "try 'ln /export/www/htdocs/weblogs/YYYY/Qn/access_log_YYYYMM log/'"
exit 1
fi
# make tmp dir, if needed
# if test: not is-a-directory tmp
if [ ! -d tmp ]
then
mkdir tmp
fi
# extract logfile lines with request for robots.txt
# save them in tmp/robot_access_lines_YYYYMM
robots_txt_req_lines_file="tmp/robot_access_lines_$1"
if [ -f $robots_txt_req_lines_file ]
then
echo $robots_txt_req_lines_file already exists.
else
# grep -F looks for fixed strings (no special character interpretation)
grep -F robots.txt $logfilename > $robots_txt_req_lines_file
wc -l $robots_txt_req_lines_file # show count of lines
fi
# # tmp/robot_access_lines_YYYYMM entries look like:
# 64.4.8.135 - - [<date/time> -0800] "GET /robots.txt HTTP/1.0" 200 119 "-" "<user agent>"
# extract IP addresses, user agents from tmp/robot_access_lines_YYYYMM
robots_txt_req_IP_file="tmp/robot_access_IPs_$1"
if [ -f $robots_txt_req_IP_file ]
then
echo $robots_txt_req_IP_file already exists.
else
sed 's/"$//' $robots_txt_req_lines_file \
| sed 's/|/!/' \
| sed 's/ .*"/ |/' \
| sort -u \
> $robots_txt_req_IP_file
wc -l $robots_txt_req_IP_file # show count of lines
# lines look like:
# <IP address> | <user-agent>
# the pipe (|) field separator will make it easier to write the HOSTEXCLUDE lines
fi
# lookup hostnames for IP addresses
# change the following line if you have installed a newer version of dnstran:
dnscache_file='../dnstran1.5.2/dnscache'
dnscache_sorted_file='tmp/dnscache_sorted_by_IP'
# check whether dnscache_file exists
if ! [ -f $dnscache_file ]
then
echo $dnscache_file not found
else
# assert: $dnscache_file exists
# check whether dnscache_sorted_file exists and is newer than dnscache_file
if [ -f $dnscache_sorted_file -a $dnscache_sorted_file -nt $dnscache_file ]
then
echo $dnscache_sorted_file is up to date
else
echo updating $dnscache_sorted_file
cut -d' ' -s -f2,3 $dnscache_file | sort -u > $dnscache_sorted_file
ls -l $dnscache_sorted_file
fi
# assert: $dnscache_sorted_file exists, is up to date, and is sorted by IP (lexical sort)
# lines look like:
# 12.0.29.254 *
# 12.0.47.2 cam-att-fw-ext.camsys.com
fi
# hostname NOT known (last character of entry is an asterisk *):
IP_only_file="tmp/IP_only_$1"
if [ -f $IP_only_file ]
then
echo "$IP_only_file already exists"
else
# search for lines where last character is an asterisk (*):
grep \*$ $dnscache_sorted_file > $IP_only_file
fi
# hostname known(last character of entry is NOT an asterisk (*):
hostname_file="tmp/hostname_$1"
if [ -f $hostname_file ]
then
echo "$hostname_file already exists"
else
# search for lines where last character is NOT (-v) an asterisk (*):
grep -v \*$ $dnscache_sorted_file > $hostname_file
fi
# use join to auto-generate HOSTEXCLUDE lines for analog.cfg
sed 's/ / |/' $IP_only_file | join - $robots_txt_req_IP_file | awk 'BEGIN { FS = "|" } ; {print "HOSTEXCLUDE " $1 " # " $3}' > exclude_IPs_$1
sed 's/ / |/' $hostname_file | join - $robots_txt_req_IP_file | awk 'BEGIN { FS = "|" } ; {print "HOSTEXCLUDE " $2 " # " $3}' > exclude_hosts_$1
exit 0 # successful completion!
# still have to manually add lines to analog.cfg
#
# it would not be too hard to find the beginning of the hostexcludes
# and replace them with the new lines.
Here's a README for the script:
I. Start by linking to the log files.
log files are automatically generated by the Apache web server
and copied monthly by a cron script to /export/www/htdocs/weblogs/
cd to analog-www/:
$ cd analog-www
I like to make local links for the following reasons:
1. Having a local directory entry (link) makes scripting a little easier
2. Linking (making a local directory entry rather than copying) uses no disk space
3. Having a link to the logfile makes it harder to accidentally delete the logfile
(it won't go away until both directory entries have been rm'd)
Syntax is:
$ ln /export/www/htdocs/weblogs/YYYY/Qn/access_log_YYYYMM log/
You can check the results:
$ ls -l log/
total 2564352
-rw-r--r-- 2 root root 184515350 Jan 31 23:58 access_log_200701
-rw-r--r-- 2 root root 148448116 Feb 28 23:53 access_log_200702
-rw-r--r-- 2 root root 172706277 Apr 1 01:00 access_log_200703
^^^
Note that the second column shows 2 links to each file.
II. Next, run analog (but we will run it again before we are done!)
Since we will run it more than once, I find it convenient to use a script to call analog.
The script is called shanalog.sh
edit shanalog.sh:
$ vi shanalog.sh
change the filenames to the current time period.
For instance, this time I changed 2006 to 2007 and 12 to 01 (three places):
# Execute analog and report magic on the monthly access_log
nice /home/steve/web/analog-www/analog log/access_log_200701;
# change reports_File_Out and website_Title
nice /home/steve/web/rmagic-2.21/rmagic.pl -reports_File_Out=/export/www/htdocs/wwwlogs/2007.01/ -website_Title="Web Statistics 2007.01";
now run the script:
$ ./shanalog.sh
This accomplishes two things:
1. It generates reports and saves them at URL http://www.techtransfer.berkeley.edu/wwwlogs/
2. It updates analog-www/dnscache,
that is, it does DNS host lookup on all the IP addresses in the Apache logs,
that is, it tries to match a hostname to the IP address.
III. Next, run robotFinder/hostexclude.sh:
$ cd ../robotFinder
$ ./hostexclude.sh 200701
hostexclude.sh grabs all requests for robots.txt from the logfile,
tries to match the IP address of the requestor in dnscache,
auto-generates HOSTEXCLUDE directives for analog.cfg,
and then writes the result to either of hostname_YYYYMM or IP_only_YYYYMM.
IV. paste the results into analog.cfg:
one way to do this, using vi:
$ cd ../analog-www
$ vi analog.cfg
In vi, find the beginning of the auto-generated HOSTEXCLUDES:
In command mode, type "/Finder" to find the line
# BEGIN read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM
do it again to find the end of the auto-generated HOSTEXCLUDES:
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM
you can either note the line number of the beginning and end of the directives and
issue a vi command something like
:90,306d
(delete lines 90 to 306)
or, from the first line of the directives, issue the command
:.,/Finder/ d
(delete from the current line (.) to the first line with "Finder" on it)
If you do it this way, make sure there are TWO lines like this:
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM
# END read ../robotFinder/IP_only_YYYYMM ; read ../robotFinder/exclude_hosts_YYYYMM
(since you will zap the first of them!)
Now you can insert the new lines with e.g. the command
:r hostname_YYYYMM
and
:r IP_only_YYYYMM
save analog.cfg and quit (vi command :wq OR ZZ)
V. Now you have to run analog again!!!
(and you might have to rename or delete the web page directories in
http://www.techtransfer.berkeley.edu/wwwlogs/2007.03/ )
$ ./shanalog.sh
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home