Friday, May 23, 2008

find | xargs grep and Human readable site map

How to compile a human-readable site map?

Machine-readable Sitemaps are just a file full of URLs, but people generally want anchor text.

If you have access to your web server, one way is to grep out all the title tags from all the web files in all the directories. Since grep prints the filename before each match (when grep ing multiple files), it's simple to change grep's output:
filename.php: <title>Page Title</title>
into anchor tags:
<a href="/filename.php">Page Title</a>
There is no 'grep -r' (grep recursive), so a workaround is to use find and pipe the output to grep. What worked for me was
find .  -iname \*.php | xargs grep -i '<title'
-iname \*.php looks for php files only (since grepping binary files like pdfs would be a waste of time)
xargs is needed I think because if find knows its is part of a pipe, it maybe only outputs a return code. (approximately)

This didn't get links the way they are on the nav bars, so I manually grabbed the source from two of those (using Firefox's handy select - right-click - view selection source) and grabbed the URLs and anchor text.
This got me thinking that harvesting all the internal links from the site would be a good strategy also.

Labels: , , , , ,

Tuesday, May 20, 2008

sed and grep: "Stories" of web visitors

I've been thinking for a long time about crunching the web logs to produce "stories" showing how visitors navigate our site. We don't get so many visitors that it would be overwhelming to look at a day or two.

So I've started writing shell scripts to do the crunching.

They are located at /home/steve/webstats/trunk

The plan is, starting from a raw log file:
1. find just requests for pages (.php, .html, .pdf) with result code 200.

(Hmmm...
right now I am using grep -v to get rid of certain things that aren't (1.) above.
is it simpler to use something like:
grep '] "GET [^ ]*[php|pdf|html][^ ]* HTTP/1.." 200 '
or
sed '/] "GET [^ ]*[php|pdf|html][^ ]* HTTP\/1.." 200 /!d'
to get "successful requests for pages?"

sed doesn't do aaa OR bbb OR ccc very well; it looks like
sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d    # most seds
gsed '/AAA\|BBB\|CCC/!d' # GNU sed only
sed is very fast, though.

This works at about 10,000 raw log lines/sec:
sed '/" 200 /!d' $logfilename | \
sed -e '/"GET [^ ]*\.php/b' \
-e '/"GET [^ ]*\.pdf/b' \
-e '/"GET [^ ]*\.ppt/b' \
-e '/"GET [^ ]*\.avi/b' \
-e '/"GET [^ ]*\.doc/b' \
-e '/"GET [^ ]*\.mpg/b' \
-e '/"GET [^ ]*\.flv/b' \
-e '/"GET [^ ]*\.html/b' \
-e '/"GET [^ ]*\/ HTTP/b' \
-e d > $temporary_story_file
More importantly, if I specify good files, then I have to modify the script every time we add a new file type. And if I forget, it's not easy to notice the absence of the overlooked file type.

grep, grep, grep takes a while to process one log file, about 15 minutes so far...

Tue May 20 20:44:19 PDT 2008
Lines in apache access log:
636818 /export/www/htdocs/weblogs/2008/Q2/access_log_200804

Dropping requests for images:

Tue May 20 20:54:32 PDT 2008
Lines left in temporary work file:
214027 /tmp/stories_14566

Dropping requests for javascripts:

Tue May 20 20:55:01 PDT 2008
Lines left in temporary work file:
203919 /tmp/stories_2_14566

Dropping requests for css:

Tue May 20 20:55:39 PDT 2008
Lines left in temporary work file:
192682 /tmp/stories_14566

Dropping partial or unsuccessful requests:

Tue May 20 20:59:28 PDT 2008
Lines left in temporary work file:
140429 /tmp/stories_2_14566


anyway, next is:
(1.5. weed out search / h@x0r bots)
2. sort by IP address and time
3. clean up the lines to make a tab-delimited table with these fields:
IP address datestamp URL requested referring URL Browser type

Of course you can buy tools that claim to do this...

Labels: , , , , , ,