Friday, May 23, 2008

find | xargs grep and Human readable site map

How to compile a human-readable site map?

Machine-readable Sitemaps are just a file full of URLs, but people generally want anchor text.

If you have access to your web server, one way is to grep out all the title tags from all the web files in all the directories. Since grep prints the filename before each match (when grep ing multiple files), it's simple to change grep's output:
filename.php: <title>Page Title</title>
into anchor tags:
<a href="/filename.php">Page Title</a>
There is no 'grep -r' (grep recursive), so a workaround is to use find and pipe the output to grep. What worked for me was
find .  -iname \*.php | xargs grep -i '<title'
-iname \*.php looks for php files only (since grepping binary files like pdfs would be a waste of time)
xargs is needed I think because if find knows its is part of a pipe, it maybe only outputs a return code. (approximately)

This didn't get links the way they are on the nav bars, so I manually grabbed the source from two of those (using Firefox's handy select - right-click - view selection source) and grabbed the URLs and anchor text.
This got me thinking that harvesting all the internal links from the site would be a good strategy also.

Labels: , , , , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home