
i use python for this kind of stuff, with the SGMLParser. Some easy examples at http://www.boddie.org.uk/python/HTML.html. Python seems to be the preferred scripting language on ubuntu. Get eclipse and pydev. Or IPython is a pretty neat shell with auto complete, command history, persistent environment variables between shells etc.
Not sure what the DIY tools parallel is for python. Perhaps myrddin can help out.
