Scraping HTML with web services
After checking out Bill Humphries’ approach to scraping yesterday, I recalled the various things Jon Udell has written about URL-as-command-line and the various places I’ve seen the W3C XSLT Servlet used in XSLT tinkering. I also remembered that there’s an HTML Tidy service offered by W3C as well.
<p>So… these are all URLs. I figured I could pull together the site <span class="caps">URL</span>, <a href="http://www.whump.com/dropbox/nationrss/nation.xsl">Bill’s <span class="caps">XSLT</span></a>, the tidy service, and the <span class="caps">XSLT</span> service, and have a whole lot of scraping going on right in my browser or via wget or curl. Here are the steps in how I composed the <span class="caps">URL</span>:</p> <ol> <li><a href="http://www.thenation.com">http://www.thenation.com</a></li>
<p>Unfortunately, this doesn’t work. In particular, step #2 fails, the Tidy service reporting a failure in processing the original <span class="caps">HTML</span>. I imagine, had that worked, the whole process at step #3 would be producing <span class="caps">RSS</span>. On my command line, <span class="caps">HTML </span>Tidy works fine, so I’ve been thinking of throwing together my own web interface to that program and seeing if that works.</p> <p>If it works, this with the addition of a cache at each stage could allow for what I think is a pretty nifty, all web-based means of scraping news items from web sites. </p> <p>What would really be nice for apps like this is a better way to express the URLs-within-URLs without escaping and escaping and escaping and... Thinking some very lightweight scripting here, or some LISP-ish expressions would help.</p>