Scraping HTML with web services
After checking out Bill Humphries’ approach to scraping yesterday, I recalled the various things Jon Udell has written about URL-as-command-line and the various places I’ve seen the W3C XSLT Servlet used in XSLT tinkering. I also remembered that there’s an HTML Tidy service offered by W3C as well.
<p>So… these are all URLs. I figured I could pull together the site <span class="caps">URL</span>, <a href="http://www.whump.com/dropbox/nationrss/nation.xsl">Bill’s <span class="caps">XSLT</span></a>, the tidy service, and the <span class="caps">XSLT</span> service, and have a whole lot of scraping going on right in my browser or via wget or curl. Here are the steps in how I composed the <span class="caps">URL</span>:</p>
<ol>
<li><a href="http://www.thenation.com">http://www.thenation.com</a></li>
xslfile=http%3A%2F%2Fwww.whump.com%2Fdropbox%2Fnationrss%2Fnation.xsl&
xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy%3F
docAddr%3Dhttp%253A%252F%252Fwww.thenation.com&transform=Submit
<p>Unfortunately, this doesn’t work. In particular, step #2 fails, the Tidy service reporting a failure in processing the original <span class="caps">HTML</span>. I imagine, had that worked, the whole process at step #3 would be producing <span class="caps">RSS</span>. On my command line, <span class="caps">HTML </span>Tidy works fine, so I’ve been thinking of throwing together my own web interface to that program and seeing if that works.</p>
<p>If it works, this with the addition of a cache at each stage could allow for what I think is a pretty nifty, all web-based means of scraping news items from web sites. </p>
<p>What would really be nice for apps like this is a better way to express the URLs-within-URLs without escaping and escaping and escaping and... Thinking some very lightweight scripting here, or some LISP-ish expressions would help.</p>
shortname=rss_scrape_urls
Archived Comments