0xDECAFBAD

It's all spinning wheels and self-doubt until the first pot of coffee.

Scraping HTML with web services

After checking out Bill Humphries’ approach to scraping yesterday, I recalled the various things Jon Udell has written about URL-as-command-line and the various places I’ve seen the W3C XSLT Servlet used in XSLT tinkering. I also remembered that there’s an HTML Tidy service offered by W3C as well.

<p>So&#8230;  these are all URLs.  I figured I could pull together the site <span class="caps">URL</span>, <a href="http://www.whump.com/dropbox/nationrss/nation.xsl">Bill&#8217;s <span class="caps">XSLT</span></a>, the tidy service, and the <span class="caps">XSLT</span> service, and have a whole lot of scraping going on right in my browser or via wget or curl.  Here are the steps in how I composed the <span class="caps">URL</span>:</p>

<ol>
<li><a href="http://www.thenation.com">http://www.thenation.com</a></li>

  • http://cgi.w3.org/cgi-bin/tidy?docAddr=http%3A%2F%2Fwww.thenation.com
  • http://www.w3.org/2000/06/webdata/xslt?
    xslfile=http%3A%2F%2Fwww.whump.com%2Fdropbox%2Fnationrss%2Fnation.xsl&
    xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy%3F
    docAddr%3Dhttp%253A%252F%252Fwww.thenation.com&transform=Submit
  • <p>Unfortunately, this doesn&#8217;t work.  In particular, step #2 fails, the Tidy service reporting a failure in processing the original <span class="caps">HTML</span>.  I imagine, had that worked, the whole process at step #3 would be producing <span class="caps">RSS</span>.  On my command line, <span class="caps">HTML </span>Tidy works fine, so I&#8217;ve been thinking of throwing together my own web interface to that program and seeing if that works.</p>
    
    <p>If it works, this with the addition of a cache at each stage could allow for what I think is a pretty nifty, all web-based means of scraping news items from web sites.  </p>
    
        <p>What would really be nice for apps like this is a better way to express the URLs-within-URLs without escaping and escaping and escaping and...  Thinking some very lightweight scripting here, or some LISP-ish expressions would help.</p>
    

    shortname=rss_scrape_urls

    Archived Comments

    • Does the w3c Tidy service support the force output option? That's what I had to do with command line Tidy to get something well formed from The Nation's home page.
    • Unfortunately, it seems that the W3C service only offers an indentation option
    • It's tempting to take the script, and offer it as a service myself, with the force output option. However, I'd need to wrap an authorization service in front of it so it doesn't kill my bandwidth.