Scraping with web services: Success
Okay, so I took another shot at scraping HTML with web services with another site that passes the HTML Tidy step. Luckily, this is a site that I already scrape using my own tool, so I have XPath expressions already cooked up to dig out info for RSS items. So, here are the vitals:
- Site: http://www.jlist.com
- XSL: http://www.decafbad.com/jlist.xsl
- Tidy URL: http://cgi.w3.org/cgi-bin/tidy?
docAddr=http%3A%2F%2Fwww.jlist.com%2FUPDATES%2FPG%2F365%2F - Final URL: http://www.w3.org/2000/06/webdata/xslt?
xslfile=http%3A%2F%2Fwww.decafbad.com%2Fjlist.xsl&
xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy%3F
docAddr%3Dhttp%253A%252F%252Fwww.jlist.com%252FUPDATES%252FPG%252F365%252F&
transform=Submit
<p>Unfortunately, although it looks okay to me, this feed <a href="http://feeds.archive.org/validator/check?url=http%3A%2F%2Fwww.w3.org%2F2000%2F06%2Fwebdata%2Fxslt%3Fxslfile%3Dhttp%253A%252F%252Fwww.decafbad.com%252Fjlist.xsl%26xmlfile%3Dhttp%253A%252F%252Fcgi.w3.org%252Fcgi-bin%252Ftidy%253FdocAddr%253Dhttp%25253A%25252F%25252Fwww.jlist.com%25252FUPDATES%25252FPG%25252F365%25252F%26transform%3DSubmit">doesn’t validate yet</a>, but I’m still poking around with it to get things straight. Feel free to help me out! :)</p>
shortname=rss_scrape_urls2