After tinkering a bit with web services and XSLT-based scraping last week for generating RSS from HTML, I ripped out some work I was doing for a Java-based scraper I’d started working on last year and threw together a kit of XSLT files that does most everything I was trying to do.
I’m calling this kit XslScraper, and there’s further blurbage and download links avaiable in the Wiki. Check it out. I’ve got shell scripts to run the stuff from as a cron job, and CGI scripts to run it all from web services.
For quick gratification, check out these feeds:
-
- The Nation (using Bill Humphries’ XSL)
-
- KurzweilAI.net
-
- J-List — You’ve got a friend in Japan!
-
- New JOBS at the University of Michigan (By Job Family)
shortname=xsl_scraper



One Comment
The KurzweilAI.net feed is empty.
5 Trackbacks/Pingbacks
Wow. Using web services and XSLT to scrape RSS from HTML
Les Orchard of 0xDECAFBAD has written an amazing piece of software called XslScraper. It’s written in Java and uses XSLT and HTMLTidy to scrape web pages in order to produce RSS. Better yet, the XSLT processor and Tidy are offered…
QuickLinks - September 02, 2003
Changes As I still don’t have a solution for a Quick Links XML feed (the usual approach with one…
QuickLinks - September 02, 2003
Changes As I still don’t have a solution for a Quick Links XML feed (the usual approach with one…
XSL+Tidy RSS Scraper Service
Alright, L. M. Orchard took my technique of RSS scraping via Tidy and XSLT, and produced a RESTful web service to do this. You provide a URL for the site to scrape, and the URL of an XSLT transform and it returns RSS.
[...] Sep 02: Using web services and XSLT to scrape RSS from HTML Progressing from perl scraping hacks to cleaner XSLT-based feed [...]