Skip navigation

Using web services and XSLT to scrape RSS from HTML



After tinkering a bit with web services and XSLT-based scraping last week for generating RSS from HTML, I ripped out some work I was doing for a Java-based scraper I’d started working on last year and threw together a kit of XSLT files that does most everything I was trying to do.

I’m calling this kit XslScraper, and there’s further blurbage and download links avaiable in the Wiki. Check it out. I’ve got shell scripts to run the stuff from as a cron job, and CGI scripts to run it all from web services.

For quick gratification, check out these feeds:

shortname=xsl_scraper

One Comment

  1. Posted September 3, 2003 at 7:03 pm | Permalink

    The KurzweilAI.net feed is empty.

5 Trackbacks/Pingbacks

  1. Wow. Using web services and XSLT to scrape RSS from HTML

    Les Orchard of 0xDECAFBAD has written an amazing piece of software called XslScraper. It’s written in Java and uses XSLT and HTMLTidy to scrape web pages in order to produce RSS. Better yet, the XSLT processor and Tidy are offered…

  2. QuickLinks - September 02, 2003

    Changes As I still don’t have a solution for a Quick Links XML feed (the usual approach with one…

  3. QuickLinks - September 02, 2003

    Changes As I still don’t have a solution for a Quick Links XML feed (the usual approach with one…

  4. XSL+Tidy RSS Scraper Service

    Alright, L. M. Orchard took my technique of RSS scraping via Tidy and XSLT, and produced a RESTful web service to do this. You provide a URL for the site to scrape, and the URL of an XSLT transform and it returns RSS.

  5. [...] Sep 02: Using web services and XSLT to scrape RSS from HTML Progressing from perl scraping hacks to cleaner XSLT-based feed [...]