Skip navigation

Scraping HTML with curl, tidy, and XSL

Continuing with making it easier for “Big Pubs” to create RSS feeds. I’m assuming that they have a publishing system, but it wasn’t built with RSS in mind, but they want on the bandwagon.

Using curl, tidy, and XSL to scrape content from HTML pages into an RSS feed. This is basically what I do now with a half-baked Java app using JTidy, XPath, and BeanShell. I keep meaning to release it, but it’s too embarassing to share so far. Yet, it’s been working well enough to scrape what sites I’m interested in such that I haven’t been too motivated to tidy it up and tarball it. One thing I like better about Bill Humphries’ approach, though, is that it doesn’t use Java :)

shortname=rssscrapexsl

2 Comments

  1. Posted August 22, 2003 at 8:23 pm | Permalink

    Well, it could use Java, if you really, really, want to since Xalan and Saxon have command line variants. I’m using LibXSLT in the demo.

  2. Posted August 23, 2003 at 1:21 pm | Permalink

    Well, I actually like the idea of chaining a few shell programs together much better than the all-in-one Java scraper I was tinkering with. Seems so much easier all around.

2 Trackbacks/Pingbacks

  1. Interesting screen scraping technique

    Scraping HTML with curl, tidy, and XSL This is pretty slick. It feels much cleaner than scraping with regexes. The…

  2. [...] Aug 22: Scraping HTML with curl, tidy, and XSL Trying to bring some cleaner approaches to web scraping beyond regular [...]