Home Wiki Main

XslScraper

Revision r1.2 - 02 Sep 2003 - 11:35 GMT - LesOrchard

Description

I've been working for almost a year on a monolithic scraping app in Java using BeanShell?, XSL, and XPath. Since I haven't used RadioUserLand in awhile, and RssDistiller went to a fee-based subscription, I've been in need of a way to subscribe to sites which don't provide feeds. The Java app worked well enough, though as I've realized, I really have more luck with small bits of hacking versus long bouts with monolithic apps.

Then, I caught Bill Humphries' post demonstrating basically what I was trying to get this Java beast to do, but he simply piped a few commands together to get the same result. This made me feel pretty dumb. Simplicity and UNIX philosophy trump the monolith.

So, although it's available if anyone's interested, I've abandoned the Java app. I've switched to concentrate on a fully XSL-based solution assuming that the documents it'll operate on have been cleaned up with the use of Tidy. This has worked so much better. I now run a shell script as a cronjob which cycles through all my scraper XSLT files to grab new content on a periodic basis.

Then, to make myself feel clever again, I decided that I might try applying the idea of URL-as-command-line to this whole thing. At first, I tried using the W3C XSLT servlet inconjunction with their HTML Tidy service to get the same effect, only entirely using web services. Unfortunately, the Tidy service fails on some pages and doesn't force output. Also, the XSLT servlet is apparently based on XT, which doesn't seem to support some of the EXSLT extensions I want to use.

So, I hacked up my own versions of the XSLT and Tidy services:

0xDECAFBAD XSLT service (based on LibXSLT)
- http://www.decafbad.com/2003/08/xsltproc
0xDECAFBAD HTML Tidy service
- http://www.decafbad.com/2003/08/tidy

Then, to streamline things a bit, I joined the two along with some caching:

0xDECAFBAD combined Tidy / XSLT service
- http://www.decafbad.com/2003/08/tidyxslt

All of these services serve up a simple form for entering the URLs used. The combined Tidy/XSLT service is what should be used if you want to link to any feeds generated. During development of XSLT scrapers, the parameter "cache=0" can be added to the tidyxslt URL. Otherwise, the combined service caches the results for each pair of source URL and XSLT URL for 3 hours before processing again.

Here are some sample feeds using the service:

For now, check out the example scrapers to figure out how to write new scrapers. One very neat thing about XSLT, and the way these web services work, you can develop and host your own XSLT scrapers and feed them to my services if you like. That's worth saying again: Even though the services are on my site, you can host your own scraping XSL file (like I use Bill Humphries' in an example above) and even use the template modules that are a part of this project. You need not install any software yourself.

Like REST, it's all just URLs. Yay for web services. Given time and demand, I may work on better documentation of this process.

However, given all that... if you find all of this useful, you'll want to run your own copy of it eventually.

Oh yeah, and though I really need to work on documentation, you should know that all of the source code is public domain. ShareAndEnjoy.