XslScraper
Revision r1.2 - 02 Sep 2003 - 11:35 GMT - LesOrchard
Description
I've been working for almost a year on a monolithic scraping app in Java using BeanShell?, XSL, and XPath. Since I haven't used RadioUserLand in awhile, and RssDistiller went to a fee-based subscription, I've been in need of a way to subscribe to sites which don't provide feeds. The Java app worked well enough, though as I've realized, I really have more luck with small bits of hacking versus long bouts with monolithic apps. Then, I caught Bill Humphries' post demonstrating basically what I was trying to get this Java beast to do, but he simply piped a few commands together to get the same result. This made me feel pretty dumb. Simplicity and UNIX philosophy trump the monolith. So, although it's available if anyone's interested, I've abandoned the Java app. I've switched to concentrate on a fully XSL-based solution assuming that the documents it'll operate on have been cleaned up with the use of Tidy. This has worked so much better. I now run a shell script as a cronjob which cycles through all my scraper XSLT files to grab new content on a periodic basis. Then, to make myself feel clever again, I decided that I might try applying the idea of URL-as-command-line to this whole thing. At first, I tried using the W3C XSLT servlet inconjunction with their HTML Tidy service to get the same effect, only entirely using web services. Unfortunately, the Tidy service fails on some pages and doesn't force output. Also, the XSLT servlet is apparently based on XT, which doesn't seem to support some of the EXSLT extensions I want to use. So, I hacked up my own versions of the XSLT and Tidy services:- 0xDECAFBAD XSLT service (based on LibXSLT)
- 0xDECAFBAD HTML Tidy service
- 0xDECAFBAD combined Tidy / XSLT service
tidyxslt URL. Otherwise, the combined service caches the results for each pair of source URL and XSLT URL for 3 hours before processing again.
Here are some sample feeds using the service:
-
- The Nation (using Bill Humphries' XSL)
-
- KurzweilAI.net
-
- J-List -- You've got a friend in Japan!
-
- New JOBS at the University of Michigan (By Job Family)
