0xDECAFBAD

It's all spinning wheels and self-doubt until the first pot of coffee.

FeedMagick, the feed filter that doesn't know much about feeds

FeedMagick is a set of PHP tools used in filtering, converting, and otherwise munging XML syndication feeds in RSS and Atom formats.

Source: FeedMagick - 0xDECAFBAD - Trac

Okay, so I held my nose and started doing a bit of PHP hacking this past weekend, and this is what I've got so far. It wasn't that bad, and there were a few nice toys to be found so far in PHP—but I think I'll need some peer-review to tell me how well I'm following the local idiom. Also, my XML mojo may suffer from some unintentional ignorance.

Anyway, the main idea behind this feed filtering kit is that I'm not parsing and reconstituting feeds at the format level. Instead, I'm diving down to the XML level with SAX filters. Having finally realized the meaning of Must Ignore, this was a particularly interesting realization to me—so I hope you'll bear with me as I tell the story...

What it doesn't do:

See, I'm not using MagpieRSS to chew up feeds into PHP structures, and I'm not using PHP structures to splice together a new feed. You see, since starting with FeedSpool, I've come to believe that ignorance is bliss and FeedMagick is a continuation of this notion.

In FeedMagick, I'm mostly ignoring feed format specifics. The only thing this code really cares about are item and entry tags, and the rest gets blindly passed along. Of course, you can write filter subclasses which do know and care about other feed elements—but the beauty is that neither you nor I need to write code that cares about all possible feed elements ever.

Why it doesn't do it:

When you write a general parser for feeds, there are a lot of permutations that need worrying about. And for a feed filter, that's just the first stage of the process. Next, you need to reconstitute that feed from parsed data—and that's going to suck.

You could consider your job done after implementing the bare feed spec—but what about Apple and Yahoo! and Microsoft? Oh, and what about calendar events? What happens to your filter if people start doing interesting things with all of these extensions?

Even if the parser author managed to pull off passing along all the information contained in the feed—pretty much reinventing the XML wheel past a certain point—you'll need to anticipate what all that information might be in order to rebuild the feed basically from scratch on the other end of the filter. It's too much.

How it gets away with it:

Stop caring about the specifics so much. This is XML, right? It's possible to build tools at the XML level that can slice and dice and put it all back together again without harm. Could be RSS, could be Atom, could be XHTML, could be RecipeML. In any case, as long as the turtle-depth stops at angle brackets, we can start there and work up.

SAX filters it is, then—in PHP flavor. I built a simple chain consisting of a SAX parser to read in the XML and a SAX filter that writes XML. So far so good, it's an identity function.

Next, I stuck a filter in the middle that just barely knows about item and entry tagsand not much else. When this thing sees a feed item in the parsing event stream, it temporarily diverts all further parsing events for that item into a buffer. At the end of the item, it spews the buffered events down the filter chain.

Where it gets useful:

Now, with all of this in place, you can make a decision in the middle: Got an item in the buffer? Take a peek at it just before it gets unbuffered—does it meet a set of filter criteria? If not, discard the buffer and keep moving. Otherwise, proceed as normal and send the item on its way.

And that's it, so far. All this filter knows or cares about is XML and the occasional feed item tag. Filter subclasses can care about more—a dc:subject element, for instance—and decide which items make it down the pipe. That item could be stuffed full of rich extensions and goodies, but this filter doesn't have to care. It can be ignorant beyond angle brackets.

That's it.

So, yeah. This might all be obvious to some people, but it all finally made a lot of sense to me. It started making sense when I built FeedSpool, but now it's really sinking in. I really get the virtue of ignorance and laziness in XML now. Or, at least, I think I'm starting to.

Archived Comments

  • This virtue of laziness and ignorance applies to dynamic programming languages as well, for the same reasons. Not pointed at you, but I see a lot of C++/Java people get very excited about how flexible XML is, but completely miss out on the idea that programming languages like <holy-war> Python Perl Ruby Lisp Scheme etc. </holy-war> provide this flexibility for all their objects and data structures, all the time. I guess we could try to re-market dynamic languages as XPL &emdash; extensible programming languages.

    Well, sorry for the mini-rant. Welcome to the Bay Area.