<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>
<channel>
	<title>Comments on: What&#8217;s old (scraping) is new again (microformats)</title>
	<atom:link href="http://decafbad.com/blog/2005/05/08/whats-old-scraping-is-new-again-microformats/feed" rel="self" type="application/rss+xml" />
	<link>http://decafbad.com/blog/2005/05/08/whats-old-scraping-is-new-again-microformats</link>
	<description>It's all spinning wheels and self-doubt until the first pot of coffee.</description>
	<pubDate>Thu, 20 Nov 2008 16:15:03 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7-beta3-9771</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Ian Bicking</title>
		<link>http://decafbad.com/blog/2005/05/08/whats-old-scraping-is-new-again-microformats/comment-page-1#comment-1642</link>
		<dc:creator>Ian Bicking</dc:creator>
		<pubDate>Tue, 30 Nov 1999 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.decafbad.com/blog/?p=643#comment-1642</guid>
		<description>&lt;p&gt;I dunno... HTMLParser and screen scraping has always been an unsatisfying experience to me.  Not very reliable, and fails in unexpected ways.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>I dunno&#8230; HTMLParser and screen scraping has always been an unsatisfying experience to me.  Not very reliable, and fails in unexpected ways.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: l.m. orchard</title>
		<link>http://decafbad.com/blog/2005/05/08/whats-old-scraping-is-new-again-microformats/comment-page-1#comment-1643</link>
		<dc:creator>l.m. orchard</dc:creator>
		<pubDate>Tue, 30 Nov 1999 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.decafbad.com/blog/?p=643#comment-1643</guid>
		<description>&lt;p&gt;Well, scraping is most certainly nothing you want to really rely on without watching it, but I've got at least a dozen or two useful and active feeds running as a result of scraping for the past few years-- it's certainly better than nothing :)&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Well, scraping is most certainly nothing you want to really rely on without watching it, but I&#8217;ve got at least a dozen or two useful and active feeds running as a result of scraping for the past few years&#8211; it&#8217;s certainly better than nothing :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Justin Mason</title>
		<link>http://decafbad.com/blog/2005/05/08/whats-old-scraping-is-new-again-microformats/comment-page-1#comment-1644</link>
		<dc:creator>Justin Mason</dc:creator>
		<pubDate>Tue, 30 Nov 1999 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.decafbad.com/blog/?p=643#comment-1644</guid>
		<description>&lt;p&gt;ha, I never realised you were one of the authors of that book -- awesome!  I wrote sitescooper, which was a proto-scraper for a wide variety of sites to transcode them into an offline-readable format for small-screen handheld devices.  &lt;/p&gt;

&lt;p&gt;I really like the microformat idea, thanks for the pointer.  one difficulty, however, of using it for scraping is that you'll have to use XPath and trust that the input XHTML is valid.  regexps won't work, because the close tags don't include the "id" or "class" attributes, so nested close tags won't match correctly with simple regexps.&lt;/p&gt;

&lt;p&gt;but then, we're all told that scraping and regexps are kludges anyway, so I guess we shouldn't be using them any more ;)&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>ha, I never realised you were one of the authors of that book &#8212; awesome!  I wrote sitescooper, which was a proto-scraper for a wide variety of sites to transcode them into an offline-readable format for small-screen handheld devices.  </p>
<p>I really like the microformat idea, thanks for the pointer.  one difficulty, however, of using it for scraping is that you&#8217;ll have to use XPath and trust that the input XHTML is valid.  regexps won&#8217;t work, because the close tags don&#8217;t include the &#8220;id&#8221; or &#8220;class&#8221; attributes, so nested close tags won&#8217;t match correctly with simple regexps.</p>
<p>but then, we&#8217;re all told that scraping and regexps are kludges anyway, so I guess we shouldn&#8217;t be using them any more ;)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: l.m. orchard</title>
		<link>http://decafbad.com/blog/2005/05/08/whats-old-scraping-is-new-again-microformats/comment-page-1#comment-1645</link>
		<dc:creator>l.m. orchard</dc:creator>
		<pubDate>Tue, 30 Nov 1999 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.decafbad.com/blog/?p=643#comment-1645</guid>
		<description>&lt;p&gt;Well, the thing about the microformats, if I recall, is that you have to at least start with well-formed XHTML.  Otherwise, your microformatted content is broken.&lt;/p&gt;

&lt;p&gt;That said, though, I've been using Python's HTMLParser to lift data out of microformat content with a lot of success.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Well, the thing about the microformats, if I recall, is that you have to at least start with well-formed XHTML.  Otherwise, your microformatted content is broken.</p>
<p>That said, though, I&#8217;ve been using Python&#8217;s HTMLParser to lift data out of microformat content with a lot of success.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tantek</title>
		<link>http://decafbad.com/blog/2005/05/08/whats-old-scraping-is-new-again-microformats/comment-page-1#comment-1646</link>
		<dc:creator>Tantek</dc:creator>
		<pubDate>Tue, 30 Nov 1999 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.decafbad.com/blog/?p=643#comment-1646</guid>
		<description>&lt;p&gt;Leslie, that's exactly right.  More and more blogging tools publish in well-formed XHTML by default and are  becoming better and better at "tidying" ill-formed markup into well-formed markup.  &lt;/p&gt;

&lt;p&gt;In addition, right now you can get started with a bit of a hybrid approach:  you can use a regexp to find the &lt;em&gt;start&lt;/em&gt; of a microformat such as hCard, hCalendar, hReview, XOXO etc. simply by looking for class attributes that contain the right value.  Then at that point, you can hand the stream over to an XML parser to process the well-formed microformatted markup and handle the structured data as you wish!&lt;/p&gt;

&lt;p&gt;Tantek&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Leslie, that&#8217;s exactly right.  More and more blogging tools publish in well-formed XHTML by default and are  becoming better and better at &#8220;tidying&#8221; ill-formed markup into well-formed markup.  </p>
<p>In addition, right now you can get started with a bit of a hybrid approach:  you can use a regexp to find the <em>start</em> of a microformat such as hCard, hCalendar, hReview, XOXO etc. simply by looking for class attributes that contain the right value.  Then at that point, you can hand the stream over to an XML parser to process the well-formed microformatted markup and handle the structured data as you wish!</p>
<p>Tantek</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: joe @ metafy</title>
		<link>http://decafbad.com/blog/2005/05/08/whats-old-scraping-is-new-again-microformats/comment-page-1#comment-1647</link>
		<dc:creator>joe @ metafy</dc:creator>
		<pubDate>Tue, 30 Nov 1999 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.decafbad.com/blog/?p=643#comment-1647</guid>
		<description>&lt;p&gt;I remember that presentation at WWDC last year or the year before in which the very nice Sara ? explains that scraping is for H4x0Rs, to which I commented to her afterward that it's actually for 3L337 H4x0Rs...&lt;/p&gt;

&lt;p&gt;I particularly enjoy the "Scraping is Fun" bent of this thread, so I thought I'd bang my own drum a little to show how much I enjoy it, too:&lt;/p&gt;

&lt;p&gt;http://www.metafy.com/products/anthracite/&lt;/p&gt;

&lt;p&gt;It works great with Perl (or any other UNIX command) and AppleScript now, and even better later this week when we release the Automator actions for Anthracite (whoops, not supposed to say that yet until they're all done)...&lt;/p&gt;

&lt;p&gt;Among other nifty things you can do with it today are convert the results of a Google search into an RSS feed, and/or search those results via Spotlight.&lt;/p&gt;

&lt;p&gt;I hope it helps you enjoy scraping even more!&lt;/p&gt;

&lt;p&gt;Joe @ Metafy
Boulder, Colorado USA&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>I remember that presentation at WWDC last year or the year before in which the very nice Sara ? explains that scraping is for H4&#215;0Rs, to which I commented to her afterward that it&#8217;s actually for 3L337 H4&#215;0Rs&#8230;</p>
<p>I particularly enjoy the &#8220;Scraping is Fun&#8221; bent of this thread, so I thought I&#8217;d bang my own drum a little to show how much I enjoy it, too:</p>
<p><a href="http://www.metafy.com/products/anthracite/" rel="nofollow">http://www.metafy.com/products/anthracite/</a></p>
<p>It works great with Perl (or any other UNIX command) and AppleScript now, and even better later this week when we release the Automator actions for Anthracite (whoops, not supposed to say that yet until they&#8217;re all done)&#8230;</p>
<p>Among other nifty things you can do with it today are convert the results of a Google search into an RSS feed, and/or search those results via Spotlight.</p>
<p>I hope it helps you enjoy scraping even more!</p>
<p>Joe @ Metafy<br />
Boulder, Colorado USA</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Doug Ransom</title>
		<link>http://decafbad.com/blog/2005/05/08/whats-old-scraping-is-new-again-microformats/comment-page-1#comment-1648</link>
		<dc:creator>Doug Ransom</dc:creator>
		<pubDate>Tue, 30 Nov 1999 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.decafbad.com/blog/?p=643#comment-1648</guid>
		<description>&lt;p&gt;Here is a microformat I proposed a couple years ago http://internetalchemy.org/2003/04/rssInXHTML&lt;/p&gt;

&lt;p&gt;Generally, I think the microformat concept is great.  If an html author can at least use CSS nicely, its trivial to create an RSS with tidy + XSL (or any other way).  Thats useful if their content management system is some sort of lame html editing system that inserts the authors text into a template (i.e. only one page can be created from a single source) rather than generating markup for several pages (html, rss) from a content source.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Here is a microformat I proposed a couple years ago <a href="http://internetalchemy.org/2003/04/rssInXHTML" rel="nofollow">http://internetalchemy.org/2003/04/rssInXHTML</a></p>
<p>Generally, I think the microformat concept is great.  If an html author can at least use CSS nicely, its trivial to create an RSS with tidy + XSL (or any other way).  Thats useful if their content management system is some sort of lame html editing system that inserts the authors text into a template (i.e. only one page can be created from a single source) rather than generating markup for several pages (html, rss) from a content source.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
