10 April, 2012

commandline RSS->text tool using Haskell arrows

I wanted barwen.ch to display news updates at login. I already have an RSS feed from the drupal installation on the main page; and that RSS feed is already gatewayed into the IRC channel. So that seemed an obvious place to get news updates.

I wrote a tool, rsstty, to output the headlines to stdout. Then, I wired it into the existing update-motd installation to fire everything someone logs in.

So you can say:

$ rsstty http://s0.barwen.ch/rss.xml
 * ZNC hosting(Thu, 01 Mar 2012 10:09:15 +0000)
 * finger server with cgi-like functionaity(Wed, 22 Feb 2012 18:43:08 +0000)
 * Welcome, people who are reading the login MOTD(Fri, 17 Feb 2012 23:56:44 +0000)
 * resized and rebooted(Wed, 25 Jan 2012 12:23:39 +0000)
 * One time passwords (HOTP/TOTP)(Wed, 18 Jan 2012 11:33:45 +0000)

I wrote the code in Haskell, using the arrow-xml package.

arrow-xml is a library for munging XML data. Programming using it is vaguely reminiscent of XSLT, but it is embedded inside Haskell, so you get to use Haskell syntax and Haskell libraries.

The interesting arrow bit of the code is this. Arrow syntax is kinda awkward to get used to Haskell and sufficiently different from regular syntax and monad syntax that even if you know those you have to get used to it. If you want to get even more confused, try to figure out how it ties into category theory - possibly the worst possible way to learn arrows ever.

But basically, the below definition make a Haskell arrow which turns a url (to an RSS feed) into a stream of one line text headlines with title and date (as above)

> arrow1 urlstring =
>  proc x -> do
>   url <- (arr $ const urlstring) -< x

This turns the supplied filename into a stream of just that single filename. (i.e. awkward plumbing)

>   rss <- readFromDocument [withValidate no, withCurl []] -< url

This uses that unixy favourite, curl (which already has Haskell bindings), to convert a stream of URLs into a stream of XML documents retrieved from those URLs - for each URL, there will be one corresponding XML document.

>   item <- deep (hasName "item" <<< isElem) -< rss

Now convert a stream of XML documents into a stream of <item> XML elements. Each XML document might have multiple item elements (and probably will - each RSS news item is supplied as an <item>) so there will be more things in the output stream than in the input stream.

>   title <- textOfChild "title" -< item
>   pubdate <- textOfChild "pubDate" -< item

Next, I'm going to pull out the text of the <title> and <pubdate> child elements of the items - there should be one each per item

>   returnA -< " * " ++ title ++ "(" ++ pubdate ++ ")\n"

When we get to this point, we should have a stream of items, a stream of titles corresponding to each item, and a stream of pubdates corresponding to each title. So now I can return (using the arrow-specific returnA) what I want using regular Haskell string operations: a stream of strings describing each item.

The above arrow is wrapped in code which feeds in the URL from the command line, and displays the stream of one-line news items on stdout.

The other interesting bit is a helper arrow, textOfChild which extracts the text content of a named child of each element coming through an XML stream. Each part of this helper arrow is another arrow, and they're wired together using <<<. To read it, imagine feeding in XML elements at the right hand side, with each arrow taking that stream and outputting a different stream: first each element is converted into a stream of its children; then only the element children are allowed through; then only the elements with the supplied name; then all of the children of any elements so selected; and then the text content of those. (its quite a long chain, but thats what the XML infoset looks like...)

> textOfChild name =
>  textNodeToString <<< getChildren <<< hasName name <<< isElem <<< getChildren
--

No comments:

Post a Comment