Building an Extensible RSS Parser in C#
The Need for an Extensible RSS Parser
This came up while attempting to parse an RSS feed with an existing XML parsing library. The project required the usage of additional fields that were available in the feed, such as "dc:publisher". However the XML parsing library wasn't easily extensible to allow for consumption of these fields. Editing the exiting library code directly would cause anyone who uses the library to have their implementations look for these custom fields. And copying the parser out of the library and embing it directly into the project would not be a best practice for code reuse. What would have been useful is a parser in the library that could easily be extend to include any additional fields requested.
In my quest to find the most awesomely extensible parser ever, my solution touched on a number of programming topics including:
-
Deserializing XML using attributes to identify objects and properties
-
Use of Generics working with Abstract Classes and Interfaces
-
Consuming custom or namespaced nodes found in RSS
-
Building extensible library code
-
Unit Testing
Laying Out the Requirements
The first thing we need to do is make some requirements:
-
The base RSS parser should be able to live in a library file and not require editing for new projects.
-
The base RSS parser should be extensible so that I can specify the fields that I want to consume that vary from project to project.
-
The model should implement an interface so that usage of the RSS parser remains consistent.
-
I would like to use the .NET XmlSerializer class along with XmlElement attributes because I like the readability and intuitiveness it provides when defining new fields to consume.
Mocking up the API
From a design perspective, coding to an API is a great way to start. Here is a good post on API-First Design. The first line should let me specify my feed source, deserialize it, and return to me a feed object:
TFeed feed = Deserializer.GetFeed(source);
This line implies that we’ll have a static deserializion class with a GetFeed() method. Generically, I’ll want to specify the type of my feed which will be an extension of the RSS Base class and as an argument it will take a feed source. In this case the source will be a URL, but we’ll overload it so that it can take a FileStream as well. Once I have my feed object I’m going to want to be able to get an array of channels. We’ll look at using an Interface here to ensure we always have a GetChannels() method.
List channels = feed.GetChannels();
Finally, once I have a channel, I’m going to want a method to get my RSS items. We’ll also use an Interface here to make sure we always have a GetItems() method.
List items = channel.GetItems();
Building the Library Code
Now that we have requirements and specifications, let’s put together some code. Here’s a Deserializer that takes in a generic feed type and feed source, and returns a feed object:
public static class RSSDeserializer
{
public static T GetFeed(string feedUrl)
{
if(string.IsNullOrEmpty(feedUrl)) return default(T);
XmlSerializer xs = new XmlSerializer(language-xml">public interface IRSSFeed
{
List GetRSSChannels();
}
public interface IRSSChannel
{
List GetRSSItems();
}
Next, we’ll set up our abstract base classes. We’ll have one for a feed object, then a channel object, and lastly an item object. These constitute the standard structure of any RSS feed (feed/channel/item) and distinguishes it for a more generic xml deserializer model.
- Each class is abstract so we’ll need to extend these in our project’s implementation.
- Each class has a Serializable attribute
- Each class has an XML Root attribute that follows the standard RSS structure of feed/channel/item.
- Within each class we identify our required RSS fields with XmlElement attributes. Here is a link to a resource that shows which fields rss requires as well as the standard optional fields: http://www.w3schools.com/xml/xml_rss.asp
- The Date field isn’t a required RSS field. This could technically go in our extended implementation. Note the usage of the XmlIgnore attribute in the BaseRSSItem class, this takes the pubDate string returned from the deserialization process and converts it to a Date object. I would have liked to place this post-deserialization process in either an OnDeserializationCallback or OnDeserialzed event but it appears these are not supported when using the XmlSerializer object.
[Serializable]
[XmlRoot("rss", IsNullable = false)]
public abstract class BaseRSSFeed : IRSSFeed
{
[XmlElement("channel")]
public List RSSChannels { get; set; }
public virtual List GetRSSChannels()
{
if(RSSChannels==null)
{
return new List();
}
return RSSChannels;
}
}
[Serializable]
[XmlRoot("channel", IsNullable = false)]
public abstract class BaseRSSChannel : IRSSChannel
{
[XmlElement("title")]
public string Title { get; set; }
[XmlElement("description")]
public string Description { get; set; }
[XmlElement("link")]
public string Link { get; set; }
[XmlElement("item")]
public List RSSItems { get; set; }
public virtual List GetRSSItems()
{
if(RSSItems==null)
{
return new List();
}
return RSSItems;
}
}
[Serializable]
[XmlRoot("item", IsNullable = false)]
public abstract class BaseRSSItem
{
[XmlElement("title")]
public string Title { get; set; }
[XmlElement("description")]
public string Description { get; set; }
[XmlElement("pubDate")]
public string pubDate { get; set; }
[XmlElement("link")]
public string Link { get; set; }
[XmlIgnore]
public DateTime Date
{
get
{
DateTime _date;
DateTime.TryParse(pubDate, out _date);
return _date;
}
}
}
Common RSS Namespaces
Let’s assume that we have this code compiled into our utility or library dll. We can then setup our use case scenario and override our base class. Suppose we want to consume a feed from a source produced by BlogEngine.NET. This platform loads in many of the optional RSS elements as well as a plethora of namespaces such as our good friends Dublin Core, BlogChannel, PingBack, Slash, and Geo. Here is an example:
Here is a resource that shows many of the most commonly used namespaces found within RSS feeds.
Extending Our Library Code
Our scenario will be simple, we want all the required fields that the base class provides plus we want the Publisher field from Dublin Core. We'll make a class called CustomRSSFeed and extend the base classes like this:
[Serializable]
[XmlRoot("rss", IsNullable = false)]
public class CustomRSSFeed : BaseRSSFeed
{
}
[Serializable]
[XmlRoot("channel", IsNullable = false)]
public class CustomRSSChannel : BaseRSSChannel
{
}
[Serializable]
[XmlRoot("item", IsNullable = false)]
public class CustomRSSItem : BaseRSSItem
{
[XmlElement("publisher", Namespace = "http://purl.org/dc/elements/1.1/")]
public string Author { get; set; }
}
A few things to note:
- Because we are using the XmlElement attribute, our property name does not need to be the same as our node name.
- The property attribute specifies the Namespace. We need to do this for any namespaced elements. This is another great reason why to place these in an extended class rather than a single class model.
- We didn't add any additional fields to our Channel class, but we could have.
Using the New API
Let’s take a look at the usage kicking off the feed parser. To see the full usage going through each channel and each item, see the default.aspx page in the downloadable solution:
private void doBlogParser()
{
// generate our feed object
CustomRssFeed rssFeed = RssDeserializer.GetFeed("http://www.dotnetblogengine.net/syndication.axd");
// get and validate channel array
List rssChannels = rssFeed.GetRssChannels();
if (rssChannels == null) return;
// bind repeater for rss channels
rptRSSChannels.DataSource = rssChannels;
rptRSSChannels.DataBind();
}
And that’s the can of corn. Attached here is a VS 2008 solution which has these samples rolled into something you can open and test straight away. There are even some unit tests put in there for good measure, remember that all library code for your company should contain unit tests.
Any comments are appreciated. You may have many ways to improve or expand upon this and I’d love to hear about them. Some of the future things I’d like to do with this are:
- Update the model so our extended classes can be extended themselves.
- Add a class constraint to the generic in the Deserializer class.
- Modify the model so we have a base class for required fields and a way to also include the full set of RSS optional fields.