Lists all of the journal entries for the day.

Sun, 26 Sep 2010

11:17 AM - Project progress

I've been working on a new web application to download and mix rss feeds.  The idea is to cache feed items into a database and allow mixing of content from many sources.  I have about 1300 rss feeds so far on a variety of topics.  It's rather interesting to do like queries on the tables and find all sort of information.  

My long term plan is to have it index all public just journal entries, and content from other sites like twitter.   (just popular stuff)

There are services like this already, but many of them charge a lot of money for the information.  It's not like a typical search engine because one wants to access the information repeatedly and google tends to block that.  It's an interesting problem we stumbled onto at work.  Most likely content will go pay down the road, but I think there will still be free content as well.  

The first version of the rss fetcher is complete.  It's populated about 444,000 articles so far (rss items), 56000 categories (per rss 2 spec), and 55000 enclosures (file attachments like podcasts and images).  

One service actually charges 75,000 dollars for this functionality. They have a lot more content than I do, but most of it is garbage from twitter.  If you do a keyword search for say clorox, it ends up with posts about bleaching blood out, throwing it on people, cleaning tips and other crazy things.  There's obviously a need for good filters and smart content searching.  I only know of one method to do this right now and I'm not going to buy IBM Omnifind :) 

This is one of the many problem domains I deal with at work.  Thing is, they have no interest in getting the content themselves whereas I see a lot of potential in it.  This doesn't really overlap with work but it's related to what I do right now.  Scary isn't it? 

The biggest hurdles are:

1. Bandwidth.  Downloading RSS feeds takes a long time.

2. Storage capacity.  I'm not sure how long I can retain content. 

3. Blocks.  I might get black listed harvesting so care has to be taken in fetching content.  The java libraries I'm using right now don't honor rss feed intervals, but I'm limiting to at least 60 minutes for now.

4. Legal.  I'm not planning on charging for this data now and effectively i'm acting like a search engine.  I spider content, collect it and cache it for a period of time.  The only difference is how one accesses it.

The next step in the project is creating the website.  I've got help with that phase.

()