www.MediaCloud.org/ - 1 views
-
Janos Haits on 31 May 11Media Cloud performs five basic functions -- media definition, crawling, text extraction, word vectoring, and analysis. First, we define the set of media sources we want to collect and discover the feeds for each media source (which in the case of many newspapers includes hundreds of feeds). Second, we crawl each of those feeds several times each day to discover any new stories published by each feed and then download the html of each new story. Third, we extract just the substantive content of each story from each html page, leaving behind the ads, navigation, and other cruft. Fourth, we break that substantive text down into a set word counts so that we can count, down to the level of individual sentences, which words which media sources are using to talk about which topics. And finally, we have a set of tools for analyzing those word counts, including the Media Dashboard tool that acts as the front page for http://mediacloud.org.