Skip to main content

Home/ DJCamp2011/ Group items tagged scraping

Rss Feed Group items tagged

Tom Johnson

Comparison of Web-scraping software - 0 views

  •  
    Scott Wilson of http://screen-scraper.com/ has created a useful comparison of web-scraping software and posted it as a public doc on Google Docs: https://docs.google.com/spreadsheet/ccc?key=0AsaY3Pb1lTh1dERtbWtTN0U3REgtYlNld0stV0NCV1E#gid=0
Tom Johnson

International Dataset Search - 0 views

  • International Dataset Search View View Source Description:  The TWC International Open Government Dataset Catalog (IOGDC) is a linked data application based on metadata scraped from an increasing number of international dataset catalog websites publishing a rich variety of government data. Metadata extracted from these catalog websites is automatically converted to RDF linked data and re-published via the TWC LOGD SPAQRL endpoint and made available for download. The TWC IOGDC demo site features an efficient, reconfigurable faceted browser with search capabilities offering a compelling demonstration of the value of a common metadata model for open government dataset catalogs. We believe that the vocabulary choices demonstrated by IOGDC highlights the potential for useful linked data applications to be created from open government catalogs and will encourage the adoption of such a standard worldwide. Warning: This demo will crash IE7 and IE8. Contributor: Eric Rozell Contributor: Jinguang Zheng Contributor: Yongmei Shi Live Demo:  http://logd.tw.rpi.edu/demo/international_dataset_catalog_search Notes: This is an experimental demo and some queries may take longer time to response (30 ~60 seconds). Please referesh this page if the demo is not loaded. Our metadata model can be accessed here . Procedure to getting and publishing metadata is described here . The RDF dump of the datasets can be downloaded here. Welcome to S2S! International OGD Catalog Search (searching 736,578 datasets)
  •  
    International Dataset Search View View Source Description: The TWC International Open Government Dataset Catalog (IOGDC) is a linked data application based on metadata scraped from an increasing number of international dataset catalog websites publishing a rich variety of government data. Metadata extracted from these catalog websites is automatically converted to RDF linked data and re-published via the TWC LOGD SPAQRL endpoint and made available for download. The TWC IOGDC demo site features an efficient, reconfigurable faceted browser with search capabilities offering a compelling demonstration of the value of a common metadata model for open government dataset catalogs. We believe that the vocabulary choices demonstrated by IOGDC highlights the potential for useful linked data applications to be created from open government catalogs and will encourage the adoption of such a standard worldwide. Warning: This demo will crash IE7 and IE8. Contributor: Eric Rozell Jinguang Zheng Yongmei Shi Live Demo: http://logd.tw.rpi.edu/demo/international_dataset_catalog_search Notes: This is an experimental demo and some queries may take longer time to response (30 ~60 seconds). Please referesh this page if the demo is not loaded. Our metadata model can be accessed here . Procedure to getting and publishing metadata is described here . The RDF dump of the datasets can be downloaded here. International OGD Catalog Search (searching 736,578 datasets) http://logd.tw.rpi.edu/demo/international_dataset_catalog_search
  •  
    Loads surprisingly quickly. Try entering your favorite search term in top blue box. Can use quotes to define phrases.
Tom Johnson

Constructing the Open Data Landscape | ScraperWiki Data Blog - 0 views

  • Constructing the Open Data Landscape Posted on September 7, 2011 by Nicola Hughes In an article in today’s Telegraph regarding Francis Maude’s Public Data Corporation, Michael Cross asks: “What makes the state think it can be at the cutting edge of the knowledge economy“. He writes in terms of market and business share, giving the example of the satnav market worth over $100bn a year yet it’s based on free data from the US Government’s GPS system. He credits the internet revolution for transforming public sector data into ‘cashable proposition’. We, along with many other start-ups, foundations and civic coding groups, are part of this ‘geeky world’ of Open Data. So we’d like to add our piece concerning the Open Data movement. Michael has the right to ask this question because there is this constant custodial battle being fought every day, every scrape and every script on the web for the rights to data. So let me tell you about the geeks’ take on Open Data.
  •  
    Constructing the Open Data Landscape Posted on September 7, 2011 by Nicola Hughes In an article in today's Telegraph regarding Francis Maude's Public Data Corporation, Michael Cross asks: "What makes the state think it can be at the cutting edge of the knowledge economy". He writes in terms of market and business share, giving the example of the satnav market worth over $100bn a year yet it's based on free data from the US Government's GPS system. He credits the internet revolution for transforming public sector data into 'cashable proposition'. We, along with many other start-ups, foundations and civic coding groups, are part of this 'geeky world' of Open Data. So we'd like to add our piece concerning the Open Data movement. Michael has the right to ask this question because there is this constant custodial battle being fought every day, every scrape and every script on the web for the rights to data. So let me tell you about the geeks' take on Open Data.
Tom Johnson

We Just Ran Twenty-Three Million Queries of the World Bank's Website - Working Paper 36... - 0 views

  •  
    "Abstract Much of the data underlying global poverty and inequality estimates is not in the public domain, but can be accessed in small pieces using the World Bank's PovcalNet online tool. To overcome these limitations and reproduce this database in a format more useful to researchers, we ran approximately 23 million queries of the World Bank's web site, accessing only information that was already in the public domain. This web scraping exercise produced 10,000 points on the cumulative distribution of income or consumption from each of 942 surveys spanning 127 countries over the period 1977 to 2012. This short note describes our methodology, briefly discusses some of the relevant intellectual property issues, and illustrates the kind of calculations that are facilitated by this data set, including growth incidence curves and poverty rates using alternative PPP indices. The full data can be downloaded at www.cgdev.org/povcalnet. "
Tom Johnson

Corporate Accountability Data in Influence Explorer - Sunlight Labs: Blog - 0 views

  •  
    Again, US-centric, but this might generate some ideas of what could be accomplish in your city/nation. Late yesterday we announced a bunch of new features for Influence Explorer: http://sunlightlabs.com/blog/2011/ie-corporate-accountability/ As the blog post explains, you can now find information about a corporation's EPA violations, federal advisory committee memberships, and participation in the rulemaking process -- all in one place. I wanted to highlight that last feature a bit more, though. To my knowledge, this is the first time that the full corpus of public comments submitted to regulations.gov has been available for bulk download and analysis. This isn't a coincidence: regulations.gov is built using technologies that make scraping it unusually difficult. This is unfortunate, since everyone seems to agree that federal rulemakings are gaining in importance -- both because of congressional gridlock that leaves the regulatory process as a second-best option, and because of calls to simplify the regulatory landscape as a pro-growth measure. It's an area where influence is certainly exerted -- rulemakers are obliged to review every comment -- but little attention is paid to who's flooding dockets with comments, and which directions rules are being pushed. It's taken us several months to develop a reliable solution and to obtain past rulemakings, but we now have the data in hand. We plan to do much more with this dataset, and we're hoping that others will want to dig in, too. You can find a link to the bulk download options in the post above -- the full compressed archive of extracted text and metadata is ~16GB, but we've provided options for grabbing individual agencies' or dockets' data. If anyone wants the original documents (PDFs, DOCs, etc) we can talk through how to make that happen, but as they clock in at 1.5TB we'll want to make sure folks know what they're getting into before we spend the time and bandwidth. Finally, note that we currently o
Tom Johnson

Data Docs: Interactive video and audio - 0 views

  •  
    "Data docs is a video platform that allows filmmakers and journalist to combine elements from the web, such as interactive graphics, text and scraped information, with linear media, such as video and audio. Having worked in video both in long-form documentary and web video, we understand the power of visual media. Videos are powerful vehicles that we can use to tell personable or explanatory immersive stories. But one of the drawbacks of video as a medium is that they are finished products, which, after they have been published, become outdated fairly quickly. Advances in technology and data bases has allowed for data to be more flexible than video. Data visualizations and interactive infographics, for instance, can be up-to-date at any moment in time if they are hooked up to the right data bases. Think of charts of stock markets that updated every millisecond because APIs or other technological mechanisms feed them live data. We wanted to combine those two worlds - the world of immersive video storyelling and that of live and constantly updated data. This is why we created Data Docs. Through the Data Docs code library filmmakers and developers can 'hook up' their video to live data and other up-to-date information from the web. The library also allows you to integrate your own interactives with specific fonts and styles into your video. It enables you to project HTML, CSS and JavaScript-based graphics on your video. This helps you make videos that will never be out of date or, in other words, to make videos that are evergreen."
Tom Johnson

Needlebase - for acquiring, integrating, cleansing, analyzing and publishing data on th... - 1 views

  • ITA Software is proud to introduce Needlebase™, a revolutionary platform for acquiring, integrating, cleansing, analyzing and publishing data on the web.  Using Needlebase through a web browser, without programmers or DBAs, your data team can easily: acquire data from multiple sources:  A simple tagging process quickly imports structured data from complex websites, XML feeds, and spreadsheets into a unified database of your design. merge, deduplicate and cleanse: Needlebase uses intelligent semantics to help you find and merge variant forms of the same record.  Your merges, edits and deletions persist even after the original data is refreshed from its source. build and publish custom data views: Use Needlebase's visual UI and powerful query language to configure exactly your desired view of the data, whether as a list, table, grid, or map.  Then, with one click, publish the data for others to see, or export a feed of the clean data to your own local database. Needlebase dramatically reduces the time, cost, and expertise needed to build and maintain comprehensive databases of practically anything. Read on to learn more about Needlebase's capabilities and our early adopters' success stories, or watch our tutorial videos. Then sign up to get started!
  •  
    ITA Software is proud to introduce Needlebase™, a revolutionary platform for acquiring, integrating, cleansing, analyzing and publishing data on the web. Using Needlebase through a web browser, without programmers or DBAs, your data team can easily: acquire data from multiple sources: A simple tagging process quickly imports structured data from complex websites, XML feeds, and spreadsheets into a unified database of your design. merge, deduplicate and cleanse: Needlebase uses intelligent semantics to help you find and merge variant forms of the same record. Your merges, edits and deletions persist even after the original data is refreshed from its source. build and publish custom data views: Use Needlebase's visual UI and powerful query language to configure exactly your desired view of the data, whether as a list, table, grid, or map. Then, with one click, publish the data for others to see, or export a feed of the clean data to your own local database. Needlebase dramatically reduces the time, cost, and expertise needed to build and maintain comprehensive databases of practically anything. Read on to learn more about Needlebase's capabilities and our early adopters' success stories, or watch our tutorial videos. Then sign up to get started! http://needlebase.com
1 - 7 of 7
Showing 20 items per page