Group items tagged scraping - DJCamp2011

Comparison of Web-scraping software - 0 views

docs.google.com/...ccc

software scraping directory tools

shared by Tom Johnson on 27 Jul 12 - No Cached

Tom Johnson on 27 Jul 12

Scott Wilson of http://screen-scraper.com/ has created a useful comparison of web-scraping software and posted it as a public doc on Google Docs: https://docs.google.com/spreadsheet/ccc?key=0AsaY3Pb1lTh1dERtbWtTN0U3REgtYlNld0stV0NCV1E#gid=0

<div class="cArrow"> </div><div class="cContentInner">Scott Wilson of <a href="http://screen-scraper.com/" rel="nofollow" target="_blank">http://screen-scraper.com/</a> has created a useful comparison of web-scraping software and posted it as a public doc on Google Docs: <a href="https://docs.google.com/spreadsheet/ccc?key=0AsaY3Pb1lTh1dERtbWtTN0U3REgtYlNld0stV0NCV1E#gid=0" rel="nofollow" target="_blank">https://docs.google.com/spreadsheet/ccc?key=0AsaY3Pb1lTh1dERtbWtTN0U3REgtYlNld0stV0NCV1E#gid=0</a></div>

...

Cancel

International Dataset Search - 0 views

logd.tw.rpi.edu/...ational_dataset_catalog_search

DataIn open data Open Government opendatanm dataset international catalog

shared by Tom Johnson on 05 Sep 11 - No Cached

International Dataset Search View View Source Description:  The TWC International Open Government Dataset Catalog (IOGDC) is a linked data application based on metadata scraped from an increasing number of international dataset catalog websites publishing a rich variety of government data. Metadata extracted from these catalog websites is automatically converted to RDF linked data and re-published via the TWC LOGD SPAQRL endpoint and made available for download. The TWC IOGDC demo site features an efficient, reconfigurable faceted browser with search capabilities offering a compelling demonstration of the value of a common metadata model for open government dataset catalogs. We believe that the vocabulary choices demonstrated by IOGDC highlights the potential for useful linked data applications to be created from open government catalogs and will encourage the adoption of such a standard worldwide. Warning: This demo will crash IE7 and IE8. Contributor: Eric Rozell Contributor: Jinguang Zheng Contributor: Yongmei Shi Live Demo:  http://logd.tw.rpi.edu/demo/international_dataset_catalog_search Notes: This is an experimental demo and some queries may take longer time to response (30 ~60 seconds). Please referesh this page if the demo is not loaded. Our metadata model can be accessed here . Procedure to getting and publishing metadata is described here . The RDF dump of the datasets can be downloaded here. Welcome to S2S! International OGD Catalog Search (searching 736,578 datasets)
...

Cancel

Tom Johnson on 05 Sep 11

International Dataset Search View View Source Description: The TWC International Open Government Dataset Catalog (IOGDC) is a linked data application based on metadata scraped from an increasing number of international dataset catalog websites publishing a rich variety of government data. Metadata extracted from these catalog websites is automatically converted to RDF linked data and re-published via the TWC LOGD SPAQRL endpoint and made available for download. The TWC IOGDC demo site features an efficient, reconfigurable faceted browser with search capabilities offering a compelling demonstration of the value of a common metadata model for open government dataset catalogs. We believe that the vocabulary choices demonstrated by IOGDC highlights the potential for useful linked data applications to be created from open government catalogs and will encourage the adoption of such a standard worldwide. Warning: This demo will crash IE7 and IE8. Contributor: Eric Rozell Jinguang Zheng Yongmei Shi Live Demo: http://logd.tw.rpi.edu/demo/international_dataset_catalog_search Notes: This is an experimental demo and some queries may take longer time to response (30 ~60 seconds). Please referesh this page if the demo is not loaded. Our metadata model can be accessed here . Procedure to getting and publishing metadata is described here . The RDF dump of the datasets can be downloaded here. International OGD Catalog Search (searching 736,578 datasets) http://logd.tw.rpi.edu/demo/international_dataset_catalog_search

<div class="cArrow"> </div><div class="cContentInner">International Dataset Search View View Source Description: The TWC International Open Government Dataset Catalog (IOGDC) is a linked data application based on metadata scraped from an increasing number of international dataset catalog websites publishing a rich variety of government data. Metadata extracted from these catalog websites is automatically converted to RDF linked data and re-published via the TWC LOGD SPAQRL endpoint and made available for download. The TWC IOGDC demo site features an efficient, reconfigurable faceted browser with search capabilities offering a compelling demonstration of the value of a common metadata model for open government dataset catalogs. We believe that the vocabulary choices demonstrated by IOGDC highlights the potential for useful linked data applications to be created from open government catalogs and will encourage the adoption of such a standard worldwide. Warning: This demo will crash IE7 and IE8. Contributor: Eric Rozell Jinguang Zheng Yongmei Shi Live Demo: <a href="http://logd.tw.rpi.edu/demo/international_dataset_catalog_search" rel="nofollow" target="_blank">http://logd.tw.rpi.edu/demo/international_dataset_catalog_search</a> Notes: This is an experimental demo and some queries may take longer time to response (30 ~60 seconds). Please referesh this page if the demo is not loaded. Our metadata model can be accessed here . Procedure to getting and publishing metadata is described here . The RDF dump of the datasets can be downloaded here. International OGD Catalog Search (searching 736,578 datasets) <a href="http://logd.tw.rpi.edu/demo/international_dataset_catalog_search" rel="nofollow" target="_blank">http://logd.tw.rpi.edu/demo/international_dataset_catalog_search</a></div>

...

Cancel
Tom Johnson on 05 Sep 11

Loads surprisingly quickly. Try entering your favorite search term in top blue box. Can use quotes to define phrases.

<div class="cArrow"> </div><div class="cContentInner">Loads surprisingly quickly. Try entering your favorite search term in top blue box. Can use quotes to define phrases.</div>

...

Cancel

Constructing the Open Data Landscape | ScraperWiki Data Blog - 0 views

blog.scraperwiki.com/...ucting-the-open-data-landscape

analytic journalism data open data Open Government

shared by Tom Johnson on 09 Sep 11 - No Cached

Constructing the Open Data Landscape Posted on September 7, 2011 by Nicola Hughes In an article in today’s Telegraph regarding Francis Maude’s Public Data Corporation, Michael Cross asks: “What makes the state think it can be at the cutting edge of the knowledge economy“. He writes in terms of market and business share, giving the example of the satnav market worth over $100bn a year yet it’s based on free data from the US Government’s GPS system. He credits the internet revolution for transforming public sector data into ‘cashable proposition’. We, along with many other start-ups, foundations and civic coding groups, are part of this ‘geeky world’ of Open Data. So we’d like to add our piece concerning the Open Data movement. Michael has the right to ask this question because there is this constant custodial battle being fought every day, every scrape and every script on the web for the rights to data. So let me tell you about the geeks’ take on Open Data.
...

Cancel

Tom Johnson on 09 Sep 11

Constructing the Open Data Landscape Posted on September 7, 2011 by Nicola Hughes In an article in today's Telegraph regarding Francis Maude's Public Data Corporation, Michael Cross asks: "What makes the state think it can be at the cutting edge of the knowledge economy". He writes in terms of market and business share, giving the example of the satnav market worth over $100bn a year yet it's based on free data from the US Government's GPS system. He credits the internet revolution for transforming public sector data into 'cashable proposition'. We, along with many other start-ups, foundations and civic coding groups, are part of this 'geeky world' of Open Data. So we'd like to add our piece concerning the Open Data movement. Michael has the right to ask this question because there is this constant custodial battle being fought every day, every scrape and every script on the web for the rights to data. So let me tell you about the geeks' take on Open Data.

<div class="cArrow"> </div><div class="cContentInner">Constructing the Open Data Landscape Posted on September 7, 2011 by Nicola Hughes In an article in today's Telegraph regarding Francis Maude's Public Data Corporation, Michael Cross asks: "What makes the state think it can be at the cutting edge of the knowledge economy". He writes in terms of market and business share, giving the example of the satnav market worth over $100bn a year yet it's based on free data from the US Government's GPS system. He credits the internet revolution for transforming public sector data into 'cashable proposition'. We, along with many other start-ups, foundations and civic coding groups, are part of this 'geeky world' of Open Data. So we'd like to add our piece concerning the Open Data movement. Michael has the right to ask this question because there is this constant custodial battle being fought every day, every scrape and every script on the web for the rights to data. So let me tell you about the geeks' take on Open Data.</div>

...

Cancel

We Just Ran Twenty-Three Million Queries of the World Bank's Website - Working Paper 36... - 0 views

www.cgdev.org/...anks-website-working-paper-362

transparency open government bank world development Latin America

shared by Tom Johnson on 10 May 14 - No Cached

Tom Johnson on 10 May 14

"Abstract Much of the data underlying global poverty and inequality estimates is not in the public domain, but can be accessed in small pieces using the World Bank's PovcalNet online tool. To overcome these limitations and reproduce this database in a format more useful to researchers, we ran approximately 23 million queries of the World Bank's web site, accessing only information that was already in the public domain. This web scraping exercise produced 10,000 points on the cumulative distribution of income or consumption from each of 942 surveys spanning 127 countries over the period 1977 to 2012. This short note describes our methodology, briefly discusses some of the relevant intellectual property issues, and illustrates the kind of calculations that are facilitated by this data set, including growth incidence curves and poverty rates using alternative PPP indices. The full data can be downloaded at www.cgdev.org/povcalnet. "

<div class="cArrow"> </div><div class="cContentInner">"Abstract Much of the data underlying global poverty and inequality estimates is not in the public domain, but can be accessed in small pieces using the World Bank's PovcalNet online tool. To overcome these limitations and reproduce this database in a format more useful to researchers, we ran approximately 23 million queries of the World Bank's web site, accessing only information that was already in the public domain. This web scraping exercise produced 10,000 points on the cumulative distribution of income or consumption from each of 942 surveys spanning 127 countries over the period 1977 to 2012. This short note describes our methodology, briefly discusses some of the relevant intellectual property issues, and illustrates the kind of calculations that are facilitated by this data set, including growth incidence curves and poverty rates using alternative PPP indices. The full data can be downloaded at <a href="http://www.cgdev.org/povcalnet" rel="nofollow" target="_blank">www.cgdev.org/povcalnet</a>. "</div>

...

Cancel

Corporate Accountability Data in Influence Explorer - Sunlight Labs: Blog - 0 views

sunlightlabs.com/...ie-corporate-accountability

analytic journalism accountability data influence corporation

shared by Tom Johnson on 05 Oct 11 - No Cached

Tom Johnson on 05 Oct 11

Again, US-centric, but this might generate some ideas of what could be accomplish in your city/nation. Late yesterday we announced a bunch of new features for Influence Explorer: http://sunlightlabs.com/blog/2011/ie-corporate-accountability/ As the blog post explains, you can now find information about a corporation's EPA violations, federal advisory committee memberships, and participation in the rulemaking process -- all in one place. I wanted to highlight that last feature a bit more, though. To my knowledge, this is the first time that the full corpus of public comments submitted to regulations.gov has been available for bulk download and analysis. This isn't a coincidence: regulations.gov is built using technologies that make scraping it unusually difficult. This is unfortunate, since everyone seems to agree that federal rulemakings are gaining in importance -- both because of congressional gridlock that leaves the regulatory process as a second-best option, and because of calls to simplify the regulatory landscape as a pro-growth measure. It's an area where influence is certainly exerted -- rulemakers are obliged to review every comment -- but little attention is paid to who's flooding dockets with comments, and which directions rules are being pushed. It's taken us several months to develop a reliable solution and to obtain past rulemakings, but we now have the data in hand. We plan to do much more with this dataset, and we're hoping that others will want to dig in, too. You can find a link to the bulk download options in the post above -- the full compressed archive of extracted text and metadata is ~16GB, but we've provided options for grabbing individual agencies' or dockets' data. If anyone wants the original documents (PDFs, DOCs, etc) we can talk through how to make that happen, but as they clock in at 1.5TB we'll want to make sure folks know what they're getting into before we spend the time and bandwidth. Finally, note that we currently o

<div class="cArrow"> </div><div class="cContentInner">Again, US-centric, but this might generate some ideas of what could be accomplish in your city/nation. Late yesterday we announced a bunch of new features for Influence Explorer: <a href="http://sunlightlabs.com/blog/2011/ie-corporate-accountability/" rel="nofollow" target="_blank">http://sunlightlabs.com/blog/2011/ie-corporate-accountability/</a> As the blog post explains, you can now find information about a corporation's EPA violations, federal advisory committee memberships, and participation in the rulemaking process -- all in one place. I wanted to highlight that last feature a bit more, though. To my knowledge, this is the first time that the full corpus of public comments submitted to regulations.gov has been available for bulk download and analysis. This isn't a coincidence: regulations.gov is built using technologies that make scraping it unusually difficult. This is unfortunate, since everyone seems to agree that federal rulemakings are gaining in importance -- both because of congressional gridlock that leaves the regulatory process as a second-best option, and because of calls to simplify the regulatory landscape as a pro-growth measure. It's an area where influence is certainly exerted -- rulemakers are obliged to review every comment -- but little attention is paid to who's flooding dockets with comments, and which directions rules are being pushed. It's taken us several months to develop a reliable solution and to obtain past rulemakings, but we now have the data in hand. We plan to do much more with this dataset, and we're hoping that others will want to dig in, too. You can find a link to the bulk download options in the post above -- the full compressed archive of extracted text and metadata is ~16GB, but we've provided options for grabbing individual agencies' or dockets' data. If anyone wants the original documents (PDFs, DOCs, etc) we can talk through how to make that happen, but as they clock in at 1.5TB we'll want to make sure folks know what they're getting into before we spend the time and bandwidth. Finally, note that we currently o</div>

...

Cancel

Data Docs: Interactive video and audio - 0 views

www.datadocs.org/training

data docs interactive video audio analytic journalism

shared by Tom Johnson on 20 May 14 - No Cached

Tom Johnson on 20 May 14

"Data docs is a video platform that allows filmmakers and journalist to combine elements from the web, such as interactive graphics, text and scraped information, with linear media, such as video and audio. Having worked in video both in long-form documentary and web video, we understand the power of visual media. Videos are powerful vehicles that we can use to tell personable or explanatory immersive stories. But one of the drawbacks of video as a medium is that they are finished products, which, after they have been published, become outdated fairly quickly. Advances in technology and data bases has allowed for data to be more flexible than video. Data visualizations and interactive infographics, for instance, can be up-to-date at any moment in time if they are hooked up to the right data bases. Think of charts of stock markets that updated every millisecond because APIs or other technological mechanisms feed them live data. We wanted to combine those two worlds - the world of immersive video storyelling and that of live and constantly updated data. This is why we created Data Docs. Through the Data Docs code library filmmakers and developers can 'hook up' their video to live data and other up-to-date information from the web. The library also allows you to integrate your own interactives with specific fonts and styles into your video. It enables you to project HTML, CSS and JavaScript-based graphics on your video. This helps you make videos that will never be out of date or, in other words, to make videos that are evergreen."

<div class="cArrow"> </div><div class="cContentInner">"Data docs is a video platform that allows filmmakers and journalist to combine elements from the web, such as interactive graphics, text and scraped information, with linear media, such as video and audio. Having worked in video both in long-form documentary and web video, we understand the power of visual media. Videos are powerful vehicles that we can use to tell personable or explanatory immersive stories. But one of the drawbacks of video as a medium is that they are finished products, which, after they have been published, become outdated fairly quickly. Advances in technology and data bases has allowed for data to be more flexible than video. Data visualizations and interactive infographics, for instance, can be up-to-date at any moment in time if they are hooked up to the right data bases. Think of charts of stock markets that updated every millisecond because APIs or other technological mechanisms feed them live data. We wanted to combine those two worlds - the world of immersive video storyelling and that of live and constantly updated data. This is why we created Data Docs. Through the Data Docs code library filmmakers and developers can 'hook up' their video to live data and other up-to-date information from the web. The library also allows you to integrate your own interactives with specific fonts and styles into your video. It enables you to project HTML, CSS and JavaScript-based graphics on your video. This helps you make videos that will never be out of date or, in other words, to make videos that are evergreen."</div>

...

Cancel

Needlebase - for acquiring, integrating, cleansing, analyzing and publishing data on th... - 1 views

needlebase.com

data tools database analysis scraping

shared by Tom Johnson on 03 Oct 11 - No Cached

ITA Software is proud to introduce Needlebase™, a revolutionary platform for acquiring, integrating, cleansing, analyzing and publishing data on the web.  Using Needlebase through a web browser, without programmers or DBAs, your data team can easily: acquire data from multiple sources:  A simple tagging process quickly imports structured data from complex websites, XML feeds, and spreadsheets into a unified database of your design. merge, deduplicate and cleanse: Needlebase uses intelligent semantics to help you find and merge variant forms of the same record.  Your merges, edits and deletions persist even after the original data is refreshed from its source. build and publish custom data views: Use Needlebase's visual UI and powerful query language to configure exactly your desired view of the data, whether as a list, table, grid, or map.  Then, with one click, publish the data for others to see, or export a feed of the clean data to your own local database. Needlebase dramatically reduces the time, cost, and expertise needed to build and maintain comprehensive databases of practically anything. Read on to learn more about Needlebase's capabilities and our early adopters' success stories, or watch our tutorial videos. Then sign up to get started!
...

Cancel

Tom Johnson on 03 Oct 11

ITA Software is proud to introduce Needlebase™, a revolutionary platform for acquiring, integrating, cleansing, analyzing and publishing data on the web. Using Needlebase through a web browser, without programmers or DBAs, your data team can easily: acquire data from multiple sources: A simple tagging process quickly imports structured data from complex websites, XML feeds, and spreadsheets into a unified database of your design. merge, deduplicate and cleanse: Needlebase uses intelligent semantics to help you find and merge variant forms of the same record. Your merges, edits and deletions persist even after the original data is refreshed from its source. build and publish custom data views: Use Needlebase's visual UI and powerful query language to configure exactly your desired view of the data, whether as a list, table, grid, or map. Then, with one click, publish the data for others to see, or export a feed of the clean data to your own local database. Needlebase dramatically reduces the time, cost, and expertise needed to build and maintain comprehensive databases of practically anything. Read on to learn more about Needlebase's capabilities and our early adopters' success stories, or watch our tutorial videos. Then sign up to get started! http://needlebase.com

<div class="cArrow"> </div><div class="cContentInner">ITA Software is proud to introduce Needlebase™, a revolutionary platform for acquiring, integrating, cleansing, analyzing and publishing data on the web. Using Needlebase through a web browser, without programmers or DBAs, your data team can easily: acquire data from multiple sources: A simple tagging process quickly imports structured data from complex websites, XML feeds, and spreadsheets into a unified database of your design. merge, deduplicate and cleanse: Needlebase uses intelligent semantics to help you find and merge variant forms of the same record. Your merges, edits and deletions persist even after the original data is refreshed from its source. build and publish custom data views: Use Needlebase's visual UI and powerful query language to configure exactly your desired view of the data, whether as a list, table, grid, or map. Then, with one click, publish the data for others to see, or export a feed of the clean data to your own local database. Needlebase dramatically reduces the time, cost, and expertise needed to build and maintain comprehensive databases of practically anything. Read on to learn more about Needlebase's capabilities and our early adopters' success stories, or watch our tutorial videos. Then sign up to get started! <a href="http://needlebase.com" rel="nofollow" target="_blank">http://needlebase.com</a></div>

...

Cancel

Group items tagged