Nat Torkington - Truly Open Data | O'Reilly Radar - 0 views
-
Pranesh Prakash on 16 Aug 10"Open source software developers have a powerful set of tools to make distributed authoring of software possible: diff to identify what's changed, patch to apply those changes elsewhere, version control to track changes over time and show provenance. Patch management would be just as important in a collaborative open data project, where users and other researchers might be submitting new or revised data. What would git for data look like? Heck, what would a local branch look like? I have a new attribute, you have a different projection, she has new rows, how does this all tie back together? (I eagerly await claims that RDF will solve this problem and all others) That's just development. The interface between developers and users is the release. State of the art for a lot of government data is the equivalent of source.tar.gz. No version numbers, much the ability to download older versions of the datasets or separate stable and development branches. Why would we want to download the historic version of a dataset? Because a paper used it and we want to test the analysis software that the paper used to ensure we get the same answer. Or because we want to see what our analysis technique would have shown with the knowledge that was available back then. Or simply to be able to track defects. The users of data will have to adapt to the idea of versions, like the users of software have. The maintainers of the dataset might release five different versions of it while you're writing your analysis code, so it can't be a painful process to incorporate the revised data into your project. With software we have shared libraries and dynamic libraries, supported by autotools and such packages. Our code has interfaces and a branch that promises backwards compatibility. What would that look like for data? And what is the data version of the dependency hell that software developers know all-too-well (M 1.5 depends on N 1.7 and P 2.0, but P 2.0 requires N 2.0, and upgrading N to 2.0 br