Having spent the better part of 18 months in the hunt for the DataNet grail, I read with interest Dorothea Salo's Caveat Lector rebuke of Cliff Lynch's remarks about the library's stake in data curation. Her angst is worth heeding. The community's first-generation institutional repository experience has been desultory at best, resulting in modest progress and (at least in Dorothea's view) loss of credibility with those who dispense local resources - administrators.
Repositories are fundamental components of the digital library -- the 'shelves' and the 'catalog' rolled into software, and it is perhaps understatement to say we haven't gotten this quite right. There are design issues, usability issues, turf to fight over. Dorothea's comments suggest the major problem, however, may be one of social engineering:
Institutional repositories did not, as far as I can tell, emerge from faculty needs. Rather, they were born of issues of ideology (Open Access), technology (how do we manage our electronic collections), and institutional reputation (our reputation will increase in proportion to our cache of IP). All eminently reasonable motivations, but the needs of faculty scarcely entered into it. A big part of the problem.
And now comes Big Data. At least institutional repositories contain human-readable artifacts. Its not so much of a stretch, then to imagine that our 4th generation Kindle-like-devices may be calling home to one or many IRs. Data, on the other hand, is not nearly so warm and fuzzy, needing agreements about structure, rendering, analytic methods and more. What are the use- and reuse-cases? Are there tractable business models?
Enter the DataNet solicitation. NSF will award five $20 million USD grants to data curation teams charged with protecting research investments while developing sustainable business models that increase the efficacy of science through reuse and repurposing of data. This is a huge task, with conflicting goals, uncertain methodologies, and unresolved incentive structures . One might forgive the skepticism of the Dorothea's of the community. One of the lynchpins of success will be the ability to make faculty's lives easier while serving the larger technological and economic needs.
Among the compelling common threads concerning institutional repositories and data repositories is that the learning and research communities of the future must include the capabilities of both in one form or another, or risk wholesale losses of digital data and the investments they represent. However badly we've done them in the past, we must redo them until we get them right. They will be basic enabling infrastructure for our communities, supporting not only institutions, disciplines, and faculty, but the very fabric of innovation upon which we rely for our prosperity.
The social engineering of incentives and services will be as critical to success as the business models and cost structures. Dorothea suggests:
And seriousness here means systematic commitments of funders to sustainability, not just grant programs. System designers, on the other hand, must develop practical systems that assure researchers better access to data without compromising resources for innovation or ensnaring them in time- and soul-destroying submission procedures -- and with suitable professional incentives for participation.
But can we wait for all that, as a profession? I don't think we can. Data curation curricula are emerging in the iSchools (UIUC and UNC both have data curation activities of note). The DataNet Federation will provide important pointers to necessary services (and the skill sets that our community must develop and nurture to be useful), and of course we have provided strong leadership in the development of preservation standards such as PRISM and OAIS that are key pieces of the curation puzzle. Jane Greenberg, of UNC, and I have recently launched a Dublin Core community to look at metadata and scientific datasets, another important piece of the puzzle.
The DataNet solicitation proposes:
...new types of organizations ...[that]... will integrate library and archival sciences, cyberinfrastructure, computer and information sciences, and domain science expertise to:
- provide reliable digital preservation, access, integration, and analysis capabilities for science and/or engineering data over a decades-long timeline;
- continuously anticipate and adapt to changes in technologies and in user needs and expectations;
- engage at the frontiers of computer and information science and cyberinfrastructure with research and development to drive the leading edge forward; and
- serve as component elements of an interoperable data preservation and access network.
Hardly a challenge we can do alone... or afford to turn away from.
_____
Seattle at night, from the perch of Kerry Park on Queen Anne hill, February, 2009
