My Photo

WorldCat


Twitter Updates

    follow me on Twitter
    Blog powered by TypePad

    google analytics


    meter


    Categories

    Categories

    August 04, 2006

    Digital Repository Interoperability

    Img_2111_1 Microsoft, the Mellon Foundation, the Coalition for Networked Information, the Digital Library Federation, and the Joint Information Systems Committee in the UK, jointly sponsored a meeting in April of 2006 to promote discussion and consensus on the characteristics of digital repositories that need to be standardized in order to promote desirable levels of interoperability. The list of sponsors is a strong clue to the importance of this somewhat esoteric topic. What is at stake is a common set of functionality that will support automated interchange across a wide spectrum of repositories and assuring auditable provenance for managed materials. Not a modest objective.

    The final report for this meeting is available at the Mellon Foundation. It is not casual reading, intended to capture a discussion rather than characterize the state of the art, though readers may find the background materials and recommendations generally useful.

    Those conversant with digital library research may be familiar with the early work of Robert Kahn and Robert Willenski on repository architectures. It is interesting to note that, a dozen years later, there still is no commonly agreed terminology, let alone a universally accepted model for this critical piece of digital library infrastructure. It is a reminder of how new our digital workspace is, and how much effort remains to achieve even a rudimentary infrastructure for supporting reliable, persistent access to electronic assets.

    There is inherent tension between standardization and localized design, especially in an unstable technological environment. Builders want to build, not reconcile. The conferees at this meeting (largely implementors of early repository systems) did not even entirely agree on which aspects of functionality should be considered core for interoperability purposes (though progress was made towards this goal).

    It is perhaps not astounding that understanding of repository core functionality is not so different now than in 1995. The Kahn-Willenski list included (I paraphrase) access, deposit, and tell-me-more. The 2006 meeting agreed on obtain and harvest, and talked at length about whether put should be there. The notions map roughly across the intervening years, though current understanding of the underlying details far exceeds the 1995 model.

    If this sounds like scant progress for a decade, keep in mind that a great deal of experience has been garnered through the deployment of DSpace, arXiv, Fedora, ePrints aDORe and the like – serious repository applications that afford the practical experience necessary to bring together common expectations on how these technologies will work together.

    The problem is broad in scope… a data model and architecture to support documents, data archives, and  formats, policies, and recombinant practices that are to a significant degree yet undeveloped.  The architecture must accommodate the functional requirements of disparate domains with entirely different business models, legal requirements, and data demands.  As the first crop of serious repository applications have matured, the field is now ready for the harder task of bringing these efforts into an interoperable framework. This meeting will have helped to focus attention and effort on this important goal and set the stage for additional progress.

    -----

    Image: Downtown Seattle from Elliot Bay, July, 2006

     

    May 04, 2006

    Not the end of an era

    Parkavenue Those of us with gray hair are fond of reminiscing about the cost of our first computers or how much memory we thought was impossibly more than we could ever use. I recall during my first months at OCLC that the Office of Research acquired its first 1 gigabyte disk pack… an expensive device about the size of a small refrigerator. Lots of cameras have more now, and it would be a rare automobile that does not eclipse the computing resources of the space shuttle. Now-quaint marvels such as these afford benchmarks that measure our progress along the digital byways.

    It is harder to identify with data standards such as the MARC record in the same way, especially in an age of global indexing and microformats (there may be a few of us who can remember their first 245 field, but these don’t have the same oomph in the retelling as, say, a 5 megabyte hard drive or IBM software on a cassette tape).

    The recent passing of Henriette Avram is an occasion for reflection on the importance of structured data to our community. Henriette, as architect of one of the world’s most important data standards, led a transformation of the profession of librarianship that will outlast most of us.  A large part of  every dollar I've earned in two decades comes from the industry she helped to spawn.

    Jim Gray, a Turing award winner and noted researcher for Microsoft, recently told me (on the day before Henriette’s death, as it turns out) his slides on the history of libraries in the digital age number four: they start with Alexander and the Alexandria library and the third is of Henriette and the MARC record.

    Thank you, Henriette.
    -----
    Image: Park Avenue in New York City on a beautiful day in April, 2006

    Post Script:  Walt Crawford, my-soon-to-be-more-closely-related colleague, caught me out in a goof, which I fixed, and if you didn't find it, tough.  Check the Internet Archive.  I'm not a real librarian... I admit it. But I'm married to one!  She's not a cataloger either. (thanks, Walt!)

    January 31, 2006

    Identity and location on the Web

    Foucault URLs versus URIs

    In Sean McGrath’s piece on URLs and social commitment, he alludes to the common confusion between the acronyms URL and URI, correctly pointing out that only Web-head protocol-wonks are liable to be caught using the URI terminology (and most of them don't usually either).  Everyone else on the planet uses the largely-interchangeable and better-understood moniker of URL.

    Is this a distinction without a difference, as usage would lead us to believe? Unhappily, in today’s Web, it rarely matters a whit. Sean tells us why:

    The great thing about URLs is that you can click on them.

    That is a great thing, and it informs our expectation of URLs to the exclusion of all other possibilities. The social contract implied in http:// is, then, that they are actionable: you can click on them and bring the referent of the link into your machine and read it or listen to it or watch it. The link serves as a pointer to a location, and clicking on it invokes a behavior specified by the http protocol: voila!

    So, what's to be unhappy about?  Overloaded onto this simple actionable relationship is the additional important function of identity. The URL serves as both a key for a retrieval transaction, and an identifier. It is no accident that CNRI, in a stroke of marketing genius, chose the term Handle for their identifier protocol, for that is exactly the right term for such identifiers. We want a handle for otherwise-slippery electronic content so we can hold on to it, pass it back and forth, refer to it, and hang it over our desk to grab like a frying pan from a pot-rack.

    Mostly this overloading of identity and location/retrieval is fine. And to the extent that it is, the conflation of URL and URI is not a problem. So what is missing?

    Three things:

    1. Persistent reference pointers – There are many classes of electronic resources that we know we will want to refer to in a location-independent way for as long as we can imagine. Books, journal articles, or any component of a persistent resource of cultural, social, or economic importance. Yes, we want these to be actionable (clickable) as far as possible, but sustainable access requires that we distinguish between identity and resolution in the life cycle of any information resource of more than passing importance.  Conflating location and identity makes this harder (though not impossible).

    2. Appropriate copy resolution – In a world without access barriers, any copy is the appropriate copy. In a world of tradable intellectual property, individuals and organizations have differential access to resources. The Web should be neutral about business models, but it cannot be indifferent to them. Owners of IP must have the means to manage and meter access, and this generally implies a decoupling of identity and resolution.

    3. Conceptual resources – Our expectation that clicking 'gets' us something is not fully met if the resource is a conceptual asset.  The development of Semantic Web technology demands the application of an identity architecture to concepts as well as documents, multimedia, and pizza-ordering forms. Proponents of the just-let-HTTP-do-it rightly point out that HTTP URLs are entirely capable of being used for identifying conceptual resources (to use the RDF parlance). This is undeniably true as far as the technology goes. The more interesting question is what happens when you click on such a link? What SHOULD happen? The answer is context dependent. Some people, myself included, are uncomfortable with the use of standard HTTP URLs for this purpose, because it breaks the widespread-if-informal social contract of URLs.  You may wish to define them, or locate them within a larger conceptual structure, or access various of their attributes, but in general you're not trying to retrieve them.

    Each of these examples begs for decoupling of identity and resolution in some contexts, but requires an additional layer of mapping of location or function that, getting back to Sean’s consumer contract:

    all come at very significant extra cost in terms of complexity. On this issue, the world has voted very loudly with its mouse clicking fingers. The world values hyperlinking simplicity over complexity by many orders of magnitude.

    The resolution of these conundrums will require daunting co-evolution of technology, business processes, and cross-community practices.  Technologies such as Handles, DOIs, OpenURLs, PURLs, and "INFO" URIs all represent approaches to addressing aspects of these problems.  At this time they are niche technologies.  Their impact on the constellation of problems we know as identity-versus-resolution will depend far more on business processes than on TECHNOLOGY (or the ideologies of their proponents or detractors).  Meanwhile, the URL rules.

    January 19, 2006

    Digital Futures Alliance

    Sunset_cranesI spent the morning Thursday at a workshop of the Digital Futures Alliance, an initiative begun in September of 2005 with the ambitious goal of catalyzing cross-sector cooperation (public, private, non-profit) to promote preservation of digital assets.  The University of Washington Library is spearheading the effort, and the  charter partners include many of the best-known technology companies of the Pacific Northwest. OCLC is a charter partner of the activity as well, represented at the meeting by CEO Jay Jordan.   

    Digital preservation is a hot topic.  There are many efforts along a variety of international axes in this space. The first question that comes to mind in such an activity is what can a regional effort accomplish? If it remains regional, its reach may be limited (though, the well-known players at the table would undoubtedly have an impact simply by embedding successful approaches in the business practices of their given industries).

    The attempt to build a cross-sector alliance is central to the strategy, and the enthusiasm of the participants was evident.  UW's strong research library, a dynamic and diverse community of researchers in the iSchool, and the amalgam of innovation-rich companies in the region make for an ideal incubator for leadership in such an endeavor.

    Thursday’s session represented an attempt to refine a common understanding of the problems, and anticipate the form that solutions might take. A survey of a broad cross-section of businesses in the region had been conducted by NewEdge, Inc. following the September, 2005 kickoff meeting, and these results were presented by Greg Zick of UW.

    Lee Dirks, Director of Research Business Systems at Microsoft, then launched working group discussions in several areas (access and usage, selection, technical issues, and education & outreach). Short reports from each discussion table were shared in plenary session, and these results will be synthesized and used to formulate ongoing working groups to identify opportunities for progress.

    There is considerable enthusiasm and expertise in this group for tackling one of the great challenges of the digital age. The commitment of Betsy Wilson, Dean of Libraries at UW, is evident, and she has strong support from the University. Converting this regional effort into a broader initiative will be a formidable challenge, and teasing out the commonalities of the problems that can be addressed in scalable, reproducible solutions is daunting indeed. I’m looking forward to participating.


    November 10, 2005

    ReMIX

    SavannahcathedralReaders of Web4Lib will have seen a version of this post on that list.  Take the rest of the day off!

    The discussion on Web4Lib concerns the relative merits of metadata-based retrieval and full text, link-enhanced retrieval.  It raises interesting questions of great import to libraries in particular and information retireval in general.

    Certainly we all agree that Google-like searching is powerful and useful. Our further hope and prejudice (given that metadata puts food on many of our tables) is that augmenting it with metadata search will improve retrieval in some use-cases with some resource classes. 

    Testing this hypothesis has always been fraught with overwhelming experimental difficulty and a substantial component of ideological bias. Indeed, as far as I know, there has never even been a serious attempt at arriving at an estimate of the cost effectiveness of MARC.  Of course its good! Right? RIGHT???

    As Google Print and its various spawn develop, the possibility of tractable experimentation is upon us.  Students of information retrieval will know of the TREC effort: information retrieval experimentation based on formal test collections.  Perhaps it is time for ReMIX: Resource Metadata and IndeXing Experiments?

    What are the domains of investigation? A quick list from the top of my head:

    • Nature of metadata
      • User-created
      • Library-created versus...?
      • Richness (MARC, DC, MODS, IEEE-LOM, ONIX...)
    • Nature of resources
      • Age
      • Type (books, articles, web resources, collections...)
    • Information use cases
      • Scholarly discovery
      • Commercial
      • End-user medical, legal, Government information....
      • User-types
      • ?

    Desirable elements:

    • A neutral home
    • A standard experimental corpus, balanced (whatever that means) and freely available
    • Open access indexes and linking information
    • Open access metadata of various types available to all
    • Open-Data repositories for the experimental results

    In other words, an open-access community-based project where the gradual accretion of knowledge on the subject would help all players understand the benefits of each mode, and combined modes as well, so as to improve retrieval performance and promote the development of more powerful systems over time.

     

    November 03, 2005

    For your "info"

    Slickhorn_confluence
    Giddiness is  uncommon in a professional capacity, but that is a little bit how the authors feel at the end of a laborious and frustrating struggle to achieve formal recognition by the IETF of the "info" URI scheme.

    The effort stretched nearly two and a half years and many episodes of frustration, angst, and ideological argumentation on both sides.

    The "info" URI scheme is predicated on the notion that the current Web identifier architecture is incomplete, and will benefit from a commonly recognized mechanism that:

    • acknowledges that sometimes it is useful to decouple identity and resolution,
    • supports a mechanism for bringing legacy identifiers into Web-space without directly maintaining Web server infrastructure, and
    • provides for simple registration of identifier namespaces that will benefit from a common registration and declaration mechanism.

    There are well-known and respectable advocates on both sides of this issue, but their opinions and ideologies will have almost no impact on the success or failure of this informational standard.  Certainly approval by the IETF in itself, while important, will not assure success.  The only thing that really counts is uptake.  Will "info" URIs attract the uptake necessary to generate network effect benefits? Will it add enough value to become a useful and persistent part of Web infrastructure?

    The answer is uncertain at this time, but there are reasons to be cautiously optimistic:

    • "info" URIs are the basis for the OpenURL naming architecture
    • The MPEG community is exploring the use of "info" for the identification of a wide variety of media assets that currently have non-URI identifiers
    • A number of communities are experimenting with "info" for identifying 'conceptual' resources -- terms from metadata sets, controlled vocabularies, and classification systems
    • SRW uses both HTTP: and "info" URIs to identify objects
    • "info" is a candidate for identifying digital assets stored in repositories such as PRONOM.

    There are sound reasons not to stray from the dominant linking idiom of the Web: HTTP URIs.  Andy Powell's arguments to this effect illustrate why using HTTP URIs should be the starting assumption in the design of every Web identifier system.  But there are also circumstances where the expectation of resolution (the implied promise of HTTP links) is specifically undesirable.  Distinguishing between these cases, and building sustainable services that address these needs, motivated the development of the "info" URI standard.  Wrapping that standard with policies and value-added business models is the next, and more difficult, challenge.

    More information about the "info" URI scheme and the registry of "info" namespaces can be found at http://info-uri.info.

    August 04, 2005

    Crossed Wires

    CrossedwiresaThe US may have invented the Internet, but as the old saw goes... what have you done for us lately?

    Thomas Friedman, in an essay  in the New York TImes, Calling All Luddites, raises provocative questions about why the US is falling behind in providing broadband connectivity to its people.

    Those who travel abroad have long been aware of the inferiority (and high cost) of cell service in the US compared to most other parts of the developed world.  The consequences of all this reach far beyond lousy phone reception or crippled devices that force you to rebuild your phone from scratch each time you 'upgrade' to a new handset.

    My colleague, Jean Godby, brought to my attention a recent article in Foreign Affairs , Down to the Wire, by Thomas Bleha, which provides a more in-depth treatment of the question.   The following quote sets the stage:

    In the first three years of the Bush administration, the United States dropped from 4th to 13th place in global rankings of broadband Internet usage.

    This isn't an accident; it is the result of well-considered government policies (other governments) that  recognize that connectivity is a major driving force for future economic development.  Taking steps to assure this infrastructure is in place, ready to provide a springboard for innovation, better security, and higher quality of life, seems a basic function that should be nurtured by thoughtful public policy.

    The automobile industry was born and raised in the US, but has never fully recovered from the assault-by-quality of auto makers from other countries.  Is the same fate befalling us in the realm of the Internet?