My Photo

WorldCat


Twitter Updates

    follow me on Twitter
    Blog powered by TypePad

    google analytics


    meter


    Categories

    Categories

    February 19, 2007

    Library standards in the mainstream

    Berlinbahnhoff6705 Jon Udell's weekly podcast from February 16 features Dan Chudnov in a lengthy conversation about OpenURLs, COINS, xISBNs, FRBR, digital archiving, and other library geekery.  This is a long, but rewarding conversation that surfaces many issues of technical librarianship into one of the technical world's A-list blogs.  If you've been putting off raising your understanding of OpenURLs and related technolgies, this podcast will help contextualize contextual linking, and give you a roadmap for exploring a host of related technologies.

    One of Udell's pet peeves is the difficulty of assuring the archival preservation of blogs, a medium that he, more than most, has invested in as a serious record of thought.  Conventional thinking has it that blogs are ephemeral, and who really cares if they are preserved?  I'm with Udell -- Blogs will become (are?) an important cultural and historical record -- even if they are ephemeral.  Remember the days when libraries were wringing their collective hands about preserving slow-burning acid-paper collections?  It was the so-called ephemera of earlier eras that caused the greatest concern, as these were precisely the materials that would disappear most quickly, and leave holes in the cultural record of their time.  So, today -- blogs.

    Udell and Chudnov pod-pine for a business model for the preservation of blogs today.  Why not an inexpensive LOCKSS-like blog preservation service run by... who else... libraries?  Perhaps in Dan's new role in the Library of Congress he will be able to promote such a service.

    -----

    The canopy over the main platform of the Berlin Haupt Bahnhoff reprises the open elegance of major 2oth century European train stations in a modern facility in the heart of the city.

    August 30, 2006

    Uncoupling Identity and Resolution

    Lady_wa My previous post outlined my understanding of Norman Walsh's postion on URIs and Identifiers.  I indicated in that post that I believed there are several cases where it is desireable to uncouple identity and resolution, and why my personal view is that http:URIs are not ideal for such cases.

    I do not dispute that the requirements embedded in these cases can be addressed using http:URIs. I simply assert that they can be better served within a more complete naming architecture that explicitly accomodates pure identifiers... that is, identifiers that are explicitly uncoupled from resolution protocols. 

    The current state of Web architecture does not support pure identifiers, and in fact I believe it is fair to characterize the prevailing attitude towards them among Web and Internet architects as hostile.  Certainly discussions of the merits and demerits in various public fora have often been (over?) heated at times.

    Resolving Pure Identifiers

    Does this mean that pure identifiers are never subject to resolution? No. It means that a pure identifier must be explicitly bound to a resolution protocol by a process or within a system designed to exploit that identifier.  This is, of course, a far more constrained circumstance than one mostly encounters on the open Web.  What are the means for such resolution?

    1. Binding of a pure identifier and a resolution protocol can be done by convention, as is the practice with DOIs, for example. Such conventions require special knowledge that must be established and maintained within a community of use.  The commercial benefits of the use of DOIs to help publishers manage intellectual property has helped to overcome this barrier, though the success is modest and fairly closely contained within the publishing community.

    2. It can be done through the use of plugins for standard software such as browsers.  During the URN battles within the IETF, which coincided roughly with the emergence of 'standardized' browsers, much hope was pinned on the use of such plugins to help get URN usage off the ground.  These hopes live yet in some parts of the library community (a number of European libraries have coalesced around the registration of National Bibliography Numbers, or NBNs, as a registered URN namespace), but it is a hope that (in my judgment) is unlikely to be widely realized.

    Even if there were no organizational impediments to the use of plugins, getting users to install such plugins in a widespread way is unlikely.  In fact, the installation of plugins in large organizations is often strictly controlled for security reasons, and hence unlikely to be done widely on the open Web.

    3. The architecture of the Internet includes a registry for URI schemes, maintained by the Internet Assigned Numbers Authority (IANA), which in theory could be deployed so as to ease the recognition of registered URI schemes within Web.  For a variety of reasons related to lack of compelling business cases, security issues, ideology, and perhaps the convenience of systems developers, very little of this capability has ever seen the light of working code.  The procedure for registering a new URI scheme was revamped as recently as last year (2005), suggesting that some in a position of influence still consider this important, but at this stage of Web technology development, the prospects of widely deployed software that can use arbitrary URI scheme declarations sensibly seems remote.

    4. Finally, labelled identifiers are often easily recognized and parsed from open Web data.  ISBNs are a good example of this, as they are widely recognized by end-users, have a public syntax, and are typically labeled with the "ISBN" token, making them easy to identify in unstructured data.  Oh... and there is a business case for their use.  Easy pickings.  Never mind that the identifier system itself has flaws... it works well enough, often enough, to generate value.   Given that there are many legacy identifier systems in the non-web world, and there is great confusion about how best to 'webulate' such systems, it is comforting that, as long as well-formed and labelled identifiers are used, they will be findable.

    This final approach may constitute the most reliable path to recognition of a given class of pure identifier.  It does not require agreement on the part of Internet architecture gatekeepers, and it is market driven.  So if there is in fact a market need, there is some chance that it can be filled by entrepreneurial zeal or even community-mindedness.

    -----

    Image: Lady Washington, in Tacoma Harbor, with Rainier-san as a backdrop (August 27)

    Do I understand you to say...?

    Nova Mortimer Adler argued that civil discourse requires first and foremost that one must have a clear notion of what one's fellow dialecticians are actually saying.  That is, one should begin every discussion with the question "Do I understand you to say...?"  In this spirit, I trust that others, or perhaps even Norman himself, might point out misinterpretations I have made in my reading of his post Names and Addresses.  My paraphrases  correspond to the major headings in his post.

    "They [identifiers] are just strings"

    I agree with this point.  If you ignore the semantics (or implied semantics) of a name, it is just a collection of characters with parsing rules that insure a globally unique string. The http:// is, for ID purposes, largely irrelevant and the Domain Name Service (DNS) provides a wonderfully effective means of providing for...

    Distributed Naming

    The DNS system has proven to be robust and stable, and an effective means for distributing local naming authority in a globally distributed way.  That is, every domain owner has the authority and means to assign names within a particular domain (namespace), or even subdivide that authority into smaller namespaces.

    Norman evinces a confidence in the persistence of DNS name management as likely to outlast any new organization created to manage a newly created URI namespace (referred to in his post as newscheme:.   This is a reasonable bet if indeed a new organization were created solely for such purpose.  In fact, it is more likely that such functions will be managed within existing stable organizations (my own thoughts naturally run towards libraries and their sound reputation for curating information for the long haul).  I suspect Norman was thinking about the DOI Foundation, host of the DOI namespace, and which emerged in response to the interests of commercial publishers in an autonomous identifier assignment entity.

    Globally unique (unambiguous) names are important

    The DNS, again, assures this global uniqueness.  But, it is within the purview of the local name authority (domain owner) to reassign names and their corresponding referents as it sees fit.  That is entirely appropriate in some cases, and not so in others.  Consider, for example, the following two URIs:

    http://www.w3.org/2001/tag/doc/URNsAndRegistries-50-2006-08-17
    http://www.w3.org/2001/tag/doc/URNsAndRegistries-50

    At the moment that I am writing this sentence, these two identifiers map to the same resource.  At some point in the future, the second one (the persistent identifier associated with the latest version) will map to another, more polished version, while the former identifier will remain associated with the current version as mandated by the policy of the resource curator (the W3C in this case).

    For most scholarly assets, it is important that the relation of an identifier to its referent be invariant.  Norman correctly points out that this is a "matter of diligence and trust," a social issue.  The example  of the W3C URIs points out that the trust is invested in a particular policy for maintaining a chain of evidence (in this case, a series of versions).  It is not always simple to achieve this, and often will require close curatorial attention (the diligence part).

    Persistence is important

    The essence of this issue is also, as Norman states, social.  There is no technical assurance of persistence for either the names or the resources they identify.  The only guarantee of persistence is the commitment of the organizations with curatorial responsibility for them.
     
    Many of the early URLs identifying the beginnings of the Web have long since broken.  CERN, the birthplace of the Web, decided it was about doing and curating physics, not Web technology.  In those breathless early years, as custody was transferred, the identifiers did not always survive.  It is perhaps the case that some of the documents did not as well... I don't know. 

    It has been argued that the http component of URIs is a weak link, as protocols do not last forever. This argument is spurious, both because of the "It's just a string" argument, and because the world is so deeply dependant on this infrastructure at this point that anything that succeeds http will necessarily be backward-compatible.  Once again, we agree.

    Resolvable Identifiers -- HTTP is a winner

    I prefer to use the terminology of resolvable rather than retrievable as Norman does, as I think it better captures the notion of mapping of identifiers and resources.  Norman's point that http is a clear winner is true in one sense, but may mislead in another.  Our first point of disagreement, I think.

    I would not argue that there is a better alternative than the http protocol for retrieval purposes on Planet Web.  I do argue, however, that there are circumstances in which resolution should be explicitly uncoupled from identity.  I will explicate some of these circumstances in a subsequent post.

    I believe that Norman and others will counter with the "It's only a string" argument, and from a technical perspective, this is exactly correct.  There is no technical requirement that an http:URI must be resolvable.  It can act as a globally-unique string that maps conceptually to a real or abstract asset, whether or not an http server ever acts on it.

    My objection to this approach goes back to the outlandish success of the Web, and to the implicit social contract of resolvability.  You can correctly assert that http is just a substring and carries no promise, but not if you live on Planet Semantics as well as Planet Web.  Such URIs will be recognized by machines and resolved (or at least resolution will be attempted) and they will be recognized by people, who will expect something to be at the end of the link. To the extent that that something is other than what they expect, unwelcome surprise results.  The overloading of resolution and identity is not only benign, but advantageous in most circumstances.  But not all.

    The question as I see it is, is there sufficient value in the use of some newscheme: pure identifier to justify the effort in establishing a separate identifier protocol, maintaining an appropriate registry, and supporting its use?  Which brings us to Norman's last objection:

    Paying for names

    Norman asks "Why pay for something new when I've already got what I want?"  A good question, and for the great majority of assigners of identifiers, there is no need. Names, even DNS-based names, are not free.  We pay for domains, annually renewable.  We pay (largely hidden) costs of assigning and maintaining the integrity of the identifier mappings under our authority. As far as I am aware, there are no examples of any naming systems that require payment for the end-use of the names (resolving them), though at least one (DOIs) require a fee for issuance or assignment to resources.

    Why would you pay for such identifiers if their equivalent is available in the technology (already paid for) at hand?  In the case of DOIs, it is presumably because their constituency (largely commercial publishers) finds their use productive in the context of their business model.

    In summary:

    • I agree almost entirely with the substance of Norman's arguments about the suitability of HTTP:URIs as a substrate for persistent, globally-unique identifiers to support resolution. 
    • We disagree, I think, on the promise of resolution implied by the http protocol token, and whether or not this has practical importance.
    • Many adherents to the "just use HTTP" argument reject the argument that it can be useful to uncouple identity and resolution, and with it, the assertion that http identifiers may sometimes be less desirable than 'pure' identifier alternatives.
    • Departing from the world's more widely-deployed identifier system (http:URIs) involves both costs and vulnerabilities.  Any effective alternative must offset these disadvantages through significant added value.

    I will elaborate on some of these points of difference in a subsequent post.

    POSTSCRIPT:

    As I was proofing this post after I published it, I found that one of the two URIs supposedly identifying the W3C Tag finding does not actually work.  The "latest version" link works, and that is the most important one, but the time-stamped version does not.  This is an excellent example of the difficulty that every organization has in actually meeting its responsibilities of "diligence and trust".  In my experience, the W3C takes these responsibilities seriously.

    Is such an issue a problem?  After all, the latest version link is the important one, no?  It is, but if you're interested in evidence chains in scholarship, then all the links are important, and as this example illustrates, they can be fragile.  In a subsequent post, I'll return to this issue and illustrate why both of these links are important to me personally.

    -----

    Image: Ship's prow, Tacoma Harbor, taken from Highlander, August 27

    On Identifiers, Scholarship, and Spitoons

    Seattleskylinesmith The problem in talking about identifiers is encapsulated in the Spittoon Joke.  If you're not familiar with this joke, I'm sorry, but  it is too tasteless to relate away from the flicker of a campfire.  The essence is that there's no easy place to stop once you start, and the starting place isn't always obvious either.  This is the dilemma I've been struggling with, having agreed to comment on a blog post by Bruce D'Arcus on identifiers: URIs as Names.

    Bruce approaches the question from the perspective of tools for scholars, with the normal sets of problems that scholars have, including persistent citation.  Among the things that we want from citations is a convenient handle for any arbitrary resource, a handle that we can use to hang the resource on our scholarly pegboard, pick up the resource, pass it to others, let them hang it on their scholarly pegboard, and so forth.  Since it is easier to arrange our pegboards in standard ways, it is best if the handles are the same size and configuration, or barring that, that the number of differentstyles of handles is small.

    We'd all be fine if we could agree on a single style of handle for all the resources that we want to manage, right?   And Lo!  We now all live on Planet Web, as Norm Walsh puts it, and as Web denizens, we know that http:URIs are the obvious and most useful form of identifiers, and hence our problem is solved, now and always, and for all  manner of resources ever to be conjured for scholarly or other  purposes.

    We are nearly 15 years downstream from the New York Times article that served to awaken many of us to Tim Berners-Lee's marvelous creation. The URL, the now-discredited moniker that has been displaced in discussions such as these by the term http:URIs, has indeed become the most widely-used identifier in the world.

    For all that, we still don't enjoy the identifier heaven that the Web promises.  I'd like to explore some of the reasons i think this is the case, and perhaps even argue a particular perspective or two.  Bruce's post on the subject points to Norman Walsh's blog post in which he argues that http:URI schemes are entirely sufficient to the need, and to deviate from this is harmful.  Norman's post points also to a W3C TAG finding under development by Henry Thompson and David Orchard, presumably giving voice to the official W3C position on Web naming issues.  I will begin my exploration of these issues by reprising the arguments in these documents, in case you want to read ahead.  Reaching, now, for the spittoon....

    -----

    Image: Seattle evening skyline, featuring the Smith Tower (the short, pointy white building), for many years the tallest building west of the Mississipi River.  Taken from the Alaskan Way Viaduct, August 27

    January 31, 2006

    Identity and location on the Web

    Foucault URLs versus URIs

    In Sean McGrath’s piece on URLs and social commitment, he alludes to the common confusion between the acronyms URL and URI, correctly pointing out that only Web-head protocol-wonks are liable to be caught using the URI terminology (and most of them don't usually either).  Everyone else on the planet uses the largely-interchangeable and better-understood moniker of URL.

    Is this a distinction without a difference, as usage would lead us to believe? Unhappily, in today’s Web, it rarely matters a whit. Sean tells us why:

    The great thing about URLs is that you can click on them.

    That is a great thing, and it informs our expectation of URLs to the exclusion of all other possibilities. The social contract implied in http:// is, then, that they are actionable: you can click on them and bring the referent of the link into your machine and read it or listen to it or watch it. The link serves as a pointer to a location, and clicking on it invokes a behavior specified by the http protocol: voila!

    So, what's to be unhappy about?  Overloaded onto this simple actionable relationship is the additional important function of identity. The URL serves as both a key for a retrieval transaction, and an identifier. It is no accident that CNRI, in a stroke of marketing genius, chose the term Handle for their identifier protocol, for that is exactly the right term for such identifiers. We want a handle for otherwise-slippery electronic content so we can hold on to it, pass it back and forth, refer to it, and hang it over our desk to grab like a frying pan from a pot-rack.

    Mostly this overloading of identity and location/retrieval is fine. And to the extent that it is, the conflation of URL and URI is not a problem. So what is missing?

    Three things:

    1. Persistent reference pointers – There are many classes of electronic resources that we know we will want to refer to in a location-independent way for as long as we can imagine. Books, journal articles, or any component of a persistent resource of cultural, social, or economic importance. Yes, we want these to be actionable (clickable) as far as possible, but sustainable access requires that we distinguish between identity and resolution in the life cycle of any information resource of more than passing importance.  Conflating location and identity makes this harder (though not impossible).

    2. Appropriate copy resolution – In a world without access barriers, any copy is the appropriate copy. In a world of tradable intellectual property, individuals and organizations have differential access to resources. The Web should be neutral about business models, but it cannot be indifferent to them. Owners of IP must have the means to manage and meter access, and this generally implies a decoupling of identity and resolution.

    3. Conceptual resources – Our expectation that clicking 'gets' us something is not fully met if the resource is a conceptual asset.  The development of Semantic Web technology demands the application of an identity architecture to concepts as well as documents, multimedia, and pizza-ordering forms. Proponents of the just-let-HTTP-do-it rightly point out that HTTP URLs are entirely capable of being used for identifying conceptual resources (to use the RDF parlance). This is undeniably true as far as the technology goes. The more interesting question is what happens when you click on such a link? What SHOULD happen? The answer is context dependent. Some people, myself included, are uncomfortable with the use of standard HTTP URLs for this purpose, because it breaks the widespread-if-informal social contract of URLs.  You may wish to define them, or locate them within a larger conceptual structure, or access various of their attributes, but in general you're not trying to retrieve them.

    Each of these examples begs for decoupling of identity and resolution in some contexts, but requires an additional layer of mapping of location or function that, getting back to Sean’s consumer contract:

    all come at very significant extra cost in terms of complexity. On this issue, the world has voted very loudly with its mouse clicking fingers. The world values hyperlinking simplicity over complexity by many orders of magnitude.

    The resolution of these conundrums will require daunting co-evolution of technology, business processes, and cross-community practices.  Technologies such as Handles, DOIs, OpenURLs, PURLs, and "INFO" URIs all represent approaches to addressing aspects of these problems.  At this time they are niche technologies.  Their impact on the constellation of problems we know as identity-versus-resolution will depend far more on business processes than on TECHNOLOGY (or the ideologies of their proponents or detractors).  Meanwhile, the URL rules.