My Photo

WorldCat


Twitter Updates

    follow me on Twitter
    Blog powered by TypePad

    google analytics


    meter


    Categories

    Categories

    August 31, 2006

    Version Identifiers, Digital Workflow, and Cyber Squatting

    Orcaspanorama I’ve been writing about identifiers lately, and I used an example of two identifiers pointing to one resource in one of my posts. It’s a very nice way of addressing the problem of versions on the Web, and while I can’t say for sure that the W3C was the originator of the technique, it is the first place I recall seeing it done.

    The basic model is:

    Latest Copy:                  http://[domain]/Local_ID
    A particular version      http://[domain]/ Local_ID-DateStamp

    I don’t keep old versions of articles I write, or at least not in any systematic way, but talking with my new-found-literary friends, Mark Kelly and Brian Taylor, one of them remarked… yeah, most authors don’t keep more than 3 or 4 versions of their manuscripts…. This is a big deal in the humanities, but something that is probably not managed with much diligence by writers who use computers to write.

    Keeping multiple versions of documents in a systematic way is easy with paper – a shoebox, or salvaged copier paper box is the system, and it works fine for the most part. It is no surprise that archives are organized around boxes’o’stuff, and that is largely sufficient to the task. Scholars who peel back the layers of this author or that artist will have a much harder time (and a lot less fun) peeling back the… what? File structure of someone’s hard disk? Especially since we throw out our old computer every few years (or lose it to a disk crash), the digital equivalent of throwing away our basement full of memorabilia.  What seems a virtue in the moment (a chance to reorganize and rationalize a now-chaotic old file system) may be a travesty in the eyes of some future cyberhistorian.

    So, the naming approach that the W3C uses for its official documents gives us a simple, understandable way to name collections of versions, and a transparent way to understand their evolution. Most of us won’t do it, even recognizing its value. We need digital shoeboxes, systems that understand versioning (that is, that have an inherent recognition of workflows of various types), and which help us to achieve the goals of such workflows.

    In a previous post I noted a personal interest in the provenance of a particular document currently under development at the W3C. As luck would have it, the latest-version of the URL for this document is intact, but the version identifier is not, helping me make the point that even in organizations with a diligant commitment to the importance of version chains, it is easy to make mistakes.

    Copied directly from the header information of a W3C Tag Finding URNs, Namespaces, and Registries:

    This version:
    http://www.w3.org/2001/tag/doc/URNsAndRegistries-50-2006-08-17

    Latest version:
    http://www.w3.org/2001/tag/doc/URNsAndRegistries-50

    As part of my explorations of identifier issues I’ve been reviewing this document carefully, as it touches closely on other work I’ve been involved in.

    As it happens, my views are at odds in certain respects with the prevailing views of the W3C Architecture Group, (the TAG), and in particular I’ve sparred in public meetings with one of the authors of this document (Henry Thompson). I have high regard for Henry’s intellect, and I further believe his motivations to be of the highest order. So, I was at once flattered and annoyed to find our disagreement memorialized in an editorial comment in this TAG finding:

    Editorial note: HST 2006-06-06

    HST now owns lccn.info and oclcnum.info, will sell to Stuart Weibel for a modest consideration :-)

    Better to be talked about than ignored. Now, I can only guess that this tender of two potentially important properties in cyberspace is unlikely to make the final editorial cut in this official TAG Finding(1). But, I want to be sure that this offer remains part of the formal W3C record. Who knows…I might want them someday. Though, Henry… isn’t this cyber squatting? I’m thinking if you hang on to them long enough, you might be willing to pay me to take them! I wonder if smiley faces have standing in contract law?

    Now... could someone at W3C please fix the URL?

    ;-)

    NOTES:

    (1) A Tag finding is a formal pronouncement on the part of the WsC Technical Architecture Group:

    The primary activity of the TAG is to develop Architectural   Recommendations. The TAG findings listed below document fundamental   principles that should be adhered to by all Web components. The TAG expects   to include these findings in the TAG's Architectural Recommendations, to be   published according to the requirements of the W3C   Recommendation Track process.

    -----

    Image: Orcas Island Panorama, March 2006 (click to enlarge).  The day I took this picture, you could faintly make out, from the highest point on Orcas, Mt. Rainier, over 200 kilometers distant.

    August 30, 2006

    Uncoupling Identity and Resolution

    Lady_wa My previous post outlined my understanding of Norman Walsh's postion on URIs and Identifiers.  I indicated in that post that I believed there are several cases where it is desireable to uncouple identity and resolution, and why my personal view is that http:URIs are not ideal for such cases.

    I do not dispute that the requirements embedded in these cases can be addressed using http:URIs. I simply assert that they can be better served within a more complete naming architecture that explicitly accomodates pure identifiers... that is, identifiers that are explicitly uncoupled from resolution protocols. 

    The current state of Web architecture does not support pure identifiers, and in fact I believe it is fair to characterize the prevailing attitude towards them among Web and Internet architects as hostile.  Certainly discussions of the merits and demerits in various public fora have often been (over?) heated at times.

    Resolving Pure Identifiers

    Does this mean that pure identifiers are never subject to resolution? No. It means that a pure identifier must be explicitly bound to a resolution protocol by a process or within a system designed to exploit that identifier.  This is, of course, a far more constrained circumstance than one mostly encounters on the open Web.  What are the means for such resolution?

    1. Binding of a pure identifier and a resolution protocol can be done by convention, as is the practice with DOIs, for example. Such conventions require special knowledge that must be established and maintained within a community of use.  The commercial benefits of the use of DOIs to help publishers manage intellectual property has helped to overcome this barrier, though the success is modest and fairly closely contained within the publishing community.

    2. It can be done through the use of plugins for standard software such as browsers.  During the URN battles within the IETF, which coincided roughly with the emergence of 'standardized' browsers, much hope was pinned on the use of such plugins to help get URN usage off the ground.  These hopes live yet in some parts of the library community (a number of European libraries have coalesced around the registration of National Bibliography Numbers, or NBNs, as a registered URN namespace), but it is a hope that (in my judgment) is unlikely to be widely realized.

    Even if there were no organizational impediments to the use of plugins, getting users to install such plugins in a widespread way is unlikely.  In fact, the installation of plugins in large organizations is often strictly controlled for security reasons, and hence unlikely to be done widely on the open Web.

    3. The architecture of the Internet includes a registry for URI schemes, maintained by the Internet Assigned Numbers Authority (IANA), which in theory could be deployed so as to ease the recognition of registered URI schemes within Web.  For a variety of reasons related to lack of compelling business cases, security issues, ideology, and perhaps the convenience of systems developers, very little of this capability has ever seen the light of working code.  The procedure for registering a new URI scheme was revamped as recently as last year (2005), suggesting that some in a position of influence still consider this important, but at this stage of Web technology development, the prospects of widely deployed software that can use arbitrary URI scheme declarations sensibly seems remote.

    4. Finally, labelled identifiers are often easily recognized and parsed from open Web data.  ISBNs are a good example of this, as they are widely recognized by end-users, have a public syntax, and are typically labeled with the "ISBN" token, making them easy to identify in unstructured data.  Oh... and there is a business case for their use.  Easy pickings.  Never mind that the identifier system itself has flaws... it works well enough, often enough, to generate value.   Given that there are many legacy identifier systems in the non-web world, and there is great confusion about how best to 'webulate' such systems, it is comforting that, as long as well-formed and labelled identifiers are used, they will be findable.

    This final approach may constitute the most reliable path to recognition of a given class of pure identifier.  It does not require agreement on the part of Internet architecture gatekeepers, and it is market driven.  So if there is in fact a market need, there is some chance that it can be filled by entrepreneurial zeal or even community-mindedness.

    -----

    Image: Lady Washington, in Tacoma Harbor, with Rainier-san as a backdrop (August 27)

    Do I understand you to say...?

    Nova Mortimer Adler argued that civil discourse requires first and foremost that one must have a clear notion of what one's fellow dialecticians are actually saying.  That is, one should begin every discussion with the question "Do I understand you to say...?"  In this spirit, I trust that others, or perhaps even Norman himself, might point out misinterpretations I have made in my reading of his post Names and Addresses.  My paraphrases  correspond to the major headings in his post.

    "They [identifiers] are just strings"

    I agree with this point.  If you ignore the semantics (or implied semantics) of a name, it is just a collection of characters with parsing rules that insure a globally unique string. The http:// is, for ID purposes, largely irrelevant and the Domain Name Service (DNS) provides a wonderfully effective means of providing for...

    Distributed Naming

    The DNS system has proven to be robust and stable, and an effective means for distributing local naming authority in a globally distributed way.  That is, every domain owner has the authority and means to assign names within a particular domain (namespace), or even subdivide that authority into smaller namespaces.

    Norman evinces a confidence in the persistence of DNS name management as likely to outlast any new organization created to manage a newly created URI namespace (referred to in his post as newscheme:.   This is a reasonable bet if indeed a new organization were created solely for such purpose.  In fact, it is more likely that such functions will be managed within existing stable organizations (my own thoughts naturally run towards libraries and their sound reputation for curating information for the long haul).  I suspect Norman was thinking about the DOI Foundation, host of the DOI namespace, and which emerged in response to the interests of commercial publishers in an autonomous identifier assignment entity.

    Globally unique (unambiguous) names are important

    The DNS, again, assures this global uniqueness.  But, it is within the purview of the local name authority (domain owner) to reassign names and their corresponding referents as it sees fit.  That is entirely appropriate in some cases, and not so in others.  Consider, for example, the following two URIs:

    http://www.w3.org/2001/tag/doc/URNsAndRegistries-50-2006-08-17
    http://www.w3.org/2001/tag/doc/URNsAndRegistries-50

    At the moment that I am writing this sentence, these two identifiers map to the same resource.  At some point in the future, the second one (the persistent identifier associated with the latest version) will map to another, more polished version, while the former identifier will remain associated with the current version as mandated by the policy of the resource curator (the W3C in this case).

    For most scholarly assets, it is important that the relation of an identifier to its referent be invariant.  Norman correctly points out that this is a "matter of diligence and trust," a social issue.  The example  of the W3C URIs points out that the trust is invested in a particular policy for maintaining a chain of evidence (in this case, a series of versions).  It is not always simple to achieve this, and often will require close curatorial attention (the diligence part).

    Persistence is important

    The essence of this issue is also, as Norman states, social.  There is no technical assurance of persistence for either the names or the resources they identify.  The only guarantee of persistence is the commitment of the organizations with curatorial responsibility for them.
     
    Many of the early URLs identifying the beginnings of the Web have long since broken.  CERN, the birthplace of the Web, decided it was about doing and curating physics, not Web technology.  In those breathless early years, as custody was transferred, the identifiers did not always survive.  It is perhaps the case that some of the documents did not as well... I don't know. 

    It has been argued that the http component of URIs is a weak link, as protocols do not last forever. This argument is spurious, both because of the "It's just a string" argument, and because the world is so deeply dependant on this infrastructure at this point that anything that succeeds http will necessarily be backward-compatible.  Once again, we agree.

    Resolvable Identifiers -- HTTP is a winner

    I prefer to use the terminology of resolvable rather than retrievable as Norman does, as I think it better captures the notion of mapping of identifiers and resources.  Norman's point that http is a clear winner is true in one sense, but may mislead in another.  Our first point of disagreement, I think.

    I would not argue that there is a better alternative than the http protocol for retrieval purposes on Planet Web.  I do argue, however, that there are circumstances in which resolution should be explicitly uncoupled from identity.  I will explicate some of these circumstances in a subsequent post.

    I believe that Norman and others will counter with the "It's only a string" argument, and from a technical perspective, this is exactly correct.  There is no technical requirement that an http:URI must be resolvable.  It can act as a globally-unique string that maps conceptually to a real or abstract asset, whether or not an http server ever acts on it.

    My objection to this approach goes back to the outlandish success of the Web, and to the implicit social contract of resolvability.  You can correctly assert that http is just a substring and carries no promise, but not if you live on Planet Semantics as well as Planet Web.  Such URIs will be recognized by machines and resolved (or at least resolution will be attempted) and they will be recognized by people, who will expect something to be at the end of the link. To the extent that that something is other than what they expect, unwelcome surprise results.  The overloading of resolution and identity is not only benign, but advantageous in most circumstances.  But not all.

    The question as I see it is, is there sufficient value in the use of some newscheme: pure identifier to justify the effort in establishing a separate identifier protocol, maintaining an appropriate registry, and supporting its use?  Which brings us to Norman's last objection:

    Paying for names

    Norman asks "Why pay for something new when I've already got what I want?"  A good question, and for the great majority of assigners of identifiers, there is no need. Names, even DNS-based names, are not free.  We pay for domains, annually renewable.  We pay (largely hidden) costs of assigning and maintaining the integrity of the identifier mappings under our authority. As far as I am aware, there are no examples of any naming systems that require payment for the end-use of the names (resolving them), though at least one (DOIs) require a fee for issuance or assignment to resources.

    Why would you pay for such identifiers if their equivalent is available in the technology (already paid for) at hand?  In the case of DOIs, it is presumably because their constituency (largely commercial publishers) finds their use productive in the context of their business model.

    In summary:

    • I agree almost entirely with the substance of Norman's arguments about the suitability of HTTP:URIs as a substrate for persistent, globally-unique identifiers to support resolution. 
    • We disagree, I think, on the promise of resolution implied by the http protocol token, and whether or not this has practical importance.
    • Many adherents to the "just use HTTP" argument reject the argument that it can be useful to uncouple identity and resolution, and with it, the assertion that http identifiers may sometimes be less desirable than 'pure' identifier alternatives.
    • Departing from the world's more widely-deployed identifier system (http:URIs) involves both costs and vulnerabilities.  Any effective alternative must offset these disadvantages through significant added value.

    I will elaborate on some of these points of difference in a subsequent post.

    POSTSCRIPT:

    As I was proofing this post after I published it, I found that one of the two URIs supposedly identifying the W3C Tag finding does not actually work.  The "latest version" link works, and that is the most important one, but the time-stamped version does not.  This is an excellent example of the difficulty that every organization has in actually meeting its responsibilities of "diligence and trust".  In my experience, the W3C takes these responsibilities seriously.

    Is such an issue a problem?  After all, the latest version link is the important one, no?  It is, but if you're interested in evidence chains in scholarship, then all the links are important, and as this example illustrates, they can be fragile.  In a subsequent post, I'll return to this issue and illustrate why both of these links are important to me personally.

    -----

    Image: Ship's prow, Tacoma Harbor, taken from Highlander, August 27

    On Identifiers, Scholarship, and Spitoons

    Seattleskylinesmith The problem in talking about identifiers is encapsulated in the Spittoon Joke.  If you're not familiar with this joke, I'm sorry, but  it is too tasteless to relate away from the flicker of a campfire.  The essence is that there's no easy place to stop once you start, and the starting place isn't always obvious either.  This is the dilemma I've been struggling with, having agreed to comment on a blog post by Bruce D'Arcus on identifiers: URIs as Names.

    Bruce approaches the question from the perspective of tools for scholars, with the normal sets of problems that scholars have, including persistent citation.  Among the things that we want from citations is a convenient handle for any arbitrary resource, a handle that we can use to hang the resource on our scholarly pegboard, pick up the resource, pass it to others, let them hang it on their scholarly pegboard, and so forth.  Since it is easier to arrange our pegboards in standard ways, it is best if the handles are the same size and configuration, or barring that, that the number of differentstyles of handles is small.

    We'd all be fine if we could agree on a single style of handle for all the resources that we want to manage, right?   And Lo!  We now all live on Planet Web, as Norm Walsh puts it, and as Web denizens, we know that http:URIs are the obvious and most useful form of identifiers, and hence our problem is solved, now and always, and for all  manner of resources ever to be conjured for scholarly or other  purposes.

    We are nearly 15 years downstream from the New York Times article that served to awaken many of us to Tim Berners-Lee's marvelous creation. The URL, the now-discredited moniker that has been displaced in discussions such as these by the term http:URIs, has indeed become the most widely-used identifier in the world.

    For all that, we still don't enjoy the identifier heaven that the Web promises.  I'd like to explore some of the reasons i think this is the case, and perhaps even argue a particular perspective or two.  Bruce's post on the subject points to Norman Walsh's blog post in which he argues that http:URI schemes are entirely sufficient to the need, and to deviate from this is harmful.  Norman's post points also to a W3C TAG finding under development by Henry Thompson and David Orchard, presumably giving voice to the official W3C position on Web naming issues.  I will begin my exploration of these issues by reprising the arguments in these documents, in case you want to read ahead.  Reaching, now, for the spittoon....

    -----

    Image: Seattle evening skyline, featuring the Smith Tower (the short, pointy white building), for many years the tallest building west of the Mississipi River.  Taken from the Alaskan Way Viaduct, August 27

    January 31, 2006

    Identity and location on the Web

    Foucault URLs versus URIs

    In Sean McGrath’s piece on URLs and social commitment, he alludes to the common confusion between the acronyms URL and URI, correctly pointing out that only Web-head protocol-wonks are liable to be caught using the URI terminology (and most of them don't usually either).  Everyone else on the planet uses the largely-interchangeable and better-understood moniker of URL.

    Is this a distinction without a difference, as usage would lead us to believe? Unhappily, in today’s Web, it rarely matters a whit. Sean tells us why:

    The great thing about URLs is that you can click on them.

    That is a great thing, and it informs our expectation of URLs to the exclusion of all other possibilities. The social contract implied in http:// is, then, that they are actionable: you can click on them and bring the referent of the link into your machine and read it or listen to it or watch it. The link serves as a pointer to a location, and clicking on it invokes a behavior specified by the http protocol: voila!

    So, what's to be unhappy about?  Overloaded onto this simple actionable relationship is the additional important function of identity. The URL serves as both a key for a retrieval transaction, and an identifier. It is no accident that CNRI, in a stroke of marketing genius, chose the term Handle for their identifier protocol, for that is exactly the right term for such identifiers. We want a handle for otherwise-slippery electronic content so we can hold on to it, pass it back and forth, refer to it, and hang it over our desk to grab like a frying pan from a pot-rack.

    Mostly this overloading of identity and location/retrieval is fine. And to the extent that it is, the conflation of URL and URI is not a problem. So what is missing?

    Three things:

    1. Persistent reference pointers – There are many classes of electronic resources that we know we will want to refer to in a location-independent way for as long as we can imagine. Books, journal articles, or any component of a persistent resource of cultural, social, or economic importance. Yes, we want these to be actionable (clickable) as far as possible, but sustainable access requires that we distinguish between identity and resolution in the life cycle of any information resource of more than passing importance.  Conflating location and identity makes this harder (though not impossible).

    2. Appropriate copy resolution – In a world without access barriers, any copy is the appropriate copy. In a world of tradable intellectual property, individuals and organizations have differential access to resources. The Web should be neutral about business models, but it cannot be indifferent to them. Owners of IP must have the means to manage and meter access, and this generally implies a decoupling of identity and resolution.

    3. Conceptual resources – Our expectation that clicking 'gets' us something is not fully met if the resource is a conceptual asset.  The development of Semantic Web technology demands the application of an identity architecture to concepts as well as documents, multimedia, and pizza-ordering forms. Proponents of the just-let-HTTP-do-it rightly point out that HTTP URLs are entirely capable of being used for identifying conceptual resources (to use the RDF parlance). This is undeniably true as far as the technology goes. The more interesting question is what happens when you click on such a link? What SHOULD happen? The answer is context dependent. Some people, myself included, are uncomfortable with the use of standard HTTP URLs for this purpose, because it breaks the widespread-if-informal social contract of URLs.  You may wish to define them, or locate them within a larger conceptual structure, or access various of their attributes, but in general you're not trying to retrieve them.

    Each of these examples begs for decoupling of identity and resolution in some contexts, but requires an additional layer of mapping of location or function that, getting back to Sean’s consumer contract:

    all come at very significant extra cost in terms of complexity. On this issue, the world has voted very loudly with its mouse clicking fingers. The world values hyperlinking simplicity over complexity by many orders of magnitude.

    The resolution of these conundrums will require daunting co-evolution of technology, business processes, and cross-community practices.  Technologies such as Handles, DOIs, OpenURLs, PURLs, and "INFO" URIs all represent approaches to addressing aspects of these problems.  At this time they are niche technologies.  Their impact on the constellation of problems we know as identity-versus-resolution will depend far more on business processes than on TECHNOLOGY (or the ideologies of their proponents or detractors).  Meanwhile, the URL rules.