My Photo

WorldCat


Twitter Updates

    follow me on Twitter
    Blog powered by TypePad

    google analytics


    meter


    Categories

    Categories

    October 30, 2006

    Global Scope and Global Uniqueness are Different

    Earlytraffic I have been poking around in the Life Sciences Identifier (LSID) spec in recent days, and in fact a few of the die-hard semgrail (semantic web discussion group) folks met to discuss it recently.

    At one point in the spec, the authors reprise LSID functional requirements with reference to URN requirements (LSIDs are registered as a URN sub-namespace). The laundry list includes both:

    • Global scope
    • Global uniqueness

    I stumbled at this distinction in my first reading. Global uniqueness is part of identifier canon law, but also trivial to achieve in today’s distributed networking environment. We know it is foundational, but we get it for free as part of DNS network addressing.

    But what about global scope? My first reaction was to think that they had merely parroted something from the URN specs and it had no direct implications for the functionality of LSIDs. 

    Further reflection occasioned an Aha! moment, however. Global scoping is not really an attribute of an identifier, but rather part of a service environment that comes into play when we deploy and propagate them.  My modest flash of insight on this is born of my thinking about identifiers having to do with WorldCat.org.  If we are as a community to increase the value of our links (that is, links to library-held materials), we need to agree on standard identifiers that are recognized and universally actionable (global scope of recognition).

    The following identifiers are all valid within WorldCat:

    http://www.worldcat.org/oclc/26160663&referer=brief_results

    http://www.worldcat.org/search?q=083890596X&qt=owc_search

    http://www.worldcat.org/search?q=083890596X

    http://www.worldcat.org/oclc/26160663

    Inspection tells us that 3 of the 4 are transactional identifiers and are thus less likely to be persistent. The last is an OCLC-number-based URL, which has, naturally enough, assumed the moniker of permalink in WorldCat. That’s the one to use. (It needs to be made more prominent in Worldcat.  Don’t make me click to get it… it should be in bold at the head of every record presentation)

    Permalink is a good name here, but it is also misleading in that people use the term rather indiscriminately. For example, this blog entry has a permalink as well, but lets face it, its persistence extends only as far as the business case that supports it (that is, until I stop paying my annual fee to Typepad or Typepad goes out of business). In the case of a WorldCat permalink, it is a reasonable bet that they will be useful as long as there is a globally-scoped bibliographic database. 

    Which brings me back to the point of this entry.  One of the virtues of having the entire Worldcat database available to the Internet Commons is that we have a stable identifier that is globally scoped and inclusive of what increasingly rounds to the world’s library assets (yes, I’m over reaching, but a bibliographic utility’s reach should exceed its grasp, or what’s a WorldCat for?)

    If I use an identifier bound to a local library system or any locally-scoped system, then I’ve done little to reduce barriers to more effective global access.  A record identifier for a given asset in the UW system is probably different than that in, say, Ohiolink.  An identifier in WorldCat will identify a given asset anywhere, and may be used for direct access within locally-scoped systems as well, if the systems are designed with global scoping in mind. 

    This doesn’t happen automatically, of course… system designers have to recognize and exploit the global scope of a given identifier. This is the nature of network value – the more something is used for a given network purpose, the more valuable it becomes. 

    The library community benefits from a public, globally-scoped identifier such as WorldCat provides for the first time.  It should be the identifier of first choice in library systems.

    -----

    Image: Early morning view of the 520 bridge from the Union Bay Natural Area on Lake Washington.

    August 30, 2006

    Do I understand you to say...?

    Nova Mortimer Adler argued that civil discourse requires first and foremost that one must have a clear notion of what one's fellow dialecticians are actually saying.  That is, one should begin every discussion with the question "Do I understand you to say...?"  In this spirit, I trust that others, or perhaps even Norman himself, might point out misinterpretations I have made in my reading of his post Names and Addresses.  My paraphrases  correspond to the major headings in his post.

    "They [identifiers] are just strings"

    I agree with this point.  If you ignore the semantics (or implied semantics) of a name, it is just a collection of characters with parsing rules that insure a globally unique string. The http:// is, for ID purposes, largely irrelevant and the Domain Name Service (DNS) provides a wonderfully effective means of providing for...

    Distributed Naming

    The DNS system has proven to be robust and stable, and an effective means for distributing local naming authority in a globally distributed way.  That is, every domain owner has the authority and means to assign names within a particular domain (namespace), or even subdivide that authority into smaller namespaces.

    Norman evinces a confidence in the persistence of DNS name management as likely to outlast any new organization created to manage a newly created URI namespace (referred to in his post as newscheme:.   This is a reasonable bet if indeed a new organization were created solely for such purpose.  In fact, it is more likely that such functions will be managed within existing stable organizations (my own thoughts naturally run towards libraries and their sound reputation for curating information for the long haul).  I suspect Norman was thinking about the DOI Foundation, host of the DOI namespace, and which emerged in response to the interests of commercial publishers in an autonomous identifier assignment entity.

    Globally unique (unambiguous) names are important

    The DNS, again, assures this global uniqueness.  But, it is within the purview of the local name authority (domain owner) to reassign names and their corresponding referents as it sees fit.  That is entirely appropriate in some cases, and not so in others.  Consider, for example, the following two URIs:

    http://www.w3.org/2001/tag/doc/URNsAndRegistries-50-2006-08-17
    http://www.w3.org/2001/tag/doc/URNsAndRegistries-50

    At the moment that I am writing this sentence, these two identifiers map to the same resource.  At some point in the future, the second one (the persistent identifier associated with the latest version) will map to another, more polished version, while the former identifier will remain associated with the current version as mandated by the policy of the resource curator (the W3C in this case).

    For most scholarly assets, it is important that the relation of an identifier to its referent be invariant.  Norman correctly points out that this is a "matter of diligence and trust," a social issue.  The example  of the W3C URIs points out that the trust is invested in a particular policy for maintaining a chain of evidence (in this case, a series of versions).  It is not always simple to achieve this, and often will require close curatorial attention (the diligence part).

    Persistence is important

    The essence of this issue is also, as Norman states, social.  There is no technical assurance of persistence for either the names or the resources they identify.  The only guarantee of persistence is the commitment of the organizations with curatorial responsibility for them.
     
    Many of the early URLs identifying the beginnings of the Web have long since broken.  CERN, the birthplace of the Web, decided it was about doing and curating physics, not Web technology.  In those breathless early years, as custody was transferred, the identifiers did not always survive.  It is perhaps the case that some of the documents did not as well... I don't know. 

    It has been argued that the http component of URIs is a weak link, as protocols do not last forever. This argument is spurious, both because of the "It's just a string" argument, and because the world is so deeply dependant on this infrastructure at this point that anything that succeeds http will necessarily be backward-compatible.  Once again, we agree.

    Resolvable Identifiers -- HTTP is a winner

    I prefer to use the terminology of resolvable rather than retrievable as Norman does, as I think it better captures the notion of mapping of identifiers and resources.  Norman's point that http is a clear winner is true in one sense, but may mislead in another.  Our first point of disagreement, I think.

    I would not argue that there is a better alternative than the http protocol for retrieval purposes on Planet Web.  I do argue, however, that there are circumstances in which resolution should be explicitly uncoupled from identity.  I will explicate some of these circumstances in a subsequent post.

    I believe that Norman and others will counter with the "It's only a string" argument, and from a technical perspective, this is exactly correct.  There is no technical requirement that an http:URI must be resolvable.  It can act as a globally-unique string that maps conceptually to a real or abstract asset, whether or not an http server ever acts on it.

    My objection to this approach goes back to the outlandish success of the Web, and to the implicit social contract of resolvability.  You can correctly assert that http is just a substring and carries no promise, but not if you live on Planet Semantics as well as Planet Web.  Such URIs will be recognized by machines and resolved (or at least resolution will be attempted) and they will be recognized by people, who will expect something to be at the end of the link. To the extent that that something is other than what they expect, unwelcome surprise results.  The overloading of resolution and identity is not only benign, but advantageous in most circumstances.  But not all.

    The question as I see it is, is there sufficient value in the use of some newscheme: pure identifier to justify the effort in establishing a separate identifier protocol, maintaining an appropriate registry, and supporting its use?  Which brings us to Norman's last objection:

    Paying for names

    Norman asks "Why pay for something new when I've already got what I want?"  A good question, and for the great majority of assigners of identifiers, there is no need. Names, even DNS-based names, are not free.  We pay for domains, annually renewable.  We pay (largely hidden) costs of assigning and maintaining the integrity of the identifier mappings under our authority. As far as I am aware, there are no examples of any naming systems that require payment for the end-use of the names (resolving them), though at least one (DOIs) require a fee for issuance or assignment to resources.

    Why would you pay for such identifiers if their equivalent is available in the technology (already paid for) at hand?  In the case of DOIs, it is presumably because their constituency (largely commercial publishers) finds their use productive in the context of their business model.

    In summary:

    • I agree almost entirely with the substance of Norman's arguments about the suitability of HTTP:URIs as a substrate for persistent, globally-unique identifiers to support resolution. 
    • We disagree, I think, on the promise of resolution implied by the http protocol token, and whether or not this has practical importance.
    • Many adherents to the "just use HTTP" argument reject the argument that it can be useful to uncouple identity and resolution, and with it, the assertion that http identifiers may sometimes be less desirable than 'pure' identifier alternatives.
    • Departing from the world's more widely-deployed identifier system (http:URIs) involves both costs and vulnerabilities.  Any effective alternative must offset these disadvantages through significant added value.

    I will elaborate on some of these points of difference in a subsequent post.

    POSTSCRIPT:

    As I was proofing this post after I published it, I found that one of the two URIs supposedly identifying the W3C Tag finding does not actually work.  The "latest version" link works, and that is the most important one, but the time-stamped version does not.  This is an excellent example of the difficulty that every organization has in actually meeting its responsibilities of "diligence and trust".  In my experience, the W3C takes these responsibilities seriously.

    Is such an issue a problem?  After all, the latest version link is the important one, no?  It is, but if you're interested in evidence chains in scholarship, then all the links are important, and as this example illustrates, they can be fragile.  In a subsequent post, I'll return to this issue and illustrate why both of these links are important to me personally.

    -----

    Image: Ship's prow, Tacoma Harbor, taken from Highlander, August 27

    July 22, 2006

    Passion: the Plasma of Innovation

    Olympichawk The heads-up from Michael Braley that got my last few posts going included an interesting analysis at Hitchhiker 650 suggesting that ownership of the semantic Web devolves to those clever enough to extract semantics via a so-called 'normalization layer'.   

    It may be true that Google leads the search space, and verbulation of the company name (Google is now an entry in Merriam Webster) testifies to its public mind share in search (or maybe Merriam Webster is making a shameless play for attention!?).  Their collateral gaze has been impressive as well, setting standards for innovation that scrambles markets and user expectations. They have a stunningly effective business model with only modest overtones of predation and nuanced hints of greed.  They do great stuff, and everyone else is playing catchup.  Maybe. 

    But a monopoly?  The character of their business is not monopolistic.  Yes, the financial capital necessary to deploy server farms on the scale of Google is an impediment, but there is a surprising number of organizations for whom this is not a serious barrier.  Witness the competition raging among Google, Yahoo, and Microsoft, with Amazon and others nipping at their heels.  And the observation that recently MySpace became the number-one visited site on the Web further challenges the notion that any one organization (or even one sector) has anything approaching monopolistic sway over the Web.

    Google is powerful because we love  their stuff 'n style.  Their competetive edge is innovation, and conventional wisdom credits them with being the best at channeling that innovation into attention, which attracts advertising dollars.  If innovation is the Prime Raw Material of the new millenium, then success will accrue to those who learn to manage it most effectively.  There are quasi-methodological approaches to doing so, but its more mystery than science.

    Remember fusion power?  The promise of limitless energy from 'burning' water to create a plasma as hot as a mini-sun. There's this issue of the containment 'vessel' though.  Passion is the plasma of innovation.  We don't know how to contain it any better.  To manage it... touch it... is to quench it.  You cannot create it with business plans or visions of fast cars and pretty boats.  It is born of inspiration, temperament, ideology, hope, even love.  Google is pretty good at attracting it and nurturing it, but I'm guessing they have no more ability to create it than anyone else.  Get distracted, take your eye off the [fire] ball, and its gone, reemerging elsewhere.  Is this a sustainable competitive advantage, let alone a prescription for monopoly?  Nawwwww.

    We return you now to your regularly scheduled anxiety of the month... Net Neutrality.

    -----

    Image:(Red Tailed?) Hawk on Hurricane Ridge in Olympic National Park, July, 2006

    July 21, 2006

    When last we left our Hero....

    Cwbrowboats Other points that TBL made in his AAAI keynote (as inferred from news.com):

    • Developers should use semantic languages and RDF
    • Tagging will improve semantic efficacy on the Web
    • The Web is the database, and RDF defines its syntax

    We all agree that we want more from the Web. Not just pages, but structured data that facilitates recombination. This is essential to so-called Web 2.0 enterprise. Tim’s support of tagging acknowledges, importantly, I think, the role of people at large in driving semantics onto the Web. Not just catalogers, not just publishers, data-providers, and IT departments. End-users. (Never mind that there are no end-users in a graphical world… we’re mostly chain users.) The benefits of [bottom-up] tagging, and its relationship to top-down systems like ontologies, thesauri, knowledge organization systems, and controlled vocabularies, remains to be elucidated in any long-term way. Certainly the hype and expectation is strong, and evidence of value accumulates in the business plans and performance of myriad Start-Up 2.0s.

    We can agree that the Web is a database of sorts. The Database, in fact. It lacks the formalisms of relational databases, but embodies the powerful idiom of linkage, making it both naturally object-oriented and graph-based -- flexible, extensible, robust, self-organizing.  Unwieldy, too.

    RDF is intended to provide some of the missing formalisms, and thereby make the data more interoperable, recombinant, and hence re-useful. It adds little value in a closed system, and hence has experienced only modest adoption.  Presumably its value will be greater in the open system of the Web. Of course, most applications have been designed to work in closed systems, leaving RDF a long-suffering next-year’s technology.

    The confusion between RDF Schema and XML schema, overlapping but disjoint schema declaration idioms, went on for years, and did little to bolster RDF’s prospects.  I'm afraid the W3C itself has to own-up to responsibility for this.

    Has next-year arrived with Web 2.0 and the emergence of Web applications? As interoperability and recombination become the main attractions rather than sideshows, RDF and related enabling technologies may rise in importance, finally getting onto page-one specifications.  I think this is Tim's hope and expectation.

    Google’s Norvig might point out that injecting/extracting meaning from a proprietary commercial perspective… the normalization layer alluded to in Hitchhiker 650… is likely to happen faster and have longer legs than promulgating encoding standards that require well intentioned and well informed users to deploy.   In fact, I think that has been a frequent message.  The enthusiastic acceptance of so much of what Google does (and does well), is compelling evidence that there is plenty of semantics on the Web... just not necessarily the  W3C variety.  Will Google thus own the semantic Web, by virtue of having driven normalization-layer innovation? Thinking….

    -----

    Image: Rowboats at the Center for Wooden Boats on Lake Union in Seattle, taken during the annual wooden boat show.  I'm going to spend my next life in their workshop, leaving only for forays to to Allegro  for lattes and Magus for used books.

    In this Corner... Sir Tim, Inventor of the Web...!

    Woodpecker The face-off between Tim Berners-Lee and Peter Norvig (it is probably too pointed to call it that), came from Tim's AAAI keynote flogging the potential of the Semantic Web, and the importance of various enabling technologies in bringing that to pass.  Mr. "When I invented the Web" is the right man for the job.  His vision and practicality in bringing us this most wonderously cobbled together platform has succeeded, as wiser pundits than I have observed, in part because of its ability to fail gracefully.  Document not found?  Look elsewhere... the system doesn't fail, though.  Website gone? Pity, but there are lots of others, and the system doesn't fail.  The system... the Web... is simple and resilient, and is layered on another elegantly simple and resilient system -- the Domain Name Service.

    But the Semantic Web... well, semantics aren't so simple.  If the news reports about Tim's talk (and my inferences) are close, the points he made are roughly:

    • Persistent Identifiers are critical
    • Developers should use semantic languages and RDF
    • Tagging will improve semantic efficacy
    • The Web is the database, and RDF defines its syntax

    Persistent Identifiers are indeed central to the problem.   I may assume too much, but I think Tim's perspective is that URIs as they exist today -- essentially, URLs -- are enough for all purposes.  On this, I would disagree. 

    From a technical point of view, he is, of course, correct.  The operational characteristics of URLs, in conjunction with the DNS system, are sufficient to meet any identity requirement for online resources.  It is in the realm of policy that problems emerge.   Tim's assertion that important identifiers be persistent speaks to this, though not convincingly enough in my judgment.

    The problem is that Identifiers are overloaded and multifaceted.  They play different roles at different points in the lifecycle of the resources they identify.  Changes are made to meet the exigencies of changing business models, and naming-theoretic arguments just don't hold up to those imperatives. 

    Norvig's observations that users often don't understand the significance of the technical decisions that they make (or don't make) bolsters this view.

    <unsubstantiated supposition follows>
    So, without branded identifiers -- identifiers whose form and syntax proclaim that they are managed according to publicly defined policies -- achieving robust identity networks will be more difficult.
    -----
    image: Woodpecker, taken in May in the Nisqually National Wildlife Refuge, downstream from Mt. Rainier glaciers.  Andy Powell and I walked the refuge on his visit here for the DCMI Usage Board meeting.  I should note that Andy's admonition about identifiers... that departures from standard Web protocols in defining identifier systems will inevitably reduce their long term persistence, is always prominent in my thinking.

    Dr. Theory and Mr. Practice

    Suzzalovaults Michael Braley, one of our summer Semantic Web discussion group faithful, brought to our attention what sounds like it must have been a very interesting Q&A session at a recent keynote by Tim Berners-Lee, and a question by Peter Norvig of Google (reported by News.com).  Quite a moment, to have these larger-than-life icons of the New Millenium postulating on our future. 

    Piecing the discussion together from various news reports and blog posts, several interesting issues emerge.  Not for the first time, any of them... their appearance at News.com has mostly to do with the respective authority of speakers.  The high level points:

    TBL:

     

    • Persistent Identifiers are critical
    • Developers should use semantic languages and RDF
    • Tagging will improve semantic efficacy
    • The Web is the database, and RDF defines its syntax

    Norvig:

    • Users can be incompetent or venal:
      • They often can’t configure software or systems properly, let alone tag things usefully.
      • They often intentionally mislead so as to try to sell us Viagra
    • Market leaders have a disincentive to standardize – they benefit from differentiated services and products, not being the same. (talking, I infer, about a general competitive attitude, not necessarily about Google policies).

    And after-comments by Hitchhiker’s Guide to 650, on the topic of the web migrating towards private ownership and control via the ‘normalization layer’:

    The Google search engine is probably in the best position out of all
    the EXISTING technologies out there to be that layer, it kind of is
    for humans. The unfortunate ramification (that TBL would be sad to
    admit) is that SOMEONE will own the semantic web, not as a standards
    body or content owner, but as the normalization/extraction layer. If
    Google is able to garner monopolitic growth without a naturally
    monopolitic product or business model (search engine), it is not
    without a huge leap of faith that in the future a player (maybe
    google) will be able to exact a toll for being the de facto router/
    translator of data on the web. . . this is a scary thought . . . the
    end of the open web? (and these guys are freaked out about net
    neutrality?)

    Aside: I’m happy to acknowledge that I don’t know who all the cool people are, but I sort of wish they wouldn’t be quite so secretive so we might be able to find out without becoming investigative. Looking at the 650 blog, I could find that the author was apparently important at eBay, has since moved to Green Dot (whatever that is) in LA.  Not much more.

    More on these topics to follow....

    -----

    Image: The Suzzalo Library is the signature structure of the University of Washington.  This stunning reading room has few peers anywhere in the cathedral-of-reason department. (by the author, April 2006)

    April 12, 2006

    Ockham’s Bathroom Scale, Lego™ blocks, and Microformats

    Skerries_harbora Some years ago I attended a W3C workshop on something or other, and my major contribution to the workshop was a quote to the effect that:

    In today’s Internet, no protocol is implementable whose print rendition registers on a bathroom scale.

    It wasn’t exactly a significant contribution, or even true, but as a soundbite, it had some legs. I thought of that notion reading Ed Summer’s comment on my interview in the Netsquared series. So, this post is my reply (Thanks, Ed, for leaving that soapbox where I could trip over it!).

    The first thing I had to do was find out what microformat means. I’ve seen the term, and had a sense of it, but wasn’t sure what it meant precisely.  I can’t imagine I’m the only one in the sandbox to whom this happens daily, but being on sabbatical somehow makes my confession more admissible.

    Define:microformat in Google didn’t help (which made me feel a lot better!). Yahoo led me to the Wikipedia entry, leading to my relief in recognizing it as an implementation strategy for Ockham’s razor – do the least that works.  Gotcha.

    So, Ed observes:

    A few of us have been working on a citation microformat for a bit now, and it appears to be converging on something like DC and a reuse of other microformat modules where appropriate.

    There is a lot of subtlety packed into this statement, though at a high level, it makes all the sense in the world.   The partisan in me squawked… "something 'like' DC"? Why not a true subset of DC?  And I hope you've looked at the Citation Working Group... but I digress.

    The notion of modular, extensible building blocks of structured data is foundational to the Dublin Core, expressed in the Warwick Framework, a conceptual architecture that was a major outcome of the second Dublin Core workshop in 1996. It gave rise to my favorite metadata metaphor – the Lego™ metaphor – which I’ve used in scores of talks on metadata over the last decade.

    I’m not sure I’d go so far as to credit the idea of microformats to that early digital library work, but we certainly were on the same path. RSS, vCard, events, and a variety of related small-chunk data structures all fit very neatly into this idiom.

    The trend in modularizing library services that Lorcan has called ‘unplug & play’ is very much within this spirit as well. It is the way things are moving, and it seems the right way… perhaps the only sensible way in today’s environment.

    Are there clouds wrapped around these modular silver linings? Certainly some darkling questions.

    Monolithic services have limited flexibility (the spittoon joke comes to mind). All or nothing. Bring your data dump truck.

    Disaggregated services (I’m liking the term microformat more and more!) afford greater flexibility, easily configurable into new services unthought-of of last week (the remix, or recombinant idiom). Hurrah!

    But there is no free lunch, and with greater numbers of smaller services, there are more blocks to manage, and the dark side of the Lego metaphor emerges:

    T'was the morning of Christmas
    and all through the house
    not a creature was stirring
    for fear of walking on all the darned Lego™ blocks with their sharp little corners strewn in chaos over the floor

    In order for all our modules to interoperate we need to catalog them, understand their functionality, their interfaces, their maintenance and change history, and make them discoverable. They need to be (re)combined in coherent ways, which means they need to be designed according to a coherent architecture. Is this happening?

    I doubt it. One of the appeals of microformats is that they are quick and easy, simple solutions to simple problems. As stand-alones, they are economical and powerful and appealing. Irresistible, really. The Labnotes blog has a funny and insightful take on this (though speaking of citations… I dare you to find the creator for this site).  The flexibility that microformats afford is an essential feature of the hyper-innovation that characterizes Web 2.0. But will they magically fit together?

    Lego blocks may be child’s play, but they are engineered to tolerances that approach those of the internal combustion engine, and designed within a tightly-regulated architecture that spans half a century.

    The complexity of the world remains (increases, really), and while we may find more efficacy or efficiency in one strategy or another, there is no magic bullet for coping with that complexity. Deal with it in the architecture of the systems, the structure of the data, or the complexity of the applications, but deal with it we must.

    So… do I like microformats? Heck yes. I think I helped popularize the notion in a grey-bearded, Web-1.0-kinda-way. But we’d best pick up after ourselves, ‘cuz walking on all those blocks is going to be painful the morning after.

    -----

    Image: Skerries Harbor, Ireland.  June, 2004

    March 06, 2006

    Speaking of Hybrid Technology...

    Firstmilesolutions_1 My friend Jean Armour Polly used to sport a sig file  that went something like:

    "Don’t underestimate the bandwidth of a minivan full of CDs on an interstate highway"

    Text ruled the net, and transferring images required strategic use of bandwidth if you even had a monitor that supported them. Quaint in today’s first-world telecom environment. But with nearly half the world having never made a phone call, and something like a third unconected to the power grid, the characteristics of technology diffusion are of course different elsewhere.

    A colleague sent me the attached picture. Guy on a motorbike, right? Look carefully at his rider… a brief-case-size box with an antenna.  The fellow rides from village to village, his wireless antenna collecting and delivering store-and-forward email and Internet searches, then 'delivering them' to the central wireless hub, and round and round it goes. 

    This clever approach to low-cost networking comes from First Mile Solutions of Cambridge, Massachusetts.  They are in the technology transfer business -- deploying low-cost, hybrid technology solutions developed at MIT and intended to help bring the "next 2 billion people" on board the Internet. The website is a rich store of project descriptions, white papers, and planning tools to rough out a project.

     

    January 31, 2006

    Persistence or Permanence?

    Lichen_unto My last two posts were motivated by Sean McGrath’s piece on URLs and the social contracts they imply. So, too, this one.

    Sean argues for the network-value of a naming convention for URLs, namely, the inclusion in URLs-of-permanent-intent of the string ‘purl’. When I first read his post, my egocentrism led me to think he was alluding to the PURL system, launched by OCLC a dozen years ago in response to our frustrations with the ground-hog-day character of the URN meetings in the IETF. We launched PURLs with an expectation that they would be widely adopted and deployed by all right-thinking Web managers (we had a LOT of silly ideas like that…). PURLs have never been as widely deployed as were our hopes, but they are still alive and growing, and remain both useful and an instructive data point in the evolution of the Internet naming architecture.

    One reason I was so ready to conclude that Sean was talking about PURLS is his argument:

    I am thinking of nothing more complicated than a social naming convention. What if permanent URLs contained the fragment '/purl/' for example? Would that not do the trick? As a consumer, I look at example.com/purl/info12.html and can immediately infer that it is a good candidate for bookmarking.

    From a URL consumer's perspective, this would be very handy I think. From a URL producer's perspective, it would also be very handy. In effect, it would allow URL producers to send out signals to the world. One signal would be: 'this URL is a good bookmark candidate. We won't be changing it and even if we change our systems internally, we will make every effort to ensure that this link will continue to work.'. The second signal would be 'This URL is not a good bookmark candidate. Bookmark it at your own risk.' Simply leaving '/purl/' out of a URL would send the latter signal.

    No new technology added! An approach predicated on the URL-equivalent of a smilie, a token that says that someone is looking after this identifier. It is an important component of the added-value that we envisioned for PURLs, as it happens.  People will see http://purl.og/.... and say... hey, a URL for the long haul. OK… we’re not exactly talking a firestorm of adoption. But the point was true then and true today.

    Back to permanence and persistence.  As I harbored the delusional notion that Sean was singing about our PURLy fates, I thought… “oh… he got the P part wrong (invoking the word permanent rather than persistent)”   It would appear, in fact, that he had arrived independently at a related acronym.

    The distinction is small, but important. What can permanence mean in a technological world where only one of twenty students in a masters level information science class recognized the phrase NCSA Mosaic?  (Well, it means that change is unrelenting and they don’t listen to big band music much either). And follow this link (top of the Google search set), if you think I'm blowing smoke:

    NCSA Mosaic Home Page
    Creation and history of the browser. All versions available for download.
    archive.ncsa.uiuc.edu/SDG/Software/Mosaic/

    I feel unlucky.

    Even in the hallowed halls of LibraryLand, we are (justly) reluctant to talk about preservation in terms longer than the odd millennium. Much of the discussion surrounding business models for digital preservation has to do with service contracts and the cost of assuring the integrity of a given bitstream for a given interval. It is NOT permanence we’re talking about. It is persistence. And the definition refers to a business process more than it does to a point in the past or future.

    If you have an identifier used for tracking the progress of a laptop from its point of manufacture in Shanghai to your doorstep, the business process that is identified (a shipment of a consumer good) is concluded in a period measured in hours or days (mine took 40 hours). Soon after, the identifier is so much digital chaff, duty done.   But certainly persistent in the context of its intended use.

    If we’re talking about cultural heritage assets, we have expectations measured in centuries. In either case, the success of the identifier is tied to the life cycle of the asset or process, not to a calendar. Thinking of our identifiers in these terms helps avoid staring towards a vanishing point. Sometimes.

    Identity and location on the Web

    Foucault URLs versus URIs

    In Sean McGrath’s piece on URLs and social commitment, he alludes to the common confusion between the acronyms URL and URI, correctly pointing out that only Web-head protocol-wonks are liable to be caught using the URI terminology (and most of them don't usually either).  Everyone else on the planet uses the largely-interchangeable and better-understood moniker of URL.

    Is this a distinction without a difference, as usage would lead us to believe? Unhappily, in today’s Web, it rarely matters a whit. Sean tells us why:

    The great thing about URLs is that you can click on them.

    That is a great thing, and it informs our expectation of URLs to the exclusion of all other possibilities. The social contract implied in http:// is, then, that they are actionable: you can click on them and bring the referent of the link into your machine and read it or listen to it or watch it. The link serves as a pointer to a location, and clicking on it invokes a behavior specified by the http protocol: voila!

    So, what's to be unhappy about?  Overloaded onto this simple actionable relationship is the additional important function of identity. The URL serves as both a key for a retrieval transaction, and an identifier. It is no accident that CNRI, in a stroke of marketing genius, chose the term Handle for their identifier protocol, for that is exactly the right term for such identifiers. We want a handle for otherwise-slippery electronic content so we can hold on to it, pass it back and forth, refer to it, and hang it over our desk to grab like a frying pan from a pot-rack.

    Mostly this overloading of identity and location/retrieval is fine. And to the extent that it is, the conflation of URL and URI is not a problem. So what is missing?

    Three things:

    1. Persistent reference pointers – There are many classes of electronic resources that we know we will want to refer to in a location-independent way for as long as we can imagine. Books, journal articles, or any component of a persistent resource of cultural, social, or economic importance. Yes, we want these to be actionable (clickable) as far as possible, but sustainable access requires that we distinguish between identity and resolution in the life cycle of any information resource of more than passing importance.  Conflating location and identity makes this harder (though not impossible).

    2. Appropriate copy resolution – In a world without access barriers, any copy is the appropriate copy. In a world of tradable intellectual property, individuals and organizations have differential access to resources. The Web should be neutral about business models, but it cannot be indifferent to them. Owners of IP must have the means to manage and meter access, and this generally implies a decoupling of identity and resolution.

    3. Conceptual resources – Our expectation that clicking 'gets' us something is not fully met if the resource is a conceptual asset.  The development of Semantic Web technology demands the application of an identity architecture to concepts as well as documents, multimedia, and pizza-ordering forms. Proponents of the just-let-HTTP-do-it rightly point out that HTTP URLs are entirely capable of being used for identifying conceptual resources (to use the RDF parlance). This is undeniably true as far as the technology goes. The more interesting question is what happens when you click on such a link? What SHOULD happen? The answer is context dependent. Some people, myself included, are uncomfortable with the use of standard HTTP URLs for this purpose, because it breaks the widespread-if-informal social contract of URLs.  You may wish to define them, or locate them within a larger conceptual structure, or access various of their attributes, but in general you're not trying to retrieve them.

    Each of these examples begs for decoupling of identity and resolution in some contexts, but requires an additional layer of mapping of location or function that, getting back to Sean’s consumer contract:

    all come at very significant extra cost in terms of complexity. On this issue, the world has voted very loudly with its mouse clicking fingers. The world values hyperlinking simplicity over complexity by many orders of magnitude.

    The resolution of these conundrums will require daunting co-evolution of technology, business processes, and cross-community practices.  Technologies such as Handles, DOIs, OpenURLs, PURLs, and "INFO" URIs all represent approaches to addressing aspects of these problems.  At this time they are niche technologies.  Their impact on the constellation of problems we know as identity-versus-resolution will depend far more on business processes than on TECHNOLOGY (or the ideologies of their proponents or detractors).  Meanwhile, the URL rules.