My Photo

WorldCat


Twitter Updates

    follow me on Twitter
    Blog powered by TypePad

    google analytics


    meter


    Categories

    Categories

    « Happy Valentine's Dayhttp://rpc.technorati.com/rpc/ping | Main | BOOK REVIEW: Guns, Germs, and Steel »

    February 21, 2006

    Identifier Ideology: Opacity and Semantics

    Disney_hall_1
    My earliest exposure to the ideological battles that swirl around identifiers was in the Uniform Resource Name (URN) ‘discussions’ that took place within the Internet Engineering Task Force (IETF) back in the early nineties. Those meetings foundered again and again on the same issues, few of which had anything to do with technology or engineering. The IETF was not a good place to pursue solutions to these problems, but there wasn’t anyplace else to carry the work forward. The impact of these deliberations has been desultory at best.

    One of the ideologies that plagued us then and now has to do with whether an identifier is opaque or has embedded semantics (I’ll cautiously refer to this latter group as semantic identifiers, recognizing the potential ambiguity of this terminology).

    An opaque identifier is simply one without any trace of embedded meaning - pure as the driven know-what-it-is. They are homely, unfriendly, and as rare within our IT infrastructure as is purity elsewhere in human affairs. But they are also as impervious to bit rot and semantic deterioration as any identifier can be, and hence embody neutral virtue.
    Sean McGrath posted a note just this past week on this issue. Sean is firmly in the camp that believes identifiers with embedded semantics are bad for us:

    Ultimately, all semantic identifiers are incorrect

    There are both philosophic and practical reasons to hold this position. It is easy to identify failure modes for semantic identifiers, and if you are curating assets for which identity persistence is paramount, these failures will find you as surely as surely as water flows downhill.
    Why, then, does anyone create anything other than opaque identifiers? Because identifiers serve multiple masters. To assign and manage identifiers is to balance countervailing requirements and to make compromises to satisfy those masters.

    It is helpful to examine the nature of the semantics we find in identifiers with an eye towards (a) understanding the pitfalls and relative risks, and (b) facing up to how pure we are (or want to be) in resisting the siren call of  reader-friendly names.

    Location and Resolution

    It should be obvious why we want actionable identifiers. Click and go. But remembering an IP address is a lot harder than remembering a domain name, so the Domain Name System (DNS) was developed to map semantic names onto unmemorable numeric identifiers (you might be a geek if you actually know your IP address!). Is that good or bad? Depends.

    Branding

    The URL is a powerful branding instrument. Organizations want their brands in their asset identifiers, and will generally fight hard to maintain that territory.
    Apple.com, OCLC.org, and LoC.gov are shingles on the Internet, and they become embedded in identifiers as a matter of business practice.
    Product brand names are similarly important:

    http://www.apple.com/ipod/ipod.html

    Lots of 'valuable' semantics in that URL. 

    Movies all have URLs names.  Which of these do you suppose Warner Brothers wants used widely?  Which is more useful for a consumer?

    http://wippub.warnerbros.com/
    movie/goodnight/goodnight.html

    http://www.imdb.com/title/tt0433383/

    Transaction Identifiers

    Transaction information often ends up as identifiers, though few identifiers are less persistent. Consider the following identifiers for the same resource in the same system:

    http://www.amazon.ca/exec/obidos/
    external-search/701-9964227-5871503?
    tag=bookstore0e86-20&
    keyword=Suitable+Boy+A&mode=books

    http://www.amazon.ca/exec/obidos/
    ASIN/0060786523/qid=1140048418/
    sr=2-1/ref=sr_2_3_1/701-9964227-5871503

    OK… strictly speaking these URLs do not identify the resource, they identify a surrogate that will lead you to a purchase option. Much more a locator than an identifier, but we pass around such links because they are useful to us.

    What about this one?

    http://www.worldcatlibraries.org/
    wcpa/isbn/1857990889

    Inspection of these resources reveals quite a lot of semantics, both about the site and the resource. There is a lot of recognizable instance metadata embedded in the URL.  There are other approaches to this idea that have some traction in the community where appropriate copy problem is manifest: the OpenURL is intended to address this problem.  More on this another time.

    Conceptual identity

    There is another element of location that is becoming more common: location in a conceptual hierarchy. If the Semantic Web is ever to fulfill its promise, it will require that we figure out how to assign conceptual identifiers in a manner that is coherent across vocabularies and languages. And like so many of these problems, the technical solution is less important than widespread adoption of consistent conventions.

    Structural elements of identifiers

    There are several kinds of semantics found in some identifiers that have nothing to do with the referent of the identifier, but rather are artifacts of the system of assignment:

    Sequence is a common attribute in identifiers, and is in fact a kind of semantics.  OCLC numbers are assigned to records entered into WorldCat on a sequential basis. Simple, but it does convey a certain degree of semantics.  In some systems even this clue about sequence can be important and even problematic. Can I assume something about the next item in a sequence? Is it a security problem? It can be.

    Other systems embed time stamps. LC Card Numbers (LCCN) include low resolution time-stamps: the first two digits (separated from the remainder of the number by a dash) represent the year that the card was created, and so provide a hint about the date of publication). I don’t know an example off hand, but undoubtedly someone is using the unix seconds-since-January-1-1970 as the basis for identifiers (a rather higher resolution convention than is necessary for most assets that we care about).

    ISBNs have country and publisher palimpsests embedded in them, and many identifiers include checksums to afford some measure of error detection.

    Now, should we feel as strongly about this sort of semantics in our identifiers? Well, as LCCNs include assignments from more than a century, somebody is going to have to fix them. When 2038 rolls around, the Unix time stamp will break. Are check sums a problem?

    What is clear is that it is unlikely that very many identifier systems will be as ideologically pure as, say, the ARK identifier system is, and that we’ll have to cope with these meaningful warts in a world that is ever-more dependant on digital identifiers.  Lovers of Irony will note that this excellent piece on the virture of opaque identifiers by John Kunze is most easily (only?) found via its semantic identifier.

    TrackBack

    TrackBack URL for this entry:
    http://www.typepad.com/services/trackback/6a00d8342600b653ef00d834ab853769e2

    Listed below are links to weblogs that reference Identifier Ideology: Opacity and Semantics:

    Comments

    un peu court mais pas mal du tout

    Actually the Unix timestamp will fail only if it's stored in a 32 bit integer, so there's some hope that we'll avoid complete disaster in 2038.
    --Th

    Verify your Comment

    Previewing your Comment

    This is only a preview. Your comment has not yet been posted.

    Working...
    Your comment could not be posted. Error type:
    Your comment has been posted. Post another comment

    The letters and numbers you entered did not match the image. Please try again.

    As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

    Having trouble reading this image? View an alternate.

    Working...

    Post a comment