My earliest exposure to the ideological battles that swirl
around identifiers was in the Uniform Resource Name (URN) ‘discussions’ that
took place within the Internet Engineering Task Force (IETF) back in the early
nineties. Those meetings foundered again
and again on the same issues, few of which had anything to do with technology
or engineering. The IETF was not a good
place to pursue solutions to these problems, but there wasn’t anyplace else to
carry the work forward. The impact of
these deliberations has been desultory at best.
One of the ideologies that plagued us then and now has to do with whether an identifier is opaque or has embedded semantics (I’ll cautiously refer to this latter group as semantic identifiers, recognizing the potential ambiguity of this terminology).
An opaque identifier is simply one without any trace of
embedded meaning - pure as the driven know-what-it-is. They are homely, unfriendly, and as rare
within our IT infrastructure as is purity elsewhere in human affairs. But they are also as impervious to bit rot
and semantic deterioration as any identifier can be, and hence embody neutral
virtue.
Sean McGrath posted a note just this past week on this
issue. Sean is firmly in the camp that
believes identifiers with embedded semantics are bad for us:
There are both philosophic and practical reasons to hold
this position. It is easy to identify
failure modes for semantic identifiers, and if you are curating assets for
which identity persistence is paramount, these failures will find you as surely
as surely as water flows downhill.
Why, then, does anyone create anything other than opaque
identifiers? Because identifiers serve
multiple masters. To assign and manage
identifiers is to balance countervailing requirements and to make compromises
to satisfy those masters.
It is helpful to examine the nature of the semantics we find in identifiers with an eye towards (a) understanding the pitfalls and relative risks, and (b) facing up to how pure we are (or want to be) in resisting the siren call of reader-friendly names.
Location and Resolution
It should be obvious why we want actionable identifiers. Click and go. But remembering an IP address is a lot harder than remembering a domain name, so the Domain Name System (DNS) was developed to map semantic names onto unmemorable numeric identifiers (you might be a geek if you actually know your IP address!). Is that good or bad? Depends.
Branding
The URL is a powerful branding instrument. Organizations want their brands in
their asset identifiers, and will generally fight hard to maintain that
territory.
Apple.com, OCLC.org, and LoC.gov are shingles on the Internet, and they become embedded in identifiers as a matter of business practice.
Product brand names are similarly important:
Lots of 'valuable' semantics in that URL.
Movies all have URLs names. Which of these do you suppose Warner Brothers wants used widely? Which is more useful for a consumer?
http://wippub.warnerbros.com/
movie/goodnight/goodnight.htmlhttp://www.imdb.com/title/tt0433383/
Transaction Identifiers
Transaction information often ends up as identifiers, though few identifiers are less persistent. Consider the following identifiers for the same resource in the same system:
OK… strictly speaking these URLs do not identify the resource, they identify a surrogate that will lead you to a purchase option. Much more a locator than an identifier, but we pass around such links because they are useful to us.
What about this one?
Inspection of these resources reveals quite a lot of semantics, both about the site and the resource. There is a lot of recognizable instance metadata embedded in the URL. There are other approaches to this idea that have some traction in the community where appropriate copy problem is manifest: the OpenURL is intended to address this problem. More on this another time.
Conceptual identity
There is another element of location that is becoming more common: location in a conceptual hierarchy. If the Semantic Web is ever to fulfill its promise, it will require that we figure out how to assign conceptual identifiers in a manner that is coherent across vocabularies and languages. And like so many of these problems, the technical solution is less important than widespread adoption of consistent conventions.
Structural elements of identifiers
There are several kinds of semantics found in some identifiers that have nothing to do with the referent of the identifier, but rather are artifacts of the system of assignment:
Sequence is a common attribute in identifiers, and is in fact a kind of semantics. OCLC numbers are assigned to records entered into WorldCat on a sequential basis. Simple, but it does convey a certain degree of semantics. In some systems even this clue about sequence can be important and even problematic. Can I assume something about the next item in a sequence? Is it a security problem? It can be.
Other systems embed time stamps. LC Card Numbers (LCCN) include low resolution time-stamps: the first two digits (separated from the remainder of the number by a dash) represent the year that the card was created, and so provide a hint about the date of publication). I don’t know an example off hand, but undoubtedly someone is using the unix seconds-since-January-1-1970 as the basis for identifiers (a rather higher resolution convention than is necessary for most assets that we care about).
ISBNs have country and publisher palimpsests embedded in them, and many identifiers include checksums to afford some measure of error detection.
Now, should we feel as strongly about this sort of semantics in our identifiers? Well, as LCCNs include assignments from more than a century, somebody is going to have to fix them. When 2038 rolls around, the Unix time stamp will break. Are check sums a problem?
What is clear is that it is unlikely that very many identifier systems will be as ideologically pure as, say, the ARK identifier system is, and that we’ll have to cope with these meaningful warts in a world that is ever-more dependant on digital identifiers. Lovers of Irony will note that this excellent piece on the virture of opaque identifiers by John Kunze is most easily (only?) found via its semantic identifier.