My earliest exposure to the ideological battles that swirl
around identifiers was in the Uniform Resource Name (URN) ‘discussions’ that
took place within the Internet Engineering Task Force (IETF) back in the early
nineties. Those meetings foundered again
and again on the same issues, few of which had anything to do with technology
or engineering. The IETF was not a good
place to pursue solutions to these problems, but there wasn’t anyplace else to
carry the work forward. The impact of
these deliberations has been desultory at best.
One of the ideologies that plagued us then and now has to do
with whether an identifier is opaque or has embedded semantics (I’ll cautiously
refer to this latter group as semantic identifiers, recognizing the potential
ambiguity of this terminology).
An opaque identifier is simply one without any trace of
embedded meaning - pure as the driven know-what-it-is. They are homely, unfriendly, and as rare
within our IT infrastructure as is purity elsewhere in human affairs. But they are also as impervious to bit rot
and semantic deterioration as any identifier can be, and hence embody neutral
virtue.
Sean McGrath posted a note just this past week on this
issue. Sean is firmly in the camp that
believes identifiers with embedded semantics are bad for us:
Ultimately, all semantic identifiers are incorrect
There are both philosophic and practical reasons to hold
this position. It is easy to identify
failure modes for semantic identifiers, and if you are curating assets for
which identity persistence is paramount, these failures will find you as surely
as surely as water flows downhill.
Why, then, does anyone create anything other than opaque
identifiers? Because identifiers serve
multiple masters. To assign and manage
identifiers is to balance countervailing requirements and to make compromises
to satisfy those masters.
It is helpful to examine the nature of the semantics we find
in identifiers with an eye towards (a) understanding the pitfalls and relative
risks, and (b) facing up to how pure we are (or want to be) in resisting the
siren call of reader-friendly names.
Location and Resolution
It should be obvious why we want actionable
identifiers. Click and go. But remembering an IP address is a lot harder
than remembering a domain name, so the Domain Name System (DNS) was developed to
map semantic names onto unmemorable numeric identifiers (you might be a geek if you actually know your IP address!). Is that good or bad? Depends.
Branding
The URL is a powerful branding instrument. Organizations want their brands in
their asset identifiers, and will generally fight hard to maintain that
territory.
Apple.com, OCLC.org, and LoC.gov are shingles on the Internet, and they become embedded in identifiers as a matter of business practice.
Product brand names are similarly important:
http://www.apple.com/ipod/ipod.html
Lots of 'valuable' semantics in that URL.
Movies all have URLs names. Which of these do you suppose Warner Brothers wants used widely? Which is more useful for a consumer?
http://wippub.warnerbros.com/
movie/goodnight/goodnight.html
http://www.imdb.com/title/tt0433383/
Transaction Identifiers
Transaction information often ends up as identifiers, though
few identifiers are less persistent. Consider the following identifiers for the same resource in the same
system:
http://www.amazon.ca/exec/obidos/
external-search/701-9964227-5871503?
tag=bookstore0e86-20&
keyword=Suitable+Boy+A&mode=books
http://www.amazon.ca/exec/obidos/
ASIN/0060786523/qid=1140048418/
sr=2-1/ref=sr_2_3_1/701-9964227-5871503
OK… strictly speaking these URLs do not identify the
resource, they identify a surrogate that will lead you to a purchase option. Much more a locator than an identifier, but
we pass around such links because they are useful to us.
What about this one?
http://www.worldcatlibraries.org/
wcpa/isbn/1857990889
Inspection of these resources reveals quite a lot of
semantics, both about the site and the resource. There is a lot of recognizable instance metadata
embedded in the URL. There are other approaches
to this idea that have some traction in the community where appropriate copy
problem is manifest: the OpenURL is intended to address this problem. More on this another time.
Conceptual identity
There is another element of location that is becoming more
common: location in a conceptual hierarchy. If the Semantic Web is ever to fulfill its promise, it will require that
we figure out how to assign conceptual identifiers in a manner that is coherent
across vocabularies and languages. And
like so many of these problems, the technical solution is less important than widespread
adoption of consistent conventions.
Structural elements of identifiers
There are several kinds of semantics found in some
identifiers that have nothing to do with the referent of the identifier, but rather
are artifacts of the system of assignment:
Sequence is a common attribute in identifiers, and is in
fact a kind of semantics. OCLC numbers are assigned to records entered into WorldCat on
a sequential basis. Simple, but it does
convey a certain degree of semantics. In some systems even this clue about sequence
can be important and even problematic. Can I assume something about the next item in a sequence? Is it a security problem? It can be.
Other systems embed time stamps. LC Card Numbers (LCCN) include low resolution
time-stamps: the first two digits (separated from the remainder of the number
by a dash) represent the year that the card was created, and so provide a hint
about the date of publication). I don’t
know an example off hand, but undoubtedly someone is using the unix
seconds-since-January-1-1970 as the basis for identifiers (a rather higher
resolution convention than is necessary for most assets that we care about).
ISBNs have country and publisher palimpsests embedded in
them, and many identifiers include checksums to afford some measure of error
detection.
Now, should we feel as strongly about this sort of semantics
in our identifiers? Well, as LCCNs include
assignments from more than a century, somebody is going to have to fix
them. When 2038 rolls around, the Unix
time stamp will break. Are check sums a
problem?
What is clear is that it is unlikely that very many
identifier systems will be as ideologically pure as, say, the ARK identifier system
is, and that we’ll have to cope with these meaningful warts in a world that is
ever-more dependant on digital identifiers. Lovers of Irony will note that this excellent piece on the virture of opaque identifiers by John Kunze is most easily (only?) found via its semantic identifier.