Metadata 2.0: On the Rails or...?
Erik Duval, of
Katholieke Universiteit Leuven in Belgium, is a longstanding metadata
colleague I met in the early days of Web metadata. We've worked in
the service of related activities for some years, and our paths have
intersected in productive ways on a variety of occasions in a dozen
years (Notably, here). So when he asked me to participate by teleconference in a Metadata 2.0 workshop in Leuven,
I was pleased to participate, even though it fell on the day of my two
presentations at the VALA2008 conference in Melbourne.
So, at 21:00 Melbourne time I called into the workshop (11:00 AM Leuven time) and visited my modeling ideology upon the hapless participants. Tele-presence is hard to do effectively, and especially for those who have to listen. I was afforded the dispensation of talking and going to bed soon after, so I was the lucky one! This post represents a reduction of the stock of slides I shared with the group -- hope its beef and not turkey.
The dominant issue in promoting metadata interoperability, in my estimation, lies in harmonizing data models, not element names. We in the library community have been slow to understand this, primarily because we've gotten along without a formal data model for so long.
The Dublin Core group began as a heterogeneous amalgamation of information mavens -- a good thing if you believe in hybrid vigor (I do). A bad thing from the point of view of finding common vocabularies and modeling idioms. There were (are) lots of ways to do/say/express metadata assertions, and a large proportion of them were represented among us. Attempts at abstracting a data model foundered in contentious seas of misunderstandings and egos, and any urgency about arriving at a common model gave way to simply staying afloat in troubled waters. After all, lots of people were using DC, right? The library community gets along without a rigorous data model, and MARC remains one of the most successful resource description idioms.
But MARC had only a few distinct generators (software systems}, and cross-cultural MARC dialects could be made to interoperate only with difficulty. We should have known better. Wishful thinking (and conflict avoidance) triumphed over clear reasoning, and the data modeling effort in DC came to fruition slowly, fitfully, painfully. It took a decade. That hard-won lesson, embodied in the Dublin Core Abstract Model (DCAM), remains, in my estimation, the golden nugget at the center of the Dublin Core ore.
I asserted to the Leuven group that metadata standards that don't share a common data model are doomed to perpetual lossy interoperability at best, costly bespoke mappings that never really satisfy. I've written in the past about the analogy of incompatible train gages such as are still encountered, for instance, on the China-Mongolia border. An entire train is 'unloaded' from its Chinese bogeys (wheel trucks) by being jacked up on hydraulic lifts, and Mongolian bogeys are then rolled under the carriages. amidst great clanking and hissing. The train is lowered and continues into the dark Gobi night. Is this the metadata model we want to perpetuate? Unpacking assertions in one model and repacking them into another? It is folly.
But it is still hard to find agreement in these spaces. Lessons learned unravel. There's always a higher abstraction level that will save us, no? Well, no, actually. Machine parsing requires precision. You agree about structure or you don't. I suspect that semantic interoperability decays across mappings, as with sound and light, as the cube of the 'distance' between the models. Multiplied, of course, by the sum of the metadata instances represented in each model. (How's that for an unsupportable assertion of cost?)
OK, but whose model? Did I mention there are egos involved? And money? And pride? And organizational investments? And NIH syndrome? Any one of these alone is a serious impediment to adoption.
In a further conversation with Erik, we discussed the general
suitability of the DCAM. Erik observed that the number of people, even
in technical groups, who have a strong grasp of its intricacies is
small. Unhappily, he is right. Is the DCAM needlessly complex, or is
the complexity matched to a proportionately difficult problem? And,
...should there be one model that we all build on or should we build something that overarches all existing models...
Isn't that then a common data model? If the DCAM is considered too complex, how will this help?
Answering these questions is the crux move for progress in Metadata 2.0. If the complexity is appropriate, then spare us yet another data model. If it is needlessly complex, then it behooves all parties to simplify and abstract until we have distilled the essence. Metadata 2.0 isn't social, isn't the next level, isn't the latest and greatest... its a do-over, a mulligan, an after-school detention. We just don't have it right yet. My assertion is that the DCAM is roughly right. If there be flaws, expose them with evidence. If there are better ways, demonstrate their value. Otherwise, adopt and deploy with vigor and rigor.
Get the trains rolling on the same tracks.
-----
Inside the train longhouse on the China-Mongolia border
(October 2004). Hydraulic jacks line the longhouse, and raise the
entire train, allowing one gage of bogey to be rolled out and another
set to be rolled in. The process, which includes a cabin-by-cabin
visitation by customs officials, took about two hours in the middle of
a Gobi-desert night.
@Jonathan: I think RDF (or RDF+RDFS) & the DCAM are in a _similar_ space in that they both define "an abstract model" for metadata.
And I think they are in a _different_ space because the DCAM was intended to reflect the perspectives of one particular metadata community, not (again, just IMHO!) as something to be adopted by many different metadata communities. The DCMI community _did_ have its own conceptualisation of "what DC metadata is", which - and here I guess I'm kinda agreeing with Ed's notion of "parallel evolution" - existed independently of the RDF model, albeit a conceptualisation that hadn't been very clearly articulated. The Usage Board's "Grammatical Principles" was about the closest we had, I think.
The DCAM is an attempt to articulate that model both in terms which reflect the terms of the DCMI community and also in a form which is compatible with the RDF model.
See also
http://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind0702&L=DC-ARCHITECTURE&P=R5678
and
http://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind0703&L=DC-ARCHITECTURE&P=R933
Why didn't DCMI just use RDF? Well, I think there had been a good deal of effort in that direction, but in c2001-2002, there still seemed to be a substantial sector of the DCMI community which just didn't buy into the RDF concepts and terminology, but nevertheless did have a (perhaps, as I say, vague & ill-defined) notion that "DC metadata" existed independently of its multiple syntactic representations.
Again speaking only for myself, I do worry about exactly this question, of whether creating the DCAM was The Right Thing to Do. But I try to argue to myself that if the end result is that the DCMI community has a model that reflects its own conceptualisations and also supports a direct mapping to the RDF model, then it benefits both that community and the wider Semantic Web.
Of course, if all we've ended up with is a poorly documented community model that is impenetrable to many of that community, then we're no further forward. :-(
I totally agree with you that, currently, the documentation is inadequate (I thought I'd better tone down the original term I used there), and newcomers approaching the DCMI web site and/or ancillary sources of information on the Web are faced with having to make sense of a set of potentially contradictory documents. Addressing that is probably the most important task facing DCMI at the moment.
Posted by:PeteJ | February 17, 2008 at 03:09 AM
And finally, one last one (this is such interesting stuff), Stu, you realize your defense of DCAM is exactly the one I was making to you of RDA and FRBR before?
If they're not good enough, why do you think you can do better by throwing them out and starting over? They represent the best that smart people working on it for a long time under various constraints of reality (which are still there) were able to do.
Posted by:Jonathan Rochkind | February 15, 2008 at 10:43 AM
Pete: Generally the reason you have special purpose abstraction tools is because they can be much simpler than more general purpose or universal ones.
Your comment seems to suggest that you believe that RDF is in fact in the same solution-space as DCAM (I've gotten different answers on this from different DCAM people), but that RDF is both general purpose AND simpler than DCAM! Then what do we need DCAM for at all? Why not just use RDF?
Posted by:Jonathan Rochkind | February 15, 2008 at 10:38 AM
"Is the DCAM needlessly complex, or is the complexity matched to a proportionately difficult problem?"
I think this is a key unanswered question. We really don't know. So let's say that we do know that if it's possible to make a simpler solution for this complex problem, it's not EASY. We'll only do it by experimenting and trying and evaluating our experiments. Like DCAM. So let's say that DCAM represents the 'state of the art' of building a general purpose metadata regime as simply as possible.
In which case we can say "Well, it's as simple as we know how to make it _right now_." The problem then, is not neccesarily complexity itself, but that if you want something as complicated as DCAM to take off--you've got to write good documentation so people can understand this? Where is this documentation? What do I read to understand DCAM--it's goals, it's principles, what it does, what it is? I have been unable to find it. This is a problem. To be sure, part of the difficulty is that the DCAM community is still -figuring it out- themselves, making it up as they go along. That's how innnovative work is done. Nevertheless, if you want people to recognize what it's good for, they need to understand it, and if you want them to understand it, you've got to write some accessible documentation. This is the DCAM community's responsibility.
Stu, you may recall that I emailed you a few months ago out of a conversation I had with someone. He was convinced that DCAM was just about the "15 DC elements", and thus had no use outside of that. When I tried to convince him he was mistaken, he didn't believe me. When I tried to find DCAM documentation that would back me up---I couldn't. In fact, he found DCAM documentation that seemed to back him up!
So how can someone that hasn't been involved in the DCAM community from the start educate themselves as to what the heck is going on?
Posted by:Jonathan Rochkind | February 15, 2008 at 10:36 AM
Hi Stu,
I wasn't completely clear what you and/or Erik meant by "the general suitability of the DCAM".
Suitability as a model for DC metadata? Or suitability as a (or the?) common model, to be adopted beyond "the DC community"?
On the former point, I'd like to think the DCAM does provide a reasonable formalisation of the concepts that had been used within the DCMI community, and further it does so in a way which provides a direct mapping to the RDF model. I know, I would say that... but at least now I think we have definitions of those key DCMI concepts which are
more or less complete and consistent.
Of course, I'm also well aware that in practice a good number of applications which describe themselves as "using Dublin Core metadata" still operate on an ad hoc basis without any reference whatsoever to the DCAM!
On the second point, and here I should add that I'm speaking only for myself :-) in contributing to the creation of the DCAM, I saw myself contributing to the development of a model for a specific metadata community (albeit one which is broad and diverse and fuzzy at the edges); I didn't - and still don't - see myself as developing a more general model for use across many communities.
For me, the RDF model is a much better candidate for that role than the DCAM - RDF was designed for that role; it is simpler than the DCAM, it has a formally-defined semantics, a good fit with the Web Architecture and the support of a huge range of software tools - support which I can't see the DCAM ever achieving.
And that is fine: the DCAM is designed to be compatible with RDF, but it isn't (IMHO) intended to be a replacement for/competitor to RDF.
Posted by:PeteJ | February 15, 2008 at 09:08 AM
It's funny--as a "library-technology-person" who has recently started dabbling in RDF and semweb technologies DublinCore seems pretty successful. It's a nice vocabulary to be able to invoke when describing resources, and it turns up in specs for FOAF, OAI-(ORE|PMH), RSS, Atom, RDFa, SKOS. The vocabulary I get--the DCAM is a tougher nut for me to crack. It hasn't been abundantly clear to me why it is needed when you have RDF already. I've summed it up to myself as the result of parallel evolution--but perhaps you could characterize it better. Maybe you already have? :-)
So I'd say that the vocabulary itself is a really wonderful achievement...and it has brought together a community of web metadata practitioners that I'm not sure would've existed otherwise. The ego/pride/money thing is a pain in the neck--but aren't they general problems of human endeavors? I don't think they are endemic to DublinCore, metadata or the Web. Perhaps NIH is the best summary of the net effect of those forces as they apply to specific technologies. It clearly has a pernicious effect (http://iandavis.com/blog/2004/03/theNucleusOfAtom).
Posted by:Ed Summers | February 15, 2008 at 05:51 AM