My Photo

WorldCat


Twitter Updates

    follow me on Twitter
    Blog powered by TypePad

    google analytics


    meter


    Categories

    Categories

    February 18, 2008

    Metadata: Semantics; Structure; Syntax

    Ibises_6636 Peter Murray, aka the Disruptive Library Technology Jester posted an encapsulated history of the origins of the Dublin Core, and observed that he still is

    trying to reconcile what differences exist between RDF and the DCAM based on these postings and comments from Stu’s blog.

    I'm glad that people are engaged in trying to sort this out, even as I'm unhappy that its still unclear at this late date.  That it still IS unclear is incontrovertible (look at the caliber of people trying!).  I'm not very confident at this point that I can wash away the confusion, but it does seem potentially useful to reprise a part of my metadata talk that I used to give a lot.

    Sharing metadata requires agreements on three topics:

    1. Semantics: what is the meaning we are trying to convey in metadata assertions?  Meaning, of course, resides in the minds of people, not machines.  The focus of the Dublin Core effort has been to promote those shared meanings... and make them sharable.  The semantics bit is about agreeing about elements: author, publisher, date, etc.
    2. Syntax: how do you take a set of metadata assertions and pack them so that one machine can send them to another, where they can be unpacked and parsed by machine logic or displayed and read by a person  with high probability that the meaning of the assertions travel unchanged from one mind to another. RDF documents refer to  serialization... the order of bits in a stream... actually putting the stuff 'on the wire.' (The careful readers and jaded among you may wonder why i changed the order of exposition  from the title of this post.  Best for last? no... hardest.)
    3. Structure: You can't do syntax reliably unless you have unambiguous structure.  The sorts of things you have to specify in a well-structured metadata assertion (not an exhaustive list):
    • The boundaries of a set of assertions (what constitutes a record)
    • Cardinality - Can an element be repeated, and if so, is there a limit on the number?
    • How is a name structured? What is the delimiter separating elements of a compound name (Prince and Bono excepted, most names are compound structures, many with surprising and confounding complexity).
    • How is nesting managed?
    • How are dates encoded? YYYY-MM-DD? DD-MM-YYYY? MM-DD-YYYY?
    • How does one identify an encoding scheme that specifies the above question?
    • How does one identify a value encoding scheme (rg. LSCH, MeSH, Dewey) from which metadata values can be chosen?  Are such schemes required or optional?
    • Are metadata values specified by reference (URI) or by value (literal strings)?

    Most of these issues are not addressed in RDF. The can be, of course... but without agreements about how to do so, people tend to do them this way and that, leaving us without the ability to share data effectively.  This is where the Dublin Core Abstract Model (DCAM) comes in, as it specifies how to structure these sorts of things in a way that makes the data sharable.

    Is it perfect and generalizable?  No... its authors, in comments on my posts, have made evident that they make no such claim. Is it the best that is available for descriptive metadata?  I assert that it is, and that efforts to work towards an Uber-Metadata-Model should start with this effort and simplify or complexify as is necessary and sufficient to assure that metadata  can be shared across communities.

    One last point.  DCAM is articulated in the vernacular of RDF, but the structure that it creates is independent of RDF.  If RDF passes into the graveyard  of once-or-never-mighty technologies, the abstractions it (DCAM) declares survive quite nicely.  Syntax independence: a goal we strove for from day 1 of the first DC metadata workshop.  It is a worthy metadata engineering principle.

    To sum up: Defining semantics is a political process of reaching consensus.  Syntax is arranging the bits reliably so they travel comfortably between computers (RDF is a fine way to do this, but by no means the only way), and structure is the specification of the details necessary to layout and declare metadata assertions so they can be embedded unambiguously in a syntax.  A data model is the specification of this structure. 
    -----
    I was influenced to include semicolons in the title of this post by an article in today's NYTs, forwarded to me by Marguerite.  I LIKE semicolons, even if they are stodgy.
    -----
    Wary Ibises (or something like them) in Barwon Heads, Australia

    February 15, 2008

    RDF & DCAM: parallel or complementary?

    Three_dragons Ed Summers posed an interesting question in reply to my assertion that the Dublin Core Abstract Model (DCAM) is the central jewel in the Dublin Core effort.

    'It's funny--as a "library-technology-person" who has recently started dabbling in RDF and semweb technologies DublinCore seems pretty successful. It's a nice vocabulary to be able to invoke when describing resources, and it turns up in specs for FOAF, OAI-(ORE|PMH), RSS, Atom, RDFa, SKOS. The vocabulary I get--the DCAM is a tougher nut for me to crack. It hasn't been abundantly clear to me why it is needed when you have RDF already. I've summed it up to myself as the result of parallel evolution--but perhaps you could characterize it better. Maybe you already have? :-)'

    It is always nice to see independent endorsement of the roughly-rightness of DC as a vocabulary, and I hope my earlier remarks in no way impugn the value of the global consensus these vocabulary terms represent.  They are valuable to a great many, but from the first workshop almost 15 years ago, we recognized (even in the name) that DC needed to be extensible and interoperable.  This is where the  abstract model is important.

    The evolution of RDF and DCAM are not parallel in any exclusive way, but rather intimately intertwined.  Indeed, DC was the prototypical client for RDF, and DC mavens have from the beginning been an integral part of the RDF and Semantic Web development community. RDF was born at a meeting of four people (Bill Arms, then of CNRI, Jim Miller, then of the W3C, Dan Connally, then and now of the W3C, and myself, representing the DC community).  The W3C folks recognized that the PICS effort then underway was inadequate to the larger needs for expressing general metadata, and thought the time was ripe for the development of something more broadly useful.

    PICS (Platform for Internet Content Selection) was an effort hastily conceived to fend off assertions that porn would infect every classroom unless the gubmint stepped in to protect us.  Someone (TimBL? Dan? Jim?) realized that there was benefit in building a general purpose architecture to support the declaration of reusable semantic assertions.  Bill knew of this, and of the DC effort, and brought us together in a meeting at the CNRI offices in Reston, Virginia.  My only contribution to the meeting was to say... 'gee, that sounds swell!'  Or something like that.

    So, some of the Web techies in the DC community jumped in enthusiastically and soon we had an RDF camp as an alternative to simple HTML META tag  attribute-value pairs.  DC fed functional requirements to the RDF folks, and we figured in a year or two the whole world would be declaring metadata using RDF.  Our tender naiveté makes me laugh and shake my head now.  We really thought we had this one by the tail.

    It didn't quite work out that way, of course.  Ten years later, and RDF still struggles in the technology marketplace (hoping for lots of shocked comments to this assertion).  Why is that?  Basically, because RDF fulfills a second order requirement: interoperability.  It is fairly straight forward to build a closed system where everyone knows what they need.  This is the way most systems used to be built, of course, and one of the wondrous things about the Web is it introduces global scope as an intrinsic technological attribute.  Not to say we always take full advantage.

    In the metadata realm we're trying to achieve global semantic scope as well as technological scope.  And we want it to be extensible.  And we hoped that applications would be built independently of one another on a technological platform that would make possible interoperability without pre-coordination.   If you believe TimBL, this is the future of the Web.  I've wanted to believe, and still want to.  If it is to happen, it requires more than RDF.  It requires conventions about how we structure our metadata assertions.  This is where DCAM comes in.  The abstract model provides a syntax-independent (hence the abstract bit) set of conventions for expressing metadata on the web.  RDF is the natural idiom for the expression of the DCAM, but it is NOT essential.  You can build any arbitrary syntactical representation of the metadata according to DCAM, and a lossless transformation to any other arbitrary syntactical representation should be possible between two machines that grok both syntaxes.

    So, staying 'on the tracks' is a matter of adopting those conventions (not an intrinsic part of RDF, but naturally expressible in RDF).  If you happen to be using RDF, all the better, but we make no assumptions that RDF is the only appropriate syntactic rendition.  If you've reached the end of this post (I'm guessing the world-wide audience for this is post is... say... 9), and want more (we're down to 3 now), you should talk to Andy Powell or Mikael Nilsson or their co-authors, who did the heavy lifting on getting this thing done.  The Metadata world owes them a substantial debt.

    -----

    Three dragons flying: Ok, the production values of the image in this post aren't exactly great... the iPhone will never win awards as a camera.  My OCLC Programs and Research colleague, Karen Yoshimura, scribbled this out on a paper restaurant table cloth faster (and far more beautifully) than I can write my name.  Man, I wish I could do that.

    February 14, 2008

    Metadata 2.0: On the Rails or...?

    Mongoliantrain_1250 Erik Duval, of Katholieke Universiteit Leuven in Belgium, is a longstanding metadata colleague I met  in the early days of Web metadata.  We've worked in the service of related activities for some years, and our paths have intersected in productive ways on a variety of occasions in a dozen years (Notably, here).  So when he asked me to participate by teleconference in a Metadata 2.0 workshop in Leuven, I was pleased to participate, even though it fell on the day of my two presentations at the VALA2008 conference in Melbourne. 

    So, at 21:00 Melbourne time I called into the workshop (11:00 AM Leuven time) and visited my modeling ideology upon the hapless participants.  Tele-presence is hard to do effectively, and especially for those who have to listen.  I was afforded the dispensation of talking and going to bed soon after, so I was the lucky one!  This post represents a reduction of the stock of slides I shared with the group -- hope its beef and not turkey.

    The dominant issue in promoting metadata interoperability, in my estimation, lies in harmonizing data models, not element names.  We in the library community have been slow to understand this, primarily because we've gotten along without a formal data model for so long.

    The Dublin Core group began as a heterogeneous amalgamation of information mavens -- a good thing if you believe in hybrid vigor (I do).  A bad thing from the point of view of finding common vocabularies and modeling idioms.  There were (are) lots of ways to do/say/express metadata assertions, and a large proportion of them were represented among us.  Attempts at abstracting a data model foundered in contentious seas of misunderstandings and egos, and any urgency about arriving at a common model gave way to simply staying afloat in troubled waters.  After all, lots of people were using DC, right?  The library community gets along without a rigorous data model, and MARC remains one of the most successful resource description idioms. 

    But MARC had only a few distinct generators (software systems}, and cross-cultural MARC dialects could be made to interoperate only with difficulty.  We should have known better.  Wishful thinking (and conflict avoidance) triumphed over clear reasoning, and the data modeling effort in DC came to fruition  slowly, fitfully, painfully. It took a decade.  That hard-won lesson, embodied in the Dublin Core Abstract Model (DCAM), remains, in my estimation, the golden nugget at the center of the Dublin Core ore.

    I asserted to the Leuven group that metadata standards that don't share a common data model are doomed to perpetual lossy interoperability at best, costly bespoke mappings that never really satisfy.  I've written in the past about the analogy of incompatible train gages such as are still encountered, for instance, on the China-Mongolia border.  An entire train is 'unloaded' from its Chinese bogeys (wheel trucks) by being jacked up on hydraulic lifts, and Mongolian bogeys are then rolled under the carriages.  amidst great clanking and hissing. The train is lowered and continues into the dark Gobi night.   Is this the metadata model we want to perpetuate?  Unpacking assertions in one model and repacking them into another?   It is folly.

    But it is still hard to find agreement in these spaces.  Lessons learned unravel. There's always a higher abstraction level that will save us, no?  Well, no, actually. Machine parsing requires precision. You agree about structure or you don't.  I suspect that semantic interoperability decays across mappings, as with sound and light, as the cube of the 'distance' between the models.  Multiplied, of course, by the sum of the metadata instances represented in each model.  (How's that for an unsupportable assertion of cost?)

    OK, but whose model?  Did I mention there are egos involved?  And money? And pride? And organizational investments?  And NIH syndrome? Any one of these alone is a serious impediment to adoption.

    In a further conversation with Erik, we discussed the general suitability of the DCAM.  Erik observed that the number of people, even in technical groups, who have a strong grasp of its intricacies is small.  Unhappily, he is right. Is the DCAM needlessly complex, or is the complexity matched to a proportionately difficult problem?  And,

    ...should there be one model that we all build on or should we build something that overarches all existing models...  

    Isn't that then a common data model?  If the DCAM is considered too complex, how will this help? 

    Answering these questions is the crux move for progress in Metadata 2.0.  If the complexity is appropriate, then spare us yet another data model.  If it is needlessly complex, then it behooves all parties to simplify and abstract until we have distilled the essence.  Metadata 2.0 isn't social, isn't the next level, isn't the latest and greatest... its a do-over, a mulligan, an after-school detention.  We just don't have it right yet.  My assertion is that the DCAM is roughly right.  If there be flaws, expose them with evidence.  If there are better ways, demonstrate their value.  Otherwise, adopt and deploy with vigor and rigor. 

    Get the trains rolling on the same tracks.

    -----
    Inside the train longhouse on the China-Mongolia border (October 2004).  Hydraulic jacks line the longhouse, and raise the entire train, allowing one gage of bogey to be rolled out and another set to be rolled in.  The process, which includes a cabin-by-cabin visitation by customs officials, took about two hours in the middle of a Gobi-desert night.

    August 31, 2007

    DC-2007 in Singapore

    Singaporeorchid_2 DC-2007 is in the books. As I write this post, I’m attending the DCMI Advisory Board meeting in a 15th floor conference room at the Singapore National Library Board, situated in a stunning new building that, when I saw it for the first time earlier this week, struck me as perhaps a multinational corporate headquarters among many in this dynamic city of international commerce. THAT is the Library??? Wow! Twenty-seven million visitors annually and an economy grounded firmly in  information technology make it seem eminently sensible.

    The conference was across the street at the Intercontinental Hotel, where 190 delegates from 33 countries participated in a day of tutorials, three days of papers, workshops, and working group meetings, and a final day of seminars.

    This was my second visit to Singapore, and this time around I encountered more of its colonial past and hints of an era whose charm may be best appreciated from the comfort of air conditioning and the cosmopolitan friendliness that is a hallmark of this city. Nineteenth century colonialism has an undeniable architectural charm, but twenty-first century Singapore is grounded in an information future that is as strategic in the digital world as the Straights of Malacca have been in the shipping world. Not so many pirates though…few cities are safer or more welcoming to visitors.

    Among the news from the conference is the announcement that DCMI is changing its host from OCLC, where it started back in the heady days of the early Web, when there were all of a half-million addressable pages on the Web. After a dozen years, DCMI has embarked on the path of creating a stand-alone organization, and the National Library Board of Singapore will provide administrative support, consistent with a national goal of being as hospitable to international information standards activities as it is to visitors in general.

    Many thanks are due to the local organizing committee for DC-2007 for making this conference a success, and some of these same folks will earn our further gratitude for their future administrative efforts on behalf of DCMI.
    -----
    Singapore is home to a spectacular botanical garden that includes a wonderful orchid garden with more than a thousand species, and twice that many hybrids.

    August 29, 2007

    Digital Cultural Evolution in China

    Verticalshanghai This morning's keynote at DC-2007 in Singapore was delivered by Zhang Xiaoxing, Deputy Director of the National Cultural Information Resource Center in China.  Dr. Zhang described a national cultural information resources sharing project started in 2002 and funded by the Chinese Government.  According to Dr. Zhang, this system is intended to support multi-technology distribution of information to grass-roots centers, especially farmers and rural citizens who would have little access to such information otherwise.

    The data span many formats and content areas including a variety of cultural domains, agricultural science and technology, and laws and regulations.  The system is organized in tiers, beginning with a root national center, three regional centers, 33 provincial centers, and more than 8,000 local centers.

    DC is the core metadata standard, and has been further elaborated into application profiles to support the varieties of content made available.  OAI and PMH protocols are used to facilitate sharing of data among the grass-roots centers.  And there is lots of it to share... currently some 58 terabytes.

    Dublin Core mavens would find Dr. Zhang's slides very familiar indeed, recapping ideas and principles argued and agreed over more than a decade of experimentation and wrangling (some of his screen shots of application profiles might yet provoke discussion among the architecture crowd). It is a genuine pleasure to see these efforts (and even some of the problems) echoed in a national effort such as this, with repercussions that can be expected to ramify widely in the countryside of Chinese society and culture, validating an awful lot of jetlag on the part of many people over the years.  I wish our colleagues in China all success with this project.
    -----
    Downtown Shanghai, DC-2004