My Photo

WorldCat


Twitter Updates

    follow me on Twitter
    Blog powered by TypePad

    google analytics


    meter


    Categories

    Categories

    February 18, 2008

    Metadata: Semantics; Structure; Syntax

    Ibises_6636 Peter Murray, aka the Disruptive Library Technology Jester posted an encapsulated history of the origins of the Dublin Core, and observed that he still is

    trying to reconcile what differences exist between RDF and the DCAM based on these postings and comments from Stu’s blog.

    I'm glad that people are engaged in trying to sort this out, even as I'm unhappy that its still unclear at this late date.  That it still IS unclear is incontrovertible (look at the caliber of people trying!).  I'm not very confident at this point that I can wash away the confusion, but it does seem potentially useful to reprise a part of my metadata talk that I used to give a lot.

    Sharing metadata requires agreements on three topics:

    1. Semantics: what is the meaning we are trying to convey in metadata assertions?  Meaning, of course, resides in the minds of people, not machines.  The focus of the Dublin Core effort has been to promote those shared meanings... and make them sharable.  The semantics bit is about agreeing about elements: author, publisher, date, etc.
    2. Syntax: how do you take a set of metadata assertions and pack them so that one machine can send them to another, where they can be unpacked and parsed by machine logic or displayed and read by a person  with high probability that the meaning of the assertions travel unchanged from one mind to another. RDF documents refer to  serialization... the order of bits in a stream... actually putting the stuff 'on the wire.' (The careful readers and jaded among you may wonder why i changed the order of exposition  from the title of this post.  Best for last? no... hardest.)
    3. Structure: You can't do syntax reliably unless you have unambiguous structure.  The sorts of things you have to specify in a well-structured metadata assertion (not an exhaustive list):
    • The boundaries of a set of assertions (what constitutes a record)
    • Cardinality - Can an element be repeated, and if so, is there a limit on the number?
    • How is a name structured? What is the delimiter separating elements of a compound name (Prince and Bono excepted, most names are compound structures, many with surprising and confounding complexity).
    • How is nesting managed?
    • How are dates encoded? YYYY-MM-DD? DD-MM-YYYY? MM-DD-YYYY?
    • How does one identify an encoding scheme that specifies the above question?
    • How does one identify a value encoding scheme (rg. LSCH, MeSH, Dewey) from which metadata values can be chosen?  Are such schemes required or optional?
    • Are metadata values specified by reference (URI) or by value (literal strings)?

    Most of these issues are not addressed in RDF. The can be, of course... but without agreements about how to do so, people tend to do them this way and that, leaving us without the ability to share data effectively.  This is where the Dublin Core Abstract Model (DCAM) comes in, as it specifies how to structure these sorts of things in a way that makes the data sharable.

    Is it perfect and generalizable?  No... its authors, in comments on my posts, have made evident that they make no such claim. Is it the best that is available for descriptive metadata?  I assert that it is, and that efforts to work towards an Uber-Metadata-Model should start with this effort and simplify or complexify as is necessary and sufficient to assure that metadata  can be shared across communities.

    One last point.  DCAM is articulated in the vernacular of RDF, but the structure that it creates is independent of RDF.  If RDF passes into the graveyard  of once-or-never-mighty technologies, the abstractions it (DCAM) declares survive quite nicely.  Syntax independence: a goal we strove for from day 1 of the first DC metadata workshop.  It is a worthy metadata engineering principle.

    To sum up: Defining semantics is a political process of reaching consensus.  Syntax is arranging the bits reliably so they travel comfortably between computers (RDF is a fine way to do this, but by no means the only way), and structure is the specification of the details necessary to layout and declare metadata assertions so they can be embedded unambiguously in a syntax.  A data model is the specification of this structure. 
    -----
    I was influenced to include semicolons in the title of this post by an article in today's NYTs, forwarded to me by Marguerite.  I LIKE semicolons, even if they are stodgy.
    -----
    Wary Ibises (or something like them) in Barwon Heads, Australia

    February 15, 2008

    RDF & DCAM: parallel or complementary?

    Three_dragons Ed Summers posed an interesting question in reply to my assertion that the Dublin Core Abstract Model (DCAM) is the central jewel in the Dublin Core effort.

    'It's funny--as a "library-technology-person" who has recently started dabbling in RDF and semweb technologies DublinCore seems pretty successful. It's a nice vocabulary to be able to invoke when describing resources, and it turns up in specs for FOAF, OAI-(ORE|PMH), RSS, Atom, RDFa, SKOS. The vocabulary I get--the DCAM is a tougher nut for me to crack. It hasn't been abundantly clear to me why it is needed when you have RDF already. I've summed it up to myself as the result of parallel evolution--but perhaps you could characterize it better. Maybe you already have? :-)'

    It is always nice to see independent endorsement of the roughly-rightness of DC as a vocabulary, and I hope my earlier remarks in no way impugn the value of the global consensus these vocabulary terms represent.  They are valuable to a great many, but from the first workshop almost 15 years ago, we recognized (even in the name) that DC needed to be extensible and interoperable.  This is where the  abstract model is important.

    The evolution of RDF and DCAM are not parallel in any exclusive way, but rather intimately intertwined.  Indeed, DC was the prototypical client for RDF, and DC mavens have from the beginning been an integral part of the RDF and Semantic Web development community. RDF was born at a meeting of four people (Bill Arms, then of CNRI, Jim Miller, then of the W3C, Dan Connally, then and now of the W3C, and myself, representing the DC community).  The W3C folks recognized that the PICS effort then underway was inadequate to the larger needs for expressing general metadata, and thought the time was ripe for the development of something more broadly useful.

    PICS (Platform for Internet Content Selection) was an effort hastily conceived to fend off assertions that porn would infect every classroom unless the gubmint stepped in to protect us.  Someone (TimBL? Dan? Jim?) realized that there was benefit in building a general purpose architecture to support the declaration of reusable semantic assertions.  Bill knew of this, and of the DC effort, and brought us together in a meeting at the CNRI offices in Reston, Virginia.  My only contribution to the meeting was to say... 'gee, that sounds swell!'  Or something like that.

    So, some of the Web techies in the DC community jumped in enthusiastically and soon we had an RDF camp as an alternative to simple HTML META tag  attribute-value pairs.  DC fed functional requirements to the RDF folks, and we figured in a year or two the whole world would be declaring metadata using RDF.  Our tender naiveté makes me laugh and shake my head now.  We really thought we had this one by the tail.

    It didn't quite work out that way, of course.  Ten years later, and RDF still struggles in the technology marketplace (hoping for lots of shocked comments to this assertion).  Why is that?  Basically, because RDF fulfills a second order requirement: interoperability.  It is fairly straight forward to build a closed system where everyone knows what they need.  This is the way most systems used to be built, of course, and one of the wondrous things about the Web is it introduces global scope as an intrinsic technological attribute.  Not to say we always take full advantage.

    In the metadata realm we're trying to achieve global semantic scope as well as technological scope.  And we want it to be extensible.  And we hoped that applications would be built independently of one another on a technological platform that would make possible interoperability without pre-coordination.   If you believe TimBL, this is the future of the Web.  I've wanted to believe, and still want to.  If it is to happen, it requires more than RDF.  It requires conventions about how we structure our metadata assertions.  This is where DCAM comes in.  The abstract model provides a syntax-independent (hence the abstract bit) set of conventions for expressing metadata on the web.  RDF is the natural idiom for the expression of the DCAM, but it is NOT essential.  You can build any arbitrary syntactical representation of the metadata according to DCAM, and a lossless transformation to any other arbitrary syntactical representation should be possible between two machines that grok both syntaxes.

    So, staying 'on the tracks' is a matter of adopting those conventions (not an intrinsic part of RDF, but naturally expressible in RDF).  If you happen to be using RDF, all the better, but we make no assumptions that RDF is the only appropriate syntactic rendition.  If you've reached the end of this post (I'm guessing the world-wide audience for this is post is... say... 9), and want more (we're down to 3 now), you should talk to Andy Powell or Mikael Nilsson or their co-authors, who did the heavy lifting on getting this thing done.  The Metadata world owes them a substantial debt.

    -----

    Three dragons flying: Ok, the production values of the image in this post aren't exactly great... the iPhone will never win awards as a camera.  My OCLC Programs and Research colleague, Karen Yoshimura, scribbled this out on a paper restaurant table cloth faster (and far more beautifully) than I can write my name.  Man, I wish I could do that.

    February 14, 2008

    Metadata 2.0: On the Rails or...?

    Mongoliantrain_1250 Erik Duval, of Katholieke Universiteit Leuven in Belgium, is a longstanding metadata colleague I met  in the early days of Web metadata.  We've worked in the service of related activities for some years, and our paths have intersected in productive ways on a variety of occasions in a dozen years (Notably, here).  So when he asked me to participate by teleconference in a Metadata 2.0 workshop in Leuven, I was pleased to participate, even though it fell on the day of my two presentations at the VALA2008 conference in Melbourne. 

    So, at 21:00 Melbourne time I called into the workshop (11:00 AM Leuven time) and visited my modeling ideology upon the hapless participants.  Tele-presence is hard to do effectively, and especially for those who have to listen.  I was afforded the dispensation of talking and going to bed soon after, so I was the lucky one!  This post represents a reduction of the stock of slides I shared with the group -- hope its beef and not turkey.

    The dominant issue in promoting metadata interoperability, in my estimation, lies in harmonizing data models, not element names.  We in the library community have been slow to understand this, primarily because we've gotten along without a formal data model for so long.

    The Dublin Core group began as a heterogeneous amalgamation of information mavens -- a good thing if you believe in hybrid vigor (I do).  A bad thing from the point of view of finding common vocabularies and modeling idioms.  There were (are) lots of ways to do/say/express metadata assertions, and a large proportion of them were represented among us.  Attempts at abstracting a data model foundered in contentious seas of misunderstandings and egos, and any urgency about arriving at a common model gave way to simply staying afloat in troubled waters.  After all, lots of people were using DC, right?  The library community gets along without a rigorous data model, and MARC remains one of the most successful resource description idioms. 

    But MARC had only a few distinct generators (software systems}, and cross-cultural MARC dialects could be made to interoperate only with difficulty.  We should have known better.  Wishful thinking (and conflict avoidance) triumphed over clear reasoning, and the data modeling effort in DC came to fruition  slowly, fitfully, painfully. It took a decade.  That hard-won lesson, embodied in the Dublin Core Abstract Model (DCAM), remains, in my estimation, the golden nugget at the center of the Dublin Core ore.

    I asserted to the Leuven group that metadata standards that don't share a common data model are doomed to perpetual lossy interoperability at best, costly bespoke mappings that never really satisfy.  I've written in the past about the analogy of incompatible train gages such as are still encountered, for instance, on the China-Mongolia border.  An entire train is 'unloaded' from its Chinese bogeys (wheel trucks) by being jacked up on hydraulic lifts, and Mongolian bogeys are then rolled under the carriages.  amidst great clanking and hissing. The train is lowered and continues into the dark Gobi night.   Is this the metadata model we want to perpetuate?  Unpacking assertions in one model and repacking them into another?   It is folly.

    But it is still hard to find agreement in these spaces.  Lessons learned unravel. There's always a higher abstraction level that will save us, no?  Well, no, actually. Machine parsing requires precision. You agree about structure or you don't.  I suspect that semantic interoperability decays across mappings, as with sound and light, as the cube of the 'distance' between the models.  Multiplied, of course, by the sum of the metadata instances represented in each model.  (How's that for an unsupportable assertion of cost?)

    OK, but whose model?  Did I mention there are egos involved?  And money? And pride? And organizational investments?  And NIH syndrome? Any one of these alone is a serious impediment to adoption.

    In a further conversation with Erik, we discussed the general suitability of the DCAM.  Erik observed that the number of people, even in technical groups, who have a strong grasp of its intricacies is small.  Unhappily, he is right. Is the DCAM needlessly complex, or is the complexity matched to a proportionately difficult problem?  And,

    ...should there be one model that we all build on or should we build something that overarches all existing models...  

    Isn't that then a common data model?  If the DCAM is considered too complex, how will this help? 

    Answering these questions is the crux move for progress in Metadata 2.0.  If the complexity is appropriate, then spare us yet another data model.  If it is needlessly complex, then it behooves all parties to simplify and abstract until we have distilled the essence.  Metadata 2.0 isn't social, isn't the next level, isn't the latest and greatest... its a do-over, a mulligan, an after-school detention.  We just don't have it right yet.  My assertion is that the DCAM is roughly right.  If there be flaws, expose them with evidence.  If there are better ways, demonstrate their value.  Otherwise, adopt and deploy with vigor and rigor. 

    Get the trains rolling on the same tracks.

    -----
    Inside the train longhouse on the China-Mongolia border (October 2004).  Hydraulic jacks line the longhouse, and raise the entire train, allowing one gage of bogey to be rolled out and another set to be rolled in.  The process, which includes a cabin-by-cabin visitation by customs officials, took about two hours in the middle of a Gobi-desert night.

    December 11, 2007

    Roll over, George

    Boole_0047 Jonathan Rochkind made some thoughtfully peevish comments on my previous post on RDA and the Futures report which drove me (perish the thought) back to the document itself.

    ...I'm confused by your apparent sympathy with the Working Group
    recommendations to suspend RDA work...

    ...That recommendation seems to instead be based on the fantasy that we need to spend lots of time 'testing' FRBR, at the end maybe deciding that FRBR is no good at all

    I don't see this in the recommendations at all.  What i read (in recommendation 4.2.1, p 29) is a clear mandate to resolve the existing ambiguities in the FRBR model in order to:

    provide a more robust framework for the creation of  the resource description and access rules that will be used in the future to support a broad range of searching options (also on page 29). 

    This is essential, and should be undertaken in the light of functional pragmatism, not ideology.  And certainly I agree with Jonathan that there is little time to waste.  The Futures report does not impugn the value of FRBR, but simply recognizes that we as a community do not agree about the importance of Expressions.  If it is critical in other ways, I missed it.

    There is much stronger concern expressed in the report about the uncertainties of RDA, having to do with unsubstantiated benefits, alignment with existing standards, and the business case for it (see the bottom of page 24).

    The subsequent recommendation (on the next page: 3.2.1) is stated more strongly than I might have chosen.  But the heading (Suspend Work on RDA) is elaborated with untils, and makes clear that useful work has been initiated with JSC and DCMI, and should continue.

    But any assertion that debates going on on the RDA list represent progress towards these goals is, in my view, whistling past the graveyard.

    And as for Jonathan's generous remark:

    In fact, I feel like you've expressed well the argument that I'd want
    to submit as comments to the Working Group

    I know that at least one of them reads my blog ;-)

    -----
    yes... THAT George Boole... taken in Cork, at the end of the DCC meeting on persistent identifiers in 2004

    In fact, specifically, THIS George Boole: http://worldcat.org/identities/lccn-n83-144364 (thanks, Thom)

    Thrashing in the Fields

    Morningview8514

    There are clues that tell us that a 'dialog' is out of control on a listserv.  Mine are (1) nested inclusion brackets and (2) "X wrote...y wrote" on successive lines. Recent discussions on the RDA listserv have tumbled deeply into that territory.

    My contribution to the confusion includes the following assertions:

    There is exactly one candidate for a content model that captures the relations among salient bibliographic entities that are needed to anchor library assets in the larger information sphere: FRBR.  It feels roughly right to most, though it would be unwise to underestimate the time we can (ill-afford) to spend on thrashing around in the details.

    There are, unhappily, several candidates for syntactical models (variously called, schemas, data models, and abstract models). These models are indifferent to what is encoded; rather, they define the permissible structures that can be encoded (think of sentence diagramming).

    To choose an idiom foreign to the Web for such encoding will assure the irrelevance of library data on the open Web. Recasting MARC in XML is, in my estimation, exactly such a choice.  It masquerades as Web-friendly, but the result is simply more-parseable confusion for any but cataloging geeks.

    The strongest alternative candidate is the Dublin Core Abstract Model, born of a decade of wrangling about data models in the web-metadata context.  Please do not confuse the data model with the element set.  I am not suggesting supplanting MARC cataloging with DC.

    I am asserting that embedding the library in the open Web demands:

    1. A coherent model of what we are describing and the relationships among those entities, and in which each entity is identified with a URI (FRBR, or something very like it).
    2. A carrier syntax that lives comfortably on the Web (the DC Abstract Model is my candidate)
    3. Rules for populating agreed structures (that at which RDA seems to be failing so earnestly).

    There is some urgency at agreeing on (1) and (2) before (3) can be achieved.  The recent Library of Congress Report on the Future of Bibliographic Control has committed the heresy (for some) of suggesting that RDA work be suspended and FRBR be subjected to more rigorous testing in order to increase the prospects of achieving our Web-destiny. I'm not sure I'd go that far, but I am convinced that our objectives will not be met through wrangling on mailing lists.  A coherent, well-funded community-grounded research and development program is in order.  All the innovative OPACs, Web-services, and Web-2.0 social networks will avail us not if we fail to achieve this coherence.
    -----
    DC mavens will recognize the 'sentence diagramming' metaphor as originating with Tom Baker
    -----
    An early morning view from my rooms with a view in Seattle

    October 05, 2006

    It's the Model, Stupid

    Img_5011_1 A short history of data modeling in the Dublin Core Metadata Initiative, and what it means for the future of cataloging

    The cardinal rule at the first Dublin Core meeting way-back-when was Thou Shalt Not Conflate Syntax and Semantics. A rule honored more in the breech than in observance perhaps, but it emphasizes the importance of separating these two fundamental facets of communicating structured information. We were right about this… almost. The missing part of the picture was a sound underlying data model. 

    We knew what we were trying to say, we thought we knew how to say it. How hard is it to describe the basic characteristics of a resource, after all? Title of resource is…. Creator of resource is…. We even spoke about the initiative in terms of grammars and evolving pidgin languages (simple, emergent grammars). We hobbled along with only a vague common understanding – a model implicit in the aggregate projects using Dublin Core, propagated through imitation, nowhere formally specified.

    It isn't though we didn't try.  Early attempts to arrive at a formal data model were fraught with contention and even acrimony. Difficult meetings (Washington… Dublin… Crete), leading to small beachheads that, lacking broad consensus, soon washed away. Maybe it wasn’t so important? After all, people still used DC, we continued to attract adherents. DC was spreading… 25 languages, 50 countries. And AACR2-MARC, arguably the world’s most successful resource description standard, didn’t have a data model either… how bad a problem can it be?

    But the chickens come home to roost. The lack of a formal model led to a plethora of non-interoperable systems, crippling one of the foundation principles of the Initiative. It took 10 years for DCMI to finally evolve and adopt a formal model, and one might wonder whether simple exhaustion was a factor in its ultimate acceptance. It will be a long time before this model channels practice sufficiently to bring the many flavors of DC closer to the goal of sharing metadata across systems, let alone across various other metadata frameworks.

    The Dublin Core Abstract Model, led by Andy Powell and Pete Johnston, and later attracting the efforts of Mikael Nilsson in the cause of bridging the DCMI framework with that of IEEE LOM, is a hybrid distillation of ideas gleaned from library practice and the Semantic Web’s cornerstone technology, RDF. It reflects insights of emerging Web practice (the use of URI’s as persistent identifiers, for example), and embraces lessons learned from a decade of early metadata adopters.

    Yesterday afternoon at DC-2006, Diane Hillmann presented a summary of progress (and… surprise… contention) associated with the RDA effort, what most people understand as the international revision of AACR2. From my own uninformed perspective, it appears that this effort suffers from much the same problem that we have had in the Dublin Core – a data model implicit in years of practice and rule-revision on top of rule-revision, resulting in a focus on the minutia of rules rather than being guided by formal principles of description.

    The Web has forced us all out of isolated communities of practice and into the Internet Commons. Certainly the practice and topography of librarianship is changing out from under us. As we struggle under the stress of these changes, it is perhaps predictable that legacy systems such as cataloging practice will change even more slowly. The RDA effort recognizes the importance of updating our profession to fit more comfortably into the Internet Commons. If we are to achieve anything like the interoperability we hope for, we will need common structural models. If the effort devolves to simply unraveling existing rules and rewinding the yarns, we will fall short of the integration we need to support our future. The successes and failures of the DC community in its own modeling struggles can be useful… and reusable.  I gather that the Joint Steering Committee has sought consultation with representatives of the IEEE LOM metadata community as well as with DCMI. It would be fitting if the DCMI could return some value to the community that has provided so much of the insight that has motivated its own progress.

    Mikael Nilsson’s exhortation is on the mark. Less talk about metadata sets, and more talk about models. It is both difficult and important to get this piece right.

    ------

    Image: DC-2006 welcome reception

    October 03, 2006

    Modeling in Manzanillo

    Dc06hotel Manzanillo, Mexico

    I gather the likes of Bo Derek and Ken Kesey frequented this town in earlier eras. This week, Mexico’s largest pacific port is the site of Dublin Core 2006. Three hundred attendees from 20 countries are here to exchange ideas about what’s happening in metadata.

    I write this from a plenary session on metadata architectures. The topic is more often relegated to smoke-filled back rooms, and the smoke is from topical combustion rather than tropical combustibles.  No topic has generated more contention over the years than data modeling.

    Mikael Nilsson is the current speaker, and his session is entitled Towards an Interoperability Framework for Metadata Standards. Mikael is, for my money, one of a handful of people in the metadata arena who are indispensable (Andy Powell and Pete Johnston are also on my short list, and are also on the bill of fare for the session).

    Interoperability has been a prominent goal from the first days of Dublin Core, and while the initiative has succeeded in many of its early goals, this one remains difficult and elusive. The only hope for achieving it lies in aligning the underlying information structures by which we construct our metadata instances. Given that our own (the DC) community took a decade in explicating the Dublin Core Abstract Model, we don’t even have broad interoperability across Dublin Core systems, let alone with other metadata communities.

    Few communities have done any better (the code word for not having a data model is to say it is implicit… uh huh). The prospects of explicating and aligning these models across frameworks, then, are grim. The DCMI and IEEE Learning Object Metadata (LOM) communities have been trying to do this for years now. For most of that time, ‘working towards interoperability’ meant agreeing it would be nice if we could do it. In recent years, thanks largely to the efforts of Mikael, Andy, and Pete, there is earnest progress toward the goal, and that progress rests on the explication of differences in the DC Abstract Model and the model implicit in the LOM package. To understand the problems clearly and in detail is a start towards solving them. At this time, the trains are still running on different gauge tracks, and it is likely to be years before this will change.

    Part of Mikael’s message in this talk can be summarized by the suggestion that we talk less about metadata standards and schemas, and more about abstract model, syntax, metadata vocabularies, and application profiles. Success is in the details, and there are lots of them to manage. It can’t be done without common underlying models.

    -----

    Image: DC-2006 Hotel.  Fortunately, its too hot to be outside during session hours.