My Photo

WorldCat


Twitter Updates

    follow me on Twitter
    Blog powered by TypePad

    google analytics


    meter


    Categories

    Categories

    « RESTful Repositories? | Main | Uncoupling identification and resolution »

    February 18, 2008

    Metadata: Semantics; Structure; Syntax

    Ibises_6636 Peter Murray, aka the Disruptive Library Technology Jester posted an encapsulated history of the origins of the Dublin Core, and observed that he still is

    trying to reconcile what differences exist between RDF and the DCAM based on these postings and comments from Stu’s blog.

    I'm glad that people are engaged in trying to sort this out, even as I'm unhappy that its still unclear at this late date.  That it still IS unclear is incontrovertible (look at the caliber of people trying!).  I'm not very confident at this point that I can wash away the confusion, but it does seem potentially useful to reprise a part of my metadata talk that I used to give a lot.

    Sharing metadata requires agreements on three topics:

    1. Semantics: what is the meaning we are trying to convey in metadata assertions?  Meaning, of course, resides in the minds of people, not machines.  The focus of the Dublin Core effort has been to promote those shared meanings... and make them sharable.  The semantics bit is about agreeing about elements: author, publisher, date, etc.
    2. Syntax: how do you take a set of metadata assertions and pack them so that one machine can send them to another, where they can be unpacked and parsed by machine logic or displayed and read by a person  with high probability that the meaning of the assertions travel unchanged from one mind to another. RDF documents refer to  serialization... the order of bits in a stream... actually putting the stuff 'on the wire.' (The careful readers and jaded among you may wonder why i changed the order of exposition  from the title of this post.  Best for last? no... hardest.)
    3. Structure: You can't do syntax reliably unless you have unambiguous structure.  The sorts of things you have to specify in a well-structured metadata assertion (not an exhaustive list):
    • The boundaries of a set of assertions (what constitutes a record)
    • Cardinality - Can an element be repeated, and if so, is there a limit on the number?
    • How is a name structured? What is the delimiter separating elements of a compound name (Prince and Bono excepted, most names are compound structures, many with surprising and confounding complexity).
    • How is nesting managed?
    • How are dates encoded? YYYY-MM-DD? DD-MM-YYYY? MM-DD-YYYY?
    • How does one identify an encoding scheme that specifies the above question?
    • How does one identify a value encoding scheme (rg. LSCH, MeSH, Dewey) from which metadata values can be chosen?  Are such schemes required or optional?
    • Are metadata values specified by reference (URI) or by value (literal strings)?

    Most of these issues are not addressed in RDF. The can be, of course... but without agreements about how to do so, people tend to do them this way and that, leaving us without the ability to share data effectively.  This is where the Dublin Core Abstract Model (DCAM) comes in, as it specifies how to structure these sorts of things in a way that makes the data sharable.

    Is it perfect and generalizable?  No... its authors, in comments on my posts, have made evident that they make no such claim. Is it the best that is available for descriptive metadata?  I assert that it is, and that efforts to work towards an Uber-Metadata-Model should start with this effort and simplify or complexify as is necessary and sufficient to assure that metadata  can be shared across communities.

    One last point.  DCAM is articulated in the vernacular of RDF, but the structure that it creates is independent of RDF.  If RDF passes into the graveyard  of once-or-never-mighty technologies, the abstractions it (DCAM) declares survive quite nicely.  Syntax independence: a goal we strove for from day 1 of the first DC metadata workshop.  It is a worthy metadata engineering principle.

    To sum up: Defining semantics is a political process of reaching consensus.  Syntax is arranging the bits reliably so they travel comfortably between computers (RDF is a fine way to do this, but by no means the only way), and structure is the specification of the details necessary to layout and declare metadata assertions so they can be embedded unambiguously in a syntax.  A data model is the specification of this structure. 
    -----
    I was influenced to include semicolons in the title of this post by an article in today's NYTs, forwarded to me by Marguerite.  I LIKE semicolons, even if they are stodgy.
    -----
    Wary Ibises (or something like them) in Barwon Heads, Australia

    TrackBack

    TrackBack URL for this entry:
    http://www.typepad.com/t/trackback/462557/26249892

    Listed below are links to weblogs that reference Metadata: Semantics; Structure; Syntax:

    Comments

    Andy, thanks. The idea of layering on some useful concepts on top of RDF seems to me like a potentially valuable thing. But I think we need to be more specific about what those things are, and how they can serve as an additional layer on top of RDF. Others in DC community don't neccesarily share this understanding of the DCAM/RDF relationship it seems--but if this were the relationship it seems like a rational one to me.

    On the other hand, that DCAM is just like RDF but more understandable to a certain community--is to me a non-starter. Especially because it doesnt' seem particularly more understandable to me (depending on what you mean by 'the dc community', I think there may be large parts of the DC community which "don't get DCAM" any more than they do RDF.) But even if it were somehow more understandable, it doesn't seem like a valid justification for the inter-operability and inefficiency-of-intellectual labor (since both these frameworks are still works in progress) issues. I understand why you guys who have been involved with DCAM and/or RDF for so long want to say it, and I understand how it might explain why DCAM and RDF both came to be in historical context---but to me, "does so using language, terminology and concepts that are understandable by that community" still seems like a poor reason to have a seperate-but-parallel metadata control regime.

    I think it does to Stu too, which is why Stu is focusing on the "DCAM does something at a more abstract level than RDF" (which is related but not the same as your "DCAM adds something on top of RDF") argument, rather than "DCAM is pretty much like RDF, but it came out of a different community and that alone is justification for it to continue to exist seperatly" argument.

    I'm coming late to this discussion (for which I apologise) and I'm probably going to repeat stuff (for which I also apologise).

    I want to briefly address the question, "why do we need the DCAM when we already have RDF?".

    The answer, for me at least, lies in the history of the DCMI.

    In short, the DCAM tries to capture the set of metadata functional requirements (functional requirements isn't quite the right word here - but I can't think of anything better) that grew out of many years of debate within the DC community. Moreover, it does so using language, terminology and concepts that are understandable by that community.

    Whether it succeeds is another matter of course - and the fact that the kinds of questions being debated here exist at all, probably means we haven't totally succeeded? Whatever... that was our intention.

    DC and RDF have a long shared history. It hasn't always been plain sailing, at least in the sense that large parts of the DC community didn't get RDF and may well never get RDF.

    The DCAM was an attempt to capture that community's metadata needs in a way that was understandable by the community. RDF, on its own, didn't succeed in doing that - at least not IMHO.

    I think also that there are subtleties in the DCAM (the bounded nature of a description for example) that are pretty much fundamental to the DC community's understanding of the nature of metadata but that are not present in the RDF model per se. So, in a sense, the DCAM is a layering of some useful (at least to the DC community) additional concepts on top of the raw RDF model.

    It's not easy to capture what I want to say here, and I'm not sure that I've succeeded, but I hope this helps a little.

    I've not really followed this whole discussion/debate, but as someone who has for a long time been scratching my head about the DCAM, I'd just like to offer some gentle encouragement to Mikael's position on all of this.

    For the record -- from the sidelines -- it is somewhat comforting to see that it all hasn't been figured out yet. ;-)

    Well, I agree and I disagree somewhat.

    The Semantics, Syntax, Structure distinction is useful and to the point.

    However, I think there is more to the Semantics leg than apparent at a first glance. Sure, the human semantics of metadata terms is essential. But RDF also comes with formal semantics - the RDF Semantics specification, and thus rules for drawing automatic conclusions.

    It is my firm belief that the machine-processable semantic foundation that RDF provides is far more important than just the simple triple pattern (which is certainly very important).

    And this is where I see the DCAM lacking, currently - it has no strong formal semantic foundation.

    Inventing a new one especially for the DCAM would be a tremendous waste of time. Thus my assertion: the DCAM needs to be based firmly on RDF Semantics.

    That the DCAM provides a slightly different view of the structure of metadata records than pure RDF is fine then, as long as the semantic foundation is the *same*.

    I therefore disagree strongly that the DCAM is, can be or should be independent of RDF. Independent of the RDF structure and syntax, possibly, but not of the RDF semantics.

    So therefore, I'd state that there are *four* aspects, not three:

    * Formal semantics
    * Vocabulary (what Stu calls semantics
    * Structure
    * Syntax

    Ah -- now it feels like we're getting somewhere. ("Feels like" at least -- it is too early to tell if we're actually getting somewhere.) Thank you for laying out the vocabulary for our discussion.

    In this discussion I'm trying to think beyond RDF. Which is to say that the definition of an RDF triple -- subject, predicate and object -- is so blindingly simple that it is hard to see past. (With the exception of adding another element -- context, or where the triple came from -- to form a quad. This is something we ran up against in the ORE technical committee as needed in some pretty important use cases.) And within that definition, using the URIs of DMCI Metadata Terms as predicates that carry the semantics of what we humans agree upon makes perfect sense. It is so obvious that it is hard to consider doing without the RDF concepts. It is sort of like thinking of doing without HTTP and the web architecture, now that it is been invented and firmly embedded in the network. (Okay, RDF isn't as embedded as HTTP, but I think you'll get where I'm heading.)

    So I think I get the reasoning behind constructing the DCAM. Still, so much of it is (at least) borrowed from RDF that it makes it difficult to know where one stops and the other begins. For the DCAM authors that may be reading this, it would be helpful to know where those boundaries are. I would guess that knowledge of RDF factored into the creation of DCAM; in the stark isolation of the DCAM specification itself it is difficult to pick up on the nuances of that discussion. Even if one feels the need to define the whole model, to include all or some large subset of what comes from RDF, explicitly describing the overlap in the definition document or some accompanying usage guide would be very helpful.

    P.S.: I like semicolons, too.

    Thanks, this is helpful.

    Except... a naive response is that syntax and structure are certainly addressed by RDF. i mean, certainly RDF, at least standard RDF-in-XML, has both a specified syntax and structure. Right?

    So can you clarify what aspects of syntax and structure of a metadata package are NOT addressed by RDF?

    Post a comment

    If you have a TypeKey or TypePad account, please Sign In