David Madore's WebLog: On metadata

I'm slowly beginning to grasp the concept of metadata, so I thought I'd say a few words about it. In truth, the difficulty is not with the concept itself (there's nothing so complicated about it: metadata are just ancillary data, that are not part of a document's content but describe the document itself), but the strange and fascinating architecture that various organizations, notably the World Wide Web Consortium, have built around it, and that bizarre language, RDF. RDF is a language of which one can easily (and that is what happened to me) read the specs and guide, and still not have the slightest idea of what it's all about: it seems at once completely abstract and devoid of utility. (As a mathematician, I really shouldn't have any problem with notions that are abstract and devoid of utility, yet…) Yet RDF is not a content-free language, and although claims that it is the universal “semantic” language (whatever that may mean) are pompous and not very meaningful, it is an interesting idea.

The most basic use of metadata would be, say, in an HTML document: to indicate a list of keywords associated with the document, one might write <meta name="Keywords" content="foo, bar, baz, qux" />, for example. Or to indicate who wrote the document, <meta name="Creator" content="Doe, John" /> might be used. But who decides what tags like “Keywords” and “Creator” are available? It could be, of course, a simple de facto list, with various tags understood by various kinds of potential users. But there is a more formal aspect: metadata vocabularies can be defined and collected in so-called “profiles”. In HTML, the profile attribute to the head element specifies the metadata vocabulary profile that is used.

One such profile is the Dublin core element set, which specifies a small list of basic (“core”) metadata properties. The formal description of the Dublin core namespace is an RDF file located at http://purl.org/dc/elements/1.1/, so one appropriate way to specify the keywords and creator for the document might be <head profile="http://purl.org/dc/elements/1.1/"> <meta name="Keywords" content="foo, bar, baz, qux" /> <meta name="Creator" content="Doe, John" /> </head>; another way, which is recommended by RFC 2731 consists of using the <link> element as follows: <head> <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" /> <meta name="DC.Keywords" content="foo, bar, baz, qux" /> <meta name="DC.Creator" content="Doe, John" /> </head>.

To exploit metadata to their full power, and to describe them commodiously, however, the RDF language is necessary. For example, to write the same metainformation as above in RDF, one would write, if I am not mistaken, <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://www.somedomain.tld/some/uri/" dc:keywords="foo, bar, baz, qux" dc:creator="Doe, John" /> </rdf:RDF>: this formally states that John Doe is the creator of http://www.somedomain.tld/some/uri/ which has keywords foo, bar, baz and qux. But RDF goes much beyond that: it is capable, for example, of making metastatements about the metadata themselves (as in “Jane Smith says that John Doe is the author of http://www.somedomain.tld/some/uri/”), or of defining (to some extent, of course—at a point it becomes necessary to express things in a natural language) the vocabulary that it uses (thus, the Dublin Core RDF vocabulary is itself expressed in RDF).

A strange and fascinating architecture, but beautiful it its way! I guess I should slowly start attaching some correctly labeled metadata to these pages.