The DOI, DataCite and Linked Data: Made for each other!

Posted by: John Erickson | January 19, 2010

The DOI, DataCite and Linked Data: Made for each other!

In recent Friendfeed and Twitter posts, Chris Rusbridge (Director of the Digital Curation Centre at Edinburgh) raises some good questions on the topic of persistent data citation. First, on the value of the DOI and using the DataCite model:

What are the advantages of DOIs as dataset identifiers in citations? The premise of DataCite is that DOIs are important, and will let us link documents and data in a sensible way. DOIs imply metadata, and that’s good. But articles (small, stable things whose publisher may occasionally change identity) are different from data (multi-scalar, highly…

Then, as to whether using the DataCite model “breaks” linked data:

Does a DOI for data (eg in data citation) break #linkeddata rule? Should we use HTTP URI instead? Is dx.doi.org ok?

My simple answer is that DataCite and linked data — or, more to the point, the DOI and linked data — are in essence made for each other. A longer answer is that the DOI infrastructure provides conveniences, such as multiple resolution, and also certain advantages, such as security, as they pertain to referencing and accessing scientific and other datasets. The bottom line is that while the DOI infrastructure does depend upon the non-HTTP protocols of the Handle System “under the hood,” from the consumer’s perspective DOI-based name resolution can (and usually does) operate completely within the “web space.” For linking to articles or datasets, the more familiar URI form of DOIs which combines a given DOI with the URL of a Handle System proxy (e.g. http://dx.doi.org/10.1109/MIC.2009.93) may be used instead of the “native” DOI form.

Background: DOI… The Digital Object Identifier is a global system for persistently identifying objects in the digital environment. DOI® names may be assigned to any entity and are used to provide current information about objects, including URIs that can be resolved to manifestations of the resources and/or their metadata. Information about DOI-named digital objects may change over time, including where they may be found, but their DOI names will not change. The DOI System therefore provides a framework for persistent identification, managing intellectual content, managing metadata, linking customers with content suppliers, facilitating electronic commerce, and enabling automated management of media. DOI names can be used for any form of management of any data, whether commercial or non-commercial.

Background: DataCite… Perhaps the best way to think about the DataCite initiative and its DOI registration agency, the German National Library of Science and Technology (TIB), is that it brings to datasets what CrossRef has brought to journal articles for a decade. Since 2005 TIB has registered around 600,000 research data sets with DOI names, allowing easy access and improving citability. DataCite’s objectives are to make it easier for scientists to access research data over the Internet, to increase acceptance of research data as quotable scientific objects in themselves (aka first-class objects) and thus to ensure that the rules of good scientific practice continue to be adhered to. See also DataCite: A global registration agency for research data (Jan Brase, 2009)

The DOI and Linked Data: I’ve been heard saying that “I’ve been active in the DOI community since before it was the DOI,” and more than 10 years ago I worked with friends and colleagues in the publishing industry, at CNRI and at the IDF on ways to leverage the multiple resolution capabilities of the DOI to supply a variety of object-specific metadata types. Our concept, which we first demonstrated on the floor of the Frankfurt Book Fair in 1998, was to persist object-specific URI queries to a variety of type-specific datasets (descriptive metadata, rights metadata, bibliographic metadata, etc) in individual records of each object’s DOI. Then, in theory, any third-party application or service that had the DOI “in hand” could readily access whatever registered, “preferred” metadata sets had been associated with the object, each supplied by a different, authoritative vendor. Note: This approach possibly varies from the open world assumption inherent in the Semantic Web and the linked open data worlds, and merits a future post!

Since its inception the DOI has seemed to me to be an elegant way to manage and deliver highly-distributed metadata supply chains. In January 1998 I wrote in “Requirements for DOI-Based Applications and Services” that…

…there will be rich metadata sets available upon which to base these services. For these assumptions to prevail on an industry-wide basis, content producers and administrators must have a framework available that facilitates metadata collection at various stages of the production process, as well as post-production maintenance. This framework would establish essential (or “standard”) metadata elements upon which services are based, with each class of service having its own requisite set of elements. To allow for differentiation in quality of service between providers, the framework must also allow for the introduction of unique or value-added metadata…A point of useful discussion in the future should concern what metadata is proprietary, what should be left ‘open’ for universal exchange, and how do we implement protocols for metadata exchange. This latter discussion has in fact begun, as the W3C takes up work on a resource description framework (RDF)…

…At which point I apparently was kidnapped by aliens, to return 10 years later when this vision was actually made practical and scalable by way of linked data principles!

Seriously, the idea of applications and services consuming data from a variety of sources is old hat for the linked data community but has been gaining momentum in other worlds, including the library community. For example, Todd Carpenter of NISO summarized current developments in the standardization of metadata supply chains in his February 2009 Against the Grain column Transforming Metadata. And Renee Register (OCLC) created a fabulous presentation on multi-sourced metadata and interoperability in From ONIX to MARC and Back Again: New Frontiers in Metadata Creation at OCLC (ALA Midwinter 2009).

I welcome any comments, questions and especially corrections you might have on this topic!

Update (08 Feb 2010): Interested readers should read my recent post DOIs, URIs and Cool Resolution to understand some subtle issues regarding how the Handle System’s HTTP proxies do content negotiation, and how this might effect their use in linked data models. Subtle, but very important!

Update (28 Apr 2011): Great news! This week CrossRef, the IDF and CNRI announced the completion of their implementation of Content Negotiation for CrossRef DOIs. As that post describes, it is an implementaton of “Option D” in last year’s CrossTech post, DOIs and Linked Data: Some Concrete Proposals. Ed Summers provides a great explanation of the significance of this in his recent INKDROID post, DOIs as Linked Date.

Posted in linked data, metadata | Tags: crossref, doi, idf, libraries, linked data, metadata, oclc, rdf, semantic web

Responses

And now there’s the DOI Linked Data Service from CrossTech! I’m glad those aliens let you go 🙂
By: Ed Summers on April 28, 2011
at 12:19 am

Reply

Bitwacker Associates