Bitwacker Associates

Posted by: John Erickson | June 16, 2010

Regarding the Singularity

A recent set of articles in the New York Times and elsewhere, including the Kurzweil book, prompted a friend to ask me for my thoughts on the Singularity Movement. Here is an excerpt of the email I wrote:

Regarding the Singularity Movement, I think economic arguments such as that presented by Robin Hanson in IEEE Spectrum (2008) carry more weight than the gushing futurist predictions from the likes of Ray Kurzweil. In the Spectrum article Hanson cites two previous singularities — the agricultural and industrial revolutions — and suggests that a revolution in machine intelligence is leading to a third that will take shape over the next half-century.

I tend to take most of what futurists say with a grain of salt, because they rely on a belief/assumption/confidence that the introduction of disruptive technologies into a society yields predictable results — for good or bad — which never happens. The combination of factors including technologies being human constructions, the fact that we as humans never make completely rational decisions, and the fact that all of this takes place within a fundamentally chaotic, only approximately predictable context, means that we simply cannot know what will happen in the future!

Here’s what I know: We humans are wired to build and use tools and, to the extent possible, adapt to the environments we build — or die trying. Google, while amazing, is still a tool; an engineered system that (given enough time) I can explain to you. Ironically enough, the reason Google works so well is because it’s actually based on simpler, but more fundamental principals than the systems which preceded it, closer to how naturally-occurring networks emerge and function. But the way Google has been adopted and applied in the “ecosystem,” while making sense in hindsight, could not have been predicted.

I’m currently reading Jonah Lehrer’s How We Decide, a wonderful exploration of the biochemistry of how we make decisions. Any such discussion naturally much touch on how various imbalances (e.g. dopamine, etc) effect that process, and how well-intentioned efforts by doctors to counteract certain imbalances leads to very unexpected and usually undesired results.

Lehrer’s book makes it profoundly clear that we never know for certain what will happen when we diddle with the decision-making processes in our brain, whether it involves extending the lower levels of the nervous system (the sensory level) or the higher level processes. Researchers do know that we seem to adapt well to lower-level, e.g. neural prosthetics, but each higher-level process involves a synaptic algorithm that we don’t completely understand — mostly because our brain is a distributed system, not a single “algorithm,” whose “result” is emergent.

That ultimately is my point: our brains are distributed systems that exhibit adaptive and unpredictable behaviors, and we can’t begin to understand what will happen when we explore higher-level prosthetics based on “intelligent machines.” Something will happen, but there is no reason to believe it will lead to either a Utopian or Dystopian existence any more than the agricultural or industrial revolutions resulted in one or the other. Indeed, the introduction of those practices to certain natural and economic ecosystems led to both regional successes and catastrophes.

For Further Information:

The Singularity University
Merely Human? That’s So Yesterday. New York Times (11 June 2010).
Economics Of The Singularity IEEE Spectrum (May 2008)
Singularity articles at http://www.kurzweilai.net
On Intelligence, the companion site to Jeff Hawkin’s provocative book by the same name. The book introduces the concept of Hierarchical Temporal Memory (HTM) based on a layered hierarchical model of how the neocortex functions.

Leave a Comment

Posted in Big Ideas, singularity movement

Posted by: John Erickson | May 28, 2010

Concerning the King Arthur Flour Expansion

Recently the King Arthur Flour Company, a global provider of quality baking supplies based in my home town of Norwich, Vermont, proposed an expansion that would include a sewer extension. This issue is being debated locally, and I thought would provide good fodder for my blog…John Erickson

Since Jill and I moved to Norwich some 18 years ago, I’ve been troubled by what seems like a lack of support for sustainable economic development within our town. I’m proud that Norwich has a high-quality global company “like” King Arthur based here, a company that is employ-owned, successful and growing. At the same time I’m embarrassed that Norwich isn’t doing more to sustain the economic well being of the Upper Valley.

15 years ago this month partners and I began the process of launching a company called NetRghts. Loving Norwich and Vermont, I had a vision of starting a sustainable high-tech company that would be based here and would create local jobs. The inevitable question of where to base our company arose; being the Vermonter in the mix and drinking from the KoolAid of iconic successes like Green Mountain Gringo, I argued for us to set up offices in Norwich, Wilder or WRJ. My co-founders thought this was ludicrous; not only did they envision the (obvious to them) negative tax implications, but they also perceived no end of difficulty with infrastructure, etc. Since they had been successful with a previous Lebanon-based software startup, I went along for the ride and we set up shop in downtown Lebanon.

But I wouldn’t give up that easily. At one point Vermont eTV — remember them? — had a call-in with Gov Dean’s youthful, energetic director of economic development. Vermont had recently provided incentives for ETI’s expansion, and my direct question to “Slick” was: what can Vermont do to keep companies like ours in Vermont? Or, were my co-founders right, there (weren’t) any incentives to lure us to Vermont. His answer: regrettably, yes, my co-founders were right. If we needed money for bricks-n-mortar expansion to grow a widget-building business, yes, but since we were “knowledge-based,” nothing. Frankly, I was shocked, since this was during the same period that Gov Dean (who I’m a fan of!) was roaming the state advocating green high-tech businesses in cabins on mountaintops…

I’ve bored you with this ancient history in order to provide some context as to why I believe the citizens of Norwich should greet initiatives such as King Arthur’s with the question, what can we as neighbors do to help? Their opening proposal may or may not be ideal — I’m not saying “Roll over, little Norwich!” — but I do believe it is our responsibility to do what we can to foster economic development in this town, and this includes hearing their plans with an open mind.

I’m tired of Norwich not merely depending on, but assuming that other towns in the region will feed our hungry, host our homeless, pay our salaries, sell us our auto parts. Instead, we should be asking how we can help those among us with the initiative to bring it on home to Norwich…

Disclaimer: I am not affiliated with King Arthur Flour, but I do confess to loving their products and have been known to roam their jobs portal from time to time…

Leave a Comment

Posted in Big Ideas, politics

Posted by: John Erickson | March 28, 2010

Long Tails and “Scaling Down” Linked Data Services

This post first appeared in November, 2009 in the Blogger version of this blog. It is updated here as I believe it introduces points relevant to Leigh Dodds’ recent post, Enhanced Descriptions: “Premium Linked Data.” I’ve freshened it as appropriate based on progress since November…

Chris Anderson’s newest book FREE: The Future of a Radical Price received some attention this summer, but I’ve actually been meditating on principles he laid out three years ago in his blog post, Scaling up is good. Scaling down is even better. In that post he marveled at Google et.al.’s ability to scale down, to run themselves efficiently enough to serve users who generate no revenue at all. Anderson’s principles are guidance on approaches to conducting business such that even if only a tiny percentage of ones visitors “convert” into paying customers, by ensuring this small percentage is of a very large number one can still achieve big-time profitability.

My goal with this post is to consider how these ideas might be applied to the domain of Linked Data, and specifically how they pertain to the provision of unique data that adds real value to the greater “Web of Data.”

In his blog Anderson gives us four keys to scaling down: Self-service, “Freemium” services, No-frills products and Crowdsourcing…

1. Self-service: give customers all the tools they need to manage their own accounts. It’s cheap, convenient, and they’ll thank you for it. Control is power, and the person who wants the work done is the one most motivated in seeing that it’s done properly.

“Self-service” applies to linked data services in oh-so-many ways! Self- service in this case is not as much about support (see “Crowdsourcing,” below) as it is about eliminating any and all intervention customers might need to customize or specialize how services perform for them. In principle, the goal should be to provide users with a flexible API and let them figure it out, with the support of their peers. Ensure that everything is doable from their side, and step out of the way.

Note #1 (29 Mar 2010): A great recent example of this is the OpenVocab Project, launched by Ian Davis of Talis. OpenVocab “enables anyone to participate in the creation of a open and shared RDF vocabulary. The project uses wiki principles to allow properties and classes to be created in the vocabulary.”

The (negative) corollary is this: if an organization must “baby sit” its customers by providing specialized services that require maintenance, then they own it and must eat the cost. If instead they allow specializations to be a user-side function, their users own it. But the users won’t be alone; they’ll have the support of their community!

2. “Freemium” services: As VC Fred Wilson puts it, “give your service away for free, possibly ad supported but maybe not, acquire a lot of customers very efficiently through word of mouth, referral networks, organic search marketing, etc, then offer premium priced value added services or an enhanced version of your service to your customer base.” Free scales down very nicely indeed.

There are any number of ways providers might apply this concept to the linked data world:

Free Access	Premium Access
Restricted vocabulary of assertions	Full access, all assertions
Limited query rate	Unlimited query rate
Limited query extent	Unlimited query extent
Limited data	Unlimited data size
Read-only	Term upload capability
Narrow reuse rights	Broad reuse rights
Community support	Private/ dedicated support
…	…

Note #2 (29 Mar 2010): In his recent post Enhanced Descriptions: “Premium Linked Data”, Leigh Dodds’ provides a great freemium/premium example: a base dataset provided for free, and an enhanced set provided at a premium and exposed via his proposed ov:enhancedDescription vocabulary term, which he defined in OpenVocab.

Note #3 (29 Mar 2010): Derek Gordon just pushed out a great piece, The Era Of APIs, that argues “APIs are at work reshaping the ways in which we understand search today, and will challenge our profession to stretch, grow and change significantly in the coming years.”

3. No-frills products: Some may come for the low cost, others for the simplicity. But increasingly consumers are sophisticated enough to know that they don’t need, or want to pay for premium brands and unnecessary features. It’s classic market segmentation, with most of the growth coming at the bottom.

In the linked data world, achieving “no frills” would seem easy because by definition it is only about the data! For linked data a “frill” is just added complexity that serves no purpose or detracts from the utility of the service. Avoid any temptation to gratuitously “add value” on behalf of customers, such as merging your core graph with others in an attempt to “make it easy” for them. Providers should also avoid “pruning” graphs, except in the case of automated filtering in order to differentiate between Freemium and Premium services.

Note #4 (29 Mar 2010): Providers should weigh this very carefully. It might well be that a “merged” graph truly is a value-added service to users, for which they are willing to pay a premium. My point is simply to avoid the gratuitous and respond to customer needs!

4. Crowdsourcing: From Amazon reviews to eBay listings, letting the customers do the work of building the service is the best way to expand a company far beyond what employees could do on their own.

By now it is not only obvious, but imperative that providers foster the development communities within and around their services. Usually communities are about evangelism, and this is certainly true for linked data providers, but increasingly service provides realize well-groomed communities can radically reduce their service costs.

Linked data providers should commit themselves to a minimum of direct support and invest in fostering an active community around their service. Every provider should have a means for members of their community to support each other. Every provider should leverage this community to demonstrate to potential adopters the richness of the support and the inherent value of their dataset.

Finally: In a thought-provoking post Linked Data and the Enterprise: A Two-way Street Paul Miller reminds the skeptical enterprise community that they, not merely their user community, will ultimately benefit from the widespread use of their data, and when developing their linked data strategy they should consider how they can “enhance” the value of the Web of Data, for paying and non-paying users alike:

…[A] viable business model for the data-curating Enterprise might be to expose timely and accurate enrichments to the Linked Data ecosystem; enrichments that customers might pay a premium to access more quickly or in more convenient forms than are available for free…

I’ve purposely avoiding considering the legal and social issues associated with publishing certain kinds of enterprise data as linked data (see also this), which I addressed in a post, Protecting your Linked Data on the Blogger version of this blog…

Leave a Comment

Posted in Big Ideas, linked data | Tags: chris anderson, crowdsourcing, freemium, linked data, long tail, paul miller, reuse rights, semantic web, web of data

Posted by: John Erickson | March 9, 2010

“This linked data went to market…wearing lipstick!?!”

Paraphrasing the nursery rhyme,

This linked data went to market,
This linked data stayed open,
This linked data was mashed-up,
This linked data was left alone.
And this linked data went…
Wee wee wee all the way home!

In his recent post Business models for Linked Data and Web 3.0 Scott Brinker suggests 15 business models that “offer a good representation of the different ways in which organisations can monetise — directly or indirectly — data publishing initiatives.” As is our fashion, the #linkeddata thread buzzed with retweets and kudos to Scott for crafting his post, which included a very seductive diagram.

My post today considers whether commercial members of the linked data community have been sufficiently diligent in analysing markets and industries to date, and what to do moving forward to establish a sustainable, linked data-based commercial ecosystem. I use as my frame of reference John W. Mullins’ The New Business Road Test: What entrepreneurs and executives should do before writing a business plan. I find Mullins’ guidance to be highly consistent with my experience!

So much lipstick…
As I read Scott’s post I wondered, aren’t we getting ahead of ourselves? Business models are inherently functions of markets — “micro” and “macro” [1] — and their corresponding industries, and I believe our linked data world has precious little understanding of the commercial potential of either. Scott’s 15 points are certainly tactics that providers, as the representatives of various industries, can and should weigh as they consider how to extract revenue from their markets, but these tactics will be so much lipstick on a pig if applied to linked data-based ecosystems without sufficient analysis of either the markets or the industries themselves.

To be specific, consider one of the “business models” Scott lists…

3. Microtransactions: on-demand payments for individual queries or data sets.

By whom? For what? Provided by whom? Competing against whom? Having at one time presented to investment bankers, I can say that “microtransactions” is no more of a business model for linked data than “Use a cash register!” is one for Home Depot or Sainsbury’s! What providers really need to develop is a deeper consideration of the specific needs they will fulfill, the benefits they will provide, and the scale and growth of the customer demand for their services.

Macro-markets: Understanding Scale
A macro-market analysis will give the provider a better understanding of how many customers are in its market and what the short- and long-term growth rates are expected to be. While it is useful for any linked data provider, whether commercial or otherwise, to understand the scale of its customer base, it is absolutely essential if the provider intends to take on investors, because they will demand credible, verifiable numbers!

Providers can quantify their macro-markets by identifying trends, including demographic, socio-cultural, economic, technological, regulatory, natural. Judging whether the macro-market is attractive depends upon whether do the trends work in favour of the opportunity.

Micro-markets: Identifying Segments, Offering Benefits
Whereas macro-market analysis considers the macro-environment, micro-market analysis focuses on identifying and targeting segments where the provider will deliver specific benefits. To paraphrase John Mullins, successful linked data providers will be those who deliver great value to their specific market segments:

Linked data providers should be looking for segments where they can provide clear and compelling benefits to the customer; commercial providers should especially look to ease customers’ pain in ways for which they will pay.
Linked data providers must ask whether the benefits their services provide as seen by their customers are sufficiently different from and better than their competitors, e.g. in terms of data quality, query performance, more supportive community, better contract support services, etc.
Linked data providers should quantify the scale of the segment just as they do the macro-environment: how large is the segment and how fast is it growing?
Finally, linked data providers should ask whether the segment can be a launching point into other segments.

The danger of falling into the “me-too” trap is particularly glaring with linked data, since a provider’s competition may come from open data sources as well as other commercial providers: think Encarta vs. Wikipedia!

Having helped found a start-up in the mid-1990s, I am acutely aware of the difference between perceived and actual need. The formula for long-term success and fulfillment is fairly straightforward: provide a service that people need, and solve problems that people need solved!

Notes:

In an upcoming post I’ll discuss the need for providers to perform a linked data industry analysis as a complement to the market analysis described here…
On the topic of creating datasets that people need, see How to create datasets that the rest of the world needs on the Infochimps.org blog.
Finally, on the topic of taking your linked data to market, visit Google Apps Marketplace. Stare. Now go to Infochimps.org’s listing of datasets. Stare. Go back…Forward…Back…Forward…Thoughts?

References

John W. Mullins, The New Business Road Test (FT Prentice Hall, 2006)

Leave a Comment

Posted in Big Ideas, linked data | Tags: business models, linked data, lipstick, market analysis, pigs

Posted by: John Erickson | February 4, 2010

DOIs, URIs and Cool Resolution

The art of happiness is to serve all — Yogi Bhajan

Once we get beyond the question of the basic HTTP URI-ness of the digital object identifier (DOI) — since for each DOI there exists DOI-based URIs due to the dx.doi.org and hdl.handle.net proxies, this issue is moot — and old-skool questions of “coolness” based on the relative brittleness over time of creative URI encoding [1], we are then left with the more substantial question of whether DOI-based HTTP URIs really “behave” themselves within the “Web-of-Objects” universe. The purpose of this post is to identify the problem and propose a potential solution, implementation of which will require certain changes to the current Handle System platform. I believe that if the proposed changes are made, lingering questions concerning the “URI-ness” of DOIs (and Handles) will disappear, once and for all.

Note: It is beyond the scope of this post to present all of the gory background details regarding the Handle System, the DOI, and the 1998 and 2008 versions of “Cool URIs.” If there is enough interest in a stand-alone article, I will happily consider writing a longer version in the future, perhaps as piece for D-Lib Magazine.

With the increasing influence of semantic web technologies there has been strong interest in assigning actionable HTTP URIs to non-document things, ranging from abstract ideas to real world objects. In the case of URI-named, Web-accessible physical items — sensors, routers and toasters — this is sometimes referred to as The Web of Things. Until 2005 the community disagreed as to what an HTTP URI could be assumed to represent, but a June 2005 decision by the W3C TAG settled the issue: If a server responds with an HTTP response code of 200 (aka a successful retrieval), the URI indeed is for an information resource; with no such response, or with a different code, no such assumption can be made. This “compromise” was said to have resolved the issue, leaving a “consistent architecture.” [3]

The result of this decision was to force consensus on how to apply the long-established principles of HTTP content negotiation in more consistent ways. In particular, “human” and “machine” requests to a given entity URI — a top-level URI representing a “thing” — should be treated differently; for example, there should be different responses to requests with HTTP headers specifying Accept: text/html (for an HTML-encoded page) versus Accept: application/rdf+xml (for RDF-modeled, XML-encoded data). This is most often seen in the semantic web and linked data worlds, where it is now common to have both textual and machine readable manifestations of the same URI-identified thing.

Modern web servers including Apache have been engineered to handle these requests through content negotiation [4]. Through standard configuration procedures, site administrators specify how their servers should respond to text/html and application/rdf+xml requests in the same way they specify what should be returned for alternate language- and encoding- requests; “en,” “fr,” etc. Typically, when media-specific requests are made against entity URIs representing concepts, the accepted practice is to return a 302 Found response code with the URI to a resource containing a representation of the expected type, such as an html-encoded page or an XML document with RDF-encoded data.

Many readers of this post will be familiar with the basic idea of HTTP proxy-based Handle System name resolution: A HTTP resolution request for a DOI-based URI is made to a proxy — a registration-agency run proxy such as dx.doi.org or the “native” Handle System proxy hdl.handle.net — the appropriate local handle server is located, the handle record for the DOI is resolved, and the default record (e.g. a document information page) is returned to the client as the payload in a 302 Found response. In a Web of Documents this might make sense, but in a universe of URI-named real-world objects and ideas, not so much.

The 2008 document provides two requirements for dealing with URIs that identify real world objects:

Be on the Web: Given only a URI, machines and people should be able to retrieve a description about the resource identified by the URI from the Web. Such a look-up mechanism is important to establish shared understanding of what a URI identifies. Machines should get RDF data and humans should get a readable representation, such as HTML. The standard Web transfer protocol, HTTP, should be used.

Be unambiguous: There should be no confusion between identifiers for Web documents and identifiers for other resources. URIs are meant to identify only one of them, so one URI can’t stand for both a Web document and a real-world object.

In the post-2005 universe of URI usage as summarised above and detailed in [2], if DOI-based URIs are used to represent conceptual objects these rules will be broken! For example, Handle System proxies today cannot distinguish between Accept: codes in the request headers; the only possible resolution is the default (first) element of the Handle record. (For hackers or merely the curious out there, I encourage you to experiment with curl at your command line or Python’s urllib2 library, hitting the DOI proxy with a DOI-based URL like http://dx.doi.org/10.1109/MIC.2009.93.) This problem with how proxies resolve DOIs and Handles is a lingering manifestation of the native Handle System protocol not being HTTP-based and the system of HTTP-based proxies being something of a work-around, but the vast majority of DOI and Handle System resolutions occur through and rely on these proxies.

One possible solution would be to enable authorities — Registration Agencies — who operate within the Handle System to configure how content negotiation within their Handle prefix space is handled at the proxy. For document-based use of the DOI an example of this would be to return the URI in the first element of the Handle record whenever a text/html request is made and (for example) the second element whenever an application/rdf+xml is made. When a request is made to the proxy, request-appropriate representation URIs would be returned to the client along with the 302 Found code. This approach treats the DOI-based URI as a conceptual or entity URI and gives the expected responses as per [2]. pax vobiscum…

Readers familiar with the Handle System will appreciate that there are many potential schemes for relating HTTP content type requests to elements of the Handle record; in the example above I use position (index value), but it is also possible to use special TYPEs.

Handle servers are powerful repositories and can implement potentially many different models other than redirection as described above. Sometimes, for example, the desire is to use a Handle record as the primary metadata store. In that case, the preferred application/rdf+xml might very well be to return an RDF-encoded serialisation of the Handle record. How this is handled should be a feature of the Handle server platform and a decision by registration agencies based on their individual value propositions, and not locked in by the code.

I eagerly look forward to your comments and reactions on these ideas!

Update 1: In a comment to this post, Herbert Van de Sompel argues that the real question is, what should DOIs represent? Herbert asserts that DOI-based URIs should model OAI-ORE resource aggregations and that Handle System HTTP proxies should behave according to OAI-ORE’s HTTP implementation guidelines. Herbert’s suggestion doesn’t conflict with what I’ve written above; this is a more subtle and (arguably) more robust view of how compound objects should be modeled, which I generally agree with.

Here’s how OAI-ORE resolution would work following the Handle proxy solution I’ve described above: Assume some DOI-based HTTP URI doi.A-1 identifies an abstract resource aggregation “A-1” (In OAI-ORE nomenclature doi.A-1 is the Aggregation URI). Following the given HTTP implementation example, let there be two Resource Maps that “describe” this Aggregation, an Atom serialization and an RDF/XML serialization. Each of these Resource Maps is (indeed MUST be) available from different HTTP URI’s, ReM-1 and ReM-2, but the desired behaviour is for either to be accessible through the DOI-based Aggregation URI, doi.A-1. Let these two URIs be persisted in the Handle record, preferably using TYPEs which distinguish how they should be returned to clients based on the naming authority’s configuration of the HTTP proxy. By the approach I describe above, the Handle System proxy would then respond to resolution requests for doi.A-1 with 303 See Other redirects to either ReM-1 or ReM-2 depending upon MIME-type preferences expressed in the Accept: headers of the requests.

Update 2: Complete listing of MIME types for OAI-ORE Resource Map serializations. Follow-up conversations with Herbert Van de Sompel, Carl Lagoze and others have reminded me I neglected to mention how the OAI-ORE model recommends handling “HTML” (application/xhtml+xml and text/html) requests! This is not a minor issue, since the purpose of ORE is to model aggregations of resources and not resources themselves, and so it is not immediately clear what such a page request should return. My solution (for the purposes of this blog post) is for Handle System HTTP proxies to respond to these requests also with 303 See Other redirects, supplying redirect URIs that map to appropriately-coded “splash screens.”

For completeness, the table below (repeated from [5]) lists the standard MIME types for Resource Map serializations. Continuing with the major theme of this post, Handle System HTTP proxies resolving requests for DOI-named ORE Resource Maps should follow these standards so the clients may request appropriate formats using HTTP Accept: headers.

Resource Map Type	MIME type
Atom	`application/atom+xml`
RDF/XML	`application/rdf+xml`
RDFa in XHTML	`application/xhtml+xml`

If a client prefers RDF/XML but can also parse Atom then it might use the following HTTP header in requests:

Accept: application/rdf+xml, application/atom+xml;q=0.5

The table below list the two common MIME types for HTML/XHTML Splash Pages following the W3C XHTML Media Types recommendations.

Resource Map Type	MIME type
XHTML	`application/xhtml+xml`
HTML (legacy)	`text/html`

Thus, if a client wishes to receive a Splash Page from the Aggregation URI and prefers XHTML to HTML then it might use the following HTTP header in requests:

Accept: application/xhtml+xml, text/html;q=0.5

As noted in [5] there is no way to distinguish a plain XHTML document from an XHTML+RDFa document based on MIME type. It is thus not possible for a client to request an XHTML+RDFa Resource Map in preference to an RDF/XML or Atom Resource Map without running the risk of a server correctly returning a plain XHTML Splash Page (without included RDFa) in response.

The Handle record for a given DOI or Handle identifying an ORE aggregation would therefore contain a set of URIs reflecting the mappings in the tables above. A content-negotiation-savvy Handle System HTTP proxy would then return the appropriate URI in the 303 Found response, based on its configuration and policies.

Update 3 (28 Apr 2011): Great news! This week CrossRef, the IDF and CNRI announced the completion of their implementation of Content Negotiation for CrossRef DOIs. As that post describes, it is an implementaton of “Option D” in last year’s CrossTech post, DOIs and Linked Data: Some Concrete Proposals. Ed Summers provides a great explanation of the significance of this in his recent INKDROID post, DOIs as Linked Date.

Update 4 (18 Feb 2012): It occurs to me that it might be instructive to readers to see Content Negotiation for CrossRef DOIs in operation, using curl at the command line. Here are the header responses for the DOI-based URI given earlier:

$ curl -H "accept: application/rdf+xml" http://dx.doi.org/10.1109/MIC.2009.93 -L -I
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: http://data.crossref.org/10.1109%2FMIC.2009.93
Expires: Sun, 19 Feb 2012 13:13:00 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 178
Date: Sat, 18 Feb 2012 20:55:42 GMT

HTTP/1.1 200 OK
Date: Sat, 18 Feb 2012 20:55:44 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.7
Vary: Accept
Content-Length: 11876
Status: 200
Connection: close
Content-Type: application/rdf+xml

You can see that the initial request to the generic DOI HDL proxy is referred over to CrossRef, which acknowledges that can respond with the RDF content type requested in the Accept: header. An identical request without the trailing -I would get you the RDF serialized in XML.

References:

[1] Tim Berners-Lee, Cool URIs don’t change (1998)
[2] Leo Sauermann, et.al., Cool URIs for the Semantic Web (2008)
[3] Tim Berners-Lee, What HTTP URIs Identify (2005)
[4] Apache Foundation, Content Negotiation (Apache 2.2)
[5] Carl Lagoze and Herbert Van de Sompel, et.al., Common MIME Types for Resource Maps and Splash Pages, ORE User Guide: HTTP Implementation (17 Oct 2008).

20 Comments

Posted in linked data, metadata | Tags: doi, handle system, linked data, metadata, oai-ore, uri, web of data, web of objects, web of things

Posted by: John Erickson | February 3, 2010

Community as a Measure of Research Success

In his 02 Feb 2010 post entitled Doing the Right Thing vs. Doing Things Right Matthias Kaiserswerth, the head of IBM Research – Zurich sums up his year-end thinking with this question for researchers…

We have so many criteria of what defines success that one of our skills as research managers is to choose the right ones at the right time, so we work on the right things rather than only doing the work right…For the scientists that read this blog, how do you measure success at the end of the year?

Having just “graduated” after a decade with another major corporate research lab, this is a topic that is near and dear to my heart! My short answer was the following blog comment…

I can say with conviction that the true measure of a scientist must be their success in growing communities around their novel ideas. If you can look back over a period of time and say that you have engaged in useful discourse about your ideas, and in so doing have moved those ideas forward — in your mind and in the minds of others — then you have been successful…Publications, grad students and dollar signs are all artifacts of having grown such communities. Pursued as ends unto themselves, it is not a given that a community will grow. But if your focus is on fostering communities around your ideas, then these artifacts will by necessity follow…

My long answer is that those of us engaged in research must act as stewards of our ideas; we must measure our success by how we apply the time, skills, assets, and financial resources we have available to us to grow and develop communities around our ideas. If we can look back over a period of time — a day, a quarter, a year, or a career — and say that we have been “good stewards” by this definition, then we can say we have been successful. If on the other hand we spend time and money accumulating assets, but haven’t moved our ideas forward as evidenced by a growing community discourse supporting those ideas, then we haven’t been successful.

A very trendy topic over the past few years has been open innovation, as iconified by Henry Chesborough’s 2003 book by the same name. Chesborough’s “preferred” definition of OI found in Open Innovation: Researching a New Paradigm (2006) reads as follows…

Open innovation is the use of purposive inflows and outflows of knowledge to accelerate internal innovation, and expand the markets for external use of innovation, respectively. [This paradigm] assumes that firms can and should use external ideas as well as internal ideas, and internal and external paths to market, as they look to advance their technology.

In very compact language Chesborough (I believe) argues that innovators within organisations can best move their ideas forward through open, active engagement with internal and external participants. [1] Yes, individual engagement could be conducted through closed “tunnels,” but for the ideas to truly flourish (think Java) this is best done through open communities. I believe the most important — perhaps singular — responsibility of the corporate research scientist is to become a “master of their domain,” to know their particular area of interest and expertise better than anyone, to propose research agendas based upon that knowledge, and to leverage their companies’ assets to motivate communities of interest around those ideas. External communities that are successfully grown based on this view of OI can become force multipliers for the companies that invest in them!

To appreciate this one needs only to consider the world of open source software and the ways in which strong communities contribute dimensions of value that no single organisation could… I’ll pause while you contemplate this idea: open-source like communities of smart people developing your ideas. Unconvinced? Then think about “Joy’s Law,” famously attributed to Sun Microsystems co-founder Bill Joy (1990):

No matter who you are, most of the smartest people work for someone else

Bill Joy’s point was that that best path to success is to create communities [2] in which all of the “world’s smartest people” are applying themselves to your problems and growing your ideas. As scientists, our measure of success must be how well we leverage the assets available to us to grow communities around our ideas.

Peter Block has given us a profound, alternative perspective on the role of leaders in the context of communities [3]. In his view, leaders provide context and produce engagement. In Block’s view, leaders…

Create a context that nurtures an alternative future, one based on gifts, generosity, accountability, and commitment;

Initiate and convene conversations that shift peoples’ experience, which occurs through the way people are brought together and the nature of the questions used to engage them;

Listen and pay attention.

Ultimately, I believe that successful researchers must first be successful community leaders, by this definition!

Update: In a 4 Feb 2010 editorial in the New York Times entitled Microsoft’s Creative Distruction, former Microsoft VP Dick Brass examines why Microsoft, America’s most famous and prosperous technology company, no longer brings us the future. As a root cause, he suggests:

What happened? Unlike other companies, Microsoft never developed a true system for innovation. Some of my former colleagues argue that it actually developed a system to thwart innovation. Despite having one of the largest and best corporate laboratories in the world, and the luxury of not one but three chief technology officers, the company routinely manages to frustrate the efforts of its visionary thinkers.

I believe Mr. Brass’ analysis is far too inwardly focused. Never in his editorial does Mr. Brass lift up the growing outreach by Microsoft Research, especially under the leadership of the likes of Tony Hey (CVP, External Research) and Lee Dirks (Director, Education & Scholarly Communications), to empower collaboration with and sponsorship of innovative researchers around the world. Through its outreach Microsoft is enabling a global community of innovators and is making an important contribution far beyond its bottom line. I think Mr. Brass would do well to focus on the multitude of possibilities Microsoft is helping to make real through its outreach, rather than focusing on what he perceives to be its problems…

Notes:

[1] One version of the open innovation model has been called distributed innovation. See e.g. Karim Lakhani and Jill Panetta, The Principles of Distributed Innovation (2007)
[2] Some authors have referred to “ecologies” or “ecosystems” when interpreting Bill Joy’s quote, but I believe the more accurate and useful term is community.
[3] For more on community building, see Peter Block, esp. Community: The Structure of Belonging (2008)

1 Comment

Posted in Big Ideas, web science | Tags: Big Ideas, community, innovation, research, web science

Posted by: John Erickson | January 26, 2010

Scale-free Networks and the Value of Linked Data

I believe vibrant, thriving networks are expressions of inherent value and provide ecosystems of opportunity for individuals and organizations to foster their growth through unique contributions in the form of content and data, tools, or infrastructure. For this reason, as I read Eric Hellman’s recent post 8 One-Way Business Models for Linked Data — a well-considered response to Scott Brinker’s seven business models (plus an eighth…) that can make Linked Data viable — I felt something was missing; something more needed to be said about the inherent value of growing networks embodying linked data principles. I therefore post this re-considered piece which first appeared in Dec 2009 on my decommissioned Blogspot blog…

Over the past few months Kingsley Idehen of OpenLink Software and others on the Business of Linked Data (BOLD) list have been debating a value proposition for linked data via Twitter (search for #linkeddata) and email. The discussion has included useful iterations on various “elevator pitches” and citations of recent successes, especially the application of GoodRelations e-commerce vocabularies at Best Buy. After some deep thought I decided to take the question of value in a different direction and to consider it from the perspective of the science of networks, especially with reference to the works of Albert-László Barabási, director of the Center for Complex Network Research and author of Linked: The New Science of Networks.

I’d like to test the idea here that data sharing between organisations based on linked open data principles is the approach most consistent with the core principles of a networked economy. I believe that the linked data model best exploits “networking thinking” and maximizes an organisation’s ability to respond to changes in relationships within the “global graph” of business. Using Barabási as a framework, linked data is the approach that most embodies a networked view of the economy from the macro- to the micro-economic level, and therefore best empowers the enterprise to understand and leverage the consequences of interconnectedness.

As has been noted numerous times elsewhere, the so-called Web of Data is perhaps the web in its purest form. Following Tim Berners-Lee principles or “rules” as stated in his Linked Data Design Issues memo from 2006, we have a very elegant framework for people and especially machines to describe the relationship between entities in a network. If we are smart about how we define those links and the entities we create to aggregate those links — the linked datasets we create — we can build dynamic, efficiently adaptive networks embodying the two laws that govern real networks: growth and preferential attachment. Barabási illustrates these two laws with an example “algorithm” for scale-free networks in Chapter 7 of Linked. The critical lessons are (a) networks must have a means to grow — there must not only be links, but the ability to add links, and (b) networks must provide some mechanism for entities to register their preference for other nodes by creating links to the more heavily-linked nodes. Preferential attachment ensures that the converse is also true: entities will “vote with their feet” and register their displeasure with nodes by eliminating links.

In real networks, the rich get richer. In the Web, the value is inherent in the links. Google’s PageRank merely reinforced the “physical” reality that the most valuable properties in the Web of Documents are those resources that are most heavily linked-to. Those properties provide added value if they in turn provide useful links to other resources. The properties that are sensitive to demand and can adapt to the preferences of their consumers, especially to aggregate links to more resources that compound their value and distinguish them from other properties, are especially valuable and are considered hubs.

Openness is important. At this point it is tempting to jump to the conclusion that Tim Berners-Lee’s four principles are all we need to create a thriving Web of Data, but this would be premature; Sir Tim’s rules are necessary but not sufficient conditions. Within any “space” where Webs of Data are to be created, whether global or constrained within an organisation, the network must embody the open world assumption as it pertains to the web: when datasets or other information models are published, their providers must expect them to be reused and extended in ways they cannot control. In particular this means that entities within the network, whether powered by humans or machines, must be free to arbitrarily link to — to make assertions about — other entities within the network. The “friction” of obtaining permission in this linking process must approximate zero.

Don’t reinvent and don’t covet! The extent of graphs that are built within organisations should not stop at their boundaries; as the BBC has shown so beautifully with their use of linked data on the revamped BBC web site, the inherent value of their property was increased radically by not only linking to datasets provided elsewhere, openly on the “global graph,” but also by enabling reuse of their properties. The BBC’s top-level principles for the revamped site are all about openness and long-term value:

The site has been developed against the principles of linked open data and RESTful architecture where the creation of persistent URLs is a primary objective. The initial sources of data are somewhat limited but this will be extended over time. Here’s our mini-manifesto: Persistence…Linked open data…RESTful…One web

The BBC has created a valuable “ecosystem”; their use of other resources, especially MusicBrainz and DBPedia, has not only made the BBC site richer but in turn has increased the value of those properties. And those properties will continue to increase in value; by the principle of preferential attachment, every relationship “into” a dataset by valuable entities such as the BBC in turn increases the likelihood that other relationships will be established.

Links are not enough. It should be obvious that simply exposing datasets and providing value-added links to others isn’t enough; as Eric Hellman notes, dataset publishers must see themselves service providers who add value beyond simply exposing data. Some will add value to the global graph by gathering, maintaining, publishing useful datasets and fostering a community of users and developers; others will add value by combining datasets from other services in novel ways, possibly decorated by their own. Eric has argued that the only winners in the linked open data space have indeed been those who have provided such merged datasets as a service.

Provide value-adding services and foster community. For those dataset providers asking how you might realise the full value potential of publishing your datasets on the Web, I suggest that you examine whether, based on the principles I’ve outlined above, you have done everything you can to make your datasets part of the Web, rather than merely “on” it, and thereby are truly adding value to the global graph:

Do you view yourselves as a service?
Have you made your datasets as useful and easy-to-use as possible?
Have you provided the best possible community support, including wikis and other mechanisms?
Have you fully documented your vocabularies?
Have you clearly defined any claimed rights, and in particular have you considered adopting open data principles?

Updates:

For more on why “…linked data…is the best approach available for publishing data in a hugely diverse and distributed environment, in a gradual and sustainable way…” see Jeni Tennison’s recent post, Why Linked Data for data.gov.uk?.
One of the more important contributions to the study of networks in the past few years is Experience vs. Talent Shapes the Structure of the Web (2008; Joseph Kong, Nima Sarshar, and Vwani Roychowdhury) which used large-scale crawl data to “investigate and validate the dynamics that underlie the evolution of the structure of the web.” The authors’ study shows that neither age (“experience”) nor status as a promising upstart (“talent” or “fitness”) are immediate indicators of success. They suggest that a more experience-based fitness ranking could be included in the overall ranking of a search result; one simple way to think of this is if we could filter Google results based on how fast certain resources are rising in the rankings.

T3FH55EJ9AFX

1 Comment

Posted in linked data, metadata, web science | Tags: bbc, business, linked data, metadata, semantic web, web science

Posted by: John Erickson | January 20, 2010

Protecting and Licensing Your Linked Data

Note: This entry was first published on 3 Dec 2009 in the Blogspot/Blogger version of this blog, which was disabled by Blogger’s “spam blog” detection algorithm December 2009 – January 2010. I’ve decided to re-post the best of that blog here…

One of the highlights of the recent ISWC2009 was a tutorial on Legal and Social Frameworks for Sharing Data on the Web. As one who during the rise of “Web 1.0” was writing and presenting frequently on topics like Copyright for Cybernauts and is now seduced by the world of linked data, I’ve been considering how the legal, business and technical worlds will reconcile themselves in this new world, a world where value will come from joining networks of data together. Eric Hellman puts this nicely:

Linked Data is the idea that the merger of a database produced by one provider and another database provided by a second provider has value much larger than that of the two separate databases… Eric Hellmen, Databases are Services, NOT Content (Dec 2009)

The question is, what legal and technical strategies are available to a linked data provider to protect themselves as they pursue such a value proposition? The following post is an effort to try to rationalise this a bit more clearly.

I’m not a lawyer. I’m a technologist who has since the early 1990s immersed himself in the sometimes delicate, more often violent dance between technology, business and public policy that has been catalysed by the rise of the digital, networked environment. In particular I’ve been motivated by the question of how policies can, and more often can’t, be systematically “implemented” by technologies — as well as by the question of how technical architectures often enforce ad hoc policy regimes, inadvertently or otherwise (see esp. Lawrence Lessig’s Code v2, the community update of Code and Other Laws of Cyberspace).

As an early (an perhaps idiosyncratic) player in the DRM industry, I quickly concluded that the only sustainable solution to the problem of communicating rights for creative works in the digital domain was to evolve an infrastructure of identifiers and metadata, which has been realised to a great extent by the rise in prominence of the DOI, accessible templates for rights communications (due in large part to Creative Commons), the emergence of a variety of metadata standards, and a standard data model (RDF) for associating metadata with objects. The more recent emergence of standards of practice for linked data will only help to further disambiguate the rights world, as these practices make the expression and transferral of content-descriptive metadata orders of magnitude easier.

I’m interested in questions concerning the communication of intellectual property rights for data shared through linked data mechanisms: What rights can be claimed? What are the best practices for claiming and transferring rights? What technical mechanisms exist — in this case, specific vocabularies and protocols — for communicating rights to metadata? The four thought leaders at the ISWC2009 LSFSDW tutorial have done a fairly complete job; this post is an attempt to summarise and/or interpret their messages and resources found elsewhere. I’d like to highlight pioneering work by the Science Commons, an offshoot of CC which has considered these questions specifically for scientific data. Also, in preparing this post I stumbled across some works that I poured over more than a decade ago, that now seem prescient! David Lanzotti and Doug Ferguson’s thorough analysis circa 2006 shows that little has changed: IP protection for databases is nebulous territory.

Copyright does not apply to datasets: Most regimes hold that copyright applies only to original creative works. This means you can only claim copyright for works that are yours and which are “creative.” This second piece means you cannot claim copyright on databases unless their structure and organisation is sufficiently creative; the US Supreme Court held that “sweat of the brow” is not sufficient to cross this threshold, and that copyright protections do not extend to non-creative accumulations of facts (c.f. Feist, 1991).

The individual elements of a dataset might themselves be extensive and creative enough to merit copyright protection; we’ll assume for this discuss that these are handled separately. In their FAQ the Open Data Commons nicely emphasises the difference between a dataset and the individual contents of that dataset, including text and images. Note also that the European Space Agency (ESA) web site includes a nice, concise explanation of the legal reasons why copyright cannot be applied to databases.

Intellectual property protection for datasets: The fact that copyright (generally) cannot be applied to datasets means that the Creative Commons body of work can’t be applied directly; indeed CC specifically discourages it. But is there an IP regime that covers accumulated data? If not copyright, patent or trademark, then what? ca. 1996 database “owners” thought that a sui generis (“of its own kind”) regime for protecting databases might proliferate, and in March 1996 the EU issued a Database Directive. International IP law requires reciprocal directives from member states, however, and the lack of adoption of this model around the world and most notably in the United Sates means IP protection for datasets is still nebulous.

In principle there are no “default” protections for datasets as there are with copyright; providers must be proactive and declare their terms of use up front, whether they choose to waive all restrictions; a limited set focused on attribution; or more extensive limitations based on customised licenses. It is clearly in the interests of both providers and consumers of datasets to ensure that rights are explicit stipulated up front, especially since a key value proposition of linked data is (as we are reminded above) the merger of graphs; for certain applications graphs from difference sources must be merged together within a single store so that inference can be applied. A service agency must know up front whether triples from particular sources can be “thrown in the hopper,” and even of there are exclusions.

Templates for expressing licensing terms: The Open Data Commons provides a template Open Database License (ODbL) that specifies Attribution and Share-alike Terms

This {DATA(BASE)-NAME} is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/

The specific text of the ODbL license is quite extensive, but the gist of it is nicely summarised in the ODbL Plain Language Summary:

You are free: To Share…To Create…To Adapt…
As long as you: Attribute…Share-alike…Keep open…

(details of each stipulation omitted for simplicity)

My point in dwelling on ODbL is not to argue that commercial providers should adopt it, but rather to consider adapting it; I’m holding it up as an exemplar for the explicit expression of terms of use for a dataset.

Expressing your rights to linked data as linked data: One of the things that has impressed me about Creative Commons is that its rights expressions were intended from the start to be modelled in RDF and machine-readable; indeed CC has created ccREL: the Creative Commons Rights Expression Language, which primarily uses the idea of embedded RDF (via RDFa) in content pages to communicate rights. A recent development is Creative Commons guidance on how ccREL and RDFa might be applied to “deploy the Semantic Web.” Note that Nathan Yergler’s (excellent) OpenWeb 2008 presentation explains this well, but doesn’t specifically deal with the linked data question. Note that in particular Nathan addresses CC+, a CC licensing model that allows providers to include a way for users to request rights beyond those stated in the basic CC license. Those who know me know what I’ll say next: this is another step forward as we converge on Henry Perritt’s ca. 1993 vision of permissions headers!

For further reading:

Jonathan Band and Jonathan S. Gowdy, Sui Generis Database Protection: Has Its Time Come? (1997)
Legal and Social Frameworks for Sharing Data on the Web. ISWC2009 Tutorial.(2009) Check out each of these presentations!
Open Database License (ODbL)
David Lanzotti and Doug Ferguson, Copyright and Databases
Open Data Commons, Making Your Data Open
Kaitlin Thaney, Ontology Sharing and Copyright Considerations
Creative Commons, ccREL: The Creative Commons Rights Expression Language

Updates:

Readers may be interested in my new post on mechanisms for providing access control to linked data, Thoughts on Securing Linked Data with OAuth and FOAF+SSL (20 January 2010)
The February issue of Talis’ Nodalities Magazine focuses on data sharing and includes an article by Science Commons’ Kaitlin Thaney entitled <a href="http://blogs.talis.com/nodalities/"Data Sharing on the Web."

Leave a Comment

Posted in linked data, metadata, rights expression

Posted by: John Erickson | January 20, 2010

Thoughts on Securing Linked Data with OAuth and FOAF+SSL

Several weeks ago Leigh Dodds ended his post Thoughts on Linked Data Business Models with the following comment:

…From a technical perspective I’m interested to see how well protocols like OAuth and FOAF+SSL can be deployed to mediate access to licensed Linked Data…

Leigh asks a very subtle question, because choosing between OAuth and FOAF+SSL (I believe) has profound implications on different levels. In short, I believe that OAuth is more suited for organizations that will establish formal licenses with their community of users, and FOAF+SSL, being more in the peer-to-peer, “Web of Trust” vein, may be more suited to ad hoc access control with user webs where the users have particular shared, but not open, interests. In both cases, work must be done to reify licenses into actionable policies that can be implemented by services.

A typical OAuth-based access control use case might be controlling access to a particular value-added, protected data set that has been licensed to a consumer-facing provider. Users, by way of the client service they have authenticated with, experience this data. The OAuth protocol ensures that only the users of the authenticated partners of the protected dataset provider can access the data. Example: A major business journal decides to publish its data in a way that resembles data.nytimes.com but is restricted to paying customers. The protected service is operated as a separate entity; users’ clients gain access to the published data after first authenticating with the “parent” site. As an added measure of privacy (built into the protocol) users’ credentials do not pass on to the data service.

The credentials required for OAuth-based access control are no different than what we see between primary service like Facebook and Twitter and their various value-added partners; the user has an account with the parent service, and will either have separate credentials for that service or have it linked to their OpenID.

Note: This application of OAuth does not really take full advantage of its capabilities. Typically the relying service will ask for read or read/write access to certain user attributes from the user’s primary service — social networking platforms like Twitter or Facebook — in which the user has the choice to grant or deny access to their records in those services. Although we can imagine such applications playing within the linked data world, in today’s post I’d like to focus more on a third-party, valued-added data provider rather than a consumer.

FOAF+SSL is not harder (at least for the user) and is no less secure, but is definitely different! The reader is referred to Henry Story’s many blog posts and this excellent presentation (listen to the audio!) for details, but the basic idea is this: the requesting user is granted access if they have sufficient status in a social graph maintained at the server. “Sufficient status” is my terminology, and means that if the user’s distinguishing URI (their WebID) meets certain conditions within a (e.g.) FOAF graph on the server, such as “is known by two or more members,” then the user will be granted access. Update: As Henry notes in his follow-up comment, his post Sketch of a RESTful photo Printing service with foaf+ssl (Oct 2009) provides a great example of applying FOAF+SSL in this way.

The FOAF+SSL approach is highly original and attractive for a number of applications because it naturally “fits” with the notion of dynamic access control based on community membership, but it seems like it must be stretched a bit to be applied to access control based on explicit terms that have been codified in a license. Still, I believe it can be made to work, and some very novel applications are possible; for example, I can imagine some pretty cool implication-based access control policies! One possible downside is that although I’ve seen many discussions detailing the client/server interactions — thinking that I believe is necessary — I haven’t seen much exploration of the “server-side” policies.

Spending time thinking about FOAF+SSL has naturally brought to mind the policy-aware web (and related) research of Lalana Kagal (MIT CSAIL) and others. The current post only concerns access control and assumes an out-of-band license over the data; a future post will explore how — and whether! — we can determine if an organization’s data usage is complies with the terms under which they’ve licensed it, in a fashion similar to what Kagal and her co-authors Oshani Seneviratne and Tim Berners-Lee described in Policy Aware Content Reuse on the Web. For more general background, readers are encouraged to read The Semantic Web and Policy by Kagal, Berners-Lee and Jim Hendler.

Other updates:

In his post Signing FOAF files: FOAF files as certificates Bruno Harbulot considers in some depth how to create a FOAF-based Web-of-Trust securely, in a similar way to PGP. Bruno also mentions Jeremy Carroll’s related work on the topic, Signing RDF Graphs.
The ESW Wiki lists many resources for FOAF+SSL, including this detailed FOAF+SSL HOWTO.
WebAccessControl (also on the ESW Wiki) describes a “decentralized system for allowing different users and groups various forms of access to resources where users and groups are identified by HTTP URIs.” It uses FOAF+SSL for authentication. One of the listed implementations is the mod_authz_webid Apache WebID authorization module.

5 Comments

Posted in linked data, rights expression | Tags: FOAF+SSL, metadata, oauth, policy, policy-aware, rights expression, security

Posted by: John Erickson | January 19, 2010

The DOI, DataCite and Linked Data: Made for each other!

In recent Friendfeed and Twitter posts, Chris Rusbridge (Director of the Digital Curation Centre at Edinburgh) raises some good questions on the topic of persistent data citation. First, on the value of the DOI and using the DataCite model:

What are the advantages of DOIs as dataset identifiers in citations? The premise of DataCite is that DOIs are important, and will let us link documents and data in a sensible way. DOIs imply metadata, and that’s good. But articles (small, stable things whose publisher may occasionally change identity) are different from data (multi-scalar, highly…

Then, as to whether using the DataCite model “breaks” linked data:

Does a DOI for data (eg in data citation) break #linkeddata rule? Should we use HTTP URI instead? Is dx.doi.org ok?

My simple answer is that DataCite and linked data — or, more to the point, the DOI and linked data — are in essence made for each other. A longer answer is that the DOI infrastructure provides conveniences, such as multiple resolution, and also certain advantages, such as security, as they pertain to referencing and accessing scientific and other datasets. The bottom line is that while the DOI infrastructure does depend upon the non-HTTP protocols of the Handle System “under the hood,” from the consumer’s perspective DOI-based name resolution can (and usually does) operate completely within the “web space.” For linking to articles or datasets, the more familiar URI form of DOIs which combines a given DOI with the URL of a Handle System proxy (e.g. http://dx.doi.org/10.1109/MIC.2009.93) may be used instead of the “native” DOI form.

Background: DOI… The Digital Object Identifier is a global system for persistently identifying objects in the digital environment. DOI® names may be assigned to any entity and are used to provide current information about objects, including URIs that can be resolved to manifestations of the resources and/or their metadata. Information about DOI-named digital objects may change over time, including where they may be found, but their DOI names will not change. The DOI System therefore provides a framework for persistent identification, managing intellectual content, managing metadata, linking customers with content suppliers, facilitating electronic commerce, and enabling automated management of media. DOI names can be used for any form of management of any data, whether commercial or non-commercial.

Background: DataCite… Perhaps the best way to think about the DataCite initiative and its DOI registration agency, the German National Library of Science and Technology (TIB), is that it brings to datasets what CrossRef has brought to journal articles for a decade. Since 2005 TIB has registered around 600,000 research data sets with DOI names, allowing easy access and improving citability. DataCite’s objectives are to make it easier for scientists to access research data over the Internet, to increase acceptance of research data as quotable scientific objects in themselves (aka first-class objects) and thus to ensure that the rules of good scientific practice continue to be adhered to. See also DataCite: A global registration agency for research data (Jan Brase, 2009)

The DOI and Linked Data: I’ve been heard saying that “I’ve been active in the DOI community since before it was the DOI,” and more than 10 years ago I worked with friends and colleagues in the publishing industry, at CNRI and at the IDF on ways to leverage the multiple resolution capabilities of the DOI to supply a variety of object-specific metadata types. Our concept, which we first demonstrated on the floor of the Frankfurt Book Fair in 1998, was to persist object-specific URI queries to a variety of type-specific datasets (descriptive metadata, rights metadata, bibliographic metadata, etc) in individual records of each object’s DOI. Then, in theory, any third-party application or service that had the DOI “in hand” could readily access whatever registered, “preferred” metadata sets had been associated with the object, each supplied by a different, authoritative vendor. Note: This approach possibly varies from the open world assumption inherent in the Semantic Web and the linked open data worlds, and merits a future post!

Since its inception the DOI has seemed to me to be an elegant way to manage and deliver highly-distributed metadata supply chains. In January 1998 I wrote in “Requirements for DOI-Based Applications and Services” that…

…there will be rich metadata sets available upon which to base these services. For these assumptions to prevail on an industry-wide basis, content producers and administrators must have a framework available that facilitates metadata collection at various stages of the production process, as well as post-production maintenance. This framework would establish essential (or “standard”) metadata elements upon which services are based, with each class of service having its own requisite set of elements. To allow for differentiation in quality of service between providers, the framework must also allow for the introduction of unique or value-added metadata…A point of useful discussion in the future should concern what metadata is proprietary, what should be left ‘open’ for universal exchange, and how do we implement protocols for metadata exchange. This latter discussion has in fact begun, as the W3C takes up work on a resource description framework (RDF)…

…At which point I apparently was kidnapped by aliens, to return 10 years later when this vision was actually made practical and scalable by way of linked data principles!

Seriously, the idea of applications and services consuming data from a variety of sources is old hat for the linked data community but has been gaining momentum in other worlds, including the library community. For example, Todd Carpenter of NISO summarized current developments in the standardization of metadata supply chains in his February 2009 Against the Grain column Transforming Metadata. And Renee Register (OCLC) created a fabulous presentation on multi-sourced metadata and interoperability in From ONIX to MARC and Back Again: New Frontiers in Metadata Creation at OCLC (ALA Midwinter 2009).

I welcome any comments, questions and especially corrections you might have on this topic!

Update (08 Feb 2010): Interested readers should read my recent post DOIs, URIs and Cool Resolution to understand some subtle issues regarding how the Handle System’s HTTP proxies do content negotiation, and how this might effect their use in linked data models. Subtle, but very important!

Update (28 Apr 2011): Great news! This week CrossRef, the IDF and CNRI announced the completion of their implementation of Content Negotiation for CrossRef DOIs. As that post describes, it is an implementaton of “Option D” in last year’s CrossTech post, DOIs and Linked Data: Some Concrete Proposals. Ed Summers provides a great explanation of the significance of this in his recent INKDROID post, DOIs as Linked Date.

1 Comment

Posted in linked data, metadata | Tags: crossref, doi, idf, libraries, linked data, metadata, oclc, rdf, semantic web

Bitwacker Associates

Regarding the Singularity

Concerning the King Arthur Flour Expansion

Long Tails and “Scaling Down” Linked Data Services

“This linked data went to market…wearing lipstick!?!”

DOIs, URIs and Cool Resolution

Community as a Measure of Research Success

Scale-free Networks and the Value of Linked Data

Protecting and Licensing Your Linked Data

Thoughts on Securing Linked Data with OAuth and FOAF+SSL

The DOI, DataCite and Linked Data: Made for each other!

Categories

Bitwacker Feeds

Bitwacker Rights

Bitwacker PGP

Bitwacker Posts