Posted by: John Erickson | January 10, 2014

What’s all this about a W3C DRM standard?

Over the past few days there has been renewed discussion of the controversial W3C Encrypted Media Extension proposal with the publication of a revised draft. (07 Jan 2014). Today I’d like to provide a bit of background, based on my long experience in the digital rights management “game” and my familiarity with the W3C process.

Who are the players? The primary editors of the W3C EME draft are employed by Google, Microsoft and Netflix, but corporate affiliation really only speaks to one’s initial interest; W3C working groups try to work toward concensus, so we need to go deeper and see who is actually active in the formulation of the draft. Since W3C EME is a work product of the HTML Working Group, one of the W3C’s largest, the stakeholders for EME are somewhat hidden; one needs to trace the actual W3C “community” involved in the discussion. One forum appears to be the W3C Restricted Media Community Group; see also the W3C restricted media wiki and mailing list. A review of email logs and task force minutes indicates regular contributions from representatives of Google, Microsoft, Netflix, Apple, Adobe, Yandex, a few independent DRM vendors such as Verimatrix, and of course W3C. Typically these contributions are highly technical.

A bit of history: The “world” first began actively debating the W3C’s interest in DRM as embodied by the Encrypted Media Extension in Octover 2013 when online tech news outlets like Infoworld ran stories about W3C director Tim Berners-Lee’s decision move forward and the controversy around that choice. In his usual role as anti-DRM advocate, Cory Doctorow first erupted that Ocober, but the world seems to be reacting with renewed vigor now. EFF has also been quite vocal in their opposition to W3C entering into this arena. Stakeholders blogged that EME was a way to “keep the Web relevant and useful.”

The W3C first considered action in the digital rights management arena in 2001, hosting the Workshop on Digital Rights Management (22-23 January 2001, INRIA, Sophia Antipolis, France), which was very well attended by academics and industrial types including the likes of HP Labs (incl. me), Microsoft, Intel, Adobe, RealNetworks, several leading publishers, etc.; see the agenda. The decision at that time was Do Not Go There, largely because it was impossible to get the stakeholders at that time to agree on anything “open,” but also because in-browser capability was limited. Since that time there has been a considerable advancements in support for user-side rendering technologies, not to mention the evolution of Javascript and the creation of HTML5; it is clear that W3C EME is a logical, if controversial, continuation in that direction.

What is this Encrypted Media Extension? The most concise way to explain EME is, that it is an extension to HTML5’s HTMLMediaElement that enables proprietary controlled content handling schemes, including encrypted content. EME does not specify a specific content protection scheme, but instead allows for vendor-specific schemes to be “hooked” via API extensions. Or, as the editors describe it,

“This proposal allows JavaScript to select content protection mechanisms, control license/key exchange, and implement custom license management algorithms. It supports a wide range of use cases without requiring client-side modifications in each user agent for each use case. This also enables content providers to develop a single application solution for all devices. A generic stack implemented using the proposed APIs is shown below. This diagram shows an example flow: other combinations of API calls and events are possible.”

A generic stack implemented using the proposed APIs
A generic stack implemented using
the proposed W3C Encrypted Media Extension APIs

Why is EME needed? One argument is that EME allows content providers to adopt content protection schemes in ways that are more browser- and platform-independent than before. DRM has a long history of user-unfriendliness, brittle platform dependence and platform lock-in; widespread implementation could improve user experiences while given content providers and creators more choices. The dark side of course is that EME could make content protection an easier choice for providers, thereby locking down more content.

The large technology stakeholders (Google, Microsoft, Netflix and others) will likely reach a concensus that accomodates their interests, and those of stakeholders such as the content industries. It remains unclear how the interests of the greater Internet are being represented. As an early participant in the OASIS XML Rights Language Technical Committee (ca 2002) I can say these discussions are very “engineer-driven” and tend to be weighted to the task at hand — creating a technical standard — and rarely are influenced by those seeking to balance technology and public policy. With the recent addition of the MPAA to the W3C, one worries even more about how the voice individual user will be heard.

For further reading:

John Erickson is the Director of Web Science Operations (DirWebSciOps) with the Tetherless World Constellation at Rensselaer Polytechnic Institute, managing the delivery of large scale open government data projects that advance Semantic Web best practices. Previously, as a principal scientist at HP Labs John focused on the creation of novel information security, identification, management and collaboration technologies. As a co-founder of NetRights, LLC John was the architect of LicensIt(tm) and @ttribute(tm), the first digital rights management (DRM) technologies to facilitate dialog between content creators and users through the dynamic exchange of metadata. As a co-founder of Yankee Rights Management (YRM), John was the architect of Copyright Direct(tm), the first real-time, Internet-based service to fully automate the complex copyright permissions process for a variety of media types.

Posted by: John Erickson | July 29, 2013

Imagination, Policymaking and Web Science

On 26 July the The Pew Research Center for the People & the Press released Few See Adequate Limits on NSA Surveillance Program…But More Approve than Disapprove which they’ve summarized in this post. Here’s a snippet…

…(D)espite the insistence by the president and other senior officials that only “metadata,” such as phone numbers and email addresses, is being collected, 63% think the government is also gathering information about the content of communications – with 27% believing the government has listened to or read their phone calls and emails…Nonetheless, the public’s bottom line on government anti-terrorism surveillance is narrowly positive. The national survey by the Pew Research Center, conducted July 17-21 among 1,480 adults, finds that 50% approve of the government’s collection of telephone and internet data as part of anti-terrorism efforts, while 44% disapprove. These views are little changed from a month ago, when 48% approved and 47% disapproved.

A famous conclusion of the 9/11 Commission was that a chronic and widespread “failure of imagination” led to the United States leaving its defenses down and enabling Bin Laden’s plot to succeed. This is a bit of an easy defense, and history has shown it to not be completely true, but I think in general we do apply a kind of double-think when contemplating extreme scenarios. I think we inherently moderate our assumptions about how far our opponents might go to win and the range of methods they will consider. How we limit our creativity is complex, but it is in part fueled by how well informed we are.

The Pew results would be more interesting if the same questions had been asked before the Edward Snowden thing, because it would have created a “baseline” of sorts for how expansive our thinking was and is. What the NSA eruption has shown us is that our government is willing to collect data at a much greater scale than most people imagined. The problem lies with that word, imagined. What if we asked instead, “What is POSSIBLE?” Not “what is possible within accepted legal boundaries,” but rather “what is possible, period, given today’s technology?” For example, what if the NSA were to enlist Google’s data center architects to help them design a state-of-the-art platform?

Key lawmakers no doubt were briefed on the scale of the NSA’s programs years ago, but it is unlikely most of the legislators or their staffers were or are capable of fully appreciating what is possible with the data collected, esp. at scale. One wonders who is asking serious, informed questions about what is possible with the kind and scale of data collected? Who is evaluating the models, etc? Who is on the outside, using science to make educated guesses about what’s “inside?”

Many versions of the web science definition declare our motivation ultimately to be “…to protect the Web.” We see the urgency and the wisdom in this call as we watch corporations and governments construct massive platforms that enable them to monitor, analyze and control large swaths and facets of The Global Graph. It is incumbent upon web scientists to not simply study the Web, but to use the knowledge we gain to ensure that society understands what influences the evolution of that Web. This includes the daunting task of educating lawmakers.

Why study web science? Frankly, because most people don’t know what they’re talking about. On the issues of privacy, tracking and security, most people have no idea what is possible in terms of large-scale data collection, what can be learned by applying modern analytics to collected network traffic, and what the interplay is between technological capabilities and laws. Fewer still have a clue how to shape the policy debate based on real science, especially a science rooted in the study of the Web.

Web science as a discipline gives us hope that there will be a supply of knowledgeable — indeed, imaginative — workers able to contribute to that discussion.

Posted by: John Erickson | July 23, 2013

Senator Leahy’s position on Aaron’s Law and CFAA Reform

Recently I wrote each member of Vermont’s congressional delegation, Senators Patrick Leahy and Bernie Sanders and Congressman Peter Welch, regarding Aaron’s Law, a proposed bill named in memory of the late Aaron Swartz that would implement critical changes to the notorious Computer Fraud and Abuse Act (CFAA) (18 U.S.C. 1030). As usual, Senator Leahy responded quickly and with meat:

Dear Dr. Erickson:

Thank you for contacting me about the need to reform the Computer Fraud and Abuse Act (CFAA). I appreciate your writing to me about this pressing issue.

In my position as Chairman of the Senate Judiciary Committee, I have worked hard to update the Computer Fraud and Abuse Act in a manner that protects our personal privacy and our notions of fairness. In 2011, I included updates to this law in my Personal Data Privacy and Security Act that would make certain that purely innocuous conduct, such as violating a terms of use agreement, would not be prosecuted under the CFAA. This bill passed the Judiciary Committee on November 22, 2011, but no further action was taken in the 112th Congress. I am pleased that others in Congress have joined the effort to clarify the scope of the CFAA through proposals such as Aaron’s law. Given the many threats that Americans face in cyberspace today, I believe that updates to this law are important. I am committed to working to update this law in a way that does not criminalize innocuous computer activity.

As technologies evolve, we in Congress must keep working to ensure that laws keep pace with the technologies of today. I have made this issue a priority in the past, and will continue to push for such balanced reforms as we begin our work in the 113th Congress.

Again, thank for you contacting me, and please keep in touch.


United States Senator

Thanks again for your great service to Vermont and the United States, Sen. Leahy!


Posted by: John Erickson | July 19, 2013

Whistleblowing, extreme transparency and civil disobedience

In her recent post Whistleblowing Is the New Civil Disobedience: Why Edward Snowden Matters the great danah boyd wrote:

Like many other civil liberties advocates, I’ve been annoyed by how the media has spilled more ink talking about Edward Snowden than the issues that he’s trying to raise. I’ve grumbled at the “Where in the World is Carmen Sandiego?” reality show and the way in which TV news glosses over the complexities that investigative journalists have tried to publish as the story unfolded. But then a friend of mine – computer scientist Nadia Heninger – flipped my thinking upside down with a simple argument: Snowden is offering the public a template for how to whistleblow; leaking information is going to be the civil disobedience of our age.

For several weeks I’ve debated with friends and colleagues over whether Mr. Snowden’s acts indeed represent civil disobedience and not some other form of protest. I’ve argued, for example, that they might not because he didn’t hang around to “face the consequences.” danah’s post provoked me to examine my views more deeply, and I sought out a more formal definition (from the Stanford Encyclopedia of Philosophy) to better frame my reflection. Based on how Mr. Snowden’s acts exhibit characteristics including conscientiousness, communication, publicity and non-violence, I do now see his whistleblowing as an example of civil disobedience.

Conscientiousness: All the evidence suggests that Mr. Snowden is serious, sincere and has acted with moral conviction. To paraphrase the Stanford Encyclopedia, he appears to have been motivated not only out of self-respect and moral consistency but also by his perception of the interests of his society.

Communication: Certainty Mr. Snowden has sought to disavow and condemn US policy as implemented by the NSA and has successfully drawn public attention to this issue; he has also clearly motivated others to question whether changes in laws and/or policies are required. The fact that he has legislators from both sides of the aisle arguing among themselves and with the Omama Administration is testimony to this. It is not clear to me what specific changes (if any) Mr. Snowden is actually seeking, and he certainly has not been actively engaged in instigating changes e.g. behind the scenes, but I don’t think this is required; his acts are clearly about effecting change by committing extreme acts of transparency.

Publicity: This is an interesting part of the argument; while e.g. Rawls and Bedau argue that civil disobedience must occur in public, openly, and with fair notice to legal authorities, Smart states what seems obvious: to provide notice in some cases gives political opponents and legal authorities the opportunity to suppress the subject’s efforts to communicate. We can safely assume that Mr. Snowden did not notify his superiors at the NSA, but his acts might be still be regarded as “open” as they were closely followed by an acknowledgment and a statement of his reasons for acting. He has not fully disclosed what other secret documents he has in is possession, but it does not appear he has anonymously released any documents, either.

Non-violence: To me this is an important feature of Mr. Snowden’s acts; as far as we know, Mr. Snowden has focusing on exposing the truth and not on violence or destruction. This is not to say that forms of protest that do result in damage property (e.g. web sites) are not civil disobedience; rather, the fact that he did not deface web sites or (to our knowledge) violate access control regimes does qualify his acts as non-violent.

I have no idea whether Mr. Snowden read Thoreau’s Civil Disobedience or even the Wikipedia article, but his acts certainly exhibit the characteristics of civil disobedience and may serve as a “template” for whistleblowers moving forward. As a technologist, my fear is that his acts also provide a “use case” for security architects, raising the bar for whistleblowers who aim to help us (in danah’s words) “critically interrogate how power is operationalized…”

Note: This post originally appeared as a comment to danah boyd, Whistleblowing Is the New Civil Disobedience: Why Edward Snowden Matters.

Posted by: John Erickson | June 10, 2013

Enabling Linked (Open) Data Commerce through Metadata

Posted by: John Erickson | January 22, 2013

I Heart Linux Mint

UPDATED 07 Jan 2016: Since late May 2009 I have been a Linux fanboy. My initial motivation for taking the plunge was learning that I would soon be euphemized from the research arm of a major computer corporation and would be on my own later that year. I was also interested in migrating toward a more researcher-friendly environment; many of the reference implementations for radical new directions in Web technology, including and especially Linked Data, were easier to get working on either a Linux derivative or MacOS, and I was increasingly frustrated by Windoze, the official corporate platform.

I first dipped my toe in the Linux pond ten years earlier, having set up Red Hat Linux on a test machine as a platform for breaking (mostly) server-side code, but was not comfortable with it for “primetime” use. All that changed with my first evaluation of Ubuntu Jaunty Jackalope (ca. April 2009). I found the shell to be more than usable; the selection of open source code was amazing, literally every application I needed; the performance on my tired machine was a radical improvement over Windoze; and certain essential tasks that had been extremely difficult under Red Hat (esp. VPN) were now clean and easy. I “sandblasted” my main work machine and haven’t gone back. For my remaining months with Giganticorp, if I needed to execute some stodgy Windoze-only crapware I fired up Windoze on VirtualBox, ever-amazed that it actually worked.

I’ve become an Ubuntu and esp. Linux Mint evangelist among my friends. Since the Linux kernel is so much more efficient than Windoze, I show anyone who will listen how they can prolong the life, and generally decrapulate their computing experience, by sandblasting their machine and installing the most recent release of Ubuntu. I continually win converts, to my utter amazement! My ultimate “feat-of-strength” is probably sandblasting a ca. 1991 iMac G3 “Blueberry” and successfully installing Ubuntu, thus (in theory) prolonging its life.

Sadly, good things can be negatively effected by entropy. With Natty Narwhal the geniuses in charge started messing around with the shell (previously Gnome), introducing an abomination called Unity with 11.04, ultimately committing to it with Oneiric Ocelot. This is when Linux Mint sauntered by my office window; I was soon out of my chair and chasing it down the street!

I think of Mint as “a more careful release of Ubuntu, without the crap and knee-jerk changes.” For a recent feature comparison see Linux Mint vs. Ubuntu. Mint is self-described as being “conservative” with updates and being sensitive to its users, especially from the developer community. The key is that Mint uses Ubuntu’s code repositories seamlessly, so the user does not sacrifice anything by choosing Mint over Ubuntu. Currently all my machines are running Linux Mint 17.3 “Rosa” (MATE) using the MATE shell.

John’s Linux Mint customizations: Immediately after installing a new distribution of Mint I install the following “essential” applications, using either the command line or Synaptic Package Manager:

NOTE: Be sure to disconnect external monitors before installing Linux Mint on laptops. If you don’t, the installer may get confused and mess up the hardware configuration. Linux Mint handles external monitors nicely after installation.

Docky A cool MacOS-like application dock
Docky ‘compositing’ reminder
Adding Chrome icon
Use Synaptic…
Google Chrome My preferred web browser
May require separate installation of libcurl3
libfile-mimeinfo-perl Perl module to determine file types. Required starting
with Linux Mint 16 for “Show in folder”
in Chrome to work properly
Use Synaptic…
Skype Skype needs no introduction HOWTO (revised)

How to force Skype to use Chrome

Shutter A great screen shot manager HOWTO:
Use Synaptic…
VirtualBox Virtual machine host HOWTO Command line recommended…
Installing Windoze remember to enable passthrough
HPLIB HP Linux Imaging and Printing (essential!!) HOWTO
vpnc Command line VPN client sudo apt-get install vpnc
curl Command line HTTP client sudo apt-get install curl
svn (subversion) Version control client sudo apt-get install subversion
Audacity Insanely great audio editor HOWTO:
Use Synaptic…
Gedit: My preferred text editor HOWTO:
Use Synaptic…
Emacs: A workhorse text editor HOWTO:
Use Synaptic…
texlive: LaTeX for Ubuntu HOWTO:
Use Synaptic…
IHMC CmapTools The concept mapping tool HOWTO
Protege (desktop version) The ontology editor HOWTO
Dropbox Store and share your
stuff in the cloud!
Filezilla GUI-oriented ftp client
(for maintaining ancient web sites)
Use Synaptic…
csv2rdf4lod automation Tim Lebo’s awesome RDF conversion power tool
csv2rdf4lod is now a component of PRIZMS
Tor Browser Bundle Protect your privacy. Defend yourself against network surveillance and traffic analysis. HOWTO
MuseScore Open source music composition and notation software HOWTO:
Use Synaptic…
icedtea Browser plugin to run Java applets in esp. Chrome HOWTO:
Use Synaptic…
Kismet 802.11 layer2 wireless network detector HOWTO:
Use Synaptic…

Edit /etc/kismet/kismet.conf after installation.
Youtube Audio Ripping youtube-dl, FFmpeg and lame work together to enable ripping of audio tracks from YouTube videos! HOWTO
Other guides also available…
vmware-view VMware Horizon client for running some virtual desktop interfaces HOWTO:
Use Synaptic…
The R Language R is a language and environment for statistical computing and graphics HOWTO (R Lang)
HOWTO (RStudio)
Updating R
Processing “Processing is an amazingly simple open source programming language (and basic IDE) that makes it possible to easily prototype graphics-heavy and Arduino projects…” HOWTO
Hardware Abstraction Layer (HAL) Enables Mint to deal with Hulu and Amazon’s DRM on streaming content HOWTO
Java JDK “Java Development Kit includes various development tools like the Java source compilers, bundling and deployment tools, debuggers, development libraries, etc…” HOWTO


  • Latest Linux Mint version installed: Linux Mint 17.3 “Rosa” (MATE) (64 bit)
  • Since Linux Mint 15 I’ve installed using full disk encryption with no apparent loss of performance. For further information, see esp. “The Performace Impact Of Linux Disk Encryption On Ubuntu 14.04 LTS” (Michael Larabel, March 2014)
  • This list used to be longer, but applications like Pidgin are now installed by default and I only need to go looking.
  • I’ll usually “pin” applications like Chrome, Skype, Gedit, GIMP, Terminal, Synaptic Package Manager, etc. to Docky after verifying they are installed.
  • Happily, it is no longer necessary to wave the “chicken feet” to get multimedia features to work, a common ritual for Linux users!

Recently Dries Buytaert, creator of Drupal and founder of gave a wonderful talk at the Berkman Center for Internet & Society on the topic of Making Large Volunteer-Driven Projects Sustainable. A podcast of Dries’ entire talk is available at the MediaBerkman site. Dries is also the founder of, a Drupal-based solutions provider.

Here’s a snippet from the abstract of the talk:

Dries Buytaert — the original creator and project lead for the Drupal open source web publishing and collaboration platform, and president of the Drupal Association — shares his experiences on how he grew the Drupal community from just one person to over 800,000 members over the past 10 years, and, generally, how large communities evolve and how to sustain them over time.

As Dries recounts in his talk, the Drupal platform has experienced massive growth and adoption over the past decade, including significant penetration among web sites hosting open government data around the world — including the United States site and numerous other federal government sites.

I highly recommend this talk to those interested in Drupal, in the open source ecosystem, and generally in the care and feeding of communities. I found Dries’ thoughts on the economic relationship between the platform, its developers and their level of commitment to be particularly interesting: if developers depend upon a platform for their income, they are more likely to be passionate about advancing it as loyal contributors.

Drupal seems to be more than that; there seems to be an ethic that accepts re-fractoring of the platform to keep it and the Drupal community current with new technologies, giving developers the opportunity to explore new skills. There is a fascinating symbiotic relationship between economics and advancing technology that favors adopters and contributors passionate about being on the cutting edge.

This talk “re-factored” my own thinking about Drupal, and tweaked my thinking about the open source ecosystem!

Posted by: John Erickson | April 3, 2012

Is access to the Internet a basic human right?

This morning on my town’s listserv a neighbor quoted an Esotonian colleague who observed (during a recent conference call),

“Internet access is a human right.”

I’m very familiar with this meme but was curious if the right to access communications infrastructure (of any kind) had any official standing.

Although the freedom to participate in communications networks is not specifically mentioned in the Universal Declaration of Human Rights, in June 2011 the UN Human Rights Council did release a report declaring the Internet to be “an indispensable tool for realizing a range of human rights, combating inequality, and accelerating development and human progress” and that “facilitating access to the Internet for all individuals, with as little restriction to online content as possible, should be a priority for all States.” See analysis here and here. You may remember that this caused headlines like “Internet access is a human right” to go around the
world; you may also remember Secretary of State Hillary Clinton’s earlier remarks regarding Internet freedom. Here is a powerful excerpt from her statement:

There are many other networks in the world. Some aid in the movement of people or resources, and some facilitate exchanges between individuals with the same work or interests. But the internet is a network that magnifies the power and potential of all others. And that’s why we believe it’s critical that its users are assured certain basic freedoms. Freedom of expression is first among them. This freedom is no longer defined solely by whether citizens can go into the town square and criticize their government without fear of retribution. Blogs, emails, social networks, and text messages have opened up new forums for exchanging ideas, and created new targets for censorship.

In reading through the UDHR I was a bit surprised that speech is mentioned only once, in the Preamble, as what seems like an aspirational goal, and never in the thirty articles. Does anyone know the history of this omission? When the UDHR was written, was actual freedom of speech too much of a hot button? And, what official status do these UN reports have?

BTW: Vint Cerf, the co-inventor (with Bob Kahn) of the Internet (and current VP at Google), opined in Jan 2012 that while access to the Internet may be an enabler of human rights, access to the Internet itself is not. As I read the UN report and Hillary Clinton’s remarks, I believe the notion of Internet-as-enabler is their larger point, and Vint Cerf is perhaps splitting hairs…

Create Apps; Win Prizes!

Elsevier/TWC Health & Life Sciences HackathonThe Tetherless World Constellation at RPI is pleased to announce that TWC and the SciVerse team at Elsevier are planning a Health and Life Sciences-themed, 24-hour hackathon to be held 27-28 June 2011. The event is sponsored by Elsevier and held at Pat’s Barn, on the campus of the Rensselaer Technology Park.

After a short tutorial period by TWC RPI staff and distinguished guests, participants will compete with each other to develop Semantic Web mashups using linked data from TWC and other sources, web APIs from Elsevier SciVerse, and visualization and other resources from around the Web.

The contest will encompass building apps utilizing the SciVerse API and other resources in multiple categories, including Health and Life Sciences and Open classes. Overall, there will be three winners:

  • First place: $1500
  • Second place: $1000
  • Third place: $500

A distinguished panel of judges has assembled that includes domain experts, faculty and senior representatives from Elsevier:

  • Paolo Ciccarese (Scientist and Senior Software Engineer, Mass General Hospital; Faculty, Harvard Medical School)
  • Chris Baker (Research Chair, Innovatia)
  • Bob Powers (Semantics Engineer, Consultant at Predictive Medicine)
  • M. Scott Marshall (Department of Medical Statistics and Bioinformatics, Leiden University Medical Center)
  • Ora Lassila (Principal Technologist, Nokia; co-author of the W3C RDF specification)
  • Elizabeth Brooks (Head of Computing & IT, UHI, Scotland)
  • Hajo Oltmanns (Elsevier: SVP Health Sciences Strategy)
  • Scott Virkler (Elsevier: SVP e-Products Global Medical Research)
  • Helen Moran (Elsevier: VP Smart Content Strategy)

All attendees will be provided lunch, dinner, and midnight snack on 27 June and breakfast and lunch on 28 June.

Travel Assistance
A small amount of travel assistance will be made available for students and non-profits on a competitive basis. Please see our Travel Assistance page or contact us for further details.

Travel and Lodging Information
See the Elsevier/Tetherless World Health and Life Sciences Hackathon web site for specific information about transportation and lodging near the venue. Please note that the Hackathon runs for 24 hours, so it is unlikely that participants will want lodging on the night of 27 June…

Please browse to the Contacts area of the Elsevier/Tetherless World Health and Life Sciences Hackathon web site or follow the EventBright event organizer link if you have questions.

Follow us on Twitter!
The hash for this event is #TWCHack11

Posted by: John Erickson | May 31, 2011

Energizing Innovation Research through Linked Open Patent Data

Please note this is a DRAFT and may change throughout the day (1 June 2011)

On June 17 I will be joining other researchers at a Patent Data Workshop jointly hosted by the USPTO and NSF at the U.S. Patent & Trademark Office in Alexandria, VA. This workshop, supported by the USPTO Office of Chief Economist and the Science of Science and Innovation Policy Program (SciSIP) at the NSF, will bring researchers together to share their ideas on how to facilitate the more efficient use of patent and trademark data, and ultimately to improve both the quantity and caliber of innovation policy scholarship.

The stated goals of this workshop include:

  1. Creating an information exchange infrastructure for both the production and informed evaluation of transparent, high-quality research into innovation;
  2. Promoting an intellectual environment particularly hospitable to high-impact quantitative studies;
  3. Creating a distinct community with well-developed research norms and cumulative influence; and
  4. Championing the development of a platform to support a robust body of empirical research into the economic and social consequences of innovation.

Each participant planning to attend this workshop has been asked to prepare a blog post that outlines (a) our understanding of the most significant theoretical or empirical challenges in this space, and/or (b) where the frontier of knowledge is, what innovative things are being done at the frontier — or within reach of being done to solve the set of problems — and where targeted funding could yield the highest payoffs in getting to solutions. The purpose of this post is to offer some of my thoughts based on progress made by linked open government data initiatives in the US and around the world.

Background: The Tetherless World and Linked Open Government Data
Since early 2010 the Tetherless World Constellation (TWC) at Rensselaer Polytechnic Institute has collaborated with the White House team to make thousands of open government datasets more accessible for consumption by web-based applications and services, including mashups leveraging Semantic Web technologies. TWC has created an infrastructure, embodied by the TWC LOGD Portal, for automatically converting to RDF and enhancing government data published in tabular (e.g. CSV) format; publishing these converted datasets as downloadable “dump files” and through SPARQL endpoints; demonstrating highly effective methodologies for using such linked open government data assets as the basis for the agile creation of lightweight, powerful visualizations and other mashups. In addition to providing a searchable interface to thousands of converted datasets, the TWC LOGD Portal publishes a growing set of demos and tutorials for use by the LOGD community.

The LOGD partnership and similar international LOGD efforts, especially the UK’s initiative, have demonstrated the value and potential for innovation achieved by exposing government data using linked data principles. Indeed, the effective application of the linked data approach to a multitude of data sharing and integration challenges in commerce, industry and eScience has shown its promise as a basis for a more efficient, agile research information exchange infrastructure.

Recommendation: Create a “DBPedia” for Patent Data
The Linked Open Data Cloud diagram famously illustrates the growing number of providers of linked open data around the world. Careful examination of the LOD Cloud shows that most sources are sparsely linked, and a very few — most notably,, are extremely heavily linked. The reason is that the Web of Data has increasingly adopted DBPedia as a reliable source or hub for canonical entity URIs. This means that as providers put their datasets online, they enhance their datasets by providing sameAs links to DBPedia URIs for named entities within these datasets. This enables their datasets to be easily linked to other datasets and increases their utility and value as the basis for visualizations and linked data mashups.

Providers embrace DBPedia’s URI conventions as “canonical” in order to make their datasets more easily adopted. Our objective with patent and trademark reference data and research information in general must be to break down barriers to its widespread use, recognizing that we may have no idea how it may be used. Linked data principles and the Web of Data emerging from them have re-written what it means to make data integration easy. Whereas even a few short years ago it was useful to simply provide a searchable patent database through a proprietary UI, next-generation innovation infrastructures will be based on globally interlinked graphs drive by concept and descriptive metadata extracted from patent records, research publications, business publications and indeed data from social networks. Scholars of innovation will traverse these graphs and mash them with other graphs in ways we cannot anticipate, and thus make serendipitous discoveries about the process of innovation we cannot predict today.

My DBPedia reference comes from the idea of identifying concepts and specific manifestations of innovation in the patent corpus. Consider an arbitrary patent disclosure; it can be represented as a graph of concepts and related manifestations. The infrastructure I’m proposing will enable the interlinking of URI-named concepts, not only with other patent records but also scientific literature, the financial and news media, social networks, etc. From a research standpoint, this will enable the study of the emergence, spread and influence on innovation in many dimensions.

The USPTO has already made great strides in improving access to and understanding of patent and trademark data; an excellent example is the Data Visualization Center and specific data visualization tools such as the Patent Dashboard which provides graphic summaries of USPTO activities. These are “canned apps,” however; the next generation of open government will require finer grained access to this data, presented as enhanced linked data and using open licensing principles. As USPTO datasets are presented in this way, researchers will be able to interlink this data with datasets from other sources, resulting in a more effective study of the causes of innovation and indeed the outcomes of government programs intended to stimulate innovation.


  1. NSF Patent Data Workshop. NSF Award Abstract #1102468 (31 Jan 2011).
  2. Julia Lane, The Science of Science and Innovation Policy (SciSIP) Program at the US National Science Foundation. OST Bridges vol. 22 (July 2009)

Older Posts »