Bitwacker Associates

Posted by: John Erickson | June 2, 2012

Dries Buytaert (Drupal founder) on Making Large Volunteer-Driven Projects Sustainable

Recently Dries Buytaert, creator of Drupal and founder of Drupal.org gave a wonderful talk at the Berkman Center for Internet & Society on the topic of Making Large Volunteer-Driven Projects Sustainable. A podcast of Dries’ entire talk is available at the MediaBerkman site. Dries is also the founder of Acquia.com, a Drupal-based solutions provider.

Here’s a snippet from the abstract of the talk:

Dries Buytaert — the original creator and project lead for the Drupal open source web publishing and collaboration platform, and president of the Drupal Association — shares his experiences on how he grew the Drupal community from just one person to over 800,000 members over the past 10 years, and, generally, how large communities evolve and how to sustain them over time.

As Dries recounts in his talk, the Drupal platform has experienced massive growth and adoption over the past decade, including significant penetration among web sites hosting open government data around the world — including the United States Data.gov site and numerous other federal government sites.

I highly recommend this talk to those interested in Drupal, in the open source ecosystem, and generally in the care and feeding of communities. I found Dries’ thoughts on the economic relationship between the platform, its developers and their level of commitment to be particularly interesting: if developers depend upon a platform for their income, they are more likely to be passionate about advancing it as loyal contributors.

Drupal seems to be more than that; there seems to be an ethic that accepts re-fractoring of the platform to keep it and the Drupal community current with new technologies, giving developers the opportunity to explore new skills. There is a fascinating symbiotic relationship between economics and advancing technology that favors adopters and contributors passionate about being on the cutting edge.

This talk “re-factored” my own thinking about Drupal, and tweaked my thinking about the open source ecosystem!

Leave a Comment

Posted in Big Ideas, Data.gov, drupal, open source, software development, software engineering

Posted by: John Erickson | April 3, 2012

Is access to the Internet a basic human right?

This morning on my town’s listserv a neighbor quoted an Esotonian colleague who observed (during a recent conference call),

“Internet access is a human right.”

I’m very familiar with this meme but was curious if the right to access communications infrastructure (of any kind) had any official standing.

Although the freedom to participate in communications networks is not specifically mentioned in the Universal Declaration of Human Rights, in June 2011 the UN Human Rights Council did release a report declaring the Internet to be “an indispensable tool for realizing a range of human rights, combating inequality, and accelerating development and human progress” and that “facilitating access to the Internet for all individuals, with as little restriction to online content as possible, should be a priority for all States.” See analysis here and here. You may remember that this caused headlines like “Internet access is a human right” to go around the
world; you may also remember Secretary of State Hillary Clinton’s earlier remarks regarding Internet freedom. Here is a powerful excerpt from her statement:

There are many other networks in the world. Some aid in the movement of people or resources, and some facilitate exchanges between individuals with the same work or interests. But the internet is a network that magnifies the power and potential of all others. And that’s why we believe it’s critical that its users are assured certain basic freedoms. Freedom of expression is first among them. This freedom is no longer defined solely by whether citizens can go into the town square and criticize their government without fear of retribution. Blogs, emails, social networks, and text messages have opened up new forums for exchanging ideas, and created new targets for censorship.

In reading through the UDHR I was a bit surprised that speech is mentioned only once, in the Preamble, as what seems like an aspirational goal, and never in the thirty articles. Does anyone know the history of this omission? When the UDHR was written, was actual freedom of speech too much of a hot button? And, what official status do these UN reports have?

BTW: Vint Cerf, the co-inventor (with Bob Kahn) of the Internet (and current VP at Google), opined in Jan 2012 that while access to the Internet may be an enabler of human rights, access to the Internet itself is not. As I read the UN report and Hillary Clinton’s remarks, I believe the notion of Internet-as-enabler is their larger point, and Vint Cerf is perhaps splitting hairs…

Leave a Comment

Posted in Uncategorized

Posted by: John Erickson | June 20, 2011

Elsevier/Tetherless World Health & Life Sciences Hackathon (27-28 June 2011)

Create Apps; Win Prizes!

The Tetherless World Constellation at RPI is pleased to announce that TWC and the SciVerse team at Elsevier are planning a Health and Life Sciences-themed, 24-hour hackathon to be held 27-28 June 2011. The event is sponsored by Elsevier and held at Pat’s Barn, on the campus of the Rensselaer Technology Park.

After a short tutorial period by TWC RPI staff and distinguished guests, participants will compete with each other to develop Semantic Web mashups using linked data from TWC and other sources, web APIs from Elsevier SciVerse, and visualization and other resources from around the Web.

Prizes
The contest will encompass building apps utilizing the SciVerse API and other resources in multiple categories, including Health and Life Sciences and Open classes. Overall, there will be three winners:

First place: $1500
Second place: $1000
Third place: $500

Judging
A distinguished panel of judges has assembled that includes domain experts, faculty and senior representatives from Elsevier:

Paolo Ciccarese (Scientist and Senior Software Engineer, Mass General Hospital; Faculty, Harvard Medical School)
Chris Baker (Research Chair, Innovatia)
Bob Powers (Semantics Engineer, Consultant at Predictive Medicine)
M. Scott Marshall (Department of Medical Statistics and Bioinformatics, Leiden University Medical Center)
Ora Lassila (Principal Technologist, Nokia; co-author of the W3C RDF specification)
Elizabeth Brooks (Head of Computing & IT, UHI, Scotland)
Hajo Oltmanns (Elsevier: SVP Health Sciences Strategy)
Scott Virkler (Elsevier: SVP e-Products Global Medical Research)
Helen Moran (Elsevier: VP Smart Content Strategy)

Refreshments
All attendees will be provided lunch, dinner, and midnight snack on 27 June and breakfast and lunch on 28 June.

Travel Assistance
A small amount of travel assistance will be made available for students and non-profits on a competitive basis. Please see our Travel Assistance page or contact us for further details.

Travel and Lodging Information
See the Elsevier/Tetherless World Health and Life Sciences Hackathon web site for specific information about transportation and lodging near the venue. Please note that the Hackathon runs for 24 hours, so it is unlikely that participants will want lodging on the night of 27 June…

Contacts
Please browse to the Contacts area of the Elsevier/Tetherless World Health and Life Sciences Hackathon web site or follow the EventBright event organizer link if you have questions.

Follow us on Twitter!
The hash for this event is #TWCHack11

Leave a Comment

Posted in computer science, linked data | Tags: elsevier, hackathon, health, lice sciences, RPI, TWCHack11, TWCRPI

Posted by: John Erickson | May 31, 2011

Energizing Innovation Research through Linked Open Patent Data

Please note this is a DRAFT and may change throughout the day (1 June 2011)

On June 17 I will be joining other researchers at a Patent Data Workshop jointly hosted by the USPTO and NSF at the U.S. Patent & Trademark Office in Alexandria, VA. This workshop, supported by the USPTO Office of Chief Economist and the Science of Science and Innovation Policy Program (SciSIP) at the NSF, will bring researchers together to share their ideas on how to facilitate the more efficient use of patent and trademark data, and ultimately to improve both the quantity and caliber of innovation policy scholarship.

The stated goals of this workshop include:

Creating an information exchange infrastructure for both the production and informed evaluation of transparent, high-quality research into innovation;
Promoting an intellectual environment particularly hospitable to high-impact quantitative studies;
Creating a distinct community with well-developed research norms and cumulative influence; and
Championing the development of a platform to support a robust body of empirical research into the economic and social consequences of innovation.

Each participant planning to attend this workshop has been asked to prepare a blog post that outlines (a) our understanding of the most significant theoretical or empirical challenges in this space, and/or (b) where the frontier of knowledge is, what innovative things are being done at the frontier — or within reach of being done to solve the set of problems — and where targeted funding could yield the highest payoffs in getting to solutions. The purpose of this post is to offer some of my thoughts based on progress made by linked open government data initiatives in the US and around the world.

Background: The Tetherless World and Linked Open Government Data
Since early 2010 the Tetherless World Constellation (TWC) at Rensselaer Polytechnic Institute has collaborated with the White House Data.gov team to make thousands of open government datasets more accessible for consumption by web-based applications and services, including mashups leveraging Semantic Web technologies. TWC has created an infrastructure, embodied by the TWC LOGD Portal, for automatically converting to RDF and enhancing government data published in tabular (e.g. CSV) format; publishing these converted datasets as downloadable “dump files” and through SPARQL endpoints; demonstrating highly effective methodologies for using such linked open government data assets as the basis for the agile creation of lightweight, powerful visualizations and other mashups. In addition to providing a searchable interface to thousands of converted Data.gov datasets, the TWC LOGD Portal publishes a growing set of demos and tutorials for use by the LOGD community.

The Data.gov/TWC LOGD partnership and similar international LOGD efforts, especially the UK’s Data.gov.uk initiative, have demonstrated the value and potential for innovation achieved by exposing government data using linked data principles. Indeed, the effective application of the linked data approach to a multitude of data sharing and integration challenges in commerce, industry and eScience has shown its promise as a basis for a more efficient, agile research information exchange infrastructure.

Recommendation: Create a “DBPedia” for Patent Data
The Linked Open Data Cloud diagram famously illustrates the growing number of providers of linked open data around the world. Careful examination of the LOD Cloud shows that most sources are sparsely linked, and a very few — most notably, DBPedia.org, are extremely heavily linked. The reason is that the Web of Data has increasingly adopted DBPedia as a reliable source or hub for canonical entity URIs. This means that as providers put their datasets online, they enhance their datasets by providing sameAs links to DBPedia URIs for named entities within these datasets. This enables their datasets to be easily linked to other datasets and increases their utility and value as the basis for visualizations and linked data mashups.

Providers embrace DBPedia’s URI conventions as “canonical” in order to make their datasets more easily adopted. Our objective with patent and trademark reference data and research information in general must be to break down barriers to its widespread use, recognizing that we may have no idea how it may be used. Linked data principles and the Web of Data emerging from them have re-written what it means to make data integration easy. Whereas even a few short years ago it was useful to simply provide a searchable patent database through a proprietary UI, next-generation innovation infrastructures will be based on globally interlinked graphs drive by concept and descriptive metadata extracted from patent records, research publications, business publications and indeed data from social networks. Scholars of innovation will traverse these graphs and mash them with other graphs in ways we cannot anticipate, and thus make serendipitous discoveries about the process of innovation we cannot predict today.

My DBPedia reference comes from the idea of identifying concepts and specific manifestations of innovation in the patent corpus. Consider an arbitrary patent disclosure; it can be represented as a graph of concepts and related manifestations. The infrastructure I’m proposing will enable the interlinking of URI-named concepts, not only with other patent records but also scientific literature, the financial and news media, social networks, etc. From a research standpoint, this will enable the study of the emergence, spread and influence on innovation in many dimensions.

Conclusions
The USPTO has already made great strides in improving access to and understanding of patent and trademark data; an excellent example is the Data Visualization Center and specific data visualization tools such as the Patent Dashboard which provides graphic summaries of USPTO activities. These are “canned apps,” however; the next generation of open government will require finer grained access to this data, presented as enhanced linked data and using open licensing principles. As USPTO datasets are presented in this way, researchers will be able to interlink this data with datasets from other sources, resulting in a more effective study of the causes of innovation and indeed the outcomes of government programs intended to stimulate innovation.

References

NSF Patent Data Workshop. NSF Award Abstract #1102468 (31 Jan 2011).
Julia Lane, The Science of Science and Innovation Policy (SciSIP) Program at the US National Science Foundation. OST Bridges vol. 22 (July 2009)

Leave a Comment

Posted in Uncategorized

Posted by: John Erickson | February 11, 2011

TWC LOGD Million Dataset Challenge

Many quotes have been attributed to Steve Jobs, but my favorite is the following:

Set totally outrageous goals!

Well, one of the more “outrageous” goals for the Linking Open Government Data project at the Tetherless World Constellation at RPI this term is to create the most comprehensive and useful catalog of open government datasets in the world.

To this end, I am challenging our students — and indeed everyone within earshot — to participate in what I’ve dubbed the TWC LOGD Million Dataset Challenge: I’m challenging you to help us create a master catalog of more than 1 million open government datasets from around the world! In return, we’ll make the catalog publicly available through our TWC LOGD Portal, as RDF dumps and via a SPARQL endpoint.

To get this thing started and to make it as easy as possible, I’ve created a Google Form-based interface. Follow the link, add metadata, move on…

I’ve structured the form to accept both catalog and individual dataset entries. Just chose the right options in the form…

To submit a dataset: http://bit.ly/g0d3tY

To view the current status (spreadsheet): http://bit.ly/eANqSg Total (18 Feb 2011): More than 331,345 datasets

A few resources to get started:

Worldwide Search CKAN: The Data Hub Prime source!
Guardian.co.uk’s catalog of over 12K world government datasets Prime source!
Where can I get large datasets open to the public? (Quora) Prime source!
DataMarket.com’s International Dataset Search page Prime source!
USA Data.gov raw data catalog Prime source!
USA Data.gov geodata catalog Prime source!
UK Data.gov.uk project Prime source!
Africa Africover datasets download (multiple countries) Prime source!
Blog listing many catalogs Prime source!
EU Official National Data Catalogs Prime source!
ePSI Public Sector Information (PSI) Data Catalogues (by governments) Prime source!
Open Data Euskadi International Catalog (Based on eOSI list) Prime source!
Open Knowledge List of European Open Data Catalogues Prime source!
Civic Commons List of Open Data Initiatives Prime source!
Univ. of Colorado (Boulder) Libraries Foreign Information by Country Prime source!
Factual.com’s “comprehensive” repository of government data
New York State GIS Clearinghouse
New York State Dept. of Health Statistics
California Dept of Education data collections
OpenGovernmentData.org catalogs
GovLoop.com’s List of Open Government Plans (US federal agencies)
Socrata open datasets
Danish data catalog
Finnish data catalog
Australian data catalog
spaghettiopendata.org Citizen-driven Italian Open Data site
Istat.it Italian government statistical data
Datasets from the Piedmont region of Italy
data.CNR.it The Italian National Research Council site
Public Datasets on Amazon Web Services
Alaska State GeoSpatial Data Clearinghouse (Alaska DNR)
Asian Development Bank Statistical Database Service

Note: As the dataset grows we’ll improve both the data entry and catalog interface. Our immediate goal is to grow the list…

Note: Watch for answers to this Quora question: What is the most comprehensive list of international open government datasets?

1 Comment

Posted in Big Ideas, Data.gov, government transparency, linked data, metadata | Tags: data.gov, gov 2.0, government transparency, linked data, linked open data, open government, open government data

Posted by: John Erickson | January 19, 2011

“Falling down is part of LIFE…Getting back up is LIVING”

Posted by: John Erickson | December 20, 2010

Fall 2010 TWC-RPI Undergraduate Research Summaries

The Fall 2010 semester marked the beginning of the Tetherless World Constellation’s undergraduate research program at Rensselaer Polytechnic Institute (RPI). Although TWC has enjoyed significant contributions from RPI undergrads since its inception, this term we stepped up our game by more “formally” incorporating a group of undergrads into TWC’s research programs, established regular meetings for the group, and with input from the students began outfitting their own space in RPI’s Winslow Building.

Patrick West, my fellow TWC undergrad research coordinator and I asked the students to blog about their work throughout the semester; with the end of term, we asked them to post summary descriptions of their work and their thoughts about the fledgling TWC undergrad research program itself. We’ve provided short summaries and links to those blogs below…

Cameron Helm began the term coming up to speed on SPARQL and RDF, experimented with several of the public TWC endpoints, and then worked with Philip on basic visualizations. He then slashed his way through the tutorials on TWC’s LOGD Portal, eventually creating impressive visualizations such as this earthquake map. Cameron is very interested in the subject of data visualization and looks to do more work in this area in the future.
After a short TWC learning period, Dan Souza began helping doctoral candidate Evan Patton create an Android version of the Mobile Wine Agent application, with all the amazing visualization and data integration required, including Twitter and Facebook integration. Mid-semester Dan also responded to the call to help with the crash” development of the Android/iPhone TalkTracker app, in time for ISWC 2010 in early November. Dan continues to work with Evan and others for early 2011 releases of Android, iPhone/iPad Touch and iPad versions of the Mobile Wine Agent.
David Molik reports that he learned web coding skills, ontology creation, server installation and administration. David contributed to the development and operation of a test site for the new, semantic web savvy website for the Biological and Chemical Oceanography Data Management Office BCO-DMO of the Woods Hole Oceanographic Institute.
Jay Chamberlin spent much of his time working on the OPeNDAP Project, an open source server to distribute scientific data that is stored in various formats. His involvement included everything from learning his way around the OPeNAP server, to working with infrastructure such as TWC’s LDAP services, to helping migrate documentation from the previous Wiki to the new Drupal site, to actually implementing required changes to the OPeNDAP code base.
Philip Ng worked on a wide variety of projects this fall, starting with basic visualizations, helping with ISWC applications, and including iPad development for the Mobile Wine Agent. Philip’s blog is fascinating to read as he works his way through the challenges of creating applications, including his multi-part series on implementing the social media features.
Alexei Bulazel began working with Dominic DiFranzo on a health-related mashup using Data.gov datasets and is now working on a research paper with David on “human flesh search engine” techniques, a topic that top thinkers including Tetherless World Senior Constellation Professor Jim Hendler have explored in recent talks. Note: For more background on this phenomena, see e.g. China’s Cyberposse, NY Times (03 Mar 2010)

Many of these students will be continuing on with these or other projects at TWC in 2011; we also expect several new students to be joining the group. The entire team at the Tetherless World Constellation thanks them for their efforts and many important contributions this fall, and looks forward to being amazed by their continued great work in the coming year!

John S. Erickson, Ph.D.

Leave a Comment

Posted in computer science, Data.gov, linked data, web science | Tags: data.gov, Jim Hendler, linked data, Rensselaer, RPI, semantic web, Tetherless World Constellation, undergraduate research, web science

Posted by: John Erickson | December 19, 2010

The TWC/Elsevier Data.gov Dataset Search App

Since Summer 2010 I’ve had the privilege of working as a research engineer at the Tetherless World Constellation (TWC) at RPI, primarily helping the team in the execution of various projects related to their association with the Obama Administration’s Data.gov initiative. One of those projects is an applet for the Elsevier SciVerse Hub portal. The following is from the description page for our application.

Data.gov Dataset Search (Profile View)

The US Government Dataset Search application is an easy way for SciVerse users and developers to search from among over 300,000 available US government datasets at http://data.gov to automatically find matches to their queries. Based on the user’s SciVerse Hub query, searches are simultaneously made against all datasets published through Data.gov as well as the RDF-converted data and related demos at the Linking Open Government Data (LOGD) portal, created by the Tetherless World Constellation (TWC) at Rensselaer Polytechnnic Institute (RPI).

Any user with the ability to search SciVerse Hub can use the US Government Dataset Search application. The application and the government data it exposes are made available free of charge. The US Government Dataset Search application is targeted at both SciVerse end users (researchers) and application developers interested in applying government datasets to their applications. Researchers utilizing SciVerse Hub are able to discover and access contextually relevant data from the US Government. Developers may utilize SciVerse Hub to identify RDF-converted data sets based on the US Government data and access this data in their applications through SPARQL endpoints or retrieve the datasets themselves.

How the US Government Dataset Search application works: For each SciVerse query the user makes, a keyword search across all current Data.gov datasets is made via a SPARQL endpoint at the TWC LOGD portal. A summary of these results is presented on the Hub search results page. Detailed results are presented in tabular form in the ‘Canvas’ (larger) view by clicking on any link. On the canvas view links are provided directly to the Data.gov dataset description pages as well as RDF-converted versions of these datasets at the TWC LOGD portal. Note that faceted search is not available with the application and only the original query in Hub willbe submitted.

All queries are made against the LOGD SPARQL endpoint at http://logd.tw.rpi.edu/sparql The application also makes use of the Google Visualization toolkit.

This application is optimized for Firefox, Chrome and Internet Explorer 8.

For more information about creating mashups using Data.gov datasets, please check out RPI’s Linking Open Goverment Data (LOGD) Portal at http://logd.tw.rpi.edu

About the TWC Linking Open Government Data project: The TWC LOGD team investigates opening and linking government data using Semantic Web technologies. TWC LOGD actively develops tools for the large-scale translation of government-related datasets into RDF, linking them into the ‘Web of Data’ and providing demos and tutorials on various means for consuming linked government data, including creating mashups, applications and data visualizations. The TWC LOGD Portal was awarded second place (open division) at the 2010 Semantic Web Challenge, held during the 2010 International Semantic Web ConferenceISWC2010.

About the Tetherless World Constellation at RPI: The Tetherless World Constellation addresses the emerging area of Web Science, focusing on the World Wide Web and its future use. Faculty in the constellation lead explorations into the principles that underlie the Web; enhance the Web’s reach beyond the desktop and laptop computer; and develop new technologies and languages that expand the capabilities of the Web. TWC researchers use powerful scientific and mathematical techniques from many disciplines to explore the modeling of the Web from network- and information- centric views. TWC’s objectives include making the next generation web natural to use while being responsive to the growing variety of policy and social needs, whether in the area of privacy, intellectual property, general compliance, or provenance. The Tetherless World Constellation is designing new techniques to explore social, scientific, and legal impacts of the evolving technologies deployed on the Web.

News about the TWC/Elsevier US Government Dataset Search Application

Featured in Looking Back at 2010 at Rensselar RPI News & Events (20 Dec 2010)
SciVerse Hub Application Connects Researchers with U.S. Government Datasets Information Today (20 Dec 2010)
U.S. Government Dataset Search Opens Data.gov to Scientists Data.gov website (14 Dec 2010)
New Application Allows Scientists Easy Access to Important Government Data RPI News & Events (10 Dec 2010)
New Application Allows Scientists Easy Access to Important Government Data Lab Manager Magazine (13 Dec 2010)
New Application Allows Scientists Easy Access to Important Government Data EurekAlert (10 Dec 2010)
New Application Allows Scientists Easy Access to Important Government Data Physorg.com (10 Dec 2010)
New Application Allows Scientists Easy Access to Important Government Data FirstScience.com (10 Dec 2010)
New Application Allows Scientists Easy Access to Important Government Data NewsoDrone.com (10 Dec 2010)

UPDATE: I’m currently developing an iGoogle Gadget version of the SciVerse app, based on the same core queries. A screen shot of the “profile” view of that app appears below. In addition to enabling me to monitor the health of our systems from my desktop, it also enables me to test out possible features for the SciVerse app itself.

iGoogle Gadget version of the US Government Dataset Search app

1 Comment

Posted in computer science, Data.gov, government transparency, linked data, web science | Tags: data.gov, elsevier, government transparency, linked data, mashups, open government data, RPI, sciverse, Tetherless World Constellation

Posted by: John Erickson | October 28, 2010

What I Want in a Software Developer(tm)

Professors and students in a nearby research group have been brainstorming a syllabus for a new, low-level computer science course. Normally I only “lurk” in such discussions, but this time I couldn’t hold my tongue. The following is my contribution, from my perspective as one who has interacted with “computer scientists” as a fellow team member, project leader, hiring manager, business partner and even corporate recruiter (interviewing mostly for other hiring managers).

This version has been edited slightly to make it better suited for a blog…

As an “old guy” who has interviewed his share of CS, CE and EE’s over the years (and hire and/or managed more than a few of them), here are my thoughts from an “outcomes” perspective…

It’s really exciting to work with a developer who groks the concepts to such a degree that specific languages and language boundaries simply don’t matter. Seeing a prototype done in Erlang because it was perfectly suited is SO much better than listening to whining over how it is hard to do it in Java or C# or Visual Basic N. They are usually curious about everything; the dude that coded a prototype NoSQL-style data store for our team in Erlang had been playing with it for a few months, “just because…”

Methodical problem solving matters. Which some would equate to Engineering(tm). But really it’s about gaining a ton of experience attacking problems. The number one thing I’ve looked for over the years is actual experience — through project work, interesting course projects, and esp. internships — in completing cool projects. And please, don’t wait to be assigned; always look for problems, and just do them.

Join the software ecosystem. The most impressive developers I’ve met over the years — some are currently undergrads at the Tetherless World Constellation at RPI — understand how to contribute to software ecosystem(s); usually this is through the open source community. They understand the tools, they understand how to engage with other developers, they understand how to analyze and improve other people’s code.

Here’s one way to think about it: if you aspire to be a professional musician (or artist), chances are you’ve participated in the “music ecosystem” in a wide variety of ways for many years, even before entering college. The best developers I’ve met — and those “computer scientists” who are developers at heart — have done the same (one guy I know built his first Linux kernel when he was in middle school).

Understand systems end-to-end. Now we’re back to the topic at hand 😉 The best contributors over the years have been those who had hands-on experience with absolutely every aspect of the “system.” This doesn’t mean going From Relays to Twitter in 10 Weeks, but it does mean understanding the relationships between all system elements.

I doubt very much that this is a problem for anyone on this list, because the very nature of PKI work requires one to have just this sort of broad and deep knowledge; plus, your professor and I have had a few conversations about this over the years…BTW, my daughter’s now at Southampton working on her Ph.D in numerical relativity and writing code on a supercomputer cluster 😉

UPDATE (29 Oct 2010): Nature recently published this interesting article, Computational science: …Error …why scientific programming does not compute, (13 Oct 2010) on the increasing need for scientists to have hard-core software engineering skills to do their science.

1 Comment

Posted in Big Ideas, computer science, management, software development, software engineering | Tags: computer science, cs curricula, Erlang, management, numerical relativity, PKI, software engineering, SOTON

Posted by: John Erickson | July 12, 2010

Data Quality is in its Fitness to the Beholder

A few weeks ago Leigh Dodds began a thoughtful discussion on SemanticOverflow with the question:

There’s an increasing variety of data available as Linked Data coming from a range of different sources. I’m wondering what indicators we might use to judge the “quality” of a dataset…Clearly quality is a subjective thing, but I’d be interested to know what factors people might use to indicate whether a dataset was trustworthy, well modelled, sustainable, etc.

For starters, I think we can all agree at the highest level that the measure of data quality is subjective and that “beauty is in the eye of the beholder”: the quality of a dataset is measured by its fitness for use in specific applications. This question of determining and disseminating “fitness” scores is the rub!

In his answer to Leigh’s question, Tim Finin proposes adopting a PageRank-like mechanism, “LODrank” based on measured usage

We could define LODrank as a PageRank-like measure that was a function of the number of links to/from other LOD datasets weighted by their LODrank. Alternatively, it might divided by the number of linkable instances in the collection, so that large datasets did not have an advantage…

This approach scores data quality based on observed fitness as evidenced by discovered use and has the advantage of automation.

My replies went in a different direction, focusing instead on the subjective nature of data quality and the need to aggregate consumer-space rankings of datasets across a set of dimensions. In his 2005 white paper Principles of Data Quality [1] Arthur D. Chapman writes,

Data quality is multidimensional, and involves data management, modelling and analysis, quality control and assurance, storage and presentation. As independently stated by Chrisman [2] and Strong et al. [3], data quality is related to use and cannot be assessed independently of the user. In a database, the data have no actual quality or value [4]; they only have potential value that is realized only when someone uses the data to do something useful. Information quality relates to its ability to satisfy its customers and to meet customers’ needs [5]

Chapman goes on to enumerate a set of factors that contribute to fitness-for-use, citing Redman [6]:

Accessibility
Accuracy
Timeliness
Completeness
Consistency with other sources
Relevance
Comprehensiveness
Providing a proper level of detail
Easy to “read”
Easy to “interpret”

Each of these factors is fundamentally subjective, even if mechanisms exist within particular domains to take their measure “objectively.” Indeed, in some domains such ratings might only be done by humans, either through voting mechanisms or by individual reviewers.

I believe the greater linked data community needs to develop vocabulary terms for expressing metrics for data quality — consider the ten points above — and then within individual communities develop agreed-upon means to determine those values. Arguably this is a “Dublin Core” approach to the problem, in the sense that terms like completeness or consistency would be reused across domains with inherently different domain-specific meanings, but such reuse would facilitate consumers from other communities choosing datasets from outside their expertise. A non-physicist might then say, “The physics community says this dataset is accurate, by their measures.”

Some of these factors are even more deeply subjective and must be evaluated dynamically, based on the consumer’s immediate context. An example of this is relevance, which could be interpreted as equivalent to a recommendation.

If you have thoughts on data quality as it applies to linked data, consider answering Leigh’s question at SemanticOverflow!

References: (as cited by Chapman)

Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen.
Chrisman, N.R., 1991. The Error Component in Spatial Data. pp. 165-174 in: Maguire D.J., Goodchild M.F. and Rhind D.W. (eds)
Geographical Information Systems Vol. 1, Principals: Longman Scientific and Technical.
Strong, D.M., Lee, Y.W.and Wang, R.W. 1997. Data quality in context. Communications of ACM 40(5): 103-110.
Dalcin, E.C. 2004. Data Quality Concepts and Techniques Applied to Taxonomic Databases. Thesis for the degree of Doctor of Philosophy,
School of Biological Sciences, Faculty of Medicine, Health and Life Sciences, University of Southampton. November 2004. 266 pp.
English, L.P. 1999. Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. New York: John Wiley & Sons, Inc. 518pp.
Redman, T.C. 2001. Data Quality: The Field Guide. Boston, MA: Digital Press.

2 Comments

Posted in data quality, linked data, metadata, ranking | Tags: data quality, linked data, ranking

Bitwacker Associates

Dries Buytaert (Drupal founder) on Making Large Volunteer-Driven Projects Sustainable

Is access to the Internet a basic human right?

Elsevier/Tetherless World Health & Life Sciences Hackathon (27-28 June 2011)

Energizing Innovation Research through Linked Open Patent Data

TWC LOGD Million Dataset Challenge

“Falling down is part of LIFE…Getting back up is LIVING”

Fall 2010 TWC-RPI Undergraduate Research Summaries

The TWC/Elsevier Data.gov Dataset Search App

What I Want in a Software Developer(tm)

Data Quality is in its Fitness to the Beholder

Categories

Bitwacker Feeds

Bitwacker Rights

Bitwacker PGP

Bitwacker Posts