New data services 3: data.gov

The United States are the only western country without a centralized data office. Instead, official statistics are produced by well over 100 agencies. This makes obtaining official US data difficult, and that’s somewhat of a paradox because in most cases, these data are public and free. Of course, with data coming from so many sources, they are also in a variety of shapes and sizes. Says Wired,

Until now, the US government’s default position has been: If you can’t keep data secret, at least hide it on one of 24,000 federal Web sites, preferably in an incompatible or obsolete format.

A commitment made by the Obama administration was to tackle this and make data more widely available. To that end, a data portal was announced in early April and data.gov was officially launched end of May.

Data.gov is three things in one.

A sign that this administration wants to make the data more accessible, especially to developers.

A shift towards open formats, such as XML.

A catalogue of datasets published by US government agencies.

The rationale is that with data.gov, data are available to wider audiences. There’s a fallacy in that, because the layperson cannot do much with an ESRI file. But hopefully, someone will and may build something out of it for the good of the community.

The aspect I found most interesting is the catalogue proper. For each indexed dataset, data.gov builds an abstract, inspired by the Dublin-Core Metadata Initiative, with fields such as authoring agency, keywords, units, and the like. This, in itself, is not a technological breakthrough but imagine if all the datasets produced by all the agencies were described in such a uniform fashion. Then, retrieving data would be a breeze.

Note that data.gov does not store the datasets. They provide a store-front which then redirects users to the proper location once a dataset has been selected.

There have been other, similar initiatives. Fedstats.gov, allegedly, provided a link to every statistical item produced by the federal government. By their own admission, the home page was last updated in 2007, and its overall design hasn’t changed much since its launch by the Clinton administration in 1997 (a laudable effort at the time). Another initiative, http://usgovxml.com, is a private portal to all data available in XML format.

So, back to “find access process present share”. Where does data.gov fall?

It can come as a surprise that they don’t touch the last 3 steps. Well, it certainly will be a surprise for anyone expecting the government to open a user-centric, one-stop-shop for data. Data.gov is certainly not a destination website for lay audiences.

It doesn’t host the data either, however, its existence drives agencies to publish their datasets in compliance with its standards. So we can say that it indirectly addresses access.

So what it really is about is finding data. Currently, the site has two services to direct users to a dataset: a search engine and a catalogue. The browsable catalogue has only one layer of hierarchy, and while this is fine with their initial volume (47 datasets, around 200 as of end of June) that won’t suffice if their ambition is to host 100,000 federal data feeds.

All in all, it could be argued that data.gov doesn’t do much by itself. But what is interesting is what it enables others to do.

On the longer term, it will drive all agencies to publish their data under one publication standard. And if you have 100,000 datasets published under that standard, and if people use it to find them, then we will have a de facto industry standard to describe data. The consequences of that cannot be overestimated.

The other not obvious long-term advantage is what it will allow developer to create. There are virtually no technical barriers to creating interesting applications on top of these datasets. Chances are that some of these applications could change our daily lives. And they will be invented not by the government, but by individuals, researchers or entrepreneurs. quite something to be looking forward to.

New data services 2: Wolfram|alpha

In March this year, überscientist Stephen Wolfram, of Mathematica fame, revealed the world he was working on something new, something big, something different. The first time I heard of this was through semantic web prophet Nova Spivack, who is not known to get excited by less-than-revolutionary projects. That, plus the fact that the project was announced so short before its release, contributed to build anticipation to huge levels.

wolframalpha

Wolfram|alpha describes itself as a “computational knowledge engine” or, simply put, as an “answer engine”. Like google and search engines, it tries to provide information based on a query. But while search engines simply try to retrieve the keywords of the query in their indexed pages, the answer engine tries to understand the query as a question and forms an educated answer. In a sense, this is similar to the freebase project, which is to put all the knowledge of a world in a database where links could be established across items.

It attempts to detect the nature of each of the word of the query. Is that a city? a mathematic formula? foodstuff? an economic variable? Once it understands the terms of the query, it gives the user all the data it can to answer.

Here for instance:

wolframalpha-2

Using the same find access process present share diagram as before,

Wolfram|alpha’s got “find” covered. More about that below.

It lets you access the data. If data have been used to produce a chart, then there is a query that will retrieve those bare numbers in a table format.

Process is perhaps Wolfram|Alpha’s forte. It will internally reformulate and cook your query to produce all meaningful outputs in its capacity.

The presentation is excellent. It is very legible, consistent across the site, efficient and unpretentious. When charts are provided which is often, the charts are small but both relevant and informative, only the necessary data are plotted. This is unusual enough to be worth mentioning.

Wolfram|alpha doesn’t allow people to share its outputs per se, but since a given query will produce consistent results, users can simply exchange queries or communicate links to a successful query result.

Now back to finding data.

When a user submits a query, the engine does not query external sources of data in real time. Rather, it used its internal, freebase-like database. This, in turn, is updated by external sources when possible.

For each query, sources are available. Unfortunately, the data sources provided are for the general categories. For instance, for all the country-related informations, the listed sources are the same, and some are accurate and dependable (national or international statistical offices), some are less reliable or verifiable (such as the CIA world factbook or what’s cited as Wolfram|Alpha curated data, 2009.). And to me that’s the big flaw of this otherwise impressive system.

Granted, coverage is not perfect. That can only improve. Syntax is not always intuitive – to make some results appear in a particular way can be very elusive. But this, as well, will get gradually better over time. But to be able to verify the data presented, or not, is a huge difference – either it is possible or not. I’m really looking forward to this.

New data services 1: Google’s public data

Google’s public data has been launched somewhat unexpectedly at the end of April 2009.

The principle is as follows. When someone enters a search query that could be interpreted as a time series, Google displays a line graph of this time series before other results. Click on it, and you can do some more things with the chart.

googlepublicdata1

The name public data can seem ambiguous.

Public, in one sense, refers to official, government-produced statistics. But, for content, public is also the opposite of copyrighted. And here, a little bit of digging reveals that it’s clearly the latter sense. If you want this service to point to your data, it must be copyright-free.

I’ve seen Hans Rosling (of Gapminder fame, now Google’s data guru) deliver a few speeches to national statisticians to which he expressed all the difficulties he had to access their data, and battle with formatting or copyright issues. So I can understand where this is coming from. However. Imagine the outcry if google.com decided to stop indexing websites which were not in the public domain!

Remember my find > access > process > present > share diagram?

I’d expect that google will solve the find problem. After all, they’re search people. But they don’t! You’d find a time series if you enter its exact name in google. There is no such thing (yet, as I imagine it would be easy to fix) as a list of their datasets.

They don’t tackle the access problem either. Once you see the visualizations, you’re not any step closer to actually getting the  data. You can see them, point by point, by mousing over the chart. I was also disappointed by the inaccuracy of the citation of their datasets. I’d have imagined that they’d provide a direct link to their sources, but they only state which agency produced the dataset. And finding a dataset from an agency is not a trivial matter.

They don’t deal with process, but who will hold that against them? Now what they offer is a very nice, very crisp representation of data (presenting data). I was impressed how legible the interface remained with many data series on the screen, while respecting Google’s look and feel and colour code.

Finally, it is also possible to share charts. Or rather, you can have a link to an image generated by google’s chart API, which is more than decent. A link to this static image, and a link to the chart on google’s public data service, and that’s all you should need (except, obviously, a link to the data proper!)

Another issue comes from the selection of the data sets proper.

One of the datasets is the unemployment rates, which are available monthly and by USA county. Now I can understand the rationale to match a google query of “unemployment rates” to that specific dataset. But there are really many unemployment rates, depending on what you divide by what. (are we counting unemployed people? unemployed jobseekers? which definition of unemployment are we using – ILO’s, or the BLS’s? and against what is the rate calculated – total population? population of working age? total labour force?) But how could that work if you expand the system to another country? To obtain the same level of granularity (to a very narrow geographic location, to a period of a month) would require some serious cooking of the data, so you can’t have granularity, comparability and accuracy.

I don’t think the system is sustainable. I don’t like the idea that it gives the impression to people that economic statistics can be measured in real time at any level, just like web usage statistics for instance. They can’t be just observed, they’re calculated by people.

Google public data is still in its infancy. To have a usable list of the datasets, for instance, would alleviate much of my negarive comments on the system. But for the time being, I’m not happy with the orientation they’ve chosen.