New data services 1: Google’s public data

Google’s public data has been launched somewhat unexpectedly at the end of April 2009.

The principle is as follows. When someone enters a search query that could be interpreted as a time series, Google displays a line graph of this time series before other results. Click on it, and you can do some more things with the chart.

googlepublicdata1

The name public data can seem ambiguous.

Public, in one sense, refers to official, government-produced statistics. But, for content, public is also the opposite of copyrighted. And here, a little bit of digging reveals that it’s clearly the latter sense. If you want this service to point to your data, it must be copyright-free.

I’ve seen Hans Rosling (of Gapminder fame, now Google’s data guru) deliver a few speeches to national statisticians to which he expressed all the difficulties he had to access their data, and battle with formatting or copyright issues. So I can understand where this is coming from. However. Imagine the outcry if google.com decided to stop indexing websites which were not in the public domain!

Remember my find > access > process > present > share diagram?

I’d expect that google will solve the find problem. After all, they’re search people. But they don’t! You’d find a time series if you enter its exact name in google. There is no such thing (yet, as I imagine it would be easy to fix) as a list of their datasets.

They don’t tackle the access problem either. Once you see the visualizations, you’re not any step closer to actually getting the ┬ádata. You can see them, point by point, by mousing over the chart. I was also disappointed by the inaccuracy of the citation of their datasets. I’d have imagined that they’d provide a direct link to their sources, but they only state which agency produced the dataset. And finding a dataset from an agency is not a trivial matter.

They don’t deal with process, but who will hold that against them? Now what they offer is a very nice, very crisp representation of data (presenting data). I was impressed how legible the interface remained with many data series on the screen, while respecting Google’s look and feel and colour code.

Finally, it is also possible to share charts. Or rather, you can have a link to an image generated by google’s chart API, which is more than decent. A link to this static image, and a link to the chart on google’s public data service, and that’s all you should need (except, obviously, a link to the data proper!)

Another issue comes from the selection of the data sets proper.

One of the datasets is the unemployment rates, which are available monthly and by USA county. Now I can understand the rationale to match a google query of “unemployment rates” to that specific dataset. But there are really many unemployment rates, depending on what you divide by what. (are we counting unemployed people? unemployed jobseekers? which definition of unemployment are we using – ILO’s, or the BLS’s? and against what is the rate calculated – total population? population of working age? total labour force?) But how could that work if you expand the system to another country? To obtain the same level of granularity (to a very narrow geographic location, to a period of a month) would require some serious cooking of the data, so you can’t have granularity, comparability and accuracy.

I don’t think the system is sustainable. I don’t like the idea that it gives the impression to people that economic statistics can be measured in real time at any level, just like web usage statistics for instance. They can’t be just observed, they’re calculated by people.

Google public data is still in its infancy. To have a usable list of the datasets, for instance, would alleviate much of my negarive comments on the system. But for the time being, I’m not happy with the orientation they’ve chosen.