To solve any problem with data, there are 5 things you need to do in sequence.
- First, find the relevant data source.
- Then, access it and get the numbers you need.
- Then, process those numbers to make them answer your question.
- Present the results, for instance, as a data visualization,
- and, optionally, share your analysis with your colleagues or the world.
I’m not going to go into details here on how the last 4 steps are getting a lot of attention and dedicated tools.
That leaves us with finding data.
In many cases, finding data is a given: to work with data, you have to start from a dataset. And many researchers know for a given where the data they need are, so they only need to get them, not to find them.
And besides, since 2002, there’s a canned answer to every search-related question – google.
But the fact is, finding data is difficult, and, currently, google isn’t doing a great job to help.
Datasets are produced by subject-matter experts who describe them in their own language. If you have a loosely-defined question, its terms won’t match the scientific description of the dataset. And even if you find something, you can’t tell for sure if you found the most appropriate dataset, if it’s reliable, or find related objects – actions which are common when one is searching among objects of the same nature, like books or journal articles.
Publishers in those two activities adressed the problem by coming up with a standard metadata, with systems like MARC records or Dublin Core. But that doesn’t work so well on data.
To address this issue, the oecd published a white paper: We Need Publishing Standards for Datasets and Data Tables. (I’m credited in the end, but didn’t do much).
Without structured metadata standards for datasets, it’s impossible to search data across several publishers with any degree of reliability. But what a huge step that would be if an agreement could be found.