Site menu:

Site search

Categories

July 2009
M T W T F S S
« Jun    
 12345
6789101112
13141516171819
20212223242526
2728293031  
    follow me on Twitter

    Tags

    Blogroll

    My stuff

    Misleading with road statistics

    Changing driving behaviors with campaigning alone is a tall order, but is literally a life-or-death matter. Road fatalities range from about 40/million  in Japan, to about 6 times as much in Russia. Fortunately, the numbers tend to decrease in most places, due to better equipment, better roads, harsher punishment and safer behaviors.

    Of all of these factors, drivers behavior is the only thing which isn’t directly controlled by governments, so it’s no surprise that it’s what the agencies try to target. Almost every angle has been tried: blaming alcohol, speed, showing the consequences of seemingly inocuous oversights, and, obviously, gore and shocking images.

    This year, in France, they’ve tried a different approach with a campaign called the 12000: thanking the drivers for their better behavior, which has saved, well, 12000 lives since 2003.

    I really appreciate the upbeat tone of campaign and its much welcome positive spin. Unfortunately, it’s based on such fallacy that it’s difficult to accept as such.

    road1

    Here’s one view of what has happened. The number of fatalities has dropped since 2003. (By the way, the unit for this and the following chart are fatalities per million population, indexed so that the value for France in 2003 is 100). It can be argued that lives have been saved, because if the number of fatalities had remained constant since 2003, the area in green would represent extra fatalities (around 6,000).

    But that’s what the agency wants us to believe.

    road21

    Says the website,

    12000 lives have been saved between 2003 and 2008. Fatalities have dropped from 6126 in 2003 to 4275 in 2008.

    To actually come up with that number of 12000 person saved, they’ve simply multiplied the difference between the 2008 and 2003 figures by 6. As if there had been a sudden and drastic drop in 2003.

    I wonder why they do that. Behaviors have changed on the road. 75% of French drivers have a perfect driving record, another 15% have only committed minor offences. Those are facts. So why inflate the numbers? and why, for instance, start at 2003 and not 2002, where mortality dropped by over 20% ? 12000, as an absolute figure, is not more striking than 1000 or 100000.

    The visuals all repeat this figure. On all the posters of the campaign, we find the following footnote: “* If behaviors had not changed since 2002 in France, 12000 more people would have died on the road between 2003 and 2008. Source: ONISR. “. The ONISR says no such things in their report, so that number must have been invented for the campaign.

    Speaking of the ONISR reports, they estimate that if people observed speed, alcohol and seat belt legislations, the numbers would drop by over 2000. So are we doing that well?

    road3

    That’s a comparison with the UK. Again, the units are ratio per population, not absolute figures. If France had the same road fatalities that the UK, over 10,000 persons would not have been killed over the 2003-2008 period…

    Anyway. There’s no good reason why all western countries couldn’t go under 50 killed / million population within a reasonably short time frame.

    Using data visualization to disinform

    Two weeks ago I have been at DD4D conference, conveniently located at my workplace. I will write some more on DD4D, meanwhile you can see this post on infosthetics by Petra and Marian. One of the things that struck me at DD4D was that several talks were about either data visualization for advocacy, or for education purposes. One speaker said that data visualization could be used to protect people against those who use numbers to mislead and disinform. Yesterday, I saw this typical example of such a manipulation, reminding of the famous Disraeli quote.
    disinform

    This is a poster for restaurants to display. Yesterday, VAT for restaurants in France was cut from 19.6% to 5.5%. This is the result over 10 years of lobbying. Initially, restaurants asked for a VAT drop and committed to cut their listed prices accordingly. That cut in price would have attracted more consumers, eventually generating more profit and possibly more tax money. That would have been a win-win-win situation for the restaurant industry, the consumer and the state.

    But eventually, the changes that restaurants have agreed to their price structure are as follow. They would cut the listed price of up to 10 menu items by 11.8% to “reflect the tax drop”. In exchange, they are allowed to display this poster, on which the chart ominously promises a massive price drop.

    In reality, 11.8% is not enough to offset the VAT drop.

    That should have been approximately 13.4%  or 100*(1.196/1.055 - 1) . Fast-food chains only have to drop some of their prices by 5% to get the poster.

    The poster claims: “a cut in VAT is a cut in prices!”. But what happens really? For most items, listed price (incl tax) is unchanged, which means their actual prices raise by 13.4%. And for the discounted items, the sales price excluding tax still raises by 1.4% (or 7.7% for fast-food chains).

    Is this what was implied by the chart?

    In the past two weeks, I have collected more examples of shameless lies backed by seemingly official numbers and charts, and will continue to collect them.

    New data services 3: data.gov

    The United States are the only western country without a centralized data office. Instead, official statistics are produced by well over 100 agencies. This makes obtaining official US data difficult, and that’s somewhat of a paradox because in most cases, these data are public and free. Of course, with data coming from so many sources, they are also in a variety of shapes and sizes. Says Wired,

    Until now, the US government’s default position has been: If you can’t keep data secret, at least hide it on one of 24,000 federal Web sites, preferably in an incompatible or obsolete format.

    A commitment made by the Obama administration was to tackle this and make data more widely available. To that end, a data portal was announced in early April and data.gov was officially launched end of May.

    Data.gov is three things in one.

    A sign that this administration wants to make the data more accessible, especially to developers.

    A shift towards open formats, such as XML.

    A catalogue of datasets published by US government agencies.

    The rationale is that with data.gov, data are available to wider audiences. There’s a fallacy in that, because the layperson cannot do much with an ESRI file. But hopefully, someone will and may build something out of it for the good of the community.

    The aspect I found most interesting is the catalogue proper. For each indexed dataset, data.gov builds an abstract, inspired by the Dublin-Core Metadata Initiative, with fields such as authoring agency, keywords, units, and the like. This, in itself, is not a technological breakthrough but imagine if all the datasets produced by all the agencies were described in such a uniform fashion. Then, retrieving data would be a breeze.

    Note that data.gov does not store the datasets. They provide a store-front which then redirects users to the proper location once a dataset has been selected.

    There have been other, similar initiatives. Fedstats.gov, allegedly, provided a link to every statistical item produced by the federal government. By their own admission, the home page was last updated in 2007, and its overall design hasn’t changed much since its launch by the Clinton administration in 1997 (a laudable effort at the time). Another initiative, http://usgovxml.com, is a private portal to all data available in XML format.

    So, back to ”find access process present share”. Where does data.gov fall?

    It can come as a surprise that they don’t touch the last 3 steps. Well, it certainly will be a surprise for anyone expecting the government to open a user-centric, one-stop-shop for data. Data.gov is certainly not a destination website for lay audiences.

    It doesn’t host the data either, however, its existence drives agencies to publish their datasets in compliance with its standards. So we can say that it indirectly addresses access.

    So what it really is about is finding data. Currently, the site has two services to direct users to a dataset: a search engine and a catalogue. The browsable catalogue has only one layer of hierarchy, and while this is fine with their initial volume (47 datasets, around 200 as of end of June) that won’t suffice if their ambition is to host 100,000 federal data feeds.

    All in all, it could be argued that data.gov doesn’t do much by itself. But what is interesting is what it enables others to do.

    On the longer term, it will drive all agencies to publish their data under one publication standard. And if you have 100,000 datasets published under that standard, and if people use it to find them, then we will have a de facto industry standard to describe data. The consequences of that cannot be overestimated.

    The other not obvious long-term advantage is what it will allow developer to create. There are virtually no technical barriers to creating interesting applications on top of these datasets. Chances are that some of these applications could change our daily lives. And they will be invented not by the government, but by individuals, researchers or entrepreneurs. quite something to be looking forward to.

    New data services 2: Wolfram|alpha

    In March this year, überscientist Stephen Wolfram, of Mathematica fame, revealed the world he was working on something new, something big, something different. The first time I heard of this was through semantic web prophet Nova Spivack, who is not known to get excited by less-than-revolutionary projects. That, plus the fact that the project was announced so short before its release, contributed to build anticipation to huge levels.

    wolframalpha

    Wolfram|alpha describes itself as a “computational knowledge engine” or, simply put, as an “answer engine”. Like google and search engines, it tries to provide information based on a query. But while search engines simply try to retrieve the keywords of the query in their indexed pages, the answer engine tries to understand the query as a question and forms an educated answer. In a sense, this is similar to the freebase project, which is to put all the knowledge of a world in a database where links could be established across items.

    It attempts to detect the nature of each of the word of the query. Is that a city? a mathematic formula? foodstuff? an economic variable? Once it understands the terms of the query, it gives the user all the data it can to answer.

    Here for instance:

    wolframalpha-2

    Using the same find access process present share diagram as before,

    Wolfram|alpha’s got “find” covered. More about that below.

    It lets you access the data. If data have been used to produce a chart, then there is a query that will retrieve those bare numbers in a table format.

    Process is perhaps Wolfram|Alpha’s forte. It will internally reformulate and cook your query to produce all meaningful outputs in its capacity.

    The presentation is excellent. It is very legible, consistent across the site, efficient and unpretentious. When charts are provided which is often, the charts are small but both relevant and informative, only the necessary data are plotted. This is unusual enough to be worth mentioning.

    Wolfram|alpha doesn’t allow people to share its outputs per se, but since a given query will produce consistent results, users can simply exchange queries or communicate links to a successful query result.

    Now back to finding data.

    When a user submits a query, the engine does not query external sources of data in real time. Rather, it used its internal, freebase-like database. This, in turn, is updated by external sources when possible.

    For each query, sources are available. Unfortunately, the data sources provided are for the general categories. For instance, for all the country-related informations, the listed sources are the same, and some are accurate and dependable (national or international statistical offices), some are less reliable or verifiable (such as the CIA world factbook or what’s cited as Wolfram|Alpha curated data, 2009.). And to me that’s the big flaw of this otherwise impressive system.

    Granted, coverage is not perfect. That can only improve. Syntax is not always intuitive - to make some results appear in a particular way can be very elusive. But this, as well, will get gradually better over time. But to be able to verify the data presented, or not, is a huge difference - either it is possible or not. I’m really looking forward to this.

    New data services 1: Google’s public data

    Google’s public data has been launched somewhat unexpectedly at the end of April 2009.

    The principle is as follows. When someone enters a search query that could be interpreted as a time series, Google displays a line graph of this time series before other results. Click on it, and you can do some more things with the chart.

    googlepublicdata1

    The name public data can seem ambiguous.

    Public, in one sense, refers to official, government-produced statistics. But, for content, public is also the opposite of copyrighted. And here, a little bit of digging reveals that it’s clearly the latter sense. If you want this service to point to your data, it must be copyright-free.

    I’ve seen Hans Rosling (of Gapminder fame, now Google’s data guru) deliver a few speeches to national statisticians to which he expressed all the difficulties he had to access their data, and battle with formatting or copyright issues. So I can understand where this is coming from. However. Imagine the outcry if google.com decided to stop indexing websites which were not in the public domain!

    Remember my find > access > process > present > share diagram?

    I’d expect that google will solve the find problem. After all, they’re search people. But they don’t! You’d find a time series if you enter its exact name in google. There is no such thing (yet, as I imagine it would be easy to fix) as a list of their datasets.

    They don’t tackle the access problem either. Once you see the visualizations, you’re not any step closer to actually getting the  data. You can see them, point by point, by mousing over the chart. I was also disappointed by the inaccuracy of the citation of their datasets. I’d have imagined that they’d provide a direct link to their sources, but they only state which agency produced the dataset. And finding a dataset from an agency is not a trivial matter.

    They don’t deal with process, but who will hold that against them? Now what they offer is a very nice, very crisp representation of data (presenting data). I was impressed how legible the interface remained with many data series on the screen, while respecting Google’s look and feel and colour code.

    Finally, it is also possible to share charts. Or rather, you can have a link to an image generated by google’s chart API, which is more than decent. A link to this static image, and a link to the chart on google’s public data service, and that’s all you should need (except, obviously, a link to the data proper!)

    Another issue comes from the selection of the data sets proper.

    One of the datasets is the unemployment rates, which are available monthly and by USA county. Now I can understand the rationale to match a google query of “unemployment rates” to that specific dataset. But there are really many unemployment rates, depending on what you divide by what. (are we counting unemployed people? unemployed jobseekers? which definition of unemployment are we using - ILO’s, or the BLS’s? and against what is the rate calculated - total population? population of working age? total labour force?) But how could that work if you expand the system to another country? To obtain the same level of granularity (to a very narrow geographic location, to a period of a month) would require some serious cooking of the data, so you can’t have granularity, comparability and accuracy.

    I don’t think the system is sustainable. I don’t like the idea that it gives the impression to people that economic statistics can be measured in real time at any level, just like web usage statistics for instance. They can’t be just observed, they’re calculated by people.

    Google public data is still in its infancy. To have a usable list of the datasets, for instance, would alleviate much of my negarive comments on the system. But for the time being, I’m not happy with the orientation they’ve chosen.

    Google public data, Wolfram Alpha and data.gov

    This last two weeks, three high-profile data-related services have been released: google’s public data, Wolfram|Alpha and data.gov.

    google's public dataWolfram|alpha

    data.gov screenshotIn the next couple of posts I’m going to review all three.

    But before I’d like to go back to 2007, when Swivel.com and Many-eyes.com were released.

    Those 2 services allow users to publish their own data visualizations, based on datasets uploaded by themselves or by others. At the OECD, we had used the services extensively, uploading hundreds of datasets and creating that many visualizations.

    At the end of the day, my main gripe with both services was never the visualization proper, or the interface, or any of the services to the data publisher - which the developers knew to be highly perfectible. No, what bothered me was the navigation within the site and how all the datasets were organized. There wasn’t any way to group your datasets or visualizations, then to group these groupings. Sure, they had an author name attached to them, and later, a theme, so it was possible to see all datasets from a specific author, or about a specific theme. But at the end of the day, that was a very long list, so the top titles received all the exposure, and the others, none. And indeed, we realized that some of our data objects got all the traffic, while others, not necessarily less interesting, had none - they were simply not seen.

    My reaction to swivel and many-eyes was to tell them, you are not going to be able to allow users to search for data sets on your site. It’s too difficult and it’s not your focus. Forget about being a community site or a portal, and instead, allow users to share what they make on your sites in the environments they like. Allow them to download images, embed applets, you name it, but the navigation will have to happen on their site, not on yours.

    So let’s see how this applies to these 3 new services.

    Finding data

    data flow

    To solve any problem with data, there are 5 things you need to do in sequence. 

    • First, find the relevant data source.
    • Then, access it and get the numbers you need. 
    • Then, process those numbers to make them answer your question. 
    • Present the results, for instance, as a data visualization, 
    • and, optionally, share your analysis with your colleagues or the world.

    I’m not going to go into details here on how the last 4 steps are getting a lot of attention and dedicated tools. 

    That leaves us with finding data. 

    In many cases, finding data is a given: to work with data, you have to start from a dataset. And many researchers know for a given where the data they need are, so they only need to get them, not to find them. 

    And besides, since 2002, there’s a canned answer to every search-related question - google.

    But the fact is, finding data is difficult, and, currently, google isn’t doing a great job to help. 

    Datasets are produced by subject-matter experts who describe them in their own language. If you have a loosely-defined question, its terms won’t match the scientific description of the dataset. And even if you find something, you can’t tell for sure if you found the most appropriate dataset, if it’s reliable, or find related objects - actions which are common when one is searching among objects of the same nature, like books or journal articles. 

    Publishers in those two activities adressed the problem by coming up with a standard metadata, with systems like MARC records or Dublin Core. But that doesn’t work so well on data.

    To address this issue, the oecd published a white paper: We Need Publishing Standards for Datasets and Data Tables. (I’m credited in the end, but didn’t do much).

    Without structured metadata standards for datasets, it’s impossible to search data across several publishers with any degree of reliability. But what a huge step that would be if an agreement could be found.

    Gerry McGovern paid us a visit

    And he gave us a talk about the internet in general. While I enjoyed the talk in general, there were some ideas which I really liked and some with which I’d adamantly disagree.

    Here goes.

    The task-centric internet.
    That’s the main theory. We went from a tool-centric internet to a content-centric internet. Now the web is (or should be) task-centric, that is focused around what people who come to your web site want to do. All the rest is clutter.

    I’m not too convinced about that. I like the idea of helping visitors achieve what they want to do, right from the homepage and without hassle. Now in a web site design, you should also consider what you want your visitors to do. Yes the choices a visitor faces should be kept to a minimum. But in my opinion it is ok to orient those choices. It is ok to send a message to tell your visitors about something they were not necessarily looking for, but which may be of interest to them.

    Navigation should help people, not reflect the brand.
    I mostly agree with that. This echoes what Jakob Nielsen says about links, which should look like links, i.e. in blue and underlined, with a different color for visited links. Now Nielsen is more subtle about this than McGovern was. Navigation links, menu options etc. are seldom underlined and this is generally for the best.

    In your text, use words that people search for.
    The two examples he gave were “low fares” vs. “cheap flights” and “climate change” vs. “global warming”. It turns out that airline companies liked to use “low fares” while customers were rather searching for “cheap flights”. And, in the academic litterature, you’d find more mentions of “climate change” than “global warming”, although, again, people search the latter. So the advice was to use the searched expressions.

    While it makes sense in the first case, it’s more questionable for the 2nd. If you write a website for academics, you want to attract the people who searched for “climate change”, not necessarily “global warming”, even if they are more numerous.

    Don’t write links in paragraphs.
    Huh? While I agree you shouldn’t write a paragraph around a link when the link itself suffice, I don’t see anything wrong with using a link within a paragraph, far from it. When writing for the web, connecting with other resources and websites has many benefits. The rationale he gave was that people are either reading or clicking, hence the paradox. To that I say, not anymore! there are so many things you can do with a link, like opening it in a new tab, bookmarking it or tagging it for future reference, etc.

    Keep headings short.
    Indeed. There are only advantages to that. It was quite interesting to see him bashing our clippings.

    An interesting point he raised was that before the internet, news releases were never meant to be published. Now they are available to the main public, and often redistributed by some e-journalists as is.

    Blogging is really a conversation.
    Blogging is about exchanging rather than proposing. I really didn’t like that analysis. In my book, unless you have something to say, unless you have substance, no one will want to exchange with you. You just can’t run a blog saying, ok you guys tell me what I should write about. Protagonists won’t materialize out of thin air. There are quite a few successful blogs without comments. In my view, comments are a side effect of blogging rather than its essence.

    Update your content frequently.
    Some content has a shorter time span than other. He showed us great examples of that on our own website. Basically, everything you’d write in the future tense is soon outdated. There’s some content, however, that in my opinion can stay online for a while.

    Monitor your content and take it out when needed.
    That was a very interesting point. When you hear something like that, anyone’s reaction would be to say, my site is already huge, so I need extra resources to monitor my content. His approach is the opposite. He says that you should only build a site so big that you can monitor it with your current resources. If your web site is too big, you should downsize it. And in fact, most organizations are taking large chunks of their public web site offline!

    The state of presentations in 2008

    There have been many changes in how people understand presentations in 2008. How far have we gone?

    In 2008, 2 major books on the topic have been published. Presentation Zen, by Garr Reynolds, and slide:ology, by Nancy Duarte.

    People are accepting that a well-executed presentation can change the world. An Inconvenient Truth got nothing less than 2 academy awards and a nobel prize. And rumors about the health of master presenter Steve Jobs caused the stock markets to panic.

    People are also finding that tools to create successful presentations are incredibly commonplace. From a technical standpoint, anyone with a computer could have created “shift happens“, which has been viewed by 5 to 10 million people.

    As a result, blogs are now swarming with sensible presentation advice. A google query for “death by powerpoint” returns 397000 hits today. A year ago, searching for presentation tips yielded ideologic (as opposed to evidence-based) guidelines such as “no more than 7 bullets per slide” or “one slide per minute”. (you can still find those as well).

    2008 was also the year where Slideshare took off. Not only did the viewership and amount of contents increase drastically, but the quality, relevancy and sophistication of the best presentations is now incredible. Empowered by inspring examples, clear guidelines and adequate tools, many are thriving to emulate great presenters.

    So if I just end here, one could conclude that the world is definitely saved from ineffective presentations. The reality is slightly different.

    This year, I have seen so far approximately 400 live presentations, and god knows how many online. Some were excellent, many were good, most were at least adequate. But a good proportion of them were still boring and I’d be lying if I claimed I could remember as much as 10% of them.

    One explanation I came up with for that is that many presenters are still focusing on the final deliverable product rather than the fundamentals. These folks are very sensitive to advice like “mind your typography”, “illustrate your slides with large images”, or “forget bullets”. Now typography or images are important and can make a difference between a good and an excellent presentation. But it’s crucial to have a message to deliver and to focus on that message.

    Bulleted texts are accused of cluttering the presentations. But if every little point or anecdote is illustrated with a vividly-colored image, then the images themselves become the clutter and clog everyone’s limited attention. You’d remember the images and cool effects but not the point. And a week later, you’ll have forgotten the images and the presentation altogether.

    So my own piece of advice is that your big images won’t make your presentation. Your angle,  structure and consistency will. The best advice I got from Presentation Zen was to prepare a presentation away from a computer and only produce it once it’s final. It works. It really does.

    Once this will be an accepted practice, seminars, classes and meetings will be much more exciting (let’s hope!).

    Go deep rather than go wide

    Yesterday, I attended a Presentation Zen webinar. One phrase that struck me was this advice to go deep rather than go wide. In a presentation, there is only so much time to present information, before everyone’s attention collapses.

    Rather than try to cover as much ground as possible, it’s much more efficient to focus on a subject and make sure to deliver.

    And if you are expected to deliver lots of information on a wide scope, then a written report is a more appropriate medium.

    The presentation should be available in slideshare any time soon.