There are virtues to an illegible chart

It all started with an extreme pie chart

A few weeks back, there’s been a chart on aid who’s made the rounds of the internet:

All US ODA by recipient, 2004-2008, OECD data, taken from USAidWatchers.com

What this chart shows is that US aid is concentrated in a few countries. The article explains that this is a result of the 3D doctrine, which ties development with diplomacy and defense. This is why US gives so much to strategic countries like Afghanistan, Iraq, or Sudan, but relatively little to India – highlighted in the chart, which has “a huge chunk of the world’s poor”.

When I saw that chart I was planning to create a chart or a data visualization on the same subject for my work. The original chart was being heavily criticized for its form, because half of it is not legible. Chart purists don’t like pie charts for that very reason – they are difficult to read, especially if you add more items. But I found the chart interesting. It states in a very striking way that more than a hundred countries in the world get next to nothing from the USA.

An apology of  extreme charts

There are virtues to an illegible chart. In fact, I don’t believe that a chart should give equal prominence to each and every of its datapoints. In most cases, it’s here to support a story, so all it should do is bear a message. Tufte popularized the notion of data-ink ratio, which states that a chart designer should use the largest share of ink to represent data, not everything else. I feel this is taken too literally by many.

There is a tradition of extreme charts which purposely break presentation rules because of the very nature of the subject they are plotting. A famous example is Al Gore on his lift – if CO2 emissions hadn’t increased so much, he wouldn’t need that lift to show his chart.

 

Al Gore on his lift, the most memorable image of An Inconvenient Truth

Another one – from the NY Times, one of the charts that Matthew Ericson showed in his Infovis 2007 keynote speech:

In perspective: America's conflicts. NY Times

Click to see the full image - it is big. I really like this chart.

Again, if the number of US soldiers killed per month had not been so high in WWII, the 2nd group of bars wouldn’t overwrite the text above and sky-rocket to the top of the page. The logical thing to do would have been to scale the chart so that the maximum values would fit in a well-delimited space, and maybe use a logarithmic scale so that the values for other wars would remain legible. That’s how we would have done it if we had to plot that kind of series in an OECD book. The fact that the NYT designers chose, on the contrary, to let the data rise all the way to the top of  the page expresses in a very powerful way the extreme nature of the WWII casualties.

“A conventional chart couldn’t hold all that horror”, the chart seems to say. Likewise, if CO2 emissions had grown more steadily over the past couple of centuries, Al Gore wouldn’t have needed a lift. By the same token, if aid values to about 100 countries were more than negligible, they could be seen on that chart. So granted, there could be more academic ways to show that, like a giant bar chart with values too small to see for all but a handful of states. But all in all I think the original pie chart does a good job in communicating that in a nutshell,  ad absurdum if you will.

My take on the chart

I wanted to work on a specific subset of aid data, that which goes to fragile states, which are, simply put, the 43 countries in the greatest need of aid. Now official aid from developed countries, like US aid, is very concentrated, meaning that only 10 of these countries got more than $1b in 2008. Only 10 countries got more than $100 per capita in that year.

Another interesting aspect of the data is that for many of these countries, aid only mostly from one or two donors, so they are vulnerable to a policy change in that country. That’s what I wanted to show in the representation.

 

 

 

Making data meaningful – Style guide on the presentation of statistics

Making Data Meaningful part 2
Introducing Making Data Meaningful Part 2 – Style guide on the presentation of statistics – which, as its name cleverly suggests, is a compilation of  advice to present graphical information.

It’s a follow up to Making Data Meaningul part 1 , which focused on writing about data, as opposed to visualize it.

The book is a cooperation between representatives of national statistical offices and intergovernmental organizations – all public statisticians, if you will. I hope it will help others to communicate their data better. Personally, I have written the part about charts and collaborated to some other chapters. But if I could sum up my advice in one sentence, it would be: go buy Stephen Few books. Start with Show me the numbers.

The list of people who collaborated to the book includes:

 

Using Tableau Public: first thoughts

I am currently beta testing Tableau Public. Essentially Tableau Public let you bring the power of Tableau analysis online. With Tableau public, your audience doesn’t need to download a workbook file that they can see in an offline, software client – they can see and interact with your work directly on a web page.

There are quite a few examples of the things you can do with Tableau public. These are the examples you are given when you start the product:

Tracking Economic Indicators by FreakalyticsA Tale of 100 Entrepreneurs by Christian ChabotBird strikes by airport by CrankyflierInteractive Running Back Selector by CBS sports

And there are always more on Tableau’s own blog. I’ve done quite a few which I’ll share progressively on this blog and on my OECD blog, http://www.oecd.blog/statistics/factblog.

So that’s the context. What’s the verdict?

1. There is no comparable data visualization platform out there.

There are many ways to communicate data visually. Count them: 1320, 2875… and many more.

However these tools have a narrower focus than Tableau, or require the user some programming ability. For instance, Many Eyes uses a certain number of types of data visualization which can be set up in seconds, but which cannot be customized. Conversely, Protovis is very flexible but requires some knowledge of Javascript. And even for a skilled developer, coding an interactive data visualization from scratch takes time.

By contrast, Tableau is a fully-featured solution which doesn’t require programming. It has many representation types which can be deeply customized: every visual characteristic of a chart (colour, size, position, etc.) can depend on your data. Several charts can also be combined as one dashboard. On top of that, data visualization done in Tableau comes with many built-in controls, with an interface to highlight and filter data, or to get more details on demand. For dashboards, it is also possible to link charts, so that actions done on one chart (highlighting records, for instance) affect other charts.

2. The solution is not limitless.

Tableau enables you to do things which are not possible using other packages. But it doesn’t allow you to do anything. That’s for your own good – it won’t allow you to do things that don’t make sense.

There are many safety nets in Tableau, which you may or may not run into. For instance, you can’t make a line chart for data which don’t have a temporal dimension – so much for parallel coordinates. However, the system is not fool-proof. Manipulating aggregates, for instance, can lead to errors that you wouldn’t have to worry about in plain old Excel, where the various steps through which data are computed to create a graph are more transparent (and more manual). Compared to Excel, you have to worry less about formatting – the default options for colours, fonts and positions are sterling – and be more vigilant about calculations.

3. Strength is in numbers.

Over the years, many of us grew frustrated with Excel visual capacities. Others firmly believed that anything could be done with the venerable spreadsheet and have shown the world that nothing is impossible.

The same applies to Tableau. The vibrant Tableau community provides excellent advice. “Historic” Tableau users are not only proficient with the tool, but also have a better knowledge of data visualization practices than the average Excel user. Like any fully-featured product, there is a learning curve to Tableau, which means that there are experts (the proper in-house term is Jedis) which find hacks to make Tableau even more versatile. So of course, it is possible to do parallel coordinates with Tableau.

The forum, like the abundant training, available as videos, manuals, list of tips,or online sessions with an instructor, doesn’t only help the user to solve their problems, but it also a fantastic source of inspiration.

With the introduction of Tableau Public, the forum will become even more helpful, as there will be more questions, more problems and more examples.

 

 

Using data visualization to disinform

Two weeks ago I have been at DD4D conference, conveniently located at my workplace. I will write some more on DD4D, meanwhile you can see this post on infosthetics by Petra and Marian. One of the things that struck me at DD4D was that several talks were about either data visualization for advocacy, or for education purposes. One speaker said that data visualization could be used to protect people against those who use numbers to mislead and disinform. Yesterday, I saw this typical example of such a manipulation, reminding of the famous Disraeli quote.
disinform

This is a poster for restaurants to display. Yesterday, VAT for restaurants in France was cut from 19.6% to 5.5%. This is the result over 10 years of lobbying. Initially, restaurants asked for a VAT drop and committed to cut their listed prices accordingly. That cut in price would have attracted more consumers, eventually generating more profit and possibly more tax money. That would have been a win-win-win situation for the restaurant industry, the consumer and the state.

But eventually, the changes that restaurants have agreed to their price structure are as follow. They would cut the listed price of up to 10 menu items by 11.8% to “reflect the tax drop”. In exchange, they are allowed to display this poster, on which the chart ominously promises a massive price drop.

In reality, 11.8% is not enough to offset the VAT drop.

That should have been approximately 13.4%  or 100*(1.196/1.055 – 1) . Fast-food chains only have to drop some of their prices by 5% to get the poster.

The poster claims: “a cut in VAT is a cut in prices!”. But what happens really? For most items, listed price (incl tax) is unchanged, which means their actual prices raise by 13.4%. And for the discounted items, the sales price excluding tax still raises by 1.4% (or 7.7% for fast-food chains).

Is this what was implied by the chart?

In the past two weeks, I have collected more examples of shameless lies backed by seemingly official numbers and charts, and will continue to collect them.

 

Flowing Data’s chart contest

This week, FlowingData has organized the contest. A chart was submitted, and contestants were asked to improve it. 

A lot of my job revolves around reviewing and correcting graphs, so I was more than happy to compete. 

Here is the original graph, hosted & designed by Swivel:

Immigration to the U.S. by decade

The rules of the contest stated that the new graph should use the same data. But instead of re-using the dataset hosted on Swivel, I checked the source to answer some questions I had.

Here goes: 

 

Period 

Total

Europe

Asia

Americas

Africa

Oceania*

1820-30 151,824 106,487 36 11,951 17 33,333
1831 40 599,125 495,681 53 33,424 54 69,911
1841-50 1,713,251 1,597,442 141 62,469 55 53,144
1851-60 2,598,214 2,452,577 41,538 74,720 210 29,169
1861-70 2,314,824 2,065,141 64,759 166,607 312 18,005
1871-80 2,812,191 2,271,925 124,160 404,044 358 11,704
1881-90 5,246,613 4,735,484 69,942 426,967 857 13,363
1891-00 3,687,564 3,555,352 74,862 38,972 350 18,028
1901-10 8,795,386 8,056,040 323,543 361,888 7,368 46,547
1911-20 5,735,811 4,321,887 247,236 1,143,671 8,443 14,574
1921-30 4,107,209 2,463,194 112,059 1,516,716 6,286 8,954
1931-40 528,431 347,566 16,595 160,037 1,750 2,483
1941-50 1,035,039 621,147 37,028 354,804 7,367 14,693
1951-60 2,515,479 1,325,727 153,249 996,944 14,092 25,467
1961-70 3,321,677 1,123,492 427,642 1,716,374 28,954 25,215
1971-80 4,493,314 800,368 1,588,178 1,982,735 80,779 41,254
1981-90 7,338,062 761,550 2,738,157 3,615,225 176,893 46,237
1991-00 9,095,417 1,359,737 2,795,672 4,486,806 354,939 98,263
2001-06  7,009,322 1,073,726  2,265,696 3,037,122 446,792 185,986

187 Years        

72,066,614 

39,346,127 

 10,525,281

  20,082,410

1,075,980

 1,036,816

* includes others unidentified by nationality

 

The FAIR (Federation for American Immigration Reform), who’ve published this on their website, also made a chart out of this data: 

 

So let’s take a look at the data. 

At first glance, it is very aggregated: data are not available per country or per year, but per continent and per decade. However, the last “decade” is only 6 years long. Also, Oceania includes all the unidentified immigrants. Immigrants from Africa and “Oceania” are a tiny fraction of the total flow so it would be difficult to draw a conclusion from their data.

So if I want to tell a story about this dataset, I would choose the following. 

The total flow of immigrants to the USA has gone through major changes. 

Looking at the composition of this flow: over 90% of the immigrants were Europeans at some point, but now that ratio is down to around 15%. 

Now, for a critique of these two graphs. 

Swivel’s: 

 

  1. It’s not very telling to keep presenting those numbers aggregated by decade. 
  2. Especially if the last decade is not corrected. All curves seem to dip, although the underlying variables are actually growing.
  3. You can clearly see the point where American immigrants take over Europeans (and later, when Asians do the same). But again, those absolute figures are not very interesting. You cannot see the share of the various continents to the total. 
  4. The Africa and Oceania curve clutter the graph and bring little information. 
  5. The fact that Oceania includes other countries is not disclosed (not that it would change the graph tremendously). 
FAIR’s
  1. To do this graph, they’ve annualised the data, which is a more sensible option. 
  2. The year labels are difficult to read. 
  3. The last column (2001-2006) is exactly similar to the others, which comprise 10 years. 
  4. Again, Oceania and Africa don’t bring much to the graph. 
  5. It’s very difficult to see the evolution of one given continent, except Europe. 
An idea that I had and discarded was to show cumulated values (stocks). 
The left graph shows the cumulated values as part of the total. The second shows the cumulated values in absolute figures.
On both graphs, one can see the decline of the share of European immigrants. It’s more striking on the second, when the blue curve suddenly flattens around the turn of the century, while the green one (America) then the red one (Asia) start to thicken. 
So we have a story there. But then, what are these numbers? what would the sum of all those migrants mean, over nearly 200 years? That’s a very different number from the stock of all migrants currently living in the USA, because over so much time, most of them are dead. And it is also a very different number from the sum of all immigrants that ever came to the USA. Starting at 1820 is quite arbitrary – and does in fact exclude most African arrivals. So based on that dataset alone, which is the rule of the contest, it’s just not possible to work with cumulated values and get meaningful results.  
Then, I thought of doing a matrix chart instead of the stacked column chart done by FAIR. 

 

Doing a matrix chart like this (several charts one top of the other, using the SAME SCALE, wich can be added vertically – and visually) is the textbook way of showing variables in such a way that one can see their evolution over time and their proportion in the total. 

This kind of chart is not natively supported in Excel, so I’ve done it with processing

(I wrote a program to make them in Excel, but will talk about that in a later post.)

It’s an interesting graph: it shows Europe immigration peak, then America taking off, followed by Asia. In the early 20th century, the Mexican revolutions caused much emigration to the US, this is the ripple in the graph. 

But then, I thought it was too complex. Frankly, by glancing at it, you don’t get anything. You might learn information by examining it. 

So I have done this one which I am going to submit. 

And here I have my 2 stories in a much lighter graph. 

The blue rectangles are the total immigrants. Various laws and events have shaped that curve, I first wanted to annotate it but I’ve decided against it. I just kept the Immigration Act which was in force between 1924 and 1965 and which largely explains the drastic drop in immigration in that time. 

Without any other variable to compete with it, you can clearly follow its story. 

Then, I’ve added the share of Europeans in all the immigrants. That’s another clear story: in the early 19th century, they made the bulk of the immigrants, but then, their share dropped sharply to around 15%. My guess, though, is that the shape of the first leg of this curve (from about 70% to over 90%) is due to the fact that many unidentified immigrants were really Europeans. 

For the title of the left axis, I’ve chosen naturalization over number of immigrants or another denomination because most of the “immigrants” of the last few decades are really people already residing in the USA which get naturalized.

But that’s another contradiction in the dataset. In 1868, when the 14th amendment to the Constitution came into force, about 4 million former slaves became American citizens. They are not shown in the data. In 1924, the Native Americans who were not yet citizens were also granted citizenship. They too are not included int he dataset. However, since 1965, most “immigrants” are change of status migrants who were already in the USA. But then, we are to play with this dataset so that’s the best I could come up with.

Lastly, a few words about the design. I took some of the colours from a chart I really liked, by Viveka Weiley. In her chart she uses the MyriadPro font (guess she’s a Mac, but I’m a PC). I am using Frutiger which is quite similar.