A few weeks back, there’s been a chart on aid who’s made the rounds of the internet:
All US ODA by recipient, 2004-2008, OECD data, taken from USAidWatchers.com
What this chart shows is that US aid is concentrated in a few countries. The article explains that this is a result of the 3D doctrine, which ties development with diplomacy and defense. This is why US gives so much to strategic countries like Afghanistan, Iraq, or Sudan, but relatively little to India – highlighted in the chart, which has “a huge chunk of the world’s poor”.
When I saw that chart I was planning to create a chart or a data visualization on the same subject for my work. The original chart was being heavily criticized for its form, because half of it is not legible. Chart purists don’t like pie charts for that very reason – they are difficult to read, especially if you add more items. But I found the chart interesting. It states in a very striking way that more than a hundred countries in the world get next to nothing from the USA.
An apology of extreme charts
There are virtues to an illegible chart. In fact, I don’t believe that a chart should give equal prominence to each and every of its datapoints. In most cases, it’s here to support a story, so all it should do is bear a message. Tufte popularized the notion of data-ink ratio, which states that a chart designer should use the largest share of ink to represent data, not everything else. I feel this is taken too literally by many.
There is a tradition of extreme charts which purposely break presentation rules because of the very nature of the subject they are plotting. A famous example is Al Gore on his lift – if CO2 emissions hadn’t increased so much, he wouldn’t need that lift to show his chart.
Al Gore on his lift, the most memorable image of An Inconvenient Truth
Another one – from the NY Times, one of the charts that Matthew Ericson showed in his Infovis 2007 keynote speech:
Click to see the full image - it is big. I really like this chart.
Again, if the number of US soldiers killed per month had not been so high in WWII, the 2nd group of bars wouldn’t overwrite the text above and sky-rocket to the top of the page. The logical thing to do would have been to scale the chart so that the maximum values would fit in a well-delimited space, and maybe use a logarithmic scale so that the values for other wars would remain legible. That’s how we would have done it if we had to plot that kind of series in an OECD book. The fact that the NYT designers chose, on the contrary, to let the data rise all the way to the top of the page expresses in a very powerful way the extreme nature of the WWII casualties.
“A conventional chart couldn’t hold all that horror”, the chart seems to say. Likewise, if CO2 emissions had grown more steadily over the past couple of centuries, Al Gore wouldn’t have needed a lift. By the same token, if aid values to about 100 countries were more than negligible, they could be seen on that chart. So granted, there could be more academic ways to show that, like a giant bar chart with values too small to see for all but a handful of states. But all in all I think the original pie chart does a good job in communicating that in a nutshell, ad absurdum if you will.
My take on the chart
I wanted to work on a specific subset of aid data, that which goes to fragile states, which are, simply put, the 43 countries in the greatest need of aid. Now official aid from developed countries, like US aid, is very concentrated, meaning that only 10 of these countries got more than $1b in 2008. Only 10 countries got more than $100 per capita in that year.
Another interesting aspect of the data is that for many of these countries, aid only mostly from one or two donors, so they are vulnerable to a policy change in that country. That’s what I wanted to show in the representation.
I’ve played a bit with the other 3 datasets of the Tableau Public contest. When I get to see what others have done, it will be easier to take something from that after having manipulated them. The one I’ve spent most time with is the US budget spending one. Here’s the sheet I came up with:
(if the viz doesn’t show in the blog, here’s the direct link)
a few explanations:
Unit: % of GDP
The dataset covers almost 40 years, and includes a notion of inflation. But even with that it’s too difficult to compare spending over time. Instead of trying to convert everything to 2009 constant dollars, it’s easier (and it makes more sense) to compare everything as percentage of GDP.
Filter: by function
The original dataset lists over 30 departments. I don’t think they are immediately comparable as is, some being much bigger than others. Besides, it’s just too complicated to ask people to choose between 30 items to make comparisons. So, instead, I grouped several departments by function, as defined by the COFOG (classifications of functions of government, a UN classification). To be honest I wasn’t extra careful when I assigned some departments to a function, for instance Veteran Affairs could have been assigned to Defense or to Social Protection (I chose the latter). But the assignments are fair. The added bonus is that using functions enables us to make international comparisons:
Comparing with OECD values
Not too long ago I made a chart comparing OECD countries’ budget expenditures. So what I didn’t like about this dataset is that it didn’t give a way to determine whether US spendings in such or such area were high or low. From the dataset proper, one can tell that, for instance, that social protection expenses were never as high as in 2009. But are they really “high”? Or – defense expenditure were at an all-time low in 1999. But were they really low?
Comparing with other values help answer those questions. To continue on these 2 examples, social protection expenditure, in 2009, was 7.2% – a much higher share than in 1965 (3.9%) but still very low compared to OECD countries – the average being 15.2%. Conversely, defense, in 1999, only represented 3.1% of GDP – it was as high as 9% during Viet-Nam, and it’s almost 5% today. Meanwhile, the OECD average is 1.4%.
Again, that comparison is not very scientific, because the numbers used for those OECD averages include other levels of government (states, cities…) which are not included here. But still, they help putting the dataset in the context.
This here is the number of points in the adult obesity rates which are not explained by the median income of the county. In other words, marks that appear very green are counties where people are less obese than counties with similar incomes, and conversely, red marks show counties where obesity is more widespread than expected by income alone.
This suggests that there could be some cultural/regional explanations to the health situation. People in the mountains area, especially Colorado, show as very green, while people in the old South are very red. Mid West and North West are average, New England and Florida tend to be better than average. Yesterday, there was a show on French TV calling Houston the fat capital of the USA, explaining that by cultural reasons. But the fact is, obesity rates in Harris county are lower than average, and on this map the corresponding mark is green, showing that factors other than income play a positive, not negative role. I love it when facts and numbers get in the way of a nice story.
and this is put together, say, just for aesthetic purposes. it categorizes counties by their median income (in increments of 1000$, X-axis) and their obesity rates (by chunks of 0.5%, Y-axis) and plots the total population of the counties that fill both criteria. Then again, it doesn’t show the actual number of people who are in a specific income and obesity bracket, it just adds the population of whole counties.
so here it is. I chose to compete on the Activity Rates and Healthy Living data set, because after downloading it I really enjoyed exploring it.
If the viz doesn’t show well in the blog, here’s a link to its page
My main reason for entering the contest is to be able to see what others have done. There are obviously many, many ways to tackle this and I am very much looking forward to see everyone’s work! my interactions with the Tableau community, especially through the forum, have always been very rewarding and what better way to learn than from example!
So for the fellow contestants that will see my work, here is my train of thoughts for the dashboard.
The dataset
I’m aware of USDA’s food environment atlas. It’s an application where people can see various food-related indicators on a map. The dataset we were handled is actually the background data of this. So, there is already a place where people can consult food indicators.
Now this beeing Tableau and all, I wanted to create an analytical dashboard where people could understand if and how the input variables affected the output variables.
The dataset consists mostly of input variables: various indicators that influence how healthy a local population is. That status (output) is expressed through a few variables, such as adult and child obesity rates and adult diabetes rates. Those variables are highly correlated with each other, so in my work I chose to focus on adult obesity rates which is the simplest one.
Now, inputs. The rest of the variables fall in several categories:
income (median household income, poverty rates);
diet (consumption of various food items per capita);
shopping habits (for various types of stores or restaurants, the dataset would give their number and the money spent in each county, both in absolute numbers and per capita);
lifestyle information (data on households without cars and far from stores, on the physical activity level of the population, and the facilities offered by the state);
pricing variables (price ratios between some “healthy” food items and some less healthy, equivalent food items, for instance fruits vs. snacks; tax information on unhealthy food);
policy variables (measuring participation to various programmes such as SNAP or WIC);
socio-demographic variables (ethnic groups in population, “metro” status of county, whether the county was growing and shrinking, and voting preferences).
Yes, that’s a lot of variables (about 90, plus the county and state dimensions).
Oddly enough, there wasn’t a population measure in the dataset, and many indicators were available in absolute value only, so I constructed a proxy by dividing two variables on the same subject (something like “number of convenience stores” and “number of convenience stores / capita”).
That enabled me to build indicators per capita for all subjects, so I could see if they were correlated with my obesity rates.
Findings – using Tableau desktop to make sense of the dataset
The indicators which were most correlated with obesity were the income ones, which came as no surprise. All income indicators were also very correlated to each other. In the USA, poverty means having an income below a certain threshold which is defined at the federal level. But in other contexts, poverty is most often defined in relation to the median income (typically, a household is in poverty if its income is below half of the median income), so it can be used to measure inequality of a community, and dispersion of incomes.
As a result, many indicators appear to be correlated with obesity because they are not independent of income. This is the case for instance for most of the policy indicators: if a programme has many recipients in a county, it is because poverty is widespread, so residents are more likely to be affected by obesity. This makes it difficult to measure the impact of the programmes with this dataset. This is also the case, unfortunately, for racial indicators, as most of the counties with a very high black population have a low income.
Diet indicators also appear to be uncorrelated with obesity. This is counter-intuitive – isn’t eating vegetables or fresh farm produce the most certain way to prevent obesity? But one has to remember that this dataset is aggregated at the county level. Just because a county has a high level of, say, fruits consumption per capita doesn’t mean that every household is eating that much. Realistically, consumption will be very dispersed: the households where people cook, which are less likely to be affected by obesity, will buy all the fruits, and those where people don’t cook will simply buy none. Also, just because one buys more vegetable than average doesn’t mean they don’t also buy other, less recommended foodstuff.
The only diet indicator that appear to be somewhat correlated to obesity is the consumption of soft drinks.
When it comes to lifestyle habits, surprisingly, the proportion of households without car and living far from a store – people likely to walk more, so to be healthier – is positively correlated with obesity. This is because counties where this indicator is high are also poorer than average – again, income explains most of this. However, physical activity in general plays a positive role. States where people are most active, such as Colorado, enjoy the lowest obesity figures. In fact, all the counties with less than 15% of obesity are in Colorado.
Finally, pricing didn’t seem to have much impact on neither obesity, nor consumption. Why is that? Economists would call this “low price elasticity”, meaning that price changes do not encourage people to switch products and habits. But there is another explanation. Again, people who can’t cook are not going to buy green vegetables because they are cheaper. Also, consider the tax amount that are applied: no more than 7% in the most aggressive states. Compare that figure to the 400%+ levy that is applied to cigarettes in many countries of the world! Clearly, 4-7% is not strong enough to change habits. However, this money can be used to sponsor programmes that can help people adopt safer behaviors.
What to show? making the visualization
First, I wanted to show all of those findings. If 2 variables that you expect to be correlated (say, consumption of vegetables and obesity) are in fact not correlated, a point is made! But visually, nothing is less interesting than a scatterplot that doesn’t exhibit correlation. It’s just a stupid cloud of dots.
So instead I chose to focus on the correlations I could establish, namely: obesity and income, and obesity and activity. Those are the 2 lower scatterplot of my dashboard. I chose the poverty rate measure, because I’d rather have a trend line going up, than going down.
I duplicated that finding with a bar chart made with median income bins. For each bin (which represent all the counties where the median income fall in that range), I would plot the average obesity rate, and, miracle! this comes up as a markedly decreasing bar chart. Now, this figure doesn’t establish correlation, let alone causality, but it certainly suggests it more efficiently than a scatterplot. Also, it can be doubled as a navigation aide: clicking on a bar would highlight or select the relevant counties.
Finally, I decided to do a map. Well, actually, it was the first thing I had done, but had second thoughts about it, and eventually I put it in. Why? first, to allow people to look up their county. Technically, my county is Travis county (Austin, TX) and I can find it easily on a map. Less so if I have to look for county names listed in order of any of their indicators. I added a quick filter on county name, for those who’d rather type than look up.
I also wanted to see whether there was a link between geography and obesity. So try the following.
Where are the counties with obesity rates less than 15% ? Colorado only.
If we raise the threshold a little, we get San Francisco and New York. But until 20%, these counties remain very localized.
Likewise, virtually all counties above 35% are in the South – Alabama, Louisianne, Mississipi.
Population also has an importance. The counties with a population above 1m people tend to have lower rates – their citizens also usually have higher incomes.
I decided to zoom the map on the lower 48 by default. It is possible to zoom out to see Alaska and Hawaii, but I don’t think that the advantage of seeing them all the time is greater than the inconvenient of having a smaller view point even if they are not necessary.
Regarding the marks. Originally, I didn’t assign any variable to their size, but then thought that the larger counties (i.e. LA, Harris (Houston), Cook (Chicago) …) were underrepresented. So I assigned my population proxy to size. But then, the density of the marks competed with the intensity of the color, which was attributed to the obesity rate. So I removed that and chose a size so that marks wouldn’t overlap each other too much. Regarding color, I wasn’t happy with the default scale. If I let it as is, it would consider that 12.5%, the minimum value of the dataset, is an extremely low number. But in absolute terms, it’s not. Most developed countries have obesity rates lower than that value at the national level. Japan or Korea are below 4%. So I made the scale start at 0. But I didn’t like the output: the counties with the highest values didn’t stand out. Eventually, I chose a diverging scale, which helped counties with high and low values to be more visible.
I edited an tooltip card for the view. In another version of the dashboard, I had a sheet with numbers written out that would change depending on which country was last brushed. I like the idea that this information can stay on. But I got confused in the configuration of the actions, and couldn’t completely prevent the filter that applied to this sheet to be disabled, sometimes, which caused number for all counties to overlap, and an annoying downtime as that happens. So I made an tooltipinstead. Anyway, it’s easier to format text like this. But the problem is that it can hide a good portion of the dashboard. So I exercised constraint and only chose what I found the 15 or so most relevant variables.
Voilà! that’ s it. I hope you like my dashboard, and I look forward to see the work of others! If you are a contestant, please leave a link to your entry in the comments. Good luck to all!!
It’s a follow up to Making Data Meaningul part 1 , which focused on writing about data, as opposed to visualize it.
The book is a cooperation between representatives of national statistical offices and intergovernmental organizations – all public statisticians, if you will. I hope it will help others to communicate their data better. Personally, I have written the part about charts and collaborated to some other chapters. But if I could sum up my advice in one sentence, it would be: go buy Stephen Few books. Start with Show me the numbers.
The list of people who collaborated to the book includes:
Yesterday’s post on Tableau Public generated a surge of traffic so I thought I should add more examples and practical information for people interested in the software.
Tableau Public doesn’t exactly allow you to do everything that Tableau does from the web. To prepare the views which are going to be published on the web, you need to use a software that runs on your computer. It lets you do whatever you can do with the regular Tableau Desktop, with a couple of limitations: you have to stick to basic source file types (access, excel, and text file, no exotic database) and you are limited to 100,000 records of data. One other difference with the regular Tableau Desktop is that you can’t save your work locally: you have to save it on the web, in your private space on Tableau servers. However, there are the same analytical and visual features in Tableau Public than in Tableau Desktop.
When your work is published, users don’t have access to all the tools you had when creating the view: they can’t move dimensions around, create exotic filters or calculations. They really see the chart as you intended it to be seen. There are a certain number of interactions built-in, however: users can select, highlight, sort and filter. If you are publishing a dashboard, the different tables and charts of the dashboard can be linked, meaning that an action (such as highlighting one dimension) in one place will be replicated elsewhere, or not. The underlying data can also be downloaded. So there is a great deal of interactivity, but not enough to twist your display beyond recognition. That being said, other Tableau Public users can download your workbook and manipulate it with the client software.
About the Beta: currently, Tableau Public is in closed beta. It will be in open Beta in February, as far as I know. To get a spot in the close beta, you need to write to the people of Tableau.
I am currently beta testing Tableau Public. Essentially Tableau Public let you bring the power of Tableau analysis online. With Tableau public, your audience doesn’t need to download a workbook file that they can see in an offline, software client – they can see and interact with your work directly on a web page.
There are quite a few examples of the things you can do with Tableau public. These are the examples you are given when you start the product:
1. There is no comparable data visualization platform out there.
There are many ways to communicate data visually. Count them: 13, 20, 28, 75… and many more.
However these tools have a narrower focus than Tableau, or require the user some programming ability. For instance, Many Eyes uses a certain number of types of data visualization which can be set up in seconds, but which cannot be customized. Conversely, Protovis is very flexible but requires some knowledge of Javascript. And even for a skilled developer, coding an interactive data visualization from scratch takes time.
By contrast, Tableau is a fully-featured solution which doesn’t require programming. It has many representation types which can be deeply customized: every visual characteristic of a chart (colour, size, position, etc.) can depend on your data. Several charts can also be combined as one dashboard. On top of that, data visualization done in Tableau comes with many built-in controls, with an interface to highlight and filter data, or to get more details on demand. For dashboards, it is also possible to link charts, so that actions done on one chart (highlighting records, for instance) affect other charts.
2. The solution is not limitless.
Tableau enables you to do things which are not possible using other packages. But it doesn’t allow you to do anything. That’s for your own good – it won’t allow you to do things that don’t make sense.
There are many safety nets in Tableau, which you may or may not run into. For instance, you can’t make a line chart for data which don’t have a temporal dimension – so much for parallel coordinates. However, the system is not fool-proof. Manipulating aggregates, for instance, can lead to errors that you wouldn’t have to worry about in plain old Excel, where the various steps through which data are computed to create a graph are more transparent (and more manual). Compared to Excel, you have to worry less about formatting – the default options for colours, fonts and positions are sterling – and be more vigilant about calculations.
3. Strength is in numbers.
Over the years, many of us grew frustrated with Excel visual capacities. Others firmly believed that anything could be done with the venerable spreadsheet and have shown the world that nothing is impossible.
The same applies to Tableau. The vibrant Tableau community provides excellent advice. “Historic” Tableau users are not only proficient with the tool, but also have a better knowledge of data visualization practices than the average Excel user. Like any fully-featured product, there is a learning curve to Tableau, which means that there are experts (the proper in-house term is Jedis) which find hacks to make Tableau even more versatile. So of course, it is possible to do parallel coordinates with Tableau.
The forum, like the abundant training, available as videos, manuals, list of tips,or online sessions with an instructor, doesn’t only help the user to solve their problems, but it also a fantastic source of inspiration.
With the introduction of Tableau Public, the forum will become even more helpful, as there will be more questions, more problems and more examples.
The chart has since been debated and criticized, among others, by Jon Peltier, Andrew Gelman, and Evan Falchuk – which all made valid points. For instance, to show correlation and outliers, a scatterplot does a much better job. That being said, it’s difficult to see the country names with a scatterplot. On the substance, the number of doctor visits is not the most relevant variable to bring into this picture, mostly because this number directly depends on the compensation mode of these doctors, not on their efficiency. The notion of “universal coverage” is also quite arbitrary. France, for instance, which had what could be called universal coverage since 1945, got an even more “universal” one in 2000. And still, some people can’t receive the healthcare they need.
The chart is based on OECD data, from a recently released book: OECD Health at a Glance. For the release of the book, I had worked on 2 presentations, which we remained unpublished. Since they were not formerly published by OECD the standard disclaimer apply – they do not commit the organization and do not necessarily represent its point of view and that of its members.
Anyway, for anyone interested in health statistics in general and in USA healthcare specifically, here they are in their slideshare glory:
While the look and feel is pleasing I was bothered by a few choices of design.
First, homicides and accidental deaths are not taken into account. I suspect that for some demographic categories, they represent a significant proportion of the deaths.
Second, the table doesn’t give an indication of the differences in mortality between the different age groups. For instance, there are over 15,000 deaths per 100,000 people over 85 years old, but only about 130 / 100,000 for young people aged 15-24. So the last item in the right-most column corresponds to much more deaths than the top item in the left-most column, although they have the same visual weight.
Coincidentally, I got to try Tableau Public Beta and thought it would be a good exercise to give it a spin.
The data source is the same. I got my data through the wonder service of the CDC.
Here goes:
By playing with the filters you can see the ranking of the causes of death. For instance, we can see that accidents and homicide are precisely the leading causes of death of young people aged 20 to 24. Now what if you want to see the demographic categories that one given cause of death affects most? Here’s a second visualization:
You can see that certain causes of death, for instance, only affect one gender or the other (such are certain forms of cancer).
I’ve made that last one to illustrate the evolution of mortality with age. No one would be surprised to learn that older people have a higher probablity of dying but by what proportions?
The idea is to illustrate topics on which we work by a chart which we’ll change regularly. So in order to do that, I’d have to be able to create charts of publishable quality.
Excel screenshots: not a good option
There are quite a few tools to create charts on the net. Despite this, the de facto standard is still a screenshot of Excel, a solution which is even used by the most reputable blogs.
This is taken from http://theappleblog.com/2009/12/18/iphone-and-ipod-touch-see-international-surge/
But alas, Excel is not fit for web publishing. First, you have to rely on Excel’s choice of colours and fonts, which won’t necessarily agree to those of your website. Second, you can’t control key characteristics of your output, such as its dimensions. And if your chart has to be resized, it will get pixelated. Clearly, there is a better way to do this.
That's a detail of the chart on the link I showed above. The letters and the data bars are not as crisp as they could have been.
How about interactive charts?
Then again, the most sensible way to present a chart on the web is by making it interactive. And there is no shortage of tools for that. But there are just as many issues.
Some come from the content management system or blogging environment. Many CMS don’t allow you to use javascript and/or java and/or flash. So you’ll have to use a technology which is tolerated by your system.
Most javascript charting solutions rely on the <CANVAS> element. Canvas is supported by most major browsers, with the exception of the Internet Explorer family. IE users still represent roughly 40% of the internet, but much more in the case of my OECD blog, so I can’t afford to use a non-IE friendly solution. There is at least one library which works well with IE, RaphaelJS.
Using java cause two problems. First, the hiccup caused by the plug-in loading is enough to discourage some users. Second, it may not be understood well by readers:
This is how one of my posts reads in google reader.
And it’s futile to believe that readers will read blogs from their home pages. So if all readers can’t show it well it’s a show-stopper.
A tool to create good bitmap charts
So, in a variety of situations the good old bitmap image is still the most appropriate thing to post. That’s why I created my own tools with Processing.
when you unzip the files, you have a file called “mychart.txt” which is a set of parameters. Edit the file according to the instructions in “instructions.txt” to your liking, then launch the tool (plotter application). It will generate an image, called “mychart.png”.
With my tools, I wanted to address two things. First, I wanted to be able to create a chart and to have a precise control of all of its components, especially the size. In Excel, by contrast, it’s difficult to control the size of the plotting area, or the placement of the title – all of this things are done automatically and are difficult to correct (when it’s possible). Second, I wanted to be able to create functional thumbnails.
If you have to create smaller versions of a chart from a bigger image, the easiest solution is to resize the chart using an image editing software. But that’s what you’d get:
That's the original chart.
And that's the resized version. Legible? nah.
But what if it were just as easy to re-render the chart in a smaller size, than to resize it with an external program? My tool can do that, too.
Left: resized, right: re-rendered.
Here’s a gallery of various charts done with the tool. The tool supports: line charts, bar charts (both stacked and clustered), dots charts and area charts. No pie charts included. It’s best suited for simple charts with few series and relatively few data points.
Impact of energy subsidies on CO2 emissions
Temperature and emission forecasts
Greenhouse gas emission projections
I hope you find it useful, tell me if you do and let me know if you find bugs.