Working with data in protovis – interlude: protovis nesting vs tableau

Protovis, like Tableau, are based on the grammar of graphics framework. In a nutshell, in both environments, a chart designer can map visual attributes (such as x or y dimension, color, shape, etc.) to dimensions of data.

The flat file which Becker’s Barley is based on can be used in Tableau public nearly as is.
Here’s a size-by-size comparison of the results:

How it’s done in Protovis

In protovis, the flat file is nested several times, so that its various elements can be called in a hierarchy from series of dots per variety, to panel per sites to a grander panel. Legend and ticks are added by hand for a perfect finish. Still, some careful planning is required to prepare the data file and to adjust the various elements (choice of colors, sizes, etc.)

How it’s done in Tableau

The data file, unsurprisingly, has 4 dimensions: site, variety, year and yield. In tableau terms, yield is a measure (a numerical dimension) while the other 3 are categorical dimensions. With a few clicks, it is possible to get a result which is similar to the original vis in Protovis. We assign yield to column (horizontal attribute), site and variety to row in that order. We also untick aggregate measures in Analysis, so we get little circles and not big bars. Here, I’ve manually sorted the sites and the varieties.


It is much, much easier to achieve a similar result with Tableau, however using protovis provides a finer control.


Working with data in protovis – part 3 of 5

Previous : Multi-dimensional arrays, inheritance and hierarchy

Short interlude: what can be done with arrays in javascript?

Now that we have a grasp on how arrays work and how they can be used in protovis, let’s take a step back and look at some very useful methods for working with arrays in standard javascript.

Sorting arrays

Using a method called, well, sort(), javascript arrays can be reordered in a variety of ways.

var a = [1, 1.2, 1.7, 1.5, .7, .3];

Without any argument, this will sort the array in ascending order: [0.3, 0.7, 1, 1.2, 1.5, 1.7].

However, we can reorder the array differently with a comparison function, such as:

a.sort(function(a,b) b-a)

Note that this is protovis notation. In traditional javascript, you’d have written a.sort(function(a,b) {return b-a;});

This would sort a in descending order ([1.7, 1.5, 1.2, 1, 0.7, 0.3]). Here is how it works.

The comparison function takes 2 elements in the array. If the first element (corresponding to the first argument, a) should appear first, the function must return a negative number. If the result is positive, the second element is appearing first. With our example, b-a is negative when the first element is greater than the second. So, this function sorts an array in descending order.

Sort also works with arrays of associative arrays. For instance:

var data = [
  {key:"a", value:1},
  {key:"b", value:1.2},
  {key:"c", value:1.7}, 
  {key:"d", value:1.5},
  {key:"e", value:.7},
  {key:"f", value:.3}

data.sort(function(a,b) b.value-a.value);

Here, we have to specify which criterion is used to sort the array. In this example, we sort it in descending order by value. But we could have sorted it in ascending order by key, for instance.

The method reverse() turns an array upside down.

var a = [1, 1.2, 1.7, 1.5, .7, .3];

a.reverse(); // gives [0.3, 0.7, 1.5, 1.7, 1.2, 1]

So, a.sort().reverse() will also sort that array in decreasing order.

Extracting sub-arrays

The method slice() will extract a sub-array out of an array, in other words, it will retrieve certain elements. It works with one or two arguments. If you only give one argument, it will give you the last elements from the index you specified.

For instance,

[1, 1.2, 1.7, 1.5, .7, .3].slice(3) // it's [1.5, 0.7, 0.3]

The index 3 corresponds to the 4th element of the array (0,1,2,3) so slice(3) will return the 4th, 5th and 6th elements.

With 2 arguments, you get a sub-array that starts at the first index but ends just before the 2nd one. For instance,

[1, 1.2, 1.7, 1.5, .7, .3].slice(0,3) // gives [1, 1.2, 1.7]

It is also possible to use negative arguments. Instead of counting from the beginning of the array, it means counting from the end.

[1, 1.2, 1.7, 1.5, .7, .3].slice(-1) // result: [0.3]

Turning a string into an array

This can be done via the split() method.

split() normally works with a separator, for instance “07/04/1975”.split(“/”) is [“07”, “04”, “1975”], but if an empty string is used, then each character becomes an element of the new array. For instance,

"ABCDEFGHIJ".split("") // result: ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"] 

This can be very useful to generate data on the go.

And what methods does protovis has for working with arrays and data?


Some simple, some very elaborate. In addition to visualization functions protovis has impressive data processing capabilities.

The workhorse: pv.range()

pv.range() is a method that creates an array of numbers.

pv.range(10), for instance, is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], or the 10 first integers starting from 0.
(this is the method I referred to in the example of the previous part).

By using the other arguments, it is possible to generate arrays of numbers that go from one specific index to another, or to change the step. For instance:

pv.range(5,10) // is [5, 6, 7, 8, 9]
pv.range(5,21,5) // is [5, 10, 15, 20]

While this may not look incredible at first sight, pv.range() really shines when associated with other functions or methods such as map().

map()  is an array method that creates a new array by applying a function to each of the element of the array it’s run against. For instance,

var data = pv.range(100).map(Math.random) // creates an array of 100 random numbers.

var data = pv.range(1000).map(function(a) Math.cos(a/1000)) // stores into data 1000 values 
// of cos(x) for all the numbers between 0 and 0.999 by increment of 0.001.

In other words, pv.range() and map() can quickly create very interesting datasets for protovis to visualize.

Simple statistical functions that make life easier

pv.min(), pv.max()

Protovis has functions that extract the maximum or the minimum value of an array.

pv.min([1, 1.2, 1.7, 1.5, .7, .3]) will return the value of the smallest element of the array, in that case 0.3.

Likewise, pv.max returns the value of the greatest element of the array.

Both of these can be used with an accessor function to help retrieve what will be compared.

For instance:

var data = [
  {key:"a", value:1},
  {key:"c", value:1.7}, 
  {key:"d", value:1.5},
  {key:"e", value:.7},
  {key:"f", value:.3}

pv.max(data, function(a) a.value); // should return 1.7

pv.sum(), pv.mean(), pv.median()

Likewise, protovis has aptly-named methods that return the sum, the mean and the median of the elements of an array. Note that it’s pv.mean() and not pv.average(), though.

pv.sum([1, 1.2, 1.7, 1.5, .7, .3]) // gives  6.4.
pv.mean([1, 1.2, 1.7, 1.5, .7, .3]) // gives 1.0666666666666667
pv.median([1, 1.2, 1.7, 1.5, .7, .3]) // gives 1.1.

Similarly to pv.min() you can use an optional accessor function.

pv.median(data, function(a) a.value);


pv.normalize() is a handy method that divides all elements in an array by a factor, so that the sum of all these elements is 1.

pv.normalize([1, 1.2, 1.7, 1.5, .7, .3])
// [0.15625, 0.18749999999999997, 0.265625, 0.234375, 0.10937499999999999, 0.04687499999999999]

pv.sum(pv.normalize([1, 1.2, 1.7, 1.5, .7, .3])) // result: 1.

Combining arrays


pv.blend() simply turns an array of arrays into a simpler array where all elements follow each other.

pv.blend([["a", "b", "c"], [1,2,3]]) // gives  ["a", "b", "c", 1, 2, 3]

One benefit is that it is possible to run methods on this new array that wouldn’t work on an array of arrays.

For instance:

pv.max([[1,2],[3,4],[5,6]]) // result : NaN
// protovis cannot guess how to compare [1,2], [3,4] or [5,6] without any further instruction.
pv.max(pv.blend([[1,2],[3,4],[5,6]])) // result: 6.
pv.max([[1,2],[3,4],[5,6]], function(a) pv.max(a)) // also returns 6.


Given two arrays a and b, pv.cross(a,b) returns an array of all possible pairs of elements.

pv.cross(["a", "b", "c"], [1,2,3]) // returns 


pv.transpose() flips an array of arrays.

In our previous example –

pv.cross(["a", "b", "c"], [1,2,3])

The result was a 9×2 array. [[“a”,1],[“a”,2],[“a”,3],[“b”,1],[“b”,2],[“b”,3],[“c”,1],[“c”,2],[“c”,3]]

If we apply pv.transpose, we get a 2×9 array:

pv.transpose(pv.cross(["a", "b", "c"], [1,2,3])); // result:

Putting it all together

This short example will show how one can work from an unprepared dataset.
For the purpose of this example, I’m retrieving GDP per capita data of OECD countries. I have a table of 34 countries on 10 years from 2000 to 2009. The countries are in alphabetical order.
I would like to do a bar chart showing the top 12 countries ranked by their GDP per capita. How complex can this be? With the array methods, not so difficult.

var data=[
["Country",2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009],
["Australia",28045.6566279103, 29241.289462815, 30444.7839453338, 32079.5610505727, 33494.6654789274, 35091.7414600449, 37098.3968669994, 39002.0335375021, 39148.4759391917, 39660.1953738791],
["Austria",28769.5911274323, 28799.1968990511, 30231.0955219759, 31080.6122212092, 32598.0698959222, 33409.3453751649, 36268.9604225683, 37801.7320979073, 39848.9481928628, 38822.6960085478],
["Belgium",27624.4454637489, 28488.7963954409, 30013.7570016445, 30241.724380444, 31151.6315179167, 32140.9239149283, 34159.4723388042, 35597.2980660662, 36878.9710906881, 36308.4784216699],
["Canada",28484.9776553362, 29331.8223245112, 29911.3226371905, 31268.5521058006, 32845.6267954986, 35105.9892865177, 36853.5554314586, 38353.0812084648, 38883.0700787846, 37807.5259929247],
["Chile",9293.72322249492, 9712.95685376592, 9973.25946014534, 10475.6486664294, 11299.9329380664, 12194.1443507847, 13036.1285137662, 13896.6946083854, 14577.5715215544, 14346.4648487517],
["Czech Republic",14992.3582327884, 16173.6387263985, 16872.3146139754, 17991.9331910401, 19303.9208098529, 20365.7571142986, 22349.7033747705, 24578.7151794413, 25845.0252652635, 25530.2001372786],
["Denmark",28822.3343389384, 29437.5232757263, 30756.3323835463, 30427.5149228283, 32301.3394496912, 33195.8834383121, 36025.8628586246, 37730.736916861, 39494.1216134705, 37688.2516868117],
["Estonia",9861.80346448344, 10693.0362387373, 11966.6746413277, 13370.127724177, 14758.4181995435, 16530.7311083966, 19134.4130457855, 21262.1415030861, 21802.2185996096, 19880.2786070771],
["Finland",25651.0005450286, 26518.293212524, 27509.1023963486, 27592.2031438268, 29855.387224753, 30689.9753400459, 33095.030855128, 36149.32963148, 37795.237097856, 35236.9462175119],
["France",25272.4127936247, 26644.8404266553, 27776.4225007831, 27399.5459524877, 28273.6330332193, 29692.129719144, 31551.3458452695, 33300.5367942194, 34233.0770304046, 33697.9698933558],
["Germany",25949.0588445755, 26855.1296905696, 27587.1573355486, 28566.8930562768, 29900.6507502683, 31365.576121107, 33713.1689185469, 35623.4147543745, 37170.693412872, 36339.9009091421],
["Greece",18410.2161113813, 19928.6672542964, 21597.5997816753, 22701.7754699285, 24088.2202877303, 24571.6085727425, 26917.7358531616, 28059.5724529461, 29919.6385697499, 29121.8306842272],
["Hungary",12134.0140204384, 13576.4413471795, 14765.2333755651, 15424.3926636975, 16316.8885518624, 16937.9383138957, 18329.1579647528, 19187.2829138188, 20699.9056965458, 20279.5060192879],
["Iceland",28840.4899286393, 30443.907492292, 31083.7723760522, 30768.0910155965, 33697.6723843766, 35025.1338769129, 35807.880160537, 37178.9762849327, 39029.2396691797, 36789.0728215933],
["Ireland",28695.0931156375, 30524.3712319885, 33052.4517849575, 34525.4148515126, 36511.518217243, 38622.5520655621, 42267.9784043449, 45293.6482834027, 42643.6362934458, 39571.204030073],
["Israel",23496.1573531847, 23455.1048827219, 23534.8524310529, 22270.5507137525, 23650.3373360675, 23390.3809761211, 24960.0508863672, 26583.387317862, 27679.3777353666, 27661.1271269539],
["Italy",25594.3456589799, 27127.4298830414, 26803.97019843, 27137.5365317475, 27416.0961790796, 28144.0379218171, 30224.2469223694, 31897.7434207435, 33270.7464832409, 32407.5033618371],
["Japan",25607.7204586644, 26156.1615068187, 26804.9303541079, 27487.123875677, 29020.8955443678, 30311.5281317716, 31865.2733339475, 33577.1846704038, 33902.3807335349, 32476.7736535378],
["Korea",17197.0800334388, 18150.876787578, 19655.5961579558, 20180.932195274, 21630.1629926488, 22783.2288432845, 24286.1803272315, 26190.6122298674, 26876.6422899926, 27099.9481192024],
["Luxembourg",53645.7229595775, 53932.4594967082, 57559.2104243171, 60723.9861616145, 65021.6940738929, 68372.258619455, 78523.2988291034, 84577.2190776183, 89732.070876649, 84802.9716853677],
["Mexico",10046.1268819126, 10135.7265293228, 10397.8915415223, 10883.8643666506, 11535.0108530091, 12460.5385178945, 13672.6034191998, 14581.5178679935, 15290.896575069, 14336.6153896301],
["Netherlands",29405.5490140191, 30788.2508953647, 31943.4974056501, 31702.5798119046, 33208.8592768987, 35110.6577619843, 38063.6877336725, 40744.3819227526, 42887.4361825772, 40813.047575264],
["New Zealand",21113.9167562199, 22128.8483527737, 23115.0676326844, 23826.8007737809, 24775.7827578032, 25460.2942993256, 27277.7654223562, 28701.3328788203, 29231.151567007, 29097.307061683],
["Norway",36125.7036078917, 37091.7150478501, 37051.9473705873, 38298.8631304819, 42257.5943785044, 47318.7844737929, 53287.9750020125, 55042.2093246253, 60634.9819502298, 55726.7456848212],
["Poland",10567.0016558591, 10950.4739587211, 11562.6245018264, 11985.0042955849, 13014.5996003514, 13785.7656450352, 15067.4402978586, 16762.2045951617, 18062.1834289945, 18928.8205162032],
["Portugal",17748.6930918065, 18464.7472033934, 19088.1764760703, 19392.2862101711, 19796.1891048601, 21294.1660710906, 22869.9272824917, 24122.9886840593, 24962.2959136026, 24980.061318413],
["Slovak Republic",10980.1169029929, 12070.70145551, 12965.6228450353, 13597.9556337584, 14659.0096334623, 16174.4605836956, 18397.5822518716, 20916.5438777989, 23241.0972994031, 22870.8998469728],
["Slovenia",17468.5101911515, 18342.8685005009, 19702.0589885459, 20448.2040741134, 22200.7770986247, 23493.9370882038, 25432.3240768864, 27228.3376725332, 29240.5471362484, 27535.6540786883],
["Spain",21320.0690041202, 22591.3855327033, 24066.5003000439, 24748.2819467008, 25958.1017090553, 27376.7644983649, 30347.9357063518, 32251.8469529105, 33173.3334547015, 32254.0895139806],
["Sweden",27948.4749835611, 28231.3958224236, 29277.7736264795, 30418.0441004337, 32505.646687298, 32701.4327422078, 35680.1787996674, 38339.6698693908, 39321.3014501815, 36996.1394001578],
["Switzerland",31618.0958082618, 32103.4812665805, 33390.9169108411, 33266.3042144918, 34536.8476164193, 35478.0395058126, 39116.3945212067, 42755.5569054368, 45516.5601956426, 44830.3719226918],
["Turkey",9169.71780977686, 8613.21159842998, 8666.90348260238, 8790.01048446305, 10166.0963401732, 11391.3768057053, 12886.6195973915, 13897.3820310325, 14962.4944915831, 14242.7276329837],
["United Kingdom",26071.3066882302, 27578.2857038985, 28887.5850874466, 29848.7319363944, 31790.9933025905, 32724.4057819337, 34970.5193042987, 35719.2060247462, 36817.493156774, 35158.8005888247],
["United States",35050.1738557741, 35866.2624634202, 36754.5543204007, 38127.524970345, 40246.0630591955, 42466.1326203714, 44594.9199470326, 46337.2237397566, 46901.0697730874, 45673.7445647402]

var year=10;
var rows = data.slice(1,data.length);

var x = pv.Scale.ordinal(pv.range(12)).splitBanded(0, 240, 4/5);
var y = pv.Scale.linear(0, 80000).range(0, 170);

var vis = new pv.Panel()
		.data(rows.sort(function(a,b) b[year]-a[year]).slice(0,12))
		.height(function(d) y(d[year]))
		.left(function() 5+x(this.index))
		.anchor("bottom").add(pv.Label).textAngle(Math.PI/2).text(function(d) d[0]).textBaseline("middle").textAlign("right")

here, in the data variable, I’ve put my 2-dimensional table as I got it. The first row contains headers which I won’t need. So, in line 40, I create rows which just removes that 1st line with a slice function. data.slice(1,data.length) effectively keeps all the lines but the first.
In the next few lines I create two scales for placing my bars, a standard vis panel and a bar chart. Now, what kind of data will I pass to the bar chart?
I want to sort the rows by the value of the latest year (which happens to be the 11th column, so 10). This is what the rows.sort(function(a,b) b[year]-a[year]) part of the statement does. I’ve simply assigned 10 to the variable year, so if I want to display other years (with an HTML form for instance) it wouldn’t be difficult to modify.
And, since I only want the top 12 values, I just write .slice(0,12) after that.

In line 55, I just add a label. pv.Label inherits the data of its parent. The data item is the whole row of my 2-dimensional table, so if I write function(d) d[0], I am referring to the left-most item which will be the country name.

So, with the use of simple array functions, I can easily rework an unprepared dataset in protovis, rather than having to tailor my dataset with manual (and error-prone) manipulations in external programs. Here is the result:

Next: reshaping complex arrays


Working with data in protovis – part 2 of 5

Previous post: simple arrays

Multi-dimensional arrays, associative arrays and protovis

Even for a simple chart, chances are that you’ll have more than a single series of numbers to represent. For instance, you could have labels to describe what they mean, several series, and so on and so forth.
So, let’s say we want to add these labels to our original examples, so we know what those bars represent.

var categories = ["a","b","c","d","e","f"];
var vis = new pv.Panel()

  .data([1, 1.2, 1.7, 1.5, .7, .3])
  .height(function(d) d * 80)
  .left(function() this.index * 25)
    .text(function() categories[this.index]);


While this did the trick, nothing guarantees that the data proper and the category name will stay coordinated. If one data point is deleted or removed and this is not replicated on the categories, they will no longer match. A more integrated way to proceed would be to group category and data information, like this:

var data = [
  {key:"a", value:1},
  {key:"b", value:1.2},
  {key:"c", value:1.7}, 
  {key:"d", value:1.5},
  {key:"e", value:.7},
  {key:"f", value:.3}
var vis = new pv.Panel()

  .height(function(d) d.value * 80)
  .left(function() this.index * 25)
    .text(function(d) d.key);


This time, we group the values and the category names in a single variable, an array of associative arrays.
When drawing the bar chart, protovis will go through this array and retrieve an associative array for each bar.
We have to change the way the height function is written. The data element being accessed is no longer of the form 1 or 1.7, but {key:”a”, value:1} or {key:”c”, value:1.7}. So to get the number part of it, we must write d.value.

Likewise, instead of accessing an array of categories for the text part, we can use the current data element via an accessor function, and write d.key.

Hierarchy and inheritance

So we’ve seen that arrays, or associative arrays, can have several levels and can be nested one into another.
Interestingly, protovis elements, like panels, charts, mark etc. also work in a hierarchy. For instance, when you start working with protovis you create a panel object. Then, you add other objects to that panel, like a bar chart (our example), or another panel. You can add other objects to your new objects, or attach them to your first panel.
This diagram shows the hierarchy between elements in the previous example.

var categories = ["a","b","c","d","e","f"];
var vis = new pv.Panel()

var bar = vis.add(pv.Bar)
  .data([1, 1.2, 1.7, 1.5, .7, .3])
  .height(function(d) d * 80)
  .left(function() this.index * 25)
  .text(function() categories[this.index]);


The bar object is considered to be the child of vis, who is its parent.

You may know that in protovis, children objects inherit properties of their parents.

For instance, if width wasn’t specified for the bar object, it would have the width of its parent, 150. Each mark would cover the whole screen.

For data, when a new object is added, data is either specified at that level, or obtained from the parent element of this object.

Let’s take our example and tweak it a bit.

var vis = new pv.Panel().width(150).height(150);
var bar = vis.add(pv.Bar)
  .data([1, 1.2, 1.7, 1.5, .7, .3]).width(20) .bottom(0)
  .height(function(d) d * 80).left(function() this.index * 25)

Here, I didn’t specify a data or a text value for the labels I added. They just took the value of its parent element – the marks of the pv.Bar object.
Here’s another variation:

var vis = new pv.Panel().width(150).height(150);
  .data([1, 1.2, 1.7, 1.5, .7, .3]) .left(function() this.index * 25)
  .add(pv.Bar).width(20) .bottom(0)
  .height(function(d) d * 80)

Here, I’m adding panels, then a bar in each panel.
From the root panel, I’m adding a group of panel with this data: [1, 1.2, 1.7, 1.5, .7, .3].
Since there are 6 elements here, I’m adding 6 panels.
Here, the left method applies to each of the panel. The first one is to the left, the next one is 25 pixels further, etc.
I’m then adding a bar object to each panel. Is that one group of bar? Technically yes, but each has only one element! Each pv.Bar implicitly gets the data element of its parent, so the first bar gets [1], the next one gets [1.2], etc. The height of each bar is determined by multiplying the value of that element by 80.
Note that since the fillStyle properties are not defined for the bars, they get the ones which are attributed by default, which explains the color changes.

Further refinement: accessing the data of its parent!

var vis = new pv.Panel().width(150).height(150);
  .data([1, 1.2, 1.7, 1.5, .7, .3]) .left(function() this.index * 25)
  .add(pv.Bar).width(20) .bottom(0)
  .height(function(a,b) b * 80)

Well, the output is exactly the same, but how I obtained the data is different. Instead of getting the data using the standard accessor function, I passed two arguments: function(a, b).
What this does is that the first argument corresponds to the current data element of this object, and the second, to that of its parent.

In this example, they happen to be the same, but this is how you can access the data of the parent objects.

Putting it all together

Let’s see how we can use protovis and the properties of hierarchy! This example is less trivial than the ones we’ve seen so far but with what we’ve seen it is quite accessible.
The challenge: re-create square pie charts.
How it’s done:

var data=[36,78,63,24],  // arbitrary numbers
cellSize = 16,
cellPadding = 1,

var vis=new pv.Panel()

var square = vis.add(pv.Panel)
    .left(function() this.index*(cellSize*10+squarePadding))

var row = square.add(pv.Panel)
	.bottom(function(d) d*cellSize)

var cell = row.add(pv.Panel)
    .left(function(d) d*cellSize+cellPadding)
    .fillStyle(function(a,b,c) (b*10+a)<c?colors[this.parent.parent.index].color:"lightGrey")

    .text(function(d) d)
    .font("100px sans-serif");


First, we initiate the data (4 arbitrary numbers from 1 to 100), and various parameters which will help size the square pie – size of the cells, space between them, space between the square pie charts. We also initiate a color palette.
Then, we are going to create 4 panels or groups of panels, each a child of the previous one.
First comes the vis panel, which groups everything,
Then the square panels, which correspond to each square pie. This is to this panel that our data is assigned.
Then come the row panels, and, finally, the cell panels.
The numbers which we want to represent are assigned to the square panel. So, what data are we passing to the row and the cell panels? The only thing we want is to have 10 rows and 10 cells per row. So, we can use an array with 10 items. We are going to use [0,1,2,3,4,5,6,7,8,9] so the data value of the row and that of the cell will correspond to the coordinate of the row and the cell, respectively. In other words, the 5th row will be assigned the data value of 4, and the 7th cell in that row will get the data value of 6. We could retrieve the same numbers using “this.index” but this can lead to obfuscated formulas.

Note that in the next part of the tutorial, we’ll see that in Protovis, there is a more elegant way to write [0,1,2,3,4,5,6,7,8,9] or similar series of numbers. But, we’ll leave this more explicit form for now.

Back to our row panel. We position it with bottom(function(d) d*cellSize). Here, again, d represents the rank of the row, so the 1st row will get 0, and its bottom value will be 0, the next row will get 1, and its bottom value will be 1*cellSize or 16, etc.

Likewise, in the cell panel, the cells are positioned with left(function(d) d*cellSize+cellPadding). This is the same principle. (here, cellPadding is just used to fine-tune the position of the cell).

This is in the final line that we are really going to use hierarchy.

.fillStyle(function(a,b,c) (b*10+a)<c?colors[this.parent.parent.index].color:"lightGrey")

here, a represents the data value of the cell – in other words, the column number.
b, the data value of the cell’s parent, the row. This is the row number.
and c is the data value of the parent of the parent of the row, the square – this is the number that we are trying to represent.

so, what we are going to determine is whether b*10+a<c. If that’s the case, we color the cell, else, we leave it in pale grey. To get the color we need, we go back to the palette that we defined at the beginning, and take the color corresponding to the square number (0 to 3 from left to right).
The square number can be obtained by this.parent.parent.index.

Finally, we add the numbers in large transparent gray digits on top of the squares.

Here is the result:

Next: Javascript and protovis array functions


There are virtues to an illegible chart

It all started with an extreme pie chart

A few weeks back, there’s been a chart on aid who’s made the rounds of the internet:

All US ODA by recipient, 2004-2008, OECD data, taken from

What this chart shows is that US aid is concentrated in a few countries. The article explains that this is a result of the 3D doctrine, which ties development with diplomacy and defense. This is why US gives so much to strategic countries like Afghanistan, Iraq, or Sudan, but relatively little to India – highlighted in the chart, which has “a huge chunk of the world’s poor”.

When I saw that chart I was planning to create a chart or a data visualization on the same subject for my work. The original chart was being heavily criticized for its form, because half of it is not legible. Chart purists don’t like pie charts for that very reason – they are difficult to read, especially if you add more items. But I found the chart interesting. It states in a very striking way that more than a hundred countries in the world get next to nothing from the USA.

An apology of  extreme charts

There are virtues to an illegible chart. In fact, I don’t believe that a chart should give equal prominence to each and every of its datapoints. In most cases, it’s here to support a story, so all it should do is bear a message. Tufte popularized the notion of data-ink ratio, which states that a chart designer should use the largest share of ink to represent data, not everything else. I feel this is taken too literally by many.

There is a tradition of extreme charts which purposely break presentation rules because of the very nature of the subject they are plotting. A famous example is Al Gore on his lift – if CO2 emissions hadn’t increased so much, he wouldn’t need that lift to show his chart.


Al Gore on his lift, the most memorable image of An Inconvenient Truth

Another one – from the NY Times, one of the charts that Matthew Ericson showed in his Infovis 2007 keynote speech:

In perspective: America's conflicts. NY Times

Click to see the full image - it is big. I really like this chart.

Again, if the number of US soldiers killed per month had not been so high in WWII, the 2nd group of bars wouldn’t overwrite the text above and sky-rocket to the top of the page. The logical thing to do would have been to scale the chart so that the maximum values would fit in a well-delimited space, and maybe use a logarithmic scale so that the values for other wars would remain legible. That’s how we would have done it if we had to plot that kind of series in an OECD book. The fact that the NYT designers chose, on the contrary, to let the data rise all the way to the top of  the page expresses in a very powerful way the extreme nature of the WWII casualties.

“A conventional chart couldn’t hold all that horror”, the chart seems to say. Likewise, if CO2 emissions had grown more steadily over the past couple of centuries, Al Gore wouldn’t have needed a lift. By the same token, if aid values to about 100 countries were more than negligible, they could be seen on that chart. So granted, there could be more academic ways to show that, like a giant bar chart with values too small to see for all but a handful of states. But all in all I think the original pie chart does a good job in communicating that in a nutshell,  ad absurdum if you will.

My take on the chart

I wanted to work on a specific subset of aid data, that which goes to fragile states, which are, simply put, the 43 countries in the greatest need of aid. Now official aid from developed countries, like US aid, is very concentrated, meaning that only 10 of these countries got more than $1b in 2008. Only 10 countries got more than $100 per capita in that year.

Another interesting aspect of the data is that for many of these countries, aid only mostly from one or two donors, so they are vulnerable to a policy change in that country. That’s what I wanted to show in the representation.




Making data meaningful – Style guide on the presentation of statistics

Making Data Meaningful part 2
Introducing Making Data Meaningful Part 2 – Style guide on the presentation of statistics – which, as its name cleverly suggests, is a compilation of  advice to present graphical information.

It’s a follow up to Making Data Meaningul part 1 , which focused on writing about data, as opposed to visualize it.

The book is a cooperation between representatives of national statistical offices and intergovernmental organizations – all public statisticians, if you will. I hope it will help others to communicate their data better. Personally, I have written the part about charts and collaborated to some other chapters. But if I could sum up my advice in one sentence, it would be: go buy Stephen Few books. Start with Show me the numbers.

The list of people who collaborated to the book includes:


More on Tableau Public

Yesterday’s post on Tableau Public generated a surge of traffic so I thought I should add more examples and practical information for people interested in the software.


Here’s a quick one on health, based on OECD Health at a Glance:

click to interact

Just select two indicators, and you see how one influences the other. Or rather, is correlated because correlation doesn’t imply causation!

Here are links to more example done with Tableau Public.

Another Paris-based intergovernmental organisation is using Tableau – the UNESCO.

These 2 have been done by PAHO to describe the situation in Haiti (the 2nd is really powered by Tableau Server, but it’s close enough)


There are further examples on the Tableau blog.

Now more about Tableau Public and the Beta.

Tableau Public doesn’t exactly allow you to do everything that Tableau does from the web. To prepare the views which are going to be published on the web, you need to use a software that runs on your computer.  It lets you do whatever you can do with the regular Tableau Desktop, with a couple of limitations: you have to stick to basic source file types (access, excel, and text file, no exotic database) and you are limited to 100,000 records of data. One other difference with the regular Tableau Desktop  is that you can’t save your work locally: you have to save it on the web, in your private space on Tableau servers. However, there are the same analytical and visual features in Tableau Public than in Tableau Desktop.

When your work is published, users don’t have access to all the tools you had when creating the view: they can’t move dimensions around, create exotic filters or calculations. They really see the chart as you intended it to be seen. There are a certain number of interactions built-in, however: users can select, highlight, sort and filter. If you are publishing a dashboard, the different tables and charts of the dashboard can be linked, meaning that an action (such as highlighting one dimension) in one place will be replicated elsewhere, or not. The underlying data can also be downloaded. So there is a great deal of interactivity, but not enough to twist your display beyond recognition. That being said, other Tableau Public users can download your workbook and manipulate it with the client software.

About the Beta: currently, Tableau Public is in closed beta. It will be in open Beta in February, as far as I know. To get a spot in the close beta, you need to write to the people of Tableau.



Health statistics

In the last days of 2009, this chart has been published by the National Geographic blog:
the cost of care

The chart has since been debated and criticized, among others, by Jon Peltier, Andrew Gelman, and Evan Falchuk – which all made valid points. For instance, to show correlation and outliers, a scatterplot does a much better job. That being said, it’s difficult to see the country names with a scatterplot. On the substance, the number of doctor visits is not the most relevant variable to bring into this picture, mostly because this number directly depends on the compensation mode of these doctors, not on their efficiency. The notion of “universal coverage” is also quite arbitrary. France, for instance, which had what could be called universal coverage since 1945, got an even more “universal” one in 2000. And still, some people can’t receive the healthcare they need.

The chart is based on OECD data, from a recently released book: OECD Health at a Glance.  For the release of the book, I had worked on 2 presentations, which we remained unpublished. Since they were not formerly published by OECD the standard disclaimer apply – they do not commit the organization and do not necessarily represent its point of view and that of its members.

Anyway, for anyone interested in health statistics in general and in USA healthcare specifically, here they are in their slideshare glory:


Testing Microsoft Office 2010

If you are using computers for work, chances are that you are spending a good portion of your day with Microsoft products such as the Office suite. Some hate it, some love it, but to hundreds of millions it’s part of our daily lives and its design choices affect how we think and work in a much more profound way than we are aware of. So, the release of a new version of Office is always a significant event.

I’ve just installed Office 2010 and here are my first impressions.

The UI is rationalized.

excel 2010

The UI is rationalized.

The interface will be familiar to Office 2007 users – they are still using the ribbon. Only a few buttons have been added to the applications I’ve tested, and the others have fortunately not moved since the previous version. However, the ribbon’s colours have been muted to a conservative white to grey gradient, which is much easier on the eyes. The added benefit is to make highlighted sections of the ribbon stand out much more efficiently.

excel highlight

Highlighting a section works much better against a sober gray than against a vivid blue.

The one button that changed was the top-left Office button. Frankly, what it was for was obvious to no-one in Office 2007. Due to its appearance, it wasn’t really clear that it was clickable, and the commands it gave access to were a mixed bunch – file control, program options, printing, document properties… which, before, were not in the same top-category.

This new area is called "Office backstage" and is a welcome change to the akward "file" menu or office button from previous versions.

This new area is called "Office backstage" and is a welcome change to the akward "file" menu or office button from previous versions.

In Office 2010, the Office button is still there, but this time, it looks like a button and is much more inviting. This time, it presents the user with the various commands on a separate screen. That way, commands are well-categorized, and there is ample space for UI designers to explain those commands which are not clear. This had not been possible when all those commands were forced to fit in one tiny menu.

Another thing that jumped at me when I started manipulating the programs were the improvement in the copy/paste interface. It’s fair to say that pasting has always been a very time-consuming task. It had never been easy, for instance, to paste values only or to keep source formatting, without having to open menus and choose options which require time and effort. Besides, some pasting options descriptions are cryptic and a bit daunting, so novice users aren’t encouraged to use them for fear of what might happen.

I've been using Excel for about 15 years so I know my way around. But improvement in the paste interface directly translates into productivity gains.

I've been using Excel for about 15 years so I know my way around. But improvement in the paste interface directly translates into productivity gains.

Now the various pasting options are promoted within the contextual menu – they are big icons, and it is possible to preview how pasted material would look before pasting. The best part is that these commands are now accessible via native keyboard shortcuts, so we no longer need a string of 4 mouse clicks, or having to key in alt+E, V, S,  enter alt + H, V, S, V, enter in sequence. After a normal paste (ctrl +V) you can hold control and choose a one key option, such as V for values, T for transposing, etc. Much better.

Changes in the Excel chart engine

There are 3 ways in Excel to represent numbers graphically: charts proper, pivot charts and sparklines.

Charts and pivot Charts didn’t see much improvement since the previous version of Excel. The formatting options move along in the direction initiated by Excel 2007: in addition to the controversed 3-D format set of options, users now have an advanced “shadow” and “glow and soft edges” submenus to spice up their charts. The interface for designing gradient fills has been upgraded. The underlying functionality remains unchanged but it is now easier to control. However, the pattern fill option returns, which is great news for people who print their graphs in B&W.

Even more complex formatting options mean a greater chance to use them poorly

Even more complex formatting options mean a greater chance to use them poorly

Sparklines are the real innovation of Excel 2010. Sparklines are a minimalist genre of chart that has been designed to fit in the regular flow of the text – they don’t require more space to be legible and efficient. While sparklines do not allow a user to look up the value of a specific data point, they are very efficient for communicating a trend. As such, they are increasingly used in dashboards and reports. There has been 3rd-party solutions to implement them in Excel but this native implementation is robust and well done. This will put sparklines on the radar for the great number of people who didn’t use them because they were not aware of their existence.

Sparklines give immediate insight on the trends in this data table. A dot marks when the maximum value was reached. That makes it easier to compare peaks at a glance.

Sparklines give immediate insight on the trends in this data table. A dot marks when the maximum value was reached. That makes it easier to compare peaks at a glance.

Changes in other applications

Word has advanced options for opentype fonts, for instance, if your font has several character sets, you can now access them from Word. This is especially good for distressed fonts or the excessively ornate ones. In addition to kerning, it is now possible to control ligatures (i.e. to allow users to specify how ff, fl or fi would appear on screen, as one unique glyph or as two separated letters). Another new feature of Word is an advanced spell checker who is able to warn you of possible word choice errors, when using homonyms for instance.

On my setup, these 3 options didn’t really work, but it’s a beta and I understand the intent.

The advanced spell checker didn't catch those words which were quite obviously used out of context.

The advanced spell checker didn't catch those words which were quite obviously used out of context.

In French, it picked sides in a famous spelling controversy. Many people believe that Perrault originally wrote that Cindirella wore fur slippers (soulier de vair). Microsoft sides with Disney on that ones and glass slippers (souliers de verre).

In French, it picked sides in a famous spelling controversy. Many people believe that Perrault originally wrote that Cindirella wore fur slippers (soulier de vair). Microsoft sides with Disney on that one and glass slippers (souliers de verre).

Powerpoint features 3 high-level changes: the possibility to structure a long presentation using sections, which somehow helps. However, as far as I could see, sections are only a grouping feature. There are few operations that can be performed on the section as a whole (as opposed to on all the presentation, or on each slide separately). For some tasks, you can think it is the case (as selecting the section implicitly selects its slides) but you’ll see that the operation only affected the current slide. Hmm. It can be useful to manage a presentation after it’s done, but IMO this will reduce the amount of time people spend designing their presentation away from powerpoint which is ultimately a bad thing.

Powerpoint sections make it easier to manage very long documents.

Powerpoint sections make it easier to manage very long documents.

Powerpoint 2010 also features 3D transitions not unlike those of Keynote ’08. It is also possible to include movie clips in presentation. Wasn’t this already the case? Previously, you’d have to embed video files in your presentations. Now it is possible to embed online videos as well. I’m not quite sure about these two options really, the first one for ideologic reasons, the 2nd because I wouldn’t recommend any speaker to rely overly on an internet connection and a video hosting service during a live presentation.

The insert screenshot shows a gallery from all my open windows to choose from. The screen clipping tool allows one to insert only a section of the window. Neat!

The insert screenshot shows a gallery from all my open windows to choose from. The screen clipping tool allows one to insert only a section of the window. Neat!

There’s another thing available everywhere in Office but which is possibly most useful in powerpoint, that is, insert screenshot. By clicking on this button, you have a list of thumbnails of all your open windows to choose from, this really reduces the hassle of using a screen capture tool, or worse, to manually do a screen capture, paste it in an image editing program, crop the image, save it to an acceptable format and copy/paste it again where you need it. It is possible to only copy part of these screens, too. It’ s quite well done.

Overall impressions

I’m impressed with the thinking that went into the interface. The ribbon was already a great demonstration of out-of-the-box thinking and looked great on paper. I wasn’t thrilled to use it as the commands I had been using for some 15 years were not always easily found, but it seems that first-time users of Office 2007 outweight those who’ve used previous versions. The execution of the ribbon in Office 2010 is improved, and the team allowed themselves to go beyond some arbitrary constraints they had imposed to themselves, such as the pasting options or the office button. Well done.

I’m happy that sparklines have been added to Excel. In the next few years, we’ll find even better usage for them. However, I’m disappointed that the charting options remain essentially unchanged. Take the pie chart for instance. Everyone is aware of its limitations. There are many alternatives which would be easy to implement in Excel. Also, I’m disappointed that the charting mechanism remains the same: present the user with a long list of chart types, without supporting their reasoning in the choice of one over the other. There should be a chart wizard that would ask the user what do they want to show with the data and suggests the best choice (and not many possible choices) of chart.

I am not sure about the improved spell checker. Improved means increased dependency on the tool, which is the reason why typos haven’t been eradicated despite the technology.

I am very skeptical about all the advances of the Office product into design. Office users are not designers. Or rather, to be a designer requires a specific form of critical reasoning, not a new tool. More sophisticated graphical options allow novice users to achieve complex results without going through that phase of reasoning, which ultimately won’t help them.


Review of Tableau 5.0

Those last 2 weeks, I finally found time to give Tableau 5.0. Tableau enjoys a stellar reputation among the data visualization community. About a year ago, I saw a live demo of Tableau by CEO and salesman extraordinaire Christian Chabot. Like most of the audience, I was very impressed, not so much by the capacities of the software but by the ease and speed with which insightful analysis seemed to appear out of bland data. But what does it feel on from the user perspective?

Chartz: ur doing it wrong

Everyone who wrote about charts would pretty much agree that the very first step in making one is to decide what to show. The form of the display is a consequence of this choice.

Most software got this wrong. They will ask you how you want your display to look like, then ask you for your data. Take this screenshot from Excel:


When you want to insert a chart, you must first choose what kind of chart (bar, line, column, pie, area, scatter, other charts) and one of its sub-types. You are not asked, what data does this apply to, and what that data really is. You are not asked, what you are trying to show through your chart – this is something you have to manage outside of the software. You just choose a chart.

I’m picking Excel because with 200m users, everyone will know what I’m talking about, but virtually all software packages ask the user to choose a rather rigid chart type as a prerequisite to seeing anything, despite overwhelming theoretic evidence that this approach is flawed. In Excel, like in many other packages, there is a world of difference between a bar chart and a column chart. They are not of the same nature.

A reverted perspective

Fortunately, Tableau does it the other way round. When you first connect with your data in Tableau, it distinguishes two types of variables you can play with: dimensions and measures. And measures can be continuous or discrete.

tableau-dimensions(This is from an example file).

Then, all you have to do is to drag your dimensions and your measures to the center space to see stuff happening. Let’s drag “close” to the rows…

tableau-dragging-1We already see something, which is not terribly useful but still. Now if we drag Date into the columns…


Instant line chart! the software found out that this is the type of representation that made the most sense in this context. You’re trying to plot continuous variables over time, so it’s pretty much a textbook answer. Let’s suppose we want another display: we can click on the aptly name “show me!” button, and:


These are all the possible representations we have. Some are greyed out, because they don’t make sense in this context. For instance, you need to have dimensions with geographic attributes to plot things on a map (bottom left). But if you mouse over one of those greyed out icons, you’ll be told why you can’t use them. So we could choose anything: a table, a bar chart, etc.

A simple thing to do would be to switch rows and columns. What if we wanted to see date vertically and the close horizontally? Just drag and drop, and:


Crafting displays

Gone are the frontiers between artificial “chart types”. We’re no longer forcing data into preset representations, rather, we assign variables (or their automatic aggregation, more on that shortly) to possible attributes of the graph. Rows and columns are two, which shouldn’t be taken too literally – in most displays, those would be better described as abcissa and ordinate – but all the areas in light grey (called “shelves”) can welcome variables : pages, filters,path, text, color, size, level of detail, etc.


Here’s an example with a more complex dataset. Here, we’re looking at sales figures. We’re plotting profit against sales. The size of the marks correspond to the volume of the order, and the colour, to their category. Results are presented year by year. It is possible to loop through the years. So this display replicates the specs of the popular Trendalyzer / Motion chart tool, only simpler to set up.

Note that as I drag variables to shelves, Tableau often uses an aggregation that it thinks makes more sense. For instance, as I dragged Order Date to the page shelf, Tableau picked the year part of the date. I could ask the program to use every value of the date, the display will be almost empty but there would be a screen for each day. Likewise, when I dragged Order Quantity to the Size shelf, Tableau chose to use the sum of Order Quantity instead. Not that it makes much of a difference here, as each bubble represents only one order. But the idea is that Tableau will automatically aggregate data in a way that makes sense to display, and that this can always be overridden.

But if I keep the data for all the years in the display, I can quickly see the transactions where profit was negative.

sets1And I can further investigate on this set of values.

So that’s the whole idea. Because you can assign any variable to any attribute of the visualization, in the Tableau example gallery you can see some very unusual examples of displays.

Using my own data

When I saw the demos, I was a little skeptical of the data being used. I mean, things were going so smoothly, evidence seemed to be jumping at the analyst, begging to be noticed. Tableau’s not bad at connecting with data of all forms and shapes, so I gave it a whirl with my own data.

Like a lot of other official data providers, OECD’s format of choice for exporting data is SDMX, a flavor of XML. Unfortunately, Tableau can’t read that. So the next easiest thing for me was Excel.

I’m not going to get too much into details, but to come up with a worksheet that Tableau liked with more than a few tidbits of data required some tweaking and some guessing. The best way seems to be: a column for each variable, dimensions and dates included, and don’t include missing data (which we usually represent by “..” or by another similar symbol).

Some variables weren’t automatically reckognized for what they were: some were detected as dimensions when they were measures, date data wasn’t processed that well (I found that using 01/01/2009 instead of 2009 or 1/2009 worked much better). But again, that was nothing that a little bit of tweaking couldn’t overcome.

On a few occasions, I have been scratching my head quite hard as I was trying to understand why I could get Y-o-Y growth rates for some variables, but not for some others, or to make custom calculated fields. Note that there are plenty of online training videos on the website. I found myself climbing the learning curve very fast (and have heard similar statements of recent users who quickly felt empowered) but am aware that practice is needed to become a Tableau Jedi. What I found recomforting is that without prior knowledge of the product, but with exposure to data design best practices, almost everything in Tableau seems logical and simple.

But anyway – I was in. Here’s my first Tableau dashboard:

my-dashboardA Dashboard is a combination of several displays (sheets) on one space. And believe me, it can become really sophisticated, but here let’s keep it simple. The top half is a map of the world with bubbles sized after the 2007 population of OECD countries. The bottom half is the same information as a bar chart, with a twist: the colour corresponds to the population change in the last 10 years. So USA (green) have been gaining population while Hungary has seen its numbers decrease.

I’ve created an action called “highlighting on country” to link both displays. The best feature of these actions is that they are completely optional, so if you don’t want to have linked displays, it is entirely up to you and each part of the dashboard can behave independantly. You can also bring controls to filter or animate data which I left out for the sake of simplicity. However, you can still select data points directly to highlihght them in both displays, like this:

my-dashboard-highlight-bottomHere I’ve highlighted the top 5 countries. The other ones are muted in both displays. Here my colour choice is unfortunate because Japan and Germany, which are selected, don’t look too different from the other countries. Now I can select the values for the countries of Europe:


And you’ll see them highlighted in the bottom pane.

Display and style

Representing data in Tableau feels like flipping the pages of a Stephen Few book, which is more than coincidiential as he is an advisor to Tableau. From my discussion with the Tableau consultant that called me, I take that Tableau takes pride in their sober look and feel, which fervently follows the recommendation of Tufte, and Few. I remember a few posts from Stephen’s blog where he lashed as business intelligence vendors for their vacuous pursuit of glossiness over clarity and usefulness. Speaking of Few, I’ve upgraded my Tableau trial by re-reading his previous book, Information Dashboard Design, and I could really see where his philosophy and that of Tableau clicked.

So there isn’t anything glossy about Tableau. Yet the interface is state-of-the-art (no more, no less). Anyone who’ve used a PC in the past 10 years can use it without much guessing. Colours of the various screen elements are carefully chosen and command placement makes sense. Most commands are accessible in contextual menus, so you really feel that you are directly manipulating data the whole time.

When attempting to create sophisticated dashboards, I found that it was difficult to make many elements fit on one page, as the white space surrounding all elements becomes incompressible. I tried to replicate displays that I had made or that I had seen around, I was often successful (see motion chart reproduction above) but sometimes I couldn’t achieve the level of customization that I had with visualizations which are coded from scratch in Tableau. Then again even Tableau’s simplest representations have many features and would be difficult to re-code.

Sharing data

According to Dan Jewett, VP of product development at Tableau,

“Today it is easier to put videos on the Web than to put data online.”

But my job is precisely to communicate data, so I’m quite looking forward this state of affairs to change. Tableau’s answer is twofold.

The first half is Tableau Server. Tableau Server is a software that organizes Tableau workbooks for a community so they can access it online, from a browser. My feeling is that Tableau Server is designed to distribute dashboards within an organization, less so with the anyone on the internet.

That’s where the second part of the answer, Tableau Public, comes into play. Tableau Public is still in closed beta, but the principle is that users would have a free desktop applications which can do everything that Tableau Desktop does, except saving files locally. Instead, workbooks would have to be published on Tableau servers for the world to see.

There are already quite a few dashboards made by Tableau Public first users around. See for instance How Long Does It Take To Build A Technology Empire? on one of the WSJ blogs.

Today, there is no shortage of tools that let users embed data online without technical manipulations. But as of today, there is no product that could come close to this embedded dashboard. Stephen McDaniel from Freakalytics notes that due to Tableau’s technical choices (javascript instead of flash), dashboards from Tableau Public can be seen in a variety of devices, including the iPhone.

I’ve made a few dashboards that I’d be happy to share with the world through Tableau Public.

This wraps up my Tableau review. I can see why the product has such an enthusiastic fan base. People such as Jorge Camoes, Stephen Few, Robert Kosara, Garr Reynolds, Nathan Yau, and even the Federal CIO Vivek Kundra have all professed their loved for the product. The Tableau Customer Conference, which I’ve only been able to follow online so far, seems to be more interesting each year. Beyond testimonies, the gallery of examples (again at, but do explore from there to see videos and white papers), still in the making, shows the incredible potential of the software.