… and tableau contest results

have been announced. My entry got honorary mentions which is the best I could hope for not being based in the US! anyway, the rewards is going through the process, seeing what others have done. and learn from the process.

I was pleasantly surprised to see that Agnieszka – Russian Sphinx chose a subject similar to mine – real estate, but in Moscow. Here’s her entry:

Now she and I used a different approach to display our map and I didn’t really comment on how I made mine (let alone how she made hers)
Here’s my map for the record:

To create my map, I used the default mapping function of Tableau, where you declare the geographic role of two of your measures – one as latitude, one as longitude – and you can plot marks on a map, that can be zoomed, panned etc.

If one of your dimensions is geographical in nature, like a country, US state, US county name or code, Tableau can calculate latitude and longitude so you don’t have to provide them. This also works for zip codes. So, if you have a valid US address, you can extract the zip code out of that (before you put your data in Tableau) and Tableau can represent this on a map quite accurately.
In my case, I had tens of thousands of marks, which were available at a much finer level than the zip code (sometimes down to the appartment number) so I thought it would be a pity to aggregate them so. My data came with notions like neighborhood or block numbers, which is specific to the New York administration I suppose, but interesting in my context. That idea of block number sounded like the right level of aggregation for my data.
The problem is, how do I get the latitude and longitude of each of the blocks?
How I did this was to chose one address per block, and then geocode it with python and google maps. Geocoding is the process of converting addresses, well, in geographical coordinates and while doing it in Python does require coding it is trivial and there are many ready-made examples around which can be reused as is.
then, in my data, I put the coordinates of each block for every record. So any building sale can be associated with a laritude and longitude, which is that of their block.
In another view, I have a simpler map – by neighborhood. Neighborhoods are another proprietary notion, they are groups of blocks. So, I simply average the latitude and longitude of all the buildings in the neighborhood to position their mark.

While it can be desirable to let users interact with the map, in many cases it is not a safe option. Once you let your users unzoom and/or scrool away from the action, it is not obvious for them to get back.
So Agnieszka has used another nifty technique, that of background images.
Her map is really a static file imported with the background image function which is found in the data menu:

Like this, the user can really specify what to use for a map, when does it start, where does it end. The map won’t move afterwards.
However, by doing so one cannot use geographic coordinates to display marks. One has to obtain, outside of Tableau, the x and y coordinates of each mark on the screen. Then, those x and y coordinates can be made into measures, and assigned respectively to the column and row shelves of the viz.

I’ll let you figure out which one is best for your next map project. Note that you can generate highly customizable map backgrounds with a service like cloudmade.


La france qui exporte

This week, I was made aware of a new set of maps by French ministry of Foreign Trade, called cartographie de la France qui exporte (map of France exports) (link). Since I’m interested in the topic and that I know that French public services have killer cartographers I was eager to see what was so exciting about the first set of online maps on French exports.

I was a little underwhelmed to be honest. Online here meant static pdf files, although this is a dataset that just begs to be explored and manipulated.
On top of that, those where basic choropleth maps with markers such as this one here:

Now this map has two problems. First, it’s a choropleth with a discrete scale, but the values of adjacent areas can vary a lot. So, if you look at this portion of the map, what can be deduced on the values? not much I’m afraid.

Second, it’s difficult to compare the marks on the map. Which region has the biggest? the smallest? how do two specific regions compare? with this representation, this type of question is even more difficult to answer than with a table.

Also these charts answer one partial question. So this one, here, shows which region exports most food products. But to where? and how about the imports and balance? now if one given view was the most relevant and could illustrate some important finding, it can be highlighted but here the website gives us collections of many of such maps. As a citizen I’m leaving no more informed than I was.

Not being the one to criticize without proposing an alternative, I whipped out an interactive exploratory tool of France trade flows.

(The interactive vis is too wide to conveniently fit in a blog, but clicking on the image will open it in a new tab).

I don’t have access to the same dataset so I can’t show a strict equivalent. My data comes from COMTRADE, the UN database of trade flows, and shows all exports and imports to France in 2009. They are not broken down by region or by type of company, but I got the flows by partner country and product category.
The idea is that one can select something on one treemap to update the other. Also, it’s possible to alternate between a categorical view (where all groups of products and continents look neatly separated) and a view of the balance, which quickly shows which products or which countries get the bulk of French trade.

(technical explanation follows for those interested in the code proper)
Now following last week’s tutorial, of course it had to be done in protovis.
Actually it illustrates some interesting principles of working with arrays, trees, maps etc.

First, I want to do as much data manipulation as possible in protovis as opposed to manually. So, my source data for the treemap is stored as an array of associative arrays, which is probably the preferred form in protovis. This is no different than, for instance, Protovis’s barley example.
Now how do you get something of the shape –

var data=

to something shaped like a tree like:

var tree=
{0: {
       02: 101421,
       03: 0,
       04: 9272355,
       05: 0,

The solution is to use the rollup method.

First, if you look at my individual records, they are of the shape:


where com is commodity code, cat is product category, cou is country, con is continent, imp is imports and exp is exports.

For any country + commodity combination, there will be only one record.
What I’m interested to get in the tree I will use for the treemap are exports. That is what will determine the size of the leaves of the tree.

first I am going to nest my array:

var byProduct=pv.nest(data) 
	.key(function(d) {return d.cou;})
	.key(function(d) {return d.cat;})
	.key(function(d) {return d.com;})

once I have written this I could follow up with a .entries() statement which would return me a nested array, or with rollup() which could give me the tree I need.
Since, again, there is only one record for a combination of country (cou) and commodity (com), I can use any aggregation I want.

I define this function:

function rollup(data) {return pv.sum(data, function(d) {return d.exp;});} 

It returns the sum of all the export values. Since there is just one record, what it does is that it gives me the one export value I need in a tree form.

So the complete statement is:

function rollup(data) {return pv.sum(data, function(d) {return d.exp;});} 

var byProduct=pv.nest(data) 
	.key(function(d) {return d.cou;})
	.key(function(d) {return d.cat;})
	.key(function(d) {return d.com;})

This creates a tree, nested by country, then by product category, then by commodity. The corresponding values are the exports.

now creating my treemap data dynamically saves me a ton of hassle compared to trying to come up with a data file of the right shape and size, not mentioning the calculation errors which creep in each manual manipulation !

Another point of interestingness: how I computed the data to create the bar charts on the side.
For the left treemap (and left bar chart) the user has selected a country. (and for the right ones, it’s a given product, but let’s focus on the left side, the reasoning is the same for the other side anyway).

so first I am going to take the tree we made earlier and just look at the selected country. We can do that with a statement like:


(so now we have a tree with just 2 levels, product category and commodity).

Now I can’t run pv.nest and all that on a tree. I need an array! so I have to use flatten to turn that section of the tree into a bona fide array which I will be able to further process.

catsByCountry = pv.flatten(myProductTree).key("cat").key("com").key("exp").array(); 

Here, note that the arguments: “cat”, “com”, “exp” are completely arbitrary. But, since I’m recreating the array almost as it originally was, I might as well use the same names for the keys.

So now, I have like a little subset of my original dataset, only the records of the selected country.
I can now proceed to sum exports by categories using a standard rollup method, just as we’ve seen here.

catsByCountry = pv.nest(catsByCountry).key(function(d) {return d.cat;}).rollup(rollup);

Conveniently, the rollup function that I defined earlier sums the records! and here I do need summing, not any aggregation.

The problem is that the rollup() method creates an associative array, and if I need to use that in a bar chart I need a proper array! so, I use pv.values() which does just that, it creates an array out of the values of an associative array.

catsByCountry = pv.values(catsByCountry);

Now the values can vary a lot in absolute terms depending on the selected country. This is why in the actual bar chart, I use pv.normalize() to have only values from 0 to 1 which are much more convenient to plot.

	.data(function() pv.normalize(catsByCountry))

one last thing, to save space in the data set (which means: bandwidth + loading time) I’ve used short keys in my data file, and I’ve used codes for countries, commodities and the like.

so I have this:


instead of

    cat:"Food and live animals",

to get the names of the countries, categories etc. I have in my data file variables that associate, say, a country code to its long name, its continent etc.
so I can have to write things like:

countries["4"].name+" ("+continents[countries["4"].continent]+")"

instead of something simpler, but it’s a good trade-off because writing those names in full in the original dataset inflates the size of the file to megabytes (there are approx 10.000 records).


Working with data in protovis – part 4 of 5

Previous: array functions in javascript and protovis

Reshaping complex arrays

This really is what protovis data processing is all about.
In today’s tutorial, I am going to refer extensively to protovis’s Becker’s Barley example. One reason for that is that it’s also used in the API documentation of the methods we are going to cover, and also because I am posting a line-by-line explanation of this example that you can refer to.

So far we’ve seen that :

  • Associative arrays are great as data elements, as their various values can be used for various attributes.
    For instance, if the current data element is an associative array of this shape:

    { yield: 27.00000, variety: "Manchuria", year: 1931, site: "University Farm" }

    one could imagine a bar chart where the length of the bar would come from the yield, their fillStyle color from the variety, the label from the site, etc.

  • arrays of associative arrays are very practical to manipulate thanks to accessor functions.
    An array of the shape:

    var barley = [
      { yield: 27.00000, variety: "Manchuria", year: 1931, site: "University Farm" },
      { yield: 48.86667, variety: "Manchuria", year: 1931, site: "Waseca" },
      { yield: 27.43334, variety: "Manchuria", year: 1931, site: "Morris" },
      { yield: 39.93333, variety: "Manchuria", year: 1931, site: "Crookston" },
      { yield: 32.96667, variety: "Manchuria", year: 1931, site: "Grand Rapids" },
      { yield: 28.96667, variety: "Manchuria", year: 1931, site: "Duluth" },
      { yield: 43.06666, variety: "Glabron", year: 1931, site: "University Farm" },
      { yield: 55.20000, variety: "Glabron", year: 1931, site: "Waseca" },
      { yield: 28.76667, variety: "Glabron", year: 1931, site: "Morris" },
      { yield: 38.13333, variety: "Glabron", year: 1931, site: "Crookston" },
      { yield: 29.13333, variety: "Glabron", year: 1931, site: "Grand Rapids" },
      { yield: 29.66667, variety: "Glabron", year: 1931, site: "Duluth" },
      { yield: 35.13333, variety: "Svansota", year: 1931, site: "University Farm" },
      { yield: 29.33333, variety: "Wisconsin No. 38", year: 1932, site: "Duluth" }

    could be easily sorted according to any of the keys – yield, variety, year, site, etc.

  • it is easy to access the data of an element’s parent, and in some cases it can greatly simplify the code.


For this last reason, you may want to turn one flat array of associative arrays into an array of arrays of associative arrays. This process is called nesting.

Simple nesting

If you turn a single array like the one on the left-hand side to an array of arrays like on the right-hand side, you could easily do 3 smaller charts, one next to the other, by creating them inside of panels. You could have some information at this panel level (for instance the variety) and the rest at a lower level.

Fortunately, there are protovis methods that can turn your flat list into a more complex array of arrays. And since protovis methods are meant to be chained, you can even go on and create arrays of arrays of arrays of arrays if needs be.
Even better – combined with the other data processing functions, you don’t only change the structure of your array, but you can also filter and control the order of the elements to show everything that you want and only what you want.

And how complicated can this be?
To do the above, all you have to type is

barley=pv.nest(barley).key(function(d) d.variety).entries();

What this does is that it nests your barley array, according to the variety key. entries() at the end is required to obtain the array of arrays needed.

Here is an example of what can be done with both kinds of data structures in just a few lines of code (which won’t include the data proper. The long, flat array is stored in the variable barley, as above).
Without nesting:

var vis = new pv.Panel()
	.left(function() this.index*5)
	.height(function(d) d.yield*2)
	.fillStyle(function(d) d.year==1931?"steelBlue":"Orange")

As the pv.Bar goes through the array, there is not much it can do to structure it. We can just size the bars according to the value of yield, and color them according to another key (here the year).

Now using nesting:

barley2=pv.nest(barley).key(function(d) d.variety).entries();
barley2=pv.nest(barley).key(function(d) d.variety).entries();
var vis = new pv.Panel()
var cell=vis.add(pv.Panel)
	.left(function() this.index*70)
cell.anchor("top").add(pv.Label).textAlign("center").font("9px sans-serif").text(function(d) d.key)
		.data(function(d) d.values)
		.left(function() 5+this.index%6*10)
		.height(function(d) d.yield*2)
		.fillStyle(function(d) d.year==1931?"steelBlue":"Orange")
		.add(pv.Label).text(function(d) d.site).textAngle(-Math.PI/2)
			.textAlign("left").left(function() 15+this.index%6*10).textStyle("#222")

Here we used the same simple nesting command as above (the original example uses a more complicated one which allows for more refinement in the display). This structure allows us to create first panels, which we can style by displaying the name of the sites for instance, then, within these panels, the corresponding bars.

Doing this with the data in its original form would have been possible, but would have required writing a much longer program. So the whole idea of nesting is to take some time to plan the data structure once, so that the code is as short and useful as possible.

Going further with nesting

However, it is possible to go beyond that:

  • by nesting further, which can be done by adding other .key() methods:

      .key(function(d) d.variety)
      .key(function(d) d.year)


  • by sorting keys or values using the sortKeys() and sortValues() methods, respectively.

    For instance, we can change the order in which the variety blocks are displayed with sortKeys():

      .key(function(d) d.variety)
      .key(function(d) d.variety)

By using sortKeys without argument, the natural order is used (alphabetical, since our key is a string). But we could provide a comparison function if we wanted a more sophisticated arrangement.

Nesting and hierarchy

If you run the double nesting command we discussed above,

  .key(function(d) d.variety)
  .key(function(d) d.year)

you’ll get as a result something of the form:

var barley=[
  {key:"Manchuria", values: [
    {key:"1931", values: [
      {site:"University farm", variety: "Manchuria", year: 1931, yield: 27},
      {site:"Waseca", variety: "Manchuria", year: 1931, yield: 48.86667},
      {site:"Morris", variety: "Manchuria", year: 1931, yield: 27.43334},
      {site:"Crookston", variety: "Manchuria", year: 1931, yield: 39.93333},
      {site:"Grand Rapids", variety: "Manchuria", year: 1931, yield: 32.96667},
      {site:"Duluth", variety: "Manchuria", year: 1931, yield: 28.96667}
    {key:"1932", values: [  
      {site:"University farm", variety: "Manchuria", year: 1932, yield: 26.9},
      {site:"Waseca", variety: "Manchuria", year: 1932, yield: 33.46667},
      {site:"Morris", variety: "Manchuria", year: 1932, yield: 34.36666},
      {site:"Crookston", variety: "Manchuria", year: 1932, yield: 32.96667},
      {site:"Grand Rapids", variety: "Manchuria", year: 1932, yield: 22.13333},
      {site:"Duluth", variety: "Manchuria", year: 1932, yield: 22.56667}
  {key: "Glabron", ...

and so on and so forth for all the varieties of barley. Now how can we use this structure in a protovis script? why not use multi-dimensional arrays instead, and if so, how would the code change?

Well. You’d start using this structure by creating a first panel and and passing it the nested structure as data.
Assuming your root panel is called vis, you’d start likewise:


now, since barley has been nested first by variety, it is now an array of 10 elements. You are going to create 10 individual panels. At some point you should worry about dimensioning and positioning them. But here we are only focusing on passing data to subsequent elements.

Next, you are going to create another set of panels (or any mark, really, this doesn’t change anything for the data)

    .data(function(d) d.values)

This is how you drill down to the next level of data, by using an accessor function with the key “values”.
Congratulations! you have created 2 panels in each of our 10 individual panels, one per year.

Finally, you are going to create a final mark (let’s say, a pv.Bar)

    .data(function(d) d.values)

Again, you use an accessor function of the same form. This will create a bar chart with 6 bars.
The data element corresponding to each bar is of the form:

{site:"University farm", variety: "Manchuria", year: 1932, yield: 26.9}

So, when you style the chart, you can access these properties with accessor functions, and write for instance:

    .height(function(d) d.yield)
    .add(pv.Label).text(function(d) d.variety)


To sum it up: you can create a hierarchical structure in protovis that corresponds to the shape of your nested array by adding elements and passing data using an accessor function with the key “values”.
At the lowest level of your structure you can access all the properties of the original array using accessor functions.

Now, what if instead we used a multi-dimensional, normal array without keys and values? don’t they have structure and hierarchy, too?
This is not only possible, but also advised when your dataset is getting really big, as you would plague your users with annoying loading times. This changes the structure of the code though.

An equivalent multi-dimensional array would be something like:

var yields =
[ // this is the level of the variety
  [ // this is the level of the year
    [ 27, 48.86667, 27.43334, 39.93333, 32.96667, 28.96667], 
    [ 26.9, 33.46667, 34.36666, 32.96667, 22.13333, 22.56667]
    [ 43.06666, 55.2, 28.76667, 38.13333, 29.13333, 29.66667],
    [ 36.8, 37.73333, 35.13333, 26.16667, 14.43333, 25.86667]
    [ 35.13333, 47.33333, 25.76667, 40.46667, 29.66667, 25.7],
    [ 27.43334, 38.5, 35.03333, 20.63333, 16.63333, 22.23333]
    [ 39.9, 50.23333, 26.13333, 41.33333, 23.03333, 26.3],
    [ 26.8, 37.4, 38.83333, 32.06666, 32.23333, 22.46667]
    [ 36.56666, 63.8333, 43.76667, 46.93333, 29.76667, 33.93333],
    [ 29.06667, 49.2333, 46.63333, 41.83333, 20.63333, 30.6]
    [ 43.26667, 58.1, 28.7, 45.66667, 32.16667, 33.6],
    [ 26.43334, 42.2, 43.53334, 34.33333, 19.46667, 22.7]
    [ 36.6, 65.7667, 30.36667, 48.56666, 24.93334, 28.1],
    [ 25.56667, 44.7, 47, 30.53333, 19.9, 22.5]
    [ 32.76667, 48.56666, 29.86667, 41.6, 34.7, 32],
    [ 28.06667, 36.03333, 43.2, 25.23333, 26.76667, 31.36667]
    [ 24.66667, 46.76667, 22.6, 44.1, 19.7, 33.06666],
    [ 30, 41.26667, 44.23333, 32.13333, 15.23333, 27.36667]
    [ 39.3, 58.8, 29.46667, 49.86667, 34.46667, 31.6],
    [ 38, 58.16667, 47.16667, 35.9, 20.66667, 29.33333]

and that’s the whole lot. It is indeed shorter. now this array is only the yields, you may want to create an array of the possible values of varieties, sites and years for good measure.

var varieties=["Manchuria", "Glabron", "Svansota", "Velvet", "Trebi",
     "No. 457", "No. 462", "Peatland", "No. 475", "Wisconsin No. 38"], 
    sites=["University Farm", "Waseca", "Morris", "Crookston", "Grand Rapids", "Duluth"],

And by the way, it is very possible to create these arrays out of the original array using the map() method or equivalent.

how can we create an equivalent structure?
we start like the above:


Likewise, our 3-dimensional array is really an array of 10 arrays of 2 arrays of 6 elements. So we are also creating 10 panels. Let’s continue and create panels for the years:

    .data(function(d) d)

To drill down one level in an array, you have to use this form. you say that you are giving the children of your object what’s inside the data property of their parent.

So naturally, you follow by

    .data(function(d) d)

now how you style your bars will be slightly different than before. What you passed your first panel was an array of yields. So that’s what you get now from your data. If you want something else, you’ll have to get it with this.index for instance.

    .height(function(d) d) // that's the yield
    .add(pv.Label).text(function() varieties[this.index])

All in all it’s trickier to work with arrays. The code is less explicit, and if you change one array even by accident, you’ll have to check that others are still synchronized. But it could make your vis much faster.


Sometimes, what you want out of an array is not a more complex array, but a simpler list of numbers. For instance, what if you could obtain the sum of all the values in the array for such or such property? This is also possible in protovis, and in fact, it looks a lot like what we’ve done. The difference is that instead of using the method entries(), we will use the method rollup().

Let’s suppose we have a flat array that looks like this: these are scores of students on 3 exams.

var scores=[
{student:"Adam", exam:1, score: 77},
{student:"Adam", exam:2, score: 34},
{student:"Adam", exam:3, score: 85},
{student:"Barbara", exam:1, score: 92},
{student:"Barbara", exam:2, score: 68},
{student:"Barbara", exam:3, score: 97},
{student:"Connor", exam:1, score: 84},
{student:"Connor", exam:2, score: 54},
{student:"Connor", exam:3, score: 37},
{student:"Daniela", exam:1, score: 61},
{student:"Daniela", exam:2, score: 58},
{student:"Daniela", exam:3, score: 64}

Now, we would like to get, in one simple object, the average for each student.
We know we could reshape the array if we wanted by using pv.Nest and entries():

pv.nest(scores).key(function(d) d.student).entries()

This would be something of the shape:

[{key:"Adam", values:[
    {exam:1, score: 77, student: "Adam"},
    {exam:2, score: 34, student: "Adam"},
    {exam:3, score: 85, student: "Adam"}
  {key:"Barbara", values:[
    {exam:1, score: 92, student: "Barbara"},
    {exam:2, score: 68, student: "Barbara"},
    {exam:3, score: 97, student: "Barbara"}

Useful, for instance, if we’d want to chart the progress of each student separately.

Now if instead of using entries() at the end, we use rollup(), we could get this:

Adam: 65.33333333333333
Barbara: 85.66666666666667
Connor: 58.333333333333336
Daniela: 61}

The exact statement is

  .key(function(d) d.student)
  .rollup(function(data) pv.mean(data, function(d) d.score))

To understand how this works, it helps to visualize what the pv.nest would have returned if we had asked for entries.
What rollup does is that it would go through each of the values that correspond to the keys, and return one aggregate value, depending on the function.

For the first student, “Adam”, the corresponding values array is like this:

    {exam:1, score: 77, student: "Adam"},
    {exam:2, score: 34, student: "Adam"},
    {exam:3, score: 85, student: "Adam"}

so rollup will just look at each element of this array and apply the function.
This is what (data) in “function(data)” corresponds to.
Next, we tell protovis what to do with these elements. Here, we are interested in the average, so we take pv.mean (not pv.average, remember?)
However, we can’t directly compute the average of an array of associative arrays – we must tell protovis exactly what to average. This is why we use an accessor function, function(d) d.score.

Of course, pv.mean used in this example can be replaced by just about any function.

In the name of clarity, especially if there is only one property that can be aggregated, you can declare a function outside of the rollup() method. This is useful if you are going to aggregate your array by different dimensions:

function meanScore(data) pv.mean(data, function(d) d.score);
var avgStudent=pv.nest(scores)
  .key(function(d) d.student)
var avgExam=pv.nest(scores)
  .key(function(d) d.exam)


Protovis also provides methods that turn a “nested” array back into a flat array. And methods that turn a normal array into a tree.
The main advantage of having a flat array is that you can nest it in a different way. This is useful, for instance, if you got your data in a nested form that doesn’t work for you. Likewise, a tree is easier to reshape in protovis than an array.

To create a flat array out of a nested one, you have to use pv.flatten and specify all the keys of the array and conclude by array().


It’s important to note that you need to specify all the keys, not just the keys that correspond to actual nesting. So again, if you start from a flat array, and you do


to reverse this, you’ll have to enter the full formula, using key four times:


Finally, pv.tree – well, I haven’t seen this method used outside the documentation. It’s not used in any live example, not covered by any question in the forum, and I haven’t found any trace of it in the internet. So I’d rather leave you with the explanation in the documentation which is fairly clear than come up with my own. If you find yourself in a situation like in the documentation, where you have a flat array of associative arrays, which have a property that could be interpreted as a hierarchy, then you could use this method to turn your array in something more useful to protovis.

Putting it all together

Instead of coming up with a specific new example for this section I refer you to my explanation of the Becker’s Barley example.
On the same subject, see a comparison of how to re-create Becker’s Barley with protovis and Tableau

next: working with layouts