Manipulating data like a boss with d3

Data is the first D in d3 (or possibly the 3rd, but it’s definitely one of these).

Anyway. Putting your data in the right form is crucial to have concise code that runs fast and is easy to read (and, later, troubleshoot).

So what shape should your data be in?
You undoubtedly have many options.

To follow through this tutorial, let’s assume you want to plot the relationship between R&D expenditure and GDP growth for a number of countries. You have got this file, full of tabular data, which lists for every country a name, a continent, the gross R&D expenditure as a percentage of GDP, GDP growth, and for context population and GDP per capita.

So one very basic approach would be to put each of these variables into one independent array.

var GERD=[2.21367, 2.74826, 1.96158, 1.80213, 0.39451, 1.52652, 3.01937, 1.44122, 3.84137, 2.20646, 2.78056, 0.5921, 1.14821, 2.64107, 1.78988, 4.2504, 1.26841, 3.33499, 3.3609, 1.67862, 0.41322, 1.81965, 1.13693, 1.75922, 0.67502, 1.65519, 1.24252, 0.48056, 1.85642, 0.92523, 1.38357, 3.61562, 2.99525, 0.84902, 1.82434, 2.78518];
var growth=[2.48590317, 3.10741128, 1.89308521, 3.21494841, 5.19813626, 1.65489834, 1.04974368, 7.63563272, 2.85477157, 1.47996142, 2.99558644, -6.90796403, 1.69192342, -3.99988322, -0.42935239, 4.84602001, 0.43108032, 3.96559062, 6.16184325, 2.67806902, 5.56185685, 1.18517739, 2.33052515, 1.59773989, 4.34962928, -1.60958484, 4.03428262, 3.34920254, -0.17459255, 2.784, -0.06947685, 3.93555895, 2.71404473, 9.00558548, 2.09209263, 3.02171711];
var GDPcap=[40718.78167, 42118.46375, 38809.66436, 39069.91407, 15106.73205, 25956.76492, 40169.83173, 22403.02459, 37577.71225, 34147.98907, 39389.25874, 26878.00015, 21731.55484, 35641.55402, 40457.94273, 28595.68799, 32580.06572, 33751.23348, 29101.34563, 86226.3276, 15200.22119, 43455.30129, 29870.67748, 57230.89, 19882.99226, 25425.59561, 19833, 24429.61828, 27559.75186, 10497.583, 32779.3288, 41526.2995, 46621.77334, 15666.18783, 35715.4691, 46587.61843];
var population=[22319.07, 8427.318, 10590.44, 33909.7, 17248.45, 10286.3, 5495.246, 1335.347, 5366.482, 62747.78, 82852.47, 11312.16, 9993.116, 308.038, 4394.382, 7623.6, 59059.66, 126912.8, 48988.83, 483.701, 109219.9, 16480.79, 4291.9, 4789.628, 37725.21, 10684.97, 142822.5, 5404.493, 2029.418, 50384.55, 44835.48, 9276.365, 7889.345, 73497, 62761.35, 313232];
var country=["Australia", "Austria", "Belgium", "Canada", "Chile", "Czech Republic", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Iceland", "Ireland", "Israel", "Italy", "Japan", "Korea", "Luxembourg", "Mexico", "Netherlands", "New Zealand", "Norway", "Poland", "Portugal", "Russian Federation", "Slovak Republic", "Slovenia", "South Africa", "Spain", "Sweden", "Switzerland", "Turkey", "United Kingdom", "United States"];
var continent=["Oceania", "Europe", "Europe", "America", "America", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Asia", "Europe", "Asia", "Asia", "Europe", "America", "Europe", "Oceania", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Africa", "Europe", "Europe", "Europe", "Europe", "Europe", "America"];

(don’t bother scrolling, it’s more of the same :) )
Then, you can just create marks for each data item and fetch each attribute independently.
Let’s do a bubble chart for instance.
(small aside: in the post I won’t go through the code to set up the svg container or the scales, instead focusing on the data structures. That code, which is really nothing special, can be found in the source code of the examples).

So to create our circles we would write something like:

svg.selectAll("circle").data(country).enter()
  .append("circle")
  .attr("cx",function(d,i) {return x(GERD[i]);})
  .attr("cy",function(d,i) {return y(growth[i]);})
  .attr("r",function(d,i) {return r(Math.sqrt(population[i]));})

  .style("fill",function(d,i) {return c(continent[i]);})
  .style("opacity",function(d,i) {return o(GDPcap[i]);})

    .append("title")
    .text(String)

and this works:

See example in its own tab or window
but this is hell to maintain. If for some reason there is an error in one of the values, for instance due to a cat or a small child in the proximity of the computer, the error will be very difficult to troubleshoot.
Another problem is that it’s very difficult to apply any kind of subsequent treatment to the data. For instance, you will notice that there are smaller bubbles entirely within the large orange bubble which happens to be on top of them. So it’s not possible to mouseover the smaller bubbles. One way to address that would be to sort data in order of decreasing population (the size of the bubbles) so that it would be impossible to have this kind of situation. Now while it is possible sorting 6 arrays according to the values of one, it’s very messy.

Ideally, you should have all the values that will be translated graphically within one, single object. You want to have an array of these objects that you will pass to the data method, and be able to write something like:

svg.selectAll("circle").data(data).enter()
  .append("circle")
  .attr("cx",function(d) {return x(+d.GERD);})
  .attr("cy",function(d) {return y(+d.growth);})
  .attr("r",function(d) {return r(Math.sqrt(+d.population));})

  .style("fill",function(d) {return c(d.continent);})
  .style("opacity",function(d) {return o(+d.GDPcap);})

Here, you have just one data source, which is much safer.

So if you’re thinking: I know, I should create a variable like this:

var data=[
  {"country":"Australia","continent":"Oceania","population":22319.07,"GDPcap":40718.78167,"GERD":2.21367,"growth":2.48590317},
  {"country":"Austria","continent":"Europe","population":8427.318,"GDPcap":42118.46375,"GERD":2.74826,"growth":3.10741128},
  {"country":"Belgium","continent":"Europe","population":10590.44,"GDPcap":38809.66436,"GERD":1.96158,"growth":1.89308521},
  {"country":"Canada","continent":"America","population":33909.7,"GDPcap":39069.91407,"GERD":1.80213,"growth":3.21494841},
  {"country":"Chile","continent":"America","population":17248.45,"GDPcap":15106.73205,"GERD":0.39451,"growth":5.19813626},
  {"country":"Czech Republic","continent":"Europe","population":10286.3,"GDPcap":25956.76492,"GERD":1.52652,"growth":1.65489834},
  {"country":"Denmark","continent":"Europe","population":5495.246,"GDPcap":40169.83173,"GERD":3.01937,"growth":1.04974368},
  {"country":"Estonia","continent":"Europe","population":1335.347,"GDPcap":22403.02459,"GERD":1.44122,"growth":7.63563272},
  {"country":"Finland","continent":"Europe","population":5366.482,"GDPcap":37577.71225,"GERD":3.84137,"growth":2.85477157},
  {"country":"France","continent":"Europe","population":62747.78,"GDPcap":34147.98907,"GERD":2.20646,"growth":1.47996142},
  {"country":"Germany","continent":"Europe","population":82852.47,"GDPcap":39389.25874,"GERD":2.78056,"growth":2.99558644},
  {"country":"Greece","continent":"Europe","population":11312.16,"GDPcap":26878.00015,"GERD":0.5921,"growth":-6.90796403},
  {"country":"Hungary","continent":"Europe","population":9993.116,"GDPcap":21731.55484,"GERD":1.14821,"growth":1.69192342},
  {"country":"Iceland","continent":"Europe","population":308.038,"GDPcap":35641.55402,"GERD":2.64107,"growth":-3.99988322},
  {"country":"Ireland","continent":"Europe","population":4394.382,"GDPcap":40457.94273,"GERD":1.78988,"growth":-0.42935239},
  {"country":"Israel","continent":"Asia","population":7623.6,"GDPcap":28595.68799,"GERD":4.2504,"growth":4.84602001},
  {"country":"Italy","continent":"Europe","population":59059.66,"GDPcap":32580.06572,"GERD":1.26841,"growth":0.43108032},
  {"country":"Japan","continent":"Asia","population":126912.8,"GDPcap":33751.23348,"GERD":3.33499,"growth":3.96559062},
  {"country":"Korea","continent":"Asia","population":48988.83,"GDPcap":29101.34563,"GERD":3.3609,"growth":6.16184325},
  {"country":"Luxembourg","continent":"Europe","population":483.701,"GDPcap":86226.3276,"GERD":1.67862,"growth":2.67806902},
  {"country":"Mexico","continent":"America","population":109219.9,"GDPcap":15200.22119,"GERD":0.41322,"growth":5.56185685},
  {"country":"Netherlands","continent":"Europe","population":16480.79,"GDPcap":43455.30129,"GERD":1.81965,"growth":1.18517739},
  {"country":"New Zealand","continent":"Oceania","population":4291.9,"GDPcap":29870.67748,"GERD":1.13693,"growth":2.33052515},
  {"country":"Norway","continent":"Europe","population":4789.628,"GDPcap":57230.89,"GERD":1.75922,"growth":1.59773989},
  {"country":"Poland","continent":"Europe","population":37725.21,"GDPcap":19882.99226,"GERD":0.67502,"growth":4.34962928},
  {"country":"Portugal","continent":"Europe","population":10684.97,"GDPcap":25425.59561,"GERD":1.65519,"growth":-1.60958484},
  {"country":"Russian Federation","continent":"Europe","population":142822.5,"GDPcap":19833,"GERD":1.24252,"growth":4.03428262},
  {"country":"Slovak Republic","continent":"Europe","population":5404.493,"GDPcap":24429.61828,"GERD":0.48056,"growth":3.34920254},
  {"country":"Slovenia","continent":"Europe","population":2029.418,"GDPcap":27559.75186,"GERD":1.85642,"growth":-0.17459255},
  {"country":"South Africa","continent":"Africa","population":50384.55,"GDPcap":10497.583,"GERD":0.92523,"growth":2.784},
  {"country":"Spain","continent":"Europe","population":44835.48,"GDPcap":32779.3288,"GERD":1.38357,"growth":-0.06947685},
  {"country":"Sweden","continent":"Europe","population":9276.365,"GDPcap":41526.2995,"GERD":3.61562,"growth":3.93555895},
  {"country":"Switzerland","continent":"Europe","population":7889.345,"GDPcap":46621.77334,"GERD":2.99525,"growth":2.71404473},
  {"country":"Turkey","continent":"Europe","population":73497,"GDPcap":15666.18783,"GERD":0.84902,"growth":9.00558548},
  {"country":"United Kingdom","continent":"Europe","population":62761.35,"GDPcap":35715.4691,"GERD":1.82434,"growth":2.09209263},
  {"country":"United States","continent":"America","population":313232,"GDPcap":46587.61843,"GERD":2.78518,"growth":3.02171711}
]

and get this done, and furthermore if you are thinking “Hey, I can do this in Excel from my csv file, with one formula that I will copy across the rows”, you need to stop right now in the name of all that is good and holy.
Even though it works:


See example in its own tab or window

This approach has a number of flaws which you can all avoid if you read on.
First, the execution of your program will be stopped while your browser reads the source code that contains the “data” variable. This is negligible for 36 rows, but as objects get bigger and more complex, an equivalent variable may take seconds or even minutes to load. And now we have a problem.
That’s a problem for your users. Now to you: creating a JSON variable from tabular data is tedious and error prone. The formula editing interface in Excel doesn’t really help you spot where you have misplaced a quote or a colon. As a result, this is very time-consuming.

Don’t do that: there is a much simpler way.

Enters the d3.csv function.

d3.csv("data.csv",function(csv) {
  // we first sort the data

  csv.sort(function(a,b) {return b.population-a.population;});

  // then we create the marks, which we put in an initial position

  svg.selectAll("circle").data(csv).enter()
    .append("circle")
    .attr("cx",function(d) {return x(0);})
    .attr("cy",function(d) {return y(0);})
    .attr("r",function(d) {return r(0);})

    .style("fill",function(d) {return c(d.continent);})
    .style("opacity",function(d) {return o(+d.GDPcap);})

      .append("title")
      .text(function(d) {return d.country;})
  
  // now we initiate - moving the marks to their position

  svg.selectAll("circle").transition().duration(1000)
    .attr("cx",function(d) {return x(+d.GERD);})
    .attr("cy",function(d) {return y(+d.growth);})
    .attr("r",function(d) {return r(Math.sqrt(+d.population));})
})

Here’s how it works.
You tell your d3.csv function the location of a csv file, (which we had all along) and a function that must run on the array of objects (what we always wanted) created by using the first row as keys.
In other words, once inside the d3.csv function, the “csv” variable will be worth exactly what we assigned to “data” earlier, with one major difference, it’s that we didn’t have to manufacture this variable or do any kind of manual intervention: we are certain it corresponds to the file exactly.

One nice thing with this method is that since your variable is not explicitly in the source code, your browser can read it much faster. The data is only read when the d3.csv function is called, as opposed to the previous approach where the entirety of the source code (including the data) had to be read before the first statement could be executed. Of course, it only makes a difference when the data size is significant. But using the d3.csv approach would let you display a “loading data” warning somewhere on your page, and remove it when inside d3.csv. Much better than a blank page.

Three caveats with this method.

  • This will no longer work in a local file system (ie opening a file in the browser). The resulting file can only run on a webserver, which can be local (ie the page has a url).
  • whatever happens within the d3.csv function is no longer in the global scope of the program. This means that after the program has run its course you cannot open the javascript console and inspect the value of “csv”, for instance. This makes these programs slightly more difficult to debug (there are obviously ways, though).
  • Everything read from the file is treated as strings. Javascript does a lot of type conversion but be mindful of that or you will have surprises. This is why I wrote x(+d.GERD) for instance (+ before a string converts it to a number).

To celebrate this superior way of aquiring data, we’ve thrown in animated data entry: the circles are initiated at a default value and move towards their position. You may want to check the link to see the transition effect.


See example in its own tab or window

So, at the level of the mark (ie our circles) the most comfortable form of data is an object with at least as many keys as there will be graphical properties to change dynamically.
One flat array of data is fine if we have just one series of data. But what if we have several series? Indeed, most visualizations have a structure and a hierarchy.
So let’s proceed with our data but now let’s assume that we want to show values for different continents as different little scatterplots (“small multiples”).
Intuitively:

  • we’ll want to add 5 “g” groups to our svg container, one for each continent,
  • and then add one dots per country in each continent to those groups.

Our flat array won’t work so well then. What to do?

The d3 answer to this problem is the d3.nest() set of methods.
d3.nest() turns a flat array of objects, which thanks to d3.csv() is a very easily available format, in an array of arrays with the hierarchy you need.
Following our intuition, wouldn’t it be nice if our data would be:

  • An array of 5 items, one for each continent, so we could create the “g” groups,
  • And if each of these 5 items contained an array with the data of all the corresponding countries, still in that object format that we love?

This is exactly what d3.nest() does. d3.nest(), go!

var data=d3.nest()
  .key(function(d) {return d.continent;})
  .sortKeys(d3.ascending)
  .entries(csv);

With the .key() method, we are indicating what we will be using to create the hierarchy. We want to group those data by continent, so we use this syntax.
.sortKeys is used to sort the keys in alphabetical order, so our panels appear in the alphabetical order of the continents. If we omit that, the panels will show up in the order of the data (ie Oceania first as Australia is the first country). We could have avoided that by sorting the data by continent first before nesting it, but it’s easier like this.
Here, we just have one level of grouping, but we could have several by chaining several .key() methods.
The last part of the statement, .entries(csv), says that we want to do that operation on our csv variable.

Here is what the data variable will look like:

[
  {"key":"Africa","values":[...]},
  {"key":"America","values":[
    {"country":"United States","continent":"America","population":"313232","GDPcap":"46587.61843","GERD":"2.78518","growth":"3.02171711"},
     {"country":"Mexico","continent":"America","population":"109219.9","GDPcap":"15200.22119","GERD":"0.41322","growth":"5.56185685"},
      {"country":"Canada","continent":"America","population":"33909.7","GDPcap":"39069.91407","GERD":"1.80213","growth":"3.21494841"},      {"country":"Chile","continent":"America","population":"17248.45","GDPcap":"15106.73205","GERD":"0.39451","growth":"5.19813626"}
  ]
}, 
  {"key":"Asia","values":[...]},
  {"key":"Europe","values":[...]},
  {"key":"Oceania","values":[...]},
]

Now that we have our data in an ideal form let’s draw those marks:

  // One cell for each continent
  var g=svg.selectAll("g").data(data).enter()
    .append("g")
    .attr("transform",function(d,i) {return "translate("+(100*i)+",0)";});
  // we add a rect element with a title element
  // so that mousing over the cell will tell us which continent it is
  g
    .append("rect")
    .attr("x",cmargin)
    .attr("y",cmargin)
    .attr("width",cwidth-2*cmargin)
    .attr("height",cheight-2*cmargin)
      .append("title")
      .text(function(d) {return d.key;})
  // we also write its name below.
  g
    .append("text")
    .attr("y",cheight+10)
    .attr("x",cmargin)
    .text(function(d) {return d.key;})
  
  // now marks, initiated to default values
  g.selectAll("circle")
  // we are getting the values of the countries like this:
  .data(function(d) {return d.values}) 
  .enter()
      .append("circle")
      .attr("cx",cmargin)
      .attr("cy",cheight-cmargin)
      .attr("r",1)
      // throwing in a title element
      .append("title")
        .text(function(d) {return d.country;});

  // finally, we animate our marks in position
  g.selectAll("circle").transition().duration(1000)
      .attr("r",3)
      .attr("cx",function(d) {return x(+d.GERD);})
      .attr("cy",function(d) {return y(+d.growth);})
      .style("opacity",function(d) {return o(d.population)})
      .style("opacity",function(d) {return o(+d.GDPcap);})

(you may want to click on the link to see the transition effect and read the full source).

See example in its own tab or window

This is all very nice but wouldn’t it be better if we could characterize some aggregate information from the continents? Let’s try to find out the average values for R&D expenditure and GDP growth.

Can it be done easily? This is a job for the other main d3.nest method, rollup.

rollup is the aggregating function. Here’s an example.

var avgs=d3.nest()
    .key(function(d) {return d.continent;})
    .sortKeys(d3.ascending)
    .rollup(function(d) {
      return {
        GERD:d3.mean(d,function(g) {return +g.GERD;}),
        growth:d3.mean(d,function(g) {return +g.growth})
      };
    })
    .entries(csv);

Remember how the combination of .key() and .entries() rearranges an array into arrays of smaller arrays, depending on these keys? well, the value that is being passed to the function inside the rollup method is each of these arrays (ie an array of all the objects corresponding to countries in America, then an array of all the objects corresponding to countries in Europe, etc.)
Also, if we use sortKeys in our previous nesting effort we’d better use it here too.
Here is what the variable will look like:

[
  {
    "key":"Africa",
    "values":{
      "GERD":0.92523,
      "growth":2.784
    }
  },  
  {
    "key":"America",
    "values":{
      "GERD":1.34876,
      "growth":4.2491646575
    }
  },
  {
    "key":"Asia",
    "values":{
      "GERD":3.6487633333333336,
      "growth":4.991151293333334
    }
  },
  {
    "key":"Europe",
    "values":{
      "GERD":1.8769234615384616,
      "growth":1.7901778096153846
    }
  },
  {
    "key":"Oceania",
    "values":{
      "GERD":1.6753,
      "growth":2.40821416
    }
  }
]

Incredible! just the values we need.
Now it’s just a matter of adding them to the sketch. Two little additions here:

// we add 2 lines for the average. They will start at a default value.
  g
    .append("line").classed("growth",1)
    .attr("x1",cmargin).attr("x2",cwidth-cmargin)
    .attr("y1",cheight-cmargin)
    .attr("y2",cheight-cmargin)
    // we give these lines a title for mouseover interaction.
      .append("title").text(function(d,i) {
        return "Average growth:"+avgs[i].values.growth
      });
  g.append("line").classed("GERD",1)
    .attr("y1",cmargin)
    .attr("y2",cheight-cmargin)
    .attr("x1",cmargin)
    .attr("x2",cmargin)
      .append("title").text(function(d,i) {
        return "Average GERD:"+avgs[i].values.GERD
      });

 ...
  // we also animate the lines
  g.select(".growth").transition().duration(1000)
    .attr("y1",function(d,i) {return y(avgs[i].values.growth);})
    .attr("y2",function(d,i) {return y(avgs[i].values.growth);})
  g.select(".GERD").transition().duration(1000)
    .attr("x1",function(d,i) {return x(avgs[i].values.GERD);})
    .attr("x2",function(d,i) {return x(avgs[i].values.GERD);})

This is the final example – again you may want to click on the link to see the transition and get the entirety of the source.

See example in its own tab or window

To wrap up:

  • At the mark level, you want to have objects with as many properties as you need graphical variables (like x,y, fill, etc.)
  • using d3.csv() and a flat file will make this easy (d3 also provides functions like d3.json or d3.xml to process data in another format).
  • d3.nest can help you group your entries to structure your data and create more sophisticated visualizations
  • rollup can be used to aggregate the data grouped using d3.nest

Making-of: the map of congress equality

It's all about this map (click for interactive version)

To my datavis readers, sorry for that string of posts in French but what better data to visualize than political data, and what better time to visualize political data than election time, and what better audience for such visualizations than the folks who are asked to vote?

Like last time, though, I am writing a follow-up technical post about how I dealt with the issues of this visualization.

So anyone who ever tried to make data visualizations knows that you can hardly start without data.

My ingredients for the recipe were:

2012 presidential election results by circonscription, plus those of 2007.

Results of the previous congressional election. There were 2 files, one per round, as opposed to a flat file of députés in place (I didn’t find one that didn’t required some significant editing to be of use). Most importantly, I needed their political orientation which required some tweaking.

Matching tables between circonscriptions and cities.  From a previous project, presidential election data at the city level. Also, geo coordinates of the cities.

The data which was most painful to extract was the list of candidates. In all fairness, UMP made it easier than PS as they had them all on a page. For PS, they had a google fusion table which had this as a data source. That file required a lot of massaging. Eventually local pages of the PS site would list the candidates missing from the map (or provide alternate names). When it was up, I also used the http://www.elections-legislatives.fr/ site to check for the missing names.

Finally, I figured out the genders of all the candidates by extracting their first name and looking up all the ones I wasn’t sure about (there are quite a few unisex first names in French).

Now calculations.

There is a pretty strong statistical link between the score of a party on an election in a certain territory, and the chances of a congress candidate of the same party of winning the district.

Predicting these chances is a well-known problem known as classification  for which the textbook method is logistic regression.

All we need was the 575 districts for which I had results. We then associate the score of a party at the 2nd round of the election to whether the corresponding congress candidate got elected (1 or 0). That gives us 1150 pairs of values which we throw in the mathematical cooking pot.

And what we get is the following formula:

where x is the score in the previous election (between 0 and 1).  As you can see when x gets close to 0, the denominator becomes a very large number and the probability quickly drops to virtually nothing, and converserly when x gets close to 1, the denominator becomes very close to 1 so the probability rises up to 1 equally fast.

With this and that in place, it is possible to come up with a reasonable estimation of the chances of any candidate based on the recent results. As an aside, the current Prime Minister has renewed the tradition started by his predecessor to ask ministers to seek office and to force them to step down if they fail to win their district. As a result, 24 out of the 37 ministers are campaigning. Out of those 24, 2 are taking very serious risks according to this model: Marie-Anne Carlotti and Benoît Hamon.

Finally, geography.

In an ideal world, there will be an abundance of geoJSON files describing France and its many administrative entities. Usable data must exist somewhere, because the maps on www.elections-legislatives.fr have all been generated (by Raphael.js says the source code). If I’m doing another project on these elections I might reverse engineer the shape of the maps to extract the coordinates.

Without a dataset, the work of drawing the boundaries of 577 districts is just huge. However, accuracy is not required as I’m only putting the districts on a map so people can look up where they live or places they know. In my previous work in order to let users change the composition of the districts, I wanted to be rigorous in the placement of everything but here we can live with imperfection.

So I am using the same principle as I did: voronoi tesselation.

For each district I am picking the largest city, for which I have the coordinates. But most large cities belong to several districts. So I am adding random noise to each point. Then, I am drawing shapes around them.

That would normally fill a rectangle, so in order to make it look like France, I have drawn a clipping mask on top of it (that, I’ve done by hand, picking coordinates of the outline of France).

That about wraps it up!

La parité, c’est maintenant?

Cliquez sur l'image pour arriver à l'application interactive

J’ai eu l’idée de créer cette carte quand j’ai pu mettre la main sur les données des résultats de la présidentielle par circonscription législative. En 2007, les résulats du deuxième tour ont été un très bon prédicteur de ceux des législatives qui ont suivi: une circonscription pour qui un candidat à la présidentielle au second tour a fait ne serait-ce que 52% a plus de 75% de chances d’être remportée par le député du même parti.
Et encore, tout le monde s’était accordé à dire qu’en 2007, la gauche avait fait une très bonne campagne et qu’elle avait endigué la vague bleue. Les probabilités sont sûrement encore plus élevées.

Or, presque 80% des circonscriptions ont été gagnées avec un score supérieur à 52%, ce qui est donc une très grosse marge. Dans plus du quart d’entre elles, le gagnant du scrutin a même récolté plus de 60% des voix…

Bref. Dans la plupart des circonscriptions, il n’y aura pas tellement de suspsens. Là où je vote, on n’a pas souvent le droit à un deuxième tour.

Comme pour le découpage électoral, je trouve ça un peu dérangeant. L’élection n’est pas tant la rencontre entre une personne et une population qui la choisit, mais surtout le fait d’un parti qui place ses pions, surtout si on rajoute les “accords électoraux”. Je crois que j’ai moins de chance de rencontrer la candidate challenger de ma circonscription que de voir François Hollande ou Nicolas Sarkozy “en vrai”.

Alors pourquoi ne pas en profiter pour s’approcher de la parité à l’assemblée?

Pourquoi ne pas le faire d’abord: si un parti ne présente pas autant de femmes que d’hommes, il récupère une amende. Ou plutôt, comme l’explique bien Alexandre Léchenet,  il perd des financements. Mais le mode de calcul est biaisé. Un parti récupère une certaine somme par voix  au niveau national, puis cette somme est minorée si les femmes représentent moins de la moitié des candidats. Pour contrer ce système, il aurait été plus judicieux de le baser sur la proportion des voix remportées par des femmes, pas sur celles qu’on a alignées au départ.

Donc, on envoie les femmes au casse-pipe: on les met dans des circonscriptions impossibles à gagner, histoire d’éviter l’amende. Il y a presque 100 candidates qui se retrouvent contre le représentant d’un parti qui a fait plus de 55% au second tour.

A Paris, par exemple, Annie Novelli défie Claude Goasguen dans la 14ème circonscription (qui a voté à 77% pour Sarkozy) et c’est Agnès Pannier qui affronte Bernard Debré dans la 4ème (75% pour Sarkozy). Pendant ce temps, Roxane Decorte se frotte à Daniel Vaillant dans la 17ème circonscription (qui a voté Hollande à 72%).

Et encore, en théorie elles pourraient gagner, mais dans 238 circonscriptions, soit plus de 40%, ni le PS ni l’UMP n’ont investi de femme, comme ça on est tranquille.

Ce qui fait qu’au final, même s’il y a 40% de candidates, le nombre d’élues devrait tourner autour de 175 soit 28%. Ce serait quand même presque 70 de plus qu’aujourd’hui, malgré le cynisme de l’actuel patron de l’UMP. Espérons qu’elle pourront aller à l’Assemblée habillées comme elles veulent.

Mine de rien, il y aurait 51.5% de femmes en France.

Making-of: cutting Paris in voting districts

Hi, in my previous post I showcased one of my recent projects. I really enjoyed building it and so would like to share how this has been done.

First, getting the data. I already scraped the results of both rounds of the presidential election by city. The districts for the congress election are also known, but it’s not possible to do a match, because large cities are almost systematically broken down into several such districts. Paris, for instance, will be represented by no less than 18 députés.

So I needed the results by the finest possible unit, that is by individual polling station. During the election night these results are compiled by city and centralized, so you would assume that the raw data of each polling station is available somewhere. That is not the case, unfortunately. Although it seems that they will be made public eventually, that may not be the case before the June 2012 election.

Fortunately, Open Data Paris had the results by polling station. More: it had their address and matching of every inhabited building in Paris to its corresponding polling station.

To map the polling stations, my first intuition was to create a voronoi tesselation of their projected, geocoded coordinates (I only had their addresses in the raw data file). In short, voronoi polygons can be generated for a certain number of control points and correspond to the area nearer to that control point than to any other. So it’s a good approximation of the areas  which correspond to a given polling station.

Problem: several polling stations could be in the same address, and for the voronoi algorithm the control points have to be distinct. So I tried jittering them (adding random noise to each one). A tesselation could be done that kind of looked like Paris but voting districts will look messy as there were frequent inversions between neighboring districts.

So I had to come up with a better approximation of what part of the city corresponded to what voting district. So I used the address to polling station correspondance, and for each polling station I took the first and the last street number of any street that was covered by it. Then I geocoded the whole lot. That’s about 16000 points. It took some time.

Here's my polling station as an example.

Then, for each polling station area, I took the minimum and maximum longitude and latitude, which formed a bounding box, and assigned the polling station to the center of that box. Then, I used tesselation again.

I found a number of oddities in the geocoding that I had to correct manually, because if one address was not accurately coded, chances are it would change the shape of the bounding box drastically and so the position of its corresponding polling station. Sometimes the geocoding service wouldn’t find the street and/or would use a street of the same name in another city, sometimes they did find the street but the coordinates were way off… So the dataset required a lot of massaging before it got into shape.

The last geographic errand I had to do for this visualization was to create a perimeter of Paris to use as a clipping mask, else the tesselation would be done on a rectangular shape with the edge polygons being very large and very skewed. So I collected coordinates of points around Paris to create one polygon. Only what’s inside of this polygon is shown (.style(“clip-path”) in d3).

After the data has been acquired, the building of the rest of the datavis was nothing special. I have used extensively mouseover and click events to trigger transformations as I always do, although this time I did prepare a lot of rules.

Originally I wanted to make the whole of France like this, though it will be difficult: one, to get the data, and two, to get it into shape. As of today the location (i.e. street address) of most of the polling stations is not available online, so even if we got the number of votes for each of the polling stations (there should be about 40000 of them) the geographic part of the problem will remain unsolved. Though, it’s a worthy endeavour. While the election results have little interest at a macro-geographic level – by region or by département – they are very useful at a very fine level as strategies can be constructed.

For instance, it’s worthwhile to send heavyweights to conquer districts that are winnable, but it’s a waste to keep them in their respective fiefdoms if victory in these districts is already certain. Also, when districts would have to be redefined, having this kind of information can be invaluable to the political force which gets to draw their new limits, or to their opponents.

Le découpage de Paris en circonscriptions

Mon dernier projet permet de voir les résultats des élections présidentielles à Paris par bureau de vote et de les projeter sur les circonscriptions qui serviront aux élections législatives de juin 2012.

Et surtout il permet de changer la composition de ces circonscriptions, dont le tracé aujourd’hui est assez arbitraire. Il y aura 18 circonscriptions à Paris contre 21 aujourd’hui, et elles ne suivent pas les arrondissements.

Le tracé de ces circonscriptions est déterminant pour le résultat des élections. Aujourd’hui, par exemple, il y a deux circonscriptions où Nicolas Sarkozy a récupéré plus de 75% des voix au 2ème tour de l’élection présidentielle, j’imagine que la gauche ne place pas trop d’espoir sur leur reconquête. De même, il existe pas moins de 9 circonscriptions où François Hollande a reçu plus de 60% des votes. En l’état actuel des choses, 12 circonscriptions semblent acquises à la gauche et 6 à la droite, dont 3 pourraient peut-être quand même être gagnées par la gauche.

Le découpage actuel n’est optimum ni pour la gauche, ni pour la droite. En modifiant le tracé des circonscriptions, la gauche pourrait toutes les remporter, et la droite pourrait en gagner 12 sur 18 (ou peut-être plus, 12 restant mon high-score personnel). Pour favoriser un camp, l’idée consiste à répartir les bureaux de votes les plus favorables entre le plus de circonscriptions possibles, plutôt que de les garder dans peu de circonscriptions. En généralisant sur le territoire, on imagine ce que ça peut donner!

Donc, quelque soit le sentiment actuel, tel ou tel redécoupage peut complètement redistribuer les cartes. C’est un sentiment dérangeant parce que ces redécoupages arrivent régulièrement et sont relativement opaques. D’ailleurs, il est assez difficile de faire le lien entre les données des élections présidentielles et les circonscriptions législatives parce que les résultats ne sont que rarement disponibles par bureau de vote.

Je donnerai les détails techniques de l’implémentation dans un futur post.

À la découverte des résultats des présidentielles avec les coordonnées parallèles

Qui dit élection présidentielle, dit résultats.

Et qui dit résultats dit représentation visuelle dans les médias.

Et donc souvent carte.

Cela dit, une carte ne nous apprend pas grand chose de ce qui s’est vraiment passé dans une élection. C’est vrai que c’est utile pour retrouver sa région ou sa ville, mais les résultats de deux villes proches géographiquement (par exemple, Boulogne-Billancourt et Issy-les-Moulineaux) n’a strictement rien à voir.

En revanche, avec une carte c’est très difficile de répondre à certaines questions, comme: où a-t-on le plus voté pour Nicolas Sarkozy? Qu’est ce qui s’est passé dans les villes qui ont beaucoup voté pour Marine Le Pen ou François Bayrou? D’où vient l’explosion du vote blanc au deuxième tour?

Pour répondre à ces questions, on peut utiliser les coordonnées parallèles.

Chaque axe vertical correspond à un vote possible – d’abord ceux du premier tour, puis ceux du deuxième. Chaque ligne de couleur correspond à une ville (ou à un département, à une région, ou à la France). Chacune de ces lignes coupe chaque axe à une hauteur qui correspond à la proportion des gens qui ont fait tel ou tel choix. Par exemple, les villes où on a beaucoup voté pour Jean-Luc Mélenchon, comme Bagnolet ou Gennevilliers, se retrouveront vers le haut de l’axe du milieu.

En s’approchant des axes, le curseur devient une croix. Il suffit alors de le faire glisser le long de l’axe pour dessiner un rectangle, et mettre en valeur toutes les lignes qui passent par ce rectangle. Par exemple, voici les villes qui ont donné plus de 70% de leurs voix du second tour à François Hollande:

 

On peut tracer autant de rectangles qu’on veut. Par exemple, on peut ne garder que les endroits qui avaient soutenu Bayrou (par exemple à plus de 10%).

et voilà: il n’y en a plus que 2.

Pour supprimer une sélection, il suffit de cliquer près de l’axe sans glisser. Si on clique sur un rectangle, on peut aussi le faire coulisser le long de l’axe.

Avec cette technique, on peut tout de suite voir les fiefs de tel ou tel candidat: il s’agit des lignes qui touchent le haut du graphique.

On peut répondre à des questions plus complexes. Par exemple, je parlais de l’explosion du vote blanc au deuxième tour. Que s’est-il passé?

Sélectionnons les endroits où le vote blanc a dépassé les 4% au deuxième tour, tout en restant à moins de 1.5% au premier tour:

On voit trois pics: les villes où Marine Le Pen, Jean-Luc Mélenchon ou François Bayrou ont fait un très gros score. On peut continuer à sélectionner et voir que ce sont bien les villes qui sont dans la pointe de l’un des trois triangles et pas celles qui passent par le bas qui se retrouvent dans ce cas de figure. Ces électeurs ont préféré voter blanc plutôt que de choisir.

Bonne exploration!

http://www.jeromecukier.net/projects/elections/dtour2012.html

See#7 conference


Last week, I had the privilege to attend the See#7 conference in Wiesbaden. I wrote a quick post summarizing the immediate feelings I had on my way back so here’s a more detailed follow-up.

Three things to know about the conference. It’s one of the main events on visualization in Europe. The main event takes place in a church which sits hundreds.

Also, the conference is not called die Konferenz zur Visualisierung von Information for no reason. It’s mostly German-centric.

Finally, the conference’s approach to talking about visualization is to offer some height of view by inviting experts who do not work in information visualization proper but offer an interesting perspective to the field from their point of view.

Videos from all talks will be made available shortly.

The first speaker was Dr Thomas Henningsen from Greenpeace and was about how the organization is using visual impact for their agenda. Greenpeace goes indeed to considerable lengths to take the one picture that shows that what they are fighting is not an abstract possibility but something very tangible. Sometimes they create the picture as in this example,

"If the planet were a bank, you would have saved it long ago"

sometimes they capture a certain moment, but they also use charts and data to make their point.

Next was Prof. Dr. Norbert Bolz. From the opinion of many of the native German speakers I heard this was the highlight of the day. As my German is somewhat rusty and as he didn’t use visual aides I admit I missed a lot of it. Prof. Bolz is presented on the see conference site as a media scholar and many of his books were on display at the conference. His talk was on how can one inform and be memorable through images. So here are parts I grabbed. To be remarkable a piece of information has to be new, it has to be on something that the reader didn’t know. This is very different from being important. A lot of the talk was also about the memorable faculty of images (Prägnanz), that images can’t be cancelled.

The next speaker was Stefanie Posavec who describes herself as a data illustrator. Like most people, you may think of data visualization  as a supervised but automated process by which a computer generates an image based on data and a set of rules. Then, enters the data illustrator, who works entirely by hand.

Perhaps Stefanie’s best-known work is the representation she’s done of Kerouac’s On the Road:

which most people assume is a fine example of generative art. Wrong: every. single. element. is. placed. by. hand.

Stefanie took us through her project and shared how she works, how she collects data and encodes it (she did bring a computer).

Here’s a picture I took at the workshop the next day along with some of her sketches.

Stefanie always seems to be apologetic that she does not write code which is ironic considering that this approach is what makes her work unique. During the workshop, she took us through specifications that she had written for a developer to create an interactive visualization and which were insanely detailed. Everybody who writes code would be really thrilled to be able to rely on such a structured document!

The next speaker was Ben Kreukniet from UnitedVisualArtists. Now UVA may not be a household name, but how about Red Hot Chili Peppers or U2? Remember how everyone was talking about the gigantic scenic structure on the U2 last tour, which AFAIK was the highest-grossing concert tour ever? that was UVA’s work. They are lighting artists who specialize in large installations, and by large they mean friggin’ epic.

Ben’s talk took us through their work, with a focus on their “origin” project:

Origin is a gigantic cube of light and sound which is made to interact with an audience. During the workshop the next day, UVA showed us the tools they use to work, which revolves around a platform they call d3 (though not that d3). And we had the privilege to preview their next work which would be an advertisement campaign to be shown in movie theaters.

The next speaker was Yannick Jacquet from the antiVJ collective. More than a portfolio talk, this was an introduction to what VJing is about for the many of us who only had a vague idea. Basically, VJing is about showing moving pictures. But it doesn’t have to be ugly psychedelic shapes moving on flat screens in night clubs. antiVJ was created in reaction to this reductive view of the field. Through a technique called video-mapping, VJs can use one projector to cast images on many separate surfaces, and with a couple of projectors which can be controlled from just one computer, they can cover very complex geometries.

Next was Michael Madsen. I covered his workshop talk in my previous post, but during the conference he presented his movie “Into Eternity”:

Here’s the story. In Finland, law requires that nuclear waste be disposed within the country. So a company is building a bunker to bury it deep, deep within the earth – eventually, canisters of nuclear waste will be stored 4000 meters underground. In 2100, the facility will be reaching its capacity and it will be sealed and expected to remain undisturbed for the next 100 000 years. This is the first human creation designed with such a horizon. Contrary to religious buildings which are being built “forever” everything has been done to give the facility the highest chances to last 100 000 years. This led to surprising choices. First, the location of the facility is secret. When it will be sealed, there will be no distinctive mark at all. And while there is a possibility of a next ice age or a similar global disaster and with that the possibility that the security be breached. This is precisely on what the movie is about – which decisions were made, why, with what perspective.

So documentaries are all about telling stories visually, which could also be told of data visualization. Only for documentaries, the angle – a combination of the subject, the approach, the questions that need to be answered – is the result of an overly elaborate research, which we sometimes do in datavis, and sometimes don’t. More often than not it’s tempting to just go ahead with a form that agrees with the dataset though, so this work process is a welcome perspective.

The last talk of the conference was by none other than Manuel Lima, who pioneered visualization blogs with visualcomplexity. Both his talks (to the exception of the description of his experience at Bing) relate to material in his highly-recommended book, Visual Complexity: Mapping Patterns of Information, the See conference talk focusing more on trees and hierarchical displays of information and the second one more on networks.

Trees can be taken literally

This was definitely the closest talk to actual datavis practice. The principles he exposed really come to life with the examples, so I will encourage you to watch his talk in its entirety (it should be available on the 12th of May). In passing, in his workshop talk he mentioned that most of his “ancient” examples come from an out-of-print grimmoire called The Album of Science, I found one for $4 on Amazon so you may want to check it out.

That about wraps it up.

Pros of see conference:

  • really cheap to attend from anywhere in Europe – transportation, accommodation and conference fees are super reasonable.
  • big.
  • not just about datavis, but rather on subjects which are particularly interesting for the datavis practitioner.
  • in pleasant Wiesbaden.
  • party-time/conference-time ratio about optimal.
  • workshop is really, really fantastic (and free to attend btw).

Cons:

  • you’re kind of expected to speak German.

Impressions from Wiesbaden

I’m just returning from the 7th See conference on information visualization. I’ll do a longer, more descriptive post later (a lot happens in just 2 days) but for now I would like to mention one talk which really moved me. After the conference proper which takes place in the impressive Lutherkirche which sit hundreds, the See+ workshop was held in the offices of the organizing agency, Scholz und Volkmer. Most speakers of the main conference came back for a more laid-back discussion with a much smaller audience.

The theme of See+ was tools. Speakers were invited to tell us how they work.

One characteristic of the See conference is that it offers a broader perspective on data visualization. In addition to well-known datavis specialists, such as Manuel Lima this year, other guests include visual artists and designers who also work with data, as well as experts in communication who apply that skill to data.

Michael Madsen is a Danish film-maker who lives in Berlin. In his see conference talk, he presented Into Eternity, a powerful documentary about an incredible facility in Black Mesa Finland which is designed to hold nuclear waste for 100,000 years. While the eerie tone of the narration is fascinating this was perhaps the topic most remote from “traditional” datavis (as in data, graphs and stuff).

Then came his See+ talk which took a turn I didn’t expect.

Pic courtesy of Joshua de Haseth. I'm actually the guy on the right.

Filming a documentary takes at least two years by a conservative count. A filmmaker first worry is therefore to find what will drive them for that long. More than a 9 to 5 job, more than a simple theme, what they are looking for is a vision, one unique way to treat one unique subject. Michael elaborated on the difference between executing what you are told to do, and a fulfilling calling that comes from the self. The flipside is that it is very difficult to go through times of no project. One subject powerful enough to give one a reason to work everyday for years does not simply come by; it can take months or even years of doubt before one is found. In Michael’s words: when I have no project, I have no identity. But when he does, his thrill is to seek and explore as he is doing something that has never been done before.

(I’m leaving a lot out which is more directly related to film-making.)

As he was discussing that I could sense people in the room tune to this, as I did. We datavis practitioners are all makers, tinkerers, inventors. Fortunately, our work cycles are often shorter and it is easier to start a new one. Certainly, there are patterns and recipes and things that need to be done for work. There are also hacks found on stackoverflow (thanks for that) and inspiration from the works of other (and thanks for the debugging console).

But there is also a “great unknown” in visualization – data that has never been collected let alone represented, techniques that have never been used, combinations that have never been tried – and things that can not even be put in words. All of this requires curiosity, independence and dedication. And the outcome may not live up to expectations. But it’s just a reminder that to do what we do we must leave our comfort zones and our ways, set off and explore.