Manipulating data like a boss with d3

1) I’ve put all the code examples in
2) I will rewrite this post after I’m done publishing my Visualization with React series, because it’s 4 years old and there are other ways to do this now.

Data is the first D in d3 (or possibly the 3rd, but it’s definitely one of these).

Anyway. Putting your data in the right form is crucial to have concise code that runs fast and is easy to read (and, later, troubleshoot).

So what shape should your data be in?
You undoubtedly have many options.

To follow through this tutorial, let’s assume you want to plot the relationship between R&D expenditure and GDP growth for a number of countries. You have got this file, full of tabular data, which lists for every country a name, a continent, the gross R&D expenditure as a percentage of GDP, GDP growth, and for context population and GDP per capita.

So one very basic approach would be to put each of these variables into one independent array.

var GERD=[2.21367, 2.74826, 1.96158, 1.80213, 0.39451, 1.52652, 3.01937, 1.44122, 3.84137, 2.20646, 2.78056, 0.5921, 1.14821, 2.64107, 1.78988, 4.2504, 1.26841, 3.33499, 3.3609, 1.67862, 0.41322, 1.81965, 1.13693, 1.75922, 0.67502, 1.65519, 1.24252, 0.48056, 1.85642, 0.92523, 1.38357, 3.61562, 2.99525, 0.84902, 1.82434, 2.78518];
var growth=[2.48590317, 3.10741128, 1.89308521, 3.21494841, 5.19813626, 1.65489834, 1.04974368, 7.63563272, 2.85477157, 1.47996142, 2.99558644, -6.90796403, 1.69192342, -3.99988322, -0.42935239, 4.84602001, 0.43108032, 3.96559062, 6.16184325, 2.67806902, 5.56185685, 1.18517739, 2.33052515, 1.59773989, 4.34962928, -1.60958484, 4.03428262, 3.34920254, -0.17459255, 2.784, -0.06947685, 3.93555895, 2.71404473, 9.00558548, 2.09209263, 3.02171711];
var GDPcap=[40718.78167, 42118.46375, 38809.66436, 39069.91407, 15106.73205, 25956.76492, 40169.83173, 22403.02459, 37577.71225, 34147.98907, 39389.25874, 26878.00015, 21731.55484, 35641.55402, 40457.94273, 28595.68799, 32580.06572, 33751.23348, 29101.34563, 86226.3276, 15200.22119, 43455.30129, 29870.67748, 57230.89, 19882.99226, 25425.59561, 19833, 24429.61828, 27559.75186, 10497.583, 32779.3288, 41526.2995, 46621.77334, 15666.18783, 35715.4691, 46587.61843];
var population=[22319.07, 8427.318, 10590.44, 33909.7, 17248.45, 10286.3, 5495.246, 1335.347, 5366.482, 62747.78, 82852.47, 11312.16, 9993.116, 308.038, 4394.382, 7623.6, 59059.66, 126912.8, 48988.83, 483.701, 109219.9, 16480.79, 4291.9, 4789.628, 37725.21, 10684.97, 142822.5, 5404.493, 2029.418, 50384.55, 44835.48, 9276.365, 7889.345, 73497, 62761.35, 313232];
var country=["Australia", "Austria", "Belgium", "Canada", "Chile", "Czech Republic", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Iceland", "Ireland", "Israel", "Italy", "Japan", "Korea", "Luxembourg", "Mexico", "Netherlands", "New Zealand", "Norway", "Poland", "Portugal", "Russian Federation", "Slovak Republic", "Slovenia", "South Africa", "Spain", "Sweden", "Switzerland", "Turkey", "United Kingdom", "United States"];
var continent=["Oceania", "Europe", "Europe", "America", "America", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Asia", "Europe", "Asia", "Asia", "Europe", "America", "Europe", "Oceania", "Europe", "Europe", "Europe", "Europe", "Europe", "Europe", "Africa", "Europe", "Europe", "Europe", "Europe", "Europe", "America"];

(don’t bother scrolling, it’s more of the same 🙂 )
Then, you can just create marks for each data item and fetch each attribute independently.
Let’s do a bubble chart for instance.
(small aside: in the post I won’t go through the code to set up the svg container or the scales, instead focusing on the data structures. That code, which is really nothing special, can be found in the source code of the examples).

So to create our circles we would write something like:

  .attr("cx",function(d,i) {return x(GERD[i]);})
  .attr("cy",function(d,i) {return y(growth[i]);})
  .attr("r",function(d,i) {return r(Math.sqrt(population[i]));})

  .style("fill",function(d,i) {return c(continent[i]);})
  .style("opacity",function(d,i) {return o(GDPcap[i]);})


and this works:

See example in its own tab or window
but this is hell to maintain. If for some reason there is an error in one of the values, for instance due to a cat or a small child in the proximity of the computer, the error will be very difficult to troubleshoot.
Another problem is that it’s very difficult to apply any kind of subsequent treatment to the data. For instance, you will notice that there are smaller bubbles entirely within the large orange bubble which happens to be on top of them. So it’s not possible to mouseover the smaller bubbles. One way to address that would be to sort data in order of decreasing population (the size of the bubbles) so that it would be impossible to have this kind of situation. Now while it is possible sorting 6 arrays according to the values of one, it’s very messy.

Ideally, you should have all the values that will be translated graphically within one, single object. You want to have an array of these objects that you will pass to the data method, and be able to write something like:

  .attr("cx",function(d) {return x(+d.GERD);})
  .attr("cy",function(d) {return y(+d.growth);})
  .attr("r",function(d) {return r(Math.sqrt(+d.population));})

  .style("fill",function(d) {return c(d.continent);})
  .style("opacity",function(d) {return o(+d.GDPcap);})

Here, you have just one data source, which is much safer.

So if you’re thinking: I know, I should create a variable like this:

var data=[
  {"country":"Czech Republic","continent":"Europe","population":10286.3,"GDPcap":25956.76492,"GERD":1.52652,"growth":1.65489834},
  {"country":"New Zealand","continent":"Oceania","population":4291.9,"GDPcap":29870.67748,"GERD":1.13693,"growth":2.33052515},
  {"country":"Russian Federation","continent":"Europe","population":142822.5,"GDPcap":19833,"GERD":1.24252,"growth":4.03428262},
  {"country":"Slovak Republic","continent":"Europe","population":5404.493,"GDPcap":24429.61828,"GERD":0.48056,"growth":3.34920254},
  {"country":"South Africa","continent":"Africa","population":50384.55,"GDPcap":10497.583,"GERD":0.92523,"growth":2.784},
  {"country":"United Kingdom","continent":"Europe","population":62761.35,"GDPcap":35715.4691,"GERD":1.82434,"growth":2.09209263},
  {"country":"United States","continent":"America","population":313232,"GDPcap":46587.61843,"GERD":2.78518,"growth":3.02171711}

and get this done, and furthermore if you are thinking “Hey, I can do this in Excel from my csv file, with one formula that I will copy across the rows”, you need to stop right now in the name of all that is good and holy.
Even though it works:

See example in its own tab or window

This approach has a number of flaws which you can all avoid if you read on.
First, the execution of your program will be stopped while your browser reads the source code that contains the “data” variable. This is negligible for 36 rows, but as objects get bigger and more complex, an equivalent variable may take seconds or even minutes to load. And now we have a problem.
That’s a problem for your users. Now to you: creating a JSON variable from tabular data is tedious and error prone. The formula editing interface in Excel doesn’t really help you spot where you have misplaced a quote or a colon. As a result, this is very time-consuming.

Don’t do that: there is a much simpler way.

Enters the d3.csv function.

d3.csv("data.csv",function(csv) {
  // we first sort the data

  csv.sort(function(a,b) {return b.population-a.population;});

  // then we create the marks, which we put in an initial position

    .attr("cx",function(d) {return x(0);})
    .attr("cy",function(d) {return y(0);})
    .attr("r",function(d) {return r(0);})

    .style("fill",function(d) {return c(d.continent);})
    .style("opacity",function(d) {return o(+d.GDPcap);})

      .text(function(d) {return;})
  // now we initiate - moving the marks to their position

    .attr("cx",function(d) {return x(+d.GERD);})
    .attr("cy",function(d) {return y(+d.growth);})
    .attr("r",function(d) {return r(Math.sqrt(+d.population));})

Here’s how it works.
You tell your d3.csv function the location of a csv file, (which we had all along) and a function that must run on the array of objects (what we always wanted) created by using the first row as keys.
In other words, once inside the d3.csv function, the “csv” variable will be worth exactly what we assigned to “data” earlier, with one major difference, it’s that we didn’t have to manufacture this variable or do any kind of manual intervention: we are certain it corresponds to the file exactly.

One nice thing with this method is that since your variable is not explicitly in the source code, your browser can read it much faster. The data is only read when the d3.csv function is called, as opposed to the previous approach where the entirety of the source code (including the data) had to be read before the first statement could be executed. Of course, it only makes a difference when the data size is significant. But using the d3.csv approach would let you display a “loading data” warning somewhere on your page, and remove it when inside d3.csv. Much better than a blank page.

Three caveats with this method.

  • This will no longer work in a local file system (ie opening a file in the browser). The resulting file can only run on a webserver, which can be local (ie the page has a url).
  • whatever happens within the d3.csv function is no longer in the global scope of the program. This means that after the program has run its course you cannot open the javascript console and inspect the value of “csv”, for instance. This makes these programs slightly more difficult to debug (there are obviously ways, though).
  • Everything read from the file is treated as strings. Javascript does a lot of type conversion but be mindful of that or you will have surprises. This is why I wrote x(+d.GERD) for instance (+ before a string converts it to a number).

To celebrate this superior way of aquiring data, we’ve thrown in animated data entry: the circles are initiated at a default value and move towards their position. You may want to check the link to see the transition effect.

See example in its own tab or window

So, at the level of the mark (ie our circles) the most comfortable form of data is an object with at least as many keys as there will be graphical properties to change dynamically.
One flat array of data is fine if we have just one series of data. But what if we have several series? Indeed, most visualizations have a structure and a hierarchy.
So let’s proceed with our data but now let’s assume that we want to show values for different continents as different little scatterplots (“small multiples”).

  • we’ll want to add 5 “g” groups to our svg container, one for each continent,
  • and then add one dots per country in each continent to those groups.

Our flat array won’t work so well then. What to do?

The d3 answer to this problem is the d3.nest() set of methods.
d3.nest() turns a flat array of objects, which thanks to d3.csv() is a very easily available format, in an array of arrays with the hierarchy you need.
Following our intuition, wouldn’t it be nice if our data would be:

  • An array of 5 items, one for each continent, so we could create the “g” groups,
  • And if each of these 5 items contained an array with the data of all the corresponding countries, still in that object format that we love?

This is exactly what d3.nest() does. d3.nest(), go!

var data=d3.nest()
  .key(function(d) {return d.continent;})

With the .key() method, we are indicating what we will be using to create the hierarchy. We want to group those data by continent, so we use this syntax.
.sortKeys is used to sort the keys in alphabetical order, so our panels appear in the alphabetical order of the continents. If we omit that, the panels will show up in the order of the data (ie Oceania first as Australia is the first country). We could have avoided that by sorting the data by continent first before nesting it, but it’s easier like this.
Here, we just have one level of grouping, but we could have several by chaining several .key() methods.
The last part of the statement, .entries(csv), says that we want to do that operation on our csv variable.

Here is what the data variable will look like:

    {"country":"United States","continent":"America","population":"313232","GDPcap":"46587.61843","GERD":"2.78518","growth":"3.02171711"},
      {"country":"Canada","continent":"America","population":"33909.7","GDPcap":"39069.91407","GERD":"1.80213","growth":"3.21494841"},      {"country":"Chile","continent":"America","population":"17248.45","GDPcap":"15106.73205","GERD":"0.39451","growth":"5.19813626"}

Now that we have our data in an ideal form let’s draw those marks:

  // One cell for each continent
  var g=svg.selectAll("g").data(data).enter()
    .attr("transform",function(d,i) {return "translate("+(100*i)+",0)";});
  // we add a rect element with a title element
  // so that mousing over the cell will tell us which continent it is
      .text(function(d) {return d.key;})
  // we also write its name below.
    .text(function(d) {return d.key;})
  // now marks, initiated to default values
  // we are getting the values of the countries like this:
  .data(function(d) {return d.values}) 
      // throwing in a title element
        .text(function(d) {return;});

  // finally, we animate our marks in position
      .attr("cx",function(d) {return x(+d.GERD);})
      .attr("cy",function(d) {return y(+d.growth);})
      .style("opacity",function(d) {return o(d.population)})
      .style("opacity",function(d) {return o(+d.GDPcap);})

(you may want to click on the link to see the transition effect and read the full source).

See example in its own tab or window

This is all very nice but wouldn’t it be better if we could characterize some aggregate information from the continents? Let’s try to find out the average values for R&D expenditure and GDP growth.

Can it be done easily? This is a job for the other main d3.nest method, rollup.

rollup is the aggregating function. Here’s an example.

var avgs=d3.nest()
    .key(function(d) {return d.continent;})
    .rollup(function(d) {
      return {
        GERD:d3.mean(d,function(g) {return +g.GERD;}),
        growth:d3.mean(d,function(g) {return +g.growth})

Remember how the combination of .key() and .entries() rearranges an array into arrays of smaller arrays, depending on these keys? well, the value that is being passed to the function inside the rollup method is each of these arrays (ie an array of all the objects corresponding to countries in America, then an array of all the objects corresponding to countries in Europe, etc.)
Also, if we use sortKeys in our previous nesting effort we’d better use it here too.
Here is what the variable will look like:


Incredible! just the values we need.
Now it’s just a matter of adding them to the sketch. Two little additions here:

// we add 2 lines for the average. They will start at a default value.
    // we give these lines a title for mouseover interaction.
      .append("title").text(function(d,i) {
        return "Average growth:"+avgs[i].values.growth
      .append("title").text(function(d,i) {
        return "Average GERD:"+avgs[i].values.GERD

  // we also animate the lines".growth").transition().duration(1000)
    .attr("y1",function(d,i) {return y(avgs[i].values.growth);})
    .attr("y2",function(d,i) {return y(avgs[i].values.growth);})".GERD").transition().duration(1000)
    .attr("x1",function(d,i) {return x(avgs[i].values.GERD);})
    .attr("x2",function(d,i) {return x(avgs[i].values.GERD);})

This is the final example – again you may want to click on the link to see the transition and get the entirety of the source.

See example in its own tab or window

To wrap up:

  • At the mark level, you want to have objects with as many properties as you need graphical variables (like x,y, fill, etc.)
  • using d3.csv() and a flat file will make this easy (d3 also provides functions like d3.json or d3.xml to process data in another format).
  • d3.nest can help you group your entries to structure your data and create more sophisticated visualizations
  • rollup can be used to aggregate the data grouped using d3.nest



34 thoughts on “Manipulating data like a boss with d3

  1. Excellent tutorial Jerome – while CSV doesn’t provide “spontaneous” data like a MySQL query, it does improve considerably on the manual entry approach. I suppose the CSV could be designed with enough flexibility and detail to simulate a database, using various filters in d3.

  2. Hi, this is great. I don’t like the variable way either.
    How can this be done using d3.xml? That is, loading data from an XML file, to a d3 tree layout/dendrogram.

    1. I don’t use d3.xml often, but the idea is that when you write something like d3.xml(“http://path/to/my/xmlfile.xml”, function(xml) {…})
      the xml variable inside the function call will hold the entire xml document, which you can then parse.

  3. Great tutorial on CSV. Everywhere else I got basic data structure like arrays or json data to create d3 visualization. Yours is the best tutorial on explaining how to generate d3 visualizations quickly using csv file. Great job, helped me a lot!

  4. This is very helpful! One question, though: I’ve tried to get this to work on partition-icicle-zoom. And to get it to work, I have to manipulate the csv to replace key -> name and values-> children. Is this possible in a newbie-kind-of-way?
    So that:

    Will be


    1. in that particular code you have this line:

      .attr(“fill”, function(d) { return color((d.children ? d : d.parent).name); })

      which is the reason why it expects “name” instead of “key”. you can change name into key here and keep key in the data, or do the opposite.

  5. Merci !
    Though the D3 API is well documented, it lacks examples for simple mortals like me.
    Did you learn D3 just by reading the API, or did you initially get help on message boards? Do you have some advices to understand the API better ? One example of something I found not obvious: the nest.rollup method – it’s quite easy when you see the example in this tutorial, but from the API you kinda have to guess what parameters get passed to the function (though one could argue it’s pretty logical, it becomes logical only when you are already a bit familiar with D3).

  6. Etienne, there is now an O’Reilly book on D3 from Scott Murray who blogs at

    “Interactive Data Visualization for the Web
    An Introduction to Designing With D3″

  7. Very useful tutorial!
    I’d like to ask two questions:
    first :can I rename the attributes GERD and growth,say , as avg_GERD or avg_growth instead of using the old name? (Think of the AS keyword in SQL )

    second question: Am I forced to group previously all the countries in order to average the same attributes (in this case, GERD and growth) over the array of countries in a continent? I’m supposed to work on a similarly nested structure and I’ve some trouble since I can’t define an accessor for the d3.mean() function.

    Thanks in advance!

    1. 1) yes.
      var avgs=d3.nest()
      .key(function(d) {return d.continent;})
      .rollup(function(d) {
      return {
      a:d3.mean(d,function(g) {return +g.GERD;}),
      b:d3.mean(d,function(g) {return +g.growth})

      would work. (or any valid key name instead of a and b).
      2) technically, no. but you may want to.
      For any data operation in d3 or javascript in general, and this is something I could have put in the tutorial, there is the possibility to start from an empty structure (either an empty array or an empty object), loop through the source data and add something to the structure.
      for instance I could write:
      csv.forEach(function (d) {
      var c=d.continent;
      if(avgs) {
      } else {

      keys(avgs).forEach(function(c) {

      this will produce the same output as before without a nesting function, it’s a bit longer to write but possibly clearer. Less trivial operations are possible during the loop as well.

  8. Thanks a lot for the rapid answer! Eventually I did the same except for the fact I used just plain javascript .. I was curious if one could perform the task inside the .rollup() function. 🙂

    PS:kudos for your great teaching style, I’m learning d3 and this tutorial has been incredibly useful.

  9. bunch of comments here talking about working with a real database however the nice thing about this being from csv is that you can always be pulling the data from your database and be writing csv files with it – this way the data is both downloadable and works with the tutorial quite nicely. That would obv give the user more flexibility to use the data in other ways should they want to or need to.

  10. Hi Jerome. This is incredibly helpful for me. One question – how do I execute your code without knowing the names of the attributes (continent, gdp, etc)? I’m very new to d3 so sorry if this is simple. Your code for reference below:

    d3.csv(“data.csv”,function(csv) {
    // we first sort the data

    csv.sort(function(a,b) {return b.population-a.population;});

    // then we create the marks, which we put in an initial position

    .attr(“cx”,function(d) {return x(0);})
    .attr(“cy”,function(d) {return y(0);})
    .attr(“r”,function(d) {return r(0);})

    .style(“fill”,function(d) {return c(d.continent);})
    .style(“opacity”,function(d) {return o(+d.GDPcap);})

    .text(function(d) {return;})

    // now we initiate – moving the marks to their position

    .attr(“cx”,function(d) {return x(+d.GERD);})
    .attr(“cy”,function(d) {return y(+d.growth);})
    .attr(“r”,function(d) {return r(Math.sqrt(+d.population));})

    1. well, the name of the attributes would come from your data file. so this assumes that you know your data file.
      that being said, for any element of the array, you can do d3.keys() which will return the list of properties of that element. ie d3.keys(csv[0]).

  11. Hi Jerome,

    Great tutorial, this was super helpful for my entrance into d3 with csvs! I’m trying to make a line chart with date as the x-axis, sum of orders as the y-axis, and a line/color for each of my three stores. My dataset has a row for each order, so I need to sum the order by date and store:

    d3.csv(file, function(error, data) {

    color.domain(d3.keys(data[0]).filter(function(key) { return key == “store”; }));

    data = function (d) {
    return {
    date: parseDate(,
    orders: +d.orders };

    data = d3.nest()
    .key(function(d) {return;})
    .key(function(d) { return; })
    .rollup(function(d) {
    return {
    orders: d3.sum(d,function(t) {return +t.orders;})

    I keep getting an error at this line in the code, and though the axis are drawn on the page, there is no data:

    .attr(“class”, “line”)
    .attr(“d”, function(d) { console.log(d);return line(d.values); })
    .style(“stroke”, function(d) { return color(d.key); });

    Can you offer any insight? Thanks in advance!

  12. Hi Jerome,
    I am very new to D3.js, currently i was working on a school project using d3 and json. Your tutorial was very helpful so i decide to implement it in my project since it the nested data its much more similar to mine even though mine is json file . at the moment it groups the data correctly unfortunately the individual data which is represented by the circle faille to work. i was wondering what went wrong, I really appreciate it if you can take a look at it.
    here is the link to the project (
    Thanks in advance!!

  13. Hey, thanks for the tutorial. I was a bit put off by the 3 caveats, but they’re actually much more minor than you make them sound:

    1. A really easy way to make something run on a webserver is using site44 – host the files from your dropbox.
    2. Not sure I understand this one. Just define a global variable, then assign values to it in the callback.
    3. Yep.

  14. Many tutorials about how to use json or csv with d3.json or d3.csv, but the simplest idea of using an array or data object is still a mystery…

    For simple applications, updating a json or csv file dynamically is a nightmare. Being able to load an object with the needed data dynamically should be simple, darn it!

Leave a Reply

Your email address will not be published. Required fields are marked *