Designing data visualizations

Designing data visualization book cover

Noah Iliinsky and O’Reilly were kind enough to send me one review copy of Noah’s book and who says review copy says review, so here goes.

We need more introductory books to data visualization.

I’ve had several discussions with data visualization colleagues who feel that there are too many books already. I strongly believe otherwise.

As of this writing, there are 59 books tagged data visualization on Amazon, versus well over a thousand for Java (for example). And on those 59, I would say about a dozen qualify as introductory. Here are 3 reasons why introductory books are important.

  • You only need to know a little to start making effective visualizations. A small book won’t teach you all there is to know about visualization, but you don’t need that to get off to a good start. A lot of this has to do with asking yourself the right questions. But this is a very unnatural thing to do, especially when you feel you can do stuff. Fortunately, even a short book can help you to pause and think.
  • An effective visualization is not harder to make than a poor one. Well, actually it is, really good visualizations are built after many iterations on one promising concept. But the point is, a lot of efforts and ressources can go into abyssmal visualizations. If you are in a position to buy visualization, having even basic knowledge of how data visualization works can prevent you from wasting your money.
  • There are many approaches to visualization. The right introductory book will be the one that resonates with you. Some people who are interested in this love to code, some are afraid of programming. Some are accomplished visual artists, some don’t know to draw. Some have specific needs (business dashboards, presentations, interactive web applications, etc.).

Where does designing data visualizations fit?

Designing Data Visualizations is a very short book – the advantage is that you can read this in a couple of hours. It’s perfect for a train or plane trip for instance. The format of the book (23 x 17.5 cm, flexible paperback) makes it easy to carry and read anywhere. And it’s an easy read – you won’t need to put down the book every few pages to make sure you understood.

The flipside of this is that you won’t learn any actionable skills from the book. The book is never trying to teach you to make things : this is explicitly outside of its scope. What is does is make you think on how to do stuff. It makes you consider the choices you make.

So you’re making a visualization. Does your choice of representation makes sense? how about your colors? placement? If you’re not confident that you know the answer to this kind of questions you must read the book right now; else, you won’t be able to improve your work. And again that is what successful designers do – iterate and improve, again and again and again.

As a non-native speaker of English one reason why I enjoy reading introductory books is for their excellent formulation of things. You know, there are those things you have a vague idea of, and the writer puts the exact words on it. So I’ll go ahead and quote my favorite paragraph :

Consult [your goal] when you are about to be seduced by the siren song of circular layouts, the allure of extra data, the false prophet of “because I can”. These are distractions on your journey. As Bruce Lee would say, “It is like a finger pointing a way to the moon. Don’t concentrate on the finger or you will miss all that heavenly glory”.

Who is this book for?

I think the people who would benefit the most from the books fall in two categories:

  1. Those who know absolutely nothing about visualization but have some interest in the subject. And the subset of those who don’t really have time to find out all about it (think: your client, your n+2 boss). They will appreciate that there is a real take-out value in such a short book.
  2. Those who can create visualization because for instance they are coders, designers, excel users etc. and who see data visualization as a byproduct of their activity, so they never really asked themselves those questions. And among those, I’m thinking mostly of coders. Noah and I met at last year’s Strata conference which is also attended by the cream of the crop of the data scientists. I was surprised to see that some of them, despite being able to harness huge quantity of data, were severely limited in their visualization options because they never had an opportunity to learn. These people who are already at ease with the tool will see their activity supercharged thanks to the book.
For a data practitioner who has already an interest in theory I won’t lie to you – reading the book will feel like patting yourself on the back and there will be little you will learn. But consider, for instance, giving copies to your customers and think of all the fruitless discussions that will  this will save you in the course of a project.
 

Hollywood + data III: our info+beauty awards entry. Bonus: making of.

So Jen and I released our Info+beauty awards entry.

How did we end up with this?

it’s really cool working around movies, because it’s something we can relate to.

A part of my movie ticket stubs stash.

At first I wanted to do something out of keywords we could grab on the movies but  Jen came up with another idea I found more worth pursuing: working around the story types (which was the most interesting aspect of the curated contest dataset) and see if there was not some kind of grand truth we could unravel there. She also requested stars and glitter, because we were not going to work on this glamorous dataset with a tedious dashboard done in Excel.

That truth didn’t take so much time to find: the most frequently used story types (like comedy or movies with monsters) do not perform well in the box office while different story types (stories of teens growing up, or when the main character turns into something else), which are used less often, are much more profitable. So why doesn’t hollywood make more Junos and Black Swans and fewer College Road Trips or Dylan Dogs?

That’s the idea. Now the making.

Fair warning – the rest of this post is fairly technical. 

Making stars

If I had to contribute significantly to the project it had to be done in d3/svg.

Fortunately, it’s easy to generate star shapes in d3. Once you have the coordinates of where the points of one unitary star should be, you can easily make stars of any size with a function and a parameter.

var c1=Math.cos(.2*Math.PI),c2=Math.cos(.4*Math.PI),
    s1=Math.sin(.2*Math.PI),s2=Math.sin(.4*Math.PI),
    r=1,

    // ok the constant after r1 is the thickness of the branches.
    // 1 is a "straight" star, less is narrower, more is thicker.

    r1=1.5*r*c2/c1,
    star=[
        [0,-r],
        [r1*s1,-r1*c1],
        [r*s2,-r*c2],
        [r1*s2,r1*c2],
        [r*s1,r*c1],
        [0,r1],
        [-r*s1,r*c1],
        [-r1*s2,r1*c2],
        [-r*s2,-r*c2],
        [-r1*s1,-r1*c1],
        [0,-r]
        ];
    // this is a list of the pair of coordinates of the points that make a star.
lineStar=function(k) {
	var line=d3.svg.line()
		.x(function(d) {return d[0]*k;})
		.y(function(d) {return d[1]*k;})
	return line(star)+"Z"; // this will stitch everything together.
}

Now, running lineStar(10) will return the path description of a star with a radius of 10, thusly:

"M0,-10L3.367709824346891,-4.635254915624212L9.510565162951535,-3.0901699437494745L5.449068960040206,
1.770509831248423L5.877852522924732,8.090169943749475L0,5.729490168751577L-5.877852522924732,
8.090169943749475L-5.449068960040206,1.770509831248423L-9.510565162951535,-3.0901699437494745
L-3.367709824346891,-4.635254915624212L0,-10Z"

Placing, moving (and spinning) the stars

The next idea was placing the stars.

And for this we need two things: being able to position them somewhere, and being able to move them easily from point A to point B, ideally with some cool effect in between.

So, it would be possible to change the x and y attributes of the path, but each would have to be dealt with separately with a different function call. I found it a better approach to rely on the transform attribute and translate. Each time I want to position a star somewhere, I need it to be set at an x and y coordinate, which will always correspond to either the data of the star, or that of a group above it. For instance, a star corresponding to a movie will need to be at the position corresponding to the data of that movie, or that of the story type above it if it’s still collapsed, or that of the high-level grouping of story types if that’s collapsed.

Now all of the data structures for that are array of objects which all have x and y keys. In other terms, for any star-shaped object, I can always expect the underlying datum d to have d.x and d.y values. So, I wrote a function translate(d) which works on those 2 properties. And as a result, when I need to position any object all I have to write is:

.attr("transform",translate)

and the object will be positioned according to its underlying data. (this is equivalent to writing .attr(“transform”,function(d) {return translate(d);}) )

If I need to be them elsewhere, i.e. at the position of their parent, I can pass the data of that parent as an argument, for instance:

.attr("transform",function(d) {return translate(structs[d.struct]);})

For a cheap bit of extra action, I’ve added a spinning effect in the translate function. Since translate(d) returns a value for the transform attribute, nobody said it just had to be instructions for translation! so I’ve added a rotate after the translate. The arguments for the rotate function depend on the x and y properties of the argument as well, so when stars move across the screen, the rotate angle changes slightly with each increment of either coordinate, giving the impression of spinning.

Explosions, starlets and other effects

Most of the cool things happening in the visualization rely on one very simple principle about d3 transitions: chaining them.
In the code you’ll find oftentimes this pattern:

.selectAll("someobject").data(...).enter().append(...) // creates the items
... // sets the initial attributes
...
.transition()
... // change the attributes
...
...
...
.each("end", function() { // stuff to be done on each item after the transition is over

and within that function, you’ll find either:
another transition which starts exactly when the previous one ends, so for instance opacity can decrease (causing a fading effect): d3.select(this).transition()…

or a command to remove the object: d3.select(this).remove().

When another transition is called, there can be another one after, then another one, then another one, then eventually the object can be removed (or not).

Now you may think of transitions as ways to get one object to change smoothly from state A to state B, like a rectangle moving across the screen. But if you start to think that the objects can be discarded after the transitions, you’ll realize that there is an unbelievable number of things that can be done with them.
For instance, upon clicking on some stars, I am creating another star shape at that same location. Initially it has a the same size as the star, but I increase that radius to a large number (1000px) while decreasing its opacity to 0. So it seems that the new star is both exploding and fading. When it’s become transparent I remove it.

gStructs.append("svg:path") // here I'm creating a "path" shape
.style("stroke","none") // with no outline
.style("fill",colorXp)  // with the fill color of the explosion
.style("opacity",0.2)  // and a low opacity to start with (translucent)
.attr("d",lineStar(d.size[sizeAxis])) // I give it the shape of a star and the size of the
                                      // star that's being clicked
.attr("transform",translate(d)) // and I position it on that star

.transition() // action!

.duration(500)	// a 500ms transition. Long enough to see the effect.
.attr("d",lineStar(1000)) // the star expands to a radius of 1000.
.style("opacity",0) // while fading to transparency.

.each("end",function() {d3.select(this).remove();}) // and when it's done - it's removed.

Changing axes

In this visualization I let the user change what’s plotted along the axes. It’s not very difficult to do but it’s a hassle to do it late in the project as it has been our case because it requires a lot of housekeeping. This is really about the data structures that will support our items. Instead of having just one value for x, y and size they have an object with several keys, one per axis. Then we maintain one variable per axis type, so everywhere we should write: d.x, we write instead: d.x[xAxis].

So when there is an axis change, of course, we do a transition so that the stars and everything move smoothly to their new position. But what if the objects were already moving? When an unplanned transition interferes with an ongoing one, the results are often ugly, especially if the current transition had chained transitions waiting to be triggered. In other words, this will leave a mess.

The way I’ve dealt with this is by keeping a tab on the number of transitions going on at a certain time. The axis change could only occur if no other transitions were taken place. If that was the case they were simply denied. There are other ways to do that like a queue of actions but that seemed the simple and adequate way to deal with this.

Bootstrap and google fonts

This was the first non-trivial project where I used bootstrap and I’m just never going back. Bootstrap simply removes all the hassle of arranging all the elements of a visualization on a screen and is very easy to use. Plus, it comes up with sensible options for buttons, forms, and the like. Since the contest it has evolved faster than a pokémon, for instance it is now possible to specify custom colors in a form and bootstrap will generate the appropriate css files. Google fonts are another great help as they are a very easy solution to choose fonts among a relatively large number of choices without relying on the fact that all the users have these fonts on their computer.

Wrapping it up

There’s a lot of other hacks in the code which you are welcome to explore, I admit I don’t remember them all because I took too much time to write this blog post after creating the entry (bad). However if there is one point you would like be to explain please ask in the comments.
I’m not entirely sure of what happened when I submitted the entry though. First it wasn’t listed with the others, then I got a message saying it hadn’t been reviewed, so it didn’t win anything, yet some time after the prizes have been handled it appeared in the “shortlisted” visualizations for the contest (which I found by accident). So whether or not it was good, I let you guys judge, at any rate it was fun making.

 

Treemaps in Tableau? can be done.

Tableau can do many things natively but there are a couple of basic primitives that are not built in because they behave somewhat differently from the overall logic. And treemaps is one of them. Then again treemaps are arguably one of the best way to express complex hierarchical information, i.e. to show the proportions in a large dataset.

Fortunately, thanks to Tableau flexibility there are ways to do that. In the tutorial I'm going to cover 2 cases. First, we'll create a somewhat complex treemap off data which will not change in runtime. Then, we'll create mini-treemaps which can change dynamically.

A complex treemap

Before we go in the details the main ideas are deceptively simple.

  • we use the polygon mark,
  • we generate the treemap layout outside of tableau.
What we want (and what we'll get) is a dataset that can be directly imported in Tableau and -boom- makes a treemap in a few clicks.

To make this dataset we can use d3. The treemap I am making is directly inspired from the d3 treemap example. d3 is already computing all of the node positions so what we'll do is modify the program slightly so that it outputs them in a way that can be directly used in Tableau.

Here is the modified file which you can download and run on your computer. To work it needs to be in the same folder as a data file called data.js which will hold your hiearchical data and which has the same structure as the one linked here.

You can just copy/paste the table that's displayed below the treemap and put it in Tableau or save it in a file for good measure. Here is the output of the data file linked above.

Let's take a look at a few rows :

Id Path Top-level category Name Value Corner x y
0 flare>analytics>cluster flare AgglomerativeCluster 3938 0 89 167
0 flare>analytics>cluster flare AgglomerativeCluster 3938 1 167 167
0 flare>analytics>cluster flare AgglomerativeCluster 3938 2 167 192
0 flare>analytics>cluster flare AgglomerativeCluster 3938 3 89 192
1 flare>analytics>cluster flare CommunityStructure 3812 0 102 138
1 flare>analytics>cluster flare CommunityStructure 3812 1 167 138
1 flare>analytics>cluster flare CommunityStructure 3812 2 167 167
1 flare>analytics>cluster flare CommunityStructure 3812 3 102 167
2 flare>analytics>cluster flare HierarchicalCluster 6714 0 89 192
2 flare>analytics>cluster flare HierarchicalCluster 6714 1 167 192
2 flare>analytics>cluster flare HierarchicalCluster 6714 2 167 236
2 flare>analytics>cluster flare HierarchicalCluster 6714 3 89 236
I'm creating 4 lines per "leaf" node. So in this example which has 220 nodes, that amounts to 880 lines. Why 4? Because to draw a rectangle in Tableau you really need to define 4 corners. This is why there is a column "Corner" which is worth 0,1,2 and 3. This, we will use to tell Tableau to read our corners in bottom left, bottom right, top right, top left order which produces a nice convex rectangle and not a concave hourglass shape.

Now off to Tableau with this data. 

Now it's just a matter of doing like this screen. Unsurprisingly the columns and rows are going to be determined by x and y. You want a polygon mark, and you absolutely must use your corner measure in the path. For color, you'll have a choice, you can use the top-level category column (as I have) or the full path which will divide your treemap in finer parts. Finally, level of detail: you must use the Id and not the name in case several of your nodes have the same name. It's quite important at this point to uncheck aggregate measures in Analysis. You do NOT want aggregate measures (though it's quite pretty). To be able to use the name, you must first make a measure out of it. And finally, you'll want to update your infotip slightly.

All of this you can see if you download the tableau file.

And voilà! Treemaps for your Tableau workbooks.

Caveat: the polygon mark doesn't support labels so you can't write on top of the small rectangles what they are but that's not the point of the treemap, which is instead to give an immediate first impression of the relative size of large groups of your data, then allow you to explore them, to that end the infotip function works just fine.

Simpler but dynamic treemaps

This is fine and dandy if your data doesn't change but it won't scale if you need to make many treemaps based on selections. What to do? You could use pie charts, but let's not.

To that end I've tried to emulate the Congress speaks visualization by Periscopic. I really like it. When you've selected representatives at the end of the process you are taken to a screen which shows the following mini-treemap:

There are just 5 rectangles. But they will change for any representative that we choose. Can this be done with Tableau? Obviously.

Now the Tableau part of this is slightly trickier than above. The idea is that we are going to use formulas to generate the coordinates of all 20 corners of the rectangles, in other words we are going to let Tableau calculate the layout. We can do it because the way that rectangles are going to be arranged is quite predictible. There is one on the left, then 4 stacked on the right one on top of the other. Again, we could compute all of these coordinates outside of Tableau but that would be a hassle and so for a large number of cases it becomes easier and more reliable to do this inside of Tableau.

Data

For this I have used completely random data. I have generated 20 names, and for each I have generated 5 values in a likely range, number of possible votes, number of votes the representative actually voted, number of times they voted yes, number of times they voted yes with their party, and the same for no. (or nay, technically).

At the end of the day I need 20 records per representative (5 rectangles of 4 corners each), so I can either replicate the line 20 times, or use linked tables. The idea is to get something like this for all of the representatives that can somehow get into Tableau.

Id representative corner rectangle possible votes total votes voted yes yes with party voted no no with party
16 Nelson Thiede 0 no against party 888 784 320 274 464 373
16 Nelson Thiede 1 no against party 888 784 320 274 464 373
16 Nelson Thiede 2 no against party 888 784 320 274 464 373
16 Nelson Thiede 3 no against party 888 784 320 274 464 373
16 Nelson Thiede 0 no vote 888 784 320 274 464 373
16 Nelson Thiede 1 no vote 888 784 320 274 464 373
16 Nelson Thiede 2 no vote 888 784 320 274 464 373
16 Nelson Thiede 3 no vote 888 784 320 274 464 373
16 Nelson Thiede 0 no with party 888 784 320 274 464 373
16 Nelson Thiede 1 no with party 888 784 320 274 464 373
16 Nelson Thiede 2 no with party 888 784 320 274 464 373
16 Nelson Thiede 3 no with party 888 784 320 274 464 373
16 Nelson Thiede 0 yes against party 888 784 320 274 464 373
16 Nelson Thiede 1 yes against party 888 784 320 274 464 373
16 Nelson Thiede 2 yes against party 888 784 320 274 464 373
16 Nelson Thiede 3 yes against party 888 784 320 274 464 373
16 Nelson Thiede 0 yes with party 888 784 320 274 464 373
16 Nelson Thiede 1 yes with party 888 784 320 274 464 373
16 Nelson Thiede 2 yes with party 888 784 320 274 464 373
16 Nelson Thiede 3 yes with party 888 784 320 274 464 373

In Tableau

In Tableau we are going to use the same idea as above: polygon mark, disable aggregate measures, and use x and y for columns and rows.

Only, x and y are going to be much more complex. Sorry about that. Well, not that complex but definitely longer.

Here's x:


case [rectangle]
when "no vote" then
     case [corner]
       when 0 then 0
       when 1 then (([possible votes]-[total votes])/[possible votes])
       when 2 then (([possible votes]-[total votes])/[possible votes])
       when 3 then 0
     end
else
     case [corner]
       when 0 then (([possible votes]-[total votes])/[possible votes])
       when 1 then 1
       when 2 then 1
       when 3 then (([possible votes]-[total votes])/[possible votes])
   end
end

Depending on the rectangle we are trying to draw we can find ourselves in one of two cases (hence the use of case).

If we draw "no vote" then we are on the left of our vis. The left corners are on the leftmost side of the vis (hence value: 0) and the right corners correspond to the proportion of possible votes which where not cast by this representative, which we can compute as ([possible votes]-[total votes])/[possible votes].

In the other case, we are drawing one of the 4 stacked rectangles, so the right corners are on the rightmost side of the vis (hence value: 1) and the left corners correspond to the value we just computed.

And now, y:

case [rectangle]
when "no vote" then
case [corner]
when 0 then 0
when 1 then 0
when 2 then 1
when 3 then 1
end
when "yes against party" then
case [corner]
when 0 then 0
when 1 then 0
when 2 then (([voted yes]-[yes with party])/[total votes])
when 3 then (([voted yes]-[yes with party])/[total votes])
end
when "yes with party" then
case [corner]
when 0 then (([voted yes]-[yes with party])/[total votes])
when 1 then (([voted yes]-[yes with party])/[total votes])
when 2 then ((2*[voted yes]-[yes with party])/[total votes])
when 3 then ((2*[voted yes]-[yes with party])/[total votes])
end
when "no with party" then
case [corner]
when 0 then ((2*[voted yes]-[yes with party])/[total votes])
when 1 then ((2*[voted yes]-[yes with party])/[total votes])
when 2 then ((2*[voted yes]+[no with party]-[yes with party])/[total votes])
when 3 then ((2*[voted yes]+[no with party]-[yes with party])/[total votes])
end
when "no against party" then
case [corner]
when 0 then ((2*[voted yes]+[no with party]-[yes with party])/[total votes])
when 1 then ((2*[voted yes]+[no with party]-[yes with party])/[total votes])
when 2 then 1
when 3 then 1
end
end
y is longer but this is the same general idea. For the "no vote" rectangle, the corners are either to the top or bottom of the vis. But for the other, we can predict where the rectangle will start and when it will end, as a proportion of the [possible votes] field. The values we want are going to be correspond to these proportions, plus that of all the rectangles below so we can achieve that stacked effect (as opposed to have all rectangles superimposed at the bottom of the vis). This is why I am entering the rectangles in stacking order. Each time, the bottom corners get the value of the top corners of the previous rectangle.

Here is the final result: