d3: scales, and color.

In protovis, scales were super-useful in just about everything. That much hasn’t changed in d3, even though d3.scale is a bit different from pv.Scale. (do note that d3.scale is in lowercase for starters).

Scales: the main idea

Simply put: scales transform a number in a certain interval (called the domain) into a number in another interval (called the range).
an example of how scales work
For instance, let’s suppose you know your data is always over 20 and always below 80. You would like to plot it, say, in a bar chart, which can be only 120 pixels tall.
You could, obviously, do the math:

.attr("height", function(d) {return (d-20)*2;})

But what if you suddenly have more or less space? or your data changes? you’d have to go back to the entrails of your code and make the change. This is very error prone. So instead, you can use a scale:

var y=d3.scale.linear().domain(20,80).range(0,120);
...
.attr("height", y)

this is much simpler, elegant, and easy to maintain. Oh, and the latter notation is equivalent to

.attr("height", function(d) {return y(d);})

… only more legible and shorter.
And, there are tons of possibility with scales.

Fun with scales

In d3, quantitative scales can be of several types:

  • linear scales (including quantize and quantile scales,
  • logarithmic scales,
  • power scales (including square root scales)

While they behave differently, they have a lot in common.

Domain and range

For all scales, with the exception of quantize and quantile scales which are a bit different, domain and range work the same.
First, note that unlike in protovis, domain and range take an array as argument. Compare:

var y=pv.Scale.linear().range(20,60).domain(0,120);
var y=d3.scale.linear().range([20,60]).domain([0,120]);

This is because contrary to protovis, where domain could be a whole dataset, in d3, domain contains the bounds of the interval that is going to be transformed.
Typically, this is two numbers. If this is more, we are talking about a polypoint scale: there are as many segments in the intervals as there are numbers in the domain (minus one). The range must have as many numbers, and so as many segments. When using the scale, if a number is in the n-th segment of the domain, it is transformed into a number in the n-th segment of the range.
illustration of a multipoint scale
With this example, 30 finds itself in the first segment of the domain. So it’s transformed to a value in the first segment of the range. 60, however, is in the 2nd segment, so it’s transformed into a value in the 2nd segment of the range.
Also, bounds of domain and range need not be numbers, as long as they can be converted to numbers. One useful examples are colors. Color names can be used as range, for instance, to create color ramps:

var ramp=d3.scale.linear().domain([0,100]).range(["red","blue"]);

This will transform any value betwen 0 and 100 into the corresponding color between red and blue.

Clamping

What happends if the scale is asked to process a number outside of the domain? That’s what clamping controls. If it is set, then the bounds of the range are the minimum and maximum value that can be returned by the scale. Else, the same transformation applies to all numbers, whether they fall within the domain or not.
Clamping example
Here, with clamping, the result of the linear transformation is 120, but without it, it’s 160.

var clamp=d3.scale.linear().domain([20,80]).range([0,120]);
clamp(100); // 160
clamp.clamp(true);
clamp(100); // 120

Scales and nice numbers

More often than not, the bounds of the domain and/or those of the ranges will be calculated. So, chances are they won’t be round numbers, or numbers a human would like. Scales, however, come with a bunch of method to address that. d3 keeps in mind that scales are often used to position marks along an axis.

.nice()

When applied to a scale, the nice method expends the domain to “nicer” numbers. You wouldn’t want your axis to start at -2.347 and end at 7.431, right?
So, there.

var data=[-2.347, 4, 5.23,-1.234,6.234,7.431]; // or whatever.
var y=d3.scale.linear().range([0,120]);
y.domain([d3.min(data), d3.max(data)]); // domain takes bounds as arguments, not all numbers
y.domain() // [-2.347, 7.431];
y.nice() // [-3, 8]

.ticks(n)

Given a domain, and a number n (which, contrary to protovis, is mandatory in d3), the ticks method will split your domain in (more or less) n convenient, human-readable values, and return an array of these values. This is especially useful to label axes. Passing these values to the scale allows them to position ticks nicely on an axis.

var y=d3.scale.linear([20,80]).range([0,120]);
...
var ticks=axis.selectAll("line")
  .data(y.ticks(4)) // 20, 40, 60 and 80
  .enter().append("svg:line");
ticks
  .attr("x1",0).attr("x2",5)
  .attr("y1",y).attr("y2",y) // short and simple. 
  .attr("stroke","black");

.rangeRound()

If used instead of .range(), this will guarantee that the output of the scales are integers, which is better to position marks on the screen with pixel precision than numbers with decimals.

.invert()

The invert function turns the scale upside down: for one given number in the range, it returns which number of the domain would have been transformed into that number.
For instance:

var y=d3.scale.linear([20,80]).range([0,120]);
y(50); // 60
y.invert(60); // 50

That’s quite useful, for instance, when a user mouses over a chart, and you would like to know to what value the mouse coordinates correspond.

Power scales and log scales

The linearscale is a function of the form y=ax+b which works for both ends of the domain and range. In the example we’ve used most often until now, this function is really f(x): y=2×-40.
Power and logarithm scales work the same, only we are looking for a function of the form y=axk+b, or y=a.log(x)+b.
For the power scales, you can specify an exponent (k) with the .exponent() method. For instance, if we specify an exponent of 2, here is what the scale would look like:
an example of a power scale
The equation is now f(x): y=x²/50-8. So 20 still becomes 0 and 80 still becomes 120, but other than that the values at the beginning of the domain would be lower than with the linear scale, and those at the end of the scale will be higher.
For convenience, d3 includes a d3.scale.sqrt() (the square root scale) so you never have to type d3.scale.pow.exponent(0.5) in full.
Also note that if you are using a log scale, you cannot have 0 in the domain.

Quantize and quantile

quantize and quantile are specific linear scales.
quantize works with a discrete, rather than continuous, range: in other terms, the output of quantize can only take a certain number of values.
For instance:

var q=d3.scale.quantize().domain([0,10]).range([0,2,8]); 
q(0); // 0
q(3); // 0
q(3.33); // 0
q(3.34); // 2
q(5); // 2
q(6.66); // 2
q(6.67); // 8
q(8); // 8
q(1000); // 8

quantile on the other hand matches values in the domain (which, this time, is the full dataset) with their respective quantile. The number of quantiles is specified by the range.
For instance:

var q=d3.scale.quantile().domain([0,1,5,6,2,4,6,2,4,6,7,8]).range([0,100]);
q.quantiles(); // [4.5], only one quantile - the median
q(0); // 0
q(4); // 0
q(4.499); // 0
q(4.5); // 100 - over the median
q(5); // 100
q(10000); // 100
q.range([0,25,50,75,100]);
q.quantiles(); // [2, 4, 5.6, 6];
q(0); // 0 
q(2); // 25 - greater than the first quantile limit
q(3); // 25
q(4); // 50
q(6); // 100
q(10000); // 100

Ordinal scales

All the scales we’ve seen so far have been quantitative, but how about ordinal scales?
The big difference is that ordinal scales have a discrete domain, in other words, they turn a limited number of values into something else, without caring for what’s between those values.
Ordinal scales are very useful for positioning marks along an x axis. Let’s suppose you have 10 bars to position for your bar chart, each corresponding to a category, a month or whatever.
For instance:

var x=d3.scale.ordinal()
  .domain(["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]) // 7 items
  .rangeBands([0,120]);
x("Tuesday"); // 34.285714285714285

There are 3 possibilites for range. Two are similar: the .rangePoints() and .rangeBands() methods, which both work with an array of two numbers – i.e. .rangeBands([0,120]). The last one is to specify all values in the range with .range().

rangePoints() and rangeBands()

With .rangePoints(interval), d3 fits n points within the interval, n being the number of categories in the domain. In that case, the value of the first point is the beginning of the interval, that of the last point is the end of the interval.
With .rangeBands(interval), d3 fit n bands within the interval. Here, the value of the last item in the domain is less than the upper bound of the interval.
Those methods replace the protovis methods .split() and .splitBanded().
difference between rangeBands and rangePoints
This chart illustrates the difference between using rangeBands and rangePoints.

var x=d3.scale.ordinal()
  .domain(["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]);
x.rangePoints([0,120]);
x("Saturday"); // 120
x.rangeBands([0,120]);
x("Saturday"); // 102.85714285714286
x("Saturday")+x.rangeBand(); // 120

the range method

Finally, we can also use the .range method with several values.
We can specify the domain, or not. Then, if we use such a scale on a value which is not part of the domain (or if the domain is left empty), this value is added to the domain. If there are n values in the range, and more in the domain, then the n+1th value of the doamin is matched with the 1st value in the range, etc.

var x=d3.scale.ordinal().range(["hello", "world"]); 
x.domain(); // [] - empty still.
x(0); // "hello"
x(1); // "world"
x(2); // "hello"
x.domain(); // [0,1,2]

Color palettes

Unlike in protovis, which had them under pv.Colors – i.e. pv.Colors.category10(), in d3, built-in color palettes can be accessed through scales. Well, even in protovis they had been ordinal scales all along, only not called this way.
There are 4 built-in color palette in protovis: d3.scale.category10(), d3.scale.category20(), d3.scale.category20b(), and d3.scale.category20c().

A palette like d3.scale.category10() works exactly like an ordinal scale.

var p=d3.scale.category10();
var r=p.range(); // ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", 
                      // "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
var s=d3.scale.ordinal().range(r); 
p.domain(); // [] - empty
s.domain(); // [] - empty, see above
p(0); // "#1f77b4"
p(1); // "#ff7f0e"
p(2); // "#2ca02c"
p.domain(); // [0,1,2];
s(0); // "#1f77b4"
s(1); // "#ff7f0e"
s(2); // "#2ca02c"
s.domain(); // [0,1,2];

It’s noteworthy that in d3, color palette return strings, not pv.Color objects like in protovis.
Also:

d3.scale.category10(1); // this doesn't work
d3.scale.category10()(1); // this is the way.

Colors

Compared to protovis, d3.color is simpler. The main reason is that protovis handled color and transparency together with the pv.Color object, whereas in SVG, those two are distinct attributes: you handle the background color of a filled object with fill, its transparency with opacity, the color of the outline with stroke and the transparency of that color with stroke-opacity.

d3 has two color objects: d3_Rgb and d3_Hsl, which describe colors in the two of the most popular color spaces: red/green/blue, and hue/saturation/light.

With d3.color, you can make operations on such objects, like converting colors between various formats, or make colors lighter or darker.

d3.rgb(color), and d3.hsl(color) create such objects.
In this context, color can be (straight from the manual):

  • rgb decimal – “rgb(255,255,255)”
  • hsl decimal – “hsl(120,50%,20%)”
  • rgb hexadecimal – “#ffeeaa”
  • rgb shorthand hexadecimal – “#fea”
  • named – “red”, “white”, “blue”

Once you have that object, you can make it brighter or darker with the appropriate method.
You can use .toString() to get it back in rgb hexadecimal format (or hsl decimal), and .rgb() or .hsl() to convert it to the object in the other color space.

var c=d3.rgb("violet") // d3_Rgb object
c.toString(); // "#ee82ee"
c.darker().toString(); // "#a65ba6"
c.darker(2).toString(); // "#743f74" - even darker
c.brighter().toString();// "ffb9ff"
c.brighter(0.1).toString(); // "#f686f6" - only slightly brighter
c.hsl(); // d3_Hsl object
c.hsl().toString() // "hsl(300, 76, 72)"
 

d3: adding stuff. And, oh, understanding selections

From data to graphics

the d3 principle (and also the protovis principle)
d3 and protovis are built around the same principle. Take data, put it into an array, and for each element of data a graphical object can be created, whose properties are derived from the data that was provided.

Only d3 and protovis have a slightly different way of adding those graphical elements and getting data.

In protovis, you start from a panel, a protovis-specific object, to which you add various marks. Each time you add a mark, you can either:

  • not specify data and add just one,
  • or specify data and create as many as there are items in the array you pass as data.

.

How de did it in protovis

var vis=new pv.Panel().width(200).height(200); 
vis.add(pv.Panel).top(10).left(10)
  .add(pv.Bar)
    .data([1,4,3,2,5])
    .left(function() {return this.index*20;})
    .width(15)
    .bottom(0)
    .height(function(d) {return d*10;});
vis.render();

this simple bar chart in protovis
you first create a panel (first line), you may add an element without data (here, another panel, line 2), and add to this panel bars: there would be 5, one for each element in the array in line 4.

And in d3?

In d3, you also have a way to add either one object without passing data, or a series of objects – one per data element.

var vis=d3.select("body").append("svg:svg").attr("width",200).attr("height",200);
var rect=vis.selectAll("rect").data([1,4,3,2,5]).enter().append("svg:rect");
rect.attr("height",function(d) {return d*20;})
  .attr("width", 15)
  .attr("x",function(d,i) {return i*20;})
  .attr("y",function(d) {return 100-20*d;}
  .attr("fill","steelblue");

In the first line, we are creating an svg document which will be the root of our graphical creation. It behaves just as the top-level panel in protovis.

However we are not creating this out of thin air, but rather we are bolting it onto an existing part of the page, here the tag. Essentially, we are looking through the page for a tag named and once we find it (which should be the case often), that’s where we put the svg document.

Oftentimes, instead of creating our document on , we are going to add it to an existing <div> block, for instance:

<div id="chart"></div>
<script type="text/javascript">
var vis=d3.select("#chart").append("svg:svg");
...
</script>

Anyway. To add one element, regardless of data, what you do is:

The logic is : d3.select(where we would like to put our new object).append(type of new object).

Going back to our code:

var vis=d3.select("body").append("svg:svg").attr("width",200).attr("height",200);
var rect=vis.selectAll("rect").data([1,4,3,2,5]).enter().append("svg:rect");
rect.attr("height",function(d) {return d*20;})
  .attr("width", 15)
  .attr("x",function(d,i) {return i*20;})
  .attr("y",function(d) {return 100-20*d;}
  .attr("fill","steelblue");

On line 2, we see a different construct:

an existing selection, or a part of the page
.selectAll(something)
.data(an array)
.enter()
.append(an object type)

This sequence of methods (selectAll, data, enter and append) are the way to add a series of elements. If all you need to know is to create a bar chart, just remember that, but if you plan on taking your d3 skills further than where you stopped with protovis, look at the end of the post for a more thorough explanation of the selection process.

Attributes and accessor functions

At this stage, we’ve added our new rectangles, and now we are going to shape and style them.

rect.attr("height",function(d) {return d*20;})
  .attr("width", 15)
  .attr("x",function(d,i) {return i*20;})
  .attr("y",function(d) {return 100-20*d;}
  .attr("fill","steelblue");

All the attributes of a graphical element are controlled by the method attr(). You specify the attribute you want to set, and the value you want to give.
In some cases, the value doesn’t depend on the data. All the bars will be 15 pixels wide, and they will all be of the steelblue color.
In some others, the value do depend on the data. We decide that the height of each bar is 20 times the value of the underlying data, in pixels (so 1 becomes 20, 5 becomes 100 etc.). Like in protovis, once data has been attributed to an element, function(variable name) enables to return a dynamic value in function on that element. By convention, we usually write function(d) {…;} (d for data) although it could be anything. Those functions are still called accessor functions.
so for instance:

.attr("height",function(d) {return d*20;})

means that the height will be 20 times the value of the underlying data element (exactly what we said above).
In protovis, we could position the mark relatively to any corner of its parent, so we had a .top method and a .bottom method. But with SVG, objects are positioned relatively to the top-left corner. So when we specify the y position, it is also relative to the top of the document, not necessarily to the axis (and not in this case).
so –

.attr("y", function(d) {return 100-d*20;})

if we use scales (see next post), all of this will have no impact whatsoever anyway.
Finally, there is an attribue here which doesn’t so much depend on the value of the data, but of its rank in the data items: the x position.
for this, we write: function(d,i) {return i*20;}
Here is a fundamental difference with protovis. In protovis, when we passed a second argument to such a function, it meant the data of the parent element (grand parent for the third, etc.). But here in d3, the second parameter is the position of the data element in its array. By convention, we write it i (for index).
And since you have to know: there is no easy way to retrieve the data of the parent element.

Bonus: understanding selections

To add many elements at once we’ve used the sequence: selectAll, data, enter, append.
Why use 4 methods for what appears to be one elementary task? If you don’t care about manipulating nodes individually, for instance for animations, you can just remember the sequence. But if you want to know more, here is what each method does.

selectAll

the selectAll method
First, we select a point on which to add your new graphical objects. When you are creating your objects and use the selectAll method, it will return an empty selection but based on that given point. You may also use selectAll in another context, to update your objects for instance. But here, an empty selection is expected.

data

the data method
Then, you attribute data. This works quite similarly to protovis: d3 expects an array. d3 takes the concept further (with the concept of data joins) but you need not concern yourself with that until you look at transitions.
Anyway, at this stage you have an empty selection, based on a given point in the page, but with data.

enter

the enter method
The enter method updates the selection with nodes which have data, but no graphical representation. Using enter() is like creating stubs where the graphical elements will be grafted.

append

the append method
Finally, by appending we actually create the graphical objects. Each is tied to one data element, so it can be further styled (for instance, through “attr”) to derive its characteristics from the value of that data.

 

From protovis to d3

You’ve spent some time learning protovis only to find that its development is halted as authors have switched to work on d3. Have your efforts all been in vain? Fear not! This series of posts will help you adapt to d3 with a protovis background.

Before we go anywhere further, let me say that these posts won’t make you awesome at d3 (yet). We won’t be talking about how to do all amazing things you could never do in protovis. Rather, we’ll focus on enabling you to be as comfortable with d3 than you could have been with protovis. And once that’s done, nothing will prevent you from learning the more powerful aspects of d3.

Anyway, if you’re reading this, you are already awesome.

Why should I make the switch to d3?

Frankly, you don’t have to. Protovis is a fine framework and works well. Now you may want to switch to d3 for several reasons.

  • d3 is fast. d3 is better at handling scenes with hundreds or thousands of elements. So if you like scatterplots or network graphs, and who doesn’t, d3 has much stronger performance.
  • d3 does animation. There were workarounds to get animation in protovis but there were that. Workarounds. Animation and transitions are built in d3 and are a snap to implement.
  • More features. Just because development has stopped on protovis doesn’t mean that it has stopped elsewhere… for instance, d3 has more ready-to-use layouts, like voronoi tesselation or chords, and it has more methods and functions to make your life easier, to access and manipulate data for example.
  • Styling. In d3 it is possible to apply style sheets in CSS to graphical elements. This helps keeping the code and the format separate.

Yes but doesn’t everything change?

Short answer: no.

Less short answer: some things do change substantially. Most things stay the same. And then, some things look the same but have changed.

Things that stay the same

  • The general principle.Protovis is about transforming an array of data into the same number of graphical elements, with characteristics derived from that data. d3 does exactly this as well.
  • pv.Nest, which in my personal protovis experience has been the hardest to understand. Only, it’s called d3.nest now.
  • Methods that supplement the existing javascript array manipulation methods, like pv.min, pv.values, pv.entries etc. are also back (as d3.min, d3.values, d3.entries, but you’ve guessed it by now). Some, like pv.mean or pv.median, didn’t make it through but you could easily rewrite them, or continue using the protovis ones.

Things that look different, but which are largely the same

Protovis had a number of native graphical objects, or marks, that could be manipulated at will with methods.

var vis=new pv.Panel()
  .height(400)
  .width(400)
  .fillStyle("aliceblue")
  .lineWidth(1)
  .strokeStyle("steelblue");
vis.render();

In protovis, it is inherently different to set the height, the width or just any property of an object. This uses different methods.

var vis=d3.select("body").append("svg:svg");
  vis.append("svg:rect")
    .attr("height",400)
    .attr("width",400)
    .attr("fill","aliceblue")
    .attr("stroke","steelblue")
    .attr("stroke-width",1);

This produces essentially the same thing. We add a rectangle of a specified height, width, and colors. There are a few differences though. Here, controlling height, width or fill is essentially the same thing and uses the same method, .attr(). Notice also that we first created an svg document, then a shape within that document. And also, that we don’t need to use vis.render(); anymore.

The d3 approach looks longer. But if we define all the style information first we could make it much shorter, shorter than in protovis in fact!
For instance:

var vis=d3.select("body").append("svg:svg").append("svg:rect").attr("class", "myRect");

Much of the apparent differences between d3 and protovis come from using explicitly svg shapes (paths, polygons, ellipses, etc.) as opposed to native objects (pv.Panel, pv.Bar, pv.Dot, etc.), although – it’s essentially the same thing. Yes, you have to learn your SVG but it’s really on a need-to-know basis. In fact, if you’ve worked with protovis, or even if you’ve worked with HTML and CSS, you probably know more SVG than you thought.

SVG is more flexible than protovis objects. The flipside is that constructs which were once simple in protovis become less obvious in SVG. But for those cases, d3 has recreated some native objects, even if not as many as in protovis.

Things that look the same, but which are different

There have been some changes in methods that have kept the same name since protovis – some minor, some more substantial. In any case, the basic ways of using these methods (like scale, color, data…) doesn’t change much. It’s only their more exotic uses who do change.

 

Now reading: Visualize This

I’ve just read Visualize This by Nathan Yau and you should too.
Before I go and develop why, I’d like to say a few words about the author.

Thank you Nathan!

Nathan started flowingdata in 2007. This wasn’t the first blog on data visualization, and his author wasn’t the best known member of this community. Maintaining the blog wasn’t (and still isn’t) his full-time occupation. Yet flowingdata rose to be the most read data visualization blog, thus making a huge service to us all, bringing this science much needed visibility.
Nathan pulled this off by posting something everyday: original content, fine examples of visualizations, technological advances, tutorials, and so on.

As if that were not enough, the job boards at flowingdata are an invaluable ressources for anyone who seeks to hire and infovis expert… and for experts themselves obviously.

So, for all of this: thank you Nathan!

Visualizing data or visualize this?

One of the first book I read on visualization was Visualizing Data by Ben Fry (not the equally interesting Cleveland book of the same title). Back then I wanted a generic book about good practices on visualization without getting my hands dirty with code. But Visualizing Data is really a book about processing. This is both the greatest limitation and the greatest strength of this book, which will teach you processing through data examples. (I am never happier than in front of an empty processing sketch).

Visualize this!, by contrast, is more agnostic. It has the same ambition to help beginners get past the initial stumbling blocks of visualization, but with a greater variety of tools. Nathan’s favorite approach is to start with R then turn to Illustrator for the win. Yet he covers more tools such as python, protovis or flash.

Nathan doesn’t go very far into the nitty-gritty. Instead, he attempts to make those tools less intimidating, more approachable. Also and perhaps more importantly, the focus of the various chapters is to distill some good ideas and practices on top of the practical examples.

Which lead us to…

Visualize this or Show me the Numbers?

Show me the Numbers by Stephen Few is one the essential books on data visualization. I have said many times that if one should read only one book on charts and tables, it should be this. A few years after reading it, I find its recommendations to be a bit rigid (which its author is not).

For comparison’s sake, here is what Stephen says about pie charts.

Speaking of “difficult to read”, allow me to declare with no further delay that I don’t use pie charts, and I strongly recommend that you abandon them as well. My reason is simple: pie charts communicate poorly. This is a fundamental problem with all types of area graphs but especially with pie charts. Our visual perception is not designed to accurately assign quantitative values to 2-D areas, and we have an even harder time when the third dimension of depth is added.

Now here’s what Nathan has to say on the subject:

Pie charts have developed a stigma for not being as accurate as bar charts or position-based visuals, so some think you should avoid them completely. It’s easier to judge length than it is to judge areas and angles. That doesn’t mean you have to completely avoid them though.

You can use the pie chart without any problems just as long as you know its limitations. It’s simple. Keep your data organized, and don’t put too many wedges on one pie.

The above is fairly representative of the tone of the book, which is less directive, less scary perhaps, than other guides that have been written on the subject. Nathan will point you in the right direction, rather than force you to do things in a certain way (as an aside, the citation of Show me the Numbers is less representative of that book as it is the strongest-voiced recommendation).

Wrapping up

Once you close Visualize this! you won’t be a master of R. But you will be on the good track to become one if you choose to. If you know nothing about visualization it is a comforting way to start as it presents the field in an accessible way. This is arguably the first book to do so. If you know more on visualization there may be a thing for you. I knew nothing on Illustrator for instance, and I also learned things in the first and last chapters. I also appreaciated the more design-oriented approach compared to more technical books.

At the end of this month I will go on vacation. I will leave the book on my desk and it will disappear, like my favorite books on the subject before it. I might find who will have borrowed it from me, but not who will have borrowed it from them… What better fate can I hope for this book?