Protovis: analysis of the Becker’s barley example sketch

We are taking a look at the Protovis Becker’s Barley example.

<html>
  <head>
    <title>Barley Yields</title>
    <link type="text/css" rel="stylesheet" href="ex.css?3.2"/>
    <script type="text/javascript" src="../protovis-r3.2.js"></script>
    <script type="text/javascript" src="barley.js"></script>
    <style type="text/css">

#fig {
  width: 350px;
  height: 833px;
}

    </style>
  </head>
  <body><div id="center"><div id="fig">
    <script type="text/javascript+protovis">

/* Compute yield medians by site and by variety. */
function median(data) pv.median(data, function(d) d.yield);
var site = pv.nest(barley).key(function(d) d.site).rollup(median);
var variety = pv.nest(barley).key(function(d) d.variety).rollup(median);

/* Nest yields data by site then year. */
barley = pv.nest(barley)
    .key(function(d) d.site)
    .sortKeys(function(a, b) site[b] - site[a])
    .key(function(d) d.year)
    .sortValues(function(a, b) variety[b.variety] - variety[a.variety])
    .entries();

/* Sizing and scales. */
var w = 242,
    h = 132,
    x = pv.Scale.linear(10, 70).range(0, w),
    c = pv.Colors.category10();

/* The root panel. */
var vis = new pv.Panel()
    .width(w)
    .height(h * pv.keys(site).length)
    .top(15)
    .left(90)
    .right(20)
    .bottom(25);

/* A panel per site-year. */
var cell = vis.add(pv.Panel)
    .data(barley)
    .height(h)
    .top(function() this.index * h)
    .strokeStyle("#999");

/* Title bar. */
cell.add(pv.Bar)
    .height(14)
    .fillStyle("bisque")
  .anchor("center").add(pv.Label)
    .text(function(site) site.key);

/* A dot showing the yield. */
var dot = cell.add(pv.Panel)
    .data(function(site) site.values)
    .top(23)
  .add(pv.Dot)
    .data(function(year) year.values)
    .left(function(d) x(d.yield))
    .top(function() this.index * 11)
    .size(12)
    .lineWidth(2)
    .strokeStyle(function(d) c(d.year));

/* A label showing the variety. */
dot.anchor("left").add(pv.Label)
    .visible(function() !this.parent.index)
    .left(-1)
    .text(function(d) d.variety);

/* X-ticks. */
vis.add(pv.Rule)
    .data(x.ticks(7))
    .left(x)
    .bottom(-5)
    .height(5)
    .strokeStyle("#999")
  .anchor("bottom").add(pv.Label);

/* A legend showing the year. */
vis.add(pv.Dot)
    .extend(dot)
    .data([{year:1931}, {year:1932}])
    .left(function(d) 170 + this.index * 40)
    .top(-8)
  .anchor("right").add(pv.Label)
    .text(function(d) d.year);

vis.render();

    </script>
  </div></div></body>
</html>

Data:

var barley = [
  { yield: 27.00000, variety: "Manchuria", year: 1931, site: "University Farm" },
  { yield: 48.86667, variety: "Manchuria", year: 1931, site: "Waseca" },
  { yield: 27.43334, variety: "Manchuria", year: 1931, site: "Morris" },
etc.

The code begins by defining a function that will be used throughout the program.

function median(data) pv.median(data, function(d) d.yield);

this function expects an array of associative arrays which have the key “yield”. What it does is that it returns the median of all the values of the key yield.
If run against barley, it will return the median yield of all observations. If run against a subset of barley, it will return the median of the yield of that subset.
The point of this function is to simplify the writing of otherwise obfuscated statements.

Now let’s see it put to good use.

var site = pv.nest(barley).key(function(d) d.site).rollup(median);

The output of this statement is easier to understand than itself. It returns an associative array, with each site as a key, and their corresponding value is the median yield for all observations for that site.

{
  Crookston: 39.03333,
  Duluth: 28.533335,
  Grand Rapids: 23.983335,
  Morris: 34.699995,
  University Farm: 31.383335,
  Waseca: 47.949995}

pv.nest(barley) means that we are going to create an associative array based on barley with a hierarchy. (tree)
.key(function(d) d.site) means that they first key in the hierarchy of that tree will be the site. So, we are going to run an operation on all the entries of “barley” with the same site.
That operation is called by .rollup(median ) at the end of the line: this crunches all the entries and replace them by the result of the function “median” applied to all those records.

var variety = pv.nest(barley).key(function(d) d.variety).rollup(median);

This does the same as above, but by variety instead of by site.

{
  Glabron: 32.4,
  Manchuria: 30.96667,
  No. 457: 33.966665,
  No. 462: 30.45,
  No. 475: 31.066665,
  Peatland: 32.383335,
  Svansota: 28.550005,
  Trebi: 39.199995,
  Velvet: 32.149995000000004,
  Wisconsin No. 38: 36.95
}
/* Nest yields data by site then year. */
  barley = pv.nest(barley)
    .key(function(d) d.site)
    .sortKeys(function(a, b) site[b] - site[a])
    .key(function(d) d.year)
    .sortValues(function(a, b) variety[b.variety] - variety[a.variety])
    .entries();

This will transform the variable barley, which is now a flat list of records, into a tree form.
pv.nest(barley) indicates we are turning barley into a tree.
.key(function(d) d.site) says that the first order of the hierarchy will be “site”.
So our tree should look like:

[{key:”Crookston”, values: {….},
  {key:”Duluth”, values:{…},
…

Well, the order of the keys may not be alphabetical, thanks to the next statement.
sortKeys sorts the keys using a comparator function: this function(a,b) thing goes through all the pairs of keys and will order them according to that function, so if for a key the value of site is higher than for another one, it will be put first.
As a result, the first key should be Waseca, then Crookston, etc.
The next order of hierarchy will be after year. No sortkeys here, so the values of keys will just be presented in natural order.
So our tree will look like:

[
{ key: “Waseca”, values: [
	{key: 1931, values: [ … ]},
	{key: 1932, values: [ … ]}
  ]
}, 
{key: “Crookston”, values: [
	{key: 1931, values: [ … ]},
	{key: 1932, values: [ … ]}
  ]
}, etc.

The next statement will rank the values.
What is in that values field (where the ellipses are) will now be ranked using another comparator function, on variety this time.
The sortKeys statement worked on keys: “Crookston”, “Duluth”, etc. so one could write directly site[a] and get a value.
But this sortValues statement works on entries (fields like: { yield: 48.86667, variety: “Manchuria”, year: 1931, site: “Waseca” } ), so one can’t write variety[a] but instead, variety[a.variety].

Finally, the last statement, entries(), says that the values thing should be filled by the actual records, as opposed to using rollup in order to crush an aggregate value from the records.

So the final tree will look like this:

[{key: “Waseca”, values: [
 {key: 1931, values: 	[{site: "Waseca", variety: "Trebi", year: 1931, yield: 63.8333},
 {site: "Waseca",variety: "Wisconsin No. 38", year: 1931, yield: 58.8},
 {site: "Waseca", variety: "No. 457", year: 1931, yield: 58.1},
…
]},
{key: 1932, values: 	[…]}
]}, 
{key: “Crookston”, values: […]}, 
…
{key: “Grand Rapids”, values: […]}]

The next block of statements is more straightforward.

/* Sizing and scales. */
var w = 242,
    h = 132,
    x = pv.Scale.linear(10, 70).range(0, w),
    c = pv.Colors.category10();

w and h are constants for the width and heights of the cells,
x is a scale to transform the yields in horizontal coordinates, so 10 will be represented at the left-most sied of the cell and 70 at the right-most side,
c is the standard color palette.

Now, we create the panels.

/* The root panel. */
var vis = new pv.Panel()
    .width(w)
    .height(h * pv.keys(site).length)
    .top(15)
    .left(90)
    .right(20)
    .bottom(25);

/* A panel per site-year. */
var cell = vis.add(pv.Panel)
    .data(barley)
    .height(h)
    .top(function() this.index * h)
    .strokeStyle("#999");

First we create the root panel. The height of the panel is determined by the number of sites.
Since site is an associative array, we first derive an array of the same length with pv.keys(site), which contains the names of the keys of that associative array (in plain English, the names of the sites). What we need is just the number of them, that’s what the length property is for.
(btw, we could have simply written barley.length)
We multiply that number by the height of each cell to get the height of the root panel.

We then create panels inside that rootpanel.
Note the simplicity with which data is pushed into those panels:

    .data(barley)

Now barley is an array of 6 objects. So we are creating 6 panels, and their data element will be of the form:

{key: “Waseca”, values: [{key: “1931”, values:  [entries]}, {key:”1932”, values: [entries]}]}

Since these cells will be positioned from the root panel, their top value is determined using a simple function of this.index, so the 1st one will be right on top, the next one will be at h pixels from the top, the next one at 2*h pixels, etc.

/* Title bar. */
cell.add(pv.Bar)
    .height(14)
    .fillStyle("bisque")
  .anchor("center").add(pv.Label)
    .text(function(site) site.key);

This just adds a title to the cells. Here, we use a bar, but it could have been a panel.
We only specify the height: the top and left value are supposed to be 0 and the width is that of the parent. We choose the svg color “bisque” as the background color and we add a label in the middle.
Here’s how we obtain the text. Yes it is the site name, but this has nothing to do with the choice of “site” as the variable name in the accessor function.
That function just takes data from the data property, which is directly inherited from the parent. This is a complex associative array, but at its first level, the key property corresponds to the site name. So, function(site) site.key returns the site name.

/* A dot showing the yield. */
var dot = cell.add(pv.Panel)
    .data(function(site) site.values)
    .top(23)
  .add(pv.Dot)
    .data(function(year) year.values)
    .left(function(d) x(d.yield))
    .top(function() this.index * 11)
    .size(12)
    .lineWidth(2)
    .strokeStyle(function(d) c(d.year));

Here’s the data representation.
We first add a group of panels called dot to the panel cell.
This is a group, and not a single panel, because of what they get through the data method: the content of the values key of their parent.
Again, the data property of their parent, cell, is of the form:

{key: site name, values: [{key: “1931”, values: […]}, {key:”1932”, values: […]}]}

So if we isolate the content of the values key, we have an array of 2 values:

[{key: “1931”, values:[…]} , 
{key: “1932”, values:[…]} ]

Therefore, this statement creates 2, not 1, panels.
But those panels are superposed: the only positioning instruction is that they are 23 pixels from the top, so they are both taking the full width of the cell panel, and go to the bottom of that panel as well.
The system will first draw the first one, then the second one on top of that.

Then, we add a dot object to those dot panels, which is a series of circles.
How many dots will there be in those series? This, again, depends on the contents of the data method.
What we get this time is:

    .data(function(year) year.values)

What this means is that we are looking at the data element of the parent, and we are taking what’s behind values.

The data element of the parent was of the form:

{key: year name, values: 
  [
   {site: site name, variety: variety name, year: year name, yield: yield value},
   {site: site name, variety: variety name, year: year name, yield: yield value}, 
   {site: site name, variety: variety name, year: year name, yield: yield value},
…
   {site: site name, variety: variety name, year: year name, yield: yield value}
  ]
}

So what we’re getting is the array of 10 entries that were behind values.

.left(function(d) x(d.yield))

This means that each mark will be positioned from the left, according to the value of its yield property, once transformed by our scale.

.top(function() this.index * 11)

And they are simply positioned vertically depending on this.index, so one on top of the other. Remember, this is relative to the dot panel, not to cell.

.size(12)
.lineWidth(2)

We are determining a size and a line width, but no fill style, so we are drawing rings, not discs. Even if they are superposed, the back one should still be visible.

.strokeStyle(function(d) c(d.year));

Finally, we are determining the color of the ring. For that, we extract the year of the current entry using d.year. It is converted to a color using the palette c.
There are other ways to get the year, but this is the simplest to write.

Now if we consider the statement in its entirety again:

/* A dot showing the yield. */
var dot = cell.add(pv.Panel)
    .data(function(site) site.values)
    .top(23)
  .add(pv.Dot)
    .data(function(year) year.values)
    .left(function(d) x(d.yield))
    .top(function() this.index * 11)
    .size(12)
    .lineWidth(2)
    .strokeStyle(function(d) c(d.year));

What is really assigned to dot is not the first item created (panels) but the dot chart proper. This is important to note that for the rest:

/* A label showing the variety. */
dot.anchor("left").add(pv.Label)
    .visible(function() !this.parent.index)
    .left(-1)
    .text(function(d) d.variety);

What we’re trying to do here is to write legends for the dot charts. We don’t want to do this for each year: the labels would be on top of each other, and less legible than if they were just one group of labels. This is why we use this visible statement. this.parent refers to a the parent panel of dot, so this.parent.index is worth 0 for the first panel, and 1 for the second. So here, !this.parent.index will be false for all values other than 0, so there will be only 1 set of labels, even if we add a year for instance.

Then, although we said we were adding the labels just to the left of the dot (this is what dot.anchor(“left”) does), what we really want is to have them on the left of the cell.
That’s why we use

.left(-1)
.text(function(d) d.variety)

runs on the data provided by the parent, in this case the dot chart. So it’s just the variety of the current entry.

/* X-ticks. */
vis.add(pv.Rule)
    .data(x.ticks(7))
    .left(x)
    .bottom(-5)
    .height(5)
    .strokeStyle("#999")
  .anchor("bottom").add(pv.Label);

This just adds a series of 7 ticks to the bottom of the chart. They are rules, but their height has been limited to 5 pixels. They start from below the chart – bottom(-5) – and we add the ticks to that below with anchor(“bottom”). Since all the values are 2-digit numbers, we can leave the default versions for the pv.Label objects, and not go into tickFormat or anything fancy.

/* A legend showing the year. */
vis.add(pv.Dot)
    .extend(dot)
    .data([{year:1931}, {year:1932}])
    .left(function(d) 170 + this.index * 40)
    .top(-8)
  .anchor("right").add(pv.Label)
    .text(function(d) d.year);

Finally, the legend of the chart, in form of a dot chart.
The first .extend(dot) statement just says that we are copying the properties of the other dot chart. So, if we decide to change the size of the rings or their thickness, it will be reflected in the legend.
Here, we just pass the data manually. Note that this is added to the root panel (vis), so top(-8) means that it appears over the main chart.

In this example, the authors have started from an unformated, flat list of 120 entries.
Then, they thought of the shape their final visualization will take and the hierarchy of objects needed to accommodate that:
• two superposed dot series, showing the various varieties for a given year/site combination (dot),
• each inside a panel, one for each year,
• which would be part of a parent panel, for each site (cell)
• which would be stacked vertically inside the root panel (vis).

So in order to do that seamlessly, they had to prepare a data variable with that hierarchy:
First by site, then by year, then by variety.

This is what the complex pv.nest command does at the beginning.
Once this is taken care of, the various objects can be added one to the other naturally.

 

One thought on “Protovis: analysis of the Becker’s barley example sketch

Leave a Reply