Hollywood + data

23 January, 2012 (22:53) | data visualization | By: jerome

Ever heard of the Information is Beautiful awards? It’s visualization competitions a monthly visualization competition put together by David McCandless of Information is Beautiful fame.

Part of it are monthly competitions that run on a curated dataset. Jen Lowe and I are making a team for the current one about the movie industry and we are going to deliver a competitive entry! even our drafts are rocking! But, while we were looking for that one great idea I explored the data.

Part of the dataset is the total box office earnings of over 600 of the movies released in the US in the last 5 years. What I did is cross that list with the user-contributed plot keywords on imdb. Then, I ran a regression on that to find out how much each keyword would generate in the box office? (I only kept those keywords mentioned over 5 times, that’s just over 2500, because else the full list is  20,000+). The full list is at the end of the article.

Read more »

Using d3 with a mySql database

2 January, 2012 (20:15) | d3, tips | By: jerome

Creating visualizations from static files is fine and dandy but sometimes you need to be able to access dynamic data. And some other times, you may want to somehow record interactions from your users. One way to do that is by interacting with a mySql database.

Without further ado here is the demo:

How does it work?

There are several parts to that.

First, one html file which holds everything together. By the way, for the styling I used Twitter’s bootstrap which makes it easy for all elements to find their place, and look at those purty buttons.

Second, one javascript file which contains the visualization proper.  If you have some familiarity with d3, there is really nothing scary in this script. I’ll go back to the parts where the script interacts with databases in detail.

Here’s what the rest does at a high level.

  1. We give some behaviors to the buttons
  2. Then we create a grid of small squares. All of these squares are positionned and given a class name, so that the square with class “r32″ and “c17″ is the 18th square from the left and 33th from the top (the class names start at 0).
  3. We catch the clicks on each square with a “clickme” function. In d3 logic, what is passed to that function is the underlying data of the element, in this case a 2-dimensional array with the x and y coordinates of the square which is being clicked on. In turn, the clickme function is going to update the data of the square, and those of the 4 surrounding squares (the one to the top, the bottom, the left and the right) by either increasing or decreasing the elevation of the terrain they represent

When it gets interesting is how the data is initialized and how it is updated.

d3.text("mapread.php", function(txt) {
	d3.selectAll("#loading").remove();
	txt.split("\n").forEach(function(line,i) {
		line.split(",").forEach(function(d,j) {
			data[i][j]=parseFloat(d);
			d3.selectAll(".r"+i+".c"+j).style("fill",function() {return cScale(data[i][j]);});
		})
	});
})

What’s really interesting here is the first line. We are asking d3 to go fetch a text file sitting at mapread.php, then do something with this file. The second part of the line, function(txt), calls a function with the contents of this text as argument.

The second line just removes the loading message box. Then, d3 splits the text in lines, and each line being a string of comma-separated values, it splits that too and feeds a variable, data, with the result of all of this splitting. Then, it formats the squares by coloring them according to the retrieved values.

At this stage you may think: but shouldn’t you load the data before drawing the scene? Well, what happens here is that loading the data takes much more time than drawing the scene, so it makes more sense to draw it first as an empty shell, load the data and then update the scene.

And as you may have guessed, this mapread.php is no ordinary text file, but a dynamically-generated file from a mySql database. I won’t cover setting up a mySql database. Tutorials on the subject abound, there are ISPs that offer free mySql hosting, and if you can also install a local server on your computer, for instance EasyPHP for windows users. And, if your ISP limits the number of mySql databases you can have, you don’t need to create a new one, just creating a new table within one will be fine. All you have to do really is find your mySql credentials.

Next, you want to create a PHP file that goes like this:

<?php
$username="username"; //replace with your mySql username
$password="password";  //replace with your mySql password
$dsn="database";  //replace with your mySql database name
$host="host";  //replace with the name of the machine your mySql runs on
$link=mysql_connect($host,$username,$password);
?>

You can call this: mysqlConfig.php or whatever, this  is a convenience file so you don’t have to type in your credentials each time you need to connect to your mySql database.

Next, here is the script that reads the database and outputs a text file:


<?php
// load in mysql server configuration (connection string, user/pw, etc)
include 'mysqlConfig.php';
// connect to the database
@mysql_select_db($dsn) or die( "Unable to select database");

// reads the map db

$query="SELECT `height` FROM `v_map` ORDER BY `row`, `col`";
mysql_query($query);

$result = mysql_query($query,$link) or die('Errant query: '.$query);

// outputs the db as lines of text.
header('Content-type: text/plain; charset=us-ascii');
$i=0;
$line="";

if(mysql_num_rows($result)) {
 while($value = mysql_fetch_assoc($result)) {

$line=$line.$value["height"];
 $i=$i+1;
 if ($i==52) {
 $i=0;
 echo $line."\n";
 $line="";}
 else {$line=$line.",";}
 }
}
mysql_close();
?>

And by the way, I am by no means a php expert. I hadn’t written a line of php in almost 10 years, so there may well be more effective ways to do that but the above works. The more interesting part is that we write an sql query which we store in $query and then we execute this query. Then, we loop over the results and echo the output.

Back to our javascript file, we also interact with another php file when we update the data.

function update(r,c,v) {
	if(r>=0 && r<y && c>=0 && c<x) {
		data[r]1=d3.max([d3.min([100,data[r]1+v*build]),0]);
		d3.selectAll(".r"+r+".c"+c).style("fill",function() {return cScale(data[r]1);});
		d3.text("mapupdate.php?height="+data[r]1+"&col="+c+"&row="+r,function() {console.log("cell on row "+r+" and col"+c+" updated to "+data[r]1);});
	}
}

Here the last line is the interesting one. What it does is that, again, it attempts to fetch a text file from a url. In fact, there is no text there but just accessing this url will trigger an interaction with the database. (I guess it would be good practice to actually get some text in return, but hey).

The program tries to read an url of the form mapupdate.php?height=20&col=10&row=32. By calling this url, we are actually passing these parameters to a php file, which will read them and use them to construct a query to the mySql database.

Here goes:

<?php

// load in mysql server configuration (connection string, user/pw, etc)
include 'mysqlConfig.php';
// connect to the database
@mysql_select_db($dsn) or die( "Unable to select database");

// updates the map db

$query="UPDATE `v_map` SET `height`=".$_GET["height"]." WHERE `col`= ".$_GET["col"]." and `row`= ".$_GET["row"];
mysql_query($query);
mysql_close();
?>

Here, the line that starts with $query is doing just that. The dot “.” is PHP concatenation operator, and the $_GET variable returns an associative array with the parameters passed to the script.

For the sake of completeness, I had two other php scripts, one to initiate the table to begin with, and one to reset it if something went wrong. Those are just plain SQL queries so no need to reproduce them here.

And voilà! now all of you can interact with this terrain builder, create islands, forests, mountains etc. The graphics are kind of crude, because when I was looking for an example I decided to recreate one of my earliest attempts in creative coding. In 1990 upon the release of Powermonger I was so fascinated by the algorithmically-generated maps the game used as copy protection that I tried to code my own terrain generator, that was a time where 320x240x16 resolution was considered generous. Only here, it’s your clicks that replace the algorithm!

I hope you enjoy the tutorial and working with persistant data with d3!

Don’t take my word for it

24 November, 2011 (03:18) | presentation | By: jerome

Inspiration

In June 2010, I attended a Wolfram|Alpha event called the London Computational Knowledge Summit where speakers mostly focused on how computers can transform the way we teach and transmit knowledge. Several of the presentations made a lasting impression, and mostly the talk by Jon McLoone:

Jon’s point was that academic papers today look an awful lot like those in the 17th century. Granted, they’re not in latin, they can be displayed online and there is color, but as far as maths are concerned it’s still long pages of difficult language and long formulas. The computer, however, can do so much more than transmit information. In the clip above (around 6’20″) Jon shows how a paper on edge detection can be much more effective if instead of using a static example to demonstrate the technique, the paper were able to use a live example, such as input from the camera. In that talk and throughout the day, there were more examples on how interactive displays could be useful for teaching.

Teaching, telling stories and getting a message across use similar functions. Fast forward to VisWeek 2010 and the first “Telling Stories with Data” workshop. Some of the presentations there (I’m thinking of Nick Diakopoulos and Matthias Shapiro mostly) hinted that there could be a process through which readers/users/audience could be taken through so they can make the most of an intended message. Interestingly, this process is not about transmitting as much data as effortlessly as possible but rather to engage the audience, to get them to challenge their assumptions.

Those two events really made me pause and think. Ever since I had started working in visualization, all my efforts had been focused on being as clear as possible, and my focus, on efficient visuals. However, for some tasks, clarity just isn’t optimal. That wasn’t much of an issue in most of my OECD work where such an approach makes a lot of sense but I started seeing that there was a world of possibility when it comes to changing people’s perception on a subject or even persuading them.

Application

French pension reform

Right at the moment of visWeek 2010, France was plagued by strikes against the proposed pension reform. At the peak of contestation up to 3m people demonstrated (that’s as much as one adult out of 14). I was quite irritated by the protests. In theory, left and right had very comparable views on this problem and only disagreed on unsignificant details. They both knew reform was unavoidable, and, again, had similar plans. But when those of the current government were implemented, the opposition capitalized on the discontent and attacked the plan vigorously. Their rethoric were entirely verbal – no numbers were harmed in the making of their discourse! Consequently, protesters and a large part of the population started to develop notions about the state of pensions which were completely disconnected from reality.

I believe that if numbers had been used early enough, it would have been enough to provide a counterpoint to such fallacies and while it may not have prevented demonstrations, it would have greatly helped to dampen their effect. With that in mind and with official data I tried to build a model to show what would happen if one changed this or that parameter of pension policy. Pension mechanics are quite simple: what you give on one side, you take on another; the evolution of population is quite well known, so making such a model is pretty straight forward. But putting that in a visual application really showed how the current situation was unsustainable. In this application I challenge the user to find a solution – any solution – to the pension problem, by using the same levers as the policy makers. It turns out that there is just one viable possibility. Yet, letting people find that by themselves and challenge that idea as hard as they could was very different from paternalizing and telling people that this was just the way it is.

On the course of the year I got involved in several occasions in situations like this where data visualization could be used to influence people’s opinion, and likewise I tried to use that approach. Instead of sending a top-down message (with or without data), instead confront the assumptions of the audience and get them to interact with a model. After this experience, their perception will have changed. This technique doesn’t try to bypass the viewers critical thinking, but instead to leverage their intelligence.

In politics

I am very concerned with the use of data visualization in politics for many reasons. One of them is because I’m a public servant. In my experience, most decisions are not taken by politicians, but by experts or technicians who are commited to the public good. Yet, when poorly  explained, these decisions can be misunderstood and attacked. Visualization, I believe, can help defend such decisions (those who are justifiable at least) and explain them better to a greater number.

Although a lot of data is available out there (or perhaps for that very reason) only few people have a good grasp of the economic situation of their country. This just can’t be helped. It’s not possible to increase the percentage of people who can guestimate the unemployment rate and it’s not really important. Very few people need to know such a number, now what is important is to be able to use that information in context when it is useful. For instance, at election time, a voter should be able to know if the incumbent has created or destroyed jobs. This is something that data visualization can handle brilliantly.

Finally, my issue with political communication is that it is written by activists, for activists. It works well to motivate people with a certain sensitivity but it is not very effective at getting others to change side. This is a bias which is difficult to detect by those in charge of political communications because, well, they’re activits too… and here this flavor of model-based data visualization, with its appearance of objectivity and neutrality, can complement the more verbal aspect of rhetoric quite well.

In the talk I used Al Gore’s An Inconvenient Truth as a counter example. This movie is a fine example of story-telling, and operating at an emotional rather that at a rational novel. I trust that people who feel concerned about climate change will be reinforced in their beliefs after seeing the movie. However, those who do not were unconvinced. In fact, the movie also gave a strong boost to climate skeptics. There was a real barrage of blog posts and websites attempting to debunk the assertions of that “truth”, most often with data. There is a missed opportunity: if the really well-made stories of the movie had been complemented with a climate model that people could experiment with, it would have been perceived as less monolithic, less manichean, less dogmatic.

The conclusions

In my practice using an interactive model can help a lot to get a message across (and no, I don’t have a rigorous evaluation for “a lot”, that’s the advantage of not being an academic).

Such models engage the users, they come out as more objective and truthful as static representations, and they can be very useful to address preconceptions. Chances are they’re more fun, too.

Then again, just because a model is interactive and built on transparent data and equations doesn’t mean it’s objective. It is usually possible to control the model or the interface so that one interpretation is more likely than the other, and that’s precisely the point if you are using data visualization to influence.

It can be very cheap and easy to turn a static representation into an interactive display. Every chart with more than 2 dimensions can be turned in a visualization where the user controls one dimension and sees data for the others evolve.

And if you build a model like this, you must be very open and transparent about the data and the equations and sometimes find ways to get people to overcome their doubts.

Besides, having a working interactive model is no guarantee of success. You really have to be careful that your users are not likely to interpret your visualization in ways you never intended.

The presentation


All examples I used in the presentation both good and bad, both mine and others can be found at http://www.jeromecukier.net/data-stories/

Promising difficulties

13 November, 2011 (22:18) | data visualization | By: jerome

At the recent VisWeek conference, Jessica Hullman and her coauthors presented ”Benefitting Infovis with Visual Difficulties (pdf)”, a paper that suggests that the charts which are read almost effortlessly are not necessarily the ones that readers understand or remember best. To answer that claim, Stephen Few wrote a rather harsh critique of this paper (pdf). As I read this I felt the original paper was not always fairly represented, but more importantly, that the views develop by both parties are not at all inreconcilable. Let me explain.

What is cognitive efficiency, or “say it with bar charts”

For quite some time, we were told that to better communicate with data, we had to make visuals as clear as possible.

The more complicated way of saying that is talking of “cognitive efficiency”. By reducing the number of tasks needed to understand a chart and simplifying them, which is sometimes called reducing the “cognitive cost” or “cognitive load”, we improve all virtues of the chart.

Various charts based on the same data points, shown in order of cognitive cost

Various charts based on the same data points. From left to right, they make the task of comparing individuals value increasingly easier

For instance: bar charts are easier to process than pie charts, because it’s easier for the human eye to compare lengths than angles. So, with equivalent data, bar charts have a lower cognitive cost than pie charts. Likewise, bar charts which are ordered by value (smallest bars to largest bars) are easier to read than unordered ones. Ordered bar charts have an even lower cognitive cost than unordered ones.

Conversely, adding non-data elements add extra tasks for the reader and increase cognitive cost. These non-data elements have been reviled by Edward Tufte as “chartjunk”. His data-ink theory says that out of all the ink used for the chart, as much as possible should be devoted to data elements. Again, this goes in the direction of data efficiency.

Engagement rather than immediacy?

Again for quite some times those rules were held to be universal. Yet, several tried to challenge them, the latest being Jessica Hullman in her paper “Benefitting Infovis with Visual Difficulties“. This paper was so thought-provoking that it received an honorable mention at the recent IEEE Information Visualization Conference 2011 (as a note to the non-academic reader, this is quite a competitive achievement).

New information visualisation techniques are often evaluated.  This paper argues that such evaluations typically consider response time or accuracy, and not how well users are able to interpret and remember visuals. When only the former criteria are taken into account then cognitive efficiency is the superior framework. But this is not the case of data storytelling (which is, arguably, a small subset of  all data visualizations).

When visualizations attempt to transmit a message, then how well users can receive this message, as well their capacity to remember this for a long time are of utmost importance, much more than the ease with which a visualization is read.

In that case, Jessica Hullman proposes a trade-off between cognitive efficiency and “obstructions”. The idea is that such obstructions, or visual difficulties, can trigger active learning processes. In other words, if when trying to read a chart, a user doesn’t understand it effortlessly, but is somehow willing to get to the bottom of it, she will apply all her active brainpower to it. This effort surge will lead her to not only better interpret it but also to better remember it. To sum up, these obstructions can have positive effects, this is why when this effect works, they are called desirable difficulties.

Desirable difficulties are tricky, because if the “obstruction” is too large, if a small additional effort is not enough to understand the chart, then it will not work. So, this is definitely not about maximizing the difficulty to understand the visualizations.

In the recommendations parts of the paper the authors say:

Instead of minimizing the steps required to process visualization, induce constructive, self-directed, cognitive activity on the part of the user.

This doesn’t mean that anything goes. This paper does not argue to add as many difficulties as possible, to use every gratuitous effect in the book. Instead, the paper goes on to give actionable design suggestions to enhance reader stimulation and active information processing.

In my practice, for instance with the Better Life Index, I verify the analyses of the Hullman paper: the novelty of the form and the aesthetic appeal of the representation drive the users to overcome the difficulty posed by the unusual shape of the flower/glyph. Would bar charts have conveyed the data more efficiently and more accurately? Definitely! would the user engagement have been comparable? Definitely not.

A critique by Stephen Few

Stephen Few, whose work I have praised at multiple occasions in this blog, has published a critique of this paper (pdf). Reading his article, then the paper again, I had the feeling that they didn’t talk about the same things. In certain contexts, difficulties are not desirable at all and must be eradicated. Yet, in other contexts, cognitive efficiency does not provide the  optimal solution.

For instance, Stephen writes:

Long-term recall is rarely the purpose of information visualization.

Fair enough! so let’s agree that when it is not the case, we should not trouble ourselves with seeking to add obstructions to the display. For instance: business intelligence systems, dashboards (for monitoring), visual analytics (and more on this shortly). Spreadsheets, mostly. All usages of data that support decision, and most usages in the corporate world. The Hullman paper only applies in the other cases anyway.

He would also write (emphasis by me):

Skilled data analysts learn to view data from many perspectives to prevent knee-jerk conclusions based on inappropriate heuristics.

Agreed! and by all means, let them analyse and let them view data from as many perspectives as they see fit, and don’t get in the way of their job.

For context, check out www.palantirtech.com/government/analysis-blog/mortgage-fraud

This here is taken from a demo from Palantir government. Here analysts are tracking mortgage fraud. Each yellow dot on the top display is a transaction where a house has been sold for over 200% of its purchase value, and the ones which are connected are about the same house. We can immediately see 2 suspicious clusters where a property has been resold 4 times in these conditions. And if at the end of their work day the analysts don’t remember the address of the fraudulent transaction, it’s no big deal as long as they have identified a wrong practice.

Conversely, at the risk of repetition, the paper authors write of a trade-off between efficiency and obstructions – cognitive efficiency being generally positive. They say that obstructions become desirable difficulties only if they are constructive, that is if they are able to trigger active information processing. They are not championning 3D pie charts or atrocious dashboards as the one at the end of Stephen’s article.  Jessica signals that novelty enhance active information processing. I don’t know how to characterize speed dials in dashboards, for instance, but novel would not be the word I’d use, and again they wouldn’t be favoured by the authors of the paper. So, I think it’s a bit unfair to associate the paper with the terrible, terrible visuals presented in Stephen’s article, the ones in the original paper being a little bit more defendable.

To see this chart in context see http://www.oecd.org/dataoecd/41/50/47984536.pdf

Consider this other chart (and let’s assume for the sake of discussion that its cognitive cost is low, while it could be much lower by showing fewer time series for instance). This was published in an OECD publication almost 2 years before the 2008 crisis. I would say this chart is easy to read (we see mortgage delinquency rates dropping in most countries) but difficult to interpret and to recall. Like other charts of the document, this one is an oracle of financial apocalypse, as the proportion of delinquent mortgage in the US, the only one without a downward trend, will have the consequences that we know. So if a different way of showing the same data could have made that more obvious at the cost of legibility, I think it would have been worth a shot.

Are we on common ground yet?

If not, let’s assume now that there exists visualizations where long-term recall is, indeed, the main purpose. Examples would include use in journalism, politics, advocacy, marketing… Jessica has been involved in the series of workshop Telling stories with data at VisWeek. This suggests an interesting distinction.

  • visualizations which are tools with which a user accesses or manipulates data.
  • visualizations where an author, with a specific intent, tries to frame data in a certain way to an audience. In that case, the author wants to make sure the audience receives the message as intended, and remembers it.
See where I’m going?
In the first case, we want cognitive efficiency all the way.
In the second, we are mostly concerned with getting our message across and making it stick.
So, there is no contradiction between having a set of rules for one category of visuals, and a different one for the other, especially since the criteria of success are so different. To illustrate this I note that both the article and the paper refer to Tableau, a “cognitive efficiency” company. Yet, it turns out that Tableau is also very interested in doing as well as possible in the story telling front, and that questions asked at the paper’s presentation by Tableau representatives show their interest in this research.

Where to from there?

We have proven methods to reduce the cognitive cost of a visual, and we can thank Stephen Few for making them more accessible. It’s much more difficult, though, to optimize the characteristics of a successful “data narrative”, that is interpretation and memorability. It’s an infographics jungle out there. Those of us who haven’t seen their share of undefendable visuals just haven’t searched enough, but absolutely anything goes.
We still do not have an equivalent framework for visualizations that tell stories. InfoVis started to study them (such as in the remarkable Narrative Visualization: Telling Stories With Data) and characterize them, but we don’t have a systematic, reproducible way to make sure that data narrative will work well, just as we can do the perfect dashboard.  We do know that the best examples at large do not comply with the rules of cognitive efficiency though. Fortunately, practitioners have not waited for convincing resarch and are leading the way, even though many get lost in the process. This is why we need more research on that front! I for one is looking forward new developments in this area of InfoVis.

An open letter to Tableau

4 November, 2011 (12:51) | data visualization | By: jerome

Normally, at this time of the year, I’ll be writing a recap of VisWeek. And I will – the writeups I have been doing for visualisingdata.com were just highlights of individual talks. But much of my VisWeek 2011 experience happened in between the talks. People you meet, ideas you glimpse, that collide and generate new ideas, plans you make…

One person I am always happy to meet at VisWeek, and who is visible from afar, is Tableau’s Jock MacKinlay. Since I couldn’t attend TCC11, I spent some time talking with Jock and Lori Williams about 7 and plans for the future. Whenever I would start a sentence by “hmm, you know…” Lori or Jock would write it down and ask me to send any remark in writing, so I thought I should do that, but rather than keeping this private, I’m going to stick to what’s written on one button I picked up on their stand: do it in public.

For a piece of software that complex, Tableau Desktop/Public has received improvements at an astonishing rate. So fast that one of the recommendations I had is already implemented or ready in the next release. So I have the conviction that Tableau product people care and are on the lookout for useful suggestions on how to move forward. Here are a few.

On design

More fonts please!

In Tableau Public, you’re stuck with the windows XP default fonts. That’s not an awful lot. Is that enough?

For the first 15 years or so of the web, webmasters had to be content with so called web-safe fonts. There was a time where the web was probably 90% verdana and people accepted this like death and taxes. But this is now 2011 and any personal website can be a marvel of customized typography – at least, the parts around a Tableau viz.

More specifically there are two issues with the current set of fonts.

First, there is no condensed font (think Helvetica Condensed, etc.). But condensed fonts are extremely useful to display numbers in callouts or on axes.

Second, the default font is Arial. This is not a neutral choice. There are lots of adjectives that can be associated with Arial, and frankly I don’t think it’s right these qualifiers should apply to Tableau. Arial is the default font of Excel pre-2007 , it’s the font of those bullet-ridden, unstyled, unsavory powerpoint slides that bore us to death for over a decade of unnecessary meetings. Many say it’s a rip-off of Helvetica.

Tableau, you're better than Arial. Say it.

Being able to chose one’s font is essential to branding. The typography of my vizzes should concord with that of my website and my brand. Same goes with my colours (more on that in a second). Conversely if any output on my website is not aligned with my brand it denotes a lack of control.

So what should Tableau do?

Offer more fonts! There are literally hundreds of proven fonts which would be very well suited for dashboards. Fonts which are not unheard of, yet not trite. Legible, yet distinct.

Ideally, Tableau should commission its own font. There is no agreement about which is the best font to display numbers. Settle the question by creating that font! Anyone who would use this font in Desktop, Public or otherwise will subtly scream “Ra Ra Tableau”.

Then, offer several themes to choose from. In my view, Tableau encourages user to focus on what question to ask the data. Once this is done the resulting “visual query” should look ok, or require very little rework to be publishable as is. Today moving away from the arial bold, arial, light gray background on header scheme does require a lot of rework, and shouldn’t. Other Washington-based software firms have been offering themes for ages. Now look at all the existing Office palettes. They all are quite good. I have trouble using the default colors because it really says: this guy hasn’t given half a thought to color. At least I am aware that many very acceptable choices exist. I think it’s a reasonable stance.

Node-link graphs

Node-link graphs, or as they really should be called, graphs, are the most useful way to represent relationship across data elements. If your dataset has items and relationship between them (like movements from a place to another, transactions from an entity to another, people and how they relate to another… there are really many examples), a graph can show quickly and elegantly what there is to know about the data.

Mobile patent suits, by Mike Bostock

I insist on the “quick insight” part because I believe that fundamentally this is what Tableau is about. Many people, including me, use Tableau systematically on a new dataset they try to understand. It’s very quick and efficient to explore data with Tableau and find out what’s interesting in a dataset. You’ll find consensus in the visual community that Tableau is the best “drafting tool” ever.

Back on graphs. One problem with graphs is that since they are not in the Excel canonical list of charts, they are not in the mindset of most corporate users.

Graphs, I believe, would be relatively easy to include in the Tableau toolkit. There are several algorithms that can calculate where the nodes should be drawn. This is not unlike drawing a map – suddenly the x and y come from lat-lon coordinates which are deduced from the underlying data. Likewise an algorithm could assign x and y coordinates to any row.  Then all there remains to do is draw a dot or shape with all the standard attributes, color, size, etc. Links between nodes could be handled by the line object.

When I told Jock about graphs he asked me what would I do with that. Oh, I had some ready ideas: as an HR manager, I could visualize all the relationship in my company. There are many explicit (work in the same team) but also implicit (have worked together in the past, graduated in the same school, similar interests, offices close by etc) relationship in HR files. Or, logistics – I could represent my itineraries non-geographically. Suddenly I can make cartograms (which are usually tricky to do) instead of maps.

In retrospect this is not the right way to approach the problem. I’ve only worked in a handful of positions and industries. There are thousands of different jobs in as many domains who can look at data with different lenses. What I do know as a data visualization person is that graphs are useful and that if people who traditionally didn’t have access to them could pull them instantly on their data – fantastic things may occur.

Treemaps and hierarchical charts

 

A circle map done with Tableau. This was slightly cumbersome and doesn’t work so well. This is also the biggest you can do as it uses the size property of circles, here at its maximum. Coordinates of circles are computed outside of Tableau. My attempts at making a treemap are, well, less polished. The dataset is the flare class hierarchy as in here

I could say pretty much the same thing about treemaps: they are very useful, especially since tabulated datasets, the bread and butter of Tableau, often contain hiearachies. Yet they are not part of the Excel family so if you’ve never worked with them you can’t feel you need them.

So I asked Jock, why doesn’t Tableau have treemaps? It makes sense, IMO. Treemaps are one of the few “advanced” visualization tools which have gained quasi-acceptance. I really see that as a progress. Isn’t there a natural match with Tableau? Jock answer was that the problem with treemap is that they do not fit well with Tableau algebra. To turn a dataset into a treemap would require a specific process that can’t generalize well like the rest of Tableau.

See? There’s your problem

So instead (or rather as a first step) I suggest Tableau adds an attribute to marks: a secondary size attribute. See it this way: if only one size attribute is filled, then it governs all aspects of the size of a mark (height, width, area, diagonal you name it. It would be proportional to all). But if there are two sizes one will affect height, the other, width. So yes, we can do rectangles. And ellipses too. Now that in itself is not half-bad. Fattened stars or what not, I don’t know what to do with them but rectangles and ellipses, now…

Now once you can handle rectangles, once you can handle layouts (which you can, because: maps), you can have treemaps, and/or any hierarchy or packing algorithm. I think we can stop at treemaps but folks at Tableau know there are wide, (mostly) unexplored territories beyond this point.

Tableau and storytelling

Stateful urls

OK this is the innovation that was already in the pipeline when I mentioned it. It would be very useful to have a system of stateful urls for Tableau public. What is that? The possibility to pass a set of parameters in addition to the url of a view, so that it is in a certain state. So the same dashboard could be shared in different states (for instance, with different items selected, or different filters) without having to be saved several times under several different names.

Bonus: dashboards under a certain state, if they can be uniquely identified through those parameters, can then be shared on facebook and twitter as is.

All of this is taken care of.

Interactive slides framework

Now the next step: an interactive slides framework.

The interactive slideshow, once quasi-exclusive to the New York Times, is now gaining in popularity.

The idea? there is a structure like the common slideshow, where the user can go to the next or previous slide, guided by a narrative. But at any step instead of following the flow, they can stop and interact with the view they have.

That said despite being a simple idea it requires quite an amount of overwork to implement properly. It makes sense for Tableau to have their own framework so users can arrange nicely a sequence of views in the desktop tool, for instance dashboards in specific states, with an extra layer of commentary if needs be, and then deploy the finished product which can be embedded in one location as opposed to requiring the user to embed several vizzes in one page. The advantage would be that the reader would go from one to the other as intended by the author, with a supporting narrative or explanations, and wouldn’t be required to explore or interact with each one to get an idea of what is going on.

What do you think?

What would you like in Tableau?

Open data and data journalism

14 October, 2011 (12:50) | presentation | By: jerome

Yesterday I attended a workshop organized by Etalab on data journalism. Since open data, data visualization and storytelling with data are my 3 work interests I could not just be found elsewhere that day.

Interestingly, while speakers and attendants were very much discussing the same subject, what was said (or inferred in questions asked) was very different. On some topics participants presented opposite opinions,  while on others there was a strong agreement.

Inspiration and enthusiasm

That was definitely the common denominator across presentations.

In short: visualization + journalism = win.

Every presenter, @dataveyes, Pierre Falga, @datastore, @sayseal, @we_do_data and @epelboin all showed are talked about things which were pretty awesome and which would have not been possible with data or visualization. While I was familiar with the other examples, I was most fired up by Fabrice Epelboin’s presentation of Tunisian media, Fhimt.com and its dataviz gallery.

What was interesting was how it was easy to tell a memorable story with the support of data. I think for the picture to be complete you also have to include in the big picture the viewer’s assumption and the presenter/journalist narration. One example which was shown by both Caroline Goulard and Simon Rogers is the relationship between tweets and UK riots.

The unsaid assumption was that social media have helped organize the riots.

Facts in hand, in turns out that the bulks of the tweets related to a  riot happened after, not before, the event.  So the narrator help us conclude that riot caused tweets rather than the other way around.

Another example from fhimt.com:
We assume that tertiary graduates have better job prospects than those with less education.

This isn’t the case in Tunisia where there graduates endure a 23% unemployment rate, while the rate for those who haven’t completed primary school is around 5%.
Comment by Fabrice Epelboin: the only thing left to do for them is prepare the revolution. I find this a very clear and rational explanation of the arab spring, in contrast with how television presented those events.

Is this difficult?

It requires work

And no one denies this. Cécile Dehesdin and We Do Data presented us their work process, from the original idea to the final piece. Cécile would stress more the usage aspects while Karen and François emphasized the benefits of illustration and aesthetics to the final result. They both tried to convey us the amount of time and effort it takes to achieve something.

and ressources… or not

Then Pierre Falga and Simon Rogers gave somewhat conflicting views of the inner working of a newsroom. While Simon Rogers depicts the process as relatively effortless and quick thanks to freely available tools, Pierre Falga’s views where that an online newsroom’s resources were very thin, which prevented most media from fully embracing data journalism. To nuance Rogers position and bring it closer to consensus, he argues that the work-intensive part is not the output proper, but rather the data collection, and like Cécile and Pierre he had his share of horror stories on this front.

Thank you, open data

All presenters were grateful for data being increasingly accessible through open data initiatives. Not all is rosy in dataland, however, as institutions here and there are not all excited about doing the prospects of spending their own resources to retrieve data for journalists – even in the case where they are legally forced to.
While data journalism obviously need open data, the reverse is possibly truer – that may be the motive for Etalab to organize the event. So far, official data portals haven’t proved to be directly useful to the concerned citizen, so it is those who are able to utilize those free data and turn them into attention-arresting stories that give them a purpose and demonstrate very visibly that the open data process truly benefits all.

Is there a demand for data journalism?

Presenters didn’t all address this question frontally but seemed to have mixed opinions about that. The guardian has been resorting to data journalism for over one century and gave no impression to ever have reconsidered the question. Others in the rooms, including attendants, had less faith on the matter. Pierre Falga and Eric Mettoux from lexpress.fr admitted their share of responsibility as that demand is largely dependent on the supply of quality material from existing media.

More fundamentally, I see that the mix of data visualization and communication is commonly referred to as data journalism which may be a slight over simplification.
Why would the task of communicating with data visualization be restricted to journalists or media? Companies and government agencies alike have considerable budgets devoted to communication. IMO they should be the ones driving that effort. To a curious audience, that is, to the people who are actively seeking information on a certain topic, data visualization answers can be insanely more powerful and cost-effective than classic communication tailored for a more passive receiver.

Experiencing the sexperience 1000

5 September, 2011 (13:25) | data visualization, web sites | By: jerome

If you think of data visualization as a great way to bring together heaps of interesting data and, well, a visual language, one of the most exciting areas in terms of room for improvement has to be how surveys are visualized.
In conversations with my pollster friend @laguirlande, I often find myself regretting that when reading an opinion survey, we see answer tallies in isolation. That is: we can see what proportion of people gave the first answer to question A, and what proportion of people gave the second answer to question B. But what we we don’t know is, out of the first group, how many find themselves in the second?
Another frustration: survey data is very tabulated. Pollsters know respondents age, gender, and all kind of categories they can fit their respondents in. But again when reading the results, more often than not it is not possible to utilize this structure to filter answers by category – do men react like women to the issue?
Yet opinion surveys are always interesting and topical, else they wouldn’t be paid for. And the interestingness may lie there: little nuggets of data that could have escaped analysts could be found by readers as they try various manipulations.

And then last week, Chanel 4 released the sexperience 1000, a project around the “Great British Sex Survey” by Ipsos MORI, which as its name implies is a large-scale survey on sexual practices of UK citizens.

The project walks us through around 20 questions where little icons represent individual respondents. When the reader changes questions, the icons arrange themselves in bars or circles to form a chart.

Yes, filtering! Yes, cross-tabulating!

It is then possible to click on a column label to “track” that group and find out what they’ve been up to.
Here, for instance, we can select the group of people who had over 101 partners and follow them through another question: the longest period without intercourse, for instance.

The selected individuals will show in green, so we can see that some of them went through long periods of abstinence.

It is also possible to select one interesting individual and see all of their answers.

It is also possible to filter the individuals according to many categories. Here, for instance, we can see that a majority of respondents had intercourse in a car. Including quite a few who don’t drive or don’t have a car!

Great interactivity, yet legibility could be improved

Those functions are truly great and encourage the reader to explore. Yet the choice of representation makes it a bit difficult to understand the answers.


Going back to the first chart I showed about the number of partners for instance, this representation highlights the mode, the most frequent answer to the question. In this case, this is the 11-20 bracket with 121 answers. A close second is just 1 partner (120 answers). The composition of the answer brackets has a lot of influence on that, because if we had made a 5-9 bracket for instance it would have outweighed both (217). Also, they chose not to directly represent the people who had 0 partners, which are over 7% of the respondents.
More fundamentally, I suppose that people want to know where do they fit in the distribution. Are they normal? Less then the norm? Or better? From that chart it’s very difficult.
With that in mind, it’s more relevant to show a cumulative chart.


Less than 5! who would have known?

The site choose to display bar charts when the answer is quantitative, but to present circles for qualitative questions. This is fine when several circles are displayed: it’s ok to figure out if one is bigger than the other.

But it’s much harder when only one is shown, like for the question on places I’ve shown above. It’s difficult to know whether this is a “big circle” or “small circle” without any indication of the total size of the sample. I only know that the majority of respondents had intercourse in a car because of the small number 550 next to the circle, but there is no graphical way to show that.

It’s even harder to compare proportions across groups which have different sizes.

Here, for instance, I’m trying to see whether people with an iPhone are more likely to cheat on their partners. What I see is two grey bubbles – it’s relatively easy to say that one is bigger than the other – then two smaller blue and pink ones, which are harder to compare one to another because they are smaller. What’s even harder is to assess whether the ratio between the larger balls is greater or smaller than the one between the smaller balls. However, the proportion between both balls (that is the share of the respondents who have an iPhone) is relatively easy to figure out, too bad it doesn’t answer the question at all.
To compare across groups of different size, you just can’t escape switching to proportions. I understand the design choice and enjoy its consistency but in this case it goes against the function of the site. I feel that in this case, the designers chose playability and aesthetic appeal over ease of getting questions answered, which is not a bad choice per se condidering the subject and audience, even if there are more academic possibilities. At any rate, the Sexperience 1000 shows the way survey results could be displayed and is a great improvement over the current situation.

d3: scales, and color.

11 August, 2011 (12:03) | d3, protovis, tips | By: jerome

In protovis, scales were super-useful in just about everything. That much hasn’t changed in d3, even though d3.scale is a bit different from pv.Scale. (do note that d3.scale is in lowercase for starters).

Scales: the main idea

Simply put: scales transform a number in a certain interval (called the domain) into a number in another interval (called the range).
an example of how scales work
For instance, let’s suppose you know your data is always over 20 and always below 80. You would like to plot it, say, in a bar chart, which can be only 120 pixels tall.
You could, obviously, do the math:

.attr("height", function(d) {return (d-20)*2;})

But what if you suddenly have more or less space? or your data changes? you’d have to go back to the entrails of your code and make the change. This is very error prone. So instead, you can use a scale:

var y=d3.scale.linear().domain(20,80).range(0,120);
...
.attr("height", y)

this is much simpler, elegant, and easy to maintain. Oh, and the latter notation is equivalent to

.attr("height", function(d) {return y(d);})

… only more legible and shorter.
And, there are tons of possibility with scales.

Fun with scales

In d3, quantitative scales can be of several types:

  • linear scales (including quantize and quantile scales,
  • logarithmic scales,
  • power scales (including square root scales)

While they behave differently, they have a lot in common.

Domain and range

For all scales, with the exception of quantize and quantile scales which are a bit different, domain and range work the same.
First, note that unlike in protovis, domain and range take an array as argument. Compare:

var y=pv.Scale.linear().range(20,60).domain(0,120);
var y=d3.scale.linear().range([20,60]).domain([0,120]);

This is because contrary to protovis, where domain could be a whole dataset, in d3, domain contains the bounds of the interval that is going to be transformed.
Typically, this is two numbers. If this is more, we are talking about a polypoint scale: there are as many segments in the intervals as there are numbers in the domain (minus one). The range must have as many numbers, and so as many segments. When using the scale, if a number is in the n-th segment of the domain, it is transformed into a number in the n-th segment of the range.
illustration of a multipoint scale
With this example, 30 finds itself in the first segment of the domain. So it’s transformed to a value in the first segment of the range. 60, however, is in the 2nd segment, so it’s transformed into a value in the 2nd segment of the range.
Also, bounds of domain and range need not be numbers, as long as they can be converted to numbers. One useful examples are colors. Color names can be used as range, for instance, to create color ramps:

var ramp=d3.scale.linear().domain([0,100]).range(["red","blue"]);

This will transform any value betwen 0 and 100 into the corresponding color between red and blue.

Clamping

What happends if the scale is asked to process a number outside of the domain? That’s what clamping controls. If it is set, then the bounds of the range are the minimum and maximum value that can be returned by the scale. Else, the same transformation applies to all numbers, whether they fall within the domain or not.
Clamping example
Here, with clamping, the result of the linear transformation is 120, but without it, it’s 160.

var clamp=d3.scale.linear().domain([20,80]).range([0,120]);
clamp(100); // 160
clamp.clamp(true);
clamp(100); // 120

Scales and nice numbers

More often than not, the bounds of the domain and/or those of the ranges will be calculated. So, chances are they won’t be round numbers, or numbers a human would like. Scales, however, come with a bunch of method to address that. d3 keeps in mind that scales are often used to position marks along an axis.

.nice()

When applied to a scale, the nice method expends the domain to “nicer” numbers. You wouldn’t want your axis to start at -2.347 and end at 7.431, right?
So, there.

var data=[-2.347, 4, 5.23,-1.234,6.234,7.431]; // or whatever.
var y=d3.scale.linear().range([0,120]);
y.domain([d3.min(data), d3.max(data)]); // domain takes bounds as arguments, not all numbers
y.domain() // [-2.347, 7.431];
y.nice() // [-3, 8]

.ticks(n)

Given a domain, and a number n (which, contrary to protovis, is mandatory in d3), the ticks method will split your domain in (more or less) n convenient, human-readable values, and return an array of these values. This is especially useful to label axes. Passing these values to the scale allows them to position ticks nicely on an axis.

var y=d3.scale.linear([20,80]).range([0,120]);
...
var ticks=axis.selectAll("line")
  .data(y.ticks(4)) // 20, 40, 60 and 80
  .enter().append("svg:line");
ticks
  .attr("x1",0).attr("x2",5)
  .attr("y1",y).attr("y2",y) // short and simple.
  .attr("stroke","black");

.rangeRound()

If used instead of .range(), this will guarantee that the output of the scales are integers, which is better to position marks on the screen with pixel precision than numbers with decimals.

.invert()

The invert function turns the scale upside down: for one given number in the range, it returns which number of the domain would have been transformed into that number.
For instance:

var y=d3.scale.linear([20,80]).range([0,120]);
y(50); // 60
y.invert(60); // 50

That’s quite useful, for instance, when a user mouses over a chart, and you would like to know to what value the mouse coordinates correspond.

Power scales and log scales

The linearscale is a function of the form y=ax+b which works for both ends of the domain and range. In the example we’ve used most often until now, this function is really f(x): y=2x-40.
Power and logarithm scales work the same, only we are looking for a function of the form y=axk+b, or y=a.log(x)+b.
For the power scales, you can specify an exponent (k) with the .exponent() method. For instance, if we specify an exponent of 2, here is what the scale would look like:
an example of a power scale
The equation is now f(x): y=x²/50-8. So 20 still becomes 0 and 80 still becomes 120, but other than that the values at the beginning of the domain would be lower than with the linear scale, and those at the end of the scale will be higher.
For convenience, d3 includes a d3.scale.sqrt() (the square root scale) so you never have to type d3.scale.pow.exponent(0.5) in full.
Also note that if you are using a log scale, you cannot have 0 in the domain.

Quantize and quantile

quantize and quantile are specific linear scales.
quantize works with a discrete, rather than continuous, range: in other terms, the output of quantize can only take a certain number of values.
For instance:

var q=d3.scale.quantize().domain([0,10]).range([0,2,8]);
q(0); // 0
q(3); // 0
q(3.33); // 0
q(3.34); // 2
q(5); // 2
q(6.66); // 2
q(6.67); // 8
q(8); // 8
q(1000); // 8

quantile on the other hand matches values in the domain (which, this time, is the full dataset) with their respective quantile. The number of quantiles is specified by the range.
For instance:

var q=d3.scale.quantile().domain([0,1,5,6,2,4,6,2,4,6,7,8]).range([0,100]);
q.quantiles(); // [4.5], only one quantile - the median
q(0); // 0
q(4); // 0
q(4.499); // 0
q(4.5); // 100 - over the median
q(5); // 100
q(10000); // 100
q.range([0,25,50,75,100]);
q.quantiles(); // [2, 4, 5.6, 6];
q(0); // 0
q(2); // 25 - greater than the first quantile limit
q(3); // 25
q(4); // 50
q(6); // 100
q(10000); // 100

Ordinal scales

All the scales we’ve seen so far have been quantitative, but how about ordinal scales?
The big difference is that ordinal scales have a discrete domain, in other words, they turn a limited number of values into something else, without caring for what’s between those values.
Ordinal scales are very useful for positioning marks along an x axis. Let’s suppose you have 10 bars to position for your bar chart, each corresponding to a category, a month or whatever.
For instance:

var x=d3.scale.ordinal()
  .domain(["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]); // 7 items
  .rangeBands([0,120]);
x("Tuesday"); // 34.285714285714285

There are 3 possibilites for range. Two are similar: the .rangePoints() and .rangeBands() methods, which both work with an array of two numbers – i.e. .rangeBands([0,120]). The last one is to specify all values in the range with .range().

rangePoints() and rangeBands()

With .rangePoints(interval), d3 fits n points within the interval, n being the number of categories in the domain. In that case, the value of the first point is the beginning of the interval, that of the last point is the end of the interval.
With .rangeBands(interval), d3 fit n bands within the interval. Here, the value of the last item in the domain is less than the upper bound of the interval.
Those methods replace the protovis methods .split() and .splitBanded().
difference between rangeBands and rangePoints
This chart illustrates the difference between using rangeBands and rangePoints.

var x=d3.scale.ordinal()
  .domain(["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]);
x.rangePoints([0,120]);
x("Saturday"); // 120
x.rangeBands([0,120]);
x("Saturday"); // 102.85714285714286
x("Saturday")+x.rangeBand(); // 120

the range method

Finally, we can also use the .range method with several values.
We can specify the domain, or not. Then, if we use such a scale on a value which is not part of the domain (or if the domain is left empty), this value is added to the domain. If there are n values in the range, and more in the domain, then the n+1th value of the doamin is matched with the 1st value in the range, etc.

var x=d3.scale.ordinal().range(["hello", "world"]);
x.domain(); // [] - empty still.
x(0); // "hello"
x(1); // "world"
x(2); // "hello"
x.domain(); // [0,1,2]

Color palettes

Unlike in protovis, which had them under pv.Colors – i.e. pv.Colors.category10(), in d3, built-in color palettes can be accessed through scales. Well, even in protovis they had been ordinal scales all along, only not called this way.
There are 4 built-in color palette in protovis: d3.scale.category10(), d3.scale.category20(), d3.scale.category20b(), and d3.scale.category20c().

A palette like d3.scale.category10() works exactly like an ordinal scale.

var p=d3.scale.category10();
var r=p.range(); // ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                      // "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
var s=d3.scale.ordinal().range(r);
p.domain(); // [] - empty
s.domain(); // [] - empty, see above
p(0); // "#1f77b4"
p(1); // "#ff7f0e"
p(2); // "#2ca02c"
p.domain(); // [0,1,2];
s(0); // "#1f77b4"
s(1); // "#ff7f0e"
s(2); // "#2ca02c"
s.domain(); // [0,1,2];

It’s noteworthy that in d3, color palette return strings, not pv.Color objects like in protovis.
Also:

d3.scale.category10(1); // this doesn't work
d3.scale.category10()(1); // this is the way.

Colors

Compared to protovis, d3.color is simpler. The main reason is that protovis handled color and transparency together with the pv.Color object, whereas in SVG, those two are distinct attributes: you handle the background color of a filled object with fill, its transparency with opacity, the color of the outline with stroke and the transparency of that color with stroke-opacity.

d3 has two color objects: d3_Rgb and d3_Hsl, which describe colors in the two of the most popular color spaces: red/green/blue, and hue/saturation/light.

With d3.color, you can make operations on such objects, like converting colors between various formats, or make colors lighter or darker.

d3.rgb(color), and d3.hsl(color) create such objects.
In this context, color can be (straight from the manual):

  • rgb decimal – “rgb(255,255,255)”
  • hsl decimal – “hsl(120,50%,20%)”
  • rgb hexadecimal – “#ffeeaa”
  • rgb shorthand hexadecimal – “#fea”
  • named – “red”, “white”, “blue”

Once you have that object, you can make it brighter or darker with the appropriate method.
You can use .toString() to get it back in rgb hexadecimal format (or hsl decimal), and .rgb() or .hsl() to convert it to the object in the other color space.

var c=d3.rgb("violet") // d3_Rgb object
c.toString(); // "#ee82ee"
c.darker().toString(); // "#a65ba6"
c.darker(2).toString(); // "#743f74" - even darker
c.brighter().toString();// "ffb9ff"
c.brighter(0.1).toString(); // "#f686f6" - only slightly brighter
c.hsl(); // d3_Hsl object
c.hsl().toString() // "hsl(300, 76, 72)"

d3: adding stuff. And, oh, understanding selections

9 August, 2011 (14:49) | d3, protovis, tips | By: jerome

From data to graphics

the d3 principle (and also the protovis principle)
d3 and protovis are built around the same principle. Take data, put it into an array, and for each element of data a graphical object can be created, whose properties are derived from the data that was provided.

Only d3 and protovis have a slightly different way of adding those graphical elements and getting data.

In protovis, you start from a panel, a protovis-specific object, to which you add various marks. Each time you add a mark, you can either:

  • not specify data and add just one,
  • or specify data and create as many as there are items in the array you pass as data.

.

How de did it in protovis

var vis=new pv.Panel().width(200).height(200);
vis.add(pv.Panel).top(10).left(10)
  .add(pv.Bar)
    .data([1,4,3,2,5])
    .left(function() {return this.index*20;})
    .width(15)
    .bottom(0)
    .height(function(d) {return d*10;});
vis.render();

this simple bar chart in protovis
you first create a panel (first line), you may add an element without data (here, another panel, line 2), and add to this panel bars: there would be 5, one for each element in the array in line 4.

And in d3?

In d3, you also have a way to add either one object without passing data, or a series of objects – one per data element.

var vis=d3.select("body").append("svg:svg").attr("width",200).attr("height",200);
var rect=vis.selectAll("rect").data([1,4,3,2,5]).enter().append("svg:rect");
rect.attr("height",function(d) {return d*20;})
  .attr("width", 15)
  .attr("x",function(d,i) {return i*20;})
  .attr("y",function(d) {return 100-20*d;}
  .attr("fill","steelblue");

In the first line, we are creating an svg document which will be the root of our graphical creation. It behaves just as the top-level panel in protovis.

However we are not creating this out of thin air, but rather we are bolting it onto an existing part of the page, here the tag. Essentially, we are looking through the page for a tag named and once we find it (which should be the case often), that’s where we put the svg document.

Oftentimes, instead of creating our document on , we are going to add it to an existing <div> block, for instance:

<div id="chart"></div>
<script type="text/javascript">
var vis=d3.select("#chart").append("svg:svg");
...
</script>

Anyway. To add one element, regardless of data, what you do is:

The logic is : d3.select(where we would like to put our new object).append(type of new object).

Going back to our code:

var vis=d3.select("body").append("svg:svg").attr("width",200).attr("height",200);
var rect=vis.selectAll("rect").data([1,4,3,2,5]).enter().append("svg:rect");
rect.attr("height",function(d) {return d*20;})
  .attr("width", 15)
  .attr("x",function(d,i) {return i*20;})
  .attr("y",function(d) {return 100-20*d;}
  .attr("fill","steelblue");

On line 2, we see a different construct:

an existing selection, or a part of the page
.selectAll(something)
.data(an array)
.enter()
.append(an object type)

This sequence of methods (selectAll, data, enter and append) are the way to add a series of elements. If all you need to know is to create a bar chart, just remember that, but if you plan on taking your d3 skills further than where you stopped with protovis, look at the end of the post for a more thorough explanation of the selection process.

Attributes and accessor functions

At this stage, we’ve added our new rectangles, and now we are going to shape and style them.

rect.attr("height",function(d) {return d*20;})
  .attr("width", 15)
  .attr("x",function(d,i) {return i*20;})
  .attr("y",function(d) {return 100-20*d;}
  .attr("fill","steelblue");

All the attributes of a graphical element are controlled by the method attr(). You specify the attribute you want to set, and the value you want to give.
In some cases, the value doesn’t depend on the data. All the bars will be 15 pixels wide, and they will all be of the steelblue color.
In some others, the value do depend on the data. We decide that the height of each bar is 20 times the value of the underlying data, in pixels (so 1 becomes 20, 5 becomes 100 etc.). Like in protovis, once data has been attributed to an element, function(variable name) enables to return a dynamic value in function on that element. By convention, we usually write function(d) {…;} (d for data) although it could be anything. Those functions are still called accessor functions.
so for instance:

.attr("height",function(d) {return d*20;})

means that the height will be 20 times the value of the underlying data element (exactly what we said above).
In protovis, we could position the mark relatively to any corner of its parent, so we had a .top method and a .bottom method. But with SVG, objects are positioned relatively to the top-left corner. So when we specify the y position, it is also relative to the top of the document, not necessarily to the axis (and not in this case).
so -

.attr("y", function(d) {return 100-d*20;})

if we use scales (see next post), all of this will have no impact whatsoever anyway.
Finally, there is an attribue here which doesn’t so much depend on the value of the data, but of its rank in the data items: the x position.
for this, we write: function(d,i) {return i*20;}
Here is a fundamental difference with protovis. In protovis, when we passed a second argument to such a function, it meant the data of the parent element (grand parent for the third, etc.). But here in d3, the second parameter is the position of the data element in its array. By convention, we write it i (for index).
And since you have to know: there is no easy way to retrieve the data of the parent element.

Bonus: understanding selections

To add many elements at once we’ve used the sequence: selectAll, data, enter, append.
Why use 4 methods for what appears to be one elementary task? If you don’t care about manipulating nodes individually, for instance for animations, you can just remember the sequence. But if you want to know more, here is what each method does.

selectAll

the selectAll method
First, we select a point on which to add your new graphical objects. When you are creating your objects and use the selectAll method, it will return an empty selection but based on that given point. You may also use selectAll in another context, to update your objects for instance. But here, an empty selection is expected.

data

the data method
Then, you attribute data. This works quite similarly to protovis: d3 expects an array. d3 takes the concept further (with the concept of data joins) but you need not concern yourself with that until you look at transitions.
Anyway, at this stage you have an empty selection, based on a given point in the page, but with data.

enter

the enter method
The enter method updates the selection with nodes which have data, but no graphical representation. Using enter() is like creating stubs where the graphical elements will be grafted.

append

the append method
Finally, by appending we actually create the graphical objects. Each is tied to one data element, so it can be further styled (for instance, through “attr”) to derive its characteristics from the value of that data.

From protovis to d3

8 August, 2011 (16:47) | d3, protovis | By: jerome

You’ve spent some time learning protovis only to find that its development is halted as authors have switched to work on d3. Have your efforts all been in vain? Fear not! This series of posts will help you adapt to d3 with a protovis background.

Before we go anywhere further, let me say that these posts won’t make you awesome at d3 (yet). We won’t be talking about how to do all amazing things you could never do in protovis. Rather, we’ll focus on enabling you to be as comfortable with d3 than you could have been with protovis. And once that’s done, nothing will prevent you from learning the more powerful aspects of d3.

Anyway, if you’re reading this, you are already awesome.

Why should I make the switch to d3?

Frankly, you don’t have to. Protovis is a fine framework and works well. Now you may want to switch to d3 for several reasons.

  • d3 is fast. d3 is better at handling scenes with hundreds or thousands of elements. So if you like scatterplots or network graphs, and who doesn’t, d3 has much stronger performance.
  • d3 does animation. There were workarounds to get animation in protovis but there were that. Workarounds. Animation and transitions are built in d3 and are a snap to implement.
  • More features. Just because development has stopped on protovis doesn’t mean that it has stopped elsewhere… for instance, d3 has more ready-to-use layouts, like voronoi tesselation or chords, and it has more methods and functions to make your life easier, to access and manipulate data for example.
  • Styling. In d3 it is possible to apply style sheets in CSS to graphical elements. This helps keeping the code and the format separate.

Yes but doesn’t everything change?

Short answer: no.

Less short answer: some things do change substantially. Most things stay the same. And then, some things look the same but have changed.

Things that stay the same

  • The general principle.Protovis is about transforming an array of data into the same number of graphical elements, with characteristics derived from that data. d3 does exactly this as well.
  • pv.Nest, which in my personal protovis experience has been the hardest to understand. Only, it’s called d3.nest now.
  • Methods that supplement the existing javascript array manipulation methods, like pv.min, pv.values, pv.entries etc. are also back (as d3.min, d3.values, d3.entries, but you’ve guessed it by now). Some, like pv.mean or pv.median, didn’t make it through but you could easily rewrite them, or continue using the protovis ones.

Things that look different, but which are largely the same

Protovis had a number of native graphical objects, or marks, that could be manipulated at will with methods.

var vis=new pv.Panel()
  .height(400)
  .width(400)
  .fillStyle("aliceblue")
  .lineWidth(1)
  .strokeStyle("steelblue");
vis.render();

In protovis, it is inherently different to set the height, the width or just any property of an object. This uses different methods.

var vis=d3.select("body").append("svg:svg");
  vis.append("svg:rect")
    .attr("height",400)
    .attr("width",400)
    .attr("fill","aliceblue")
    .attr("stroke","steelblue")
    .attr("stroke-width",1);

This produces essentially the same thing. We add a rectangle of a specified height, width, and colors. There are a few differences though. Here, controlling height, width or fill is essentially the same thing and uses the same method, .attr(). Notice also that we first created an svg document, then a shape within that document. And also, that we don’t need to use vis.render(); anymore.

The d3 approach looks longer. But if we define all the style information first we could make it much shorter, shorter than in protovis in fact!
For instance:

var vis=d3.select("body").append("svg:svg").append("svg:rect").attr("class", "myRect");

Much of the apparent differences between d3 and protovis come from using explicitly svg shapes (paths, polygons, ellipses, etc.) as opposed to native objects (pv.Panel, pv.Bar, pv.Dot, etc.), although – it’s essentially the same thing. Yes, you have to learn your SVG but it’s really on a need-to-know basis. In fact, if you’ve worked with protovis, or even if you’ve worked with HTML and CSS, you probably know more SVG than you thought.

SVG is more flexible than protovis objects. The flipside is that constructs which were once simple in protovis become less obvious in SVG. But for those cases, d3 has recreated some native objects, even if not as many as in protovis.

Things that look the same, but which are different

There have been some changes in methods that have kept the same name since protovis – some minor, some more substantial. In any case, the basic ways of using these methods (like scale, color, data…) doesn’t change much. It’s only their more exotic uses who do change.