Interactive map of the subway

My last project is an interactive map of the Paris subway.
Interacive subway map

Interactive subway map

Quite some time ago, I had seen Tom Carden’s Tube Travel Times and was fascinated. So while I initially had tried to do something different I ended up borrowing a lot of his design, so it wouldn’t be fair to start this post without acknowledging that.
Then again, there are a few original things in this worth writing about.
A subway system can be seen as a network of nodes and edges. Nodes would be the stations, and edges, the subway lines that connect them. Finding the shortest path between two parts of the graph is a classic problem of computer science. That path is the sum of the length of all the edges it takes to traverse the graph from the source to the destination.
However a typical subway journey, and I should know, is not entirely spent in a train. You first have to get from the street to the platform, wait for your train; in the event of a connection, you have to go from one platform to another, wait for another train; and eventually you have to return to the street level.
So in my representation of the subway system in addition to the stations themselves, I’ve added nodes for all the platforms within a station and edges between them. So if you have to go from one station A to station B, which is three stops away on a given subway line, you would really traverse 5 edges: station A to platform, stop 1, stop 2, stop 3, then platform to station B.
While the Paris subway agency, RATP, has recently publicized its open data intent, the travel times between two stations are not public. I actually had reliable and accurate measured travel times for 90% of the network, on the basis of which I could estimate the rest. The great unknown that remains is the complexity of the stations, some are very straightforward so going from the street to a train is a matter of seconds, while some are downright mazes. While I think I have travelled through almost every stations by now, I definitely not have stopped in many of them. So as part of the project I am asking users to complement my data on the stations they know well. Actually, I didn’t use much of the datasets that RATP published. The position of the stations on the canonical plan has a lot of inaccuracies. For the geographical positions I used my own measures rather than the published data (I had been playing on and off with subway data for close to 5 years now, I did use the trafic data to give an idea of the occupancy of a given section, and the official color codes as well. Because I am using my own measures, I have not added other railway systems like the RER or tramway for which I don’t have enough data.

So when a user selects a station, the rest of the network moves according to their (shortest path) distance to the selected station. So at the heart of the exercise there is a shortest path calculation from any station to any other. Including stations, platforms, connections, entrances and exits, that’s a network of 680 nodes and 1710 edges (for 299 stations, I didn’t include a couple of the very last ones). And it’s a directed network, because there are some sections which go in only one direction. At most, a network that big could have close to 500,000 edges, so it’s fairly sparse.
There are several algorithms to compute shortest paths in the code. I am using the best known one, Dijkstra, which works on path with no negative edge length. The way it is implemented here, with binary heaps, makes it so fast that it’s not noticeable. Not using that improvement, or relying on a more versatile but slower shortest path algorithm, would result in considerably greater computing times.

When a user selects one station a few things happen. First, all of the others are arranged around the selected station. Imagine the station is the fixed point in a polar coordinate system, all the other stations retain their angular coordinate but their radii now depends on the shortest path time. That idea came directly from Tom’s applet. Another thing I took from him is the addition of rings in the background which give an indication of the travelling time to those destinations. But I settled for that after much fumbling and with the conviction that there was no better solution.
What also happens is that all the edges which are not part of any shortest path disappear. So subway lines are no longer continuous.
Once a station is selected, more things happen when hovering on a second station, the actual shortest path is revealed while all others fade out, and the station marks are hidden except those of the start, destination and all connections on the way, for which names are displayed.
Finally, an interace appears that let users enter estimations for the times I don’t know so well about the stations. I haven’t added any hard figures there so I am only asking for impressions, but I believe that with some usage the time estimations can get much more accurate.

Edit 1
the vis has been quite successful, so I decided to awesomize it a bit.
Now by clicking on any colored line, it’s possible to “block” any given subway section. The shortest routes become those which do not go through that section. You can simulate what happens, for instance, when a given line is down.
The other change is that I’ve added walking distances. For each station I’ve computed the beeline distance with its 5 nearest neighbors. I’ve translated that distance into time (knowing that pedestrians can’t traverse buildings, have to mind trafic etc.).
But in many cases it is still quicker to go from one station to the next on foot than by train.


Gun murders in America

Click for interactive visualization

I have created this map of every homicide in the USA using firearms for the latest year where detailed information was available. Every, that is from all the agencies that report homicides to the FBI, which is not an obligation – this is why the map lacks Florida data.
In the interactive version you can see how murders happen through the year and explore them according to several criteria that were available in the database. While large shooting sprees receive media attention, unfortunately there are thousands of cases each year in just about every community.
Technically this is my first foray with d3.v3 and it uses two of its major new features, topoJSON for easy, lightweight maps, and hex binning to represent many individual events in one hexagon. Thanks to Mike Bostock for tutoring on that.


Events in the Game of Thrones

Events in a Game of Thrones

Click for the interactive version

I’ve created an interactive visualization on the Game of Thrones.

(A more technical post on how this was done to follow)


La nuit blanche

TL;DR: go check the model here

So I live in Paris and try to go to the museums there as much as I can, which is often.
About ten years ago (2002, I think) the city came up with what I thought was a fantastic idea: “nuit blanche”, an art event during one night in October. There were a dozen or so exhibits planned, all of which were fairly ambitious. For instance Sophie Calle would welcome visitors on a secret bedroom on the very top of the Eiffel tower.

When I saw the programme it was almost to good to be true: all those world class artists would do something extraordinary in Paris! I thought I could just casually walk all night from one installation to another along with a few other like-minded night wanderers. so it would look a bit like this:

(how to read this: circles are installations. The large black circle represents their capacity, the inner green circle the occupancy. If too many people try to enter an installation, the green circle will grow larger than the black one and will turn red, this is when queuing appears. Dots represent visitors).

Little did I know that the city expected 100000 visitors. Actually, 1 million showed up.

If the total number of visitors that were in the streets at one given moment in time is an indication of success, Nuit Blanche has been fantastic. But to anybody who’s experienced it it was really a night of waiting and walking. There just wasn’t enough capacity in the various places that would host the event, and no way for a visitor to expect the size of the queue that they would be facing (more on that later).

Every year, it’s a struggle. I’m like, the programme is so exciting, but then I know I will spend my time queuing or walking (in later editions, public transportation got better, so let’s say queuing and moving).

And each year I end up going, and I end up thinking: never again. (but hey, hey, and hey).

In the last edition there’s been something like 2.5 million visitors. There are about 100 places to visit in Paris. Many of which are indoors with a limited capacity and so enforce queues. Some are outside, so people can just walk past it and enjoy the art (sometimes, even from a great distance).

What I found annoying though it that there was no way to tell in advance which installations were going to be crowded, and which were going to be empty.
This is what led me to build this model.

Here’s how it works.
A certain number of people and attractions are randomly distributed in the city.
People will go to the nearest attraction. If it’s possible, they will go in, else they will queue until there is room.
While inside, they will enjoy it for a certain time, then they will move out and go to the nearest one they haven’t visited.

This model, which is fairly close to how things work currently, will always end up creating large queues in some places while others will run under capacity.

Now imagine a simple change: people get an idea of how long the queues are everywhere. Instead of going to the nearest exhibit, they will go to the place where they would be able to get in the quickest, taking into account the time to get there and the time to queue if applicable. Since nobody likes to wait, people will shy away from the places with long queues, evening the load, and by the same token reducing the waiting time for other visitors. Technically the waiting time at all the installations become Lyapunov functions, and provided they have enough information agents will make it optimal.

If you run this model in its full screen incarnation with dashboards, bells and whistles, you will see that this doesn’t really help increasing the number of installations that people get to visit. What happens is that the time spent waiting in queue shifts to time spent moving between places. (the overall time spent inside installations should increase, though).

In order to reduce the time spent moving, one can improve the public transportation (represented by the average speed, in the model), or putting installations close together, which has been done to some extent since the start. But do that with a limit, because that moving time is an interesting buffer, it’s not as annoying as queuing and Paris is beautiful at night, and when you’re among 2.5 million art enthousiasts it’s as safe a city can get.

To further improve the time spent visiting exhibits, one has to improve capacity or reduce attendance (two options which I trust the city of Paris will be wary about). One other option would be to improve the percentage of art installations that can be enjoyed from the outside, another parameter in the simulation.

Lastly, here are a few things the model doesn’t include, which IMO don’t make a lot of difference in the results, but to be honest:
– the installations are presented as being equally attractive. In fact, some are always more interesting than others and get even more queue. (that would make the model where queues are ignored even more unbalanced).
– the capacities of the installations are determined by random, but uniformly distributed. That isn’t the case, the largest museums are open that night and they can welcome thousands of visitors, while some other places would be full at 100.
– opening times of installations haven’t been taken into account; most will close at some point during the night.

and on that note, enjoy the model


Some simulation models

Inspired by the coursera class on model thinking, by Scott E Page, I have implemented a bunch of simulation models with d3. It’s something I have done on and off for the past few weeks, and let’s say it’s really the first 6. Those interactive versions just graze the surface of all there is to see.

So what I have is:



d3 tutorial at visWeek 2012

Jeff Heer, Scott Murray and myself have done a d3 tutorial at visWeek 2012. You probably gathered that from the title of the post.

Here is a link to all the slides and code examples that we have presented:

d3 tutorial

For the purpose of the tutorial I have compiled a d3 cheat sheet, on 4 pages it groups some of the most common d3 functions. When I was learning d3 my number one problem was figuring out which property should be set using .attr, and which required .style. And also: which svg element support which property? All of this is addressed in the cheat sheet. It’s part of the link above, but if you want it directly without downloading a 13Mb file, here it is:

d3 cheat sheet


A game of data

Inspired by the, well, inspiring set of Lost visualizations released by Santiago Ortiz – Lostalgic, I decided to publish the one visualization on all the data I had gathered on the Song of Ice and Fire series of books.

Click to see the vis

Here’s the idea behind this one. Many books set in a fantasy world come with a map where all the places mentioned in the books are situated. I end up looking up these places very often to get an idea for the distances, for instance. But the way these locations are placed on a map is one specific convention. If two places are supposed to be fairly close from each other on a map but that it is very inconvenient to travel from one to the other, it is as if they were far, and conversely, if two places are a world apart but travel between them is fast and easy, it is as if they were close.

With that in mind I am drawing a subjective map of the Game of Thrones world.
In the books, chapters are broadly comparable. Since all chapters are narrated from the point of view of one character, I link two places between which this protagonist has travelled in the course of one chapter. I also add links for travels suggested in the chapter, even if not done by the point of view character.
Places which are linked are drawn one to the other. As a result, this creates an alternate, abstract geography, where distances represent the difficulties and obstacles in travel, rather than distance in the territory.

In addition the size of the nodes depend on the number of times they are visited in the books. A node could be large even for a relatively empty place, if a lot of the action takes place there, this is true for Castle Winterfell or Castle Black. Then again, large cities which are alluded to in the story, but where not much happens in the books, such as Casterly Rock or Sunspear, will appear as tiny dots. King’s Landing, which is the settings of roughly 25% of the books, and also probably the largest city in this world, is the largest node.


Dimensionality reduction

Following my Tableau politics contest entry, here is another view I had developed but which I didn’t include in the already full dashboard.

In the main view I have tried to show how the values of candidates relate to those of the French. It’s difficult to convey that graphically when these values are determined by the answers to as many as 19 questions (and there are many many more that could be used to that effect).

Enter a technique called dimensionality reduction. The idea is to turn a dataset with many dimensions into a dataset with much fewer dimensions, as little as one, two or three. So we compute new variables, so that they capture virtually all the variability of the original dataset. In other words, if two records have different values in the original dataset, they should have different values in the transformed dataset too.

If you’re not allergic to words like eigenvalues the math is actually pretty simple. But let’s not go into that. The point is that with this technique you can represent a complex dataset as a two-dimensional dataset with very little loss.

Of course the technique doesn’t tell you what these new variables represent. Getting a feel for the data I postulate that the one on the X axis represents the toughness of a candidate (pro-security measures, no sensitivity for minorities, etc.) and the one on the Y axis is happiness with previous government. Or possibly, lover of the capitalistic doctrine.

Now you get a better feel for how close or distant the various candidates are from individual voters. You can also see which “spaces” remain empty or which are competitive. The top-right quadrant, for instance, looks tempting, but it is really nearly empty (about 200 respondents on over 1500). The right half of the matrix, that is the one which is sensitive to strength, has only one possible competitor but also few voters (~400 respondents). It makes more sense to remain an acceptable choice for the top half (750 voters) and especially the top left (550). In other words the Sarkozy mark should drift slightly towards the top left for optimal impact.


Tableau 2012 politics contest – justification and making-of

what led me to those choices

I was technically happy of my entry for the sports contest. I had done what I wanted: obtain a hard-to-find, interesting dataset, attempt to create an exotic, hard-to-make and never-tableau’d-before shape with aesthetic appeal and insights.

Yet the rules stated that the entries shall be judged on the story-telling front. While there were interesting insights, indeed, they didn’t constitute a story, a structured narration with a beginning and an end. Having worked on that subject on occasion, I think there is an inherent contradiction between a dashboard tool that lets a user freely manipulate a bunch of data and that articulated story where the user is more led throughout a process.

So that’s what led most of the work.

The second idea was that there is an unspoken, but IMO unnecessary rule about making Tableau dashboards compact things, highly interactive and interconnected. First, the elephant in the room: Tableau public is slow. It’s too slow. So too many interactions do not make a pleasant experience. Second, it is true that in Tableau one can assemble a dashboard out of interconnected worksheets, where clicking on one makes things happen in another. But just because you can doesn’t mean you should. Remember the <BLINK> element in the webpage of the 1990s? And this is this interconnectivity that causes dashboards to be compact and fit over the fold. If clicking on one element causes changes on another, you’d better be able to see both even on a laptop screen.

So the second idea was to create instead a long dashboard where a user would be held by the hand as she’s taken from point A to point B. Along the way, there would be texts and images to explain what’s going on, data – not necessarily interconnected, worksheets with little interactivity which can be understood at first sight, and which can stand some manipulation but don’t need to.

When visualization and storytelling intersect there is one form that I like which is to start with a preconception and to let the user find through manipulation that this idea is wrong. So I tried to use that in the dashboard as well.

The subject

That’s actually the #1 issue in French politics right now. Which strategy should the main right-wing party adopt? Typically, during the presidential campaign, both large parties fight for the votes of the center and are less radical than usual. But during this campaign the UMP, the party of the former president Nicolas Sarkozy, steered hard to the right in an attempt to steal back the voters gone to the far-right.

Apparently, that strategy was successful, even if he lost the presidential rate, he managed to somehow catch up against his rival.

Yet there are those who argue that if the party was more moderate, it would have been more successful and possibly win.

Anyway. The presidential race is over. But now the party is deciding which way to go next by electing its next leader.

Fortunately, there is data that can be used to determine whether the far-right or the moderate strategy can be more fruitful. This is what it is about.

Making the viz

Tableau dashboards can go up to 4000px in height, so that’s what I shot for.

So let’s say it loud and clear, it’s hell to manipulate large dashboards in Tableau, even with a very strong computer. When you add a new worksheet the legend part and the quickfilter part are added whenever there is room which could be thousands of pixels away. Since you can’t drag an element across screens you may have to proceed in babysteps. Once there is a certain number of elements, be they text, blanks or very simple and stable worksheets, adding another element takes a very long time, so does moving them around, etc.

As usual fixed size is your only friend, fixed heights, fixed widths, alternating horizontal and vertical layout containers.

So up to the last 2 worksheets there is really nothing to write home about. Only this: when you interact on the published workbook on the web it is painstakingly slow as the dashboard is reloaded and recomputed in its entirety. While this is ok for most of the worksheet, for the most complex one (the one with many sliders) it’s just unacceptable because the sheet won’t have time to be redrawn between two interactions.

So I came up with an idea: create a secondary dashboard with just that sheet, publish it independently, and then, in the previous dashboard, I have added a web page object. And that web page pointed to that other dashboard. So in effect, there is a dashboard within a dashboard, so when there is interactions in the complex worksheet, the secondary, smaller dashboard is the only one which is reloaded and recomputed, which is noticeably faster. Still not faster as in fast, but usable.

now publishing aspects aside this worksheet is interesting. The idea is to update a model based on 19 criteria. For every record, the outcome depends the “closeness” of the answers of the record and those of the candidates. The 19 parameters control the position of one of the candidates: Nicolas Sarkozy. So what I’ve done is calculate, outside of Tableau, the “distance” between each record and each of the other 8, and in the data file, I’ve specified that minimal distance and the name of the corresponding candidate. Then, in Tableau, I compute in real time the distance between the record and the parameters, and if that score is inferior to the threshold in the data file, then Sarkozy is deemed to be the closest, else it is the one from the data file. The worksheet tallies up the number of records which are closest to each candidate. Also, in order to keep the parameters legible I have constrained them to 9 values, when they really represent numbers between -2 and 2.

Also for the record, I have made a French and an English version. Why? Because I hope to get the French version published in a media and weight in on the debate, while I need the English version for the contest. This raises a lot of issues, all the worksheets need to exist in 2 versions, many variables need to be duplicated as well. As a sidenote the marks concerning a candidate are colored in tableau blue /orange in English in order to highlight candidate Sarkozy, but according to the campaign colors of the candidates in the French version.

That’s about it. I hope you enjoy my viz!


Which way to the right?

Here is my entry for the Tableau 2012 politics contest.

Source of the data:

Economic statistics from OECD, opinion data from TNS Sofres.

Making-of and explanation post to follow.