Before going any further I invite you to follow any and all of these accounts you don’t yet follow on twitter. You would be in great company (see below).
This week I was looking at the Twitter API and had the idea to compare our respective friends and followers on twitter. We have very different profiles.
First, we are all based in a different country, except for @filwd and @moritz_stefaner who are both in Germany. Also the interest of our audiences are different.
Still, I was suprised to see that there was not that much overlap among our followers. For each of us, at least one third of our followers do not follow any of the other accounts. So, even for @infosthetics which has over 8000 followers, the collective audience of the group is approximately twice as much.
So anyway, I conducted some further analysis to answer some questions such as:
which accounts follow several of us, but not one specific account?
among the followers unique to an account, which have many followers? (ie @datavis is being followed, and occasionally RT’d, by @alyssa_milano!)
Unfortunately, I couldn’t find the time to complete mini-challenge 2 and the grand challenge. I’m making this on my free time and I had to balance all kinds of commitments, so I couldn’t secure enough time to finish. Unlike previous years, though, I managed to find enough time to start ! so, in the words of Charlie Sheen: winning.
So what is this about?
In the fictional Vastopolis, a mysterious infection strikes. Where does it come from and how is this transmitted? To answer these questions we have one million tweets sent by residents in the past 3 weeks. and among that million, there are quite a few about people reporting symptoms.
The first thing that I did was coming up for a method to tell whether one tweet was actually about a disease or not. so I scored them. I made a list of words that were required to consider that one message related to sickness, they were fairly univoquial like sick, flu, pneumonia, etc. Each of those words added one point to a “sickness” score. Then there was a second list of more ambiguous words like “a lot”, “pain”, “fire” etc. I added one point for each of these words or phrase, if a message already contains a required word. So, there were a few false negative, a few false positive, but all in all it was fairly accurate.
Fairly soon I had the idea to show the sums of all the scores of a part of the map, rather than showing each individual tweet. But originally, the sectors were quite large and I showed data by day.
Then, I worked with finer sectors and by 6 hours chunks. That’s how I could exhibit how people moved towards the center of the map by day, and back to its edges every night. With finer geographic details I could also see some spikes in various areas of the map during the period that I couldn’t see before, which were not necessarily related to the disease.
Eventually, I wanted to read what the tweets corresponded to, so I loaded the full text of the messages so that clicking on a square would reveal what was said at that moment. In this dataset, every spike in volume corresponds with an event that’s been added by the designers, so it was fun to discover everything happening there, from baseball games to accidents or buildings catching fire. Often, there were articles in the mini-challenge 3 dataset that would give more information about what really happened.
so, what was mini-challenge 3 about? nothing less than diagnosing possible terrorist threat. This time we were given not one million tweets, but thousands of articles which were much longer than 140 characters! From reading a few sample articles, I saw that most didn’t talk about terrorism or vastopolis at all. But couldn’t they contain clues that could link 2 and 2?
my first idea was to find all entities in the articles, that is names of people, or names of organizations (which follow a certain syntax) and arrange them in a network. The problem is that there were just too many names and groups (thousands of both) and I couldn’t tell from such a list which sounded suspicious. Although, a group called “network of hate” is probably not a charity. I’m sure it is possible to solve the challenge like this, but I chose another way to get my first leads.
I just did like in mini-challenge 1 and scored my articles, but I gave them several scores instead of just one by comparing them to several series of words. One series, for instance, was all the proper names in Vastopolis, like names of neighborhoods, because articles about Vastopolis are probably more interesting. The other series corresponded to various kind of threats.
That allowed me to create the scatterplot form which I used both to represent articles and to narrow the selection by selecting an area if needed. Then, as time went by I added more and more features to the tool, for instance an interface to read articles with keywords highlighted, the possibility to filter articles by keyword in addition to a graphical interface, being able to see all the articles as a list and select from that list, not just from the scatterplot, and finally the possiblity to mark articles as interesting and regroup them in another list…
That was about when I felt I could run out of time, so I didn’t add the other features I had planned or worked on making a decent interface. Also, I spent a lot of time not just trying to solve the challenge, but reading all the stories that were planted in the dataset, linking them to the tweets of MC1, etc.
Anyway. I quite enjoyed working on that and really, really appreciated the humongous work that went into creating the vast challenge universe. I’m looking forward seeing what other teams came up with. On a side note, it’s probably my last protovis projects as it makes sense to completely switch to D3 now…