Salonanarchist | Leunstoelactivist

Spamming after all? Revisiting the repost ratios of Vox, Upshot and 538

Recently I wrote about people who share their URLs on Twitter, and then post them again, hoping to draw even more people to their site. I said that FiveThirtyEight reposts its URLs on average 0.3 times. I was wrong: it reposts its URLs far more often. And so do voxdotcom and UpshotNYT, who didn’t even make the top 5 in my original analysis. The Upshot reposts its URLs on average as many as 0.8 times.

The reason I underestimated the repost ratios in my original analysis has to do with the fact that tweets tend to contain shortened URLs. and look like different URLs. However, they point to the same article, so one should be treated as a repost of the other (or perhaps both are a repost of yet another one, who knows). If you don’t take this into account and treat them as different URLs, you’ll underestimate the number of reposts (red bar in the graph).

It’s not that I wasn’t aware of this problem when I did the first analysis. I first tried to account for this by looking up the non-shortened URLs, using the Python urllib2 module. It turned out this was very time-consuming, which was a problem since I wanted to look up quite a few URLs. Pragmatically, I decided instead to use the ‘expanded URL’ provided by the Twitter API. This method does yield higher repost ratios for 538 and the Upshot (grey bars in the graph). Still, it doesn’t really solve the problem, because the expanded URL provided by the Twitter API will sometimes be yet another shortened URL. That’s the reason I still underestimated how often people recycle their content on Twitter.

When I realised the ratios I had originally calculated were still rather low given how many reposts there appeared to be in my timeline, I decided to recalculate repost ratios using urllib2 after all. Because this method is so time-consuming, I did this for just three accounts: Vox, 538 and Upshot NYT. This resulted in repost ratios that are substantially higher (light blue bars in the graph). The new Python script is here.

Note that the ratios are snapshots calculated on a sample of the 200 most recent tweets (that is, about one to two weeks of tweets).


Rise in Dutch cycling accidents, but Strava probably not to blame

The number of wielrenners (cyclists on racing bikes) treated at Dutch emergency departments has doubled since 2010, according to a study published today. Among a range of possible explanations the authors mention the popularity of apps like Strava:

The increasing popularity of smartphone apps like Strava, which let you keep track of cycling records for certain tracks and compare them with others, can lead to dangerous situations.

Like I said, this is just one of many possible explanations discussed in the report and the authors are by no means suggesting that Strava is a key factor causing cycling accidents. That said, the idea that Strava may have played a role doesn’t seem to be a priori absurd.

Strava was launched in 2009, but when did it become popular in the Netherlands? I couldn’t find any direct data on this, but Google trends is a plausible indicator.

The Google data are pretty clear: interest in Strava didn’t take off until February 2012 in the Netherlands (interestingly, the search volume index is highest in Limburg and Gelderland, which are also the main regions with hills in the Netherlands). As an extra check, I looked at messages at the forum pages (you need to login in order to be able to search the forum) containing the search term ‘strava’. There were 10 messages prior to 1 February 2012 and 1,843 after that date, which seems to confirm the Google pattern.

By contrast, the number of wielrenners at emergency departments saw its biggest increase between 2010 and 2011. The number was stable at about 2,000 prior to 2011, but rose to 3,700 in 2011 and 4,200 in 2012. So it seems Strava was largely unknown in the Netherlands at the time when the largest increase in cycling accidents happened.

The reason for the study was a media storm last year about supposed irresponsible behaviour of wielrenners towards ‘normal’ cyclists. Car lobby club ANWB even suggested wielrenners should stay at home on sunny days.

In a survey among wielrenners, 45% said wielrenners do not sufficiently adjust their speed and 51% said wielrenners often ride in (too) wide groups. An analysis of 2,849 injury-causing accidents involving two cyclists revealed that in 24 cases a ‘normal’ cyclist got injured as a result of a collision with a wielrenner. So while many wielrenners agree that (some) wielrenners behave irresponsibly, this doesn’t seem to be a major cause of injuries among other cyclists.

Wielrenners themselves have about 2.2 injuries per 100,000 hours of activity. This is much lower than the number for all sports combined (7.1). However, 23% of wielrenners who go to the emergency department have to be treated in hospital, compared to 6% for all sports. So in terms of serious injuries, wielrennen doesn’t seem to be much safer or unsafer than other sports.

While it’s difficult to pinpoint the exact cause of the rise in accidents involving wielrenners, the authors of the report suggest the capacity of cycle paths is no longer sufficient given the rising number of cyclists, including a rise in cycling among people above 55. One of their recommendations is to create more ‘cycling highways’ for fast cyclists.


Identify potential spammers in your timeline, using Python

(Also see follow-up article here) - Twitter has become an important tool to let people know you’ve published a new article on your website. It has been suggested that you can get more visitors if you tweet the article’s URL not once, but multiple times. Unfortunately, some people are following that advice and are systematically reposting URLs.

So who are those people? Identifying the biggest reposters in your timeline is quite straightforward (whether these people are spamming is up to you to decide). Here’s a script that calculates the repost ratio, that is the average number of times people repost URLs, for each person you follow. For people who post URLs only once - in other words, who never repost them - the ratio will be zero. Here are the biggest reposters among the accounts followed by Data and Data Viz:

DataDrivenJournalism reposts URLs on average 0.36 times
FiveThirtyEight, 0.30
HelpMeViz, 0.30
Jon Schwabish, 0.23
Archie Tse, 0.20

To be fair, the numbers show that these people only repost some URLs. Further, people who do not normally repost URLs may still end up with a relatively high repost ratio if there are one or two URLs that they have reposted very often: these outliers would drive up their average number of reposts. Here are some potential outliers:

Cole Nussbaumer linked to this page with workshop dates 21 times in her 200 most recent tweets
WTFViz : WTFViz submit page, 9 times
Zack Beatty: tool, 8 times
HelpMeViz: Help Me Viz homepage, 8 times

These examples illustrate that there may be legitimate reasons to repost URLs. For example, Cole Nussbaumer’s page with workshop dates probably changes frequently, so reposting that URL would seem to make sense.

If you don’t want these often-posted URLs to drive up the repost ratio, you can calculate the repost ratio as the share of URLs that got reposted at least once. That way, you’ll disregard how often they got reposted. Here are the top 5 results by that method:

FiveThirtyEight now has a repost ratio of 0.27, which means it reposts about 1 in 4 URLs
DataDrivenJournalism, 0.23
HelpMeViz, 0.17
NPR visuals team, 0.16
Jon Schwabish, 0.14

In case you’re wondering: my own repost ratio is 0.06 / 0.05.

Not ditching R for Python just yet

As a result of the whole controversy over using Python vs R for statistical analysis and graphs, I thought I’d switch to Python. Mostly because I think it’s more practical to use the same language for different tasks, but also because it seems easier to make decent-looking graphs with Python (I’m sure some people will thoroughly disagree). And, of course, because googling for solutions using «Python» as a search term simply works better than searching for «R».

But now Brian Caffo, Roger Peng and Jeff Leek’s Data Science Specialization Course has started on Coursera and they use R. I guess I’ll have to postpone my decision.


Big Brother: state or capitalist

George Orwell’s Nineteen Eighty-Four describes a future characterized by total surveillance (with telescreens observing people in their own homes, even monitoring their heartbeat and recognizing their facial expression). This surveillance is carried out by the state and its helpers. Corporations play no role in it.

In fact, corporations and capitalism are a thing of the past in Nineteen Eighty-Four, for private property has been abolished. A children’s book explains that capitalists were rich, ugly men wearing top hats. The Party constantly emphasizes how terrible conditions were before the Revolution and how much better they are today. But the main character, Winston Smith, can’t help but wonder if things had been really that bad in the past and if capitalists had really been such terrible creatures.

The suggestion is clear: the state is using capitalists as a scapegoat to mask its own failings (in fact, if I were a member of today’s whining one percent, I'd claim that Orwell had predicted the current «rising tide of hatred of the successful one percent»).

Today, thirty years after 1984, private property hasn’t been abolished, but we are approaching a level of surveillance pretty close to what Orwell described. When we try to explain what’s going on, we frequently use the term Big Brother. But when we do, are we referring to the state, as Orwell did, or do we have capitalists in mind?

To explore this matter, I looked up how often newspaper articles mention Big Brother in combination with either the names of government agencies, or the names Google and Facebook (of course I should have included Apple, notwithstanding their smart privacy patent, but I left them out for practical reasons explained below). The results are shown in the graph below. For the non-Dutch: NRC is a Dutch newspaper and AIVD is the Dutch intelligence service.

It appears that Google and Facebook turn up in combination with Big Brother far more often than government agencies like the CIA, MI5 or AIVD. However, as the red bars show, this has changed since the revelations of Edward Snowden. Since May last year, the NSA has been mentioned in combination with Big Brother more often than Google or Facebook (in the Guardian, the same applies to the GCHQ).

So Orwell didn’t foresee the role of corporations in mass surveillance, and we used to have a blind spot for the role of the state - but Snowden seems to have fixed that.


I used the Guardian and New York Times APIs to look up how often names of selected state agencies and corporations have appeared in combination with Big Brother in articles over the past ten years. I removed the results from the Guardian media section to get rid of most references to the Big Brother TV show. I wanted to include Apple, but unfortunately, the newspaper APIs don’t distinguish between apple and Apple. I thought searching for iPhone might be a practical solution, but the Guardian results included articles containing ‘I phone’. The NRC doesn’t have an API so I looked up the terms manually; the timeline to the right of the search results makes it quite easy to count the number of post Snowden occurences. In all cases, the method to search the newspaper archives is imperfect in that it yields some unwanted results (e.g. articles mentioning somebody’s big brother which have nothing to do with Big Brother).


Problematic cycling charts

You might think the graph above is about the effort required for climbing, with those little bicycles going up the slope, but it’s not (in fact, it shows for each bicycle type how much more power is required to cycle as speed increases). Apparently, somebody added the bicycles for «fun», without giving much thought to what the graph is supposed to communicate.

The graph is from the book Cycling Science (not to be confused with the intriguing Bicycling Science), a book full of charts that explain how cycling works. Unfortunately, it contains quite a bit of chart junk and some of the graphs raise more questions than they answer.

For example, the chapter on cycling safety has a map that suggests the Netherlands is the most unsafe country for cycling. The problem is that it shows the percentage of road deaths who are cyclists, which says more about how many people cycle than about cycling safety. Another graph says Chris Boardman managed to cycle more than 56 km in an hour when he assumed a super-aerodynamic position, but that he would only manage 15 km when sitting upright. Really?

Despite car sharing, still lots of cars in Amsterdam

Does car sharing mean the end of the car as we know it? A study by consultancy Alix Partners in American metropolitan areas claims that each vehicle in a car-sharing fleet leads to 32 fewer cars being bought.

I haven’t seen the original report, but apparently respondents were asked whether they have avoided buying a car due to their participation in a car-sharing scheme; 51% said yes. The average car-sharing service would have about 66 members per car, which would sort of result in 32 canceled car sales per shared-use car. Of course, this is not the most rigorous way to measure the impact of car sharing. All the same, the study suggests that the impact may be huge.

In Amsterdam, the number of cars in car-sharing schemes has grown (xls) from 378 to 1476 over the past ten years. If the Alix number would hold true here, that would mean some 35,000 fewer cars sold. In reality, the number of cars for private use has risen from 184,000 in 2003 to 201,000 in 2013. The number of cars per 1,000 remained pretty stable at about 250. In the inner city, the total number of cars has risen from 19,190 in 2004 to 19,840 in 2012.

Of course, it’s unrealistic to assume that the ratio of 32 cars not bought per car-sharing vehicle applies in Amsterdam. A study from 2006 on car sharing in the inner city found (pdf) that half the users hadn’t owned a car in the first place. In this study, each shared-use car replaced 3 private-owned cars (this would still imply that 2 parking spaces can be removed for each car-sharing vehicle introduced). Perhaps the ratio has gone up a bit since, that is if the number of members relative to the fleet has gone up.

Anyway, it seems that the current number of car-sharing vehicles may have reduced car ownership by a few thousand at most. For a more substantial impact, we’d need more shared-use cars.

TNS NIPO is about to launch a monitor on car sharing in the Netherlands.


Efforts to raise turnout in elections may increase turnout inequality

Just the other day I posted something about unequal voter turnout in Amsterdam (higher turnout in neoliberal-voting neighbourhoods; lower turnout in left-voting neighbourhoods). The conclusion would seem obvious: raise turnout, and election outcomes will likely become more representative of the preferences of Amsterdammers.

Now it turns out things may not be that simple. Based on a smart analysis (via), Ryan Enos, Anthony Fowler and Lynn Vavreck find that «get out the vote» efforts may raise turnout disproportionally among people who are more likely to vote in the first place, thus exacerbating turnout inequality.

This is not inconsequential, for these «high-propensity» citizens are far from representative of the general population. They are:

wealthier, more educated, more likely to attend church, more likely to be employed, more likely to approve of Bush, more conservative, and more Republican. They are more supportive of abortion rights and less supportive of withdrawing troops from Iraq, domestic spending, affirmative action, minimum wage, gay marriage, federal housing assistance, and taxes on wealthy famiilies.

All in all, it seems that in many respects, people who are likely to vote lean to the right compared to the general population; and that this right-wing bias may be exacerbated by efforts to raise turnout.

This is pretty sobering, but it doesn’t mean that the whole idea of raising turnout should be thrown out of the window. First of all, Enos et al. point out that their method can be used to gain a better understanding of the impact of interventions. Hopefully this will help develop interventions that reduce inequality instead of increasing it.

Second, it appears that the experiments analysed by Enos et al. randomly assigned people to treatment or control groups (I checked this for the largest experiments - the ones done by Gerber, Green & Larimer and Nickerson & Rogers). Of course, this is good practice from a research point of view.

However, it might still make sense to do voter mobilisations that specifically target a group of unlikely voters (instead of a randomly selected treatment group). For example, one might target a neighbourhood that normally has very low turnout. If I understand the findings of Enos et al. correctly, it’s conceivable that this would increase turnout inequality within the targeted neighbourhood, while at the same time reducing turnout inequality across the entire city.

Then again, perhaps we should consider compulsory voting after all (I’ll admit I used to be pretty sceptical of that idea). In a previous study, one of the authors (Anthony Fowler) analysed the impact of the introduction of compulsory voting in Australia in the first half of the 20th century. «When near-universal turnout was achieved, elections and policy shifted in favor of the working-class citizens who had previously failed to participate.» (pdf)


High turnout in liberal-voting neighbourhoods, low turnout in left-voting neighbourhoods

A ‘prominent civil servant with a social-democrat background’ gets to hand out 400,000 euros in subsidies to turn out ethnic minorities to vote, the Telegraaf newspaper reported last week. «It’s not difficult to guess which parties will benefit the most from a turnout campaign among hard to reach groups of voters.»

Ok, so they’re hyping it a bit, but the story is more or less accurate. Last year, the city council almost unanimously asked for a campaign that should result in «a turnout of at least 65% across Amsterdam and a substantial increase in turnout in districts that have a low turnout and among specific groups».

Turnout in elections is uneven, as the charts below illustrate. In neighbourhoods where many people voted economic left (SP or PvdA), turnout was low in 2010. By contrast, in neighbourhoods that tend to vote (neo) liberal (pro-market parties VVD and D66), turnout was high. On the one hand there’s Bijlmer Centrum: 57% voted economic left in 2010, but turnout was only 34%. At the other end of the spectrum, there’s for example the Apollobuurt: 57% voted liberal and turnout was 65%. A similar pattern occured in previous elections.

What causes this correlation between political outcome and turnout? A possible explanation: high educated, well-paid, white home owners have more confidence that politicians will take their interests into account. Therefore, they’re more inclined to think it makes sense to vote. And they often vote liberal.

Interestingly, turnout isn’t always that unequal, as a comparison of the 2002 and 2006 elections serves to illustrate.

The boxplot to the left shows that turnout tended to be higher in 2006 than in 2002. At least as interesting is the fact that inequality in turnout has decreased. The chart to the right shows how this happened. In allmost all neighbourhoods, turnout rose relative to 2002, but it rose most in neighbourhoods that had low turnout in 2002. Examples include the Kolenkit in West, the Vogelbuurt in Noord and Bijlmer Centrum. Incidentally, turnout inequality rose again in 2010.

A similar development has taken place at the national level. In elections for the Lower Chamber, liberal-voting municipalities tend to have higher turnout than left-voting ones. Again, turnout inequality was lower in 2006 than in 2002 and 2003. (If you want to check the calculations: data and code for the analysis at both the local and the national level can be found here.)

2006 was a year in which left-wing parties got relatively many votes. For example, PvdA, GroenLinks, SP and AADG jointly got 33 seats in the Amsterdam council, compared to 26 in 2002. Since inequality was less uneven in 2006, it’s conceivable that the 2006 election result better reflected the preferences of Amsterdammers than the election result of 2002.

In any case: if we want a fairer election outcome, it’s important to get more people to vote, especially in neighbourhoods that tend to have low turnout. Whether the municipal turnout campaign will be effective is difficult to say on the basis of the plans, but it is possible to raise turnout. For example, by organising local elections on the same day as national elections.

About those weird Netflix genres

The hippest story on Twitter right now is how Alexis Madrigal of the Atlantic discovered the 76,897 genres Netflix uses to classify its movie offering. Some examples of these weirdly specific genres include Critically-acclaimed Cerebral Independent Films; Feel-good Movies starring Elvis Presley and Coming-of-age Animal Tales.

Madrigal explains how straightforward it is to navigate all the genre pages on the Netflix website by incrementing the id in the url. But then he mentions that he retrieved the genres using «an expensive piece of software called UBot Studio that lets you easily write scripts for automating things on the web». Surely a few lines of Python code could’ve done the job? In fact, I guess you could probably extract the subgenre structure and the genre elements - region, adjectives, time period etc - with nltk and regex.

Never mind that, though. Madrigal’s article is an interesting read. Here it is if you haven’t read it yet. And here’s a critique of Netflix’s algorithms by Felix Salmon of Reuters, who argues that its recommendations are no longer about quality but about offering more of the same. You watched one Dark Political Movie from the 1980s? Then we’ll show you some more Dark Political Movies from the 1980s.