Salonanarchist | Leunstoelactivist

Using strava tweets to analyse cycling patterns

A recent report by traffic research institute SWOV analyses accidents reported by cyclists on racing bikes in the Netherlands. Among other things, the data show an early summer dip in accidents: 53 in May, 38 in June and 51 in August. A bit of googling revealed this is a common phenomenon, although the dip appears to occur earlier than elsewhere (cf this analysis of cycling accidents in Montréal).

Below, I discuss a number of possible explanations for the pattern.

Statistical noise

Given the relatively small number of reported crashes in the SWOV study, the pattern could be due to random variation. Also, respondents were asked in 2014 about crashes they had had in 2013, so memory effects may have had an influence on the reported month in which accidents took place. On the other hand, the fact that similar patterns have been found elsewhere suggests it may well be a real phenomenon.


An OECD report says the summer accident dip is specific for countries with «a high level of daily utilitarian cycling» such as Belgium, Denmark and the Netherlands. The report argues the drop is «most likely linked to a lower number of work-cycling trips due to annual holidays».

If you look at the data presented by the OECD, this explanation seems plausible. However, holidays can’t really explain the data reported by SWOV. Summer holidays started between 29 June and 20 July (there’s regional variation), so the dip should have occured in August instead of June.

Further, you’d expect a drop in bicycle commuting during the summer, but surely not in riding racing bikes? I guess the best way to find out would be to analyse Strava data, but unfortunately Strava isn’t as forthcoming with its data as one might wish (in terms of open data, it would rank somewhere between Twitter and Facebook).

A possible way around this is to count tweets of people boasting their Strava achievements. Of course, there are several limitations to this approach (I discuss some in the Method section below). Despite these limitations, I think Strava tweets could serve as a rough indicator of road cycling patterns. An added bonus is that the length of the ride is often included in tweets.

The chart above shows Dutch-language Strava tweets for the period April 2014 - March 2015. Whether you look at the number of rides or the total distance, there’s no early summer drop in cycling. There’s a peak in May, but none in August - September.


According to the respondents of the SWOV study, 96% percent of accidents happened in daylight. Of course this doesn’t rule out that some accidents may have happened in the dusk and there may be a seasonal pattern to this.

Many tweets contain the time at which they were tweeted. This is a somewhat problematic indicator of the time at which trips took place, if only because it’s unclear how much time elapsed between the ride and the moment it was tweeted. But let’s take a look at the data anyway.

I think tweets tend to be posted rather early in the day. Also, the effect of switches between summer and winter time is missing in the median post time (perhaps Twitter converts the times to the current local time).

That said, the data suggests that rides take place closer to sunset during the winter, not during the months of May and August which show a rise in accidents. So, while no firm conclusions should be drawn on the basis of this data, there are no indications that daylight patterns can explain accident patterns.


Perhaps more accidents happen when many people cycle and there’s a lot of rain. In 2013, there was a lot of rain in May; subsequently the amount of rain declined, and there was a peak again in September (pdf). So at first sight, it seems that the weather could explain the accident peak in May, but not the one in August.


None of the explanations for the early summer drop in cycling accidents seem particularly convincing. It’s not so difficult to find possible explanations for the peak in May, but it’s unclear why this is followed by a decline and a second peak in August. This remains a bit of a mystery.


Unfortunately, the Twitter API won’t let you access old tweets, so you have to use the advanced search option (sample url) and then scroll down (or hit CMD and the down arrow) until all tweets have been loaded. This takes some time. I used rit (ride) and strava as search terms; this appears to be a pretty robust way to collect Dutch-language Strava tweets.

It seems that Strava started offering a standard way to tweet rides as of April 2014. Before that date, the number of Strava tweets was much smaller and the wording of the tweets wasn’t uniform. So there’s probably little use in analysing tweets from before April 2014.

I removed tweets containing terms suggesting they are about running (even though I searched for tweets containing the term rit there were still some that were obviously about running) and tweets containing references to mountainbiking. I ended up with 9,950 tweets posted by 2,258 accounts. 1,153 people only tweeted once about a Strava ride. Perhaps the analysis could be improved by removing these.

I had to add 9 hrs to the tweet time, probably because I had been using a VPN when I downloaded the data.

A relevant question is how representative Strava tweets are of the amount of road cycling. According to the SWOV report, about two in three Dutch cyclists on racing bikes almost never use apps like Strava or Runkeeper; the percentage is similar for men and women. The average distance in Strava tweets is 65km; in the SWOV report most respondents report their average ride distance is 60 - 90km.

In any case, not all road cyclists use Strava and not all who use Strava consistently post their rides on Twitter (fortunately, one might add). Perhaps people who tweet their Strava rides are a bit more hardcore and perhaps more impressive rides are more likely to get tweeted.

Edit - the numbers reported above are for tweets containing the time they were posted; this information is missing in about one-third of the tweets.

Here’s the script I used to clean the twitter data.

Moroccan trade union protests and the Arab Spring

In an analysis in the Washington Post, political scientist Matt Buehler argues that the Arab Spring was not just a spontaneous eruption of youth protests: «labour unrest [...] foreshadowed the popular mobilization of youth activists of the Arab blogosphere». In turn, these youth mobilisations created new opportunities for unions.

He illustrates this with an analysis of events in Morocco. Even before the Arab Spring reached the country and culminated in large protests in February 2011, the country had seen trade union protests sparked by the inequality exacerbated by neoliberal reforms. The combination of union and youth protests forced the regime to make concessions, resulting, among other things, in substantial wage and pension increases.

Results from a simple search on Google Trends seem largely consistent with Buehler’s finding that trade union protests preceded the 20 February mobilisation. Searches for trade union names started to rise in 2008 and 2009, that is before the rise in searches for AMDH, a human rights organisation that played a key role in the 20 February protests. Similarly, searches for grève (strike) peaked in 2008 and 2009, whereas searches for manifestation (march / demonstration) and sit in (the latter not shown in the graph) didn’t really start to rise until the end of 2010. It’s also interesting to note that interest in union-related search terms surged again following the February protests.

Exporting Google Trends data

Google Trends has a «download as csv» option which seems handy enough, but it has some issues. For one thing, if you try to export data on multiple search terms, it often seems to omit data for one of the search terms, even if all search terms were correctly shown on screen. I have absolutely no clue what this is about.

A solution might be to download data for each search term separately. A drawback is that data would then be normalised on a per search term basis (i.e., for each term the highest value would be set at 100). This means that it would no longer be possible to compare volume across search terms, but it would still be possible to compare patterns.

However, you then run into the problem that Google will export the data on a per month basis if volume is low and on a per week basis if volume is higher. I don’t understand why Google doesn’t offer the possibility to download all data on a per month basis so you can more easily compare. A hack is suggested here, but I couldn’t get it to work.


Why is the government counting the number of «new townspeople»

The research bureau of the Amsterdam government recently released a dataset about Amsterdam’s neighbourhoods, which contains over 20 variables that in some way deal with the ethnicity of local residents. The Netherlands has always had a somewhat dubious obsession with categorising people by ethnic background (not just on the basis of where they were born, but where their parents were born). Even so, I was a bit surprised by the category new townspeople (nieuwe stedelingen). People are considered new townspeople if they meet the following criteria:

  • Between 18 and 55 years old; and
  • Registered as a resident of Amsterdam after their 18th birthday; and
  • Either both parents were born in the Netherlands, or the person him- or herself or at least one of the parents was born in a Western country.

So who would invent such a weird category? A bit of googling reveals that the term new townspeople is associated with students and knowledge workers (but apparently not from India or Turkey) and that it’s used in combination with terms such as post-industrial economy, creative industry, Richard Florida, Bagels & Beans and pine nut sandwiches. In other words, new townspeople are associated with gentrification. In policy documents, a high share of new townspeople is seen as a positive sign for a neighbourhood.

Sociologist Jan Rath recently criticized the gentrification thing:

It’s become a controversial term, but administrators really do pursue a population policy in the city. Officially it’s a search for the right social mix in a neighbourhood, but in reality it really boils down to reducing the number of houses for the people with the lowest incomes.

In addition to that, local administrators apparently don’t think it’s awkward to measure the success of their policies by counting the number of new townspeople, a bureaucratic term for new residents who are not ethnic minorities.


Amsterdammers like old canal houses and dislike 1950s architecture

The research bureau of the Amsterdam city government (O+S) has published an Excel file containing a wealth of data about Amsterdam’s neighbourhoods. Among other things, it tells us how beautiful Amsterdammers think houses in their neighbourhood are. The average ratings are shown on the map below.

According to locals, the most beautiful houses are to be found around the Leliegracht (rated 8.7 out of 10) in the western canal belt. The ugliest are at the messy margins of the city, for example around the Weespertrekvaart in the Omval neighbourhood.

It will hardly come as a surprise that there’s a pretty strong correlation between the value of houses and how beautiful locals think they are. Either Amsterdammers have a posh taste in houses, or beautiful houses are expensive because people are willing to pay more for them (probably it’s a bit of both).

It so happened I had recently come across a new dataset from Statistics Netherlands (CBS) containing data on the construction period of houses by 4-digit postcode. I linked this data to the O+S data (for the challenges involved see the Method section below). The scatterplot shows neighbourhoods by share of houses from a specified period, and rating.

A few conclusions can be drawn:

  • In neighbourhoods with a high share of historic (pre–1906) houses, locals tend to think houses are beautiful;
  • By contrast, in neighbourhoods with a high share of post-war (1945 - 1960) houses, such as the western garden cities, locals tend to be more critical of the houses in their neighbourhood;
  • And post–2011 architecture doesn’t appear to be very popular either.

My first reaction to these findings was disappointment in my fellow Amsterdammers. Mainly for these reasons:

  • They don’t seem to particularly appreciate the Amsterdam School architecture, which largely coincides with the 1906–1930 period (or there would have been a positive correlation between rating and the share of houses from this period);
  • On the other hand, they don’t seem to realise how ugly much of the 1980s architecture really is (otherwise you’d expect a negative correlation between rating and share of houses from the 1980s).

A deeper dive into the data resulted in a somewhat more nuanced view. For some of the neighbourhoods, data is available at a more detailed level than the level I used in my analysis.

As for the Amsterdam School: a pretty sensational example is the Tellegenbuurt in the neighbourhood Diamantbuurt, which gets a mediocre 7 out of 10 rating (just above the median rating of 6.9). However, the more detailed data shows that at least the western part of the Tellegenbuurt gets a somewhat better 7.4. Similarly, the iconic het Schip housing block is in the Spaarndammer- and Zeeheldenbuurt, where locals rate the houses a 6.9, but the western parts of the Spaarndammerbuurt proper get a rating of 7.5.

I still think Amsterdammers undervalue the 1906–1930 period, but at least they do seem to show some appreciation for some of the most-acclaimed highlights of the period.

As for the 1980s: this was a period of urban renewal. It resulted in dull housing blocks in otherwise decent-looking neighbourhoods such as the Dapperbuurt, the Oostelijke Eilanden and the eastern part of the Indische buurt. This mixture may explain why these neighbourhoods don’t necessarily get very low ratings.


The ratings of houses were collected in 2013, by asking the question «How do you rate the houses in your neighbourhood? (1=very ugly, 10=very beautiful)». The O+S file containing these ratings is available here and the CBS file containing data on period of construction here.

The main challenge consisted in linking the two datasets. Fortunately, the CBS also has a file containing neighbourhood data with the most prevalent 4-digit postcode (and also information on the share of houses that have that postcode). The link between postcode and neighbourhood is imperfect but not too bad. For example, in 57 out of the 97 neighbourhoods in my final analysis, over 90% of the addresses have the postcode associated with the neighbourhood.

Somewhat surprisingly, the O+S spelling of neighbourhoods is in some cases slightly different from the CBS (why?!). For example, Bijlmer oost (e,g,k) versus Bijlmer-Oost (E, G, K). I created a separate table to link the different spellings.

I used R to merge the files and check for correlations between share of houses from a specific period and rating of the houses (code on Github). One shouldn’t expect too strong correlations for two reasons: first, the share of houses from a certain period will be at best just one among many factors that have an influence on rating and second, because of the noise created by the imperfect link between postcode and neighbourhood.

For share of pre–1906 houses there was the strongest correlation with the rating of the houses (.51). For 1945–1960 the correlation was -.32 and for post–2011 it was -.39. There was an even weaker, but still statistically significant, correlation for the 1960s (-.22).

I initially created a map with Qgis, but then I decided the map needed some interactivity. I created a new version with Leaflet and D3, using this tutorial to figure out the basics of Leaflet and how to combine it with D3. The initial result wasn’t pretty, but then I found the black and white tiles by Stamen (better than the OSM black and white) and now I think it looks better (although I guess maps overlaid with a choropleth will always look a bit smudgy).

Bicycle path

Amazing. Apparently, they sweep the bicycle paths at the Veluwezoom.


Opening Surveymonkey files in R

Many people use Surveymonkey to conduct online surveys. You can get standard pdf reports of your data, but often you’ll want to do some more analysis or have more control over the design of the charts. An obvious option is to read the data into R. But there’s a practical problem: Surveymonkey uses the second row of it’s output file for answer categories and puts some other information in that row as well. This has the additional effect that R will treat numerical variables as factors.

I wrote a few lines of code which, I think, deal with that problem and turned that into an R package. Until recently it’d never have occured to me to create an R package, but then I read this post by Hillary Parker who describes the process so clearly that it actually appeared doable. I took some additional cues from this video by trestletech. The steps are described here.

I thought of adding a function to read data from Limesurvey, an open source alternative to Surveymonkey. But apparently, that functionality is already available (I haven’t tested it).

The package is available on Github.


Step by step: creating an R package

With the help of posts by Hillary Parker and trestletech I managed to create my first R package in RStudio (here’s why) . It wasn’t as difficult as I thought and it seems to work. Below is a basic step-by-step description of how I did it (this assumes you have one or more R functions to include in your package, preferably in separate R-script files):

If you want, you can upload the package to Github. Other people will then be able to install it:



A new balance in Amsterdam’s city council?

Last autumn, Amsterdam politicians discussed on Twitter whether the relations between coalition and opposition have changed since the March 2014 election, which resulted in a new coalition.

One way to look at this is to analyse voting behaviour on motions and amendments over the past two years. From a political perspective, proposals with broad support may not be very interesting:

For example, a party can propose a large number of motions that get very broad support, but materially change little in the stance, let alone the policy, of the government. In the litterature, this is sometimes referred to as «hurrah voting»: everybody yells «hurrah!», but is there any real influence? (Tom Louwerse)

In a sense, it could be argued that the same applies to proposals supported by the entire coalition. More interesting are what I’ll call x proposals: proposals that do not have the support of the entire coalition, but are adopted nevertheless. In the Amsterdam situation these are often proposals opposed by the right-wing VVD. The explanation is simple: Amsterdam coalitions tend to lean to the right (relative to the composition of the city council). As a result, left-wing coalition parties have more allies outside the coalition.

Let’s start with the situation before the March 2014 election. The social-democrat PvdA was the largest party. The coalition consisted of green party GroenLinks, PvdA and VVD, but the larger left-wing parties PvdA, GroenLinks and socialist party SP had a comfortable majority. The chart below shows the parties that introduced x proposals. The arrows show who they got support from to get these proposals adopted.

The size of the circles corresponds to the size of the parties; pink circles represent coalition parties. The thickness of arrows corresponds to the number of times one party supported another party’s x proposal. The direction of the arrows is not only shown by the arrow heads but also by the curvature: arrows bend to the right.

The image is clear: PvdA and especially GroenLinks were the main mediators who managed to gain support for x proposals.

And now the situation after March 2014. By now neoliberal party D66 is the largest party and the coalition consists of SP, D66 and VVD. This means that PvdA and GroenLinks are now opposition parties, but it turns out they still play a key role in getting x proposals adopted. GroenLinks initiated as many as half the x proposals.

The most active mediator is Jorrit Nuijens (GroenLinks), followed by Maarten Poorter (PvdA) and Femke Roosma (GroenLinks).


Data is from the archive of the Amsterdam city council. Votes on motions and ammendments as of January 2013 can be downloaded as an Excel file. The file (downloaded on 31 January 2015) contains data on 1,165 (versions) of proposals, put to a vote until 17 December 2014.

A few things can be said about the Excel file. On the one hand, it’s great this information is being made available. On the other hand, the file is a bit of a beast that takes quite a few lines of code to control. The way in which voting is described varies (e.g., «rejected with the votes of the SP in favour», «adopted with the votes of the council members Drooge and De Goede against»); the structure of the title changed in November 2014; Partij voor de Dieren is sometimes abbreviated and sometimes not; and sometimes the text describing voting has been truncated, apparently because it didn’t fit into a cell. Given the complexity of the file, it can’t be exluded completely that proposals may have been classified incorrectly.

The analysis (by necessity) focuses on visible influence. The first name on the list of persons introducing a proposal is considered as the initiator. In reality, it will probably sometimes occur that an initiator will let someone else take credit for a proposal.

The code for cleaning and analysing the data is available here. The D3 code for the network graphs is based on this example.

Deceptive charts - do they work?

Anyone mildly interested in data visualisation must have come across examples of shamelessly deceptive Fox News charts. Truncated y-axes, distorted x-axes, messing with units - nothing’s too bold when it comes to manipulating the audience. But does this kind of deception actually work? Anshul Vikram Pandey and his colleagues at New York University decided (pdf) to find out. They showed subjects either control or deceptive versions of a number of charts.

The deceptive versions were: a bar chart with truncated y-axis; a bubble chart with one bubble too large relative to the other; a line chart with a more spread-out y-axis, resulting in a less steep rise than in the control version and a chart with an inverted y-axis (inspired by Reuters’ famous Gun Deaths in Florida chart - interesting discussion here). In all cases, the correct numbers were included in the chart.

Of course a truncated y-axis can sometimes be defensible and needn’t be deceptive, as long as it is made clear what’s going on. More problematic is the aspect ratio chart. The authors claim the chart to the right is deceptive and the one to the left not, but how can you tell? You can’t. There’s no rule that says what the number of pixels per year on the x-axis should be.

Be that as it may, the authors found substantial differences in how the deceptive charts were interpreted compared to the control charts. Note that in most cases, they didn’t measure whether deceptive charts were interpreted incorrectly, just whether they were interpreted differently than the control charts. For example, participants were asked how much better access to drinking water was in Silvatown, represented by the bar to the right of the bar plot, relative to Willowtown, represented by the bar to the left (on a 5-point Likert scale ranging from slightly better to substantially better). When shown the control bar chart, the average score was 1.45; with the truncated y-axis the average score was 2.77.

The authors also tried to find out whether factors such as education and familiarity with charts had an influence on how charts were interpreted. It appears that people who are familiar with charts are less easily fooled by a truncated y-axis. Perhaps because truncated y-axes are second on the list of phenomena chart geeks love to hate and criticise (after 3D exploding pie charts, of course).


Peak economist

On Friday, the New York Times published an interesting article by Justin Wolfers about the kind of experts the paper mentions. Don’t worry, he’s aware of the methodological issues:

While the idea of measuring influence through newspaper mentions will elicit howls of protest from tweed-clad boffins sprawled across faculty lounges around the country, the results are fascinating.

To summarize: by his measure, economists have become the most influential profession among the social sciences and their influence rises during economic crises. Or at least so in the New York Times. I looked up data for the Dutch newspaper NRC Handelsblad, which has data available from 1990.

Some conclusions can be drawn:

  • The current ranking is the same as for the NYT, with economists heading the list and demographers at the bottom;
  • Apparently, NRC Handelsblad has always had a pretty high regard for historians, but due to the crisis they lost their top position to economists;
  • There was a peak in mentions of psychologist in 2012, but some of that can be ascribed to reports of scientific fraud by psychologist Diederik Stapel.

For comparison, I tried reproducing Wolfers’ NYT chart for the years 1990 - 2014. Here’s what I got:

The sudden increase for all professions in 2014 is unexpected - see Method for possible explanations. If we leave 2014 aside, what emerges is that «peak economist» (to borrow an expression from Wolfers) seems to have happened earlier in the NYT than in NRC Handelsblad. Perhaps something to do with the fact that the crisis hit the US earlier than Europe.


The NYT data were downloaded from the NYT Chronicle Tool (I had to separately download the data for each search term). Data from NRC Handelsblad were downloaded using the website’s search function. In order to get the total numbers per year I also did a search using «de» («the») as a search term («de» is the most frequently used word in written Dutch).

As indicated in the article, I got a steep rise in the percentages for all professions in the NYT in 2014. I manually checked some of the percentages I got against those in the chart of the NYT Chronicle Tool, and these appear to be correct. The spike is not visible in Wolfers’ chart, but that may be due to the fact that he uses three-year averages.

There may be an issue with the denominator, i.e. the total number of articles. The number for total_articles_published in the data I downloaded from the NYT was pretty stable at about 100,000 between 1990 and 2005. Then it rose to about 250,000 in 2013 (perhaps something to do with changed archiving practices, or with online publishing?). However, in 2014, it dropped to about one-third of the 2013 level.

The NRC Handelsblad data also has some fluctuations in the total number of articles per year, but less extreme and at first sight they don’t seem to coincide with unexpected fluctuations in the percentages of articles mentioning professions.

Code is available here.