champagne anarchist | armchair activist

Coursera Data Analysis and Interpretation

I was initially introduced to R by Nathan Yau’s Visualize This, but subsequently I learned a lot about R through some of the courses in Brian Caffo, Roger Peng and Jeff Leek’s Data Science Specialization at Coursera. In fact, the course was a reason for me to postpone switching from R to Python.

By now, I’ve decided to make the switch anyhow, and I think I’ve found another Coursera specialisation that will help me learn the tricks: Lisa Dierker and Jen Rose’s Data Analysis and Interpretation. It’s kind of basic, at least at the beginning, but that’s good. Some of the assignments require you to blog about a project of your choosing, so I’ll be posting about my homework here.


Can mistyped urls deliver representative samples?

An article on the Washington Post’s Monkey Cage blog describes how researchers managed to carry out opinion polls on executions in Bahrain, «one of the most difficult countries in the region for such sensitive research». In order to overcome the difficulties encountered, they ran two ‘innovative surveys’ in partnership with research company RIWI.

RIWI takes advantage of the fact that people sometimes make mistakes when they type a url in the address bar of their browser. If the url they mistakenly go to happens to be controlled by RIWI, they are redirected to a short questionnaire. RIWI claims this is a cheap way to obtain a non-biased sample.

This sounds like a smart approach that might actually work. But does it? Some people have doubts, such as one of the commenters on the Monkey Cage post:

Innovative is certainly one way to describe it. How can you possibly consider Internet typo redirects as a nationally representative sample? Would be very curious to see what the raw demographics look like compared to the population. Hope there was some sophisticated weighting used.

In a recent article in Nature, RIWI founder and CEO Neil Seeman explains his method. In a comment, one Charles Packer observes:

There are no citations here of publications that assess the validity of the company’s claims. Same for the corporate website: no discussion of the mechanics of its methodology.

When I searched the company on Google, I found a lot of articles aimed at investors and very few discussing their research methods. The most detailed description of their methodology I found is in Seeman’s patent application. It explains that, for example, «Google could harvest the many thousands of users who inadvertently type in instead of and direct them to an online polling page, instead of simply redirecting them to the web site».

The main type of typos RIWI uses seems to be those where people type .cm, .co or .om instead of .com. RIWI uses the respondents’ IP addresses to guess their location. In his patent application, Seeman claims that his approach is successful in reducing bias:

Under the invention, every individual Internet user around the globe has the equal probability of being drawn into the potential respondent pool. This dramatically reduces selection bias and coverage bias as compared to all other current techniques of respondent identification and selection online. There is no reason to believe that the people who fail to randomly fall into the potential survey population (i.e., who do not make the typographical error) have distinct characteristics from the people who do, thus increasing the validity of the results. This makes the process of respondent selection scientifically valid, superior even to random digit telephone dialing.

Is that true? While their claims sound plausible, it’s still conceivable that bias occurs. For example, through the selection of urls RIWI uses; because people who tend to make typos may be different from people who don’t; or because people who directly type urls into the address bar of their browser may be different from people who prefer to google for sites.

It has been claimed that RIWI has predicted election results in Egypt and Turkey more accurately than other firms. That sounds promising, but it would be helpful to know how many election outcomes RIWI has predicted and how accurate all of these predictions were. RIWI also refers to a validation study of one of their US samples, but the original study seems to have been removed from their website. The website’s FAQ says ‘third party and academic review’ is available, but only on request: «Yes, but please contact us first so we can get a sense of your needs and most applicable information to send you».

It’s quite possible that RIWI’s approach is superior to the survey panels used by other firms, but more openness about their methodology and results would make their case more convincing.


Base versus ggplot2

Yesterday, stats guru Jeff Leek confessed the ultimate unpopular opinion in data science: «I don’t use ggplot2 and I get nervous when other people do» (if you haven’t a clue what this is about, you may want to skip this post altogether). His confession met with ridicule, more riducule, and an occasional «oh my god I thought I was the only one!».

I sort of assumed everybody uses ggplot now. I was wrong: I like base for graphics, is that weird? * Buena referencia para graficas base! (si como yo, odian ggplot). * base graphics FTW re:slopegraph (it’s a royal pain to do this in ggplot). * I’m not a fan of ggplot. * Retro! * I kinda hate ggplot. * Vigorous group discussion on the merits of base plot vs #ggplot in #rstats.

For me, it’s six of one and half a dozen of the other: I’m planning to switch to Python.


Rabid feminists, fans and rightwingers

The Oxford Dictionary (the default dictionary on Mac OSX) has been accused of sexism in the examples it provides to illustrate how words are used. The debate focused on its definition of rabid: 1. having or proceeding from an extreme or fanatical support of or belief in something: a rabid feminist. 2. (of an animal) affected with rabies. her mother was bitten by a rabid dog. Why this example? Why portray feminists as rabid?

Apparently, the Oxford Dictionary first ridiculed the critique, but later issued a statement:

We apologise for the offence that these comments caused. The example sentences we use are taken from a huge variety of different sources and do not represent the views or opinions of Oxford University Press. That said, we are now reviewing the example sentence for «rabid» to ensure that it reflects current usage.

«In other words, it’s not the dictionary that’s sexist, it’s the English-speaking world», David Shariatmadari commented in the Guardian. He adds a warning that the review the dictionary plans to do may well find that rabid in fact does occur more often in combination with feminist than with other words (especially if online discussions are included). Even so, the dictionary cannot simply hide behind a word count - they’re still responsible for the editorial choices they make.

And how about the Guardian itself? The table below lists the words that appear most frequently after rabid in Guardian articles since 1999. The words have been stemmed so as to lump together terms like racist and racists.

term count
dog 136
anti 86
fan 63
right 33
support 21
nationalist 18
anim 15
press 14
rightwing 14
tori 13
republican 13
fanbas 13
rightw 12
puppi 11
follow 10
critic 9
antisemit 9
nation 9
racist 9
bat 9
feminist 9
crowd 9
home 9

The term anti deserves a separate analysis. The table below lists the most frequent words matching the pattern rabid anti[\s|\-]([a-z]+), again reduced to their stem.

term count
semit 11
european 10
communist 9

Terms like dog, anim[al] and bat obviously have to do with the second meaning of the term rabid (affected with rabies). Other than that, it’s clear that rabid is far more often used in combination with fan or rightwing than feminist. At least so in the Guardian.


I simply adapted the code I wrote earlier to analyse use of the term illegal in the Guardian and the New York Times.


Minister Jeroen Dijsselbloem takes up data visualisation challenge

Every year, Dutch Finance Minister Jeroen Dijsselbloem sends a report to Parliament on state participations - companies that are (partially) owned by the state. Recently, the minister answered questions from the Finance Committee of the Lower House. One of them questioned the use of a stacked bar chart to show dividends, «since this isn’t very clear». The minister acknowledges the problem and takes up the challenge:

In creating this bar chart we aimed at comprehensiveness by including all dividends received from all state participations. Because of the large differences in dividend, this results in sub-optimal readability. For the 2015 annual report, it will be considered whether the readability can be improved without making concessions to comprehensiveness.

I’m sure he’ll be interested in good ideas, so if you have any suggestions for improving the chart, tweet them to @j_dijsselbloem. And if you want to give it a try yourself: here’s the data for 2010–2014.

Update: Jean Adams shows how the chart can be improved. Adams correctly points to a discrepancy between the csv and the original chart: the csv contains data on total dividend paid, whereas the original chart shows the amount received by the state (the two are different for companies owned for less than 100% by the state).