champagne anarchist | armchair activist

My entry for the Best Worst Viz competition

Number of tweets with hashtag #BestWorstViz, per date of the month April 2016 and time of the day. Times are UTC, 18 April is the deadline. Data updates every hour; clear browser history to refresh. Entry for Best Worst Viz competition, created by dirkmjk.

I love to hate bad graphs (who doesn’t), and I think Andy Kirk’s idea to organise a Best Worst Viz competition is quite brilliant. As he explains, there’s something fair about creating your own bad graph rather than criticising somebody else’s:

[..] picking on bad visualisation involves work by other people who we might never meet or have a chance to learn about what the true circumstances and intent of a project were. The essence of this challenge is based on your best worst visualisation - the best worst visualisation you can possibly make.

I had to give it a try. But how? An exploding 3D pie chart, truncated y-axis, out-of-control spaghetti chart - it all seemed a bit too obvious. I aimed for something different, drawing inspiration from the blink element of the early days of web design. The shifting colours of the stacked bar chart pointlessly illustrate the direction of time - or whatever. I think it’s pretty bad.

Standalone version of graph here.

Links between businesses and politics II: revolving door and access to ministers

Eline Huisman and Ariejan Korteweg of the Volkskrant have done some good investigative journalism by finding out how often companies, organisations and inviduals have visited the current ministers (this data wasn’t publicly available in the Netherlands). It’s interesting to compare the top–10 of companies with access to ministers to the top–10 of revolving door companies (companies where national politicians have or have had a position).

Position of companies on the access to ministers ranking and the revolving door ranking

Access Revolving door
Air France-KLM 1 6
Rabobank 2 1
Shell 3 2
ING Bank 4 5
Schiphol 6 -
Aegon 7 8
KPN 8 -
SNS Reaal 9 -
KPMG 10 4
NS 7
Delta Lloyd - 9

I’m sure more can be said about this, but the comparison shows there’s conciderable overlap between the two lists (for the geeks among you: the Jaccard index is 0.54). The following companies score high on both measures of political ties: Air France-KLM, Rabobank, Shell, ING Bank, ABN Amro, Aegon and KPMG. Dutch Railways (NS) and PGGM don’t feature in the Volkskrant business ranking because they classify them as semipublic.

Of course, these lists provide no basis for firm conclusions about cause and effect. However, one can imagine that companies that participate actively in the revolving door could have easier access to ministers.

The details of the Volkskrant investigation can be found in this visualisation, which unfortunately isn’t easily searcheable. The underlying data are available here as csv. If you’d classify NS and PGGM as companies in the Volkskrant list, the overlap wouldn’t change because other companies would drop out of the top–10. Further, for comparability I’ve removed industry and lobby organisations such as employers’ organisation VNO-NCW from the access to ministers ranking. Alphabetical order was used where two companies have the same score.


Coursera Data Analysis and Interpretation

I was initially introduced to R by Nathan Yau’s Visualize This, but subsequently I learned a lot about R through some of the courses in Brian Caffo, Roger Peng and Jeff Leek’s Data Science Specialization at Coursera. In fact, the course was a reason for me to postpone switching from R to Python.

By now, I’ve decided to make the switch anyhow, and I think I’ve found another Coursera specialisation that will help me learn the tricks: Lisa Dierker and Jen Rose’s Data Analysis and Interpretation. It’s kind of basic, at least at the beginning, but that’s good. Some of the assignments require you to blog about a project of your choosing, so I’ll be posting about my homework here.


Can mistyped urls deliver representative samples?

An article on the Washington Post’s Monkey Cage blog describes how researchers managed to carry out opinion polls on executions in Bahrain, «one of the most difficult countries in the region for such sensitive research». In order to overcome the difficulties encountered, they ran two ‘innovative surveys’ in partnership with research company RIWI.

RIWI takes advantage of the fact that people sometimes make mistakes when they type a url in the address bar of their browser. If the url they mistakenly go to happens to be controlled by RIWI, they are redirected to a short questionnaire. RIWI claims this is a cheap way to obtain a non-biased sample.

This sounds like a smart approach that might actually work. But does it? Some people have doubts, such as one of the commenters on the Monkey Cage post:

Innovative is certainly one way to describe it. How can you possibly consider Internet typo redirects as a nationally representative sample? Would be very curious to see what the raw demographics look like compared to the population. Hope there was some sophisticated weighting used.

In a recent article in Nature, RIWI founder and CEO Neil Seeman explains his method. In a comment, one Charles Packer observes:

There are no citations here of publications that assess the validity of the company’s claims. Same for the corporate website: no discussion of the mechanics of its methodology.

When I searched the company on Google, I found a lot of articles aimed at investors and very few discussing their research methods. The most detailed description of their methodology I found is in Seeman’s patent application. It explains that, for example, «Google could harvest the many thousands of users who inadvertently type in instead of and direct them to an online polling page, instead of simply redirecting them to the web site».

The main type of typos RIWI uses seems to be those where people type .cm, .co or .om instead of .com. RIWI uses the respondents’ IP addresses to guess their location. In his patent application, Seeman claims that his approach is successful in reducing bias:

Under the invention, every individual Internet user around the globe has the equal probability of being drawn into the potential respondent pool. This dramatically reduces selection bias and coverage bias as compared to all other current techniques of respondent identification and selection online. There is no reason to believe that the people who fail to randomly fall into the potential survey population (i.e., who do not make the typographical error) have distinct characteristics from the people who do, thus increasing the validity of the results. This makes the process of respondent selection scientifically valid, superior even to random digit telephone dialing.

Is that true? While their claims sound plausible, it’s still conceivable that bias occurs. For example, through the selection of urls RIWI uses; because people who tend to make typos may be different from people who don’t; or because people who directly type urls into the address bar of their browser may be different from people who prefer to google for sites.

It has been claimed that RIWI has predicted election results in Egypt and Turkey more accurately than other firms. That sounds promising, but it would be helpful to know how many election outcomes RIWI has predicted and how accurate all of these predictions were. RIWI also refers to a validation study of one of their US samples, but the original study seems to have been removed from their website. The website’s FAQ says ‘third party and academic review’ is available, but only on request: «Yes, but please contact us first so we can get a sense of your needs and most applicable information to send you».

It’s quite possible that RIWI’s approach is superior to the survey panels used by other firms, but more openness about their methodology and results would make their case more convincing.


Base versus ggplot2

Yesterday, stats guru Jeff Leek confessed the ultimate unpopular opinion in data science: «I don’t use ggplot2 and I get nervous when other people do» (if you haven’t a clue what this is about, you may want to skip this post altogether). His confession met with ridicule, more riducule, and an occasional «oh my god I thought I was the only one!».

I sort of assumed everybody uses ggplot now. I was wrong: I like base for graphics, is that weird? * Buena referencia para graficas base! (si como yo, odian ggplot). * base graphics FTW re:slopegraph (it’s a royal pain to do this in ggplot). * I’m not a fan of ggplot. * Retro! * I kinda hate ggplot. * Vigorous group discussion on the merits of base plot vs #ggplot in #rstats.

For me, it’s six of one and half a dozen of the other: I’m planning to switch to Python.