On 18 September, Scotland may vote for independence. My understanding is that the referendum isn’t necessarily about kilts and haggis, but rather about left-leaning Scots who are fed up with London’s neoliberal policies. Policies that have caused, among other things, a widening gap between the rich and the rest of society. In fact, the Scottish referendum has been called the «world’s first vote on economic inequality».
One way in which inequality manifests itself is geographically. An interesting question is whether income and political power coincide. In some countries such as Germany and the Netherlands, the seat of government is in a region with a GDP comparable to the rest of the country. More often, governments are in high-income regions. For example, France’s richest region is Hauts-de-Seine (with business district la Défense), followed on its heels by Paris itself. Both have a GDP almost three times as high as the national GDP.
But the widest gap is to be found in the UK. Across Europe, only three out of 1357 regions have a GDP per inhabitant that is more than 3 times as high as their national GDP. For Polish boomtown Warsaw, the ratio is just above 3. For the German region of Wolfsburg, where VW has its headquarters, the ratio is 3.4. But the list is headed by the UK, where the «Inner London - West» region has a GDP as much as 5.8 times as high as the national GDP.
All in all, Scots who are dissatisfied with the distribution of income in the UK clearly have a point. Should the No camp find itself looking for someone to blame on 19 September, then perhaps Ms. Thatcher might qualify.
Map of all of Europe here.
I used Eurostat data on gross domestic product per inhabitant by NUTS 3 regions in 2011. NUTS 3 are the smallest regions used by Eurostat and have populations ranging from 150,000 to 800,000. 2011 is the most recent year for which data are available. The map is from EuroGeographics. The R code for the analysis is available here.
Of course, comparing regional GDP to national GDP is just one way of measuring inequality; other measures may produce somewhat different outcomes. It would be interesting to use wealth rather than income data, but I doubt that wealth data are available for regions.
The Dutch minister of the interior, Ronald Plasterk, has asked the Bureau for Economic Policy Analysis (CPB) to evaluate the declining turnout in local elections. This is an important issue, given how inequality and low turnout are related.
More specifically, Plasterk would like to know: first, if turnout is correlated to population size and second, what effect do municipal mergers have on turnout (one suspects a lobby of local governments opposed to mergers behind these questions).
As for the first question, that’s an easy one: yes. Smaller cities tend to have higher turnout. I looked it up, and the correlation’s actually pretty strong, if declining: 0.62 in 2002; 0.59 in 2006 and 0.50 in 2010 (somehow I couldn’t download data from the Kiesraad website, so I used the data I had downloaded some time ago, not including 2014 yet). I think political scientists will not be shocked by these outcomes.
More interesting is what kind of recommendations the CPB will come up with. Somehow I don’t think they’ll recommend cutting up large municipalities. Perhaps they should consider recommending a reintroduction of compulsory voting.
In Nickel and Dimed, her book on going undercover in low-wage America, Barbara Ehrenreich describes how not owning a car is one of the many factors making it difficult for low-paid workers to find better jobs. «Some of my co-workers, in Minneapolis as well as Key West, rode bikes to work, and this clearly limited their geographical range», she adds.
I was reminded of Ehrenreich’s book when I read a blog post by Michael Andersen. He argues that Denmark’s good quality bicycle infrastructure has contributed to the country’s egalitarian nature by making it easier to escape poverty. Danes with low incomes make a high share of their trips by bicycle. Rich Danes cycle too, but make far more trips by car.
In the comments to the blog post there’s a suggestion that in Amsterdam, it’s mainly the wealthy who ride bicycles. I couldn’t find recent data for Amsterdam, but geographical patterns may play a role. In the central area, where density is high and where the high-income districts Zuid and Centrum are located, people cycle more. In the peripheral districts, where distances to shops and other facilities tend to be longer, fewer trips are made by bicycle. Some of the poorest neighbourhoods are located there.
Statistics Netherlands (CBS) has data for the entire country, as well as for the cities with the highest addresses per surface area ratio. These include Amsterdam, Rotterdam, the Hague, Utrecht and a number of smaller cities. The main conclusions:
- Like in Denmark, cycling infrastructure benefits all kinds of people, but low-income people even more so;
- In high-density cities, not just the lowest income groups, but also the richest are more likely to take advantage of cycling infrastructure.
Incidentally, this doesn’t mean that cyclists get the space they should get. In a recent opinion article in NRC Handelsblad, writer Fred Feddes says that bicycle lanes make up 11% of public space in Amsterdam’s inner city, but parked cars probably far more.
Update 20 August - Someone at the Fietsersbond dug up this (pdf) publication of the Amsterdam Municipality from 2010 which compares the mobility of Amsterdammers over the period 1986-1991 to 2005-2008. It suggests that cycling patterns in Amsterdam may in fact differ from the general pattern in high-density cities, with more cycling among high-income residents (as suggested by the commenter quoted above):
As for the development per income class, it turns out there are substantial differences. Among high-income residents the share of cycling in the total number of trips has more than doubled (from 15% to 33%), whereas the growth is only modest among low-income residents (from 26% to 33%). This means that relatively speaking, wealthy Amsterdammers today cycle more than low-income residents.
In an intriguing opinion article in Thursday’s NRC Handelsblad, an author named Fred Feddes suggests banning parked cars from Amsterdam’s city centre. He argues that the current 15,000 parking spaces in the inner city take up 18ha, amounting to as much as 40% of the 45ha public space.
Sure, parked cars use lots of space, but 40%? Apparently, I wasn’t the only one to find that figure incredible. Council member Zeeger Ernsting tweeted:
As much as I endorse the viewpoint, the figure of 40% parking can’t possibly be right.. But indeed, cars [are] still far too dominant
I couldn’t immediately trace Feddes’ source and I’m sure there will be more debate on the issue. For now, here’s a quick and dirty calculation:
- According to this (pdf) document of the Centrum district, «traffic areas» and green areas amount to 86ha. That’s more than Feddes’ 45ha, although I think the green areas may include some non-public space.
- The district’s open data site has data on parking spaces (dating from 2010). All types combined, there were some 16,000 of them, slightly more than Feddes’ estimate.
- Assuming that one parking space takes up 12 to 14m2, this would amount to 19 to 22ha; again slightly more than Feddes’ 18ha.
Perhaps Ernsting could ask the local government to shed some more light on this issue. Meanwhile, my provisional conclusion is that Feddes’ estimate doesn’t seem as incredible as I initially thought. And even if parked cars use only about 25% of public space, that’s still an enormous amount of space if you think about it.
Update 3 January 2015 - in a new article on the issue, Feddes provides more detail on the data he uses. The 45ha public space refers to «traffic terrain» (verkeersterrein) in 2009. CBS data for 2008 also put that number at 45ha. A more recent table (xlsx) indicates that this has since grown to 58ha. Interestingly, these more recent data also differentiate between types of traffic space. Apparently, railways take up 19ha (and according to this pdf, tram and metro tracks haven’t even been included in that category since 1993), leaving only 40ha for road traffic. On the basis of that number, the share of space dominated by (parked) cars would be even larger. Amazing.
A while ago, Open Culture wrote about a 1955 US Army manual entitled How to spot a communist. According to the manual, communists have a preference for long sentences and tend to use expressions like:
integrative thinking, vanguard, comrade, hootenanny, chauvinism, book-burning, syncretistic faith, bourgeois-nationalism, jingoism, colonialism, hooliganism, ruling class, progressive, demagogy, dialectical, witch-hunt, reactionary, exploitation, oppressive, materialist.
What happened in the 1950s is pretty terrible, but that doesn’t mean we can’t have a bit of fun with the manual. I used the New York Times Article Search API to look up which of its writers actually use terms like hootenanny, book-burning and jingoism. The results are summarised below.
Interestingly, many of the users of «communist» terms are either foreign correspondents or art, music and film critics. While it’s possible that people who have an affinity with the arts tend to sympathise with communism, an alternative explanation would be that critics have more freedom than «regular» journalists to use somewhat exotic and expressive terms like the ones the US Army associated with communism.
Also of interest is that one of the current writers on the list is Ross Douthat, the main conservative columnist of the New York Times. In his articles, he uses terms like materialist, oppressive, reactionary, exploitation, vanguard, ruling class, progressive and chauvinism. Surely he wouldn’t be a reformed communist - would he?
The New York Times Article Search API is a great tool, but you have to keep in mind that digitising the archive isn’t an entirely error-free process. For example, sometimes bits of information end up in the lastname field that don’t belong there (e.g. "lastname": "DURANTYMOSCOW"). While it’s possible to correct some of these issues, it’s likely that search results will in some way be incomplete.
To get a manageable dataset, I looked up all articles containing any combination of two terms from the manual. I then calculated a score for each author by simply counting the number of unique terms they have used.
An alternative would have been to correct for the total number of articles per author in the NYT archive. It took me a while to figure out how to search by author using the NYT API. It turns out you can search for terms appearing in the byline using
?fq=byline:("firstname middlename lastname") - even though this option isn’t mentioned in the documentation. I’m not entirely sure such a search will return articles where the byline/original field is empty.
As you might expect, there’s a correlation between the number of articles per author and the number of unique terms this author has used.
All in all, it would be possible to calculate a relative score, for example number of terms used per 1,000 articles, but this may have unintended consequences. To take an extreme example: an author who has written one article which happened to contain three terms would get a score of 3,000 using this method, whereas an author who has thousands of articles and consistently uses a broad range of terms but not at a rate of three per article would get a (considerably) lower score.
I decided to stick with the absolute number of unique terms per author. This has the disadvantage that authors who have written few articles are unlikely to show up in the analysis, but I’m not sure that this problem can be adequately solved by calculating a relative score.
The Python and R code used to collect and analyse the data is available on Github.
Website Follow the Money has analysed the «revolving door» between politics and businesses in the Netherlands, adding that the examples discussed are far from exhaustive. I’ve expanded the list of connections between businesses and politics by checking the resumes of close to 700 politicians – government members and members of parliament – who have been active in Dutch politics after 2001.
The list is headed by the Rabobank: 32 politicians have (had) a position there. This score can perhaps partly be explained by the fact that Rabobank is a cooperative of local banks, each with their own advisory board; so many people have positions there. Number two is Royal Dutch Shell, the largest Dutch company (of course, it’s partly British).
From the list, it can be concluded that financial institutions play a central role in the connections between businesses and politics. The phenomenon is not politically neutral: almost three-quarters of the politicians who have (had) positions with the three largest banks are (or have been) affiliated to the conservative parties CDA and VVD.
One of them is former finance minister Gerrit Zalm (VVD). After his political career, he first moved to DSB Bank and then became chairman of the board of ABN Amro (for controversies, see the FTM article as well as this analysis by de Correspondent). Another example is Joop Wijn (CDA) who started at ABN Amro and subsequently served as minister and state secretary at the finance and economic affairs departments. After that, he had a management position at Rabobank and currently he’s on the executive board of ABN Amro.
Financial institutions aside, an interesting case is airline KLM, now part of Air France-KLM, which appears to have played a bit of an emancipatory role. Over the past years, as many as four former KLM stewardesses have obtained a position in national politics: Fransje Roscam Abbing-Bos (VVD, Senate); Gonny van Oudenallen (various parties, Lower House); Ing Yoe Tan (PvdA, Senate) and Kathleen Ferrier (CDA, Lower House).
I’ve created a list of Dutch companies using information from Wikipedia and Elsevier / Bureau van Dijk. I’ve checked these companies against resumes from the (very useful) website Parlement.com. Here’s the Python script I used to download the resumes and to analyse them. The results had to be cleaned up manually. For example, former MP Wijnand Duyvendak, who’s been in charge of the Friends of the Earth Schiphol campaign, should not be counted as having had a position with Schiphol. To be on the safe side, I also didn’t count positions on the pension board or the board of a foundation of a company.
Some websites offer data that you can download as an Excel or CSV file (e.g., Eurostat), or they may offer structured data in the form of an API. Other websites contain useful information, but they don’t provide that information in a convenient form. In such cases, a webscraper may be the tool to extract that information and store it in a format that you can use for further analysis.
If you really want control over your scraper the best option is probably to write it yourself, for example in Python. If you’re not into programming, there are some apps that may help you out. One is Outwit Hub. Below I will provide some step by step examples of how you can use Outwit Hub to scrape websites and export the results as an Excel file.
But first a few remarks:
- Outwit Hub comes in a free and a paid version; the paid version has more options. As far as I can tell, the most important limitation of the free version is that it will only let you extract 100 records per query. In the examples below, I’ll try to stick to functionality available in the free version.
- Information on websites may be copyrighted. Using that information for other purposes than personal use (e.g. publishing it) may be a violation of copyright.
- Webscraping is a messy process. The data you extract may need some cleaning up. More importantly, always do some checks to make sure the scraper is functioning properly. For example, is the number of results you got consistent with what you expected? Check some examples to see if the numbers you get are correct and if they have ended up in the right row and column.
Scraping a single webpage
Sometimes, all the information you’re looking for will be available from one single webpage.
Out of the box, Outwit Hub comes with a number of preset scrapers. These include scrapers for extracting links, tables and lists. In many cases, it makes sense to simply try Outwit Hub’s tables and lists scrapers to see if that will get you the results you want. It will save you some time, and often the results will be cleaner than when you create your own scraper.
Sometimes, however, you will have to create your own scraper. You do so by telling Outwit Hub which chunks of information it should look for. The output will be presented in the form of a table, so think of the information as cases (units of information that should go into one row) and within those cases, the different types of information you want to retrieve about those cases (the information that should go into the different cells within a row).
You tell Outwit Hub what information to look for by defining the «Marker Before» and the «Marker After». For example, you may want to extract the tekst of a title that is represented as
<h1>Chapter One<h1> in the html code. In this case the Marker Before could be
<h1> and the Marker After could be
</h1>. This would tell Outwit Hub to extract any text between those two markers.
It may take some trial and error to get the markers right. Ideally, they should meet two criteria:
- They should capture all the instances you want included. For example, if some of the titles you want to extract aren’t h1 titles but h2 titles, the
</h1>markers will give you incomplete results. Perhaps you could use
- They should capture as little irrelevant pieces of information as possible. For example, you may find that an interesting piece of information is located between
</p>tags. However, p-tags (used to define paragraphs in a text) may occur a lot on a webpage and you may end up with a lot of irrelevant results. So you may want to try to find markers that more precisely define what you’re looking for.
Some French workers have resorted to «bossnapping» as a response to mass layoffs during the crisis. If you’re interested in the phenomenon, you can find some information from a paper on the topic summarized here. From a webscraping perspective, this is pretty straightforward: all the information can be found in one table on a single webpage.
The easiest way to extract the information is to use Outwit Hub’s preset «tables» scraper:
Of course, rather than using the preset table scraper, you may want to try to create your own scraper:
Example: Wikipedia Yellow Jerseys table
If you’re interested in riders who won Yellow Jerseys in the Tour de France, you can find statistics on this Wikipedia page. Again, the information is presented in a single table on a single website.
Again, the easy way is to use Outwit Hub’s «tables» scraper:
And here’s how you create your own scraper:
Example: the Fall band members
Mark E. Smith of the Fall is a brilliant musician, but he does have a reputation for discarding band members. If you want to analyse the Fall band member turnover, you can find the data here. This time, the data is not in a table structure. The webpage does have a list structure, but the list elements are the descriptions of band members, not their names and the years in which they were band members. So Outwit Hub’s «tables» and «lists» scrapers won’t be much help in this case – you’ll have to create your own scaper.
To extract the information:
Navigating through links on a webpage
In the previous examples, all the information could be found on a single webpage. Often, the information will be spread out over a series of webpages. Hopefully, there will also be a page with links to all the pages that contain the relevant information. Let’s call the page with links the index page and the webpages it links to (where the actual information is to be found) the linked pages.
You’ll need a strategy to follow the links on the index page and collect the information from all the linked pages. Here’s how you do it:
- First visit one of the linked pages and create a scraper to retrieve the information you need from that page.
- Return to the index page and tell Outwit Hub to extract all the links from that page.
- Try to filter these links as well as you can to exclude irrelevant links (most webpages contain large numbers of links and most of them are probably irrelevant for your purposes).
- Tell Outwit Hub to apply the scraper (the one you created for one of the linked pages) to all the linked pages.
- Hopefully, all the linked pages have the same structure, but don’t count on it. You’ll need to check if your scraper works properly for all the linked pages.
- In the output window, make sure to set the catch / empty settings correctly because otherwise Outwit Hub will discard the output collected so far before moving to the next linked page.
Example: Tour de France 2013 stages
We’ll return to the Tour de France Yellow Jersey, but this time we’ll look in more detail into the stages of the 2013 edition. Information can be found on the official webpage of le Tour.
Navigating through multiple pages with links
Same as above, but now the links to the linked pages are not to be found on a single index page, but a series of index pages.
First create a web scraper for one of the linked pages, then collect the links from the index page so you can tell Outwit Hub to apply your scraper to all the linked pages. However, you’ll need one more step before you can tell Outwit Hub to apply the scraper: you’ll need to collect the links from all the index pages, not just the first one. In many cases, Outwit Hub will be able to find out by itself how to move through all the index pages.
Example: Proceedings of Parliament
Suppose you want to analyse how critically Dutch Members of Parliament have been following the Dutch intelligence service AIVD over the past 15 years or so. You can search the questions they have asked with a search query like this, which gives you 206 results, and their urls can be found on a series of 21 index pages (perhaps new questions have been asked since, in which case you’ll get a higher number of results). So the challenge is to create a scraper for one of the linked pages and then get Outwit Hub to apply this scraper to all the links from all 21 index pages.
- Paul Bradshaw - Scraping for Journalists (from basic to complex, uses Outwit Hub; Scraperwiki a.o.).
- Hartley Brody - The Ultimate Guide to Web Scraping (general background; examples use Python).
- Ouwit Hub also has some nice tutorials – you can run them by copying their url into the address bar of Outwit Hub.
- Codecademy (if you want to learn Python).
Don’t ask me why, but Oranjeverenigingen (Orange Associations - most focus on organising festivities on King’s Day) seem to be struggling with the new transparency rules of the tax authority.
Recently, new rules have been introduced for organisations that want to receive tax-exempt donations. Among other things, they must have a website and publish the compensation their board members receive. As a consequence of these new rules, over two thousand organisations have had their «anbi status» withdrawn, broadcaster NOS reported.
The tax authority has published a dataset on organisations that have or used to have the anbi status. It appears that especially Oranjeverenigingen have been affected. Six percent of all organisations had their anbi status withdrawn, but this happened to 75% of organisations with «oranje» in their name. Obviously, it’s a bit risky to draw conclusions from this as long as the explanation of the phenomenon is unclear.
Data from the tax authority are here, and here’s the R script I analysed the data with. I also checked this for other terms that occur frequently (organisations with the Dutch word for «first aid», «christian», «jehova», «education», «amsterdam», «third world aid shop» or «museum» in their names), but they don’t show the same pattern.