Salonanarchist | Leunstoelactivist

Scraping websites with Outwit Hub: Step by step tutorial

Some websites offer data that you can download as an Excel or CSV file (e.g., Eurostat), or they may offer structured data in the form of an API. Other websites contain useful information, but they don’t provide that information in a convenient form. In such cases, a webscraper may be the tool to extract that information and store it in a format that you can use for further analysis.

If you really want control over your scraper the best option is probably to write it yourself, for example in Python. If you’re not into programming, there are some apps that may help you out. One is Outwit Hub. Below I will provide some step by step examples of how you can use Outwit Hub to scrape websites and export the results as an Excel file.

But first a few remarks:

  • Outwit Hub comes in a free and a paid version; the paid version has more options. As far as I can tell, the most important limitation of the free version is that it will only let you extract 100 records per query. In the examples below, I’ll try to stick to functionality available in the free version.
  • Information on websites may be copyrighted. Using that information for other purposes than personal use (e.g. publishing it) may be a violation of copyright.
  • Webscraping is a messy process. The data you extract may need some cleaning up. More importantly, always do some checks to make sure the scraper is functioning properly. For example, is the number of results you got consistent with what you expected? Check some examples to see if the numbers you get are correct and if they have ended up in the right row and column.
The Outwit Hub app can be downloaded here (it’s also available as a Firefox plugin, but last time I checked it wasn’t compatible with the newest version of Firefox).

Scraping a single webpage

Sometimes, all the information you’re looking for will be available from one single webpage.

Strategy

Out of the box, Outwit Hub comes with a number of preset scrapers. These include scrapers for extracting links, tables and lists. In many cases, it makes sense to simply try Outwit Hub’s tables and lists scrapers to see if that will get you the results you want. It will save you some time, and often the results will be cleaner than when you create your own scraper.

Sometimes, however, you will have to create your own scraper. You do so by telling Outwit Hub which chunks of information it should look for. The output will be presented in the form of a table, so think of the information as cases (units of information that should go into one row) and within those cases, the different types of information you want to retrieve about those cases (the information that should go into the different cells within a row).

You tell Outwit Hub what information to look for by defining the «Marker Before» and the «Marker After». For example, you may want to extract the tekst of a title that is represented as <h1>Chapter One<h1> in the html code. In this case the Marker Before could be <h1> and the Marker After could be </h1>. This would tell Outwit Hub to extract any text between those two markers.

It may take some trial and error to get the markers right. Ideally, they should meet two criteria:

  • They should capture all the instances you want included. For example, if some of the titles you want to extract aren’t h1 titles but h2 titles, the <h1> and </h1> markers will give you incomplete results. Perhaps you could use <h and </h as markers.
  • They should capture as little irrelevant pieces of information as possible. For example, you may find that an interesting piece of information is located between <p> and </p> tags. However, p-tags (used to define paragraphs in a text) may occur a lot on a webpage and you may end up with a lot of irrelevant results. So you may want to try to find markers that more precisely define what you’re looking for.

Example: Bossnappings

Some French workers have resorted to «bossnapping» as a response to mass layoffs during the crisis. If you’re interested in the phenomenon, you can find some information from a paper on the topic summarized here. From a webscraping perspective, this is pretty straightforward: all the information can be found in one table on a single webpage.

The easiest way to extract the information is to use Outwit Hub’s preset «tables» scraper:

Of course, rather than using the preset table scraper, you may want to try to create your own scraper:

Example: Wikipedia Yellow Jerseys table

If you’re interested in riders who won Yellow Jerseys in the Tour de France, you can find statistics on this Wikipedia page. Again, the information is presented in a single table on a single website.

Again, the easy way is to use Outwit Hub’s «tables» scraper:

And here’s how you create your own scraper:

Example: the Fall band members

Mark E. Smith of the Fall is a brilliant musician, but he does have a reputation for discarding band members. If you want to analyse the Fall band member turnover, you can find the data here. This time, the data is not in a table structure. The webpage does have a list structure, but the list elements are the descriptions of band members, not their names and the years in which they were band members. So Outwit Hub’s «tables» and «lists» scrapers won’t be much help in this case – you’ll have to create your own scaper.

To extract the information:

Navigating through links on a webpage

In the previous examples, all the information could be found on a single webpage. Often, the information will be spread out over a series of webpages. Hopefully, there will also be a page with links to all the pages that contain the relevant information. Let’s call the page with links the index page and the webpages it links to (where the actual information is to be found) the linked pages.

Strategy

You’ll need a strategy to follow the links on the index page and collect the information from all the linked pages. Here’s how you do it:

  • First visit one of the linked pages and create a scraper to retrieve the information you need from that page.
  • Return to the index page and tell Outwit Hub to extract all the links from that page.
  • Try to filter these links as well as you can to exclude irrelevant links (most webpages contain large numbers of links and most of them are probably irrelevant for your purposes).
  • Tell Outwit Hub to apply the scraper (the one you created for one of the linked pages) to all the linked pages.

Two remarks:

  • Hopefully, all the linked pages have the same structure, but don’t count on it. You’ll need to check if your scraper works properly for all the linked pages.
  • In the output window, make sure to set the catch / empty settings correctly because otherwise Outwit Hub will discard the output collected so far before moving to the next linked page.

Example: Tour de France 2013 stages

We’ll return to the Tour de France Yellow Jersey, but this time we’ll look in more detail into the stages of the 2013 edition. Information can be found on the official webpage of le Tour.

Navigating through multiple pages with links

Same as above, but now the links to the linked pages are not to be found on a single index page, but a series of index pages.

Strategy

First create a web scraper for one of the linked pages, then collect the links from the index page so you can tell Outwit Hub to apply your scraper to all the linked pages. However, you’ll need one more step before you can tell Outwit Hub to apply the scraper: you’ll need to collect the links from all the index pages, not just the first one. In many cases, Outwit Hub will be able to find out by itself how to move through all the index pages.

Example: Proceedings of Parliament

Suppose you want to analyse how critically Dutch Members of Parliament have been following the Dutch intelligence service AIVD over the past 15 years or so. You can search the questions they have asked with a search query like this, which gives you 206 results, and their urls can be found on a series of 21 index pages (perhaps new questions have been asked since, in which case you’ll get a higher number of results). So the challenge is to create a scraper for one of the linked pages and then get Outwit Hub to apply this scraper to all the links from all 21 index pages.

Resources

Tags: 

King’s Day associations lose tax exempt status

Don’t ask me why, but Oranjeverenigingen (Orange Associations - most focus on organising festivities on King’s Day) seem to be struggling with the new transparency rules of the tax authority.

Recently, new rules have been introduced for organisations that want to receive tax-exempt donations. Among other things, they must have a website and publish the compensation their board members receive. As a consequence of these new rules, over two thousand organisations have had their «anbi status» withdrawn, broadcaster NOS reported.

The tax authority has published a dataset on organisations that have or used to have the anbi status. It appears that especially Oranjeverenigingen have been affected. Six percent of all organisations had their anbi status withdrawn, but this happened to 75% of organisations with «oranje» in their name. Obviously, it’s a bit risky to draw conclusions from this as long as the explanation of the phenomenon is unclear.

Method

Data from the tax authority are here, and here’s the R script I analysed the data with. I also checked this for other terms that occur frequently (organisations with the Dutch word for «first aid», «christian», «jehova», «education», «amsterdam», «third world aid shop» or «museum» in their names), but they don’t show the same pattern.

Tags: 

Decline in cycling in the Netherlands?

Using new data from Statistics Netherlands (CBS), cycling expertise centre Fietsberaad reports that cycling has declined in the Netherlands over the past three years, both in terms of the distance traveled and the number of trips per person per day. The chart to the left is from their website.

Fietsberaad does warn against reading too much into this: there have been changes in how the data are collected and analysed, and the weather may have caused short-term fluctuations in cycling (meteorological institute KNMI reports that there were 46 days with minimum temperatures below 0°C in 2011; 50 in 2012 and 64 in 2013). Keeping all this in mind, it’s still interesting to note that the same period saw an increase in cycling in the four largest cities.

Be that as it may, the chart created by Fietsberaad does look worrisome. But what does it actually show? There are no values on the y-axis. Does the y-axis even start at zero? Apparently it doesn't, for otherwise the chart would have looked more like the one below. Which looks slightly less dramatic.

Belkin quits. How loyal are sponsors of cycling teams?

Last year, Belkin became the title sponsor of the former Rabobank cycling team, but today it announced that it will end its sponsorship by the end of the year. Various commentators have expressed concern over the lack of continuity in sponsoring. Which raises the question: is it normal for a sponsor to quit after such a short period? And is this becoming worse?

Some sponsors leave after one or two years, while others remain loyal for ten years or more (Française des Jeux, Lampre, Lotto, Quick Step).

The graph above shows the sponsor turnover of UCI Pro Tour teams (the share of sponsors that would quit the subsequent year). Turnover is about 25%, which suggests that a normal sponsorship duration should be about four years. So Belkin’s loyalty is not impressive by those standards.

While the sponsorship duration fluctuates, there doesn’t appear to be a trend of sponsors becoming more or less loyal.

Method

I retrieved sponsor names from team names of UCI Pro Tour teams listed by Cycling News. Due to variations in spelling (Française des Jeux, FDJ, FDJ.fr), the data needed some cleaning up. If you want to check them: here’s a list of sponsors and the years in which I think they were active.

Tags: 

Cyclists should have priority here

IMG_1202

Some crossings make you wonder: isn’t it weird that cyclists don’t have priority here. This occurs in Amsterdam, but more often in the country. There are different variants, but often there’s a bend in the cycle path just before a crossing. The cycle path is no longer part of the main road and cyclists are confronted with give way road markings. You have to give way to everybody: motorists coming from behind who turn right, oncoming traffic turning left and traffic from the right.

Often, you have to give way to rather secondary roads. For example, the exit to a tiny car park along the Oostvaardersdijk in Almere (photo above). Or the entrance of a government building at the Amsterdamseweg in Velsen-Zuid, where motorists who get priority subsequently have to stop at a gate anyway.

As a cyclist, you end up with a tricky crossing. You have to pay attention to traffic from behind, oncoming traffic and traffic from the right. The sense of insecurity mixes with indignation at the fact that apparently, people have specifically diverted the cycle path just to rob cyclists of their priority. Why are they doing this?

I put this question – in somewhat more neutral terms – to a number of road maintenance authorities, with illustrations from Velsen-Zuid, Watergang, Monnickendam, Weesp, Almere and Muiden. Their answers reveal that there are two reasons for bending cycle paths. First, this creates a space for motorists coming from the right where they can wait before entering or crossing the main road (this is a reason for bending the cycle path, but in itself not a reason to rob cyclists of their priority). Second, it’s about bicycle safety. In the words of the spokesperson of the Province of Noord-Holland:

For reasons of bicycle safety, we at the province often choose not to let cyclists have priority, especially outside the built-up area. It’s the same thing as with roundabouts: you may have priority as a cyclist, but whether you’ll be given priority is a different matter. And with roundabouts, it’s been shown that cyclists who have priority are more often involved in accidents, simply because they’re not given priority.

It’s good to know that the safety of cyclists is high on the agenda. But bending the cycle path and robbing cyclists of their priority – I’m not convinced that’s the right solution. In fact, it’s a bit twisted to reward motorists for not paying attention to cyclists who have priority. There have to be better ways to make them pay attention to cyclists and to slow them down.

As I said, such situations occur mainly in the country. You can point to situations in Amsterdam where cyclists should have priority, but mostly these don’t concern cycle paths along main roads that have been bended. However, there is a slightly similar situation opposite the entrance of the Westerpark.

The original Dutch version of this article appeared in the OEK (pdf). More examples here.

Map: How the fastfood workers’ fight just went global


In November 2012, fastfood workers in New York went on strike for decent wages. Since, the fight has spread rapidly in the US and on 15 May, it went global. There were actions in cities like Dublin, Mumbai, São Paulo, Bandung, Kagoshima and many others. Security workers at Amsterdam Airport, who had just had their own action for real jobs, also showed their support.

The map above shows cities mentioned in tweets with the hashtag #FastFoodGlobal.

Method

The map above doesn’t even do justice to the scope of the action. For one thing, many other hashtags were used besides #FastFoodGlobal (e.g., #fastfoodstrike, #fightfor15, #raisethewage, #lowpayisnotok, and, quite often actually, #ronaldmacdonald). Further, it only captures references in the Latin alphabet, and only the transcription used by Wikipedia.

I used the Twitter API to collect some 50,000 tweets with the hashtag #FastFoodGlobal. I checked the text of these tweets agains a list of cities with a population of 100,000 and over. Of course, it’s impossible to identify cities with 100% accuracy. I removed cities like Van (Turkish city but also a word in Spanish and Dutch) and Hamburg (cf. hamburger) as well as cities mentioned less than 25 times. The map is based on a tutorial by D3 Tips and Tricks.

Tags: 

Mountains and cycling culture: on winning jerseys in the Giro, Tour and Vuelta

Are there any characteristics that explain why some countries are more successful in pro cycling than others? An article at the Inner Ring blog discusses why Germany is «Europe’s Pro Cycling Black Hole», despite having some serious mountains and a vibrant cycling culture (as illustrated by the membership of the Bund Deutscher Radfahrer) - but note that the same author has also warned against simple expanations of why countries are successful. And in the UK, there has been some disappointment that successes in professional cycling haven’t led to more cycling in general.

So are mountains and cycling culture somehow related to success in professional cycling? Of course, there are different ways to answer that question. Here’s a look at some indicators, which suggest mountains - no, and cycling culture - maybe.

The graph below shows maximum elevation (to be more precise, the difference in elevation between the lowest and highest location on the country’s mainland) and the number of jerseys won in the Giro d’Italia, the Tour de France and the Vuelta a España over the past years.

There is only a weak and not statistically significant correlation between elevation span and the number of jerseys won. If you adjust for the size of the population, the relation is even negative, and still weak. Perhaps a different indicator for mountainousness would yield other results, but for now it appears that having mountains has little to do with success in the grand tours.

Then how about cycling culture? The graph below shows two indicators on the x-axis: the share of trips made by bicycle in the country’s capital (modal share), and the relative number of bikes sold. The y-axis shows the relative number of jerseys won over the past years. According to these variables, cycling culture is not related to success in professional cycling (in fact, there’s a weak, not significant, negative correlation).

Another possible indicator of a cycling culture is the membership of cyclists’ organisations. The graph below is a bit geekier than the previous ones: the scales are logarithmic (for example, the y-axis goes from 0.1 to 1 to 10 to 100).

It turns out that there is in fact a correlation between the membership of cycling organisations and the number of jerseys won. Perhaps bicycle sales and modal share are indicators of everyday bicycle use whereas membership of cycling organisations also says something about recreational use, which in turn might be related to success in professional cycling – but that’s just guessing. Whether there’s a causal relation between the two is yet another question.

See also: Giro, Tour and Vuelta: Which countries won jerseys over the past 111 yrs.

Method

The analysis is limited to the jerseys for the leaders of the general classification (the maglia rosa for the Giro d’Italia, the maillot jaune for the Tour de France and whatever colour the leader’s jersey had in the Vuelta a España that particular year). For each year and for each tour, for each rider who has won a jersey in that tour (regardless of how many days) a point was added to the country total of that rider’s country.
The D3 tooltip code is largely borrowed from D3 Tips and Tricks.

Note that Wikipedia explains that the modal share (the share of trips made by bicycle) is not measured in a consistent way and something similar may well apply to data for membership of cyclists’ organisations.

Tags: 

Giro, Tour and Vuelta: which countries won jerseys over the past 111 yrs

The graph below shows which countries have been successful at winning jerseys in the Giro d’Italia, the Tour de France and the Vuelta a España.

The graph shows among other things how France has been struggling since the 1990s, how Belgium (Eddy Merckx) and the Netherlands (Joop Zoetemelk, Gerrie Knetemann) did well in the 1970s and the success of the UK in the 2010s (Bradley Wiggins, Chris Froome, Mark Cavendish). If you adjust for population size (not shown), Luxembourg and Belgium are the most successful countries.

See also: Mountains and cycling culture: On winning jerseys in the Giro, Tour and Vuelta.

Method

The analysis is limited to the jerseys for the leaders of the general classification (the maglia rosa for the Giro d’Italia, the maillot jaune for the Tour de France and whatever colour the leader’s jersey had in the Vuelta a España that particular year). For each year and for each tour, for each rider who has won a jersey in that tour (regardless of how many days) a point was added to the country total of that rider’s country.
The D3 tooltip code is largely borrowed from D3 Tips and Tricks.

Tags: 

Spamming after all? Revisiting the repost ratios of Vox, Upshot and 538

Recently I wrote about people who share their URLs on Twitter, and then post them again, hoping to draw even more people to their site. I said that FiveThirtyEight reposts its URLs on average 0.3 times. I was wrong: it reposts its URLs far more often. And so do voxdotcom and UpshotNYT, who didn’t even make the top 5 in my original analysis. The Upshot reposts its URLs on average as many as 0.8 times.

The reason I underestimated the repost ratios in my original analysis has to do with the fact that tweets tend to contain shortened URLs. http://nyti.ms/1rFwue2 and http://nyti.ms/1iIujpo look like different URLs. However, they point to the same article, so one should be treated as a repost of the other (or perhaps both are a repost of yet another one, who knows). If you don’t take this into account and treat them as different URLs, you’ll underestimate the number of reposts (red bar in the graph).

It’s not that I wasn’t aware of this problem when I did the first analysis. I first tried to account for this by looking up the non-shortened URLs, using the Python urllib2 module. It turned out this was very time-consuming, which was a problem since I wanted to look up quite a few URLs. Pragmatically, I decided instead to use the ‘expanded URL’ provided by the Twitter API. This method does yield higher repost ratios for 538 and the Upshot (grey bars in the graph). Still, it doesn’t really solve the problem, because the expanded URL provided by the Twitter API will sometimes be yet another shortened URL. That’s the reason I still underestimated how often people recycle their content on Twitter.

When I realised the ratios I had originally calculated were still rather low given how many reposts there appeared to be in my timeline, I decided to recalculate repost ratios using urllib2 after all. Because this method is so time-consuming, I did this for just three accounts: Vox, 538 and Upshot NYT. This resulted in repost ratios that are substantially higher (light blue bars in the graph). The new Python script is here.

Note that the ratios are snapshots calculated on a sample of the 200 most recent tweets (that is, about one to two weeks of tweets).

Tags: 

Rise in Dutch cycling accidents, but Strava probably not to blame

The number of wielrenners (cyclists on racing bikes) treated at Dutch emergency departments has doubled since 2010, according to a study published today. Among a range of possible explanations the authors mention the popularity of apps like Strava:

The increasing popularity of smartphone apps like Strava, which let you keep track of cycling records for certain tracks and compare them with others, can lead to dangerous situations.

Like I said, this is just one of many possible explanations discussed in the report and the authors are by no means suggesting that Strava is a key factor causing cycling accidents. That said, the idea that Strava may have played a role doesn’t seem to be a priori absurd.

Strava was launched in 2009, but when did it become popular in the Netherlands? I couldn’t find any direct data on this, but Google trends is a plausible indicator.

The Google data are pretty clear: interest in Strava didn’t take off until February 2012 in the Netherlands (interestingly, the search volume index is highest in Limburg and Gelderland, which are also the main regions with hills in the Netherlands). As an extra check, I looked at messages at the Fiets.nl forum pages (you need to login in order to be able to search the forum) containing the search term ‘strava’. There were 10 messages prior to 1 February 2012 and 1,843 after that date, which seems to confirm the Google pattern.

By contrast, the number of wielrenners at emergency departments saw its biggest increase between 2010 and 2011. The number was stable at about 2,000 prior to 2011, but rose to 3,700 in 2011 and 4,200 in 2012. So it seems Strava was largely unknown in the Netherlands at the time when the largest increase in cycling accidents happened.

The reason for the study was a media storm last year about supposed irresponsible behaviour of wielrenners towards ‘normal’ cyclists. Car lobby club ANWB even suggested wielrenners should stay at home on sunny days.

In a survey among wielrenners, 45% said wielrenners do not sufficiently adjust their speed and 51% said wielrenners often ride in (too) wide groups. An analysis of 2,849 injury-causing accidents involving two cyclists revealed that in 24 cases a ‘normal’ cyclist got injured as a result of a collision with a wielrenner. So while many wielrenners agree that (some) wielrenners behave irresponsibly, this doesn’t seem to be a major cause of injuries among other cyclists.

Wielrenners themselves have about 2.2 injuries per 100,000 hours of activity. This is much lower than the number for all sports combined (7.1). However, 23% of wielrenners who go to the emergency department have to be treated in hospital, compared to 6% for all sports. So in terms of serious injuries, wielrennen doesn’t seem to be much safer or unsafer than other sports.

While it’s difficult to pinpoint the exact cause of the rise in accidents involving wielrenners, the authors of the report suggest the capacity of cycle paths is no longer sufficient given the rising number of cyclists, including a rise in cycling among people above 55. One of their recommendations is to create more ‘cycling highways’ for fast cyclists.

Tags: 

Pages