An open data Masters thesis

Just over four years ago, I’d frantically finished the most involved piece of writing in my life – my Masters thesis at the Oxford Internet Institute.

It’s a 10,000 word investigation into the state of “open data” use in the UK and USA, around 2010. In it, I looked at developers’ motivations for participating in the open data scene, the problems of tracking provenance for open data, and I tried to identify some of the “intermediaries” that occupied positions of power in the network of open data use and re-use across the world.

Sounds exciting right?

You can read the whole thing here, as a PDF.


What’s changed?

Four years is a long time online. It’s like fifty Internet years. So how does my thesis stand up? What’s changed?

Well, not having been in academia for the last four years, it’s hard to speak definitively. But I did end up working for a UK startup creating open data through web scrapers, and I do now work at the company that first liberated UK Parliament data in the form of TheyWorkForYou. So I’ve not moved far from the open data world. Maybe I’m as good a judge as any.

Some thoughts, in no particular order:

My key finding (that there weren’t yet any clear intermediaries over open data, as there were in other spheres like news provision and web search) still hasn’t changed. Open data publication is still pretty distributed (some would say fractured). Government- or NGO-sanctioned catalogues hold small, overlapping fractions of what’s out there. Alongside them, developers mostly rely on web searches to find what they’re after, and if the data’s not out there, they’ll try to scrape it themselves.

Speaking of which, ScraperWiki—whose CEO, Francis Irving, was one of my interview participants, and who I went on to work for—is no longer a platform for collaborative data reuse.1 I don’t think they’ll mind me saying, we simply couldn’t find a scalable business model for collaborative data development online – mostly because the big corporates who do this professionally all use on-premise teams, and the remainder—the activists, journalists, geeks—don’t really need collaboration, and are reticent to pay for something that’s often either a hobby or a means to gathering data that may never actually be used.

On the other hand, where work has been funded instead by grants or non-commercial investment, collaborative open data development has continued apace. Morph.io, a modern re-imagination of ScraperWiki’s scraping platform, is a mainstay of activists and hackers looking to liberate and manipulate data on the web. It’s funded through its parent charity, the OpenAustralia Foundation. Likewise mySociety—who I now work for—has expanded beyond their popular UK-based sites, and now works with international partners on a little ecosystem of open data tools (like the Poplus components) and open data repositories (like EveryPolitician, which only last month marked the milestone of holding data about every politician in 200 national legislatures worldwide).

Meanwhile, governments, while not completely cold on open data, speak about it very little. I opened my thesis with a quote from Barrack Obama in 2009 – “Government should be transparent. Government should be participatory. Government should be collaborative.” Soon after, he established data.gov, the world’s first proper government-sanctioned data portal.

But now, where are the politicians singing the praises of open data? Nowhere. In fact, a sceptic might point out that, where possible, governments seem intent on shutting down data dissemination wherever possible – for example, by giving government ministers more opportunity to wriggle out of Freedom of Information requests, one of the major sources of forcibly opened government data.

Thankfully, however, there are lots of geeks inside government (like the team at data.gov.uk and the legendary Jukesie at the ONS) opening up government data regardless. And with increased pressure from GDS for departments to use open source code and software, it’s starting to become a lot more common that data is collected and stored in an open format from day one, which is the ideal situation.

Lastly, my expectation that the private sector would soon start to capitalise on open data—as Nike had in 2010—didn’t really materialise on a huge scale. But businesses have begun to use external data, more generally, to help them make decisions. It’s usually erroneously called “big data” (despite almost never being much bigger than a spreadsheet) but at its heart a lot of this stuff is open data produced by governments or international bodies, that companies hope to take for free, combine with their own internal datasets, and… err… step three: profit!

Anyway, enough talk. Go check out my pretty network diagram!

  1. ScraperWiki now focusses on data consulting. Scraping is still part of it, but it’s performed by a small team of in-house experts, rather than the wider community. They’ve also noticed a niche in accurate extraction of tabular data from PDFs, so they have a self-service tool for that too.