Cultural geographer Dr Sam Kinsley, lecturer in Human Geography and Co-Investigator on the ESRC-funded Contagion project writes about the difficulties you might encounter when you use social media research.
This post originally appeared on the Contagion research project blog; it has also appeared on the LSE Impact of Social Science blog.
Dr Sam Kinsley is a lecturer in Human Geography
Social media research is on the rise but researchers are increasingly at the mercy of the changing limits and access policies of social media platforms. API and third party access to platforms can be unreliable and costly.
Sam Kinsley outlines the limitations and stumbling blocks when researchers gather social media data. Should researchers be using data sources (however potentially interesting/valuable) that restrict the capability of reproducing our research results?
Many of the research articles and blogs concerning conducting research with social media data, and in particular with Twitter data, offer overviews of their methods for harvesting data through an API. An Application Programming Interface is a set of software components that allow third parties to connect to a given application or system and utilise its capacities using their own code. Most of these research accounts tend to make this process seem rather straight forward. Researchers can either write a programme themselves, such as, or can utilise one of several tools that have emerged that provide a WYSIWYG interface for undertaking the connection to the social networking platform, such as implementing yourTwapperKeeper, COSMOS or using a service such as ScraperWiki (to which I will return). However, what is little commented upon is the restrictions put on access to data through many of the social networking platform APIs, in particular Twitter. The aim of this blog post is to address some of the issues around access to data and what we are permitted to do with it.
Restrictions to ‘free’ access to Twitter data
The restrictions imposed on access to data and their possible uses have a direct effect upon the kinds of questions one can ask of the data, and indeed the kind of research we can conduct. What are these restrictions? In the case of Twitter, there are two particular API access points of interest:
These both come with particular kinds of restrictions, which have the potential to effect the amounts of data one can access. The streaming API effectively filters the full stream of all of the tweets being posted at any given time (named the ‘firehose’) down to one per cent of the total (colloquially referred to as the ‘spritzer’) and the sampling method is not explained to users. As there are over 500,000,000 tweets per day, with an average in 2013 of 5,700 per second, one per cent remains rather a lot of data. Nevertheless, as a sample it may be seen as problematic. For example, researchers have compared the one per cent and firehose streams to statistically investigate how proportionate the ‘spritzer’ representation is of the full data set. Morstatter et al. (2013) suggest that for large datasets, or big issues that generate lots of traffic, the one per cent is apparently fairly ‘faithful’ to the full stream, with a common set of top keywords and hashtags. However, for smaller datasets the spritzer appears to be a less faithful representation of all activity – this would mean researchers using the API would possibly need to be selective on the issues they study. Further, they suggest there is a ‘blackboxed’ bias in the one per cent ‘spritzer’ API stream which diverges from random one per cent samples they took from the ‘firehose’.
The search API is slightly more complicated. The data available is typically limited to the last week of activity, although for some search terms it may be slightly longer (this seems to vary). Access is governed by the number of requests to the API any given user can make in a set period (15 minutes). A user with an ‘access token’ can make 180 calls per 15 minutes fetching approximately 100 tweets per call. A user can utilise more than one access token but in their documentation Twitter allude to a limit on application-only authentication (without access tokens) of 450 calls per 15 minutes, so it might be reasonable to assume this is an absolute limit (I don’t have any experimental results to prove or disprove this).
As a thought experiment, if we assume that limit then the total amount of data accessible is 450 calls x 100 tweets per call, per four 15-minute periods (one hour) = 180,000 tweets fetched per hour (in which period, on 2013 averages, 20,520,000 new tweets are added). Taken the other way around, if we assume that we can use lots of access tokens and we wanted to be opportunistic and harvest all tweets related to a phenomenon that occurred in the last three days with approximately 40,000,000 tweets in the corpus – we would need to collect all of those tweets in three days, as the oldest data is already three days old, and so we would need eight access tokens simultaneously gathering tweets, without any replication of data being harvested between them, for three solid days. There are two big assumptions here: first, we can use eight access tokens to harvest data at the maximum rate for 24 hours per day, without restriction; second, those accounts can be used so that only ‘fresh’ data is gathered, without replication across the eight.
In both forms of the API access to Twitter we may be forgiven for thinking there’s not much wrong, lots of data is available. However, when a researcher begins to ask questions that they would like to answer with that data particular kinds of problem can arise. By and large, to get to the maximum figures indicated for the API, above, one needs to implement a bespoke programme to ensure dedicated access in order to maximise the rate of data collection. Equally, using multiple ‘access tokens’ will, most likely, result in gathering some duplicate data, which will need to be filtered and refined.
In practice, when gathering data through the service ScraperWiki we often encountered rate limiting, which we were powerless to affect. Even with yourTwapperKeeper, for example, one needs to have better than average IT skills in order to implement an effective data collection method (see Bruns and Liang for an overview of what might be needed). This can, of course, be addressed by working with colleagues with the appropriate skills and may lead to interesting cross-disciplinary collaborations. However, should you wish to search the historical archive of tweets (for example: searching for tweets concerning the UK riots in 2011) this is not possible through the API and you will have to pay a commercial reseller of twitter data, or ‘certified partner‘ in the jargon, to get those data. Therefore, in order to have a chance at gathering data, researchers using the API need to be opportunist and set ‘scrapes’ of data running as close in time to the activities of interest as possible.
Equally, if one uses broad enough search terms it is entirely possible that the volume of tweets matching the criteria is such that it is not possible to harvest them before they drop out of the free-to-access pool of data before your search can reach them. Therefore, API-based data gathering for research is best suited to opportunistic highly specific searches (such as the UK badger cull), rather than topics that significantly trend (such as anything to do with an international celebrity).
At the beginning of the Contagion project we accessed the API through the easy-to-use third party online system ScraperWiki. With that system it was easy for us to set up ‘scrapes’ for tweets and search and order the data we retrieved, download it and analyse it in various ways. However, earlier this year, ScraperWiki had their access to the Twitter API revoked. The tools for searching and collecting Twitter data were stopped and never reactivated. We have therefore had to seek alternative means of accessing data.
A political economy of ‘big data’
ESRC fund the Contagion project, which aims to investigate the various ways in which contagion is both studied and modelled within a cross-disciplinary setting.
Perhaps the more serious issue to which this situation of access to data alludes is the proprietary nature of access, and indeed the data itself. While (largely unlimited) use of Twitter as a service is free to any user that signs up, access to the data on the platform is not. Twitter is, of course, a business. Just like many other ‘social’ platforms the data Twitter receives from its users is valuable and can be packaged as a commodity. There is therefore a political economy to this kind of ‘big data’ and accordingly political economic issues for ‘big data’ research.
Access is a commodity
If a researcher relies on the free API access to a platform, with its attendant vagaries of how much data one can access and for how long, then that researcher is at the mercy of the changing limits and access policies of that API. On the other hand, if one pays for access to data, to avoid the uncertainty of access (how much data and for how long), then expect to pay handsomely. Both main ‘certified partners’ that sell access to Twitter data, Datasift and Gnip (recently bought by Twitter), render access a commodity – you not only pay for the data but also for the processing power/time it takes to extract it and the ‘enrichments’ they add, by resolving shortened URLs for you, attributing sentiment to a given tweet (positive, neutral, negative) and so on.
The costs charged by ‘resellers’ of data are not insignificant in terms of typical research budgets, with some charging through a subscription model – requiring customers to commit for a minimum of six months. Twitter themselves have advertised their own ‘data grant‘ scheme, which came into operation this year, and offered a limited number of opportunities to access data through a competitive application process, not dissimilar to funding grant calls. Of the 1300 applicants only six (or 0.5 per cent) were granted data (the numbers here come from this Fortune article).
Data are proprietary goods
The corollary to gaining access to proprietary data is that the license one agrees to abide by for access to Twitter data states that you cannot share that data. Therefore, investing in any form of data access (via the API or a ‘reseller’) through publicly funded research is problematic. For we are all asked to submit data attained in a publicly-funded project to data archives to allow other researchers to access and use it, which is prohibited by Twitter’s Terms of Service (1.4.1). As others have observed, it is possible to get around this by archiving only the unique ID code for each tweet and leaving it up to any future researchers to download the tweets using those IDs, thereby not breaching the Terms of Service. However, with the limits to the API outlined above, for a large corpus of tweets (> 1m, say) this might take a rather long time. A quick calculation suggests, using the status/lookup API, with one ‘access token’ it would take 13 hours 48 mins (at 100 tweets per request, 180 requests per 15 minutes = 72,000 tweets per hour) solid use of the API (without any hitches) to download one million tweets. Not impossible then, but perhaps significantly inconvenient – and reliant upon the system of unique IDs remaining the same for the foreseeable future. Furthermore, such restrictions may be suggested to run counter to the requirements set on research data gathered using UK research councils funds. The (UK) ESRC, who funded Contagion, have general principles in their Research Data Policy that suggest:
- Publicly-funded research data are a public good, produced in the public interest.
- Publicly-funded research data should be openly available to the maximum extent possible.
This asks difficult questions of us as researchers: Should we be using data sources (however potentially interesting/valuable) that restrict the capability of reproducing our research results? Should we be using public funds to pay for data that are restricted in such ways?
Not free, not easy
Some argue that conducting research using Twitter data has become something of a fad across academe, but in practice it proves neither to be easy (without non-trivial IT expertise and/or understanding of the policies of Twitter as a company), nor free: it requires investment in terms of hours of work (designing and/or operating systems to collect, store and analyse the data), it may require paid access (depending on what kind of sample of data you require), and it comes with usage restrictions.
This has led to the principal arenas of Twitter-based research occurring outside of the academy – a lot of data science, in fact, is conducted by commercial organisations. Whether or not this research is meaningful is open to interpretation. Nevertheless, it remains the case that, as others have suggested, an awful lot of (computationally-driven) social science is being done by ‘non-academic’ researchers, amongst whom there are significant numbers of people with advanced levels of relevant IT skills. However, I argue that one of the unfortunate effects of this shift in the locus of research is a lack of criticality.
One might convincingly argue, for example, that there is an awful lot of data visualisation for its own sake. It doesn’t necessarily argue anything, instead it describes an impressive amount of data in a visually appealing manner. Equally, there is tendency in some technically-led social research to assume that the context of data, or even the hypotheses one might pose and use that data to address, are secondary to its formatting or scale. For example, in a conversation with a sales person for a data provider I was advised that as a geographer I ought to study the picture sharing platform Instagram because that had the highest take-up of geo-located content. What that content represents, or what kinds of questions we can or might ask of it is therefore of secondary importance to the fact that there is geo-location metadata.
This is not to suggest that valuable ‘theory building’ research cannot be conducted through forms of data mining. We might not know the questions we can ask of the kinds (and scales) of data we are being faced with without performing exploratory analyses. Nevertheless, if we want to be surprised by the data (which may include concluding it is not particularly interesting for various reasons), as others have suggested, we surely need to implement critical forms of inquiry.
The point of this blog post is that to study social media data, and in particular Twitter data, is to concern oneself with emerging economies of data and their attendant politics. Rather than considering platforms like commercial social networking systems as easy and plentiful sources of research data, they require hard work: it is hard to gain access to that data (as non-technical and non-wealthy academic researchers); and: some hard critical epistemological reflection is required upon what can and cannot be asked and/or concluded given the specificities of each kind of dataset and data source we use. The means of access, the APIs and other elements necessary to access the data, are important interlocutors in the stories we tell with these data.
It remains possible to do particular kinds of research with the Twitter data one can access through the APIs, but we have to think pretty carefully about what kinds of questions we can and should ask of these data, and the system from which they are derived.