Monthly Archives: January 2015

Philosophy of Data Science series – Sabina Leonelli: “What constitutes trustworthy data changes across time and space”

What are the implications of data-intensive science for social sciences and how can social scientists use this data in research?

Dr Sabina Leonelli is the Associate Director of the Exeter Centre for the Study of the Life Sciences (Egenis) and as the Associate Editor of the journal History and Philosophy of the Life Sciences.

Dr Sabina Leonelli is the Associate Director of the Exeter Centre for the Study of the Life Sciences (Egenis) and as the Associate Editor of the journal History and Philosophy of the Life Sciences.


In this blog, Dr Sabina Leonelli, Associate Professor in the department of Sociology, Philosophy and Anthropology, looks at the epistemological issues, and the practical applications, raised by data intensive research.

This post first appeared on the LSE Impact blog, as part of their Philosophy of Data Science series

The next instalment of the Philosophy of Data Science series is with Sabina Leonelli, Principal Investigator of the ERC project, The Epistemology of Data-Intensive Science. Last year she completed a monograph titled “Life in the Digital Age: A Philosophical Study of Data-Centric Biology”, currently under review with University of Chicago Press. Here she discusses with Mark Carrigan the history of data-centric science and research practice and data’s relation to pre-existing and emerging social structures. Data types are produced by many stakeholders, from citizens to industry and governmental agencies, which means that what constitutes data, for whom and for which purposes is constantly at stake.

Previous interviews in the Philosophy of Data Science series: Rob KitchinEvelyn RuppertDeborah LuptonSusan HalfordNoortje Marres.

 

What is “data-intensive science”? How new is it? 

I take data-intensive science to be any research enterprise where major efforts are devoted to the generation, dissemination, analysis and/or interpretation of data. Indeed, my preferred term to refer to this scientific approach is ‘data-centric’ rather than data-intensive, as a distinctive feature of such research is the high degree of attention and care devoted to data handling practices (which is not necessarily to the exclusion of theories, models, instruments, software and materials, since data practices are almost invariably intertwined with concerns about other components of research). Thus defined, data-centric science is definitely not new. This is clearly illustrated by the major data collection and curation efforts characterising 17th century astronomy and metereology and 18th century natural history – cases which, together with many others, are documented within the ‘Historicising Big Data’ working group which I visited last year at the Max Planck Institute for the History of Science in Berlin.

At the same time, the current manifestations of data-centric science have distinctive features that relate to the technologies, institutions and governance structures of the contemporary scientific world. For instance, this approach is typically associated to the emergence of large-scale, multi-national networks of scientists; to a strong emphasis on the importance of sharing data and regarding them as valuable research outputs in and of themselves, regardless of whether or not they have yet been used as evidence for a given discovery; the institutionalisation of procedures and norms for data dissemination through the Open Science and Open Data movements, and policies such as those recently adopted by RCUK and key research funders such as the European Research Council, the Wellcome Trust and the Gates Foundation; and the development of instruments, building on digital technologies and web services, that facilitate the production and dissemination of data with a speed and geographical reach as yet unseen in the history of science. In my work, I stress how this peculiar conjuncture of institutional, socio-political, economic and technological developments have made data-centric science into a prominent research approach, which has considerably increased international debate and active reflection over processes of data production, dissemination and interpretation within science and beyond. This level of reflexivity over data practices is what I regard as the most novel and interesting aspect of contemporary data-centrism.

What are the epistemological issues raised by data-intensive science? 

Some obvious issues, raised both within the sciences and the humanities, concern the notion of data itself and the patterns of reasoning and methods associated with them. What are data, and how are they transformed into meaningful information? What is the status of so-called raw data with respect to other sources of evidence? What constitutes good, reliable data? What role do theory and materials play in data-intensive research? What patterns of reasoning characterise this scientific approach? What difference do the scale (itself a multifaceted notion), technological sophistication and institutional sanctioning of widespread data dissemination make to discovery and innovation? These are issues investigated by my current ERC project ‘The Epistemology of Data-Intensive Science’, which analyses data handling across a range of disciplines including plant biology, biomedicine and oceanography. Philosophical analysis can help to address these questions in ways that inform both current data practices and the ways in which have been conceptualised within the social science and humanities, as well as by policy bodies and other institutions.

The epistemological aspect that interests me most, however, is even more fundamental. Given the central role of data in making scientific research into a distinctive, legitimate and non-dogmatic source of knowledge, I view the study of data-intensive science as offering the opportunity to raise foundational questions about the nature of knowledge and knowledge-making activities and interventions. Scientific research is often presented as the most systematic set of efforts in the contemporary world aimed to critically explore and debate what constitutes acceptable and sufficient evidence for any given belief about reality. The very term ‘data’ comes from the Latin ‘givens’, and indeed data are meant to document as faithfully and objectively as possible whatever entities or processes are being investigated. And yet, data collection is always steeped in a specific way of understanding the world and constrained by given material and social conditions, and the resulting data are therefore marked by the historical circumstances through which they were generated: what constitutes trustworthy or sufficient data changes across time and space, making it impossible to ever assemble a complete and intrinsically reliable dataset. Furthermore, data are valued and used for a variety of reasons within research, including as sources of evidence, tokens of exchange and personal identity, signifiers of status and markers of intellectual property; and myriads of data types are produced by as many stakeholders, from citizens to industry and governmental agencies, which means that what constitutes data, for whom and for which purposes is constantly at stake.

This landscape makes the study of data into an excellent entry point to reflect on the activities and claims associated to the idea of scientific knowledge, and the implications of existing conceptualisations of various forms of knowledge production and use. This is nicely exemplified by an ongoing Leverhulme Trust Research Grant on the digital divide in data handling practices across developed and developing countries, particularly sub-Saharan Africa, which we are currently developing at Exeter – what constitutes knowledge, and a ‘scientific contribution’, varies enormously depending not only on access to data, but also on what is regarded as relevant data in the first place, and what capabilities any research group has to develop, structure and disseminate their ideas.

What are the implications of data-intensive science for the social sciences? 

At a practical level, it constitutes an opportunity for social scientists to invest more time and energy in understanding the functioning of technologies geared towards data production, dissemination and analysis (such as complex data infrastructures, digital databases and software), their relation to pre-existing and emerging social structures and practices, and the ways in which they can be fruitfully and critically appropriated as research tools. It is also an occasion to revisit the importance of intertwining quantitative and qualitative data, which is particularly important at a time where regrettably few analysts work with both types of data. Spotting correlations through the analysis of so-called ‘big data’ is an exciting endeavour and excellent opportunity to devise new research directions. At the same time, the significance of such findings can only be assessed in relation to in-depth understandings of social dynamics and their history, which is typically garnered through qualitative methods such as interviews and ethnography. Just like in the natural sciences, where multi-disciplinary networks are increasingly valued, social scientists need to cooperate with each other in order to combine the qualitative and quantitative skills needed to work with big data; with computer scientists and statisticians, so as to deepen their understanding of the analytic tools and technologies available to handle and interpret data; and with the humanities, particularly history and philosophy, to ensure help with contextualising and reflecting upon the conditions under which data are obtained, disseminated, processed and used.

This interview is part of an ongoing series on the Philosophy of Data Science. Previous interviews in the series: Rob KitchinEvelyn RuppertDeborah LuptonSusan Halford, Noortje Marres.

Note: This article gives the views of the author, and not the position of the Impact of Social Science blog, nor of the London School of Economics. Please review our Comments Policy if you have any concerns on posting a comment below.

Paris attack: Has history repeated a generational spiral into ultra-violence?

Bill Tupman is an honorary Research Fellow at the University of Exeter, who has 40 years’ experience in researching terrorism. In the wake of the Charlie Hebdo shootings he asks what we could learn from the past…?

As with so much in modern life, it was mobile phones that captured the brutal murder of 12 people, including two police officers and a maintenance worker, but mostly cartoonists and journalists, in the offices of Charlie Hebdo in Paris. The horrific visual images left an impression of anonymous ‘cold-blooded’, disciplined, merciless killers, who hid behind balaclavas and wore black.

It was an imitation of the forces of Islamic State, the murderous terrorists of Iraq and Syria. However, the images did not tell the whole truth. In reality, the attackers went to the wrong building first, had to force someone to let them into the offices, escaped but with no safe house established and were without a plan as to what to do next.

Nevertheless, we must ask, is this a homegrown attack or the first of the long-anticipated attacks by returning jihadis?

Attacks from those trained in warfare in Syria and Iraq were expected to be more militarised, disciplined, cold-blooded and violent, because the conflict in that area has become more and more vicious since the emergence of Islamic State from the remnants of al Qaeda in Iraq, which had offended many Iraqi Sunnis, leading to its defeat during the early years of the Obama presidency. Instead, the connection appears to be to the Yemen, perhaps less surprisingly when you consider that it is al Qaeda in the Yemen that has made threats against Europe, while Syrian groups have prioritised the Syrian and Iraqi conflicts.

Al Qaeda was established to provide support to a network of more than 20 organisations operating in the Islamic world, united by common experience in the war against the USSR in Afghanistan and by a desire to return to a more fundamental version of Islam. At present it consists of a number of affiliated regional groups and indirectly affiliated organisations.

The key components are; al Qaeda in the Islamic Maghreb, which attacked in Mali and against which French troops were deployed; al Qaeda in Somalia, al Qaeda in the Arabian peninsula, which the Paris attackers claimed to be representing; al Qaeda in Syria, which is a different organisation to Islamic state; and al Qaeda in the Indian sub-continent.

Generational change

This downward spiral into ultra-violence has been seen before and is not inherent in Islamic extremism.

During the 1970s, the Economist published an article introducing the idea of generational change in what began as an urban guerrilla movement and turned into full-blown terrorism.

The first generation was led by relatively experienced individuals, with a long background in political activity. They began by using violence against symbolic targets and differentiated between the ‘enemy’ and the public as a whole. Mostly, symbolic buildings were targeted.

The second generation saw violence against politicians, policemen and soldiers as acceptable. The third hoisted high the banner of ‘if you are not part of the solution you are part of the problem’ , a slogan which justified civilian casualties.

Finally a fourth generation emerged which concentrated on attacking ‘soft’ civilian targets.

At the time this was considered a logical progression. As the older leadership was arrested or killed, leadership passed to individuals with fewer scruples and no real interest in building up public support. Ministers of the interior, prison warders, police officers and soldiers became targets because of the existence of comrades in prison and because other activists had been killed in action.

Provocation of overreaction by the authorities became a strategy. As potential targets were hardened and made more risky to attack, militants turned to softer targets, especially since public support decreased rather than increased. A further variable was the existence of training programmes in the refugee camps in Palestine and in other post-colonial countries.

Re-examining revolutionary groups

We may be able to learn more from a re-examination of what happened to revolutionary groups in the late 1960s and early 1970s.

There was a bewildering array of ideologies around, just as there are varieties of Islam. Just as today we have Sunni, Shia, Sufi, Salafi, Wahabi and other varieties, so we had anarchism, Maoism, several kinds of Trotskyism and orthodox communism on the revolutionary Left. We had different strategies and tactics and organisational splits because of them. We had broader campaigns within which all of these groups operated in different ways. Is it possible to learn anything by comparing the two periods? Is it possible to encourage splits and ultimately disintegration of contemporary movements?

Islamic State has been very successful at forcing groups with different ideologies together, but history tells us they will splinter just as quickly if its military successes are turned into retreat. Below the leadership is a much looser set of associations, following individual local charismatic leaders, who will go their own way when it suits them. The ‘foreign jihadis’, as well as providing shock troops, are also useful for intimidating elements of this loose coalition into submission. But the whole network is more fragile than it looks, and unforeseen events could rapidly produce internal conflicts over personalities, the tactic of indiscriminate violence, or even the teachings of Islam, leading to disillusion on the part of existing and potential recruits from Europe.

A political economy of Twitter data? Conducting research with proprietary data is neither easy nor free.

Cultural geographer Dr Sam Kinsley, lecturer in Human Geography and Co-Investigator on the ESRC-funded Contagion project writes about the difficulties you might encounter when you use social media research.

This post originally appeared on the Contagion research project blog; it has also appeared on the LSE Impact of Social Science blog.

Dr Sam Kinsley is a lecturer in Human Geography

 

Social media research is on the rise but researchers are increasingly at the mercy of the changing limits and access policies of social media platforms. API and third party access to platforms can be unreliable and costly. 

Sam Kinsley outlines the limitations and stumbling blocks when researchers gather social media data. Should researchers be using data sources (however potentially interesting/valuable) that restrict the capability of reproducing our research results?

 

 

 

Many of the research articles and blogs concerning conducting research with social media data, and in particular with Twitter data, offer overviews of their methods for harvesting data through an API. An Application Programming Interface is a set of software components that allow third parties to connect to a given application or system and utilise its capacities using their own code. Most of these research accounts tend to make this process seem rather straight forward. Researchers can either write a programme themselves, such as, or can utilise one of several tools that have emerged that provide a WYSIWYG interface for undertaking the connection to the social networking platform, such as implementing yourTwapperKeeperCOSMOS or using a service such as ScraperWiki (to which I will return). However, what is little commented upon is the restrictions put on access to data through many of the social networking platform APIs, in particular Twitter. The aim of this blog post is to address some of the issues around access to data and what we are permitted to do with it.

Restrictions to ‘free’ access to Twitter data

The restrictions imposed on access to data and their possible uses have a direct effect upon the kinds of questions one can ask of the data, and indeed the kind of research we can conduct. What are these restrictions? In the case of Twitter, there are two particular API access points of interest:

These both come with particular kinds of restrictions, which have the potential to effect the amounts of data one can access. The streaming API effectively filters the full stream of all of the tweets being posted at any given time (named the ‘firehose’) down to one  per cent of the total (colloquially referred to as the ‘spritzer’) and the sampling method is not explained to users. As there are over 500,000,000 tweets per day, with an average in 2013 of 5,700 per second, one per cent  remains rather a lot of data. Nevertheless, as a sample it may be seen as problematic. For example, researchers have compared the one per cent  and firehose streams to statistically investigate how proportionate the ‘spritzer’ representation is of the full data set. Morstatter et al. (2013) suggest that for large datasets, or big issues that generate lots of traffic, the one per cent  is apparently fairly ‘faithful’ to the full stream, with a common set of top keywords and hashtags. However, for smaller datasets the spritzer appears to be a less faithful representation of all activity – this would mean researchers using the API would possibly need to be selective on the issues they study. Further, they suggest there is a ‘blackboxed’ bias in the one per cent  ‘spritzer’ API stream which diverges from random one per cent samples they took from the ‘firehose’.

The search API is slightly more complicated. The data available is typically limited to the last week of activity, although for some search terms it may be slightly longer (this seems to vary). Access is governed by the number of requests to the API any given user can make in a set period (15 minutes). A user with an ‘access token’ can make 180 calls per 15 minutes fetching approximately 100 tweets per call. A user can utilise more than one access token but in their documentation Twitter allude to a limit on application-only authentication (without access tokens) of 450 calls per 15 minutes, so it might be reasonable to assume this is an absolute limit (I don’t have any experimental results to prove or disprove this).

As a thought experiment, if we assume that limit then the total amount of data accessible is 450 calls x 100 tweets per call, per four 15-minute periods (one hour) = 180,000 tweets fetched per hour (in which period, on 2013 averages, 20,520,000 new tweets are added). Taken the other way around, if we assume that we can use lots of access tokens and we wanted to be opportunistic and harvest all tweets related to a phenomenon that occurred in the last three days with approximately 40,000,000 tweets in the corpus – we would need to collect all of those tweets in three days, as the oldest data is already  three days old, and so we would need eight access tokens simultaneously gathering tweets, without any replication of data being harvested between them, for three solid days. There are two big assumptions here: first, we can use eight access tokens to harvest data at the maximum rate for 24 hours per day, without restriction; second, those accounts can be used so that only ‘fresh’ data is gathered, without replication across the eight.

In both forms of the API access to Twitter we may be forgiven for thinking there’s not much wrong, lots of data is available. However, when a researcher begins to ask questions that they would like to answer with that data particular kinds of problem can arise. By and large, to get to the maximum figures indicated for the API, above, one needs to implement a bespoke programme to ensure dedicated access in order to maximise the rate of data collection. Equally, using multiple ‘access tokens’ will, most likely, result in gathering some duplicate data, which will need to be filtered and refined.

In practice, when gathering data through the service ScraperWiki we often encountered rate limiting, which we were powerless to affect. Even with yourTwapperKeeper, for example, one needs to have better than average IT skills in order to implement an effective data collection method (see Bruns and Liang for an overview of what might be needed). This can, of course, be addressed by working with colleagues with the appropriate skills and may lead to interesting cross-disciplinary collaborations. However, should you wish to search the historical archive of tweets (for example: searching for tweets concerning the UK riots in 2011) this is not possible through the API and you will have to pay a commercial reseller of twitter data, or ‘certified partner‘ in the jargon, to get those data. Therefore, in order to have a chance at gathering data, researchers using the API need to be opportunist and set ‘scrapes’ of data running as close in time to the activities of interest as possible.

Equally, if one uses broad enough search terms it is entirely possible that the volume of tweets matching the criteria is such that it is not possible to harvest them before they drop out of the free-to-access pool of data before your search can reach them. Therefore, API-based data gathering for research is best suited to opportunistic highly specific searches (such as the UK badger cull), rather than topics that significantly trend (such as anything to do with an international celebrity).

At the beginning of the Contagion project we accessed the API through the easy-to-use third party online system ScraperWiki. With that system it was easy for us to set up ‘scrapes’ for tweets and search and order the data we retrieved, download it and analyse it in various ways. However, earlier this year, ScraperWiki had their access to the Twitter API revoked. The tools for searching and collecting Twitter data were stopped and never reactivated. We have therefore had to seek alternative means of accessing data.

A political economy of ‘big data’

ESRC-218

ESRC fund the Contagion project, which aims to investigate the various ways in which contagion is both studied and modelled within a cross-disciplinary setting.

Perhaps the more serious issue to which this situation of access to data alludes is the proprietary nature of access, and indeed the data itself. While (largely unlimited) use of Twitter as a service is free to any user that signs up, access to the data on the platform is not. Twitter is, of course, a business. Just like many other ‘social’ platforms the data Twitter receives from its users is valuable and can be packaged as a commodity. There is therefore a political economy to this kind of ‘big data’ and accordingly political economic issues for ‘big data’ research.

Access is a commodity

If a researcher relies on the free API access to a platform, with its attendant vagaries of how much data one can access and for how long, then that researcher is at the mercy of the changing limits and access policies of that API. On the other hand, if one pays for access to data, to avoid the uncertainty of access (how much data and for how long), then expect to pay handsomely. Both main ‘certified partners’ that sell access to Twitter data, Datasift and Gnip (recently bought by Twitter), render access a commodity – you not only pay for the data but also for the processing power/time it takes to extract it and the ‘enrichments’ they add, by resolving shortened URLs for you, attributing sentiment to a given tweet (positive, neutral, negative) and so on.

The costs charged by ‘resellers’ of data are not insignificant in terms of typical research budgets, with some charging through a subscription model – requiring customers to commit for a minimum of six months. Twitter themselves have advertised their own ‘data grant‘ scheme, which came into operation this year, and offered a limited number of opportunities to access data through a competitive application process, not dissimilar to funding grant calls. Of the 1300 applicants only six (or 0.5 per cent) were granted data (the numbers here come from this Fortune article).

Data are proprietary goods

The corollary to gaining access to proprietary data is that the license one agrees to abide by for access to Twitter data states that you cannot share that data. Therefore, investing in any form of data access (via the API or a ‘reseller’) through publicly funded research is problematic. For we are all asked to submit data attained in a publicly-funded project to data archives to allow other researchers to access and use it, which is prohibited by Twitter’s Terms of Service (1.4.1). As others have observed, it is possible to get around this by archiving only the unique ID code for each tweet and leaving it up to any future researchers to download the tweets using those IDs, thereby not breaching the Terms of Service. However, with the limits to the API outlined above, for a large corpus of tweets (> 1m, say) this might take a rather long time. A quick calculation suggests, using the status/lookup API, with one ‘access token’ it would take 13 hours 48 mins (at 100 tweets per request, 180 requests per 15 minutes = 72,000 tweets per hour) solid use of the API (without any hitches) to download one million tweets. Not impossible then, but perhaps significantly inconvenient – and reliant upon the system of unique IDs remaining the same for the foreseeable future. Furthermore, such restrictions may be suggested to run counter to the requirements set on research data gathered using UK research councils funds. The (UK) ESRC, who funded Contagion, have general principles in their Research Data Policy that suggest:

  • Publicly-funded research data are a public good, produced in the public interest.
  • Publicly-funded research data should be openly available to the maximum extent possible.

This asks difficult questions of us as researchers: Should we be using data sources (however potentially interesting/valuable) that restrict the capability of reproducing our research results? Should we be using public funds to pay for data that are restricted in such ways?

Not free, not easy

Some argue that conducting research using Twitter data has become something of a fad across academe, but in practice it proves neither to be easy (without non-trivial IT expertise and/or understanding of the policies of Twitter as a company), nor free: it requires investment in terms of hours of work (designing and/or operating systems to collect, store and analyse the data), it may require paid access (depending on what kind of sample of data you require), and it comes with usage restrictions.

This has led to the principal arenas of Twitter-based research occurring outside of the academy – a lot of data science, in fact, is conducted by commercial organisations. Whether or not this research is meaningful is open to interpretation. Nevertheless, it remains the case that, as others have suggested, an awful lot of (computationally-driven) social science is being done by ‘non-academic’ researchers, amongst whom there are significant numbers of people with advanced levels of relevant IT skills. However, I argue that one of the unfortunate effects of this shift in the locus of research is a lack of criticality.

One might convincingly argue, for example, that there is an awful lot of data visualisation for its own sake. It doesn’t necessarily argue anything, instead it describes an impressive amount of data in a visually appealing manner. Equally, there is tendency in some technically-led social research to assume that the context of data, or even the hypotheses one might pose and use that data to address, are secondary to its formatting or scale. For example, in a conversation with a sales person for a data provider I was advised that as a geographer I ought to study the picture sharing platform Instagram because that had the highest take-up of geo-located content. What that content represents, or what kinds of questions we can or might ask of it is therefore of secondary importance to the fact that there is geo-location metadata.

This is not to suggest that valuable ‘theory building’ research cannot be conducted through forms of data mining. We might not know the questions we can ask of the kinds (and scales) of data we are being faced with without performing exploratory analyses. Nevertheless, if we want to be surprised by the data (which may include concluding it is not particularly interesting for various reasons), as others have suggested, we surely need to implement critical forms of inquiry.

The point of this blog post is that to study social media data, and in particular Twitter data, is to concern oneself with emerging economies of data and their attendant politics. Rather than considering platforms like commercial social networking systems as easy and plentiful sources of research data, they require hard work: it is hard to gain access to that data (as non-technical and non-wealthy academic researchers); and: some hard critical epistemological reflection is required upon what can and cannot be asked and/or concluded given the specificities of each kind of dataset and data source we use. The means of access, the APIs and other elements necessary to access the data, are important interlocutors in the stories we tell with these data.

It remains possible to do particular kinds of research with the Twitter data one can access through the APIs, but we have to think pretty carefully about what kinds of questions we can and should ask of these data, and the system from which they are derived.