Abstract
Introduction
Disclosure risk as a legal issue
How does big data differ compared to “traditional” data?
The ethical use of big data
Principle 1: Public good and avoiding causing harm
Principle 2, the risk of re-identification and attribute disclosure
SDC for big data? Implications and recommendations
Concluding remarks
References
Abstract
Big data holds great potential for research and for society, large volumes of varied data can be produced and made available to researchers much faster compared to ‘traditional’ data. Whilst this potential is recognized, there are ethical concerns which users of big data must consider. With the volume and variety of information in big data, comes a greater risk of disclosure. Researchers and data access services working with highly detailed and sensitive, secure data have grappled with this for many years. The sector has developed both ethical frameworks and statistical disclosure control techniques which could be utilized by those working with big data. We discuss the challenges, present some of the frameworks and techniques and conclude with recommendations for secure data access of big data.
1. Introduction
There is great potential in big data, the potential for making new discoveries made possible for the first time by vast amounts of data. With the emergence of new forms of data, has come new ways of thinking about and analyzing data, requiring new platforms for analysis. Big data is generated in higher frequencies than other forms of data, such as from social surveys and national censuses that can take months even years to be made available to researchers. For researchers used to navigating the various, sometimes lengthy, application processes for other forms of data, the scale and speed of production and the ease of access make big data an attractive prospect to those with the computational skills and power to handle it.
There is no single consensus on what makes data ‘big’, but a common way of thinking about big data is that it consists of multiple data sources that have been combined or explicitly linked to create a data source of significant size. Big data can be thought of as having several key characteristics: volume, variety and speed (Soria-Comas and Domingo-Ferrer, 2016; Schroeder, 2014). Volume is self-explanatory, it refers to the size of the dataset, formed from multiple sources of data combined through some linking variable. These are much larger than social survey datasets and require significantly more computational power (Sfetcu, 2019). The combination of data sources leads to the second characteristic – variety. A big data set will contain information on many different aspects of people's lives. For example, digital trace data might combine information on retail transactions, location histories collected using a mobile phone's GPS, websites visited and so on. The final characteristic, velocity refers to the increased speed of data collection and processing.
That big data offers great potential is certain, but it is not a unanimously positive picture. Like with all forms of data there are challenges and concerns with big data, and these have been widely discussed 1. Many ethical issues have been raised in relation to big data, around issues of consent and privacy. These ethical issues are complex, and a complete overview is not attempted here. Neither is the goal to provide a complete instruction guide on how to use big data ethically. The focus of this article is on disclosure risk (the risk that individual data subjects will be identified) as a key ethical issue of big data and exploring what lessons can be applied from the experiences of secure data access services.