Data has become an integral part of day-to-day human life. Users leave behind a trail of digital footprint that includes their personal and non-personal information. A normal user puts 1.7 megabytes of data every second into the hand of service providers and trusts them to keep it safe. However, researchers have found out that in the name of improving the quality of service, the service providers, knowing or accidentally, put users’ personal information at risk of getting into the hands of an adversary. The service providers usually apply masking or anonymization before releasing the users’ data. Anonymization techniques do not guarantee privacy preservation and are proven to be prone to cross-linking attacks. In the past, researchers were able to successfully cross-link multiple datasets to leak the sensitive information of various users. Cross-linking attacks are always possible on anonymized datasets, and therefore, service providers must use a technique that guarantees privacy preservation. Differential privacy is superior for publishing sensitive information while protecting privacy. It provides mathematical guarantees and prevents background knowledge attacks such that information remains private regardless of whatever information an adversary might have. This paper discusses how differential privacy can help achieve privacy guarantees for the release of sensitive heterogeneous datasets while preserving its utility.
Recent advancements in computing have enhanced the way users and organizations interact with data. From personalized recommendations, e-commerce to industrial and scientific research, data is needed everywhere, and that too in vast abundance. There are 4.6 billion internet users worldwide1 (approximately 60% of the world population), each generating 1.7 megabytes of data every second, according to a report2 . The generated data, in turn, is used with algorithms in machine learning and deep learning to create solutions for self-driving cars, recommendation engines, automated game playing, and so on.
The generation of data is diversified among several domains, therefore, making it heterogeneous. In an Internet of Things (IoT) scenario, data corresponds to values read by installed sensors across the ecosystem. At the same time, a self-driving car captures videos and images along with sensor values. Similarly, a smart meter generates time-series data of power consumption at the home it is installed. The method of parsing a video is different from that of an image. Similarly, naked sensor values are handled by entirely different tools and techniques. Due to its heterogeneous nature, coming up with a single solution to protect data privacy becomes a daunting task.
Conclusion and future works
According to the dictionary, the meaning of the term ‘‘statistics’’ is given as dealing with data that tells the condition of a group or a community. Differential privacy, as we defined it, states: if the presence or absence of an individual sample in a study does not affect the outcome of the study, we can say the outcome is about the group or the community. But if in case, the outcome of the study changes, we can say that the outcome was about the few individuals whose data is included in the study. Therefore, differential privacy has two properties: (i) it is stable to small perturbations in the data, and (ii) it is statistical in that the analysis done tells us about the whole community.
The existence of the first property directly leads to the use of differential privacy into securing machine/deep learning models. Differential privacy protects the models from adversarial attacks that focus on adding small perturbations in the training data [42, 43]. Differential privacy ensures generalization in adaptive data analysis. Adaptive analysis means that the questions asked and hypotheses tested depend on the outcomes of earlier questions. Generalization refers to bringing the outcome of a test on a sample closer the ground truth of the distribution as much as possible. Also, generalization requires the outcome to remain the same no matter how the data is sampled. Answering with differentially private mechanisms ensures privacy and generalizability with high probabilities. Hence, differential private mechanisms of adding controlled noise have promising statistics results and applications.