خلاصه
1. مقدمه
2 خطرات موجود در انتشار داده ها
3 حریم خصوصی دیفرانسیل
4 حریم خصوصی متفاوت برای یادگیری ماشینی و عمیق
5 استراتژی برای انتشار داده ها
6 مجموعه داده ناهمگن
7 محدودیت های حریم خصوصی دیفرانسیل
8 موارد استفاده
9 نتیجه گیری و کارهای آینده
منابع
Abstract
1 Introduction
2 Risks involved in releasing data
3 Differential privacy
4 Differential privacy for machine and deep learning
5 Strategies for data release
6 Heterogeneous datasets
7 Limitations of differential privacy
8 Use cases
9 Conclusion and future works
References
چکیده
داده ها به بخشی جدایی ناپذیر از زندگی روزمره انسان تبدیل شده است. کاربران ردپایی از ردپای دیجیتالی را پشت سر می گذارند که شامل اطلاعات شخصی و غیر شخصی آنها می شود. یک کاربر معمولی در هر ثانیه 1.7 مگابایت داده را در اختیار ارائه دهندگان خدمات قرار می دهد و به آنها برای ایمن نگه داشتن آن اعتماد می کند. با این حال، محققان دریافته اند که به نام بهبود کیفیت خدمات، ارائه دهندگان خدمات، دانسته یا تصادفی، اطلاعات شخصی کاربران را در معرض خطر قرار می دهند که به دست دشمنان برسد. ارائهدهندگان خدمات معمولاً قبل از انتشار دادههای کاربران، پوشش یا ناشناسسازی را اعمال میکنند. تکنیک های ناشناس حفظ حریم خصوصی را تضمین نمی کنند و ثابت شده است که مستعد حملات پیوند متقابل هستند. در گذشته، محققان توانستند با موفقیت چندین مجموعه داده را به صورت متقابل پیوند دهند تا اطلاعات حساس کاربران مختلف را افشا کنند. حملات پیوند متقابل همیشه بر روی مجموعه داده های ناشناس امکان پذیر است، و بنابراین، ارائه دهندگان خدمات باید از تکنیکی استفاده کنند که حفظ حریم خصوصی را تضمین کند. حریم خصوصی دیفرانسیل برای انتشار اطلاعات حساس و حفظ حریم خصوصی برتر است. تضمینهای ریاضی را ارائه میکند و از حملات دانش پسزمینه جلوگیری میکند، به طوری که اطلاعات بدون توجه به اطلاعاتی که دشمن ممکن است داشته باشد، خصوصی باقی بماند. این مقاله به این موضوع میپردازد که چگونه حریم خصوصی متفاوت میتواند به دستیابی به تضمینهای حریم خصوصی برای انتشار مجموعه دادههای ناهمگن حساس و در عین حال حفظ کاربرد آن کمک کند.
توجه! این متن ترجمه ماشینی بوده و توسط مترجمین ای ترجمه، ترجمه نشده است.
Abstract
Data has become an integral part of day-to-day human life. Users leave behind a trail of digital footprint that includes their personal and non-personal information. A normal user puts 1.7 megabytes of data every second into the hand of service providers and trusts them to keep it safe. However, researchers have found out that in the name of improving the quality of service, the service providers, knowing or accidentally, put users’ personal information at risk of getting into the hands of an adversary. The service providers usually apply masking or anonymization before releasing the users’ data. Anonymization techniques do not guarantee privacy preservation and are proven to be prone to cross-linking attacks. In the past, researchers were able to successfully cross-link multiple datasets to leak the sensitive information of various users. Cross-linking attacks are always possible on anonymized datasets, and therefore, service providers must use a technique that guarantees privacy preservation. Differential privacy is superior for publishing sensitive information while protecting privacy. It provides mathematical guarantees and prevents background knowledge attacks such that information remains private regardless of whatever information an adversary might have. This paper discusses how differential privacy can help achieve privacy guarantees for the release of sensitive heterogeneous datasets while preserving its utility.
Introduction
Recent advancements in computing have enhanced the way users and organizations interact with data. From personalized recommendations, e-commerce to industrial and scientific research, data is needed everywhere, and that too in vast abundance. There are 4.6 billion internet users worldwide1 (approximately 60% of the world population), each generating 1.7 megabytes of data every second, according to a report2 . The generated data, in turn, is used with algorithms in machine learning and deep learning to create solutions for self-driving cars, recommendation engines, automated game playing, and so on.
The generation of data is diversified among several domains, therefore, making it heterogeneous. In an Internet of Things (IoT) scenario, data corresponds to values read by installed sensors across the ecosystem. At the same time, a self-driving car captures videos and images along with sensor values. Similarly, a smart meter generates time-series data of power consumption at the home it is installed. The method of parsing a video is different from that of an image. Similarly, naked sensor values are handled by entirely different tools and techniques. Due to its heterogeneous nature, coming up with a single solution to protect data privacy becomes a daunting task.
Conclusion and future works
According to the dictionary, the meaning of the term ‘‘statistics’’ is given as dealing with data that tells the condition of a group or a community. Differential privacy, as we defined it, states: if the presence or absence of an individual sample in a study does not affect the outcome of the study, we can say the outcome is about the group or the community. But if in case, the outcome of the study changes, we can say that the outcome was about the few individuals whose data is included in the study. Therefore, differential privacy has two properties: (i) it is stable to small perturbations in the data, and (ii) it is statistical in that the analysis done tells us about the whole community.
The existence of the first property directly leads to the use of differential privacy into securing machine/deep learning models. Differential privacy protects the models from adversarial attacks that focus on adding small perturbations in the training data [42, 43]. Differential privacy ensures generalization in adaptive data analysis. Adaptive analysis means that the questions asked and hypotheses tested depend on the outcomes of earlier questions. Generalization refers to bringing the outcome of a test on a sample closer the ground truth of the distribution as much as possible. Also, generalization requires the outcome to remain the same no matter how the data is sampled. Answering with differentially private mechanisms ensures privacy and generalizability with high probabilities. Hence, differential private mechanisms of adding controlled noise have promising statistics results and applications.