kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards Privacy Preserving Intelligent Systems
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Theoretical Computer Science, TCS.ORCID iD: 0000-0001-6934-0378
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Intelligent systems, i.e., digital systems containing smart devices that can gather, analyze, and act in response to the data they collect from their surrounding environment, have progressed from theory to application especially in the last decade, thanks to the recent technological advances in sensors and machine learning. These systems can take decisions on users' behalf dynamically by learning their behavior over time. The number of such smart devices in our surroundings is increasing rapidly. Since these devices in most cases handle privacy-sensitive data, privacy concerns are also increasing at a similar rate. However, privacy research has not been in sync with these developments. Moreover, the systems are heterogeneous in nature (e.g., in terms of form factor, energy, processing power, use case scenarios, etc.) and continuously evolving which makes the privacy problem even more challenging.

In this thesis, we identify open privacy problems of intelligent systems and later propose solutions to some of the most prominent ones. We first investigate privacy concerns in the context of data stored on a single smart device. We identify that ownership change of a smart device can leak privacy-sensitive information stored on the device. To solve this, we propose a framework to enhance the privacy of owners during ownership change of smart devices based on context detection and data encryption. Moving from the single-device setting to more complex systems involving multiple devices, we conduct a systematic literature review and a review of commercial systems to identify the unique privacy concerns of home-based health monitoring systems. From the review, we distill a common architecture covering most commercial and academic systems, including an inventory of what concerns they address, their privacy considerations, and how they handle the data. Based on this, we then identify potential privacy intervention points of such systems.

For the publication of collected data or a machine-learning model trained on such data, we explore the potential of synthetic data as a tool for achieving a better trade-off between privacy and utility compared to traditional privacy-enhancing approaches. We perform a thorough assessment of the utility of synthetic tabular data. Our investigation reveals that none of the commonly used utility metrics for assessing how well synthetic data corresponds to the original data can predict whether for any given univariate or multivariate statistical analysis (when the analysis is not known beforehand) synthetic data achieves utility similar to the original data. For machine learning-based classification tasks, however, the metric Confidence Interval Overlap shows a strong correlation with how similarly the machine learning models (i.e., trained on synthetic vs. original) perform. Concerning privacy, we explore membership inference attacks against machine learning models which aim at finding out whether some (or someone's) particular data was used to train the model. We find from our exploration that training on synthetic data instead of original data can significantly reduce the effectiveness of membership inference attacks. For image data, we propose a novel methodology to quantify, improve, and tune the privacy utility trade-off of the synthetic image data generation process compared to the traditional approaches.

Overall, our exploration in this thesis reveals that there are several open research questions regarding privacy at different phases of the data lifespan of intelligent systems such as privacy-preserving data storage, possible inferences due to data aggregation, and the quantification and improvement of privacy utility trade-off for achieving better utility at an acceptable level of privacy in a data release. The identified privacy concerns and their corresponding solutions presented in this thesis will help the research community to recognize and address remaining privacy concerns in the domain. Solving the concerns will encourage the end-users to adopt the systems and enjoy the benefits without having to worry about privacy.

Abstract [sv]

Intelligenta system, d.v.s. digitala system som innehåller smarta enheter som kan samla in, analysera och agera beroende på den data de samlar in från sin omgivning, har gått från teori till tillämpning, särskilt under det senaste decenniet, tack vare tekniska framsteg inom sensorer och maskininlärning. Dessa system kan fatta beslut åt användarna på ett dynamiskt sätt genom att lära sig deras beteende över tid. Antalet sådana smarta enheter i vår omgivning  ökar snabbt. Eftersom dessa enheter i de flesta fall hanterar integritetskänsliga data, ökar integritetsproblemen också i samma takt. Dock har forskningen kring skydd av personlig information och integritet inte varit i synk med denna utveckling. Dessutom är systemenheterogena (t.ex. när det gäller formfaktor, energi, beräkningskapacitet, användningsområden, etc.) och de utvecklas ständigt vilket gör att integritetsproblem blir ännu mer utmanande.

I denna avhandling identifierar vi integritetsproblem för intelligenta system och föreslår lösningar på några av de mest framstående problemen. Vi undersöker först integritetsproblem i samband med data som lagras på en enda smart enhet. Vi noterar att när en smart enhete byter ägare kan integritetskänslig information lagrad på enheten komma i orätta händer. För att lösa detta föreslår vi ett ramverk för att förbättra integriteten för ägarna under sådana ägarbyten. Ramverket använder sig av tekniker för att detektera miljöombyte och kryptering av data. Sedan går vi från scenariot med en enda enhet till mer komplexa system som involverar flera enheter. Vi genomför en systematisk litteraturstudie och en genomgång av kommersiella system för att identifiera de unika integritetsproblemen som uppstår hos hembaserade hälsoövervakningssystem. Från studien destillerar vi en gemensam arkitektur som täcker de flesta kommersiella och akademiskt producerade system, samt en inventering av vilka problem de tar upp, deras integritetshänsyn och hur de hanterar ägarens data. Utifrån detta har vi då identifierat potentiella ställen för integritetsskydd för sådana system.

För att dela insamlad data eller en maskininlärningsmodell tränad på sådana data med andra utforksar vi huruvida syntetiskt data kan användas som ett verktyg för att uppnå en bättre avvägning mellan integritet och nytta jämfört med traditionella integritetshöjande tillvägagångssätt. Vi gör en grundlig bedömning av användbarheten av syntetiska tabelldata vad gäller korrekthet. Vår undersökning visar att ingen av de vanliga måtten för hur väl syntetisk data motsvarar originaldata kan förutsäga om, för en given univariat eller multivariat statistisk analys (när analysen inte är känd i förväg), syntetiska data uppnår nytta liknande originaldata. För maskininlärningsbaserade klassificeringsuppgifter visar dock metriken överlapp av konfidensintervaller en stark korrelation mellan hur lika maskininlärningsmodellerna (d.v.s. tränade på syntetiska vs. originaldata) presterar. När det gäller integritet utforskar vi attacker mot maskininlärningsmodeller som syftar till ta reda på om vissa (eller någons) särskilda data användes för att träna modellen. Vår forskning visar att träning på syntetisk data istället av originaldata kan avsevärt minska effektiviteten av sådana attacker. För bilddata föreslår vi en ny metod för att kvantifiera, förbättra och justera avvägningen mellan integritet och nytta jämfört med de traditionella metoderna.

Sammantaget visar vår utforskning i denna avhandling att det finns flera öppna forskningsfrågor angående integritet vid olika faser av databehandling inom intelligenta system, så som integritetsbevarande datalagring, möjliga oönskade slutsatser på grund av dataaggregering, och kvantifiering och förbätt-ring av avvägningen mellan integritet och nytta av data, för att uppnå bättre nytta på en acceptabel nivå av integritet när man delar data med andra. De identifierade integritetsproblemen och deras motsvarande lösningar som presenteras i denna avhandling kommer att hjälpa forskarsamhället att känna igen och åtgärda återstående integritetsproblem i domänen. Om problemen lösas kommer det att uppmuntra slutanvändarna att använda nya system och dra nytta av fördelarna utan att behöva oroa sig för integritet. 

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2023. , p. xii, 41
Series
TRITA-EECS-AVL ; 2023:17
Keywords [en]
Privacy, Intelligent Systems, Synthetic data, Machine Learning
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-326694ISBN: 978-91-8040-582-9 (print)OAI: oai:DiVA.org:kth-326694DiVA, id: diva2:1755956
Public defence
2023-06-02, https://kth-se.zoom.us/j/66441177033, E2, Lindstedtsvägen 3, Floor 3, Stockholm, 09:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP), 7132
Note

QC 20230510

Available from: 2023-05-10 Created: 2023-05-09 Last updated: 2023-05-15Bibliographically approved
List of papers
1. chownIoT: Enhancing IoT Privacy by Automated Handling of Ownership Change
Open this publication in new window or tab >>chownIoT: Enhancing IoT Privacy by Automated Handling of Ownership Change
2018 (English)In: Privacy and Identity Management. Fairness, Accountability, and Transparency in the Age of Big Data, 2018, Vol. 547, p. 205-221Conference paper, Published paper (Refereed)
Abstract [en]

Considering the increasing deployment of smart home IoT devices, their ownership is likely to change during their life-cycle. IoT devices, especially those used in smart home environments, contain privacy-sensitive user data, and any ownership change of such devices can result in privacy leaks. The problem arises when users are either not aware of the need to reset/reformat the device to remove any personal data, or not trained in doing it correctly as it can be unclear what data is kept where. In addition, if the ownership change is due to theft or loss, then there is no opportunity to reset. Although there has been a lot of research on security and privacy of IoT and smart home devices, to the best of our knowledge, there is no prior work specifically on automatically securing ownership changes. We present a system called for securely handling ownership change of IoT devices. combines authentication (of both users and their smartphone), profile management, data protection by encryption, and automatic inference of ownership change. For the latter, we use a simple technique that leverages the context of a device. Finally, as a proof of concept, we develop a prototype that implements inferring ownership change from changes in the WiFi SSID. The performance evaluation of the prototype shows that has minimal overhead and is compatible with the dominant IoT boards on the market.

Keywords
Ownership, Privacy, Smart home, IoT
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-267328 (URN)10.1007/978-3-030-16744-8_14 (DOI)000559105400014 ()2-s2.0-85065286871 (Scopus ID)978-3-030-16743-1 (ISBN)978-3-030-16744-8 (ISBN)
Conference
IFIP International Summer School on Privacy and Identity Management 2018
Note

QC 20200608

Available from: 2020-02-06 Created: 2020-02-06 Last updated: 2024-03-18Bibliographically approved
2. Systematization of Knowledge of Ambient Assisted Living Systems: A Privacy Perspective
Open this publication in new window or tab >>Systematization of Knowledge of Ambient Assisted Living Systems: A Privacy Perspective
2022 (English)In: Transactions on Data Privacy, ISSN 1888-5063, E-ISSN 2013-1631, Vol. 15, no 1, p. 1-40Article in journal (Refereed) Published
Abstract [en]

The confluence of several developments make privacy of ambient assisted living (AAL) an increasingly important problem: aging population, scale and availability of sensors (IoT, health monitoring, smart home) leading to higher quality and quantity of sensitive data, and advances in data analysis and learning. Privacy research has not been in sync with these developments. For AAL systems to be useful and used, they need to be trustworthy and protect the users' privacy. We conducted a systematic literature review on recent AAL research to provide a map for potential privacy concerns. We also collected already available commercial systems for a comparison with those found in the academic literature. We were able to distill a common architecture covering most commercial and academic systems, including an inventory of what concerns they address, the technologies they apply, their data handling, and privacy considerations. Based on this outcome, we identified potential intervention points for privacy.

Keywords
health monitoring, ambient assisted living, smart home, IoT, privacy
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-285908 (URN)000793792600001 ()2-s2.0-85130147671 (Scopus ID)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20220506

Available from: 2020-11-13 Created: 2020-11-13 Last updated: 2023-09-11Bibliographically approved
3. Utility Assessment of Synthetic Data Generation Methods
Open this publication in new window or tab >>Utility Assessment of Synthetic Data Generation Methods
2022 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Big data analysis poses the dual problem of privacy preservation and utility, i.e., how accurate data analyses remain after transforming original data in order to protect the privacy of the individuals that the data is about - and whether they are accurate enough to be meaningful. In this paper, we thus investigate across several datasets whether different methods of generating fully synthetic data vary in their utility a priori (when the specific analyses to be performed on the data are not known yet), how closely their results conform to analyses on original data a posteriori, and whether these two effects are correlated. We find some methods (decision-tree based) to perform better than others across the board, sizeable effects of some choices of imputation parameters (notably number of released datasets), no correlation between broad utility metrics and analysis accuracy, and varying correlations for narrow metrics. We did get promising findings for classification tasks when using synthetic data for training machine-learning models, which we consider worth exploring further also in terms of mitigating privacy attacks against ML models such as membership inference and model inversion.

Keywords
Synthetic Data, Utility Metrics, Analysis, Correlation.
National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-315977 (URN)
Conference
Privacy in Statistical Database
Note

QC 20220817

Available from: 2022-08-03 Created: 2022-08-03 Last updated: 2023-05-09Bibliographically approved
4. The Impact of Synthetic Data on Membership Inference Attacks
Open this publication in new window or tab >>The Impact of Synthetic Data on Membership Inference Attacks
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Privacy of machine learning on Big Data has become a prominent issue in recent years due to the increased availability and usage of sensitive personal data to train the models. Membership inference attacks are one such issue that has been identified as a major privacy threat against machine learning models. Several techniques including applying differential privacy have been advocated to mitigate the effectiveness of inference attacks, however, they come at a cost of reduced utility/accuracy. Synthetic data is one approach that has been widely studied as a tool for privacy preservation recently but not much yet in the context of membership inference attacks. In this work, we aim to deepen the understanding of the impact of synthetic data on membership inference attacks. We compare models trained on original versus synthetic data, evaluate different synthetic data generation methods, and study the effect of overfitting in terms of membership inference attacks. Our investigation reveals that training on synthetic data can significantly reduce the effectiveness of membership inference attacks compared to models trained directly on the original data. This also holds for highly overfitted models that have been shown to increase the success rate of membership inference attacks. We also find that different synthetic data generation methods do not differ much in terms of membership inference attack accuracy but they do differ in terms of utility (i.e., observed based on train/test accuracy). Since synthetic data shows promising results for binary classification-based membership inference attacks on classification models explored in this work, exploring the impact on other attack types, models, and attribute inference attacks can be of worth. 

Keywords
Synthetic Data, Machine Learning, Membership Inference Attack, Accuracy
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-326693 (URN)
Funder
Knut and Alice Wallenberg Foundation
Note

QC 20230510

Available from: 2023-05-08 Created: 2023-05-08 Last updated: 2023-05-10Bibliographically approved
5. AnonFACES: Anonymizing Faces Adjusted to Constraints on Efficacy and Security
Open this publication in new window or tab >>AnonFACES: Anonymizing Faces Adjusted to Constraints on Efficacy and Security
Show others...
2020 (English)In: Proceedings of the 19th Workshop on Privacy in the Electronic Society, Association for Computing Machinery (ACM) , 2020, p. 87-100Conference paper, Published paper (Refereed)
Abstract [en]

Image data analysis techniques such as facial recognition can threaten individuals' privacy. Whereas privacy risks often can be reduced by adding noise to the data, this approach reduces the utility of the images. For this reason, image de-identification techniques typically replace directly identifying features (e.g., faces, car number plates) present in the data with synthesized features, while still preserving other non-identifying features. As of today, existing techniques mostly focus on improving the naturalness of the generated synthesized images, without quantifying their impact on privacy. In this paper, we propose the first methodology and system design to quantify, improve, and tune the privacy-utility trade-off, while simultaneously also improving the naturalness of the generated images. The system design is broken down into three components that address separate but complementing challenges. This includes a two-step cluster analysis component to extract low-dimensional feature vectors representing the images (embedding) and to cluster the images into fixed-sized clusters. While the importance of good clustering mostly has been neglected in previous work, we find that our novel approach of using low-dimensional feature vectors can improve the privacy-utility trade-off by better clustering similar images. The use of these embeddings has been found particularly useful when wanting to ensure high naturalness and utility of the synthetically generated images. By combining improved clustering and incorporating StyleGAN, a state-of-the-art Generative Neural Network, into our solution, we produce more realistic synthesized faces than prior works, while also better preserving properties such as age, gender, skin tone, or even emotional expressions. Finally, our iterative tuning method exploits non-linear relations between privacy and utility to identify good privacy-utility trade-offs. We note that an example benefit of these improvements is that our solution allows car manufacturers to train their autonomous vehicles while complying with privacy laws.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2020
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-285906 (URN)10.1145/3411497.3420220 (DOI)2-s2.0-85097241828 (Scopus ID)
Conference
Workshop on Privacy in the Electronic Society
Note

QC 20201117

Available from: 2020-11-13 Created: 2020-11-13 Last updated: 2023-05-09Bibliographically approved

Open Access in DiVA

fulltext(783 kB)356 downloads
File information
File name FULLTEXT01.pdfFile size 783 kBChecksum SHA-512
66278dbcd9e1a03d1292038c63c38d1cb62fad71dc34f3b6dcf8b5d8f44a2612fe43d551fbcafb7627a45cd62c688a6dd88780034e8f5ece7bd4a9471c86b8d0
Type fulltextMimetype application/pdf

Authority records

Khan, Md Sakib Nizam

Search in DiVA

By author/editor
Khan, Md Sakib Nizam
By organisation
Theoretical Computer Science, TCS
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 356 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1831 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf