Introduction
Data mining looks for patterns, associations, and trends. In this discovery process, personal and sensitive data are sometimes inadvertently exposed, or extracted through potential covert actions intruding into the mining process. Maximizing data privacy begins with asking the question, ‘what data is actually needed for analysis. There are two general approaches to preserving privacy in Data Mining: Perturbation and Cryptographic.
Techniques for privacy differ according to the distribution of the data. Centralized data, typically in corporate/health databases, use perturbation privacy techniques. Privacy data mining performs their analysis using data transformed prior to data mining, i.e., the age 45 would now be 40 to 49 possible represented by the numeric value of 4. Entering ‘noise’, techniques that blur actual data recognition help to hide data. Distributed data, horizontal partitioned and vertical partitioned, typically associated with cross-database analysis, employ cryptographic privacy techniques. Transferring encrypted data removing or hiding sensitive data in some fashion minimizes the capability of any intrusion.
What is Data Mining
· Modeling:
o Initial Exploration of Data
o Model building or Pattern identification w/ Validation/Verification
o Deployment
· The Model-Building Process
o Inductive Learning
o Finding patterns
o Identifies classes
o Model must be adaptive
o Supervised Learning – definition
OR
o Unsupervised Learning – examples
· Methods Used
o Statistical Methods
o Machine Learning
· Where Does Privacy Preservation Come In?
o Process Analytically versus Safeguarding Sensitive Data
o What is the need?
Types of Information to Protect in Data Mining
· Personal information
· Sensitive information
· Collaboration among different agencies
Privacy-Preserving Technology
· Data Mining Algorithm Classification
o Privacy-Preserving Association Rule Mining
o Privacy-Preserving Classification Mining
o Privacy-Preserving Clustering Mining
· Random Data Perturbation Methodologies
o Centralized Data
o Add the random noise to confidential numerical attributes.
o Guarantees no complete disclosure
o Still possible - partial disclosure
· Cryptography-Based Methodologies
o Distributed Data
o Secure Multiparty
References:
Acquisti, Alessandro, Gritzalis, Stefanos, Lambrinoudakis, Costas, De Capitani Di Vimercati, Sabrina (2008). “Digital Privacy, Theory, Technologies, and Practices.” Auerback Publications. Taylor & Francis Group, LLC
Evfimievski, Alexandre, Grandison, Tyrone.Privacy (2009) “Preserving Data Mining.” IBM Almaden Research Center.
Shen, Yanguang, Han, Junrui, Shao, Hui (2009) “Research on Privacy-Preserving Technology of Data Mining.” IEEE, Second International Conference on Intelligent Computation Technology and Automation.