What is Data Mining
Data mining looks for patterns, associations, and trends. In this discovery process, personal and sensitive data are sometimes inadvertently exposed, or extracted through potential covert actions intruding into the mining process. Maximizing data privacy begins with asking the question, ‘what data is actually needed for analysis.
What is Data Is Necessary in Data Mining
Depending on the industry, personal information has more or less meaning. In healthcare or education, the name is important. The number of times a person buys a certain laundry detergent does not require a name association. Data may cross several databases form different organizations. Sensitive data, such as salary, is pertinent to accounting, yet the health condition is relevant to a doctor, not accounting. Data of interest may be a set of attributes associated with events, persons, or places. When the intent of a data analysis requires collaboration, how much data should be visible to the collaborating agencies?
Data Mining Process
Techniques for privacy differ according to the distribution of the data. Centralized data, typically in corporate/health databases, use perturbation privacy techniques. Privacy data mining performs their analysis using data transformed prior to data mining, i.e., the age 45 would now be 40 to 49 possible represented by the numeric value of 4. Entering ‘noise’, techniques that blur actual data recognition help to hide data, i.e., false data, additional data. Distributed data, horizontal partitioned and vertical partitioned, typically associated with cross-database analyses, employ cryptographic privacy techniques. Transferring encrypted data removing or hiding sensitive data in some fashion minimizes the capability of any intrusion.
The Model-Building Process
Mining data is primarily an inductive learning process. Finding patterns, identifying classes/grouping of attributes are focuses of the process. The model developed by the data mining process must be adaptive; the focus can be steered for a different set of patterns, etc.
Through supervised control, the model process provides insight into trends, concentrations of specified attributes, etc. Through examples, the model process mines through the data discovering potential patterns not necessarily considered. Machine learning and statistical methods are major players in data mining analysis. The primary purpose of the data mining process is to uncover information from very large amounts of data.
Approaches to Privacy in Data Mining
The data mining analytical process of large data sets. The process does not care what the data is. The results of the process provides information to make decisions or review status an organization’s products and services. Since this process is essentially blind to data, preprocessing and/or hiding sensitive and personal data is necessary. Answering, ‘what is the need? What is the Outcome?
There are three technique used for preserving privacy. Algorithms use any combination to maximize privacy: association, classification, and clustering. Association rule mining discovers occurrences that happen together, i.e., ‘if this then that’ tend to occur 40% of the time. (Dunham, 2000) Accepting the 40% as a rule the actual data occurrence is not physically part of the data analysis. Classification Mining determines rules of association that classify the data (Dutt, 2005). For example, number of years working in a career, a doctorate, and experience teaching is a top candidate for CTU professor position. Someone with only a Masters would be a candidate for undergraduate courses. Classification Mining is an input to machine learning, neural networks. Data for analysis may only view the classifications. Data mining employing Clustering Mining would see sensitive data replaced by categories. These approaches work for centralized data. In distributed data analysis, the addition of encrypted techniques provides additional privacy.