KEYWORDS
|
Horizontal Division, Vertical Division, Encryption, Privacy, Database. |
INTRODUCTION
|
The goal of privacy preserving data mining is to develop data mining methods without increasing the risk of misuse of the data used to generate those methods. The topic of privacy preserving data mining has b e e n e x t e n s i v e l y studied b y t h e d a t a mi n i n g community in recent years. Many effective Techniques for privacy preserving data mining have been proposed that use some transformation method on the original data in order to perform the privacy preservation. The transformed dataset is made available for mining and must me e t pr i va c y requirements with out l os i ng t h e benefit of mining. We classify them into the following three categories: |
Randomization method is a popular method in current p r i v a c y p r e s e r v i n g d a t a m i n i n g studies. It masks the values of the records by adding noise to the original data. The noise added is sufficiently large so that the individual values of the records can no longer be recovered. However, the probability distribution of the aggregate data can be recovered and subsequently used for privacy-preservation purposes. In general, randomization method aims at finding an appropriate |
RELATED WORK
|
Following method plays an important role in our project work t o protect data from insider attack to improve security. |
i) The Anonymization Method: |
Anonymization method aims at making the individual record be indistinguishable among a group records by using techniques of generalization and suppression. The representative an on ymi zati on method is k-anonymity. The motivating factor behind the k-anonymity approach is that many attributes in the data can often be considered quasi-identifiers which can be used in conjunction with public records in order to uniquely identify the records. Many advanced methods, such as, p-sensitive, (a, k)-anonymity-anonymity,t-closeness, ldiversity and M-invariance, Personalized anonymity etc. have been proposed. The anonymization method can ensure that the transformed data is true, but it also results in information loss in some extent. |
ii) The Encryption Method: |
Encryption method mainly resolves the problems that people jointly conduct mining tasks based on the private inputs they provide. These mining tasks could occur between mutual un-trusted parties, or even between competitors, therefore, protecting privacy becomes a primary concern in distributed data mining setting. The two different approaches for distributed privacy preserving data mining are method on horizontally partitioned data and that on vertically partitioned data. The encryption method may not be so efficient but it ensures that the transformed data is exact and secure. |
We consider the collaborative data publishing setting with horizontally partitioned data across multiple data providers. These contribute a subset of records Ti. Even a data p r o vi d e r could be the data owner themselves who contribute their own records. This is a common observed scenario in social networking and recommendation systems. Our main aim is that a data recipient including the data providers will not be able to compromise t h e privacy o f the individual records provided by other parties by publishing an anonymized view of the integrated data such. |
Attacks by Data Providers Using Anonymized Data and Their Own Data: Each data provider such as P1 in Figure 1 can also be anyonymised data T* and his own data (T1) additional information about other records. If the attacks by the external recipient in the first attack scenario are compared with those of data providers, each provider has more knowledge of their own data records. This attack scenario will be further worsened when multiple data providers collude with each other. |
PROBLEM STATEMENT
|
The proposed project work focuses on the problem of privacy for data publishing for the improvement of database and also overcome the problem of “insider attack” to provide a better security |
We consider the collaborative data publishing setting with horizontally distributed data across multiple data providers. Each data provider contributes subset of records Ti.As also each record has an owner, whose identity shall be protected. Each record attribute is either a sensitive attribute, which carries sensitive information about data owners, an identifier, which directly identifies the owner, or a quasi- identifier (QID), which may identify the owner if joined with a publicly known dataset. A data provider could also be the data owner itself who is contributing its own records. wants to breach privacy of data using background knowledge, as well as anonymized data. Privacy is breached if one learns anything about data. |
Privacy preserving data publishing for a single database has been extensively studied in recent years. A large body of work contributes to data anonymization that transforms a dataset to meet a privacy principle such as k- anonymity using techniques such as generalization or suppression (removal) so that it does not contain individually identifiable information There are a number of potential approaches one may apply to enable privacy preserving data publishing for distributed databases. A naive approach is for each data custodian to perform data anonymization independently. Data recipients or clients can then query the individual anonymized databases or its integrated view. One main drawback is that data is nonymized before the integration and hence will cause the data utility to suffer. In addition, individual databases reveal their ownership of the anonymized data. An alternative approach assumes an existence of third party that can be trusted by each of the data owners. In this scenario, data owners send their data to this trusted third party where data integration and anonymization are performed. Then, clients can query the centralized database. However, finding such a trusted third party is not always feasible. |
CONCLUSION & FUTURE WORK
|
We carried out a wide survey of the different approaches for privacy preserving data mining, and analyzed the major algorithms available for each method and pointed out the existing drawback. All the purposed methods are able to achieve our goal of privacy preservation. Hence there is a need to further perfect those approaches or develop some well-organized methods. |
For this, we recognize that the following problems should be concentrated on. |
1) Privacy and accuracy is a pair of contradiction; improving one usually incurs a cost in the other. How to apply various optimizations to achieve a trade-off should be deeply researched. |
2) Side-effects are unavoidable in data sanitization process. How to measure and reduce their Negative impact on privacy preserving needs to be considered carefully and define some metrics for measuring them. |
3) In distributed privacy preserving data mining areas, we should try to develop more efficient algorithms and look for a balance between disclosure cost, computation cost and communication cost. |
4) How to deploy privacy-preserving techniques into practical applications also needs to be further studied. |
We presented heuristics to verify m-privacy w.r.t. C.A few of them check m-privacy for EG monotonic C, and use adaptive ordering techniques for higher efficiency. We also presented a provider-aware anonymization algorithm with an adaptive verification strategy to ensure high utility and m-privacy of anonymized data. Experimental results confirmed that our heuristics perform better or comparable with existing algorithms in terms of efficiency and utility. Finally, we emphasize that privacy-preserving technology solves only one side of the problem. It is equally important to identify and overcome the nontechnical difficulties faced by decision makers when they deploy a privacy-preserving technology. Their typical concerns include the degradation of data/service quality, loss of valuable information, increased costs, and increased complexity. We believe that cross- disciplinary research is the key to remove these obstacles, and urge computer scientists in the privacy protection field to conduct cross-disciplinary research with social scientists in sociology, psychology, and public policy studies. In future it is used for Improvement of algorithm for integrated databases, like combination of Oracle, MySQL and MS-SQL databases. Making the project OS independent. |
Figures at a glance
|
|
|
Figure 1 |
Figure 2 |
|
|
References
|
- S. Goryczka, L. Xiong, and B. C. M. Fung, “m-privacy for collaborative data Publishing,” IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL: PP NO: 99 YEAR 2013
- N. Mohammed, B. C. M. Fung, P. C. K. Hung, and C. Lee, “Centralizedand distributed anonymization for high- dimensional healthcare data,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 4, no. 4, pp.18:1–18:33, October 2010.
- C. Dwork, “A firm foundation for private data analysis,” Commun. ACM, vol. 54, pp. 86–95, January 2011.
- L. Sweeney, “Datafly: A system for providing anonymity in medical data,” in Proc. of the IFIP TC11 WG11.3 Eleventh Intl. Conf. on Database Security XI: Status and Prospects, 1998, pp. 356–381.
- W. Jiang and C. Clifton, “Privacy-preserving distributed k-anonymity,” in Data and Applications Security XIX, ser. Lecture Notes in Computer Science,2005, vol. 3654, pp. 924–924.
- N. Mohammed, B. C. M. Fung, K. Wang, and P. C. K. Hung, “Privacy preserving Data mashup,” in Proc. of the 12th Intl. Conf. on Extending Database Technology, 2009, pp. 228–239.
- P. Jurczyk and L. Xiong, “Distributed anonymization: Achieving privacy for both data subjects and data providers,” in DBSec, 2009, pp. 191–207.
- I. Mironov, O. Pandey, O. Reingold, and S. Vadhan, “Computational differential privacy,” in Advances in Cryptology CRYPTO 2009, ser. Lecture Notes in Computer Science, vol. 5677, 2009, pp. 126–142.
- K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Incognito: efficient full-domain k anonymity,” in Proc. of the 2005 ACM SIGMOD Intl. Conf. on Management of Data, 2005, pp. 49–60.
|