Security and privacy methods are used to protect the data values. Private data values are secured with confidentiality and integrity methods. Privacy model hides the individual identity over the public data values. Sensitive attributes are protected using anonymity methods. Discrimination is the prejudicial treatment of an individual based on their membership in a certain group or category. Antidiscrimination acts are designed to prevent discrimination on the basis of a number of attributes in various settings. Public data collections are used to train association/classification rules in view of making automated decisions. Data mining can be both a source of discrimination and a means for discovering discrimination. Automated data collection and data mining techniques such as classification rule mining are used to making automated decisions. Discriminations are divided into two types such as direct and indirect discriminations. Direct discrimination occurs when decisions are made based on sensitive attributes. Indirect discrimination occurs when decisions are made based on non sensitive attributes which are strongly correlated with biased sensitive ones. Discrimination discovery and prevention are used for anti-discrimination requirements. Direct and indirect discriminations prevention is applied on individually or both at the same time. The data values are cleaned to obtain direct and/or indirect discriminatory decision rules. Data transformation techniques are applied to prepare the data values for the discrimination prevention. Rule protection and rule generalization algorithm and direct and indirect discrimination prevention algorithm are used to protect discriminations. The discrimination prevention model is integrated with the differential privacy scheme to high privacy. Dynamic policy selection based discrimination prevention is adopted to generalize the systems for all regions. Data transformation technique is improved to increase the utility rate. Discrimination removal process is improved with rule hiding techniques.
Keywords |
Discrimination, differential privacy, policy selection, rule protection, rule generalization |
INTRODUCTION |
Data mining and knowledge discovery in databases are two new research areas that investigate the automatic
extraction of previously unknown patterns from large collections of data. Recent development in data collection, data
dissemination and related technologies have inaugurated a new era of research where existing data mining algorithms
should be reconsidered from a different point of view, this of privacy preservation. It is well documented that this new
without limits explosion of new information through the Internet and other media, has reached to a point where threats
against the privacy are very common on a daily basis and they deserve serious thinking. |
Privacy preserving data mining, is a novel research direction in data mining and statistical databases, where data
mining algorithms are analyzed for the side-effects they incur in data privacy. The main consideration in privacy preserving
data mining is twofold. First, sensitive raw data like identifiers, gender, religion, addresses and the like should be changed
or cut out from the original database, in order for the recipient of the data not to be able to compromise another person’s privacy. Second, sensitive data which can be mined from a database by using data mining algorithms should also be
excluded, because such knowledge can equally well compromise data privacy. The main objective in privacy preserving
data mining is to develop algorithms for changing the original data in some way, so that the private data and private
knowledge remain private even after the mining process. The problem that arises when confidential information can be
derived from released data by unauthorized users is also commonly called the “database inference” problem. |
II. RELATED WORK |
Despite the wide deployment of information systems based on data mining technology in decision making, the
issue of antidiscrimination in data mining did not receive much attention until 2008 [9]. Some proposals are oriented to the
discovery and measure of discrimination. Others deal with the prevention of discrimination. |
The discovery of discriminatory decisions was first proposed by Pedreschi et al. [5]. The approach is based on mining
classification rules (the inductive part) and reasoning on them (the deductive part) on the basis of quantitative measures of
discrimination that formalize legal definitions of discrimination. For instance, the US Equal Pay Act states that: “a selection
rate for any race, gender, or specific group which is less than four-fifths of the rate for the group with the highest rate will
generally be regarded as evidence of adverse impact.” This approach has been extended to encompass statistical
significance of the extracted patterns of discrimination in [3] and to reason about affirmative action and favoritism [4].
Moreover it has been implemented as an Oracle-based tool in [6]. Current discrimination discovery methods consider each
rule individually for measuring discrimination without considering other rules or the relation between them. However, in
this paper we also take into account the relation between rules for discrimination discovery, based on the existence or
nonexistence of discriminatory attributes. |
Discrimination prevention, the other major antidiscrimination aim in data mining, consists of inducing patterns that do not
lead to discriminatory decisions even if the original training data sets are biased. Three approaches are conceivable: |
A. Preprocessing |
Transform the original data in such a way that the discriminatory biases contained in the original data are
completely trim so that no wrong decision rule can be mined from the transformed data and apply any of the standard
data mining algorithms. The preprocessing approaches of data transformation and hierarchy-based generalization can be
adapted from the privacy preservation literature. Along this line, [7], [8] perform a controlled distortion of the training data
from which a classifier is learned by making minimally intrusive modifications leading to an unbiased data set. The
preprocessing approach is useful for applications in which a data set should be published and/or in which data mining needs
to be performed also by external parties |
B. In processing |
Change the data mining algorithms in such a way that the resulting models do not contain wrong decision rules.
For example, an alternative approach to cleaning the discrimination from the original data set is proposed in [2] whereby
the nondiscriminatory constraint is embedded into a decision tree learner by changing its splitting criterion and pruning
strategy through a novel leaf relabeling approach. However, it is obvious that in processing discrimination prevention
methods must rely on new special-purpose data mining algorithms; standard data mining algorithms cannot be used. |
C. Post processing |
Modify the resulting data mining models, instead of cleaning the original data set or changing the data mining
algorithms. For example, in [3], a confidence-altering approach is proposed for classification rules inferred by the CPAR
algorithm. The post processing approach does not allow the data set to be released: only the modified data mining models
can be released (knowledge publishing), hence data mining can be performed by the data owner only.One might think of a
straightforward preprocessing approach consisting of just removing the discriminatory attributes from the data set. |
Although this would solve the direct discrimination problem, it would cause much information loss and in general it would
not solve indirect discrimination. As stated in [9] there may be other attributes (e.g., Zip) that are highly correlated with the
sensitive ones (e.g., Race) and allow inferring discriminatory rules. Hence, there are two important challenges regarding
discrimination prevention: one challenge is to consider both direct and indirect discrimination instead of only direct
discrimination; the other challenge is to find a good tradeoff between discrimination removal and the quality of the
resulting training data sets and data mining models. |
Although some methods have already been proposed for each of the above-mentioned approaches (preprocessing, inprocessing,
post processing), discrimination prevention stays a largely unexplored research avenue. In this paper, we
concentrate on discrimination prevention based on preprocessing, because the preprocessing approach seems the most
flexible one: it does not require changing the standard data mining algorithms, unlike the in-processing approach, and it
allows data releasing (rather than just knowledge is publishing), unlike the post-processing approach. |
III. DISCRIMINATION PREVENTION SCHEMES |
In sociology, discrimination is the prejudicial treatment of an individual based on their membership in a certain
group or category. It involves rejecting to members of one group opportunities that are available to other groups. There is a
list of antidiscrimination acts, which are laws designed to prevent discrimination on the basis of a number of attributes (e.g.,
race, religion, gender, nationality, disability, marital status, and age) in various settings (e.g., employment and job, access to
public services, credit and finance, etc.). |
Services in the information society allow for automatic and routine collection of large amounts of data. Those data
are often used to train association/classification rules in view of making automated decisions, like loan granting/denial,
insurance premium computation, personnel selection, etc. At first sight, automating decisions may give a sense of fairness:
classification rules do not guide themselves by personal preferences. However, at a closer look, one realizes that
classification rules are actually learned by the system (e.g., loan acceptance) from the training data. If the training data are
inherently biased for or against a particular community (e.g., black people), the learned model may show a discriminatory
prejudiced behavior. In other words, the system may infer that just being black people is a legitimate reason for loan
rejection. Discovering such potential biases and removing them from the training data without harming their decision
making utility is therefore highly complex. One must prevent data mining from becoming itself a source of discrimination,
due to data mining tasks generating discriminatory models from biased data sets as part of the automated decision making.
In [9], it is demonstrated that data mining can be both a source of discrimination and a means for discovering
discrimination. |
Discrimination can be either direct or indirect (also called systematic). [1] Direct discrimination consists of rules
or procedures that explicitly mention minority or specific group based on their sensitive discriminatory attributes related to
group membership. Indirect discrimination consists of rules or procedures that, while not explicitly showing discriminatory
attributes, intentionally or unintentionally could generate discriminatory decisions. Redlining by financial institutions
(refusing to give mortgages or insurances in urban areas they consider as deteriorating) is an archetypal example of indirect
discrimination, although certainly not the only one. With a slight abuse of culture and their membership for the sake of
compactness, in this paper indirect discrimination will also be referred to as redlining and rules causing indirect
discrimination will be called redlining rules [9]. Indirect discrimination could happen because of the availability of some
background knowledge (rules), for example, that a certain zip code corresponds to a deteriorating area or an area with
mostly black population. The background knowledge might be accessible from publicly available data (e.g., census data) or
might be obtained from the original data set itself because of the existence of nondiscriminatory attributes that are highly
correlated with the sensitive ones in the original data set. Discrimination prevention methods based on preprocessing
published so far [7], [8] present some limitations, which we next highlight: |
They attempt to find discrimination in the original data only for one discriminatory item and based on a single
measure. This approach cannot sure that the transformed data set is really discrimination free, because it is known
that discriminatory behaviors can often be hidden inside several discriminatory items, and even behind
combinations of them. |
They only consider direct discrimination. |
They do not obtain any measure to evaluate how much discrimination has been removed and how much
information loss has been occurred. |
IV. DISCRIMINATION PREVENTION ISSUES |
Automated data acquisition and data mining techniques such as classification rule mining are used to making
automated decisions. Discriminations are divided into two types such as direct and indirect discriminations. Direct
discrimination occurs when decisions are made based on sensitive attributes. Indirect discrimination occurs when decisions
are made based on insensitive attributes which are strongly correlated with biased sensitive ones. Discrimination discovery
and prevention are used for anti-discrimination requirements. Direct and indirect discriminations prevention is applied on
individually or both at the same time. The data values are cleaned to obtain direct and/or indirect discriminatory decision
rules. Data transformation techniques are applied to prepare the data values for the discrimination prevention. Rule
protection and rule generalization algorithm and direct and indirect discrimination prevention algorithm are used to protect
discriminations. The following drawbacks are identified in the existing system. |
Static discrimination policy based scheme |
Limited utility ratio |
Low privacy assurance |
Privacy association is not analyzed |
V. DIRECT AND INDIRECT DISCRIMINATION PREVENTION ALGORITHM |
Algorithm 1 details our proposed data transformation method for simultaneous direct and indirect discrimination
prevention. The algorithm starts with redlining rules. From each redlining rule (r : X → C), more than one indirect α-
discriminatory rule (r’ : A, B → C) might be generated because of two reasons: 1) existence of different ways to group the
items in X into a context item set B and a nondiscriminatory item set D correlated to some discriminatory item set A; and 2)
existence of more than one item in DIs. Hence, as shown in Algorithm 4 (Step 5), given a redlining rule r, proper data
transformation should be conducted for all indirect α-discriminatory rules r’ : (A C DIs), (B C X) → C ensuing from r. |
Algorithm 1. Direct and Indirect Discrimination Prevention |
|
|
If some rules can be extracted from DB as both direct and indirect α-discriminatory rules, it means that there is
overlap between MR and RR; in such case, data transformation is performed until both the direct and the indirect rule
protection requirements are satisfied (Steps 13-18). This is possible because, the same data transformation method (Method
2 consisting of changing the class item) can provide both DRP and IRP. However, if there is no overlap between MR and
RR, the data transformation is performed according to Method 2 for IRP, until the indirect discrimination prevention
requirement is satisfied (Steps 19-23) for each indirect α-discriminatory rule ensuing from each redlining rule in RR, this
can be done without any negative impact on direct discrimination prevention. Then, for each direct α-discriminatory rule r’
Є MR\RR (that is only directly extracted from DB), data transformation for satisfying the direct discrimination prevention
requirement is performed (Steps 26-33), based on Method 2 for DRP; this can be done without any negative impact on
indirect discrimination prevention. Performing rule protection or generalization for each rule in MR by each of Algorithms
1-4 has no adverse effect on protection for other rules (i.e., rule protection at Step i + x to make r’ protective cannot turn
into discriminatory a rule r made protective at Step i) because of the two following reasons: the kind of data transformation
for each rule is the same (change the discriminatory item set or the class item of records) and there are no two α-
discriminatory rules r and r’ in MR such that r = r’. |
VI. PROPOSED WORK |
The proposed discrimination prevention model is integrated with the differential privacy scheme to high privacy
which means. Dynamic policy selection based discrimination prevention is adopted to generalize the systems for all
regions. Data transformation technique is improved to increase the utility rate. Discrimination removal process is improved
with rule hiding techniques by hiding sensitive rules. |
The discrimination prevention system is designed to protect the decisions that are derived from the rule mining process.
The system is divided into five major modules. They are data cleaning process, privacy preservation, rule mining, rule
hiding and discrimination prevention. |
6.1 Differential Privacy to Data |
A. Formal Definition |
K gives e-differential privacy if for all values of DB, DB’ differing in a single element, and all S in Range (K) |
|
B. How to Achieve Differential Privacy |
f: DB → Rd |
K (f, DB) = f (DB) + [Noise]d |
E.g., Count (P, DB) = # rows in DB with Property P |
C. How does it work? |
Δf = max DB, DB-Me |f(DB) – f(DB-Me)| |
Theorem: To achieve ε-differential privacy, use scaled symmetric noise Lap(R) with R = Δf/ε. |
D. Example |
|
VII. OVERALL FUNCTIONALITIES OF PROPOSED MODEL |
A. Data Cleaning Process |
Data populate and missing value assignment operations are carried out in the data cleaning process. Textual data
values are transferred into the Oracle database. Incomplete transactions are updated with alternate values. Aggregation
based data substitution method is used for data assignment process. |
B. Privacy Preservation |
Privacy preservation is applied to protect sensitive attributes. Differential privacy technique is applied on sensitive
attributes. Noise is added with the sensitive attributes. Data transformation process is applied to prepare the data for rule
mining process. |
C. Rule Mining |
The rule mining process is performed to filter the frequent patterns. Candidate sets are prepared using attribute
name and values. Support and confidence values are estimated using item sets. Frequent patterns are identified with
minimum support and confidence values. |
D .Rule Hiding |
Rule hiding method is applied to protect the sensitive rules. Rules derived from sensitive attributes are not released
directly. Rules are embedded with nearest rule intervals. |
E. Discrimination Prevention |
Discrimination prevention process is designed to protect decisions. Rule generalization and rule prevention
algorithms are enhanced for dynamic policy model. Direct and indirect discrimination prevention algorithm is also tuned
for dynamic policy scheme. Discriminations are protected with reference to sensitive and non-sensitive attributes. |
VIII. CONCLUSION |
Data mining techniques are applied to hidden knowledge from data bases. Discriminatory decisions are obtained
and prevented with reference to the attributes. Direct and indirect discrimination prevention scheme is used to protect the
decision rules. The discrimination prevention scheme is enhanced with dynamic policy selection model and differential
privacy mechanisms. The system increases the data utility rate. Policy selection based discrimination prevention model can
be applied for all regions. Privacy preserved rate is improved by the system. Rule privacy is optimized with rule
generalization mechanism. |
Tables at a glance |
|
Table 1 |
|
|
Figures at a glance |
|
|
Figure 1 |
Figure 2 |
|
|