Title: | Fair Data Adaptation with Quantile Preservation |
---|---|
Description: | An implementation of the fair data adaptation with quantile preservation described in Plecko & Meinshausen (2019) <arXiv:1911.06685>. The adaptation procedure uses the specified causal graph to pre-process the given training and testing data in such a way to remove the bias caused by the protected attribute. The procedure uses tree ensembles for quantile regression. |
Authors: | Drago Plecko [aut, cre], Nicolas Bennett [aut] |
Maintainer: | Drago Plecko <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.6 |
Built: | 2024-11-10 06:02:03 UTC |
Source: | https://github.com/dplecko/fairadapt |
Convenience function for returning adapted data
adaptedData(x, train = TRUE) ## S3 method for class 'fairadapt' adaptedData(x, train = TRUE) ## S3 method for class 'fairadaptBoot' adaptedData(x, train = TRUE)
adaptedData(x, train = TRUE) ## S3 method for class 'fairadapt' adaptedData(x, train = TRUE) ## S3 method for class 'fairadaptBoot' adaptedData(x, train = TRUE)
x |
Object of class |
train |
A logical indicating whether train data should be returned.
Defaults to |
Either a data.frame
when called on an fairadapt
object, or a list
of data.frame
s with the adapted data of length n.boot
, when called on a
fairadaptBoot
object.
A real dataset from Broward County, Florida. Contains information on individuals released on parole, and whether they reoffended within two years.
compas
compas
A data frame with 1,000 rows and 9 variables:
sex of the individual
age, measured in years
race, binary with values Non-White and White
count of juvenile felonies
count of juvenile misdemeanors
count of other juvenile offenses
count of prior offenses
degree of charge, with two values, F (felony) and M (misdemeanor)
a logical TRUE/FALSE indicator of recidivism within two years after parole start
Compute Quantiles generic for the Quantile Learning step.
computeQuants(x, data, newdata, ind, ...)
computeQuants(x, data, newdata, ind, ...)
x |
Object with an associated |
data |
|
newdata |
|
ind |
A |
... |
Additional arguments to be passed down to respective method functions. |
A vector of counterfactual values corresponding to newdata
.
Implementation of fair data adaptation with quantile preservation
(Plecko & Meinshausen 2019). Uses only plain R
.
fairadapt( formula, prot.attr, adj.mat, train.data, test.data = NULL, cfd.mat = NULL, top.ord = NULL, res.vars = NULL, quant.method = rangerQuants, visualize.graph = FALSE, eval.qfit = NULL, ... )
fairadapt( formula, prot.attr, adj.mat, train.data, test.data = NULL, cfd.mat = NULL, top.ord = NULL, res.vars = NULL, quant.method = rangerQuants, visualize.graph = FALSE, eval.qfit = NULL, ... )
formula |
Object of class |
prot.attr |
A value of class |
adj.mat |
Matrix of class |
train.data , test.data
|
Training data & testing data, both of class
|
cfd.mat |
Symmetric matrix of class |
top.ord |
A vector of class |
res.vars |
A vector of class |
quant.method |
A function choosing the method used for quantile
regression. Default value is |
visualize.graph |
A |
eval.qfit |
Argument indicating whether the quality of the quantile
regression fit should be computed using cross-validation. Default value is
|
... |
Additional arguments forwarded to the function passed as
|
The procedure takes the training and testing data as an input, together with the causal graph given by an adjacency matrix and the list of resolving variables, which should be kept fixed during the adaptation procedure. The procedure then calculates a fair representation of the data, after which any classification method can be used. There are, however, several valid training options yielding fair predictions, and the best of them can be chosen with cross-validation. For more details we refer the user to the original paper. Most of the running time is due to the quantile regression step using the ranger package.
An object of class fairadapt
, containing the original and
adapted training and testing data, together with the causal graph and some
additional meta-information.
Plecko, D. & Meinshausen, N. (2019). Fair Data Adaptation with Quantile Preservation
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada <- fairadapt(score ~ ., train.data = head(uni_admission, n = n_samp), test.data = tail(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender" ) uni_ada
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada <- fairadapt(score ~ ., train.data = head(uni_admission, n = n_samp), test.data = tail(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender" ) uni_ada
The fairadapt()
function performs data adaptation, but does so only
once. Sometimes, it might be desirable to repeat this process, in order to be
able to make uncertainty estimates about the data adaptation that is
performed. The wrapper function fairadaptBoot()
enables the user to do
so, by performing the fairadapt()
procedure multiple times, and
keeping in memory the important multiple data transformations. For a worked
example of how to use fairadaptBoot()
for uncertainty quantification,
see the fairadapt
vignette.
fairadaptBoot( formula, prot.attr, adj.mat, train.data, test.data = NULL, cfd.mat = NULL, top.ord = NULL, res.vars = NULL, quant.method = rangerQuants, keep.object = FALSE, n.boot = 100, rand.mode = c("finsamp", "quant", "both"), test.seed = 2022, ... )
fairadaptBoot( formula, prot.attr, adj.mat, train.data, test.data = NULL, cfd.mat = NULL, top.ord = NULL, res.vars = NULL, quant.method = rangerQuants, keep.object = FALSE, n.boot = 100, rand.mode = c("finsamp", "quant", "both"), test.seed = 2022, ... )
formula |
Object of class |
prot.attr |
A value of class |
adj.mat |
Matrix of class |
train.data , test.data
|
Training data & testing data, both of class
|
cfd.mat |
Symmetric matrix of class |
top.ord |
A vector of class |
res.vars |
A vector of class |
quant.method |
A function choosing the method used for quantile
regression. Default value is |
keep.object |
a |
n.boot |
An integer corresponding to the umber of bootstrap iterations. |
rand.mode |
A string, taking values |
test.seed |
a seed for the randomness in breaking quantiles for the
discrete variables. This argument is only relevant when |
... |
Additional arguments forwarded to the function passed as
|
An object of class fairadaptBoot
, containing the original and
adapted training and testing data, together with the causal graph and some
additional meta-information.
Plecko, D. & Meinshausen, N. (2019). Fair Data Adaptation with Quantile Preservation
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada <- fairadaptBoot(score ~ ., train.data = head(uni_admission, n = n_samp), test.data = tail(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender", n.boot = 5 ) uni_ada
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada <- fairadaptBoot(score ~ ., train.data = head(uni_admission, n = n_samp), test.data = tail(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender", n.boot = 5 ) uni_ada
Fair Twin Inspection convenience function.
fairTwins(x, train.id = seq_len(nrow(x$train)), test.id = NULL, cols = NULL)
fairTwins(x, train.id = seq_len(nrow(x$train)), test.id = NULL, cols = NULL)
x |
Object of class |
train.id |
A vector of indices specifying which rows of the training data should be displayed. |
test.id |
A vector of indices specifying which rows of the test data should be displayed. |
cols |
A |
A data.frame
, containing the original and adapted values
of the requested individuals. Adapted columns have _adapted
appended
to their original name.
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada <- fairadapt(score ~ ., train.data = head(uni_admission, n = n_samp), test.data = tail(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender" ) fairTwins(uni_ada, train.id = 1:5)
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada <- fairadapt(score ~ ., train.data = head(uni_admission, n = n_samp), test.data = tail(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender" ) fairTwins(uni_ada, train.id = 1:5)
The dataset contains various demographic, education and work information of the employees of the US government. The data is taken from the 2018 US Census data.
gov_census
gov_census
A data frame with 204,309 rows and 17 variables:
gender of the employee
employee age in years
race of the employee
indicator of hispanic origin
citizenship of the employee
indicator of nativity to the US
marital status
size of the employee's family
number of children of the employee
education level measured in years
yearly salary in US dollars
hours worked every week
weeks worked in the given year
occupation classification
industry classification
economic region where the person is employed in the US
https://www.census.gov/programs-surveys/acs/microdata/documentation.html
Obtaining the graphical causal model (GCM)
graphModel(adj.mat, cfd.mat = NULL, res.vars = NULL)
graphModel(adj.mat, cfd.mat = NULL, res.vars = NULL)
adj.mat |
Matrix of class |
cfd.mat |
Symmetric matrix of class |
res.vars |
A vector of class |
An object of class igraph
, containing the causal graphical,
with directed and bidirected edges.
adj.mat <- cfd.mat <- array(0L, dim = c(3, 3)) colnames(adj.mat) <- rownames(adj.mat) <- colnames(cfd.mat) <- rownames(cfd.mat) <- c("A", "X", "Y") adj.mat["A", "X"] <- adj.mat["X", "Y"] <- cfd.mat["X", "Y"] <- cfd.mat["Y", "X"] <- 1L gcm <- graphModel(adj.mat, cfd.mat, res.vars = "X")
adj.mat <- cfd.mat <- array(0L, dim = c(3, 3)) colnames(adj.mat) <- rownames(adj.mat) <- colnames(cfd.mat) <- rownames(cfd.mat) <- c("A", "X", "Y") adj.mat["A", "X"] <- adj.mat["X", "Y"] <- cfd.mat["X", "Y"] <- cfd.mat["Y", "X"] <- 1L gcm <- graphModel(adj.mat, cfd.mat, res.vars = "X")
fairadapt
object.Prediction function for new data from a saved fairadapt
object.
## S3 method for class 'fairadapt' predict(object, newdata, ...)
## S3 method for class 'fairadapt' predict(object, newdata, ...)
object |
Object of class |
newdata |
A |
... |
Additional arguments forwarded to |
The newdata
argument should be compatible with adapt.test
argument that was used when constructing the fairadapt
object. In
particular, newdata
should contain column names that appear in the formula
argument that was used when calling fairadapt()
(apart from the outcome
variable on the LHS of the formula).
A data.frame
containing the adapted version of the new data.
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada <- fairadapt(score ~ ., train.data = head(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender" ) predict(object = uni_ada, newdata = tail(uni_admission, n = n_samp))
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada <- fairadapt(score ~ ., train.data = head(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender" ) predict(object = uni_ada, newdata = tail(uni_admission, n = n_samp))
fairadaptBoot
object.Prediction function for new data from a saved fairadaptBoot
object.
## S3 method for class 'fairadaptBoot' predict(object, newdata, ...)
## S3 method for class 'fairadaptBoot' predict(object, newdata, ...)
object |
Object of class |
newdata |
A |
... |
Additional arguments forwarded to |
The newdata
argument should be compatible with adapt.test
argument that was used when constructing the fairadaptBoot
object. In
particular, newdata
should contain column names that appear in the
formula
argument that was used when calling fairadaptBoot()
(apart from
the outcome variable on the LHS of the formula).
A data.frame
containing the adapted version of the new data.
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada_boot <- fairadaptBoot(score ~ ., train.data = head(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender", n.boot = 5, keep.object = TRUE ) predict(object = uni_ada_boot, newdata = tail(uni_admission, n = n_samp))
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada_boot <- fairadaptBoot(score ~ ., train.data = head(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender", n.boot = 5, keep.object = TRUE ) predict(object = uni_ada_boot, newdata = tail(uni_admission, n = n_samp))
Quality of quantile fit statistics.
quantFit(x, ...)
quantFit(x, ...)
x |
Object of class |
... |
Ignored in this case. |
A numeric
vector, containing the average empirical loss for
the 25%, 50% and 75% quantile loss functions, for each variable.
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada <- fairadapt(score ~ ., train.data = head(uni_admission, n = n_samp), test.data = tail(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender", eval.qfit = 3L ) quantFit(uni_ada)
n_samp <- 200 uni_dim <- c( "gender", "edu", "test", "score") uni_adj <- matrix(c( 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), ncol = length(uni_dim), dimnames = rep(list(uni_dim), 2), byrow = TRUE) uni_ada <- fairadapt(score ~ ., train.data = head(uni_admission, n = n_samp), test.data = tail(uni_admission, n = n_samp), adj.mat = uni_adj, prot.attr = "gender", eval.qfit = 3L ) quantFit(uni_ada)
There are several methods that can be used for the quantile learning step
in the fairadapt
package. Each of the methods needs a specific
constructor. The constructor is a function that takes the data (with some
additional meta-information) and returns an object on which the
computeQuants()
generic can be called.
rangerQuants(data, A.root, ind, min.node.size = 20, ...) linearQuants( data, A.root, ind, tau = c(0.001, seq(0.005, 0.995, by = 0.01), 0.999), ... ) mcqrnnQuants( data, A.root, ind, tau = seq(0.005, 0.995, by = 0.01), iter.max = 500, ... )
rangerQuants(data, A.root, ind, min.node.size = 20, ...) linearQuants( data, A.root, ind, tau = c(0.001, seq(0.005, 0.995, by = 0.01), 0.999), ... ) mcqrnnQuants( data, A.root, ind, tau = seq(0.005, 0.995, by = 0.01), iter.max = 500, ... )
data |
A |
A.root |
A |
ind |
A |
min.node.size |
Forwarded to |
... |
Forwarded to further methods. |
tau |
Forwarded to |
iter.max |
Forwarded to |
Within the package, there are 3 different methods implemented, which use
quantile regressors based on linear models, random forests and neural
networks. However, there is additional flexibility and the user can provide
her/his own quantile method. For this, the user needs to write (i) the
constructor which returns an S3 classed object (see examples below);
(ii) a method for the computeQuants()
generic for the S3 class
returned in (i).
The rangerQuants()
function uses random forests
(ranger
package) for quantile regression.
The linearQuants()
function uses linear quantile regression
(quantreg
package) for the Quantile Learning step.
The mcqrnnQuants()
function uses monotone quantile
regression neural networks (mcqrnn
package) in the Quantile Learning step.
A ranger
or a rangersplit
S3 object, depending on the
value of the A.root
argument, for rangerQuants()
.
A rqs
or a quantregsplit
S3 object, depending on the
value of the A.root
argument, for linearQuants()
.
An mcqrnn
S3 object for mcqrnnQuants()
.
A simulated dataset containing the evaluation of students' abilities.
uni_admission
uni_admission
A data frame with 1,000 rows and 4 variables:
the gender of the student
educational achievement, for instance GPA
performance on a university admission test
overall final score measuring the quality of a candidate
Visualize Graphical Causal Model
visualizeGraph(x, ...)
visualizeGraph(x, ...)
x |
Object of class |
... |
Additional arguments passed to the graph plotting function. |