Datasets

On top of the causal inference models, this library also offers datasets to validate said models. In real life, knowing the causal effect of a particular action would require knowing what would have happened if you hadn’t done that, which is impossible. For that reason most of this libraries datasets are synthethic, which allow us to know for sure the real treatment effect.

ihdp_simulated_outcomes

Data downloaded from the supplemental materials tab available at: https://www.tandfonline.com/doi/suppl/10.1198/jcgs.2010.08162?scroll=top

The Infant Health and Development Program was a collaborative, randomized, longitudinal, multisite clinical trial designed to evaluate the efficacy of comprehensive early intervention in reducing the developmental and health problems of low birth weight, premature infants. An intensive intervention extending from hospital discharge to 36 months corrected age was administered between 1985 and 1988 at eight different sites. [GROSS, R. et al., 1993]

Simulated outcomes are computed as described in [HILL, J., 2011] Starting with the experimental data, an observational study is created by throwing away a nonrandom portion of the treatment group: all children with nonwhite mothers. This leaves 139 children. The control group remains intact with 608 children. Thus the treatment and control groups are no longer balanced and simple comparisons of outcomes would lead to biased estimates of the treatment effect. Ethnicity was chosen as the variable used to partition the data because it led to subgroups that were more distinct than those yielded by the other categorical variables.

The order of the variables from left to right is: - treat: treatment indicator (1 if treated, 0 if not treated); - bw: weight at birth (kg) - b.head: head circumference at birth (inches) - preterm: preterm indicator (1 if treated, 0 if not treated); - birth.o: birth order of baby by mother - nnhealth: neonatal health score - momage: mother’s age at birth - sex - twin - b.marr: was mother married at birth - mom.lths - mom.hs: mother went to high school - mom.scoll: mother scholarity at birth - cig: mother smoked during pregnancy - first: mother’s first baby - booze: mother drank alcohol during pregnancy - drugs: mother smoked during pregnancy - work.dur: mother worked during pregnancy - prenatal: mother went through prenatal treatment - ark, ein, har, mia, pen, tex, was: one-hot encoded columns indicating place of enrollment in the program - y_A_sim: simulated outcome for surface A (linear) as described on [HILL, J., 2011] - y_B_sim: simulated outcome for surface B (non-linear) as described on [HILL, J., 2011]

lalonde_nsw_jobs

Data sets were downloaded from NYU website: http://users.nber.org/~rdehejia/nswdata2.html

The data are drawn from a paper by Robert Lalonde, “Evaluating the Econometric Evaluations of Training Programs,” American Economic Review, Vol. 76, pp. 604-620. We are grateful to him for allowing us to use this data, assistance in reading his original data tapes, and permission to publish it here.

The order of the variables from left to right is: - data_id:

“Lalonde Sample” if the sample belongs to the initial Lalonde controlled random trial

“PSID” if the sample belongs to the observational data collected after the controlled experiment

treat: treatment indicator (1 if treated, 0 if not treated);
age;
education: years of formal education;
married: 1 if married, 0 otherwise;
black: 1 if black, 0 otherwise;
hispanic: 1 if hispanic, 0 otherwise;
white: 1 if white, 0 otherwise;
nodegree: 1 if no degree, 0 otherwise;
re75: earnings in 1975;
re78: earnings in 1978.

Entirely synthetic dataset

pycausal_explorer.datasets.synthetic.create_synthetic_data(size=1000, target_type='continuous', random_seed=None)[source]

Creates a synthetic dataset with explicit causal effects. The generating function is as follows:

the covariate x is normally distributed with mu = 1 and sigma = 1;
the treatment is binomially distributed where n=1 and the chance of success is (x + 0.5) / 10. As a result, it’s either 0 or 1 depending on x;
the ouctcome y is 0.5 * x + treatment effect * treatment.

The result is a treatment and outcome that depend on a covariate x. This will generate bias when attempting to predict causal effect.

sizeint: Amount of rows of created data.
target_typebasestring: “continuous” or “binary”. Wether the outcome should be continuous or binary.
random_seedint, optional: Random seed for data generation

outndarray: Returns a 3 element tuple containing a common cause covariate, the treatment and the outcome.