Empirical Methods in Finance

Part 6

Henrique C. Martins

Selection Bias

Selection Bias

Back to the selection bias example of before.

  • Imagine that John and Mary are moving to the north of Canada.

  • John has a history of respiratory disease and decide to buy insurance.

  • Mary does not have a history of respiratory disease and decide not to buy insurance.

Default John Mary
State of insurance 1 0
Situation without insurance 3 5
Situation with insurance 4 5
Observed 4 5
Effect ? ?

\[(Y_{1,john} - Y_{0,john}) + (Y_{1,Mary}- Y_{0,Mary}) = 4 - 3 + 5 - 5 = 0.5\]

Selection Bias

Rearranging the terms:

\[(Y_{1,john} - Y_{0,Mary}) + (Y_{1,Mary} - Y_{0,john}) = (4 - 5) + (5 - 3) = 0.5\] \[We\;see + We\;do\;not\;see = (4 - 5) + (5 - 3) = 0.5\]

The term \((Y_{1,Mary} - Y_{0,john}) = (5 - 3) = 2\) is the selection bias.

It exists because we are comparing two people that should not be compared.

Selection Bias

Some notation:

\(d=1\) for the treated units (treatment group)

\(d=0\) for the treated units (control group)

\(Y_{i}\) = Potential outcome of individual i.

\(Y_{i,1}\) or \(Y(1)\) = Potential outcome of individual i, treatement group.

\(Y_{i,0}\) or \(Y(0)\) = Potential outcome of individual i, control group.

Selection Bias

Some notation:

These are the representations of the causal effect we often want to estimate.

Average Treatment Effect:

ATE = \(\frac{1}{N} (E[Y_{i,1}] - E[Y_{i,0}])\)

Average Treatment Effect on the treated:

ATET = \(\frac{1}{N} (E[Y_{i,1}|D_i=1] - E[Y_{i,0}|D_i=1])\)

Average Treatment Effect on the untreated:

ATEU = \(\frac{1}{N} (E[Y_{i,1}|D_i=0] - E[Y_{i,0}|D_i=0])\)

Of course, again, we cannot observe both potential outcomes of the same unit i.

Selection Bias

When dealing with causal inference, we have to find ways to approximate what the hidden potential outcome of the treated units is.

That is, the challenge in identifying causal effects is that the untreated potential outcomes, \(Y_{i,0}\), are never observed for the treated group (\(D_i= 1\)). The “second” term in the following equation:

ATET = \(\frac{1}{N} (E[Y_{i,1}|D_i=1] - E[Y_{i,0}|D_i=1])\)

We need an empirical design to “observe” what we do not really observe (i.e., the counterfactual).

Selection Bias

Many options:

  • Matching/Balancing
  • Difference-in-differences (DiD)
  • Instrumental variables
  • Regression discontinuity design (RDD)
  • Synthetic control (Synth)

Selection Bias

The process of finding units that are comparable is called matching.

Before we continue…

We will match on observables. We cannot be on unobservables.

Thus, you may want to write in your article “selection bias due to observables”.

Cunningham:

Propensity score matching has not seen as wide adoption among economists as in other nonexperimental methods like regression discontinuity or difference-in-differences. The most common reason given for this is that economists are oftentimes skeptical that CIA can be achieved in any dataset almost as an article of faith. This is because for many applications, economists as a group are usually more concerned about selection on unobservables than they are selection on observables, and as such, they reach for matching methods less often.

CIA = CMI

Matching

Matching

Matching aims to compare the outcomes between observations that have the same values of all control variables, except that one unit is treated and the other is not.

In this literature, the control variables used to matched are often called covariates.

That is, for each treated unit, the researcher finds an untreated unit that is similar in all covariates.

The implication is that the researcher can argue that “units are comparable after matching”.

Matching

The easiest to see is exact matching: it matches observations that have the exact same values.

  • It might be doable if you have only one covariate.

  • Naturally, if you have only one covariate, you might still be left with some selection bias.

    • In the previous example, health history is one important covariate that makes John and Mary different.

    • But what about life style? Nutrition? Etc.

As the number of covariates grow, you cannot pursue exact matching. That is the job of PSM.

Matching

In exact matching, the causal effect estimator (ATET) is:

\[ATET = \frac{1}{N} \sum (E[Y_{i}] - E[Y_{j(i)}] | D_i=1)\]

Where \(Y_{j(i)}\) is the j-th unit matched to the i-th unit based on the j-th being “closest to” the i-th unit for some covariate.

For instance, let’s say that a unit in the treatment group has a covariate with a value of 2 and we find another unit in the control group (exactly one unit) with a covariate value of 2.

Then we will impute the treatment unit’s missing counterfactual with the matched unit’s, and take a difference.

Matching

Consider the following dataset from Cunningham:

Matching

Average ages are very different. The salary of a 24 yrs old person is quite different than the salary of a 32 yrs person.

R
# Load necessary packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)

read_data <- function(df)
{
  full_path <- paste("https://github.com/scunning1975/mixtape/raw/master/",df, sep = "")
  df <- read_dta(full_path)
  return(df)
}
training_example <- read_data("training_example.dta") %>% slice(1:20)
summary(training_example$age_treat)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  18.00   20.25   23.00   24.30   28.50   33.00      10 
R
summary(training_example$age_control)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   23.50   31.50   31.95   39.00   51.00 
R
# Load necessary packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)

read_data <- function(df)
{
  full_path <- paste("https://github.com/scunning1975/mixtape/raw/master/",df, sep = "")
  df <- read_dta(full_path)
  return(df)
}

training_example <- read_data("training_example.dta") %>% slice(1:20)

ggplot(training_example, aes(x=age_treat)) +
  stat_bin(bins = 10, na.rm = TRUE)

R
# Load necessary packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)

read_data <- function(df)
{
  full_path <- paste("https://github.com/scunning1975/mixtape/raw/master/",df, sep = "")
  df <- read_dta(full_path)
  return(df)
}

training_example <- read_data("training_example.dta") %>% slice(1:20)

ggplot(training_example, aes(x=age_control)) +
  stat_bin(bins = 10, na.rm = TRUE)

R
# Load necessary packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)

read_data <- function(df)
{
  full_path <- paste("https://github.com/scunning1975/mixtape/raw/master/",df, sep = "")
  df <- read_dta(full_path)
  return(df)
}

training_example <- read_data("training_example.dta") %>% slice(1:20)

ggplot(training_example, aes(x=age_matched)) +
  stat_bin(bins = 10, na.rm = TRUE)

Matching

In this example, you are literally finding the units in the control group that have the same age as the units in the treatment group.

You are exact matching 1-by-1 in this example.

You have only one covariate, i.e., age.

Distance Matching

Distance Matching

The last example was simple because you could exact match.

If you cannot find one exact match, you need an approximate match.

In order to do that, you have to use distance matching.

Distance matching minimizes the distance (i.e., how far the covariates are from each other) between the treatment and control groups.

Distance Matching

Euclidean distance = \(|X_i-X_j|=\sqrt{(X_i-X_j)'(X_i-X_j)}=\sqrt{\sum_{n=1}^k(X_{n,i}-X_{n,j})^2}\)

Distance Matching

Normalized Euclidean distance = \(|X_i-X_j|=\sqrt{(X_i-X_j)'\hat{V}^{-1}(X_i-X_j)}=\sqrt{\sum_{n=1}^k\frac{(X_{n,i}-X_{n,j})}{\sigma^2_n}}\)

The problem with this measure of distance is that the distance measure itself depends on the scale of the variables themselves.

For this reason, researchers typically will use some modification of the Euclidean distance, such as the normalized Euclidean distance, or they’ll use a wholly different alternative distance.

The normalized Euclidean distance is a commonly used distance, and what makes it different is that the distance of each variable is scaled by the variable’s variance.

Distance Matching

Mahalanobis distance = \(|X_i-X_j|=\sqrt{(X_i-X_j)'\hat{\sum_x}^{-1}(X_i-X_j)}\)

Where \(\hat{\sum_x}\) is the sample covariance matrix of X.

Distance Matching

Distance matching only goes so far…

the larger the dimensionality, the harder is to use distance matching.

As sample size increases, for a given N of covariates, the matching discrepancies tend to zero.

But, the more covariates, the longer it takes.

At the end of the day, it is preferable to have many covariates, but it is makes distance matching harder.

Coarsened Exact Matching (CER)

Coarsened Exact Matching (CER)

In coarsened exact matching, something only counts as a match if it exactly matches on each matching variable.

The “coarsened” part comes in because, if you have any continuous variables to match on, you need to “coarsen” them first by putting them into bins, rather than matching on exact values.

Coarsening means creating bins. Fewer bins makes exact matches more likely.

CER is not used much in empirical research in finance. It is used more in the big data realm when you have many variables to match.

Propensity-score matching (PSM)

Propensity-score matching (PSM)

PSM is one way to matching using many covariates.

PSM aggregates all covariates into one score (propensity-score), which is the likelihood of receiving the treatment.

The idea is to match units that, based on observables, have the same probability (called propensity-score) of being treated.

The idea is to estimate a probit (default in stata) or logit model (fist stage):

\[P(D=1|X)\]

The propensity-score is the predicted probability of a unit being treated given all covariates X. The p-score is just a single number.

Propensity-score matching (PSM)

Considerations in PSM.

  1. How many neighbors to match?
  • Nearest neighbor, radius or kernel?
  1. With or without replacement?

  2. With or without common support?

  • Common support: imposes a common support by dropping treatment observations whose pscore is higher than the maximum or less than the minimum pscore of the controls.
  1. It is expected that, after PSM, you show the overlap of propensity-scores.

Propensity-score matching (PSM)

Source

The y-axis is the propensity-score.

Propensity-score matching (PSM)

Source

Nearest matching: Find the observation closest to (\(min|p_i-p_j|\))

Propensity-score matching (PSM)

Source

Kernel matching: Each treated observation i is matched with several control observations, with weights inversely proportional to the distance between treated and control observations.

Propensity-score matching (PSM)

Source

Radius matching: Each treated observation i is matched with control observations j that fall within a specified radius.

\[|p_i-p_j| <r\]

Propensity-score matching (PSM)

Source

Common support: Restrict matching only based on the common range of propensity scores.

Propensity-score matching (PSM)

Seems good overlap, but “good” is arbitrary.

Propensity-score matching (PSM)

Seems bad overlap

Propensity-score matching (PSM)

Seems good overlap, but “good” is arbitrary.

Propensity-score matching (PSM)

Seems bad overlap

Propensity-score matching (PSM)

Propensity-score matching (PSM)

Propensity-score matching (PSM)

Example

Example

Let’s practice with an example. 185 treated units vs 15,992 control units.

R
# Load necessary packages
# Load necessary libraries
library(haven)
library(psych)
data <- read_dta("files/cps1re74.dta")
summary_stats <- by(data, data$treat, FUN = function(group) {
  c(
    mean = mean(group$age, na.rm = TRUE),
    variance = var(group$age, na.rm = TRUE),
    skewness = skew(group$age, na.rm = TRUE),
    count = length(group$age)
  )
})
summary_df <- as.data.frame(do.call(rbind, summary_stats))
colnames(summary_df) <- c("mean", "variance", "skewness", "count")
print(summary_df)
      mean variance  skewness count
0 33.22524 121.9968 0.3477684 15992
1 25.81622  51.1943 1.1063375   185
Python
import pandas as pd
from scipy.stats import skew
import statsmodels.api as sm
data = pd.read_stata("files/cps1re74.dta")
grouped_data = data.groupby('treat')['age'].agg(['mean', 'var', lambda x: skew(x, nan_policy='omit'), 'count']).reset_index()
grouped_data.columns = ['treat', 'mean', 'variance', 'skewness', 'count']
print(grouped_data)
   treat       mean    variance  skewness  count
0      0  33.225238  121.996792  0.347801  15992
1      1  25.816216   51.194301  1.115369    185
Stata
use files/cps1re74.dta, clear
qui estpost tabstat age black educ , by(treat) c(s) s(me v sk n) nototal
esttab .    ,varwidth(20) cells("mean(fmt(3)) variance(fmt(3)) skewness(fmt(3)) count(fmt(0))") noobs nonumber compress 
(DW Subset of LaLonde Data)



------------------------------------------------------------
                                                            
                          mean  variance  skewness     count
------------------------------------------------------------
0                                                           
age                     33.225   121.997     0.348     15992
black                    0.074     0.068     3.268     15992
educ                    12.028     8.242    -0.423     15992
------------------------------------------------------------
1                                                           
age                     25.816    51.194     1.115       185
black                    0.843     0.133    -1.888       185
educ                    10.346     4.043    -0.721       185
------------------------------------------------------------

Example

Clearly, the treated group is younger, mainly black, and less educated.

Also note that the variance and skewness of the two subsamples are different.

If we were to use these two subsamples in any econometric analysis without preprocessing to make them comparable, we would likely have coefficients biased by selection bias.

Therefore, it is important to perform some matching method.

Let’s start with Propensity Score Matching (PSM). We will use the simplest matching, that is, without using any additional functions.

Example

Nearest with noreplacement.

R
# install.packages("MatchIt")
library(haven)
library(psych)
library(MatchIt)
data <- read_dta("files/cps1re74.dta")
model <- matchit(treat ~ age + black + educ, data = data, method = "nearest")
summary(model)

Call:
matchit(formula = treat ~ age + black + educ, data = data, method = "nearest")

Summary of Balance for All Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.1445        0.0099          1.3615     7.5662    0.4987
age            25.8162       33.2252         -1.0355     0.4196    0.1863
black           0.8432        0.0735          2.1171          .    0.7697
educ           10.3459       12.0275         -0.8363     0.4905    0.0908
         eCDF Max
distance   0.7741
age        0.3427
black      0.7697
educ       0.4123

Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.1445        0.1443          0.0020     1.0039    0.0002
age            25.8162       25.7081          0.0151     0.9244    0.0073
black           0.8432        0.8432          0.0000          .    0.0000
educ           10.3459       10.4054         -0.0296     0.7190    0.0117
         eCDF Max Std. Pair Dist.
distance   0.0162          0.0029
age        0.0270          0.1481
black      0.0000          0.0000
educ       0.0432          0.2554

Sample Sizes:
          Control Treated
All         15992     185
Matched       185     185
Unmatched   15807       0
Discarded       0       0
Stata
use files/cps1re74.dta, clear
psmatch2 treat age black educ , n(1) noreplacement
sum _weight , d
(DW Subset of LaLonde Data)


Probit regression                                       Number of obs = 16,177
                                                        LR chi2(3)    = 752.47
                                                        Prob > chi2   = 0.0000
Log likelihood = -634.83401                             Pseudo R2     = 0.3721

------------------------------------------------------------------------------
       treat | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0379829   .0042976    -8.84   0.000     -.046406   -.0295598
       black |   1.700675   .0793898    21.42   0.000     1.545074    1.856277
        educ |  -.0806962   .0142167    -5.68   0.000    -.1085604    -.052832
       _cons |  -.9031083   .2123282    -4.25   0.000    -1.319264   -.4869526
------------------------------------------------------------------------------

            psmatch2: weight of matched controls
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              1
 5%            1              1
10%            1              1       Obs                 370
25%            1              1       Sum of wgt.         370

50%            1                      Mean                  1
                        Largest       Std. dev.             0
75%            1              1
90%            1              1       Variance              0
95%            1              1       Skewness              .
99%            1              1       Kurtosis              .

Example

Notice that we are creating weights now

R
# install.packages("MatchIt")
library(haven)
library(MatchIt)
data <- read_dta("files/cps1re74.dta")
model <- matchit(treat ~ age + black + educ, data = data, method = "exact")
summary(model$weights)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.1457  0.0000 52.7952 
Stata
use files/cps1re74.dta, clear
qui psmatch2 treat age black educ , kernel
sum _weight , d
(DW Subset of LaLonde Data)

            psmatch2: weight of matched controls
-------------------------------------------------------------
      Percentiles      Smallest
 1%     .0024355       .0007862
 5%     .0024375       .0024348
10%       .00244       .0024348       Obs              16,177
25%     .0024517       .0024348       Sum of wgt.      16,177

50%     .0024919                      Mean            .022872
                        Largest       Std. dev.      .1130791
75%     .0026476              1
90%     .0038379              1       Variance       .0127869
95%     .0876547              1       Skewness       7.604874
99%            1              1       Kurtosis       64.26276

Example

Now, the descriptive statistics are much closer

R
library(haven)
library(MatchIt)
#install.packages("e1071")
library(e1071)
data <- read_dta("files/cps1re74.dta")
model <- matchit(treat ~ age + black + educ, data = data, method = "exact")
matched_data <- match.data(model)
summary_stats <- by(matched_data, matched_data$treat, function(x) {
  c(mean(x$age), var(x$age), skewness(x$age), length(x$age))
})

result_df <- data.frame(
  Treatment = c("Control", "Treated"),
  Mean_Age = sapply(summary_stats, function(x) x[1]),
  Variance_Age = sapply(summary_stats, function(x) x[2]),
  Skewness_Age = sapply(summary_stats, function(x) x[3]),
  Count = sapply(summary_stats, function(x) x[4])
)
print(result_df)
  Treatment Mean_Age Variance_Age Skewness_Age Count
0   Control 25.58786     39.16294    0.7826941  2191
1   Treated 25.51807     51.59664    1.1298334   166
Stata
use files/cps1re74.dta, clear
qui psmatch2 treat age black educ , kernel
qui estpost tabstat age black educ [aweight = _weight], by(treat) c(s) s(me v sk n) nototal
esttab .    ,varwidth(20) cells("mean(fmt(3)) variance(fmt(3)) skewness(fmt(3)) count(fmt(0))") noobs  nonumber compress 
(DW Subset of LaLonde Data)




------------------------------------------------------------
                                                            
                          mean  variance  skewness     count
------------------------------------------------------------
0                                                           
age                     27.033    85.548     1.077     15992
black                    0.791     0.165    -1.434     15992
educ                    10.710     8.146    -0.883     15992
------------------------------------------------------------
1                                                           
age                     25.816    51.194     1.115       185
black                    0.843     0.133    -1.888       185
educ                    10.346     4.043    -0.721       185
------------------------------------------------------------

Entropy Balancing

Entropy Balancing

Here, instead of matching units, we reweight the observations such that the moments of the distributions (mean, variance, skewness) are similar.

  • The ebalance function implements a reweighting scheme. The user starts by choosing the covariates that should be included in the reweighting.

  • For each covariate, the user then specifies a set of balance constraints (in Equation 5) to equate the moments of the covariate distribution between the treatment and the reweighted control group.

  • The moment constraints may include the mean (first moment), the variance (second moment), and the skewness (third moment).

The outcome is a vector containing the weights to weight the observations, such that the weighted average, weighted variance, and weighted skewness of the covariates in control group are similar to those in the treatment group

Entropy Balancing

R
library(haven)
#install.packages("ebal")
library(ebal)
data <- read_dta("files/cps1re74.dta")
treatment <-cbind(data$treat)
vars <-cbind(data$age, data$educ, data$black)
eb <- ebalance(treatment, vars)
Converged within tolerance 
R
# means in treatment group data
apply(vars[treatment==1,],2,mean)
[1] 25.8162162 10.3459459  0.8432432
R
# means in reweighted control group data
apply(vars[treatment==0,],2,weighted.mean,w=eb$w)
[1] 25.8163688 10.3460391  0.8431526
R
# means in raw data control group data
apply(vars[treatment==0,],2,mean)
[1] 33.22523762 12.02751376  0.07353677
Stata
use files/cps1re74.dta, clear
ebalance treat age black educ, targets(3)
(DW Subset of LaLonde Data)



Data Setup
Treatment variable:   treat
Covariate adjustment: age black educ (1st order). age black educ (2nd order). a
> ge black educ (3rd order).


Optimizing...
Iteration 1: Max Difference = 580799.347
Iteration 2: Max Difference = 213665.688
Iteration 3: Max Difference = 78604.7628
Iteration 4: Max Difference = 28918.6249
Iteration 5: Max Difference = 10640.1108
Iteration 6: Max Difference = 3915.82197
Iteration 7: Max Difference = 1442.09731
Iteration 8: Max Difference = 532.07826
Iteration 9: Max Difference = 197.376777
Iteration 10: Max Difference = 74.6380533
Iteration 11: Max Difference = 29.9524313
Iteration 12: Max Difference = 11.4337344
Iteration 13: Max Difference = 4.43722698
Iteration 14: Max Difference = 1.76899046
Iteration 15: Max Difference = .420548538
Iteration 16: Max Difference = .037814194
Iteration 17: Max Difference = .001164231
maximum difference smaller than the tolerance level; convergence achieved


Treated units: 185     total of weights: 185
Control units: 15992   total of weights: 185


Before: without weighting

             |              Treat              |        Control       
             |      mean   variance   skewness |      mean   variance 
-------------+---------------------------------+----------------------
         age |     25.82      51.19      1.115 |     33.23        122 
       black |     .8432      .1329     -1.888 |    .07354     .06813 
        educ |     10.35      4.043     -.7212 |     12.03      8.242 

             |  Control  
             |  skewness 
-------------+-----------
         age |     .3478 
       black |     3.268 
        educ |    -.4233 


After:  _webal as the weighting variable

             |              Treat              |        Control       
             |      mean   variance   skewness |      mean   variance 
-------------+---------------------------------+----------------------
         age |     25.82      51.19      1.115 |      25.8      51.17 
       black |     .8432      .1329     -1.888 |     .8421       .133 
        educ |     10.35      4.043     -.7212 |     10.34       4.04 

             |  Control  
             |  skewness 
-------------+-----------
         age |     1.122 
       black |    -1.877 
        educ |    -.7121 

Entropy Balancing

Stata
use files/cps1re74.dta, clear
qui ebalance treat age black educ, targets(3)
qui estpost tabstat age black educ [aweight = _webal], by(treat) c(s) s(me v sk n) nototal
esttab .    ,varwidth(20) cells("mean(fmt(3)) variance(fmt(3)) skewness(fmt(3)) count(fmt(0))") noobs  nonumber compress 
(DW Subset of LaLonde Data)




------------------------------------------------------------
                                                            
                          mean  variance  skewness     count
------------------------------------------------------------
0                                                           
age                     25.801    51.167     1.122     15992
black                    0.842     0.133    -1.877     15992
educ                    10.340     4.040    -0.712     15992
------------------------------------------------------------
1                                                           
age                     25.816    51.194     1.115       185
black                    0.843     0.133    -1.888       185
educ                    10.346     4.043    -0.721       185
------------------------------------------------------------

THANK YOU!

QUESTIONS?