Inferência Causal

Part 2

Henrique C. Martins

Selection Bias

Selection Bias

Back to the selection bias example of before.

  • Imagine that John and Mary are moving to the north of Canada.

  • John has a history of respiratory disease and decide to buy insurance.

  • Mary does not have a history of respiratory disease and decide not to buy insurance.

Default John Mary
State of insurance 1 0
Situation without insurance 3 5
Situation with insurance 4 5
Observed 4 5
Effect ? ?

\[(Y_{1,john} - Y_{0,john}) + (Y_{1,Mary}- Y_{0,Mary}) = 4 - 3 + 5 - 5 = 0.5\]

Selection Bias

Rearranging the terms:

\[(Y_{1,john} - Y_{0,Mary}) + (Y_{1,Mary} - Y_{0,john}) = (4 - 5) + (5 - 3) = 0.5\] \[We\;see + We\;do\;not\;see = (4 - 5) + (5 - 3) = 0.5\]

The term \((Y_{1,Mary} - Y_{0,john}) = (5 - 3) = 2\) is the selection bias.

It exists because we are comparing two people that should not be compared.

Selection Bias

Some notation:

\(d=1\) for the treated units (treatment group)

\(d=0\) for the treated units (control group)

\(Y_{i}\) = Potential outcome of individual i.

\(Y_{i,1}\) or \(Y(1)\) = Potential outcome of individual i, treatement group.

\(Y_{i,0}\) or \(Y(0)\) = Potential outcome of individual i, control group.

Selection Bias

Some notation:

These are the representations of the causal effect we often want to estimate.

Average Treatment Effect:

ATE = \(\frac{1}{N} (E[Y_{i,1}] - E[Y_{i,0}])\)

Average Treatment Effect on the treated:

ATET = \(\frac{1}{N} (E[Y_{i,1}|D_i=1] - E[Y_{i,0}|D_i=1])\)

Average Treatment Effect on the untreated:

ATEU = \(\frac{1}{N} (E[Y_{i,1}|D_i=0] - E[Y_{i,0}|D_i=0])\)

Of course, again, we cannot observe both potential outcomes of the same unit i.

Selection Bias

When dealing with causal inference, we have to find ways to approximate what the hidden potential outcome of the treated units is.

That is, the challenge in identifying causal effects is that the untreated potential outcomes, \(Y_{i,0}\), are never observed for the treated group (\(D_i= 1\)). The “second” term in the following equation:

ATET = \(\frac{1}{N} (E[Y_{i,1}|D_i=1] - E[Y_{i,0}|D_i=1])\)

We need an empirical design to “observe” what we do not really observe (i.e., the counterfactual).

Selection Bias

Many options:

  • Matching/Balancing
  • Difference-in-differences (DiD)
  • Instrumental variables
  • Regression discontinuity design (RDD)
  • Synthetic control (Synth)

Selection Bias

The process of finding units that are comparable is called matching.

Before we continue…

We will match on observables. We cannot be on unobservables.

Thus, you may want to write in your article “selection bias due to observables”.

Cunningham:

Propensity score matching has not seen as wide adoption among economists as in other nonexperimental methods like regression discontinuity or difference-in-differences. The most common reason given for this is that economists are oftentimes skeptical that CIA can be achieved in any dataset almost as an article of faith. This is because for many applications, economists as a group are usually more concerned about selection on unobservables than they are selection on observables, and as such, they reach for matching methods less often.

CIA = CMI

Matching

Matching

Matching aims to compare the outcomes between observations that have the same values of all control variables, except that one unit is treated and the other is not.

In this literature, the control variables used to matched are often called covariates.

That is, for each treated unit, the researcher finds an untreated unit that is similar in all covariates.

The implication is that the researcher can argue that “units are comparable after matching”.

Matching

The easiest to see is exact matching: it matches observations that have the exact same values.

  • It might be doable if you have only one covariate.

  • Naturally, if you have only one covariate, you might still be left with some selection bias.

    • In the previous example, health history is one important covariate that makes John and Mary different.

    • But what about life style? Nutrition? Etc.

As the number of covariates grow, you cannot pursue exact matching. That is the job of PSM.

Matching

In exact matching, the causal effect estimator (ATET) is:

\[ATET = \frac{1}{N} \sum (E[Y_{i}] - E[Y_{j(i)}] | D_i=1)\]

Where \(Y_{j(i)}\) is the j-th unit matched to the i-th unit based on the j-th being “closest to” the i-th unit for some covariate.

For instance, let’s say that a unit in the treatment group has a covariate with a value of 2 and we find another unit in the control group (exactly one unit) with a covariate value of 2.

Then we will impute the treatment unit’s missing counterfactual with the matched unit’s, and take a difference.

Matching

Consider the following dataset from Cunningham:

Matching

Average ages are very different. The salary of a 24 yrs old person is quite different than the salary of a 32 yrs person.

R
# Load necessary packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)

read_data <- function(df)
{
  full_path <- paste("https://github.com/scunning1975/mixtape/raw/master/",df, sep = "")
  df <- read_dta(full_path)
  return(df)
}
training_example <- read_data("training_example.dta") %>% slice(1:20)
summary(training_example$age_treat)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  18.00   20.25   23.00   24.30   28.50   33.00      10 
R
summary(training_example$age_control)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   23.50   31.50   31.95   39.00   51.00 
R
# Load necessary packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)

read_data <- function(df)
{
  full_path <- paste("https://github.com/scunning1975/mixtape/raw/master/",df, sep = "")
  df <- read_dta(full_path)
  return(df)
}

training_example <- read_data("training_example.dta") %>% slice(1:20)

ggplot(training_example, aes(x=age_treat)) +
  stat_bin(bins = 10, na.rm = TRUE)

R
# Load necessary packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)

read_data <- function(df)
{
  full_path <- paste("https://github.com/scunning1975/mixtape/raw/master/",df, sep = "")
  df <- read_dta(full_path)
  return(df)
}

training_example <- read_data("training_example.dta") %>% slice(1:20)

ggplot(training_example, aes(x=age_control)) +
  stat_bin(bins = 10, na.rm = TRUE)

R
# Load necessary packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)

read_data <- function(df)
{
  full_path <- paste("https://github.com/scunning1975/mixtape/raw/master/",df, sep = "")
  df <- read_dta(full_path)
  return(df)
}

training_example <- read_data("training_example.dta") %>% slice(1:20)

ggplot(training_example, aes(x=age_matched)) +
  stat_bin(bins = 10, na.rm = TRUE)

Matching

In this example, you are literally finding the units in the control group that have the same age as the units in the treatment group.

You are exact matching 1-by-1 in this example.

You have only one covariate, i.e., age.

Distance Matching

Distance Matching

The last example was simple because you could exact match.

If you cannot find one exact match, you need an approximate match.

In order to do that, you have to use distance matching.

Distance matching minimizes the distance (i.e., how far the covariates are from each other) between the treatment and control groups.

Distance Matching

Euclidean distance = \(|X_i-X_j|=\sqrt{(X_i-X_j)'(X_i-X_j)}=\sqrt{\sum_{n=1}^k(X_{n,i}-X_{n,j})^2}\)

Distance Matching

Normalized Euclidean distance = \(|X_i-X_j|=\sqrt{(X_i-X_j)'\hat{V}^{-1}(X_i-X_j)}=\sqrt{\sum_{n=1}^k\frac{(X_{n,i}-X_{n,j})}{\sigma^2_n}}\)

The problem with this measure of distance is that the distance measure itself depends on the scale of the variables themselves.

For this reason, researchers typically will use some modification of the Euclidean distance, such as the normalized Euclidean distance, or they’ll use a wholly different alternative distance.

The normalized Euclidean distance is a commonly used distance, and what makes it different is that the distance of each variable is scaled by the variable’s variance.

Distance Matching

Mahalanobis distance = \(|X_i-X_j|=\sqrt{(X_i-X_j)'\hat{\sum_x}^{-1}(X_i-X_j)}\)

Where \(\hat{\sum_x}\) is the sample covariance matrix of X.

Distance Matching

Distance matching only goes so far…

the larger the dimensionality, the harder is to use distance matching.

As sample size increases, for a given N of covariates, the matching discrepancies tend to zero.

But, the more covariates, the longer it takes.

At the end of the day, it is preferable to have many covariates, but it is makes distance matching harder.

Coarsened Exact Matching (CER)

Coarsened Exact Matching (CER)

In coarsened exact matching, something only counts as a match if it exactly matches on each matching variable.

The “coarsened” part comes in because, if you have any continuous variables to match on, you need to “coarsen” them first by putting them into bins, rather than matching on exact values.

Coarsening means creating bins. Fewer bins makes exact matches more likely.

CER is not used much in empirical research in finance. It is used more in the big data realm when you have many variables to match.

Propensity-score matching (PSM)

Propensity-score matching (PSM)

PSM is one way to matching using many covariates.

PSM aggregates all covariates into one score (propensity-score), which is the likelihood of receiving the treatment.

The idea is to match units that, based on observables, have the same probability (called propensity-score) of being treated.

The idea is to estimate a probit (default in stata) or logit model (fist stage):

\[P(D=1|X)\]

The propensity-score is the predicted probability of a unit being treated given all covariates X. The p-score is just a single number.

Propensity-score matching (PSM)

Considerations in PSM.

  1. How many neighbors to match?
  • Nearest neighbor, radius or kernel?
  1. With or without replacement?

  2. With or without common support?

  • Common support: imposes a common support by dropping treatment observations whose pscore is higher than the maximum or less than the minimum pscore of the controls.
  1. It is expected that, after PSM, you show the overlap of propensity-scores.

Propensity-score matching (PSM)

Source

The y-axis is the propensity-score.

Propensity-score matching (PSM)

Source

Nearest matching: Find the observation closest to (\(min|p_i-p_j|\))

Propensity-score matching (PSM)

Source

Kernel matching: Each treated observation i is matched with several control observations, with weights inversely proportional to the distance between treated and control observations.

Propensity-score matching (PSM)

Source

Radius matching: Each treated observation i is matched with control observations j that fall within a specified radius.

\[|p_i-p_j| <r\]

Propensity-score matching (PSM)

Source

Common support: Restrict matching only based on the common range of propensity scores.

Propensity-score matching (PSM)

Seems good overlap, but “good” is arbitrary.

Propensity-score matching (PSM)

Seems bad overlap

Propensity-score matching (PSM)

Seems good overlap, but “good” is arbitrary.

Propensity-score matching (PSM)

Seems bad overlap

Propensity-score matching (PSM)

Propensity-score matching (PSM)

Propensity-score matching (PSM)

Example

Example

Let’s practice with an example. 185 treated units vs 15,992 control units.

R
# Load necessary packages
# Load necessary libraries
library(haven)
library(psych)
data <- read_dta("files/cps1re74.dta")
summary_stats <- by(data, data$treat, FUN = function(group) {
  c(
    mean = mean(group$age, na.rm = TRUE),
    variance = var(group$age, na.rm = TRUE),
    skewness = skew(group$age, na.rm = TRUE),
    count = length(group$age)
  )
})
summary_df <- as.data.frame(do.call(rbind, summary_stats))
colnames(summary_df) <- c("mean", "variance", "skewness", "count")
print(summary_df)
      mean variance  skewness count
0 33.22524 121.9968 0.3477684 15992
1 25.81622  51.1943 1.1063375   185
Python
import pandas as pd
from scipy.stats import skew
import statsmodels.api as sm
data = pd.read_stata("files/cps1re74.dta")
grouped_data = data.groupby('treat')['age'].agg(['mean', 'var', lambda x: skew(x, nan_policy='omit'), 'count']).reset_index()
grouped_data.columns = ['treat', 'mean', 'variance', 'skewness', 'count']
print(grouped_data)
   treat       mean    variance  skewness  count
0      0  33.225238  121.996792  0.347801  15992
1      1  25.816216   51.194301  1.115369    185
Stata
use files/cps1re74.dta, clear
qui estpost tabstat age black educ , by(treat) c(s) s(me v sk n) nototal
esttab .    ,varwidth(20) cells("mean(fmt(3)) variance(fmt(3)) skewness(fmt(3)) count(fmt(0))") noobs nonumber compress 
(DW Subset of LaLonde Data)



------------------------------------------------------------
                                                            
                          mean  variance  skewness     count
------------------------------------------------------------
0                                                           
age                     33.225   121.997     0.348     15992
black                    0.074     0.068     3.268     15992
educ                    12.028     8.242    -0.423     15992
------------------------------------------------------------
1                                                           
age                     25.816    51.194     1.115       185
black                    0.843     0.133    -1.888       185
educ                    10.346     4.043    -0.721       185
------------------------------------------------------------

Example

Clearly, the treated group is younger, mainly black, and less educated.

Also note that the variance and skewness of the two subsamples are different.

If we were to use these two subsamples in any econometric analysis without preprocessing to make them comparable, we would likely have coefficients biased by selection bias.

Therefore, it is important to perform some matching method.

Let’s start with Propensity Score Matching (PSM). We will use the simplest matching, that is, without using any additional functions.

Example

Nearest with noreplacement.

R
# install.packages("MatchIt")
library(haven)
library(psych)
library(MatchIt)
data <- read_dta("files/cps1re74.dta")
model <- matchit(treat ~ age + black + educ, data = data, method = "nearest")
summary(model)

Call:
matchit(formula = treat ~ age + black + educ, data = data, method = "nearest")

Summary of Balance for All Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.1445        0.0099          1.3615     7.5662    0.4987
age            25.8162       33.2252         -1.0355     0.4196    0.1863
black           0.8432        0.0735          2.1171          .    0.7697
educ           10.3459       12.0275         -0.8363     0.4905    0.0908
         eCDF Max
distance   0.7741
age        0.3427
black      0.7697
educ       0.4123

Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.1445        0.1443          0.0020     1.0039    0.0002
age            25.8162       25.7081          0.0151     0.9244    0.0073
black           0.8432        0.8432          0.0000          .    0.0000
educ           10.3459       10.4054         -0.0296     0.7190    0.0117
         eCDF Max Std. Pair Dist.
distance   0.0162          0.0029
age        0.0270          0.1481
black      0.0000          0.0000
educ       0.0432          0.2554

Sample Sizes:
          Control Treated
All         15992     185
Matched       185     185
Unmatched   15807       0
Discarded       0       0
Stata
use files/cps1re74.dta, clear
psmatch2 treat age black educ , n(1) noreplacement
sum _weight , d
(DW Subset of LaLonde Data)


Probit regression                                       Number of obs = 16,177
                                                        LR chi2(3)    = 752.47
                                                        Prob > chi2   = 0.0000
Log likelihood = -634.83401                             Pseudo R2     = 0.3721

------------------------------------------------------------------------------
       treat | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0379829   .0042976    -8.84   0.000     -.046406   -.0295598
       black |   1.700675   .0793898    21.42   0.000     1.545074    1.856277
        educ |  -.0806962   .0142167    -5.68   0.000    -.1085604    -.052832
       _cons |  -.9031083   .2123282    -4.25   0.000    -1.319264   -.4869526
------------------------------------------------------------------------------

            psmatch2: weight of matched controls
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              1
 5%            1              1
10%            1              1       Obs                 370
25%            1              1       Sum of wgt.         370

50%            1                      Mean                  1
                        Largest       Std. dev.             0
75%            1              1
90%            1              1       Variance              0
95%            1              1       Skewness              .
99%            1              1       Kurtosis              .

Example

Notice that we are creating weights now

R
# install.packages("MatchIt")
library(haven)
library(MatchIt)
data <- read_dta("files/cps1re74.dta")
model <- matchit(treat ~ age + black + educ, data = data, method = "exact")
summary(model$weights)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.1457  0.0000 52.7952 
Stata
use files/cps1re74.dta, clear
qui psmatch2 treat age black educ , kernel
sum _weight , d
(DW Subset of LaLonde Data)

            psmatch2: weight of matched controls
-------------------------------------------------------------
      Percentiles      Smallest
 1%     .0024355       .0007862
 5%     .0024375       .0024348
10%       .00244       .0024348       Obs              16,177
25%     .0024517       .0024348       Sum of wgt.      16,177

50%     .0024919                      Mean            .022872
                        Largest       Std. dev.      .1130791
75%     .0026476              1
90%     .0038379              1       Variance       .0127869
95%     .0876547              1       Skewness       7.604874
99%            1              1       Kurtosis       64.26276

Example

Now, the descriptive statistics are much closer

R
library(haven)
library(MatchIt)
#install.packages("e1071")
library(e1071)
data <- read_dta("files/cps1re74.dta")
model <- matchit(treat ~ age + black + educ, data = data, method = "exact")
matched_data <- match.data(model)
summary_stats <- by(matched_data, matched_data$treat, function(x) {
  c(mean(x$age), var(x$age), skewness(x$age), length(x$age))
})

result_df <- data.frame(
  Treatment = c("Control", "Treated"),
  Mean_Age = sapply(summary_stats, function(x) x[1]),
  Variance_Age = sapply(summary_stats, function(x) x[2]),
  Skewness_Age = sapply(summary_stats, function(x) x[3]),
  Count = sapply(summary_stats, function(x) x[4])
)
print(result_df)
  Treatment Mean_Age Variance_Age Skewness_Age Count
0   Control 25.58786     39.16294    0.7826941  2191
1   Treated 25.51807     51.59664    1.1298334   166
Stata
use files/cps1re74.dta, clear
qui psmatch2 treat age black educ , kernel
qui estpost tabstat age black educ [aweight = _weight], by(treat) c(s) s(me v sk n) nototal
esttab .    ,varwidth(20) cells("mean(fmt(3)) variance(fmt(3)) skewness(fmt(3)) count(fmt(0))") noobs  nonumber compress 
(DW Subset of LaLonde Data)




------------------------------------------------------------
                                                            
                          mean  variance  skewness     count
------------------------------------------------------------
0                                                           
age                     27.033    85.548     1.077     15992
black                    0.791     0.165    -1.434     15992
educ                    10.710     8.146    -0.883     15992
------------------------------------------------------------
1                                                           
age                     25.816    51.194     1.115       185
black                    0.843     0.133    -1.888       185
educ                    10.346     4.043    -0.721       185
------------------------------------------------------------

Entropy Balancing

Entropy Balancing

Here, instead of matching units, we reweight the observations such that the moments of the distributions (mean, variance, skewness) are similar.

  • The ebalance function implements a reweighting scheme. The user starts by choosing the covariates that should be included in the reweighting.

  • For each covariate, the user then specifies a set of balance constraints (in Equation 5) to equate the moments of the covariate distribution between the treatment and the reweighted control group.

  • The moment constraints may include the mean (first moment), the variance (second moment), and the skewness (third moment).

The outcome is a vector containing the weights to weight the observations, such that the weighted average, weighted variance, and weighted skewness of the covariates in control group are similar to those in the treatment group

Entropy Balancing

R
library(haven)
#install.packages("ebal")
library(ebal)
data <- read_dta("files/cps1re74.dta")
treatment <-cbind(data$treat)
vars <-cbind(data$age, data$educ, data$black)
eb <- ebalance(treatment, vars)
Converged within tolerance 
R
# means in treatment group data
apply(vars[treatment==1,],2,mean)
[1] 25.8162162 10.3459459  0.8432432
R
# means in reweighted control group data
apply(vars[treatment==0,],2,weighted.mean,w=eb$w)
[1] 25.8163688 10.3460391  0.8431526
R
# means in raw data control group data
apply(vars[treatment==0,],2,mean)
[1] 33.22523762 12.02751376  0.07353677
Stata
use files/cps1re74.dta, clear
ebalance treat age black educ, targets(3)
(DW Subset of LaLonde Data)



Data Setup
Treatment variable:   treat
Covariate adjustment: age black educ (1st order). age black educ (2nd order). a
> ge black educ (3rd order).


Optimizing...
Iteration 1: Max Difference = 580799.347
Iteration 2: Max Difference = 213665.688
Iteration 3: Max Difference = 78604.7628
Iteration 4: Max Difference = 28918.6249
Iteration 5: Max Difference = 10640.1108
Iteration 6: Max Difference = 3915.82197
Iteration 7: Max Difference = 1442.09731
Iteration 8: Max Difference = 532.07826
Iteration 9: Max Difference = 197.376777
Iteration 10: Max Difference = 74.6380533
Iteration 11: Max Difference = 29.9524313
Iteration 12: Max Difference = 11.4337344
Iteration 13: Max Difference = 4.43722698
Iteration 14: Max Difference = 1.76899046
Iteration 15: Max Difference = .420548538
Iteration 16: Max Difference = .037814194
Iteration 17: Max Difference = .001164231
maximum difference smaller than the tolerance level; convergence achieved


Treated units: 185     total of weights: 185
Control units: 15992   total of weights: 185


Before: without weighting

             |              Treat              |        Control       
             |      mean   variance   skewness |      mean   variance 
-------------+---------------------------------+----------------------
         age |     25.82      51.19      1.115 |     33.23        122 
       black |     .8432      .1329     -1.888 |    .07354     .06813 
        educ |     10.35      4.043     -.7212 |     12.03      8.242 

             |  Control  
             |  skewness 
-------------+-----------
         age |     .3478 
       black |     3.268 
        educ |    -.4233 


After:  _webal as the weighting variable

             |              Treat              |        Control       
             |      mean   variance   skewness |      mean   variance 
-------------+---------------------------------+----------------------
         age |     25.82      51.19      1.115 |      25.8      51.17 
       black |     .8432      .1329     -1.888 |     .8421       .133 
        educ |     10.35      4.043     -.7212 |     10.34       4.04 

             |  Control  
             |  skewness 
-------------+-----------
         age |     1.122 
       black |    -1.877 
        educ |    -.7121 

Entropy Balancing

Stata
use files/cps1re74.dta, clear
qui ebalance treat age black educ, targets(3)
qui estpost tabstat age black educ [aweight = _webal], by(treat) c(s) s(me v sk n) nototal
esttab .    ,varwidth(20) cells("mean(fmt(3)) variance(fmt(3)) skewness(fmt(3)) count(fmt(0))") noobs  nonumber compress 
(DW Subset of LaLonde Data)




------------------------------------------------------------
                                                            
                          mean  variance  skewness     count
------------------------------------------------------------
0                                                           
age                     25.801    51.167     1.122     15992
black                    0.842     0.133    -1.877     15992
educ                    10.340     4.040    -0.712     15992
------------------------------------------------------------
1                                                           
age                     25.816    51.194     1.115       185
black                    0.843     0.133    -1.888       185
educ                    10.346     4.043    -0.721       185
------------------------------------------------------------

DiD Introduction

DiD Introduction

Let’s introduce DiD using the most famous example in the topic: John Snow’s 1855 findings that demonstrated to the world that cholera was spread by fecally-contaminated water and not via the air (Snow 1855)

  • Snow compared the periods of 1849 and 1854 to analyze the impact of a change in London’s water supply dynamics.
  • Various water companies, each drawing water from different sections of the Thames river, served the city’s water needs.
  • The downstream areas of the Thames, where some companies sourced their water, were susceptible to contamination due to the disposal of various substances, including fecal matter from cholera-infected individuals.
  • In the interim between 1849 and 1854, a pivotal policy was implemented: the Lambeth Company was mandated by an Act of Parliament to relocate its water intake upstream of London.

DiD Introduction

This is what happened.

Region Supplier Death Rates 1849 Death Rates 1854
Non-Lambeth Only (Dirty) 134.9 146.6
Lambeth + Others (Mix Dirty and Clean) 130.1 84.9

The specific DID estimate we can get here is:

  • \((84.9-130.1)-(146.9-134.9)=-57.2\).

This resembles a “modern” DiD.

Shocks

Shocks

Shock-based designs use an external shock to limit selection bias.

They are very hard to find, but if you do, you can reasonably estimate causal effects.

They are sources of exogenous variations in the X, which are crucial to causality when we have some of the problems discussed before.

A shock is often considered the first-best solution to causal inference (randomized control trials are the second-best).

In social sciences, we cannot have randomized control trials. It is not feasible to have experiments.

That is why we often explore the idea of “Natural experiments” or “Natural shocks” (I often use the terms interchangeably)

Shocks

A Natural experiment is an exogenous variation that change a variable in a random subset of firms.

  • Regulations, laws, etc.
  • Natural disasters.
  • Sudden death of CEOs or a product, etc.
  • The gender of a newborn.

Because the variation that occur in x is truly exogenous, the CMI holds and thus we can infer causality.

Shocks

A good shock has some conditions source:

  1. Shock Strength: The shock is strong enough to significantly change firm behavior or incentives.
  1. Exogenous Shock: The shock came from “outside” the system one is studying.
  • Treated firms did not choose whether to be treated,
  • cannot anticipate the shock,
  • the shock is expected to be permanent, and
  • there is no reason to believe that which firms were treated depends on unobserved firm characteristics.

If the shock is exogenous, or appears to be, we are less worried that unobservables might be correlated with both assignment to treatment and the potential outcomes, and thus generate omitted variable bias.

Shock exogeneity should be defended, not just assumed.

Shocks

A good shock has some conditions source:

  1. “As If Random” Assignment: The shock must separate firms into treated and controls in a manner which is close to random.
  1. Covariate balance: The forcing and forced variables aside, the shock should produce reasonable covariate balance between treated and control firms, including “common support” (reasonable overlap between treated and control firms on all covariates).

Somewhat imperfect balance can be address with balancing methods, but severe imbalance undermines shock credibility, even if the reason for imbalance is not obvious. Covariate balance should be reported.

Shocks

A good shock has some conditions source:

  1. Only-Through Condition(s): We must have reason to believe that the apparent effect of the shock on the outcome came only through the shock (sometimes, through a specific channel).

The shock must be “isolated”, there must be no other shock, at around the same time, that could also affect treated firms differently than control firms. And if one expects the shock to affect outcomes through a particular channel, the shock must also affect the outcome only through that channel.

In IV analysis, this is called an “exclusion restriction” or “only-through condition”, because one assumes away (excludes) other channels.

Shocks

The idea of a natural shock is often mixed with the difference-in-differences (DiD) design.

It is more common that it should that people refer to shocks when they want to refer to DiD.

A Did design explores a Natural shock to estimate causal effects.

A good shock generates, by nature, random assignment.

Remember

Remember:

When dealing with causal inference, we have to find ways to approximate what the hidden potential outcome of the treated units is.

That is, the challenge in identifying causal effects is that the untreated potential outcomes, \(Y_{i,0}\), are never observed for the treated group (\(D_i= 1\)). The “second” term in the following equation:

ATET = \(\frac{1}{N} (E[Y_{i,1}|D_i=1] - E[Y_{i,0}|D_i=1])\)

We need an empirical design to “observe” what we do not really observe (i.e., the counterfactual).

Single differences

Single differences

There are two single differences we can use to estimate the causal effect of a treatments (i.e., a shock).

1) Single Cross-Sectional Differences After Treatment

2) Single Time-Series Difference Before and After Treatment

Single differences

1) Single Cross-Sectional Differences After Treatment

One approach to estimating a parameter that summarizes the treatment effect is to compare the post-treatment outcomes of the treatment and control groups.

This method is often used when there is no data available on pre-treatment outcomes.

It takes the form:

\[y=\beta_0+\beta_1d_i+\epsilon\]

where \(d_i\) is a dummy marking the units that are treated. The treatment effect is given by \(\beta_1\)

This is a cross-sectional comparison, using only post-treatment values.

Single differences

1) Single Cross-Sectional Differences After Treatment

You may add the interaction between the treatment dummy and the years.

\[y=\beta_0+\beta_1d_i\times year_1+\beta_2d_i\times year_2 +\beta_3d_i\times year_3+ . . . +\epsilon\]

This design allows the treatment effect to vary over time by interacting the treatment dummy with period dummies.

What is the endogeneity concern here?

The endogeneity concern is that these firms’ \(y\) were different and could become more different between the treated and control groups even if the shock had not happened.

That is, there is something in the residual that explains the differences in \(y\) that is correlated with the \(d\).

Single differences

2) Single Time-Series Difference Before and After Treatment

A second way to estimate the treatment effect is to compare the outcome after the treatment with the outcome before the treatment for just those units that are treated.

The difference from before is that you have data before the treatment, but you only have data for the treated units.

\[y=\beta_0+\beta_1 time+\epsilon\]

where \(time\) marks the years after the treatment. The treatment effect is given by \(\beta_1\)

Single differences

2) Single Time-Series Difference Before and After Treatment

You can also estimate a multi-year regression.

\[y=\beta_0+\beta_1year_1 +\beta_2\times year_2 +\beta_3\times year_3+ . . . +\epsilon\]

What is the endogeneity concern here?

The endogeneity concern is that these firms’ \(y\) could have changed over the period of observation even if the shock had not happened.

That is, there is something in the error term that explains \(y\) that is correlated with the \(year\) dummies.

Single differences

The combination of the 1) Single Cross-Sectional Differences After Treatment and 2) Single Time-Series Difference Before and After Treatment is called Difference-in-Differences.

Difference-in-Differences

Difference-in-Differences

This is one of the most popular methods in social sciences for estimating causal effects in non-experimental settings.

The literature is exploding over the recent years.

There is one group that is treated, another is the control.

There are two periods of time, before and after the treatment.

And the treated group receives the treatment in the second period.

The key identifying assumption is that the average outcome among the treated and comparison populations would have followed “parallel trends” in the absence of treatment.

Difference-in-Differences

The two single difference estimators complement one another.

The cross-sectional comparison avoids the problem of omitted trends by comparing two groups over the same time period.

The time series comparison avoids the problem of unobserved differences between two different groups of firms by looking at the same firms before and after the change.

The double difference, difference-in-differences (DD), estimator combines these two estimators to take advantage of both estimators’ strengths.

(Roberts and Whited, 2014)

Difference-in-Differences

Difference-in-differences methods overcome the identification challenge via assumptions that allow us to impute the mean counterfactual untreated outcomes for the treated group by using

    1. the change in outcomes for the untreated group and
    1. the baseline outcomes for the treated group.

The key assumption for identifying the causal effect is the parallel trends assumption, which intuitively states that the average outcome for the treated and untreated units would have evolved in parallel if treatment had not occurred.

Assumption 1 (Parallel Trends).

\(E[Y_{i,1,t=2}-Y_{i,0,t=1}|D_i=1] = E[Y_{i,1,t=2}-Y_{i,0,t=1}|D_i=0]\)

In other words: this condition means that in the absence of treatment, the average change in the \(y\) would have been the same for both the treatment and control groups.

Difference-in-Differences

The counterfactual is determined by the assumption of a parallel trend between the treated and control groups. MM

Difference-in-Differences

Difference-in-Differences

Inserting more periods.

Difference-in-Differences

A more realistic example. (The effect)

Difference-in-Differences

Assumption 2 (No anticipatory effects).

The treatment has no causal effect prior to its implementation.

\[Y_{i,0,t=1}=Y_{i,1,t=1}\]

Prior to the treatment (t=1), there is no effect of the treatment.

Estimation

Estimation

Let’s see how to implement a DiD model. The canonical DiD model is the followin:

\[y_{i,t} = \beta_0 + \beta_1 time_t + \beta_2 treated_i + \beta_3 treated_i \times time_t + \epsilon_{i,t}\]

Where:

\(treated_i\) is a dummy marking the units that received the treatment.

\(time_t\) is a dummy marking the year of the treatment and the years after.

Estimation

Let’s see how to implement a DiD model. The canonical DiD model is the followin:

\[y_{i,t} = \beta_0 + \beta_1 time_t + \beta_2 treated_i + \beta_3 treated_i \times time_t + \epsilon_{i,t}\]

\(\beta_0\) is the average \(y_{i,t}\) for the control units before the treatment (\(time_t\) =0 and \(treated_i\) = 0).

\(\beta_0+\beta_1\) is the average \(y_{i,t}\) for the control units after the treatment (\(time_t\) =1 and \(treated_i\) = 0).

\(\beta_0+\beta_2\) is the average \(y_{i,t}\) for the treated units before the treatment (\(time_t\) =0 and \(treated_i\) = 1).

\(\beta_0+\beta_1+\beta_2+\beta_3\) is the average \(y_{i,t}\) for the treated units after the treatment (\(time_t\) =1 and \(treated_i\) = 1).

Estimation

Post-treat. (1) Pre-treat. (2) Diff. (1-2)
Treated (a) \(\beta_0+\beta_1+\beta_2+\beta_3\) \(\beta_0+\beta_2\) \(\beta_1+\beta_3\)
Control (b) \(\beta_0+\beta_1\) \(\beta_0\) \(\beta_1\)
Diff. (a-b) \(\beta_2+\beta_3\) \(\beta_2\) \(\beta_3\)

This is why we call difference-in-differences.

\(\beta_3\) gives us an estimate of the treatment effect on the treated units (ATE).

This is very popular because you can run as a regression.

Many papers simply show this table as final result.

However, it is also possible that you include additional covariates (i.e., controls) in a DiD design.

You can also modify this to test the timing of the treatment (discussed later).

Controls in DiD

Controls in DiD

If you add controls, you have something in the form:

\[y_{i,t} = \beta_0 + \beta_1 time_t + \beta_2 treated_i + \beta_3 treated_i \times time_t + \beta_4x_{1,i,t}+ \beta_5x_{2,i,t}+ ...+ \epsilon_{i,t}\]

Important

Bad Controls problem: Never add a control that is also affected by the treatment.

Tip

Using pre-treatment values is good practice.

If you are including a good control, you should get lower standard deviations of the estimated effects.

Controls in DiD

One way of adding more controls is by adding FEs.

\[y_{i,t} = \beta_0 + \beta_1 time_t + \beta_2 treated_i + \beta_3 treated_i \times time_t + FirmFE+ YearFE + \epsilon_{i,t}\]

They help control for unobserved heterogeneity at the firm-level and in each year.

But what do \(\beta_1\) and \(\beta_2\) mean now?

Answer: They are perfectly correlated with the \(FirmFE\) and \(YearFE\), respectively.

Reason: because they don’t vary across each firm and year, respectively.

The software will most likely drop one \(FirmFE\) and one \(YearFE\), thus \(\beta_1\) and \(\beta_2\) mean nothing.

I am guilty of this mistake because a referee asked and I couldn’t reply properly.

Controls in DiD

You will be better off if you estimate:

\[y_{i,t} = \beta_0 + \beta_3 treated_i \times time_t + FirmFE+ YearFE + \epsilon_{i,t}\]

This is called a generalized DiD.

The \(FirmFE\) allow that each firm has one intercept (i.e., one average \(y_{i,t}\)).

The \(YearFE\) accommodate a potential shock in \(y_{i,t}\) in each year.

Controls in DiD

Additionally, adding good controls can be helpful to maintain the shock exogenous (i.e., random) after controlling by X.

Imagine that a financial shock is more likely to affect high leveraged firms than low-leverage ones.

  1. the shock will not change the leverage decision of each firm.

  2. we believe that high leveraged firms will have a different \(y_{i,t}\) after the treatment.

You may add leverage as a control to make sure the shock is random given the firms’ levels of leverage.

If you may believe that the leverage will change after the treatment, you can use pre-treatment values (interacted with the time dummies).

In this case, you are controlling for the pre-treatment condition but are not including the bias that might come due to changes in leverage after the treatment.

Controls in DiD

If assignment is random, then including additional covariates should have a negligible effect on the estimated treatment effect.

Thus, a large discrepancy between the treatment effect estimates with and without additional controls raises a red flag.

If assignment to treatment and control groups is not random but dictated by an observable rule, then controlling for this rule via covariates in the regression satisfies the conditional mean zero assumption required for unbiased estimates.

Regardless of the motivation, it is crucial to remember that any covariates included as controls must be unaffected by the treatment, a condition that eliminates other outcome variables and restricts most covariates to pre-treatment values.

(Roberts and Whited, 2014)

Comment about DiD

Comment about DiD

(The effect)

Parallel trends means we have to think very carefully about how our dependent variable is measured and transformed.

Because parallel trends is an assumption about the size of a gap remaining constant, which means something different depending on how you measure that gap.

For example, if parallel trends holds for dependent variable \(Y\), then it doesn’t hold for \(ln(y)\), and vice versa

For example, say that in the pre-treatment period \(y\) is 10 for the control group and 20 for the treated group. In the post-treatment period, in the counterfactual world where treatment never happened,\(y\) would be 15 for the control group and 25 for the treated group. Gap of \(20-10=10\) before, and \(25-15=10\) after. Parallel trends holds!

However: \(Ln(20)-ln(10)=.693\) before, and \(ln(25)-ln(15)=.51\) after.

Example of DiD

Example of DiD

R
# Load the necessary library
library(foreign)
data <- read.dta("files/kielmc.dta")
data$y81_nearinc <- data$y81 * data$nearinc
model1 <- lm(rprice ~ y81 + nearinc + y81_nearinc, data = data)
model2 <- lm(rprice ~ y81 + nearinc + y81_nearinc + age + agesq, data = data)
model3 <- lm(rprice ~ y81 + nearinc + y81_nearinc + age + agesq + intst + land + area + rooms + baths, data = data)
summary(model3)

Call:
lm(formula = rprice ~ y81 + nearinc + y81_nearinc + age + agesq + 
    intst + land + area + rooms + baths, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-76721  -8885   -252   8433 136649 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.381e+04  1.117e+04   1.237  0.21720    
y81          1.393e+04  2.799e+03   4.977 1.07e-06 ***
nearinc      3.780e+03  4.453e+03   0.849  0.39661    
y81_nearinc -1.418e+04  4.987e+03  -2.843  0.00477 ** 
age         -7.395e+02  1.311e+02  -5.639 3.85e-08 ***
agesq        3.453e+00  8.128e-01   4.248 2.86e-05 ***
intst       -5.386e-01  1.963e-01  -2.743  0.00643 ** 
land         1.414e-01  3.108e-02   4.551 7.69e-06 ***
area         1.809e+01  2.306e+00   7.843 7.16e-14 ***
rooms        3.304e+03  1.661e+03   1.989  0.04758 *  
baths        6.977e+03  2.581e+03   2.703  0.00725 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19620 on 310 degrees of freedom
Multiple R-squared:   0.66, Adjusted R-squared:  0.6491 
F-statistic: 60.19 on 10 and 310 DF,  p-value: < 2.2e-16
Stata
use "files/kielmc.dta", clear
gen y81_nearinc = y81 * nearinc
eststo:qui reg rprice y81 nearinc y81_nearinc 
eststo:qui reg rprice y81 nearinc y81_nearinc age agesq 
eststo:qui reg rprice y81 nearinc y81_nearinc age agesq intst land area rooms baths
esttab , compress
(est1 stored)

(est2 stored)

(est3 stored)


-------------------------------------------------
                 (1)          (2)          (3)   
              rprice       rprice       rprice   
-------------------------------------------------
y81          18790.3***   21321.0***   13928.5***
              (4.64)       (6.19)       (4.98)   

nearinc     -18824.4***    9397.9       3780.3   
             (-3.86)       (1.95)       (0.85)   

y81_near~c  -11863.9     -21920.3***  -14177.9** 
             (-1.59)      (-3.45)      (-2.84)   

age                       -1494.4***    -739.5***
                         (-11.33)      (-5.64)   

agesq                       8.691***     3.453***
                          (10.25)       (4.25)   

intst                                   -0.539** 
                                       (-2.74)   

land                                     0.141***
                                        (4.55)   

area                                     18.09***
                                        (7.84)   

rooms                                   3304.2*  
                                        (1.99)   

baths                                   6977.3** 
                                        (2.70)   

_cons        82517.2***   89116.5***   13807.7   
             (30.26)      (37.04)       (1.24)   
-------------------------------------------------
N                321          321          321   
-------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Regression Discontinuity Design (RDD)

Regression Discontinuity Design (RDD)

The regression-discontinuity (RD) research design is a quasi experimental method.

Here the treatment is not a binary as before.

Treated units are assigned based on a cutoff score of a continuous variable.

In a RDD the treatment assignment is not random…

… but it is based in the value of an observed covariate, in which the units lie on either side of the threshold.

Regression Discontinuity Design (RDD)

Example Mastering Metrics:

  • Age for drinking is 18 in Brazil.

Why prohibiting younger than 18?

The theory behind this effort is that legal drinking at age 18 discourages binge drinking and promotes a culture of mature alcohol consumption.

The cutoff splits who can drink and who can’t.

Regression Discontinuity Design (RDD)

The name RDD comes from a jump, a discontinuity that occurs in a continuous variable.

In its simplest form, the design has a:

  • The assignment variable (e.g., age),

  • Two groups (above and below the cutoff),

  • The outcome variable.

  • You may include nonlinearities and control variables.

Regression Discontinuity Design (RDD)

RDD is:

\[D_i = 1(x_i>c) \;\]

where:

  • \[\;D = 1 \; if \;X \;_i>c \]

  • \[\;D = 0 \; if \;X \;_i<c \]

\(X\) is called the forcing variable because it “forces” units into treatment or control groups.

\(X\) may be correlated with \(Y_1\) or \(Y_0\), so simply comparing treated and control units would not provide causal effects

However, if the units are randomly assigned into treatment around the cutoff, we would have causal effects.

Regression Discontinuity Design (RDD)

The main assumption that allows using RDD as a causal method is that

Next to the cut, the participants are similar. The only difference is that one individual is in each of the “sides”. Source

Regression Discontinuity Design (RDD)

  • The cutoff value occurs at 50

  • What are the differences between someone that scores 49.99 and 50.01 in the X variable?

  • The intuition is that these individuals are similar and comparable.

In the absence of treatment, the assumption is that the solid line would “continue” with the same inclination and values.

There is a discontinuity, however. This implies that the pretreatment in the absence of the treatment should be the dashed line.

The discontinuity is the causal effect of X (at the cutoff) to Y.

Regression Discontinuity Design (RDD)

Unlike the matching and regression strategies based on treatment-control comparisons conditional on covariates, the validity of RDD is based on our willingness to extrapolate across values of the running variable, at least for values in the neighborhood of the cutoff at which treatment switches on. MM

Regression Discontinuity Design (RDD)

Again the drinking example:

In the US it is 21 years (age & alcohol). Example Mastering Metrics:

Notice the x-axis.

Regression Discontinuity Design (RDD)

In the US it is 21 years. Example Mastering Metrics:

Notice the x-axis.

Regression Discontinuity Design (RDD)

Examples Almeida et al 2016

Regression Discontinuity Design (RDD)

Examples Flammer 2015

Regression Discontinuity Design (RDD)

This is (legally, it should be) an example of a Sharp RDD

A Sharp RDD occurs when the cutoff is mandatory. There are no exceptions. In this case, there are no 17 years old drinking and driving.

The treatment is

\[D_a= 1, \;if \;a \;>=\; 18, \;0 \;if \;a\; <\; 18\]

The alternative is a fuzzy RDD, which occurs when there is some misassignment.

  • People from under the cut also receiving the treatment.
  • Ex. students that receive 5,96 usually are approved in a course (this is a misassignment).
  • To estimate a Fuzzy RDD, you can use the treatment as an instrumental variable (Angrist & Pischke, 2009).

Regression Discontinuity Design (RDD)

fuzzy RDD (The effect )

Regression Discontinuity Design (RDD)

fuzzy RDD

Compliers: Takes treatment if above threshold but not if below threshold.

Always-Takers: Always takes the treatment, ignores the cut.

Never-Takers: Never takes the treatment, ignores the cut.

Defiers: Takes treatment if below the threshold, does not take the treatment if above the threshold.

Regression Discontinuity Design (RDD)

Things that are “good” to a RDD.

  • Age
  • Dates (you need 6 years to start school, 5,99 years is not allowed)
    • Great example: Most of NHL players are those with anniversaries just after the enrollment date
  • Ranking systems
  • Location (when people cannot “move” easily)

Estimation

Estimation

A Sharp RDD can take the form of:

\[Y_i = \alpha + \beta_1 D_a + \beta_2 \tilde{x}_1 + \epsilon\]

Where \(D_a\) is the treatment based on the cutoff.

\(x\) is the running variable (the difference between the actual \(X\) and the cutoff in \(X_0\)).

  • For instance, months until the 18 years birthday.

  • We can also use the notation: \(\tilde{X}=x-x_0\) regarding the difference.

Estimation

You also should trim the sample to a reasonable window around the cutoff \(c\)

  • Something like: \(c-h<X_i<c+h\), where \(h\) is a positive value that determines the window

  • There is no theoretical solution to how big should \(h\) be. It invites robustness tests.

Estimation

Final Step:

Decide on the model of E[Y|X]:

  • linear in both “sides” with same slope

  • linear with different slopes

  • non-linear

Tip: always execute a visual inspection to check which model os appropriate.

Also, always estimate different models and discuss the results.

Estimation

fuzzy RDD

\[Y_i = \alpha + \beta_1 D_a + \beta_2 \tilde{x}_1 + \beta_3(Z \times \tilde{x}_1) \epsilon\]

Where Z is 1 if unit is above the cut or 0 if unit is below the cut.

Notice that D is not equal to Z, in these cases.

RDD Nonlinearities

RDD Nonlinearities

Example Mastering Metrics:

This is a linear relationship, with the same slopes.

\[E[Y|X,D] = \alpha + \beta_1 \times D + \beta_2 \times \tilde{X}\]

RDD Nonlinearities

The effect This is a linear relationship, with different slopes.

RDD Nonlinearities

Example Mastering Metrics:

This is a nonlinear relationship.

RDD Nonlinearities

Example Mastering Metrics:

Is the relationship linear or nonlinear here? If you misjudge the relationship, it will be hard to tell a story credibly.

RDD Nonlinearities

The takeaway: RDD is a graphical method. You need to show the graphs.

Nobody will believe your story without the correct specification of the model.

In the first case:

\[Y_i = \alpha + \beta_1 D_a + \beta_2 x_1 + \epsilon\]

In the second case:

\[Y_i = \alpha + \beta_1 D_a + \beta_2 x_1 + \beta_3 x_1^2 + \epsilon\]

RDD Nonlinearities

We can also add an interaction term (notice that I am changing the notation now to make it similar to MM)

\[Y_i = \alpha + \beta_1 D_a + \beta_2 (x - x_0) + \beta_3 (x - x_0) D_a + \epsilon\]

This allows for different inclinations before and after the cut.

RDD Nonlinearities

Or even different nonlinearities before and after the cut:

\[Y_i = \alpha + \beta_1 D_a + \beta_2 (x - x_0) + \beta_3 (x - x_0)^2 + \beta_4 (x - x_0) D_a + \beta_5 (x - x_0)^2 D_a + \epsilon\]

Example RDD

Example RDD Clearly, this is not linear.

R
library(readxl)
library(ggplot2)
data  <- read_excel("files/RDD.xlsx")
ggplot(data, aes(x, y))  + 
  geom_point(size=1.2) + 
  labs(y = "", x="", title = "Evolution of Y") +
  theme(plot.title = element_text(color="black", size=20, face="bold"),
        panel.background = element_rect(fill = "grey95", colour = "grey95"),
        axis.text.y = element_text(face="bold", color="black", size = 12),
        axis.text.x = element_text(face="bold", color="black", size = 12),
        legend.title = element_blank(),
        legend.key.size = unit(1, "cm")) +
    geom_smooth(method = "lm", fill = NA)

Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Read Excel file
data = pd.read_excel("files/RDD.xlsx")
# Generate line graph - Including all observations together
sns.set(style="whitegrid")
plt.figure(figsize=(10, 5))
scatter_plot = sns.scatterplot(x='x', y='y', data=data, s=50)
scatter_plot.set_title("Evolution of Y", fontsize=20, fontweight='bold')
scatter_plot.set_xlabel("", fontsize=12, fontweight='bold')
scatter_plot.set_ylabel("", fontsize=12, fontweight='bold')
# Add regression line
sns.regplot(x='x', y='y', data=data, scatter=False, line_kws={'color': 'blue'})
plt.show()

Example RDD

R
library(readxl)
library(ggplot2)
data  <- read_excel("files/RDD.xlsx")
# Creating  groups
data$treated <- 0
data$treated[data$x >= 101] <- 1  
# Generate a line graph - two groups
ggplot(data, aes(x, y, group=treated, color = factor(treated)))  + 
    geom_point( size=1.25) + 
    labs(y = "", x="", title = "RDD exemplo")+
    theme(plot.title = element_text(color="black", size=25, face="bold"),
          panel.background = element_rect(fill = "grey95", colour = "grey95"),
          axis.text.y = element_text(face="bold", color="black", size = 16),
          axis.text.x = element_text(face="bold", color="black", size = 16),
          legend.title = element_blank(),
          legend.key.size = unit(2, "cm")) +
    geom_smooth(method = "lm", fill = NA)

Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Read Excel file
data = pd.read_excel("files/RDD.xlsx")
# Create treated variable
data['treated'] = 0
data.loc[data['x'] >= 101, 'treated'] = 1
# Generate line graph with two groups
sns.set(style="whitegrid")
plt.figure(figsize=(10, 5))
scatter_plot = sns.scatterplot(x='x', y='y', hue='treated', style='treated', data=data, s=50)
scatter_plot.set_title("RDD exemplo", fontsize=25, fontweight='bold')
scatter_plot.set_xlabel("", fontsize=16, fontweight='bold')
scatter_plot.set_ylabel("", fontsize=16, fontweight='bold')
scatter_plot.legend().set_title('')
scatter_plot.legend(title='', loc='upper left', fontsize='small')
# Add regression lines
sns.regplot(x='x', y='y', data=data[data['treated'] == 0], scatter=False, line_kws={'color': 'blue'})
sns.regplot(x='x', y='y', data=data[data['treated'] == 1], scatter=False, line_kws={'color': 'orange'})
plt.show()

Example RDD

R
library(readxl)
library(ggplot2)
data  <- read_excel("files/RDD.xlsx")
# Creating  groups
data$treated <- 0
data$treated[data$x >= 101] <- 1  
# define cut
cut <- 100
band <- 50
xlow = cut - band
xhigh = cut + band
# subset the data for the bandwidth
data <- subset(data, x > xlow & x <= xhigh, select=c(x, y,  treated))
# Generate a line graph - two groups
ggplot(data, aes(x, y, group=treated, color = factor(treated)))  + 
  geom_point( size=1.25) + 
  labs(y = "", x="", title = "RDD example")+
  theme(plot.title = element_text(color="black", size=25, face="bold"),
        panel.background = element_rect(fill = "grey95", colour = "grey95"),
        axis.text.y = element_text(face="bold", color="black", size = 16),
        axis.text.x = element_text(face="bold", color="black", size = 16),
        legend.title = element_blank(),
        legend.key.size = unit(2, "cm")) +
  geom_smooth(method = "lm", fill = NA)

Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Read Excel file
data = pd.read_excel("files/RDD.xlsx")
# Create treated variable
data['treated'] = 0
data.loc[data['x'] >= 101, 'treated'] = 1
# Define cut, band, and subset data
cut = 100
band = 50
xlow = cut - band
xhigh = cut + band
data = data[(data['x'] > xlow) & (data['x'] <= xhigh)][['x', 'y', 'treated']]
# Generate line graph with two groups
sns.set(style="whitegrid")
plt.figure(figsize=(10, 5))
scatter_plot = sns.scatterplot(x='x', y='y', hue='treated', style='treated', data=data, s=50)
scatter_plot.set_title("RDD example", fontsize=25, fontweight='bold')
scatter_plot.set_xlabel("", fontsize=16, fontweight='bold')
scatter_plot.set_ylabel("", fontsize=16, fontweight='bold')
scatter_plot.legend().set_title('')
scatter_plot.legend(title='', loc='upper left', fontsize='small')
# Add regression lines
sns.regplot(x='x', y='y', data=data[data['treated'] == 0], scatter=False, line_kws={'color': 'blue'})
sns.regplot(x='x', y='y', data=data[data['treated'] == 1], scatter=False, line_kws={'color': 'orange'})
plt.show()

Example RDD

R
library(readxl)
library(ggplot2)
data  <- read_excel("files/RDD.xlsx")
data$treated <- 0
data$treated[data$x >= 101] <- 1  
cut <- 100
band <- 50
xlow = cut - band
xhigh = cut + band
data <- subset(data, x > xlow & x <= xhigh, select=c(x, y,  treated))
# Generating xhat - Now we are going to the RDD
data$xhat <- data$x - cut
# Generating xhat * treated to allow different inclinations (we will use the findings of the last graph, i.e. that each group has a different trend.)
data$xhat_treated <- data$xhat * data$treated
# RDD Assuming different trends
rdd <- lm(y  ~ xhat + treated  + xhat_treated, data = data)
summary(rdd)

Call:
lm(formula = y ~ xhat + treated + xhat_treated, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9477  -3.2607   0.6875   3.2227  12.2004 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  55.83059    1.53681  36.329  < 2e-16 ***
xhat          0.29431    0.05405   5.445 3.97e-07 ***
treated      28.93921    2.20672  13.114  < 2e-16 ***
xhat_treated -0.51587    0.07644  -6.749 1.13e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.515 on 96 degrees of freedom
Multiple R-squared:  0.8942,    Adjusted R-squared:  0.8909 
F-statistic: 270.3 on 3 and 96 DF,  p-value: < 2.2e-16
Python
import pandas as pd
import statsmodels.api as sm
data = pd.read_excel("files/RDD.xlsx")
# Generate treated variable
data['treated'] = 0
data.loc[data['x'] >= 101, 'treated'] = 1
# Define cut and band
cut = 100
band = 50
xlow = cut - band
xhigh = cut + band
# Subset data
data = data[(data['x'] > xlow) & (data['x'] <= xhigh)]
# Generate xhat and xhat_treated
data['xhat'] = data['x'] - cut
data['xhat_treated'] = data['xhat'] * data['treated']
# Regression
X = data[['xhat', 'treated', 'xhat_treated']]
X = sm.add_constant(X)  # Add a constant term
y = data['y']
model = sm.OLS(y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.891
Method:                 Least Squares   F-statistic:                     270.3
Date:                ter, 30 jan 2024   Prob (F-statistic):           1.14e-46
Time:                        15:36:45   Log-Likelihood:                -310.60
No. Observations:                 100   AIC:                             629.2
Df Residuals:                      96   BIC:                             639.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           55.8306      1.537     36.329      0.000      52.780      58.881
xhat             0.2943      0.054      5.445      0.000       0.187       0.402
treated         28.9392      2.207     13.114      0.000      24.559      33.320
xhat_treated    -0.5159      0.076     -6.749      0.000      -0.668      -0.364
==============================================================================
Omnibus:                        0.995   Durbin-Watson:                   2.114
Prob(Omnibus):                  0.608   Jarque-Bera (JB):                1.079
Skew:                          -0.221   Prob(JB):                        0.583
Kurtosis:                       2.747   Cond. No.                         151.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Stata
import excel "files/RDD.xlsx", firstrow clear
gen treated = 0
replace treated = 1 if x >= 101
gen cut = 100
gen band = 50
gen xlow = cut - band
gen xhigh = cut + band
keep if x > xlow & x <= xhigh
gen xhat = x - cut
gen xhat_treated = xhat * treated
regress y xhat treated xhat_treated
(2 vars, 200 obs)


(100 real changes made)





(100 observations deleted)

      Source |       SS           df       MS      Number of obs   =       100
-------------+----------------------------------   F(3, 96)        =    270.35
       Model |  24669.3025         3  8223.10084   Prob > F        =    0.0000
    Residual |  2920.00749        96  30.4167447   R-squared       =    0.8942
-------------+----------------------------------   Adj R-squared   =    0.8909
       Total |    27589.31        99  278.679899   Root MSE        =    5.5151

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        xhat |   .2943097   .0540479     5.45   0.000     .1870255     .401594
     treated |   28.93921   2.206717    13.11   0.000     24.55891    33.31951
xhat_treated |  -.5158703   .0764353    -6.75   0.000    -.6675932   -.3641475
       _cons |   55.83059   1.536805    36.33   0.000     52.78005    58.88112
------------------------------------------------------------------------------

Example RDD

The coefficient of x before the cut is 0.29 (t-stat 5.45), and after the cut, it is -0.51 (t-stat -6.75).

We also have the coefficient of the treatment, which is measured by the “jump” that occurs near the cut: an estimated coefficient of 28.9 (t-stat 13.11).

If this were a real example, this would be the causal effect of receiving the treatment (i.e., being beyond the cut).

Final comments RDD

Final comments RDD

Sensitivity to specification

Misspecification can lead to a “spurious jump”. Do not mix nonlinearity with a discontinuity.

Sensitivity to window

Also, need to check many alternative \(h\).

Smoothness of the running around the cut

The distribution of X should be smooth around the cut. If it is not, it might indicate that the treatment assignment was not random.

Test comparability of units around the cut

Covariates balance. You can check whether there are jumps in control variables as an additional robustness.

Placebo tests

Test whether the treatment is zero when it should be zero.

Synthetic Control

Synthetic Control

It is a method to estimate the effect of events or policy interventions, often at an aggregate level (cities, states, etc.)

The event occurs often in only one unit.

It compares the evolution of the outcome for the treated unit to the evolution of the control group.

  • The control group contains many units.

The limitation is often the selection of the control group. It is very ambiguous.

Synthetic Control

Abadie, Diamond, and Hainmueller (2010) apply synth by using a cigarette tax in California called Proposition 99.

In 1988, California passed comprehensive tobacco control legislation called Proposition 99.

Proposition 99 increased cigarette taxes by $0.25 a pack, spurred clean-air ordinances throughout the state, funded anti-smoking media campaigns, earmarked tax revenues to health and anti-smoking budgets, and produced more than $100 million a year in anti-tobacco projects.

Other states had similar control programs, and they were dropped from their analysis. Mastering Metrics

Synthetic Control

There was a trend before the treatment. How can we estimate the causal effect?

Synthetic Control

The goal is to elect an optimal set of weights that when applied to the rest of the country produces the following figure:

Synthetic Control

The variables used for computing the weights are the following. You are creating weights such that weighting the other states, you can create a synthetic California. Notice that, so far, the product of this analysis is only two data points per period.

Synthetic Control

Tip: Synth is also an graphical method, so graphs like the following are common. This is the difference between the two series.

Inference

Inference

Notice that, so far, the product of this analysis is only two data points per period. How can you stablish a “significant” causal effect?

Steps

  1. Apply synth to each state in the control group (also called “donor pool”).

  2. Obtain a distribution of placebos.

  3. Compare the gap for California to the distribution of the placebo gaps.

  4. Then, test whether the effect for the treated unit is large enough relative to the placebos (i.e., to the effect estimated for a placebo unit randomly selected).

Inference

Notice the bold line (treated unit) after the treatment. It is at the bottom.

Inference

Abadie, Diamond, and Hainmueller (2010) recommend iteratively dropping the states whose pre-treatment RMSPE is considerably different than California’s because as you can see, they’re kind of blowing up the scale and making it hard to see what’s going on. Mastering Metrics

Inference

The previous figure suggests the effect is large enough relative to the placebo effects.

  1. The root mean squared prediction error (RMSPE) is:

\[RMSPE = \bigg (\dfrac{1}{T-T_0} \sum_{t=T_0+t}^T \bigg (Y_{1t} - \sum_{j=2}^{J+1} w_j^* Y_{jt} \bigg )^2 \bigg )^{\tfrac{1}{2}}\]

It shows how far predictions fall from measured true values using Euclidean distance.

  1. Sort the ratio post- to pre-treatment RMSPE in descending order

  2. Calculate the p-value as \(\frac{Rank}{Total}\).

Basically, these steps give how likely is the occurrence of the treated unit distance vis-a-vis the average placebo.

Inference

RMSPE: in the previous example, California has the largest increase in the error after the treatment. Position 1 out of 39 states, implying an exact p-value of \(\frac{1}{38}=0.026\) (significant).

Example Synth

Example Synth

Stata
use files/synth_smoking.dta , clear
tsset state year
synth cigsale beer(1984(1)1988) lnincome retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) 
(Tobacco Sales in 39 US States)


Panel variable: state (strongly balanced)
 Time variable: year, 1970 to 2000
         Delta: 1 unit

-------------------------------------------------------------------------------
Synthetic Control Method for Comparative Case Studies
-------------------------------------------------------------------------------

First Step: Data Setup
-------------------------------------------------------------------------------
control units: for 38 of out 38 units missing obs for predictor lnincome in per
> iod 1970 -ignored for averaging
control units: for 38 of out 38 units missing obs for predictor lnincome in per
> iod 1971 -ignored for averaging
treated unit: for 1 of out 1 units missing obs for predictor lnincome in period
>  1970 -ignored for averaging
treated unit: for 1 of out 1 units missing obs for predictor lnincome in period
>  1971 -ignored for averaging
-------------------------------------------------------------------------------
Data Setup successful
-------------------------------------------------------------------------------
                Treated Unit: California
               Control Units: Alabama, Arkansas, Colorado, Connecticut,
                              Delaware, Georgia, Idaho, Illinois, Indiana,
                              Iowa, Kansas, Kentucky, Louisiana, Maine,
                              Minnesota, Mississippi, Missouri, Montana,
                              Nebraska, Nevada, New Hampshire, New Mexico,
                              North Carolina, North Dakota, Ohio, Oklahoma,
                              Pennsylvania, Rhode Island, South Carolina, South
                              Dakota, Tennessee, Texas, Utah, Vermont,
                              Virginia, West Virginia, Wisconsin, Wyoming
-------------------------------------------------------------------------------
          Dependent Variable: cigsale
  MSPE minimized for periods: 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
                              1980 1981 1982 1983 1984 1985 1986 1987 1988
Results obtained for periods: 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
                              1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
                              1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
                              2000
-------------------------------------------------------------------------------
                  Predictors: beer(1984(1)1988) lnincome retprice age15to24
                              cigsale(1988) cigsale(1980) cigsale(1975)
-------------------------------------------------------------------------------
Unless period is specified
predictors are averaged over: 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
                              1980 1981 1982 1983 1984 1985 1986 1987 1988
-------------------------------------------------------------------------------

Second Step: Run Optimization
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Optimization done
-------------------------------------------------------------------------------

Third Step: Obtain Results
-------------------------------------------------------------------------------
Loss: Root Mean Squared Prediction Error

---------------------
   RMSPE |  1.943233 
---------------------
-------------------------------------------------------------------------------
Unit Weights:

----------------------------
         Co_No | Unit_Weight
---------------+------------
       Alabama |           0
      Arkansas |           0
      Colorado |        .285
   Connecticut |        .101
      Delaware |           0
       Georgia |           0
         Idaho |           0
      Illinois |           0
       Indiana |           0
          Iowa |           0
        Kansas |           0
      Kentucky |           0
     Louisiana |           0
         Maine |           0
     Minnesota |           0
   Mississippi |           0
      Missouri |           0
       Montana |           0
      Nebraska |           0
        Nevada |        .245
 New Hampshire |           0
    New Mexico |           0
North Carolina |           0
  North Dakota |           0
          Ohio |           0
      Oklahoma |           0
  Pennsylvania |           0
  Rhode Island |           0
South Carolina |           0
  South Dakota |           0
     Tennessee |           0
         Texas |           0
          Utah |        .369
       Vermont |           0
      Virginia |           0
 West Virginia |           0
     Wisconsin |           0
       Wyoming |           0
----------------------------
-------------------------------------------------------------------------------
Predictor Balance:

------------------------------------------------------
                               |   Treated  Synthetic 
-------------------------------+----------------------
             beer(1984(1)1988) |     24.28   23.22596 
                      lnincome |  10.03176   9.867266 
                      retprice |  66.63684   65.40743 
                     age15to24 |  .1786624   .1825559 
                 cigsale(1988) |      90.1    92.6063 
                 cigsale(1980) |     120.2   120.3907 
                 cigsale(1975) |     127.1   126.7094 
------------------------------------------------------
-------------------------------------------------------------------------------

counter | pri_inf  | dual_inf  | pri_obj   | dual_obj  | sigfig | alpha  | nu 
----------------------------------------------------------------------------------
      0 | 8.29e+001 | 7.80e-006 | -1.26e+001 | -3.93e+002 |  0.000 | 0.0000 | 1.00e+002
      1 | 5.04e-001 | 4.74e-008 | -1.26e+001 | -7.02e+002 |  0.000 | 0.9939 | 3.05e-005
      2 | 2.85e-003 | 2.68e-010 | -1.25e+001 | -2.80e+001 |  0.000 | 0.9943 | 2.70e-006
      3 | 1.60e-004 | 1.51e-011 | -1.26e+001 | -1.34e+001 |  1.193 | 0.9438 | 5.40e-006
      4 | 1.57e-005 | 1.47e-012 | -1.26e+001 | -1.27e+001 |  2.000 | 0.9022 | 9.21e-007
      5 | 9.72e-006 | 9.13e-013 | -1.26e+001 | -1.27e+001 |  2.207 | 0.3806 | 6.37e-006
      6 | 2.91e-006 | 2.73e-013 | -1.26e+001 | -1.26e+001 |  2.714 | 0.7006 | 8.69e-007
      7 | 5.05e-007 | 4.75e-014 | -1.26e+001 | -1.26e+001 |  3.414 | 0.8263 | 8.90e-008
      8 | 1.70e-007 | 1.60e-014 | -1.26e+001 | -1.26e+001 |  3.881 | 0.6640 | 6.85e-008
      9 | 2.24e-008 | 2.10e-015 | -1.26e+001 | -1.26e+001 |  4.685 | 0.8682 | 3.46e-009
     10 | 2.57e-010 | 2.43e-017 | -1.26e+001 | -1.26e+001 |  6.512 | 0.9885 | 4.02e-012
     11 | 1.28e-012 | 1.87e-018 | -1.26e+001 | -1.26e+001 |  8.807 | 0.9950 | 1.14e-014
     12 | 6.45e-015 | 1.86e-018 | -1.26e+001 | -1.26e+001 | 11.104 | 0.9950 | 5.78e-017
     13 | 8.88e-016 | 2.02e-018 | -1.26e+001 | -1.26e+001 | 13.407 | 0.9950 | 2.91e-019
----------------------------------------------------------------------------------
optimization converged

Example Synth

Stata
use files/synth_smoking.dta , clear
synth cigsale beer lnincome(1980&1985) retprice cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) fig
quietly graph export figs/synth1.svg, replace

Example Synth

Authors using synthetic control must do more than merely run the synth command when doing comparative case studies.

They must find the exact-values through placebo-based inference, check for the quality of the pre-treatment fit, investigate the balance of the covariates used for matching, and check for the validity of the model through placebo estimation (e.g., rolling back the treatment date).

Mastering Metrics

Mastering Metrics

In 1992, Texas expanded the prision system operational capacity.

Mastering Metrics

This is what happened.

Mastering Metrics

Synthetic Texas.

Mastering Metrics

Gap between Synthetic Texas and Texas.

Mastering Metrics

Building Placebos.

Mastering Metrics

Texas is the one in the far right tail.

Introduction IV

Introduction IV

Imagine the following model

\[Ln(wage)=\alpha + \beta_1 educ + \epsilon\]

We can infer that educ is correlated with ability, but the latter is in the error term.

educ in this case is “endogenous”.

\[Cov(educ, \epsilon) \neq 0\]

Introduction IV

The setup of an IV is

Suppose that we have an observable variable z that satisfies these two assumptions:

  1. z is uncorrelated with u:

\[Cov(z, \epsilon) = 0\]

  1. z is correlated with x:

\[Cov(z, x) \neq 0\]

Then, we call z an instrumental variable for x, or sometimes simply an instrument for x.

Introduction IV

Before we continue,

IV is not a model, it is an estimation method

I’ll call it a Design.

Do not say, I estimated an IV model (more often than it should be).

Instrumental Variables

Instrumental Variables

Imagine that you have one independent variable that is “endogenous”:

  • \(Cov(x_k,\mu)\neq 0\)

  • You may have many other independent variables not “endogenous”

In this situation:

  • \(B_k\) is biased

  • The other betas will likely be biased as well, since it is unlikely that all other Xs are not correlated with \(x_k\)

Instrumental Variables

\(x_k\) is the endogenous variable.

  • It has “good” variation:

    • the part that varies that is not correlated with \(\mu\)
  • It has “bad” variation:

    • the part that varies that is correlated with \(\mu\)

Let’s assume now that you can find an instrument \(z\)

  • The instrument \(z\) is correlated with \(X_k\), but only the “good” variation, not the “bad”.

  • The instrument \(z\) does not explain \(y\) directly, only through \(x_k\).

    • Only through condition.

Instrumental Variables

Relevance Condition

The instrument \(z\) is correlated with \(x_k\).

  • This assumption is easy to test. Simply run a regression of \(x=f(z, all\; Xs)\) and check the significance of the \(\beta_z\).

  • This is called the first stage of an IV regression

    • Tip: Always show the beta coefficient and the R2 (even if low) of the first-stage.

Exclusion Condition

The instrument \(z\) is not correlated with \(\mu\).

  • That is \(cov(z,\mu) = 0\)

  • As all correlations with \(\mu\), you cannot test this prediction. You have to rely on the theory, create a story about that.

Instrumental Variables

An example of IV Murray.

Instrumental Variables

An example of IV Murray.

Instrumental Variables

Mixtape

Good instruments should feel weird

Parents with two same-gender kids are more likely to try a third kid than a diverse-gender pair of parents.

So, you may use the gender of the kids as instrument for the likelihood of the mother go back to the labor market.

Angrist’s example

Angrist’s example

Remember the Fuzzy RDD.

  • There is the treatment
  • There is the position (before or after the cut)

The position is an indication of receiving or not the treatment, but it is not definitive.

Thus, we can use the position as an IV for the treatment.

Angrist’s example

Mixtape

One of the more seminal papers in instrumental variables for the modern period is Angrist and Krueger (1991).

Their idea is simple and clever; a quirk in the United States educational system is that a child enters a grade on the basis of his or her birthday.

For a long time, that cutoff was late December. If children were born on or before December 31, then they were assigned to the first grade. But if their birthday was on or after January 1, they were assigned to kindergarten.

Thus two people—one born on December 31 and one born on January 1—were exogenously assigned different grades.

Everyone is forced to leave school when 16.

Angrist’s example

Mixtape

Angrist and Krueger had the insight that that small quirk was exogenously assigning more schooling to people born later in the year.

The person born in December would reach age 16 with more education than the person born in January

Angrist’s example

Mixtape

What is the instrument here?

Angrist’s example

Mixtape

What is the instrument here?

The instrument is the quarter of birth.

  • People born in the 3rd and 4th quarter receive more education than others due to compulsory schooling.

Two-stage least squares (2SLS)

Two-stage least squares (2SLS)

One of the more intuitive instrumental variables estimators is the 2SLS.

The first stage is

\[x_k = \delta + \delta_1 z + \delta_2 x_1 + . . .+ \delta_n x_n + \mu\]

Then, you predict \(x_k\) using the first stage.

  • “predict” means that you are finding the “response” Y of the equation after estimating the coefficients

\[\hat{x_k} = \hat{\delta} + \hat{\delta_1} z + \hat{\delta_2} x_1 + . . . + \hat{\delta_n} x_n \]

Then, the second stage is:

\[y = \alpha + \beta_1 \hat{x_k} + \beta_2 x_1 + . . .+ \beta_n x_n + \mu\]

The idea using \(\hat{x_k}\) is that it represents only the variation that is not correlated with \(\mu\).

Instrumental Variables

We can write that:

\[\beta_1= \frac{Cov(z,y)}{Cov(z,x_k)}\]

  • It shows that \(\beta_1\) is the population covariance between z and y divided by the population covariance between z and x.

  • Notice how this fails if z and x are uncorrelated, that is, if \(Cov(z, x) = 0\)

\[\beta_1= \frac{\sum_{i=1}^n(z_i-\bar{z})(y_i-\bar{y})}{\sum_{i=1}^n(z_i-\bar{z})(x_i-\bar{x})}\]

Notice that if \(z\) and \(x\) are the same (i.e., perfect correlation), \(\beta_1\) above is the OLS \(\beta\)

\[\beta_1= \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}\]

Weak Instruments

Weak Instruments

I did not demonstrate here, but we can say that (see pp; 517-18 Wooldridge):

  • though IV is consistent when \(z\) and \(u\) are uncorrelated and \(z\) and \(x\) have any positive or negative correlation, IV estimates can have large standard errors, especially if \(z\) and \(x\) are only weakly correlated.

This gives rise to the week instruments problem.

We can write the probability limit of the IV estimator:

\[plim \hat{\beta_{iv}} = \beta_1 + \frac{Corr(z,\mu)}{Corr(z,x)} \times \frac{\sigma_{\mu}}{\sigma_x}\]

It shows that, even if \(Corr(z,\mu)\) is small, the inconsistency in the IV estimator can be very large if \(Corr(z,x)\) is also small.

Weak Instruments

Two tips:

  1. Look at the F-stat of the first-stage. It should not be low.

  2. The sign of the coefficient in the first stage should be as expected.

  3. Look at the S.E. When the instrument is week, the S.E., are even larger than it should be.

One more tip,

  1. Avoid computing the first stage by hand, it would give wrong estimates of the S.E. in the second stage.

Example

Example

OLS

Stata
use files/mroz.dta , clear
reg lwage educ 
      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(1, 426)       =     56.93
       Model |  26.3264237         1  26.3264237   Prob > F        =    0.0000
    Residual |  197.001028       426  .462443727   R-squared       =    0.1179
-------------+----------------------------------   Adj R-squared   =    0.1158
       Total |  223.327451       427  .523015108   Root MSE        =    .68003

------------------------------------------------------------------------------
       lwage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .1086487   .0143998     7.55   0.000     .0803451    .1369523
       _cons |  -.1851969   .1852259    -1.00   0.318    -.5492674    .1788735
------------------------------------------------------------------------------

Example

The IV estimate of the return to education is 5.9%, which is barely more than one-half of the OLS estimate. This suggests that the OLS estimate is too high and is consistent with omitted ability bias.

Stata
use files/mroz.dta , clear
ivreg lwage (educ =fatheduc ) , first
First-stage regressions
-----------------------

      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(1, 426)       =     88.84
       Model |  384.841983         1  384.841983   Prob > F        =    0.0000
    Residual |  1845.35428       426  4.33181756   R-squared       =    0.1726
-------------+----------------------------------   Adj R-squared   =    0.1706
       Total |  2230.19626       427  5.22294206   Root MSE        =    2.0813

------------------------------------------------------------------------------
        educ | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    fatheduc |   .2694416   .0285863     9.43   0.000     .2132538    .3256295
       _cons |   10.23705   .2759363    37.10   0.000     9.694685    10.77942
------------------------------------------------------------------------------


Instrumental variables 2SLS regression

      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(1, 426)       =      2.84
       Model |  20.8673618         1  20.8673618   Prob > F        =    0.0929
    Residual |  202.460089       426  .475258426   R-squared       =    0.0934
-------------+----------------------------------   Adj R-squared   =    0.0913
       Total |  223.327451       427  .523015108   Root MSE        =    .68939

------------------------------------------------------------------------------
       lwage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .0591735   .0351418     1.68   0.093    -.0098994    .1282463
       _cons |   .4411035   .4461018     0.99   0.323    -.4357311    1.317938
------------------------------------------------------------------------------
Instrumented: educ
 Instruments: fatheduc

Final Comments

Final Comments

When you have only one Z, you can say that the model is just identified.

But you can have multiple IVs, in which case you will say that the model is overidentified.

  • You can implement IV design just as before

  • The relevance and exclusion assumptions are there as well.

Assuming that both conditions are satisfied, you will have more asymptotic efficiency in the IV estimates.

Final Comments

It is rare to have multiple Zs. You should be happy if you have a good one!

But if you do have multiple IVs, you can test their quality…

  • If they are all valid, you should get consistent estimates…

  • … even if you use only a subset of them.

  • So the test is about how similar the estimates are if you use subsets of IVs.

But this test does not give really an answer about whether the IVs are good.

This always come from theory.

Final Comments

Some more comments:

  1. If you have an interaction term between \(x_1\) and \(x_2\), and \(z\) is the instrument for \(x_1\), you can “create” the instrument \(zx_2\) for \(x_2\).
  1. GMM uses lagged variables as instruments. But, this is not a good decision if the variables are highly serially correlated.
  • Lagged total assets is not a good instrument for total assets.
  1. Using the average-group of variable X is also problematic (i.e., the industry average own. concentration as IV of firm-level own. concentration)
  • This is no different than a group FE, making hard to believe in the exclusion restriction.

THANK YOU!