Empirical Methods in Finance

Part 4

Henrique C. Martins

More about Regressions

Regression

Linear regressions are the workhorse tool in econometrics

  • Simplicity: straightforward to understand, implement, and visualize.
  • Interpretability: coefficients have clear interpretations.
    • represents the change in the Y for a one-unit change in the X, holding all other variables constant.
  • Versatility: simple linear regression or multiple linear regression.
  • Assumptions: linearity, independence of errors, homoscedasticity, and normality of errors.
  • Baseline Model: You can compare the performance of more advanced models to linear regression.
  • Estimation: provides clear estimates of the coefficients’ confidence intervals and hypothesis testing.

Regression

In this setting, the variables \(y\) and \(x\) can have several names.

Y X
Dependent variable Independent variable
Explained variable Explanatory variable
Response variable Control variable
Predicted variable Predictor variable
Regressand Regressor

Regression

Broadly, we are interested in how y is explained by x?

  • \(y_i = \alpha + \beta_1 x_i + \epsilon\)

Perhaps \(\epsilon\) is the most important part of a regression.

The interpretation is “everything that is not explained by X and that explains Y”.

A comment

  • Usually, the literature uses \(\epsilon\) for the “estimated” residual.

  • And \(\mu\) for the “true” residual, which necessarily implies that the assumptions hold.

  • At the end od the day, you don’t need to worry to much with the notation of this term because we are always in the “estimated world”, and almost never in the “true world”.

  • The “true world” implies that you are studying the population or that you have a true random sample of the population

    • \(y_i = \alpha + \beta_1 x_i + \mu\)

Regression

Remember

  • \(y, x\), and \(\mu\) are random variables
  • \(y and x\) are observable
  • \(\mu\) and \(\beta\) are unobservable
  • \(\mu\) captures everything that determines y after accounting for x

Our goal is to estimate β

Regression

There are some assumptions/requirements about \(\mu\) in a OLS

First assumption

  • E(\(\mu\)) = 0

    • This is a simple assumption, not strong at all.
    • It simply assumes that the average of \(\mu\) is zero in the population.
    • Basically, any non-zero mean is absorbed by the intercept
      • Say that E(\(\mu\)) = k
      • We could rewrite \(\mu = k + w\), where E(w)=0
      • Then, model becomes \(y=(\alpha +𝑘) + \beta𝑥+𝑤\)
      • Intercept is now just (α + k), and error, w, is mean zero

Regression

Second assumption

  • E(\(\mu\)|x) = E(\(\mu\)) for all values of x
    • It says that the average value of \(\mu\) does not depend on the value of x (i.e., the slice of the population you are looking at).
    • We say that \(\mu\) is mean-independent of x.
    • This is true if the X and the \(\mu\) are independent to each other.
    • Implies that x and \(\mu\) are uncorrelated.
    • Conditional Mean Independence (CMI).
    • This is one of the keys assumption to causal inference.

Regression

Second assumption

Example

Let’s say the model is:

\[wage = \alpha + \beta Schooling_{years} + \epsilon\]

  • where \(\epsilon\) represents unobserved ability.

Does CMI hold?

That is E(ability|x=8)=E(ability|x=10)=E(ability|x=12)?

Probably not, because the unobserved ability should depend on the years of schooling.

The solution (not trivial) would be to include ability as a new X.

Regression

Another example

Consider the following model (with only one X)

\[Leverage_i = \alpha + \beta_1 Profitability_i + \mu_i\]

  • CMI says that, to every firm i, \(\mu\) is the same, even when firms have different profitability.

  • Can you think on examples when this assumption may not hold in this model?

  1. unprofitable firms have higher bankruptcy risk, which should make them to have lower leverage (tradeoff theory).

  2. unprofitable firms have low cash, which should make them to have more leverage (pecking order theory).

Regression

The discussion of whether the CMI holds is the origin of the “endogeneity” problem.

You will face reviewers arguing reasons of why the CMI might not hold in your case.

  • Many will criticize a model by saying it has an “endogeneity problem”, whithout explaining more.

    • This is very generic. Don’t do that!

They should explain what is the source of the problem that is making the model violate CMI.

  • OVB, selection bias, reverse causality, simultaneity, etc?

Generally speaking, endogeneity refers to a violation of CMI, meaning that \(x\) and \(\mu\) are correlated.

This is always a plausible possibility, since \(\mu\) carries a lot of stuff (something must be correlated with X).

Regression

Third assumption

Combining 1 and 2 leads to

  • E(\(\mu\)|x) = 0
    • This is a very important assumption called zero conditional mean assumption.

Ordinary Least Squares

Our focus is to find estimates for \(\alpha\) and \(\beta\). Should we have access to the population, it would be easy. We could write:

\[y_i= \alpha + \beta_1x_i + \mu\]

But remember that,

\[E(u) = 0\] \[E(u|x) = 0\]

The second bullet implies that the correlation between x and \(\mu\) is zero (we can write that as E(x,u) = 0).

Important

Remember: \(\alpha\) and \(\beta\) are parameters to be estimated (i.e., constants), while \(X\) and \(Y\) are variables

Regression

So we can write that (in the population)

\(E(y - \alpha - \beta_1x ) = 0\)

\((x[y - \alpha - \beta_1x ]) = 0\)

So we can write that (in the sample)

\(\frac{1}{n} \sum_{i=1}^n (y_i - \hat{\alpha} - \hat{\beta_1} x_i ) = 0\) , We will use this to find \(\alpha\):

\(\frac{1}{n} \sum_{i=1}^n (x_i [y_i - \hat{\alpha} - \hat{\beta_1} x_i ]) = 0\) , We will use this to find \(\beta\):

Finding \(\alpha\)

From before:

\[\frac{1}{n} \sum_{i=1}^n (y_i - \hat{\alpha} - \hat{\beta_1} x_i ) =0\]

Passing the sum operator through

\[\frac{1}{n}\sum_{i=1}^n(y_i) - \frac{1}{n}\sum_{i=1}^n(\hat{\alpha}) - \frac{1}{n} \sum_{i=1}^n(\hat{\beta_1} x_i )=0\]

Coefficients are constants, so we can get rid of the sum operator.

\[\frac{1}{n}\sum_{i=1}^n(y_i) - \hat{\alpha} - \hat{\beta_1} \frac{1}{n} \sum_{i=1}^n(\ x_i )=0\]

Finding \(\alpha\)

  • We know that \(\frac{1}{n}\sum_{i=1}^n(y_i)\) is \(\bar{y_i}\) (the mean)

\[\bar{y_i} - \hat{\alpha} - \hat{\beta_1} \bar{x_i}=0\]

So we write:

\[\hat{\alpha} = \bar{y_i} - \hat{\beta_1} \bar{x_i}\]

Easy Peasy 😀

Finding \(\beta\)

From before:

\[\frac{1}{n} \sum_{i=1}^n (x_i [y_i - \hat{\alpha} - \hat{\beta_1} x_i ]) = 0\]

But now we have:

\[\hat{\alpha} = \bar{y_i} - \hat{\beta_1} \bar{x_i}\]

Thus,

\[\frac{1}{n} \sum_{i=1}^n (x_i [y_i - (\bar{y_i} - \hat{\beta_1} \bar{x_i}) - \hat{\beta_1} x_i ]) = 0\]

Finding \(\beta\)

Turning into

\[\frac{1}{n} \sum_{i=1}^n (x_i [y_i - \bar{y_i} ]) - \frac{1}{n} \sum_{i=1}^n (x_i [\hat{\beta_1} x_i - \hat{\beta_1} \bar{x_i} ]) = 0\]

Or

\[\frac{1}{n} \sum_{i=1}^n (x_i [y_i - \bar{y_i} ]) = \hat{\beta_1} \frac{1}{n} \sum_{i=1}^n (x_i [ x_i - \bar{x_i} ]) \]

Last step (I am skipping some steps here)

\[\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x} )(y_i - \bar{y_i} ) = \hat{\beta_1} \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x_i} )^2 \]

Finding \(\beta\)

If the variance of x is not zero, we can write \(\beta\) as

\[\hat{\beta_1} = \frac{\sum_i^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_i^n (x_i - \bar{x})^2} = \frac{Covariance(x_i,y_i)}{Variance(x_i)}\]

Finding \(\mu\)

Now that we have \(\hat{Y_i}=\hat{\alpha} + \hat{\beta_1} X_i\), we can estimate the residual \(\mu\)

\[\hat{\mu} = Y_i - \hat{Y_i}\]

Which is the same as:

\[\hat{\mu} = Y_i - \hat{\alpha} - \hat{\beta_1}x_i\]

Most residuals will not be 0, meaning they do not lie on the best fitting line.

Finding \(\mu\)

The job of an OLS model is find the parameters to minimize the squared error (i.e., to find the best fitting line).

\[\sum_{i=1}^n \hat{\mu}^2 = \sum_{i=1}^n(Y_i - \hat{Y_i})^2\]

Regression (Source)

Another example of regression. The differences in the coefficients are due to the differences in the seed of the random variables generator.

R
library(tidyverse)
set.seed(1)
tb <- tibble(
  x = rnorm(10000),
  u = rnorm(10000),
  y = 5.5*x + 12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.
) 
reg_tb <- lm(y ~ x, data=tb) 
summary(reg_tb)

Call:
lm(formula = y ~ x, data = tb)

Residuals:
    Min      1Q  Median      3Q     Max 
-51.528  -8.152  -0.173   7.978  44.718 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.04991    0.11890   -0.42    0.675    
x            5.55690    0.11745   47.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.89 on 9998 degrees of freedom
Multiple R-squared:  0.1829,    Adjusted R-squared:  0.1828 
F-statistic:  2238 on 1 and 9998 DF,  p-value: < 2.2e-16
Python
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
np.random.seed(1)

obs = 10000
data = pd.DataFrame({
    'x': np.random.normal(size=obs),
    'u': np.random.normal(size=obs),
})
data['y'] = 5.5 * data['x'] + 12 * data['u']

X = sm.add_constant(data['x'])
model = sm.OLS(data['y'], X).fit()

print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.183
Model:                            OLS   Adj. R-squared:                  0.183
Method:                 Least Squares   F-statistic:                     2237.
Date:                qua, 11 set 2024   Prob (F-statistic):               0.00
Time:                        17:07:22   Log-Likelihood:                -39049.
No. Observations:               10000   AIC:                         7.810e+04
Df Residuals:                    9998   BIC:                         7.812e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1114      0.120      0.927      0.354      -0.124       0.347
x              5.6887      0.120     47.293      0.000       5.453       5.924
==============================================================================
Omnibus:                        0.640   Durbin-Watson:                   2.050
Prob(Omnibus):                  0.726   Jarque-Bera (JB):                0.672
Skew:                          -0.012   Prob(JB):                        0.715
Kurtosis:                       2.968   Cond. No.                         1.01
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Stata
set seed 1 
set obs 10000 
gen x = rnormal() 
gen u  = rnormal() 
gen y  = 5.5*x + 12*u 
reg y x 
Number of observations (_N) was 0, now 10,000.

      Source |       SS           df       MS      Number of obs   =    10,000
-------------+----------------------------------   F(1, 9998)      =   2206.80
       Model |  327141.413         1  327141.413   Prob > F        =    0.0000
    Residual |   1482125.8     9,998  148.242229   R-squared       =    0.1808
-------------+----------------------------------   Adj R-squared   =    0.1807
       Total |  1809267.22     9,999  180.944816   Root MSE        =    12.175

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |   5.598296    .119172    46.98   0.000     5.364695    5.831897
       _cons |  -.0750109   .1217548    -0.62   0.538    -.3136748     .163653
------------------------------------------------------------------------------

Regression (Source)

Another example of regression. The differences in the coefficients are due to the differences in the seed of the random variables generator.

R
library(tidyverse)
set.seed(1)
tb <- tibble(
  x = rnorm(10000),
  u = rnorm(10000),
  y = 5.5*x + 12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.
) 
reg_tb <- lm(y ~ x, data=tb) 
tb %>% 
  lm(y ~ x, .) %>% 
  ggplot(aes(x=x, y=y)) + 
  ggtitle("OLS Regression Line") +
  geom_point(size = 0.05, color = "black", alpha = 0.5) +
  geom_smooth(method = lm, color = "black") +
  annotate("text", x = -1.5, y = 30, color = "red", 
           label = paste("Intercept = ", reg_tb$coefficients[1])) +
  annotate("text", x = 1.5, y = -30, color = "blue", 
           label = paste("Slope =", reg_tb$coefficients[2]))

Python
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt

# Create a scatterplot with the OLS regression line using Seaborn
sns.set(style='whitegrid')
plt.figure(figsize=(7, 5))
sns.scatterplot(x='x', y='y', data=data, color='black', alpha=0.5, s=5)
sns.regplot(x='x', y='y', data=data, color='black', scatter=False, line_kws={'color':'black'})
plt.title('OLS Regression Line')
plt.annotate(f'Intercept = {model.params[0]:.2f}', xy=(-1.5, 30), color='red')
plt.annotate(f'Slope = {model.params[1]:.2f}', xy=(1.5, -30), color='blue')
plt.show()

Stata
set seed 1 
set obs 10000 
gen x = rnormal() 
gen u  = rnormal() 
gen y  = 5.5*x + 12*u 
reg y x 
predict yhat1 
gen yhat2 = -0.0750109  + 5.598296*x 
predict uhat1, residual 
gen uhat2=y-yhat2 
qui sum uhat* 
twoway (lfit y x, lcolor(black) lwidth(medium)) (scatter y x, mcolor(black) msize(tiny) msymbol(point)), title(OLS Regression Line) 
qui graph export "files/graph3.svg", replace

Regression (Source)

Using the previous regressions, we can show that:

  1. \(\hat{y_i} = \hat{\alpha} + \hat{\beta_1} x_i\)

  2. \(\hat{\mu_i} = y_i - \hat{y_i}\)

R
library(tidyverse)
set.seed(1)
tb <- tibble(
  x = rnorm(10000),
  u = rnorm(10000),
  y = 5.5*x + 12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.
) 
reg_tb <- lm(y ~ x, data=tb) 

tb <- tb %>% 
  mutate(
    yhat1 = predict(lm(y ~ x, .)),
    yhat2 = reg_tb$coefficients[1] + reg_tb$coefficients[2]*x, 
    uhat1 = residuals(lm(y ~ x, .)),
    uhat2 = y - yhat2
  )
summary(tb[-1:-3])
     yhat1               yhat2               uhat1              uhat2         
 Min.   :-20.45096   Min.   :-20.45096   Min.   :-51.5275   Min.   :-51.5275  
 1st Qu.: -3.79189   1st Qu.: -3.79189   1st Qu.: -8.1520   1st Qu.: -8.1520  
 Median : -0.13842   Median : -0.13842   Median : -0.1727   Median : -0.1727  
 Mean   : -0.08624   Mean   : -0.08624   Mean   :  0.0000   Mean   :  0.0000  
 3rd Qu.:  3.71578   3rd Qu.:  3.71578   3rd Qu.:  7.9778   3rd Qu.:  7.9778  
 Max.   : 21.12342   Max.   : 21.12342   Max.   : 44.7176   Max.   : 44.7176  
Python
import pandas as pd
import numpy as np
import statsmodels.api as sm
np.random.seed(1)
obs = 10000
x = np.random.normal(size=obs)
u = np.random.normal(size=obs)
y = 5.5 * x + 12 * u

X = sm.add_constant(x)
model = sm.OLS(y, X).fit()

tb = pd.DataFrame({'x': x, 'u': u, 'y': y})
tb['yhat1'] = model.predict(X)
tb['uhat1'] = y - tb['yhat1']
tb['yhat2'] = model.params[0] + model.params[1] * x
tb['uhat2'] = y - tb['yhat2']

print(tb[['yhat1','yhat2', 'uhat1','uhat2']].describe())
              yhat1         yhat2         uhat1         uhat2
count  10000.000000  10000.000000  1.000000e+04  1.000000e+04
mean       0.166975      0.166975 -1.691092e-16 -1.691092e-16
std        5.682040      5.682040  1.201339e+01  1.201339e+01
min      -20.688875    -20.688875 -4.142425e+01 -4.142425e+01
25%       -3.659775     -3.659775 -8.199882e+00 -8.199882e+00
50%        0.159473      0.159473  4.497835e-02  4.497835e-02
75%        3.933075      3.933075  8.147307e+00  8.147307e+00
max       23.018769     23.018769  5.000751e+01  5.000751e+01
Stata
set seed 1 
clear 
qui set obs 10000 
gen x = rnormal() 
gen u  = rnormal() 
gen y  = 5.5*x + 12*u 
qui reg y x 
predict uhat1, residual 
predict yhat1 
gen yhat2 = -0.0750109  + 5.598296*x 
gen uhat2=y-yhat2 
sum yhat*  uhat* 
(option xb assumed; fitted values)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       yhat1 |     10,000   -.0797351    5.719914  -23.06927   22.47639
       yhat2 |     10,000   -.0797351    5.719913  -23.06927   22.47639
       uhat1 |     10,000   -3.53e-09    12.17487   -49.8996   42.15174
       uhat2 |     10,000   -1.72e-08    12.17487   -49.8996   42.15174

Properties of OLS

Properties of OLS

We can easily show that (remember from before)

\[\sum_i^n(y_i - \hat{\alpha} - \hat{\beta_1} x_i) = 0\]

And that

\[\sum_i^n \hat{\mu} = 0\]

The graphs next slide shows a spherical figure, suggesting that the residual is not correlated with the the fitted values (\(\hat{y_i}\))

Properties of OLS

R
library(tidyverse)
set.seed(1)
tb <- tibble(
  x = rnorm(10000),
  u = rnorm(10000),
  y = 5.5*x + 12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.
) 
reg_tb <- lm(y ~ x, data=tb) 
tb <- tb %>% 
  mutate(
    yhat1 = predict(lm(y ~ x, .)),
    yhat2 = reg_tb$coefficients[1] + reg_tb$coefficients[2]*x, 
    uhat1 = residuals(lm(y ~ x, .)),
    uhat2 = y - yhat2
  )
tb %>% 
  lm(uhat1 ~ yhat1 , .) %>% 
  ggplot(aes(x=yhat1, y=uhat1)) + 
  geom_point(size = 0.1, color = "black") +
  geom_smooth(method = lm, color = "black")

Python
import pandas as pd
import numpy as np
import statsmodels.api as sm
np.random.seed(1)
obs = 10000
x = np.random.normal(size=obs)
u = np.random.normal(size=obs)
y = 5.5 * x + 12 * u

X = sm.add_constant(x)
model = sm.OLS(y, X).fit()

tb = pd.DataFrame({'x': x, 'u': u, 'y': y})
tb['yhat1'] = model.predict(X)
tb['uhat1'] = y - tb['yhat1']
tb['yhat2'] = model.params[0] + model.params[1] * x
tb['uhat2'] = y - tb['yhat2']
model = sm.OLS(tb['uhat1'], sm.add_constant(tb['yhat1'])).fit()
# Create a scatter plot with a regression line
sns.set(style="whitegrid")
plt.figure(figsize=(7, 4.5))
sns.scatterplot(x='yhat1', y='uhat1', data=tb, size=0.05, color='black', alpha=0.5)
sns.regplot(x='yhat1', y='uhat1', data=tb, scatter=False, color='black')
plt.xlabel('yhat1')
plt.ylabel('uhat1')
plt.title('Scatter Plot of uhat1 vs. yhat1')
plt.show()

Stata
set seed 1 
set obs 10000 
gen x = rnormal() 
gen u  = rnormal() 
gen y  = 5.5*x + 12*u 
qui reg y x 
predict yhat1 
predict uhat1, residual 
twoway (lfit uhat1 yhat1 , lcolor(black) lwidth(large)) (scatter uhat1 yhat1 , mcolor(black)  msymbol(point))
qui graph export "files/graph4.svg", replace

Properties of OLS

Similarly, we can easily show that

\[\sum_i^nx_i(y_i - \hat{\alpha} - \hat{\beta_1} x_i) = 0\]

And that

\[\sum_i^nx_i\hat{\mu} = 0\]

Meaning that the sample covariance between the X and the residual will be always zero.

Properties of OLS

R
library(tidyverse)
set.seed(1)
tb <- tibble(
  x = rnorm(10000),
  u = rnorm(10000),
  y = 5.5*x + 12*u # notice that I am defining the beta1 here. The 5.5 is the "true" beta we want to estimate.
) 
reg_tb <- lm(y ~ x, data=tb) 
tb <- tb %>% 
  mutate(
    yhat1 = predict(lm(y ~ x, .)),
    yhat2 = reg_tb$coefficients[1] + reg_tb$coefficients[2]*x, 
    uhat1 = residuals(lm(y ~ x, .)),
    uhat2 = y - yhat2
  )
tb %>% 
  lm(uhat1 ~ x , .) %>% 
  ggplot(aes(x=x, y=uhat1)) + 
  geom_point(size = 0.1, color = "black") +
  geom_smooth(method = lm, color = "black")

Python
import pandas as pd
import numpy as np
import statsmodels.api as sm
np.random.seed(1)
obs = 10000
x = np.random.normal(size=obs)
u = np.random.normal(size=obs)
y = 5.5 * x + 12 * u

X = sm.add_constant(x)
model = sm.OLS(y, X).fit()

tb = pd.DataFrame({'x': x, 'u': u, 'y': y})
tb['yhat1'] = model.predict(X)
tb['uhat1'] = y - tb['yhat1']
tb['yhat2'] = model.params[0] + model.params[1] * x
tb['uhat2'] = y - tb['yhat2']
model = sm.OLS(tb['uhat1'], sm.add_constant(tb['yhat1'])).fit()
# Create a scatter plot with a regression line
sns.set(style="whitegrid")
plt.figure(figsize=(7, 4.5))
sns.scatterplot(x='x', y='uhat1', data=tb, size=0.05, color='black', alpha=0.5)
sns.regplot(x='x', y='uhat1', data=tb, scatter=False, color='black')
plt.xlabel('x')
plt.ylabel('uhat1')
plt.title('Scatter Plot of uhat1 vs. x')
plt.show()

Stata
set seed 1 
set obs 10000 
gen x = rnormal() 
gen u  = rnormal() 
gen y  = 5.5*x + 12*u 
qui reg y x 
predict yhat1 
predict uhat1, residual 
twoway (lfit uhat1 x , lcolor(black) lwidth(large)) (scatter uhat1 x , mcolor(black) msymbol(point))
qui graph export "files/graph5.svg", replace

Properties of OLS

Let’s say you estimate a model and find the \(\hat{\mu}\).

If you calculate the correlation between the X and \(\hat{\mu}\), you will find zero.

This is by construction! It is not an evidence that CMI is nos violated.

In fact, the OLS “assumes” and “forces” zero correlation.

It is intuitive: if you are “forcing” zero correlation when the correlation is not in fact zero, your coefficients will be biased.

The previous graphs actually show zero correlation. But that is expected and does not suggest the model is not violating CMI.

At the end of the day, CMI is untestable and unverifiable.

Goodness-of-fit

Goodness-of-fit

Understanding what SSR, SSE and SST mean

  • SSE = Sum of Squares Explained = \(\sum_i^n(\hat{y_i}-\bar{y})^2\)
  • SSR = Sum of Squares Residuals = \(\sum_i^n\hat{\mu}^2\)
  • SST = Sum of Squares Total = SSE + SSR = \(\sum_i^n(y_i-\hat{y_i})^2\)

R-squared is simply the ratio of portion explained over the total that could be explained.

\[R^2 = \frac{SSE}{SST} = 1-\frac{SSR}{SST}\]

Goodness-of-fit

Goodness-of-fit

You can think this way:

  1. If X does not explain Y, then the best predictor of Y is \(\bar{y}\). In that case, your model does not explain anything of Y, thus \(R^2\) is zero, and \(\hat{y_i}=\bar{y}\)
  1. If X partially explains Y, then \(\hat{y_i} \neq \bar{y}\), meaning that \(\hat{y_i}\) has some inclination (like the figure next slide). This means that \(SSE>0\) and your \(R^2>0\) but \(R^2<1\)
  1. Whatever is not explained by \(\hat{y_i}\) is left to \(\sum_i^2\hat{\mu}^2\), meaning that SSR will be non-zero.
  1. The ratio of the portion that you can explain by \(\hat{y_i}\) over the total that is to be explained \(y_i-\hat{y_i}\) if the \(R^2\).

Goodness-of-fit

R
library(foreign) # importing dataset from a stata dta file
data <- read.dta("files/CEOSAL1.dta")
attach(data)
# Statistics of salary 
mean(salary)
[1] 1281.12
R
# OLS model
model <- lm(salary ~ roe)
salaryhat <- fitted(model)                      # Predict values for dependent variable
uhat <- resid(model)                            # Predict regression residuals
salarymean <- rep(mean(salary),length(salary))  # Generating the mean of salary 
summary(model)

Call:
lm(formula = salary ~ roe)

Residuals:
    Min      1Q  Median      3Q     Max 
-1160.2  -526.0  -254.0   138.8 13499.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   963.19     213.24   4.517 1.05e-05 ***
roe            18.50      11.12   1.663   0.0978 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1367 on 207 degrees of freedom
Multiple R-squared:  0.01319,   Adjusted R-squared:  0.008421 
F-statistic: 2.767 on 1 and 207 DF,  p-value: 0.09777
Python
import pandas as pd
import statsmodels.api as sm
data = pd.read_stata("files/CEOSAL1.dta")
print(data['salary'].mean())
1281.1196172248804
Python
# OLS model
X = data['roe']
X = sm.add_constant(X)
y = data['salary']

model = sm.OLS(y, X).fit()  # Fit the linear regression model
salaryhat = model.fittedvalues  # Predicted values for the dependent variable
uhat = model.resid  # Predict regression residuals
salarymean = pd.Series([y.mean()] * len(y))  # Generating the mean of salary
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                qua, 11 set 2024   Prob (F-statistic):             0.0978
Time:                        17:08:12   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Stata
use "files/CEOSAL1.DTA" , replace
sum salary 
reg salary roe 
predict salaryhat , xb              
predict uhat, resid                 
egen salarymean = mean(salary)      
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      salary |        209     1281.12    1372.345        223      14822

      Source |       SS           df       MS      Number of obs   =       209
-------------+----------------------------------   F(1, 207)       =      2.77
       Model |  5166419.04         1  5166419.04   Prob > F        =    0.0978
    Residual |   386566563       207  1867471.32   R-squared       =    0.0132
-------------+----------------------------------   Adj R-squared   =    0.0084
       Total |   391732982       208  1883331.64   Root MSE        =    1366.6

------------------------------------------------------------------------------
      salary | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         roe |   18.50119   11.12325     1.66   0.098    -3.428196    40.43057
       _cons |   963.1913   213.2403     4.52   0.000     542.7902    1383.592
------------------------------------------------------------------------------

Goodness-of-fit

R
library(foreign) # importing dataset from a stata dta file
mydata <- read.dta("files/CEOSAL1.dta")
attach(mydata)
model <- lm(salary ~ roe)
salaryhat <- fitted(model)                      # Predict values for dependent variable
uhat <- resid(model)                            # Predict regression residuals
salarymean <- rep(mean(salary),length(salary))  # Generating the mean of salary 
# r-squared is simply the ratio of portion explained over total that could be explained - Understanding what SSR, SSE and SST mean 
plot(salary ~ roe)
abline(lm(salary ~ roe), col = "blue")
abline(lm(salarymean ~ roe), col = "red")

Python
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_stata("files/CEOSAL1.dta")
X = data[['roe']]
y = data['salary']
salarymean = np.repeat(y.mean(), len(y))
X_mean = X.mean()
y_mean = y.mean()
slope = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) ** 2)
intercept = y_mean - slope * X_mean
salaryhat = slope * X + intercept
# Plotting the data and regression lines
plt.scatter(X, y,  alpha=0.7)
plt.plot(X, salaryhat,  color='blue', linewidth=2)
plt.plot(X, salarymean, color='red',  linewidth=2)
plt.xlabel('roe')
plt.ylabel('salary')
plt.show()

Stata
use "files/CEOSAL1.DTA" , replace
sum salary , d
reg salary roe 
predict salaryhat , xb              
predict uhat, resid                 
egen salarymean = mean(salary)      

twoway (scatter salary roe) (lfit salary roe) (lfit salarymean roe) 
qui graph export "files/graph4_6.svg", replace

Goodness-of-fit

Manually calculating \(R^2\)

R
library(foreign) # importing dataset from a stata dta file
mydata <- read.dta("files/CEOSAL1.dta")
attach(mydata)
model <- lm(salary ~ roe)
salaryhat <- fitted(model)                      # Predict values for dependent variable
uhat <- resid(model)                            # Predict regression residuals
salarymean <- rep(mean(salary),length(salary))  # Generating the mean of salary 

# r-squared is simply the ratio of portion explained over total that could be explained
ssr  <- sum(uhat^2)
ssrB <- sum((salary    - salaryhat)^2)
sst  <- sum((salary    - salarymean)^2)
sse  <- sum((salaryhat - salarymean)^2)
sse / sst
[1] 0.01318862
Python
import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.read_stata("files/CEOSAL1.dta")
X = data['roe']
y = data['salary']
X = sm.add_constant(X)  # Add a constant term (intercept)
model = sm.OLS(y, X).fit()
salaryhat = model.fittedvalues
uhat = model.resid
salarymean = np.repeat(y.mean(), len(y))
# Calculate R-squared
ssr = np.sum(uhat**2)
ssrB = np.sum((y - salaryhat)**2)
sst = np.sum((y - salarymean)**2)
sse = np.sum((salaryhat - salarymean)**2)
rsquared = sse / sst
print(rsquared)
0.013188624081034118
Stata
use "files/CEOSAL1.DTA" , replace
qui reg salary roe 
predict salaryhat , xb              
predict uhat, resid                 
egen salarymean = mean(salary)      
egen sst  = total((salary    - salarymean)^2)  
egen ssr  = total((salary    - salaryhat)^2)
egen ssrB = total(uhat^2)                   
egen sse  = total((salaryhat - salarymean)^2)   
di sse / sst
.01318862

Variance of coefficients

Variance of coefficients

When we estimate coefficients we have some “error of estimation”.

  • Basically, you are searching the “true” coefficient using a sample, which should be representative of the population but it is not the population itself.

  • This means that the coefficient estimated is estimated with error.

  • We would like (e.g., we will need) to impose some “structure” to that error.

Variance of coefficients

Standard error and T-stat

To assess if the variables are significantly related, you need to assess the significance of \(\beta\) coefficients.

Using the example from Wooldridge, we know that the Beta of ROE is 18.591, while the standard error of ROE is 11.123.

  • The standard error is a measure of the accuracy of your estimate. If you find a large standard error, your estimate does not have good accuracy.

  • Ideally, you would find small standard errors, meaning that your coefficient is accurately estimated.

  • However, you do not have good control over the magnitude of the standard errors.

Variance of coefficients

Standard error and T-stat

If you have a large standard error, probably you coefficient will not be significantly different from zero. You can test whether your coefficient is significantly different from zero computing the t-statistics as follows:

\[t_{\beta} = \frac{\hat{\beta}}{se(\hat{\beta})}\]

If \(t_{\beta}\) is large enough, you can say that \(\beta\) is significantly different from zero. Usually, \(t_{\beta}\) larger than 2 is enough to be significant.

Variance of coefficients

In the previous example, you can find the t-stat manually as follows (\(t_{\beta} =\frac{\hat{\beta}}{se(\hat{\beta})}\)):

R
library(foreign) # importing dataset from a stata dta file
data <- read.dta("files/CEOSAL1.dta")
attach(data)
# OLS model
model <- lm(salary ~ roe)
summary(model)$coefficients[2,1] / summary(model)$coefficients[2,2] 
[1] 1.663289
R
summary(model)$coefficients[2,3]
[1] 1.663289
Python
import pandas as pd
import statsmodels.api as sm
data = pd.read_stata("files/CEOSAL1.dta")
# OLS model
X = data['roe']
X = sm.add_constant(X)
y = data['salary']
model = sm.OLS(y, X).fit()  
# Extract and calculate specific coefficients
coef_beta = model.params['roe']
coef_std_error = model.bse['roe']
# Calculate t-value
t_value = coef_beta / coef_std_error
# Print the coefficient and t-value
print("Coefficient (beta):", coef_beta)
Coefficient (beta): 18.501186345214933
Python
print("Standard Error:", coef_std_error)
Standard Error: 11.123250903287634
Python
print("t-value:", t_value)
t-value: 1.6632894920806511
Stata
use "files/CEOSAL1.DTA" , replace
qui reg salary roe 
local beta = _b[roe]
local std_error = _se[roe]
local t_value = `beta' / `std_error'
display "Coefficient (beta): " `beta'
display "Standard Error: " `std_error'
display "t-value: " `t_value'
Coefficient (beta): 18.501186

Standard Error: 11.123251

t-value: 1.6632895

Variance of coefficients

Naturally, the previous analysis requires an estimate of \(\beta\) and an estimate of the \(\beta\)’s standard error.

The standard error can be defined as:

\[se(\hat{\beta_1})=\frac{\hat{\sigma}}{\sqrt{SST_x}}\]

  • Where \(\hat{\sigma}\) is the standard deviation of the error term in the regression, which can be calculated as:

\[\hat{\sigma} = \sqrt{\frac{SSR}{n-2}}\]

- The $n-2$ here is an adjustment for the degrees of freedom in the regression.
  • SST is defined as before \(\sum_i^n(y_i-\hat{y_i})^2\)

Variance of coefficients

R
library(foreign) # importing dataset from a stata dta file
data <- read.dta("files/CEOSAL1.dta")
attach(data)
# OLS model
model <- lm(salary ~ roe)
# Extract the standard error of the coefficient for 'roe'
summary(model)$coefficients["roe", "Std. Error"]
[1] 11.12325
R
#calculating manually
# Extract the residuals
residuals <- resid(model)
# Number of observations (n)
n <- length(residuals)
# Calculate the mean of the independent variable (roe)
roe_mean <- mean(roe)
# Calculate the sum of squared deviations of roe from its mean (SXX)
SST <- sum((roe - roe_mean)^2)
# Calculate the sum of squared errors (SSE)
SSR <- sum(residuals^2)
# Calculate the standard error of beta
Sd_beta <- sqrt(SSR / ((n - 2)))
# Calculate S.E
Se_beta <- Sd_beta / sqrt(SST)
# Print the standard error of beta
print(Se_beta)
[1] 11.12325
Python
import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.read_stata("files/CEOSAL1.dta")
X = data['roe']
y = data['salary']
X = sm.add_constant(X)  
model = sm.OLS(y, X).fit()
# Extract the standard error of the coefficient for 'roe'
beta_se_summary = model.bse['roe']
print("Standard Error (from summary):", beta_se_summary)
Standard Error (from summary): 11.123250903287634
Python
# Calculate it manually
# Extract the residuals
residuals = model.resid
# Number of observations (n)
n = len(residuals)
# Calculate the mean of the independent variable (roe)
roe_mean = X['roe'].mean()
# Calculate the sum of squared deviations of roe from its mean (SST)
SST = np.sum((X['roe'] - roe_mean) ** 2)
# Calculate the sum of squared errors (SSE)
SSE = np.sum(residuals ** 2)
# Calculate the standard error of beta (Sd_beta)
Sd_beta = np.sqrt(SSE / (n - 2))
# Calculate SE_beta
SE_beta = Sd_beta / np.sqrt(SST)
print("Standard Error (manually calculated):", SE_beta)
Standard Error (manually calculated): 11.123250601798315
Stata
use "files/CEOSAL1.DTA" , replace
qui reg salary roe 
gen beta_se_summary = _se[roe]
gen n = _N
predict residuals, residuals
sum roe, meanonly
gen roe_diff = roe - r(mean)
egen roe_diff_sq = total(roe_diff^2)
gen residuals_sq = residuals^2
egen residuals_sq_sum = total(residuals_sq)
gen Sd_beta = sqrt(residuals_sq_sum / (n - 2))
gen SE_beta = Sd_beta / sqrt(roe_diff_sq)
display "Standard Error (from summary): " sum(beta_se_summary)
display "Standard Error (manually calculated): " sum(SE_beta)
Standard Error (from summary): 11.123251

Standard Error (manually calculated): 11.123251

Variance of coefficients

Another comment:

\[se(\hat{\beta_1})=\frac{\hat{\sigma}}{\sqrt{SST_x}}\]

  1. The larger \(\hat{\sigma}\) is, the larger the variance of \(\beta\). That is, the more “noise” in the association between x and Y, the harder it is to learn something about \(\beta\).

  2. However, more variation in x, the larger the SST, so the smaller is the variance of \(\beta\).

Robust standard errors

Robust standard errors

Looking at both equations below:

\[t_{\beta} = \frac{\hat{\beta}}{se(\hat{\beta})}\]

\[se(\hat{\beta_1})=\frac{\hat{\sigma}}{\sqrt{SST_x}}\]

What happens if \(\hat{\sigma}\) is not constant (for the values of x)?

In other words, how realistic is to assume that the variance in the errors is the same for all slices of x?

Can you think of an example where that may happen?

Robust standard errors

Earnings = f(education)

PhD have a higher variance of earnings than non-educated people.

Leveragge=f(Size)

It is quite possible that small firms will have less options of leverage than large companies.

This means that a sub-sample of large companies will have higher variance in the leverage decisions (and thus the error terms) than the sub-sample of small firms

Robust standard errors

One of the key assumptions in OLS estimators is homoscedasticity

That is, the assumption is that the variance of the errors is homoscedastic (constant variance in all slices of X).

It means that throughout all observations, the error term shows the same variance.

If errors are not homoscedastic, we have the heteroscedasticity problem.

Heteroskedasticity does not cause bias or inconsistency in the OLS estimators of the \(\beta\) like the OVB would.

It also does not affect the \(R^2\).

What Heteroscedasticity does is to bias the standard errors of the estimates.

Robust standard errors

Robust standard errors

Robust standard errors

Homoskedascticity = Constant \(\hat{\sigma}\) to all slices of X.

Heteroskedascticity = Non-constant \(\hat{\sigma}\) to all slices of X.

Without homoskedascticity, OLS no longer has the minimum mean squared errors, which means that the estimated standard errors are biased, which in turn creates bias in the t-stat and the inference you’ll make with your model.

Fortunately, we have an easy solution for that.

\[Var(\hat{\beta_1}) = \frac{\sum_i^n(x_i-\bar{x})^2\hat{\mu}^2}{SST^2_x}\]

This formula simply “includes” the heteroskedascticity in the calculation of \(Var(\hat{\beta_1})\), meaning this correct the estimated standard deviation to heteroskedascticity.

We call this correction as Robust Standard Errors (White Robust).

In other words, you should always use Robust Standard Errors. It is easy to use it with R.

Robust standard errors

Using Robust Standard-errors.

R
library(sandwich)
library(foreign) 
library(lmtest)

data <- read.dta("files/CEOSAL1.DTA")
model <- lm(salary ~ roe, data = data)
robust_model <- coeftest(model, vcov = vcovHC(model, type = "HC3"))
SE_beta_robust <- robust_model["roe", "Std. Error"]
cat("Robust Standard Error :", SE_beta_robust, "\n")
Robust Standard Error : 6.899434 
Python
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
data = pd.read_stata("files/CEOSAL1.DTA")
model = smf.ols("salary ~ roe", data=data)
results = model.fit(cov_type='HC3')  
SE_beta_robust = results.bse['roe']
print("Robust Standard Error :", SE_beta_robust)
Robust Standard Error : 6.899433834797476
Stata
use "files/CEOSAL1.DTA" , replace

qui reg salary roe 
gen beta_se_non = _se[roe]

qui reg salary roe , robust
gen beta_se_summary = _se[roe]

di "Standard Error (non-robust): " sum(beta_se_non)
di "Standard Error (robust): " sum(beta_se_summary)
Standard Error (non-robust): 11.123251

Standard Error (robust): 6.8294482

Robust standard errors

Notice that the standard errors have changed quite significantly in this example.

Usually, the robust standard errors are larger than the traditional ones in empirical works.

But, in this example, they are smaller.

Perhaps more importantly:

Once the S.e. change, you should expect that the t-stat of the estimates also change.

Final comment: robust standard errors are robust in the case of homoskedasticity.

Warning

Thus, you should always use robust S.E.

Clustered standard errors

Clustered standard errors

Almost always, someone will ask you whether you clustered your standard errors.

The intuition is the following:

  • When you do not cluster, you are assuming that all observations are independently and identically distributed (i.i.d.), which may or may not be true.

  • Imagine you are studying the effect of class size on students achievement.

  • How much of a effect would have the teacher of a class?

  • In this design, the teacher influences the achievement of all the students in the same class, and one teacher cannot be at two classes at the same time.

  • Thus, it would be wise to cluster the errors at the class-level. This assumes that the residual of each individual is clustered with the other individuals in the same class.

In principle, clustering solves any form of dependence of the residuals in your data.

Clustered standard errors

In corporate finance/accounting research panel data research, the tradition is to cluster at the firm-level.

  • The reason is that the observations of the same firm are not independent trough time, thus are correlated.

But, there is a lot of debate about this decision.

The tip is to cluster where the randomness exist. That is quite subjective. In the class size example, the randomness comes out of the teacher, since each teacher has their own ways of teaching (materials, resources, etc.).

But, it is a good practice to stress this decision a bit in your own research by also showing results with clustered s.e. at the industry-level.

Final tip: usually the minimum number of cluster is about 30. Less than that might be insufficient (but, again, the guidance in this topic is very subjective).

Clustered standard errors

The clustered standard errors are different because I am fabricating the clusters here for the sake of the coding.

In your real research, you would have the cluster at hands.

R
library(sandwich)
library(foreign) 
library(lmtest)
library(plm)

data <- read.dta("files/CEOSAL1.DTA")
model <- lm(salary ~ roe, data = data)
robust_model <- coeftest(model, vcov = vcovHC(model, type = "HC3"))
#clustered
data$cluster <- rep(1:35, length.out = nrow(data))
model <- plm(salary ~ roe, data = data, index = c("cluster"))

clustered_se <- vcovHC(model, type = "HC3", cluster = "group")
SE_beta_clustered <- sqrt(clustered_se["roe", "roe"])

cat("Standard Error (robust):", SE_beta_robust, "\n")
Standard Error (robust): 6.899434 
R
cat("Standard Error (clustered)::", SE_beta_clustered, "\n")
Standard Error (clustered):: 12.98492 
Python
import pandas as pd
import statsmodels.api as sm

# Read the dataset
data = pd.read_stata("files/CEOSAL1.DTA")

# Create a new variable 'cluster' with cluster numbers ranging from 1 to 35
data['cluster'] = list(range(1, 36)) * (len(data) // 35)
Length of values (175) does not match length of index (209)
Python
# Fit the linear regression model
model = sm.OLS(data['salary'], sm.add_constant(data['roe'])).fit()

# Compute robust standard errors
robust_model = model.get_robustcov_results(cov_type='HC3')
SE_beta_robust = robust_model.cov_params().loc['roe', 'roe'] ** 0.5
'numpy.ndarray' object has no attribute 'loc'
Python
# Fit the linear regression model with clustered standard errors
model_clustered = sm.OLS(data['salary'], sm.add_constant(data['roe'])).fit(cov_type='cluster', cov_kwds={'groups': data['cluster']})
'cluster'
Python
# Extract the clustered standard errors for 'roe'
clustered_se = model_clustered.HC3_se.loc['roe']
name 'model_clustered' is not defined
Python
print("Robust Standard Error (HC3):", SE_beta_robust)
Robust Standard Error (HC3): 6.899433834797476
Python
print("Clustered Standard Error (HC3):", clustered_se)
name 'clustered_se' is not defined
Stata
use "files/CEOSAL1.DTA" , replace

qui reg salary roe 
gen beta_se_non = _se[roe]

qui reg salary roe , robust
gen beta_se_summary = _se[roe]

egen cluster = seq(), block(6)
qui regress salary roe , vce(cluster cluster)
gen SE_beta_clustered = _se[roe]

di "Standard Error (non-robust): " sum(beta_se_non)
di "Standard Error (robust): " sum(beta_se_summary)
di "Standard Error (clustered): " sum(SE_beta_clustered)
Standard Error (non-robust): 11.123251

Standard Error (robust): 6.8294482

Standard Error (clustered): 6.4122176

THANK YOU!

QUESTIONS?