Imagine that you want to investigate the effect of Governance on Q
\(𝑸_{i} = α + 𝜷_{i} × Gov + Controls + error\)
The ideal is to establish estimates that allow you to infer that changing Gov will CAUSE a change in Q. However, without a nice empirical design, we cannot infer causality
One source of bias is: reverse causation
Perhaps it is Q that causes Gov
OLS based methods do not tell the difference between these two betas:
\(𝑄_{i} = α + 𝜷_{i} × Gov + Controls + error\)
\(Gov_{i} = α + 𝜷_{i} × Q + Controls + error\)
If one Beta is significant, the other will most likely be significant too.
Lags to mitigate reverse causation
In a regression model, there is always the possibility that it is Y that causes X, meaning that your assumptions about the relationship of these variables can always be wrong (i.e., what causes what?).
We are not discussing causality yet (which is a whole new chapter). But one possible simple solution is to use lagged values of your X. Something along the following lines:
Notice the subscript \(t-1\) in x now. It means that you are using the previous period’s value of X as explanatory variable of the current period’s value of Y.
This type of structure mitigates the concern that variations in Y are the reason of why X varies since it is less likely that the current variation of Y provokes variations on X in the previous years (i.e., the idea is that the future does not affect the past).
Lags to mitigate reverse causation
This is not a perfect solution because Y and X in most accounting and finance research designs are usually auto-correlated, meaning that the previous values are correlated with the present value.
For instance, the firm’s leverage of 2015 is highly correlated with the firm’s leverage of 2016.
But in many cases it is a good idea to use lag values. At least, you should have this solution in your toolbox.
\(\beta\) shows the difference in the average change of Y for units that experience a change in x during the same period.
Saying the same thing in a different way, \(\beta\) shows how much y changes, on average, where and when X increases by one unit.
Let’s say that x is binary and it changes from 0 to 1 for each firm during several different periods. So, y will change, on average, by \(\beta\) when x changes from 0 to 1.
As mentioned before, a Fixed effect is equivalent to a binary variable marking one group of observations. For instance, all observations of the same year, or from the same firm.
We can explore many interesting types of binary variables in most cases of corporate finance. For instance, whether the firm is included in “Novo Mercado”, if the firm has high levels of ESG, etc.
The implementation of a binary variable is quite simple: it takes the value of 0 for one group, and 1 for the other.
The interpretation is a bit trickier.
Binary variables
Let’s think about the example 7.1 of Wooldridge. He estimates the following equation:
In model (7.1), only two observed factors affect wage: gender and education. Because \(female = 1\) when the person is female, and $female = 0 $ when the person is male, the parameter \(\delta_1\) has the following interpretation:
- \(\delta_1\) is the difference in hourly wage between females and males, given the same amount of education (and the same error term u). Thus, the coefficient \(\delta_1\) determines whether there is discrimination against women: if \(\delta_1<0\), then, for the same level of other factors, women earn less than men on average.
Binary variables
In terms of expectations, if we assume the zero conditional mean assumption E(\(\mu\) | female,educ) = 0, then
The key here is that the level of education is the same in both expectations; the difference, \(\delta_1\) , is due to gender only.
Binary variables
The visual interpretation is as follows. The situation can be depicted graphically as an intercept shift between males and females. The interpretation relies on \(\delta_1\). We can observe that \(delta_1 < 0\); this is an argument for existence of a gender gap in wage.
Binary variables
Using our own example, we can make the case that it is necessary to separate the firms in two groups: dividend payers and non-payers.
There is literature suggesting that dividend payers are not financially constrained, while those firms that do not pay dividends are.
If financial constrain is something important to our model (and assuming that the right way to control for it is by including a dividend payer dummy), we should include such a dummy.
data$w_div_payer <-ifelse(data$w_div_ta <=0, 0, 1)tapply(data$w_div_ta, data$w_div_payer, summary) # Summary by group using tapply
$`0`
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.087e-05 0.000e+00 0.000e+00 -4.538e-07 0.000e+00 0.000e+00
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000000 0.0003314 0.0048639 0.0134473 0.0160092 0.0997752
Let’s say you have a variable that should not show a clear linear relationship with another variable.
For instance, consider ownership concentration and firm value. There is a case to be made the relationship between these variable is not linear. That is, in low levels of ownership concentration (let’s say 5% of shares), a small increase in it might lead to an increase in firm value. The argument is that, in such levels, an increase in ownership concentration will lead the shareholder to monitor more the management maximizing the likelihood of value increasing decisions.
But consider now the case where the shareholder has 60% or more of the firm’s outstanding shares. If you increase further the concentration it might signals the market that this shareholder is too powerful that might start using the firm to personal benefits (which will not be shared with minorities).
Models with squared variables
If this story is true, the relationship is (inverse) u-shaped. That is, at first the relationship is positive, then becomes negative.
Theoretically, I could make an argument for a non-linear relationship between several variables of interest in finance. Let’s say size and leverage. Small firms might not be able to issue too much debt as middle size firms. At the same time, large firms might not need debt. The empirical relationship might be non-linear.
There is always a potential case to be made regarding the relationship between the variables.
\[Lev_{t,i} = \alpha + \beta_1 \times Size_{t,i} + \beta_2 \times Size^2_{t,i} + controls + \epsilon_{t,i}\] As Wooldridge says, misspecifying the functional form of a model can certainly have serious consequences. But, in this specific case, the problem seems minor since we have the data to fix it.
Models with squared variables
In this specific case, our theory does not hold, since the squared term is not significant. So we can conclude that in this model, the relationship is linear.
In some specific cases, you want to interact variables to test if the interacted effect is significant. For instance, you might believe that, using Wooldridge very traditional example 7.4., women that are married are yet more discriminated in the job market. So, you may prefer to estimate the following equation.
Where \(maried\) is a binary variable marking all married people with 1, and 0 otherwise.
In this setting, the group of single men is the base case and is represented by \(\beta_0\). That is, both female and married are 0.
The group of single women is represented by \(\beta_0 + \beta_1\). That is, female is 1 but married is 0.
The group of married men is represented by \(\beta_0 + \beta_2\). That is, female is 0 but married is 1.
Finally, the group of married women is represented by \(\beta_0 + \beta_1 + \beta_2 + \beta_3\). That is, female and married are 1.
Models with Interactions
Using a random sample taken from the U.S. Current Population Survey for the year 1976, Wooldridge estimates that \(female<0\), \(married>0\), and \(female.married<0\). This result makes sense for the 70s.
When the dependent variable is binary we cannot rely on linear models as those discussed so far. We need a linear probability model. In such models, we are interested in how the probability of the occurrence of an event depends on the values of x. That is, we want to know \(P[y=1|x]\).
Imagine that y is employment status, 0 for unemployed, 1 for employed. This is our Y. Imagine that we are interested in estimating the probability that a person start working after a training program. For these types of problem, we need a linear probability model.
The mechanics of estimating these model is similar to before, except that Y is binary.
The interpretation of coefficients change. That is, a change in x changes the probability of y = 1. So, let’s say that \(\beta_1\) is 0.05. It means that changing \(x_1\) by one unit will change the probability of y = 1 (i.e., getting a job) in 5%, ceteris paribus.
The relationship between the probability of labor force participation and educ is plotted in the figure below. Fixing the other independent variables at 50, 5, 30, 1 and 6, respectively, the predicted probability is negative until education equals 3.84 years. This is odd, since the model is predicting negative probability of employment given a set of specific values.
Linear probability model
Another example. The model is predicting that going from 0 to 4 kids less than 6 years old reduces the probability of working by \(4\times 0.262 = 1.048\), which is impossible since it is higher than 1.
That is, one important caveat of a linear probability model is that probabilities might falls off of expected empirical values. If this is troublesome to us, we might need a different solution.
Logit and Probit
Although the linear probability model is simple to estimate and use, it has some limitations as discussed. If that problem is important to us, we need a solution that addresses the problem of negative or higher than 1 probability. That is, we need a binary response model.
In a binary response model, interest relies on the response probability.
\[P(y =1 | x) = P(y=1| x_1,x_2,x_3,...)\]
That is, we have a group of X variables explaining Y, which is binary. In a LPM, we assume that the response probability is linear in the parameters \(\beta\). This is the assumption that created the problem discussed above.
Logit and Probit
We can change that assumption to a different function.
A logit model assumes a logistic function (\(G(Z)=\frac{exp(z)}{[1+exp(z)]}\))
A probit model assumes a standard normal cumulative distribution function (\(\int_{-inf}^{+z}\phi(v)dv\))
The adjustment is something as follows.
\[P(y =1 | x) = G(\beta_0 + \beta_1 x_1+ \beta_2 x_2 + \beta_3 x_3)\] Where G is either the logistic (logit) or the normal (probit) function.
We don’t need to memorize these functions, but we need to understand the adjustment that assuming a different function makes. Basically, **we will not predict negative or above 1 values anymore because the function adjusts at very low and very high values*.
Logit and Probit
Logit and Probit
Let’s estimate a logit and probit to compare with the LPM.
Importantly, in a LPM model, the coefficients have similar interpretations as usual. But logit and probit models lead to harder to interpret coefficients.
In fact, often we do not make any interpretation of these coefficients. Instead, we usually transform them to arrive at an interpretation that is similar to what we have in LPM.
To make the magnitudes of probit and logit roughly comparable, we can multiply the probit coefficients by 1.6, or we can multiply the logit estimates by .625.
Also, the probit slope estimates can be divided by 2.5 to make them comparable to the LPM estimates.
At the end of the day, the interpretation of the logit and probit outputs are similar to LPM’s.
Tobit
Another problem in the dependent variable occurs when we have a limited dependent variable with a corner solution.
That is, a variable that ranges from zero to all positive values.
For instance, hours working. Nobody works less than zero hours, but individuals in the population can work many number of positive hours.
When we have such type of dependent variable, we need to estimate a tobit model.
Tobit
Using Wooldridge’s example 17.2.
library(AER)lpm <-lm(hours ~ nwifeinc + educ + exper + expersq + age + kidslt6 + kidsge6 , data = mroz)tobit <-tobit(hours ~ nwifeinc + educ + exper + expersq + age + kidslt6 + kidsge6 , data = mroz)summary(tobit)
One of the key assumptions in OLS estimators is that \(var(\mu|x_1,x_2,x_3,...) = \sigma^2\). That is, the assumption is that the variance of the errors is homoskedastic (present constant variance). It means that throughout all observations, the error term shows the same variance \(\sigma^2\). If errors are not homoskedastic, we have the Heteroscedasticity problem.
Heteroskedasticity does not cause bias or inconsistency in the OLS estimators of the \(\beta\) like the OVB would. It also does not affect the \(R^2\). What Heteroscedasticity does is to bias the standard errors of the estimates.
Remember again that \(t_{\beta} = \frac{\hat{\beta}}{se(\hat{\beta})}\). So, if you have biased standard errors, you will not assess correctly the significance of your coefficients. It also affects the F statistics.
Graphically, we can think as follows.
Heteroscedasticity
Example of homoscedasticity:
Heteroscedasticity
Example of heteroscedasticity:
Heteroscedasticity
To give you more context, think in terms of the relationship that we’ve discussing \(leverage=f(size)\).
It is quite possible that small firms will have less options of leverage than large companies.
This means that a subsample of large companies will have higher variance in the leverage decisions (and thus the error terms) than the subsample of small firms.
So, we need to correct somehow the heteroskedasticity problem to find unbiased standard errors for the independent variable size in this model.
Heteroscedasticity
The solution to this problem is to estimate Robust standard errors. Basically, we will need to change the estimator of the standard error to an unbiased version.
This is called White-Robust standard error or the Heteroscedasticity-Robust standard error and was first showed by White (1980).
Heteroscedasticity
Before we estimate a model with robust standard errors, let’s visually check if there is heteroskedasticity in the errors of the model. I am using Wooldridge’s example 8.1.
library(dplyr)library(sandwich)library(lmtest) wage1<-wage1 %>%mutate(marmale =case_when(female ==0& married ==1~1, female ==0& married ==0~0, female ==1& married ==1~0, female ==1& married ==0~0) )wage1<-wage1 %>%mutate(marrfem =case_when(female ==0& married ==1~0, female ==0& married ==0~0, female ==1& married ==1~1, female ==1& married ==0~0) )wage1<-wage1 %>%mutate(singfem =case_when(female ==0& married ==1~0, female ==0& married ==0~0, female ==1& married ==1~0, female ==1& married ==0~1) )wage_t <-lm(lwage ~ marmale + marrfem + singfem + educ + exper + expersq + tenure + tenursq , data = wage1)library(tidyverse)library(broom)fitted_data <-augment(wage_t, data = wage1)ggplot(fitted_data, aes(x = .fitted, y = .resid)) +geom_point() +geom_smooth(method ="lm") +theme_solarized()
Heteroscedasticity
Heteroscedasticity
Visually, we do not see much variation in the error term throughout the x axis. Not much evidence of heteroskedasticity.
But let’s formally test it using the Breusch-Pagan test.
The H_0 of this test is for homoskedasticity. Thus, if we reject the test, heteroskedasticity is present.
The estimated p-value is 10.55%, above the usual levels. We can conclude that heteroskedasticity is not present is such model.
bptest(wage_t)
studentized Breusch-Pagan test
data: wage_t
BP = 13.189, df = 8, p-value = 0.1055
Heteroscedasticity
Let’s estimate both standard errors to see their difference. Because we do not see much heteroskedasticity we should not see much difference in the estimated standard errors.
Notice that the standard errors have changed a bit, but not too much in this example.
Usually, the robust standard errors are larger than the traditional ones in empirical works, but they could be smaller.
Also notice that most of the independent variables have similar significance levels, but some have less significance.
This is expected, since the robust standard errors are expected to be larger (but again not always).
A final note, the standard errors estimates here are a bit different than those in Wooldridge I am assuming this is due to the package that I am using.
Finally, it is often common and most widely accepted to estimate the robust standard errors instead of the traditional ones.
Multicollinearity
Multicollinearity is the term used when many of the independent variables in a model are correlated.
This problem is not clearly formulated in econometrics book due to its nature.
But one thing is evident: multicollinearity increases the standard errors of the coefficients
In a sense, this problem is similar to having a small sample, from which is hard (in a statistical sense, meaning high standard deviation) to estimate the coefficient.
As Wooldridge says: everything else being equal, for estimating \(\beta_j\), it is better to have less correlation between \(x_j\) and the other independent variables.
Multicollinearity
We can use the following test to verify if multicollinearity exists: variance inflation factor (VIF).
There is no formal threshold to interpret the test, but 10 is usually accepted as appropriate.
That is, if there is one or more variables showing a VIF of 10 or higher, the interpretation is that there is evidence of multicollinearity.
Multicollinearity
We can observe below that the experience variables are multicolinear. This is expected since one is the squared root of the other.
Assuming that expersq is important to be included in the model, there is not much we can do in this example.
Some scholars argue that we could ignore the multicollinearity altogether, others would argue to exclude expersq if possible.
The general most accepted solution is to keep multicolinear variables if they are control variables.
That is, if your focus is on any other variable, you can keep the multicolinear ones and move on. If the multicolinear variable is one of the interest variables, you might want to discuss dropping the other one.
Multicollinearity
Let’s investigate our previous OLS model. We see no evidence of multicollinearity.
In this example, we can say that the OVB is \(short = long + bias\).
That is, \(0.44535 = -0.38389 + bias\), or \(0.44535 = -0.38389 + 0.82924\).
Which is the same as: \(0.44535 = -0.38389 + phi (omitted = f(non-omitted)) * omega (beta\; \; omitted \;in\; long)\)
ovbmodel <-lm(risky_firm ~bad_decision , data = ovb )# The OVB is 0.44535 = -0.38389 + 1.25146 * 0.66262matrix1<-summary(long)$coefficientsmatrix2<-summary(ovbmodel)$coefficients# Calculating OVBsum(matrix1[3,1] * matrix2[2,1])
[1] 0.8292402
Omitted variable bias (OVB)
We can see that omitting the variable “risky_firm” is problematic since it seems to explain the outcome of this regression.
tapply(ovb$performance, ovb$risky_firm, summary)
$`0`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0400 0.0800 0.1300 0.1472 0.2000 0.3100
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0500 0.4500 0.5800 0.6148 0.8000 0.9900
tapply(ovb$bad_decision, ovb$risky_firm, summary)
$`0`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0100 0.0500 0.0800 0.0976 0.1300 0.2000
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1200 0.4400 0.6300 0.6056 0.8200 0.9600