Practicing 4
Imagine that the population contains 5000 units, from which you can observe only 50.
You want to run a linear model to understand the relationship between x and Y.
The “true” beta of this relationship is as follows. By “true” I mean the beta you would get should you observe the population (remember though that you don’t).
Call:
lm(formula = df$y ~ df$x)
Residuals:
Min 1Q Median 3Q Max
-2527.44 -1230.21 4.28 1246.20 2510.94
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.549e+03 4.103e+01 62.13 <2e-16 ***
df$x 1.871e-01 1.407e-02 13.30 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1435 on 4998 degrees of freedom
Multiple R-squared: 0.03416, Adjusted R-squared: 0.03397
F-statistic: 176.8 on 1 and 4998 DF, p-value: < 2.2e-16
So the “true” beta is 0.187. And the t-stat is 13.296
Plotting this relationship in a graph, you get:
library(ggplot2)
library(ggthemes)
ggplot(df) + geom_point( aes(x=x, y=y), color = "lightblue") +
geom_smooth(data = df, aes(x=x, y=y) , method = lm) +
theme_solarized() +
labs(title = paste("Beta is" , round(summary(lm(df$y ~ df$x))$coefficients[2,1], 3) ,
", Beta t-stat is" , round(summary(lm(df$y ~ df$x))$coefficients[2,3], 3) ,
", R2 is" , round(summary(lm(df$y ~ df$x))$r.squared , 3) ,
", Sample Size is" , nrow(df) ) )
If you run a linear model using the sample you can observe, you might get this.
library(dplyr)
set.seed(1235)
random <- sample_n(df, sample)
reg <- lm(random$y ~ random$x)
sum <- summary(reg)
ggplot(df) + geom_point( aes(x=x, y=y), color = "grey") +
geom_point( data = random, aes(x=x, y=y) , color = "blue") +
geom_smooth(data = random, aes(x=x, y=y) , method = lm) +
theme_solarized() +
labs(title = paste("Beta is" , round(reg$coefficients["random$x"], 3) ,
", Beta t-stat is" , round(summary(reg)$coefficients[2 , 3] , 3) ,
", R2 is" , round(summary(reg)$r.squared , 3) ,
", Sample Size is" , nrow(random) ) )
Or maybe this:
set.seed(1242)
random <- sample_n(df, sample)
reg <- lm(random$y ~ random$x)
sum <- summary(reg)
ggplot(df) + geom_point( aes(x=x, y=y), color = "grey") +
geom_point( data = random, aes(x=x, y=y) , color = "blue") +
geom_smooth(data = random, aes(x=x, y=y) , method = lm) +
theme_solarized() +
labs(title = paste("Beta is" , round(reg$coefficients["random$x"], 3) ,
", Beta t-stat is" , round(summary(reg)$coefficients[2 , 3] , 3) ,
", R2 is" , round(summary(reg)$r.squared , 3) ,
", Sample Size is" , nrow(random) ) )
Or maybe this:
set.seed(1243)
random <- sample_n(df, sample)
reg <- lm(random$y ~ random$x)
sum <- summary(reg)
ggplot(df) + geom_point( aes(x=x, y=y), color = "grey") +
geom_point( data = random, aes(x=x, y=y) , color = "blue") +
geom_smooth(data = random, aes(x=x, y=y) , method = lm) +
theme_solarized() +
labs(title = paste("Beta is" , round(reg$coefficients["random$x"], 3) ,
", Beta t-stat is" , round(summary(reg)$coefficients[2 , 3] , 3) ,
", R2 is" , round(summary(reg)$r.squared , 3) ,
", Sample Size is" , nrow(random) ) )
Or maybe several other estimates.
So, the takeaway is: always remember that you can only observe a sample of the population. If the sample you observe is biased, you will get biased estimates.
QUESTIONS?
Henrique C. Martins
[Henrique C. Martins] [henrique.martins@fgv.br][Do not use without permission]