*USP 2/2019
*Course: FLS6183
*Authors: Lorena Barberia & Maria Leticia Claro
*Lab 6 - Omitted Variable bias

clear

* Part a. No correlation between X and Z

* We will create a data set with 500 observations. We are establishing that X has a mean of 7 and a standard deviation of 8.  
* We are estiablishing at Y with mean 100 and standard deviation of 20.  We are also establishing that Z has a mean of 20 and a standard deviation of 2.
* In addition, we will stipulate that the correlation between X and Z is 0, that the correlation between X and Y = 0.7 and the correlation between Y and Z = 0.3.
* x, y and z are randomly drawn from a normal distribution. Please note in the matrix C below the order is (y, x, z) in each colunm and row.

* In this case, we have named the matrix "V", "m" "sd".  The next step is to define the elements in the matrix.
* This is done by row, with a comma between elements and a backslash ("\") separating each 
* row.  
matrix C = (1, 0.6, 0.4 \ 0.6, 1, 0 \ 0.4, 0, 1)
*We will simulate the means
matrix m = (100,7,20)
*We will simulate the standard errors
matrix sd = (20,8,2)

* For instance, if we wanted to generate a 150 observation data set with the 
* correlation structure that we defined above, we would issue the following command
* drawnorm draws a sample from a multivariate normal distribution with desired means and covariance matrix.
*The values generated are a function of the current random-number seed or the number specified with set seed()
drawnorm y x z, n(500) means (m) sds(sd) corr(C) seed(12345)

*where y, x and z are our new variables which will be generate, n is the number of observation and corr 
*the matrix correlation adopted to gen the new variables.


* Now, let's see what happens when we estimate them with the sample data:

eststo model1: regress y x
predict u_hat, resid

eststo model2: regress y x z
predict u_hat_2, resid


*We have going to use a new command to show the results. We want the betas, standard errors and the confidence intervals
* it will show us the stars. Please note the format of the stars in the output. To see more options type "help estout"
estout model1 model2, cells(b(star fmt(3)) se (fmt(3)) ci(par fmt(2))) stats(r2 N) ///
		legend label collabels(none) varlabels(_cons Constant)

*Let's look the residuals 
		
kdensity u_hat, normal	
kdensity u_hat_2, normal

scatter u_hat x || scatter u_hat_2 x
	
* Part B. Correlation between X and Z =0.75

* We will create a data set with 500 observations. We are establishing that X has a mean of 7 and a standard deviation of 8.  
* We are estiablishing at Y with mean 100 and standard deviation of 20.  We are also establishing that Z has a mean of 20 and a standard deviation of 2.
* In addition, we will stipulate that the correlation between X and Z is 0.75, that the correlation between X and Y = 0.7 and the correlation between Y and Z = 0.3.
* x, y and z are randomly drawn from a normal distribution.
clear 

matrix m = (100,7,20)
matrix sd = (20,8,2)
matrix C = (1, 0.7, 0.3 \ 0.7, 1, 0.75 \ 0.3, 0.75, 1)
drawnorm y x z, n(500) means (m) sds(sd) corr(C) seed(12345)

eststo model4: regress y x
predict u_hat, resid

eststo model5: regress y x z
predict u_hat_2, resid

estout model4 model5, cells(b(star fmt(3)) se (fmt(3)) ci(par fmt(2))) stats(r2 N) ///
		legend label collabels(none) varlabels(_cons Constant)
	
kdensity u_hat, normal
kdensity u_hat_2, normal

scatter u_hat x || scatter u_hat_2 x
		
 * Part C. Analysis with Dummy Variable
 * 1. No correlation between X and Z

clear

matrix m = (100,7,20)
matrix sd = (20,8,2)
matrix C = (1, 0.7, 0.3 \ 0.7, 1, 0 \ 0.3, 0, 1)
drawnorm y x z, n(500) means (m) sds(sd) corr(C) seed(12345)

corr x y z
* Transform Z into a dummy variable equal to 0 below the mean, and 1 if above the mean.

sum(z)
return list
gen zmean = r(mean) // we are saving on stata memories the z mean 

replace z=0 if z <= zmean // now we are using that information to (re)create z as a dummy  
replace z=1 if z > zmean

corr x y z

eststo model6: regress y x
predict u_hat, resid

eststo model7:regress y x z 
predict u_hat_2, resid


estout model2 model5 model6 model7, cells(b(star fmt(3)) se (fmt(3)) ci(par fmt(2))) stats(r2 N) ///
		legend label collabels(none) varlabels(_cons Constant)


scatter u_hat x || scatter u_hat_2 x
		
* 2. Correlation between X and Z =0.75

clear 

matrix m = (100,7,20)
matrix sd = (20,8,2)
matrix C = (1, 0.7, 0.3 \ 0.7, 1, 0.75 \ 0.3, 0.75, 1)
drawnorm y x z, n(500) means (m) sds(sd) corr(C) seed(12345)

corr x y z

sum(z)
return list
gen zmean = r(mean)

replace z=0 if z <= zmean
replace z=1 if z > zmean

corr x y z // the correlation between z and other variables change when we transform the variable

eststo model8: regress y x
predict u_hat, resid
 
eststo model9: regress y x z 
predict u_hat_2, resid

estout  model8 model9, cells(b(star fmt(3)) se (fmt(3)) ci(par fmt(2))) stats(r2 N) ///
		legend label collabels(none) varlabels(_cons Constant)
		
scatter u_hat x || scatter u_hat_2 x

* Advanced Examples with unit effects

* Create ID Variable  that is correlated with X
egen newid = group(x)
set matsize 550
regress y x z i.newid
regress y x i.newid
* Create ID Variable that is not correlated with X
egen newid2=group(random)