* Lorena G. Barberia	
* Adapted from Andy Philips (2013)
* 11/1/2018
*
*	Bootstrapping and Clarify

* In this lab, we will briefly explore how bootstrapping and Clarify work.
*
*	-----------------------------------------------------------------------
*	This .do file is based on generated data

* Data Generating Process:
clear
set seed 345
set obs 120
gen e1 = rnormal()
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 2*x1 + 3*x2 + e1
kdensity y

* Now let's examine the regression results of the DGP
regress y x1 x2
estimates store m1_original

* let's examine the residuals and fitted values
predict uhat_m1_original, resid 
predict yhat_m1_original, xb 
sum uhat_m1_original yhat_m1_original


* BOOTSTRAPPING	--------------------------------------------------------------
/* bootstrapping standard errors from a statistic can be used by the following:
	1. write program (if you have a custom statistic program)
	2. load in data
	3. drop missing values (STATA will not discern if you have missing 
	values)
	4. drop unneeded variables (this speeds up a bootstrap)
	5. set seed
	6. run bootstrap
*/

* first drop any missing obs
foreach var in y x1 x2 {
	drop if `var' == .
}

reg y x1 x2
bootstrap, reps(1000): regress y x1 x2
estimates store m2_bootstrap 
predict uhat_m1_bootstrap, resid 
predict yhat_m1_bootstrap, xb 
sum uhat_m1_original uhat_m1_bootstrap yhat_m1_original yhat_m1_bootstrap

coefplot m1_original m2_bootstrap, drop(_cons) xline(0) 

/* note how the t-scores are slightly lower in the bootstrap, but the coeffs. 
	remain the same. for speed we can drop unneeded vars.
*/

* Clarify --------------------------------------------------------------


* net from http://gking.harvard.edu/clarify

* net install clarify


estsimp regress y x1 x2 

* Note the labels for b1, b2, b3, b4
* b4 is sigma"2
* b1 and b2 are the betas for the regression parameters
* b3 is the simulated parameter for the constant 

sum

* How do the coefficient standard errors compare with the original standard errors for the coefficients?

* standard error of b1
di  .0881776/sqrt(1000)

* confidence interval of b1

di   1.9432 + (.00278842*1.96)
di   1.9432 - (.00278842*1.96)

* standard error of b2
di .1115355/sqrt(1000)

* confidence interval of b2

di  2.792914+(.00352706*1.96)
di  2.792914 - (.00352706*1.96)


* Now let's compare these outputs with the uhat above.

gen uhat_clarify=sqrt( b4 )

sum uhat_m1_original uhat_m1_bootstrap uhat_clarify

* Let's compare predicted values of y with pv and ev


setx mean

simqi, pv
simqi, ev

* let's compare with the original yhat and yhat under bootstrap
* note that these are standard deviations and we would need to calculate the standard errors to compare them directly 
sum yhat_m1_original yhat_m1_bootstrap

drop b1 b2 b3 b4

* Questions 
* 1. How do the estimates from the bootstrap model compare to the original results? Why? 
* 2. How do the estimates from Clarify compare with the original regression estimates? Why? 
* 3. How do the standard errors from Clarify compare to the original results? Why? 
* 4. Why do the Clarify results with expected values and predicted values differ ?