*USP 2/2017
*Course: FLS6183
*Author: Lorena Barberia
*Lab 1 - What is the difference between hypothesis tests using a difference 
*of means test versus a bivariate regression model
*when you have a categorical dummy variable as the explanatory variable?


*Our objective here is compare a Difference of Means Test with Hypothesis Testing for a Bivariate Regression.
*When should you use a bivariate regression? Are there any differences 
*between these two hypothesis tests?

*In order to examine this question, we will use our own simulated data. 
*First of all, we need to set a seed. Setting a seed will allow us 
*to replicate our analysis. 

clear
set seed 12345

*We will also define how many observation we want to generate. For our exercise, we
*will set the number of obs as 100.
set obs 100 


*It seems very similar with the normal distribution. It is likely that
*if we increase our sample size they would be even more similar.

*We will first create a random varible (z). We will then use Z to create our dummy variable (X).
gen z=rnormal()

*Now, let's get a sense of our data with summary statistics. 
sum z, detail
kdensity z

*Now that we have our auxiliary variable, z, we will create our dummy variable.
*We define x=0 for z<0 and x=1 for z>=0.
gen x=0
replace x=1 if z>=0

*Now, let's get a sense of our data with summary statistics. 
sum x, detail
histogram x, percent

*Now, let`s gen our stochastic term
gen r=invnorm(uniform())

*We are going to use your x and r to gen our y:
gen y = 1 + 2*x +r

*Now, let's get a sense of our data with summary statistics.  
sum y, detail

*First, let us carry out a hypothesis test 
*for a single variable using our variable y, which is a continous variable.  
ttest y=0.5

*Can you visually illustrate the results of the hypothesis test? 

*Now, let's consider a new problem. 
*Our dummy variable could be a characteristic for which we want to test 
*if there are differences between groups, for example. 

*How can we carry out hypothesis testing in this case in which we have 
*a continous variable (y) which differs between two groups? 
*One possiblity would be to carry out a difference of means test. 
*Another possiblity would be to use a bivariate regression model. 
*Are there any differences in carrying out a hypothesis test of the difference 
*of the mean of y by group type (X=1 or X-0), or testing this hypothesis using 
*a bivariate regression in which we test the statistical significance of our
*dummy independent variable? When is a bivariate regression a more appropriate hypothesis test?

*As x is a categorical variable, we can examine the mean of y by type of x case. 
*To do so, we first want to sort all of the cases by x type. 
sort x
by x: sum y

*In this case, is the mean higher when X=1 or X=0?

*Let's now do a t-test to test if there are differences assuming that the variances in each sample group are equal. 
ttest y, by(x)

*Now, let's estimate a bivariate regression model. 
reg y x

*What can you observe in these results? Are they similar or different? How?

*Do you think there is a preferred method in this particular case for these 
*type of data (y= continous and x=dummy)? 

*Are there any advantages of using a regression in this case? 

*To help you think through your answer, let's look at a scatterplot 
*of the data and plot our estimated regression line. 
scatter y x