* This is an exercize in which we will simulate various degrees of multicollinearity.
* In order to do this, we will have to introduce some new commands.  As always, we will
* begin with a clear command:

*USP 2/2019
*Course: FLS6183
*Authors: Lorena Barberia & Maria Leticia Claro
*Lab 5 - Multicollinearity simulation

clear

* The first command that we are going to introduce is a command to create a matrix.
* In this command "mat" tells STATA that we are going to define a matrix.  In this case,
* we have named the matrix "C".  The next step is to define the elements in the matrix.
* This is done by row, with a comma between elements and a backslash ("\") separating each 
* row.  

mat C=(1,0\0,1)

* This two by two matrix that we have just created will be our correlations between the 
* variables that we will generate.  In this case,
*  1   0
*  0   1
* Which is 
* corr(X1,X1) corr(X1,X2)
* corr(X2,X1) corr(X2,X2)

* Because simulations are based on random number generation, we will get a different set
* of data each time that we run the same program unless we specify a seed #.  For instance:

set seed 999

* One useful command in STATA for helping us to simulate different data scenarios is "corr2data".
* This command allows us to randomly generate a set of variables with a particular pattern of
* correlation.  For instance, if we wanted to generate a 10 observation data set with the 
* correlation structure that we defined above, we would issue the following command:

corr2data x1 x2, n(10) corr (C)

*where x1 and x2 are our new variables which will be generate, n is the number of observation and corr 
*the matrix correlation adopted to gen the new variables.

* let's look at what we just generated.

graph twoway scatter x1 x2

* Let's check if STATA followed our command. 
*The pwcorr command calculates pairwise correlation coefficients using all the available information.
*This command has a advantage because it can give to us the statistical significance.

pwcorr x1 x2, sig obs

* By now you have probably guessed where we are heading.  We have our two independent variables
* x1 and x2.  The next step is to generate a stochastic component:

generate r=invnorm(uniform())

* And now, let's generate our Population Regression Function:

generate y=.5 + x1 + x2 + r

* What are the population parameters?
* Now, let's see what happens when we estimate them with the sample data:

regress y x1 x2

*Sometimes, we want to test a directonal hypothesis  for our regression coefficients. 
*When we want to know if our slope or intercept is higher or lower than 0, we can 
*calculate the p-values directly from the regression output. When our estimated 
*coefficient is positive (as our slope), we can test three different hypothesis with the p-value as follows: 
*H0: intercept=0 p-value= = 0.078(given in the output)
*H0: intercept<=0 p-value=  0.078/2= 0.039
*H0: intercept>=0 p-value=1-(0.078/2= 0.039)=0.961

*We can also use commands to calculate this in Stata
test _b[x1]=0
local sign_x1 = sign(_b[x1])

display "Ho: coef <= 0  p-value = " ttail(r(df_r),`sign_x1'* sqrt(r(F)))
display "Ho: coef >= 0  p-value = " 1-ttail(r(df_r),`sign_x1'* sqrt(r(F)))


* "vif" is a command for estimating the variance inflation factor for each variable in the 
* most recently estimated model:

vif 


*We also can be interested in the confidence interval associated with the correlation coefficient.
*The extension ci2 for Stata (You need to install this command before use it) allows us to obtain these results.

findit ci2
ci2 y x1, corr
ci2 y x2, corr

*What do you interpret from this output?


* Now, it is your turn. Please complete the analysis in the assignment this week. To do so, you will need to change the correlations between the explanatory variables and the number of observations.