*USP 2/2017 *Course: FLS6183 *Author: Gabriel Zanlorenssi *Lab 2 - Categorical and continuous variables. Relationship between variables. *Correlation test vs regression test. *Set the seed and the number of observations clear set seed 4321 set obs 200 **CATEGORICAL VARIABLES *Genearting categorical variable gen z = runiform() *Let's look to the distribution kdensity z *Turning this variable into categorical replace z = z * 5 replace z = round(z, 1) *Take a look at data viewer. What do you observe? *Run a density plot for z again kdensity z *This is not the best way to visualize the variable since *now its distribution is discrete. Let's try a histogram histogram z, discrete density histogram z, discrete frequency histogram z, discrete percent *Histogram command, by default, is expecting that the variable *is continuous. You have to set "discrete" after comma to *say to Stata that the variable is discrete. The argument after *discrete is just the label of the information on Y axis. *You can describe a categorical variable using "tabulate" or simply "tab" tab z *Summary statistics for a categorical variable can be in some cases meaningless sum z *Can you explain the mean and standard deviation of z tell us? **CONTINUOUS VARIABLES *Now we are going to generate a normal distribution of mean 10 and st. variation equal to 2 gen x=rnormal(10,2) *Can you imagine which is the mean, median, mode, max and min values of this distribution *without viewing the data? *Let's look the distribution kdensity x *How can we locate visually the mean, median, mode, max and min? *Looks normal, but a little bit skewed to the left of 10. Let's compare: kdensity x, normal *You can also use histogram for continuous variables... histogram x *..but discrete frequencies not! histogram x, discrete frequency *Other interesting option is a boxplot graph box x *How can we interpret this? *Tabulate is also not ideal for x because the variable can have many data points tab x *The best way to describe a continuous variable is summarising it sum x *Detailed, with percentiles and other summary statistics: sum x, detail *Variable x is column vector with 100 rows. In matrix notation: [100,1] *Summary statistics consists in grouped values for a entire group. *Therefore, they're scalar values [1,1]. *You can see these scalars using return list sum x, detail return list *Stata can store scalars, vectors and matrices from any output scalar mean_x = r(mean) scalar sd_x = r(sd) scalar p5_x = r(p5) scalar p95_x = r(p95) *You can display scalars using "display" or "di" di mean_x di sd_x *Or even perform operations di mean_x - 1.66*sd_x di mean_x + 1.66*sd_x *If our distribution follows the t-student, (mean_x - 1.66*sd_x) must be the 5th percentile *and (mean_x + 1.66*sd_x) must be lower than 95th percintle with 100 observations di mean_x - 1.66*sd_x di p5_x di mean_x + 1.66*sd_x di p95_x *What can we conclude? **RELATIONSHIP BETWEEN VARIABLES *First, let's generate another categorial variable *and another continuous variables gen u = rnormal() replace u = 1 if u>0 replace u = 0 if u<0 gen w = rchi2(3) *Chi squared distribution tends to a normal as the *the degrees of freedom increase. But with 3 degress, *we will not have a normal distribution. Let's look: kdensity w, normal *CATEGORICAL (z) vs CATEGORICAL (u) *The best way to describe a relationship between to categorical *variable is by cross-tab tab u z *With row and col percents tab u z, row col *With chi-squared bivariate test tab u z, chi2 *Graphically, the best to do is a bar chart graph bar (count) z, over(u) *Or a pie graph pie z, over(u) *CATEGORICAL (u) vs CONTINUOUS (x) *Describing using tables table u, contents(mean x) *Testing with mean test, that we perfomed last class ttest x, by(u) *And graphically using bar charts graph bar (count) x, over(u) *Or pie graph pie x, over(u) *CONTINUOUS (w) vs CONTINUOUS (x) *Any tabular description is very hard in this case *The best is to group by a criteria as "if is higher than 0 is 1" *and "if is lower than 0 is 0", and then use tables. *The best way to show graphically is building *a scatterplot, where can see the the relation and distribution *between the two variables scatter w x *Introducing a fit line: scatter w x || lfit w x *The bivariate test is the correlation test. What we should expect *from the test after looking to the scatterplot? pwcorr w x *Including p-values pwcorr w x, sig *REGRESSION vs CORRELATION *Let's generate a final variable which is directly correlated to x *but not to z gen r = rnormal() // our stochastic component gen y = 3 + 4*x + r *Summarise and describe graphically y sum y, detail graph box y kdensity y *And the relationship between y and x scatter y x *Now perform a correlation test between y and x. pwcorr y x *What can we conclude? *Do the same with regress command reg y x *What can we conclude? *Finally, we can put z into our model and test reg y x z *What can we conclude?