搜档网
当前位置:搜档网 › Stata_Tutorial2

Stata_Tutorial2

Stata_Tutorial2
Stata_Tutorial2

STATA Tutorial 2

Professor Erdin?

Please follow the directions once you locate the Stata software in your

computer. Room 114 (Business Lab) has computers with Stata software

1.Wald Test

Wald Test is used to test the joint significance of a subset of coefficients, namely. Take for example the two variables from the example above, baths and bedrms. These two variables are individually insignificant based on t-tests with very high p values. But before dropping them together, we may want to test the joint significance of them using Wald test.

Run in Stata:

test bedrms baths

The command test bedrms baths tests whether baths and bedrms are insignificant jointly. Since the null says they are, and F-stat’s p-value=0.6375, then we cannot reject the null. DROP both baths and bedrms from the regression equation. They don’t belong to the model.

MODEL A

reg price sqft bedrms baths

Source SS df MS Number of

obs 14

F (3, 10) 16.99

Model 85114.9

4 3. 28371.6473 Prob > F 0.0003

Residual 16700.0

7 10.1670.00687 R-squared 0.836

Adj R-squared 0.7868

Total 101815 13 7831.9239 Root MSE 40.866 price Coef. Std. Err. t P>t [95% Conf. Interval]

sqft 0.1548 .0319404

4.85 0.001 0.083632 0.225968

bedrms -27.02933 -0.443 -81.8126 38.63758

21.5875 0.80 baths

-12.1928 43.25 -0.28

0.784 -108.56 84.17425

_cons 129.0616 88.30326 1.46

0.175 -67.6903 325.8136

( 1) bedrms = 0 ( 2) baths = 0

F( 2, 10) = 0.47 Prob > F = 0.6375

Special Wald Test

This is an F-test for the significance of all variables in the model, i.e. sqft, bedrms and baths . Hence, the null states betas of all variables in the model are set equal to zero.

Null 0H : 0sqft bedrms baths βββ===

Alternative A H : At least some are non zero.

Run in Stata this command:

test bedrms baths sqft

( 1) bedrms = 0 ( 2) baths = 0 ( 3) sqft = 0

F( 3, 10) = 16.99 Prob > F = 0.0003

Notice that F-test p-value is 0.0003, which is lower than 1% of α. Hence, we can reject the null and at least some variables in this trio is significant. This variable is SQFT! (based on the t-test).

RESTRICTED MODEL

MODEL B reg price sqft

Source

SS

df MS

Number of obs

14

F( 1, 12) 54.86 Model 83541.4429 1 83541.4429 Prob > F 0 Residual 18273.5678 12 1522.79731 R-squared 0.8205

Adj R-

squared 0.8056 Total 101815.011 13 7831.9239 Root MSE 39.023 price Coef. Std. Err. t P>t [95% Conf. Interval]

sqft 0.1387503 .0187329

7.41 0 0.0979349 0.179566

cons 52.3509 37.28549

1.40 0.186 -28.88719 133.589

2. OLS Regression

Regress yvar xvarlist Regress the dependent variable yvar on the independent variables xvarlist.

Regress yvar xvarlist, vce(robust) Regress, but this time compute robust (Eicker-Huber-White)standard errors. We are always using the vce(robust) option

because we want consistent (i. e,, asymptotically unbiased) results, but we do not want to have to assume homoskedasticity and normality of the random error terms. So, remember always to specify the vce(robust) option after estimation commands. The“vce” stands for variance-covariance estimates (of the estimated model parameters).

Regress yvar xvarlist, vce(robust) level(#) Regress with robust standard errors, and this time change the confidence interval to #% (e.g. use 99 for a 99% confidence interval).

Improved Robust Standard Errors in Finite Samples

For robust standard errors, an apparent improvement is possible. Davidson and MacKinnon*report two variance-covariance estimation methods that seem, at least in their Monte Carlo simulations, to converge more quickly, as sample size n increases, to the correct variance covariance estimates. Thus their methods seem to be better, although they require more computational time. Stata by default makes Davidson and MacKinnon’s recommended simple degrees of freedom correction by multiplying the estimated variance matrix by n/(n-K). However, we should learn about an alternative in which the squared residuals are rescaled. To use this formula, specify “vce(hc2)” instead of “vce(robust)”. An alternative is “vce(hc3)” instead of “vce(robust)”.

Weighted Least Squares

We learn about (variance-) weighted least squares. If you know (to within a constant multiple) the variances of the error terms for all observations, this yields more efficient estimates (OLS with robust standard errors works properly using asymptotic methods but is not the most efficient estimator). Suppose you have, stored in a variable sdvar, a reasonable estimate of the standard deviation of the error term for each observation. Then weighted least squares can be performed as follows:

Run in Stata:

vwls yvar xvarlist, sd(sdvar)

3. Post-Estimation Commands

Commands described here work after OLS regression. They sometimes work after other estimation commands, depending on the command.

Fitted Values, Residuals, and Related Plots

predict yhatvar After a regression, create a new variable, having the name you enter here, that contains for each observation its fitted value ?yi .

predict rvar, residuals After a regression, create a new variable, having the name you enter here, that contains for each observation its residual ? ui .

scatter y yhat x Plot variables named y and yhat versus x.

scatter resids x It is wise to plot your residuals versus each of your x-variables. Such “residual plots” may reveal a systematic relationship that your analysis has ignored. It is also wise to plot your residuals versus the fitted values of y, again to check for a possible nonlinearity that your analysis has ignored.

rvfplot Plot the residuals versus the fitted values of y.

rvpplot Plot the residuals versus a “predictor” (x-variable).

Confidence Intervals and Hypothesis Tests

For a single coefficient in your statistical model, the confidence interval is already reported in the table of regression results, along with a 2-sided t-test for whether the true coefficient is zero. However, you may need to carry out F-tests, as well as compute confidence intervals and t-tests for “linear combinations” of coefficients in the model.

Here are example commands. Note that when a variable name is used in this subsection, it really refers to the coefficient (the βk) in front of that variable in the model equation. Run in Stata:

lincom logpl+logpk+logpf Compute the estimated sum of three model coefficients, which are the coefficients in front of the variables named logpl, logpk, and logpf. Along with this estimated sum, carry out a t-test with the null hypothesis being that the linear combination equals zero, and compute a confidence interval.

lincom 2*logpl+1*logpk-1*logpf Like the above, but now the formula is a different linear combination of regression coefficients.

lincom 2*logpl+1*logpk-1*logpf, level(#) As above, but this time change the confidence interval to #% (e.g. use 99 for a 99% confidence interval).

test logpl+logpk+logpf==1 Test the null hypothesis that the sum of the coefficients of variables logpl, logpk, and logpf, totals to 1. This only makes sense after a regression involving variables with these names. This is an F-test.

test (logq2==logq1) (logq3==logq1) (logq4==logq1) (logq5==logq1) Test the null hypothesis that four equations are all true simultaneously: the coefficient of

logq2 equals the coefficient of logq1, the coefficient of logq3 equals the coefficient of logq1, the coefficient of logq4 equals the coefficient of logq1, and the coefficient of

logq5 equals the coefficient of logq1; i.e., they are all equal to each other. This is an F-test.

test x3 x4 x5 Test the null hypothesis that the coefficient of x3 equals 0 and the coefficient of x4 equals 0 and the coefficient of x5 equals 0. This is an F-test. Nonlinear Hypothesis Tests

After estimating a model, you could do something like the following:

testnl _b[popdensity]*_b[landarea] = 3000 Test a nonlinear hypothesis. Note that coefficients must be specified using _b, whereas the linear “test” command lets you omit the _b[].

testnl (_b[mpg] = 1/_b[weight]) (_b[trunk] = 1/_b[length]) For multi-equation tests you can put parentheses around each equation (or use multiple equality signs in the same equation)

Computing Estimated Expected Values for the Dependent Variable

di _b[xvarname] Display the value of an estimated coefficient after a regression. Use the variable name “_cons” for the estimated constant term. Of course there’s no need just to display these numbers, but the good thing is that you can use them in formula. See the next example.

di _b[_cons] + _b[age]*25 + _b[female]*1 After a regression of y on age and female (but no other independent variables), compute the estimated value of y for a 25-year-old female. See also the predict command mentioned above. Also Stata’s “adjust” command provides a powerful tool to display predicted values when the x-variables taken on various values (but for your own understanding, do the calculation by hand a few times before you try using adjust).

Displaying Adjusted 2R and Other Estimation Results

display e(r2_a) After a regression, the adjusted R-squared, 2R , can be looked up as “e(r2_a)”. (Stata does not report the adjusted 2R when you do regression with robust standard errors, because robust standard errors are used when the variance (conditional on your right-hand-side variables) is thought to differ between observations, and this would alter the standard interpretation of the adjusted2R statistic. Nonetheless, people often report the adjusted 2R in this situation anyway. It may still be a useful indicator, and often the (conditional) variance is still reasonably close to constant across observations, so that it can be thought of as an approximation to the adjusted 2R statistic that would occur if the (conditional) variance were constant.)

ereturn list Display all results saved from the most recent model you estimated, including the adjusted 2R and other items. Items that are matrices are not displayed; you can see them with the command “matrix list r(matrixname)”.

Plotting Any Mathematical Function

twoway function y=exp(-x/6)*sin(x), range(0 12.57) Plot a function graphically, for any function (of a single variable x) that you specify. A command like this may

be useful when you want to examine how a polynomial in one regressor (which here must be called x) affects the dependent variable in a regression, without specifying values for other variables.

Influence Statistics

Influence statistics give you a sense of how much your estimates are sensitive to particular observations in the data. This may be particularly important if there might be errors in the data. After running a regression, you can compute how much different the

estimated coefficient of any given variable would be if any particular observation were dropped from the data. To do so for one variable, for all observations, use this command: predict newvarname, dfbeta(varname) Computes the influence statistic (“DFBETA”) for varname: how much the estimated coefficient of varname would change if each observation were excluded from the data. The change divided by the standard error of varname, for each observation i, is stored in the ith observation of the newly created variable newvarname. Then you might use “summarize newvarname, detail” to find out the largest values by which the estimates would change (relative to the standard error of the estimate). If these are large (say close to 1 or more), then you might be alarmed that one or more observations may completely change your results, so you had better make sure those results are valid or else use a more robust estimation technique (such as “robust regression,” which is not related to robust standard errors, or “quantile regression,” both available in Stata). If you want to compute influence statistics for many or all regressors, Stata’s “dfbeta” command lets you do so in one step. Functional Form Test

It is sometimes important to ensure that you have the right functional form for variables in your regression equation. Sometimes you don’t want to be perfect, you just want to summarize roughly how some independent variables affect the dependent variable. But sometimes, e.g., if you want to control fully for the effects of an independent variable, it can be important to get the functional form right (e.g., by adding polynomials and interactions to the model). To check whether the functional form is reasonable and consider alternative forms, it helps to plot the residuals versus the fitted values and versus the predictors. Another approach is to formally test the null hypothesis that the patterns in the residuals cannot be explained by powers of the fitted values. One such formal test is the Ramsey RESET test:

estat ovtest Ramsey’s (1969) regression equation specification error test. Heteroskedasticity Tests

After running a regression, you can carry out White’s test for heteroskedasticity using the command:

estat imtest, white

Note, however, that there are many other heteroskedasticity tests that may be more appropriate. Stata’s imtest command also carries out other tests, and the commands hettest and szroeter carry out different tests for heteroskedasticity.

The Breusch-Pagan Lagrange multiplier test, which assumes normally distributed errors, can be carried out after running a regression, by using the command:

estat hettest, normal

Other tests that do not require normally distributed errors include:

estat hettest, iid (Heteroskedasticity test – Koenker’s (1981)’s score test, assumes iid errors.)

estat hettest, fstat (Heteroskedasticity test – Wooldridge’s (2006) F-test, assumes iid errors.)

estat szroeter, rhs mtest(bonf) (Heteroskedasticity test – Szroeter (1978) rank test for null hypothesis that variance of error term is unrelated to each variable.)

estat imtest ( Heteroskedasticity test – Cameron and Trivedi (1990), also includes tests for higher-order moments of residuals (skewness and kurtosis).

Serial Correlation Tests

To carry out these tests in Stata, you must first “tsset” your data. For a Breusch-Godfrey test where, say, p = 3, do your regression and then use Stata’s “estat bgodfrey” command: estat bgodfrey, lags(1 2 3) Heteroskedasticity tests including White test.

Other tests for serial correlation are available. For example, the Durbin-Watson d-statistic is available using Stata’s “estat dwatson” command. However the Durbin-Watson statistic assumes there is no endogeneity even under the alternative hypothesis, an assumption which is typically violated if there is serial correlation, so you really should use the Breusch-Godfrey test instead (or use Durbin’s alternative test, “estat durbinalt”).

4.LM (Lagrange multiplier) Test on Non-linearities and Model Specification/ Likelihood Ratio Test

Run in Stata:

Step 1 : reg y x

1 x

2

x

3

(Run in Stata the dependable variable with the independable

variables)

Step 2: estimates store a

1

Step 3: Gen X

2sq= X

2

∧2 ( To generate the square of the variable X

2

)

Step 4: reg Y x

1 x

2

x

3

x

2

sq ( Now run the regression including the new variable

X

2

)

Step 5: estimates store a 2

Step 6: lrtest a 1 a 2

Step 7: Reject H 0 if : a.Fstat > F* b.P-value< α

相关主题