Why are linear probability models not homoscedastic? Because the error term is equal to y; yhat, or p-phat. P-hat depends on x, var phat also dep

11 May Why are linear probability models not homoscedastic? Because the error term is equal to y; yhat, or p-phat. P-hat depends on x, var phat also dep

Posted at 06:40h in Business & Finance / Economics by

i need help sample questions added

1) May 11 noon – May 11 11:59 pm

2) 12 hours

3) I will send you pics and you will solve and send them back to me

4) show steps

5) maybe 10-15 questions

6) yes

Final_practice_answers.docx

Suggested Answers if not directly in lecture notes

1. Why are linear probability models not homoscedastic?

Because the error term is equal to y – yhat, or p-phat. P-hat depends on x, var phat also depends on x. Therefore, in a linear probability model, this will always have a higher variance for probabilities near 0.5 and lower at higher or lower probabilities.

2. Suppose the CDC is worried that the rate of growth of flu this season is very different from the usual rate of 1 percent a week and is considering making flu vaccines free and mandatory for the remainder of the season to curb the growth rate for the next months. We collect data on the number of flu cases Y, per week, during 20 weeks, t=1,2, …19, and obtain the following estimates:

week R2 = .62

(5.11) (.007) n = 20

week R2 = .96

(31) (.0003) n = 20

Based on the estimates obtain a 95% confidence interval for the growth rate . What do you recommend to the CDC?

I would use the second estimated equation given that the coefficient gives you the growth rate. The 95% confidence interval for estimated growth rate is [0.012- 1.96* 0.0003, 0.012+ 1.96* 0.0003]. Does the interval contain 0.01? If yes then we cannot reject the usual 1% a week growth. If not then we can reject equality, and the growth rate is different from 1 percent/week.

3. From a random sample of agricultural yields Y (1000 dollars per acre), over years and region in the US, we have estimated the following equation for Y:

lnYhat= 0.49 + .01 GE R2 = .32

(.11) (.01) n = 1526 (these estimates are totally made up)

a. Interpret the results on the Genetically engineered (GE) technology on yields. (follow SSS= Sign Size Significance)

b. Suppose GE is more used in the West Coast where the crops are also with higher yields. How would the estimated effect of GE be affected by including a West Coast region dummy the equation? Justify your answer.

We have OVB. Corr(west GE)>0 and corr(yield,WEST)>0 so when including west the coefficient on GE would drop given that before it had a positive OVB

c. If we include region fixed effects would be control for the factors in b? Justify your answer.

Yes. A region FE captures anything that is region specific affecting yields, so west coast is constant and region specific and would be controlled by a region FE

d. If yields have been generally improving over time and GE adoption also was more recently introduced in the USA, what would happen to the coefficient of GE if we include year fixed effects?

If we had a panel data set and did not include time dummies or a trend, given the above, the GE coefficient would have a positive OVB. Adding year FE would bring the coefficient on GE down. You can think about this trend as being an omitted variable that affects all observations the same in every time period,. Therefore, adding time fixed controls for this omitted variable and will change our estimates.

4. A recent paper investigates whether advertisement for Viagra causes increases in birth rates in the USA. Apparently, advertising for products, including Viagra, happens on TV and reaches households that have a TV within a Marketing region and does not happen in areas outside a designated marketing region. What the authors do is look at hospital birth rates in regions inside and near the advertising region border and collect data on dollars per 100 people (Ads) for a certain time, and compare those to the birth rates in hospitals located outside and near the advertising region designated border. They conduct a panel data analysis. The Table below has the main result in column 1 and then robustness check in column 2 adding weather controls.

a. Interpret SSS of all the estimates in column 1, where Ads variable is measured in $ dollars per 100 people.

What is the Marginal effect of Advertising on Birth rates (evaluate that at the mean of Ads=5.58 $/100 people).

ME add=0.0872 +2 (-0036)*ads. Plug in 5.58 to get answer

b. Why do they include Zip Code Fixed Effects, in particular what would be a variable that they are controlling for when adding Zip Code fixed effects that could cause a problem (and what problem) when interpreting the Ads Marginal Effects causal estimate in a?

They control for zip code FE because there could be differences by zip-code that could explain birthrates and also could be correlated with Viagra ads, and if not controlled for would result in OVB. For example, suppose a zip code has mostly seniors, therefore has low birth rates and also Viagra ads are more common.

c. Why do they control month FE?

Once again they are concerned with OVB common to all zipcodes, time changing factors that affect birth rates and also could be correlated with ads that if not controlled for will be included in the estimate of Ads and Ads squared and result in OVB

d. What is the conclusion from this Table in terms of Viagra Ads causing birth rates in the USA

One more

Viagra ad is predicted to, holding all else constant, using specif (3), increase birthrates significantly by 0.0954-2 * 00042*averageAds=0.0954-2 * 00042*5.58 children births per 1000 population. Viagra ads cause more babies.

5. Do more right hand side variables always improve the fit of an OLS regression in a SLR model?

No, R2 cannot go down when we add more variables, but it does not have to increase.

6. If x1i= 5 x2i+ei, where ei is a random term, can you estimate a regression of y on x1 and x2? What would be the problem? And how could you detect it?

There would be near perfect multi-colinearity. We could detect this by examining the size of the standard errors on X1 and X2 and observing that they are very large relative to when we only include one or the other.

7. Please use the Stata output below to answer the following.

reg lwage educ exper female

Source | SS df MS Number of obs = 526

———-+———————————- F(3, 522) = 94.75

Model | 52.2939096 3 17.4313032 Prob > F = 0.0000

Residual | 96.0358418 522 .183976708 R-squared = ____

———-+———————————- Adj R-squared = 0.3488

Total | 148.329751 525 .28253286 Root MSE = .42893

—————————————————————————

lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ | .0912897 .0071232 _______ 0.000 .0772962 .1052833

exper | .0094139 .0014493 6.50 0.000 .0065667 .012261

female | -.3435967 .0376668 -9.12 0.000 -.4175939 -.2695996

_cons | .4808357 .1050163 4.58 0.000 .2745292 .6871421

a) What is the t for educ missing in the output above and also the R squared ?

b) Interpret the coefficient on educ. Make sure to comment on the size and significance of the coefficient.

b) Test the null that female salaries are 50% lower than male salaries at 5 % significance. Show your work using the five steps in hypothesis testing.

Step 1:

H0 : βfemale = −0.5

H1 : βfemale ̸= −0.5

Step 2:

t = −0.3435967 − (−0.5) / 0.0376 = 4.15

Step 3:

The critical value for a t-stat at 5% significance and 522 degrees of freedom is 1.96

Step 4:

|4.15| > 1.96

Step 5:

We reject the null that female salaries are 50% lower than male salaries at the 5% significance level

c) Given the two outputs below, would you conclude that the wages are influenced by respondents being female and in the west coast, considered together. Show your work using the five steps in hypothesis testing.

reg lwage educ exper female west

Source | SS df MS Number of obs = 526

————-+———————————- F(4, 521) = 73.27

Model | 53.4024249 4 13.3506062 Prob > F = 0.0000

Residual | 94.9273265 521 .182202162 R-squared = 0.3600

————-+———————————- Adj R-squared = 0.3551

Total | 148.329751 525 .28253286 Root MSE = .42685

——————————————————————————

lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ | .0909875 .0070898 12.83 0.000 .0770594 .1049156

exper | .0094465 .0014423 6.55 0.000 .006613 .01228

female | -.3487115 .037542 -9.29 0.000 -.4224638 -.2749591

west | .1226554 .0497271 2.47 0.014 .0249653 .2203456

_cons | .465773 .1046868 4.45 0.000 .2601128 .6714332

. reg lwage educ exper

Source | SS df MS Number of obs = 526

————-+———————————- F(2, 523) = 86.86

Model | 36.9850396 2 18.4925198 Prob > F = 0.0000

Residual | 111.344712 523 .212896199 R-squared = 0.2493

————-+———————————- Adj R-squared = 0.2465

Total | 148.329751 525 .28253286 Root MSE = .46141

——————————————————————————

lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

————-+—————————————————————-

educ | .0979356 .0076224 12.85 0.000 .0829613 .1129099

exper | .0103469 .0015551 6.65 0.000 .0072919 .013402

_cons | .2168544 .108595 2.00 0.046 .0035183 .4301904

——————————————————————————

Run an F-test (for process see notes)

Question. 8

During a plastic bag ban implemented in Berkeley but not in Richmond, a co-author and I collected data on number of paper bag used per transaction before (in 2012) and after the election (2013) in Berkeley and also in a neighboring city that did not pass the plastic bag ban. Please specify a linear regression model to estimate the CAUSAL impact of the plastic ban policy on the number of paper bags used per transaction. Clearly label all the variables, the Y, and all the variables on the right hand side and explain what each coefficient means in this regression you specify.

Bags_used = beta_0 + beta_1 Berkeley + beta_2 Year_2013

+ beta_3 Berkeley X Year_2013

beta_0 = Bags used in Richmond in 2012

beta_1 = Difference in Berkeley compared Richmond in 2012

beta_2 = Difference in from 2012 to 2013 in Richmond

beta_3 = Diff-in-diff estimated treatment effect

What data would you need to check for the main assumption behind the estimation strategy you use above?

PRE PERIOD TRENDS – we would want data from 2011 (at least) to check to see that pre-trends in the number of bags used was similar in Berkeley and Richmond.

9. Consider the model below that relates income to years of experience and years of education:

a) How would you modify this equation to check whether the effect of experience depends on the level of education? What test would you perform?

Add the interaction term of experience and education to the model. Then test to see if beta3 is equal to zero.

b) Suppose now that the effect of experience does not depend on education, but education is specified as a category variable, “no diploma”, “high-school diploma”, “college diploma and above”. How would you re-specify the model? How would you test that education has no influence on income?

Test using an F-test that beta2 and beta3 are equal to zero.

c) Suppose you found data from an IQ test for everyone in your sample. You see that the new IQ variable is positively correlated with the probability of someone graduating from high-school and college. It is also positively correlated with income. If you include IQ in the your model from part b.), describe what you expect to happen to the coefficient(s) on your education variable(s).

Both beta2 and beta3 should fall. This is because the omitted variable IQ was positively correlated with high-school and college and is positively correlated with income. This caused upward bias in our coefficients when IQ was omitted.

11. You have a data set that contains information about individuals’ gender, the number of children they have, their family income, and whether they are in the labor force

You estimate the following linear probability model: 

P(laborforce=1) = β0 + β1 children + β2 female + β3 (children × female) + u

a.) In terms of the model’s parameters, what is the marginal effect of having an additional child on a woman’s probability of being in the labor force? What is the marginal effect of having an additional child on a man’s probability of being in the labor force?

Marginal effect of a child = β1 + β3 female . This means the return is β1 for a man and β1 +β3 for a woman

b.) Based on the graph below, what signs do you expect for the parameters β0, β1, β2 and β3? Be specific about your reasoning for each parameter.

· β0 is the intercept for the male line, therefore β0 0

·  β0β2 is the intercept for the female line, which is less than that for the male, therefore β20

· β1 is the slope for the male line which is positive, therefore β1 0

·  β1β3 is the slope for the female line, which is negative, therefore β3 0

c.) Suppose that the government implements a program that provides free childcare to families with income below $20,000 a year (and no assistance to families with higher income).

i. Propose an estimation technique to estimate the effect of this program on labor participation. Write down the regression equation you would use for the technique (ignore gender and the number of children for this part).

ii. State the assumption you need to make in order for this technique to successfully recover the causal effect of the program on labor participation.

iii. Finally, give an example of a test you might perform to test this assumption?

i. We can conduct an RD around the $20,000 threshold to recover the effect of the program.

P(laborforce=1) = β0 + β1 program + β2 (income – 20,000) + β3 (program × income – 20,000)

ii. We need to assume that the relationship between income and labor force participation would have been continuous at the $20,000 threshold were it not for the program. In other words, individuals just above and just below the threshold are identical expect for the program.

iii. We would want to check the smoothness across the threshold of other variables that should not be impacted by the program. For example, we can check levels of education or job experience. We could also check the histogram of income around the threshold to make sure that there is no bunching in the running variable.

12. Below is the estimation of a standard model of household energy consumption:

(0.57) (0.61) (0.25)

where energy is the annual consumption of energy of the household (in 1000 kilowatt-hours), income its annual income (in $1,000), and price the average price of energy over the year (in cents/kilowatt-hours), and standard errors are in parentheses.

a. What is the economic interpretation of the model coefficient?

It is the own price elasticity of energy consumption.

b. Interpret the estimated coefficient .

Sign: As expected, price is negatively correlated with consumption of energy

Size: a 10% increase in price, leads to a predicted decrease of 7.9% in energy consumption

Significance: t= .79/.25 = 3.16 The coefficient is statistically significant from zero, so we reject the null that price has no effect on energy consumption

c. What is the predicted effect of a 15% increase in the price of energy on household average energy consumption?

.15*-.79 = .1185 -> 11.85% decrease in predicted energy consumption

13. From a sample of 200 households, we estimated the following two models of gasoline consumption (t-statistics in parentheses):

a. Using the estimated coefficients in the first question, how does gasoline vary with income?

Both od the income variables are significant at the 10% level, although not at the 5% level. From the point estimates, gasoline consumption increase with income, up until annual incomes of 5 million USD (0.25/0.00005 = 5000 thousands USD), and then begins to decrease. Since annual incomes of million USD are extremely rare, gasoline consumption essentially always increases with income.

b. Are the two income variables jointly significant?

1. H0: , H1:

3. At the 5% level, with 2 numerator d.f. and 196 denominator d.f., c = 3

4. |5.02| > |3|, so we reject H0

5. At the 5% significance level, we reject the null hypothesis that does not affect gasoline consumption.

c. Comparing the suv coefficient in the second equation to the first, what do you conclude about the correlation between income and SUV ownership?

Because the coefficient on suv is larger in the second equation than in the first, we have upward bias. We also noted that gasoline consumption is generally increasing with income in our sample. Thus, by the omitted variable bias formula, the correlation between suv ownership and income must be positive too.

Final 2018- suggested answers

Introductory Applied Econometrics Final Exam

EEP/IAS 118

YOUR Section Day and Time

Question 1. Suppose you have a random sample of people in the U.S. with data on the average number of hours they sleep each week, and their age in years. You obtain the following regression results with this data:

Instead, a colleague of yours tries a quadratic functional form and obtains the following results:

our colleague argues that the information above is enough to conclude that her regression, model (2), is a better fit. Is this correct? Explain briefly why or why not.

No, the above information is NOT enough. To compare non-nested models, it is necessary to use adjusted R2 instead of R2.

Question 2. Arizona State University is experimenting with combining in-classroom and online learning. When students enroll in a class, they get randomly assigned to the traditional version or to a new version in which the professor posts videos of lecture online for students to watch at home and uses class time exclusively for activities.

After the first semester of this new teaching style, ASU hires you to evaluate the impact of these innovations in teaching on student learning. Since ASU is so big, each semester several different professors teach Econ 101. ASU gives you a data set where each row is a professor ID, average final grade given, and a dummy equal to 1 if the professor taught the new version of Econ 101.

What kind of data set are you working with? (ie, time series, cross-section, pooled cross-section, or panel) What is the unit of observation?

This is cross-sectional data (the data include many units in one time period). The unit of observation is a class (you could also say that the unit of observation is a professor — in the cross section, they are the same).

Write down a simple, univariate regression model that would tell you the difference in student outcomes between the hybrid and traditional versions of Econ 101. Explain how to interpret each coefficient.

where hybrid_i == 1 if professor i taught the hybrid version, and 0 if she taught the traditional version. The intercept b0 is the average final grade for traditional classes; b1 is the marginal effect of taking the hybrid version. The average final grade for hybrid

classes is b0 + b1.

Explain why the random assignment of students means we do not need to worry about omitted variable bias from student characteristics.

Under randomization, the treatment and control groups are both representative samples of the population with no statistically significant differences in observables before the treatment. Importantly, no characteristics that predict the outcome also predict treatment status. This ensures that the control group is a good counterfactual for the treatment group — if the students in the hybrid class had instead taken the traditional class, they would have gotten the same grades as the students who did take the traditional class.

Even though we don't need to worry about bias from student characteristics, there could be other omitted variables. Give an example of an omitted variable that could be causing bias. Explain why, using the two conditions for OVB.

Professor characteristics like experience, clarity, or quality of instructional materials will

be correlated with final grades, and may also be correlated with hybrid since we only have one time period in our cross-sectional data.

You explain these concerns to ASU and convince them to allow you to collect data for another 3 semesters. By now, every professor has taught both the hybrid and traditional course twice. Your new data set includes a professor ID, average final grade given, dummy for being the new Econ 101, and the semester in which it was taught (1, 2, 3, or 4).

Now what kind of data are you working with? What is the unit of observation?

Now we have panel data, on each professor in each semester. The unit of observation is a professor-semester, i.e., Villas-Boas in Spring 2018, Villas-Boas in Fall 2017,…, etc.

Explain what kind of fixed effects you plan to include in your regression. Give an example of an omitted variable that each fixed effect controls for.

Now I can include both professor fixed effects and time fixed effects. Professor fixed effects control for all aspects of a professor that do not change over time. One example might be the number of office hours that each professor offers each week. Time fixed effects control for anything that might affect the grades in every class within the same time period (in this case, a semester). For example, maybe there was a staff strike in Spring 2018 which affected students’ access to library study space.

Do you expect the coefficient from the simple, univariate regression to change? Why or why not?

Yes — now that we’ve controlled for professor characteristics like quality, I think the coefficient on hybrid will change. I think that it is likely that hybrid was correlated with professor quality in the simple univariate model, and since it is also correlated with grades, this caused OVB. Now that we have included professor fixed effects, we have removed this OVB, so b1 should change.

Question 3. We are really concerned about the flu outbreak in the college campuses in the US. We collect data for a random sample of college students in 4 campuses for three years. An observation is whether a student in campus j in year t has the flu (HASFLU). We also have data on whether the student got the flu shot (SHOT) and whether the student lives in the dorm (DORM). In addition, we collect data on two characteristics of the 4 campuses, namely, on whether they are a private university or not (PRIVATE), and on the number of students in each campus per year (N).

Please test whether the two observed characteristics of the campuses significantly affect the probability of getting the flu, using the estimates from logit models in Stata below

Model 1

logit HASFLU SHOT DORM N PRIVATE

Logistic regression Number of obs = 230800

LR chi2(6) = 121.39

Prob > chi2 = 0.0000

Log likelihood = -372.90 Pseudo R2 = 0.1179

——————————————————————————

inlf | Coef. Std. Err. z P>|z| [95% Conf. Interval]

————-+—————————————————————-

SHOT | -.0350754 .0080669 -4.35 0.000 -.0508862 -.0192646

DORM | .2575602 .0409102 6.30 0.000 .1773777 .3377427

PRIVATE | -.0576886 .0128004 -4.51 0.000 -.0827769 -.0326003

N | .484777 .1980748 7.50 0.000 1.872996 1.096558

Model 2

logit HASFLU SHOT DORM

Logistic regression Number of obs = 230800

LR chi2(6) = 101.39

Prob > chi2 = 0.0000

Log likelihood = -392.90 Pseudo R2 = 0.0779

——————————————————————————

inlf | Coef. Std. Err. z P>|z| [95% Conf. Interval]

————-+—————————————————————-

SHOT | -.0450754 .0080669 -4.75 0.000 -.0518862 -.0182646

DORM | .2575602 .0409102 6.30 0.000 .1773777 .3377427

Perform the 5 steps in hypothesis testing.

1. H0: βprivate= βN=0

HA: either βprivate or βN or both are not equal to 0

2. test statistic= 2(LLUR-LLR) = 2(-372.90+392.90)=40

3. The critical value comes from a chi square distribution with q=2 degrees of freedom. At a 95% confidence level, c=5.99

4. 40>5.99 so we reject the null hypothesis

5. At a 95% confidence level, there is statistical evidence to support that whether a school is private or public and the number of students at that school jointly affect whether a student has the flu.

Now, for b) and c), below are the estimated marginal effects from model 1.

Marginal effects after logit

y = Pr(HASFLU) (predict)

= .57425363

——————————————————————————

variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X

———+——————————————————————–

Shot | -.0085755 .00197 -4.34 0.000 0.3429

DORM | .06297 .00999 0.1269

PRIVATE | -.0141041 .00313 -4.51 0.000 -.020229 -.007979 0.5378

N | .3630078 .04862 7.47 0.000 -.458302 -.267713 100005

——————————————————————————

(*) dy/dx is for discrete change of dummy variable from 0 to 1

Construct the 95 % confidence interval for the marginal effect of the flu shot on having the flu. (show your work and use the space provided)

CI= [-0.0085755±1.96*0.00197]=[-0.0124,-0.0047]

Please test whether being in a dorm affects the probability of having the flu. (5 steps)

1. H0: βdorm=0

HA: βdorm≠0

2. test statistic= 0.06297/0.00999=6.303303

3. At the 95% confidence level, with n-k-1=230,800 – 5 degrees of freedom, the critical value is 1.96.

4. 6.303>1.96 so we reject the null hypothesis

5. At a 95% confidence level, there is statistical evidence to support that whether a student lives in a dorm affects whether he or she gets the flu, holding constant whether the student has the flu shot, goes to a private university, and the size of the university.

Question 4. On January 25, 2018, the EPA loosened air pollution regulations. Previously, facilities that were above some air pollution threshold, such as 10 tons per year, were forced to forever reduce their emissions significantly, such as to below half of the threshold. Since the change in January, facilities that have polluted above the threshold of 10 tons per year now only need to get their emissions below the threshold.

You want to use this policy change to study how air pollution affects asthma rates in children. You have data on facility location, facility emissions, whether the facility is in violation of emissions standards, and zip-code level child asthma rates.

You decide to run a differences-in-differences analysis. What is your treatment group? What is your control group?

This study could be run at the zip code level, or at the facility level if you aggregate the asthma rate data in some way to estimate asthma rates around a facility.

A zip code or facility is treated if it is affected by the policy. Therefore, it is treated if it was previously forced to reduce emissions by a large a

Related Tags
Academic APA Assignment Business Capstone College Conclusion Course Day Discussion Double Spaced Essay English Finance General Graduate History Information Justify Literature Management Market Masters Math Minimum MLA Nursing Organizational Outline Pages Paper Presentation Questions Questionnaire Reference Response Response School Subject Slides Sources Student Support Times New Roman Title Topics Word Write Writing

11 May Why are linear probability models not homoscedastic? Because the error term is equal to y; yhat, or p-phat. P-hat depends on x, var phat also dep

Related Tags

Who We Are

Some Categories

More Links

We Accept