” EC 421程序 写作、R编程设计程序调试、R课程程序Problem Set 1Econometrics ReviewEC 421: Introduction to EconometricsDue before noon (11:59am) on July 1st, 2020 (on canvas)To make grading slightly easier, please include all of your R code at the end of your word doc with yourwritten answers.OBJECTIVE This problem set has three purposes: (1) reinforce the econometrics topics we reviewed inclass; (2) build your R toolset; (3) build your intuition for newer topics like heteroskedasticity andconsistency.Problem 0: InferenceFor this question I used data from a survey conducted by the department of education in 1980. Adescription of the data can be found here. We want to test the effect of an extra year of education onwages. For this question, we have observations and parameters. This means that we havedegrees of freedom. For a signicancelevel, this gives us a t-critical value of .You can use this Information throughout the rest of the problem.We are interested in the following regression:where is the individual hourly wage, and is the individual years of education. This regressionyields the following parameter estimates:where standard errors for each parameter are given below in parenthesis.0a. Conduct the appropriate statistical test to determine whether or not education has a statisticallysignicantEC 421作业 写作、R编程设计作业调试、R课程设计impact on wages. Write out all steps and be clear with your conclusion.0b. Now write out the formula for the standard error of . Is the standard error increasing or decreasing inthe sample size ? Next write out the formula for the test statistic you calculated in 0a.. Is the test statisticincreasing or decreasing in the sample size? Lastly, use your answers to this question to determine whetheror not probability that you reject the null hypothesis is increasing or decreasing in the size. Hint: You donthave to write out this probability explicitly. Just explain the intuition behind what the test-statistic is tellingyou and how this helps you answer the question.Oc. Use the information provided combined with the regression output to construct a 95% condenceinterval for the Parameter . Write out the steps you took to get to the lower and upper bounds. Provide acareful interpretation of what this condenceinterval tells you.0d. Now suppose we think we omitted an important variable: gender. State the two conditions this variablemust meet (in the context of this example) for it to cause omitted variables bias. Would increasing thesample size (working with big data) alleviate the issues caused by omitting gender from this regression?0e. Luckily, our data contains information on whether or not individuals in our data are male or female. Wenow include two indicators in our regression. One for male, and one for female — and drop the intercept.We have the following coefcientestimates and standard errors.You dont need to calculate The next test (I have not given you enough information to do so), but write outhow you would use this model to statistically test the null hypothesis that wages for males and females aredifferent from each other. Write out each step.n = 4739 k = 24737 = 5 t0.025,4737 = 1.96wagei = 0 + 1educi + iProblem 1: Bias and variance1a. Throughout this course, we will use the OLS estimator to estimate . Explain what it means for to bebiased for .Figure 1Note This gureshows the distributions of three estimators (A, B, and C) that each estimate the unknownparameter . E[A]= , E[B]= , E[C]=1b. Which of the estimators in Figure 1 (above) are unbiased? Hint: There may be more than one.1c. Which of the estimators in Figure 1 (above) has the minimum variance?1d. Which of the estimators in Figure 1 (above) is the best (minimum variance) unbiased estimator?1e. Suppose we want to estimate the effect of advertising on sales. Explain what it bias would mean in thiscontext.1f. What does the term standard error mean?1g. What does it mean for an estimator to be more efcientthan another estimator? Of the unbiasedestimators, which one is efcient?Problem 2: Getting Started with RProblems 2 – 6 will use data I downloaded from the 2018 American Community Survey, which I downloadedfrom IPUMS. You can ndthis data on canvas.2a. Load packages. You will probably want to load the tidyverse and here packages. Maybe some othersas well.2b. Load the data. The data can be found on canvas. To accomplish this, use the read_csv() command.2c. Check your dataset. How many observations and variables do you have? Hint: Try dim(), ncol() andnrow()Problem 3: Getting you know your data3a. Plot a histogram of household income (hh_income) using ggplot2.Remember: the hh_income variable is measured in tens of thousands (meaning a value of 3means the households income is $30,000)This link provides A few good examples of how to create a histogram using ggplot2.3b. What are the mean and median levels of household income? Based upon this answer and the previoushistogram, is household income (fairly) evenly distributed or is it skewed? Explain.3c. Run a regression summarizing the relationship between household income and household size.Interpret the results of the regression — e.g. tell me what the coefcientsmean and comment on theirstatistical signicance.3d. Explain why you chose the specicationthat you did in the previous question.Was it linear, log-linear, log-log?What was the outcome variable?What was the explanatory variable?Why did you Make these choices?Problem 4: Regression Refesher4a. Regress average commute time time_commuting on household income (hh_income). Interpret thecoefcientand comment on its statistical signicance.4b. Regress the log of aeverage commute time on household income. Interpret the coefcientandcomment on its statistical signicance.4c. Regress the log of aeverage commute time on the log household income. Interpret the coefcientandcomment on its statistical signicance.4d. If you had to pick one of the above specicationsto show your boss at work, which one would you pick?Why? (There is no right answer to this question, just want you to start thinking about model specication.)4 / 8Problem 5: Multiple Linear RegressionWe will now add some covariates to our regression model.5a. Regress average commute time on household income and the share of individuals in the householdwho are non-white ehtnicities (hh_share_nonwhite). Interpret the coefcientsand comment on theirstatistical signicance.Also compare your results to 4a. Has anything changed?5b. Regress average commute time on the indicator variable for whether a household moved in the lastyear (i_moved). Interpret the coefcientsand comment on their statistical signcance.5c. Add the share of the household that represents a non-white ethnicity (hh_share_nonwhite) to theregression from 5b. Note: Your outcome variable is still average household commute time, but you shouldnow have two explanatory variables. Interpret the coefcientsand comment on their statistical signicance.5d. Did adding this second explanatory variable change the coefcientof the rstvariable at all? What doesthat tell you? Explain your answer.5e. One variable that we potentially omitted from our regression is an indicator for whether or not theindividual lives in an urban or rural area. Does this variable (which we dont have) meet the criteria for anomitted variable? Specicallystate both conditions it needs to meet for us to have classic omitted variablesbias. Sign the bias on hh_income that results from omitting urban/rural status.5 / 8Problem 6: Heteroskedasticity6a. Suppose we are interested in the relationship between a households housing costs and its time spentcommunity. Plot a scatter plot using ggplot2 with housing cost (cost_housing) on the axis and commutetime (time_commuting) on the axis. Make sure to label your axis.This Link provides an example if you need help.6b. Based on your plot 5a, if we regress cost_housing on time_commuting, do you think we would have anissue with heteroskedasticity? Explain your answer.6c. What issues can heteroskedasticity cause (Hint: there are at least two main issues)6d. Time for a regression. Regress cost_housing on time_commuting and hh_income. Report your results –interpret the coefcientsand comment on their statistical signicance.Be careful With your language here.Remember: the hh_income variable is measured in tens of thousands (meaning a value of 3means the households income is $30,000)6e. Use the residuals from your regression in 5d to conduct a Breusch-Pagan Test for heteroscedasticity. Doyou ndsignicantevidence of heteroskedasticity? Justify your answer. Note: I will post an additional videothat will help you write the code for this question. There is also sample code in the slides.6f. Now conduct a Goldfeld-Quandt test for heteroskedasticity. Do you ndsigncantevidence ofheteroskedasticity? Here are some hints:We are still interested in the same regression (regressing the cost of housing on commute timeand household income)Sort the dataset on time_commuting. This can be done with the arrange() function.Create two groups for the GQ test by using the rst8,000 and the last 8,000 observations (aftersorting on commute time). The head() and tail() functions will help here.When you construct the GQ stat, put the larger SSE value in the numerator.6g. Use the lm_robust() command from the estimatr package to calculate heteroskedastic-robust standarderrors. How do these standard errors compare to the plain OLS standard errors you previously found?Hint: lm_robust(y ~ x, data = some_df, se_type = HC2) will calculate heteroskedasticrobuststandard errors.6h. Why did your coefcientsremain the same in 5g — even though your standard errors changed?Problem 7: Unbiasedness and consistencyThroughout this course, we will use the OLS estimator to estimate . We will continue to discusssituations in which the estimator (or other estimators) are (1) unbiased or (2) consistent.7a. What is the formal (mathematical) denitionof bias?7b. Why do we care if if the OLS estimator (or any estimator) is biased?7c. What does it mean for an estimator to be consistent?7d. True/False Unbiasedness is a property for nite-sizedsamples, while consistency refers to an esimatoras sample sizes approach innity.7e. Which of the following two estimators would you choose? Explain your reasoning.Estimator A is unbiased and inconsistent.Estimator B is biased and consistent.^ 7 / 8Description of variables and namesVariable DescriptionpsCounty FIPS codehh_size Household size (number of people)hh_income Household total income in $10,000cost_housing Households total reported cost of housingn_vehicles Households number of vehicleshh_share_nonwhite Share of household members identifying as non-white ethnicitesi_renter Binary indicator for whether any household members are rentersi_moved Binary indicator for whether a household member moved in prior one yeari_foodstamp Binary indicator for whether any household member participates in foodstampsi_smartphone Binary indicator for whether a household member owns a smartphonei_internet Binary indicator for whether the household has access to the internettime_commuting Average time spent commuting per day by household memberIn general, Ive tried to stick with a naming convention. Variables that begin with i_ denote binary indicatoryvariables (taking on the value of 0 or 1). Variables that begin with n_ are numeric variables.如有需要,请加QQ:99515681 或邮箱:99515681@qq.com
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。