MAT 4378作业写作、辅导categorical data作业

”

MAT 4378作业写作、辅导categorical data作业、写作R编程设计作业、辅导R实验作业
MAT 4378 MAT 5317, Analysis of categorical data, Assignment 3 1
MAT 4378 MAT 5317, Analysis of categorical data
Assignment 3
Due date: in class on Monday, November 18, 2019
Remark: You can use R for your computations for Questions 2 to 4. If you use
R please provide the output. However, the R output is not an answer to a question.
Please provide one or two sentences to properly answer the question.
1. Consider a ratio estimator h(1,2) = 1/2, where the estimated variancecovariance
2. A carefully controlled experiment was conducted to study the effect of the size of
the deposit level on the likelihood that a returnable one-liter soft drink bottle
will be returned. The data to follow show the number of bottles that were
returned (Wi) out of 500 sold (ni) at each of size deposit levels (Xi
in cents):
Deposit level xi 2 5 10 20 25 30
Number sold ni 500 500 500 500 500 500
Number returned wi 72 103 170 296 406 449
An analysist believes that a logistic regression model is appropriate for studying
the relation between the size of the deposit and the probability a bottle will be
returned.
(a) Find the maximum likelihood estimates for 0 and 1. Give the estimated
regression model.
(b) Obtain a scatter plot of the sample proportions against the level of the
deposit, and superimpose the estimated logistic response onto the plot.
Does the fitted logistic response function appear to fit well?
(c) Obtain exp(
1) and interpret this number.
(d) What is the estimated probability that a bottle will be returned when the
deposit is 15 cents?
(e) Estimate the amount of deposit for which 75% of the bottles are expected
to be returned.
MAT 4378 MAT 5317, Analysis of categorical data, Assignment 3 2
(f) In part (e), we have an estimate x = g(
0,
1) for the level of the deposit
that corresponds to = 75% of the bottles are returned. This estimator is
a non-linear function of
0,
1. Use the delta-method to find an asymptotic
estimated standard error for this estimate. Hint: It will be helpful to
use the function vcov on your glm object. Furthermore, to multiply the
matrices A and B with R use A %*% B.
3. A marketing research firm was engaged by an automobile manufacturer to conduct
a pilot study to examine the feasibility of using logistic regression for
ascertaining the likelihood that a family will purchase a new car during the
next year. A random sample of 33 suburban families was selected. Data on
annual family income (x1, in thousands of dollars) and the current age of the
oldest family automobile (x2, in years) were obtained. A followup interview
conducted 12 months later was used to determine whether the family actually
purchased a new car (y = 1) or did not purchase a new car (y = 0) during the
year. The data is found in the file CarPurchase.csv.
(a) Find the maximum likelihood estimates of 0, 1, and 2. State the estimated
logistic regression model.
(b) Obtain exp(1) and exp(2) and interpret these numbers.
(c) What is the estimated probability that a family with annual income of $50
thousand and an oldest car of 3 years will purchase a new car next year?
4. Rather than finding the probability of success at an explanatory variable value,
it is often of interest to find the value of an explanatory variable given a desired
probability of success. This is referred to as inverse prediction. One application
of inverse prediction involves finding the amount of pesticide or herbicide needed
to have a desired kill rate when applied to pests or plants. The lethal dose level
x (commonly called LDz, where z = 100 is defined as
x =(cloglog() 0)1
for the complementary log-log regression model
cloglog() = 0 + 1 x.
(a) Show how x is derived by solving for x in the complementary log-log
regression model.
(b) We can obtain 95% confidence interval for x as follows:
Describe how this confidence interval for x is derived. (Note that there is
generally no closed-form solution for the confidence interval limits, which
leads to the use of iterative numerical procedures.)
MAT 4378 MAT 5317, Analysis of categorical data, Assignment 3 3
(c) Turner et al. (1992) uses logistic regression to estimate the rate at which
picloram, a herbicide, kills tall larkspur, a weed. Their data was collected
by applying four different levels of picloram to separate plots, and the
number of weeds killed out of the number of weeds within the plot was
recorded. The data are in the file picloram.csv. Complete the following:
(i) We will use a cloglog model instead of a logistic regression model. Give
the estimated complementary log-log model.
(ii) Compute e1 and interpret this number within the context of the problem.
(iii) Plot the observed proportion of killed weeds and the estimated model.
Describe how well the model fits the data.
Note: Here are some commands that you might find helpful. We are
assuming that the dataframe is called picloram.data and that the
fitted model is called mod.
## plot proportions versus x
with(picloram.data, plot(x = picloram, y = kill/total,
xlab = Picloram, ylab = Proportion of weeds killed,
panel.first = grid(col = gray, lty = dotted)))
# Put estimated esimated response on the plot
curve(expr = predict(object = mod,
newdata = data.frame(picloram = x), type = response),
col = red, add = TRUE)
(iv) Estimate the 0.9 kill rate level LD90 for picloram. Add lines to the
plot in (iii) to illustrate how it is found (the segments() function can
be useful for this purpose).
(v) We are assuming that your fitted model is the glm object mod. Use
the following commands to compute a 95% confidence interval for the
0.9 kill rate. Note: The function uniroot solves for the root of a
function over an interval.
b0 = summary(mod)$coefficients[1,1]
b1 = summary(mod)$coefficients[2,1]
LD.x-(log(-log(1-0.9))-b0)/b1
root.func – function(x, mod.obj, pi0, alpha) {
beta.hat – mod.obj$coefficients
cov.mat – vcov(mod.obj)
var.den – cov.mat[1,1] + x^2*cov.mat[2,2] +
2*x*cov.mat[1,2]
abs(beta.hat[1] + beta.hat[2]*x – log(-log(1-pi0)))/
sqrt(var.den) – qnorm(1-alpha/2) }
lower – uniroot(f = root.func, interval =
c(min(picloram.data$picloram), LD.x),
mod.obj = mod, pi0 = 0.9, alpha = 0.05)
MAT 4378 MAT 5317, Analysis of categorical data, Assignment 3 4
upper – uniroot(f = root.func, interval =
c(LD.x, max(picloram.data$picloram)),
mod.obj = mod, pi0 = 0.9, alpha = 0.05)
lower$root
upper$root
(vi) In part (v), we found a 95% CI for x0.9. Explain in a few sentences
how these commands give us the lower and the upper bound of the
confidence interval.

辅导MT5761留学生作业、写作Statistical Modelling作业、Java，Python，C++程序语言作业辅导
MT5761: Statistical Modelling
1 Week 3 Practical
Marks allocated for each question are indicated inside square brackets.
This practical is to be submitted by Wednesday 20th November 2019 at noon
1.1 Modelling count data using Generalised Linear Models
In this section we are going to:
Using a Poisson GLM to fit a model to the counts (per unit area)
Use model selection tools to choose covariates from those available
Interpret some parameter estimates from a fitted model
In this practical we will continue to use the same data from previous practicals, but adopt a different
modelling strategy. You will need to load the data as before and convert three of the variables in the
data set into factors:
EIA$impact-as.factor(EIA$impact)
EIA$MonthOfYear-as.factor(EIA$MonthOfYear)
EIA$Year – as.factor(EIA$Year)
attach(EIA)
1.1.1 Initial Model Fitting
Fit a Poisson-based GLM using a square-root link function, 7 covariates and 2 year-based interaction
terms. Use the following code:
fit.poisSqrt- glm(count ~ tidestate + observationhour + DayOfMonth +
MonthOfYear + Year + x.pos + y.pos +
Year:x.pos + Year:y.pos, data=EIA,
family=poisson(link=sqrt))
1
1. If the linear predictor for the fit.poisSqrt model is:
it = 0 + 1x1it + … + 26x26it
and the coefficients are listed in the order of the output produced using the code above, which of
the following describes the fit.poisSqrt model when the tide is Slack, in Month 10 and Year
10? [1]
(a) it = 0+2x2it +3x3it +4x4it +13x13it +16x16it +19x19it +20x20it +21x21it +24x24it
(b) it = 0 + 1x1it + 2x2it + 3x3it + 4x4it + 5x5it + 6x6it + 7x7it + 8x8it + 9x9it +
10x10it + 11x11it + 12x12it + 13x13it + 16x16it + 19x19it + 20x20it + 21x21it + 24x24it
(c) it = 0 + 1x1it + 2x2it + 3x3it + 4x4it + 5x5it + 6x6it + 7x7it + 8x8it + 9x9it +
10x10it + 11x11it + 12x12it + 13x13it + 14x14it + 15x15it + 16x16it + 17x17it + 18x18it +
19x19it + 20x20it + 21x21it + 22x22it + 23x23it + 24x24it + 25x25it + 26x26it
(d) it = 0+1x1it +2x2it +4x4it +13x13it +16x16it +19x19it +20x20it +21x21it +24x24it
(e) The correct answer is not provided as an option.
2. Which of the following BEST describes the fit.poisSqrt model? [1]
(a) yit Poisson(it =
2
it)
(b) yit Poisson(it = exp(it)areait)
(c) yit Poisson(it =
2
itarea2
it)
(d) yit Poisson(it = exp(
2
it))
(e) yit Poisson(it = exp(it))
Fit a Poisson-based GLM using a log link, 7 covariates and 2 year-based interaction terms. Use
the following code:
fit.pois- glm(count ~ tidestate + observationhour + DayOfMonth +
MonthOfYear + Year + x.pos + y.pos +
Year:x.pos + Year:y.pos, data=EIA, family=poisson)
3. If the linear predictor for the fit.pois model is:
it = 0 + 1x1it + … + 26x26it
and the coefficients are listed in the order of the output produced using the code above, which of
the following describes the fit.pois model when the tide is Flood, in Month 1 and Year 12?[1]
2
(a) it = 0 + 1x1it + 3x3it + 4x4it + 18x18it + 19x19it + 20x20it + 23x23it + 26x26it
(b) it = 0 + 1x1it + 2x2it + 3x3it + 4x4it + 5x5it + 6x6it + 7x7it + 8x8it + 9x9it +
10x10it + 11x11it + 12x12it + 13x13it + 14x14it + 15x15it + 16x16it + 17x17it + 18x18it +
19x19it + 20x20it + 21x21it + 22x22it + 23x23it + 24x24it + 25x25it + 26x26it
(c) The correct answer is not provided as an option.
4. Which of the following BEST describes the fit.pois model? [1]
(a) yit Poisson(it =
2
it)
(b) yit Poisson(it = exp(it)areait)
(c) yit Poisson(it =
2
itarea2
it)
(d) yit Poisson(it = exp(
2
it))
(e) yit Poisson(it = exp(it))
Refit the Poisson model with a log link and an offset term (using area as an effort term) and check
for collinearity using the VIF function.
5. Which of the following about collinearity is FALSE? [1]
(a) While there are no concerns about collinearity in this case, the standard errors for the year
coefficients are twice the size of what they would be if x.pos was not included.
(b) The values in the GV IF(1/(2Df)) are used to quantify collinearity when there is more than
one coefficient associated with one or more of the model covariates.
(c) Fitting collinear covariates together in a model can result in unstable estimates with large
standard errors.
(d) Ignoring intolerable levels of collinearity in a model can result in one or more covariates
being excluded from the model due to large p-values.
(e) One remedy for collinearity is to exclude one of the collinear covariates from the model and
refitting the new model.
1.1.2 Model Selection
Compare the AIC scores for the fit.pois, fit.poisSqrt and fit.poisOff models.
6. Which of the following about these AIC results is FALSE? [1]
3
(a) Comparing the AIC scores for the fit.pois and fit.poisOff models is not useful to
determine which model is preferable.
(b) In this case, the offset must be specified using the log() function owing to the log link
function used.
(c) We can use details about the survey design (and survey implementation) to help us decide
if an offset term is required.
(d) If the survey effort is uneven and we fail to include this information in the model, then we
may draw false conclusions about model covariates.
(e) An equivalent alternative to including an offset term in a model is to include the effort
covariate in the model, to account for uneven survey effort.
7. Carry out automated stepwise selection on the fit.poisOff governed by the AIC and BIC
criteria: call these new models step.poisOff and step.poisOff BIC respectively (as before
set direction = both). Based on these stepwise-selection results, which of the following is
FALSE? [1]
(a) The model selection results are the same regardless of whether the AIC and BIC criteria are
used.
(b) The model chosen using the AIC score suggests that the relationship between the response
and the x-coordinate changes with year.
(c) The model chosen using the BIC suggests the cost of including the DayOfMonth covariate
outweighs the benefits of doing so.
(d) There are three coefficients allocated to the tidestate covariate because there are four
categories for tidestate.
(e) The model chosen using the AIC criteria assumes the relationship between the y-coordinate
and the response is nonlinear in nature.
8. Perform likelihood ratio test results using the Anova function for the step.poisOff BIC model.
Based on these results, which of the following is FALSE? [1]
(a) The p-value associated with the interaction term, Year:x.pos is calculated by comparing
the test statistic of 73.8 with a reference
2 distribution with df = 3.
(b) Each p-value is based on comparisons between the likelihood values for a model with and
without each covariate separately, while retaining all other covariates in the model.
(c) While the (default) anova function always returns the same results as the (default) Anova
function, we are using the latter here because it automatically returns p-values.
(d) The Anova results suggests that all model terms should be retained in the model, regardless
of whether the 5% or 1% level is used to decide covariate retention.
4
(e) The Df column represents the number of coefficients associated with each model term.
9. Carry out all-possible-subsets selection on the step.poisOff BIC model using the dredge function
and the default ranking criteria. Based on these results, which of the following is FALSE?[1]
(a) The model ranked the highest in this case has a very similar AICc score to the second
ranking model resulting in a very similar model weight.
(b) We could use the model weights to model-average which would result in model predictions
which are a weighted average in line with the model weights.
(c) While this function suggests no terms should be dropped from the BIC-selected model, this
function returns output which tells us how other candidate models compare with the highest
ranked model in this case.
(d) For small sample sizes, the AICc might suggest a different model is preferred compared with
results obtained using the AIC.
(e) This function investigates the fit of all possible models while the stepwise selection function
does not necessarily consider all candidate models.
10. Based on the step.poisOff BIC model results, which of the following is FALSE? [1]
(a) There is no significant difference (at the 1% level) between average numbers (per unit area)
in an EBB or FLOOD tide state.
(b) There is no significant difference (at the 5% level) between average numbers (per unit area)
in month 1 and months 8, 10, 11 or 12. Average numbers (per unit area) in all the other
months are significantly different to average values in month 1.
(c) While average numbers (per unit area) are significantly fewer in years 11 12 compared
with year 9, there is no evidence for a different in average numbers in years 9 and 10.
(d) The relationship between the x-coordinate and average numbers (per unit area) is signifi-
cantly steeper in years 10 12 compared with the x-coordinate relationship in year 9.
(e) The relationship between the y-coordinate and average numbers (per unit area) is significantly
steeper (at 5% level) in year 10 compared with the y-coordinate relationship in
year 9, but significantly shallower (at 5% level) in year 12 compared with the y-coordinate
relationship in year 9.
11. Based on the step.poisOff BIC model, what is the predicted value on the scale of the response,
when tidestate=EBB, observationhour=10, month=1, year=11, x.pos=-2061, y.pos=-1158 and
the area of the cell is the mean of the area of all cells in the EIA data set? Report your answer
to 3 decimal places. [1]
12. Based on the step.poisOff BIC model, what is the ratio of the predicted numbers in month 5
compared to predicted numbers in month 1? Report your answer to 3 decimal places. [1]
5
1.2 Modelling count data using overdispersed models
1.2 Modelling count data using overdispersed models
Objectives:
In this section we are going to:
Check for overdispersion in our data (given our model)
Select the best model and make some (updated) predictions
Assess the fit of our final model
1.2.1 Overdispersion
13. Check for overdispersion in the step.poisOff BIC model using family=quasipoisson. Call this
new model step.poisOffOD. Based on these results, what is the estimated dispersion parameter?
Report your result to one decimal place. [1]
14. Compare the p-values with the overdispersed model with the p-values obtained under a strictly
Poisson model (when the dispersion parameter=1). Which of the following is FALSE? [1]
(a) The strictly Poisson model and the overdispersed model return identical model coefficients,
only the standard errors about these parameter estimates differ.
(b) In all cases, the standard errors are larger under the overdispersed model since the dispersion
parameter is estimated to be larger than 1.
(c) A likelihood ratio test can be performed using the overdispersed model in the same way it
is performed using the strictly Poisson model.
(d) Larger standard errors result in larger p-values and thus ignoring overdispersion when it is
present can result in model covariates being retained when there is no genuine relationship
with the response.
(e) Based on the p-values in the overdispersed model, both interaction terms would be dropped
from the step.poisOff BIC model (at 5 % significance level).
15. Based on the overdispersed results, is the following statement TRUE or FALSE? [1]
While there is compelling evidence that average numbers (per unit area) change with the xcoordinate
and y-coordinate, there is no evidence that either of these relationships change with
year.
6
1.3 Model Diagnostics
1.3 Model Diagnostics
In this section we are going to:
Assess any residual patterns using the residualPlots function in the car library
Check for any correlation present in the residuals.
Assess what we have learned from this modelling process with reference to our research
questions.
Based on the analysis from the previous section update your model dropping insignificant covariates
and/or interaction terms.
16. Using the residualPlots function in the car library in R, which of the following statements
about linearity is TRUE? [1]
(a) The relationship of each covariate with the response is linear in a Poisson-based model.
(b) The residualPlots function output tells us that we should remove tidestate from the
model.
(c) The last plot is most appropriate for assessing whether the mean-variance relationship is
modelled appropriately.
(d) observationhour has evidence of non-linearity on the link scale.
17. Make a plot of observed vs fitted. Which of the following statements is FALSE? [1]
(a) Observed counts greater than 5 are severely under predicted.
(b) A good fitting model should show scatter about the 45o
line
(c) There are negative fitted values.
(d) The range of observed counts is larger than the range of predicted counts.
18. Plot fitted values vs scaled residuals to assess the mean-variance relationship. Which of the
following statements is FALSE? [1]
(a) There is a known linear mean-variance relationship for a Poisson-based model.
(b) There should be pattern in a plot of fitted vs scaled residuals for the mean-variance relationship
of an overdispersed Poisson-based model to hold.
(c) It is difficult to tell from this plot whether the mean-variance relationship is appropriately
modelled. Binning the data may help here.
7
(d) The variance of the final model increases at a rate approximately fifteen times faster than
the mean.
19. Use acf plots to determine the nature of any scaled residual correlation present. Which of the
following statements is FALSE? [1]
(a) The first data point is correlated with the 41st point
(b) There is correlation through time within gridcodes
(c) The acf plot indicates we should remove every 41st data point to deal with any correlation
present.
(d) If we re-order the data by gridcode, we may see a different pattern in the acf plot.
20. Which of the following about summarising our model is FALSE? [1]
(a) The final model identified that there was a decline in animal density during the study period.
(b) There was no spatially explicit decline identified
(c) There was indication that some covariates are inappropriately modelled (e.g. should be tried
as non-linear terms.)
(d) If there is non-independence in the model residuals, the current p-values are likely to be
too small, which may lead to one or more covariates being kept in the model that should not
be.
(e) We cannot use a GLS model with Poisson errors so therefore this is the best model we can
achieve.