” MT5764留学生程序 写作、 写作Data Analysis程序MT5764: Advanced Data AnalysisMajor Course Assignment10 April 2020Housekeeping This project is about modelling Covid-19 data. I hope you will find this an opportunity to demonstratethe skills that you have developed in this module while simultaneously aiding in the effort to answerquestions of global interest. However, if you find this a distressing topic to work on, please contact medirectly before Tuesday, 14th April 2020 to arrange for a different dataset to complete this assignment. This major course assignment replaces your exam and thus comprises 60% of the overall modulemark. This is an individual project. The submitted coursework should reflect the work of you as anindividual. Suspected cases of copying will be taken very seriously, so please adhere to the Universitysguidelines on good academic practice. If you have any uncertainties or questions about this, pleasecontact me. I recommend you attempt every part of the assignment; even if you do not finish everything, marks arelikely to be awarded for incomplete tasks/code. Remember I cannot allocate marks to a blank sheet ofpaper, so help me to help you.Submission Write a succinct report that includes a clear and detailed description of how you have answered eachtask in the assignment, justifying each decision taken along the way and referring to the correspondingcode. Only include model output summaries and well-labelled plots that you describe and refer to in thewrite-up. You will be penalised for including superflous outputs and/or code that you are not discussingin the report or is not attempting to answer the task. Please include just your student ID at the startof your report. Do not include your name anywhere in your report. Please comment and annotate your final code, and name functions and variables sensibly. Make sureyou only submit the code that was used to answer the specific tasks. Your code needs to be succinctand comprehensible. Marks will be deducted if I cannot follow what you have done. You can use R and/or SAS to answer each task in this assignment. For example, if you find themodelling more convenient in SAS but prefer Rs graphical tools (or vice-versa), then feel free to use amix, making it clear in your report. Save your final, well-annotated code, as single scripts using yourstudent ID; ID_12345.R (replace 12345 with your student ID). Please include just your student ID atthe top of your script. Do not include your name anywhere in your scripts. You are free to write your report using whatever software you are comfortable with. For example, RMarkdown, Jupyter, LATEXor Word. However, you must convert your final report into a single PDF,saved using your student number (e.g. ID_12345.pdf).1 Compress your report (e.g. ID_12345.pdf) and code (e.g. ID_12345.R and/or ID_12345.sas) into asingle zip file (.zip), saved using your student ID (e.g. ID_12345.zip) and upload to Moodle. To be clear, you are required to upload to Moodle a single zip file (e.g. ID_12345.zip), which containsa single PDF of your final report and one or two final scripts (R/SAS)1.MT5764留学生作业 写作、 写作Data Analysis作业、R语言作业 辅导、R程序设计作业调试Deadline is Friday, 15th May 2020, 23:59 (UK time). Please do not leave it to the last minute toupload your work. The School has a lateness policy. The standard policy is an initial penalty of 15% of the maximumavailable mark, then a further 5% per 8-hour period, or part thereof for work submitted late withoutgood reason.1If you are writing your report using interactive notebooks, such as R Markdown or Jupyter, then you do not need tore-upload your code, as long as it is all included and commented within your notebook.2AssignmentDataThe assignment involves in-depth statistical analysis of the following two Covid-19 datasets that you willneed to download from Moodle. The main data source is John Hopkins Universitys repository (who in turnpooled data from various other sources), coupled with some country-level statistics.1. CovidCases.csv – The number of Covid-19 cases as of the day this assignment was set and a fewcountry-level metrics.Country Deaths Confirmed PopDensity MedianAge UrbanPop Bed Lung1 Afghanistan 14 444 60 18 25 0.5 37.622 Albania 22 400 105 36 63 2.9 11.673 Algeria 205 1572 18 29 73 1.9 8.774 Antigua and Barbuda 2 19 223 34 26 3.8 11.765 Argentina 63 1715 17 32 93 5.0 29.276 Armenia 9 881 104 35 63 4.2 23.86HealthExp GDP1 184 481.24322 774 5357.57043 1031 3940.17994 1105 17236.97785 1390 9856.43046 883 4536.9212 Country: Data from different states/regions/counties were pooled under a single Country name. Deaths: The number of deaths due to Covid-19 (persons). Confirmed: The number of confirmed Covid-19 cases (persons). PopDensity: Population density (persons/km2). MedianAge: Median age of the population (years). UrbanPop: The percentage of the population that live in urban areas (%). Bed: Hospital beds per 1,000 people (beds/1000 persons). Lung: Death rate per 100,000 people due to lung disease (deaths/100,000 persons). HealthExp: Total health expenditure per capita in US dollars ($/person). GDP: The nominal gross domestic product per capita (a measure of a countrys economy) in US dollars($/person).2. CovidConfirmedTime.csv – The number of confirmed cases of Covid-19 over time, for the 30 worstaffected countries, excluding China2.Country Day Confirmed1 Australia 0 1072 Australia 1 1283 Australia 2 1284 Australia 3 2005 Australia 4 2506 Australia 5 297 Country: Data from different states/regions/counties were pooled under a single Country name. Day: Days since the 100th confirmed case. Confirmed: The number of confirmed Covid-19 cases up to and including that day.2Unfortunately data for China from the early part of the outbreak is not available from the John Hopkins Universitysrepository.3Tasks for CovidCases.csv1. Fit a generalised linear model (GLM) or quasi-likelihood model (whichever you deem the most pertinentin this case), using an appropriate error structure, to model the number of deaths per confirmed cases(known as the case fatality rate). Use all the predictors available in the dataset (i.e. fit a full model).Justify your decisions along the way. Show and interpret your final model output, in particular commenton the effect size of each predictor. [4 marks]2. Refit the model identified in task 1. but now only consider countries that have recorded 10 or moredeaths. Show and interpret your final model output relative to the model in task 1. Do you think it ismore sensible to fit a model to this subset of the data if we were interested in performing inference onthe factors associated with the case fatality rate due to Covid-19? Justify your answer. [3 marks]3. Assess the assumptions of the model fitted in task 2. using appropriate model diagnostic tests andplots. Provide a clear explanation and interpretation for each test and plot used. [4 marks]4. Starting with the full model identified in task 2. perform an all-possible-subsets model selection usingan appropriate information criterion (justify your choice). Show the top 5 models and interpret theresults. [3 marks]5. Use data from countries that have recorded 10 or more deaths to fit a LASSO model. Consider all theavailable predictors and use 10-fold cross-validation to estimate the regularisation parameter . Plothow the regression coefficients (label them) and residual deviance change as a function of log (). Onthe plot clearly highlight the value for that minimises the cross-validation (CV) error (quantified bythe residual deviance). Show and interpret both plots and the final fitted model (taken to be the onethat minimises the CV error). [5 marks]6. Use data from countries that have recorded 10 or more deaths to fit a penalised regression spline.Include a smooth term for each predictor. Set the value for k (the dimension of the basis used torepresent the smooth term) to be the same for all covariates. Compare the partial residual plots (showthe residuals and confidence bands) for each predictor for a model with k=5 and k=10. Show and discussthe fitted models. [5 marks]Tasks for CovidConfirmedTime.csv7. Explore the dataset using any appropriate plots. [2 marks]8. Use generalised estimating equations (GEEs) to fit a generalised linear model with the number ofconfirmed Covid-19 cases as outcome and day as a single explanatory variable. Use an appropriateerror structure and a within-group correlation matrix to accommodate observations from the samecountry (justify your choice). Show and interpret the fitted model. [5 marks]9. Plot the trajectory for the number of confirmed Covid-19 cases over time for the average countryand compare that to the observed trajectory for the UK and Germany. Are these countries recordingcases at a faster, slower or similar pace to the average country?. Use CovidCases.csv to commenton whether the case fatality rates for these two countries are associated with the rate at which they areacquiring new cases. Justify your answers. [3 marks]10. Assess the GEE model fitted in task 8. using appropriate model diagnostic tests and plots. Provide aclear explanation and interpretation for each plot. [4 marks]11. Fit a mixed model with the number of confirmed Covid-19 cases as outcome and day as a singleexplanatory variable, but allowing for each country to have its own intercept and slope. Use anappropriate error structure and a within-group correlation matrix to accommodate observations fromthe same country (justify your choice). Display and interpret the model output. Hint: If you run intoconvergence issues you might want to model log(Confirmed) instead of Confirmed. [5 marks]12. Extract and plot as a histogram, the estimated slopes for each country. Pick the top three countrieswhose slope differs the most from the average country. For these three countries, compare (graphically)4the fitted models to what was observed, and to the model fit for the average country. Comment onthe results. [3 marks]13. Refit the model in task 11. but this time assume a common intercept (i.e. only allow for random slopes).Repeat task 12. and compare and comment on the two sets of results. Hint: If you run into convergenceissues you might want to model log(Confirmed) instead of Confirmed. [3 marks]14. Comment on the validity of the models fitted in this section (using the CovidConfirmedTime.csvdataset) once countries have passed the peak of the outbreak. [1 mark]如有需要,请加QQ:99515681 或邮箱:99515681@qq.com
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。