” 写作Stat 428课程编程、 辅导R编程设计Final ProjectStat 428I. Simulation Problem (50 points)In the lecture, we discussed Nearest Neighbor Tests and Energy Distance Test for two sample testing problem.We consider another two tests: two-sample Hotellings T-square test statistic and graph-based two sampletest. Suppose the data we observe X1, . . . , Xn and Y1, . . . , Ym, where Xi, Yj Rd are multivariate randomvectors. Here, X1, . . . , Xn are drawn from distribution F and Y1, . . . , Ym are drawn from distribution G. Thehypothesis of interest in Two sample testing problem isH0 : F = G and H1 : F 6= G.Graph-based two Sample test is defined in the following way. We pool all data togetherZ1, . . . , Zn+m = X1, . . . , Xn|Y1, . . . , YmBased these n + m observations, we construct a graph G = (V, E) such that the set of vertex is V ={1, . . . , n + m} and there is an edge between i and j if kZi Zjk Q, where Q is a positive number. Let Ebe the collection of Edges. The graph-based two sample test statistic is defined as,where |E| means the number of edges in the edge set E. Here, Ie = 1 if the two vertex connected by e havethe same label and Ie = 0 otherwise.Question 1 ReportA pharmaceutical company would like to test whether the effect of two treatments are similar or not. Themanager want to choose one two sample testing method from nearest neighbor tests, energy distance test,Hotellings T-square test and graph-based two sample test and ask your advice for the choice of two sampletest. First, could you help the manager to implement these four methods from the scratch: nearest neighbortests, energy distance test, Hotellings T-square test and graph-based two sample test? Second, could youprepare a report to provide some suggestions for the manager? In this report, you need to address at leastfour of the following points:1 Several different parts can be customized in these tests, e.g., the threshold Q in graph-based test, thenumber of neighbor in nearest neighbor test and the specific form of distance in energy distance testand graph-based test. Could you provide some suggestion on the choice of these customized part? Youneed to show some numerical experiment as your evidence. Are these tests sensitive to the dimension of data d? Are these tests sensitive to specific distribution of F or G? Which test has larger power under what condition? Clearly, the power of the test relies on the sample size n, m and how different F and G are underalternative hypothesis. Could you prepare a plot to show effect of sample size on power? Could youprepare another Plot to show effect of the difference bewteen F and G on power? Are these methods able to control Type I error?You need to submit both Rmd and pdf file of your report.Question 2 Presentation and SlidesBased your report, could you prepare a 3-5 minutes presentation to summarize your findings and suggestions?Assume your audience is the manager from this pharmaceutical company, who has only very limited statisticbackground. In this question, you need to submit a video (I need to see you in this video) and your slides(Both Rmd and pdf).Question 3 R package (Bonus question: extra 10 points for the final project)Could you prepare an R package to include all your four two sample testing methods and a manual thatintroduces how these methods can be used? To finish this question, you need to submit a compressed Rpackage.II. Real Data Problem (50 points)The data for this project describe payments for child support made to a government agency. A case refersto a legal judgment that an absent parent (abbreviated in variable names as AP) must make child supportpayments. The data is distributed in four CSV files, whcih can be downloaded from Compass2g. The dataare distributed as is As obtained from the agency (albeit anonymized). Most categorical variables areself-explanatory.The file cases.csv has six columns, one for each case: CASE_NUM Unique case identifier CASE_STATUS ACV (active), IN_ (inactive), IC_ (closed), IO_ (legal), IS_(suspend) CASE_SUBTYPE AO (arrears), EF (foster), MA (medical), NO (arrears), RA (regular), RN (regular) CASE_TYPE AF (AFDC), NA (non-afdc), NI (other) AP_ID Identifying number for absent parent LAST_PYMNT_DT Recorded date of last paymentThe file parents.csv has 10 columns, one for each parent: AP_ID Unique identifier for parent AP_ADDR_ZIP Coded na for missing, 0 for known unknown, 1 for city, 2 south state, 3 north state,4 other AP_DECEASED_IND AP is deceased AP_CUR_INCAR_IND AP is incarcerated AP_APPROX_AGE MARITAL_STS_CD Self-explanatory SEX_CD RACE_CD Categorical PRIM_LANG_CD Language code2 CITIZENSHIP_CD Citizenship codeThe file children.csv has 9 columns: CASE_NUM Case number ID Unique identifier for child SEX_CD RACE_CD MARITAL_STS_CD Marital status code PRIM_LANG_CD Primary language CITIZENSHIP_CD DATE_OF_BIRTH_DT DRUG_OFFNDR_IND Past drug offenceThe file payments.csv has only six columns, but more than 1.5 million records: CASE_NUM Case number for the payment PYMNT_AMT Dollar amount of payment COLLECTION_DT Date of payment PYMNT_SRC A (regular), C (worker comp), F (tax offset), I (interstate), S (st tax), W (garnish) PYMNT_TYPE A (cash), B (bank), C (check), D (credit card), E (elec trans), M (money order) AP_ID Absent parent IDQuestion 1 File linkage integrity(a) Read the four CSV files into R, building four data frames with the names Cases, Parents, Childrenand Payments. Show the dimensions of these data frames. (You may find it useful to save these dataframes as Rdata objects in a file using the save command. You can then recover them with the loadcommand more Quickly than reading the CSV file.)(b) What is the distribution of the number of children attached to a case? Show an appropriate plot of thedistribution, and mark the location of the average number in the plot.(c) The file children.csv may have more than one record for each child. What is the largest number ofcases associated with a child, and indicate why you believe that this is indeed the same child.(d) Does every absent parent (AP_ID) identified in the payments data have an identifying record in theparents data file?Question 2 Recoding categoriesSome categorical variables among these data frames are sparse (seldom observed). For example, the variablePYMNT_SRC in Payments has category M with 2 cases and category R with 7. These are too few formodeling in regression.Write a function named pool_categories that recodes a categorical variable into a simpler factor withfewer categories by pooling categories with counts below a threshold into a category labeled Other (a factorlevel which your function should check does not already exist!). You might find the R function %in% usefulfor this exercise.Question 3 Payment counts and amountsYou must use ggplot2 for generating the plots asked for in this question.(a) Make a variable Payments$DATE which is a viable R date by converting the COLLECTION_DTvariable. Use this variable to find (i) the range of dates of all payments and (ii) the percentage of thetotal number of payments made before May 1, 2015.3(b) Show a sequence plot of the total number of payments made on each day from May 1, 2015 through theend of the data.(c) What explains the bimodal shape of the marginal distribution of the number of payments over thistime period? Explain with some evidence how you reached your opinion.(d) Describe The distribution of the payment amounts. Do you have an explanation for its shape? (Youmight find it useful to work with a sample for plotting. R takes a while to draw 1.5 million points.)Question 4 Most common parent(a) Identify the parent with the most cases.(b) Identify all of the different children associated with the cases of the parent identified in (a).(c) What is the average age of these children, in years? Use their age as of Jan 1, 2017. (Fractions of ayear are fine.)(d) Show a plot of the payment history for this parent.Question 5 Payments for casesThe unit of analysis for this question is the payment behavior of an absent parent. Hence, if the parent isinvolved in several cases, you will need to accumulate the relevant information. You may find it useful forthis and the next question to build a data frame for parents that collects the relevant information for eachparent. You may find dplyr useful here and elsewhere, but you dont have to use it.(a) It has been conjectured that parents deemed responsible for more children are more likely to makeeither a larger number of payments or a larger total payment amount over this period. Is that true?(b) It has been conjectured that parents responsible for younger children are more likely to make morepayments. Is the average age of the children of an absent parent associated with the total amount ofpayments made by the absent parent? (Define a childs age as the age on Jan 1, 2017.)(c) Does the location of the parent (AP_ADDR_ZIP) anticipate the total amount of payments made bythe absent parent?(d) Does the combination of attributes of the parent with the number and average age of the childreninvolved predict the total amount of payments made by a parent? Explain your results briefly. (Note:It makes no sense to remove cases with missing values of a categorical variable. Missingness just definesanother category of the variable.)Question 6 ConsistencyAgain, the unit of analysis for this question is an absent parent. An important aspect of payments is theconsistency of the payments over time. A steady income stream is, for many, preferable to a highly volatile,unpredictable Payment schedule, even if the latter has a higher average.(a) Among all parents who made payments, is there any association between the SD of total daily paymentsand the average of total daily payments?(b) The coefficient of variation (CV) is the ratio of the SD of daily payments to the mean. Show timesequence plots of the payments of 3 parents, with low, medium and high CV. That is, find threerepresentative parents who make payments. One of these three should have a high CV, another anmedium CV, and a third a low CV.(c) Is the CV of Payments associated with the total amount of payments over this time period?(d) Do any attributes of the parent as revealed in these data anticipate that the parent will make consistentpayments, that is, have small CV?如有需要,请加QQ:99515681 或邮箱:99515681@qq.com
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。