DATA7202编程写作、辅导Python, Matlab

” DATA7202编程写作、辅导Python, MatlabStatistical Methods for Data ScienceDATA7202Semester 1, 2021Assignment 1 (Weight: 25%)Assignment 1 is due on 22 Mar 2021 17:00).Please answer the questions below. For theoretical questions, you should present rigorous proofsand appropriate explanations. Your report should be visually appealing and all questions shouldbe answered in the order of their appearance. For programming questions, you should present youranalysis of data using Python, Matlab, or R, as a short report, clearly answering the objectivesand justifying the modeling (and Hence statistical analysis) choices you make, as well as discussingyour conclusions. Do not include excessive amounts of output in your reports. All the code shouldbe copied into the appendix and the sources should be packaged separately and submitted on theblackboard in a zipped folder with the name:student_last_name.student_first_name.student_id.zip.For example, suppose that the student name is John Smith and the student ID is 123456789.Then, the zipped file name will be John.Smith.123456789.zip.1. [15 Marks] Repeat the advertisement exercise with the following changes.(a) The data is generated via the Following data generation mechanism. Xi U(0, 1) fori {1, 2, 3}; here U(0, 1) stands for the continuous uniform distribution over the [0, 1] set.However, we require that X1 + X2 + X3 = 1, that is, the explanatory variables stand fora percentage of the budget.(b) In addition, the model for y is as follow:Y = 0.5X1 + 3X2 + 5X3 + 5X2X3 + 2X1X2X3 + W, (1)where W U(0, 1).Similar to the original example, generate train and test sets of size N = 1000. Fit the linear regressionand the random forest models to the data. For the linear regression, make an inferenceabout the coefficients, specifically, Comment about the contributions of different advertisementtypes to sales. Use the linear model and the RF (with 500 trees), to make a prediction (usingthe test set), and report the corresponding mean squared errors.When constructing datasets, please use 1 and 2 seeds for the train and the test sets,respectively.2. [10 Marks] Consider the following variant of the cross-validation procedure.(i) Using the available data, find a subset of good predictors that show correlation withthe response variable.(ii) Using these predictors, construct a model (for regression or classification).(iii) Use cross-validation to estimate the model prediction error.1Is this a good method? Do you expect to obtain the true prediction error? Explain youranswer.3. [5 Marks] Suppose that we observe X1, . . . , Xn F. We model F as a normal distributionwith mean and standard deviation of . For this problem, determine the hypothesis classH = {f(x, ); }.and state explicitly what is and .4. [15 Marks] Let H be a Class of binary classifiers over a set Z. Let D be an unknown distributionover X , and let g be a target hypothesis in H. F Show that the expected value of LossT (g)over the choice of T equals LossD(g), namely,ET LossD(g) = LossD(g).5. [15 Marks (see details below)] Consider the following dataset.Now, suppose that we would like to consider two models.Model1 : y = 1×1 + ,andModel2 : y = 0 + 1×1 + ,where N(0, 1). That is, we consider two linear models with and without the intercept.(a) [5 Marks)] Fit these models tot the data and write the corresponding coefficients. Namely,fill the following table:Model 0 1Model1 0Model2(b) [5 Marks)] Consider the squared error loss, the absolute error loss, and the L1.5 loss. Findthe average loss for each model. Namely, fill the following table:Model squared error loss absolute error loss L1.5 lossModel1Model2(c) [5 Marks)] Draw a conclusion from the obtained results.6. [30 Marks (see details below)] Consider the Hitters data-set (given in Hitters.csv). Ourobjective is to predict a hitters salary via linear models.(a) [5 Marks)] Load the data-set and replace all categorical values with numbers. (You canuse the LabelEncoder Object in Python).2(b) [5 Marks)] Generally, it is better to use OneHotEncoder when dealing with categoricalvariables. Justify the usage of LabelEncoder in (a).(c) [20 Marks)] Fit linear regression and report 10-Fold Cross-Validation mean squared error.7. [10 Marks)] Consider a function(2)Suppose that a = 1, b = 2, and c = 3, and write a Crude Monte Carlo algorithm for theestimation of ` using N = 10000 sample size. Deliver the 95% confidence interval. Comparethe obtained estimation with the true value ` as given in (2).如有需要，请加QQ：99515681 或WX：codehelp

“

添加老师微信回复‘’官网辅导‘’获取专业老师帮助，或点击联系老师1对1在线指导。

声明：本站包含转载考而思在线或考而思。对于转载内容，本站尊重原创者劳动，保留原文作者或出处。但由于人为因素的限制，难免有疏忽、失误和遗漏，或者内容来源无法查明。如果出现类似这些情况，不管是被转载内容的原作者，还是本站读者，请及时联系本站，以确保第一时间予以修正。

本站辅导：留学课程辅导丨留学生论文辅导丨留学生作业辅导丨留学挂科申诉丨留学生课程预习

推荐：essay代写

相关文章