写作CSE 446/546编程、辅导C/C++程序设计

” 写作CSE 446/546编程、辅导C/C++程序设计Homework #2CSE 446/546: Machine LearningPlease review all homework guidance posted on the website before submitting to Gradescope. Reminders: Please provide succinct answers along with succinct reasoning for all your answers. Points may be deductedif long answers demonstrate a lack of clarity. Similarly, when discussing the experimental results, conciselycreate tables and/or figures when appropriate to organize the experimental results. In other words, allyour explanations, tables, and figures for any particular part of a question must be grouped together. When submitting to gradescope, please link each question from the homework in gradescope to the locationof its answer in your Homework PDF. Failure to do so may result in point deductions. For instructions,see httpss://www.gradescope.com/get_started#student-submission. Please recall that B Problems, indicated in boxed text are only graded for 546 students, and that theywill be weighted at most 0.2 of your final GPA (see website for details). In Gradescope there is a placeto submit solutions to A and B problems seperately. You are welcome to create just a single PDF thatcontains answers to both, submit the same PDF twice, but associate the answers with the individualquestions in gradescope. If you collaborate on this homework with others, you must indicate who you worked with on your homework.Failure to do so may result in accusations of plagiarism.Conceptual Questions [10 points]A0. The answers to these questions should be answerable without referring to external materials. Briefly justifyyour answers with a few words.a. [2 points] Suppose that your estimated model for predicting house prices has a large positive weight onnumber of bathrooms. Does it implies that if we remove the feature number of bathrooms and refitthe model, the new Predictions will be strictly worse than before? Why?b. [2 points] Compared to L2 norm penalty, explain why a L1 norm penalty is more likely to result in alarger number of 0s in the weight vector or not?c. [2 points] In at most one sentence each, state one possible upside and one possible downside of using thefollowing regularizer:d. [1 points] True or False: If the step-size For gradient descent is too large, it may not converge.e. [2 points] In your own words, describe why SGD works.f. [2 points] In at most one sentence each, state one possible advantage of SGD (stochastic gradient descent)over GD (gradient descent) and one possible disadvantage of SGD relative to GD.1Convexity and NormsA1. A norm k k over Rn is defined by the properties: i) Non-negative: kxk 0 for all x Rn with equalityif and only if x = 0, ii) absolute scalability: ka xk = |a| kxk for all a R and x Rn, iii) triangle inequality:kx + yk kxk + kyk for all x, y Rn.a. [3 points] Show that f(x) = (Pni=1 |xi|) is a norm. (Hint: begin by showing that |a + b| |a| + |b| for alla, b R.)b. [2 points] Show that g(x) = Pnis not a norm. (Hint: it suffices to find two points in n = 2dimensions such that the triangle inequality does not hold.)Context: norms are often used in regularization to encourage specific behaviors of solutions.1/p then one can Show that kxkp is a norm for all p 1. The important cases of p = 2 andp = 1 correspond to the penalty for ridge regression and the lasso, respectively.B1. [6 points] For any x Rn, define the following norms:kxk := limp kxkp = maxi=1,…,n |xi|. Show that kxk kxk2 kxk1.A2. [3 points] A set A Rn is convex if x + (1 )y A for all x, y A and [0, 1].For each of the grey-shaded sets above (I-III), state whether each one is convex, or state why it is not convexusing any of the points a, b, c, d in your answer.A3. [4 points] We say a function f : Rd R is convex on a set A if f(x + (1 )y) f(x) + (1 )f(y) forall x, y A and [0, 1].For each of the grey-colored functions below (I-III), state whether each one is convex on the given interval orstate why not with a counterexample using any of the points a, b, c, d in your answer.a. Function in Panel I on [a, c]b. Function in panel II on [a, c]c. Function in panel III on [a, d]d. Function in panel III on [c, d]2B2. Use just the definitions above and let k k be a norm.a. [3 points] Show that f(x) = kxk is a convex function.b. [3 points] Show that {x Rn : kxk 1} is a convex set.(This is the function considered in 1b above specialized to n = 2.) We know g is not a norm. Is thedefined set convex? Why not?Context: It is a fact that a function f defined over a set A Rn is convex if and only if the set {(x, z) Rn+1 : z f(x), x A} is convex. Draw a picture of this for yourself to be sure you understand it.B3. For i = 1, . . . , n let `i(w) be convex functions over w Rd(e.g., `i(w) = (yi wxi)2), k k is anynorm, and 0.a. [3 points] Show thatXni=1`i(w) + kwkis convex over w Rd(Hint: Show that if f, g are Convex functions, then f(x) + g(x) is also convex.)b. [1 points] Explain in one sentence why we prefer to use loss functions and regularized loss functionsthat are convex.Lasso on a real datasetGiven 0 and data (x1, y1), . . . ,(xn, yn), the Lasso is the problem of solving is a regularization tuning parameter. For the programming part of this homework, you are required toimplement the coordinate descent method of Algorithm 1 that can solve the Lasso problem.You may use common computing packages (such as NumPy or SciPy), but do not use an existing Lasso solver(e.g., of scikit-learn).Before you get started, here are some hints that you may find helpful:Algorithm 1: Coordinate Descent Algorithm for Lassowhile not converged do For-loops can be slow whereas vector/matrix computation in Numpy is very optimized; exploit this asmuch as possible. The pseudocode provided has many opportunities to speed up computation by precomputing quantitieslike ak before the for loop. These small changes can speed things up considerably. As a sanity check, ensure the objective value is nonincreasing with each step. It is up to you to decide on a suitable stopping condition. A common criteria is to stop when no elementof w changes by More than some small during an iteration. If you need your algorithm to run faster, aneasy place to start is to loosen this condition. You will need to solve the Lasso on the same dataset for many values of . This is called a regularizationpath. One way to do this efficiently is to start at a large , and then for each consecutive solution, initializethe algorithm with the previous solution, decreasing by a constant ratio (e.g., by a factor of 2) untilfinished.This is helpful for choosing the first in a regularization path.A4. We will first try out your solver with some synthetic data. A benefit of the Lasso is that if we believemany features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, effectivelydifferentiating between the relevant and irrelevant features. Suppose that x Rd, y R, k d, and pairs ofdata (xi, yi) for i = 1, . . . , n are generated independently according to the model yi = wT xi + i wherewj =(j/k if j {1, . . . , k}0 otherwisewhere i N (0, 2) is some Gaussian noise (in the model above b = 0). Note that since k d and wj = 0 forj k, the features k + 1 through d are unnecessary (and potentially even harmful) for predicting y.With this model in mind, let n = 500, d = 1000, k = 100, and = 1. Generate some data by choosing xi Rd,where each component is drawn from a N (0, 1) distribution and yi generated as specified above.a. [10 points] With Your synthetic data, solve multiple Lasso problems on a regularization path, starting atmax where 0 features are selected and decreasing by a constant ratio (e.g., 1.5) until nearly all thefeatures are chosen. In plot 1, plot the number of non-zeros as a function of on the x-axis (Tip: useplt.xscale(log)).b. [10 points] For each value of tried, record values for false discovery rate (FDR) (number of incorrectnonzeros in wb1 /total number of nonzeros in wb) and true positive rate (TPR) (number of correct nonzerosin wb/k). In plot 2, plot these values with the x-axis as FDR, and the y-axis as TPR and note that in anideal situation we would have an (FDR,TPR) pair in the upper left corner, but that can always triviallyachieve (0, 0) and ( dkd, 1).c. [5 points] Comment on the effect of in these two plots.A5. Well now put the Lasso to work on some real data. Download the training data set crime-train.txtand the test data set crime-test.txt from the website under Homework 1. Store your data in your workingdirectory and read in the files with:import pandas as pddf_train = pd.read_table(crime-train.txt)df_test = pd.read_table(crime-test.txt)1For each j, wbj is an incorrect nonzero if and only if wbj 6= 0 while wj = 04This stores the data as Pandas DataFrame objects. DataFrames are similar to Numpy arrays but more flexible;unlike Numpy arrays, they store row and column indices along with the values of the data. Each column ofa DataFrame can also, in principle, store data of a different type. For this assignment, however, all data arefloats. Here are a few Commands that will get you working with Pandas for this assignment:df.head() # Print the first few lines of DataFrame df.df.index # Get the row indices for df.df.columns # Get the column indices.df[foo] # Return the column named foo.df.drop(foo, axis = 1) # Return all columns except foo.df.values # Return the values as a Numpy array.df[foo].values # Grab column foo and convert to Numpy array.df.iloc[:3,:3] # Use numerical indices (like Numpy) to get 3 rows and cols.The data consist of local crime statistics for 1,994 US communities. The response y is the rate of violent crimesreported per capita in a community. The name of the response variable is ViolentCrimesPerPop, and it is heldin the first column of df train and df test. There are 95 features. These features include many variables.Some features are the consequence of complex political processes, such as the size of the police force. Othersare demographic characteristics of the community, including self-reported statistics about race, age, education,and employment drawn from Census reports.The goals of this problem are twofold: first, to encourage you to think deeply about models you might train andhow they might be misused; and second, to see how Lasso encourages sparsity of linear models in settings wherethe feature set is very large relative to the number of training examples. We emphasize that training amodel on this dataset can suggest a degree of correlation between a communitys demographicsand the rate at which a community experiences and reports violent crime. We strongly encouragestudents to consider why these correlations may or may not hold more generally, whether correlations mightresult from a common cause, and what issues can result in misinterpreting what a model can explain.We have split the dataset into a training and test set with 1,595 and 399 entries, respectively2. Wed like to usethis training set to fit a model to predict the crime rate in new communities and evaluate model performance onthe test set. As there are a considerable number of input variables and fairly few training datapoints, overfittingis a serious issue. In order to avoid this, use the coordinate descent LASSO algorithm you just implemented inthe previous problem.a. [4 points] Begin by reading the documentation for the original version of this dataset: https://archive.ics.uci.edu/ml/datasets/communities+and+crime. Report 3 features included in this dataset forwhich historical Policy choices in the US would lead to variability in these features. As an example,the number of police in a community is often the consequence of decisions made by governing bodies,elections, and amount of tax revenue available to decisionmakers.b. [4 points] Before you train a model for this prediction task, describe 3 features in the dataset which might,if found to have nonzero weight in model, be interpreted as reasons for higher levels of violent crime, butwhich might actually be a result rather than (or in addition to being) the cause of this violence.Now, we will run the LASSO solver with = max defined above. For the initial weights, just use 0. Then, cut down by a factor of 2 and run again, but this time pass in the values of w from your = max solution asyour initial weights. This is faster than initializing with 0 weights each time. Continue the process of cutting by a factor of 2 until the smallest value of is less than 0.01. For all plots use a log-scale for the dimension(Tip: use plt.xscale(log)).a. [4 points] Plot the number of nonzeros of each solution versus .b. [4 points] Plot the regularization paths (in one plot) for the coefficients for input variables agePct12t29,pctWSocSec, pctUrban, agePct65up, and householdsize.2The features have been standardized to have mean 0 and variance 1.5c. [4 points] On one plot, plot the squared error on the training and test data versus .d. [4 points] Sometimes a larger value of performs nearly as well as a smaller value, but a larger value willselect fewer variables and perhaps be more interpretable. Inspect the weights (on features) for = 30.Which feature variable had the largest (most positive) Lasso coefficient? What about the most negative?Discuss briefly.e. [4 points] Suppose there was a large negative weight on agePct65up and upon seeing this result, a politiciansuggests policies that encourage people over the Age of 65 to move to high crime areas in an effort to reducecrime. What is the (statistical) flaw in this line of reasoning? (Hint: fire trucks are often seen aroundburning buildings, do fire trucks cause fire?)Logistic RegressionBinary Logistic RegressionA6. Let us again consider the MNIST dataset, but now just binary classification, specifically, recognizing ifa digit is a 2 or 7. Here, let Y = 1 for all the 7s digits in the dataset, and use Y = 1 for 2. We will useregularized logistic regression. Given a binary classification dataset {(xi如有需要，请加QQ：99515681 或邮箱：99515681@qq.com

“