COMP9417程序写作、Data Mining程序辅导

” COMP9417程序写作、Data Mining程序辅导COMP9417 Machine Learning and Data Mining Final Examination1. TIME ALLOWED 24 HOURS2. THIS EXAMINATION PAPER HAS 14 PAGES3. TOTAL NUMBER OF QUESTIONS 54. ANSWER ALL 5 QUESTIONS5. TOTAL MARKS AVAILABLE 1006. OPEN BOOK EXAM – LECTURE NOTES, TUTORIALS, AND ONLINE RESOURCESARE PERMITTED. PLEASE USE REFERENCES WHERE NECESSARY.7. SUBMISSION YOU WILL SUBMIT A SINGLE PDF FILE answers.pdfCONTAINING YOUR ANSWERS FOR ALL QUESTIONS ATTEMPTED. START EACHSUB-QUESTION ON A NEW PAGE. MARKS MAY BE DEDUCTED FOR UNCLEARWORK. YOU MAY TYPE YOUR SOLUTIONS USING LATEX, OR TAKE CLEARPHOTOS OF HANDWRITTEN WORK, OR MAKE USE OF A DIGITAL TABLET.FOR QUESTIONS THAT REQUIRE CODING, YOU WILL SUBMIT A SINGLEPYTHON FILE solutions.py (SEE TEMPLATE) CONTAINING YOUR CODE,THOUGH ALL GENERATED PLOTS/TABLES MUST BE INCLUDED IN THEPDF FILE. CODE MUST ONLY BE SUBMITTED IN THE PYTHONFILE solutions.py.8. DISCUSSION WITH OTHER STUDENTS IS STRICTLY PROHIBITED. CODESUBMISSIONS WILL BE CHECKED FOR PLAGIARISM. CHEATING WILL RESULTIN A FAILING GRADE FOR THE COURSE AND POTENTIAL FURTHER1 Please see overDISCIPLINARY ACTION.9. IF NEEDED, YOU ARE PERMITTED TO SEEK CLARIFICATION ON THE EXAMWEBCMS FORUM. QUESTIONS SPECIFIC TO CONTENT WILL NOT BE ANSWEREDAND MAY BE REMOVED. OFFENDERS MAY HAVE EXAM MARKS DEDUCTED IFINFORMATION IS DIVULGED ON THE FORUM DURING OR FOLLOWING THEEXAM.10. ENSURE YOUR ANSWER FORMATTING IS CLEAR AND LEGIBLE. BEGIN EACHSUB-QUESTION ON A NEW PAGE WITH A CLEAR HEADING. HIGHLIGHT FINALANSWERS AND MAKE USE OF DOT POINTS FOR DISCUSSION QUESTIONS. MARKSMAY BE DEDUCTED FOR WORK THAT IS DIFFICULT TO READ OR UNDERSTAND.11. SUBMIT YOUR EXAM ANSWERS AND SOLUTIONS VIA THE CSE GIVE SYSTEM.THE COMMAND TO TO SUBMIT WILL BE:COMP9417作业写作、Data Mining作业辅导give cs9417 exam answers.pdf solutions.pySUBMISSION WILL BE OPEN FROM 09:00 AUSTRALIAN EASTERN STANDARDTIME (AEST) FOR 24 HOURS.12. BY STARTING THIS EXAM AS A STUDENT OF THE UNIVERSITY OF NEWSOUTH WALES, YOU DO SOLEMNLY AND SINCERELY DECLARE THAT YOU HAVENOT SEEN ANY PART OF THIS SPECIFIC EXAMINATION PAPER FOR THEABOVE COURSE PRIOR TO ATTEMPTING THIS EXAM, NOR HAVE ANY DETAILSOF THE EXAMS CONTENTS BEEN COMMUNICATED TO YOU. IN ADDITION,YOU WILL NOT DISCLOSE TO ANY UNIVERSITY STUDENT ANY INFORMATIONCONTAINED IN THE ABOVEMENTIONED EXAM, AND YOU WILL COMPLETE THEEXAM ON YOUR OWN WITHOUT ANY EXTERNAL HELP. VIOLATION OF THISAGREEMENT IS CONSIDERED ACADEMIC MISCONDUCT AND PENALTIES MAYAPPLY.2 Please see overQuestion 1 [12 marks]This question covers Naive Bayes and requires you to refer to the training data given in thetable below, which shows the number of times each of four words A, B, C and D occurs in eachof eight documents, four in the positive class, and four negative.Document No. A B C D ClassConsider a new document x? with 1 occurrence of A, no occurrences of B, no occurrence of C,and 2 occurrences of D. Round all answers to four decimal places.(a) [4 marks]Recall that if X = (X1, X2, X3, X4) Multinomial(p1, p2, p3, p4), thenP(X = (x1, x2, x3, x4)) =P4i=1 xi.Construct a Naive Bayes classifier for the problem of predicting whether a document should beclassified as positive or negative using the multinomial distribution to model the probability ofa word occurring or not in a class. Under your model, what is the value of P(+|x?)? Do not usesmoothing.(b) [2 marks]Now, apply (add-1) smoothing to your probability estimates. Under the smoothed probabilities,what is P(|x?) under the multinomial model?(c) [4 marks]Recall that if X Bernoulli(p) thenP(X = x) = px(1 p)1x.3 Please see overConstruct a Naive Bayes classifier for the problem of predicting whether a document shouldbe classified as positive or negative using the multivariate Bernoulli distribution to model theprobability of a word occurring or not in a class. What is p(+|x?) now? Do not use smoothing.(d) [2 marks]Now, apply (add-1) smoothing to your probability estimates. Under the smoothed probabilities,what is P(|x?) under the multivariate Bernoulli model?4 Please see overQuestion 2 [23 marks] In this question, we will apply gradient descent to a simulated dataset.You may make use of numpy And matplotlib. You are not permitted to make use of any existingnumpy implementations of gradient descent (if they exist). Generate data using the followingPython code:1 import numpy as np2 import matplotlib . pyplot as plt34 np . Random . Seed (42) # make sure you run this line for consistency5 x = np . random . uniform (1 , 2 , 100)6 y = 1.2 + 2.9 * x + 1.8 * x **2 + np . random . normal (0 , 0.9 , 100)7 plt . scatter (x , y )8 plt . show ()Your data should look like the following:Then, consider the loss function,where c R is a hyper-parameter.5 Please see over(a) [3 marks] Consider the (simple) linear model yi = w0 +w1xifor i = 1, . . . , n. We can writethis more succinctly by letting w = (w0, w1)T and Xi = (1, xi)T, so that yi = wT Xi. The lossachieved by our model (w) on a given dataset of n observations is thenLc(y, y) = Xni=1Lc(yi, yi) = Xni=1 r1c2(yi wT Xi)2 + 1 1,Compute The following derivatives:Lc(yi, yi)w0and Lc(yi, yi)w1.If you are unable to figure this out, you may use Wolfram Alpha or a similar program to computethe derivatives in order to make use of them in later questions. Note that you must show yourworking for full marks, so solutions using a tool like Wolfram Alpha will receive a mark of zero.(b) [2 marks] Using the gradients computed in (a), write down the gradient descent updatesfor w0 and w1 (using pseudocode), assuming a step size of . Note that in gradient descent weconsider the loss over the entire dataset, not just at a single observation. For simplicity, assumethat your updates are always based on the values at the previous time step, even if you haveaccess to the value at the current time step (i.e., when updating multiple parameters, the updatefor w(t+1)1 might depend on w0, so you could use the new value of w0, w(t+1)0, since it has alreadybeen computed, but here we assume that we just use the old value w(t)0).(c) [12 marks] In this section, you will implement gradient descent from scratch on the generateddataset using the gradients computed in (a), and the pseudocode in (b). Initialise yourweight vector to w(0) = [1, 1]T, and let c = 2. Consider step sizes [1, 0.1, 0.01, 0.001, . . . , 0.00000001](a total of 9 step sizes). For each step-size, generate a plot of the loss achieved at each iterationof gradient descent. You should use 100 iterations in total. Generate a 3 3 grid of figuresshowing performance for each step-size. Add a screen shot of the Python code for this questionin your answers PDF file (as well as pasting the code into your solutions.py file). The followingcode may help with plotting:1 fig , ax = plt . subplots (3 ,3 , figsize =(10 ,10) )2 alphas = [10 e -1 , 10 e -2 , 10 e -3 ,10 e -4 ,10 e -5 ,10 e -6 ,10 e -7 , 10 e -8 , 10 e -9]3 for i , ax in enumerate ( ax . flat ) :4 # losses is a list of 9 elements . Each element is an array of length 100storing the loss at each iteration for5 # that particular step size6 ax . plot ( losses [ i ])7 ax . set_title ( f step size : { alphas [i]}) # plot titles8 plt . tight_layout () # plot formatting9 plt . show ()6 Please see over(d) [1 mark] Comment on your results in part (c): what do you see happening for differentstep sizes? Why Does this occur?(e) [3 marks] To find an appropriate step-size, re-run the gradient descent to find the optimalmodel parameters. State this step-size, the final model, and generate two plots: the first forthe weights w0 and w1 over the 100 iterations, and the second of the data with the final modelsuper-imposed. What do you think of the final model? Are there any obvious ways to improveit? Be sure to include the plots in your answers PDF file, and the code used in your solutions.pyfile.(f) [2 marks] Consider the following scenario: you re-run the analysis for various values of cand you notice that the optimal step-size varies across different values of c. Describe an approachto choosing an appropriate value of c.7 Please see overQuestion 3 [25 marks] This question covers the (kernel) perceptron and requires you to referto the following training data for parts (a)-(c). You are only permitted to make use of numpyand matplotlib. You are Not permitted to make use of any existing numpy implementations ofperceptrons (if they exist).x1 x2 y-0.8 1 13.9 0.4 -11.4 1 10.1 -3.3 -11.2 2.7 -1-2.45 0.1 -1-1.5 -0.5 11.2 -1.5 1Table 1: Data for Question 3(a) [3 marks] Plot the data in Table 1. Recall that the polynomial kernel is defined ask(x, y) = (m + xTy)dfor m {0, 1, 2, . . . } and d {1, 2, . . . }. Each such kernel corresponds toa feature representation of the original data. Find the simplest polynomial kernel for which thisdata becomes linearly separable (note: simplest in this case is defined as the polynomial kernelwith the smallest values of both m and d).(b) [2 marks] The optimal kernel found in (a) induces a feature representation in Rpfor someinteger p determined by your choice of kernel. Choose a subset of two coordinates from thisp-dimensional vector and plot the transformed data. For example, for vector (w, x, y, z)T R4,a subset of two coordinates could be the first two coordinates (w, x), or the first and thirdcoordinates (w, y), etc.). Is your transformed data linearly separable?(c) [10 marks] Train a kernel perceptron on this data with initial weight vector w(0) = 1p(the vector of ones in Rp). Use a learning rate of = 0.2. Note: this should be done innumpy. Provide a table outlining all updates of the weight vector, and the iteration number atwhich the update occured. State the final learned perceptron and the number of iterations untilconvergence. Demonstrate that your perceptron correctly classifies each of the points. You mayuse the following table as a template for presenting your results:8 Please see overIteration No. w0 w1 . . . wp0 1 1 . . . 1first update iteration 1+0 1+1 . . . 1+pwhere jis the update for wj computed at the first iteration. To demonstrate that your perceptronclassifies each point correctly, use the following table:xi (xi) yiT(xi)w[0.8, 1]T ([0.8, 1]T) r1 0[1.2, 1.5]T ([1.2, 1.5]T) r8 0where ri = yiT(xi)wshould be positive if your perceptron has converged. Along with yourtable(s) of results, provide a screen shot of your code in your answers PDF file, and add the codeto your solutions.py file.(d) [5 marks] Let x, y R2(i.e., x and y are two dimensional vectors), and consider the kernelk(x, y) = (2xTy + 3)3.Compute the feature vector (x) correpsonding to this kernel (in other words, the feature representationof x and y such that (x)T (y) = k(x, y)).(e) [5 marks]Consider the following dataset. The positive examples (class = +1) are:(2, 0), (0, 2), (1, 1),and the negative examples (class = 1) are:(2, 0), (0, 2), (1, 1),Claim: An ordering of the examples in this dataset on which a perceptron (with learning rate = 1 ) makes at least 5 mistakes during training cannot exist.If you agree with this claim, provide a proof of why it must be true. Otherwise, provide anordering of the examples such that the number of mistakes is achieved.9 Please see overQuestion 4 [20 marks] This question covers the Support Vector Machine and requires you torefer to the following two-dimensional training data:x1 x2 y-3 9 -1-2 4 -1-1 1 -10 -3 11 1 12 4 -13 9 -1(a) [1 mark] Plot the data (no need to provide any code if you do this in Python). Is the datalinearly separable?(b) [11 marks] In this section, we will build an SVM classifier by hand. Recall that the dualformulation of the SVM classifier for a dataset of size n isarg max, subject to i 0, for all i, Xni=1iyi = 0.Solve for (1, . . . , 7) for The provided dataset. (Hint: Using the plot in the previous sectionand your knowledge of SVMs, try to simplify The (dual) objective before doing any calculus.)(c) [3 marks] For your SVM, compute the corresponding weight vector (w R2) and bias t.Superimpose a plot of the SVM onto the scatter in (a). What is the margin for this model?(d) [2 marks] Discuss briefly the following claim: Linear classifiers are unable to representnon-linear functions.(e) [3 marks] In your own words, explain what is meant by the kernel trick. Why is this suchan important technique for machine learning?10 Please see overQuestion 5 [20 marks] In this question, we will consider the Scikit-learn implementations ofthe following classifiers: Decision Trees Random Forest AdaBoost Logistic Regression Multilayer Perceptron (Neural Network) Support Vector MachineYou are required to compare the performance of the above models on a binary classification task.The following code loads in these classifiers and defines a function to simulate a toy dataset:1 import numpy as np2 import matplotlib . Pyplot as plt3 from matplotlib . colors import ListedColormap4 import warnings5 warnings . simplefilter ( action =ignore , category = FutureWarning )67 import time8 from sklearn . svm import SVC9 from sklearn . linear_model import LogisticRegression10 from sklearn . ensemble import AdaBoostClassifier11 from sklearn . ensemble import RandomForestClassifier12 from sklearn . tree import DecisionTreeClassifier13 from sklearn . neural_network import MLPClassifier14 from sklearn . model_selection import train_test_split15 from sklearn . preprocessing import StandardScaler16 from sklearn . datasets import make_classification1718 def create_dataset () :19 X , y = make_classification ( n_samples =1250 ,20 n_features =2 ,21 n_redundant =0 ,22 n_informative =2 ,23 random_state =5 ,24 n_clusters_per_class =1)25 rng = np . random . RandomState (2)26 X += 3 * rng . uniform ( size = X . shape )27 linearly_separable = (X , y )28 X = StandardScaler () . fit_transform ( X )29 return X , y11 Please see over(a) [3 marks] Generate a dataset. Then, randomly split the dataset into training set Xtrainand test set Xtest, with 80 examples for training and 20 for testing. Plot the decision boundariesof each of the classifiers on the test set. You may wish to use the following plotter function whichplots the decision boundary and can work for any sklearn model.1 def plotter ( classifier , X , X_test , y_test , title , ax = None ) :2 # plot decision boundary for given classifier3 plot_step = 0.024 x_min , x_max = X [: , 0]. min () – 1 , X [: ,0]. max () + 15 y_min , y_max = X [: , 1]. min () – 1 , X [: ,1]. max () + 16 xx , yy = np . meshgrid ( np . arange ( x_min , x_max , plot_step ) ,7 np . arange ( y_min , y_max , plot_step ) )8 Z = classifier . predict ( np . c_ [ xx . ravel () , yy . ravel () ])9 Z = Z . reshape ( xx . shape )10 if ax :11 ax . contourf ( xx , yy , Z , cmap = plt . cm . Paired )12 ax . scatter ( X_test [: , 0] , X_test [: , 1] , c = y_test )13 ax . set_title ( title )14 else :15 plt . contourf ( xx , yy , Z , cmap = plt . cm . Paired )16 plt . scatter ( X_test [: , 0] , X_test [: , 1] , c = y_test )17 plt . title ( title )Note that you can use the Same general approach for plotting a grid as used in Question 2,and the plotter function supports an ax argument. Note that this plotter function differsslightly from the one in the sample final. Be sure to include all your code in solutions.py,and any figures generated in your answers PDF file.(b) [10 marks] Now, test how the performance of each of the classifiers varies as you increasethe size of the training set. Fix your training and test sets from part (a). Then, starting from arandom subset (chosen with replacement) of your training set of size 50, train your classificationmodel, and compute The accuracy on the test set. Repeat this process for training set sizes of[50, 100, 200, 300, . . . , 1000]. Repeat the experiment a total of 10 times for each classifier. Then,for each classifier, plot its average accuracy at each training set size. Compare the accuracyacross different algorithms in a single figure, and in 5 lines or less, discuss your observations.Please use the following color code for your plots: [Decision Tree, Random Forest, AdaBoost,Logistic Regression, Neural Network, SVM]. Be sure to include all your code in solutions.py, andany figures generated in your answers PDF file.12 Please see over(c) [7 marks] Using the time module, record the training time for each of the classifiers ateach of the training set sizes. Plot the average training time over the 10 trials of each classifieras a function of the training size. You may add code for this section to your code in part (b).What do you observe? In 5 lines or less, discuss your observations. Use the same color schemeas in (b). Be sure to include your code in solutions.py, and any figures generated in your answersPDF file.13 Please see overEND OF PAPER14如有需要，请加QQ：99515681 或邮箱：99515681@qq.com

“