辅导SENG 474程序、写作Data Mining程序

” 辅导SENG 474程序、写作Data Mining程序Data Mining (CSC 503/SENG 474)Assignment 1 Due on Friday, June 26th, 11:55pmInstructions: You must complete this assignment on your own; this includes any coding/implementing,running of experiments, generating plots, analyzing results, writing up results, and workingout problems. Assignments are for developing skills to make you strong. If you do theassignments well, youll almost surely do better on the midterm and final. On the other hand, you can certainly have high-level discussions with classmates about coursematerial. You also are welcome to come to office hours (or otherwise somehow ask me andthe TAs things if you are stuck). You must type up your analysis and solutions; I strongly encourage you to use LaTeX to dothis. LaTeX is great both for presenting figures and math. Please submit your solutions via conneX by the due date/time indicated above. This is ahard deadline. Any late assignment will result in a zero (I wont accept assignments evenone day late; I forgot to hit submit on conneX, etc. will not be met with any sympathy).However, I will make an exception if I am notified prior to the deadline with an acceptableexcuse and if you further can (soon thereafter) provide a signed note related to this excuse.1Assignment 1This assignment has two parts. The first part involves implementing some of the methods thatyou have already seen and analyzing their results on some classification problems; this part is forall students (both CSC 503 and SENG 474). The second part is actually only for graduate students(only CSC 503). The first part might seem like a lot of work. However, if you do it well, youllbecome strong, and also gain valuable skills for your project. This is a good thing.1 Experiments and Analysis 辅导SENG 474作业、写作Data Mining作业First, implement the following methods: Decision trees (with pruning). For simplicity, I suggest going with reduced error pruning(this is the form of pruning we covered in class). For the split criterion, you can use informationgain or the Gini index or some other good criterion. You might even try a few differentones as part of your analysis (explained further below). Random forests (no pruning). What forest size should you use? Well, you should experimentwith different forest sizes and see what happens. For the size of the random sample(sampled with replacement) used to learn each tree, I suggest setting the number of sampleswith replacement to be equal to the original sample size. For the number of random featureswhen selecting the split feature at each decision node, I suggest starting at d (for d features)and experimenting by going up and down from there. Neural networks. Any number of layers is fine, as long at there is at least one hidden layer;I suggest going with 3 layers (i.e. the input layer, 1 hidden layer, and the output layer).I put implement in quotes because I wont actually require you to implement these methods;you can instead use machine learning software that you find online. If you do implement anything,you can use whatever programming language you like. For neural networks in particular, I shamefullyrecommend using an existing implementation here, but eventually (after this assignment) foryour own edification it would be great if you implement a neural network yourself.What to test onYoull be analyzing the performance of these methods on two classification problems.For the first problem, youll be using the heart disease dataset from the UCI repository: httpss://archive.ics.uci.edu/ml/datasets/Heart+DiseaseIn order to make your life a bit easier, I have provided a slightly cleaned and modified dataset,cleaned_processed.cleveland.data. In case youre interested, you can find details on how Iprepared this dataset in Appendix A. The last attribute/feature (the label) takes values 0 and 1,where 0 indicates no heart disease and 1 indicates some level of heart disease. Keep in mind thatthe dataset has not been split into a training and test set. You should do this. For any training/testsplit, a good rule of thumb is 80% for the training set and 20% for the test set.The second problem will actually be designed by you. It will again be a classification problem,meaning that it consists of a set of training examples and a set of test examples for a classificationtask. For simplicity, I suggest going with a binary classification task. You can either generatethe data yourself or find some data online. The important thing is that the problem should beinteresting: that is to say, the learning task should not be so easy that every method can obtainzero test error, but it should be easy enough where some method is able to obtain a reasonablysmall test error. What does reasonably small test error mean? Well, for binary classification,this would be a test error well below 50%. Why? Because a classifier that just randomly predictswith equal probability will always get close to 50% test error.2How to do the analysisThe core of this assignment, meaning what is worth the most marks, is the experiment analysis.Your analysis will go into a file called report.pdf. This file should: contain a description of the second classification problem (the one you came up with orselected), including an explanation of why the problem is interesting; specifically, explainwhy it is non-trivial and why it enables good comparison of the methods. present the performance of the methods in terms of the training and test error on each ofthe problems. In particular, you should present graphs that show how each of training errorand test error vary with training set size. But beyond that, you should also play with theparameters of the methods to see the effect. For instance, what happens when you change thelearning rate (or number of nodes in the hidden layer, or the number of iterations of training)for the neural network? What happens if you change the number of random features usedfor the random forest? What happens if you change the pruning rule (or use a different splitcriterion) for the decision tree? As much as possible, try to present your results with plots. contain a detailed analysis of your results, including various design choices you made (likechoice of split criterion for decision trees, choice of nonlinearity for neural networks, etc.).Try and explain the results that you got for the various experiments you ran, and use theseresults to compare the methods. Think about the best parameter settings for the methods(and maybe think about how the best parameter setting might change as the training samplesize increases). Ideally, use your analysis to come up with ideas on how you might improvethe methods.Please make your analysis concise (yet thorough). Dont ramble, and dont make stuff up. Actas if you are a respected scientist presenting some work to your colleagues (and you want them tocontinue being your colleagues after the presentation).What to submitIn all, for the analysis portion of the assignment, you should submit (as a zip file): the file report.pdf explained above; a file called README.txt which contains instructions for running your code (for any code youwrote) or gives proper attribution to any code/datasets you got from other sources (like theInternet, for example). If you mostly used some existing code/software but needed to modifya few files, then give attribution and mention the files you modified. a file called code.zip, containing your code. Please organize this file well (embrace the ideaof directories). In case you mostly used some existing code/software but needed to modify afew files, then just provide those files here, and make sure your README.txt mentions thosefiles in relation to the existing code/software. any additional files that you need; in particular, this should include the training and test setsfor the second classification problem in case you came up with your own problem.32 Problem-solving part – CSC 503 onlyIn this problem, we consider the gradient descent algorithm for a two-dimensional regression problem.So, we have 2 continuous features x1 and x2, and the task is to predict a continuous target y.Suppose that for a given input xi =x1,ix2,i!, we predict using hypotheses of the following form:fw(xi) = w1 x1,i + w2 x2,i + w3 x1,i x2,i + b.Assume that we have n training examples (x1, y1), . . . ,(xn, yn). Suppose that we measure theerror According to the squared error with a further penalty on the squared Euclidean norm of w.Then, for a fixed, positive number , the training error can be written asE(w) = 1In the below, assume the learning rate (also called the step size) is some positive number .(a) Derive the gradient descent update for b.(b) Derive the gradient descent update for w1.(c) Derive the gradient descent update for w3.(d) Show the full gradient descent update for the vector w.3 How youll be marked For each of CSC 503 and SENG 474, the total number of marks is 100, and you receive 10marks for Your code (in case you write code) together with the classification problem thatyou design. For undergrads (SENG 474), the analysis (in report.pdf) is worth 90 marks. For grad students (CSC 503), the analysis (in report.pdf) is worth 80 marks and theProblem-solving part is worth 10 marks.4A Cleaned version of processed.cleveland.dataIn order to obtain the provided cleaned version of the data, I started from the original datasetprocessed.Cleveland.data, which can be found in the Data Folder from the link above. Youshould not use this dataset!In the original processed.cleveland.data dataset, some of the 6 examples have one featurewith the Value ?. This is a missing value. There are various ways to deal with missing values,but an easy way (when not many values are missing) is to simply discard those examples; this iswhat I did. Also, the last attribute/feature (the label) takes values {0, 1, 2, 3, 4}. I formed a binaryclassification task by: keeping the label 0 as is; grouping together the labels in the set {1, 2, 3, 4} by changing their label to 1. Note that allof these non-zero labels indicate some level of heart disease.如有需要，请加QQ：99515681 或邮箱：99515681@qq.com

“