” CS434课程程序 写作、Python程序设计CS434 Implementation Assignment 3 Due 11:59PM May 09, 2020General instructions.1. Please add onto the Python starter code (version 3.6+). Note: you may need to install several packages(pip install user), but try not to require anything else. Ill run it with the packages already there, and youshould have everything you need. (Things like os are fine though if theyre packaged with Python 3.6).2. You can work in a team of up to 3 people. Each team will only need to submit one copy of thesource code and report. You need to explicitly state each members contribution in percentages (a roughestimate).3. Your source code and report will be submitted through Canvas4. You need to submit a readme file that contains the commands to run your code for requested experiments(e.g. python main.py {args}).5. Please make sure that you can run code remotely on the class server ( vm-cs434-1 ).6. Be sure to answer all the questions in your report. You will be graded based on your code as wellas the report. In particular, the clarity and quality of the report will be worth 10 pts. So pleasewrite your report in clear and concise manner. Clearly label your figures, legends, and tables.7. In your report, the results should always be accompanied by discussions of the results. Do theresults follow your expectation? Any surprises? What kind of explanation can you provide?1Decision Tree Ensemble for Predicting Election Results by USCounty Statistics(total points: 90 pts + 10 report pts)In this assignment we will work on a task to classify United States counties between Democratic and Republicanmajority vote in the 2016 US Election. We do this based on a set of discrete features related toeach county. These features are a categorical representation of continuous features, which are provided inthe data description (county dictionary) file. The goal in this assignment is to develop variations of thedecision tree algorithm and ensemble methods.Data. The data for this assignment is taken from a collection of US election data in the primary electionsfrom 2016. The data has already been preprocessed for you. Here is a short description of each train andvalidation split:CS434课程作业 写作、Python程序设计作业调试、Python实验作业 辅导(a) Train Set Features/Labels (x train.csv, y train.csv): Includes 2098 rows (samples, one for eachcounty) in each file. Each sample in X contains 51 categorical features related to bins of continuousdata encompassing various statistics of a given county. If youre curious, I can provide what the binsare, but its not necessary for the assignment. Just think quantiles of some sort and youll be fineunderstanding it.(b) Test Set Features/Labels (x test.csv, y test.csv): Includes 700 rows. Each row obeys the sameformat given for the train set. This set will be used to see the performance of the models.Important Guidelines. For all parts of this assignment:(a) We assigned labels already for the class 1 to Republican and 0 to Democrat. Be sure they arentmodified when making predictions, as the starter code assumes 0/1 labels, not -1/1 labels.(b) Please do not add bias to the features.(c) Please use the base classes provided for your implementations and modifications. You may add morefunctions if needed, but ensure the class structure preserves the format in the starter code.(d) NOTE: This data is class imbalanced. That is, we have about a 2:1 ratio of positive to negativeclass. This will affect how we quantify predictions, as we may need to use more than just raw accuracyfor model selection. You will be asked to explain your intuition behind why this is the case in theassignment.Definitions(a) Precision: TPTP+FP , where T P = true positive predictions, and F P = false positive predictions. Interpretation:how well did we do recovering true positive predictions? That is, a low false positive rate(not many predictions for positive class that were incorrect predictions).(b) Recall: TPTP+FN , where T P = true positive predictions, and F N = false negative predictions. Interpretation:out of all the positive examples, how many did we find? That is, a low false negative rate(not many true positive classes got predicted as negative).(c) F1 Score: 2 precisionrecallprecision+recall, weighted average of precision and recall. Essentially, the balance betweenthe two.2Part 1 (20 pts) : Decision Tree (DT). For this part we are interested in using a decision tree withbelow configuration: The DT uses Gini impurity to measure the uncertainty, rather than the entropy/information gain aspresented on the slides. We do this to avoid taking logarithms, but the two serve a similar purpose.Specifically, if we have a node split from training list A to two left and right list AL and AR as depictedin figure 1 thenAC+ C_CL+ CL_ CR+ CR_AL ARFigure 1: Split according to feature fi testing against value vthe benefit of split for feature fi against value v is computed as:B = U(A) plU(AL) prU(AR) (1)Where U is the uncertainty measure. For this assignment, we will use gini-index, which is computedfor a list such as AL as follows:(3) The features are categorical with more than 2 possible values in most cases. So a feature might betested multiple times against different values in the tree. Most of the logic for this is done for you, butthere are a few things you will need to do to get the basic decision tree working.Please implement the following steps:(a) Please read through the starter code provided and work to understand how the recursive process isimplemented for building a decision tree. I have chosen to use a fairly common, minimal approach forcreating the trees. You will need to know this for modifying the classes for both Random Forest andAdaBoost.(b) Add your code into the specified sections of the DecisionTreeClassifier class. You should only needto handle the Gini Impurity calculation (as shown above). Your function should calculate all theimpurities and return the gain.(c) The base function included in the main.py can be used to test your implementation. (Note: thereis one included for Random Forest as well, feel free to make a similar one for AdaBoost when youimplement it).(d) Now, add a function in main for creating trees with depths ranging from 1 to 25 (inclusive). Plot andexplain the behavior of train/testing performance against the depth. That is, plot accuracy versusnumber of trees, as well as F1 score vs number of trees. (F1 score is a weighted average of precisionand recall, I believe youve covered precision/recall already).3(e) Report the depth that gives the best validation accuracy? You will probably hit a point where it willnot need to be deeper, and its best to use the simplest version.(f) What is the most important feature for making a prediction? How can we find it? Report the nameof it (see data dictionary) as well as the value it takes for the split.Part 2 (30 pts) : Random Forest (Bagging). In this part we are interested in random forest which isa variation of bagging without some of its limitations. Please implement the following steps:(a) Implement a random forest with parameters:n trees : The number of trees in the forest.max features : The number of features for a tree.max depth : Maximum depth of the trees in the forest.Here is how the forest is created: The random forest is a collection of n trees trees. All the treesin the forest have maximum depth of max depth. Each tree is built on a data set of size 2098 (sizeof your original training data) sampled (with replacement) from the training data. You sample bothX and y together and need to preserve the indexes that correspond to (X, y) pairs. In the processof building a tree of the forest, each time we try to find the best feature fi to split, we need to firstsub-sample (without replacement) max features number of features from the feature set and thenpick fi with highest benefit from max features sampled features. Much of this is handled alreadyin your DecisionTreeClassifier, but you need to modify DecisionTreeClassifier class to handlefeature bagging. You will also fill in missing RandomForestClassifier code in this class.(b) For max depth = 7, max features = 11 and n trees [10, 20, …, 200], plot the train and testingaccuracy of the forest versus the number of trees in the forest n. Please also plot the train and testingF1 scores versus the number of trees.(c) What effect does adding more trees into a forest have on the train/testing performance? Why?(d) Repeat above experiments for max depth = 7, n = 50 trees, and max features [1, 2, 5, 8, 10, 20, 25, 35, 50].How does max features change the train/validation accuracy? Why?(e) Optional: try to find the best combination of the three parameters using a similar idea.(f) With your best result, run 10 trials with different random seeds (for the data/feature sampling youcan use the same seed) and report the individual train/testing accuracys and F1 scores, as well asthe average train/testing accuracy and F1 scores across the 10 trials. Overall, how do you thinkrandomness affects the performance?Part 3 (30 pts) : AdaBoost (Boosting). For this part we are interested in applying AdaBoost tocreate yet another ensemble model with decision trees. Considering the AdaBoost algorithm described inthe slides, please do following steps:(a) Modify your train y and test y vectors (class labels) to set class 0 to be class -1. That is, changeall the 0s in your labels to be -1 instead. AdaBoost assumed 1 and -1, rather than 1 and 0 for labels.Use:y_train[y_train==0] = -1y_test[y_test==0] = -1(b) Implement your own AdaBoostClassifier class using a modified DecisionTreeClassifier class forthe weak learners provided in the code. You should implement it in the same format as the other twoclasses (DecisionTreeClassifier and RandomForestClassifier). Once again, you will need to slightlymodify your DecisionTreeClassifier class. Make a new class for DecisionTreeAdaBoost if you needto (since youll have to add the weight vector as described below). However, you need to preserve4the original code as much as possible and simply modify the gain and prediction functionsto handle the weights described. This helps make grading easier.(c) Let the weak learner be a decision tree with depth of only 1 (decision stump). The decision treeshould get a weight parameter D which is a vector of size 2098 (training data size). Implement thedecision tree with parameter D such that it considers D in its functionality.(Hint: It changes the probability estimation used for computing gini-impurity and also for the predictionsat leaves, use majority weights instead of majority counts. Basically, your weight vector Ddictates your predictions, rather than majority class).(d) Using the decision tree with parameter D implemented above, develop the AdaBoost algorithm asdescribed in the slide (and part a) with parameter L for the class (number of base classifiers/numberof trees).(e) Report the train and validation accuracy for L [10, 20, …, 200].(f) Plot the train/test accuracy and train/test F1 scores of AdaBoost against the parameter L.Part 4 (10 pts): Class Imbalance Questions Here, I hope youve noticed that with a 2:1 ratio ofpositive to negative class (approximately 66% of the data is class 1), accuracy might not be the best thing toreport. Basically, were used to seeing data with 50% for each class, and we know were doing well if we getbetter than random guessing, which we can see immediately with accuracy. In this class imbalanced case,we could predict the positive class every time and have a base accuracy of approximately 66%, which mighttrick us into thinking our model is doing okay. Weve done nothing to handle this imbalance while training,but there are many things that are often done, and theres not a right answer for every case. What youneed to do here is the following:(a) Read this article: httpss://www.svds.com/learning-imbalanced-classes/(b) Write a summary of ideas you have for handling the imbalanced classes in our data.Theres no right answer here, but you should think about things we could use. Ive added F1 scoreas a metric, which is coarse, but better than just accuracy. What else could I have done to see howgood the classifier is on the testing data? I think this is important because sometimes (very, very often)in the real world we see much worse class imbalance than this, and we dont want to blindly reportmetrics like accuracy if were reporting the wrong ones. Feel free to report ideas from the article, andif youre not familiar with precision/recall, refresh on it via Wikipedia or slides.Submission. Your submission should include the following:1) The modified source code with a short instruction on how to run the code for your experiments in areadme.txt.2) Your report (preferably in PDF format ), which begins with a general introduction section, followed byone section for each part of the assignment.3) Please note that all the files should be in one folder and compressed only by .zip.5如有需要,请加QQ:99515681 或邮箱:99515681@qq.com
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。