” SIT742编程设计 写作、 辅导Python程序SIT742 (Modern Data Science)Full Marks: 40Assessment Task 022021 Trimester 1, Due: 8:00pm AEST, 22/05/2021Students with difficulty in meeting the deadline because of illness, etc. must apply for anassignment extension in CloudDeakin no later than 12:00pm on 21/05/2021 (Friday). This is a group work for group with up to 3 members. If you choose to work on itindividually, please seek approval from unit chair via email. There are folders for this Task on CloudDeakin, please enrol into the group (2 or 3members) before 15/05/2021 (12:00am):2021 Assessment Task 2 (1-member Group) for students with approval ofworking alone; Approval required;2021 Assessment Task 2 (2-member Group) for groups of 2 members; Selfenrollmentrequired;2021 Assessment Task 2 (3-member Group) for groups of 3 members; Selfenrollmentrequired. Please form the group first, and then self-enrol into the appropriate group before15/05/2021 (12:00am).InstructionsSix files are provided for this assessment task:HTWebLog_p1.zip The compressed zip file is for Part I of this assessment task, and it is a sample of HotelTULIP Web log dataset, which contains the web access log information from 11/2006 to 02/2007. 1.Citation2003-2021.Csv This CSV file is for Part II of this assessment task, and the file structure is provided.Search-results.csv This CSV file is for Part II of this assessment task, and the file structure is provided.SIT742Task2.ipynb This is the notebook file for the Python code in ipynb, and the latest notebook is alsoreleased in SIT742Task2.ipynb.Web log This code snippet contains all the coding requirements and also hints for Part I of thisassessment task.Predictive Aanalysis This code snippet contains all the coding requirements and also hints is forPart II of this assessment task.You will need to complete the code in the notebook and make it run-able. The results of running thenotebook will help you to develop your report, as well as generate the required files: Citation2003-2021.csv and Search-results.csv.SIT742Task2-Report-Template.docx This is the Word template for your report SIT742Task2-Report.pdf.What to Submit?You are required to submit the following completed files to the corresponding Assignment (Dropbox) inCloudDeakin:SIT742Task2.ipynb The completed notebook with all the run-able code on all requirements.SIT742Task2-Report.pdf Your report for the both Part I and Part II of this assessment task.1This file is exclusively for SIT742 educational purpose only. You are not allowed to further distribute it.Page 1 of 6SIT742 (Modern Data Science)Full Marks: 40Assessment Task 022021 Trimester 1, Due: 8:00pm AEST, 22/05/2021Citation2003-2021.csv The completed citation information as CSV file, sorted by year.Search-results.csv The completed parameter grid search result as CSV file.Part IData Analytic Web Log Data (20 marks)Here is the hypothetical background:Hotel TULIP (a hypothetical organisation) is a five star hotel that locates in Australia. It is avery special hotel with an equally special purpose: Not only does it embody all the creative energyand spirit of TULIP-Lab, its a learning environment on which the tourism and hospitalitystudents are trained for future hoteliers.In the past two decades, the Web server of Hotel TULIP has logged all the web traffic to thehotel website, and stored large amount of data related to the use of various web pages. The hotelsCIO, Dr Bear Guts (not Bill Gates!), Believes that those log files are great resources to help theirInformation Technology Division improve their potential customers online experience, and helptheir Market Promotion Division to identify potential customers and their behaviour patterns.Hence, Hotel TULIP would like you Group-SIT742 (a hypothetical data analytics group with upto 3 data analysers) to analyse web log files and discover user accessing patterns of different webpages.The Web server is using Microsoft Internet Information Service (IIS), and the Web log formatcan be found at: httpss://msdn.microsoft.com/en-us/library/ms525807(v=vs.90).aspxTask DescriptionThis task requires you to develop a data analysis report for the provided Hotel TULIP Web logs.Without exploration or further analysis, raw Web log data hardly reveals any insightful information.In this part, you are required to complete the Python code snippets to generate suitable numeric and visualdescription in the Hotel TULIP Web log dataset based on the detailed requirements in SIT742Task2.ipynb,and develop the report SIT742Task2-Report.pdf to summarise the data analytic results. The detailedrequirements can also be found in the notebook SIT742Task2.ipynb, here we summarise them as follows:1 Data ETL (4 marks)1.1 Load Data (2 marks)Load data from files. In order to reduce the processing time, we will remove missing values, and select 30%of total data for the following tasks.Code Remove missing values. For the columns, if the column is with 15% NAs, you need to removethat column. Then, for the rows, if there are any NAs in that row, you need to remove that row(requests) Randomly select 30% of the total data into a new dataframe weblog_df.Report Please show the Number of requests in weblog_df.1.2 Feature Selection (2 marks)Code Select cs_method, cs_ip, cs_uri_stem, cs(Referer) as features and sc_status as the classlabel into a new dataframe ml_df for following Machine Learning Tasks.Page 2 of 6SIT742 (Modern Data Science)Full Marks: 40Assessment Task 022021 Trimester 1, Due: 8:00pm AEST, 22/05/2021Report Data Description of ml_df. Show the top 5 rows of ml_df.2 Unsupervised learning (4 marks)You are required to complete this part using sklearn.Code Perform unsupervised learning on ml_df with K Means.Report Visualization of KMeans performance using the elbow plot , with a varying K from 2 to 10. What is the best K for this dataset?3 Supervised learning (8 marks)You are required to complete this part using PySpark packages.3.1 Data PreparationPrepare the data for supervised learning by completing the code.3.2 Logistic Regression (4 marks)Code Perform supervised learning on ml_df with Logistic Regression. Evaluate the classification performance using confusion matrix including TP, TN, FP, FN; Evaluate the classification performance using Precision, Recall and F1 score.Report Show the classification result using confusion matrix. Evaluate the classification performance using confusion matrix including TP, TN, FP, FN, Evaluate the classification Performance using Precision, Recall and F1 score.3.3 K-fold Cross-Validation (4 marks)You are required to use K-fold cross validation to find the best hyper-parameter set where K = 2.Code Implement K-fold cross validation for three (any three) classification models.Report Your code design and running results. Your findings on the classification model or its hyper-parameters based on cross-validation results(Best results).4 Association Rule Mining (4 marks)You are required to complete this part using suitable package for association rule mining.Code Analyze the dataset using association rule mining; Choose suitable threshold for confidence, support and/or other parameters.Report Your code design and running results. Your findings on the association rule mining results.Page 3 of 6SIT742 (Modern Data Science)Full Marks: 40Assessment Task 022021 Trimester 1, Due: 8:00pm AEST, 22/05/2021Part IIData Analytic Prediction (8 marks)Google Scholar is a web service that indexes the metadata of research articles on many scientists. Majorityof computer scientists use Google scholar to track their publications and research development. Therefore,the web crawling on Google Scholar can provide the citation information on all professors with a publicGoogle Scholar profile. After the crawling, the prediction could be conducted to predict the future citationnumbers such as citation all.Task DescriptionIn 2021, to better introduce and understand the research works on the professors, the university wants toperform the citation prediction for individual professors. You are required to implement a web crawler tocrawl the citation information for A/Professor Gang Li from 2003 to 2021 (inclusive), and also conductseveral prediction as required. You will need to make sure that the web crawling code and prediction codemeets the requirements. You are free to use any Python package for Web crawling and prediction by finishingthe following tasks.1. Crawl the citation information for A/Professor Gang Li from 2003 to 2021.2. Train Arima on citation information from 2003 to 2017, and predict the 2018, 2019 and 2020 citationinformation. You need to draw the line plot 2to show the predicted citation For comparison (moredetails in below sections).3. Conduct the grid search on Arima parameters (p, d and q) to select the best parameter values and thenuse them to predict the citation information from 2021 to 2022. You also need to draw the predictionfor comparison (more details in below sections).5 A/Professor Gang Li citation Information ExtractionYou will need to import the suitable (or your chosen) web crawling library and use the corresponding libraryto crawl the year 2003 to year 2021 citation information (19 years) for A/Professor Gang Lis google scholarprofile page: httpss://scholar.google.com/citations?user=dqwjm-0AAAAJ. Eg: citation on year 2020 is839 and citation on year 2021 is 228 3.5.1 Crawl and Generate the citation dataframe (1 mark)The code must contain the necessary web crawling steps and necessary data saving steps. The results ofthe code running will generate the citation2003-2021.csv. The citation2003-2021.csv will contain theyear column and citations column. Data extraction without web crawling steps in the code will incur 0 mark.6 Train Arima to predict the 2018 to 2020 citationIn this part, you need to train the Arima, perform the prediction and also evaluation.6.1 Train Arima Model (1 mark)You will need to use the crawled citation2003-2021.csv and then perform the Arima training withparameter of p = 1, q = 1 and d = 1 on data from 2003 to 2017 (15 years).23Hint: In the right corner of Google profile page, there is a hyperlink VIEW ALL. By clicking the hyperlink, you could see allthe citations from 2003 to 2019Page 4 of 6SIT742 (Modern Data Science)Full Marks: 40Assessment Task 022021 Trimester 1, Due: 8:00pm AEST, 22/05/20216.2 Predicting the citation and Calculate the RMSE (1 mark)Then you will need to use the trained Arima model to predict the citation on year 2018, 2019 and 2020.You will need to perform the evaluation by comparing the predicted citation from 2018 to 2020 with thetrue citation from 2018 to 2020 and calculate the root mean square error (RMSE).6.3 Visualization for comparison (1 mark)You will also need to use matplotlib to draw the line plot with training data from 2013 to 2017, the testingtrue value, the prediction value and also the confidence interval.NoteYou will need to complete the notebook code, and insert the related self-written code and required resultsinto the corresponding place of the report SIT742Task2-Report.pdf.7 Parameter selection and 2021-2022 PredictionIn this part, you will need to Conduct the grid search with Arima and select the best parameter values topredict the citations on 2021 and 2022.7.1 Grid Search (2 mark)You will need to run the grid search for parameters from the range p = [1, 2], q = [1, 2], d = [1, 2] withtraining data (year 2003 to 2017) and testing data (year 2018 to 2020). The result of the search on eachparameter combination (eg: p=1, q=1, d=1) will need to be stored in the search-results.csv. Thesearch-results.csv will have the column of RMSE and column parameter-set.7.2 Select the best parameter values and Predict for 2021 and 2022 (2 marks)You will need to perform the training with Arima on data from 2003 to 2020 with best parameter valuesyou have found above, and then conduct the prediction for year 2021 and 2022. You will also need to usematplotlib to draw the line plot with training data from 2013 to 2020, the predictions 2021 to 2022 togetherwith their confidence interval.NoteYou will need to complete the notebook and insert the related self-written code and required results into thecorresponding place of the report SIT742Task2-Report.pdf.Part IIISelf Reflection – Essay (12 marks)8 Self Reflection EssayBased on your experience with the assessment tasks, you are required to write an essay with 1200-2000 wordsto reflect your understanding and thoughts on the Big data, which should include the following information:1. What are the Python packages that you find useful in manipulating and analyzing Big data? You canbriefly analyze their Advantages and disadvantages;Page 5 of 6SIT742 (Modern Data Science)Full Marks: 40Assessment Task 022021 Trimester 1, Due: 8:00pm AEST, 22/05/20212. What are the Big data platforms that can help storing, retrieving and analyzing the big data? Whatare their advantages and disadvantages?3. Compare and contract the Python data analytical packages and their Spark packages.4. What are your opinions on the privacy issues in the Big data era? Any example to further illustratethe risks?5. What are the methods you think could help to solve the privacy issues on big data? Please list anysuccessful implemented method.6. Any other thoughts about Data science, or suggestions to future students (or teaching team) about thisunit.Referencing should be in Harvard style, and more information about essay writing can be found at:请加QQ:99515681 或邮箱:99515681@qq.com WX:codehelp
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。