” COMP20008编程 写作、 辅导Data Processing程序、Python编程COMP20008 Elements of Data ProcessingProject 1August 27, 2020Due dateThe assignment is worth 25 marks, (25% of subject grade) and is due 8:00am Monday21st September 2020 Australia/Melbourne time.BackgroundA web server has been setup at https: // comp20008-jh. eng. unimelb. edu. au: 9889/ main/containing a number of media reports on Rugby games. As data scientists, we would like toextract information from those reports and use that information to improve our understandingof team Performance.Rugby scoresUnderstanding the rugby scoring system is important in order to be able to extract scoresfrom match reports. A rubgy score is listed as x-y where x and y are the number of pointsobtained by each team. For example, the following are all valid scores:10-816-04-12Learning outcomesThe learning objectives of this assignment are:To gain practical experience in written communication skills for data science projects.To practice a selection of processing and exploratory analysis techniques through visualisationdiscussed in lectures and workshops.To practice crawling and scraping data from the Internet.To practice using widely used Python library for data processing and gain experienceusing library functions which may be unfamiliar and which require consultation of additionaldocumentation from resources on the Web.COMP20008作业 写作、 辅导Data Processing作业、COMP20008 2020 SM2Your tasksYou are to perform a small data science project including some data processing and analysisusing Python. Your responses to Tasks 1-5 must be contained in a single .py file. Specifically,you have the following tasks:Task 1 (2 marks)Crawl the https: // comp20008-jh. eng. unimelb. edu. au: 9889/ main/ website to find a completelist of articles available.Produce a csv file containing the URL and headline of each the articles your crawler has found.The CSV file Should have two column headings url and headline and be called task1.csv.Note: You might want to start with a smaller website to test your crawling implementationwith this site ( https: // comp20008-jh. eng. unimelb. edu. au: 9889/ sample/ ).Task 2 (4 marks)For each article found in Task 1,a) extract the name of the first team mentioned in the article. You can find a list of teamnames as part of the rugby.json file provided. We will assume the article is writtenabout that team (and only that team). (2 marks)Note: Your implementation must make use of the list of teams in rugby.json. Wewill run your program with a different rugby.json file and expect to find all the articlesthat refer to the teams listed in the modified file. The file we use will follow the sameformat, but may have different teams.b) extract the largest match score identified in the article. You will need to use regularexpressions to accomplish this. We will assume this score relates to the first namedteam in the article. (2 marks)Produce a csv file containing the URL, headline, first team mentioned and first completematch score of each the articles your crawler has found. The csv file should have four columnheadings url, headline, team and score and be called task2.csv.Note: Some articles may not contain a team name and/or a match score. These articles canbe discarded.Task 3 (1 mark)For each article used in Task 2, identify the absolute value of the game difference. E.g. a14-6 score and a 5-13 score both have a game difference of 8. The value is referred to as thegame differenceProduce a csv file containing the team name and average game difference for each team thatat least one article has been written about. The csv file should have two column headingsteam and avg game difference and be called task3.csv.Page 2COMP20008 2020 SM2Task 4 (2 marks)Generate a suitable plot showing five teams that articles are most frequently written aboutand the number of times an article is written about that team.Save this plot as a png file called task4.pngTask 5 (2 marks)Generate a Suitable plot comparing the average game difference for each team with theirgame difference. Ignore any teams that have no articles written about them.Save this plot as a png file called task5.pngTask 6 (14 marks)Write a 3-4 page report to communicate the process and activities undertaken in the project,the analysis, and some limitations. Specifically, the report should contain the following information:A description of the crawling method and a brief summary the output for Task 1.(2 marks)A description of how you scraped data from each page, including any regular expressionsused for Task 2 and a brief summary of the output. (3 marks)An analysis of the information shown in the two plots produced for Tasks 4 5, includinga brief summary of the data used. The plots are to be shown (included) alongwith your analysis. (4 marks)A discussion of the appropriateness of associating the first named team in the articlewith the first match score. (2 marks)At least two suggested methods for how you could figure out from the contents of thearticle whether the first named team won or lost the match being reported on and acomment on the advantages and disadvantages of each approach. (2 marks)A discussion of what other information could be extracted from the articles to betterunderstand team performance and a brief suggestion for how this could be done.(1 mark)Submission instructionsYour responses to Tasks 1 – 5 must be contained in a single python script (.py) file. As theoutput of this file will be verified automatically, it is essential that the program runs withoutproducing errors. For this assignment you may NOT install any additional packages thatarent present on the JupyterHub server, e.g. by using the pip install command. Doing sowill cause your submission to fail our marking scripts.Submission is via the LMS. Two submission links will be provided, one for the .py filePage 3COMP20008 2020 SM2containing your responses to Tasks 1 – 5 and a second for a .pdf or .docx file containingyour response to Task 6.Extensions and late submission penaltiesIf requesting an extension due to illness, please submit a medical certificate to the lecturer.If there are any other exceptional circumstances, please contact the lecturer with plenty ofnotice. Late Submissions without an approved extension will attract the following penalties0 hourslate = 24 (2 marks deduction)24 hourslate = 48 (4 marks deduction)48 hourslate = 72: (6 marks deduction)72 hourslate = 96: (8 marks deduction)96 hourslate = 120: (10 marks deduction)120 Hourslate = 144: (12 marks deduction)144 hourslate: (25 marks deduction)where hourslate is the elapsed time in hours (or fractions of hours).This project is expected to require 30-35 hours work.Academic honestyYou are expected to follow the academic honesty guidelines on the University website httpss://academichonesty.unimelb.edu.auFurther informationA project discussion forum has also been created on the Ed forum. Please use this in thefirst instance if You have questions, since it will allow discussion and responses to be seen byeveryone. There will also be a list of frequently asked questions on the project page.Page 4如有需要,请加QQ:99515681 或邮箱:99515681@qq.com
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。