” 写作COMP20008编程、 辅导Data程序COMP20008 Elements of Data ProcessingAssignment 1March 3, 2021Due dateThe assignment is worth 20 marks, (20% of subject grade) and is due 8:00am Thursday1st April 2021 Australia/Melbourne time.BackgroundLearning outcomesThe learning objectives of this assignment are to: Gain practical experience in Written communication skills for documenting for datascience projects. Practice a selection of processing and exploratory analysis techniques through visualisation. Practice text processing techniques using Python. Practice widely used Python libraries and gain experience in consultation of additionaldocumentation from Web resources.Your tasksThere are three parts in this assignment, Part A, Part B, and Part C. Part A and Part B areworth 9 marks each and Part C is worth 2 marks.Getting startedBefore starting the Assignment you must do the following: Create a github account at httpss://www.github.com if you dont already have one. Visit httpss://classroom.github.com/a/FSvGXkWI and accept the assignment. Thiswill create your personal assignment repository on github. Clone your assignment repository to your local machine. The repository contains importantfiles that you will need in order to complete the assignment.1COMP20008 2021 SM1Part A (Total 9 marks)For Part A, download the complete Our World in Data COVID-19 dataset (owid-coviddata)from httpss://covid.ourworldindata.org/data/owid-covid-data.csv.Part A Task 1 Data pre-processing (3 marks)Program in python to produce a dataframe by1. (2 marks) aggregating the values of the following four variables:total casesnew casestotal deathsnew deathsby month and location in the year 2020.The dataframe should contain the following columns after completion of this sub-task:locationmonthtotal casesnew casestotal deathsnew deathsNote: if there are no entries for certain combinations of locations and months, thereshould be no entry for those combinations in the dataframe.2. (1 mark) adding a new variable, case fatality rate, to the dataframe produced fromsub-task 1. The variable, case fatality rate, is defined as the number of deaths perconfirmed case in a given period. Do not impute missing values.The final dataframe Should contain the columns in the following order:locationmonthcase fatality ratetotal casesnew casestotal deathsnew deathsand the rows are to be sorted by location and month in ascending order.Page 2COMP20008 2021 SM1Print the first 5 rows of the final dataframe to the standard output.Save the new dataframe to a CSV file named, owid-covid-data-2020-monthly.csv inthe same directory as the python program. Your program should be called from the commandline as follows:python parta1.py owid-covid-data-2020-monthly.csvHint: You will need to use appropriate functions for the aggregation based on your understandingsof the variables.Part A Task 2 Visualisation (2 marks)Program in python to produce two scatter plots:1. (1 mark) a scatter plot of case fatality rate (on the y-axis) and confirmed new cases onthe x-axis) by locations in the year 2020.Output the plot to scatter-a.png in the same directory as the python program.2. (1 mark) a second scatter plot of the same data with only one change: the x-axis ischanged to a log-scale.Output the plot to scatter-b.png in the same directory as the python program. Forthis plot, apply preprocessing if necessary.Your program should be called from the command line as follows:python parta2.py scatter-a.png scatter-b.pngPart A Task 3 Discussion and visual analysis (4 marks)A short report of your visual analysis of the two plots produced from Task 2.It is expected that the Visual analysis would include:1. (1.5 marks) a brief introduction/description of the raw data, including the source, anylimitations you observe in the data and all preprocessing steps taken on the raw datato produce the visualisations,2. (1.5 marks) explanation of the plots and patterns observed, and3. (1 mark) a discussion contrasting the two scatter plots.The report is to be 500 – 600 (maximum) words excluding figures, about 1 page, in pdfformat, and must include the two plots, scatter-a.png and scatter-b.png, producedfrom Part A Task 2.The filename of the report must be owid-covid-2020-visual-analysis.pdf .Part B (Total 9 marks)For Part B, download the cricket dataset from the LMS. This dataset contains a sample ofcricket-related articles from BBC News. We wish to build a search engine that will allow auser to specify keywords and find all articles related to those keywords.Page 3COMP20008 2021 SM1Part B Task 1: Regular Expressions (1 mark)Each article contains a document ID which uniquely identifies the document. This documentID is comprised of four letters followed by a hyphen, followed by three numbers and optionallyending in a letter. For example, each of the following are valid document IDs:ABCD-123ABCD-123VXKCD-999ACOMP-200The document IDs are not located in a consistent place in each article. Use a regular expressionto identify the document ID for each document in the dataset. Write a Python programin partb1.py that Produces a CSV file called partb1.csv containing the filenames and DocumentIDs for each document in the dataset. Your CSV file should contain the followingcolumns in the order below:filenamedocumentIDYour program should be called from the command line along with the name of the CSV file:python partb1.py partb1.csvPart B Task 2: Preprocessing (1 mark)We now wish to perform the following preprocessing on each article in the cricket folder inorder to make them easier to search:Remove all non-alphabetic characters (for example, numbers and punctuation characters),except for spacing characters such as whitespaces, tabs and newlines.Convert all spacing characters such as tabs and newlines to whitespace and ensure thatonly one whitespace character exists between each wordChange all uppercase characters to lower caseCreate a Python program in partb2.py that performs this preprocessing.Your program should be called from the command line along with the filename of a document.For example:python partb2.py cricket001.txtYour program should then load the specified file, perform the preprocessing steps aboveand print the results to standard output.Hint: You may wish to create a function for performing this preprocessing as you will needto perform this pre-processing as part of each task in Part BPage 4COMP20008 2021 SM1Part B Task 3: Basic Search (2 marks)Create a Python program in partb3.py that will allow the user to search for articles containingparticular keywords. Your program should be called from the command line alongwith the keywords being searched for. For example:python partb3.py keyword1 keyword2 keyword3You can assume each Keyword will be separated by a whitespace character and thatbetween 1 and 5 keywords will be entered. Your program should then return the documentIDs of the documents that contain all of the keywords in the users search query. For thistask:You should check for matches after performing the preprocessing in Task 2. For example,searching for the word old should return articles containing the words Old or OLD.The keywords that the user searches for are separate keywords. You are not required tomatch exact phrases. For example, if a user searches for the keywords captain early,these words do not need to appear consecutively in the document to constitute a match.Only documents that contain the actual keyword should return a match. For example,searching for the word old should not return articles containing the word golden.Your program should output the document IDs of each article containing all of the specifiedkeywords.Hint: You may wish to load partb1.csv back into your programPart B Task 4: Advanced Search (2 marks)We now wish to expand the search feature to enable inexact matching. For example, auser should be able to specify the keyword missing and the search should also return articlescontaining the related words missed or miss. Create a Python program in partb4.py basedon your response to Task 3 that uses a Porter Stemmer to enable this inexact matching. Yourprogram should be Called from the command line along with the keywords being searched for.For example:python partb4.py keyword1 keyword2 keyword3Your program should output the document IDs of each article containing all of the specifiedkeywords, or words considered by the Porter Stemmer to have the same base. For this task:You should check for matches after performing the preprocessing in Task 2. For example,searching for the word old should return articles containing the words Old or OLD.The keywords that the user searches for are separate keywords. You are not required tomatch exact phrases. For example, if a user searches for the keywords captain early,these words do not need to appear consecutively in the document to constitute a match.Other than inexact matches permitted by the Porter Stemmer, only documents thatcontain the actual keyword should return a match. For example, searching for the wordold should not return articles containing the word golden.Note that other than the final point this list of requirements is the same as for Task 3.Page 5COMP20008 2021 SM1Part B Task 5: Search Rankings (3 marks)We wish to further expand the search feature to enable documents to be ranked, so thatthose most relevant to the users keywords are displayed at the top of the list. One wayof computing such a ranking is to use TF-IDF along with the cosine similarity measure asdiscussed in lectures. Create a Python program in partb5.py based on your response toTask 4 that ranks articles returned by Task 4 by cosine similarity score.Your program should be called from the command line along with the keywords beingsearched for. For example:python partb5.py keyword1 keyword2 keyword3Your program should output:The headings documentID and scoreThe document IDs of each article containing all of the specified keywords, or wordsconsidered by the Porter Stemmer to have the same base.The cosine similarity score between the vector of stemmed keywords and the vector ofstemmed words appearing in the document for each document matched, rounded tofour decimal places.You should assume that the collection being used by TF-IDF is the complete list of stemmedwords contained in articles returned by your Task 4 search. The output should be sorted indescending order by Cosine similarity score with the search query. For example, one sampleoutput might look like this:documentID scoreJDKC-105M 0.0618BTAR-174V 0.0182Part C(Total 2 marks)GitHub SubmissionEnsure all of your completed code files as well as your report have been pushed to the githubrepository you created in the Getting Started section. We strongly encourage you to push anupdated version of your code to your github repository each time you make a major change.Your repository must also contain a README file, which must contain your name and studentID. It must also contain a brief description of your project and a list of dependencies.Submission InstructionsSubmit all pythin scripts and the pdf discussion report via LMS. A complete submittionincludes the following items:1. parta1.py2. parta2.pyPage 6COMP20008 2021 SM13. owid-covid-2020-visual-analysis.pdf4. partb1.py5. partb2.py6. partb3.py7. partb4.py8. partb5.py9. A link to your GitHub repositoryYou must also have pushed the Above files to your github repository, which the teaching staffalready have access to.Extensions and late submission penaltiesIf requesting an extension due to illness, please submit a medical certificate to the lecturer.If there are any other exceptional circumstances, please contact the lecturer with plenty ofnotice. Late submissions without an approved extension will attract the following penalties0 hourslate = 24 (2 marks deduction)24 hourslate = 48 (4 marks deduction)48 hourslate = 72: (6 marks deduction)72 hourslate = 96: (8 marks deduction)96 hourslate = 120: (10 marks deduction)120 hourslate = 144: (12 marks deduction)144 hourslate: (20 marks deduction)where hourslate is the elapsed time in hours (or fractions of hours).This project is expected to Require 15-20 hours work.Academic honestyYou are expected to follow the academic honesty guidelines on the University website httpss://academichonesty.unimelb.edu.auFurther informationA project discussion forum has also been created on the Ed forum. Please use this in thefirst instance if you have questions, since it will allow discussion and responses to be seen byeveryone. There will also be a list of frequently asked questions on the project page.请加QQ:99515681 或邮箱:99515681@qq.com WX:codehelp
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。