” CITS1401语言 辅导、 写作Python程序、Python语言编程调试CITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 1 of 11Project 2: How Good (Positive and Patriotic) is Australia?Submission deadline: 5:00 pm, Friday 23rd October 2020Value: 20% of CITS1401To be completed individually.You should construct a Python 3 program containing your solution to the followingproblem and submit your program electronically on Moodle. No other method ofsubmission is allowed. Your program will be automatically tested on Moodle. Rememberyour first two checks against the tester on Moodle will not have any penalty. Howeverany further check will carry 10% penalty per check.You are expected to have read and understood the Universitys guidelines on academicconduct. In accordance with this policy, you may discuss with other students the generalprinciples required to understand this project, but the work you submit must be theresult of your own effort. Plagiarism detection, and other systems for detecting potentialmalpractice, will therefore be used. Besides, if what you submit is not your own workthen you will have learnt little and will therefore, likely, fail the final exam.You must submit your project before the submission deadline listed above. FollowingUWA policy, a late penalty of 5% will be deducted for each day (or part day), after thedeadline, that the assignment is submitted. No submissions will be allowed after 7 daysfollowing the deadline except approved special consideration cases.Context:For this project, imagine for a moment that you have successfully completed your UWAcourse and recently taken up a position for the Department of Prime Minister andCabinet in Canberra with the Australian Federal Government. At first you were quitereluctant to leave Perth to move over east and, more generally, wondered what use anew graduate with a heavy focus on computing, programming and data could be to thisdepartment. Regardless, the opportunity to gain experience in the real world was toogood, and although it is not quite your own multi-million dollar technology start-up,there was no way you werent taking up the offer.Your first few weeks of orientation was a mostly blur. However, one thing you noticedwas that any time you mentioned your skills in programming, and with Python1 inparticular, to any senior bureaucrat, or even some of the savvier politicians, their eyesseemed to light up and they suddenly became much more interested in whatever you1 Actually their eyes are more likely to light up if / when you mention your skills in data science and machinelearning and big data, for all of which Python is basically the foundational tool for.CITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 2 of 11were saying to them. After reflecting on these experiences, maybe there would be someeven more interesting opportunities for you in the near future?However, for now you decide to put aside these, as its not like the work that you havebeen doing already has not been interesting, and this is what you need to focus on fortoday. At an early morning meeting with your immediate supervisor, you were told thatthe Government is very interested in reducing its spend on trying to understand what(and how) the Australian population currently thinks about it. Instead of spendingmillions of dollars calling randomised groups of Australian residents every quarter toask about their opinions on various Government services, many senior bureaucrats havewondered for a while now whether there was any way to use the masses of freelyavailable data on the internet to provide similar insights at a fraction of the cost.It is within this context that your supervisor has asked you to develop a program, as aproof-of-concept, to demonstrate that it is possible to provide some of these insights ata much lower cost. At your meeting your supervisor noted that, for the proof-of-conceptstage, the use of any live internet data will not be possible without approval from thelegal team (as well as possibly many others). This seemed like quite an obstacle untilyou thought back to one of your early Python units (maybe this one?) and rememberedthat there is an open source, freely available corpus collection of billions of recentlycrawled websites called the Common Crawl ( https://commoncrawl.org/). Morespecifically the Common Crawl corpus consists of tens of thousands of files saved in acertain format (the WARC format, see below), each of which contains the raw HTML oftens of thousands of web pages from a web crawl performed in the recent past. Beingopen source this data is free for you to use so with it you can immediately begin buildingyour proof-of-concept.The Project:As your program is to be a proof-of-concept, both you and your supervisor decided thatits scope should be kept as narrow as possible (but, of course, it must be broad enoughso that it can successfully demonstrate some really good insights). For this reason, itwas decided that your program is to focus only on providing four insights only:1. How positive is Australia generally?2. How positive does Australia feel towards their Government specifically?3. How patriotic is Australia compared with two other major English speakingcountries UK and Canada?4. What are the most referred-to websites (domains) by all Australian websites(your team may want to use this information in the future to better understandhow influential each Australian web result is to your insights, i.e. highly-referredto web domains should be counted as more influential, and lowly-referred to webdomains should be counted as less influential).As outlined in the context section, in order to generate these insights (which will bediscussed in greater detail later in this document), your program will need to examineCITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 3 of 11the raw HTML from large quantities of Australian web pages, and such information isavailable in WARC format from the Common Crawl.The Common Crawl and WARC format:The WARC (Web ARChive) format is a standard format for mass storage of largeamounts of web pages within a single file. The Common Crawl makes the results oftheir crawl freely available for download in this format (as well as the WAT and WETformats, which will not be used for this project). For this project we will use WARC filesfrom the August 2020 crawl ( httpss://commoncrawl.org/2020/08/august-2020-crawlarchive-now-available/).In order to access these files you need to download the WARCfiles list which you can access by clicking on the CC-MAIN-2020-34/warc.paths.gzhyperlink in the table in the August 2020 crawl homepage.Clicking on this link will download an archive, which, when opened, will contain a textfile. Once you open the text file you can download any of the WARC files from thecommon crawl by appending httpss://commoncrawl.s3.amazonaws.com/ to the front ofany of the lines of this file and pasting this full address into your browser.A couple of notes about the Common Crawl WARC files as discussed so far: The file list and all Common Crawl WARC files are compressed using gzip. These filescan be unzipped automatically if you are using Linux or Mac OSX. For Windows youwill have to download a free application to do this – try 7-Zip: httpss://www.7-zip.org/. The Common Crawl WARC files are very large approximately 900MB compressedand up to 5GB uncompressed. Each file contains approximately 45,000 individualcrawl results.Due to the size of the files above, this project has made available a massively cut downsample Common Crawl WARC file on LMS as well as Moodle server. It is expected youwill use this file to get familiar with the format and for your (initial) testing of yourproject. However, your submission will be tested with other WARC files.To start getting familiar with WARC files, it is recommended you download the samplefile and open it in a text editor (for Windows, Wordpad performs better; you can alsouse Thonny). You will see that a WARC file consists of an overall file header, beginningwith the text WARC/1.0, and the next time you see this text is to describe either arequest (WARC/1.0WARC-Type: request), a response (WARC/1.0WARCType:response) or possibly a metadata or other type of WARC category (e.g.WARC/1.0WARC-Type: metadata). For this project we are only interested inWARC responses (WARC/1.0WARC-Type: response), as these are the onlycategories that contains the raw HTML data of the web page we are analysing.2Looking into more detail at WARC responses, you can see that these are further brokendown into three sections, which are separated by blank lines. The first is the WARC2 Note the use of \r with \n to signify a line ending in the WARC (and HTTP) headers. This is a standard lineending code for text files saved with Microsoft Windows and some other scenarios. You will need to account forthis when processing these headers.CITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 4 of 11response header (beginning with WARC/1.0). The second is the HTTP header (usuallybeginning with HTTP/1.1 200) and the third is the raw HTML data (usually but notnecessarily beginning with !DOCTYPE HTML). For the purposes of this project, youcan assume that the first block of text (before the first blank line) is the WARC header,the second block of text (after the first blank line) is always the HTTP header, and thethird block of text (i.e. anything after the second blank line and before the nextWARC/1.0 heading) is the raw HTML that we need to analyse.Taking into account the above, your program will need to be able to open a WARC file,discard or ignore the overall WARC file header, and then for each result:1. Extract the URL from the WARC response header (this is stored in the line startingwith WARC-Target-URI)2. Extract the Content-Type from the HTTP header. For this project we are onlyinterested in responses that are of Content-Type: text/html. Any other typesof HTTP responses can be ignored.3. Extract the raw HTML for this result and store it in a data structure so that it isassociated with the URL you extracted (in point 1).Extracting Raw Text from HTML:If you were to have a look at the raw HTML you have extracted in detail, you would seethat it doesnt quite (yet) look like nice words and sentences that you will be able toanalyse to determine its positivity and patriotism as you are required to do forinsights 1 – 3. In order to get your text to this point, you are going to have to performsome transformations on it, namely:Removal of any HTML tags any text between a character and a characteryou can assume is a HTML tag and needs to be removed before completing your analysisfor insights 1, 2 and 3.Removal of JavaScript code before you remove your HTML tags above, you willalso need to remove any text that is between the script and /script tags(again only for completing insights 1 – 3).The Insights Themselves:Some more details about what is required for each insight is below:1. How positive is Australia generally?For this insight, both you and your supervisor are keen to understand how muchAustralian websites use positive words compared to how much Australian websites usenegative words. It was decided that, for this insight, your program should produce alist with five items. The first and second items in this list are the total count of positivewords and negative words respectively within the raw text for all Australian web pagesthat were in the WARC file provided to your program. The third item of the list shouldbe the ratio of positive words to negative, which can be calculated by dividing the formerCITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 5 of 11by the latter. The fourth and fifth items should be the average number of positive wordsand negative words respectively found in the typical Australian web page.To assist you in this duty, your supervisor has provided you with a list of commonpositive English words, and a list of common negative English words. You can find theselists as text files on LMS and Moodle Server. For this project you can assume that anywords that are not in either of these lists should not be included as part of the positiveor negative counts.Note in order to produce accurate results here, you will have to make sure that yourprogram counts the appearance of any positive and negative words in your textregardless of the words case (uppercase / lowercase or a combination of the both) andany punctuation at the start, end or within each word itself (e.g. commas, full stops,quotation marks, etc.) – – in fact it is recommended you remove all punctuation fromyour text before performing this step (for now assume every non-alphanumericcharacter visible on a standard ANSI keyboard is a punctuation character that needs tobe removed, if you have bought your computer in Australia or the US then it is verylikely you will have an ANSI keyboard, if not then you can easily find out whatpunctuation an ANSI keyboard contains through a Google image search).Note the above analysis should be performed on Australian websites only. For thisproject assume that a website will always an Australian website if, and only if its domainname ends in a .au (domain names are discussed in more detail in insight 4).2. How positive is Australia towards its Government?As well as calculating how positive Australia is in general, the second outcome of thisproject is to determine how positive Australia is towards its Government. In order todetermine this your program should examine every sentence that contains the wordgovernment for any positive or negative words. You and your supervisor decided onthe following rules for any sentences containing the word government: If the sentence has only one or more positive words then it should be counted asa positive sentence. If the sentence contains one negative word then it should be counted as anegative sentence however if the sentence contains two negative words thenit should be counted as a positive sentence (i.e. it is likely that the writer hasused a double negative). If the sentence contains three or more negative wordsthen it should be counted as a negative sentence. If the sentence contains a combination of positive and negative words (or nopositive or negative words) then it should not be counted as either positive ornegative.As with insight 1 your results should be provided in a list with the first and second itemsbeing your raw positive and negative counts, and the third item being the ratio ofpositive to negative counts, and the fourth and fifth items being your average numberof positive sentences and negative sentences per web page respectively.Also as with insight 1, the above analysis should be performed on Australian websitesonly. In addition, the same directions with regards to the words case and punctuationCITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 6 of 11applies also, but you may wish to delay removing any sentence-ending punctuationcharacters (see below) to ensure you are able to split your raw text for the result intoindividual sentences first.For this section you can assume that a sentence is any number of words ended by asentence-ending character (a full stop, a question mark or an exclamation mark).3. How patriotic is Australia compared with the other major English speakingcountriesFor this insight you are required to determine how often the word australia appearsin the raw text of your Australian websites compared with how often the other countrysnames appear in their web sites, specifically focussing on two other major Englishspeaking countries Canada and the United Kingdom (who both have their own uniqueTLDs): For Canada your program should determine how often the word canada appearsin the raw text for any URLs whose domain name ends in .ca. For the United Kingdom your program should determine how often the word ukand the phrases united kingdom and great britain appear in the raw textfor any URLs whose domain name ends in .uk.All of the insights are to be calculated as percentages with the following formula:(total number of occurrences of all words / phrases for the country) / (aggregatenumber of words of every web results raw text for that country) * 100.These percentages are then to be provided in a list in the order of [Australia, Canada,United Kingdom].The same directions from insight 1 with regards to the words case and punctuationremoval applies to the words you will examine in this insight also.4. Web domain links and countsFor every Australian web page in your WARC file, your program should count everydomain name that it links to. A domain name is the part of a URL that refers to the rootsite, for example the domain name for the link httpss://www.google.com.au/example/testing.html is www.google.com.au. Forthis project you can consider that a link in to web page only exists when it appearswithin a a href=… tag within the pages raw html, for example,a href= httpss://www.google.com.au/example/testing.html ora href= httpss://www.google.com.au/example/testing.html ora href= httpss://www.google.com.au/example/testing.html.Your program should examine each link that each of your Australian web pages refersto, extract only its domain name (that includes any subdomains, including www), andcount the occurrences of these domain names across all of your Australian results. Anylinks that do not start with a https:// or a httpss:// can be ignored. Your programshould return the top 5 most occurring domain names and their counts in a list of tupleswith the format: [(domain_name, aggregate_count), …]. The tuples in the list shouldCITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 7 of 11be in descending order by count. In case of tie in their counts, rank them in ascendingalphabetical order.Note URLs are case-insensitive, so all domain names (and a href=… tags) shouldalways be converted to lower case before counting.The Program Itself:Your program should be written in Python and have the following main() signature(from where it will be called for testing and demonstrations):def main (WARC_fname, positive_words_fname, negative_words_fname):The input arguments to this function are: WARC_fname the name of the WARC filename that your program will analyse positive_words_fname the name of the filename containing the list of positivewords3 (one per line) that is to be inputted into your program. negative_words_fname the name of the filename containing the list of negativewords4 (one per line) that is to be inputted into your program.(For windows, use Wordpad is recommended to view the files)When called your program should return four lists representing the outputs of each ofyour insights (in the order of insight 1 to insight 4).Your program should not modify any of the words within your lists of positive_wordsand negative_words (e.g. do not remove any punctuation from these words, even ifthis means that there will be no instances of this particular word counted). All samplefiles are provided on LMS and Moodle server in a zip file.Of course, you are expected to structure your program using several helper functionsthat are called from within your main() function (and / or helper functions within thesehelper functions) in order for your marker / colleagues to more easily understand yourprogram.For testing your program will be called by using the main() function. For example: gen_pos, gov_pos, pat, top_links = main(warc_sample_file.warc,positive_words.txt, negative_words.txt) gen_pos[1256, 651, 1.9293, 27.3043, 14.1522] gov_pos[22, 13, 1.6923, 0.4783, 0.2826] pat3 This list is sourced from Minqing Hu and Bing Liu. Mining and Summarizing Customer Reviews. Proceedings ofthe ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25,2004, Seattle, Washington, USA, (online access: httpss://gist.github.com/mkulakowski2/4289437)4 As above (online access: httpss://gist.github.com/mkulakowski2/4289441)CITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 8 of 11[0.5431, 0.2793, 0.6286] top_links[(www.industryupdate.com.au,275),(religionsforpeaceaustralia.org.au,183),(boundforsouthaustralia.history.sa.gov.au,148),(www.jcu.edu.au,114),(blogs.geelongcollege.vic.edu.au,54)]Additional Requirements:Your program may be distributed to high-level bureaucrats, some of which may nothave the same level of technical skills that you and your team have. To ensure yourprogram is a success in these peoples hands you will need to ensure that: All inputs to your main() function are validated to ensure they are valid and whatyour program expects. If they are not then your program should gracefullyterminate. If a user was also to input any file name that cannot be found or opened then itshould also gracefully terminate. As this is a proof-of-concept, the import of any Python modules is strictlyprohibited. As with your previous project, the use of certain modules (e.g. warc orwarcio) would be a perfectly sensible thing to do in a production setting (providedthese modules have been vetted by your information security team for use inprograms that may be exploring and/or collating sensitive Government data), thesewill again take away from the core activities of the project which are to becomefamiliar with opening files, text processing and the use of inbuilt Python structures(which in turn are similar to basic structures from many other programminglanguages). Your program, of course, should be appropriately commented.In addition to the above please note the following: WARC files will often contain characters that the default settings for your open()function call cannot handle. Therefore, you should open your WARC file in binarymode for example:warc_file_handler = open(WARC_fname, rb)And you should use the read() function to read in its content, followed immediatelyby a call to the decode() function to convert the files contents to basic text, ignoringany decode errors for example:warc_text = warc_file_handler.read().decode(ascii, ignore) Do not assume that the input file names will end in .warc. File name suffixes arenot mandatory in systems other than Microsoft Windows. Do not enforce that withinyour program that the file must end with a .warc or any other extension (or tryto add an extension onto the provided WARC_filename argument), doing so caneasily lead to lost marks. Ensure your program does NOT call the input() function at any time. Calling theinput() function will cause your program to hang, waiting for input that automatedCITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 9 of 11testing system will not provide (in fact, what will happen is that if the markingprogram detects the call(s), and will not test your code at all which may result inzero grade). For the purposes of our testing your program should also not call the print()function at any time. If it has encountered an error state and is exiting gracefullythen your program needs to return empty lists. At no point should you print theprograms outputs instead of (or in addition to) returning them or provide a printoutof the programs progress in calculating such outputs. For any of your outputs that is a ratio or an average, if ever the denominator is zerothen your program should return None for this ratio rather than zero or an errorstate. All of your outputs should be rounded to 4 decimal places (if they are required to berounded). Rounding should only occur at the final step immediately before the valueis saved into its final data structure (i.e. do not round off any values during any ofyour intermediate steps when calculating your outputs). If you wish to view the contents of the positive and negative words files in MicrosoftWindows please open these files in WordPad or another text editor besides NotePad,as NotePad does not process text files that end in just \n correctly.Submission:Submit your solution on Moodle before the deadline. You are required to paste yourcode in the text box as well as load the same code as a python file. The name of thefile must be your student_id.py. Read the submission guidelines on Moodle portal.You need to contact unit coordinator if you have special considerations or you plan tobe making a submission after the mentioned due date.Marking Rubric:Your program will be marked out of 40 (later scaled to be out of 20% of your final markfor CITS1401). 30 out of 40 marks will be awarded automatically based on how wellyour program completes a number of tests, reflecting normal use of the program, andalso how the program handles various states including, but not limited to, differentnumbers of results in any WARC file and / or any error states. You need to thinkcreatively what your program may face. Your submission will be graded by data filesother than what has been provided. Therefore you need to be creative to look intocorner or worst cases. There are some hidden tests in Moodle as well as the projectmay undergo further automated testing after the deadline.10 out of 40 marks will be awarded manually after the deadline. They will be based onstyle (5/10) the code is clear to read and efficiency (5/10) your program is wellconstructed and runs efficiently. For style, think about use of comments, sensibleCITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 10 of 11variable names, your name and student ID at the top of the program, etc. (Please lookat your lecture notes, where this is discussed.)Style Rubric:0 Gibberish, impossible to understand1 2 Style is quite poor3 4 Style is adequate to good, with small lapses5 Style is very good or excellent, your submission is very easy to readand followYour program will be traversing files of various sizes (possibly including very large warcfiles) so try to minimise the number of times your program looks at the same dataitems. You may think to use different data structures such as tuples, lists, ordictionaries.Efficiency Rubric:0 Code too incomplete to judge efficiency, or wrong problem tackled1 2 Very poor or inferior efficiency, many lapses3 4 Acceptable or good efficiency, with some or few lapses5 Excellent efficiency, should have no issues with large WARC files, etc.Efficiency lapses can include (but are not limited to): The use of more loops than necessary Inappropriate use of readline() Opening files more than once (and not closing the files you open) Use of try / except when the error can be caught and handled using an ifstatement instead Blocks of code and / or helper functions run / called more times than is necessaryAutomated Moodle testing is being used so that all submitted programs are being testedthe same way. However, there is randomness in the testing data. Sometimes it happensCITS1401 Computational Thinking with PythonProject 2 Semester 2 2020Page 11 of 11that there is one mistake in your program that means that no tests are passed or yourprogram gives error for a test case resulting in failure to proceed with other test cases,and you will get zero grade. Remember there is penalty for re-submissions. So it isbetter to check your program thoroughly on Thonny before attempting to submit it onMoodle.Your program is running well on the provid”
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。