CS1003编程语言 写作、 辅导Programming程序

” CS1003编程语言 写作、 辅导Programming程序University of St AndrewsSchool of Computer ScienceCS1003 Programming with DataP1 Text ProcessingDeadline: 5 February 2021 Credits: 10% of coursework markMMS is the definitive source for deadline and credit detailsYou are expected to have read and understood all the information in this specificationand any Accompanying documents at least a week before the deadline. you must contactthe lecturer regarding any queries well in advance of the deadline.This practical involves reading data from a file and using basic text processing techniques to solve aspecified problem. You will need to decompose the problem into a number of methods as appropriateand classes if necessary. You will also need to test your solution carefully and write a report.TaskThe task is to write a Java program to perform string similarity search among words stored in a text file.The code you are going to write is similar to code that is found in spell-checkers. Your program shouldaccept two Command line arguments, the first is the path of a text file (which contains a dictionary ofcommonly used English words) and the second is a query string. The program should then read thetext in the file and split it into lines, where each line contains a single word. Then, calculate a similarityscore between the query word and each word read from the file. Finally, print the closest match fromthe file (the word with the highest similarity score) to standard output, together with the similarityscore. Place your main method in a class called CS1003P1.java. Some example runs are as follows.Searching for the closest word to strawberry java CS1003P1 ../data/words_alpha.txt strawberryResult: strawberryScore: 1.0Searching for the closest word to stravberry java CS1003P1 ../data/words_alpha.txt stravberryResult: strawberryScore: 0.6923077Searching for the closest word to ztravberry java CS1003P1 ../data/words_alpha.txt ztravberryResult: StrawberryScore: 0.46666667String similarityThere are several ways of calculating a similarity score between strings, in this practical we ask you touse a Jaccard index on character bigrams. This might sound scary at first, but dont worry! We willnow define what we mean and give an example.Jaccard indexThe Jaccard index is a similarity measure between sets of objects. It is calculated by dividing thesize of the intersection of the two sets by the size of the union of the same two sets. If the two setsare very similar, the value of the Jaccard index will be close to 1 (if the two sets are identical it willbe exactly 1). On the other hand, if the two sets are very dissimilar, the value of the Jaccard indexwill be close to 0 (if the two sets are disjoint it will be exactly 0). Try drawing a few simple Venndiagrams to convince yourselves of this! Wikipedia has a good article on the Jaccard index as well: httpss://en.wikipedia.org/wiki/Jaccard_indexCharacter bigramsA character bigram is a sequence of two consecutive characters in a string. Bigrams have applicationsin several Areas of text processing like linguistics, cryptography, speech recognition, and text search. Inthis practical, we will calculate the Jaccard index on sets of bigrams for calculating a similarity scorebetween strings. Following is an example of the set of bigrams for the string cocoa: co, oc, oa.Notice that since we generate a set of bigrams, we avoid repeating co twice.Your program Should contain a method to create a set of bigrams for a given string.Top and tailAdding special characters to the start and the end of a string before calculating the set of bigrams canimprove string similarity search. This is often done by adding a character to the beginning and a$ character to the end of the string. On the same example, cocoa, we first add the special charactersto either side and get to cocoa$. The set of bigrams becomes: c, co, oc, oa, a$.Suggested steps Download the text file words alpha.txt 1from StudRes and save it to a known location. Youshould not submit this file as part of your submission. Create a Java class called CS1003P1 and write a program that is able to read the data stored inthis text file line by line. In order to check that this works, print each line to standard output.See method from the class. Write a method to calculate character bigrams of a given string and store them in a set. See theand classes and the method that they implement. You may test your methodwith the string cocoa, the output should match the given output above. Implement top-and-tail as described above and update your bigram calculation to use this functionality. Implement the Jaccard index calculation. Which two sets will you calculate the Jaccard indexon? We suggest that you use the method for implementing set intersection and the method for implementing set union. The method returns the size of a set. If you1Source: httpss://github.com/dwyl/english-words2calculate the Jaccard index between the set {1,2,3} and the set {1,2,4} (where thesize of the intersection is 2 and the size of the union is 4) you should get 2/4 = 0.5 as the result. Combining character bigrams, top-and-tail and Jaccard index you now have a way of calculatinga similarity score between two strings. Use this to calculate the score between the query wordand each Word from the file in a loop. Keep track of the best score (and the word that has thebest score!) for reporting at the end. We suggest that you print the best matching string and the corresponding similarity score as youiterate through the dictionary during development. This can help you with testing your program.Auto-checker and TestingThis assignment makes use of the Schools automated checker stacscheck. You should therefore ensurethat your program can be tested using the auto-checker. It should help you see how well your programperforms on the tests we have made public and will hopefully give you an insight into any issues prior tosubmission. The automated checking system is simple to run from the command line in your CS1003-P1directory:Make sure to type the command exactly occasionally copying and pasting from the PDF specificationwill not work correctly. If you are struggling to get it working, ask a demonstrator.The automated checking system will only check the basic operation of your program. It is up to youto provide evidence that you have thoroughly tested your program.SubmissionReportYour report must be structured as follows: Overview: Give a short overview of the practical: what were you asked to do, and what did youachieve? Clearly list which parts you have completed, and to what extent. Design: Describe the design of your program. Justify the decisions you made. In particular,describe the classes you chose, the methods they contain, a brief explanation of why you designedyour solution in the way that you did, and any interesting features of your Java implementation. Testing: Describe How you tested your program. In particular, describe how you designed differenttests. Your report should include the output from a number of test runs to demonstrate thatyour program satisfies the specification. Please note that simply reporting the result of stacscheckis not enough; you should do further testing and explain in the report how you convinced yourselfthat your program works correctly. Evaluation: Evaluate the success of your program against what you were asked to do. Conclusion: Conclude by summarising what you achieved, what you found difficult, and whatyou would like to do given more time.Dont forget to add a header including your matriculation number, the name of your tutor and thedate.3UploadPackage up your CS1003-P1 folder and a PDF copy of your report into a zip file as in previous weeks,and submit it using MMS, in the slot for Practical P1. After doing this, it is important to verifythat you have Uploaded your submission correctly by downloading it from MMS. Youshould also double check that you have uploaded your work to the correct slot. You canthen run stacscheck directly on your zip file to make sure that your code still passes stacscheck. Forexample, if your file is called, save it to your Downloads directory and run:RubricMarking1-6 Very little evidence of work, software which does not compile or run, or crashesbefore doing any useful work. You should seek help from your tutor immediately.7-10 An acceptable attempt to complete the main task with serious problems such asnot compiling, or crashing often during execution.11-13 A competent attempt to complete the main task. Serious weaknesses such as usingwrong data types, poor code design, weak testing, or a weak report riddled withmistakes.14-16 A good Attempt to complete the main task together with good code design, testingand a report.17-18 Evidence of an excellent submission with no serious defects, good testing, accompaniedby an excellent report.19-20 An exceptional submission. A correct implementation of the main task, extensivetesting, accompanied by an excellent report. In addition it goes beyond the basicspecification in a way that demonstrates use of concepts covered in class and otherconcepts discovered through self-learning.See also the standard mark descriptors in the School Student Handbook: https://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/feedback.html#Mark_DescriptorsLateness penaltyThe standard penalty for late submission applies (Scheme B: 1 mark per 8 hour period, or partthereof): https://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/assessment.html#lateness-penaltiesGood academic practiceThe University Policy on Good Academic Practice applies: httpss://www.st-andrews.ac.uk/students/rules/academicpractice/Going FurtherHere are some additional questions and pointers for the interested student.4 Character bigrams are a special case. If you are interested look into character n-grams which aresets of n-characters where n is not necessarily 2 (as in bigrams). N-grams can be constructed at the word level instead of at the character level. Look into applicationsof word-level n-grams and think about the use cases of character-level vs word-leveln-grams. There are many Other string similarity methods! See httpss://en.wikipedia.org/wiki/String_metric as a starting point. Think about use cases for string similarity methods.如有需要,请加QQ:99515681 或WX:codehelp

添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导