辅导COMP26120课程、写作C++，Java

” 辅导COMP26120课程、写作C++，JavaAcademic Session: 2020-21Lab Exercise 3: Spellchecking (Better Trade-offs)Duration: 2 weeksYou should do all your work on the lab123 branch of the COMP26120 2020 repository – see Blackboardfor further details. You will need to make use of the existing code in the branch as a startingpoint.Important: This is the third of three labs. This is an assessed lab. You have the choice tocomplete the lab in C, Java or Python. Program stubs for this exercise exist for each language. Onlyone language solution will be marked. To submit your work please run the submit script in therelevant directory this means: If you are submitting a C solution you should run submit lab3 c.sh If you are submitting a Java solution you should run submit lab3 java.sh If you are submitting a Python Solution you should run submit lab3 python.shThe most recent submission is the one that will count.Also Important: After submitting your Code log in to COMPjudge to check that you are passingall of the tests (see Blackboard for details on COMPjudge). If you are not then you can resubmit.If you do not see the submission in COMPjudge then first check that you can see the tag in yourGitLab page make sure you are on the correct branch for the submit scripts to work.For this assignment, you can only get full credit if you do not use code from outsidesources.Note on extension activities: The marking scheme for this lab is organised so that you can getover 80% without attempting any of the extension activities. These activities are directed at thosewanting a challenge and may take considerable extra time or effort.Reminder: It is bad practice to include automatically generated files in source control (e.g. your gitrepositories). This applies to object files (C), class files (Java), and compiled bytecode files (Python).This is a general poor practice but for this course it can also cause issues with the online tests.1Learning ObjectivesBy the end of this lab you will be able to: Describe the role of the hash function in the implementation of hash tables and describe andcompare various hash collision strategies Describe the basic structure of a binary search tree as well as the basic operations and theircomplexities Write C, Java, or Python code to implement the above conceptsIntroductionThe aim of this exercise is to use the context of a simple spell-checking program to explore the binarysearch trees and hash tables data structures.Data structuresThe spell-checking stores the dictionary of words using a Set datatype. There are three data structuresused to implement this Set datatype in this lab. You have already used the dynamic array in Lab 1.In this exercise we use hash tables and binary search trees to implement the Set datatype. Thesehave been introduced in lectures and we describe these briefly below. You may want to look at therecommended textbooks or online to complete your knowledge.Hash Table. The Underlying memory structure for a hash table is an array. A hash function isthen used to map from a large key domain to the (much smaller) set of indices into the array, wherea value can be stored. When two values in the input domain map to the same index in the array thisis called a collision and there are multiple ways to resolve these collisions. To use a hash table torepresent a set we make the key and value the same – the result is usually called a hash set.Binary Search Tree. You can think of a tree as a linked list where every node has two children.This allows us to Differentiate between what happens in each sub-tree. In a binary search tree thegeneral idea is that the left sub-tree holds values smaller than that found in the current node, andthe right sub-tree holds larger values.Lab 3: Better StorageIn the Lab 1 we achieved a faster find function by first sorting the input. In this part we exploretwo data structures that organise the data in a way that should make it faster to perform the findoperation.Part a: Hash Table LookupSo far our solution to a fast find function has been sorting the input and in Part b we will look atstoring it in a sorted order. In this part you will take a different approach that relies on a hashfunction to distribute the values to store into a fixed size array. We have only provided you with verybasic program stubs. You need to decide what internal data structure you need etc.The hash-value(s) needed for inserting a string into your hash-table should be derived from the string.For example, you can consider a simple summation key based on ASCII values of the individual characters,or a more sophisticated polynomial hash code, in which different letter positions are associated2with different weights. (Warning: if your algorithm is too simple, you may find that yourprogram is very slow when using a realistic dictionary.) You should experiment with the varioushash functions described in the lectures and textbook and implement at least two for comparison.One can be terrible. These Can be accessed by the modes already given in the program stubs, pleasedo not change these. Write your code so that you can use the m parameter, which sets the mode.Initially, you need to use an open addressing hashing strategy so that collisions are dealt with by usinga collision resolution function. As a first step you should use linear probing, the most straightforwardstrategy. To understand what is going on you should add code to count the number of collisions soyou can calculate the average per access in print stats.An issue with open addressing is what to do when it is not possible to insert a value. To make thingssimple, to begin with you can simply fail in such cases. Your code should keep the num entries fieldof the table up to date. Your insert function should check for a full table, and exit the programcleanly should this occur. Once this is working you should increase (double?) the hash table sizewhen the table is getting Full and then rehash into a larger table. In C, beware of memory leaks!1Once you have done this, you can submit to COMPjudge to check that it works correctly.Extension Activities. If you still have time, consider also implementing alternative methods fordealing with collisions. But make sure you have completed Parts b c without these extensions firstas it is likely to be more work than it may initially appear. The alternative methods you may consider(and reasons for considering them) are:1. Quadratic Probing. It is a good idea to get used to the general quadratic probing scheme. Theobservant among you will notice its a common question in exam papers!2. Double Hashing. You have two hash functions anyway, put them to work!3. Separate Chaining. This is the most challenging as it requires making use of an additionaldata structure2 but it is also how many hash table implementations actually work so worthunderstanding.During the marking of this part you will be asked to discuss the asymptotic algorithmic complexityof your function f ind, And the potential problems that linear and quadratic probing may cause withrespect to clustering of the elements in a hash table. You may be asked these questions even ifyou did not implement these collisions resolution methods.Part b: Binary Search Tree LookupIn this part you need to edit the program stub in your chosen language for binary search trees. Youshould:1. Complete the insertion function by comparing the value to insert with the existing value. Whatshould you do if they are equal? Hint: this is representing a set.2. Complete the find function. Importantly, it should follow the same logic as insertion.3. Extend the code to record statistics such as the height and the average number of comparisonsper insertion or find. This will require some extra fields in the class (Java, Python) or nodestruct (C).1One can also get memory leaks in memory managed languages but these are more semantic e.g. preserving areference to something you will never use again.2You dont need to write this yourself. In Java and Python you can use the standard library. In C you can use animplementation you find online.3Once you have done this, submit to COMPjudge and check the tests.In this lab exercise we Will stop there with trees. However, it is worth noting that this implementationis not optimal. What will happen if the input dictionary is already sorted? In the next lab we will beexploring self-balancing trees.TestingOnce you have implemented hash sets and binary trees you should test them. We have provided a testscript and some data. For example, to run simple tests for your hash set implementation in modes 0,1, and 2 if your language is Java you should run./python3 test.py simple hashset java 2You might want to edit the script to do different things. We have also provided some large tests(replace simple by large Above) which will take a lot longer to run (see help box below). You willneed to unzip the files by running unzip henry.zip in the data/large directory. These large testsshould give you an insight into the relative performance of the different techniques.Failing Tests:If the tests are failing there are two possible explanations. Either your implementation iswrong or the test script is being more picky than you thought.For the first reason, make sure you understand what you expect to happen. Create a smalldictionary and input file yourself and test using these. Remember that the program shouldprint out all of the words that are in the input file but not in the dictionary. For binarysearch make sure that you are searching using the same order that you sort with. Makesure that your code is connected up properly in find so that it returns true/false correctly.For the second reason, make sure that you dont print anything extra at verbosity level 0. Ifyou look at test.py youll see that what its doing is comparing the output of your programbetween Spellchecking: and Usage statistics: against some expected output. It runs theprogram at verbosity level 0 and expects that the only spelling errors are output in-betweenthose two points. See the note below for C programs for an extra thing to check.You should use the relevant submit command to upload your work to GitLab and tag it appropriately.This provides a way for TAs to see your code and is a good backup there are a few cases everyyear where students lose code that could have been avoided if they had kept their work up to date onGitLab. When you do this, it will be picked up by COMPjudge for an online version of these tests see Blackboard for details on how to access these; they will be used as part of assessment later.Part c: Making a ComparisonYou should now perform an experimental evaluation like the one you did for Lab 2. Ideally, thisshould compare at least one implementation of dynamic arrays (with or without sorting) either usingour own implementation From Lab 1 or the model solutions, your implementation of binary trees andat least one implementation of hash sets.You should address the question:Under what conditions are different implementations of the dictionary data structurepreferable?4Note that for hash set the initial size of the data structure will now play a larger role. You may findit useful to correlate your findings with statistics such as the number of collisions in the hash table.We recommend you read, if you have not already done so, the Lab 2 instructions on how to generateinputs for your experiments. You are also encouraged to reuse the inputs you generated as part ofthat lab as part of your evaluation here.Write-Up. It is important to keep notes on Your experimental design and results. Ideally, thesewould then be written up as a detailed and well-argued LATEX report. However, we would like you tofocus your time on doing the evaluation rather than writing about it.Therefore, we ask you to complete the provided report.txt document. This includes somequestions to answer Before you start. If you wish to turn this into a LATEX document withgraphs etc then this is a bonus but not required.Note that it is more important to write clearly about your experimental design (justifying yourdecisions) than it is to write about your results and your analysis of them. The answers to the wrongquestion are useless.Instructions for C SolutionsIf you intend to provide a solution in C you should work in the c directory of the COMP26120 2020repository. All program stubs and other support files for C programs can be found in this directory.The completed programs will consist of several .c and .h files, combined together using make(the makefile covers Labs 1-3). You are given a number of complete and incomplete components tostart with: global.h and global.c – define some global variables and functions speller.c – the driver for each spell-checking program, which:1. reads strings from a dictionary file and inserts them in your data-structure2. reads strings from a second text file and f inds them in your data-structure- if a string is not found then it is assumed to be a spelling mistake and is reported to theuser, together with the line number on which it occurred set.h – defines the generic interface for the data-structure hashset.h and hashset.c – includes a partial implementation for hash sets that you need tocomplete in Part a. bstree.h and bstree.c – includes a partial implementation for binary search trees that you needto complete in Part b.Fix: Please note that there is a Small error in speller.c that will cause the test.py scriptto fail. On line 219 the print statement should be outside of the verbosity check so that italways prints. Please move this before using test.py.Note: The code in speller.c that reads words treats any non-alphabetic character as aseparator, so e.g. non-alphabetic would be read as the two words non and alphabetic.This is intended to extract words from any text for checking (is that – a hyphen or asubtraction?) so we must also do it for the dictionary to be consistent. This means thatyour code has to be able to deal with duplicates i.e. recognise and ignore them. For example,on my PC /usr/share/dict/words is intended to contain 479,829 words (1 per line) but isread as 526,065 words of which 418,666 are unique and 107,399 are duplicates.5Running your codeTo test your implementation, You will need a sample dictionary and a sample text file with some textthat contains the words from the sample dictionary (and perhaps also some spelling mistakes). Youare given several such files in the data directory of the COMP26120 2020 repository, and you willprobably need to create some more to help debug your code. These files will need to be copied to thedirectory where you are working or you will need to set your P AT H so that your program can beexecuted from anywhere.You should also test your program using a larger dictionary. One example is the Linux dictionarythat can be found in /usr/share/dict/words.Compile and link your code using make hashset (for part a), or make bstree (for part b). These willcreate executables called speller hashset and speller bstree respectively.When you run your spell-checker program, you can: use the d flag to specify a dictionary. (for part a) use the s flag to specify a hash table size: try a prime number greater than twicethe size of the dictionary (e.g. 1,000,003 for /usr/share/dict/words). use the m flag to specify a particular mode of operation for your code (e.g. to use a particularhashing algorithm). You Can access the mode setting by using the corresponding mode variablein the code you write. Note that the modes you should use have already been specified in existingheader files. use the v flag to turn on diagnostic printing, or vv for more printing (or vvv etc. – themore vs, the higher the value of the corresponding variable verbose). We suggest using thisverbose value to control your own debugging output.e.g.:speller hash -d /usr/share/dict/words -s 1000003 -m 2 -v sample-fileInstructions for Java SolutionsIf you intend to provide a solution in Java you should work in the java directory of the COMP26120 2020repository. The completed programs will form a Java package called comp26120. You can find this asa sub-directory of the java directory. All program stubs and other support files for Java programscan be found in this directory.You are given a number of complete and incomplete components to start with: GetOpt.java, LongOpt.java and MessagesBundle.properties – control command line optionsfor your final program. You should not edit these. speller config.java – defines configuration options for the program. speller.java – the driver for each spell-checking program, which:1. reads strings from a dictionary file and inserts them in your data-structure2. reads strings from a second text file and f inds them in your data-structure- if a string is not found Then it is assumed to be a spelling mistake and is reported to theuser, together with the line number on which it occurred6 set.java – defines the generic interface for the data-structure set factory.java – defines a factory class to return the appropriate data structure to the program.This will be used in later labs. hashset.java -includes a partial implementation for hash sets that you need to complete in Parta. speller hashset.java – sub-classes speller for use with hashset. bstree.java – includes a partial implementation for binary search trees that you need to completein Part b. You will need to edit this. speller bstree.java – sub-classes speller for use with bstree.Note: The code in speller.java that reads words treats any non-alphabetic character as aseparator, so e.g. non-alphabetic would be read as the two words non and alphabetic.This is intended to extract words from any text for checking (is that – a hyphen or asubtraction?) so we must also do it for the dictionary to be consistent. This means thatyour code has to be able to deal with duplicates i.e. recognise and ignore them. For example,on my PC /usr/share/dict/words is intended to contain 479,829 words (1 per line) but isread as 526,065 words of which 418,666 are unique and 107,399 are duplicates.Running your codeTo test your implementation, you will need a sample dictionary and a sample text file with some textthat contains the words from the sample dictionary (and perhaps also some spelling mistakes). Youare given several such files in the data directory of the COMP26120 2020 repository, and you willprobably need to create some more to help debug your code. These files will need to be copied tothe java directory or you will need to set your CLASSP AT H so that your program can be executedfrom anywhere.You should also test your program using a larger dictionary. One example is the Linux dictionarythat can be found in /usr/share/dict/words.Compile your code using Javac *.java. This will create an executable called speller bstree.class(for Part a) and speller hashset.class (for Part a). To run your program you should call javacomp26120.speller bstree sample-file and java comp26120.speller hashset sample-file respectively.Note that you will either need to set your CLASSP AT H to the java directory of theCOMP26120 2020 repository or call java comp26120.speller bstree sample-file (resp. javacomp26120.speller hashset sample-file) in that directory.When you run your spell-checker program, you can: use the d flag to specify a dictionary. (for part a) use the s flag to specify a hash table size: try a prime number greater than twicethe size of the dictionary (e.g. 1,000,003 for /usr/share/dict/words). use the m flag to specify a particular mode of operation for your code (e.g. to use a particularsorting or hashing algorithm). You can access the mode setting by using the correspondingmode variable in the code you write. Note that the modes you should use have already beenspecified in existing header files.7 use the v flag to turn on diagnostic printing, or vv for more printing (or vvv etc. – themore vs, the higher the value of the corresponding variable verbose). We suggest using thisverbose value to control your own debugging output.e.g.:java comp26120.speller hashset -d /usr/share/dict/words -s 1000003 -m 2 -v sample-fileInstructions for Python SolutionsIf you intend to provide a Solution in Python you should work in the python directory of theCOMP26120 2020 repository. All program stubs and other support files for Python programs canbe found in this directory. You will need to use Python 3.You are given a number of complete and incomplete components to start with: config.py – defines configuration options for the program. speller.py – the driver for each spell-checking program, which:1. reads strings from a dictionary file and inserts them in your data-structure2. reads strings from a second text file and f inds them in your data-structure- if a string is not found then it is assumed to be a spelling mistake and is reported to theuser, together with the line number on which it occurred set factory.py – defines a factory class to return the appropriate data structure to the program.This will be used in later labs. hashset.py – includes a partial implementation for binary search trees that you need to completein Part a. You will need to edit this. speller hashset.py – front end for use with hashset which then calls the functionality in speller.py. bstree.py – includes a partial implementation for binary search trees that you need to completein Part b. You will need to edit this. speller bstree.py – front end for use with bstree which then calls the functionality in speller.py.Note: The code in speller.py that reads words treats any non-alphabetic character as aseparator, so e.g. non-alphabetic would be read as the two words non and alphabetic.This is intended to extract words from any text for checking (is that – a hyphen or asubtraction?) so we must also do it for the dictionary to be consistent. This means thatyour code has to be able to deal with duplicates i.e. recognise and ignore them. For example,on my PC /usr/share/dict/words is intended to contain 479,829 words (1 per line) but isread as 526,065 words of which 418,666 are unique and 107,399 are duplicates.Running your codeImportant: To run the provided program stubs in python2.7 (which is supplied on the CS providedvirtual machine) you Will need to install the enum34 package. You can do this from the UNIXcommand line.To test your implementation, you will need a sample dictionary and a sample text file with sometext that contains the words from the sample dictionary (and perhaps also some spelling mistakes).You are given several such files in the data directory of the COMP26120 2020 repository, and youwill probably need to create some more to help debug your code. These files will need to be copied8to the python directory or you will need to set your P Y T HONP AT H so that your program can beexecuted from anywhere.You should also test your program using a larger dictionary. One example is the Linux dictionarythat can be found in /usr/share/dict/words.To run your program you should call python3 speller hashset.py sample-file (where sample f ile is a sample input file for spell checking) for Part a and python3 speller bstree.py sample-filefor Part b. Note that you will either need to set your P Y T HONP AT H to the python directory ofthe COMP26120 2020 repository or call python3 speller hashset.py sample-file (resp. python3speller bstree sample-file) in that directory.When you run your spell-checker program, you can: use the d flag to specify a dictionary. (for part a) use the s flag to specify a hash table size: try a prime number greater than twicethe size of the dictionary (e.g. 1,000,003 for /usr/share/dict/words). use the m flag to specify a particular mode of operation for your code (e.g. to use a particularsorting or hashing algorithm). You can access the mode setting by using the correspondingmode variable in the code you write. Note that the modes you should use have already beenspecified in existing Configuration files. use the v flag to turn on diagnostic printing, or vv for more printing (or vvv etc. – themore vs, the higher the value of the corresponding variable verbose). We suggest using thisverbose value to control your own debugging output.e.g.:python3 speller hashset.py -d /usr/share/dict/words -s 1000003 -m 2 -v sample-fileMarking SchemeThis coursework is worth 10% of your final mark for COMP26120. This means each mark in thismark scheme is worth 0.5% of your final mark for the module.Part a. Has the basic hash table been implemented including at least one suitable hashfunction and insertion/lookup using linear probing? (Maximum 3 marks)The hash table is properly initialised. The hash function is sensible (unusual hash functionsshould be carefully justified). Insertion and lookup using linear probing has been implementedcorrectly (ignores Duplicates, performs insertions when it should, always finds thingsthat are in the table). Statistics (including number of collisions) are reported.3As above but there is a small flaw e.g. failure to detect duplicates or statistics missing 2As above but a key component (e.g. the hash function) does not work as required or is farfrom optimal. For example, if the hash function does not produce a value in the requiredrange or lookup using linear probing terminates too early.1No attempt has been made 0Part a. Has a second hash function been given that is different from the first? (Maximum1 mark)There is a second hash function and the student can explain how it works 1A second hash function exists but it is either flawed or cannot be explained 0.5No attempt has been made 09Part a. Has rehashing been implemented? (Maxiumum 1 mark)Rehashing has been implemented and works correctly 1An attempt has been made to implement rehashing but it is flawed. 0.5No attempt has been made 0Part a. Does the student understand the complexity of hash table operations and theissues related to probing? (Maximum 2 marks)The student can explain the complexity of the insert and find functions. The student candiscuss potential problems that linear and quadratic probing may cause with respect toclustering of the elements in a hash table.2The student can almost explain all of the concepts 1.5The student can explain either the complexity or the problems with probing but not both. 1No attempt has been made 0Part b. Has the implementation of binary search tree been completed completely andcorrectly? Note that Code quality is addressed later. (Maximum 2 marks)Insertion uses a suitable ordering on values to create an ordered binary tree. Duplicates arehandled properly. Find uses the same ordering to correctly identify whether a value is inthe tree. Appropriate statistics are computed and presented.2As above but statistics are missing or inappropriate 1.5As above but duplicates are not detected. However, the student can explain how to detectduplicates AND what the negative impact of not detecting duplicates is.1.5As above but duplicates are not detected 1Some of the functionality is missing or does not work e.g. sometimes values are not insertedor sometimes find does not detect if a value is in the tree.0.5No attempt has been made 0Part b. The student understands how ordered binary trees work and the complexity ofthe associated algorithms. (Maximum 2 marks)The student can give the complexity of insert and find in an ordered binary tree and explain*why* this is the case. The student can compare this to a similar setting in linked lists.The student can explain what a balanced tree is and what impact the structure of the treehas on find.2The student can State the complexities of the different operations but struggles to explainwhy these hold.1The student can attempt an explanation but the topic needs to be revised. 0.5No attempt has been made 0Part c. Has the experimental analysis been well-designed? (Maximum 2 marks)There is evidence that the student has carefully considered how best to answer the statedquestion. The choice of conditions to vary have been justified. The student discusses whatthey expect to find and links this to the theoretical complexities of the different approaches.It is easy to read.2Design decisions are Justified but there is no reference to the theoretical complexities of thedifferent approaches The design is described without justification of the ch”