” 辅导data留学生编程、 写作Python,Java编程Assignment 3IntroductionGrammar checkers like Grammarly or the one built in Microsoft Word might have helped you a lot when writing essays. In this assignment, you will use what you have learned about context-free grammars (CFG) and constituency parsing to build a very simple grammar checker of your own. We provided a dataset of sentences written by English learners with grammatical mistakes so you could evaluate the accuracy of your grammar checker. After that, you are required to analyse errors made by the grammar checker and report your findings.So how exactly does our simple grammar checker work? First of all, we have to define the set of grammar rules of English and make them machine-understandable. To be specific, we will try to come up with a very crude context-free grammar of English. After we have the CFG for English, we will run a constituency parser to parse the input sentence with that CFG. If the sentence can be successfully parsed with the CFG, we mark the input sentence as grammatically correct. And if the sentence can not be parsed, it is considered ungrammatical.Here is an example to illustrate our Grammar checker. We start with a simplified CFG that only considers the basic structure in English.S – Subj Verb Obj .Subj – NPObj – NPNP – Det NounDet – theNoun – horse | appleVerb – ate. – . A sentence = subject + verb + object + period.The subject is a noun phrase.The object is also a noun phrase.A noun phrase is a determiner + a noun.The terminal symbols (lexicon) are simplified for illustration purposes. 辅导data留学生作业、 写作Python,Java编程If we use a constituency parser (for example, the CYK algorithm) with the above CFG to parse the sentence the horse ate the apple., the parsing will succeed with an output like:But if you try the same thing with an ungrammatical sentence like the horse ate ate the apple., the parser will fail. In this way, we will know that this sentence does not conform to the English grammar we defined and can call it ungrammatical.This figure shows the workflow of our grammar checker.Usually when writing grammars, one important component is the lexicon or the terminal symbols. For example, a grammar usually has a list of nouns like Noun-horse|apple. But we simplify this process by providing the part of speech tags for all input sentences. Part of speech indicates the grammatical function of a word in a sentence. For example, whether a word is a noun, adjective, verb, or a preposition.In this assignment, instead of parsing the actual sentence, you will write a grammar and a parser to parse the POS tag sequence. This will simplify the task and relieve you from building lexicons. For our above example The horse ate the apple., you will be given the POS tag sequence DT NN VBD DT NN .. To modify the parser to parse the POS sequence, we replace the English words in our CFG with tag names.S – Subj Verb Obj .Subj – NPObj – NPNP – Det NounDet – DTNoun – NNVerb – VBD. – . A sentence = subject + verb + object + period.The subject is a noun phrase.The object is also a noun phrase.A noun phrase is a determiner + a noun.No need to put an entire English dictionary here!TasksInput: sentences with POS tagsThe input is a tsv (tab-separated values) file like the sample:Id Label sentence pos73 0 Many thanks in advance for your cooperation . JJ NNS IN NN IN PRP$ NN .74 1 At that moment we saw the bus to come . IN DT NN PRP VBD DT NN TO VB .The id column is the unique id for each sentence. The label column indicates whether a sentence contains grammar errors (1 means having errors and 0 means error-free). In the sentence column, the original sentence is already tokenized and separated by a single space, so you can use the str.split() function to get the tokens. The pos column contains the POS tags for each token in the sentence, also separated by a single space. The POS tags follow the Penn Treebank (PTB) tagging scheme, as described here.Part 1: Building a toy grammarThe first step is to write a toy CFG for English in NLTKs .cfg format. You can all start with this grammar below (included in the repository) and try to modify the rules or add new production rules.S – NP VP PunctPP – Adp NPNP – Det Noun | NP PPVP – Verb NP | VP PPVerb – VB | VBD | VBG | VBN | VBP | VBZ……Det – DTNoun – NNVB – VBAdp – ADPPunct – PUNCT…… Sentence = noun phrase + verb phrase + .Prepositional phrase = Preposition + noun phraseNoun phraseVerb phraseInclude all the inflections of a verbMore production rules go hereThis part will be included in the template. See the appendix for the definitions.Part 2: Constituency parsingUse the chart parser from NLTK to parse each of the POS sequences in the dataset with the toy grammar you wrote in Part 1. The results should be stored in a tsv file (not csv) with three columns:Column name DescriptionId The id of the input sentence.ground_truth The ground truth label of the input sentence, copied from the dataset.Prediction 1 if the sentence has grammar errors, 0 if not. In other words, whether the POS sequence can be parsed successfully with your grammar and parser.Part 3: Evaluation and error analysisIn this part, you will evaluate the performance of your grammar checker by calculating the precision and recall of it. By looking at the output from Part 2, you are able to get the following numbers:TP (true positive): the sentence contains errors and your checker found them.FP (false positive): the sentence is grammatical but your checker found it not.FN (false negative): the sentence is ungrammatical but your checker labelled it as correct.TN (true negative): the sentence is grammatical and your checker agrees.The precision and recall are defined respectively as:After getting these two numbers, please look at how and why your grammar checker did not perform well (if so). Identify at most 3 reasons for the false positives and at most 3 reasons for the false negatives.Output: TSV file, grammars, code, report, and documentationYour final repository should have this structure:TSV fileStore the output from Part 2 in tsv format in the output/ folder.GrammarsSave the toy grammar you created to grammars/toy.cfg. The grammars should have NLTKs .cfg format.CodePut all the code in the code/ folder.ReportIn your report, include the precision, recall and How and why the grammar checker produced false positives and negatives, as instructed in Part 2. Answer the questions: with our current design, is it possible to build a perfect grammar checker? If so, what resources or improvements are needed? If not, briefly justify your answer.DocumentationYou should let the TAs know How to set up a folder as the input folder, which libraries are needed to run your project, how to run your Python script, and where to look for the output file. You can do all that in a README file and put it inside your projects folder. For more tips on how to write a clear and helpful README file follow this link.SuggestionsGeneral suggestions●If or when in doubt, ask for help right away.●Dont push yourself too hard, as English is probably NOT a context-free language.●Start early, familiarize yourself with the task and start thinking about it sooner rather than later.●When building the toy grammar, look at English Sentences from a corpus and try to sum up the rules. You are also encouraged to find resources about English grammar or CFGs for English.Useful links●Chapter 8 of the NLTK book introduces grammar, CFG, and constituency parsing in the context of NLTK.●Chapter 12 of J+M introduces CFG and English grammar rules. Chapter 13 covers algorithms for constituency parsing.●The PTB POS tag set is described here.●The NLTK Documentation gives more Details about its APIs.如有需要,请加QQ:99515681 或邮箱:99515681@qq.com
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。