” CS 2461程序 辅导、 写作Search Engine程序CS 2461 Project 5: Information Retrieval (a Search Engine)In this project you will implement a document retrieval system that searches through a set ofdocuments to determine the relevance of the documents with respect to a search phrase/query. The goalof this project is to further expose you to Working with pointers, memory allocation and file I/Ooperations in C. It builds on your knowledge of data structures and hash tables and linked lists inparticular be sure to go over homework 7 and read the example on linked lists in the textbook(Chapter 19) very carefully. This project has an extra credit option first focus on completing the basicspecifications before moving to the Extra credit component.You must work on your own on this project no code or algorithm collaboration of any kind allowed,and you CANNOT use source code from any other source apart from the textbook or lecture notes.(Note: We will be running your code through a code plagiarism detection tool to detect similarities.)You can discuss general architecture of the system and C/Unix questions. You can use the linked listand Hash table codes you implemented in Homework 7 (and linked list code from SoftwareEngineering CS 2113) but make sure you have fixed any problems you had with the code (you dontwant points taken off multiple times for the same errors!). Failure to adhere to these policies willconstitute a violation of GWs Academic Integrity code and you will be charged with a violation agrade of 0 on the project and at least one grade lower on the course.Important: This project requires Planning and writing a substantial amount of C code as well assignificant amount of time testing. It is highly recommended that you start designing your solution firston paper (using a flowchart to describe the modules in your system), then determining what datastructures you need, and then building the functions to manipulate your data structures. This is not aproject that you can put off till the weekend before it is due and then expect to have a working projectby the deadline. Section 1 describes the problem and the ranking (relevance) algorithm. Section 2describes a simple example. Section 3 provides the project specifications and requirements, and Section4 describes an extra credit option that leads to a more realistic system.Problem Statement: Alice Nedwons is interested in the task of searching through (secret) documentsin a directory and identify documents of interest which are documents that contain specific keywordsor query words. However, the directory has a large number of files including files which have norelevance to her search parameters, and manually inspecting every files contents will take anunacceptable amount of time. Fortunately, she has decided to hire you for the job to you since sheknows that you have the necessary skills and knowledge to write an efficient program to automate thesearch process and give her a set of relevant documents. Her plan rests on the assumption that allrelevant documents (stored as plain text files) are stored in one directory. And luckily for you, thisproblem reduces to a special case of the document retrieval problem that can be solved usingtechniques that you have already studied!1. Algorithm for search and retrieval using the tf-idf ranking functionYour task is searching through documents (i.e., files/webpages) in a directory and identify documentsof relevance for a search phrase (i.e., search query) submitted by a user. For example, if the searchphrase is computer architecture GW, you need to find the documents (i.e., files) that not only contain(some of) the words in the query but you would like to rank the documents in order of relevance. Themost relevant document would be ranked first, and the least relevant file would be ranked last. This issimilar (but not necessarily the same method) to how you search the web using a search engine (such asGoogle) the results from a search Engine are sorted in order of relevance. (Googles page rankalgorithm uses a very different technique from what we describe here.)1.1 Overall Architecture of the SystemA na?ve solution to the problem is to read all the documents (i.e., words in the documents) and create asingle (large) linked list. Then for each query keyword, you can search through the linked list.Question (a): If we have N documents, and document Di consists of mi words, then how long does thissimple solution take to search for a query consisting of K words. Give your answer in big O notation. Amore efficient solution can be constructed using other data structures.This problem is an instance of a document retrieval problem and a solution could be architected byacomposition of two main phases (steps) the (1) training phase and (2) search phase. The first step, oftraining, consists of reading in all the documents and creating an effective data structure that will beused during the search process. The training step is nothing but a pre-processing phase of your system,and the data structure you can use to help speed up the search is a hash table. Since we will besearching for words in a search query, we can create a hash table that stores information of the formword, pointer to bucket. Each bucket is a linked list where a node in the linked list must contain (i)the word, (ii) document ID, and (iii) number of times that the word appears in that document. Thus adocument may appear in multiple buckets; but a word appears in a single bucket and a word can appearmultiple times in a document. Towards the end of the training phase, all documents have been read bythe program and the hash table created. The final step of the training phase is removal of stop words.Once all documents have been read and the hash index created, the system determines stop words forthis set of documents and then removes them from the index. (Stop words are words that occurfrequently, such as articles and prepositions in English or words that do not help us in the relevanceranking since they occur in most documents. Removing stop words can not only help improverelevance ranking but can also help speed up the search process since the size of the data structure isreduced.) Question (b): If we assume that the hash function maps words evenly across the buckets (i.e.,each bucket gets same number of words), then what is the time complexity (big O notation) of thisimproved solution? Use the same parameters as Question (a) for the size of each document and size ofthe query set.The second step is the search/retrieval and ranking phase. The user provides a search query consistingof a number of words/term, and the program must return the document IDs in order of relevance. If thequery contains multiple words, we perform a hash table lookup for each word. A table lookup for aword gives us the corresponding bucsket, and by searching through the bucket we can determine if theword exists in any of the documents and if so then its frequency of occurrence (i.e., the count of thenumber of times it appeared in that document). By performing this table lookup for each word in thesearch query, we can compute the score or relevance of each document for the query. The higher thescore the more relevant the Document, and the result lists the documents in decreasing order ofrelevance. This is where the method used to determine the relevance ranking of a document comes intoplay. In this project, we use the term frequency-inverse document frequency (tf-idf) score to determinethe relevance of a document. This method is described next.1.2 The tf-idf Algorithm for RankingThe simplest way to process a search query is to interpret it as a Boolean query either a documentcontains all the words or they do not. However, this usually results in too few or too many results andfurther does not provide a ranking that returns the documents that may be most likely to be useful to theuser. We wish to assign a score to how well a document matched the query, and to do this we need away to assign a score to a document-query pair. To do this, consider some questions such as howmany times did a word occur in the document, where the word occurs and how important the wordis. The goal of relevance functions (which is the secret sauce of search engines) is to determine ascore that co-relates to the relevance of a document.The term frequency-inverse document frequency (tf-idf) method is one of the most common weightingalgorithms used in many information retrieval and text search problems. This starts with a bag of wordsmodel the query is represented by the occurrence counts of each word in the query (which means theorder is lost for example, john is quicker than mary and mary is quicker than john both have thesame representation in this model).Thus, a query of size m can be viewed as a set of m search terms/words w1, w2,wm.Note: We use the word and term interchangeably in what follows.Term Frequency: is a measure to quantify the frequency of occurrence of a word within a particulardocument.The term frequency tfw,i of a term (word) w in document i is a measure of frequency of theword in the document.Using raw frequency, this is the number of times that term (word) appears in the document i. Note:There are variations of term frequency that are used in different search algorithms; for example, sinceraw frequency may not always relate to relevance they divide the frequency by the number of words inthe document to get a normalized raw frequency. Further, some compute the log frequency weight ofterm w as the log of tf. For this project, you use the simple definition of number of times the wordappears in the document. You could, if you prefer, use the normalized (divide by number of words inthe document) or logarithm Scaled but clearly indicate in your report and code documentation whichmetric you are using IF you are not using the simple definition. One of the problems with the tf score isthat common (and stop) words can get a high score for example, terms like and of etc. Inaddition, if a writer of a document wants a high score they can bias the search engine by replicatingwords in the document.Inverse Document Frequency: Many times a rare term/word is more informative than frequent terms for example, stop words (such as the for etc.). So we consider how frequent the term occursacross the documents being searched (i.e., in the database), and the document frequency dfw capturesthis aspect in the score.The document frequency dfw is the number of documents that contain the term w.The inverse document frequency idfw of term w is defined as idfw = log (N/dfw), where N is thetotal number of documents in the database (i.e., being searched). If dfw=0 then 1 is added to thedenominator to handle the divide by zero case, i.e., for this case idfw = log (N/(1+dfw)). (Thelogarithm is used, instead of N/dfw to dampen the effect of idf.)tf-idf weighting: The tf-idf score gives us a measure of how important is a word to a document amonga set of documents. It uses local (term frequency) and global (inverse document frequency) to scaledown the frequency of common terms and scale up the frequency/score of rare terms.The tf-idf(w,i) weight of a term (word) w in document i is the product of its term frequency and inversedocument frequency.tf-idf(w,i) = tfw,i idfwA search query (i.e., search phrase), submitted by a user, typically consists of a number of words/terms.Therefore we have to determine the relevance, or rank, of the document for the entire search phraseconsisting of some m number of words w1, w2.wm , using the tf-idf scores for each word. Therelevance, or rank, Ri of document i for this search phrase consisting of m words, is defined as thesum of the tf-idf scores for each of the m words.Ri = tf idf (wj,i) j=1mSome references for more information on tf-idf method for document retrieval.H. Wu and R. Luk and K. Wong and K. Kwok. Interpreting TF-IDF term weights as makingrelevance decisions. ACM Transactions on Information Systems, 26 (3). 2008.J. Ramos, Using TF-IDF to determine word relevance in document queries.1.3 Dealing with Stop words.An important part of the information retrieval algorithms involves dealing (removing) with stop words.Stop words are words which do not play a role in determining the significance or relevance of adocument these could be either insignificant words (for example, articles, prepositions etc.) or arevery common in the context of the documents being processed. Stop words are language dependent, aswell as context dependent, and there are a number of methods discussed in the literature to identify stopwords and to create a list of stop words for the English language. In this project, you will use a simpleheuristic (described in what follows) that identifies stop words based on the context (i.e., the set ofdocuments). Simply stated, the words that occur frequently across all documents could be tagged as astop word since they will have little Value in helping rank this set of documents for a user query. Interms of our metrics, term frequency and inverse document frequency, the lower the idf the greater theprobability that the word is a stop word. Simplifying this further, we can tag a word as a stop word if ithas a idf=0 (i.e., it appears in all documents). (Note: A better solution would be combine words withlow idf with a list of stop words consisting of articles, prepositions etc. which may or may not have alow idf score for the set of documents you are processing.) In this project you will implement a stopword removal function that will remove words from your hash index based on an idf score of 0 i.e.,after the hash index is built the words with idf=0 will be removed from that bucket thus resulting in afinal hash index that has no words with idf=0. This will lead to a more efficient search/query process Question (c) why does this lead to a more efficient search process ? If you want to build a morerealistic system, you can combine this algorithm along with a statically provided stop word list (i.e.,common prepositions etc.) and look up the stop word list during insertion into the hash table.2. A Simple ExampleSuppose you are given documents D1.txt, D2.txt and D3.txt whose contents are as follows:D1.txt computer architecture at GW is both torture and funD2.txt computer architecture refers to the hardware and softwarearchitecture of a computerD3.txt Greco roman architecture is influenced by both greekarchitecture and roman architectureThe search query, consisting of three words/terms is:computer architecture GW2.1 Training (Pre-Processing) phaseD1 contains 9 words, and D2 contains 12 words, D3 contains 12 words. The word architecture iscommon to all three documents, and computer is common to D1 and D2, while GW appears only inD1.Suppose a hash function with 4 buckets (this is only an example we are not using the actual hashfunction you will implement) will hash some these words as follows (note that there are collisions forexample, both computer and torture are hashed to the same bucket):Bucket Words0 computer, torture,roman1 is, fun, and, greek, Greco, GW2 architecture, refers,hardware,3 a, at, by, influenced, both,The pre-processing of the documents D1. D2 and D3, will result in entries being placed into the hashtable. Each bucket contains a Pointer to a linked list; there are as many linked lists as there are bucketsin the hash table. Note the mapping of words to buckets is strictly dependent on the hash function beingused, but each bucket will contain entries of the type described earlier and shown in the figure below.Note that if a word from D2 gets mapped to the same bucket, it should be added to the data structures(lists) for that bucket. If a word is repeated then its count should be incremented for example, thecount of computer in D2 is 2 since computer appears twice in document D2.2.2 Dealing with stop words:There are two options for organizing data to facilitate both the search process as well as removing stopwords. The first (straightforward) approach is shown in Figure 1. In this case, each bucket has a linkedlist and the same word can appear multiple times in the list but only once for each document. Further,the term frequency of the word in that document is stored at the node. To compute the idf score foreach word (after reading in all the documents), the algorithm needs to traverse the linked list andcompute the idf score, and if idf=0 then it should remove all occurrences (from all documents) of thatword from the linked list. This option will let you use the hash map you created in Homework 7 to beused directly without major modifications.The second approach is shown in Figure 2. In this case, we have a linked list for each word and thedocument frequency df score for that word can be stored in the node of the linked list for that bucket.Thus, after reading in all documents, removing stop words can be done by examining the df scores ateach node (in the upper level linked list) and computing the idf score to determine if that entire linkedlist needs to be deleted. If you choose to implement this option then you will need to change your hashtable implementation from Homework 7.In the example, the word and appears in all three documents and its document frequency df =3 andtherefore idf = log (3/3)=0 and is identified as a stop word and needs to be removed from the index.Question 1(d): Which of the two approaches is more efficient in terms of removing stop words andwhy. Which option did you choose to implement.2.2 Search/Retrieval phase:During the search phase, the System takes a user query and searches for relevant documents. Todetermine the relevance of each document, it uses the tf-idf scoring (ranking) technique. In ourexample, our query set contains the words computer and architecture and GW. The retrievalprocess starts off by searching for the first word in the query computer and computes its score.Since computer gets hashed to the first bucket (bucket 0). We search through this bucket andcompute the tf-idf score for every document for the word computer. In this example, I am using theraw frequency (i.e., count) to Determine the tf score.The term frequency for the search term computer for each document is: tfcomputer,1=1,tfcomputer,2=2, tfcomputer,3=0 (since computer does not occur in document D3). This step simplysearches for the word (in the appropriate bucket) and retrieves the frequency count stored at thenode in the list (if the word is found, else it is zero).For the document frequency: the word computer occurs in 2 out of 3 documents thereforedfcomputer=2.Inverse document frequency: idfcomputer= log(3/2)= 0.17The tf-idf score for the term computer for the three documents are:o tf-idf(computer,1)= 1*0.17 = 0.17o tf-idf(computer,2)= 2*0.17= 0.34o tf-idf(computer,3)= 0We can similarly compute the tf-idf terms for the other two terms in the search query architectureand GW and this gives us:for the term architectureo tf-idf(architecture,1) = 1* (log (3/3))=0o tf-idf(architecture,2) = 2 * log (3/3)=0o tf-idf(architecture,3) = 3* log (3/3)=0for the term GWo tf-idf(GW,1)= 1*log(3/1)= 1* 0.48= 0.48o tf-idf(GW,2)= 0o tf-idf(GW, 3) =0Using the above tf-idf scores for each term we can compute the rank/relevance score (for the entirequery computer architecture GW) of each document as:o R1 = 0.17 +0 + 0.48 = 0.65o R2 = 0.34+0 = 0.34o R3 = 0 + 0 = 0Based on the above relevance ranking, the system would return D1,D2,and D3 in order of relevance. Ifthe term frequency for all search terms is zero, then that document should not be returned since none ofthe terms in the search query appear in the document (in the example, D3 does contain architecture).We can also have a perfect match if all words in the query appear in a document (i.e., no-zero termfrequency for all words in the query for a document).You have to figure out how to keep track of thescore and what data structure to use for this purpose.3. Specification and RequirementsThis project is worth 150 points.A formal description of the problem can be stated as:Task : Given a search query consisting of a number of words, retrieve (i.e., list) and rank documents inthat are relevant to this search query.Input : A set of plain text documents (D1, D2… Dn) in a single directory that need to be stored andindexed. A search query consisting of query words w1,.. wq.Output : Listing of the (names) of documents ordered by descending order of relevance/score (i.e., themost relevant document with the highest score first).3.1 AssumptionsFollowing are some assumptions and conditions that your solution must satisfy:For the sake of this Assignment, assume there are at least three documents D1.txt, D2.txt andD3.txt. You can design your solution with more than 3 documents labeled D1.txt throughDn.txt. If you choose to implement the extra credit version (reading files from a directory) thenthe number of documents depends on the number of files in the directory.You can assume each document contains several words, and you can assume no word is longerthan 20 characters (it is possible to design a solution without these assumptions).The document only contains words from the English alphabet (i.e., no digits or specialcharacters from the keyboard).For simplicity, you can assume all words are in lower case. But see if you can write a programthat is case insensitive if you implement this option, then please indicate this clearly in yourdocumentation (code comments).The query set (i.e., the search phrase/query) can be of an arbitrary length (and you can againassume no word is longer than 20 characters). If you feel the need to simplify this and assume amaximum number of words in the query, then clearly state this assumption you will lose 2.5points for this assumption.The query set (of keywords) is entered by the user at run-time (after the pre-processing phasewhen all the documents have been read) and you can assume they are entered on one line. Theprogram must prompt the user for the query keywords and then return the result of the search.After returning the results the program will return to prompt for the next query set or for specialsymbol # to exit the program. If you set a maximum size to the query set then include this inyour prompt.The number of buckets in the hash table is specified at run-time by the user.o If you make a simplifying assumption and assume that this size is specified statically atcompile time in the program you will incur a penalty of 5 points.You should not make any assumptions on the contents of the documents or the query words.3.2 What you have to implement: Requirements and SpecificationsA hashing function that takes a string w as input and converts it into a number. Refer toHomework 7 for the function specifications. A simple (and general) process is: Sum up theASCII values of all the characters in the string to get a number S and then apply the hashfunction to get bucket b. For the hash function, you can choose the simple b= S% N (i.e., Smodulo N) function Where N is the number of buckets in the hash table. You should explore ifthere are other, better, hash functions you can choose for this application, and if you choose adifferent hash function, you must then define that function in your documentation and why youchose that function. Note: Hash functions using a modulo N function typically use a primenumber for N (the number of buckets). Why do you think this is the case ?A function that inserts a string w and the associated document number i in the hash table (intobucket) refer to homework 7 for the function specification ( hm_put ). Take care to ensurethat the frequency of the string in that document is updated if the string has appeared before inthe document (i.e., if it has already been inserted into the table). This function will need to callsome of the functions you need to implement linked lists. If you need to refer to code toimplement linked list functions, then read Chapter 19 and you can use the code provided in thebook.A function training for the training process, i.e., pre-processing, that takes a set ofdocuments as input and returns the populated hash table as output. Figure out the specificationsfor the function.A function read_query to read the search query.A function rank in the search/retrieval process that computes the score for each document andranks them based on the tf-idf ranking algorithm. Your system should also determine if there isno document with a match i.e., if none of the words in the search query appear in any of thedocuments.A function stop_word that is part of (last step of) the training process that identifies stopwords and removes the stop words from the hash table and adjusts the hash table and listsaccordingly.A main function that first calls the training process to read all the documents and create thehash table.o Note that main must first prompt user for the size of the hash table, i.e., Enter numberof buckets: o Once the training phase is over, it will enter the retrieval (search) phase to search for thekeywords and find the documents that contain these keywords. main will prompt userand asks for Enter S for search or X to Exit.o If user enters S, then prompts for the query set (keywords entered on one line) and thencalls the read_query function to read the query set. The program then computes thescore (call function rank ) prints out the documents in order of relevance (i.e.,descending order of scores), and returns to the main prompt (i.e., to prompt for anothersearch or to exit).A makefile. Think carefully About how you want to construct the different modules andtherefore how you set up the makefile.3.3 Grading and Submission Instructions:You must turn in, using blackboard, a tar (or zip) file containing (1) a short document (report)describing your implementation show the flow chart, algorithms and how the different functionsinteract with each other, (2) the source code files and (3) the makefile.We will test the code on shell.seas.gwu.edu so be sure your code works on shell (and gcc)before submission.If your code does not compile, you will receive a zero for the project.If your code crashes during normal operation (i.e., the specifications of the project), then it canresult in a 50% penalty depending on the severity of the reason for the crash.You are required to provide the makefile. How you break up your code into different files willplay a role in your grade. To run the code, we will use the make command so make sure youtest your makefile before submitting.You will be graded on both correctness (60%) as well as efficiency (30%) of your solution, inaddition to documentation and code style (10%).o Efficiency refers to the time complexity of your algorithm and also includes datastructures you use and memory management (no memory leaks!).o You must document your code if you provide poor documentation then you lose 10%of the grade. However, if you provide no documentation whatsoever then you willbe penalized 20%.Any assumptions you make on the specification of the input and search process (if the providedspecifications do not cover your question) then you should state these clearly in the report andin the comments in your code (in function main). Failure to do so may lead to your programbeing graded as one that does not meet specifications. Additionally, if you make an assumptionthat contradicts the specifications we provided then you are not meeting the projectspecifications and points will be deducted.Read the next section for extra credit options.If you choose to use your one time late submission, you have 36 hours extra but will incur a 10% latepenalty (in addition to any other points taken off during grading).4. Extra Credit OptionsConsider adding this extra credit Option after you have completed the basic project. It will help youlearn a few more C/Unix utilities (you can try to get an idea of how to query directories by runningsome sample code before integrating into your project). There is no partial credit on the extra creditoptions you must implement the specified functionality fully, and to meet the specifications.Option: Automatic reading of arbitrary number of documents: (10 points) In this project,we hard coded the names of the file we are interes”
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。