” COMPSCI 753程序 写作、Algorithms程序COMPSCI 753Algorithms for Massive DataSemester 2, 2020Assignment 1: Locality-sensitive HashingNinh PhamSubmission:Please submit a single pdf file the source code on CANVAS by 11:59pm, Sunday23 August 2020. The answer file must contain your studentID, UPI and name.Penalty Dates:The assignment will not be accepted after the last penalty date unless there are specialcircumstances (e.g., sickness with certificate). Penalties will be calculated as follows asa percentage Of the mark for the assignment. By 11:59pm, Sunday 23 August 2020 No penalty By 11:59pm, Monday 24 August 2020 25% penalty By 11:59pm, Tuesday 25 August 2020 50% penalty11 Assignment problem (50 pts)The assignment aims at investigating MinHash and Locality-sensitive Hashing (LSH)framework on real-world data sets. In class, we have seen how to construct signaturematrices using random permutations. However, in practice, random permutation onvery large matrix is prohibitive. Section 3.3.5 of your textbook (Chapter 3, Mining ofMassive Datasets1 by J. Leskovec, A. Rajaraman, J. Ullman) introduces a simple butfast method to simulate this randomness using different hash functions. We encourageyou to read through that section before attempting the assignment.In the assignment, you write a program2to compute all pairs similarity on the bagCOMPSCI 753作业 写作、Algorithms作业 辅导of words Data set from the UCI Repository3 using the Jaccard similarity. This problemis the core component for detecting plagiarism and finding similar documents ininformation retrieval.The bag of words data set contains 5 text datasets which share the same pre-processingprocedure. That is, after tokenization and removal of stopwords, the vocabulary ofunique words was truncated by only keeping important words that occurred more than 10times for Large data sets. For small data sets, there was not truncation.It has the format: docID wordID count, where docID is the document ID, wordIDis the word ID in the vocabulary, and count is the word frequency. Since the Jaccardsimilarity does not take into account the word frequency, we simply ignore this informationin the assignment. This means that we consider count = 1 for each pair (docID,wordID). We consider a document as a set and each word as a set element, and makeuse the Jaccard similarity.We only make use the KOS blog entries data set for this assignment since we still needto run bruteforce algorithm to measure the accuracy of MinHash and LSH. However,students are encouraged to try on larger data sets, such as NYTimes news article andPubMed Abstracts. Note that the data set is very sparse, so you might think of therelevant data structures for fast processing.The assignment Tasks and its point are as follows.1. Execute bruteforce computation (10 pts): Compute all pairs similarity withJaccard and save the result into file (since you have to use the bruteforce resultfor the next tasks). You need to report:(a) The running time of your bruteforce algorithm (5 pts).1 https://www.mmds.org/2no restriction on programming languages used but preferred Python.3 httpss://archive.ics.uci.edu/ml/datasets/Bag+of+Words2(b) The average Jaccard similarity of all pairs except identical pairs i.e.J(di, dj ) where i 6= j (5 pts).2. Compute the MinHash signatures for all documents (10 pts): Computethe MinHash signatures (number of hash functions d = 10) for all documents. Youneed to report:(a) The running time of this step (10 pts).3. Measure the accuracy of MinHash estimators (10 pts): Compute all pairssimilarity estimators based on MinHash. Repeat the procedure 2) and 3) withthe number of hash functions d ranging from {10, 20, . . . , 100} and plot the meanabsolute error (MAE) of MinHash estimators on different values of d.MAE =Pni,j=1,i6=j|J(di, dj ) J(di, dj )|/(n2 n) where J(di, dj ) is the actual Jaccardsimilarity between di and dj and J(di, dj ) is their MinHash-based estimator.You need to report:(a) The running time of estimating all pairs similarity based on MinHashwith different values of d (5 pts).(b) The figure of MAEs with different values of d on x-axis and MAEvalues on y-axis (5 pts).4. Exploit LSH (20 pts): Implement LSH framework to solve the subproblem:Finding all pairs similar documents with Jaccard 0.6. In particular, usingd = 100 hash functions, you need to explain:(a) How to tune the parameter b (number of bands) and r (number ofrows in one band) so that we achieve the false negatives of 60%-similar Pairs at most 10% (5 pts).(b) The space usage affected by these parameters (5 pts).Given your chosen setting, from your experiment, you need to report(a) The false candidate ratio (5 pts).the number of candidate pairs with exact Jaccard 0.6number of candidate pairs(b) The probability that a dissimilar pair with Jaccard 0.3 is a candidatepair (5 pts).the number of candidate pairs with exact Jaccard 0.3number of pairs with exact Jaccard 0.332 What to submit?An answer.pdf file reports the requested values and explanation of each task.A source Code file contains detailed comments.Note: When taking the screenshots, make sure that you do not reveal any additionalcontent you do not wish to share with us ;-).4如有需要,请加QQ:99515681 或邮箱:99515681@qq.com
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。