
” COMP9313程序 写作、Data Management程序 辅导、Python,c++,JavaCOMP9313:Big Data ManagementSample Exam QuestionsExplain the difference between NameNodeand DataNode.Given a file of 500MB, let block size be150MB, and replication factor=3. How muchspace do we need to store this file in HDFS?Why?Question 1 HDFSQuestion 2 Spark Given a large text file, your task is to find out the top-k mostfrequent co-occurring term pairs. The co-occurrence of (w, u)is defined as: u and w appear in the same line (this alsomeans that (w, u) and (u, w) are treated equally). Your Sparkprogram Should generate a list of k key-value pairs ranked indescending order according to the frequencies, where thekeys are the pair of terms and the values are the co-occurringfrequencies (Hint: you need to define a function which takesan array of terms as input and generate all possible pairs).textFile = sc.textFile(inputFile)words = textFile.map(lambda x: x.lower().split())// fill your code here, and store the result in a pair RDD avgLenavgLen.collect()COMP9313作业 写作、Data Management作业 辅导Question 3 Finding Similar ItemsSuppose we wish to find similar sets, and weapply locality-sensitive hashing with k=5 andl=2.If two sets had Jaccard similarity 0.6, what isthe probability that they will be identified inthe locality-sensitive hashing as candidates(i.e. they hash at least once to the same superhash)?You may assume that there are nocoincidences, where two unequal values hashto the same hash value.Question 4 Mining Data StreamsSuppose we are maintaining a count of 1s usingthe DGIM method. We represent a bucket by (i, t),where i is the number of 1s in the bucket and t isthe bucket Timestamp (time of the most recent 1).Consider that the current time is 200, window sizeis 60, and the current list of buckets is: (16, 148)(8, 162) (8, 177) (4, 183) (2, 192) (1, 197) (1,200). At the next ten clocks, 201 through 210, thestream has 0101010101. What will the sequenceof buckets be at the end of these ten inputs?Question 5 Recommender SystemsConsiderThree Users u1, u2, and u3, and four movies m1, m2, m3, and m4. The users rated themovies using a 4-point scale: -1: bad, 1: fair, 2:good, and 3: great. A rating of 0 means that theuser did not rate the movie. The three usersratings for the four movies are: u1 = (3, 0, 0, – 1), u2 = (2, -1, 0, 3), u3 = (3, 0, 3, 1) Which user has more similar taste to u1 based oncosine similarity, u2 or u3? Show detailed calculationprocess. User u1 has not yet watched movies m2 and m3. Which movie(s) are you going to recommend touser u1, based on the user-based collaborative filtering approach? Justify your answer.如有需要,请加QQ:99515681 或邮箱:99515681@qq.com
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。






