” COMP9313课程程序 写作、 写作Python实验程序 写作COMP9313 Project 1-specs – Jupyter Notebooklocalhost:8888/notebooks/COMP9313 Project 1-specs.ipynb 1/4Deadline + Late PenaltyNote : It will take you quite some time to complete this project, therefore, we earnestly recommend that youstart working as early as possible.Submission deadline for the Project is 20:59:59 on 10th Jul, 2020 (Sydney Time).LATE PENALTY: 10% on day-1 and 30% on each subsequent day.Instructions1. This note book contains instructions for COMP9313 Project 1.2. You are required to complete your implementation in the file submission.py provided along with thisnotebook.3. You are not allowed to print out unnecessary stuff. We will not consider any output printed out on thescreen. All results should be returned in appropriate data structures via corresponding functions.4. You are required to submit the following files, via CSE give :(i) submission.py (your code),(ii) report.pdf (illustrating your implementation details)Note: detailed submission instructions will be announced later.5. We provide you with detailed instructions of the project in this notebook. In case of any problem, you canpost your query @Piazza. Please do not post questions regarding the implementation details.6. You are allowed to add other functions and/or import modules (you may have to for this project), but youare not allowed to define global variables. All the functions should be implemented insubmission.py .7. In this project, you may need to CREATE YOUR OWN TEST CASES in order to evaluate the correctness,while at the same time improving the efficiency of your implementation. DO NOT COMPLETELY RELY ONTHE TOY EXAMPLE IN THE SPEC!COMP9313课程作业 写作、 写作Python实验作业 写作In order to create your own test cases, you are expected to use real datasets or randomly generateddata, and generate hash functions by yourself.8. The testing environment is the same as that of Lab1 . Note: Importing other modules (not a part of theLab1 test environment) may lead to errors, which will result in ZERO score for the ENTIRE Project.After completing the project, the students are ENCOURAGED to attempt for BONUS part. Detailedinstructions for the BONUS part are given in the later part of this notebook.Task1: C2LSH (90 points)In this question, you will implement the C2LSH algorithm in Pyspark. Specifically, you are required to write amethod c2lsh() in the file submission.py that takes the following four arguments as input:1. data_hashes: is a rdd where each element (i.e., key,value pairs) in this rdd corresponds to (id, data_hash).id is an integer and data_hash is a python list that contains integers (i.e., hash values of the datapoint).2. query_hashes is a python list that contains integers (i.e., hash values of the query).3. alpha_m is an integer which indicates the minimum number of collide hash values between data and query(i.e., ).4. beta n is an integer which indicates the minimum number of candidates to be returned (i.e., ).2020/7/4 COMP9313 Project 1-specs – Jupyter Notebooklocalhost:8888/notebooks/COMP9313 Project 1-specs.ipynb 2/44. beta_n is an integer which indicates the minimum number of candidates to be returned (i.e., ).Note:1. You dont need to implement hash functions and generate hashed data, we will provide the data hashes foryou.2. Please follow the description of the algorithm provided in the lecture notes, which is slightly differentto the original C2LSH paper.3. While one of the main purposes of this project is to use spark to solve the problems. Therefore, it ismeaningless to circumvent pyspark and do it in other ways (e.g., collect the data and implement thealgorithm without transformations etc.). Any such attempt will be considered as a invalid implementation,hence will be assigned ZERO score. Specifically, you are not allowed to use the following PySparkfunctions:aggregate , treeAggregate , aggregateByKeycollect , collectAsMapcountByKey , countByValueforeachreduce , treeReducesaveAs* (e.g. saveAsTextFile )take* (e.g. take , takeOrdered )topfoldReturn FormatThe c2lsh() method returns a rdd which contains a sequence of candidate ids.Notice: The order of the elements in the list does not matter (e.g., we will collect the elements andevaluate them as a set).EvaluationYour implementation will be tested using 3 different test cases. We will be evaluating based on the followingfactors:the correctness of implemented c2lsh() . The output will be compared with the result from the correctimplementation. Any difference will be considered as incorrect.the efficiency of your implmentation. We will calculate the running time of c2lsh() in each testcase(denoted as ).For each testcase (worth 30 points), the following marking criteria will be used:Case 1, 0 points: the returned rdd is incorrect, orCase 2, 10 points: the returned rdd is correct, and ,Case 3, 20 points: the returned rdd is correct, and ,Case 4, 30 points: the returned rdd is correct, and .Where depend on the testing environment and the test cases. 11 22 33 1 2 3Task 2: Report (10 points)2020/7/4 COMP9313 Project 1-specs – Jupyter Notebooklocalhost:8888/notebooks/COMP9313 Project 1-specs.ipynb 3/4You are also required to submit your project report, named: report.pdf . Specifically, in the report, you are atleast expected to answer the following questions:1. Implementation details of your c2lsh() . Explain how your major transform function works.2. Show the evaluation result of your implementation using your own test cases.3. What did you do to improve the efficiency of your implementation?BonusIn order to encourage the students to come up with efficient implementations, we allow bonus part of the projectwith a maximum of 20 points. Rules for the BONUS part are as under:Prerequisites:1. You must have obtained 90 points for the implementation part.2. The total running time of your implementation for the three test cases is among the top-50 smallestrunning times of the class.All the submissions, satisfying the above-mentioned conditions will be tested against a more challengingdataset. Top-20 most-efficient and correct implementations will be awarded the bonus scores.We will rank the top-20 implementations in an increasing order w.r.t the running time. We will award 20points to the most efficient implementation (i.e., the one with smallest running time), 19 points to the 2ndmost efficient one, and so on. The implementation ranked 20-th on the list will get 1 bonus point.How to execute your implementation (EXAMPLE)2020/7/4 COMP9313 Project 1-specs – Jupyter Notebooklocalhost:8888/notebooks/COMP9313 Project 1-specs.ipynb 4/4In [1]:Project Submission and FeedbackFor the project submission, you are required to submit the following files:1. Your implementation in the python file submission.py .2. The report report.pdf .Detailed instruction about using give to submit the project files will be announced later via Piazza.In [ ]:Number of candidate: 10set of candidate: {0, 70, 40, 10, 80, 50, 20, 90, 60, 30}from pyspark import SparkContext, SparkConffrom time import timeImport pickleimport submissiondef createSC():conf = SparkConf()conf.setMaster(local[*])conf.setAppName(C2LSH)sc = SparkContext(conf = conf)Return scwith open(toy/toy_hashed_data, rb) as file:data = pickle.load(file)With open(toy/toy_hashed_query, rb) as file:query_hashes = pickle.load(file)alpha_m = 10beta_n = 10sc = createSC()data_hashes = sc.parallelize([(index, x) for index, x in enumerate(data)])Start_time = time()res = submission.c2lsh(data_hashes, query_hashes, alpha_m, beta_n).collect()end_time = time()sc.stop()# print(running time:, end_time – start_time)print(Number of candidate: , len(res))print(set of candidate: , set(res))如有需要,请加QQ:99515681 或邮箱:99515681@qq.com
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。