” 写作DSCI553程序、 辅导Data编程DSCI553 Foundations and Applications of Data MiningSpring 2021Assignment 3Deadline: Mar. 23rd 11:59 PM PST1. Overview of the AssignmentIn Assignment 3, you will complete three tasks. You will first implement Min-Hash and Locality SensitiveHashing (LSH) to find similar businesses efficiently. Then you will implement various types ofrecommendation systems.2. Requirements2.1 Programming Requirementsa. You must use Python Spark to Implement all tasks. You can only use the standard Python libraries(i.e., external libraries like numpy or pandas are not allowed).b. You are required to only use Spark RDD, i.e. no point if using Spark DataFrame or DataSet.c. There will be 10% bonus for Scala implementation in each task. You can get the bonus only when bothPython and Scala implementations are correct.2.2 Programming EnvironmentPython 3.6, Scala 2.11, and Spark 2.3.0We will use Vocareum to automatically run and grade your submission. You must test your scripts on yourlocal machine and the Vocareum terminal before submission.2.3 Write your own codeDo not share code with other students!!For this assignment to be an effective learning experience, you must write your own code! We emphasizethis point because you may find Python implementations of some of the required functions on the Web.Please do not look for or at any such code!Plagiarism detection will combine all the code we can find from the Web (e.g., Github) as well as otherstudents code from this and other (previous) sections. We will report all detected plagiarism to theuniversity.3. Yelp DataFor this assignment, we Have generated sample review data from the original Yelp review dataset usingsome filters, such as the condition: state == CA. We randomly took 80% of sampled reviews fortraining, 10% for testing, and 10% as the blind dataset. (We do not share the blind dataset.) You can accessand download the following JSON files either under the directory on the Vocareum:resource/asnlib/publicdata/ or on Google Drive (USC email only):a. train_review.jsonb. test_review.json containing only the target user and business pairs for prediction tasksc. test_review_ratings.json containing the ground truth rating for the testing pairsd. user_avg.json containing the average stars for the users in the train datasete. business_avg.json containing the average stars for the businesses in the train datasetf. stopwordsg. We do not share the blind dataset.4. TasksYou need to submit the following files on Vocareum: (all in lowercase)a. Python scripts: task1.py, task2train.py, task2predict.py, task3train.py, task3predict.pyb. Model files: task2.model, task3item.model, task3user.modelc. Result files: task1.res, task2.predict, task3item.predict, task3user.predictd. Scala scripts: task1.scala, task2train.scala, task2predict.scala, task3train.scala, task3predict.scala; onejar package: hw3.jare. Model files: task2.scala.model, task3item.scala.model, task3user.scala.modelf. Result files: task1.scala.res, task2.scala.predictg. [OPTIONAL] You can include other scripts to support your programs (e.g., callable functions).4.1 Task1: Min-Hash + LSH (2pts)4.1.1 Task descriptionIn this task, you will implement the Min-Hash and Locality Sensitive Hashing algorithms with Jaccardsimilarity to find similar Business pairs in the train_review.json file. We focus on 0/1 ratings rather thanthe actual rating values in the reviews. In other words, if a user has rated a business, the users contributionin the characteristic matrix is 1; otherwise, the contribution is 0 (Table 1). Your task is to identify businesspairs whose Jaccard similarity is = 0.05.Table 1: The left table shows the original ratings; the right table shows the converted 0 and 1 ratings.You can define any collection of hash functions to permutate the row entries of the characteristic matrix togenerate Min-Hash signatures. Some potential hash functions are:where is any prime number; is the number of bins. You can define any combination for the parametersin your implementation.After you have defined all hash functions, you will build the signature matrix using Min-Hash. Then youwill divide the matrix into bands with rows each, whereis the number of hash functions).You need to set and properly to balance the number of candidates and the computational cost. Twobusinesses become a candidate pair if their signatures are identical in at least one band.Lastly, you need to verify the candidate pairs using their original Jaccard similarity. Table 1 shows anexample of calculating the Jaccard similarity between two businesses. Your final outputs will be thebusiness pairs whose Jaccard similarity is = 0.05.user1 user2 user3 user4business1 0 1 1 1business2 0 1 0 0Table 2: Jaccard similarity (business1, business2) = #intersection / #union = 1/34.1.2 Execution commandsPython $ spark-submit task1.py input_file output_fileScala $ spark-submit –class task1 hw3.jar input_file output_fileinput_file: the train review setoutput_file: the similar business pairs and their similarities4.1.3 Output formatYou must write a business pair and its similarity in the JSON format using exactly the same tags like theexample in Figure 1. Each line represents a business pair, e.g., b1 and b2. For each business pair b1and b2, you do not need to generate the output for b2 and b1 since the similarity value is the same asb1 and b2. You do not Need to truncate decimals for the sim values.Figure 1: An example output for Task 1 in the JSON format4.1.4 GradingYour task 1 outputs (1pt) will be graded by precision and recall metrics defined below.Precision = # true positives / # output pairs, Recall = # true positives / # ground truth pairsYour precision should be = 0.95 (0.5pt), and recall should be = 0.5 (0.5pt). The execution time onVocareum should be less than 200 seconds. To evaluate the implementation, you can generate the groundtruth that contains all the business pairs in the train_review.json file whose Jaccard similarity is =0.05 andcalculate precision and recall by yourself.4.2 Task2: Content-based Recommendation System (2pts)4.2.1 Task descriptionIn this task, you will build a content-based recommendation system by generating profiles from reviewtexts for users and businesses in the train_review.json file. Then you will use the model to predict if a userprefers to review a given Business by computing the cosine similarity between the user and item profilevectors.During the training process, you will construct the business and user profiles as follows:a. Concatenating all reviews for a business as one document and parsing the document, such as removingthe punctuations, numbers, and stopwords. Also, you can remove extremely rare words to reduce thevocabulary size. Rare words could be the ones whose frequency is less than 0.0001% of the totalnumber of words.b. Measuring word importance using TF-IDF, i.e., term frequency multiply inverse doc frequencyc. Using top 200 words with the highest TF-IDF scores to describe the documentd. Creating a Boolean vector with these significant words as the business profilee. Creating a Boolean vector for representing the user profile by aggregating the profiles of the items thatthe user has reviewedDuring the prediction process, you will estimate if a user would prefer to review a business by computingthe cosine distance between the Profile vectors. The (user, business) pair is valid if their cosine similarityis = 0.01. You should only output these valid pairs.4.2.2 Execution commandsTraining commands:Python $ spark-submit task2train.py train_file model_file stopwordsScala $ spark-submit –class task2train hw3.jar train_file model_file stopwordstrain_file: the train review setmodel_file: the output modelstopwords: containing the stopwords that can be removedPredicting commands:Python $ spark-submit task2predict.py test_file model_file output_fileScala $ spark-submit –class task2predict hw3.jar test_file model_file output_filetest_file: the test review set (only target pairs)model_file: the model generated during the training processoutput_file: the output results4.2.3 Output format:Model format: There is no Strict format requirement for the content-based model.Prediction format:You must write the results in JSON format using exactly the same tags like the example in Figure 2. Eachline represents a predicted pair of (user_id, business_id). You do not need to truncate decimals for simvalues.Figure 2: An example prediction output for Task 2 in JSON format4.2.4 GradingYou need to generate the content-based model and the prediction results (1pt). We will grade yourprediction results by calculating precision and recall using the ground truth (i.e., the blind reviews). Thedefinitions of precision and recall are the same as the ones in task 1. Your precision should be = 0.8 (0.5pt)and recall should be = 0.7 (0.5pt) for the blind datasets. The execution time of the training process onVocareum should be less than 600 seconds. The execution time of the predicting process on Vocareumshould be less than 300 seconds.4.3 Task3: Collaborative Filtering Recommendation System (4pts)4.3.1 Task descriptionIn this task, you will build collaborative filtering (CF) recommendation systems using the train_review.jsonfile. After building the systems, you will use the systems to predict the ratings for a user and business pair.You are required to implement 2 cases: Case 1: Item-based CF recommendation system (2pts)During the training process, you will build a recommendation system by computing the Pearson correlationfor the business pairs with at least three co-rated users. During the predicting process, you will use thesystem to predict the rating for a given pair of user and business. You must use at most N businessneighbors who are the top N most similar to the target business for prediction (you can try various N, e.g.,3 or 5). Case 2: User-based CF recommendation system with Min-Hash LSH (2pts)During the training process, you should combine the Min-Hash and LSH algorithms in your user-based CFrecommendation system since the number of potential user pairs might be too large to compute. You needto (1) identify user pairs similarity using their co-rated businesses without considering their rating scores(similar to Task 1). This process reduces the number of user pairs you need to compare for the final Pearsoncorrelation score. (2) compute the Pearson correlation for the user pair candidates with Jaccardsimilarity = 0.01 and at least three co-rated businesses. The predicting process is similar to Case 1.4.3.2 Execution commandsTraining commands:Python $ spark-submit task3train.py train_file model _file cf_typeScala $ spark-submit –class task3train hw3.jar train_file model _file cf_typetrain_file: the train review setmodel_file: the output modelcf_type: either item_based or user_basedPredicting commands:Python $ spark-submit task3predict.py train_file test_file model_file output_filecf_typeScala $ spark-submit –class task3predict hw3.jar train_file test_file model_fileoutput_file cf_typetrain_file: the train review settest_file: the test review set (only target pairs)model_file: the model generated during the training processoutput_file: the output resultscf_type: either item_based or user_based4.3.3 Output format:Model format:You must write the model in JSON format using exactly the same tags like the example in Figure 3. Eachline represents a business pair (b1, b2) for the item-based model (Figure 3a) or a user pair (u1, u2)the for user-based model (Figure 3b). There is no need to have (b2, b1) or (u2, u1). You do notneed to truncate decimals for sim values.(a)(b)Figure 3: (a) is an example of item-based model and (b) is an example of user-based modelPrediction format:You must write a target pair and its prediction in the JSON format using exactly the same tags like theexample in Figure 4. Each line represents a predicted pair of (user_id, business_id). You do not needto truncate decimals for stars values.Figure 4: An example output for task3 in JSON format4.3.4 GradingYou need to generate the item-based and user-based CF models. We will grade your model using precisionand recall defined in task 1. For your item-based model, precision should be = 0.9 (0.25pt) and recallshould be =0.9 (0.25pt). For your user-based model should, precision should be = 0.4 (0.25pt) and recallshould be =0.5 (0.25pt).Besides, we will compare your prediction results against the ground truth in both test and blind datasets.You should output the predictions ONLY generated from the model. Then we use RMSE (Root MeanSquared Error) defined in the equation below to evaluate the performance. For those pairs that your modelcannot predict (e.g., due to cold start problem or too few co-rated users), we will predict them with thebusiness average stars for the item-based model and the user average stars for the user-based model. Weprovide two files contain the average stars for users and businesses in the training dataset, respectively. Thevalue of UNK tag, which can be used for predicting those new businesses and users, is the average starsfor the whole reviews.Where! is the prediction for business and is the true rating for business is the totalnumber of the user and business.The execution time of the training process on Vocareum should be less than 600 seconds. The executiontime of the predicting process on Vocareum should be less than 100 seconds. RMSE for the item-basedmodel in both test and blind datasets should be =0.91 (1.5pt), and for the user-based model in both datasetsshould be =1.01 (1.5pt). If the performance of only either one dataset reaches the threshold, you willobtain 1pt.5. About Vocareuma. You can use the provided datasets under the directory resource: /asnlib/publicdata/b. You should upload the required files under your workspace: work/c. You must test your scripts on both the local machine and the Vocareum terminal before submission.d. During the submission period, the Vocareum will directly evaluate the following result files: task1.res,task2.predict, task3item.model, and task3user.model. The Vocareum will also run task3predict scriptsand evaluate the prediction results for both test and blind datasets.e. During the grading period, the Vocareum will run both train and predict scripts. If the training orpredicting process fails to run, you can get 50% of the score only if the submission report showsthat your submitted models or results are correct (regrading).f. Here are the commands that you can use to run Python scripts on Vocareum:g. You will receive a submission report after Vocareum finishes executing your scripts. The submissionreport should show precision and recall for each task. We do not test the Scala implementation duringthe submission period.h. Vocareum will automatically run both Python and Scala implementations during the grading period.i. The total execution time of the submission period should be less than 600 seconds. The execution timeof grading period needs to be less than 3000 seconds.j. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade onyour last submission.6. Grading Criteria(% penalty = % penalty of possible points you get)a. You can use your free 5-day extension separately or together. You must submit a late-day request via httpss://forms.gle/6aDASyXAuBeV3LkWA. This form is recording the number of late days you use foreach assignment. By default, we will not count the late days if no request submission.b. There will be a 10% bonus for each task if your Scala implementations are correct. Only when yourPython results are correct, the bonus of Scala will be calculated. There is no partial point for Scala.c. There will be no point if your submission cannot be executed on Vocareum.d. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade yourassignments if there is a grading error. No exceptions.e. There will be a 20% penalty for the late submission within one week and no point after that.如有需要,请加QQ:99515681 或WX:codehelp
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。