INF553课程程序写作、辅导Data Mining程序

” INF553课程程序写作、辅导Data Mining程序INF553 Foundations and Applications of Data MiningSummer 2020Assignment 5NO LATE SUBMISSIONS1. Overview of the AssignmentIn this assignment, you are going to implement three algorithms: the Bloom filtering, Flajolet-Martinalgorithm, and reservoir Sampling. For the first task, you will implement Bloom Filtering for off-line Yelpbusiness dataset. The off-line here means you do not need to take the input as streaming data. Forthe second and the third task, you need to deal with on-line streaming data directly. In the second task,you need to generate a simulated data stream with the Yelp dataset and implement FlajoletMartinalgorithm with Spark Streaming library. In the third task, you will do some analysis on Twitterstream using fixed size sampling (Reservoir Sampling).2. Requirements2.1 Programming Requirementsa. You must use Python and Spark to implement all tasks. There will be 10% bonus for each task if youalso submit a Scala implementation and both your Python and Scala implementations are correct.b. You will need Spark Streaming library for task1 and task2. In task3, you will use Twitter API ofstreaming. You can use the Python library, tweepy, and Scala library, spark-streaming-twitter.c. You can only use Spark RDD and standard Python or Scala libraries. I.e. no point if using SparkDataFrame or DataSet2.2 Programming EnvironmentPython 3.6, Scala 2.11 and Spark 2.3.2We will use Vocareum to automatically run and grade your submission. You must test your scripts onthe local machine and the Vocareum terminal before submission.2.3 Write your own codeDo not share code with other students!!For this assignment to be an effective learning experience, you must write your own code! Weemphasize this point Because you will be able to find Python implementations of some of the requiredfunctions on the web. Please do not look for or at any such code!TAs will combine all the code we can find from the web (e.g., Github) as well as other students codefrom this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.2.4 What you need to turn inYou need to submit the following files on Vocareum: (all lowercase)a. [REQUIRED] three Python scripts, named: task1.py, task2.py, task3.pyb. [REQUIRED FOR SCALA] three Scala scripts, named: task1.scala, task2.scala, task3.scalac. [REQUIRED FOR SCALA] one jar package, named: hw5.jard. You dont need to include your results. We will grade on your code with our testing data (data willbe in the same format).3. DatasetsINF553课程作业写作、辅导Data Mining作业3.1.1 Yelp Business DataFor task1, you need to download the business_first.json and business_second.json from Vocareum. Thefirst file is used to set up the bit array for Bloom filtering, and the second file is used for prediction.3.1.2 Yelp Streaming Data SimulationFor task2, you need to download the business.json file and the generate_stream.jar on the Vocareum.Please follow the instructions below to simulate streaming on your machine:1) Run the generate_stream.jar in the terminal to generate Yelp streaming data from thebusiness.json with the command:java -cp generate_stream.jar file path StreamSimulation business.json file path 9999 100- 9999 is a port number on the localhost. You can assign any available port to it.- 100 represents 100 Milliseconds (0.1 second) which is the time interval between items in thesimulated data stream.2) Keep step 1) running while testing your code. Use Ctrl+C to terminate if necessary.3) Add the following code to connect the data stream in your Spark Streaming code:ssc.socketTextStream(localhost, 9999)- The first argument is the host name, which is localhost in this case.- The second argument is the port number in step 1), which is 9999 in this case.3.2 Twitter Stream DataFor task3, you need to analyze the twitter streaming data using Twitter APIs. Please follow theinstruction to set up Twitter APIs.a. Create credentials for Twitter APIs- Register on httpss://apps.twitter.com/ by clicking on Create new app and then fill the formclick on Create your Twitter app.- Go to the newly created app and open the Keys and Access Tokens. Click on Generate myaccess token. You will need to use these tokens as arguments when executing the code.b. Add library dependencies in the code- You can use Python library, tweepy. To install the library, you can use pip install tweepy.- You can use Scala libraries, spark-streaming-twitter and spark-streaming. To install thelibraries, you can add the library dependencies in the sbt. https://docs.tweepy.org/en/3.7.0/streaming_how_to.html https://bahir.apache.org/docs/spark/current/spark-streaming-twitter/4. Tasks4.1 Task1: Bloom Filtering (4.5 pts)You will implement the Bloom Filtering algorithm to estimate whether the city of a coming business inthe data stream Has shown before. The details of the Bloom Filtering Algorithm can be found at thestreaming lecture slide. You need to find proper hash functions and the number of hash functions in theBloom Filtering algorithm.In this task, you should keep a global filter bit array and the length is 200.Some possible the hash functions are:f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % mwhere p is any prime number and m is the length of the filter bit array. You can use any combination forthe parameters (a, b, p). The hash functions should keep the same once you created them.As the city of a business is a string, you need to convert it into an integer and then apply hash functionsto it., the following code shows one possible solution:import binasciiint(binascii.hexlify(s.encode(utf8)),16)(We only treat the exact the same strings as the same cities. You do not need to consider alias.)Execution DetailsIn Spark Streaming, set the batch duration to 10 seconds:ssc=StreamingContext(sc , 10)You will get a batch of data in spark streaming every 10 seconds and you will use the Bloom Filteringalgorithm to estimate whether the coming city appeared before or not since the beginning of your code.You need to maintain a previous city set in order to calculate the false positive rate (FPR).We will test your code for 10 minutes.Output ResultsYou need to save your results in a CSV file with the header Time,FPR. Each line stores the timestampwhen you receive the batch of data and the false positive rate. The time format should be YYYY-MMDDhh:mm:ss (Figure 1 Shows an example). You do not need to round your answer.Figure 1: Output file format for task14.2 Task2: Flajolet-Martin algorithm (4.5 pts)In task2, you will implement the Flajolet-Martin algorithm (including the step of combining estimationsfrom groups of hash functions) to estimate the number of unique cities within a window in the datastream. The details of the Flajolet-Martin Algorithm can be found at the streaming lecture slide. Youneed to find proper hash functions and the number of hash functions in the Flajolet-Martin algorithm.Figure 2: Spark Streaming windowExecution DetailsFor this task, the batch duration should be 5 seconds, the window length should be 30 seconds and thesliding interval should be 10 seconds. We will test your code for 10 minutes.Output ResultsYou need to save your results in a CSV file with the header Time,Gound Truth,Estimation. Each linestores the timestamp when you receive the batch of data, the actual number of unique cities in thewindow period, and the estimation result from the Flajolet-Martin algorithm. The time format should beYYYY-MM-DD hh:mm:ss (Figure 3 shows an example). You do not need to round your answer.Figure 3: Flajolet-Martin output file format4.3 Task3: Fixed Size Sampling on Twitter Streaming (3.5pts)You will use Twitter API of streaming to implement the fixed size sampling method (Reservoir SamplingAlgorithm) and find popular tags on tweets based on the samples.In this task, we assume that the memory can only save 100 tweets, so we need to use the fixed sizesampling method to only keep part of the tweets as a sample in the streaming. When the streaming ofthe Twitter coming, for the first 100 tweets, you can directly save them in a list. After that, for the nthtwitter, you will keep the nth tweet with the probability of 100/n, otherwise discard it. If you keep the nthtweet, you need to randomly pick one in the list to be replaced. If the coming tweet has no tag, you candirectly ignore it.You also need to keep a global variable representing the sequence number of the tweet. If the comingtweet has no tag, the sequence number will not increase, else the sequence number increases one.Every time you receive a new tweet, you need to find the tags in the sample list with the top 3frequencies.Output Results: you just need to print your results in the terminalIn the first line, you should print the sequence number of this new tweet as shown in the example. Then,you should print The tags and frequencies in the descending order of frequency. If some tags share thesame frequency, you should print them all and ordered in lexicographic order (Figure 4).Figure 4: Twitter streaming printing information example4.4 Execution FormatPython:spark-submit task1.py first_json_path second_json_path output_file_pathspark-submit task2.py port # output_filenamespark-submit task3.pyScala:spark-submit class task1 hw5.jar first_json_path second_json_path output_file_pathspark-submit class task2 hw5.jar port # output_file_pathspark-submit class task3 hw5.jarInput parameters:1. port #: the simulated streaming port your listen to.2. output_file_path: the output file including file path, file name, and extension.Note:Its OK to have the following error in your submission log:5. Grading Criteria(% penalty = % penalty of possible points you get)a. There will be 10% Bonus for each task if your Scala implementations are correct. Only when yourPython results are correct, the bonus of Scala will be calculated. There is no partial point for Scala.b. There will be no point if your submission cannot be executed on Vocareum.c. There is no regrading. Once the grade is posted on the Vocareum, we will only regrade yourassignments if there is a grading error. No exceptions.d. No late submissions allowed.如有需要，请加QQ：99515681 或邮箱：99515681@qq.com

“