COMP5349语言编程 写作、Python程序

” COMP5349语言编程 写作、Python程序School of Computer ScienceCOMP5349: Cloud Computing Sem. 1/2021Assignment 1: Data Analysis with Spark RDD APIIndividual Work: 20% 01.04.20211 IntroductionThis assignment tests your ability to implement simple data analytic workload using SparkRDD API. The data set you will work on is adapted from Trending Youtube Video Statisticsdata from Kaggle. There are two workloads you should design and implement againstthe given data set. You are required to designed and implement both workloads usingONLY basic Spark RDD API. You should not use Spark SQL or other advanced features.2 Input Data Set DescriptionThe dataset contains several months records of daily top trending YouTube video in thefollowing ten countries: Canada,France, Germany, India,Japan, Mexico, Russia, South Korea,United Kingdom and United States of America. There are up to 200 trending videoslisted per day.In the original data set, each countrys data is stored in a separate CSV file, with eachrow representing a trending video record. If a video is listed as trending in multiple days,each trending appearance has its own record. The category names are stored in a fewseparate JSON files.The following Preprocessing have been done to ensure that you can focus on the mainworkload design. Merge the 10 individual CSV files into a single CSV file; Add a column category to store the actual category name based on the mapping Add a column country to store the trending country, each country is represented bytwo capital letter code. Remove rows with invalid video id values Remove most columns that are not relevant to the workloadsThe results is a CSV file AllVideos.csv with 8 columns and no header row. Thecolumns are as follows. The trending date column has the date format: yy.dd.mmvideo_id,trending_date,category,views,likes,dislikes,country13 Analysis Workload Description3.1 Controversial Trending Videos IdentificationListing a video as trending would help it attract more views. However, not all trendingvideos are liked by viewers. For some video, listing it as trending would increase itsdislikes number more Than the increase of its likes number. This workload aims to identifysuch videos. Below are a few records of a particular video demonstrating the changeof various numbers over time:video id trending date views likes dislikes countryQwZT7T-TXT0 2018-01-03 13305605 835378 629120 USQwZT7T-TXT0 2018-01-04 23389090 1082422 1065772 USQwZT7T-TXT0 … … … … USQwZT7T-TXT0 2018-01-09 37539570 1402578 1674420 USQwZT7T-TXT0 2018-01-03 13305605 835382 629123 GBQwZT7T-TXT0 … … … … GBQwZT7T-TXT0 2018-01-18 45349447 1572111 1944971 GBThe video has multiple trending appearances in US and GB. In both countries, its views,likes and dislikes all increase over time with each trending appearance. As highlightedin the table above, the dislikes number grows much faster than the likes numbers. Inboth countries, the video ended with higher number of dislikes than likes albeit startingwith higher likes number.In this workload, you are asked to find out the top 10 videos with fastest growthof dislikes number between its first and last trending appearances. Here we measurethe growth of dislikes number By the difference of dislikes increase and likes increasebetween the first and last trending appearances in the same country. For instance, thedislikes growth of video QwZT7T-TXT0 in US is computed as follows:(1674420 629120) (1402578 835378) = 478100The result of this workload should show the video id, dislike growth value and countrycode. Below is the sample results.QwZT7T-TXT0, 579119, GBQwZT7T-TXT0, 478100, USBEePFpC9qG8, 365862, DERmZ3DPJQo2k, 334390, KRq8v9MvManKE, 299044, INpOHQdIDds6s, 160365, CAZGEoqPpJQLE, 151913, RU84LBjXaeKk4, 134836, FR84LBjXaeKk4, 134834, DE84LBjXaeKk4, 121240, RU23.2 Category and Trending CorrelationSome videos are trending in multiple countries. We are interested to know if there isany correlation between video category and trending popularity among countries. Forinstance, we may Expect to see a common set of trending music videos in many countriesand a distinctive set of trending political videos in each country. In this workload, you areasked to find out the average country number for videos in each category.The following sample data set contains five videos belonging to category Sports,their trending data are as follows:video id category trending date views country1 Sports 18.17.02 700 US1 Sports 18.18.02 1500 US2 Sports 18.11.03 3000 US2 Sports 18.11.03 2000 CA2 Sports 18.11.03 5000 IN2 Sports 18.12.03 7000 IN3 Sports 18.17.04 2000 JP4 Sports 18.16.04 3000 KR4 Sports 18.17.04 9000 KR5 Sports 18.16.04 4000 RUWe can see that video 1,3,4,5, each appears in one country; video 2 appears in threecountries; If they are the only videos in Sports category, the average country numberwould be (1 + 3 + 1 + 1 + 1)/5 = 1.4 You should print out the final result sorted by theaverage country number. The sample output of this work load is as follows.(Trailers, 1.0),(Autos Vehicles, 1.0190448285965426),(News Politics, 1.052844979051223),(Nonprofits Activism, 1.057344064386318),(Education, 1.0628976994615762),(People Blogs, 1.0640343760329336),(Pets Animals, 1.0707850707850708),(Howto Style, 1.0876256925918326),(Travel Events, 1.0929411764705883),(Gaming, 1.0946163477016635),(Sports, 1.1422245184146431),(Entertainment, 1.1447534885477444),(Science Technology, 1.1626835588828102),(Film Animation, 1.1677314564158094),(Comedy, 1.2144120659156503),(Movies, 1.25),(Music, 1.310898044427568),(Shows, 1.614678899082569)3A small number of videos have more than one category name. The category name maychange over time. For instance video id 119YrPUNM28 has changed its category name fromNews Politics to Entertainment. A video may be given different category namesin different countries. For instance, video id 7klO0p092Y is under category People Blogs in CA and DE but under category Entertainment in US. As the number is quitesmall, you do not need to identify and handle them separately. The sample answer doublecount them in all categories they appear.4 Coding and Execution RequirementBelow are requirements on coding and Execution: You should implement Both workloads in PySpark using Spark RDD API. You should implement both workloads in a single Jupyter notebook. There shouldbe clear indication which cells belong to which workload. You should not modify the input data file in any way and your code should read thedata file from the same directory as the notebook file.5 Deliverable and DemoThe source code should be submitted as a single Jupyter notebook file. The due date isweek 7 Wednesday 21/04/21 23:59. Please name your notebook file aslabCode-uniKey-firstName-lastName.ipynb.There will be a 10 Minutes demo in week 7/8. You need to attend the demo toreceive mark for this assignment!During the demo, the marker will run your notebook on their own environment tocheck the correctness of the result. You should also have your environment ready to runyour code and to answer questions. The marker may ask you to explain the overall computationgraph or certain part of the implementation. You may be asked to add somestatement in your code to Show the structure of an intermediate RDD, or to apply variousfilters on intermediate RDDs to provide slightly different result.请加QQ:99515681 或邮箱:99515681@qq.com WX:codehelp

添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导