辅导Data Analytics编程、写作Python实验编程

” 辅导Data Analytics编程、写作Python实验编程Data Analytics – Python other toolsFall 2020HW 1: End-to-end analysis of TMDb data, Argo-Lite, SQLite, D3 Warmup, OpenRefine, FlaskUse HW1 Skeleton zip file and unzip to folders.Homework OverviewIn Question 1 (Q1), you will collect data using an API for The Movie Database (TMDb). You will construct agraph representation of this data that will show which actors have acted together in various movies, and useArgo Lite to visualize this graph and highlight patterns that you find. This exercise demonstrateshow visualizing and interacting with data can help with discovery.In Q2, you will construct a TMDb database in SQLite, with tables capturing information such as how welleach movie did, which Actors acted in each movie, and what the movie was about. You will alsopartition and combine information in these tables in order to more easily answer questions such as whichactors acted in the highest number of movies?.In Q3, you will visualize temporal trends in movie releases, using a JavaScript-based library called D3. Thispart will show how creating interactive rather than static plots can make data more visually appealing,engaging and easier to parse.In Q4, you will use OpenRefine to clean data from Mercari, and construct GREL queries to filter the entriesin this dataset.In Q5, you will build A simple web application that displays a table of TMDb data on a single-page website.To do this, you will use Flask, a Python framework for building web applications that allows you to connectPython data processing on the back end with serving a site that displays these results.Extremely Important: folder structure content of submission zip fileThe zip files folder structure must exactly be (when unzipped):HW1-username/Q1/Q2/Q3/submission.pyQ2_SQL.pyindex.(html / js / css)q3.csvlib/d3/d3.Min.jsd3-fetch/d3-fetch.min.jsd3-dsv/d3-dsv.Min.jsQ4/Q5/properties_clean.csvchanges.jsonQ4Observations.txtwrangling.py2 Version 1Q1 [40 points] Collect data from TMDb and visualize co-actor networkQ1.1 [30 points] Collect data from TMDb and build a graphFor this Q1.1, you will be using and submitting a python file – submission.py in folder Q1Complete all tasks according To the instructions found in submission.py to complete the Graph class, theTMDbAPIUtils class, and the two global functions. The Graph class will serve as a re-usable way torepresent and write out your collected graph data. The TMDbAPIUtils class will be used to work with theTMDB API For data retrieval.NOTE: You must only use a version of Python 3.7.0 and 3.8 for this question. You must not use anyother versions (e.g., Python 3.8).NOTE: You must only use the modules and libraries provided at the top of submission.py and modulesfrom the Python Standard Library. Pandas and Numpy CANNOT be usedNOTE: We will call each function once in submission.py during grading. You may lose some points if yourprogram runs for unreasonably long time, such as more than 10 minutes during non-busy times. Theaverage runtime of the code during grading is expected to take approximately 4 seconds.a) [10 pts] Implementation of the Graph class according to the instructions in submission.pyb) [10 pts] Implementation of the TMDbAPIUtils class according to the instructions insubmission.py. You will use version 3 of the TMDb API to download data about actors and theirco-actors. To use the TMDb API:o Use the TMDb API key to access the TMDb data: 74a32a264202ce8e09c7o Refer to the TMDB API Documentation, which contains a helpful try-it-out feature forinteracting with the API calls.c) [10 pts] Producing correct nodes.csv and edges.csv. You must upload your nodes.csv andedges.csv file as directed in Q1.2.NOTE: Q1.2 builds on the results of Q1.13 Version 1Q1.2 [10 points] Visualizing a graph of co-actors using Argo-LiteUsing Argo Lite, visualize a network of actors and their co-actors. You can access Argo Lite hereYou will produce An Argo Lite graph snapshot your edges.csv and nodes.csv from Q1.1.c.a. To get started, review Argo Lites readme on GitHub. Argo Lite has been open-sourced.b. Importing your Graph● Launch Argo Lite● From the menu bar, click Graph Import CSV. In the dialogue that appears:o Select I have both nodes and edges file● Under Nodes, use Choose File to select nodes.csv from your computero Leave Has Headers selectedo Verify Column for Node ID is id● Under Edges, use Choose File to select edges.csv from your computero Verify Column for Source ID is sourceo Select Column for Target ID to targeto Verify Selected Delimiter is ,● At the bottom of the dialogue, verify that After import, show is All Nodes● The graph will load in the window. Note that the layout is paused by default; you can selectto Resume or Pause layout as needed.● Dragging a Node will pin it, freezing its position. Selecting a pinned node, right clicking it,then choosing unpin selected will unpin that node, so its position will once again becomputed by the graph layout algorithm. Experiment with pinning and unpinning nodes.c. [7 points] Setting graph display options● On Graph Options panel, under Nodes Modifying All Nodes, expand Color menuo Select Color by degree, with scale: Linear Scaleo Select a color gradient of your choice that will assign lighter colors to nodes with highernode degrees, and darker colors to nodes with lower degrees● Collapse the Color options, expand the Size options.o Select Scale by to degree, with scale: Linear Scaleo Select meaningful Size Range values of your choice or use the default range.● Collapse the Size options● On the Menu, click Tools Data Sheet● Within the Data Sheet dialogue:o Click Hide Allo Set 10 more nodes with highest degreeo Click Show and then close the Data Sheet dialogue● Click and drag a rectangle selection around the visible nodes● With the nodes selected, configure their node visibility by setting the following:o Go to Graph Options Labelso Click Show Labels of Selected Nodeso At the bottom of the menu, select Label By to nameo Adjust the Label Length so that the full text of the actor name is displayed● On the Menu, click Tools – Filters – Show All Nodes The result of this workflow yields agraph with the sizing and coloring depending upon the node degree and the nodes with thehighest degree are emphasized by showing their labels.●d. [3 points] Designing a meaningful graph layout4 Version 1Using the following guidelines, create a visually meaningful and appealing layout:● Reduce as much edge crossing as possible● Reduce node overlap as much as possible● Keep the graph compact and symmetric as possible● Use the nodes Spatial positions to convey information (e.g., clusters or groups)● Experiment with showing additional node labels. If showing all node labels creates too muchvisual complexity, show at least 10 important nodes. You may decide what importancemean to you. For example, you may consider nodes (actors) having higher connectivity aspotentially more important (based on how the graph is built).The objective of this task is to familiarize yourself with basic, important graph visualization features.Therefore, this is an open-ended task, and most designs receive full marks. So please experimentwith Argo Lites features, changing node size and shape, etc. In practice, it is not possible to createperfect visualizations for most graph datasets. The above guidelines are ones that generally help.However, like most design tasks, creating a visualization is about making selective designcompromises. Some guidelines could create competing demands and following all guidelines maynot guarantee a perfect design.If you want to save your Argo Lite graph visualization snapshot locally to your device, so you cancontinue working on it later, we recommend the following workflow.● Select Graph Save Snapshoto In the Save Snapshot` dialog, click Copy to Clipboardo Open an external text editor program such as TextEdit or Notepad. Paste the clipboardcontents of the Graph snapshot, and save it to a file with a .json extension. You shouldbe able to accomplish this with a default text editor on your computer by overriding thedefault file extension and manually entering .json.o You may save your progress by saving the snapshot and loading them into Argo Lite tocontinue your work.● To load a snapshot, choose Graph Open Snapshot● Select the graph snapshot you created.e. Publish and Share your graph snapshot● Select Graph Publish and Share Snapshot Share● Next, click Copy to Clipboard to copy the generated URL● Return the URL in the return_argo_lite_snapshot() function in submission.pyIf you modify your graph after you publish and share a URL, you will need to re-publish and obtain anew URL of your latest graph. Only the graph snapshot shared via the URL will be graded.Deliverables: Place the files listed below in the Q1 folder. submission.py: The completed Python fileQ2 [35 points] SQLiteSQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. Itis one of the worlds most popular embedded database systems. It is convenient to share data stored in anSQLite database just one cross-platform file which does not need to be parsed explicitly (unlike CSV5 Version 1files, which have to be parsed).You will modify the given Q2_SQL.py file in folder Q2 by adding SQL statements to it.NOTE: You must only use a version of Python 3.7.0 and 3.8 for this question. You must not use anyother versions (e.g., Python 3.8)NOTE: Do not modify the import statements, everything you need to complete this question has beenimported for you. You may not use other libraries for this assignment.A Sample class has been provided for you to see some sample SQL statements, you can turn off this outputby changing the global variable SHOW to False. NOTE: This must be set to false before uploading toGradescope and turning it in to Canvas.username – use This name, e.g. mhull32NOTE: For the questions in this section, you must only use INNER JOIN when performing a joinbetween two tables. Other types of joins may result in incorrect results.a. [9 points] Create tables and import data.i. [2 points] Create two tables (via two separate methods) named movies and movie_cast withcolumns having the indicated data types:1. movies1. id (integer)2. title (text)3. score (real)2. movie_cast1. movie_id (integer)2. cast_id (integer)3. cast_name (text)4. birthday (text)5. popularity (real)ii. [2 points] Import the provided movies.csv file into the movies table and movie_cast.csv intothe movie_cast table1. You will write Python code that imports the .csv files into the individual tables. This willinclude looping though the file and using the INSERT INTO SQL command. Only userelative paths while importing files since absolute/local paths are specific locations thatexist only on your computer and will cause the auto-grader to fail.iii. [5 points] Vertical Database Partitioning. Database partitioning is an important technique thatdivides large Tables into smaller tables, which may help speed up queries. For this question youwill create a new table cast_bio from the movie_cast table (i.e., columns in cast_bio willbe a subset of those in movie_cast) Do not edit the movie_cast table. Be sure that when youinsert into the new cast_bio that the values are unique. Please read this page for an exampleof vertical database partitioning.cast_bio1. cast_id (integer)2. cast_name (text)3. birthday (date)6 Version 14. popularity (real)b. [1 point] Create indexes. Create the following indexes for the tables specified below. This step increasesthe speed of subsequent operations; though the improvement in speed may be negligible for this smalldatabase, it is significant for larger databases.i. movie_index for the id column in movies tableii. cast_index for the cast_id column in movie_cast tableiii. cast_bio_index for the cast_id column in cast_bio tablec. [3 points] Calculate a Proportion. Find the proportion of movies having a score 50 and that has war inthe name. Treat each row as a different movie. The proportion should only be based on the total numberof rows in the movie table. Format all decimals to two places using printf(). Do NOT use theROUND() function as it does not work the same on every OS.Output format and sample value:7.70d. [4 points] Find the most prolific actors. List 5 cast members with the highest number of movieappearances that have a popularity 10. Sort the results by the number of appearances in descendingorder, then by cast_name in alphabetical order.Output format and sample values (cast_name,appearance_count):Harrison Ford,2e. [4 points] Find the highest scoring movies with the smallest cast. List the 5 highest-scoring movies thathave the fewest cast members. Sort the results by score in descending order, then by number of castmembers in ascending order, then by movie name in alphabetical order. Format all decimals to twoplaces using printf().Output format and sample values (movie_title,movie_score,cast_count):Star Wars: Holiday Special,75.01,12War Games,58.49,33f. [4 points] Get high scoring actors. Find the top ten cast members who have the highest average moviescores. Format all decimals to two places using printf(). Sort the output by average score in descending order, then by cast_name in alphabetical order. Do not include movies with score 25 in the average score calculation. Exclude cast members who have appeared in two or fewer movies.Output format and sample values (cast_id,cast_name,average_score):8822,Julia Roberts,53.00g. [6 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs ofactors who have had a good collaboration as defined here. Each row in the view describes one pair ofactors who appeared in at least 3 movies together AND the average score of these movies is = 40.The view should have the format:good_collaboration(cast_member_id1,7 Version 1cast_member_id2,movie_count,average_movie_score)For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lowernumeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). Thereshould not be any self pair where the value of cast_member_id1 is the same as that ofcast_member_id2.NOTE: Full points will only be awarded for queries that use joins for part g.Remember that Creating a view will not produce any output, so you should test your view with afew simple select statements during development. One such test has already been added to thecode as part of the auto-grading.NOTE: Do not submit any code that creates a TEMP or TEMPORARY view that you mayhave used for testing.Optional Reading: Why create views?i. [4 points] Find the best collaborators. Get the 5 cast members with the highest average scoresfrom the good_collaboration view, and call this score the collaboration_score. Thisscore is the average of the average_movie_score corresponding to each cast member,including actors in cast_member_id1 as well as cast_member_id2. Format all decimals totwo places using printf(). Sort your output by this score in descending order, then by cast_name alphabetically.Output format (cast_id,cast_name,collaboration_score):2,Mark Hamil,99.321920,Winoa Ryder,88.32h. [4 points] SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTSdocumentation). Import movie overview data from the movie_overview.csv into a new FTS table calledmovie_overview with the schema:movie_overview id (integer) overview (text)NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR and NOTare case sensitive in FTS queries.i. [1 point] Count the number of movies whose overview field contains the word fight. Matchesare not case sensitive. Match full words, not word parts/sub-strings.e.g., Allowed: FIGHT, Fight, fight, fight.. Disallowed: gunfight, fighting, etc.Output format:12ii. [2 points] Count the number of movies that contain the terms space and program in the8 Version 1overview field with no more than 5 intervening terms in between. Matches are not casesensitive. As you Did in h(i)(1), match full words, not word parts/sub-strings. e.g., Allowed: InSpace there was a program, In this space program. Disallowed: In space you are notsubjected to the laws of gravity. A program., etc.Output format:6Deliverables: Place all the files listed below in the Q2 folder1. Q2_SQL.py: Modified file containing all the SQL statements you have used to answer parts a – h inthe proper sequence.Q3 [15 points] D3 (v5) WarmupRead chapters 4-8 of Scott Murrays Interactive Data Visualization for the Web, 2nd edition . . Thissimple reading provides important foundation you will need for Homework 2. This question uses D3version v5, while the book covers D3 v4. What you learn from the book is transferable to v5.NOTE the following important points:1. We highly recommend that you use the latest Firefox browser to complete this question. We will gradeyour work using Firefox 79.0 (or newer).2. For this homework, the D3 library is provided to you in the lib folder. You must NOT use any D3 libraries(d3*.js) other than the ones provided.3. You may need to setup an HTTP server to run your D3 visualizations. The easiest way is to use https.server for Python 3.x. Run your local HTTP server in the hw1-skeleton/Q3 folder.4. We have provided sections of code along with comments in the skeleton to help you complete theimplementation. While you do not need to remove them, you may need to write additional code to make thingswork.5. All d3*.js files in the lib folder are referenced using relative paths in your html file. For example, since thefile Q3/index.html uses d3, its header contains:script type=text/javascript src=lib/d3/d3.min.js/scriptIt is incorrect to use an absolute path such as:script type=text/javascript src= https://d3js.org/d3.v5.min.js/scriptThe 3 files that are referenced are:- lib/d3/d3.min.js- lib/d3-dsv/d3-dsv.min.js- lib/d3-fetch/d3-fetch.min.js9 Version 16. For a question that reads in a dataset, you are required to submit the dataset too (as part of yourdeliverable). In your html / js code, use a relative path to read in the dataset file. For example, since Q3requires reading data from the q3.csv file, the path should be q3.csv and NOT an absolute path such asC:/Users/polo/HW1-skeleton/Q3/q3.csv. Absolute/local paths are specific locations that exist only on yourcomputer, which Means your code will NOT run on our machines when we grade (and you will lose points).7. You can and are encouraged (though not required) to decouple the style, functionality and markup in thecode for each question. That is, you can use separate files for CSS, JavaScript and HTML this is a goodprogramming practice in general.Deliverables: Place all the files/folders listed below in the Q3 folder● A folder named lib containing folders d3, d3-fetch, d3-dsv● q3.csv: the file that we have provided you, in the hw1 skeleton under Q3 folder, which contains thedata that will be loaded into the D3 plot.● index.(html / css / js) : when run in a browser, it should display a barplot with the followingspecifications:a. [1.5 points] Load the data from q3.csv using D3 fetch methods. We recommend d3.dsv().b. [2 points] The barplot must display one bar per row in the q3.csv dataset. Each barcorresponds to the running total of movies for a given year. The height of each barrepresents the running total. The bars are ordered by ascending time with the earliestobservation at the far left. i.e., 1880, 1890, …, 2000c. [1 point] The bars must have the same fixed width, and there must be some space betweentwo bars, so that the bars do not overlap.d. [3 points] The plot must have visible X and Y axes that scale according to the generatedbars. That is, the axes are driven by the data that they are representing. Likewise, the tickson these axes must adjust automatically based on the values within the datasets, i.e., theymust not be hard-coded.e. [2 point] Set x-axis label to Year and y-axis label to Running Total.f. [1 point] Use a linear scale for the Y axis to represent the running total (recommendedfunction: d3.scaleLinear()).g. [3 points] Use a time scale for the X axis to represent year (recommended function:d3.scaleTime()). It may be necessary to use time parsing / formatting when you load anddisplay the year Data. The axis would be overcrowded if you display every year value so setthe X-axis ticks to display one tick for every 10 years.h. [1 point] Set the HTML title tag and display a title for the plot.■ Position the title Running Total of TMDb Movies by Year above the barplot.■ Set the HTML title tag (i.e., title Running Total of TMDb Movies by Year/title).10 Version 1i. [0.5 points] Add your username (usually includes a mix of letters and numbers,e.g. mhull32) to thearea beneath the bottom-right of the plot (see example image).The barplot should appear similar in style to the sample data plot provided below.Q4 [5 points] OpenRefineOpenRefine is a Java application and requires Java JRE to run. Download and install Java if you do nothave it (you can verify by typing java -version in your computers terminal or command prompt).a. Watch the videos on OpenRefines homepage for an overview of its features. Then, download( httpss://github.com/OpenRefine/OpenRefine/releases/tag/3.3)and install OpenRefine release 3.3.Do not use version 3.4 (which is in beta status)b. Import Dataset● Run OpenRefine and point your browser at 127.0.0.1:3333.● We use a products dataset from Mercari, derived from a Kaggle competition (Mercari PriceSuggestion Challenge). If you are interested in the details, visit the data description page.We have sampled a subset of the dataset provided as properties.csv.● Choose Create Project This Computer properties.csv. Click Next.● You will now see a preview of the data. Click Create Project at the upper right corner.c. Clean/Refine the DataNOTE: OpenRefine maintains a log of all changes. You can undo changes. Use the Undo/Redobutton at the upper left corner. Follow the exact output format specified in every part below.i. [0.5 point] Select the category_name column and choose Facet by Blank (Facet CustomizedFacets Facet by blank) to filter out the records that have blank values in this column. Provide thenumber of rows that return True in Q4Observations.txt. Exclude these rows.11 Version 1Output format and sample values:i.rows: 500ii. [1 point] Split the column category_name into multiple columns without removing the originalcolumn. For example, a row with Kids/Toys/Dolls Accessories in the category_name columnwould be split across the newly created columns as Kids, Toys and Dolls Accessories. Usethe existing functionality in OpenRefine that creates multiple columns from an existing column basedon a separator (i.e., in this case /) and does not remove the original category_name column.Provide the number of new columns that are created by this operation, excluding the originalcategory_name column.Output format and sample values:ii.columns: 10NOTE: There are many possible ways to split the data. While we have provided one way toaccomplish this in step ii, some methods could create columns that are completely empty. In thisdataset, none of the new columns should be completely empty. Therefore, to validate your output,we recommend that you verify that there are no columns that are completely empty, by sorting andchecking for null values.iii. [0.5 points] Select The column name and apply the Text Facet (Facet Text Facet). Cluster byusing (Edit Cells Cluster and Edit ) this opens a window where you can choose differentmethods and keying functions to use while clustering. Choose the keying function that producesthe smallest number of clusters under the Key Collision method. Click Select All and MergeSelected Close. Provide the name of the keying function and the number of clusters that wasproduced.Output format and sample values:iii.function: fingerprint, 200NOTE: Use the default Ngram size when testing Ngram-fingerprint.iv. [1 point] Replace the null values in the brand_name column with the text Unknown (Edit Cells – Transform). Provide the General Refine Evaluation Language (GREL) expression used.Output format and sample values:iv.GREL_categoryname: endsWith(food, ood)v. [1 point] Create a new column high_priced with the values 0 or 1 based on the price columnwith the following conditions: if the price is greater than 90, high_priced should be set as 1, else0. Provide the GREL expression used to perform this.Output format and sample values:v.GREL_highpriced: endsWith(food, ood)vi. [1 point] Create a new column has_offer with the values 0 or 1 based on theitem_description column with the following conditions: If it contains the text discount or offeror sale, then set the value in has_offer as 1, else 0. Provide the GREL expression used toperform this. Convert the text to lowercase before you search for the terms.12 End of HW1Version 1Output format and sample values:vi.GREL_hasoffer: endsWith(food, ood)Deliverables: Place all the files listed below in the Q4 folder● properties_clean.csv : Export the final table as a comma-separated values (.csv) file.● changes.json : Submit a list of changes made to file in json format. Use the Extract OperationHistory option under the Undo/Redo tab to create this file.● Q4Observations.txt : A text file with answers to parts c.i, c.ii, c.iii, c.iv, c.v, c.vi. Provide eachanswer in a new line in the exact output format specified. Your files final formatting should result in a.txt file that has each answer on a new line followed by one blank line (to help visually separately theanswers)Q5 [5 points] Introduction to Python FlaskFlask is a lightweight web application framework written in Python that provides you with tools, libraries andtechnologies to quickly build a web application. It allows you to scale up your application as needed.You will modify the given file: wrangling_scripts/wrangling.pyNOTE: You must only use a version of Python 3.7.0 and 3.8 for this question. You must not use anyother versions (e.g., Python 3.8).NOTE: You must only use the modules and libraries provided at the top of submission.py and modulesfrom the Python Standard Library (except Flask). Pandas and Numpy CANNOT be usedUsername()- Update the username() method inside wrangling.py by including your Username, e.g.mhull32. Get started by installing Flask on your machine by running pip install Flask (Note that youcan optionally create a virtual environment by following the steps here. Creating a virtualenvironment is Purely optional and can be skipped.) To run the code, you must navigate to the Q5 folder in your terminal/command prompt and executethe following command: python run.py. After running the command g”