” CMT655程序 写作、 辅导program语言编程Cardiff School of Computer Science and InformaticsModule Code: CMT655Module Title: Manipulating and Exploiting DataAssessment Title: Course PortfolioAssessment Number: 1Date Set: Friday 26th March 2021Submission Date and Time: May 31st 2021 9:30Return Date: June 30th 2021This assignment is worth 100% Of the total marks available for this module. If courseworkis submitted late (and where there are no extenuating circumstances):1 If the assessment is submitted no later than 24 hours after the deadline, the markfor the assessment will be capped at the minimum pass mark;2 If the assessment is submitted more than 24 hours after the deadline, a mark of 0will be given for the assessment.Your submission must include the official Coursework Submission Cover sheet, which canbe found here: httpss://docs.cs.cf.ac.uk/downloads/coursework/Coversheet.pdfSubmission InstructionsYour coursework should be submitted via Learning Central by the above deadline. Itconsists of a portfolio divided in three assessments.Assessment (1) consists of a set of exercises. The final deliverable consists of twojupyter notebooks.Assessment (2) is a machine-learning powered web service which is able totrain, evaluate and run predictions on unseen data, as well as storing modelconfiguration and results in a database. The deliverable is a zip file with theapplication source code, a README.txt file and (optionally) a requirements.txtfile which lists dependencies and versions the app would require to run.Assessment (3) is a reflective report (up to 2,000 words) describing solutions,design choices and a reflection on the main challenges and ethical considerationsaddressed during the development of solutions for assessments (1) and (2).You have to upload the following files:Any deviation from the submission instructions above (including the number and types offiles submitted) may result in a Mark of zero for the assessment or question part.AssignmentIn this portfolio, students demonstrate their familiarity with the topics covered in the modulevia three separate assessments.DeliverableAssessment 1The deliverable for Assessment 1 consists of 2 jupyter notebook (.ipynb) files. They will besubmitted with all the output cells executed in a fresh run (i.e., Kernel – Restart and runall). 20 marks.Assessment 2The deliverable for Assessment 2 will be a zip file containing the webapp code, aREADME.txt and an optional requirements.txt file, which will list the dependencies the apprequires. 25 marks.Assessment 3The deliverable for Assessment 3 will be a PDF file based on the .docx template providedfor this assessment in the starter package, available at Learning Central. 55 marks.Description Type NameCover sheet Compulsory One PDF (.pdf) file [student number].pdfAssessment1Compulsory One Jupyter notebook (.ipynb) assessment1_db_creation_[student number].ipynbAssessment1Compulsory One Jupyter notebook (.ipynb) assessment1_queries_[studentnumber].ipynbAssessment2Compulsory One zip file (.zip) Assessment2_webapp_[studentnumber].zipAssessment3Compulsory One PDF (.pdf) file assessment3_report_[studentnumber].pdfAssessment 1In Assessment 1, students solve two main types of challenges. These challenges are: (1)data modeling and (2) database querying.DATA MODELING AND QUERYING (20 Marks)assessment1_db_creation_[student number].ipynbassessment1_queries_[student number].ipynb1. Data modeling (8 marks)You are given an initial .csv dataset from Reddit (data_portfolio_21.csv,available in the starter package in Learning Central). This data dump contains postsextracted from Covid-related subreddits, as well as random subreddits. Your firsttask is to process this dump and design, create and implement a relational(MySQL) database, which you will then populate with all the posts and related data.This dataset has information about three entities: posts, users and subreddits.The column names are self-explanatory: columns starting with the prefix user_describe users, those starting with the prefix subr_ describe subreddits, the columnsubreddit is the subreddit name, and the rest of the columns are post attributes(author, post date, post title and text, number of comments, score, favorited by,etc.).What to implement: Start from the notebook assessment1_db_creation_[studentnumber].ipynb, replacing [student number] with your student number. Implementthe following (not necessarily in this order):– Python logic for reading in the data. [2 marks]– SQL code for creating tables. [3 marks]– SQL code for populating tables. [3 marks]Use comments or markdown along the way to explain how you dealt withissues such as missing data, non-standard data types, multivalued columns, etc.You are not required to explain the database design (normalization, integrity,constraints, etc) process in this notebook, as there is a dedicated part of the reportin Assessment 3 for this. However, you can include pointers to design choices forfacilitating the understanding of your implementation.All your code should be self-contained in Python code, and therefore you will haveto rely on a MySQL library for executing SQL statements and queries. Please usepymysql, the one we have used in class.You should submit your Notebook with all the cells executed, from start tofinish, in a fresh run (i.e., first cell number should be [1], second [2], etc.). You canachieve this by selecting Kernel – Restart and run all. At the end of the run, yournotebook should have populated a database in the university server which you willhave created exclusively for this coursework.2. Querying (12 marks)You are given a set of questions in natural language, for which you must implementqueries to find the answer. While the queries will be answered in the providedjupyter notebook, they will have to be written in SQL, i.e., you cannot use Python tosolve them.What to implement: Start from the notebook assessment1_queries_[studentnumber].ipynb, replacing [student number] with your student number. All the logicshould be contained inside the provided (empty) functions. Then, a call to eachfunction should show the output of these queries. You are also required to submityour notebook after a fresh run (Kernel – Restart and run all).The questions are:1 – Users with highest scores over time [0.5 marks]● Implement a query that returns the users with the highest aggregate scores (over all theirposts) for the whole dataset. You should restrict your results to only those whose aggregatedscore is above 10,000 points, in descending order. Your query should return two columns:username and aggr_scores.2 – Favorite subreddits with numbers but not 19 [0.5 marks]● Implement a query that returns the set of subreddit names who have been favorited at leastonce and that contain any number in their name, but you should exclude those with the digit19, as we want to filter out COVID-19 subreddit names. Your query should only return onecolumn: subreddit.3 – Most active users who add subreddits to their favorites. [0.5 marks]● Implement a query that returns the top 20 users in terms of the number of subreddits they havefavorited. Since several users have favorited the same number of subreddits, you need toorder your results, first, by number of favourites per user, and secondly, alphabetically by username. The alphabetical order should be, first any number, then A-Z (irrespective of case). Yourquery should return two columns: username and numb_favs.4 – Awarded posts [0.5 marks]● Implement a query that returns the number of posts who have received at least one award.Your query should return only one value.5 – Find Covid subreddits in name and description. [1 mark]● Implement a query that retrieves the name and description of all subreddits where the namestarts with covid or corona and the description contains covid anywhere. The returned tableshould have two columns: name and description.6 – Find users in haystack [1 mark]● Implement a query that retrieves only the names of those users who have at least 3 posts withthe same score as their Number of comments, and their username contains the string memeanywhere. Your returned table should contain only one column: username.7 – Subreddits with the highest average upvote ratio [1 mark]● Implement a query that shows the 10 top subreddits in terms of the average upvote ratio of theusers that posted in them. Your query should return two columns: subr_name andavg_upv_ratio.8 – What are the chances [1 mark]● Implement a query that finds those posts whose length (in number of characters) is exactly thesame as the length of the description of the subreddit in which they were posted on. Youshould retrieve the following columns: subreddit_name, posting_user, user_registered_at,post_full_text, post_description and dif (which should show the difference in charactersbetween the subreddit description and the post.9 – Most active December 2020 days. [1 mark]● Write a query that retrieves only a ranked list of the most prolific days in December 2020,prolific measured in number of posts per day. Your query should return those days in asingle-column table (column name post_day) in the format YYYY-MM-DD.10 – Top covid-mentioning users. [1 mark]● Retrieve the top 5 users in terms of how often they have mentioned the term covid in theirposts. Your query should return two columns: username and total_count. You will consider anoccurrence of the word covid only when it appears before and after a whitespace (i.e.,spacecovidspace) and irrespective of case (both spaceCovidspace andspacecovidspace would be valid hits).11 – Top 10 users whose posts reached the most users, but only in their favorite subreddits. [2marks]● Write a query to retrieve a list of 10 users sorted in descending order by the number of userstheir posts reached, considering only the subset of users belonging to their favouritesubreddits. Your query must return only one column: username.12 – Users with high score for their posts. [2 marks]● Retrieve the number of users with an average score for their posts which is higher than theaverage score for the posts in our dataset. Your query should return only one result, under thecolumn result.Assessment 2In Assessment 2, you implement a Flask application which manages a machine-learningbased text classifier and speaks to both MongDB and MySQL databases.WEBAPP (25 Marks)assessment2_webapp_[student number].zipIn this assessment the goal is to set up a web service based on Flask which will sit on topof the database you built in Assessment 1, and will have a machine learning component.Specifically, the app will have several functionalities for training, evaluating and deployinga covid-or-not classifier, Which will take as input a message posted in social media (e.g.,Reddit), and predicts whether it is about Covid-19 or not.What to implement: You will pull your data from the MySQL database that youimplemented in Assessment 1. Then, your task is to develop a web application based onFlask which will have the following functionalities:a) Run a classification experiment and store results, models and configuration in aMongoDB database;b) Retrieve results for the experiments done so far, ranked based on a criterion ofyour choice; andc) Perform inference, i.e., given a piece of text provided by the user, predict whetherit is about Covid-19 or not.You are provided with an empty skeleton which contains starter HTML and Python code.Your task is to implement the backend logic following the detailed instructions below. Theprovided skeleton has an index.html page which has the layout shown in Figure 1:Figure 1: Landing page of the ML-powered Flask application.Then, your task will be to implement the backend logic that is triggered when each of thethree buttons shown in Figure 1 are clicked on. The logic corresponding to each of thesebuttons is explained in detail below.1. Run a classification experiment [15 marks]In this exercise, you have to implement the following workflow:a) Reset, create and verify two VIEWS, which you will call training_data andtest_data. These VIEWS should contain non-overlapping posts which you will useto train and evaluate your classifier. The logic will be implemented in the following(empty) functions, located in the main.py script, provided in the starter package:- reset_views() – Drop (if they exist) and create the views. [1 mark]- create_training_view() – Create an SQL VIEW for training data. Thisview will have two columns, the text and the label. It is up to you to decide the sizeof the dataset and the proportion of Covid vs. non-Covid posts. And also which partof the post you take (title, body or both). You will justify these choices inAssessment 3. We will make the strong assumption that any post submitted to aCovid-related subreddit will be about Covid-19, and that it will be non-related toCovid if it is submitted Somewhere else. [3 marks]- create_test_view() – Create view for test data. This view will have twocolumns, the text and the label. The same principles as in the previous case apply.[3 marks]- check_views() – Retrieve all records in training_data and test_data, andprint to the console their size. This is a small sanity check. [1 mark]b) Retrieve data from the views you created in step 1, train and evaluate aclassifier, and return the results as a JSON object that is rendered in the browser.Implement this functionality in the experiment() method, again in the main.pyscript. It is up to you to decide on the classifier and its configuration. There is anopportunity to reflect on these choices in Assessment 3. [5 marks]c) Take the model binaries, model configuration and the classification results youobtained in step (b), as well as the time in which the experiment was performed,and store this information in a dedicated collection. This exercise is open, i.e., thereis no suggested format on how to store this data, what information to store for yourmodels or the evaluation metrics you use. There is an opportunity to reflect onthese choices in Assessment 3. [2 marks]2. Retrieve information on the experiments conducted so far (5 marks)a) In this exercise, you query the collection you implemented in step 1c, and showthe top 3 experiments based on a certain criterion (best scoring according tometric X, the most recent experiments, the fastest experiments in training time,etc.). Your results will be returned as JSON objects and rendered in the browser. [5marks]3. Implement a covid-or-not predictor (5 marks)a) In this exercise, you implement a functionality for predicting on the fly whether apiece of text is Covid-19-related or not. To this end, you will use the top-rankedmodel according to the ranking you implemented in step 2a. This model will then beapplied to the input text and the results will be rendered in the browser as a JSONobject with the format:{input_text: some_input_text,prediction: the_prediction_of_your_classifier}. [5 marks]Assessment 3Report (55 Marks)assessment3_report_[student number].pdfIn Assessment 3, you write a technical report on Assessments 1 and 2, and discussethical, legal and social implications of the development of this Covid-19 application in thecontext of the UK Data Ethics Framework. You are strongly encouraged to follow ascholarly approach, e.g., With peer-reviewed references as support, and with additionalempirical evidence (your own tests) for justifying your decisions (e.g., performance orphysical storage for DBMS, training time or accuracy for the ML webapp solution).Maximum 2,000 words (not counting references, index and table and figure captions).This report should cover the following aspects, discussing challenges and problemsencountered and the solutions implemented to overcome them. The mark will be dividedbetween 3 expected sections:o [3a] Database Creation (DB choice, design, etc.), i.e., the research andfindings stemming from the development of Assessment 1. Specifically, youshould discuss any business rules that can be inferred from the dataset(reverse-engineering), normalization (identifying partial and transitivedependencies, if any, unnormalized relations, etc.), data integrity andconstraints, bad data, etc. Moreover, the expectation is that any design decision(or lack thereof) will be empirically (e.g., with performance tests) and/ortheoretically (pointing to peer-reviewed publications) supported. [20 Marks]o [3b] ML Application, explaining the implementation of the training and testVIEWS; the ML algorithm chosen (based on main features, hyperparametersused in the application, training speed as opposed to other alternatives, etc);evaluation metrics; the overall logic followed by the app for storing and retrievingexperimental results; and finally any further details that may be relevant for thecovid-or-not inference functionality. You should also discuss the rationalebehind the MongoDB interaction with pointers both to the database and the codethat interacts with it. [20 Marks]o [3c] Ethics and Bias in Data-driven Solutions in the specific context of thisdataset and the broader area of application of this project (automaticcategorization of social media content to enable easier screening of publicopinion). You should map Your discussion to one of the five actions outlinedin the UKs Data Ethics Framework. You should prioritize the action that, in youropinion, is the weakest. Then, justify your choice by critically analyzing the threekey principles outlined in the Framework, namely transparency, accountabilityand fairness. Finally, you should propose one solution that explicitly addressesone point related to one of these three principles, reflecting on how your solutionwould improve the data cycle in this particular use case. [15 Marks]Learning Outcomes AssessedThis coursework covers the 7 LOs listed in the module description.Criteria for assessmentCredit will be awarded against the following criteria.请加QQ:99515681 或邮箱:99515681@qq.com WX:codehelp
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。