BUSA90501编程写作、辅导data留学生编程

” BUSA90501编程写作、辅导data留学生编程BUSA90501 Machine LearningSyndicate Project DescriptionWeight: 30%1 OverviewPairwise relationships are prevalent in real life. For example, friendships between people, communication links betweencomputers and pairwise similarity of images. Networks provide a way to represent a group of relationships.The entities in question are represented as network nodes and the pairwise relations as edges.In real network data, there are often missing edges between nodes. This can be due to a bug or deficiency inthe data collection process, a lack of resources to collect all pairwise relations or simply there is uncertainty aboutthose relationships. Analysis performed on incomplete networks with missing edges can bias the final output, e.g.,if we want to find the shortest path between two cities in a road network, but we are missing information of majorhighways between these cities, then no algorithm will able to find this actual shortest path.Furthermore, we might want to predict if an edge will form between two nodes in the future. For example, in diseasetransmission networks, if health authorities determine a high likelihood of a transmission edge forming betweenan infected and uninfected person, then the authorities might wish to vaccinate the uninfected person.In this way, being able to predict (and correct for) missing edges is an important task.Your task:In this project, you will be learning from a training network and trying to predict whether edges exist among test nodepairs.The training network is a fragment of an academic co-authorship graph. The nodes in the networkauthorshave been given randomly assigned IDs, and an undirected edge between node A and B represents that authors Aand B have published a paper together as co-authors. The training network is a network of a time period (2010-2017),focusing on individuals in a specific academic subcommunity.Your task is to predict if an edge will form between two nodes in the future, we provide development set and testset as future link information to validate and evaluate your works. The development set is a list of 4,866 edges, contain2,433 real edges in the year after the time period of the training set (2018) , and also 2,433 fake edges (pairs of nodesthat are not connected). The test data is a list of 4,460 edges, 2,230 of these test edges are real in the next year afterdevelopment set (2019), while the other 2,230 do not actually exist.To make the project fun, we will run it as a Kaggle in-class competition. Your assessment will be partially basedon your final ranking in the privately-held competition, partially based on your absolute performance and partiallybased on your report.2 Data FormatBUSA90501作业写作、辅导data留学生作业All data will be available in raw text. The training graph data will be given in a (tab delimited) edge list format, whereeach row represents a node and its neighbours. For example:Figure 1: Network diagram for the adjacency list example.represents the network illustrated in Figure 1.In addition to the edges, you are also provided with a file including several features of the nodes (authors). Thisfile, nodes.json is in JSON format and includes information in 2010-2017 for each author: their id in the graph the number of years since their first and last publication to 2017 (e.g. first:3 means author published firstpaper at 2014) their number of publications in total, num_papers presence of specific keywords in the titles and abstracts of their publications (denoted keyword_X where X {0, 1,…, 53}, each being a binary value and only listed if its value is 1) publication at specific venues (denoted venue_X where X {0, 1,…, 303}, each being a binary value and onlylisted if its value is 1)This gives you some additional information beside the network structure for your prediction task.1The test edge set is in a comma separated values (CSV) edge list format, which includes a one line header, followedby a line for each (source node, target node) edge. Your implemented algorithm should take the test CSV file as inputand return a 4,461 row CSV file that has a) in the first row, the string Id,Predicted; b) in all subsequent rows, aconsecutive integer ID, a comma, then a float in the range [0,1]. These floats are your guesses or predictions as towhether the corresponding test edge was from the co-authorship network or not. Higher predictions correspond tobeing more confident that the edge is real.For example, given the test edge set of {(3, 5), (4, 12)} as represented in CSV format byId,Source,Sink1,3,52,4,12if your prediction probabilities are 0.1 for edge (3,5), 0.99 for edge (4,12), then your output file should be:Id,Predicted1,0.12,0.99The test set will be used to generate an AUC for your performance; you may submit test predictions multipletimes per day (if you wish). During the competition AUC on a 33% subset of the test set will be used to rank you in thepublic leaderboard. We will Use the complete test set to determine your final AUC and ranking. The split of test setduring/after the competition, is used to discourage you from constructing algorithms that overfit on the leaderboard.The training graph train.txt, the development edges dev.csv,labels for development dev-labels.csv, the test edgestest-public.csv, and a sample submission file sample.csv will be available within the Kaggle competition website.You should use development set for hyperparameters tuning and model selection, then make prediction on test setand submit to Kaggle competition.1These features were calculated after excluding from the network the hidden test edges, to invalidate trivial approaches for prediction.23 Links and Check ListCompetition link: httpss://www.kaggle.com/t/aedc05f00c12488792c251818b2dd99eTeam registration: httpss://forms.gle/C1KTR6GtEavcXHnb7The Kaggle in class competition allows you to compete and benchmark against your peers. Please do the followingby: June 18th 11pm1. Setup one account on Kaggle with uni email ending @student.unimelb.edu.au.2. Your project team is your syndicate team.3. Connect with your team mates on Kaggle as a Kaggle team.2 Only submit via the team!4. Register your team using the team registration Google Forms link above. One registration per team.5. Complete and upload the Group Agreement form from Canvas, to Canvas to record team-mate expectationswithin your syndicate.4 Student GroupsTeams should match assigned syndicate groups. We will mark all teams based on our expectations of what a typicalsyndicate team could achieve: you might consider roles such as researcher, feature engineering, learning, work-flows/scripting, experimentation, ensembling of team models, generating validation data, etc. and divide your identifiedroles among your team. We expect you to complete a Group Agreement found on Canvas with this spec, andupload it to Canvas. We recommend tools such as Slack or Trello for group coordinationyou may use your platformof choice.By the date listed above, Pease enter the UoM and Kaggle usernames for each team member, along with Kaggleteam nameso that we may match teams to studentswith the above registration Google Form (one response perteam, please).We encourage active discussion among teams, but please refrain from colluding. Given your marks are partiallydependent on your final ranking in the competition, it is in your interest not to collude.The Group Agreement is important in the process of group work, in setting internal expectations. And platformslike Slack/Trello/Git logs can be used to document contribution (or lack thereof ). In the rare circumstance a studentis penalised for lack of contribution, that student will have the opportunity to appeal. Again, we dont expect thisprocess to come into effect for any teamsfrom past experience in a class of this size. In the past students report thatthis kind of project work is challenging, rewarding and fun.5 Report1. A brief description of the problem and introduction of any notation that you adopt in the report.2. Description of your final approach(s) to link prediction, the motivation and reasoning behind it, and why youthink it performed well/not well in the competition.3. Any other alternatives you considered and why you chose your final approach over these (this may be in theform of empirical evaluation, but it must be to support your reasoning – examples like method A, got AUC 0.6and method B, got AUC 0.7, hence we use method B, with no further explanation, will be marked down).2See e.g. httpss://www.quora.com/How-do-I-form-a-team-in-Kaggle3Your description of the algorithm should be clear and concise. You should write it at a level that a postgraduate studentcan read and Understand without difficulty. If you use any existing algorithms, please do not rewrite the completedescription, but provide a summary that shows your understanding and references to the relevant literature. In thereport, we will be interested in seeing evidence of your thought processes and reasoning for choosing one algorithmover another.Dedicate space to describing the features you used and tried, hyperparameters tuning, any interesting detailsabout software setup or your experimental pipeline, and any problems you encountered and what you learned. Inmany cases these issues are at least as important as the learning algorithm, if not more important.Report format rules. The report should be submitted as a PDF, and be no more than five pages, single column. Thefont size should be 11 or above. If a report is longer than five pages in length, we will only read and assess the reportup to page five and ignore further pages. (Dont waste space on cover pages.)6 SubmissionIn addition to pre-submission of the team registration Google Form and group agreement PDF to Canvas, the finalsubmission will consist of three parts by the overall project deadline: A valid submission to the Kaggle in class competition. This submission must be of the expected format asdescribed above, and produce a place somewhere on the leaderboard. Invalid submissions do not attract marksfor the competition portion of grading (see Section 7). To Canvas, a zip archive of your source code of your link prediction algorithm in any language including anyscripts for automation, and a README.txt describing in just a few lines what files are for (but no data please). To Canvas, a Written research report in PDF format (see Section 5).The submission link will be visible in Canvas prior to deadline.7 AssessmentThe project will be marked out of 30 and contribute 30 percent towards your subject total mark. No late submissionsaccepted. You must inform your lecturer about sickness well before the deadline. Submit early and often to guardagainst unexpected last minute issues.The assessment in this project will be broken down into two components. The following criteria will be consideredwhen allocating marks.Based on our experimentation with the project task, we expect that all reasonable efforts at the project will achievea passing grade or higher.Kaggle Competition (15/30):Your final mark for the Kaggle competition is based on your rank in that competition. Assuming N teams ofenrolled students compete, there are no ties and you come in at R place (e.g. first place is 1, last is N) with an AUC ofA [0, 1] then your mark is calculated as12max{min{A, 0.80}0.4, 0}0.40+3N RN 1.Ties are handled so that you are not penalised by the tie: tied teams receive the rank of the highest team (as if noteam were tied). This expression can result in marks from 0 to 15. For example, if teams A, B, C, D, E came 1st, 4th,2nd, 2nd, 5th, then the rank-based mark terms (out of 5) for the five teams would be 3, 0.75, 2.25, 2.25, 0.4This complicated-looking expression can result in marks from 0 all the way to 15. We are weighing more towardsyour absolute AUC than your ranking. The component out of 12 for AUC gives a score of 0/12 for AUC of 0.4 or lower;12/12 for AUC of 0.8 or higher; and linearly scales over the interval of AUCs [0.4, 0.8]. We believe that much higherthan 0.5 (random classifier) AUC is achievable with minimal work, while 0.8 AUC is an excellent result deserving of fullmarks. For example, an AUC of 0.7 for a team coming last would yield 9/15; or 10.5/15 if coming mid-way in the class.The rank-based term encourages healthy competition and discourages collusion. The other AUC-based term -rewards teams who dont place in the top but none-the-less achieve good absolute results.Note that invalid submissions will come last and will attract a mark of 0 for this part, so please ensure your outputconforms to the specified requirements.Report (15/30):The below marking rubric outlines the criteria that will be used to mark your report.5Critical Analysis Report Clarity and Structure(Maximum = 10 marks) (Maximum = 5 marks)10 marks 5 marksFinal Approach is well motivated and itsadvantages/disadvantages clearly discussed;thorough and insightful analysis of why the finalapproach works/not work for provided trainingdata; insightful discussion and analysis of otherapproaches and why they were not usedVery clear and accessible description of all that hasbeen done, a postgraduate student can pick up thereport and read with no difficulty.8 marks 4 marksFinal approach is reasonably motivated and itsadvantages/disadvantages somewhat discussed;good analysis of why the final approachworks/not work for provided training data; somediscussion and analysis of other approaches andwhy they were not usedClear description for the most part, with some minordeficiencies/loose ends.6 marks 3 marksFinal approach is somewhat motivated and itsadvantages/disadvantages are discussed; limitedanalysis of why the final approach works/notwork for provided training data; limiteddiscussion and analysis of other approaches andwhy they were not usedGenerally clear description, but there are notablegaps And/or unclear sections.4 marks 2 markFinal approach is marginally motivated and itsadvantages/disadvantages are discussed; littleanalysis of why the final approach works/notwork for provided training data; little or nodiscussion and analysis of other approaches andwhy they were not usedThe report is unclear on the whole and the readerhas to work hard to discern what has been done.2 marks 1 markFinal approach is barely or not motivated and itsadvantages/disadvantages are not discussed; noanalysis of why the final approach works/notwork for provided training data; little or nodiscussion and analysis of other approaches andwhy they were not usedThe report completely lacks structure, omits allkey references and is barely understandable.Plagiarism policy: You are reminded that all submitted project work in this subject is to be your own individualteam work. Automated similarity checking software will be used to compare submissions. It is University policy thatacademic integrity be enforced. For more details, please see the policy at https://academichonesty.unimelb.edu.au/policy.html.如有需要，请加QQ：99515681 或邮箱：99515681@qq.com

“