MIT 15.097程序写作、algorithms课程程序辅导

” MIT 15.097程序写作、algorithms课程程序辅导Support Vector MachinesMIT 15.097 Course NotesLets start with some intuition about margins.The margin of an example xi = distance from example to decision boundary= yif(xi)The margin is positive if the example is on the correct side of the decision boundary,otherwise its negative.Heres the intuition for SVMs: We want all examples to have large margins, want them to be as far fromdecision boundary as possible. That way, the decision boundary is more stable, we are confident in alldecisions.1Most other algorithms (logistic regression, decision trees, perceptron) dont generallyproduce large margins. (AdaBoost generally produces large margins.)As in logistic regression and AdaBoost, function f is linear,mf(x) = X(j)x(j) + 0.j=1MIT 15.097作业写作、algorithms课程作业辅导Note that the intercept term can get swept into x by adding a 1 as the lastcomponent of each x. Then f(x) would be just Tx but For this lecture wellkeep the intercept term separately because SVM handles that term differentlythan if you put the intercept as a separate feature. We classify x using sign(f(x)).If xi has a large margin, we are confident that we classified it correctly. So wereessentially suggesting to use the margin yif(xi) to measure the confidence in ourprediction.But there is a problem with using yif(xi) to measure confidence in prediction.There is some arbitrariness about it.How? What should we do about it?2SVMs maximize the distance from the decision boundary to the nearest trainingexample they maximize the minimum margin. There is a geometric perspectivetoo. I set the Intercept to zero for this picture (so the decision boundary passesthrough the origin):The decision boundary are xs where Tx = 0. That means the unit vector for must be perpendicular to those xs that lie on the decision boundary.Now that you have the intuition, well put the intercept back, and we have totranslate the decision boundary, so its really the set of xs where Tx = 0.The margin of example i is denoted i:B is the point on the decision boundary closest to the positive example xi. B isB = xi i|||| 23since we Moved i units along the unit vector to get from the example to B.Since B lies on the decision boundary, it obeys Tx + 0 = 0, where x is B. (Iwrote the intercept there explicitly).Note that here we normalized so we wouldnt have the arbitrariness in the meaningof the margin.If the example is negative, the same calculation works, with a few sign flips (wedneed to move i units rather than i units).So the geometric margin from the picture is the same as the functionalmargin yif(xi).Maximize the minimum marginSupport vector Machines maximize the minimum margin. They would like tohave all examples Being far from the decision boundary. So theyll choose f thisway:For any and 0 that satisfy this, any positively scaled multiple satisfies themtoo, so we can arbitrarily set ||||2 = 1/ so that the right side is 1.Now when we maximize , were maximizing = 1/ ||||2.Txi + 0) 1 0 i = 1 . . . m (1)(the 1/2 and square are just for convenience) which is the same as:Writing the KKT conditions, starting with Lagrangian stationarity, where weneed to find the gradient wrt and the derivative wrt 0:m mL([, 0], ) = Xiyixi = 0 = =Xiyixi.i=1 i=1m mL([, 0], ) = iyi = 0 = 0i=1 iyi = 0.i=1i 0Xi (dual feasibilitXy)i yi(Txi + 0) + 1= 0Ti (complementary slackness)yi( xi + 0) + 1 0. (primal feasibility)5Using the KKT conditions, we can simplify the Lagrangian.m m m1L([, 0], ) = kk2 T2 + X(iyixi) +X(iyi0) + i2i=1 i=1 i=1(We just Expanded terms. Now well plug in the first KKT condition.)Xm1= ||||22 || ||2 2 0Xm(iyi) + i2Xi=1 i=1(Plug in the second KKT condition.)Again using the first KKT condition, we can rewrite the first term.Plugging back into the Lagrangian (2), which now only depends on , andputting in the second and third KKT conditions gives us the dual problem;max1 X T i 0 i = 1 . . . m L() = i ikyiykxWell use the last two KKT conditions in what follows,Pfor instance to get conditionson 0, but what weve already done is enough to define the dual problemfor .We can solve this dual problem. Either (i) wed use a generic quadratic programmingsolver, or (ii) use another algorithm, like SMO, which I will discuss6later. For now, assume we solved it. So we have 1, . . . , m. We can use thesolution of the dual problem to get the solution of the primal problem. We canplug into the first KKT condition to get, but we can see something cool in the process.Support VectorsLook at the Complementary slackness KKT condition and the primal and dualfeasibility conditions:The examples in the first category, for which the scaled margin is 1 and theconstraints are active are called support vectors. They are the closest to thedecision boundary.7Finish What We Were Doing EarlierTo get 0, use the complementarity condition for any of the support vectors (in other words, use the fact that the unnormalized margin of the support vectors is one):If you take a positive support vector, yi = 1,Written another way, since the support vectors have the smallest margins,So thats the solution! Just to recap, to get the scoring function ffor SVM,youd compute from the dual problem (3), plug it into (4) to get , plug thatinto the equation above to get 0, and thats the solution to the primal problem,and the Coefficients for f.8Support vectorsImage by MIT OpenCourseWare.Because of the form of the solution:it is possible that is very fast to calculate.Why is that? Think support vectors.The Nonseparable CaseIf there is no separating hyperplane,there is no feasible solution to the problem we wrote above. Most real problemsare nonseparable.Lets fix our SVM so it can accommodate the nonseparable case. The new formulationwill penalize mistakes the farther they are from the decision boundary.So we are allowed to make mistakes now, but we pay a price.9Lets change our primal problem (1) to this new primal problem:So the constraints allow some slack of size i, but we pay a price for it in theobjective. That is, if yif(xi) 1 then i gets set to 0, penalty is 0. Otherwise,if yif(xi) = 1 i, we Pay price i.2 Parameter C trades off between the twin goals of making the 2|| ||2small (makingwhat-was-the-minimum-margin 1/ ||||2large) and ensuring that most examples2have margin at least 1/ ||||2.Going on a Little TangentRewrite the penalty another way:If yif(xi) 1, zero penalty. Else, pay price i = 1 yif(xi)Third times the charm:Pay price i = b1 yif(xi)c+where This notation bzc+ means take the maximum of z and 0.Equation (5) becomes:m1min ||||22 + CXb1 yif(xi),0 2i=1c+Does that look familiar?10The Dual for the Nonseparable CaseForm the Lagrangian of (5):m m m1L( , , r) =2||||2, b, T2 + CXi i yi( xi + 0) 1 + i riiwhere is and ris are Lagrange multipliers (constrained to be 0). The dualturns out to be (after some work)i=1 iyi = 0So the only difference from the original problemPs Lagrangian (3) is that 0 iwas changed to 0 i C. Neat!Solving the dual problem with SMOSMO (Sequential Minimal Optimization) is a type of coordinate ascent algorithm,but adapted to SVM so that the solution always stays within the feasibleregion.Start with (6). Lets say you want to hold 2, . . . , m fixed and take a coordinatestep in the first direction. That is, change 1 to maximize the objective in (6).Can we make any progress? Can we get a better feasible solution by doing this?m Turns out, no. Look at the constraint in (6), i=1 iyi = 0. This means:So, since 2, . . . , m are fixed, 1 is also fixed.11So, if we want to update any of the is, we need to update at least 2 of themsimultaneously to keep the solution feasible (i.e., to keep the constraints satis-fied).Start with a feasible vector . Lets update 1 and 2, holding 3, . . . , m fixed.What values of 1 and 2 are we allowed to choose?Again, the constraint is: 1y1 + 2y2 = i=3 iyi =: (fixed constant). PmWe are only allowed to choose 1, 2 on the line, so when we pick 2, we get 1automatically, from11 = (y1 2y2)= y1( 2y2) (y1 = 1/y1 since y1 {+1, 1}).Also, the other constraints in (6) say 0 1, 2 C. So, 2 needs to be within[L,H] on the figure (in order for 1 to stay within [0, C]), where we will alwayshave 0 L,H C. To do the coordinate ascent step, we will optimize theobjective over 2, keeping it within [L,H]. Intuitively, (6) becomes:max1 + 2 + constants XikyiykxTi xk where 1 = y1( 2y2).2[L,H] 2i,k (7)The objective is Quadratic in 2. This means we can just set its derivative to 0to optimize it and get 2 for the next iteration of SMO. If the optimal value is12outside of [L,H], just choose 2 to be either L or H for the next iteration.For instance, if this is a plot of (7)s Objective (sometimes it doesnt look likethis, sometimes its upside-down), then well choose :Note: there are heuristics to choose the order of is chosen to update.13MIT OpenCourseWare https://ocw.mit.edu15.097 Prediction: Machine Learning and StatisticsSpring 2012For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.如有需要，请加QQ：99515681 或邮箱：99515681@qq.com

“

添加老师微信回复‘’官网辅导‘’获取专业老师帮助，或点击联系老师1对1在线指导。

声明：本站包含转载考而思在线或考而思。对于转载内容，本站尊重原创者劳动，保留原文作者或出处。但由于人为因素的限制，难免有疏忽、失误和遗漏，或者内容来源无法查明。如果出现类似这些情况，不管是被转载内容的原作者，还是本站读者，请及时联系本站，以确保第一时间予以修正。

本站辅导：留学课程辅导丨留学生论文辅导丨留学生作业辅导丨留学挂科申诉丨留学生课程预习

推荐：essay代写

相关文章