Sample complexity: Difference between revisions

Content deleted Content added
No edit summary
Tags: Mobile edit Mobile web edit Advanced mobile edit
Line 23:
In our setting, we have <math>h=\mathcal{A}(S_n)</math>, where <math>\mathcal{A}</math> is a learning algorithm and <math>S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n</math> is a sequence of vectors which are all drawn independently from <math>\rho</math>. Define the optimal risk<math display="block">
\mathcal E^*_\mathcal{H} = \underset{h \in \mathcal H}{\inf}\mathcal E(h).
</math>Set <math>h_n=\mathcal{A}(S_n)</math>, for each [[sample size]] <math>n</math>. Note that <math>h_n</math> is a [[random variable]] and depends on the random variable <math>S_n</math>, which is drawn from the distribution <math>\rho^n</math>. The algorithm <math>\mathcal{A}</math> is called '''consistent''' if <math> \mathcal E(h_n) </math> probabilistically converges to <math> \mathcal E_\mathcal H^*</math>. In other words, for all <math>\epsilon, \delta > 0</math>, there exists a positive integer <math>N</math>, such that, for all sample sizes <math>n \geq N</math>, we have
 
<math display="block">
Line 32:
In others words, the sample complexity <math>N(\rho,\epsilon,\delta)</math> defines the rate of consistency of the algorithm: given a desired accuracy <math>\epsilon</math> and confidence <math>\delta</math>, one needs to sample <math>N(\rho,\epsilon,\delta)</math> data points to guarantee that the risk of the output function is within <math>\epsilon</math> of the best possible, with probability at least <math>1 - \delta</math> .<ref name="Rosasco">{{citation |last = Rosasco | first = Lorenzo | title = Consistency, Learnability, and Regularization | series = Lecture Notes for MIT Course 9.520. | year = 2014 }}</ref>
 
In [[Probably approximately correct learning|probably approximately correct (PAC) learning]], one is concerned with whether the sample complexity is ''polynomial'', that is, whether <math>N(\rho,\epsilon,\delta)</math> is bounded by a polynomial in <math>1/\epsilon</math> and <math>1/\delta</math>. If <math>N(\rho,\epsilon,\delta)</math> is polynomial for some learning algorithm, then one says that the hypothesis space <math> \mathcal H </math> is '''PAC-learnable'''. Note that thisThis is a stronger notion than being learnable.
 
==Unrestricted hypothesis space: infinite sample complexity==
Line 61:
 
=== An example of a PAC-learnable hypothesis space ===
<math>X = \R^d, Y = \{-1, 1\}</math>, and let <math>\mathcal H</math> be the space of affine functions on <math>X</math>, that is, functions of the form <math>x\mapsto \langle w,x\rangle+b</math> for some <math>w\in\R^d,b\in\R</math>. This is the linear classification with offset learning problem. Now, note that four coplanar points in a square cannot be shattered by any affine function, since no affine function can be positive on two diagonally opposite vertices and negative on the remaining two. Thus, the VC dimension of <math>\mathcal H</math> is <math>d + 1</math>, so it is finite. It follows by the above characterization of PAC-learnable classes that <math>\mathcal H</math> is PAC-learnable, and by extension, learnable.
 
=== Sample-complexity bounds ===