Content deleted Content added
Erel Segal (talk | contribs) |
Erel Segal (talk | contribs) |
||
Line 38:
<math> \mathcal H </math> is '''PAC-learnable'''. Note that this is a stronger notion than being learnable.
==Unrestricted hypothesis space: infinite sample complexity==
<div id=
One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error. More formally, one asks whether there exists a learning algorithm <math>ALG</math> such that, for all ''ε'', ''δ'' > 0, there exists a positive integer ''N'' such that for all ''n'' ≥ ''N'', we have<math display="block">
\sup_\rho\left(\Pr_{\rho^n}[\mathcal E(h_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]\right)<\delta,
</math>where <math>h_n=ALG(S_n)</math>, with <math>S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n</math> as above. The [[No free lunch in search and optimization|No Free Lunch Theorem]] says that without restrictions on the hypothesis space <math>\mathcal H</math>, this is not the case, i.e., there always exist "bad" distributions for which the sample complexity is arbitrarily large.<ref name=":0">{{citation |last = Vapnik | first = Vladimir | title = Statistical Learning Theory | place = New York | publisher = Wiley. | year = 1998}}</ref>
Thus, in order to make statements about the rate of convergence of the quantity <math display="block"> \sup_\rho\left(\Pr_{\rho^n}[\mathcal E(f_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]\right),
</math>
one must either *constrain the space of probability distributions <math>\rho</math>, e.g. via a parametric approach, or
*constrain the space of hypotheses <math>\mathcal H</math>, as in distribution-free approaches.
==Restricted hypothesis space: finite sample-complexity==
The latter approach leads to concepts such as [[VC dimension]] and [[Rademacher complexity]] which control the complexity of the space <math>\mathcal H</math>. A smaller hypothesis space introduces more bias into the inference process, meaning that <math>\mathcal E^*_\mathcal{H}</math> may be greater than the best possible risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions. This trade-off leads to the concept of [[regularization (mathematics)|regularization]].<ref name = "Rosasco" />
Line 52 ⟶ 59:
# The VC dimension of <math>\mathcal H</math> is finite.
# <math>\mathcal H</math> is a uniform [[Glivenko-Cantelli class]].
This gives a
=== An example of a PAC-learnable hypothesis space ===
Let ''X'' = '''R'''<sup>''d''</sup>, ''Y'' = {-1, 1}, and let <math>\mathcal H</math> be the space of affine functions on ''X'', that is, functions of the form <math>x\mapsto \langle w,x\rangle+b</math> for some <math>w\in\R^d,b\in\R</math>. This is the linear classification with offset learning problem. Now, note that four coplanar points in a square cannot be shattered by any affine function, since no affine function can be positive on two diagonally opposite vertices and negative on the remaining two. Thus, the VC dimension of <math>\mathcal H</math> is 3, in particular finite. It follows by the above characterization of PAC-learnable classes that <math>\mathcal H</math> is PAC-learnable, and by extension, learnable.
=== Sample-complexity bounds ===
Suppose <math>\mathcal H</math> is a class of binary functions. Then, <math>\mathcal H</math> is <math>(\epsilon,\delta)</math>-PAC-learnable with a sample of size:
<ref>{{Cite journal|title=The optimal sample complexity OF PAC learning
|author=Steve Hanneke|year=2016|url=http://dl.acm.org/citation.cfm?id=2946683}}</ref>
<math display="block">
N = O\bigg(\frac{VC(\mathcal H) + \ln{1\over \delta}}{\epsilon}\bigg)
</math>
Moreover, any <math>(\epsilon,\delta)</math>-PAC-learning algorithm for <math>\mathcal H</math> must have sample-complexity:<ref>{{Cite journal|doi=10.1016/0890-5401(89)90002-3|title=A general lower bound on the number of examples needed for learning|journal=Information and Computation|volume=82|issue=3|pages=247|year=1989|last1=Ehrenfeucht|first1=Andrzej|last2=Haussler|first2=David|last3=Kearns|first3=Michael|last4=Valiant|first4=Leslie}}</ref>
<math display="block">
N = \Omega\bigg(\frac{VC(\mathcal H) + \ln{1\over \delta}}{\epsilon}\bigg)
</math>
==Other Settings==
In addition to the supervised learning setting, sample complexity is relevant to [[semi-supervised learning]] problems including [[active learning]],<ref name="Balcan">{{cite journal |doi = 10.1007/s10994-010-5174-y|title = The true sample complexity of active learning|journal = Machine Learning|date = 2010|volume = 80|issue =
==References==
|