Sample complexity: Difference between revisions

Content deleted Content added
No edit summary
Yongyi781 (talk | contribs)
Added major changes from Yongyi781/sandbox
Tag: nowiki added
Line 1:
{{Machine learning bar}}
 
In [[machine learning]], the '''sample complexity''' of a machine learning algorithm is, roughly speaking, the number of examplestraining samples needed for the estimate of a target functionalgorithm to besuccessfully withinlearn a giventarget error ratefunction.<ref nameMore =specifically, "Balcan">{{cite journal | doi = 10.1007/s10994-010-5174-y | title=The truethe sample complexity ofis activethe learningnumber |of journal=Machinesamples Learningneeded |for date=2010the |function volume=80returned |by issue=2-3the |algorithm pages=111–139to |be first=Maria-Florinawithin |an last=Balcan}}</ref>arbitrarily Thesmall '''sample complexity'''error of athe machinebest learningpossible algorithmfunction, characterizeswith itsprobability ratearbitrarily ofclose [[consistencyto (statistics)|consistency]]1.
 
There are two variants of sample complexity. The weak variant fixes a particular input-output distribution, and the strong variant takes the worst-case sample complexity over all input-output distributions. A natural question in statistical learning is to ask, for a given hypothesis space, whether the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that an algorithm can approximately solve all possible learning problems on a particular input-output space no matter the distribution of data over that space. The No Free Lunch Theorem, discussed below, says that this is always impossible if the hypothesis space is not constrained<ref name=":0" />.
==Mathematical Setup==
Given a sample <math>S_n = \{(x_1,y_1),\ldots,(x_n,y_n)\}</math> drawn i.i.d. from a distribution <math>\rho</math> on some input space <math>\mathcal X \times \mathcal Y</math>, a supervised learning algorithm chooses a function <math>f:\mathcal X \to \mathcal Y</math> from some hypothesis class <math>\mathcal H</math>. A desirable property of the algorithm is that it chooses functions with small expected prediction error with respect to <math>\rho</math> and some loss function <math>V:\mathcal Y \times \mathcal Y \to \mathbb R_+</math>. Specifically, it is desirable to have a consistent algorithm, or an algorithm that generates functions whose expected loss or [[empirical risk minimization|risk]]
 
==Definition==
<math display="block">
Let ''X'' be a space which we call the input space, and ''Y'' be a space which we call the output space, and let ''Z'' denote the product {{Math|''X'' × ''Y''}}. For example, in the setting of binary classification, ''X'' is typically a finite-dimensional vector space and ''Y'' is the set {{Math|{-1,1<nowiki>}</nowiki>}}.
\mathcal R(f) = \mathbb E_\rho[V(f(x),y)],
</math>
 
Fix a hypothesis space <math>\mathcal H</math> of functions <math>f\colon X\to Y</math>. A learning algorithm over <math>\mathcal H</math> is a computable map from <math>Z^*</math> to <math>\mathcal H</math>. In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from ''X'' to ''Y''. Typical learning algorithms include empirical risk minimization and empirical risk minimization with [[Tikhonov regularization]].
converge to the best possible risk
 
Fix a loss function <math>V\colon Y\times Y\to\R_{\geq 0}</math>, for example, the square loss <math>
<math display="block">
V(y,y')=(y-y')^2
\mathcal R^*_\mathcal{H} = \underset{f \in \mathcal H}{\inf}\mathcal R(f).
</math>. For a given distribution ''ρ'' on {{Math|''X'' × ''Y''}}, the expected risk of a function <math>f\in\mathcal H</math> is
</math>
 
:<math>\mathcal RE(f) := \mathbb E_\rho[V(f(x),y)]=\int_{X\times Y} V(f(x),y)\,d\rho(x,y)</math>
Formally, let <math>f_n</math> be the functions generated by an algorithm as the number of data points <math>n</math> grows. The algorithm is consistent if
 
In our setting, we have <math>f=A(S_n)</math> where ''A'' is a learning algorithm and <math>S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n</math> is a sequence of vectors which are all drawn independently from ''ρ''. Define the optimal risk<math display="block">
<math display="block">
\underset{n \to \infty}{\lim}\mathbb P_{S_n}(\mathcal R(f_n) - \mathcal RE^*_\mathcal{H} >= \epsilon)underset{f \toin 0\mathcal H}{\inf}\mathcal E(f).
</math>Set <math>f_n=A(S_n)</math> for each ''n''. Note that {{Math|''f<sub>n</sub>''}} is a random variable and depends on the random variable {{Math|''S<sub>n</sub>''}}, which is drawn from the distribution {{Math|''ρ<sup>n</sup>''}}. The algorithm ''A'' is called consistent if <math>
</math>
\mathcal E(f_n)
</math> probabilistically converges to <math>
\mathcal E_\mathcal H^*
</math>, in other words, for all ''ε'', ''δ'' > 0, there exists a positive integer ''N'' such that for all ''n'' ≥ ''N'', we have<math display="block">
\mathbb P_Pr_{S_n\rho^n}([\mathcal RE(f_n) - \mathcal RE^*_\mathcal{H} > \epsilon) geq\varepsilon]< \delta.
</math>The sample complexity of ''A'' is then the minimum ''N'' for which this holds, as a function of ''ρ'', ''ε'', and ''δ''. We write the sample complexity as ''N''(''ρ'', ''ε'', ''δ'') to emphasize that this value of ''N'' depends on ''ρ'', ''ε'', and ''δ''. If ''A'' is not consistent, then we set ''N''(''ρ'', ''ε'', ''δ'') = ∞. If there exists an algorithm for which ''N'' is finite, then we say that the hypothesis space <math>
\mathcal H
</math> is '''learnable'''.
 
In words, the sample complexity <math>n''N''(\rho''ρ'',\epsilon ''ε'',\delta ''δ'')</math> defines the rate of consistency of the algorithm: given a desired accuracy <math>\epsilon</math>''ε'' and confidence <math>\delta</math>''δ'', one needs to sample at most <math>n''N''(\rho''ρ'',\epsilon ''ε'',\delta ''δ'')</math> data points to guarantee that the risk of the output function is within <math>\epsilon</math>''ε'' of the best possible, with probability at least <math>1 -\delta</math> ''δ''.<ref name = "Rosasco">{{citation |last = Rosasco | first = Lorenzo | title = Consistency, Learnability, and Regularization | series = Lecture Notes for MIT Course 9.520. | year = 2014 }}</ref>
for all <math>\epsilon > 0</math>, where <math>\mathbb P_{S_n}</math> denotes the i.i.d. sampling measure <math>\rho^n</math>.
 
In [[Probably approximately correct learning|probabilistically approximately correct (PAC) learning]], one is concerned with whether the sample complexity is ''polynomial'', that is, whether ''N''(''ρ'', ''ε'', ''δ'') is bounded by a polynomial in 1/''ε'' and 1/''δ''. If ''N''(''ρ'', ''ε'', ''δ'') is polynomial for some learning algorithm, then one says that that the hypothesis space <math>
The consistency property is nice, but it says nothing about how fast the risks converge. Since in practice one always deals with finite data, it is important to answer the question of how large a sample is needed to achieve a risk that is close, in the <math>\epsilon</math> sense, to the best possible for the function class. The notion of sample complexity answers this question. The sample complexity of a learning algorithm is a function <math>n(\rho,\epsilon,\delta)</math> such that for all <math>n \ge n(\rho,\epsilon,\delta)</math>,
\mathcal H
</math> is '''PAC-learnable'''. Note that this is a stronger notion than being learnable.
 
==No Free Lunch Theorem (Machine Learning)==
<math display="block">
One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error. More formally, one asks whether there exists a learning algorithm ''A'' such that, for all ''ε'', ''δ'' > 0, there exists a positive integer ''N'' such that for all ''n'' ≥ ''N'', we have<math display="block">
\mathbb P_{S_n}(\mathcal R(f_n) - \mathcal R^*_\mathcal{H} > \epsilon) < \delta.
\underset{sup_\rho}{\sup}left(\ Pr_{\mathbb P_{S_nrho^n}([\mathcal RE(f_n) - \mathcal RE^*_\mathcal{H} > \epsilongeq\varepsilon]\right)<\delta,
</math>
Optimistically</math>where one could hope for a stronger notion of sample complexity that is independent of the<math>f_n=A(S_n)</math>, distributionwith <math>S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n</math> onas theabove. inputThe and[[No outputfree spaces.lunch However,in itsearch hasand beenoptimization|No shownFree Lunch Theorem]] says that without restrictions on the hypothesis classspace <math>\mathcal H</math>, this is not the case, i.e., there always existsexist "bad" distributions for which the sample complexity is arbitrarily large.<ref name=":0">{{citation |last = Vapnik | first = Vladimir | title = Statistical Learning Theory | place = New York | publisher = Wiley. | year = 1998}}</ref> Thus in order to make statements about the rate of convergence of the quantity<math display="block">
\sup_\rho\left(\Pr_{\rho^n}[\mathcal E(f_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]\right),
</math>one must either
*constrain the space of probability distributions <math>\rho</math>, e.g. via a parametric approach, or
*constrain the space of hypotheses <math>\mathcal H</math>, as in distribution -free approaches.
The latter approach leads to concepts such as [[VC dimension]] and [[Rademacher complexity]] which control the complexity of the space <math>\mathcal H</math>. A smaller hypothesis space introduces more bias into the inference process, meaning that <math>\mathcal RE^*_\mathcal{H}</math> may be greater than the best possible risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions. This trade-off leads to the concept of [[regularization (mathematics)|regularization]].<ref name = "Rosasco" />
 
It is a theorem from [[Vapnik–Chervonenkis theory|VC theory]] that the following three statements are equivalent for a hypothesis space <math>\mathcal H</math>:
In words, the sample complexity <math>n(\rho,\epsilon,\delta)</math> defines the rate of consistency of the algorithm: given a desired accuracy <math>\epsilon</math> and confidence <math>\delta</math>, one needs to sample at most <math>n(\rho,\epsilon,\delta)</math> data points to guarantee that the risk of the output function is within <math>\epsilon</math> of the best possible, with probability at least <math>1-\delta</math>.<ref name = "Rosasco">{{citation |last = Rosasco | first = Lorenzo | title = Consistency, Learnability, and Regularization | series = Lecture Notes for MIT Course 9.520. | year = 2014 }}</ref>
# <math>\mathcal H</math> is PAC-learnable.
# The VC dimension of <math>\mathcal H</math> is finite.
# <math>\mathcal H</math> is a uniform [[Glivenko-Cantelli class]].
This gives a nice way to prove that certain hypothesis spaces are PAC learnable, and by extension, learnable.
 
== An example of a PAC-learnable hypothesis space ==
==No Free Lunch Theorem (Machine Learning)==
Let ''X'' = '''R'''<sup>''d''</sup>, ''Y'' = {-1, 1}, and let <math>\mathcal H</math> be the space of affine functions on ''X'', that is, functions of the form <math>x\mapsto \langle w,x\rangle+b</math> for some <math>w\in\R^d,b\in\R</math>. This is the linear classification with offset learning problem. Now, note that four coplanar points in a square cannot be shattered by any affine function, since no affine function can be positive on two diagonally opposite vertices and negative on the remaining two. Thus, the VC dimension of <math>\mathcal H</math> is 3, in particular finite. It follows by the above characterization of PAC-learnable classes that <math>\mathcal H</math> is PAC-learnable, and by extension, learnable.
Optimistically one could hope for a stronger notion of sample complexity that is independent of the distribution <math>\rho</math> on the input and output spaces. However, it has been shown that without restrictions on the hypothesis class <math>\mathcal H</math>, there always exists "bad" distributions for which the sample complexity is arbitrarily large.<ref>{{citation |last = Vapnik | first = Vladimir | title = Statistical Learning Theory | place = New York | publisher = Wiley. | year = 1998}}</ref> Thus in order to make statements about the rate of convergence of the quantity
 
<math display="block">
\underset{\rho}{\sup}\ \mathbb P_{S_n}(\mathcal R(f_n) - \mathcal R^*_\mathcal{H} > \epsilon),
</math>
 
one must either
*constrain the space of probability distributions <math>\rho</math>, e.g. via a parametric approach, or
*constrain the space of hypotheses <math>\mathcal H</math>, as in distribution free approaches.
The latter approach leads to concepts such as [[VC dimension]] and [[Rademacher complexity]] which control the complexity of the space <math>\mathcal H</math>. A smaller hypothesis space introduces more bias into the inference process, meaning that <math>\mathcal R^*_\mathcal{H}</math> may be greater than the best possible risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions. This trade-off leads to the concept of [[regularization (mathematics)|regularization]].<ref name = "Rosasco" />
==Other Settings==
In addition to the supervised learning setting, sample complexity is relevant to [[semi-supervised learning]] problems including [[active learning]],<ref name = "Balcan">{{cite journal |doi = 10.1007/s10994-010-5174-y|title = The true sample complexity of active learning|journal = Machine Learning|date = 2010|volume = 80|issue = 2-3|pages = 111–139|first = Maria-Florina|last = Balcan}}</ref> where the algorithm can ask for labels to specifically chosen inputs in order to reduce the cost of obtaining many labels. The concept of sample complexity also shows up in [[reinforcement learning]],<ref>{{citation |last = Kakade | first = Sham | title = On the Sample Complexity of Reinforcement Learning | place = University College London | publisher = Gatsby Computational Neuroscience Unit. | series = PhD Thesis | year = 2003 | url = http://www.ias.tu-darmstadt.de/uploads/Research/NIPS2006/SK.pdf}}</ref> [[online machine learning|online learning]], and unsupervised algorithms, e.g. for [[dictionary learning]].<ref>{{cite journal |last1 = Vainsencher | first1 = Daniel | last2 = Mannor | first2 = Shie | last3 = Bruckstein | first3 = Alfred | title = The Sample Complexity of Dictionary Learning | journal = Journal of Machine Learning Research | volume = 12 | pages = 3259–3281 | date = 2011 | url = http://www.jmlr.org/papers/volume12/vainsencher11a/vainsencher11a.pdf}}</ref>
 
==References==