Sample complexity: Difference between revisions

Content deleted Content added
No edit summary
 
(59 intermediate revisions by 30 users not shown)
Line 1:
{{Short description|Attribute of machine learning models}}
{{Machine learning bar}}
The '''sample complexity''' of a [[machine learning]] algorithm represents the number of training-samples that it needs in order to successfully learn a target function.
 
More precisely, the sample complexity is the number of training-samples that we need to supply to the algorithm, so that the function returned by the algorithm is within an arbitrarily small error of the best possible function, with probability arbitrarily close to 1.
In [[machine learning]], '''sample complexity''' is the number of examples needed for the estimate of a target function to be within a given error rate.<ref name = "Balcan">{{cite journal | doi = 10.1007/s10994-010-5174-y | title=The true sample complexity of active learning | journal=Machine Learning | date=2010 | volume=80 | issue=2-3 | pages=111–139 | first=Maria-Florina | last=Balcan}}</ref> The '''sample complexity''' of a machine learning algorithm characterizes its rate of [[consistency (statistics)|consistency]].
 
There are two variants of sample complexity:
==Mathematical Setup==
* The weak variant fixes a particular input-output distribution;
Given a set of samples <math>S_n = \{(x_1,y_1),\ldots,(x_n,y_n)\}</math> drawn i.i.d. according to a distribution <math>\rho</math> from some input space <math>\mathcal X \times \mathcal Y</math>, a supervised learning algorithm chooses a function <math>f:\mathcal X \to \mathcal Y</math> from some hypothesis class <math>\mathcal H</math>. A desirable property of the algorithm is that it chooses functions with small expected prediction error with respect to <math>\rho</math> and some loss function <math>V:\mathcal Y \times \mathcal Y \to \mathbb R_+</math>. Specifically, it is desirable to have a consistent algorithm, or an algorithm that generates functions whose expected risks converge to the best possible expected risk. Formally, let
* The strong variant takes the worst-case sample complexity over all input-output distributions.
 
The [[No free lunch theorem]], discussed below, proves that, in general, the strong sample complexity is infinite, i.e. that there is no algorithm that can learn the globally-optimal target function using a finite number of training samples.
<math>
\mathcal R^*_\mathcal{H} = \underset{f \in \mathcal H}{\inf}\mathbb E_\rho[V(f(x),y)],
</math>
 
However, if we are only interested in a particular class of target functions (e.g., only linear functions) then the sample complexity is finite, and it depends linearly on the [[VC dimension]] on the class of target functions.<ref name=":0" />
and let <math>f_n</math> be the functions generated by an algorithm as the number of data points <math>n</math> grows. The algorithm is consistent if
 
==Definition==
<math>
Let <math>X</math> be a space which we call the input space, and <math>Y</math> be a space which we call the output space, and let <math>Z</math> denote the product <math>X\times Y</math>. For example, in the setting of binary classification, <math>X</math> is typically a finite-dimensional vector space and <math>Y</math> is the set <math>\{-1,1\}</math>.
\underset{n \to \infty}{\lim}\mathbb P_{S_n}(\mathbb E_\rho[V(f_n(x),y)] - \mathcal R^*_\mathcal{H} > \epsilon) \to 0,
</math>
 
Fix a hypothesis space <math>\mathcal H</math> of functions <math>h\colon X\to Y</math>. A learning algorithm over <math>\mathcal H</math> is a computable map from <math>Z</math> to <math>\mathcal H</math>. In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from <math>X</math> to <math>Y</math>. Typical learning algorithms include [[empirical risk minimization]], without or with [[Tikhonov regularization]].
for all <math>\epsilon > 0</math>, where <math>\mathbb P_{S_n}</math> denotes the probability measure <math>\rho^n</math>.
 
Fix a loss function <math>\mathcal{L}\colon Y\times Y\to\R_{\geq 0}</math>, for example, the square loss <math>\mathcal{L}(y, y') = (y - y')^2</math>, where <math>h(x) = y'</math>. For a given distribution <math>\rho</math> on <math>X\times Y</math>, the '''expected risk''' of a hypothesis (a function) <math>h\in\mathcal H</math> is
The consistency property is nice, but it says nothing about how fast the expected risks converge. Since in practice one always deals with finite data, it is important to answer the question of how many samples are needed to achieve a risk that is close, in the <math>\epsilon</math> sense, to the best possible for the function class. The notion of sample complexity answers this question. The sample complexity of a learning algorithm is a function <math>n(\rho,\epsilon,\delta)</math> such that for all <math>n \ge n(\rho,\epsilon,\delta)</math>,
 
:<math>\mathcal E(h) :=\mathbb E_\rho[\mathcal{L}(h(x),y)]=\int_{X\times Y} \mathcal{L}(h(x),y)\,d\rho(x,y)</math>
<math>
\mathbb P_{S_n}(\mathbb E_\rho[V(f(x),y)] - \mathcal R^*_\mathcal{H} \le \epsilon) \ge 1- \delta.
</math>
 
In our setting, we have <math>h=\mathcal{A}(S_n)</math>, where <math>\mathcal{A}</math> is a learning algorithm and <math>S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n</math> is a sequence of vectors which are all drawn independently from <math>\rho</math>. Define the optimal risk<math display="block">
In words, the sample complexity <math>n(\rho,\epsilon,\delta)</math> defines the rate of consistency of the algorithm. Given a desired accuracy <math>\epsilon</math> and confidence <math>\delta</math>, one needs at most <math>n(\rho,\epsilon,\delta)</math> samples to guarantee that the expected risk of the output function is within <math>\epsilon</math> of the best possible expected risk with probability at least <math>1-\delta</math>.<ref name = "Rosasco">{{citation |last = Rosasco | first = Lorenzo | title = Consistency, Learnability, and Regularization | series = Lecture Notes for MIT Course 9.520. | year = 2014 }}</ref>
\mathcal E^*_\mathcal{H} = \underset{h \in \mathcal H}{\inf}\mathcal E(h).
</math>Set <math>h_n=\mathcal{A}(S_n)</math>, for each [[sample size]] <math>n</math>. <math>h_n</math> is a [[random variable]] and depends on the random variable <math>S_n</math>, which is drawn from the distribution <math>\rho^n</math>. The algorithm <math>\mathcal{A}</math> is called '''consistent''' if <math> \mathcal E(h_n) </math> probabilistically converges to <math> \mathcal E_\mathcal H^*</math>. In other words, for all <math>\epsilon, \delta > 0</math>, there exists a positive integer <math>N</math>, such that, for all sample sizes <math>n \geq N</math>, we have
 
<math display="block">
==No Free Lunch Theorem (Machine Learning)==
\Pr_{\rho^n}[\mathcal E(h_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]<\delta.
Optimistically one could hope for a stronger notion of sample complexity that is independent of the distribution <math>\rho</math> on the input and output spaces. However, it has been shown that without restrictions on the hypothesis class <math>\mathcal H</math>, there always exists "bad" distributions for which the sample complexity is arbitrarily large.<ref>{{citation |last = Vapnik | first = Vladimir | title = Statistical Learning Theory | place = New York | publisher = Wiley. | year = 1998}}</ref> Thus in order to make statements about the rate of convergence of the quantity
</math>The '''sample complexity''' of <math>\mathcal{A}</math> is then the minimum <math>N</math> for which this holds, as a function of <math>\rho, \epsilon</math>, and <math>\delta</math>. We write the sample complexity as <math>N(\rho, \epsilon, \delta)</math> to emphasize that this value of <math>N</math> depends on <math>\rho, \epsilon</math>, and <math>\delta</math>. If <math>\mathcal{A}</math> is '''not consistent''', then we set <math>N(\rho,\epsilon,\delta)=\infty</math>. If there exists an algorithm for which <math>N(\rho,\epsilon,\delta)</math> is finite, then we say that the hypothesis space <math> \mathcal H</math> is '''learnable'''.
 
In others words, the sample complexity <math>N(\rho,\epsilon,\delta)</math> defines the rate of consistency of the algorithm: given a desired accuracy <math>\epsilon</math> and confidence <math>\delta</math>, one needs to sample <math>N(\rho,\epsilon,\delta)</math> data points to guarantee that the risk of the output function is within <math>\epsilon</math> of the best possible, with probability at least <math>1 - \delta</math> .<ref name="Rosasco">{{citation |last = Rosasco | first = Lorenzo | title = Consistency, Learnability, and Regularization | series = Lecture Notes for MIT Course 9.520. | year = 2014 }}</ref>
<math>
 
\underset{\rho}{\sup}\ \mathbb P_{S_n}(\mathbb E_\rho[V(f(x),y)] - \mathcal R^*_\mathcal{H} > \epsilon),
In [[Probably approximately correct learning|probably approximately correct (PAC) learning]], one is concerned with whether the sample complexity is ''polynomial'', that is, whether <math>N(\rho,\epsilon,\delta)</math> is bounded by a polynomial in <math>1/\epsilon</math> and <math>1/\delta</math>. If <math>N(\rho,\epsilon,\delta)</math> is polynomial for some learning algorithm, then one says that the hypothesis space <math> \mathcal H </math> is '''PAC-learnable'''. This is a stronger notion than being learnable.
 
==Unrestricted hypothesis space: infinite sample complexity==
<span id='No Free Lunch Theorem'></span>
One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error. More formally, one asks whether there exists a learning algorithm <math>\mathcal{A}</math>, such that, for all <math>\epsilon, \delta > 0</math>, there exists a positive integer <math>N</math> such that for all <math>n \geq N</math>, we have
 
<math display="block">
\sup_\rho\left(\Pr_{\rho^n}[\mathcal E(h_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]\right)<\delta,
</math>
where <math>h_n=\mathcal{A}(S_n)</math>, with <math>S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n</math> as above. The [[No free lunch in search and optimization|No Free Lunch Theorem]] says that without restrictions on the hypothesis space <math>\mathcal H</math>, this is not the case, i.e., there always exist "bad" distributions for which the sample complexity is arbitrarily large.<ref name=":0">{{citation |last = Vapnik | first = Vladimir | title = Statistical Learning Theory | place = New York | publisher = Wiley. | year = 1998}}</ref>
 
Thus, in order to make statements about the rate of convergence of the quantity
<math display="block">
\sup_\rho\left(\Pr_{\rho^n}[\mathcal E(h_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]\right),
</math>
one must either
*Constrainconstrain the setspace of probability distributions <math>\rho</math>, e.g. via a parametric approach, or
*Constrainconstrain the setspace of hypotheses <math>\mathcal H</math> to be small, as in distribution -free approaches.
 
The latter approach leads to concepts such as [[VC dimension]] and [[Rademacher complexity]] which control the complexity of the space <math>\mathcal H</math>. A smaller hypothesis space introduces more bias into the inference process, meaning that <math>\mathcal R^*_\mathcal{H}</math> may be larger than the best possible expected risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce functions converging in expected risk to <math>\mathcal R^*_\mathcal{H}</math>. This trade-off leads to the concept of [[regularization (mathematics)|regularization]].<ref name = "Rosasco" />
==Restricted hypothesis space: finite sample-complexity==
The latter approach leads to concepts such as [[VC dimension]] and [[Rademacher complexity]] which control the complexity of the space <math>\mathcal H</math>. A smaller hypothesis space introduces more bias into the inference process, meaning that <math>\mathcal E^*_\mathcal{H}</math> may be greater than the best possible risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions. This trade-off leads to the concept of [[regularization (mathematics)|regularization]].<ref name = "Rosasco" />
==Other Settings==
 
In addition to the supervised learning setting, sample complexity is relevant to [[semi-supervised learning]] problems including [[active learning]],<ref name = "Balcan" /> where the algorithm can ask for labels to specifically chosen inputs in order to reduce the cost of obtaining many labels. The concept of sample complexity also shows up in [[reinforcement learning]],<ref>{{citation |last = Kakade | first = Sham | title = On the Sample Complexity of Reinforcement Learning | place = University College London | publisher = Gatsby Computational Neuroscience Unit. | series = PhD Thesis | year = 2003 | url = http://www.ias.tu-darmstadt.de/uploads/Research/NIPS2006/SK.pdf}}</ref> [[online machine learning|online learning]], and unsupervised algorithms, e.g. for [[dictionary learning]].<ref>{{cite journal |last1 = Vainsencher | first1 = Daniel | last2 = Mannor | first2 = Shie | last3 = Bruckstein | first3 = Alfred | title = The Sample Complexity of Dictionary Learning | journal = Journal of Machine Learning Research | volume = 12 | pages = 3259–3281 | date = 2011 | url = http://www.jmlr.org/papers/volume12/vainsencher11a/vainsencher11a.pdf}}</ref>
It is a theorem from [[Vapnik–Chervonenkis theory|VC theory]] that the following three statements are equivalent for a hypothesis space <math>\mathcal H</math>:
# <math>\mathcal H</math> is PAC-learnable.
# The VC dimension of <math>\mathcal H</math> is finite.
# <math>\mathcal H</math> is a uniform [[Glivenko-Cantelli class]].
This gives a way to prove that certain hypothesis spaces are PAC learnable, and by extension, learnable.
 
=== An example of a PAC-learnable hypothesis space ===
<math>X = \R^d, Y = \{-1, 1\}</math>, and let <math>\mathcal H</math> be the space of affine functions on <math>X</math>, that is, functions of the form <math>x\mapsto \langle w,x\rangle+b</math> for some <math>w\in\R^d,b\in\R</math>. This is the linear classification with offset learning problem. Now, four coplanar points in a square cannot be shattered by any affine function, since no affine function can be positive on two diagonally opposite vertices and negative on the remaining two. Thus, the VC dimension of <math>\mathcal H</math> is <math>d + 1</math>, so it is finite. It follows by the above characterization of PAC-learnable classes that <math>\mathcal H</math> is PAC-learnable, and by extension, learnable.
 
=== Sample-complexity bounds ===
<span id='bounds'></span>
Suppose <math>\mathcal H</math> is a class of binary functions (functions to <math>\{0,1\}</math>). Then, <math>\mathcal H</math> is <math>(\epsilon,\delta)</math>-PAC-learnable with a sample of size:
<ref>{{Cite journal|title=The optimal sample complexity of PAC learning
|journal=J. Mach. Learn. Res.|volume=17|issue=1|pages=1319–1333|author=Steve Hanneke|year=2016|arxiv=1507.00473|url=https://www.jmlr.org/papers/v17/15-389.html}}</ref>
<math display="block">
N = O\bigg(\frac{VC(\mathcal H) + \ln{1\over \delta}}{\epsilon}\bigg)
</math>
where <math>VC(\mathcal H)</math> is the [[VC dimension]] of <math>\mathcal H</math>.
Moreover, any <math>(\epsilon,\delta)</math>-PAC-learning algorithm for <math>\mathcal H</math> must have sample-complexity:<ref>{{Cite journal|doi=10.1016/0890-5401(89)90002-3|title=A general lower bound on the number of examples needed for learning|journal=Information and Computation|volume=82|issue=3|pages=247|year=1989|last1=Ehrenfeucht|first1=Andrzej|last2=Haussler|first2=David|last3=Kearns|first3=Michael|last4=Valiant|first4=Leslie|doi-access=free}}</ref>
<math display="block">
N = \Omega\bigg(\frac{VC(\mathcal H) + \ln{1\over \delta}}{\epsilon}\bigg)
</math>
Thus, the sample-complexity is a linear function of the [[VC dimension]] of the hypothesis space.
 
Suppose <math>\mathcal H</math> is a class of real-valued functions with range in <math>[0,T]</math>. Then, <math>\mathcal H</math> is <math>(\epsilon,\delta)</math>-PAC-learnable with a sample of size:
<ref name=mr15>{{cite book|first1=Martin|last1=Anthony|first2=Peter L.|last2=Bartlett|title=Neural Network Learning: Theoretical Foundations|year=2009|isbn=9780521118620}}</ref><ref>{{cite conference|title=On the Pseudo-Dimension of Nearly Optimal Auctions|year=2015|conference=NIPS|url=http://papers.nips.cc/paper/5766-on-the-pseudo-dimension-of-nearly-optimal-auctions|arxiv=1506.03684|last1=Morgenstern|first1=Jamie|author1-link=Jamie Morgenstern|last2=Roughgarden|first2=Tim|pages=136–144|publisher=Curran Associates}}</ref>
<math display="block">
N = O\bigg(T^2\frac{PD(\mathcal H)\ln{T\over \epsilon} + \ln{1\over \delta}}{\epsilon^2}\bigg)
</math>
where <math>PD(\mathcal H)</math> is [[VC dimension#Generalizations|Pollard's pseudo-dimension]] of <math>\mathcal H</math>.
 
==Other settings==
In addition to the supervised learning setting, sample complexity is relevant to [[semi-supervised learning]] problems including [[Active learning (machine learning)|active learning]],<ref name="Balcan">{{cite journal |doi = 10.1007/s10994-010-5174-y|title = The true sample complexity of active learning|journal = Machine Learning|date = 2010|volume = 80|issue = 2–3|pages = 111–139|last1=Balcan|first1=Maria-Florina|authorlink1= Maria-Florina Balcan|last2=Hanneke|first2=Steve|last3=Wortman Vaughan|first3=Jennifer|doi-access = free}}</ref> where the algorithm can ask for labels to specifically chosen inputs in order to reduce the cost of obtaining many labels. The concept of sample complexity also shows up in [[reinforcement learning]],<ref>{{citation |last = Kakade | first = Sham | title = On the Sample Complexity of Reinforcement Learning | place = University College London | publisher = Gatsby Computational Neuroscience Unit. | series = PhD Thesis | year = 2003 | url = http://www.ias.tu-darmstadt.de/uploads/Research/NIPS2006/SK.pdf}}</ref> [[online machine learning|online learning]], and unsupervised algorithms, e.g. for [[dictionary learning]].<ref>{{cite journal |last1 = Vainsencher | first1 = Daniel | last2 = Mannor | first2 = Shie | last3 = Bruckstein | first3 = Alfred | title = The Sample Complexity of Dictionary Learning | journal = Journal of Machine Learning Research | volume = 12 | pages = 3259–3281 | date = 2011 | url = http://www.jmlr.org/papers/volume12/vainsencher11a/vainsencher11a.pdf}}</ref>
 
==Efficiency in robotics==
A high sample complexity means that many calculations are needed for running a [[Monte Carlo tree search]].<ref>{{cite conference |title=Monte-carlo tree search by best arm identification |author=Kaufmann, Emilie and Koolen, Wouter M |conference=Advances in Neural Information Processing Systems |pages=4897–4906 |year=2017 }}</ref> It is equivalent to a [[Model-free (reinforcement learning)|model-free]] brute force search in the state space. In contrast, a high-efficiency algorithm has a low sample complexity.<ref>{{cite conference |title=The chin pinch: A case study in skill learning on a legged robot |author=Fidelman, Peggy and Stone, Peter |conference=Robot Soccer World Cup |pages=59–71 |year=2006 |publisher=Springer }}</ref> Possible techniques for reducing the sample complexity are [[metric learning]]<ref>{{cite conference |title=Sample complexity of learning mahalanobis distance metrics |author=Verma, Nakul and Branson, Kristin |conference=Advances in neural information processing systems |pages=2584–2592 |year=2015 }}</ref> and model-based reinforcement learning.<ref>{{cite arXiv |title=Model-ensemble trust-region policy optimization |author=Kurutach, Thanard and Clavera, Ignasi and Duan, Yan and Tamar, Aviv and Abbeel, Pieter |eprint=1802.10592 |year=2018 |class=cs.LG }}</ref>
 
== See also ==
* [[Active learning (machine learning)]]
 
==References==