Content deleted Content added
m task, replaced: J. Mach. Learn. Res → J. Mach. Learn. Res. |
|||
(36 intermediate revisions by 19 users not shown) | |||
Line 1:
{{Short description|Attribute of machine learning models}}
{{Machine learning bar}}
The '''sample complexity''' of a [[machine learning]] algorithm represents the number of training-samples that it needs in order to successfully learn a target function.
Line 8 ⟶ 9:
* The strong variant takes the worst-case sample complexity over all input-output distributions.
The [[No
However, if we are only interested in a particular class of target functions (e.g., only linear functions) then the sample complexity is finite, and it depends linearly on the [[VC dimension]] on the class of target functions.<ref name=":0" />
==Definition==
Let <math>X</math> be a space which we call the input space, and <math>Y</math> be a space which we call the output space, and let <math>Z</math> denote the product <math>X\times Y</math>. For example, in the setting of binary classification, <math>X</math> is typically a finite-dimensional vector space and <math>Y</math> is the set <math>\{-1,1\}</math>.
Fix a hypothesis space <math>\mathcal H</math> of functions <math>h\colon X\to Y</math>. A learning algorithm over <math>\mathcal H</math> is a computable map from <math>Z
Fix a loss function <math>
:<math>\mathcal E(h) :=\mathbb E_\rho[
In our setting, we have <math>h=
\mathcal E^*_\mathcal{H} = \underset{h \in \mathcal H}{\inf}\mathcal E(h).
</math>Set <math>h_n=\mathcal{A}(S_n)</math>, for each [[sample size]] <math>n</math>. <math>h_n</math> is a [[random variable]] and depends on the random variable <math>S_n</math>, which is drawn from the distribution <math>\rho^n</math>. The algorithm <math>\mathcal{A}</math> is called '''consistent''' if <math> \mathcal E(h_n) </math> probabilistically converges to <math> \mathcal E_\mathcal H^*</math>. In other words, for all <math>\epsilon, \delta > 0</math>, there exists a positive integer <math>N</math>, such that, for all sample sizes <math>n \geq N</math>, we have
<math display="block">
\Pr_{\rho^n}[\mathcal E(h_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]<\delta.
</math>The '''sample complexity''' of <math>
In others words, the sample complexity <math>N(\rho,\epsilon,\delta)</math> defines the rate of consistency of the algorithm: given a desired accuracy
In [[Probably approximately correct learning|
==Unrestricted hypothesis space: infinite sample complexity==
<span id='No Free Lunch Theorem'></span>
One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error. More formally, one asks whether there exists a learning algorithm <math>
<math display="block"> \sup_\rho\left(\Pr_{\rho^n}[\mathcal E(h_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]\right)<\delta,
</math>
Thus, in order to make statements about the rate of convergence of the quantity
<math display="block">
\sup_\rho\left(\Pr_{\rho^n}[\mathcal E(
</math>
one must either
Line 62 ⟶ 61:
=== An example of a PAC-learnable hypothesis space ===
=== Sample-complexity bounds ===
<span id='bounds'></span>
Suppose <math>\mathcal H</math> is a class of binary functions (functions to <math>\{0,1\}</math>). Then, <math>\mathcal H</math> is <math>(\epsilon,\delta)</math>-PAC-learnable with a sample of size:
<ref>{{Cite journal|title=The optimal sample complexity
|journal=J. Mach. Learn. Res.|volume=17|issue=1|pages=1319–1333|author=Steve Hanneke|year=2016|arxiv=1507.00473|url=
<math display="block">
N = O\bigg(\frac{VC(\mathcal H) + \ln{1\over \delta}}{\epsilon}\bigg)
</math>
where <math>VC(\mathcal H)</math> is the [[VC dimension]] of <math>\mathcal H</math>.
Moreover, any <math>(\epsilon,\delta)</math>-PAC-learning algorithm for <math>\mathcal H</math> must have sample-complexity:<ref>{{Cite journal|doi=10.1016/0890-5401(89)90002-3|title=A general lower bound on the number of examples needed for learning|journal=Information and Computation|volume=82|issue=3|pages=247|year=1989|last1=Ehrenfeucht|first1=Andrzej|last2=Haussler|first2=David|last3=Kearns|first3=Michael|last4=Valiant|first4=Leslie|doi-access=free}}</ref>
<math display="block">
N = \Omega\bigg(\frac{VC(\mathcal H) + \ln{1\over \delta}}{\epsilon}\bigg)
Line 79 ⟶ 78:
Thus, the sample-complexity is a linear function of the [[VC dimension]] of the hypothesis space.
Suppose <math>\mathcal H</math> is a class of real-valued functions with range in <math>[0,T]</math>. Then, <math>\mathcal H</math> is <math>(\epsilon,\delta)</math>-PAC-learnable with a sample of size:
<ref name=mr15>{{cite book|first1=Martin|last1=Anthony|first2=Peter L.|last2=Bartlett|title=Neural Network Learning: Theoretical Foundations|year=2009|isbn=9780521118620}}</ref><ref>{{cite conference|title=On the Pseudo-Dimension of Nearly Optimal Auctions|year=2015|conference=NIPS|url=http://papers.nips.cc/paper/5766-on-the-pseudo-dimension-of-nearly-optimal-auctions|arxiv=1506.03684|
<math display="block">
N = O\bigg(T^2\frac{PD(\mathcal H)\ln{T\over \epsilon} + \ln{1\over \delta}}{\epsilon^2}\bigg)
Line 86 ⟶ 85:
where <math>PD(\mathcal H)</math> is [[VC dimension#Generalizations|Pollard's pseudo-dimension]] of <math>\mathcal H</math>.
==Other
In addition to the supervised learning setting, sample complexity is relevant to [[semi-supervised learning]] problems including [[Active learning (machine learning)|active learning]],<ref name="Balcan">{{cite journal |doi = 10.1007/s10994-010-5174-y|title = The true sample complexity of active learning|journal = Machine Learning|date = 2010|volume = 80|issue = 2–3|pages = 111–139|
==Efficiency in robotics==
A high sample complexity means that many calculations are needed for running a [[Monte Carlo tree search]].<ref>{{cite conference |title=Monte-carlo tree search by best arm identification |author=Kaufmann, Emilie and Koolen, Wouter M |conference=Advances in Neural Information Processing Systems |pages=4897–4906 |year=2017 }}</ref> It is equivalent to a [[Model-free (reinforcement learning)|model-free]] brute force search in the state space. In contrast, a high-efficiency algorithm has a low sample complexity.<ref>{{cite conference |title=The chin pinch: A case study in skill learning on a legged robot |author=Fidelman, Peggy and Stone, Peter |conference=Robot Soccer World Cup |pages=59–71 |year=2006 |publisher=Springer }}</ref> Possible techniques for reducing the sample complexity are [[metric learning]]<ref>{{cite conference |title=Sample complexity of learning mahalanobis distance metrics |author=Verma, Nakul and Branson, Kristin |conference=Advances in neural information processing systems |pages=2584–2592 |year=2015 }}</ref> and model-based reinforcement learning.<ref>{{cite arXiv |title=Model-ensemble trust-region policy optimization |author=Kurutach, Thanard and Clavera, Ignasi and Duan, Yan and Tamar, Aviv and Abbeel, Pieter |eprint=1802.10592 |year=2018 |class=cs.LG }}</ref>
== See also ==
* [[Active learning (machine learning)]]
==References==
|