Content deleted Content added
m Open access bot: doi added to citation with #oabot. |
|||
(12 intermediate revisions by 10 users not shown) | |||
Line 1:
{{Short description|Attribute of machine learning models}}
{{Machine learning bar}}
The '''sample complexity''' of a [[machine learning]] algorithm represents the number of training-samples that it needs in order to successfully learn a target function.
Line 8 ⟶ 9:
* The strong variant takes the worst-case sample complexity over all input-output distributions.
The [[No
However, if we are only interested in a particular class of target functions (e.g., only linear functions) then the sample complexity is finite, and it depends linearly on the [[VC dimension]] on the class of target functions.<ref name=":0" />
==Definition==
Let <math>X</math> be a space which we call the input space, and <math>Y</math> be a space which we call the output space, and let <math>Z</math> denote the product <math>X\times Y</math>. For example, in the setting of binary classification, <math>X</math> is typically a finite-dimensional vector space and <math>Y</math> is the set <math>\{-1,1\}</math>.
Fix a hypothesis space <math>\mathcal H</math> of functions <math>h\colon X\to Y</math>. A learning algorithm over <math>\mathcal H</math> is a computable map from <math>Z
Fix a loss function <math>\mathcal{L}\colon Y\times Y\to\R_{\geq 0}</math>, for example, the square loss <math>\mathcal{L}(y, y') = (y - y')^2</math>, where <math>h(x) = y'</math>. For a given distribution <math>\rho</math> on <math>X\times Y</math>, the '''expected risk''' of a hypothesis (a function) <math>h\in\mathcal H</math> is
Line 23 ⟶ 24:
In our setting, we have <math>h=\mathcal{A}(S_n)</math>, where <math>\mathcal{A}</math> is a learning algorithm and <math>S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n</math> is a sequence of vectors which are all drawn independently from <math>\rho</math>. Define the optimal risk<math display="block">
\mathcal E^*_\mathcal{H} = \underset{h \in \mathcal H}{\inf}\mathcal E(h).
</math>Set <math>h_n=\mathcal{A}(S_n)</math>, for each [[sample size]] <math>n</math>.
<math display="block">
\Pr_{\rho^n}[\mathcal E(h_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]<\delta.
</math>The '''sample complexity''' of <math>\mathcal{A}</math> is then the minimum <math>N</math> for which this holds, as a function of <math>\rho, \epsilon</math>, and <math>\delta</math>. We write the sample complexity as <math>N(\rho, \epsilon, \delta)</math> to emphasize that this value of <math>N</math> depends on <math>\rho, \epsilon</math>, and <math>\delta</math>. If <math>\mathcal{A}</math> is '''not consistent''', then we set <math>N(\rho,\epsilon,\delta)=\infty</math>. If there exists an algorithm for which <math>N(\rho,\epsilon,\delta)</math> is finite, then we say that the hypothesis space <math> \mathcal H</math> is '''learnable'''.▼
▲The '''sample complexity''' of <math>\mathcal{A}</math> is then the minimum <math>N</math> for which this holds, as a function of <math>\rho, \epsilon</math>, and <math>\delta</math>. We write the sample complexity as <math>N(\rho, \epsilon, \delta)</math> to emphasize that this value of <math>N</math> depends on <math>\rho, \epsilon</math>, and <math>\delta</math>. If <math>\mathcal{A}</math> is '''not consistent''', then we set <math>N(\rho,\epsilon,\delta)=\infty</math>. If there exists an algorithm for which <math>N(\rho,\epsilon,\delta)</math> is finite, then we say that the hypothesis space <math> \mathcal H</math> is '''learnable'''.
In others words, the sample complexity <math>N(\rho,\epsilon,\delta)</math> defines the rate of consistency of the algorithm: given a desired accuracy <math>\epsilon</math> and confidence <math>\delta</math>, one needs to sample <math>N(\rho,\epsilon,\delta)</math> data points to guarantee that the risk of the output function is within <math>\epsilon</math> of the best possible, with probability at least <math>1 - \delta</math> .<ref name="Rosasco">{{citation |last = Rosasco | first = Lorenzo | title = Consistency, Learnability, and Regularization | series = Lecture Notes for MIT Course 9.520. | year = 2014 }}</ref>
In [[Probably approximately correct learning|probably approximately correct (PAC) learning]], one is concerned with whether the sample complexity is ''polynomial'', that is, whether <math>N(\rho,\epsilon,\delta)</math> is bounded by a polynomial in <math>1/\epsilon</math> and <math>1/\delta</math>. If <math>N(\rho,\epsilon,\delta)</math> is polynomial for some learning algorithm, then one says that the hypothesis space <math> \mathcal H </math> is '''PAC-learnable'''.
==Unrestricted hypothesis space: infinite sample complexity==
Line 61:
=== An example of a PAC-learnable hypothesis space ===
<math>X = \R^d, Y = \{-1, 1\}</math>, and let <math>\mathcal H</math> be the space of affine functions on <math>X</math>, that is, functions of the form <math>x\mapsto \langle w,x\rangle+b</math> for some <math>w\in\R^d,b\in\R</math>. This is the linear classification with offset learning problem. Now,
=== Sample-complexity bounds ===
Line 67:
Suppose <math>\mathcal H</math> is a class of binary functions (functions to <math>\{0,1\}</math>). Then, <math>\mathcal H</math> is <math>(\epsilon,\delta)</math>-PAC-learnable with a sample of size:
<ref>{{Cite journal|title=The optimal sample complexity of PAC learning
|journal=J. Mach. Learn. Res.|volume=17|issue=1|pages=1319–1333|author=Steve Hanneke|year=2016|arxiv=1507.00473|url=https://www.jmlr.org/papers/v17/15-389.html}}</ref>
<math display="block">
N = O\bigg(\frac{VC(\mathcal H) + \ln{1\over \delta}}{\epsilon}\bigg)
Line 79:
Suppose <math>\mathcal H</math> is a class of real-valued functions with range in <math>[0,T]</math>. Then, <math>\mathcal H</math> is <math>(\epsilon,\delta)</math>-PAC-learnable with a sample of size:
<ref name=mr15>{{cite book|first1=Martin|last1=Anthony|first2=Peter L.|last2=Bartlett|title=Neural Network Learning: Theoretical Foundations|year=2009|isbn=9780521118620}}</ref><ref>{{cite conference|title=On the Pseudo-Dimension of Nearly Optimal Auctions|year=2015|conference=NIPS|url=http://papers.nips.cc/paper/5766-on-the-pseudo-dimension-of-nearly-optimal-auctions|arxiv=1506.03684|last1=Morgenstern|first1=Jamie|author1-link=Jamie Morgenstern|last2=Roughgarden|first2=Tim|pages=136–144|publisher=Curran Associates}}</ref>
<math display="block">
N = O\bigg(T^2\frac{PD(\mathcal H)\ln{T\over \epsilon} + \ln{1\over \delta}}{\epsilon^2}\bigg)
Line 85:
where <math>PD(\mathcal H)</math> is [[VC dimension#Generalizations|Pollard's pseudo-dimension]] of <math>\mathcal H</math>.
==Other
In addition to the supervised learning setting, sample complexity is relevant to [[semi-supervised learning]] problems including [[Active learning (machine learning)|active learning]],<ref name="Balcan">{{cite journal |doi = 10.1007/s10994-010-5174-y|title = The true sample complexity of active learning|journal = Machine Learning|date = 2010|volume = 80|issue = 2–3|pages = 111–139|last1=Balcan|first1=Maria-Florina|authorlink1= Maria-Florina Balcan|last2=Hanneke|first2=Steve|last3=Wortman Vaughan|first3=Jennifer|doi-access = free}}</ref> where the algorithm can ask for labels to specifically chosen inputs in order to reduce the cost of obtaining many labels. The concept of sample complexity also shows up in [[reinforcement learning]],<ref>{{citation |last = Kakade | first = Sham | title = On the Sample Complexity of Reinforcement Learning | place = University College London | publisher = Gatsby Computational Neuroscience Unit. | series = PhD Thesis | year = 2003 | url = http://www.ias.tu-darmstadt.de/uploads/Research/NIPS2006/SK.pdf}}</ref> [[online machine learning|online learning]], and unsupervised algorithms, e.g. for [[dictionary learning]].<ref>{{cite journal |last1 = Vainsencher | first1 = Daniel | last2 = Mannor | first2 = Shie | last3 = Bruckstein | first3 = Alfred | title = The Sample Complexity of Dictionary Learning | journal = Journal of Machine Learning Research | volume = 12 | pages = 3259–3281 | date = 2011 | url = http://www.jmlr.org/papers/volume12/vainsencher11a/vainsencher11a.pdf}}</ref>
==Efficiency in robotics==
A high sample complexity means
== See also ==
* [[Active learning (machine learning)]]
==References==
|