Sample complexity: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 04:15, 31 January 2024 edit HeyElliott (talk \| contribs) Extended confirmed users 121,470 edits MOS:NOTE Tag: 2017 wikitext editor ← Previous edit		Latest revision as of 00:08, 25 June 2025 edit undo David Eppstein (talk \| contribs) Autopatrolled, Administrators 235,596 edits Jamie Morgenstern
(7 intermediate revisions by 6 users not shown)
Line 1: {{Short description\|Attribute of machine learning models}} {{Machine learning bar}} The '''sample complexity''' of a [[machine learning]] algorithm represents the number of training-samples that it needs in order to successfully learn a target function. Line 10 ⟶ 11: The [[No free lunch theorem]], discussed below, proves that, in general, the strong sample complexity is infinite, i.e. that there is no algorithm that can learn the globally-optimal target function using a finite number of training samples. However, if we are only interested in a particular class of target functions (e.g., only linear functions) then the sample complexity is finite, and it depends linearly on the [[VC dimension]] on the class of target functions.<ref name=":0" /> ==Definition== Let <math>X</math> be a space which we call the input space, and <math>Y</math> be a space which we call the output space, and let <math>Z</math> denote the product <math>X\times Y</math>. For example, in the setting of binary classification, <math>X</math> is typically a finite-dimensional vector space and <math>Y</math> is the set <math>\{-1,1\}</math>. Fix a hypothesis space <math>\mathcal H</math> of functions <math>h\colon X\to Y</math>. A learning algorithm over <math>\mathcal H</math> is a computable map from <math>Z^</math> to <math>\mathcal H</math>. In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from <math>X</math> to <math>Y</math>. Typical learning algorithms include [[empirical risk minimization]], without or with [[Tikhonov regularization]]. Fix a loss function <math>\mathcal{L}\colon Y\times Y\to\R_{\geq 0}</math>, for example, the square loss <math>\mathcal{L}(y, y') = (y - y')^2</math>, where <math>h(x) = y'</math>. For a given distribution <math>\rho</math> on <math>X\times Y</math>, the '''expected risk''' of a hypothesis (a function) <math>h\in\mathcal H</math> is Line 27 ⟶ 28: <math display="block"> \Pr_{\rho^n}[\mathcal E(h_n) - \mathcal E^_\mathcal{H}\geq\varepsilon]<\delta. </math>The '''sample complexity''' of <math>\mathcal{A}</math> is then the minimum <math>N</math> for which this holds, as a function of <math>\rho, \epsilon</math>, and <math>\delta</math>. We write the sample complexity as <math>N(\rho, \epsilon, \delta)</math> to emphasize that this value of <math>N</math> depends on <math>\rho, \epsilon</math>, and <math>\delta</math>. If <math>\mathcal{A}</math> is '''not consistent''', then we set <math>N(\rho,\epsilon,\delta)=\infty</math>. If there exists an algorithm for which <math>N(\rho,\epsilon,\delta)</math> is finite, then we say that the hypothesis space <math> \mathcal H</math> is '''learnable'''.▼ ~~</math>~~ ▲The '''sample complexity''' of <math>\mathcal{A}</math> is then the minimum <math>N</math> for which this holds, as a function of <math>\rho, \epsilon</math>, and <math>\delta</math>. We write the sample complexity as <math>N(\rho, \epsilon, \delta)</math> to emphasize that this value of <math>N</math> depends on <math>\rho, \epsilon</math>, and <math>\delta</math>. If <math>\mathcal{A}</math> is '''not consistent''', then we set <math>N(\rho,\epsilon,\delta)=\infty</math>. If there exists an algorithm for which <math>N(\rho,\epsilon,\delta)</math> is finite, then we say that the hypothesis space <math> \mathcal H</math> is '''learnable'''. In others words, the sample complexity <math>N(\rho,\epsilon,\delta)</math> defines the rate of consistency of the algorithm: given a desired accuracy <math>\epsilon</math> and confidence <math>\delta</math>, one needs to sample <math>N(\rho,\epsilon,\delta)</math> data points to guarantee that the risk of the output function is within <math>\epsilon</math> of the best possible, with probability at least <math>1 - \delta</math> .<ref name="Rosasco">{{citation \|last = Rosasco \| first = Lorenzo \| title = Consistency, Learnability, and Regularization \| series = Lecture Notes for MIT Course 9.520. \| year = 2014 }}</ref> Line 79: Suppose <math>\mathcal H</math> is a class of real-valued functions with range in <math>[0,T]</math>. Then, <math>\mathcal H</math> is <math>(\epsilon,\delta)</math>-PAC-learnable with a sample of size: <ref name=mr15>{{cite book\|first1=Martin\|last1=Anthony\|first2=Peter L.\|last2=Bartlett\|title=Neural Network Learning: Theoretical Foundations\|year=2009\|isbn=9780521118620}}</ref><ref>{{cite conference\|title=On the Pseudo-Dimension of Nearly Optimal Auctions\|year=2015\|conference=NIPS\|url=http://papers.nips.cc/paper/5766-on-the-pseudo-dimension-of-nearly-optimal-auctions\|arxiv=1506.03684\|last1=Morgenstern\|first1=Jamie\|author1-link=Jamie Morgenstern\|last2=Roughgarden\|first2=Tim\|pages=136–144\|publisher=Curran Associates}}</ref> <math display="block"> N = O\bigg(T^2\frac{PD(\mathcal H)\ln{T\over \epsilon} + \ln{1\over \delta}}{\epsilon^2}\bigg) Line 85: where <math>PD(\mathcal H)</math> is [[VC dimension#Generalizations\|Pollard's pseudo-dimension]] of <math>\mathcal H</math>. ==Other ~~Settings~~settings== In addition to the supervised learning setting, sample complexity is relevant to [[semi-supervised learning]] problems including [[Active learning (machine learning)\|active learning]],<ref name="Balcan">{{cite journal \|doi = 10.1007/s10994-010-5174-y\|title = The true sample complexity of active learning\|journal = Machine Learning\|date = 2010\|volume = 80\|issue = 2–3\|pages = 111–139\|last1=Balcan\|first1=Maria-Florina\|authorlink1= Maria-Florina Balcan\|last2=Hanneke\|first2=Steve\|last3=Wortman Vaughan\|first3=Jennifer\|doi-access = free}}</ref> where the algorithm can ask for labels to specifically chosen inputs in order to reduce the cost of obtaining many labels. The concept of sample complexity also shows up in [[reinforcement learning]],<ref>{{citation \|last = Kakade \| first = Sham \| title = On the Sample Complexity of Reinforcement Learning \| place = University College London \| publisher = Gatsby Computational Neuroscience Unit. \| series = PhD Thesis \| year = 2003 \| url = http://www.ias.tu-darmstadt.de/uploads/Research/NIPS2006/SK.pdf}}</ref> [[online machine learning\|online learning]], and unsupervised algorithms, e.g. for [[dictionary learning]].<ref>{{cite journal \|last1 = Vainsencher \| first1 = Daniel \| last2 = Mannor \| first2 = Shie \| last3 = Bruckstein \| first3 = Alfred \| title = The Sample Complexity of Dictionary Learning \| journal = Journal of Machine Learning Research \| volume = 12 \| pages = 3259–3281 \| date = 2011 \| url = http://www.jmlr.org/papers/volume12/vainsencher11a/vainsencher11a.pdf}}</ref> ==Efficiency in robotics== A high sample complexity means, that many calculations are needed for running a [[Monte Carlo tree search]].<ref>{{cite conference \|title=Monte-carlo tree search by best arm identification \|author=Kaufmann, Emilie and Koolen, Wouter M \|conference=Advances in Neural Information Processing Systems \|pages=4897–4906 \|year=2017 }}</ref> ~~Its~~It ~~equal~~is equivalent to a [[Model-free (reinforcement learning)\|model -free]] brute force search in the state space. In contrast, a high -efficiency algorithm has a low sample complexity.<ref>{{cite conference \|title=The chin pinch: A case study in skill learning on a legged robot \|author=Fidelman, Peggy and Stone, Peter \|conference=Robot Soccer World Cup \|pages=59–71 \|year=2006 \|publisher=Springer }}</ref> Possible techniques for reducing the sample complexity are [[metric learning]]<ref>{{cite conference \|title=Sample complexity of learning mahalanobis distance metrics \|author=Verma, Nakul and Branson, Kristin \|conference=Advances in neural information processing systems \|pages=2584–2592 \|year=2015 }}</ref> and model -based reinforcement learning.<ref>{{cite arXiv \|title=Model-ensemble trust-region policy optimization \|author=Kurutach, Thanard and Clavera, Ignasi and Duan, Yan and Tamar, Aviv and Abbeel, Pieter \|eprint=1802.10592 \|year=2018 \|class=cs.LG }}</ref> == See also == * [[Active learning (machine learning)]] ==References==