Revision as of 18:57, 19 December 2014 edit Pixtonc (talk \| contribs) 5 edits mNo edit summary ← Previous edit		Revision as of 07:56, 20 December 2014 edit undo BG19bot (talk \| contribs) 1,005,055 edits m WP:CHECKWIKI error fix for #61. Punctuation goes before References. Do general fixes if a problem exists. - using AWB (10514) Next edit →
Line 4: ==Mathematical Setup== Given a set of samples <math>S_n = \{(x_1,y_1),\ldots,(x_n,y_n)\}</math> drawn i.i.d. according to a distribution <math>\rho</math> from some input space <math>\mathcal X \times \mathcal Y</math>, a supervised learning algorithm chooses a function <math>f:\mathcal X \to \mathcal Y</math> from some hypothesis class <math>\mathcal H</math>. A desirable property of the algorithm is that it chooses functions with small expected prediction error with respect to <math>\rho</math> and some loss function <math>V:\mathcal Y \times \mathcal Y \to \mathbb R_+</math>. Specifically, it is desirable to have a consistent algorithm, or an algorithm that generates functions whose expected risks converge to the best possible expected risk. Formally, let <math> \mathcal R^_\mathcal{H} = \underset{f \in \mathcal H}{\inf}\mathbb E_\rho[V(f(x),y)], </math> and let <math>f_n</math> be the functions generated by an algorithm as the number of data points <math>n</math> grows. The algorithm is consistent if ▼ ▲and let <math>f_n</math> be the functions generated by an algorithm as the number of data points <math>n</math> grows. The algorithm is consistent if <math> \underset{n \to \infty}{\lim}\mathbb P_{S_n}(\mathbb E_\rho[V(f(x),y)] - \mathcal R^_\mathcal{H} > \epsilon), </math> for all <math>\epsilon > 0</math>, where <math>\mathbb P_{S_n}</math> denotes the probability measure <math>\rho^n</math>. The consistency property is nice, but it says nothing about how fast the expected risks converge. Since in practice one always deals with finite data, it is important to answer the question of how many samples are needed to achieve a risk that is close, in the <math>\epsilon</math> sense, to the best possible for the function class. The notion of sample complexity answers this question. The sample complexity of a learning algorithm is a function <math>n(\rho,\epsilon,\delta)</math> such that for all <math>n \ge n(\rho,\epsilon,\delta)</math>, <math> Line 29 ⟶ 24: </math> In words, the sample complexity <math>n(\rho,\epsilon,\delta)</math> defines the rate of consistency of the algorithm. Given a desired accuracy <math>\epsilon</math> and confidence <math>\delta</math>, one needs at most <math>n(\rho,\epsilon,\delta)</math> samples to guarantee that the expected risk of the output function is within <math>\epsilon</math> of the best possible expected risk with probability at least <math>1-\delta</math>.<ref name = "Rosasco">{{~~cite~~citation \|last = Rosasco \| first = Lorenzo \| title = Consistency, Learnability, and Regularization \| series = Lecture Notes for MIT Course 9.520. \| year = 2014 }}</ref>▼ ▲In words, the sample complexity <math>n(\rho,\epsilon,\delta)</math> defines the rate of consistency of the algorithm. Given a desired accuracy <math>\epsilon</math> and confidence <math>\delta</math>, one needs at most <math>n(\rho,\epsilon,\delta)</math> samples to guarantee that the expected risk of the output function is within <math>\epsilon</math> of the best possible expected risk with probability at least <math>1-\delta</math>.<ref name = "Rosasco">{{cite \|last = Rosasco \| first = Lorenzo \| title = Consistency, Learnability, and Regularization \| series = Lecture Notes for MIT Course 9.520. \| year = 2014 }}</ref> ==No Free Lunch Theorem (Machine Learning)== Optimistically one could hope for a stronger notion of sample complexity that is independent of the distribution <math>\rho</math> on the input and output spaces. However, it has been shown that without restrictions on the hypothesis class <math>\mathcal H</math>, there always exists "bad" distributions for which the sample complexity is arbitrarily large.<ref>{{~~cite~~citation \|last = Vapnik \| first = Vladimir \| title = Statistical Learning Theory \| place = New York \| publisher = Wiley. \| year = 1998}}</ref> Thus in order to make statements about the rate of convergence of the quantity <math> \underset{\rho}{\sup}\ \mathbb P_{S_n}(\mathbb E_\rho[V(f(x),y)] - \mathcal R^_\mathcal{H} > \epsilon), </math> one must either Constrain the set of probability distributions <math>\rho</math>, e.g. via a parametric approach, or Constrain the set <math>\mathcal H</math> to be small, as in distribution free approaches. The latter approach leads to concepts such as [[VC dimension]] and [[Rademacher complexity]] which control the complexity of the space <math>\mathcal H</math>. A smaller hypothesis space introduces more bias into the inference process, meaning that <math>\mathcal R^_\mathcal{H}</math> may be larger than the best possible expected risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce functions converging in expected risk to <math>\mathcal R^*_\mathcal{H}</math>. This trade-off leads to the concept of [[regularization (mathematics)\|regularization]].<ref name = "Rosasco" />. ==Other Settings== In addition to the supervised learning setting, sample complexity is relevant to [[semi-supervised learning]] problems including [[active learning]],<ref name = "Balcan" />, where the algorithm can ask for labels to specifically chosen inputs in order to reduce the cost of obtaining many labels. The concept of sample complexity also shows up in [[reinforcement learning]],<ref>{{~~cite~~citation \|last = Kakade \| first = Sham \| title = On the Sample Complexity of Reinforcement Learning \| place = University College London \| publisher = Gatsby Computational Neuroscience Unit. \| series = PhD Thesis \| year = 2003 \| url = http://www.ias.tu-darmstadt.de/uploads/Research/NIPS2006/SK.pdf}}</ref>, [[online learning]], and unsupervised algorithms, e.g. for [[dictionary learning]].<ref>{{cite journal \|last1 = Vainsencher \| first1 = Daniel \| last2 = Mannor \| first2 = Shie \| last3 = Bruckstein \| first3 = Alfred \| title = The Sample Complexity of Dictionary Learning \| journal = Journal of Machine Learning Research \| volume = 12 \| pages = ~~3259-3281~~3259–3281 \| date = 2011 \| url = http://www.jmlr.org/papers/volume12/vainsencher11a/vainsencher11a.pdf}}</ref>. ==References==

Sample complexity: Difference between revisions