Revision as of 03:24, 22 September 2015 edit 72.93.26.84 (talk) No edit summary ← Previous edit		Revision as of 16:46, 22 September 2015 edit undo 69.173.127.108 (talk) No edit summary Next edit →
Line 4: ==Mathematical Setup== Given a ~~set of~~ samples <math>S_n = \{(x_1,y_1),\ldots,(x_n,y_n)\}</math> drawn i.i.d. according to a distribution <math>\rho</math> from some input space <math>\mathcal X \times \mathcal Y</math>, a supervised learning algorithm chooses a function <math>f:\mathcal X \to \mathcal Y</math> from some hypothesis class <math>\mathcal H</math>. A desirable property of the algorithm is that it chooses functions with small expected prediction error with respect to <math>\rho</math> and some loss function <math>V:\mathcal Y \times \mathcal Y \to \mathbb R_+</math>. Specifically, it is desirable to have a consistent algorithm, or an algorithm that generates functions whose expected ~~risks~~loss ~~converge~~or to[[empirical ~~the best possible expected~~risk minimization\|risk~~. Formally, let~~]] <math display="block"> \mathcal R~~^_\mathcal{H}~~(f) = ~~\underset{f \in \mathcal H}{\inf}~~\mathbb E_\rho[V(f(x),y)], </math> converge to the best possible risk ~~and let <math>f_n</math> be the functions generated by an algorithm as the number of data points <math>n</math> grows. The algorithm is consistent if~~ <math display="block"> ~~\underset{n \to \infty}{\lim}\mathbb P_{S_n}(\mathbb E_\rho[V(f_n(x),y)] -~~ \mathcal R^_\mathcal{H} >= \~~epsilon)~~underset{f \toin 0,\mathcal H}{\inf}\mathcal R(f). </math> ~~for~~Formally, ~~all~~let <math>~~\epsilon > 0~~f_n</math>, ~~where~~be ~~<math>\mathbb~~the ~~P_{S_n}</math>~~functions ~~denotes~~generated by an algorithm as the ~~probability~~number ~~measure~~of data points <math>~~\rho^~~n</math> grows. The algorithm is consistent if <math display="block"> The consistency property is nice, but it says nothing about how fast the expected risks converge. Since in practice one always deals with finite data, it is important to answer the question of how many samples are needed to achieve a risk that is close, in the <math>\epsilon</math> sense, to the best possible for the function class. The notion of sample complexity answers this question. The sample complexity of a learning algorithm is a function <math>n(\rho,\epsilon,\delta)</math> such that for all <math>n \ge n(\rho,\epsilon,\delta)</math>,▼ \underset{n \to \infty}{\lim}\mathbb P_{S_n}(\mathcal R(f_n) - \mathcal R^_\mathcal{H} > \epsilon) \to 0 </math> for all <math>\epsilon > 0</math>, where <math>\mathbb P_{S_n}</math> denotes the i.i.d. sampling measure <math>\rho^n</math>. ▲The consistency property is nice, but it says nothing about how fast the ~~expected~~ risks converge. Since in practice one always deals with finite data, it is important to answer the question of how ~~many~~large ~~samples~~a ~~are~~sample is needed to achieve a risk that is close, in the <math>\epsilon</math> sense, to the best possible for the function class. The notion of sample complexity answers this question. The sample complexity of a learning algorithm is a function <math>n(\rho,\epsilon,\delta)</math> such that for all <math>n \ge n(\rho,\epsilon,\delta)</math>, <math display="block"> \mathbb P_{S_n}(\~~mathbb~~mathcal ~~E_\rho[V~~R(~~f(x),y~~f_n)] - \mathcal R^_\mathcal{H} ~~\le~~> \epsilon) ~~\ge 1-~~< \delta. </math> In words, the sample complexity <math>n(\rho,\epsilon,\delta)</math> defines the rate of consistency of the algorithm. Given a desired accuracy <math>\epsilon</math> and confidence <math>\delta</math>, one needs to sample at most <math>n(\rho,\epsilon,\delta)</math> ~~samples~~data points to guarantee that the ~~expected~~ risk of the output function is within <math>\epsilon</math> of the best possible ~~expected risk~~, with probability at least <math>1-\delta</math>.<ref name = "Rosasco">{{citation \|last = Rosasco \| first = Lorenzo \| title = Consistency, Learnability, and Regularization \| series = Lecture Notes for MIT Course 9.520. \| year = 2014 }}</ref> ==No Free Lunch Theorem (Machine Learning)== Optimistically one could hope for a stronger notion of sample complexity that is independent of the distribution <math>\rho</math> on the input and output spaces. However, it has been shown that without restrictions on the hypothesis class <math>\mathcal H</math>, there always exists "bad" distributions for which the sample complexity is arbitrarily large.<ref>{{citation \|last = Vapnik \| first = Vladimir \| title = Statistical Learning Theory \| place = New York \| publisher = Wiley. \| year = 1998}}</ref> Thus in order to make statements about the rate of convergence of the quantity <math display="block"> \underset{\rho}{\sup}\ \mathbb P_{S_n}(\~~mathbb~~mathcal ~~E_\rho[V~~R(~~f(x),y~~f_n)] - \mathcal R^*_\mathcal{H} > \epsilon), </math>

Sample complexity: Difference between revisions