Sample complexity: Difference between revisions

Content deleted Content added
improve lead section and math notation
Definition: - improve notation
Line 14:
Let <math>X</math> be a space which we call the input space, and <math>Y</math> be a space which we call the output space, and let <math>Z</math> denote the product <math>X\times Y</math>. For example, in the setting of binary classification, <math>X</math> is typically a finite-dimensional vector space and <math>Y</math> is the set <math>\{-1,1\}</math>.
 
Fix a hypothesis space <math>\mathcal H</math> of functions <math>fh\colon X\to Y</math>. A learning algorithm over <math>\mathcal H</math> is a computable map from <math>Z^*</math> to <math>\mathcal H</math>. In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from ''<math>X''</math> to ''<math>Y''</math>. Typical learning algorithms include [[empirical risk minimization]], and empiricalwithout risk minimizationor with [[Tikhonov regularization]].
 
Fix a loss function <math>VLoss\colon Y\times Y\to\R_{\geq 0}</math>, for example, the square loss <math> Loss(y,y')=(y-y')^2</math>. For a given distribution <math>\rho</math> on <math>X\times Y</math>, the expected risk of a hypothesis (a function) <math>h\in\mathcal H</math> is
V(y,y')=(y-y')^2
</math>. For a given distribution ''ρ'' on {{Math|''X'' × ''Y''}}, the expected risk of a function <math>f\in\mathcal H</math> is
 
:<math>\mathcal E(fh) :=\mathbb E_\rho[VLoss(fh(x),y)]=\int_{X\times Y} VLoss(fh(x),y)\,d\rho(x,y)</math>
 
In our setting, we have <math>fh=AALG(S_n)</math> where ''A''<math>ALG</math> is a learning algorithm and <math>S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n</math> is a sequence of vectors which are all drawn independently from ''ρ''<math>\rho</math>. Define the optimal risk<math display="block">
\mathcal E^*_\mathcal{H} = \underset{fh \in \mathcal H}{\inf}\mathcal E(fh).
</math>Set <math>f_nh_n=AAlg(S_n)</math> for each ''<math>n''</math>. Note that {{Math|''f
<submath>nh_n</submath>''}} is a random variable and depends on the random variable {{Math|''S<submath>nS_n</submath>''}}, which is drawn from the distribution {{Math|''ρ<supmath>\rho^n</supmath>''}}. The algorithm ''A''<math>ALG</math> is called consistent if <math>
\mathcal E(f_nh_n)
</math> probabilistically converges to <math>
\mathcal E_\mathcal H^*
</math>, in other words, for all ''ε'', ''δ'' > 0, there exists a positive integer ''N'' such that for all ''n'' ≥ ''N'', we have<math display="block">
\Pr_{\rho^n}[\mathcal E(f_nh_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]<\delta.
</math>The sample complexity of ''A''<math>ALG</math> is then the minimum ''N'' for which this holds, as a function of ''ρ'', ''ε'', and ''δ''. We write the sample complexity as ''<math>N''(''ρ''\rho, ''ε''\epsilon, ''δ''\delta)</math> to emphasize that this value of ''N'' depends on ''ρ'', ''ε'', and ''δ''. If ''A''<math>ALG</math> is not consistent, then we set ''<math>N''(''ρ''\rho, ''ε''\epsilon, ''δ''\delta) =\infty</math>. If there exists an algorithm for which ''<math>N''(\rho,\epsilon,\delta)</math> is finite, then we say that the hypothesis space <math> \mathcal H</math> is '''learnable'''.
\mathcal H
</math> is '''learnable'''.
 
In words, the sample complexity ''<math>N''(''ρ''\rho, ''ε''\epsilon, ''δ''\delta)</math> defines the rate of consistency of the algorithm: given a desired accuracy ''ε'' and confidence ''δ'', one needs to sample ''<math>N''(''ρ''\rho, ''ε''\epsilon, ''δ''\delta)</math> data points to guarantee that the risk of the output function is within ''ε'' of the best possible, with probability at least 1 - ''δ''.<ref name="Rosasco">{{citation |last = Rosasco | first = Lorenzo | title = Consistency, Learnability, and Regularization | series = Lecture Notes for MIT Course 9.520. | year = 2014 }}</ref>
 
In [[Probably approximately correct learning|probabilistically approximately correct (PAC) learning]], one is concerned with whether the sample complexity is ''polynomial'', that is, whether ''<math>N''(''ρ''\rho, ''ε''\epsilon, ''δ''\delta)</math> is bounded by a polynomial in 1/''ε'' and 1/''δ''. If ''<math>N''(''ρ''\rho, ''ε''\epsilon, ''δ''\delta)</math> is polynomial for some learning algorithm, then one says that the hypothesis space <math>
<math> \mathcal H </math> is '''PAC-learnable'''. Note that this is a stronger notion than being learnable.
\mathcal H
</math> is '''PAC-learnable'''. Note that this is a stronger notion than being learnable.
 
==No Free Lunch Theorem==