Sample complexity

This is an old revision of this page, as edited by Erel Segal (talk | contribs) at 11:17, 23 June 2016 (Definition: - improve notation). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs to successfully learn a target function.

More precisely, the sample complexity is the number of training-samples that we need to supply to the algorithm, so that the function returned by the algorithm is within an arbitrarily small error of the best possible function, with probability arbitrarily close to 1.

There are two variants of sample complexity:

  • The weak variant fixes a particular input-output distribution;
  • The strong variant takes the worst-case sample complexity over all input-output distributions.

A natural question in statistical learning is, for a given hypothesis space, whether the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that an algorithm can approximately solve all possible learning problems on a particular input-output space no matter the distribution of data over that space. The No Free Lunch Theorem, discussed below, says that this is always impossible if the hypothesis space is not constrained. [1]

Definition

Let   be a space which we call the input space, and   be a space which we call the output space, and let   denote the product  . For example, in the setting of binary classification,   is typically a finite-dimensional vector space and   is the set  .

Fix a hypothesis space   of functions  . A learning algorithm over   is a computable map from   to  . In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from   to  . Typical learning algorithms include empirical risk minimization, without or with Tikhonov regularization.

Fix a loss function  , for example, the square loss  . For a given distribution   on  , the expected risk of a hypothesis (a function)   is

 

In our setting, we have   where   is a learning algorithm and   is a sequence of vectors which are all drawn independently from  . Define the optimal risk Set   for each  . Note that   is a random variable and depends on the random variable  , which is drawn from the distribution  . The algorithm   is called consistent if   probabilistically converges to  , in other words, for all ε, δ > 0, there exists a positive integer N such that for all nN, we have The sample complexity of   is then the minimum N for which this holds, as a function of ρ, ε, and δ. We write the sample complexity as   to emphasize that this value of N depends on ρ, ε, and δ. If   is not consistent, then we set  . If there exists an algorithm for which   is finite, then we say that the hypothesis space   is learnable.

In words, the sample complexity   defines the rate of consistency of the algorithm: given a desired accuracy ε and confidence δ, one needs to sample   data points to guarantee that the risk of the output function is within ε of the best possible, with probability at least 1 - δ.[2]

In probabilistically approximately correct (PAC) learning, one is concerned with whether the sample complexity is polynomial, that is, whether   is bounded by a polynomial in 1/ε and 1/δ. If   is polynomial for some learning algorithm, then one says that the hypothesis space   is PAC-learnable. Note that this is a stronger notion than being learnable.

No Free Lunch Theorem

One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error. More formally, one asks whether there exists a learning algorithm A such that, for all ε, δ > 0, there exists a positive integer N such that for all nN, we have where  , with   as above. The No Free Lunch Theorem says that without restrictions on the hypothesis space  , this is not the case, i.e., there always exist "bad" distributions for which the sample complexity is arbitrarily large.[1] Thus in order to make statements about the rate of convergence of the quantity one must either

  • constrain the space of probability distributions  , e.g. via a parametric approach, or
  • constrain the space of hypotheses  , as in distribution-free approaches.

The latter approach leads to concepts such as VC dimension and Rademacher complexity which control the complexity of the space  . A smaller hypothesis space introduces more bias into the inference process, meaning that   may be greater than the best possible risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions. This trade-off leads to the concept of regularization.[2]

It is a theorem from VC theory that the following three statements are equivalent for a hypothesis space  :

  1.   is PAC-learnable.
  2. The VC dimension of   is finite.
  3.   is a uniform Glivenko-Cantelli class.

This gives a nice way to prove that certain hypothesis spaces are PAC learnable, and by extension, learnable.

An example of a PAC-learnable hypothesis space

Let X = Rd, Y = {-1, 1}, and let   be the space of affine functions on X, that is, functions of the form   for some  . This is the linear classification with offset learning problem. Now, note that four coplanar points in a square cannot be shattered by any affine function, since no affine function can be positive on two diagonally opposite vertices and negative on the remaining two. Thus, the VC dimension of   is 3, in particular finite. It follows by the above characterization of PAC-learnable classes that   is PAC-learnable, and by extension, learnable.

Other Settings

In addition to the supervised learning setting, sample complexity is relevant to semi-supervised learning problems including active learning,[3] where the algorithm can ask for labels to specifically chosen inputs in order to reduce the cost of obtaining many labels. The concept of sample complexity also shows up in reinforcement learning,[4] online learning, and unsupervised algorithms, e.g. for dictionary learning.[5]

References

  1. ^ a b Vapnik, Vladimir (1998), Statistical Learning Theory, New York: Wiley.
  2. ^ a b Rosasco, Lorenzo (2014), Consistency, Learnability, and Regularization, Lecture Notes for MIT Course 9.520.
  3. ^ Balcan, Maria-Florina (2010). "The true sample complexity of active learning". Machine Learning. 80 (2–3): 111–139. doi:10.1007/s10994-010-5174-y.
  4. ^ Kakade, Sham (2003), On the Sample Complexity of Reinforcement Learning (PDF), PhD Thesis, University College London: Gatsby Computational Neuroscience Unit.
  5. ^ Vainsencher, Daniel; Mannor, Shie; Bruckstein, Alfred (2011). "The Sample Complexity of Dictionary Learning" (PDF). Journal of Machine Learning Research. 12: 3259–3281.