Revision as of 11:08, 23 June 2016 edit Erel Segal (talk \| contribs) Extended confirmed users, IP block exemptions 14,576 edits improve lead section and math notation ← Previous edit		Revision as of 11:17, 23 June 2016 edit undo Erel Segal (talk \| contribs) Extended confirmed users, IP block exemptions 14,576 edits →Definition: - improve notation Next edit →
Line 14: Let <math>X</math> be a space which we call the input space, and <math>Y</math> be a space which we call the output space, and let <math>Z</math> denote the product <math>X\times Y</math>. For example, in the setting of binary classification, <math>X</math> is typically a finite-dimensional vector space and <math>Y</math> is the set <math>\{-1,1\}</math>. Fix a hypothesis space <math>\mathcal H</math> of functions <math>fh\colon X\to Y</math>. A learning algorithm over <math>\mathcal H</math> is a computable map from <math>Z^</math> to <math>\mathcal H</math>. In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from ''<math>X''</math> to ''<math>Y''</math>. Typical learning algorithms include [[empirical risk minimization]], ~~and empirical~~without ~~risk minimization~~or with [[Tikhonov regularization]]. Fix a loss function <math>VLoss\colon Y\times Y\to\R_{\geq 0}</math>, for example, the square loss <math> Loss(y,y')=(y-y')^2</math>. For a given distribution <math>\rho</math> on <math>X\times Y</math>, the expected risk of a hypothesis (a function) <math>h\in\mathcal H</math> is ~~V(y,y')=(y-y')^2~~ ~~</math>. For a given distribution ''ρ'' on {{Math\|''X'' × ''Y''}}, the expected risk of a function <math>f\in\mathcal H</math> is~~ :<math>\mathcal E(fh) :=\mathbb E_\rho[VLoss(fh(x),y)]=\int_{X\times Y} VLoss(fh(x),y)\,d\rho(x,y)</math> In our setting, we have <math>fh=AALG(S_n)</math> where ~~''A''~~<math>ALG</math> is a learning algorithm and <math>S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n</math> is a sequence of vectors which are all drawn independently from ~~''ρ''~~<math>\rho</math>. Define the optimal risk<math display="block"> \mathcal E^_\mathcal{H} = \underset{fh \in \mathcal H}{\inf}\mathcal E(fh). </math>Set <math>~~f_n~~h_n=AAlg(S_n)</math> for each ''<math>n''</math>. Note that ~~{{Math\|''f~~ <~~sub~~math>nh_n</~~sub~~math>~~''}}~~ is a random variable and depends on the random variable ~~{{Math\|''S~~<~~sub~~math>nS_n</~~sub~~math>~~''}}~~, which is drawn from the distribution ~~{{Math\|''ρ~~<~~sup~~math>\rho^n</~~sup~~math>~~''}}~~. The algorithm ~~''A''~~<math>ALG</math> is called consistent if <math> \mathcal E(~~f_n~~h_n) </math> probabilistically converges to <math> \mathcal E_\mathcal H^* </math>, in other words, for all ''ε'', ''δ'' > 0, there exists a positive integer ''N'' such that for all ''n'' ≥ ''N'', we have<math display="block"> \Pr_{\rho^n}[\mathcal E(~~f_n~~h_n) - \mathcal E^*_\mathcal{H}\geq\varepsilon]<\delta. </math>The sample complexity of ~~''A''~~<math>ALG</math> is then the minimum ''N'' for which this holds, as a function of ''ρ'', ''ε'', and ''δ''. We write the sample complexity as ''<math>N''(~~''ρ''~~\rho, ~~''ε''~~\epsilon, ~~''δ''~~\delta)</math> to emphasize that this value of ''N'' depends on ''ρ'', ''ε'', and ''δ''. If ~~''A''~~<math>ALG</math> is not consistent, then we set ''<math>N''(~~''ρ''~~\rho, ~~''ε''~~\epsilon, ~~''δ''~~\delta) = ∞\infty</math>. If there exists an algorithm for which ''<math>N''(\rho,\epsilon,\delta)</math> is finite, then we say that the hypothesis space <math> \mathcal H</math> is '''learnable'''. ~~\mathcal H~~ ~~</math> is '''learnable'''.~~ In words, the sample complexity ''<math>N''(~~''ρ''~~\rho, ~~''ε''~~\epsilon, ~~''δ''~~\delta)</math> defines the rate of consistency of the algorithm: given a desired accuracy ''ε'' and confidence ''δ'', one needs to sample ''<math>N''(~~''ρ''~~\rho, ~~''ε''~~\epsilon, ~~''δ''~~\delta)</math> data points to guarantee that the risk of the output function is within ''ε'' of the best possible, with probability at least 1 - ''δ''.<ref name="Rosasco">{{citation \|last = Rosasco \| first = Lorenzo \| title = Consistency, Learnability, and Regularization \| series = Lecture Notes for MIT Course 9.520. \| year = 2014 }}</ref> In [[Probably approximately correct learning\|probabilistically approximately correct (PAC) learning]], one is concerned with whether the sample complexity is ''polynomial'', that is, whether ''<math>N''(~~''ρ''~~\rho, ~~''ε''~~\epsilon, ~~''δ''~~\delta)</math> is bounded by a polynomial in 1/''ε'' and 1/''δ''. If ''<math>N''(~~''ρ''~~\rho, ~~''ε''~~\epsilon, ~~''δ''~~\delta)</math> is polynomial for some learning algorithm, then one says that the hypothesis space ~~<math>~~ <math> \mathcal H </math> is '''PAC-learnable'''. Note that this is a stronger notion than being learnable.▼ ~~\mathcal H~~ ▲</math> is '''PAC-learnable'''. Note that this is a stronger notion than being learnable. ==No Free Lunch Theorem==

Sample complexity: Difference between revisions