Revision as of 17:44, 14 June 2015 edit Michael Hardy (talk \| contribs) Administrators 210,577 edits →Algorithm outline: The whole point of the "exp" notation is to make it unnecessary to use a superscript in cases where it's typographically inconvenient. ← Previous edit		Revision as of 19:47, 12 August 2015 edit undo Yobot (talk \| contribs) Bots 4,733,870 edits m WP:CHECKWIKI error fixes (#64 + others) using AWB (11371) Next edit →
Line 1: In [[machine learning]], '''local case-control sampling''' <ref name="LCC">{{cite journal\|last1=Fithian\|first1=William\|last2=Hastie\|first2=Trevor\|title=Local case-control sampling: Efficient subsampling in imbalanced data sets\|journal=The Annals of Statistics\|date=2014\|volume=42\|issue=5\|~~page~~pages=1693–1724\|ref=http://arxiv.org/abs/1306.3706}}</ref> is an [[algorithm]] used to reduce the complexity of training a [[logistic regression]] classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as [[~~Logistic_regression~~Logistic regression#Case-~~control_sampling~~control sampling\|case control sampling]] and weighted case control sampling. == Imbalanced datasets == Line 15: # The output model is <math> \hat{\theta} = (\hat{\alpha}, \hat{\beta}) </math>, where <math>\hat{\alpha} \leftarrow \hat{\alpha}_S + \tilde{\alpha} </math> and <math>\hat{\beta} \leftarrow \hat{\beta}_S + \tilde{\beta} </math>. The algorithm can be understood as selecting samples that surprises the pilot model. Intuitively these samples are closer to the [[~~Decision boundary\|~~decision boundary]] of the classifier and is thus more informative. === Obtaining the pilot model === In practice, for cases where a pilot model is naturally available, the algorithm can be applied directly to reduce the complexity of training. In cases where a natural pilot is nonexistent, an estimate using a subsample selected through another sampling technique can be used instead. In the original paper describing the algorithm, the authors propose to use weighted case-control sampling with half the assigned sampling budget. For example, if the objective is to use a subsample with size <math> N=1000 </math>, first estimate a model <math>\tilde{\theta} </math> using <math> N_h = 500 </math> samples from weighted case control sampling, then collect another <math> N_h = 500 </math> samples using local case-control sampling. === Larger or smaller sample size === Line 34: \| volume=42 \| issue=5 \| ~~page~~pages=1693–1724 \| ref=http://arxiv.org/abs/1306.3706}}</ref> }}

Local case-control sampling: Difference between revisions