Local case-control sampling: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 05:44, 14 June 2015 edit 50.188.46.248 (talk) deleting the "too technical" tag; also some obvious notation corrections ← Previous edit		Latest revision as of 13:19, 22 August 2022 edit undo Trappist the monk (talk \| contribs) Administrators 494,442 edits m →top: parameter misuse; Tag: AWB
(10 intermediate revisions by 9 users not shown)
Line 1: In [[machine learning]], '''local case-control sampling''' <ref name="LCC">{{cite journal\|last1=Fithian\|first1=William\|last2=Hastie\|first2=Trevor\|title=Local case-control sampling: Efficient subsampling in imbalanced data sets\|journal=The Annals of Statistics\|date=2014\|volume=42\|issue=5\|~~page~~pages=1693–1724\|~~ref~~doi=~~http://arxiv~~10.~~org/abs~~1214/14-aos1220\|pmid=25492979\|pmc=4258397\|arxiv=1306.3706}}</ref> is an [[algorithm]] used to reduce the complexity of training a [[logistic regression]] classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as [[~~Logistic_regression~~Logistic regression#Case-~~control_sampling~~control sampling\|case control sampling]] and weighted case control sampling. == Imbalanced datasets == Line 9: == Algorithm outline == In logistic regression, given the model <math> \theta = (\alpha, \beta) </math>, the prediction is made according to <math> \mathbb{P}(Y=1\mid X; \theta) = \tilde{p}_{\theta}(x) = \frac{\exp^{(\alpha+\beta^T x})}{1+\exp^{(\alpha+\beta^T x})} </math>. The local-case control sampling algorithm assumes the availability of a pilot model <math>\tilde{\theta} = (\tilde{\alpha}, \tilde{\beta}) </math>. Given the pilot model, the algorithm performs a single pass over the entire dataset to select the subset of samples to include in training the logistic regression model. For a sample <math> (x,y) </math>, define the acceptance probability as <math> a(x,y) = \|y-\tilde{p}_{\tilde{\theta}}(x)\| </math>. The algorithm proceeds as follows: # Generate independent <math> z_i \sim \text{Bernoulli}(a(x_i,y_i)) </math> for <math> i \in \{1, \ldots, N\} </math>. Line 15: # The output model is <math> \hat{\theta} = (\hat{\alpha}, \hat{\beta}) </math>, where <math>\hat{\alpha} \leftarrow \hat{\alpha}_S + \tilde{\alpha} </math> and <math>\hat{\beta} \leftarrow \hat{\beta}_S + \tilde{\beta} </math>. The algorithm can be understood as selecting samples that surprises the pilot model. Intuitively these samples are closer to the [[~~Decision boundary\|~~decision boundary]] of the classifier and itis thus more informative. === Obtaining the pilot model === In practice, for cases where a pilot model is naturally available, the algorithm can be applied directly to reduce the complexity of training. In cases where a natural pilot is nonexistent, an estimate using a ~~estimates~~subsample ~~based~~selected onthrough ~~other~~another sampling technique can be ~~applied~~used instead. In the original paper describing the algorithm, the authors propose to use weighted case-control sampling with half the assigned sampling budget. For example, if the objective is to use a subsample with size <math> N=1000 </math>, first estimate a model <math>\tilde{\theta} </math> using <math> N_h = 500 </math> samples from weighted case control sampling, then collect another <math> N_h = 500 </math> samples using local case-control sampling. === Larger or smaller sample size === It is possible to control the sample size by multiplying the acceptance probability with a constant <math> c </math>. For a larger sample size, pick <math> c>1 </math> and adjust the acceptance probability to <math> \min(ca(x_i, y_i), 1) </math>. For a smaller sample size, the same strategy applies. In cases where the number of samples desired is precise, a convenient alternative method is to uniformly downsample from a larger subsample selected by local case-control sampling. == Properties == The algorithm has the following properties. When the pilot is [[Consistency (statistics)\|consistent]], the estimates using the samples from local case-control sampling is consistent even under [[~~Specification~~Statistical ~~(regression)~~model specification\|model misspecification]]. If the model is correct then the algorithm has exactly twice the asymptotic variance of logistic regression on the full data set. For a larger sample size with <math> c>1 </math>, the factor 2 is improved to <math> 1+\frac{1}{c} </math>. == References == {{Reflist~~\|colwidth=100em\|refs=~~}} ~~<ref name="LCC">{{cite journal\|last1=Fithian\|first1=William\|last2=Hastie\|first2=Trevor~~ ~~\| title=Local case-control sampling: Efficient subsampling in imbalanced data sets~~ ~~\| journal=The Annals of Statistics~~ ~~\| date=2014~~ ~~\| volume=42~~ ~~\| issue=5~~ ~~\| page=1693–1724~~ ~~\| ref=http://arxiv.org/abs/1306.3706}}</ref>~~ }} [[Category:Machine learning]] [[Category:~~Log-linear~~Logistic ~~models~~regression]] ~~[[Category:Regression analysis]]~~