Local case-control sampling: Difference between revisions

Content deleted Content added
Yobot (talk | contribs)
m WP:CHECKWIKI error fixes (#64 + others) using AWB (11371)
m top: parameter misuse;
 
(7 intermediate revisions by 6 users not shown)
Line 1:
In [[machine learning]], '''local case-control sampling''' <ref name="LCC">{{cite journal|last1=Fithian|first1=William|last2=Hastie|first2=Trevor|title=Local case-control sampling: Efficient subsampling in imbalanced data sets|journal=The Annals of Statistics|date=2014|volume=42|issue=5|pages=1693–1724|refdoi=http://arxiv10.org/abs1214/14-aos1220|pmid=25492979|pmc=4258397|arxiv=1306.3706}}</ref> is an [[algorithm]] used to reduce the complexity of training a [[logistic regression]] classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as [[Logistic regression#Case-control sampling|case control sampling]] and weighted case control sampling.
 
== Imbalanced datasets ==
Line 24:
 
== Properties ==
The algorithm has the following properties. When the pilot is [[Consistency (statistics)|consistent]], the estimates using the samples from local case-control sampling is consistent even under [[SpecificationStatistical (regression)model specification|model misspecification]]. If the model is correct then the algorithm has exactly twice the asymptotic variance of logistic regression on the full data set. For a larger sample size with <math> c>1 </math>, the factor 2 is improved to <math> 1+\frac{1}{c} </math>.
 
== References ==
{{Reflist|colwidth=100em|refs=}}
<ref name="LCC">{{cite journal|last1=Fithian|first1=William|last2=Hastie|first2=Trevor
| title=Local case-control sampling: Efficient subsampling in imbalanced data sets
| journal=The Annals of Statistics
| date=2014
| volume=42
| issue=5
| pages=1693–1724
| ref=http://arxiv.org/abs/1306.3706}}</ref>
}}
 
[[Category:Machine learning]]
[[Category:Log-linearLogistic modelsregression]]
[[Category:Regression analysis]]