Local case-control sampling: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 10:21, 5 March 2019 edit SolidPhase (talk \| contribs) Extended confirmed users 959 edits m →Properties: direct wikilink to Statistical model specification ← Previous edit		Latest revision as of 13:19, 22 August 2022 edit undo Trappist the monk (talk \| contribs) Administrators 494,442 edits m →top: parameter misuse; Tag: AWB
(One intermediate revision by one other user not shown)
Line 1: In [[machine learning]], '''local case-control sampling''' <ref name="LCC">{{cite journal\|last1=Fithian\|first1=William\|last2=Hastie\|first2=Trevor\|title=Local case-control sampling: Efficient subsampling in imbalanced data sets\|journal=The Annals of Statistics\|date=2014\|volume=42\|issue=5\|pages=1693–1724~~\|ref=http://arxiv.org/abs/1306.3706~~\|doi=10.1214/14-aos1220\|pmid=25492979\|pmc=4258397\|arxiv=1306.3706}}</ref> is an [[algorithm]] used to reduce the complexity of training a [[logistic regression]] classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as [[Logistic regression#Case-control sampling\|case control sampling]] and weighted case control sampling. == Imbalanced datasets ==