Content deleted Content added
No edit summary |
m →top: parameter misuse; |
||
(16 intermediate revisions by 11 users not shown) | |||
Line 1:
In [[machine learning]], '''local case-control sampling''' <ref name="LCC">{{cite journal|last1=Fithian|first1=William|last2=Hastie|first2=Trevor|title=Local case-control sampling: Efficient subsampling in imbalanced data sets|journal=The Annals of Statistics|date=2014|volume=42|issue=5|
▲In [[machine learning]], '''local case-control sampling''' <ref name="LCC">{{cite journal|last1=Fithian|first1=William|last2=Hastie|first2=Trevor|title=Local case-control sampling: Efficient subsampling in imbalanced data sets|journal=The Annals of Statistics|date=2014|volume=42|issue=5|page=1693-1724|ref=http://arxiv.org/abs/1306.3706}}</ref> is an [[algorithm]] used to reduce the complexity of training a [[logistic regression]] classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as [[Logistic_regression#Case-control_sampling|case control sampling]] and weighted case control sampling.
== Imbalanced datasets ==
In [[Statistical classification|classification]], a dataset is a set of ''N'' data points <math> (x_i, y_i)_{i=1}^N </math>, where <math> x_i \in\mathbb R^d </math> is a feature vector, <math> y_i \in \{0,1\} </math> is a label. Intuitively, a dataset is imbalanced when certain important statistical patterns are rare. The lack of observations of certain patterns does not always imply their irrelevance. For example, in medical studies of rare diseases, the small number of infected patients (cases) conveys the most valuable information for diagnosis and treatments.
Formally, an imbalanced dataset exhibits one or more of the following properties:
* ''Marginal Imbalance''. A dataset is marginally imbalanced if one class is rare compared to the other class. In other words, <math> \mathbb{P}(Y=1) \approx 0 </math>.
* ''Conditional Imbalance''. A dataset is conditionally imbalanced when it is easy to predict the correct labels in most cases. For example, if <math> X \in \{0,1\} </math>, the dataset is conditionally imbalanced if <math> \mathbb{P}(Y=1
== Algorithm outline ==
In logistic regression, given the model <math> \theta = (\alpha, \beta) </math>, the prediction is made according to <math> \mathbb{P}(Y=1
# Generate independent <math> z_i \sim \text{Bernoulli}(a(x_i,y_i)) </math> for <math> i \in \{1, \
# Fit a logistic regression model to the subsample <math> S = \{(x_i, y_i) : z_i =1 \} </math>, obtaining the unadjusted estimates <math> \hat{\theta}_S = (\hat{\alpha}_S, \hat{\beta}_S) </math>.
# The output model is <math> \hat{\theta} = (\hat{\alpha}, \hat{\beta}) </math>, where <math>\hat{\alpha} \leftarrow \hat{\alpha}_S + \tilde{\alpha} </math> and <math>\hat{\beta} \leftarrow \hat{\beta}_S + \tilde{\beta} </math>.
The algorithm can be understood as selecting samples that surprises the pilot model. Intuitively these samples are closer to the [[
=== Obtaining the pilot model ===
In practice, for cases where a pilot model is naturally available, the algorithm can be applied directly to reduce the complexity of training. In cases where a natural pilot is nonexistent, an estimate using a
=== Larger or smaller sample size ===
It is possible to control the sample size by multiplying the acceptance probability with a constant <math> c </math>. For a larger sample size, pick <math> c>1 </math> and adjust the acceptance probability to <math> \min(ca(x_i, y_i), 1) </math>. For a smaller sample size, the same strategy applies. In cases where the number of samples desired is precise, a convenient alternative method is to uniformly downsample from a larger subsample selected by local case-control sampling.
== Properties ==
The algorithm has the following properties. When the pilot is [[Consistency (statistics)|consistent]], the estimates using the samples from local case-control sampling is consistent even under [[
== References ==
{{Reflist
[[Category:Machine learning]]
[[Category:
|