Conditional logistic regression: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 14:46, 4 November 2016 edit Felixbalazard (talk \| contribs) 30 edits No edit summary ← Previous edit		Latest revision as of 19:34, 17 July 2025 edit undo Citation bot (talk \| contribs) Bots 5,863,300 edits Removed URL that duplicated identifier. \| Use this bot. Report bugs. \| #UCB_CommandLine
(27 intermediate revisions by 26 users not shown)
Line 1: {{Short description\|Statistical technique}} '''Conditional logistic regression''' is an extension of [[logistic regression]] that allows one to ~~take into~~ account for [[stratification (clinical trials)\|stratification]] and [[Matching (statistics)\|matching]]. Its main field of application is [[observational studies]] and in particular [[epidemiology]]. It was ~~designed~~devised in 1978 by [[Norman Breslow]], [[Nick Day (statistician)\|Nicholas Day]], ~~K. T.~~[[Katherine Halvorsen]], [[Ross L. Prentice]] and C. Sabai.<ref name="pmid727199">{{cite journal \| ~~author~~ vauthors= Breslow, ~~N. E.~~NE, Day, ~~N. E.~~NE, Halvorsen, ~~K. T.~~KT, Prentice, ~~R. L.~~RL, & Sabai, C.\| title = Estimation of multiple relative risk functions in matched case-control studies. \| journal =Am ~~American~~J ~~Journal of Epidemiology~~Epidemiol \| ~~date~~ year= 1978 \| volume = 108 \| issue = 4 \| pages= 299–307 \| pmid=727199 ~~299-307~~\| doi= 10.1093/oxfordjournals.aje.a112623}} </ref> It is the most flexible and general procedure for matched data. ==~~Motivation~~Background== Observational studies use [[stratification (clinical trials)\|stratification]] or [[Matching (statistics)\|matching]] as a way to control for [[confounding]]. Several tests existed before conditional logistic regression for matched data as shown in [[Conditional logistic regression#Related tests\|related tests]]. However, they did not allow for the analysis of continuous predictors with arbitrary strata size. All of those procedures also lack the flexibility of conditional logistic regression and in particular the possibility to control for covariates. [[Logistic regression]] can ~~take~~account ~~into account~~for stratification by having a different constant term for each ~~strata~~stratum. Let us denote <math>Y_{i\ell}\in\{0,1\}</math> the label (e.g. case status) of the <math>\ell</math>th observation of the <math>i</math>th ~~strata~~stratum and <math>X_{i\ell}\in\mathbb{R}^p</math> the values of the corresponding predictors. ~~Then,~~We then take the likelihood of one observation isto be▼ :<math> \mathbb{P}(Y_{i\ell}=1\|X_{i\ell})=\frac{\exp(\alpha_i + \boldsymbol\beta^\top X_{i\ell})}{1+\exp(\alpha_i + \boldsymbol\beta^\top X_{i\ell})}</math>▼ ▲Logistic regression can take into account stratification by having a different constant term for each strata. Let us denote <math>Y_{i\ell}\in\{0,1\}</math> the label (e.g. case status) of the <math>\ell</math>th observation of the <math>i</math>th strata and <math>X_{i\ell}\in\mathbb{R}^p</math> the values of the corresponding predictors. Then, the likelihood of one observation is where <math>\alpha_i</math> is the constant term for the <math>i</math>th stratum. The parameters in this model can be estimated using [[maximum likelihood estimation]]. ▲:<math> \mathbb{P}(Y_{i\ell}=1\|X_{i\ell})=\frac{\exp(\alpha_i +\boldsymbol\beta^\top X_{i\ell})}{1+\exp(\alpha_i +\beta^\top X_{i\ell})}</math> For example, consider estimating the impact of exercise on the risk of cardiovascular disease. If people who exercise more are younger, have better access to healthcare, or have other differences that improve their health, then a logistic regression of cardiovascular disease incidence on minutes spent exercising may overestimate the impact of exercise on health. To address this, we can group people based on demographic characteristics like age and zip code of their home residence. Each stratum <math>\ell</math> is a group of people with similar demographics. The vector <math>X_{i\ell}</math> contains information about the variable of interest (in this case, minutes spent exercising) for individual <math>i</math> in stratum <math>\ell</math>. The value <math>\alpha_i</math> is the impact of demographics on cardiovascular disease incidence <math>Y_{i\ell}</math>, which is assumed to be the same for all people in the stratum. The vector <math>\boldsymbol\beta</math> (which, in this example, is just a scalar) is the quantity of interest --- the impact of exercise on cardiovascular disease. We can also include control variables within <math>X_{i\ell}</math>. where <math>\alpha_i</math> is the constant term for the <math>i</math>th strata. While this works satisfactorily for a limited number of strata, pathological behavior occurs when the strata are small. When the strata are pairs, the number of variables grows with the number of observations <math>N</math> (it equals <math>\frac{N}{2}+p</math>). The asymptotic results on which [[maximum likelihood estimation]] is based on are therefore not valid and the estimation is biased. In fact, it can be shown that the unconditional analysis of matched pair data results in an estimate of the odds ratio which is the square of the correct, conditional one.<ref>{{cite book \|last1=Breslow \|first1=N.E. \|last2=Day\|first2=N.E.\|date=1980 \|title=Statistical Methods in Cancer Research. Volume 1-The Analysis of Case-Control Studies \|url=http://www.iarc.fr/en/publications/pdfs-online/stat/sp32/ \|___location=Lyon, France \|publisher= IARC \|pages=249-251 }}</ref>▼ ==Motivation== Logistic regression as described above works satisfactorily when the number of strata is small relative to the amount of data. If we hold the number of strata fixed and increase the amount of data, estimates of the model parameters (<math>\alpha_i</math> for each stratum and the vector <math>\boldsymbol\beta</math>) converge to their true values. ▲~~where~~Pathological ~~<math>\alpha_i</math>~~behavior, ishowever, ~~the~~occurs ~~constant~~when ~~term~~we ~~for~~have ~~the~~many ~~<math>i</math>th~~small strata. ~~While~~because ~~this works satisfactorily for a limited~~the number of ~~strata,~~parameters ~~pathological~~grow ~~behavior occurs when~~with the ~~strata~~amount ~~are~~of ~~small~~data. ~~When~~For ~~the~~example, ~~strata~~if ~~are~~each stratum contains ~~pairs~~two datapoints, then the number of ~~variables~~parameters ~~grows~~in ~~with~~a ~~the~~model ~~number of observations~~with <math>N</math> datapoints ~~(it equals~~is <math>~~\frac{~~ N}{/2} + p</math>), so the number of parameters is of the same order as the number of datapoints. ~~The~~In these settings, as we increase the amount of data, the asymptotic results on which [[maximum likelihood estimation]] is based on are ~~therefore~~ not valid and the ~~estimation~~resulting isestimates are biased. Conditional logistic regression fixes this issue. In fact, it can be shown that the unconditional analysis of matched pair data results in an estimate of the [[odds ratio]] which is the square of the correct, conditional one.<ref>{{cite book \|last1=Breslow \|first1=N.E. \|last2=Day \|first2=N.E. \|date=1980 \|title=Statistical Methods in Cancer Research. Volume 1-The Analysis of Case-Control Studies \|url=http://www.iarc.fr/en/publications/pdfs-online/stat/sp32/ \|___location=Lyon, France \|publisher= IARC \|pages=~~249~~249–251 \|access-~~251~~date=2016-11-04 \|archive-url=https://web.archive.org/web/20161226114802/http://www.iarc.fr/en/publications/pdfs-online/stat/sp32/ \|archive-date=2016-12-26 \|url-status=dead }}</ref> In addition to tests based on logistic regression, several other tests existed before conditional logistic regression for matched data as shown in [[#Related tests\|related tests]]. However, they did not allow for the analysis of continuous predictors with arbitrary stratum size. All of those procedures also lack the flexibility of conditional logistic regression and in particular the possibility to control for covariates. ==Conditional likelihood== ~~The~~Conditional logistic regression uses a conditional likelihood approach that deals with the above pathological behavior by conditioning on the number of cases in each ~~strata and~~stratum. ~~therefore~~This ~~eliminating~~eliminates the need to estimate the strata parameters. ~~In the case where~~ When the strata are pairs, where the first observation is a case and the second is a control, this can be seen as follows :<math> \begin{align} & \mathbb{P}(Y_{i1}=1,Y_{i2}=0\|X_{i1},X_{i2},Y_{i1}+Y_{i2}=1) \\ & =\frac{\mathbb{P}(Y_{i1}=1\|X_{i1}) \mathbb{P}(Y_{i2}=0\|X_{i2})}{\mathbb{P}(Y_{i1}=1\|X_{i1}) \mathbb{P}(Y_{i2}=0\|X_{i2})+\mathbb{P}(Y_{i1}=0\|X_{i1}) \mathbb{P}(Y_{i2}=1\|X_{i2})}\\[6pt] \ & =\frac{\frac{\exp(\alpha_i+\boldsymbol{\beta}^\top X_{i1})}{1+\exp(\alpha_i+\boldsymbol{\beta}^\top X_{i1})}\times\frac{1}{1+\exp(\alpha_i+\boldsymbol{\beta}^\top X_{i2})}}{\frac{\exp(\alpha_i+\boldsymbol{\beta}^\top X_{i1})}{1+\exp(\alpha_i+\boldsymbol{\beta}^\top X_{i1})}\times\frac{1}{1+\exp(\alpha_i+\boldsymbol{\beta}^\top X_{i2})}+\frac{1}{1+\exp(\alpha_i+\boldsymbol{\beta}^\top X_{i1})}\times\frac{\exp(\alpha_i+\boldsymbol{\beta}^\top X_{i2})}{1+\exp(\alpha_i+\boldsymbol{\beta}^\top X_{i2})}}\\[6pt] \ & =\frac{\exp(\boldsymbol{\beta}^\top X_{i1})}{\exp(\boldsymbol{\beta}^\top X_{i1})+\exp(\boldsymbol{\beta}^\top X_{i2})}. \\[6pt] Line 21 ⟶ 33: </math> With similar computations, the conditional likelihood of a ~~strata~~stratum of size <math>m</math>, with the <math>k</math> first observations being the cases, is :<math> \mathbb{P}(Y_{ij}=1\text{ for }j\leq k,Y_{ij}=0\text{ for } k<j\leq m\|X_{i1},...,X_{im},\sum_{j=1}^m Y_{ij}=k)=\frac{\exp(\sum_{j=1}^k \boldsymbol{\beta}^\top X_{ij})}{\sum_{J\in \mathcal{C} _{k}^{m}} \exp(\sum_{j\in J} ~~\exp(~~\boldsymbol{\beta}^\top X_{ij})}, </math> where <math>\mathcal{C} _{k}^{m}</math> is the set of all subsets of size <math>k</math> of the set <math>\{1,...,m\}</math>. The full conditional log likelihood is then simply the sum of the log likelihoods for each ~~strata~~stratum. The estimator is then defined as the <math>\beta</math> that maximizes the conditional log likelihood. ==Implementation== Conditional logistic regression is available in R as the function <code>clogit</code> in the <code>survival</code> package. It is in the <code>survival</code> package because the log likelihood of a conditional logistic model is the same as the log likelihood of a Cox model with a particular data structure.<ref>{{cite web \|url=https://stat.ethz.ch/R-manual/R-devel/library/survival/html/clogit.html \|title=R documentation Conditional logistic regression \|last1=Lumley \|first1=Thomas ~~\|date= \|website= \|publisher=~~ \|access-date=November 3, 2016}}</ref> It is also available in python through the <code>statsmodels</code> package starting with version 0.14.<ref>{{cite web \| url=https://www.statsmodels.org/dev/generated/statsmodels.discrete.conditional_models.ConditionalLogit.html \|title=statsmodels.discrete.conditional_models.ConditionalLogit \|access-date=March 25, 2023}}</ref> ==Related tests==▼ * [[Paired difference test]] allows to test the association between a binary outcome and a continuous predictor while taking into account pairing.▼ * [[Cochran-Mantel-Haenszel test]] allows to test the association between a binary outcome and a binary predictor while taking into account stratification with arbitrary strata size. When its conditions of application are verified, it is identical to the conditional logistic regression [[score test]]. <ref>{{cite journal \| author = Day, N. E., Byar, D. P.\| title = Testing hypotheses in case-control studies-equivalence of Mantel-Haenszel statistics and logit score tests. \| journal = Biometrics \| date = 1979 \| volume = 35 \| issue = 3 \| pages = 623-630 }}</ref>▼ ▲==Related tests== ▲* A [[~~Paired~~paired difference test]] ~~allows to~~can test the association between a binary outcome and a continuous predictor while taking into account pairing. ▲* A [[Cochran-Mantel-Haenszel test]] ~~allows to~~can test the association between a binary outcome and a binary predictor while taking into account stratification with arbitrary strata size. When its conditions of application are verified, it is identical to the conditional logistic regression [[score test]]. <ref>{{cite journal \| author = Day, N. E., Byar, D. P.\| title = Testing hypotheses in case-control studies-equivalence of Mantel-Haenszel statistics and logit score tests. \| journal = Biometrics \| date = 1979 \| volume = 35 \| issue = 3 \| pages = ~~623-630~~623–630 \| doi=10.2307/2530253\| jstor = 2530253 \| pmid = 497345 }}</ref> ==Notes== {{reflist}} [[Category:Logistic regression]]