Content deleted Content added
Maxeto0910 (talk | contribs) No edit summary Tags: Mobile edit Mobile web edit Advanced mobile edit |
|||
(31 intermediate revisions by 23 users not shown) | |||
Line 1:
{{Short description|Class of statistical models}}
{{Distinguish|
{{Regression bar}}
In [[statistics]], a '''generalized linear model''' ('''GLM''') is a flexible generalization of ordinary [[linear regression]]. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
Generalized linear models were formulated by [[John Nelder]] and [[Robert Wedderburn (statistician)|Robert Wedderburn]] as a way of unifying various other statistical models, including [[linear regression]], [[logistic regression]] and [[Poisson regression]].<ref>{{cite journal | last1= Nelder | first1 = John |author-link = John Nelder | first2 = Robert |last2 = Wedderburn | s2cid = 14154576 |author-link2 = Robert Wedderburn (statistician) | title = Generalized Linear Models | year=1972 | journal = Journal of the Royal Statistical Society. Series A (General) | volume= 135 |issue=3 | pages=370–384 | doi= 10.2307/2344614 | publisher= Blackwell Publishing | jstor= 2344614 }}</ref> They proposed an [[iteratively reweighted least squares]] [[iterative method|method]] for [[maximum likelihood estimation]]
==Intuition==
Ordinary linear regression predicts the [[expected value]] of a given unknown quantity (the ''response variable'', a [[random variable]]) as a [[linear combination]] of a set of observed values (''predictors''). This implies that a constant change in a predictor leads to a constant change in the response variable (i.e. a ''linear-response model''). This is appropriate when the response variable can vary, to a good approximation, indefinitely in either direction, or more generally for any quantity that only varies by a relatively small amount compared to the variation in the predictive variables, e.g. human heights.
However, these assumptions are inappropriate for some types of response variables. For example, in cases where the response variable is expected to be always positive and varying over a wide range, constant input changes lead to geometrically (i.e. exponentially) varying, rather than constantly varying, output changes. As an example, suppose a linear prediction model learns from some data (perhaps primarily drawn from large beaches) that a 10 degree temperature decrease would lead to 1,000 fewer people visiting the beach. This model is unlikely to generalize well over
Similarly, a model that predicts a probability of making a yes/no choice (a [[Bernoulli distribution|Bernoulli variable]]) is even less suitable as a linear-response model, since probabilities are bounded on both ends (they must be between 0 and 1). Imagine, for example, a model that predicts the likelihood of a given person going to the beach as a function of temperature. A reasonable model might predict, for example, that a change in 10 degrees makes a person two times more or less likely to go to the beach. But what does "twice as likely" mean in terms of a probability? It cannot literally mean to double the probability value (e.g. 50% becomes 100%, 75% becomes 150%, etc.). Rather, it is the ''[[odds ratio|odds]]'' that are doubling: from 2:1 odds, to 4:1 odds, to 8:1 odds, etc. Such a model is a ''log-odds or [[Logistic regression|logistic]] model''.
Generalized linear models cover all these situations by allowing for response variables that have arbitrary distributions (rather than simply [[normal distribution]]s), and for an arbitrary function of the response variable (the ''link function'') to vary linearly with the predictors (rather than assuming that the response itself must vary linearly). For example, the case above of predicted number of beach attendees would typically be modeled with a [[Poisson distribution]] and a log link, while the case of predicted probability of beach attendance would typically be
==Overview==
In a generalized linear model (GLM), each outcome '''Y''' of the [[dependent variable]]s is assumed to be generated from a particular [[probability distribution|distribution]] in an [[exponential family]], a large class of [[probability distributions]] that includes the [[normal distribution|normal]], [[binomial distribution|binomial]], [[poisson distribution|Poisson]] and [[gamma distribution|gamma]] distributions, among others. The conditional mean
: <math>\operatorname{E}(\mathbf{Y}
where E('''Y''' | '''X''') is the [[expected value]] of '''Y''' [[conditional expectation|conditional]] on '''X'''; '''X''β''''' is the ''linear predictor'', a linear combination of unknown parameters '''''β'''''; ''g'' is the link function.
In this framework, the variance is typically a function, '''V''', of the mean:
:<math> \operatorname{Var}(\mathbf{Y}
It is convenient if '''V''' follows from an exponential family of distributions, but it may simply be that the variance is a function of the predicted value.
Line 36 ⟶ 37:
: 1. A particular distribution for modeling <math> Y </math> from among those which are considered exponential families of probability distributions,
: 2. A linear predictor <math>\eta = X \beta</math>, and
: 3. A link function <math>g</math> such that <math>\operatorname{E}(Y \mid X) = \mu = g^{-1}(\eta)</math>.
=== Probability distribution ===
Line 47 ⟶ 48:
: <math> f_Y(y \mid \theta, \tau) = h(y,\tau) \exp \left(\frac{b(\theta)T(y) - A(\theta)}{d(\tau)} \right). \,\!</math>
<math>\boldsymbol\theta</math> is related to the mean of the distribution. If <math>\mathbf{b}(\boldsymbol\theta)</math> is the identity function, then the distribution is said to be in [[canonical form]] (or ''natural form''). Note that any distribution can be converted to canonical form by rewriting <math>\boldsymbol\theta</math> as <math>\boldsymbol\theta'</math> and then applying the transformation <math>\boldsymbol\theta = \mathbf{b}(\boldsymbol\theta')</math>. It is always possible to convert <math>A(\boldsymbol\theta)</math> in terms of the new parametrization, even if <math>\mathbf{b}(\boldsymbol\theta')</math> is not a [[one-to-one function]]; see comments in the page on [[exponential families]].
If, in addition, <math>\mathbf{T}(\mathbf{y})</math> :<math> \boldsymbol\mu = \operatorname{E}(\mathbf{y}) = \
For scalar <math>\mathbf{y}</math> and <math>\boldsymbol\theta</math>, this reduces to
Line 54 ⟶ 57:
Under this scenario, the variance of the distribution can be shown to be<ref>{{harvnb|McCullagh|Nelder|1989}}, Chapter 2.</ref>
:<math>\operatorname{Var}(\mathbf{y}) = \nabla^
For scalar <math>\mathbf{y}</math> and <math>\boldsymbol\theta</math>, this reduces to
Line 71 ⟶ 74:
The link function provides the relationship between the linear predictor and the [[Expected value|mean]] of the distribution function. There are many commonly used link functions, and their choice is informed by several considerations. There is always a well-defined ''canonical'' link function which is derived from the exponential of the response's [[density function]]. However, in some cases it makes sense to try to match the [[Domain of a function|___domain]] of the link function to the [[range of a function|range]] of the distribution function's mean, or use a non-canonical link function for algorithmic purposes, for example [[Probit model#Gibbs sampling|Bayesian probit regression]].
When using a distribution function with a canonical parameter <math>\theta,</math>
Following is a table of several exponential-family distributions in common use and the data they are typically used for, along with the canonical link functions and their inverses (sometimes referred to as the mean function, as done here).
{| class="wikitable
|+ Common distributions with typical uses and canonical link functions
! Distribution !! Support of distribution !! Typical uses !! Link name !! Link function, <math>\mathbf{X}\boldsymbol{\beta}=g(\mu)\,\!</math> !! Mean function
|-
| [[normal distribution|Normal]]
| rowspan="2" |real: <math>(-\infty,+\infty)</math> || rowspan="2" |Linear-response data || rowspan="2" | Identity
| rowspan="2" |<math>\mathbf{X}\boldsymbol{\beta}=\mu\,\!</math> || rowspan="2" | <math>\mu=\mathbf{X}\boldsymbol{\beta}\,\!</math>
|-
| [[Laplace distribution|Laplace]]
|-
| [[exponential distribution|Exponential]]
Line 108 ⟶ 113:
|-
| rowspan=2| [[categorical distribution|Categorical]]
| integer: <math>[0,K)</math>|| rowspan=2| outcome of single ''K''-way occurrence
| rowspan="3" |<math>\mathbf{X}\boldsymbol{\beta}=\ln \left(\frac \mu {1-\mu}\right) \,\!</math>
|-
| ''K''-vector of integer: <math>[0,1]</math>, where exactly one element in the vector has the value 1
|-
| [[multinomial distribution|Multinomial]]
| ''K''-vector of integer: <math>[0,N]</math> || count of occurrences of different types (1, ..., ''K'') out of ''N'' total ''K''-way occurrences
|}
In the cases of the exponential and gamma distributions, the ___domain of the canonical link function is not the same as the permitted range of the mean. In particular, the linear predictor may be positive, which would give an impossible negative mean. When maximizing the likelihood, precautions must be taken to avoid this. An alternative is to use a noncanonical link function.
In the case of the Bernoulli, binomial, categorical and multinomial distributions, the support of the distributions is not the same type of data as the parameter being predicted. In all of these cases, the predicted parameter is one or more probabilities, i.e. real numbers in the range <math>[0,1]</math>. The resulting model is known as ''[[logistic regression]]'' (or ''[[multinomial logistic regression]]'' in the case that ''K''-way rather than binary values are being predicted).
For the Bernoulli and binomial distributions, the parameter is a single probability, indicating the likelihood of occurrence of a single event. The Bernoulli still satisfies the basic condition of the generalized linear model in that, even though a single outcome will always be either 0 or 1, the ''[[expected value]]'' will nonetheless be a real-valued probability, i.e. the probability of occurrence of a "yes" (or 1) outcome. Similarly, in a binomial distribution, the expected value is ''Np'', i.e. the expected proportion of "yes" outcomes will be the probability to be predicted.
Line 164 ⟶ 169:
The most typical link function is the canonical [[logit]] link:
:<math>g(p) = \operatorname{logit} p = \ln \left( { p \over 1-p } \right).</math>
GLMs with this setup are [[logistic regression]] models (or ''logit models'').
Line 184 ⟶ 189:
where ''μ'' is a positive number denoting the expected number of events. If ''p'' represents the proportion of observations with at least one event, its complement
:<math>
and then
:<math>
A linear model requires the response variable to take values over the entire real line. Since ''μ'' must be positive, we can enforce that by taking the logarithm, and letting log(''μ'') be a linear model. This produces the "cloglog" transformation
Line 205 ⟶ 210:
===Multinomial regression===
The binomial case may be easily extended to allow for a [[multinomial distribution]] as the response (also, a Generalized Linear Model for counts, with a constrained total). There are two ways in which this is
====Ordered response====
Line 215 ⟶ 220:
====Unordered response====
If the response variable is a [[Level of measurement#Nominal
:<math> g(\mu_m) = \eta_m = \beta_{m,0} + X_1 \beta_{m,1} + \cdots + X_p \beta_{m,p} \text{ where } \mu_m = \mathrm{P}(Y = m \mid Y \in \{1,m\} ). \,</math>
Line 234 ⟶ 239:
The standard GLM assumes that the observations are [[uncorrelated]]. Extensions have been developed to allow for [[correlation]] between observations, as occurs for example in [[longitudinal studies]] and clustered designs:
* '''[[Generalized estimating equation]]s''' (GEEs) allow for the correlation between observations without the use of an explicit probability model for the origin of the correlations, so there is no explicit [[likelihood]]. They are suitable when the [[random effects]] and their variances are not of inherent interest, as they allow for the correlation without explaining its origin. The focus is on estimating the average response over the population ("population-averaged" effects) rather than the regression parameters that would enable prediction of the effect of changing one or more components of '''X''' on a given individual. GEEs are usually used in conjunction with [[Huber–White standard errors]].<ref>{{cite journal
|title = Models for Longitudinal Data: A Generalized Estimating Equation Approach |first1 = Scott L. |last1 = Zeger |last2 = Liang |first2 = Kung-Yee |last3 = Albert |first3 = Paul S. |author-link1=Scott Zeger |author-link2=Kung-Yee Liang |journal = Biometrics |volume = 44 |year = 1988 |pages = 1049–1060 |issue = 4
|doi = 10.2307/2531734
|pmid = 3233245
Line 249 ⟶ 254:
==See also==
*
*
*
*
*
*
*
*
*
*
* [[Generalized estimating equation]]
== References ==
|