Content deleted Content added
Pepperbeast (talk | contribs) |
remove unnecessary references to figures which are alongside relevant text, and "this is a picture of" |
||
(48 intermediate revisions by 24 users not shown) | |||
Line 1:
{{short description|Data visualization}}
[[File:Michelsonmorley-boxplot.svg|thumb|upright=1.
In [[descriptive statistics]], a '''box plot''' or '''boxplot''' is a method for
In addition to the box on a box plot, there can be lines (which are called ''whiskers'') extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also called the '''box-and-whisker plot''' and the '''box-and-whisker diagram'''. [[Outlier]]s that differ significantly from the rest of the dataset<ref>{{Cite journal|last=Grubbs|first=Frank E.|date=February 1969|title=Procedures for Detecting Outlying Observations in Samples|url=http://dx.doi.org/10.1080/00401706.1969.10490657|journal=Technometrics|volume=11|issue=1|pages=1–21|doi=10.1080/00401706.1969.10490657|issn=0040-1706|url-access=subscription}}</ref> may be plotted as individual points beyond the whiskers on the box-plot. Box plots are [[non-parametric]]: they display variation in samples of a [[statistical population]] without making any assumptions of the underlying [[probability distribution|statistical distribution]]<ref>{{Cite book|last=Richard.|first=Boddy|url=http://worldcat.org/oclc/940679163|title=Statistical Methods in Practice : for Scientists and Technologists.|date=2009|publisher=John Wiley & Sons|isbn=978-0-470-74664-6|oclc=940679163}}</ref> (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length). == History ==
The range-bar method was first introduced by [[Mary Eleanor Spear]] in her book "Charting Statistics" in 1952<ref>{{Cite book|title=Charting Statistics|last=Spear|first=Mary Eleanor|publisher=McGraw Hill|year=
|pages=166}}</ref> and again in her book "Practical Charting Techniques" in 1969.<ref>{{Cite book|title=Practical charting techniques|last=Spear, Mary Eleanor.|date=1969|publisher=McGraw-Hill|isbn=0070600104|___location=New York|oclc=924909765}}</ref> The box-and-whisker plot was first introduced in 1970 by [[John Tukey]], who later published on the subject in his book "Exploratory Data Analysis" in 1977.<ref name=":0">{{cite web |first1=Hadley |last1=Wickham |first2=Lisa |last2=Stryjewski |url=https://vita.had.co.nz/papers/boxplots.pdf |title=40 years of boxplots |access-date=December 24, 2020}}</ref> ==Elements==
[[File:Box-Plot mit Min-Max Abstand.png|thumb|
[[File:Box-Plot mit Interquartilsabstand.png|thumb|
A boxplot is a standardized way of displaying the dataset based on the [[five-number summary]]: the minimum, the maximum, the sample median, and the first and third quartiles.
Line 18 ⟶ 24:
* '''[[Median]] (''Q''<sub>2</sub> or 50th percentile)''': the middle value in the data set
* '''[[First quartile]] (''Q''<sub>1</sub> or 25th percentile)''': also known as the ''lower quartile'' ''q''<sub>''n''</sub>(0.25), it is the median of the lower half of the dataset.
* '''[[Third quartile]] (''Q''<sub>3</sub> or 75th percentile)''': also known as the ''upper quartile'' ''q''<sub>''n''</sub>(0.75), it is the median of the upper half of the dataset.<ref>{{cite journal |last1=Holmes |first1=Alexander |last2=Illowsky |first2=Barbara |last3=Dean |first3=Susan |title=Introductory Business Statistics |website=OpenStax |date=31 March 2015 |url=https://opentextbc.ca/introbusinessstatopenstax/chapter/measures-of-the-___location-of-the-data/ |access-date=29 April 2020 |archive-date=27 July 2020 |archive-url=https://web.archive.org/web/20200727025431/https://opentextbc.ca/introbusinessstatopenstax/chapter/measures-of-the-___location-of-the-data/ |url-status=dead }}</ref>
In addition to the minimum and maximum values used to construct a box-plot, another important element that can also be employed to obtain a box-plot is the interquartile range (IQR), as denoted below:
* '''[[Interquartile range]] (IQR)'''
:: <math>\text{IQR} = Q_3 - Q_1 = q_n(0.75) - q_n(0.25)</math>
A box-plot usually includes two parts, a box and a set of whiskers.
===Whiskers===▼
===Box===
The box is drawn from ''Q''<sub>1</sub> to ''Q''<sub>3</sub> with a horizontal line drawn inside it to denote the median. Some box plots include an additional character to represent the mean of the data.<ref name="frigge hoaglin iglewicz2">{{Cite journal|last1=Frigge|first1=Michael|last2=Hoaglin|first2=David C.|last3=Iglewicz|first3=Boris|date=February 1989|title=Some Implementations of the Boxplot|journal=[[The American Statistician]]|volume=43|issue=1|pages=50–54|doi=10.2307/2685173|jstor=2685173}}</ref><ref>{{cite journal|last1=Marmolejo-Ramos|first1=F.|last2=Tian|first2=S.|date=2010|title=The shifting boxplot. A boxplot based on essential summary statistics around the mean|journal=International Journal of Psychological Research|volume=3|issue=1|pages=37–46|doi=10.21500/20112084.823|doi-access=free|hdl=10819/6492|hdl-access=free}}</ref>▼
▲===Whiskers===
Another popular choice for the boundaries of the whiskers is based on the 1.5 IQR value. From above the upper quartile ('''''Q''<sub>3</sub>'''), a distance of 1.5 times the IQR is measured out and a whisker is drawn ''up to'' the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile ('''''Q''<sub>1</sub>''') and a whisker is drawn ''down to'' the lowest observed data point from the dataset that falls within this distance. Because the whiskers must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the same for both sides. All other observed data points outside the boundary of the whiskers are plotted as '''outliers'''.<ref>{{Cite book |title=A Modern Introduction to Probability and Statistics |url=https://archive.org/details/modernintroducti00dekk_722 |url-access=limited |last=Dekking |first=F.M. |publisher=Springer |year=2005 |isbn=1-85233-896-2 |pages=[https://archive.org/details/modernintroducti00dekk_722/page/n240 234]–238 }}</ref> The outliers can be plotted on the box-plot as a dot, a small circle, a star, ''etc.'' (see example below).▼
The whiskers must end at an observed data point, but can be defined in various ways. In the most straightforward method, the boundary of the lower whisker is the minimum value of the data set, and the boundary of the upper whisker is the maximum value of the data set. Because of this variability, it is appropriate to describe the convention that is being used for the whiskers and outliers in the caption of the box-plot.
▲Another popular choice for the boundaries of the whiskers is based on the 1.5 IQR value. From above the upper quartile ('''''Q''<sub>3</sub>'''), a distance of 1.5 times the IQR is measured out and a whisker is drawn ''up to'' the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile ('''''Q''<sub>1</sub>''') and a whisker is drawn ''down to'' the lowest observed data point from the dataset that falls within this distance. Because the whiskers must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the same for both sides. All other observed data points outside the boundary of the whiskers are plotted as '''outliers'''.<ref>{{Cite book |title=A Modern Introduction to Probability and Statistics |url=https://archive.org/details/modernintroducti00dekk_722 |url-access=limited |last=Dekking |first=F.M. |publisher=Springer |year=2005 |isbn=1-85233-896-2 |pages=[https://archive.org/details/modernintroducti00dekk_722/page/n240 234]–238 }}</ref> The outliers can be plotted on the box-plot as a dot, a small circle, a star, ''etc.'' (see example below).
[[File:Box Plot Picture.png|thumb|A box plot representing data]]
There are other representations in which the whiskers can stand for several other things, such as:
* One [[standard deviation]] above and below the mean of the data set
* The 9th percentile and the 91st percentile of the data set
* The 2nd percentile and the 98th percentile of the data set
Rarely, box-plot can be plotted without the whiskers. This
▲Some box plots include an additional character to represent the mean of the data.<ref name="frigge hoaglin iglewicz2">{{Cite journal|last1=Frigge|first1=Michael|last2=Hoaglin|first2=David C.|last3=Iglewicz|first3=Boris|date=February 1989|title=Some Implementations of the Boxplot|journal=[[The American Statistician]]|volume=43|issue=1|pages=50–54|doi=10.2307/2685173|jstor=2685173}}</ref><ref>{{cite journal|last1=Marmolejo-Ramos|first1=F.|last2=Tian|first2=S.|date=2010|title=The shifting boxplot. A boxplot based on essential summary statistics around the mean|journal=International Journal of Psychological Research|volume=3|issue=1|pages=37–46|doi=10.21500/20112084.823|doi-access=free}}</ref>
The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to depict the [[seven-number summary]]. If the data are [[Normal distribution|normally distributed]], the locations of the seven marks on the box plot will be equally spaced. On some box plots, a cross-hatch is placed before the end of each whisker.
==Variations==
[[File:Fourboxplots.svg|thumb|
Since the mathematician [[John W. Tukey]] first popularized this type of visual data display in 1969, several variations on the classical box plot have been developed, and the two most commonly found variations are the variable
'''Variable
'''Notched box''' plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide of the significance of the difference of medians; if the notches of two boxes do not overlap, this will provide evidence of a statistically significant difference between the medians.<ref name="mcgill tukey larsen" /> The height of the notches is proportional to the interquartile range (IQR) of the sample and is inversely proportional to the square root of the size of the sample. However, there is
One convention for obtaining the boundaries of these notches is to use a distance of <math alt="±1.58×IQR/sqrt(n)">\pm \frac{1.58 \text{ IQR}}{\sqrt n}</math> around the median.<ref name="Rboxplotstats">{{Cite web | title = R: Box Plot Statistics | work = R manual | url = http://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/boxplot.stats.html | access-date = 26 June 2011}}</ref>
Line 82 ⟶ 86:
=== Example without outliers ===
[[File:No Outlier.png|thumb|
A series of hourly temperatures were measured throughout the day in degrees Fahrenheit. The recorded values are listed in order as follows (°F): 57, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81.
A box plot of the data set can be generated by first calculating five relevant values of this data set: minimum, maximum, median ('''''Q''<sub>2</sub>'''), first quartile ('''''Q''<sub>1</sub>'''), and third quartile ('''''Q''<sub>3</sub>''').
The minimum is the smallest number of the data set. In this case, the minimum recorded day temperature is 57
The maximum is the largest number of the data set. In this case, the maximum recorded day temperature is 81
The median is the "middle" number of the ordered data set. This means that
The first quartile value ('''''Q''<sub>1</sub>''' '''or 25th percentile)''' is the number that marks one quarter of the ordered data set. In other words, there are exactly 25% of the elements that are less than the first quartile and exactly 75% of the elements that are greater than it. The first quartile value can be easily determined by finding the "middle" number between the minimum and the median. For the hourly temperatures, the "middle" number found between 57
The third quartile value ('''''Q''<sub>3</sub>''' '''or 75th percentile)''' is the number that marks three quarters of the ordered data set. In other words, there are exactly 75% of the elements that are less than the third quartile and 25% of the elements that are greater than it. The third quartile value can be easily obtained by finding the "middle" number between the median and the maximum. For the hourly temperatures, the "middle" number between 70
The interquartile range, or IQR, can be calculated by subtracting the first quartile value ('''''Q''<sub>1</sub>''') from the third quartile value ('''''Q''<sub>3</sub>'''):
Line 111 ⟶ 115:
: <math>Q_1-1.5\text{ IQR}=66^\circ F-13.5^\circ F=52.5^\circ F.</math>
The upper whisker boundary of the box-plot is the largest data value that is within 1.5 IQR above the third quartile. Here, 1.5 IQR above the third quartile is 88.5
Similarly, the lower whisker boundary of the box plot is the smallest data value that is within 1.5 IQR below the first quartile. Here, 1.5 IQR below the first quartile is 52.5
=== Example with outliers ===
[[File:Boxplot with outlier.png|thumb|
Above is an example without outliers. Here is a
The ordered set for the recorded temperatures is (°F): 52, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 89.
Line 123 ⟶ 127:
In this example, only the first and the last number are changed. The median, third quartile, and first quartile remain the same.
In this case, the maximum value in this data set is 89
Similarly, the minimum value in this data set is 52
=== In the case of large datasets ===
An additional example for obtaining box-plot from a data set containing a large number of data points is:
==== General equation to compute empirical quantiles ====
Line 139 ⟶ 143:
Using the above example that has 24 data points (''n'' = 24), one can calculate the median, first and third quartile either mathematically or visually.
'''Median'''
: <math> \begin{align} q_n(0.5) & = x_{(12)} + (0.5\cdot25-12)\cdot(x_{(13)}-x_{(12)}) \\[5pt] & = 70+(0.5\cdot25-12)\cdot(70-70) = 70^\circ
\end{align}
</math>
'''First quartile'''
: <math> \begin{align} q_n(0.25) & = x_{(6)} + (0.25\cdot25-6)\cdot(x_{(7)}-x_{(6)}) \\[5pt] & = 66 +(0.25\cdot25 - 6)\cdot(66-66) = 66^\circ
\end{align}
</math>
'''Third quartile'''
: <math> \begin{align} q_n(0.75) & = x_{(18)} + (0.75\cdot25-18)\cdot(x_{(19)}-x_{(18)}) \\[5pt] & =75 + (0.75\cdot25-18)\cdot(75-75) = 75^\circ
\end{align}
</math>
== Visualization ==
[[File:Boxplot vs PDF.svg|thumb|upright=1.2|
[[File:Boxplots with skewness.png|thumb|
Although box plots may seem more primitive than [[histogram
Although looking at a statistical distribution is more common than looking at a box plot, it can be useful to compare the box plot against the probability density function (theoretical histogram) for a normal N(0,''σ''<sup>2</sup>) distribution and observe their characteristics directly
▲[[File:Boxplots with skewness.png|thumb|Figure 8. Box-plots displaying the skewness of the data set]]
{{clear}}
Line 161 ⟶ 180:
* [[Bagplot]]
* [[Contour boxplot]]
* [[Data and information visualization]]
* [[Exploratory data analysis]]
Line 167 ⟶ 185:
* [[Five-number summary]]
* [[Functional boxplot]]
* [[Seasonality]]
* [[Seven-number summary]]
* [[Sina plot]]
* [[Violin plot]]
|