Box plot: Difference between revisions

Content deleted Content added
remove unnecessary references to figures which are alongside relevant text, and "this is a picture of"
 
(14 intermediate revisions by 11 users not shown)
Line 1:
{{short description|Data visualization}}
[[File:Michelsonmorley-boxplot.svg|thumb|upright=1.535|Figure 1. Box plot of data from the [[Michelson–Morley experiment#Michelson experiment (1881)|Michelson experiment]]]]
 
In [[descriptive statistics]], a '''box plot''' or '''boxplot''' is a method for demonstrating graphically the locality, spread and skewness groups of numerical data through their [[quartile]]s.<ref>{{Cite book|last=C.|first=Dutoit, S. H.|url=http://worldcat.org/oclc/1019645745|title=Graphical exploratory data analysis.|date=2012|publisher=Springer|isbn=978-1-4612-9371-2|oclc=1019645745}}</ref>

In addition to the box on a box plot, there can be lines (which are called ''whiskers'') extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also called the '''box-and-whisker plot''' and the '''box-and-whisker diagram'''. [[Outlier]]s that differ significantly from the rest of the dataset<ref>{{Cite journal|last=Grubbs|first=Frank E.|date=February 1969|title=Procedures for Detecting Outlying Observations in Samples|url=http://dx.doi.org/10.1080/00401706.1969.10490657|journal=Technometrics|volume=11|issue=1|pages=1–21|doi=10.1080/00401706.1969.10490657|issn=0040-1706|url-access=subscription}}</ref> may be plotted as individual points beyond the whiskers on the box-plot. Box plots are [[non-parametric]]: they display variation in samples of a [[statistical population]] without making any assumptions of the underlying [[probability distribution|statistical distribution]]<ref>{{Cite book|last=Richard.|first=Boddy|url=http://worldcat.org/oclc/940679163|title=Statistical Methods in Practice : for Scientists and Technologists.|date=2009|publisher=John Wiley & Sons|isbn=978-0-470-74664-6|oclc=940679163}}</ref> (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length).
 
Box plots are [[non-parametric]]: they display variation in samples of a [[statistical population]] without making any assumptions of the underlying [[probability distribution|statistical distribution]]<ref>{{Cite book|last=Richard.|first=Boddy|url=http://worldcat.org/oclc/940679163|title=Statistical Methods in Practice : for Scientists and Technologists.|date=2009|publisher=John Wiley & Sons|isbn=978-0-470-74664-6|oclc=940679163}}</ref> (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length). The spacings in each subsection of the box-plot indicate the degree of [[statistical dispersion|dispersion]] (spread) and [[skewness]] of the data, which are usually described using the [[five-number summary]]. In addition, the box-plot allows one to visually estimate various [[L-estimator]]s, notably the [[interquartile range]], [[midhinge]], [[range (statistics)|range]], [[mid-range]], and [[trimean]]. Box plots can be drawn either horizontally or vertically.
 
== History ==
Line 12 ⟶ 15:
 
==Elements==
[[File:Box-Plot mit Min-Max Abstand.png|thumb|Figure 2. Box-plot with whiskers from minimum to maximum]]
[[File:Box-Plot mit Interquartilsabstand.png|thumb|FigureThe 3. Samesame box-plot with whiskers drawn within the 1.5 IQR value]]
 
A boxplot is a standardized way of displaying the dataset based on the [[five-number summary]]: the minimum, the maximum, the sample median, and the first and third quartiles.
Line 25 ⟶ 28:
In addition to the minimum and maximum values used to construct a box-plot, another important element that can also be employed to obtain a box-plot is the interquartile range (IQR), as denoted below:
 
* '''[[Interquartile range]] (IQR)''' : the distance between the upper and lower quartiles
 
:: <math>\text{IQR} = Q_3 - Q_1 = q_n(0.75) - q_n(0.25)</math>
 
A box-plot usually includes two parts, a box and a set of whiskers as shown in Figure 2.
 
===Box===
Line 38 ⟶ 41:
 
Another popular choice for the boundaries of the whiskers is based on the 1.5 IQR value. From above the upper quartile ('''''Q''<sub>3</sub>'''), a distance of 1.5 times the IQR is measured out and a whisker is drawn ''up to'' the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile ('''''Q''<sub>1</sub>''') and a whisker is drawn ''down to'' the lowest observed data point from the dataset that falls within this distance. Because the whiskers must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the same for both sides. All other observed data points outside the boundary of the whiskers are plotted as '''outliers'''.<ref>{{Cite book |title=A Modern Introduction to Probability and Statistics |url=https://archive.org/details/modernintroducti00dekk_722 |url-access=limited |last=Dekking |first=F.M. |publisher=Springer |year=2005 |isbn=1-85233-896-2 |pages=[https://archive.org/details/modernintroducti00dekk_722/page/n240 234]–238 }}</ref> The outliers can be plotted on the box-plot as a dot, a small circle, a star, ''etc.'' (see example below).
[[File:Box Plot Picture.png|thumb|A box plot representing data]]
 
There are other representations in which the whiskers can stand for several other things, such as:
 
Line 45 ⟶ 48:
* The 2nd percentile and the 98th percentile of the data set
 
Rarely, box-plot can be plotted without the whiskers. This can be appropriate for sensitive information to avoid whiskers (and outliers) disclosing actual values observed.<ref name="DGRW">{{Cite journalbook|last1=Derrick|first1=Ben|last2=Green|first2=Elizabeth|last3=Ritchie|first3=Felix|last4=White|first4=Paul|date=September 2022|titlechapter=The Risk of Disclosure When Reporting Commonly Used Univariate Statistics|journaltitle=Privacy in Statistical Databases|series=Lecture Notes in Computer Science |volume=13463|pages=119–129|doi=10.1007/978-3-031-13945-1_9|isbn=978-3-031-13944-4 }}</ref>
 
The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to depict the [[seven-number summary]]. If the data are [[Normal distribution|normally distributed]], the locations of the seven marks on the box plot will be equally spaced. On some box plots, a cross-hatch is placed before the end of each whisker.
 
==Variations==
[[File:Fourboxplots.svg|thumb|300px|Figure&nbsp;4upright=1. 3|Four box plots, with and without notches and variable width]]
 
Since the mathematician [[John W. Tukey]] first popularized this type of visual data display in 1969, several variations on the classical box plot have been developed, and the two most commonly found variations are the variable -width box plots and the notched box plots shown in Figure 4.
 
'''Variable -width box''' plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group.<ref name="mcgill tukey larsen">{{Cite journal|last1=McGill|first1=Robert|last2=Tukey|first2=John W.|author2-link=John W. Tukey|last3=Larsen|first3=Wayne A.|date=February 1978|title=Variations of Box Plots|journal=[[The American Statistician]]|volume=32|issue=1|pages=12–16|doi=10.2307/2683468|jstor=2683468}}</ref>
 
'''Notched box''' plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide of the significance of the difference of medians; if the notches of two boxes do not overlap, this will provide evidence of a statistically significant difference between the medians.<ref name="mcgill tukey larsen" /> The height of the notches is proportional to the interquartile range (IQR) of the sample and is inversely proportional to the square root of the size of the sample. However, there is an uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples).<ref name="mcgill tukey larsen" /> The width of the notch is arbitrarily chosen to be visually pleasing, and should be consistent amongst all box plots being displayed on the same page.
Line 83 ⟶ 86:
 
=== Example without outliers ===
[[File:No Outlier.png|thumb|Figure 5. The generatedA boxplot figure of the example on the left with no outliers]]
A series of hourly temperatures were measured throughout the day in degrees Fahrenheit. The recorded values are listed in order as follows (°F): 57, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81.
 
Line 117 ⟶ 120:
 
=== Example with outliers ===
[[File:Boxplot with outlier.png|thumb|FigureA 6.box The generated boxplot of the example on the leftplot with outliers]]
Above is an example without outliers. Here is a followupfollow-up example for generating box-plot with outliers:
 
The ordered set for the recorded temperatures is (°F): 52, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 89.
Line 140 ⟶ 143:
Using the above example that has 24 data points (''n'' = 24), one can calculate the median, first and third quartile either mathematically or visually.
 
'''Median'''
: <math>
\begin{align}
q_n(0.5) & = x_{(12)} + (0.5\cdot25-12)\cdot(x_{(13)}-x_{(12)}) \\[5pt]
& = 70+(0.5\cdot25-12)\cdot(70-70) = 70^\circ \text{F</math>}
\end{align}
</math>
 
'''First quartile'''
: <math>
\begin{align}
q_n(0.25) & = x_{(6)} + (0.25\cdot25-6)\cdot(x_{(7)}-x_{(6)}) \\[5pt]
& = 66 +(0.25\cdot25 - 6)\cdot(66-66) = 66^\circ \text{F</math>}
\end{align}
</math>
 
'''Third quartile'''
: <math>
\begin{align}
q_n(0.75) & = x_{(18)} + (0.75\cdot25-18)\cdot(x_{(19)}-x_{(18)}) \\[5pt]
& =75 + (0.75\cdot25-18)\cdot(75-75) = 75^\circ \text{F</math>}
\end{align}
</math>
 
== Visualization ==
[[File:Boxplot vs PDF.svg|thumb|upright=1.2|Figure 7. Box-plot and a [[probability density function]] (pdf) of a Normal N(0,1σ<sup>2</sup>) Population]]
[[File:Boxplots with skewness.png|thumb|Figure 8. Box-plots displaying the skewness of the data set]]
 
Although box plots may seem more primitive than [[histogram]]s or [[kernel density estimation|kernel density estimates]], they do have a number of advantages. First, the box plot enables statisticians to do a quick graphical examination on one or more data sets. Box-plots also take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data in parallel (see Figure 1 for an example). Lastly, the overall structure of histograms and kernel density estimate can be strongly influenced by the choice of [[Histogram#Number of bins and width|number and width of bins]] techniques and the choice of bandwidth, respectively.
 
Although looking at a statistical distribution is more common than looking at a box plot, it can be useful to compare the box plot against the probability density function (theoretical histogram) for a normal N(0,''σ''<sup>2</sup>) distribution and observe their characteristics directly (as shown in Figure 7).
 
[[File:Boxplots with skewness.png|thumb|Figure 8. Box-plots displaying the skewness of the data set]]
{{clear}}
 
Line 162 ⟶ 180:
* [[Bagplot]]
* [[Contour boxplot]]
* [[Candlestick chart]]
* [[Data and information visualization]]
* [[Exploratory data analysis]]
Line 168 ⟶ 185:
* [[Five-number summary]]
* [[Functional boxplot]]
* [[Seasonality]]
* [[Seven-number summary]]
* [[Sina plot]]