Content deleted Content added
m Open access bot: hdl updated in citation with #oabot. |
Polyamorph (talk | contribs) m clean up, typo(s) fixed: straight-forward → straightforward, a uncertainty → an uncertainty |
||
Line 32:
===Whiskers===
The whiskers must end at an observed data point, but can be defined in various ways. In the most
Another popular choice for the boundaries of the whiskers is based on the 1.5 IQR value. From above the upper quartile ('''''Q''<sub>3</sub>'''), a distance of 1.5 times the IQR is measured out and a whisker is drawn ''up to'' the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile ('''''Q''<sub>1</sub>''') and a whisker is drawn ''down to'' the lowest observed data point from the dataset that falls within this distance. Because the whiskers must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the same for both sides. All other observed data points outside the boundary of the whiskers are plotted as '''outliers'''.<ref>{{Cite book |title=A Modern Introduction to Probability and Statistics |url=https://archive.org/details/modernintroducti00dekk_722 |url-access=limited |last=Dekking |first=F.M. |publisher=Springer |year=2005 |isbn=1-85233-896-2 |pages=[https://archive.org/details/modernintroducti00dekk_722/page/n240 234]–238 }}</ref> The outliers can be plotted on the box-plot as a dot, a small circle, a star, ''etc.'' (see example below).
Line 53:
'''Variable width box''' plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group.<ref name="mcgill tukey larsen">{{Cite journal|last1=McGill|first1=Robert|last2=Tukey|first2=John W.|author2-link=John W. Tukey|last3=Larsen|first3=Wayne A.|date=February 1978|title=Variations of Box Plots|journal=[[The American Statistician]]|volume=32|issue=1|pages=12–16|doi=10.2307/2683468|jstor=2683468}}</ref>
'''Notched box''' plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide of the significance of the difference of medians; if the notches of two boxes do not overlap, this will provide evidence of a statistically significant difference between the medians.<ref name="mcgill tukey larsen" /> The height of the notches is proportional to the interquartile range (IQR) of the sample and is inversely proportional to the square root of the size of the sample. However, there is
One convention for obtaining the boundaries of these notches is to use a distance of <math alt="±1.58×IQR/sqrt(n)">\pm \frac{1.58 \text{ IQR}}{\sqrt n}</math> around the median.<ref name="Rboxplotstats">{{Cite web | title = R: Box Plot Statistics | work = R manual | url = http://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/boxplot.stats.html | access-date = 26 June 2011}}</ref>
Line 85:
A box plot of the data set can be generated by first calculating five relevant values of this data set: minimum, maximum, median ('''''Q''<sub>2</sub>'''), first quartile ('''''Q''<sub>1</sub>'''), and third quartile ('''''Q''<sub>3</sub>''').
The minimum is the smallest number of the data set. In this case, the minimum recorded day temperature is 57
The maximum is the largest number of the data set. In this case, the maximum recorded day temperature is 81
The median is the "middle" number of the ordered data set. This means that there are exactly 50% of the elements is less than the median and 50% of the elements is greater than the median. The median of this ordered data set is 70
The first quartile value ('''''Q''<sub>1</sub>''' '''or 25th percentile)''' is the number that marks one quarter of the ordered data set. In other words, there are exactly 25% of the elements that are less than the first quartile and exactly 75% of the elements that are greater than it. The first quartile value can be easily determined by finding the "middle" number between the minimum and the median. For the hourly temperatures, the "middle" number found between 57
The third quartile value ('''''Q''<sub>3</sub>''' '''or 75th percentile)''' is the number that marks three quarters of the ordered data set. In other words, there are exactly 75% of the elements that are less than the third quartile and 25% of the elements that are greater than it. The third quartile value can be easily obtained by finding the "middle" number between the median and the maximum. For the hourly temperatures, the "middle" number between 70
The interquartile range, or IQR, can be calculated by subtracting the first quartile value ('''''Q''<sub>1</sub>''') from the third quartile value ('''''Q''<sub>3</sub>'''):
Line 109:
: <math>Q_1-1.5\text{ IQR}=66^\circ F-13.5^\circ F=52.5^\circ F.</math>
The upper whisker boundary of the box-plot is the largest data value that is within 1.5 IQR above the third quartile. Here, 1.5 IQR above the third quartile is 88.5
Similarly, the lower whisker boundary of the box plot is the smallest data value that is within 1.5 IQR below the first quartile. Here, 1.5 IQR below the first quartile is 52.5
=== Example with outliers ===
Line 121:
In this example, only the first and the last number are changed. The median, third quartile, and first quartile remain the same.
In this case, the maximum value in this data set is 89
Similarly, the minimum value in this data set is 52
=== In the case of large datasets ===
An additional example for obtaining box-plot from a data set containing a large number of data points is:
==== General equation to compute empirical quantiles ====
Line 149:
[[File:Boxplot vs PDF.svg|thumb|upright=1.2|Figure 7. Box-plot and a [[probability density function]] (pdf) of a Normal N(0,1σ<sup>2</sup>) Population]]
Although box plots may seem more primitive than [[histogram
Although looking at a statistical distribution is more common than looking at a box plot, it can be useful to compare the box plot against the probability density function (theoretical histogram) for a normal N(0,''σ''<sup>2</sup>) distribution and observe their characteristics directly (as shown in Figure 7).
|