Exploratory data analysis: Difference between revisions

Content deleted Content added
Fixed typo
Tags: Reverted possibly inaccurate edit summary Mobile edit Mobile web edit
Overview: Fixed typo
Tags: Reverted possibly inaccurate edit summary Mobile edit Mobile web edit
Line 11:
Tukey's EDA was related to two other developments in [[statistical theory]]: [[robust statistics]] and [[nonparametric statistics]], both of which tried to reduce the sensitivity of statistical inferences to errors in formulating [[statistical model]]s. Tukey promoted the use of [[five number summary]] of numerical data—the two [[extreme value|extreme]]s ([[maximum]] and [[minimum]]), the [[median]], and the [[quartile]]s—because these median and quartiles, being functions of the [[empirical distribution function|empirical distribution]] <!-- [[statistical functional]]s (and the related [[interquartile range]] and [[range]]) -->are defined for all distributions, unlike the [[mean value|mean]] and [[standard deviation]]; moreover, the quartiles and median are more robust to [[skewness|skewed]] or [[heavy-tailed distribution]]s than traditional summaries (the mean and standard deviation). The packages [[S (programming language)|S]], [[S-PLUS]], and [[R (programming language)|R]] included routines using [[resampling (statistics)|resampling statistics]], such as Quenouille and Tukey's [[resampling (statistics)#Jackknife|jackknife]] and [[Bradley Efron|Efron]]{{'s}} [[bootstrapping (statistics)|bootstrap]], which are nonparametric and robust (for many problems).
 
Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the [[analytic function|analytic]] theory of [[statistical hypothesis testing|testing statistical hypotheses]], particularly the [[Pierre-Simon Laplace|Laplacian]] tradition's emphasis on [[exponential family|exponential families]].<ref>{{cite journal |title=Conversation with John W. Tukey and Elizabeth Tukey, Luisa T. Fernholz and Stephan Morgenthaler |journal=Statistical Science |volume=15 |issue=1 |year=2000 |pages=79–94 |doi=10.1214/ss/1009212675|last1=Morgenthaler |first1=Stephan |last2=Fernholz |first2=Luisa T. |doi-access=free }}</ref>Tukey's championing of EDA encouraged the development of statistical computing packages, especially S at Bell Labs.[4] The S programming language inspired the systems S-PLUS and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers, trends and patterns in data that merited further study.
 
Tukey's EDA was related to two other developments in statistical theory: robust statistics and nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical models. Tukey promoted the use of five number summary of numerical data—the two extremes (maximum and minimum), the median, and the quartiles—because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the mean and standard deviation; moreover, the quartiles and median are more robust to skewed or heavy-tailed distributions than traditional summaries (the mean and standard deviation). The packages S, S-PLUS, and R included routines using resampling statistics, such as Quenouille and Tukey's jackknife and Efron's bootstrap, which are nonparametric and robust (for many problems).
 
== Development ==