Exploratory data analysis: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 00:45, 9 January 2024 edit David Eppstein (talk \| contribs) Autopatrolled, Administrators 235,663 edits cleanup ← Previous edit		Latest revision as of 20:43, 25 May 2025 edit undo Yash Thale (talk \| contribs) 3 edits m →top: Added some info about EDA/plotting/Visualizatio libraries used currently woth Python For EDA. Tags: Mobile edit Mobile app edit Android app edit App full source
(12 intermediate revisions by 11 users not shown)
Line 1: {{short description\|Approach of analyzing data sets in statistics}}{{Data Visualization}} ~~{{Data Visualization}}~~ In [[statistics]], '''exploratory data analysis''' (EDA) is an approach of [[data analysis\|analyzing]] [[data set]]s to summarize their main characteristics, often using [[statistical graphics]] and other [[data visualization]] methods. A [[statistical model]] can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts with traditional hypothesis testing, in which a model is supposed to be selected before the data is seen. Exploratory data analysis has been promoted by [[John Tukey]] since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from [[Data analysis#Initial data analysis\|initial data analysis (IDA)]],<ref>{{cite book \|last=Chatfield \|first=C. \|year=1995 \|title=Problem Solving: A Statistician's Guide \|publisher=Chapman and Hall \|isbn=978-0412606304 \|edition=2nd }}</ref><ref>{{cite journal \|doi=10.1371/journal.pcbi.1009819\|title=Ten simple rules for initial data analysis\|year=2022\|last1=Baillie\|first1=Mark\|last2=Le Cessie\|first2=Saskia\|last3=Schmidt\|first3=Carsten Oliver\|last4=Lusa\|first4=Lara\|last5=Huebner\|first5=Marianne\|author6=Topic Group "Initial Data Analysis" of the STRATOS Initiative\|journal=PLOS Computational Biology\|volume=18\|issue=2\|pages=e1009819\|pmid=35202399\|pmc=8870512\|bibcode=2022PLSCB..18E9819B \|doi-access=free }}</ref> which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.▼ [[File:Optimizing edge intelligence.png\|thumb\|Exploratory Data Analysis: Unveiling Insights into Edge Intelligence Enhancement. In this comprehensive exploration, the graph traces the trajectories of two curves - one representing the quantitative assessment model for edge intelligence enhancement, and the other showcasing actual test results. Both embark from the origin (0,1) and converge meaningfully at (80,70), indicating a shared comprehensive proportion during this pivotal phase. Intriguingly, as the data unfolds beyond this point, a discernible divergence emerges. The Edge Intelligence Enhancement Model consistently surpasses actual test results, revealing a compelling reserve in comprehensive proportions. This nuanced visual narrative provides valuable insights into the intricate dynamics between modeled predictions and empirical outcomes, underscoring the significance of exploratory data analysis in unraveling the complexities of enhanced edge intelligence.]] ▲In [[statistics]], '''exploratory data analysis''' (EDA) is an approach of [[data analysis\|analyzing]] [[data set]]s to summarize their main characteristics, often using [[statistical graphics]] and other [[data visualization]] methods. A [[statistical model]] can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by [[John Tukey]] since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from [[Data analysis#Initial data analysis\|initial data analysis (IDA)]],<ref>{{cite book \|last=Chatfield \|first=C. \|year=1995 \|title=Problem Solving: A Statistician's Guide \|publisher=Chapman and Hall \|isbn=978-0412606304 \|edition=2nd }}</ref><ref>{{cite journal \|doi=10.1371/journal.pcbi.1009819\|title=Ten simple rules for initial data analysis\|year=2022\|last1=Baillie\|first1=Mark\|last2=Le Cessie\|first2=Saskia\|last3=Schmidt\|first3=Carsten Oliver\|last4=Lusa\|first4=Lara\|last5=Huebner\|first5=Marianne\|author6=Topic Group "Initial Data Analysis" of the STRATOS Initiative\|journal=PLOS Computational Biology\|volume=18\|issue=2\|pages=e1009819\|pmid=35202399\|pmc=8870512\|bibcode=2022PLSCB..18E9819B \|doi-access=free }}</ref> which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA. ==Overview== Line 8 ⟶ 7: Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."<ref>[http://projecteuclid.org/download/pdf_1/euclid.aoms/1177704711 John Tukey-The Future of Data Analysis-July 1961]</ref> Exploratory data analysis is ~~an analysis~~a technique to analyze and investigate ~~the~~a ~~data set~~dataset and ~~summaries~~summarize ~~the~~its main characteristics. ofA ~~the dataset. Main~~main advantage of EDA is providing the ~~data~~ visualization of data after conducting ~~the~~ analysis. Tukey's championing of EDA encouraged the development of [[Computational statistics\|statistical computing]] packages, especially [[S (programming language)\|S]] at [[Bell Labs]].<ref>{{Citation \|last=Becker \|first=Richard A. \|title=A Brief History of S \|publisher=AT&T Bell Laboratories \|place=Murray Hill, New Jersey \|access-date=2015-07-23 \|url=http://www2.research.att.com/areas/stat/doc/94.11.ps \|format=PS \|archive-url=https://web.archive.org/web/20150723044213/http://www2.research.att.com/areas/stat/doc/94.11.ps \|archive-date=2015-07-23 \|quotation="... we wanted to be able to interact with our data, using Exploratory Data Analysis (Tukey, 1971) techniques."}}</ref> The S programming language inspired the systems [[S-PLUS]] and [[R (programming language)\|R]]. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify [[outlier]]s, [[trend estimation\|trends]] and [[pattern recognition\|patterns]] in data that merited further study. Tukey's EDA was related to two other developments in [[statistical theory]]: [[robust statistics]] and [[nonparametric statistics]], both of which tried to reduce the sensitivity of statistical inferences to errors in formulating [[statistical model]]s. Tukey promoted the use of [[five number summary]] of numerical data—the two [[extreme value\|extreme]]s ([[maximum]] and [[minimum]]), the [[median]], and the [[quartile]]s—because these median and quartiles, being functions of the [[empirical distribution function\|empirical distribution]] <!-- [[statistical functional]]s (and the related [[interquartile range]] and [[range]]) -->are defined for all distributions, unlike the [[mean value\|mean]] and [[standard deviation]];. ~~moreover~~Moreover, the quartiles and median are more robust to [[skewness\|skewed]] or [[heavy-tailed distribution]]s than traditional summaries (the mean and standard deviation). The packages [[S (programming language)\|S]], [[S-PLUS]], and [[R (programming language)\|R]] included routines using [[resampling (statistics)\|resampling statistics]], such as Quenouille and Tukey's [[resampling (statistics)#Jackknife\|jackknife]] and [[Bradley Efron\|Efron]]{{'s}} [[bootstrapping (statistics)\|bootstrap]], which are nonparametric and robust (for many problems). Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, both of which ~~concerned~~were of interest to Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the [[analytic function\|analytic]] theory of [[statistical hypothesis testing\|testing statistical hypotheses]], particularly the [[Pierre-Simon Laplace\|Laplacian]] tradition's emphasis on [[exponential family\|exponential families]].<ref>{{cite journal \|title=Conversation with John W. Tukey and Elizabeth Tukey, Luisa T. Fernholz and Stephan Morgenthaler \|journal=Statistical Science \|volume=15 \|issue=1 \|year=2000 \|pages=79–94 \|doi=10.1214/ss/1009212675\|last1=Morgenthaler \|first1=Stephan \|last2=Fernholz \|first2=Luisa T. \|doi-access=free }}</ref> == Development == Line 102 ⟶ 101: * [[Orange (software)\|Orange]], an [[open-source software\|open-source]] [[data mining]] and [[machine learning]] software suite. * [[Python (programming language)\|Python]], an open-source programming language widely used in data mining and machine learning. * Matplotlib & Seaborn are the Python libraries used in todays world for EDA and Plotting/Data Visualization.(point updated: 2025) * [[R (programming language)\|R]], an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for data science. * [[TinkerPlots]] an EDA software for upper elementary and middle school students. Line 131: {{cite book\|author1=Martinez, W. L.\|author1-link= Wendy L. Martinez \|author2=Martinez, A. R. \|author3= Solka, J. \|name-list-style=amp \|year=2010\|title=Exploratory Data Analysis with MATLAB, second edition\|publisher=Chapman & Hall/CRC\|isbn= 9781439812204}} Theus, M., Urbanek, S. (2008), Interactive Graphics for Data Analysis: Principles and Examples, CRC Press, Boca Raton, FL, {{ISBN\|978-1-58488-594-8}} {{cite book \|author1=Tucker, L \|author2=MacCallum, R. \|title=Exploratory Factor Analysis \|year=1993 \|~~___location~~url= [http://www.unc.edu/~rcm/book/factornew.htm]}} {{cite book \|last=Tukey \|first=John Wilder \|title=Exploratory Data Analysis \|year=1977 \|publisher=Addison-Wesley \|isbn=978-0-201-07616-5 \|url=https://archive.org/details/exploratorydataa00tuke_0 \|url-access=registration }} {{cite book \|title=Applications, Basics and Computing of Exploratory Data Analysis \|last1=Velleman \|first1=P. F. \|last2=Hoaglin \|first2=D. C. \|year=1981 \|publisher=Duxbury Press \|isbn=978-0-87150-409-8 \|url-access=registration \|url=https://archive.org/details/applicationsbasi00vell }} Line 138: S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) [https://link.springer.com/book/10.1007%2F978-1-4612-4950-4 ''Graphical Exploratory Data Analysis'']. Springer {{ISBN\|978-1-4612-9371-2}} <!-- unclear why these are repeated here when they are listed above Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach. Springer. ISBN 3-540-25994-5 Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007-12-12). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN 9780387717616. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 978-0-471-09776-1. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 978-0-471-09777-8. Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive Graphics. Wiley ISBN 978-0-471-68160-1 Jambu M. (1991) Exploratory and Multivariate Data Analysis. Academic Press ISBN 0123800900 S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) Graphical Exploratory Data Analysis. Springer ISBN 978-1-4612-9371-2 --> == External links ==