Content deleted Content added
to be -> supposed to be |
Yash Thale (talk | contribs) m →top: Added some info about EDA/plotting/Visualizatio libraries used currently woth Python For EDA. Tags: Mobile edit Mobile app edit Android app edit App full source |
||
(4 intermediate revisions by 4 users not shown) | |||
Line 1:
{{short description|Approach of analyzing data sets in statistics}}{{Data Visualization}}
In [[statistics]], '''exploratory data analysis''' (EDA) is an approach of [[data analysis|analyzing]] [[data set]]s to summarize their main characteristics, often using [[statistical graphics]] and other [[data visualization]] methods. A [[statistical model]] can be used or not, but primarily EDA is for seeing what the data can tell
==Overview==
Line 8 ⟶ 7:
Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."<ref>[http://projecteuclid.org/download/pdf_1/euclid.aoms/1177704711 John Tukey-The Future of Data Analysis-July 1961]</ref>
Exploratory data analysis is
Tukey's championing of EDA encouraged the development of [[Computational statistics|statistical computing]] packages, especially [[S (programming language)|S]] at [[Bell Labs]].<ref>{{Citation |last=Becker |first=Richard A. |title=A Brief History of S |publisher=AT&T Bell Laboratories |place=Murray Hill, New Jersey |access-date=2015-07-23 |url=http://www2.research.att.com/areas/stat/doc/94.11.ps |format=PS |archive-url=https://web.archive.org/web/20150723044213/http://www2.research.att.com/areas/stat/doc/94.11.ps |archive-date=2015-07-23 |quotation="... we wanted to be able to interact with our data, using Exploratory Data Analysis (Tukey, 1971) techniques."}}</ref> The S programming language inspired the systems [[S-PLUS]] and [[R (programming language)|R]]. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify [[outlier]]s, [[trend estimation|trends]] and [[pattern recognition|patterns]] in data that merited further study.
Tukey's EDA was related to two other developments in [[statistical theory]]: [[robust statistics]] and [[nonparametric statistics]], both of which tried to reduce the sensitivity of statistical inferences to errors in formulating [[statistical model]]s. Tukey promoted the use of [[five number summary]] of numerical data—the two [[extreme value|extreme]]s ([[maximum]] and [[minimum]]), the [[median]], and the [[quartile]]s—because these median and quartiles, being functions of the [[empirical distribution function|empirical distribution]] <!-- [[statistical functional]]s (and the related [[interquartile range]] and [[range]]) -->are defined for all distributions, unlike the [[mean value|mean]] and [[standard deviation]]
Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, both of which
== Development ==
Line 102 ⟶ 101:
* [[Orange (software)|Orange]], an [[open-source software|open-source]] [[data mining]] and [[machine learning]] software suite.
* [[Python (programming language)|Python]], an open-source programming language widely used in data mining and machine learning.
* Matplotlib & Seaborn are the Python libraries used in todays world for EDA and Plotting/Data Visualization.(point updated: 2025)
* [[R (programming language)|R]], an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for data science.
* [[TinkerPlots]] an EDA software for upper elementary and middle school students.
|