Data analysis: Difference between revisions

Content deleted Content added
Data cleaning: Typo in citation
m brackets fixed
 
(12 intermediate revisions by 10 users not shown)
Line 1:
{{short description|none}} <!-- "none" is preferred when the title is sufficiently descriptive; see [[WP:SDNONE]] -->
{{short description|The process of analyzing data to discover useful information and support decision-making}}{{Data Visualization}}
{{Data Visualization}}
{{Computational physics}}
 
Line 11 ⟶ 12:
''Data analysis'' is a [[Process theory|process]] for obtaining [[raw data]], and subsequently converting it into information useful for decision-making by users.<ref name="Auerbach Publications"/> Statistician [[John Tukey]], defined data analysis in 1961, as:<blockquote>"Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."<ref>{{Cite journal |url=http://projecteuclid.org/download/pdf_1/euclid.aoms/1177704711 |title=John Tukey-The Future of Data Analysis-July 1961 |journal=The Annals of Mathematical Statistics |date=March 1962 |volume=33 |issue=1 |pages=1–67 |doi=10.1214/aoms/1177704711 |access-date=2015-01-01 |archive-date=2020-01-26 |archive-url=https://web.archive.org/web/20200126232007/https://projecteuclid.org/download/pdf_1/euclid.aoms/1177704711 |url-status=live |last1=Tukey |first1=John W. }}</ref></blockquote>
 
There are several phases, and they are [[Iteration|iterative]], in that feedback from later phases may result in additional work in earlier phases.<ref name="Schutt & O'Neil">{{cite book
| author2-last = O'Neil | author2-first= Cathy | author2-link= Cathy O'Neil
| author1-last = Schutt | author1-first= Rachel
| year = 2013
| title = Doing Data Science | publisher = [[O'Reilly Media]]
| isbn = 978-1-449-35865-5}}</ref>
 
===Data requirements===
Line 41 ⟶ 42:
'''Mathematical formulas''' or '''models''' (also known as '''[[algorithms]]'''), may be applied to the data in order to identify relationships among the variables; for example, checking for [[Correlation and dependence|correlation]] and by determining whether or not there is the presence of [[causality]]. In general terms, models may be developed to evaluate a specific variable based on other variable(s) contained within the dataset, with some ''[[Residual bit error rate|residual error]]'' depending on the implemented model's accuracy (''e.g.'', Data = Model + Error).<ref>{{Cite journal|title=Figure 2. Variable importance by permutation, averaged over 25 models.|journal=eLife|date=28 February 2017|volume=6|pages=e22053|doi=10.7554/elife.22053.004|last1=Evans|first1=Michelle V.|last2=Dallas|first2=Tad A.|last3=Han|first3=Barbara A.|last4=Murdock|first4=Courtney C.|last5=Drake|first5=John M. |editor1=Brady, Oliver |doi-access=free }}</ref>
 
[[Inferential statistics]] utilizes techniques that measure the relationships between particular variables.<ref>{{Cite journal|title=Table 3: Descriptive (mean ± SD), inferential (95% CI) and qualitative statistics (ES) of all variables between self-selected and predetermined conditions |journal=PeerJ|date=12 November 2020|volume=8|pages=e10361|doi=10.7717/peerj.10361/table-3 |last1=Watson|first1=Kevin|last2=Halperin|first2=Israel|last3=Aguilera-Castells|first3=Joan|last4=Iacono|first4=Antonio Dello |doi-access=free }}</ref> For example, [[regression analysis]] may be used to model whether a change in advertising (''independent variable X''), provides an explanation for the variation in sales (''dependent variable Y''), i.e. is Y a function of X? This can be described as (''Y'' = ''aX'' + ''b'' + error), where the model is designed such that (''a'') and (''b'') minimize the error when the model predicts ''Y'' for a given range of values of ''X''.<ref>{{Cite journal|last=Nwabueze|first=JC|date=2008-05-21|title=Performances of estimators of linear model with auto-correlated error terms when the independent variable is normal |url=http://dx.doi.org/10.4314/jonamp.v9i1.40071 |journal=Journal of the Nigerian Association of Mathematical Physics |volume=9 |issue=1|doi=10.4314/jonamp.v9i1.40071|issn=1116-4336}}</ref>
 
===Data product===
Line 50 ⟶ 51:
 
{{Main|Data and information visualization}}
Once data is analyzed, it may be reported in many formats to the users of the analysis to support their requirements.<ref>{{Citation|title=Data requirements for semiconductor die. Exchange data formats and data dictionary|url=http://dx.doi.org/10.3403/02271298|publisher=BSI British Standards|doi=10.3403/02271298|access-date=2021-05-31}}</ref> The users may have feedback, which results in additional analysis.
 
When determining how to communicate the results, the analyst may consider implementing a variety of data visualization techniques to help communicate the message more clearly and efficiently to the audience. Data visualization uses [[information displays]] (graphics such as, tables and charts) to help communicate key messages contained in the data. [[Table (information)|Tables]] are a valuable tool by enabling the ability of a user to query and focus on specific numbers; while charts (e.g., bar charts or line charts), may help explain the quantitative messages contained in the data.<ref>{{Cite book|date=2021|title=Visualizing Data About UK Museums: Bar Charts, Line Charts and Heat Maps |url=http://dx.doi.org/10.4135/9781529768749 |doi=10.4135/9781529768749 |isbn=9781529768749 |s2cid=240967380}}</ref>
Line 82 ⟶ 83:
For the variables under examination, analysts typically obtain [[descriptive statistics]], such as the mean (average), [[median]], and [[standard deviation]]. They may also analyze the [[probability distribution|distribution]] of the key variables to see how the individual values cluster around the mean.<ref name="Koomey1"/>
 
[[File:US_Employment_Statistics_-_March_2015.png|thumb|250px|right|An illustration of the [[MECE principle]] used for data analysis]]
 
[[McKinsey and Company]] named a technique for breaking down a quantitative problem into its component parts called the [[MECE principle]]. MECE means "Mutually Exclusive and Collectively Exhaustive".<ref>{{Citation|title=Consultants Employed by McKinsey & Company|date=2008-07-30|url=http://dx.doi.org/10.4324/9781315701974-15|work=Organizational Behavior 5|pages=77–82|publisher=Routledge|doi=10.4324/9781315701974-15|isbn=978-1-315-70197-4|access-date=2021-06-03}}</ref> Each layer can be broken down into its components; each of the sub-components must be [[Mutually exclusive events|mutually exclusive]] of each other and [[Collectively exhaustive events|collectively]] add up to the layer above them. For example, profit by definition can be broken down into total revenue and total cost.<ref>{{Cite journal|last=Carey|first=Malachy|date=November 1981|title=On Mutually Exclusive and Collectively Exhaustive Properties of Demand Functions |url=http://dx.doi.org/10.2307/2553697 |journal=Economica |volume=48|issue=192|pages=407–415|doi=10.2307/2553697|jstor=2553697|issn=0013-0427}}</ref>
Line 196 ⟶ 197:
{{quote box|quote=You are entitled to your own opinion, but you are not entitled to your own facts.|source=[[Daniel Patrick Moynihan]]|width = 250px}}
 
Effective analysis requires obtaining relevant [[fact]]s to answer questions, support a conclusion or formal [[opinion]], or test [[hypotheses]].<ref>{{Citation|title=Information relevant to your job|date=2007-07-11 |url=http://dx.doi.org/10.4324/9780080544304-16 |work=Obtaining Information for Effective Management|pages=48–54 |publisher=Routledge|doi=10.4324/9780080544304-16|doi-broken-date=171 DecemberJuly 20242025 |isbn=978-0-08-054430-4|access-date=2021-06-03}}</ref> Facts by definition are irrefutable, meaning that any person involved in the analysis should be able to agree upon them. The auditor of a public company must arrive at a formal opinion on whether financial statements of publicly traded corporations are "fairly stated, in all material respects".<ref>{{Cite journal|last=Gordon|first=Roger|date=March 1990|title=Do Publicly Traded Corporations Act in the Public Interest?|url=http://dx.doi.org/10.3386/w3303|___location=Cambridge, MA|doi=10.3386/w3303 |journal=National Bureau of Economic Research Working Papers}}</ref> This requires extensive analysis of factual data and evidence to support their opinion.
 
===Cognitive biases===
Line 206 ⟶ 207:
Effective analysts are generally adept with a variety of numerical techniques. However, audiences may not have such literacy with numbers or [[numeracy]]; they are said to be innumerate.<ref>{{Cite web |title=Figure 6.7. Differences in literacy scores across OECD countries generally mirror those in numeracy|url=http://dx.doi.org/10.1787/888934081549|access-date=2021-06-03|doi=10.1787/888934081549}}</ref> Persons communicating the data may also be attempting to mislead or misinform, deliberately using bad numerical techniques.<ref>{{Cite web |url=http://www.bloombergview.com/articles/2014-10-28/bad-math-that-passes-for-insight |title=Bad Math that Passes for Insight |last=Ritholz |first=Barry |work=Bloomberg View |access-date=2014-10-29 |archive-date=2014-10-29 |archive-url=https://web.archive.org/web/20141029014527/http://www.bloombergview.com/articles/2014-10-28/bad-math-that-passes-for-insight |url-status=dead }}</ref>
 
For example, whether a number is rising or falling may not be the key factor. More important may be the number relative to another number, such as the size of government revenue or spending relative to the size of the economy (GDP) or the amount of cost relative to revenue in corporate financial statements.<ref>{{Cite journal |last1=Gusnaini |first1=Nuriska |last2=Andesto |first2=Rony |last3=Ermawati|date=2020-12-15|title=The Effect of Regional Government Size, Legislative Size, Number of Population, and Intergovernmental Revenue on The Financial Statements Disclosure |url=http://dx.doi.org/10.24018/ejbmr.2020.5.6.651 |journal=European Journal of Business and Management Research |volume=5 |issue=6 |doi=10.24018/ejbmr.2020.5.6.651 |s2cid=231675715|issn=2507-1076}}</ref> This numerical technique is referred to as normalization<ref name="Koomey1"/> or common-sizing. There are many such techniques employed by analysts, whether adjusting for inflation (i.e., comparing real vs. nominal data) or considering population increases, demographics, etc.<ref>{{cite book |last1=Taura |first1=Toshiharu |last2=Nagai |first2=Yukari |title=Design Creativity 2010 |date=2011 |publisher=Springer-Verlag London |___location=London |isbn=978-0-85729-223-0 |pages=165-171165–171 |chapter=Comparing Nominal Groups to Real Teams}}</ref>
 
Analysts may also analyze data under different assumptions or scenarios. For example, when analysts perform [[financial statement analysis]], they will often recast the financial statements under different assumptions to help arrive at an estimate of future cash flow, which they then discount to present value based on some interest rate, to determine the valuation of the company or its stock.<ref>{{Cite journal|last=Gross|first=William H.|date=July 1979|title=Coupon Valuation and Interest Rate Cycles |journal=Financial Analysts Journal|volume=35|issue=4|pages=68–71 |doi=10.2469/faj.v35.n4.68|issn=0015-198X}}</ref> Similarly, the CBO analyzes the effects of various policy options on the government's revenue, outlays and deficits, creating alternative future scenarios for key measures.<ref>{{Cite web |title=25. General government total outlays|url=http://dx.doi.org/10.1787/888932348795|access-date=2021-06-03|doi=10.1787/888932348795}}</ref>
Line 313 ⟶ 314:
| pmid = 30520736
| pmc = 6340702
}} Supplementary file 1. Cross-validation schema. {{doi|10.7554/elife.40224.014|doi-access=free}}</ref> Cross-validation is generally inappropriate, though, if there are correlations within the data, e.g. with [[panel data]].<ref>{{Citation|last=Hsiao|first=Cheng|title=Cross-Sectionally Dependent Panel Data|url=http://dx.doi.org/10.1017/cbo9781139839327.012|work=Analysis of Panel Data|year=2014|pages=327–368|place=Cambridge|publisher=Cambridge University Press|doi=10.1017/cbo9781139839327.012|isbn=978-1-139-83932-7|access-date=2021-06-03}}</ref> Hence other methods of validation sometimes need to be used. For more on this topic, see [[statistical model validation]].<ref>{{Citation|last=Hjorth|first=J.S. Urban|title=Cross validation|date=2017-10-19|url=http://dx.doi.org/10.1201/9781315140056-3|work=Computer Intensive Statistical Methods|pages=24–56|publisher=Chapman and Hall/CRC|doi=10.1201/9781315140056-3|isbn=978-1-315-14005-6|access-date=2021-06-03}}</ref>
* ''[[Sensitivity analysis]]''. A procedure to study the behavior of a system or model when global parameters are (systematically) varied. One way to do that is via [[Bootstrapping (statistics)|bootstrapping]].<ref>{{Cite journal |last1=Sheikholeslami|first1=Razi|last2=Razavi|first2=Saman|last3=Haghnegahdar|first3=Amin|date=2019-10-10|title=What should we do when a model crashes? Recommendations for global sensitivity analysis of Earth and environmental systems models|journal=Geoscientific Model Development|volume=12|issue=10|pages=4275–4296|doi=10.5194/gmd-12-4275-2019|bibcode=2019GMD....12.4275S|s2cid=204900339|issn=1991-9603 |doi-access=free }}</ref>
 
Line 331 ⟶ 332:
 
== Reproducible analysis ==
The typical data analysis workflow involves collecting data, running analyses, creating visualizations, and writing reports. However, this workflow presents challenges, including a separation between analysis scripts and data, as well as a gap between analysis and documentation. Often, the correct order of running scripts is only described informally or resides in the data scientist's memory. The potential for losing this information creates issues for reproducibility.
 
To address these challenges, it is essential to document analysis script content and workflow. Additionally, overall documentation is crucial, as well as providing reports that are understandable by both machines and humans, and ensuring accurate representation of the analysis workflow even as scripts evolve.<ref>{{Cite book |last=Mailund |first=Thomas |title=Beginning Data Science in R 4: Data Analysis, Visualization, and Modelling for the Data Scientist |year=2022 |isbn=978-148428155-0 |edition=2nd}}</ref>
Line 339 ⟶ 340:
 
* [[Kaggle]] competitions; the [[Kaggle]] platform is owned and run by [[Google]].<ref>{{cite news |title=The machine learning community takes on the Higgs |url=http://www.symmetrymagazine.org/article/july-2014/the-machine-learning-community-takes-on-the-higgs/ |access-date=14 January 2015|newspaper=Symmetry Magazine|date=July 15, 2014|archive-date=16 April 2021|archive-url=https://web.archive.org/web/20210416100455/https://www.symmetrymagazine.org/article/july-2014/the-machine-learning-community-takes-on-the-higgs|url-status=live}}</ref>
* [[LTPP International Data Analysis Contest|LTPP data analysis contest]]<ref>{{cite web |date = May 26, 2016 |url = https://www.fhwahighways.dot.gov/turner-fairbank-highway-research/tfhrc/programs/infrastructure/pavements/ltpp/-center |title = Data.Gov:Long-Term Pavement Performance (LTPP) |access-date = November 10, 2017 |archive-date = November 1, 2017 |archive-url = https://web.archive.org/web/20171101191727/https://www.fhwa.dot.gov/research/tfhrc/programs/infrastructure/pavements/ltpp/ |url-status = live }}</ref> held by [[FHWA]] and [[ASCE]].<ref name="Nehme 2016-09-29">{{cite web |first = Jean |last = Nehme |date = September 29, 2016 |url = https://www.fhwahighways.dot.gov/turner-fairbank-highway-research/tfhrc/programs/infrastructure/pavements/ltpp/2016_2017_asce_ltpp_contest_guidelines.cfm-center |title = LTPP International Data Analysis Contest |publisher = Federal Highway Administration |access-date = October 22, 2017 |archive-date = October 21, 2017 |archive-url = https://web.archive.org/web/20171021010012/https://www.fhwa.dot.gov/research/tfhrc/programs/infrastructure/pavements/ltpp/2016_2017_asce_ltpp_contest_guidelines.cfm |url-status = live }}</ref>
 
==See also==