Content deleted Content added
→Initial data analysis: not really cite journal |
I made copyedits |
||
Line 34:
===Data processing===
[[File:Relationship of data, information and intelligence.png|thumb|350px|The phases of the [[intelligence cycle]] used to convert raw information into actionable intelligence or knowledge are conceptually similar to the phases in data analysis.]]
Data, when initially obtained, must be processed or organized for analysis.<ref>{{Cite book|last=Nelson|first=Stephen L.|title=Excel data analysis for dummies|date=2014|publisher=Wiley|isbn=978-1-118-89810-9|oclc=877772392}}</ref><ref>{{Cite journal|title=Figure 3—source data 1. Raw and processed values were obtained through qPCR.|date=30 August 2017|doi=10.7554/elife.28468.029 |doi-access=free }}</ref> For instance, these may involve placing data into rows and columns in a table format (''known as'' [[data model|structured data]]) for further analysis, often through the use of spreadsheet(excel) or statistical software.<ref name="Schutt & O'Neil"/>
===Data cleaning===
{{Main|Data cleansing}}
Once processed and organized, the data may be incomplete, contain duplicates, or contain errors.<ref name="Bohannon">{{Cite journal|last=Bohannon|first=John|date=2016-02-24|title=Many surveys, about one in five, may contain fraudulent data|journal=Science|doi=10.1126/science.aaf4104|issn=0036-8075|doi-access=free}}</ref><ref>{{Cite book|first1=Garber|last1=Jeannie Scruggs|last2=Gross|first2=Monty|last3=Slonim|first3=Anthony D.|title=Avoiding common nursing errors|date=2010|publisher=Wolters Kluwer Health/Lippincott Williams & Wilkins|isbn=978-1-60547-087-0|oclc=338288678}}</ref> The need for ''data cleaning'' will arise from problems in the way that the datum
===Exploratory data analysis===
Line 77:
Author [[Jonathan Koomey]] has recommended a series of best practices for understanding quantitative data.<ref>{{Cite journal|date=2008-10-01|title=Recommended Best Practices|url=http://dx.doi.org/10.14217/9781848590151-8-en|access-date=2021-06-03|doi=10.14217/9781848590151-8-en}}</ref> These include:
*Check raw data for anomalies prior to performing an analysis;
*Re-perform important calculations, such as verifying columns of data that are formula
*Confirm main totals are the sum of subtotals;
*Check relationships between numbers that should be related in a predictable way, such as ratios over time;
Line 88:
Analysts may use robust statistical measurements to solve certain analytical problems.<ref>{{Cite journal|date=1968-06-03|title=Dual-use car may solve transportation problems|url=http://dx.doi.org/10.1021/cen-v046n024.p044|journal=Chemical & Engineering News Archive|volume=46|issue=24|pages=44|doi=10.1021/cen-v046n024.p044|issn=0009-2347}}</ref> [[Hypothesis testing]] is used when a particular hypothesis about the true state of affairs is made by the analyst and data is gathered to determine whether that state of affairs is true or false.<ref>{{Cite journal|last=Heckman|date=1978|title=Simple Statistical Models for Discrete Panel Data Developed and Applied to Test the Hypothesis of True State Dependence against the Hypothesis of Spurious State Dependence|url=http://dx.doi.org/10.2307/20075292|journal=Annales de l'inséé|issue=30/31|pages=227–269|doi=10.2307/20075292|jstor=20075292|issn=0019-0209}}</ref><ref>{{Cite book|first=Dean|last=Koontz|title=False Memory|date=2017|publisher=Headline Book Publishing|isbn=978-1-4722-4830-5|oclc=966253202}}</ref> For example, the hypothesis might be that "Unemployment has no effect on inflation", which relates to an economics concept called the [[Phillips Curve]].<ref>{{Citation|last=Munday|first=Stephen C. R.|title=Unemployment, Inflation and the Phillips Curve|date=1996|url=http://dx.doi.org/10.1007/978-1-349-24986-2_11|work=Current Developments in Economics|pages=186–218|place=London|publisher=Macmillan Education UK|doi=10.1007/978-1-349-24986-2_11|isbn=978-0-333-64444-7|access-date=2021-06-03}}</ref> Hypothesis testing involves considering the likelihood of [[Type I and type II errors]], which relate to whether the data supports accepting or rejecting the hypothesis.<ref>{{Cite journal|last=Louangrath|first=Paul I.|date=2013|title=Alpha and Beta Tests for Type I and Type II Inferential Errors Determination in Hypothesis Testing|url=http://dx.doi.org/10.2139/ssrn.2332756|journal=SSRN Electronic Journal|doi=10.2139/ssrn.2332756|issn=1556-5068}}</ref><ref>{{Cite book|first=Ann M.|last=Walko|title=Rejecting the second generation hypothesis : maintaining Estonian ethnicity in Lakewood, New Jersey|date=2006|publisher=AMS Press|isbn=0-404-19454-0|oclc=467107876}}</ref>
[[Regression analysis]] may be used when the analyst is trying to determine the extent to which independent variable X affects dependent variable Y (e.g., "To what extent do changes in the unemployment rate (X) affect the inflation rate (Y)?").<ref name="Yanamandra 57–68">{{Cite journal|last=Yanamandra|first=Venkataramana|date=September 2015|title=Exchange rate changes and inflation in India: What is the extent of exchange rate pass-through to imports?|url=http://dx.doi.org/10.1016/j.eap.2015.07.004|journal=Economic Analysis and Policy|volume=47|pages=57–68|doi=10.1016/j.eap.2015.07.004|issn=0313-5926}}</ref> This is an attempt to model or fit an equation line or curve to the data, such that Y is a function of X.<ref>{{Cite book|first1=Nawarathna|last1=Mudiyanselage|first2=Pubudu Manoj|last2=Nawarathna|title=Characterization of epigenetic changes and their connection to gene expression abnormalities in clear cell renal cell carcinoma|oclc=1190697848}}</ref><ref>{{Cite journal|title=Appendix 1—figure 5. Curve data is included in Appendix 1—table 4 (solid points) and the theoretical curve by using the Hill equation parameters of Appendix 1—table 5 (curve line).|journal=eLife|date=29 June 2017|volume=6|pages=e25233|doi=10.7554/elife.25233.027|last1=Moreno Delgado|first1=David|last2=Møller|first2=Thor C.|last3=Ster|first3=Jeanne|last4=Giraldo|first4=Jesús|last5=Maurel|first5=Damien|last6=Rovira|first6=Xavier|last7=Scholler|first7=Pauline|last8=Zwier|first8=Jurriaan M.|last9=Perroy|first9=Julie|last10=Durroux|first10=Thierry|last11=Trinquet|first11=Eric|last12=Prezeau|first12=Laurent|last13=Rondard|first13=Philippe|last14=Pin|first14=Jean-Philippe|editor1=Chao, Moses V |doi-access=free }}</ref>
[[Necessary condition analysis]] (NCA) may be used when the analyst is trying to determine the extent to which independent variable X allows variable Y (e.g., "To what extent is a certain unemployment rate (X) necessary for a certain inflation rate (Y)?").<ref name="Yanamandra 57–68"/> Whereas (multiple) regression analysis uses additive logic where each X-variable can produce the outcome and the X's can compensate for each other (they are sufficient but not necessary),<ref>{{Cite web|url=https://doi.org/10.1049%2Fiet-tv.48.859|last=Feinmann|first=Jane|title=How Can Engineers and Journalists Help Each Other?|access-date=2021-06-03|doi=10.1049/iet-tv.48.859|url-access=subscription|type=Video|publisher=The Institute of Engineering & Technology}}</ref> necessary condition analysis (NCA) uses necessity logic, where one or more X-variables allow the outcome to exist, but may not produce it (they are necessary but not sufficient). Each single necessary condition must be present and compensation is not possible.<ref>{{Cite journal|last=Dul|first=Jan|date=2015|title=Necessary Condition Analysis (NCA): Logic and Methodology of 'Necessary But Not Sufficient' Causality|url=http://dx.doi.org/10.2139/ssrn.2588480|journal=SSRN Electronic Journal|doi=10.2139/ssrn.2588480|hdl=1765/77890|s2cid=219380122|issn=1556-5068}}</ref>
Line 210:
For example, whether a number is rising or falling may not be the key factor. More important may be the number relative to another number, such as the size of government revenue or spending relative to the size of the economy (GDP) or the amount of cost relative to revenue in corporate financial statements.<ref>{{Cite journal|last1=Gusnaini|first1=Nuriska|last2=Andesto|first2=Rony|last3=Ermawati|date=2020-12-15|title=The Effect of Regional Government Size, Legislative Size, Number of Population, and Intergovernmental Revenue on The Financial Statements Disclosure|url=http://dx.doi.org/10.24018/ejbmr.2020.5.6.651|journal=European Journal of Business and Management Research|volume=5|issue=6|doi=10.24018/ejbmr.2020.5.6.651|s2cid=231675715|issn=2507-1076}}</ref> This numerical technique is referred to as normalization<ref name="Koomey1"/> or common-sizing. There are many such techniques employed by analysts, whether adjusting for inflation (i.e., comparing real vs. nominal data) or considering population increases, demographics, etc.<ref>{{Citation|last1=Linsey|first1=Julie S.|author1-link=Julie Linsey|title=Effectiveness of Brainwriting Techniques: Comparing Nominal Groups to Real Teams|date=2011|url=http://dx.doi.org/10.1007/978-0-85729-224-7_22|work=Design Creativity 2010|pages=165–171|place=London|publisher=Springer London|isbn=978-0-85729-223-0|access-date=2021-06-03|last2=Becker|first2=Blake|doi=10.1007/978-0-85729-224-7_22}}</ref> Analysts apply a variety of techniques to address the various quantitative messages described in the section above.<ref>{{Cite journal|last=Lyon|first=J.|date=April 2006|title=Purported Responsible Address in E-Mail Messages|doi=10.17487/rfc4407|url=http://dx.doi.org/10.17487/rfc4407}}</ref>
Analysts may also analyze data under different assumptions or
==Other topics==
Line 241:
===Initial data analysis===
The most important distinction between the initial data analysis phase and the main analysis phase
| last = Jaech | first = J.L.
| date = April 21, 1960
|