Data analysis: Difference between revisions

Content deleted Content added
Initial data analysis: not really cite journal
I made copyedits
Line 34:
===Data processing===
[[File:Relationship of data, information and intelligence.png|thumb|350px|The phases of the [[intelligence cycle]] used to convert raw information into actionable intelligence or knowledge are conceptually similar to the phases in data analysis.]]
Data, when initially obtained, must be processed or organized for analysis.<ref>{{Cite book|last=Nelson|first=Stephen L.|title=Excel data analysis for dummies|date=2014|publisher=Wiley|isbn=978-1-118-89810-9|oclc=877772392}}</ref><ref>{{Cite journal|title=Figure 3—source data 1. Raw and processed values were obtained through qPCR.|date=30 August 2017|doi=10.7554/elife.28468.029 |doi-access=free }}</ref> For instance, these may involve placing data into rows and columns in a table format (''known as'' [[data model|structured data]]) for further analysis, often through the use of spreadsheet(excel) or statistical software.<ref name="Schutt & O'Neil"/>
 
===Data cleaning===
{{Main|Data cleansing}}
Once processed and organized, the data may be incomplete, contain duplicates, or contain errors.<ref name="Bohannon">{{Cite journal|last=Bohannon|first=John|date=2016-02-24|title=Many surveys, about one in five, may contain fraudulent data|journal=Science|doi=10.1126/science.aaf4104|issn=0036-8075|doi-access=free}}</ref><ref>{{Cite book|first1=Garber|last1=Jeannie Scruggs|last2=Gross|first2=Monty|last3=Slonim|first3=Anthony D.|title=Avoiding common nursing errors|date=2010|publisher=Wolters Kluwer Health/Lippincott Williams & Wilkins|isbn=978-1-60547-087-0|oclc=338288678}}</ref> The need for ''data cleaning'' will arise from problems in the way that the datum areis entered and stored.<ref name="Bohannon"/> Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation.<ref>{{cite web|title=Data Cleaning|url=http://research.microsoft.com/en-us/projects/datacleaning/|publisher=Microsoft Research|access-date=26 October 2013|archive-date=29 October 2013|archive-url=https://web.archive.org/web/20131029200356/http://research.microsoft.com/en-us/projects/datacleaning/|url-status=live}}</ref> Such data problems can also be identified through a variety of analytical techniques. For example; with financial information, the totals for particular variables may be compared against separately published numbers that are believed to be reliable.<ref>{{Cite journal|last1=Hancock|first1=R.G.V.|last2=Carter|first2=Tristan|date=February 2010|title=How reliable are our published archaeometric analyses? Effects of analytical techniques through time on the elemental analysis of obsidians|url=http://dx.doi.org/10.1016/j.jas.2009.10.004|journal=Journal of Archaeological Science|volume=37|issue=2|pages=243–250|doi=10.1016/j.jas.2009.10.004|bibcode=2010JArSc..37..243H |issn=0305-4403}}</ref><ref name="Koomey1">{{Cite web |url=http://www.perceptualedge.com/articles/b-eye/quantitative_data.pdf |title=Perceptual Edge-Jonathan Koomey-Best practices for understanding quantitative data-February 14, 2006 |access-date=November 12, 2014 |archive-date=October 5, 2014 |archive-url=https://web.archive.org/web/20141005075112/http://www.perceptualedge.com/articles/b-eye/quantitative_data.pdf |url-status=live }}</ref> Unusual amounts, above or below predetermined thresholds, may also be reviewed. There are several types of data cleaning, that are dependent upon the type of data in the set; this could be phone numbers, email addresses, employers, or other values.<ref>{{Cite journal|last1=Peleg|first1=Roni|last2=Avdalimov|first2=Angelika|last3=Freud|first3=Tamar|date=2011-03-23|title=Providing cell phone numbers and email addresses to Patients: the physician's perspective|journal=BMC Research Notes|volume=4|issue=1|page=76|doi=10.1186/1756-0500-4-76|pmid=21426591|issn=1756-0500|pmc=3076270 |doi-access=free }}</ref><ref>{{Cite book|last=Goodman|first=Lenn Evan|title=Judaism, human rights, and human values|date=1998|publisher=Oxford University Press|isbn=0-585-24568-1|oclc=45733915}}</ref> Quantitative data methods for outlier detection, can be used to get rid of data that appears to have a higher likelihood of being input incorrectly.<ref>{{Cite journal|title=Blind joint maximum likelihood channel estimation and data detection for single-input multiple-output systems|last=Hanzo|first=Lajos|url=http://dx.doi.org/10.1049/iet-tv.44.786|access-date=2021-05-29|doi=10.1049/iet-tv.44.786|url-access=subscription}}</ref> Textual data spell checkers can be used to lessen the amount of mistyped words. However, it is harder to tell if the words themselves are correct.<ref>{{cite journal|last=Hellerstein|first=Joseph|title=Quantitative Data Cleaning for Large Databases|journal=EECS Computer Science Division|date=27 February 2008|page=3|url=http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf|access-date=26 October 2013|archive-date=13 October 2013|archive-url=https://web.archive.org/web/20131013011223/http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf|url-status=live}}</ref>
 
===Exploratory data analysis===
Line 77:
Author [[Jonathan Koomey]] has recommended a series of best practices for understanding quantitative data.<ref>{{Cite journal|date=2008-10-01|title=Recommended Best Practices|url=http://dx.doi.org/10.14217/9781848590151-8-en|access-date=2021-06-03|doi=10.14217/9781848590151-8-en}}</ref> These include:
*Check raw data for anomalies prior to performing an analysis;
*Re-perform important calculations, such as verifying columns of data that are formula -driven;
*Confirm main totals are the sum of subtotals;
*Check relationships between numbers that should be related in a predictable way, such as ratios over time;
Line 88:
Analysts may use robust statistical measurements to solve certain analytical problems.<ref>{{Cite journal|date=1968-06-03|title=Dual-use car may solve transportation problems|url=http://dx.doi.org/10.1021/cen-v046n024.p044|journal=Chemical & Engineering News Archive|volume=46|issue=24|pages=44|doi=10.1021/cen-v046n024.p044|issn=0009-2347}}</ref> [[Hypothesis testing]] is used when a particular hypothesis about the true state of affairs is made by the analyst and data is gathered to determine whether that state of affairs is true or false.<ref>{{Cite journal|last=Heckman|date=1978|title=Simple Statistical Models for Discrete Panel Data Developed and Applied to Test the Hypothesis of True State Dependence against the Hypothesis of Spurious State Dependence|url=http://dx.doi.org/10.2307/20075292|journal=Annales de l'inséé|issue=30/31|pages=227–269|doi=10.2307/20075292|jstor=20075292|issn=0019-0209}}</ref><ref>{{Cite book|first=Dean|last=Koontz|title=False Memory|date=2017|publisher=Headline Book Publishing|isbn=978-1-4722-4830-5|oclc=966253202}}</ref> For example, the hypothesis might be that "Unemployment has no effect on inflation", which relates to an economics concept called the [[Phillips Curve]].<ref>{{Citation|last=Munday|first=Stephen C. R.|title=Unemployment, Inflation and the Phillips Curve|date=1996|url=http://dx.doi.org/10.1007/978-1-349-24986-2_11|work=Current Developments in Economics|pages=186–218|place=London|publisher=Macmillan Education UK|doi=10.1007/978-1-349-24986-2_11|isbn=978-0-333-64444-7|access-date=2021-06-03}}</ref> Hypothesis testing involves considering the likelihood of [[Type I and type II errors]], which relate to whether the data supports accepting or rejecting the hypothesis.<ref>{{Cite journal|last=Louangrath|first=Paul I.|date=2013|title=Alpha and Beta Tests for Type I and Type II Inferential Errors Determination in Hypothesis Testing|url=http://dx.doi.org/10.2139/ssrn.2332756|journal=SSRN Electronic Journal|doi=10.2139/ssrn.2332756|issn=1556-5068}}</ref><ref>{{Cite book|first=Ann M.|last=Walko|title=Rejecting the second generation hypothesis : maintaining Estonian ethnicity in Lakewood, New Jersey|date=2006|publisher=AMS Press|isbn=0-404-19454-0|oclc=467107876}}</ref>
 
[[Regression analysis]] may be used when the analyst is trying to determine the extent to which independent variable X affects dependent variable Y (e.g., "To what extent do changes in the unemployment rate (X) affect the inflation rate (Y)?").<ref name="Yanamandra 57–68">{{Cite journal|last=Yanamandra|first=Venkataramana|date=September 2015|title=Exchange rate changes and inflation in India: What is the extent of exchange rate pass-through to imports?|url=http://dx.doi.org/10.1016/j.eap.2015.07.004|journal=Economic Analysis and Policy|volume=47|pages=57–68|doi=10.1016/j.eap.2015.07.004|issn=0313-5926}}</ref> This is an attempt to model or fit an equation line or curve to the data, such that Y is a function of X.<ref>{{Cite book|first1=Nawarathna|last1=Mudiyanselage|first2=Pubudu Manoj|last2=Nawarathna|title=Characterization of epigenetic changes and their connection to gene expression abnormalities in clear cell renal cell carcinoma|oclc=1190697848}}</ref><ref>{{Cite journal|title=Appendix 1—figure 5. Curve data is included in Appendix 1—table 4 (solid points) and the theoretical curve by using the Hill equation parameters of Appendix 1—table 5 (curve line).|journal=eLife|date=29 June 2017|volume=6|pages=e25233|doi=10.7554/elife.25233.027|last1=Moreno Delgado|first1=David|last2=Møller|first2=Thor C.|last3=Ster|first3=Jeanne|last4=Giraldo|first4=Jesús|last5=Maurel|first5=Damien|last6=Rovira|first6=Xavier|last7=Scholler|first7=Pauline|last8=Zwier|first8=Jurriaan M.|last9=Perroy|first9=Julie|last10=Durroux|first10=Thierry|last11=Trinquet|first11=Eric|last12=Prezeau|first12=Laurent|last13=Rondard|first13=Philippe|last14=Pin|first14=Jean-Philippe|editor1=Chao, Moses V |doi-access=free }}</ref>
 
[[Necessary condition analysis]] (NCA) may be used when the analyst is trying to determine the extent to which independent variable X allows variable Y (e.g., "To what extent is a certain unemployment rate (X) necessary for a certain inflation rate (Y)?").<ref name="Yanamandra 57–68"/> Whereas (multiple) regression analysis uses additive logic where each X-variable can produce the outcome and the X's can compensate for each other (they are sufficient but not necessary),<ref>{{Cite web|url=https://doi.org/10.1049%2Fiet-tv.48.859|last=Feinmann|first=Jane|title=How Can Engineers and Journalists Help Each Other?|access-date=2021-06-03|doi=10.1049/iet-tv.48.859|url-access=subscription|type=Video|publisher=The Institute of Engineering & Technology}}</ref> necessary condition analysis (NCA) uses necessity logic, where one or more X-variables allow the outcome to exist, but may not produce it (they are necessary but not sufficient). Each single necessary condition must be present and compensation is not possible.<ref>{{Cite journal|last=Dul|first=Jan|date=2015|title=Necessary Condition Analysis (NCA): Logic and Methodology of 'Necessary But Not Sufficient' Causality|url=http://dx.doi.org/10.2139/ssrn.2588480|journal=SSRN Electronic Journal|doi=10.2139/ssrn.2588480|hdl=1765/77890|s2cid=219380122|issn=1556-5068}}</ref>
Line 210:
For example, whether a number is rising or falling may not be the key factor. More important may be the number relative to another number, such as the size of government revenue or spending relative to the size of the economy (GDP) or the amount of cost relative to revenue in corporate financial statements.<ref>{{Cite journal|last1=Gusnaini|first1=Nuriska|last2=Andesto|first2=Rony|last3=Ermawati|date=2020-12-15|title=The Effect of Regional Government Size, Legislative Size, Number of Population, and Intergovernmental Revenue on The Financial Statements Disclosure|url=http://dx.doi.org/10.24018/ejbmr.2020.5.6.651|journal=European Journal of Business and Management Research|volume=5|issue=6|doi=10.24018/ejbmr.2020.5.6.651|s2cid=231675715|issn=2507-1076}}</ref> This numerical technique is referred to as normalization<ref name="Koomey1"/> or common-sizing. There are many such techniques employed by analysts, whether adjusting for inflation (i.e., comparing real vs. nominal data) or considering population increases, demographics, etc.<ref>{{Citation|last1=Linsey|first1=Julie S.|author1-link=Julie Linsey|title=Effectiveness of Brainwriting Techniques: Comparing Nominal Groups to Real Teams|date=2011|url=http://dx.doi.org/10.1007/978-0-85729-224-7_22|work=Design Creativity 2010|pages=165–171|place=London|publisher=Springer London|isbn=978-0-85729-223-0|access-date=2021-06-03|last2=Becker|first2=Blake|doi=10.1007/978-0-85729-224-7_22}}</ref> Analysts apply a variety of techniques to address the various quantitative messages described in the section above.<ref>{{Cite journal|last=Lyon|first=J.|date=April 2006|title=Purported Responsible Address in E-Mail Messages|doi=10.17487/rfc4407|url=http://dx.doi.org/10.17487/rfc4407}}</ref>
 
Analysts may also analyze data under different assumptions or scenarioscenarios. For example, when analysts perform [[financial statement analysis]], they will often recast the financial statements under different assumptions to help arrive at an estimate of future cash flow, which they then discount to present value based on some interest rate, to determine the valuation of the company or its stock.<ref>{{Cite book|last=Stock|first=Eugene|title=The History of the Church Missionary Society Its Environment, its Men and its Work|date=10 June 2017|publisher=Hansebooks GmbH |isbn=978-3-337-18120-8|oclc=1189626777}}</ref><ref>{{Cite journal|last=Gross|first=William H.|date=July 1979|title=Coupon Valuation and Interest Rate Cycles|url=http://dx.doi.org/10.2469/faj.v35.n4.68|journal=Financial Analysts Journal|volume=35|issue=4|pages=68–71|doi=10.2469/faj.v35.n4.68|issn=0015-198X}}</ref> Similarly, the CBO analyzes the effects of various policy options on the government's revenue, outlays and deficits, creating alternative future scenarios for key measures.<ref>{{Cite journal|title=25. General government total outlays|url=http://dx.doi.org/10.1787/888932348795|access-date=2021-06-03|doi=10.1787/888932348795}}</ref>
 
==Other topics==
Line 241:
 
===Initial data analysis===
The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that is aimed at answering the original research question.<ref>{{cite report
| last = Jaech | first = J.L.
| date = April 21, 1960