Predictive analytics: Difference between revisions

Content deleted Content added
revert to last version not corrupted by obvious WP:COI edits and general LinkedIn-level shenanigans
restore good edits done in the interim since the reverted-to version
Line 13:
Predictive analytics is often defined as predicting at a more detailed level of granularity, i.e., generating predictive scores (probabilities) for each individual organizational element. This distinguishes it from [[forecasting]]. For example, "Predictive analytics—Technology that learns from experience (data) to predict the future behavior of individuals in order to drive better decisions."<ref>{{Cite book |last=Siegel |first=Eric |title=Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (1st ed.) |publisher=[[Wiley (publisher)|Wiley]] |year=2013 |isbn=978-1-1183-5685-2 |language=English}}</ref> In future industrial systems, the value of predictive analytics will be to predict and prevent potential issues to achieve near-zero break-down and further be integrated into [[prescriptive analytics]] for decision optimization.<ref>{{Cite book |last=Spalek |first=Seweryn |title=Data Analytics in Project Management |publisher=Taylor & Francis Group, LLC |year=2019 |language=English}}</ref>
 
== BigAnalytical Datatechniques ==
While there is no universal definition of big data, most of them refer to the processing of a large set of data points to get a finished product. When the dataset is too large to be analyzed using traditional analysis techniques, big data analytics comes into play. However, size is not the only factor that defines big data.
 
Gartner's definition of big data is useful in explaining the defining properties of big data: "Big data is high-volume, high-velocity and/or high variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation."<ref name=":2">{{Cite web |title=Definition of Big Data - Gartner Information Technology Glossary |url=https://www.gartner.com/en/information-technology/glossary/big-data |access-date=2022-04-28 |website=Gartner |language=en-US}}</ref> These properties are sometimes referred to as the 3 Vs of big data.
 
When we talk about volume of data, think about its size. There is no universal criteria for size that determines whether a dataset is "big" or not, because size is relative. Terabytes of data could be considered big data to one firm while another firm uses a larger unit of storage as criteria for big data such as a petabyte or an exabyte.
 
The velocity of data refers to the speed of data and how much time it takes to create, store, and analyze it. Batch processing was traditionally used to process large blocks of data, but this takes a lot of time and is only useful if decision making can be successful without fast-paced data processing. The markets of the modern day however require real-time processing for powerful and successful decision making in highly versatile and competitive environments.
 
There are also a few different types of data, which is what Gartner means by variety. Data can be structured, semi-structured, or unstructured. "Structured data is data that adheres to a predefined data model and is therefore straightforward to analyze."<ref name=":2" /> Structured data generally has rows and columns that can be sorted and searched with basic techniques. Spreadsheets and relational databases are typical examples of structured data. Unstructured data is basically the opposite of structured data in that it doesn't adhere to a predefined data model and doesn't contain columns or rows to help organize the data. This makes unstructured data more difficult to understand than structured data, which can be easily processed using traditional programs like Excel and SQL. Some examples of unstructured data include emails, PDF files, and Google searches. Storing and processing unstructured data has become much easier in recent years due to programs like Power BI and Tableau.
 
"Semi-structured data lies in between structured and unstructured data. It does not adhere to a formal data structure yet does contain tags and other markers to organize the data."<ref name=":1">{{Cite book |last1=McCarthy |first1=Richard |title=Applying Predictive Analytics: Finding Value in Data |last2=McCarthy |first2=Mary |last3=Ceccucci |first3=Wendy |publisher=Springer |year=2021}}</ref> The semi-structured category of data is much easier to analyze than unstructured data. Many big data tools can 'read' and process semi-structured forms of data like XML or JSON files.
 
The volume, variety and velocity of big data have introduced challenges across the board for capture, storage, search, sharing, analysis, and visualization. Examples of big data sources include [[web log]]s, [[RFID]], [[Sensor network|sensor]] data, [[social network]]s, Internet search indexing, call detail records, military surveillance, and complex data in astronomic, biogeochemical, genomics, and atmospheric sciences. Thanks to technological advances in computer hardware—faster CPUs, cheaper memory, and [[Massive parallel processing|MPP]] architectures—and new technologies such as [[Hadoop]], [[MapReduce]], and [[In-database processing|in-database]] and [[text analytics]] for processing big data, it is now feasible to collect, analyze, and mine massive amounts of structured and [[unstructured data]] for new insights. It is also possible to run predictive algorithms on streaming data. Today, exploring big data and using predictive analytics is within reach of more organizations than ever before and new methods that are capable of handling such datasets are proposed.
 
== Analytical Techniques ==
The approaches and techniques used to conduct predictive analytics can broadly be grouped into regression techniques and machine learning techniques.
 
=== Machine Learninglearning ===
{{Main|Machine learning}}
Machine learning can be defined as the ability of a machine to learn and then mimic human behavior that requires intelligence. This is accomplished through artificial intelligence, algorithms, and models.<ref>{{Cite web |title=Machine learning, explained |url=https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained |access-date=2022-05-06 |website=MIT Sloan |language=en}}</ref>
 
Line 39 ⟶ 25:
One example of an ARIMA method is exponential smoothing models. Exponential smoothing takes into account the difference in importance between older and newer data sets, as the more recent data is more accurate and valuable in predicting future values. In order to accomplish this, exponents are utilized to give newer data sets a larger weight in the calculations than the older sets.<ref>{{Cite web |title=6.4.3. What is Exponential Smoothing? |url=https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc43.htm |access-date=2022-05-06 |website=www.itl.nist.gov}}</ref>
 
==== Time Seriesseries Modelsmodels ====
Time series models are a subset of machine learning that utilize time series in order to understand and forecast data using past values. A time series is the sequence of a variable's value over equally spaced periods, such as years or quarters in business applications.<ref>{{Cite web |title=6.4.1. Definitions, Applications and Techniques |url=https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc41.htm |access-date=2022-05-06 |website=www.itl.nist.gov}}</ref> To accomplish this, the data must be smoothed, or the random variance of the data must be removed in order to reveal trends in the data. There are multiple ways to accomplish this.
 
===== Single Movingmoving Averageaverage =====
Single moving average methods utilize smaller and smaller numbered sets of past data to decrease error that is associated with taking a single average, making it a more accurate average than it would be to take the average of the entire data set.<ref>{{Cite web |title=6.4.2.1. Single Moving Average |url=https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc421.htm |access-date=2022-05-06 |website=www.itl.nist.gov}}</ref>
 
===== Centered Movingmoving Averageaverage =====
Centered moving average methods utilize the data found in the single moving average methods by taking an average of the median-numbered data set. However, as the median-numbered data set is difficult to calculate with even-numbered data sets, this method works better with odd-numbered data sets than even.<ref>{{Cite web |title=6.4.2.2. Centered Moving Average |url=https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc422.htm |access-date=2022-05-06 |website=www.itl.nist.gov}}</ref>
 
=== Predictive Modelingmodeling ===
{{Main|Predictive modelling}}
Predictive Modelingmodeling is a statistical technique used to predict future behavior. It utilizes predictive models to analyze a relationship between a specific unit in a given sample and one or more features of the unit. The objective of these models is to assess the possibility that a unit in another sample will display the same pattern. Predictive model solutions can be considered a type of data mining technology. The models can analyze both historical and current data and generate a model in order to predict potential future outcomes.<ref name=":1">{{Cite book |last1=McCarthy |first1=Richard |title=Applying Predictive Analytics: Finding Value in Data |last2=McCarthy |first2=Mary |last3=Ceccucci |first3=Wendy |publisher=Springer |year=2021}}</ref>
 
Regardless of the methodology used, in general, the process of creating predictive models involves the same steps. First, it is necessary to determine the project objectives and desired outcomes and translate these into predictive analytic objectives and tasks. Then, analyze the source data to determine the most appropriate data and model building approach (models are only as useful as the applicable data used to build them). Select and transform the data in order to create models. Create and test models in order to evaluate if they are valid and will be able to meet project goals and metrics. Apply the model's results to appropriate business processes (identifying patterns in the data doesn't necessarily mean a business will understand how to take advantage or capitalize on it). Afterward, manage and maintain models in order to standardize and improve performance (demand will increase for model management in order to meet new compliance regulations).<ref name=":4" />
 
=== Regression Techniquesanalysis ===
{{Main|Regression analysis}}
Generally, regression analysis uses structural data along with the past values of independent variables and the relationship between them and the dependent variable to form predictions.<ref name=":0" />
 
==== Linear Regressionregression ====
{{Main|Linear regression}}
In [[linear regression]], a plot is constructed with the previous values of the dependent variable plotted on the Y-axis and the independent variable that is being analyzed plotted on the X-axis. A regression line is then constructed by a statistical program representing the relationship between the independent and dependent variables which can be used to predict values of the dependent variable based only on the independent variable. With the regression line, the program also shows a slope intercept equation for the line which includes an addition for the error term of the regression, where the higher the value of the error term the less precise the regression model is. In order to decrease the value of the error term, other independent variables are introduced to the model, and similar analyses are performed on these independent variables.<ref name=":0" /><ref>{{Cite web |title=Linear Regression |url=http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm |access-date=2022-05-06 |website=www.stat.yale.edu}}</ref>
 
== Applications ==
 
=== Analytical Review and Conditional Expectations in Auditing ===
{{Main|ARIMA}}
An important aspect of auditing includes analytical review. In analytical review, the reasonableness of reported account balances being investigated is determined. Auditors accomplish this process through predictive modeling to form predictions called conditional expectations of the balances being audited using autoregressive integrated moving average (ARIMA) methods and general regression analysis methods,<ref name=":0" /> specifically through the Statistical Technique for Analytical Review (STAR) methods.<ref name=":3">{{Cite journal |last1=Kinney |first1=William R. |last2=Salamon |first2=Gerald L. |date=1982 |title=Regression Analysis in Auditing: A Comparison of Alternative Investigation Rules |journal=Journal of Accounting Research |volume=20 |issue=2 |pages=350–366 |doi=10.2307/2490745 |jstor=2490745 |issn=0021-8456}}</ref>
 
Line 82 ⟶ 72:
=== Child protection ===
Some child welfare agencies have started using predictive analytics to flag high risk cases.<ref>{{Cite web |last=Reform |first=Fostering |date=2016-02-03 |title=New Strategies Long Overdue on Measuring Child Welfare Risk |url=https://imprintnews.org/blogger-co-op/new-strategies-long-overdue-measuring-child-welfare-risk/15442 |access-date=2022-05-03 |website=The Imprint |language=en-US}}</ref> For example, in [[Hillsborough County, Florida]], the child welfare agency's use of a predictive modeling tool has prevented abuse-related child deaths in the target population.<ref>{{Cite journal |date=2016 |title=Within Our Reach: A National Strategy to Eliminate Child Abuse and Neglect Fatalities |url=https://www.acf.hhs.gov/sites/default/files/documents/cb/cecanf_final_report.pdf |journal=Commission to Eliminate Child Abuse and Neglect Fatalities}}</ref>
 
=== Clinical decision support systems ===
Predictive analysis have found use in health care primarily to determine which patients are at risk of developing conditions such as diabetes, asthma, or heart disease. Additionally, sophisticated [[clinical decision support system]]s incorporate predictive analytics to support medical decision making.
 
A 2016 study of [[Neurodegeneration|neurodegenerative disorders]] provides a powerful example of a CDS platform to diagnose, track, predict and monitor the progression of [[Parkinson's disease]].<ref>{{Cite journal |last1=Dinov |first1=Ivo D. |last2=Heavner |first2=Ben |last3=Tang |first3=Ming |last4=Glusman |first4=Gustavo |last5=Chard |first5=Kyle |last6=Darcy |first6=Mike |last7=Madduri |first7=Ravi |last8=Pa |first8=Judy |last9=Spino |first9=Cathie |last10=Kesselman |first10=Carl |last11=Foster |first11=Ian |date=2016-08-05 |title=Predictive Big Data Analytics: A Study of Parkinson's Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations |journal=[[PLOS ONE]] |volume=11 |issue=8 |pages=e0157077 |doi=10.1371/journal.pone.0157077 |issn=1932-6203 |pmc=4975403 |pmid=27494614 |bibcode=2016PLoSO..1157077D |doi-access=free}}</ref>
 
=== Predicting outcomes of legal decisions ===
Line 101 ⟶ 86:
* [[Artificial intelligence in healthcare]]
* [[Analytical procedures (finance auditing)]]
* [[Big data]]
* [[Computational sociology]]
* [[Criminal Reduction Utilising Statistical History]]