In-database processing: Difference between revisions

Content deleted Content added
Bender the Bot (talk | contribs)
m Related Technologies: HTTP to HTTPS for Blogspot
 
(44 intermediate revisions by 31 users not shown)
Line 1:
'''In-database processing''', sometimes referred to as '''in-database analytics''', refers to the integration of data [[analytics]] into [[data warehousing]] functionality. Today, many large databases, such as those used for [[credit card fraud]] [[fraud detection|detection]] and [[investment bank]] [[risk management]], use this technology because it provides significant performance improvements over traditional methods.<ref>{{citation|title=What Is In-Database Processing?|url=http://www.wisegeek.com/what-is-in-database-processing.htm|publisher=Wise Geek|accessdate=May 14, 2012}}</ref>
{{Too many categories|date=October 2011}}
'''In-database processing''', sometimes referred to as in-database analytics, refers to the integration of data [[analytics]] into [[data warehousing]] functionality. Today, many large databases, such as those used for credit card fraud detection and investment bank risk management, use this technology because it provides significant performance improvements over traditional methods.<ref>{{citation|title=What Is In-Database Processing?|url=http://www.wisegeek.com/what-is-in-database-processing.htm|publisher=Wise Geek|accessdate=May 14, 2012}}</ref>
 
==History==
Traditional approaches to data analysis require data to be moved out of the database into a separate analytics environment for processing, and then back to the database. ([[SAS]] and [[SPSS]] from [[IBM]] are examples of tools that still do this today.). Doing the analysis in the database, where the data resides, eliminates the costs, time and security issues associated with the old approach by doing the processing in the data warehouse itself.<ref name="DBTA">{{citation|last=Das|first=Joydeep|title=Adding Competitive Muscle with In-Database Analytics|url=http://www.dbta.com/Articles/Editorial/Trends-and-Applications/Adding-Competitive-Muscle-with-In-Database-Analytics-67126.aspx|publisher=Database Trends & Applications|date=May 10, 2010}}</ref>
 
Though in-database capabilities were first commercially offered in the mid-1990s, as object-related database systems from vendors including IBM, [[Illustra]]/[[Informix]] (now IBM) and [[Oracle Corporation|Oracle]], the technology did not begin to catch on until the mid-2000s.<ref name="IE">{{citation|last=Grimes|first=Seth|title=In-Database Analytics: A Passing Lane for Complex Analysis|url=http://intelligent-enterprise.informationweek.com/info_centers/data_int/showArticle.jhtml;jsessionid=YH5ZICM4SKOMRQE1GHPSKH4ATMY32JVN?articleID=212500351&cid=RSSfeed_IE_News|publisher=Intelligent Enterprise|date=December 15, 2008}}</ref> The concept of migrating analytics from the analytical workstation and into the Enterprise Data Warehouse was first introduced by Thomas Tileston in his presentation entitled, “Have Your Cake & Eat It Too! Accelerate Data Mining Combining SAS & Teradata” at the [[Teradata]] Partners 2005 "Experience the Possibilities" conference in Orlando, FL, September 18–22, 2005. Mr. Tileston later presented this technique globally in 2006,<ref>{{Cite web|url=http://www.itworldcanada.com/article/business-intelligence-taking-the-sting-out-of-forecasting/7193|title=Business Intelligence – Taking the sting out of forecasting &#124; IT World Canada News|date=31 October 2006}}</ref> 2007<ref>http://www2.sas.com/proceedings/forum2007/371-2007.pdf {{Bare URL PDF|date=March 2022}}</ref><ref>{{Cite web |url=http://de.saswiki.org/wiki/SAS_Global_Forum_2007 |title=SAS Global Forum 2007 – SAS-Wiki |access-date=2014-08-21 |archive-date=2014-08-21 |archive-url=https://web.archive.org/web/20140821121434/http://de.saswiki.org/wiki/SAS_Global_Forum_2007 |url-status=dead }}</ref><ref>{{Cite web |url=http://lexjansen.com/cgi-bin/sug_proceedings_pdf.php?c=SUGI&x=SGF2007 |title=Archived copy |access-date=2014-08-21 |archive-url=https://web.archive.org/web/20140822051218/http://lexjansen.com/cgi-bin/sug_proceedings_pdf.php?c=SUGI&x=SGF2007 |archive-date=2014-08-22 |url-status=dead }}</ref> and 2008.<ref>http://www.teradata.kr/teradatauniverse/PDF/Track_2/2_2_Warner_Home_Thomas_Tileston.pdf {{Bare URL PDF|date=March 2022}}</ref>
 
At that point, the need for in-database processing had become more pressing as the amount of data available to collect and analyze continues to grow exponentially (due largely to the rise of the Internet), from megabytes to gigabytes, terabytes and petabytes. This “[[big data]]” is one of the primary reasons it has become important to collect, process and analyze data efficiently and accurately.
 
Also, the speed of business has accelerated to the point where a performance gain of nanoseconds can make a difference in some industries.<ref name="DBTA">< /ref> Additionally, as more people and industries use data to answer important questions, the questions they ask become more complex, demanding more sophisticated tools and more precise results.
 
All of these factors in combination have created the need for in-database processing. The introduction of the [[column-oriented database]], specifically designed for analytics, data warehousing and reporting, has helped make the technology possible.
 
==Types==
There are three main types of in-database processing: translating a model into SQL code, loading C or C++ libraries into the database process space as a built-in user-defined function (UDF), and out-of-process libraries typically written in C, C++ or JAVAJava and registering them in the database as a built-in UDFs in a SQL statement.
 
===Translating Modelsmodels into SQL Codecode===
In this type of in-database processing, a predictive model is converted from its source language into SQL that can run in the database usually in a [[stored procedure]]. Many analytic model-building tools have the ability to export their models in either in SQL or [[PMML]] (Predictive Modeling Markup Language). Once the SQL is loaded into a stored procedure, values can be passed in through parameters and the model is executed natively in the database. Tools that can use this approach include SAS, SPSS, R and [[KXEN]].
 
===Loading C or C++ Librarieslibraries into the database process space===
With C or C++ UDF libraries that run in process, the functions are typically registered as built-in functions within the database server and called like any other built-in function in a SQL statement. Running in process allows the function to have full access to the database server’sserver's memory, parallelism and processing management capabilities. Because of this, the functions must be well-behaved so as not to negatively impact the database or the engine. This type of UDF gives the highest performance out of any method for OLAP, mathematical, statistical, univariate distributions and data mining algorithms. Vendors such as [http://www.FuzzyL.com Fuzzy Logix] (DBLytix) and RogueWave (IMSL) have pre-built libraries available. IBM Netezza, EMC Greenplum, Sybase and Teradata (AsterData) have the capability to do this type of in-database analytics. Some of these vendors allow customers to write their own custom in-process UDFs.
 
===Out-of-Processprocess===
Out-of-Processprocess UDFs are typically written in C, C++ or JAVAJava. By running out of process, they do not run the same risk to the database or the engine as they run in their own process space with their own resources. Here, they wouldn’twouldn't be expected to have the same performance as an in-process UDF. They are still typically registered in the database engine and called through standard SQL, usually in a stored procedure. A vendor, [[Zementis Inc|Zementis]], has plug-ins for different database vendors that can be used to take a PMML model and convert it to a JAVA UDF that can be called through the native SQL. Out-of-process UDFs are a safe way to extend the capabilities of a database server and are an ideal way to add custom data mining libraries.
 
==Uses==
In-database processing makes data analysis more accessible and relevant for high-throughput, real-time applications including fraud detection, credit scoring, risk management, transaction processing, pricing and margin analysis, usage-based micro-segmenting, behavioral ad targeting and recommendation engines, such as those used by customer service organizations to determine next-best actions.<ref name=Kobelius>{{citation|last=Kobelius|first=James|title=The Power of Predictions: Case Studies in CRM Next Best Action|url=http://www.forrester.com/The+Power+Of+Predictions/fulltext/-/E-RES60094|publisher=Forrester|date=June 22, 2011|access-date=May 15, 2012|archive-date=April 13, 2012|archive-url=https://web.archive.org/web/20120413193606/http://www.forrester.com/The+Power+Of+Predictions/fulltext/-/E-RES60094|url-status=dead}}</ref>
 
==Vendors==
In-database processing is performed and promoted as a feature by many of the major data warehousing vendors, including [[Teradata]] (and [[Aster Data Systems]], which it acquired), IBM (with its [[Netezza]], PureData Systems, and [https://www.ibm.com/analytics/data-management/data-warehouse Db2 Warehouse] products), IEMC [[Greenplum]], [[Sybase]], [[ParAccel]], SAS, and [[EXASOL]]. Some of the products offered by these vendors, such as CWI's [[MonetDB]] or IBM's Db2 Warehouse, offer users the means to write their own functions (UDFs) or extensions (UDXs) to enhance the products' capabilities.<ref>{{cite web | url = https://www.monetdb.org/content/embedded-r-monetdb | title = Embedded R in MonetDB | date = 22 December 2014 | access-date = 22 December 2014 | archive-date = 13 November 2014 | archive-url = https://web.archive.org/web/20141113025427/https://www.monetdb.org/content/embedded-r-monetdb | url-status = dead }}</ref> [[Fuzzy Logix]] offers libraries of in-database models used for mathematical, statistical, data mining, simulation, and classification modelling, as well as financial models for equity, fixed income, interest rate, and portfolio optimization. [http://in-database.com In-DataBase Pioneers] collaborates with marketing and IT teams to institutionalize data mining and analytic processes inside the data warehouse for fast, reliable, and customizable consumer-behavior and predictive analytics.
In-database processing is performed and promoted as a feature by many of the major data warehousing vendors, including [[Teradata]] (and acquired [[Aster Data Systems]]), IBM [[Netezza]], EMC [[Greenplum]] and [[Sybase]].
 
==Related Technologies==
In-database processing is one of several technologies focused on improving data warehousing performance. Others include [[parallel computing]], shared everything architectures, [[shared nothing architecture]]s and [[massive parallel processing]]. It is an important step towards improving [[predictive analytics]] capabilities.<ref name="TimManns">[httphttps://timmanns.blogspot.com/2009/01/isnt-in-database-processing-old-news.html] "Isn't In-database processing old news yet?," "Blog by Tim Manns (Data Mining Blog)," January 8, 2009</ref>
 
==External links==
* [http://www.exasol.com/en/exasolution/exapowerlytics.html EXASOL EXAPowerlytics]
 
==References==
{{Reflist|2}}
 
 
{{Database models}}
Line 44:
 
[[Category:Database management systems]]
[[Category:Applied data mining]]
[[Category:Data modeling]]
[[Category:Database theory]]
[[Category:Project management]]
[[Category:System administration]]
[[Category:Transaction processing]]
 
{{comp-sci-stub}}