Content deleted Content added
m →Files: Typo fixing, replaced: hierarchially → hierarchically |
m →Databases: HTTP to HTTPS for Brown University |
||
(48 intermediate revisions by 35 users not shown) | |||
Line 1:
{{distinguish|Information Engineering}}
{{Short description|Software engineering approach to designing and developing information systems}}
{{Use mdy dates|date=August 2021}}
'''Data engineering'''
== History ==
Around the 1970s/1980s the term ''
In the early 2000s, the data and data tooling was generally held by the [[information technology]] (IT) teams in most companies.<ref name="hist2">{{cite web |last1=Dodds |first1=Eric |title=The History of the Data Engineering and the Megatrends |url=https://www.rudderstack.com/blog/the-data-engineering-megatrend-a-brief-history |website=Rudderstack |access-date=31 July 2022}}</ref> Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business.
In the early 2010s, with the rise of the [[internet]], the massive increase in data volumes, velocity, and variety led to the term [[big data]] to describe the data itself, and data-driven tech companies like [[Facebook]] and [[Airbnb]] started using the phrase ''' data engineer'''.<ref name="hist1" /><ref name="hist2" /> Due to the new scale of the data, major firms like [[Google]], Facebook, [[Amazon (company)|Amazon]], [[Apple Inc.|Apple]], [[Microsoft]], and [[Netflix]] started to move away from traditional [[Extract transform load|ETL]] and storage techniques. They started creating '''data engineering''', a type of [[software engineering]] focused on data, and in particular [[data infrastructure|infrastructure]], [[data warehouse|warehousing]], [[Information privacy|data protection]], [[cybersecurity]], [[data mining|mining]], [[data modelling|modelling]], [[data processing|processing]], and [[metadata]] management.<ref name="hist1" /><ref name="hist2" /> This change in approach was particularly focused on [[cloud computing]].<ref name="hist2" /> Data started to be handled and used by many parts of the business, such as [[sales]] and [[marketing]], and not just IT.<ref name="hist2" />
== Tools ==
=== Compute ===
High-performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is [[dataflow programming]], in which the computation is represented as a [[directed graph]] (dataflow graph); nodes are the operations, and edges represent the flow of data.<ref name="sigops">{{cite web |last1=Schwarzkopf |first1=Malte |title=The Remarkable Utility of Dataflow Computing |url=https://www.sigops.org/2020/the-remarkable-utility-of-dataflow-computing/ |website=ACM SIGOPS |access-date=31 July 2022 |date=7 March 2020}}</ref> Popular implementations include [[Apache Spark]], and the [[deep learning]] specific [[TensorFlow]].<ref name="sigops" /><ref name="sparkpaper">{{cite web |url=https://cs.stanford.edu/~matei/papers/2016/cacm_apache_spark.pdf |access-date=31 July 2022|title=sparkpaper}}</ref><ref name="tensorflow paper">{{cite web |last1=Abadi |first1=Martin |last2=Barham |first2=Paul |last3=Chen |first3=Jianmin |last4=Chen |first4=Zhifeng |last5=Davis |first5=Andy |last6=Dean |first6=Jeffrey |last7=Devin |first7=Matthieu |last8=Ghemawat |first8=Sanjay |last9=Irving |first9=Geoffrey |last10=Isard |first10=Michael |last11=Kudlur |first11=Manjunath |last12=Levenberg |first12=Josh |last13=Monga |first13=Rajat |last14=Moore |first14=Sherry |last15=Murray |first15=Derek G. |last16=Steiner |first16=Benoit |last17=Tucker |first17=Paul |last18=Vasudevan |first18=Vijay |last19=Warden |first19=Pete |last20=Wicke |first20=Martin |last21=Yu |first21=Yuan |last22=Zheng |first22=Xiaoqiang |title=TensorFlow: A system for large-scale machine learning |url=https://research.google/pubs/pub45381/ |website=12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) |access-date=31 July 2022 |pages=265–283 |date=2016}}</ref> More recent implementations, such as [[Differential Dataflow|Differential]]/[[Timely Dataflow|Timely]] Dataflow, have used [[incremental computing]] for much more efficient data processing.<ref name="sigops" /><ref name="differential-paper">{{cite web |last1=McSherry |first1=Frank |last2=Murray |first2=Derek |last3=Isaacs |first3=Rebecca |last4=Isard |first4=Michael |title=Differential dataflow |website=[[Microsoft]] |url=https://www.microsoft.com/en-us/research/publication/differential-dataflow/ |access-date=31 July 2022 |date=5 January 2013}}</ref><ref name="differential-github">{{cite web |title=Differential Dataflow |url=https://github.com/TimelyDataflow/differential-dataflow |publisher=Timely Dataflow |access-date=31 July 2022 |date=30 July 2022}}</ref>
=== Storage ===
Data
Data engineers optimize data storage and processing systems to reduce costs. They use data compression, partitioning, and archiving.
==== Databases ====
If the data
</ref><ref name="sigmodrecord">{{cite conference |first1=Andrew |last1=Pavlo |first2=Matthew |last2=Aslett |title=What's Really New with NewSQL? |book-title=SIGMOD Record |year=2016 |url=https://db.cs.cmu.edu/papers/2016/pavlo-newsql-sigmodrec2016.pdf |access-date=February 22, 2020}}
</ref><ref>{{cite web |url=https://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext |title=NewSQL: An Alternative to NoSQL and Old SQL for New OLTP Apps |first=Michael |last=Stonebraker |publisher=Communications of the ACM Blog |date=June 16, 2011 |access-date=February 22, 2020}}</ref><ref name="high scalability">{{cite web |url=http://highscalability.com/blog/2012/9/24/google-spanners-most-surprising-revelation-nosql-is-out-and.html |title=Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In |first=Todd |last=Hoff |date=September 24, 2012 |access-date=February 22, 2020}}</ref>
Line 25 ⟶ 27:
==== Data warehouses ====
{{Main|Data warehouse}}
If the data
==== Data lakes ====
A [[
==== Files ====
If the data
* [[File system]]s represent data hierarchically in nested folders.<ref name="redhat1">{{cite web |title=File storage, block storage, or object storage? |url=https://www.redhat.com/en/topics/data-storage/file-block-object-storage |website=www.redhat.com |access-date=31 July 2022 |language=en}}</ref>
* [[Block storage]] splits data into regularly sized chunks;<ref name="redhat1" /> this often matches up with (virtual) [[hard drives]] or [[solid state drives]].
Line 48 ⟶ 50:
=== Data modeling ===
{{
Data modeling is the analysis and representation of data requirements for an organisation. It produces a data model—an abstract representation that organises business concepts and the relationships and constraints between them. The resulting artefacts guide communication between business and technical stakeholders and inform database design.<ref name="simsionwitt">{{cite book |last1=Simsion |first1=Graeme |last2=Witt |first2=Graham |title=Data Modeling Essentials |edition=4th |publisher=Morgan Kaufmann |year=2015 |isbn=9780128002025}}</ref><ref name="date">{{cite book |last=Date |first=C. J. |title=An Introduction to Database Systems |edition=8th |publisher=Addison-Wesley |year=2004 |isbn=9780321197849}}</ref>
A common convention distinguishes three levels of models:<ref name="simsionwitt" />
* '''Conceptual model''' – a technology-independent view of the key business concepts and rules.
* '''Logical model''' – a detailed representation in a chosen paradigm (most commonly the relational model) specifying entities, attributes, keys, and integrity constraints.<ref name="date" />
* '''Physical model''' – an implementation-oriented design describing tables, indexes, partitioning, and other operational considerations.<ref name="date" />
Approaches include entity–relationship (ER) modeling for operational systems,<ref name="chen1976">{{cite journal |last=Chen |first=Peter P. |title=The Entity–Relationship Model—Toward a Unified View of Data |journal=ACM Transactions on Database Systems |volume=1 |issue=1 |year=1976 |pages=9–36 |doi=10.1145/320434.320440}}</ref> dimensional modeling for analytics and data warehousing,<ref name="kimball">{{cite book |last1=Kimball |first1=Ralph |last2=Ross |first2=Margy |title=The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling |edition=3rd |publisher=Wiley |year=2013 |isbn=9781118530801}}</ref> and the use of UML class diagrams to express conceptual or logical models in general-purpose modeling tools.<ref name="uml">{{cite report |title=Unified Modeling Language (UML) Version 2.5.1 |publisher=Object Management Group |date=2017 |url=https://www.omg.org/spec/UML/2.5.1/}}</ref>
Well-formed data models aim to improve data quality and interoperability by applying clear naming standards, normalisation, and integrity constraints.<ref name="date" /><ref name="simsionwitt" />
== Roles ==
=== Data engineer ===
A ''' data engineer''' is a type of software engineer who creates [[big data]] [[Extract, transform, load|ETL]] pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into [[business intelligence|insights]].<ref>{{
=== Data scientist ===
{{ main article | Data science}}
'''Data scientists''' are more focused on the analysis of the data, they will be more familiar with [[mathematics]], [[algorithms]], [[statistics]], and [[machine learning]].<ref name="hist1" /><ref>{{Cite web |date=Jan 5, 2017 |title=What is Data Science and Why it's Important |url=https://www.edureka.co/blog/what-is-data-science/ |publisher=Edureka}}</ref>
==See also==
Line 70 ⟶ 81:
==Further reading==
{{refbegin|2}}
*
*
*
* Ian Macdonald (1986). "Information engineering". in: ''Information Systems Design Methodologies''. T.W. Olle et al. (ed.). North-Holland.
* Ian Macdonald (1988). "Automating the Information engineering methodology with the Information Engineering Facility". In: ''Computerized Assistance during the Information Systems Life Cycle''. [[T.W. Olle]] et al. (ed.). North-Holland.
* [[James Martin (author)|James Martin]] and [[Clive Finkelstein]]. (1981). ''Information engineering''. Technical Report (2 volumes), Savant Institute, Carnforth, Lancs, UK.
* James Martin (1989). ''Information engineering''. (3 volumes), Prentice-Hall Inc.
*
* {{cite book |last1=Reis |first1=Joe |last2=Housley |first2=Matt |title=Fundamentals of Data Engineering |date=2022 |publisher=O'Reilly Media |isbn=978-1-0981-0827-4 }}
{{refend}}
==External links==
{{commons category| Information Engineering}}
* [http://www.informatik.uni-bremen.de/uniform/gdpa/methods/m-iem.htm The Complex Method IEM] {{Webarchive|url=https://web.archive.org/web/20190720070308/http://www.informatik.uni-bremen.de/uniform/gdpa/methods/m-iem.htm |date=July 20, 2019 }}
* [https://web.archive.org/web/20060215222446/http://sysdev.ucdavis.edu/WEBADM/document/rad-archapproach.htm Rapid Application Development]
* [http://www.ies.aust.com Enterprise Engineering and Rapid Delivery of Enterprise Architecture]
Line 90 ⟶ 100:
{{Authority control}}
{{Engineering fields}}
[[Category:Software
[[Category:Information systems]]
[[Category:Data management]]
[[Category:Data engineering]]
[[Category:Engineering disciplines]]
|