Content deleted Content added
No edit summary Tag: Reverted |
m →Databases: HTTP to HTTPS for Brown University |
||
(4 intermediate revisions by 3 users not shown) | |||
Line 2:
{{Short description|Software engineering approach to designing and developing information systems}}
{{Use mdy dates|date=August 2021}}
'''Data engineering'''
== History ==
Around the 1970s/1980s the term ''
In the early 2000s, the data and data tooling was generally held by the [[information technology]] (IT) teams in most companies.<ref name="hist2">{{cite web |last1=Dodds |first1=Eric |title=The History of the Data Engineering and the Megatrends |url=https://www.rudderstack.com/blog/the-data-engineering-megatrend-a-brief-history |website=Rudderstack |access-date=31 July 2022}}</ref> Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business.
Line 21:
==== Databases ====
If the data is structured and some form of [[online transaction processing]] is required, then [[databases]] are generally used.<ref name="mit">{{cite web |title=Lecture Notes {{!}} Database Systems {{!}} Electrical Engineering and Computer Science {{!}} MIT OpenCourseWare |url=https://ocw.mit.edu/courses/6-830-database-systems-fall-2010/pages/lecture-notes/ |website=ocw.mit.edu |access-date=31 July 2022}}</ref> Originally mostly [[relational database]]s were used, with strong [[ACID]] transaction correctness guarantees; most relational databases use [[SQL]] for their queries. However, with the growth of data in the 2010s, [[NoSQL]] databases have also become popular since they [[Horizontal scaling#Horizontal and vertical scaling|horizontally scaled]] more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the [[object-relational impedance mismatch]].<ref name="leavitt">{{cite journal |last1=Leavitt |first1=Neal |title=Will NoSQL Databases Live Up to Their Promise? |journal=Computer |date=February 2010 |volume=43 |issue=2 |pages=12–14 |doi=10.1109/MC.2010.58 }}</ref> More recently, [[NewSQL]] databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular.<ref name="aslett2012">{{cite web |url=
</ref><ref name="sigmodrecord">{{cite conference |first1=Andrew |last1=Pavlo |first2=Matthew |last2=Aslett |title=What's Really New with NewSQL? |book-title=SIGMOD Record |year=2016 |url=https://db.cs.cmu.edu/papers/2016/pavlo-newsql-sigmodrec2016.pdf |access-date=February 22, 2020}}
</ref><ref>{{cite web |url=https://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext |title=NewSQL: An Alternative to NoSQL and Old SQL for New OLTP Apps |first=Michael |last=Stonebraker |publisher=Communications of the ACM Blog |date=June 16, 2011 |access-date=February 22, 2020}}</ref><ref name="high scalability">{{cite web |url=http://highscalability.com/blog/2012/9/24/google-spanners-most-surprising-revelation-nosql-is-out-and.html |title=Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In |first=Todd |last=Hoff |date=September 24, 2012 |access-date=February 22, 2020}}</ref>
Line 50:
=== Data modeling ===
{{
Data modeling is the analysis and representation of data requirements for an organisation. It produces a data model—an abstract representation that organises business concepts and the relationships and constraints between them. The resulting artefacts guide communication between business and technical stakeholders and inform database design.<ref name="simsionwitt">{{cite book |last1=Simsion |first1=Graeme |last2=Witt |first2=Graham |title=Data Modeling Essentials |edition=4th |publisher=Morgan Kaufmann |year=2015 |isbn=9780128002025}}</ref><ref name="date">{{cite book |last=Date |first=C. J. |title=An Introduction to Database Systems |edition=8th |publisher=Addison-Wesley |year=2004 |isbn=9780321197849}}</ref>
A common convention distinguishes three levels of models:<ref name="simsionwitt" />
* '''Conceptual model''' – a technology-independent view of the key business concepts and rules.
* '''Logical model''' – a detailed representation in a chosen paradigm (most commonly the relational model) specifying entities, attributes, keys, and integrity constraints.<ref name="date" />
* '''Physical model''' – an implementation-oriented design describing tables, indexes, partitioning, and other operational considerations.<ref name="date" />
Approaches include entity–relationship (ER) modeling for operational systems,<ref name="chen1976">{{cite journal |last=Chen |first=Peter P. |title=The Entity–Relationship Model—Toward a Unified View of Data |journal=ACM Transactions on Database Systems |volume=1 |issue=1 |year=1976 |pages=9–36 |doi=10.1145/320434.320440}}</ref> dimensional modeling for analytics and data warehousing,<ref name="kimball">{{cite book |last1=Kimball |first1=Ralph |last2=Ross |first2=Margy |title=The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling |edition=3rd |publisher=Wiley |year=2013 |isbn=9781118530801}}</ref> and the use of UML class diagrams to express conceptual or logical models in general-purpose modeling tools.<ref name="uml">{{cite report |title=Unified Modeling Language (UML) Version 2.5.1 |publisher=Object Management Group |date=2017 |url=https://www.omg.org/spec/UML/2.5.1/}}</ref>
Well-formed data models aim to improve data quality and interoperability by applying clear naming standards, normalisation, and integrity constraints.<ref name="date" /><ref name="simsionwitt" />
== Roles ==
=== Data engineer ===
A ''' data engineer''' is a type of software engineer who creates [[big data]] [[Extract, transform, load|ETL]] pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into [[business intelligence|insights]].<ref>{{cite report |last1=Tamir |first1=Mike |last2=Miller |first2=Steven |last3=Gagliardi |first3=Alessandro |date=11 December 2015 |title=The Data Engineer |ssrn=2762013 }}</ref> They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like [[Java (programming language)|Java]], [[Python (programming language)|Python]], [[Scala (programming language)|Scala]], and [[Rust (programming language)|Rust]].<ref>{{Cite web|date=2019-02-07|title=Data Engineer vs. Data Scientist|url=https://
=== Data scientist ===
|