Data engineering: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 10:44, 5 June 2025 edit Priyash48 (talk \| contribs) 12 edits No edit summary Tag: Reverted ← Previous edit		Latest revision as of 07:33, 20 August 2025 edit undo Bender the Bot (talk \| contribs) Bots 1,064,377 edits m →Databases: HTTP to HTTPS for Brown University Tag: AWB
(4 intermediate revisions by 3 users not shown)
Line 2: {{Short description\|Software engineering approach to designing and developing information systems}} {{Use mdy dates\|date=August 2021}} '''Data engineering''' ~~refers~~is a [[software engineering]] approach to the building of [[~~System\|systems~~data system]]s, to enable the collection and usage of [[data]]. This data is usually used to enable subsequent [[data analytics\|analysis]] and [[data science]], which often involves [[machine learning]].<ref name="whatis1">{{cite web \|title=What is Data Engineering? {{!}} A Quick Glance of Data Engineering \|url=https://www.educba.com/what-is-data-engineering/ \|website=EDUCBA \|access-date=31 July 2022 \|date=5 January 2020}}</ref><ref name="whatis2">{{cite web \|title=Introduction to Data Engineering \|url=https://www.dremio.com/resources/guides/intro-data-engineering/ \|website=Dremio \|access-date=31 July 2022}}</ref> Making the data usable usually involves substantial [[computer\|compute]] and [[computer data storage\|storage]], as well as [[data processing]]. == History == Around the 1970s/1980s the term '''[[information engineering]] methodology''' (IEM) was created to describe [[database design]] and the use of [[software]] for data analysis and processing.<ref name="hist1">{{cite web \|last1=Black \|first1=Nathan \|title=What is Data Engineering and Why Is It So Important? \|url=https://quanthub.com/what-is-data-engineering/ \|website=QuantHub \|access-date=31 July 2022 \|date=15 January 2020}}</ref> These techniques were intended to be used by [[database administrator]]s (DBAs) and by [[systems analyst]]s based upon an understanding of the operational processing needs of organizations for the 1980s. In particular, these techniques were meant to help bridge the gap between strategic business planning and information systems. A key early contributor (often called the "father" of information engineering methodology) was the Australian [[Clive Finkelstein]], who wrote several articles about it between 1976 and 1980, and also co-authored an influential [[Savant Institute]] report on it with James Martin.<ref>"Information engineering," [https://books.google.com/books?id=U2Da-O9RAgIC&pg=PA29 part 3], [https://books.google.com/books?id=aMrnCDJzb9MC&pg=RA1-PA1 part 4], [https://books.google.com/books?id=Ux9iw6tMs6MC&pg=PA32 part 5], [https://books.google.com/books?id=dPLZ7QidjbEC&pg=RA1-PA1 Part 6]" by Clive Finkelstein. In ''Computerworld, In depths, appendix.'' May 25 – June 15, 1981.</ref><ref>Christopher Allen, Simon Chatwin, Catherine Creary (2003). ''Introduction to Relational Databases and SQL Programming.''</ref><ref>[[Terry Halpin]], [[Tony Morgan (computer scientist)\|Tony Morgan]] (2010). ''Information Modeling and Relational Databases.'' p. 343</ref> Over the next few years, Finkelstein continued work in a more business-driven direction, which was intended to address a rapidly changing business environment; Martin continued work in a more data processing-driven direction. From 1983 to 1987, Charles M. Richter, guided by Clive Finkelstein, played a significant role in revamping IEM as well as helping to design the IEM software product (user data), which helped automate IEM. In the early 2000s, the data and data tooling was generally held by the [[information technology]] (IT) teams in most companies.<ref name="hist2">{{cite web \|last1=Dodds \|first1=Eric \|title=The History of the Data Engineering and the Megatrends \|url=https://www.rudderstack.com/blog/the-data-engineering-megatrend-a-brief-history \|website=Rudderstack \|access-date=31 July 2022}}</ref> Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business. Line 21: ==== Databases ==== If the data is structured and some form of [[online transaction processing]] is required, then [[databases]] are generally used.<ref name="mit">{{cite web \|title=Lecture Notes {{!}} Database Systems {{!}} Electrical Engineering and Computer Science {{!}} MIT OpenCourseWare \|url=https://ocw.mit.edu/courses/6-830-database-systems-fall-2010/pages/lecture-notes/ \|website=ocw.mit.edu \|access-date=31 July 2022}}</ref> Originally mostly [[relational database]]s were used, with strong [[ACID]] transaction correctness guarantees; most relational databases use [[SQL]] for their queries. However, with the growth of data in the 2010s, [[NoSQL]] databases have also become popular since they [[Horizontal scaling#Horizontal and vertical scaling\|horizontally scaled]] more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the [[object-relational impedance mismatch]].<ref name="leavitt">{{cite journal \|last1=Leavitt \|first1=Neal \|title=Will NoSQL Databases Live Up to Their Promise? \|journal=Computer \|date=February 2010 \|volume=43 \|issue=2 \|pages=12–14 \|doi=10.1109/MC.2010.58 }}</ref> More recently, [[NewSQL]] databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular.<ref name="aslett2012">{{cite web \|url=~~http~~https://cs.brown.edu/courses/cs227/archives/2012/papers/newsql/aslett-newsql.pdf \|title=How Will The Database Incumbents Respond To NoSQL And NewSQL? \|first=Matthew \|last=Aslett \|publisher=451 Group \|publication-date=April 4, 2011 \|year=2011 \|access-date=February 22, 2020}} </ref><ref name="sigmodrecord">{{cite conference \|first1=Andrew \|last1=Pavlo \|first2=Matthew \|last2=Aslett \|title=What's Really New with NewSQL? \|book-title=SIGMOD Record \|year=2016 \|url=https://db.cs.cmu.edu/papers/2016/pavlo-newsql-sigmodrec2016.pdf \|access-date=February 22, 2020}} </ref><ref>{{cite web \|url=https://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext \|title=NewSQL: An Alternative to NoSQL and Old SQL for New OLTP Apps \|first=Michael \|last=Stonebraker \|publisher=Communications of the ACM Blog \|date=June 16, 2011 \|access-date=February 22, 2020}}</ref><ref name="high scalability">{{cite web \|url=http://highscalability.com/blog/2012/9/24/google-spanners-most-surprising-revelation-nosql-is-out-and.html \|title=Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In \|first=Todd \|last=Hoff \|date=September 24, 2012 \|access-date=February 22, 2020}}</ref> Line 50: === Data modeling === {{ main article \| Data modelling }} Data modeling is the analysis and representation of data requirements for an organisation. It produces a data model—an abstract representation that organises business concepts and the relationships and constraints between them. The resulting artefacts guide communication between business and technical stakeholders and inform database design.<ref name="simsionwitt">{{cite book \|last1=Simsion \|first1=Graeme \|last2=Witt \|first2=Graham \|title=Data Modeling Essentials \|edition=4th \|publisher=Morgan Kaufmann \|year=2015 \|isbn=9780128002025}}</ref><ref name="date">{{cite book \|last=Date \|first=C. J. \|title=An Introduction to Database Systems \|edition=8th \|publisher=Addison-Wesley \|year=2004 \|isbn=9780321197849}}</ref> This is the process of producing a [[data model]], an [[abstract model]] to describe the data and relationships between different parts of the data.<ref name="model1">{{cite web \|title=What is Data Modelling? Overview, Basic Concepts, and Types in Detail \|url=https://www.simplilearn.com/what-is-data-modeling-article \|website=Simplilearn.com \|access-date=31 July 2022 \|date=15 June 2021}}</ref> A common convention distinguishes three levels of models:<ref name="simsionwitt" /> * '''Conceptual model''' – a technology-independent view of the key business concepts and rules. * '''Logical model''' – a detailed representation in a chosen paradigm (most commonly the relational model) specifying entities, attributes, keys, and integrity constraints.<ref name="date" /> * '''Physical model''' – an implementation-oriented design describing tables, indexes, partitioning, and other operational considerations.<ref name="date" /> Approaches include entity–relationship (ER) modeling for operational systems,<ref name="chen1976">{{cite journal \|last=Chen \|first=Peter P. \|title=The Entity–Relationship Model—Toward a Unified View of Data \|journal=ACM Transactions on Database Systems \|volume=1 \|issue=1 \|year=1976 \|pages=9–36 \|doi=10.1145/320434.320440}}</ref> dimensional modeling for analytics and data warehousing,<ref name="kimball">{{cite book \|last1=Kimball \|first1=Ralph \|last2=Ross \|first2=Margy \|title=The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling \|edition=3rd \|publisher=Wiley \|year=2013 \|isbn=9781118530801}}</ref> and the use of UML class diagrams to express conceptual or logical models in general-purpose modeling tools.<ref name="uml">{{cite report \|title=Unified Modeling Language (UML) Version 2.5.1 \|publisher=Object Management Group \|date=2017 \|url=https://www.omg.org/spec/UML/2.5.1/}}</ref> Well-formed data models aim to improve data quality and interoperability by applying clear naming standards, normalisation, and integrity constraints.<ref name="date" /><ref name="simsionwitt" /> == Roles == === Data engineer === A ''' data engineer''' is a type of software engineer who creates [[big data]] [[Extract, transform, load\|ETL]] pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into [[business intelligence\|insights]].<ref>{{cite report \|last1=Tamir \|first1=Mike \|last2=Miller \|first2=Steven \|last3=Gagliardi \|first3=Alessandro \|date=11 December 2015 \|title=The Data Engineer \|ssrn=2762013 }}</ref> They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like [[Java (programming language)\|Java]], [[Python (programming language)\|Python]], [[Scala (programming language)\|Scala]], and [[Rust (programming language)\|Rust]].<ref>{{Cite web\|date=2019-02-07\|title=Data Engineer vs. Data Scientist\|url=https://~~prepzee~~www.springboard.com/blog/data-engineer-vs-data-scientist~~-whats-the-difference~~/\|access-date=2021-03-14\|website=~~prepzee~~Springboard Blog\|language=en-US}}</ref><ref name="hist1" /> They will be more familiar with databases, architecture, cloud computing, and [[Agile software development]].<ref name="hist1" /> === Data scientist ===