First normal form: Difference between revisions

Content deleted Content added
Added a bit more context to the section about the atomic values controversy
m Altered phrasing to "level of database normalization"
 
(8 intermediate revisions by 3 users not shown)
Line 1:
{{short description|PropertyLevel of adatabase relation in a relational databasenormalization}}
{{morecitation sourcesstyle|date=NovemberMay 20242025}}
 
'''First normal form''' ('''1NF''') is the simplestmost formbasic level of [[database normalization]] defined by English computer scientist [[Edgar F. Codd]], the inventor of the [[relational database]]. A [[Relation (database)|relation]] (or a [[Table (database)|''table'']], in [[SQL]]) can be said to be in first normal form if each field is ''atomic'', containing a single value rather than a set of values or a [[nested table]]. In other words, a relation complies with first normal form if no [[attribute ___domain]] (the set of values allowed in a given column) has relations as elements.<ref>Codd, E. F. (1972). "Further Normalization of the Data Base Relational Model". p. 27</ref>
 
Most relational database management systems, including standard SQL, do not support creating or using table-valued columns, which means most relational databases will be in first normal form by necessity. Otherwise, normalization to 1NF involves eliminating nested relations by breaking them up into separate relations associated with each other using [[foreign key]]s.<ref name="Codd 1970 p380-381">Codd,{{Cite E.journal F. (1970). "|title=A Relationalrelational Modelmodel of Datadata for Largelarge Sharedshared Datadata Banks".banks |journal=Communications of the ACM |last=Codd |first=E. ClassicsF. |volume=13 (|issue=6): 377–87.|pages=377&ndash;387 p|author-link=Edgar F. 380–381Codd |year=1970 |doi=10.1145/362384.362685}}</ref>{{rp|pages=381}} This process is a necessary step when moving data from a non-relational (or [[NoSQL]]) database, such as one using a [[hierarchical database|hierarchical]] or [[document-oriented database|document-oriented]] model, to a relational database.
 
A database must satisfy 1NF to satisfy further "[[Database_normalization#Normal_forms|normal forms]]", such as [[Second normal form|2NF]] and [[Third normal form|3NF]], which enable the reduction of redundancy and anomalies. Other benefits of adopting 1NF include the introduction of increased [[data independence]] and flexibility (including features like [[Many-to-many (data model)|many-to-many]] relationships) and simplification of the [[relational algebra]] and [[query language]] necessary to describe operations on the database.
 
Codd considered 1NF mandatory for relational databases, while the other normal forms were merely guidelines for database design.<ref>Codd,{{Cite E. F. (1979).journal |title=Extending the Databasedatabase Relationalrelational Modelmodel to Capturecapture Moremore Meaning.meaning |journal=ACM Transactions on Database Systems,Vol. 4,|last=Codd No|first=E. F. |volume=4, December|issue=4 1979,|pages=397&ndash;434 Page|author-link=Edgar 413F. Codd |year=1979 |doi=10.1145/320107.320109}}</ref>{{rp|page=439}}
 
== History and definitionBackground ==
First normal form was introduced in 1970 by [[Edgar F. Codd]] in thehis paper ''"A Relationalrelational Modelmodel of Datadata for Largelarge Sharedshared Datadata Banks''banks",{{r|Codd 1970}} although initially it was simply referred to as "normalization" or "normal form". It was renamed to "first normal form" when Codd introduced additional normal forms in his paper ''"Further Normalization of the Data Base Relational Model''" in 1971.<ref>Codd, E. F. (1971). "Further Normalization of the Data Base Relational Model". ''Data Base Systems. Courant Computer Science Symposium 6 in Data Base Systems'' edited by Rustin, R.</ref>
 
The relational model was proposed as an improvement over [[hierarchical]] databases which were prevalent at the time.{{r|Codd 1970|p=377}} A key difference lies in how relationships between records are represented. In a hierarchical database, one-to-many relationships are represented through containment: a single record may contain sets of records (known as repeating groups) as attribute values. But Codd argued that hierarchy is not flexible and expressive enough for more complex data models. For example many-to-many relationships cannot be represented through hierarchy.{{r|Codd 1970|p=378}} Thus he suggest eliminating nested records and instead represent relationship through [[foreign key|foreign keys]]. This allows richer relationships to be expressed, since a record can now participate in multiple relationships.{{r|Codd 1970|p=378}}
In a [[finitary relation|relation]] (or [[Table (database)|''table'']]), each attribute (or [[Column (database)|''column'']]) has a set of possible values, known as its [[Attribute ___domain|___domain]] (e.g., the set of integers in a given range). A tuple (or [[Row (database)|''row'']]) contains exactly one element from the attribute ___domain per attribute. A ___domain can be any set, hence it might contain further relations as elements – this would allow tuples to themselves contain relations, in turn containing multiple tuples and attributes. Such domains can be found in [[non-relational database]]s.
 
A direct translation of a hierarchical database into relations would represent repeating groups as nested relations. Thus normalization is defined as eliminating nested relations and instead represent the one-to-many relationship through foreign keys. {{r|Codd 1970|p=381}}
A relation complies with first normal form when this is not the case, and no attribute ___domain has relations as elements. Codd calls a ___domain which contains relations a "[[finitary relation|nonsimple ___domain]]", or repeating group,{{r|Codd 1970 p380-381}} while a ___domain which does not contain relations is called a "simple ___domain". Normalization to 1NF is thus a process of eliminating nonsimple domains from all relations.
 
Codd distinguishes between "atomic" and "compound" data. Atomic (or "nondecomposable") data includes basic types such as numbers and [[String (computer science)|strings]] – broadly speaking, it "''cannot'' be decomposed into smaller pieces by the [[DBMS]] (excluding certain special functions)". Compound data is made up of structures such as [[Relation (database)|relations]] (or ''[[Table (database)|tables]]'', in [[SQL]]) which contain several pieces of atomic data and thus "''can'' be decomposed by the DBMS".<ref name="Codd 1990">{{Cite book |last=Codd |first=E. F. |title=The relational model for database management: version 2 |publisher=[[Addison-Wesley]] |isbn=978-0-201-14192-4 |publication-date=1 January 1990}}</ref>{{rp|page=6}}
Codd uses the terms ''atomic'' and ''nondecomposable'' for elements of simple domains.{{r|Codd 1970 p380-381}} Thus, an atomic value is any value which is not a relation; atomic values cannot be decomposed using [[relational algebra]] operations like selection or projection. Precisely, Codd defines an atomic value as one that "cannot be decomposed into smaller pieces by the [[DBMS]] (excluding certain special functions)"<ref name="Codd 1990">Codd, E. F. ''The Relational Model for Database Management Version 2'' (Addison-Wesley, 1990)</ref>{{rp|page=6}} and states that "values in the domains on which each relation is defined are required to be atomic with respect to the DBMS,"{{r|Codd 1990}}{{Page needed|date=May 2025}}.
 
In a relation, each attribute (or [[Column (database)|''column'']]) has a set of allowed values known as its [[Attribute ___domain|___domain]] (e.g., a "Price" attribute's ___domain may be the set of non-negative numbers with up to 2 fractional digits). Each tuple (or [[Row (database)|''row'']]) in the relation contains one value per attribute, and each must be an element in that attribute's ___domain. Codd distinguishes attributes which have "simple domains" containing only atomic data from attributes with "nonsimple domains" containing at least some forms of compound data.{{r|Codd 1970}}{{rp|pages=380}} Nonsimple domains introduce a degree of structural complexity which can be difficult to navigate, to query and to update – for instance, it will be time-consuming to operate across several [[Nested table|nested relations]] (that is, tables containing further tables), which can be found in some [[non-relational database]]s.
 
First normal form therefore requires all attribute domains to be ''simple'' domains, such that the data in each field is atomic and no relation has relation-valued attributes. Precisely, Codd states that, in the relational model, "values in the domains on which each relation is defined are required to be atomic with respect to the DBMS."<ref name="Codd 1990" />{{rp|page=6}} Normalization to 1NF is thus a process of eliminating nonsimple domains from all relations.
 
==Examples==
===DesignsDesign that violateviolates 1NF===
This table of customers' credit card transactions does not conform to first normal form, as each customer corresponds to a repeating group of transactions. Such a design can be represented in a [[hierarchical database]], but not in an SQL database, since SQL does not support nested tables.
 
{{Table alignment}}
{| class="wikitable col1right"
! Customer !! CustomerID !! Transactions
|+ Customer
! Customer<u>CustomerID</u> !! CustomerIDName !! Transactions
|-
| Abraham1 || 1Abraham
||
{{Table alignment}}
{| class="wikitable"
{| class="wikitable col1right col3right"
! <u>TransactionID</u> !! Date !! Amount
|-
| 12890 || 2003-10-14 || &minus;87
| 2003-10-14
|&minus;87
|-
| 12904 || 2003-10-15 || &minus;50
|12904
|2003-10-15
|&minus;50
|}
 
|-
| Isaac2 || 2Isaac
||
{{Table alignment}}
{| class="wikitable"
{| class="wikitable col1right col3right"
! <u>TransactionID</u> !! Date !! Amount
|-
| 12898 || 2003-10-14 || &minus;21
| 2003-10-14
|&minus;21
|}
|-
| Jacob3 || 3Jacob
||
{{Table alignment}}
{| class="wikitable"
{| class="wikitable col1right col3right"
! <u>TransactionID</u> !! Date !! Amount
|-
| 12907 || 2003-10-15 || &minus;18
| 2003-10-15
| &minus;18
|-
| 14920 || 2003-11-20 || &minus;70
| 2003-11-20
| &minus;70
|-
| 15003 || 2003-11-27 || &minus;60
| 2003-11-27
| &minus;60
|}
|}
Line 76 ⟶ 72:
For example, in order to find out the monetary sum of all transactions that occurred in October 2003 for all customers, the [[database management system]] (DMBS) would have to first unpack the Transactions field of each customer, then sum the Amount of each transaction thus obtained where the Date of the transaction falls in October 2003.
 
===DesignsDesign that complycomplies with 1NF===
Codd described how a database like this could be made less structurally complex and more flexible by transforming it into a [[relational database]] in first normal form. To normalize the table so it complies with first normal form, attributes with nonsimple domains must be extracted to separate, stand-alone relations. Each extracted relation gains a [[foreign key]] referencing the [[primary key]] of the relation which initially contained it. This process can be applied recursively to nonsimple domains nested in multiple levels (i.e., domains containing tables within tables within tables, and so on).{{r|Codd 1970 p380-}}{{rp|pages=380&ndash;381}}
 
In this example, CustomerID is the primary key of the containing relation and will therefore be appended as a foreign key to the new relation:
 
{{Col-float}}
{| class="wikitable"
{{Col-float-break|style=margin-right: 20px;}}
{{Table alignment}}
{| class="wikitable col1right"
|+ Customer
! <u>CustomerID</u> !! Name
|-
| 1 || Abraham
! Customer !! CustomerID
|-
| Abraham2 || 1Isaac
|-
| Isaac3 || 2Jacob
|-
| Jacob || 3
|}
{{Col-float-break}}
 
{{Table alignment}}
{| class="wikitable"
{| class="wikitable col1right col2right col4right"
|-
|+ Transaction
! <u>CustomerID</u> !! <u>TransactionID</u> !! Date !! Amount
|-
| 1 || 12890 || 2003-10-14 || &minus;87
Line 108:
| 3 || 15003 || 2003-11-27 || &minus;60
|}
{{Col-float-end}}
 
In this modified design, the primary key is {CustomerID} in the first relation and {CustomerID, TransactionID} in the second relation.
 
Now that eacha rowsingle, represents"top-level" anrelation individualcontains transactionall transactions, it will be simpler to run queries on the database. To find the monetary sum of all October transactions, the DMBS would simply find all rows with a Date falling in October and sum the Amount fields. All values are now easily exposed to the DBMS, whereas previously some values were embedded in lower-level structures that had to be handled specially. Accordingly, the normalized design lends itself well to general-purpose query processing, whereas the unnormalized design does not.
 
It is worth noting that the revised design also meets the additional requirements for [[second normal form|second]] and [[third normal form]].
 
== Rationale and drawbacks ==
{{Confusing section|date=May 2025}}
 
Normalization to 1NF is the major theoretical component of transferring a database to the [[relational model]]. Use of a relational database in 1NF brings certain advantages:
Normalizing to 1NF brings certain advantages:
 
* 1NFIt allowsenables fordata theto storagebe of relational datastored in regular [[two-dimensional array]]s; supporting nested relations would require more complex data structures.{{r|Codd 1970}}{{rp|page=381}}
* 1NFIt allows for the use of a simpler [[query language]], like [[SQL]], since any data item can be identified using only a relation name, attribute name and key; addressing nested data items would require a more complex language with support for hierarchical data paths.
* Representing relationships using foreign keys is more flexible and allows for features such as [[Many-to-many (data model)|many-to-many]] relationships, while a hierarchical model can represent only [[One-to-one (data model)|one-to-one]] or [[One-to-many (data model)|one-to-many]] relationships.
* Since locating data items is not coupled to a parent–child hierarchy, a database in 1NF creates greater [[data independence]] and is more resilient to structural changes over time.{{Clarify|date=May 2025}}
* From 1NF, further normalization becomes possible (for example to [[Second normal form|2NF]] or [[Third normal form|3NF]]), which can reduce data redundancy and anomalies.
 
== Controversy about atomiccompound values ==
The use of 1NF also comes with certain drawbacks:
There is some discussion about to what extent compound or complex values other than relations (such as [[Array (data structure)|arrays]] or [[XML]] data) are permitted in 1NF.{{Citation needed|date=May 2025}} Codd states that relations are the only type of compound data allowed within the relational model (if not in attribute domains), since any additional type of compound data would add complexity without adding power; nevertheless, the model specifically allows "certain special functions" like <code>SUBSTRING</code> to decompose values otherwise considered atomic.{{r|Codd 1990}}{{rp|page=6,340}}
* Performance worsens for certain operations. In a hierarchical model, nested records are physically stored after the parent record, which means a whole subtree can be retrieved in a single read operation. In 1NF, this will require a join operation per record type, which can be costly, especially for complex trees. For this reason, [[document-oriented database]]s eschew 1NF.
* [[Object-oriented language]]s represent runtime state as trees or [[directed graph]]s of objects connected by [[Pointer (computer programming)|pointers]] or references. This does not map cleanly to a 1NF relational database, creating a gap sometimes called the [[object–relational impedance mismatch]], which [[object–relational mapping]] (ORM) libraries try to bridge.
* 1NF has been interpreted as not allowing complex data types for values. This is open to interpretation though, and [[Christopher J. Date]] has argued that values can be arbitrarily complex objects.{{citation-needed|date=June 2023}}
 
== Controversy about atomic values ==
1NF disallows relations as attribute values but does not otherwise constrain what kinds of values are permitted. This had led to some discussion about to what extent complex or compound values (like arrays or XML-values) should be considered permitted under 1NF.
 
Codd states that relations are the only type of compound data allowed by the relational model, since any additional types of compound data would just add complexity without adding power.<ref>Codd, E. F. (1990) The Relational Model for Database Management: Version 2 (Addison-Wesley) page 6</ref>. Nevertheless, the model specifically allows "certain special functions" like <code>substring</code> to decompose values otherwise considered atomic.<ref>Codd, E. F. (1990) The Relational Model for Database Management: Version 2 (Addison-Wesley) page 340</ref>
 
[[Hugh Darwen]] and [[Christopher J. Date]] have suggested that Codd's concept of an "atomic value" is ambiguous, and that this ambiguity has led to widespread confusion about how 1NF should be understood.<ref>Darwen, Hugh. "Relation-Valued Attributes; or, Will the Real First Normal Form Please Stand Up?", in C. J. Date and Hugh Darwen, ''Relational Database Writings 1989-1991'' (Addison-Wesley, 1992).</ref><ref>{{cite book |last=Date |first=C. J. |author-link=Christopher J. Date |chapter=Chapter 8: What First Normal Form Really Means |date=2007 |title=Date on Database: Writings 2000–2006 |publisher=Apress |isbn=978-1-4842-2029-0 |page=108 |quote='[F]or many years,' writes Date, 'I was as confused as anyone else. What's worse, I did my best (worst?) to spread that confusion through my writings, seminars, and other presentations.'}}</ref> In particular, the notion of an atomic value as a "value that cannot be decomposed" is problematic, as it would seem to imply that few, if any, data types are atomic:
*A [[String (computer science)|character string]] would seem not to be atomic, as an RDBMS typically provides operators to decompose it into substrings[[substring]]s.
*A [[Fixed-point arithmetic|fixed-point]] number would seem not to be atomic, as an RDBMS typically provides operators to decompose it into integer and fractional components.
* An [[ISBN]] would seem not to be atomic, as it includes various parts, including the ''registration group'', ''registrant'' and ''publication'' elements.
Date suggests that "the notion of atomicity ''has no absolute meaning''":<ref name="Date 2007">{{cite book |last=Date| first=C. J. |author-link=Christopher J. Date |chapter=Chapter 8: What First Normal Form Really Means |date=2007 |title=Date on Database: Writings 2000–2006 |publisher=Apress |isbn=978-1-4842-2029-0 |page=112}}</ref>{{rp|page=112}}<ref>{{cite book |last=Date |first=C. J. |author-link=Christopher J. Date |url=https://books.google.com/books?id=BCjkCgAAQBAJ&pg=PA50| |title=SQL and Relational Theory: How to Write Accurate SQL Code |date=6 November 2015 |publisher=O'Reilly Media |isbn=978-1-4919-4115-7 |pages=50– |access-date=31 October 2018}}</ref>{{Pages needed|date=May 2025}} a value may be considered atomic for some purposes, but may be considered an assemblage of more basic elements for other purposes. If this position is accepted, 1NF cannot be defined with reference to atomicity. Columns containing any conceivable data type (from string typesstrings and numeric types to [[Array data structure|array]] typesarrays and table typestables) are then acceptable in a 1NF table,{{Citation needed|date=May 2025}} although perhaps not always desirable; for example, it may be more desirable to separate a CustomerName column into two separate columns, FirstName and Surname.
 
==Cristopher J. Date's definition of 1NF==
{{ConfusingImportance section|date=May 2025}}
According to [[Christopher J. Date]]'s definition, a table is in first normal form if and only if it is "[[isomorphism|isomorphic]] to some relation", which means, specifically, that it satisfies the following five conditions:<ref>{{cite book r|last=Date |first=C. J. |author-link=Christopher J. Date |date=2007 |chapter=Chapter 8: What First Normal Form Really Means |title=Date on Database: Writings 2000–2006 |publisher=Apress |isbn=978-1-4842-2029-0 }}{{rp|pages=127–128127&ndash;128}}</ref>
 
# There is no specific top-to-bottom ordering of the rows.
Line 153 ⟶ 147:
Violation of any of these conditions would mean that the table is not strictly relational, and therefore that it is not in first normal form.
 
This definition of 1NF permits relation-valued attributes (tables within tables), which Date argues are useful in rare cases.<ref>{{cite book r|last=Date |first=C. J. |author-link=Christopher J. Date |chapter=Chapter 8: What First Normal Form Really Means |date=2007 |title=Date on Database: Writings 2000–2006 |publisher=Apress |isbn=978-1-4842-2029-0 }}{{rp|pages=121–126121&ndash;126}}</ref> Examples of tables (or [[view (database)|views]]) that would not meet this definition of first normal form are:
 
*A table that lacks a [[unique key]] [[constraint (database)|constraint]]. Such a table would be able to accommodate duplicate rows, in violation of condition 3.
*A view whose definition mandates that results be returned in a particular order, so that the row-ordering is an intrinsic and meaningful aspect of the view, in violation of condition 1. The [[tuple]]s in true relations are not ordered with respect to each other (such views cannot be created using [[SQL]] that conforms to the [[SQL:2003]] standard).
*A table with at least one [[Null (SQL)|nullable]] attribute. A nullable attribute would be in violation of condition 4, which requires every column to contain exactly one value from its column's ___domain. This aspect of condition 4 is controversial; it marks an important departure from Codd's later vision of the [[relational model]],<ref>{{cite book |last=Date |first=C. J. |author-link=Christopher J. Date |year=2009 |title=SQL and Relational Theory |publisher=O'Reilly |chapter=Appendix A.2 |quote=Codd first defined the relational model in 1969 and didn't introduce nulls until 1979}}</ref> which made explicit provision for nulls.<ref>{{cite magazine |last=Date |first=C. J. |author-link=Christopher J. Date |date=October 14, October 1985 |title=Is Your DBMS Really Relational? |magazine=Computerworld |quote=Null values ... [must be] supported in a fully relational DBMS for representing missing information and inapplicable information in a systematic way, independent of data type.}} (the third of Codd's 12 rules)</ref>
 
==See also==
Line 163 ⟶ 157:
*[[Second normal form]] (2NF)
*[[Third normal form]] (3NF)
*[[Boyce–Codd normal form]] (BCNF or 3.5NF)
*[[Fourth normal form]] (4NF)
*[[Fifth normal form]] (5NF)
Line 172 ⟶ 167:
==Further reading==
{{Refbegin}}
* Codd, E. F. (1970). A Relational Model of Data for Large Shared Data Banks. IBM Research Laboratory, San Jose, California.
* Codd, E. F. (1971). Further Normalization of the Relational Model. Courant Computer Science Symposium 6 in Data Base Systems edited by Rustin, R.
* Date, C. J., & Lorentzos, N., & Darwen, H. (2002). ''[https://archive.today/20121209052842/http://www.elsevier.com/wps/product/cws_home/680662 Temporal Data & the Relational Model]'' (1st ed.). Morgan Kaufmann. {{ISBN|1-55860-855-9}}.
* Date, C. J. (1999), ''[https://web.archive.org/web/20050404010227/http://www.aw-bc.com/catalog/academic/product/0,1144,0321197844,00.html An Introduction to Database Systems]'' (8th ed.). Addison-Wesley Longman. {{ISBN|0-321-19784-4}}.
* Kent, W. (1983) ''[http://www.bkent.net/Doc/simple5.htm A Simple Guide to Five Normal Forms in Relational Database Theory]'', ''Communications of the ACM'', vol. 26, ppp.&nbsp;120–125.
{{Refend}}