Data validation: Difference between revisions

Content deleted Content added
No edit summary
IamSandyy (talk | contribs)
m Added internal link to "Database"
 
(404 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Process of ensuring computer data is both correct and useful}}
In [[computer science]], '''data validation''' is the process of ensuring that a program operates on clean, correct and useful data. It uses routines, often called [[validation rule]]s, that check for correctness or meaningfulness of data that are input to the system. The rules may be implemented through the automated facilities of a [[data dictionary]], or by the inclusion of explicit [[application program]] validation logic.
{{redirect|Input validation||Improper input validation}}
{{more citations needed|date=November 2016}}
In [[computing]], '''data validation''' or '''input validation''' is the process of ensuring [[data]] has undergone [[data cleansing]] to confirm it has [[data quality]], that is, that it is both correct and useful. It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a [[data dictionary]], or by the inclusion of explicit [[application program]] validation logic of the computer and its application.
 
This is distinct from [[formal verification]], which attempts to prove or disprove the correctness of algorithms for implementing a specification or property.
The simplest data validation verifies that the characters provided come from a valid set. For example, telephone numbers should include the [[numerical digit|digit]]s and possibly the characters <tt>+</tt>, <tt>-</tt>, <tt>(</tt> and <tt>)</tt> (plus, minus and the parentheses). A more sophisticated data validation routine would check to see the user has entered a valid country code; the number of digits entered matches the convention for the country or area specified, etc.
 
==Overview ==
Incorrect data validation can lead to [[data corruption]] or a [[software security vulnerability|security vulnerability]]. Data validation checks that the data are valid and sensible/reasonable before they are processed.
Data validation is intended to provide certain well-defined guarantees for fitness and [[data consistency|consistency of data]] in an application or automated system. Data validation rules can be defined and designed using various methodologies, and be deployed in various contexts.<ref>[https://ec.europa.eu/eurostat/cros/system/files/methodology_for_data_validation_v1.0_rev-2016-06_final.pdf Methodology for data validation 1.0]</ref> Their implementation can use [[declarative programming|declarative]] [[data integrity]] rules, or [[imperative programming|procedure-based]] [[business rules]].<ref>[http://msdn.microsoft.com/en-us/library/aa291820(VS.71).aspx Data Validation, Data Integrity, Designing Distributed Applications with Visual Studio .NET]</ref>
 
The guarantees of data validation do not necessarily include accuracy, and it is possible for [[data entry]] errors such as misspellings to be accepted as valid. Other clerical and/or computer controls may be applied to reduce inaccuracy within a system.
Some methods used for validation are...
 
==Different kinds==
*'''Format or picture check'''
In evaluating the basics of data validation, generalizations can be made regarding the different kinds of validation according to their scope, complexity, and purpose.
Checks that the data is in a specified format (template), e.g., dates have to be in the format DD/MM/YYYY.
 
For example:
*'''Data type checks'''
* Data type validation;
Check the data type of the input and give an error message if the input data does not match with the chosen data type, e.g., In an input box accepting numeric data, if the letter 'O' was typed instead of the number zero, an error message would appear.
* Range and constraint validation;
* Code and cross-reference validation;
* [[Structure validation|Structured validation]]; and
* Consistency validation
 
*'''Range===Data-type check'''===
Data type validation is customarily carried out on one or more simple data fields.
Checks that the data lie within a specified range of values, e.g., the month of a person's date of birth should lie between 1 and 12.
 
The simplest kind of data type validation verifies that the individual characters provided through user input are consistent with the expected characters of one or more known [[Primitive data type|primitive data types]] as defined in a programming language or data storage and retrieval mechanism.
*'''Limit check'''
Unlike range checks, data is checked for one limit only, upper OR lower, e.g., data should not be greater than 2 (>2).
 
For example, an integer field may require input to use only characters 0 through 9.
*'''Presence check'''
Checks that important data are actually present and have not been missed out, e.g., customers may be required to have their telephone numbers listed.
 
===Simple range and constraint check===
*'''Check digits'''
Simple range and constraint validation may examine input for consistency with a minimum/maximum range, or consistency with a test for evaluating a sequence of characters, such as one or more tests against regular expressions. For example, a counter value may be required to be a non-negative integer, and a password may be required to meet a minimum length and contain characters from multiple categories.
Used for numerical data. An extra digit is added to a number which is calculated from the digits. The computer checks this calculation when data are entered, e.g., The [[ISBN]] for a book. The last digit is a check digit calculated using a [[modulus]] 11 method.
 
===Code and cross-reference check===
*'''Batch totals'''
Code and cross-reference validation includes operations to verify that data is consistent with one or more possibly-external rules, requirements, or collections relevant to a particular organization, context or set of underlying assumptions. These additional validity constraints may involve cross-referencing supplied data with a known look-up table or directory information service such as [[LDAP]].
Checks for missing records. Numerical fields may be added together for all records in a batch. The batch total is entered and the computer checks that the total is correct, e.g., add the 'Total Cost' field of a number of transactions together.
 
For example, a user-provided country code might be required to identify a current geopolitical region.
*'''Hash totals'''
This is just a batch total done on one or more numeric fields which appears in every record, ''e.g.'', add the Telephone Numbers together for a number of Customers.
 
*'''Spelling===Structured check'''===
Structured validation allows for the combination of other kinds of validation, along with more complex processing. Such complex processing may include the testing of conditional constraints for an entire complex data object or set of process operations within a system.
Looks for spelling and grammar errors.
 
*'''===Consistency Checks'''check===
Consistency validation ensures that data is logical. For example, the delivery date of an order can be prohibited from preceding its shipment date.
Checks fields to ensure data in these fields corresponds, e.g., If Title = "Mr.", then Gender = "M".
 
===Example===
There are also many different kinds of data that you can use to Validate your data, this includes Typical Data; data that is usaully entered into a system this is to check that normal information can be entered into the system. Then there is Extreme data, this is data that is used only in certain rare occasions for instance: For a place to input say a birth date extreme data would be 29/02/1910. 29th of february being a leap year date and 1910 being a very old person. This kind of data should be acceptable without any wrong things coming up. Then the final type is Invalid Data, which is obviously is of its name. Its data that should not be accepted into the system, for instance in a Telephone Number, inputting symbols and letters should then be stopped by an error message or something of equivalence
Multiple kinds of data validation are relevant to 10-digit pre-2007 [[ISBN]]s (the 2005 edition of ISO 2108 required ISBNs to have 13 digits from 2007 onwards<ref>[http://www.lac-bac.gc.ca/iso/tc46sc9/isbn.htm ''Frequently Asked Questions about the new ISBN standard''] {{Webarchive|url=https://web.archive.org/web/20070610160919/http://www.lac-bac.gc.ca/iso/tc46sc9/isbn.htm |date=2007-06-10 }} [[International Organization for Standardization|ISO]].</ref>).
== External links ==
 
* Size. A pre-2007 ISBN must consist of 10 digits, with optional hyphens or spaces separating its four parts.
* [http://www.thevickerage.worldonline.co.uk/theteacher/newgcse/module5/task10.htm Data Validation]
* Format checks. Each of the first 9 digits must be 0 through 9, and the 10th must be either 0 through 9 or an ''X''.
* [[Check digit]]. To detect transcription errors in which digits have been altered or transposed, the last digit of a pre-2007 ISBN must match the result of a mathematical formula incorporating the other 9 digits ([[International Standard Book Number#ISBN-10 check digits|ISBN-10 check digits]]).
 
==Validation types==
;Allowed character checks
:Checks to ascertain that only expected characters are present in a field. For example a numeric field may only allow the digits 0–9, the decimal point and perhaps a minus sign or commas. A text field such as a personal name might disallow characters used for [[Markup language|markup]]. An e-mail address might require at least one @ sign and various other structural details. [[Regular expression]]s can be effective ways to implement such checks.
 
;Batch totals
:Checks for missing records. Numerical fields may be added together for all records in a batch. The batch total is entered and the computer checks that the total is correct, e.g., add the 'Total Cost' field of a number of transactions together.
 
;Cardinality check
:Checks that record has a valid number of related records. For example, if a contact record is classified as "customer" then it must have at least one associated order (cardinality > 0). This type of rule can be complicated by additional conditions. For example, if a contact record in a payroll database is classified as "former employee" then it must not have any associated salary payments after the separation date (cardinality = 0).
 
;Check digits
:Used for numerical data. To support error detection, an extra digit is added to a number which is calculated from the other digits.
 
;Consistency checks
:Checks fields to ensure data in these fields correspond, e.g., if expiration date is in the past then status is not "active".
 
;Cross-system consistency checks
:Compares data in different systems to ensure it is consistent. Systems may represent the same data differently, in which case comparison requires transformation (e.g., one system may store customer name in a single Name field as 'Doe, John Q', while another uses First_Name 'John' and Last_Name 'Doe' and Middle_Name 'Quality').
 
;Data type checks
:Checks input conformance with typed data. For example, an input box accepting numeric data may reject the letter 'O'.
 
;File existence check
:Checks that a file with a specified name exists. This check is essential for programs that use file handling.
 
;Format check
:Checks that the data is in a specified format (template), e.g., dates have to be in the format YYYY-MM-DD. Regular expressions may be used for this kind of validation.
 
;Presence check
:Checks that data is present, e.g., customers may be required to have an email address.
 
;Range check
:Checks that the data is within a specified range of values, e.g., a probability must be between 0 and 1.
 
;Referential integrity
:Values in two relational [[database]] tables can be linked through foreign key and primary key. If values in the foreign key field are not constrained by internal mechanisms, then they should be validated to ensure that the referencing table always refers to a row in the referenced table.
 
;Spelling and grammar check
:Looks for spelling and grammatical errors.
 
;Uniqueness check
:Checks that each value is unique. This can be applied to several fields (i.e. Address, First Name, Last Name).
 
;Table look up check
:A table look up check compares data to a collection of allowed values.
 
===Post-validation actions===
{{More citations needed section|date=July 2012}}
;Enforcement Action
:Enforcement action typically rejects the data entry request and requires the input actor to make a change that brings the data into compliance. This is most suitable for interactive use, where a real person is sitting on the computer and making entry. It also works well for batch upload, where a file input may be rejected and a set of messages sent back to the input source for why the data is rejected.
:Another form of enforcement action involves automatically changing the data and saving a conformant version instead of the original version. This is most suitable for cosmetic change. For example, converting an [all-caps] entry to a [Pascal case] entry does not need user input. An inappropriate use of automatic enforcement would be in situations where the enforcement leads to loss of business information. For example, saving a truncated comment if the length is longer than expected. This is not typically a good thing since it may result in loss of significant data.
 
;Advisory Action
:Advisory actions typically allow data to be entered unchanged but sends a message to the source actor indicating those validation issues that were encountered. This is most suitable for non-interactive system, for systems where the change is not business critical, for cleansing steps of existing data and for verification steps of an entry process.
 
;Verification Action
:Verification actions are special cases of advisory actions. In this case, the source actor is asked to verify that this data is what they would really want to enter, in the light of a suggestion to the contrary. Here, the check step suggests an alternative (e.g., a check of a mailing address returns a different way of formatting that address or suggests a different address altogether). You would want in this case, to give the user the option of accepting the recommendation or keeping their version. This is not a strict validation process, by design and is useful for capturing addresses to a new ___location or to a ___location that is not yet supported by the validation databases.
 
;Log of validation
:Even in cases where data validation did not find any issues, providing a log of validations that were conducted and their results is important. This is helpful to identify any missing data validation checks in light of data issues and in improving
 
==Validation and security==
Failures or omissions in data validation can lead to [[data corruption]] or a [[software security vulnerability|security vulnerability]].<ref>[http://www.cgisecurity.com/owasp/html/ch10.html Chapter10. Data Validation]</ref> Data validation checks that data are fit for purpose,<ref>[https://web.archive.org/web/20171201042621/https://spotlessdata.com/blog/more-efficient-data-validation-spotless More Efficient Data Validation with Spotless]</ref> valid, sensible, reasonable and secure before they are processed.
 
== See also ==
* [[Data processing]]
* [[Data verification]]
* [[Triangulation (social science)]]
* [[Verification and validation]]
 
==References==
{{Reflist}}
 
== External links ==
* [https://www.owasp.org/index.php/Data_Validation Data Validation], [[OWASP]]
* [https://github.com/OWASP/CheatSheetSeries/blob/master/cheatsheets/Input_Validation_Cheat_Sheet.md Input Validation], OWASP Cheat Sheet Series, github.com
 
{{Data}}
[[Category:Computer security]]
 
{{DEFAULTSORT:Data Validation}}
{{Comp-sci-stub}}
[[Category:Data processing]]
[[Category:Data security]]
[[Category:Data quality]]