Data validation: Difference between revisions

Content deleted Content added
m clean up
IamSandyy (talk | contribs)
m Added internal link to "Database"
 
(43 intermediate revisions by 37 users not shown)
Line 1:
{{Short description|Process of ensuring computer data is both correct and useful}}
{{redirect|Input validation||Improper input validation}}
{{more citations needed|date=November 2016}}
In [[computer sciencecomputing]], '''data validation''' or '''input validation''' is the process of ensuring [[data]] havehas undergone [[data cleansing]] to ensureconfirm theyit havehas [[data quality]], that is, that theyit areis both correct and useful. It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a [[data dictionary]],<ref>{{Cite web|url=|title=Data dictionary|last=|first=|authorlink=|last2=|authorlink2=|date=|website=|publisher=video 18+|others=|archive-url=|archive-date=|access-date=}}</ref> or by the inclusion of explicit [[application program]] validation logic of the computer and its application.
 
This is distinct from [[formal verification]], which attempts to prove or disprove the correctness of algorithms for implementing a specification or property.
 
==Overview ==
Data validation is intended to provide certain well-defined guarantees for fitness, accuracy, and [[data consistency for any|consistency of variousdata]] kinds of user input intoin an application or automated system. Data validation rules can be defined and designed using any of various methodologies, and be deployed in any of various contexts.<ref>[https://ec.europa.eu/eurostat/cros/system/files/methodology_for_data_validation_v1.0_rev-2016-06_final.pdf Methodology for data validation 1.0]</ref> Their implementation can use [[declarative programming|declarative]] [[data integrity]] rules, or [[imperative programming|procedure-based]] [[business rules]].<ref>[http://msdn.microsoft.com/en-us/library/aa291820(VS.71).aspx Data Validation, Data Integrity, Designing Distributed Applications with Visual Studio .NET]</ref>
 
Data validation rules may be defined, designed and deployed, for example:
 
'''Definition and design contexts:'''
* as a part of requirements-gathering phase in a [[software engineering]] or [[Specification (technical standard)|designing a software specification]]
* as part of an operations modeling phase [[business process modeling|BPM]]
 
'''Depts''':
*
* as a set of programs or business-logic routines in a [[programming language]]
* as a set of [[Stored procedure|stored-procedures]] in a [[database management system]]
 
The guarantees of data validation do not necessarily include accuracy, and it is possible for [[data entry]] errors such as misspellings to be accepted as valid. Other clerical and/or computer controls may be applied to reduce inaccuracy within a system.
For business applications, data validation can be defined through [[declarative programming|declarative]] [[data integrity]] rules, or [[imperative programming|procedure-based]] [[business rules]].<ref>[http://msdn.microsoft.com/en-us/library/aa291820(VS.71).aspx Data Validation, Data Integrity, Designing Distributed Applications with Visual Studio .NET]</ref> Data that does not conform to these rules will negatively affect business process execution. Therefore, data validation should start with business process definition and set of business rules within this process. Rules can be collected through the requirements capture exercise.<ref>Arkady Maydanchik (2007), "Data Quality Assessment", Technics Publications, LLC</ref>
 
==Different kinds of data validation==
In evaluating the basics of data validation, generalizations can be made regarding the different typeskinds of validation, according to thetheir scope, complexity, and purpose of the various validation operations to be carried out.
 
For example:
* Data type validation;
* Range and constraint validation;
* Code and Crosscross-reference validation; and
* [[Structure validation|Structured validation]]; and
* Consistency validation
 
===Data-type check===
Data type validation is customarily carried out on one or more simple data fields.
 
The simplest kind of data type validation verifies that the individual characters provided through user input are consistent with the expected characters of one or more known [[Primitive data type|primitive data types;]] as defined in a programming language or data storage and retrieval mechanism as well as the specification of the following primitive data types: 1) integer; 2) float (decimal); or 3) string.
 
For example, an integer field may require input to use only characters 0 through 9.
For example, many database systems allow the specification of the following l<code>(</code>, and <code>)</code> (plus, minus, and parentheses). A more sophisticated data validation routine would check to see the user had entered a valid country code, i.e., that the number of digits entered matched the convention for the country or area specified.
 
A validation process involves two distinct steps: (a) Validation Check and (b) Post-Check action. The check step uses one or more computational rules (see section below) to determine if the data is valid. The Post-validation action sends feedback to help enforce validation.
 
===Simple range and constraint check===
Simple range and constraint validation may examine user input for consistency with a minimum/maximum range, or consistency with a test for evaluating a sequence of characters, such as one or more tests against regular expressions. For example, a UScounter phonevalue numbermay shouldbe haverequired 10to digitsbe a non-negative integer, and noa letterspassword ormay specialbe required to meet a minimum length and contain characters from multiple categories.
 
===Code and cross-reference check===
Code and cross-reference validation includes tests for data type validation, combined with one or more operations to verify that the user-supplied data is consistent with one or more possibly-external rules, requirements, or validity constraintscollections relevant to a particular organization, context or set of underlying assumptions. These additional validity constraints may involve cross-referencing supplied data with a known look-up table or directory information service such as [[LDAP]].
 
For example, a user-provided country code might be required to identify a current geopolitical region.
For example, an experienced user may enter a well-formed string that matches the specification for a valid e-mail address, as defined in RFC 5322 <ref>(sections 3.2.3 and 3.4.1) and RFC 5321 – with a more readable form given in the informational RFC 3696</ref><ref>Written by J. Klensin, the author of RFC 5321</ref><ref>and the [http://www.rfc-editor.org/errata_search.php?rfc=3696 associated errata]</ref> but that well-formed string might not actually correspond to a resolvable ___domain connected to an active e-mail account.
 
===Structured check===
Structured validation allows for the combination of anyother kinds of various basic data type validation steps, along with more complex processing. Such complex processing may include the testing of conditional constraints for an entire complex data object or set of process operations within a system.
 
A Validation rule is a criterion or constraint used in the process of data validation, carried out after the data has been encoded onto an input medium and involves a data vet or validation program. This is distinct from [[formal verification]], where the operation of a program is determined to be that which was intended, and that meets the purpose. The Validation rule or check system still used by many major software manufacturers was designed by an employee at Microsoft sometime between 1997 and 1999.
 
The method is to check that data follows the appropriate parameters defined by the systems analyst. A judgement as to whether data is valid is made possible by the validation program, but it cannot ensure complete accuracy. This can only be achieved through the use of all the clerical and computer controls built into the system at the design stage. The difference between data validity and accuracy can be illustrated with a trivial example. A company has established a Personnel file and each record contains a field for the Job Grade. The permitted values are A, B, C, or D. An entry in a record may be valid and accepted by the system if it is one of these characters, but it may not be the correct grade for the individual worker concerned. Whether a grade is correct can only be established by clerical checks or by reference to other files. During systems design, therefore, data definitions are established which place limits on what constitutes valid data. Using these data definitions, a range of software validation checks can be carried out.
 
===Consistency check===
Consistency checkvalidation ensures that the entered data is logical. For example, the delivery date cannotof an order can be beforeprohibited thefrom orderpreceding its shipment date.
 
===Range check===
* Range. Does not apply to ISBN, but typically data must lie within maximum and minimum preset values. For example, customer account numbers may be restricted within the values 10000 to 20000, if this is the arbitrary range of the numbers used for the system.
 
===Criteria?Example===
AnMultiple examplekinds of adata validation checkare is the procedure usedrelevant to verify10-digit anpre-2007 [[ISBN]].s (the 2005 edition of ISO 2108 required ISBNs to have 13 digits from 2007 onwards<ref>[http://www.lac-bac.gc.ca/iso/tc46sc9/isbn.htm ''Frequently Asked Questions about the new ISBN standard''] {{Webarchive|url=https://web.archive.org/web/20070610160919/http://www.lac-bac.gc.ca/iso/tc46sc9/isbn.htm |date=2007-06-10 }} [[International Organization for Standardization|ISO]].</ref>).
 
* Size. A pre-2007 ISBN must consist of 10 digits, with optional hyphens or spaces separating its four parts.
* Size. The number of characters in a data item value is checked; for example, an ISBN must consist of 10 characters only (in the previous version—the standard for 1997 and later has been changed to 13 characters.)
* Format checks. DataEach must conform to a specified format. Thus,of the first 9 charactersdigits must be the digits 0 through 9', and the 10th must be either those0 digitsthrough 9 or an ''X''.
* [[Check digit]]. To detect transcription errors in which digits have been altered or transposed, the last digit of a pre-2007 ISBN must match the result of a mathematical formula incorporating the other 9 digits ([[International Standard Book Number#ISBN-10 check digits|ISBN-10 check digits]]).
* [[Check digit]]. An extra digit calculated on, for example, an account number, can be used as a self-checking device. When the number is input to the computer, the validation program carries out a calculation similar to that used to generate the check digit originally and thus checks its validity. This kind of check will highlight transcription errors where two or more digits have been transposed or put in the wrong order. The 10th character of the 10-character [[ISBN]] is the check digit.
 
==Validation methodstypes==
;Allowed character checks
:Checks to ascertain that only expected characters are present in a field. For example a numeric field may only allow the digits 0–9, the decimal point and perhaps a minus sign or commas. A text field such as a personal name might disallow characters suchused afor [[Markup language|markup-based security attack]]. An e-mail address might require at least one @ sign and various other structural details. [[Regular expressionsexpression]]s arecan be effective ways ofto implementingimplement such checks. (See also data type checks below)
 
;Batch totals
Line 72 ⟶ 57:
 
;Cardinality check
:Checks that record has a valid number of related records. For example, if Contacta contact record is classified as a"customer" Customerthen it must have at least one associated Orderorder (Cardinalitycardinality > 0). If order does not exist for a "customer" record then it must be either changed to "seed" or the order must be created. This type of rule can be complicated by additional conditions. For example, if a contact record in Payrolla payroll database is markedclassified as "former employee", then this recordit must not have any associated salary payments after the separation date on which employee left organization (Cardinalitycardinality = 0).
 
;Check digits
:Used for numerical data. AnTo support error detection, an extra digit is added to a number which is calculated from the other digits. The computer checks this calculation when data are entered. For example the last digit of an ISBN for a book is a check digit calculated modulus 10.[3]
 
;Consistency checks
:Checks fields to ensure data in these fields correspond, e.g., Ifif Titleexpiration =date "Mr.",is in the past then Genderstatus is =not "Mactive".
 
;Control totals
:This is a total done on one or more numeric fields which appears in every record. This is a meaningful total, e.g., add the total payment for a number of Customers.
 
;Cross-system consistency checks
:Compares data in different systems to ensure it is consistent, e.g., TheSystems addressmay for the customer withrepresent the same id is the same in both systems. The data may be represented differently, in differentwhich systemscase andcomparison mayrequires need to be transformed to a common format to be compared,transformation (e.g., one system may store customer name in a single Name field as 'Doe, John Q', while another in three different fields:uses First_Name ('John), Last_Name (Doe)' and Middle_Name (Quality); to compare the two, the validation engine would have to transform data from the second system to match the data from the first, for example, using SQL: Last_Name || ', Doe' ||and First_Name || substr(Middle_Name, 1, 1) would convert the data from the second system to look like the data from the first 'Doe, John QQuality').
 
;Data type checks
:Checks the data type of the input and give an error message if the input data does not matchconformance with the chosentyped data. type,For e.g.example, In an input box accepting numeric data, ifmay reject the letter 'O' was typed instead of the number zero, an error message would appear.
 
;File existence check
:Checks that a file with a specified name exists. This check is essential for programs that use file handling.
 
;Format or picture check
:Checks that the data is in a specified format (template), e.g., dates have to be in the format DD/MM/YYYY-MM-DD. Regular expressions shouldmay be consideredused for this typekind of validation.
 
;Hash totals
:This is just a batch total done on one or more numeric fields which appears in every record. This is a meaningless total, e.g., add the Telephone Numbers together for a number of Customers.
 
;Limit check
:Unlike range checks, data are checked for one limit only, upper OR lower, e.g., data should not be greater than 2 (<=2).
 
;Logic check
:Checks that an input does not yield a logical error, e.g., an input value should not be 0 when it will divide some other number somewhere in a program.
 
;Presence check
:Checks that important data is actually present and have not been missed out, e.g., customers may be required to have their telephonean numbersemail listedaddress.
 
;Range check
:Checks that the data is within a specified range of values, e.g., the month of a person'sprobability datemust of birth should liebe between 10 and 121.
 
;Referential integrity
:In modern Relational database valuesValues in two relational [[database]] tables can be linked through foreign key and primary key. If values in the primaryforeign key field are not constrained by database internal mechanismmechanisms,[4] then they should be validated. Validationto ofensure that the foreign key field checks that referencing table must always referrefers to a valid row in the referenced table.[5]
 
;Spelling and grammar check
Line 120 ⟶ 93:
 
;Table look up check
:A table look up check takes the enteredcompares data item and compares it to a valid listcollection of entries that are stored in a databaseallowed tablevalues.
 
===Post-validation actions===
Line 132 ⟶ 105:
 
;Verification Action
:Verification actions are special cases of advisory actions. In this case, the source actor is asked to verify that this data is what they would really want to enter, in the light of a suggestion to the contrary. Here, the check step suggests an alternative (e.g., a check of youra mailing address returns a different way of formatting that address or suggests a different address altogether). You would want in this case, to give the user the option of accepting the recommendation or keeping their version. This is not a strict validation process, by design and is useful for capturing addresses to a new ___location or to a ___location that is not yet supported by the validation databases.
 
;Log of validation
:Even in cases where data validation did not find any issues, providing a log of validations that were conducted and their results is important. This is helpful to identify any missing data validation checks in light of data issues and in improving the validation.
 
==Validation and security==
Failures or omissions in data validation can lead to [[data corruption]] or a [[soaspsoftware security vulnerability|security vulnerability]].<ref>[http://www.cgisecurity.com/owasp/html/ch10.html Chapter10. Data Validation]]</ref> Data validation checks that data are fit for purpose,<ref>[https://web.archive.org/web/20171201042621/https://spotlessdata.com/blog/more-efficient-data-validation-spotless More Efficient Data Validation with Spotless]</ref> valid, sensible, reasonable and secure before they are processed.
 
== See also ==
* [[Data processing]]
* [[Data verification]]
* [[Triangulation (social science)]]
* [[Verification and validation]]
 
Line 154 ⟶ 129:
 
{{DEFAULTSORT:Data Validation}}
[[Category:Data processing]]
[[Category:Data security]]
[[Category:Data quality]]