Unicode collation algorithm: Difference between revisions

Content deleted Content added
more infomative first sentence?
Adding local short description: "String collation algorithm", overriding Wikidata description "algorithm"
 
(48 intermediate revisions by 41 users not shown)
Line 1:
{{Short description|String collation algorithm}}
The '''Unicode collation algorithm''' (UCA) is an algorithm defined in [[Unicode Technical Report]] #10, which defines a customizable method to compare two [[string]]s. These comparisons can then be used to [[collate]] or sort text in any [[writing system]] and [[language]] that can be represented with [[Unicode]].
__NOTOC__
The '''Unicode collation algorithm''' ('''UCA''') is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from [[String (computer science)|strings]] representing text in any [[writing system]] and [[language]] that can be represented with [[Unicode]]. These keys can then be efficiently compared byte by byte in order to [[collate]] or sort them according to the rules of the language, with options for ignoring case, accents, etc.<ref name=":0">{{Cite web |last1=Whistler |first1=Ken |last2=Scherer |first2=Markus |last3=Davis |first3=Mark |author-link3=Mark Davis (Unicode) |date=2022-08-26 |title=UTS #10: Unicode Collation Algorithm |url=https://www.unicode.org/reports/tr10/ |access-date=2023-08-16 |website=[[Unicode]]}}</ref>
 
Unicode Technical Report #10 also specifies the ''Default Unicode Collation Element Table'' (DUCET). This data file specifies a default collation ordering. The DUCET is customizable for different languages,<ref name=":0" /><ref name=":1">{{Cite book |last=Hosken |first=Martin |url=https://scriptsource.org/cms/scripts/render_download.php?format=file&media_id=..%2Fsites%2Fs%2Fmedia%2Fdatabase%2Fssproto%2Fentries%2Fpn%2Frn%2Fpnrnlhkrq9_sort_tutorial.pdf&filename=sort_tutorial.pdf |title=Unicode Sort Tailoring: Tutorial |date=2021-09-23 |publisher=[[SIL International|SIL Writing Systems Technology]] |edition=1.3 |pages=2–3 |access-date=2023-08-16}}</ref> and some such customizations can be found in the Unicode [[Common Locale Data Repository]] (CLDR).<ref>{{Cite web |title=CLDR Releases/Downloads |url=https://cldr.unicode.org/index/downloads |access-date=2023-08-16 |website=[[Common Locale Data Repository|Unicode CLDR]] |language=}}</ref>
When used with the [[default Unicode collation element table]] (DUCET), this collation method is similar to the [[European ordering rules]] for strings in most European languages. In particular, for strings in the [[Latin alphabet]], the ordering is the same as normal sorting order in English and similar languages, since it first looks only at letters stripped of any modifications or [[diacritic]]al marks.
 
An open source implementation of UCA is included with the [[International Components for Unicode]], ICU.<ref>{{Cite web |title=ICU - International Components for Unicode |url=https://icu.unicode.org/home |access-date=2023-08-16 |website=[[Unicode]]}}</ref><ref>{{Cite web |title=Collations |url=https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.1/dbadmin/natlang-s-7003956.html |access-date=2023-08-16 |website=SyBooks Online}}</ref> ICU supports tailoring, and the collation tailorings from CLDR are included in ICU.<ref>{{Cite web |title=Customization |url=https://unicode-org.github.io/icu/userguide/collation/customization/ |access-date=2023-08-16 |website=ICU Documentation |language=}}</ref><ref name=":1" />
''Note - For a detailed overview of this complex method, full specification can be found at [http://www.unicode.org/unicode/reports/tr10/ Unicode Technical Standard #10].''
 
==See also==
In addition to providing a default sorting order, UTS #10 also specifies how to tailor the sorting behaviour to be appropriate for a given locale.
* [[Collation]]
* [[ISO/IEC 14651]]
* [[European ordering rules]] (EOR)
* [[Common Locale Data Repository]] (CLDR)
 
== References ==
An important open source implementation of UCA is included with the [[International Components for Unicode]], which also supports tailoring. You can see the effects of tailoring and a large number of language specific tailorings in the on-line '''ICU Locale Explorer'''.
<references />
 
==SeeExternal alsolinks==
* [httphttps://www.unicode.org/unicode/reports/tr10/ Unicode Collation Algorithm]: Unicode Technical Standard #10
*[[Collation]]
* [http://developer.mimer.com/collations/sql-unicode-collation-charts/index.tml Mimer SQL Unicode Collation Charts]
==External links and references==
*[http://www.unicode.org/unicode/reports/tr10/ Unicode Collation Algorithm]: Unicode Technical Standard #10
*[http://www.icu-project.org/ International Components for Unicode (ICU)]
*[http://developer.mimer.com/collations/charts/index.tml Mimer SQL Unicode Collation Charts]
*[http://www.collation-charts.org/mysql60/by-charset.html#utf8 MySQL UCA-based Unicode Collation Charts]
 
===Tools===
* [httphttps://demo.icuicu4c-projectdemos.unicode.org/icu-bin/locexp?_=en_US&x=col ICU Locale Explorer] An online demonstration of the Unicode Collation Algorithm using [[International Components for Unicode]]
*[https://icu4c-demos.unicode.org/icu-bin/collation.html An ICU collation demo]
* [http://billposer.org/Software/msort.html msort] A sort program that provides an unusual level of flexibility in defining collations and extracting keys.
 
[[Category:{{Unicode]] navigation}}
 
[[Category:String collation algorithms]]
[[Category:Unicode algorithms|Collation]]
[[Category:Collation]]
 
 
{{compu-stub}}
{{algorithm-stub}}
{{compustandard-stub}}