Unicode collation algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 19:59, 26 August 2009 edit Babbage (talk \| contribs) Autopatrolled, Extended confirmed users, Pending changes reviewers 7,810 edits more infomative first sentence? ← Previous edit		Latest revision as of 00:30, 1 May 2025 edit undo Urban Versis 32 (talk \| contribs) Extended confirmed users, Pending changes reviewers 6,225 edits Adding local short description: "String collation algorithm", overriding Wikidata description "algorithm" Tag: Shortdesc helper
(48 intermediate revisions by 41 users not shown)
Line 1: {{Short description\|String collation algorithm}} The '''Unicode collation algorithm''' (UCA) is an algorithm defined in [[Unicode Technical Report]] #10, which defines a customizable method to compare two [[string]]s. These comparisons can then be used to [[collate]] or sort text in any [[writing system]] and [[language]] that can be represented with [[Unicode]]. __NOTOC__ The '''Unicode collation algorithm''' ('''UCA''') is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from [[String (computer science)\|strings]] representing text in any [[writing system]] and [[language]] that can be represented with [[Unicode]]. These keys can then be efficiently compared byte by byte in order to [[collate]] or sort them according to the rules of the language, with options for ignoring case, accents, etc.<ref name=":0">{{Cite web \|last1=Whistler \|first1=Ken \|last2=Scherer \|first2=Markus \|last3=Davis \|first3=Mark \|author-link3=Mark Davis (Unicode) \|date=2022-08-26 \|title=UTS #10: Unicode Collation Algorithm \|url=https://www.unicode.org/reports/tr10/ \|access-date=2023-08-16 \|website=[[Unicode]]}}</ref> Unicode Technical Report #10 also specifies the ''Default Unicode Collation Element Table'' (DUCET). This data file specifies a default collation ordering. The DUCET is customizable for different languages,<ref name=":0" /><ref name=":1">{{Cite book \|last=Hosken \|first=Martin \|url=https://scriptsource.org/cms/scripts/render_download.php?format=file&media_id=..%2Fsites%2Fs%2Fmedia%2Fdatabase%2Fssproto%2Fentries%2Fpn%2Frn%2Fpnrnlhkrq9_sort_tutorial.pdf&filename=sort_tutorial.pdf \|title=Unicode Sort Tailoring: Tutorial \|date=2021-09-23 \|publisher=[[SIL International\|SIL Writing Systems Technology]] \|edition=1.3 \|pages=2–3 \|access-date=2023-08-16}}</ref> and some such customizations can be found in the Unicode [[Common Locale Data Repository]] (CLDR).<ref>{{Cite web \|title=CLDR Releases/Downloads \|url=https://cldr.unicode.org/index/downloads \|access-date=2023-08-16 \|website=[[Common Locale Data Repository\|Unicode CLDR]] \|language=}}</ref> When used with the [[default Unicode collation element table]] (DUCET), this collation method is similar to the [[European ordering rules]] for strings in most European languages. In particular, for strings in the [[Latin alphabet]], the ordering is the same as normal sorting order in English and similar languages, since it first looks only at letters stripped of any modifications or [[diacritic]]al marks. An open source implementation of UCA is included with the [[International Components for Unicode]], ICU.<ref>{{Cite web \|title=ICU - International Components for Unicode \|url=https://icu.unicode.org/home \|access-date=2023-08-16 \|website=[[Unicode]]}}</ref><ref>{{Cite web \|title=Collations \|url=https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.1/dbadmin/natlang-s-7003956.html \|access-date=2023-08-16 \|website=SyBooks Online}}</ref> ICU supports tailoring, and the collation tailorings from CLDR are included in ICU.<ref>{{Cite web \|title=Customization \|url=https://unicode-org.github.io/icu/userguide/collation/customization/ \|access-date=2023-08-16 \|website=ICU Documentation \|language=}}</ref><ref name=":1" /> ~~''Note - For a detailed overview of this complex method, full specification can be found at [http://www.unicode.org/unicode/reports/tr10/ Unicode Technical Standard #10].''~~ ==See also== ~~In addition to providing a default sorting order, UTS #10 also specifies how to tailor the sorting behaviour to be appropriate for a given locale.~~ * [[Collation]]▼ * [[ISO/IEC 14651]] * [[European ordering rules]] (EOR) * [[Common Locale Data Repository]] (CLDR) == References == An important open source implementation of UCA is included with the [[International Components for Unicode]], which also supports tailoring. You can see the effects of tailoring and a large number of language specific tailorings in the on-line '''ICU Locale Explorer'''. <references /> ==~~See~~External ~~also~~links== * [~~http~~https://www.unicode.org~~/unicode~~/reports/tr10/ Unicode Collation Algorithm]: Unicode Technical Standard #10▼ ▲[[Collation]] [http://developer.mimer.com/~~collations/~~sql-unicode-collation-charts/~~index.tml~~ Mimer SQL Unicode Collation Charts]▼ ~~==External links and references==~~ ▲[http://www.unicode.org/unicode/reports/tr10/ Unicode Collation Algorithm]: Unicode Technical Standard #10 [http://www.icu-project.org/ International Components for Unicode (ICU)] ▲[http://developer.mimer.com/collations/charts/index.tml Mimer SQL Unicode Collation Charts] [http://www.collation-charts.org/mysql60/by-charset.html#utf8 MySQL UCA-based Unicode Collation Charts] ===Tools=== * [~~http~~https://~~demo.icu~~icu4c-~~project~~demos.unicode.org/icu-bin/locexp?_=en_US&x=col ICU Locale Explorer] An online demonstration of the Unicode Collation Algorithm using [[International Components for Unicode]] [https://icu4c-demos.unicode.org/icu-bin/collation.html An ICU collation demo] [http://billposer.org/Software/msort.html msort] A sort program that provides an unusual level of flexibility in defining collations and extracting keys. ~~[[Category:~~{{Unicode]] navigation}} [[Category:String collation algorithms]] [[Category:Unicode algorithms\|Collation]] [[Category:Collation]] {{compu-stub}}▼ {{algorithm-stub}} ▲{{~~compu~~standard-stub}}