Unicode Frequently Asked Questions

Specifications

Q: How can I find out whether a particular issue is covered by a specification published by the Consortium. And where do I look it up?

The Unicode Standard and related standards contain a number of specifications or guidelines for dealing with different programming tasks. Sometimes it's hard to find these as they are not all provided as specific, dedicated documents.

The following table lists subject areas for which the core specification of the Unicode Standard.

General

General Category, Default-Ignorable, plus those used in other specifications

Chapter 4

character properties (Unihan)

UAX #38

Unikemet)

UAX #57

Additional information for Cuneiform: references to additional data specific to the Sumero-Akkadian Cuneiform script

UTR #56

UCD

UAX #44

XML representation of the UCD

UAX #42

§ 4.2 Case.

§ 3.13

Characters with Unusual Properties: characters that implementers need to pay special attention to

§ 4.12

Script Property: usage model for determining text runs in a given script

UAX #24

Unicode Support of Mathematics: guidelines for mathematical usage

UTR #25

emoji characters

UTR #51

character sequences

UAX #34

Encodings

UTF-32 conversion and validation

§ 3.9

Unicode Encoding Schemes: UTF-8, UTF-16 (BE/LE), UTF-32 (BE/LE) conversion and validation

§ 3.10

Binary Order: UTF-8 order vs. UTF-16 order

§ 5.17

code pages

UTS #22

A Standard Compression Scheme for Unicode: how to compress Unicode to about the same size as legacy

UTS #6

EBCDIC systems

UTR #16

encoding scheme

UTR #26

ideographic variation sequences

§ 23.4

glyphs

UTS #37

Comparison (Collation)

canonical ordering

§ 3.11

§ 3.11 definitions

UAX #15

Unicode Collation Algorithm: the default mechanism for comparing, searching, matching, and ordering Unicode text

UTS #10

Parsing

Hangul Syllables: boundaries, parsing, (de/)composition, names

§ 3.12

Decimal Numbers: conversion and validation

§ 5.5

Unicode Regular Expression Guidelines: the features required in supporting regular expressions with Unicode

UTS #18

Unicode Identifiers and Syntax: how to parse identifiers

UAX #31

Unicode Source Code Handling: guidance for programming language designers on handling security issues in Unicode program text

UTS #55

§ 23.9 Tag Characters

§ 5.10

Variation Selectors: use, validation

§ 23.4

Ideographic Description Sequences: use, validation

§18.2

Segmentation

Newline Guidelines: how to handle newline characters

§ 5.8

Line Breaking Algorithm: the default way to determine where to linewrap

UAX #14

grapheme clusters, words, and sentences

UAX #29

Rendering

The Bidirectional Algorithm: required for display of Arabic and Hebrew text

UAX #9

Arabic Mark Rendering: sequence details for stable rendering of multiple marks

UAX #53

East Asian Width: the default determination of character width in East Asian contexts

UAX #11

Minimal shaping requirements for Tamil, and other complex scripts

Chapters 9-15

Vertical orientation adjustments for characters

UTR #50

Locale Data

internationalization

UTS #35

LDML data for hundreds of locales

CLDR

Identifiers and Security

Identifier and Syntax: security issues for identifiers

UAX #31

Unicode Security Considerations: guidelines for recognizing Unicode security problems and dealing with them

UTR #36

Unicode Security Mechanisms: useful tools for detecting spoofs

UTS #39

IDNA2003

UTS #46

Unicode Source Code Handling: guidance for programming language designers and programming environment developers to avoid security issues from improper handling of Unicode program text

UTS #55

Q: Which Unicode specifications are normative?

Some Unicode specifications are Unicode Standard, the material in Chapter 3, Unicode Technical Standards (About Unicode Technical Reports.

Q: Where can I find the rationale behind a given specification?

Specifications published by the encoded characters.

The following table list sources of information on specific technical decisions or the rationale behind them.

Unicode Technical Committee

Minutes and supporting documents Register
Minutes and supporting documents Search

Character Encoding

ScriptSource, information on scripts
Unicode Status for each script (Example: Arabic)
Wikipedia, information on Unicode blocks
History section for each Unicode block (Example: Arabic)
Emoji Proposals By proposal
Emoji Proposals By code point

Algorithms

Linebreaking Algorithm

Q: Where can I find out when a character was encoded or a feature was added to a given specification?

For both the Core Spec of the Unicode Standard and its Annexes, as well as Technical Standards and Reports, a "Modifications" section highlights changes from the preceding version. Tracking these backwards gives information on when a particular change was introduced, but the granularity is not particularly fine, nor is there a cross-reference with particular decisions and supporting documents. For DerivedAge.txt indicates the version a character was added to the standard. For some specifications an annotated version provides a more fine-grained documentation of the version and rationale for each change.

Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant