[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

8. Unicode character classification and properties <unictype.h>

This include file declares functions that classify Unicode characters and that test whether Unicode characters have specific properties.

The classification assigns a “general category” to every Unicode character. This is similar to the classification provided by ISO C in <wctype.h>.

Properties are the data that guides various text processing algorithms in the presence of specific Unicode characters.


8.1 General category

Every Unicode character or code point has a general category assigned to it. This classification is important for most algorithms that work on Unicode text.

The GNU libunistring library provides two kinds of API for working with general categories. The object oriented API uses a variable to denote every predefined general category value or combinations thereof. The low-level API uses a bit mask instead. The advantage of the object oriented API is that if only a few predefined general category values are used, the data tables are relatively small. When you combine general category values (using uc_general_category_or, uc_general_category_and, or uc_general_category_and_not), or when you use the low level bit masks, a big table is used thats holds the complete general category information for all Unicode characters.


8.1.1 The object oriented API for general category

Type: uc_general_category_t

This data type denotes a general category value. It is an immediate type that can be copied by simple assignment, without involving memory allocation. It is not an array type.

The following are the predefined general category value. Additional general categories may be added in the future.

The UC_CATEGORY_* constants reflect the systematic general category values assigned by the Unicode Consortium. Whereas the other UC_* macros are aliases, for use when readable code is preferred.

Constant: uc_general_category_t UC_CATEGORY_L
Macro: uc_general_category_t UC_LETTER

This represents the general category “Letter”.

Constant: uc_general_category_t UC_CATEGORY_LC
Macro: uc_general_category_t UC_CASED_LETTER
Constant: uc_general_category_t UC_CATEGORY_Lu
Macro: uc_general_category_t UC_UPPERCASE_LETTER

This represents the general category “Letter, uppercase”.

Constant: uc_general_category_t UC_CATEGORY_Ll
Macro: uc_general_category_t UC_LOWERCASE_LETTER

This represents the general category “Letter, lowercase”.

Constant: uc_general_category_t UC_CATEGORY_Lt
Macro: uc_general_category_t UC_TITLECASE_LETTER

This represents the general category “Letter, titlecase”.

Constant: uc_general_category_t UC_CATEGORY_Lm
Macro: uc_general_category_t UC_MODIFIER_LETTER

This represents the general category “Letter, modifier”.

Constant: uc_general_category_t UC_CATEGORY_Lo
Macro: uc_general_category_t UC_OTHER_LETTER

This represents the general category “Letter, other”.

Constant: uc_general_category_t UC_CATEGORY_M
Macro: uc_general_category_t UC_MARK

This represents the general category “Marker”.

Constant: uc_general_category_t UC_CATEGORY_Mn
Macro: uc_general_category_t UC_NON_SPACING_MARK

This represents the general category “Marker, nonspacing”.

Constant: uc_general_category_t UC_CATEGORY_Mc
Macro: uc_general_category_t UC_COMBINING_SPACING_MARK

This represents the general category “Marker, spacing combining”.

Constant: uc_general_category_t UC_CATEGORY_Me
Macro: uc_general_category_t UC_ENCLOSING_MARK

This represents the general category “Marker, enclosing”.

Constant: uc_general_category_t UC_CATEGORY_N
Macro: uc_general_category_t UC_NUMBER

This represents the general category “Number”.

Constant: uc_general_category_t UC_CATEGORY_Nd
Macro: uc_general_category_t UC_DECIMAL_DIGIT_NUMBER

This represents the general category “Number, decimal digit”.

Constant: uc_general_category_t UC_CATEGORY_Nl
Macro: uc_general_category_t UC_LETTER_NUMBER

This represents the general category “Number, letter”.

Constant: uc_general_category_t UC_CATEGORY_No
Macro: uc_general_category_t UC_OTHER_NUMBER

This represents the general category “Number, other”.

Constant: uc_general_category_t UC_CATEGORY_P
Macro: uc_general_category_t UC_PUNCTUATION

This represents the general category “Punctuation”.

Constant: uc_general_category_t UC_CATEGORY_Pc
Macro: uc_general_category_t UC_CONNECTOR_PUNCTUATION

This represents the general category “Punctuation, connector”.

Constant: uc_general_category_t UC_CATEGORY_Pd
Macro: uc_general_category_t UC_DASH_PUNCTUATION

This represents the general category “Punctuation, dash”.

Constant: uc_general_category_t UC_CATEGORY_Ps
Macro: uc_general_category_t UC_OPEN_PUNCTUATION

This represents the general category “Punctuation, open”, a.k.a. “start punctuation”.

Constant: uc_general_category_t UC_CATEGORY_Pe
Macro: uc_general_category_t UC_CLOSE_PUNCTUATION

This represents the general category “Punctuation, close”, a.k.a. “end punctuation”.

Constant: uc_general_category_t UC_CATEGORY_Pi
Macro: uc_general_category_t UC_INITIAL_QUOTE_PUNCTUATION

This represents the general category “Punctuation, initial quote”.

Constant: uc_general_category_t UC_CATEGORY_Pf
Macro: uc_general_category_t UC_FINAL_QUOTE_PUNCTUATION

This represents the general category “Punctuation, final quote”.

Constant: uc_general_category_t UC_CATEGORY_Po
Macro: uc_general_category_t UC_OTHER_PUNCTUATION

This represents the general category “Punctuation, other”.

Constant: uc_general_category_t UC_CATEGORY_S
Macro: uc_general_category_t UC_SYMBOL

This represents the general category “Symbol”.

Constant: uc_general_category_t UC_CATEGORY_Sm
Macro: uc_general_category_t UC_MATH_SYMBOL

This represents the general category “Symbol, math”.

Constant: uc_general_category_t UC_CATEGORY_Sc
Macro: uc_general_category_t UC_CURRENCY_SYMBOL

This represents the general category “Symbol, currency”.

Constant: uc_general_category_t UC_CATEGORY_Sk
Macro: uc_general_category_t UC_MODIFIER_SYMBOL

This represents the general category “Symbol, modifier”.

Constant: uc_general_category_t UC_CATEGORY_So
Macro: uc_general_category_t UC_OTHER_SYMBOL

This represents the general category “Symbol, other”.

Constant: uc_general_category_t UC_CATEGORY_Z
Macro: uc_general_category_t UC_SEPARATOR

This represents the general category “Separator”.

Constant: uc_general_category_t UC_CATEGORY_Zs
Macro: uc_general_category_t UC_SPACE_SEPARATOR

This represents the general category “Separator, space”.

Constant: uc_general_category_t UC_CATEGORY_Zl
Macro: uc_general_category_t UC_LINE_SEPARATOR

This represents the general category “Separator, line”.

Constant: uc_general_category_t UC_CATEGORY_Zp
Macro: uc_general_category_t UC_PARAGRAPH_SEPARATOR

This represents the general category “Separator, paragraph”.

Constant: uc_general_category_t UC_CATEGORY_C
Macro: uc_general_category_t UC_OTHER

This represents the general category “Other”.

Constant: uc_general_category_t UC_CATEGORY_Cc
Macro: uc_general_category_t UC_CONTROL

This represents the general category “Other, control”.

Constant: uc_general_category_t UC_CATEGORY_Cf
Macro: uc_general_category_t UC_FORMAT

This represents the general category “Other, format”.

Constant: uc_general_category_t UC_CATEGORY_Cs
Macro: uc_general_category_t UC_SURROGATE

This represents the general category “Other, surrogate”. All code points in this category are invalid characters.

Constant: uc_general_category_t UC_CATEGORY_Co
Macro: uc_general_category_t UC_PRIVATE_USE

This represents the general category “Other, private use”.

Constant: uc_general_category_t UC_CATEGORY_Cn
Macro: uc_general_category_t UC_UNASSIGNED

This represents the general category “Other, not assigned”. Some code points in this category are invalid characters.

The following functions combine general categories, like in a boolean algebra, except that there is no ‘not’ operation.

Function: uc_general_category_t uc_general_category_or (uc_general_category_t category1, uc_general_category_t category2)

Returns the union of two general categories. This corresponds to the unions of the two sets of characters.

Function: uc_general_category_t uc_general_category_and (uc_general_category_t category1, uc_general_category_t category2)

Returns the intersection of two general categories as bit masks. This does not correspond to the intersection of the two sets of characters.

Function: uc_general_category_t uc_general_category_and_not (uc_general_category_t category1, uc_general_category_t category2)

Returns the intersection of a general category with the complement of a second general category, as bit masks. This does not correspond to the intersection with complement, when viewing the categories as sets of characters.

The following functions associate general categories with their name.

Function: const char * uc_general_category_name (uc_general_category_t category)

Returns the name of a general category, more precisely, the abbreviated name. Returns NULL if the general category corresponds to a bit mask that does not have a name.

Function: const char * uc_general_category_long_name (uc_general_category_t category)

Returns the long name of a general category. Returns NULL if the general category corresponds to a bit mask that does not have a name.

Function: uc_general_category_t uc_general_category_byname (const char *category_name)

Returns the general category given by name, e.g. "Lu", or by long name, e.g. "Uppercase Letter". This lookup ignores spaces, underscores, or hyphens as word separators and is case-insignificant.

The following functions view general categories as sets of Unicode characters.

Function: uc_general_category_t uc_general_category (ucs4_t uc)

Returns the general category of a Unicode character.

This function uses a big table.

Function: bool uc_is_general_category (ucs4_t uc, uc_general_category_t category)

Tests whether a Unicode character belongs to a given category. The category argument can be a predefined general category or the combination of several predefined general categories.


8.1.2 The bit mask API for general category

The following are the predefined general category value as bit masks. Additional general categories may be added in the future.

Macro: uint32_t UC_CATEGORY_MASK_L
Macro: uint32_t UC_CATEGORY_MASK_LC
Macro: uint32_t UC_CATEGORY_MASK_Lu
Macro: uint32_t UC_CATEGORY_MASK_Ll
Macro: uint32_t UC_CATEGORY_MASK_Lt
Macro: uint32_t UC_CATEGORY_MASK_Lm
Macro: uint32_t UC_CATEGORY_MASK_Lo
Macro: uint32_t UC_CATEGORY_MASK_M
Macro: uint32_t UC_CATEGORY_MASK_Mn
Macro: uint32_t UC_CATEGORY_MASK_Mc
Macro: uint32_t UC_CATEGORY_MASK_Me
Macro: uint32_t UC_CATEGORY_MASK_N
Macro: uint32_t UC_CATEGORY_MASK_Nd
Macro: uint32_t UC_CATEGORY_MASK_Nl
Macro: uint32_t UC_CATEGORY_MASK_No
Macro: uint32_t UC_CATEGORY_MASK_P
Macro: uint32_t UC_CATEGORY_MASK_Pc
Macro: uint32_t UC_CATEGORY_MASK_Pd
Macro: uint32_t UC_CATEGORY_MASK_Ps
Macro: uint32_t UC_CATEGORY_MASK_Pe
Macro: uint32_t UC_CATEGORY_MASK_Pi
Macro: uint32_t UC_CATEGORY_MASK_Pf
Macro: uint32_t UC_CATEGORY_MASK_Po
Macro: uint32_t UC_CATEGORY_MASK_S
Macro: uint32_t UC_CATEGORY_MASK_Sm
Macro: uint32_t UC_CATEGORY_MASK_Sc
Macro: uint32_t UC_CATEGORY_MASK_Sk
Macro: uint32_t UC_CATEGORY_MASK_So
Macro: uint32_t UC_CATEGORY_MASK_Z
Macro: uint32_t UC_CATEGORY_MASK_Zs
Macro: uint32_t UC_CATEGORY_MASK_Zl
Macro: uint32_t UC_CATEGORY_MASK_Zp
Macro: uint32_t UC_CATEGORY_MASK_C
Macro: uint32_t UC_CATEGORY_MASK_Cc
Macro: uint32_t UC_CATEGORY_MASK_Cf
Macro: uint32_t UC_CATEGORY_MASK_Cs
Macro: uint32_t UC_CATEGORY_MASK_Co
Macro: uint32_t UC_CATEGORY_MASK_Cn

The following function views general categories as sets of Unicode characters.

Function: bool uc_is_general_category_withtable (ucs4_t uc, uint32_t bitmask)

Tests whether a Unicode character belongs to a given category. The bitmask argument can be a predefined general category bitmask or the combination of several predefined general category bitmasks.

This function uses a big table comprising all general categories.


8.2 Canonical combining class

Every Unicode character or code point has a canonical combining class assigned to it.

What is the meaning of the canonical combining class? Essentially, it indicates the priority with which a combining character is attached to its base character. The characters for which the canonical combining class is 0 are the base characters, and the characters for which it is greater than 0 are the combining characters. Combining characters are rendered near/attached/around their base character, and combining characters with small combining classes are attached "first" or "closer" to the base character.

The canonical combining class of a character is a number in the range 0..255. The possible values are described in the Unicode Character Database http://www.unicode.org/Public/UNIDATA/UCD.html. The list here is not definitive; more values can be added in future versions.

Constant: int UC_CCC_NR

The canonical combining class value for “Not Reordered” characters. The value is 0.

Constant: int UC_CCC_OV

The canonical combining class value for “Overlay” characters.

Constant: int UC_CCC_NK

The canonical combining class value for “Nukta” characters.

Constant: int UC_CCC_KV

The canonical combining class value for “Kana Voicing” characters.

Constant: int UC_CCC_VR

The canonical combining class value for “Virama” characters.

Constant: int UC_CCC_ATBL

The canonical combining class value for “Attached Below Left” characters.

Constant: int UC_CCC_ATB

The canonical combining class value for “Attached Below” characters.

Constant: int UC_CCC_ATA

The canonical combining class value for “Attached Above” characters.

Constant: int UC_CCC_ATAR

The canonical combining class value for “Attached Above Right” characters.

Constant: int UC_CCC_BL

The canonical combining class value for “Below Left” characters.

Constant: int UC_CCC_B

The canonical combining class value for “Below” characters.

Constant: int UC_CCC_BR

The canonical combining class value for “Below Right” characters.

Constant: int UC_CCC_L

The canonical combining class value for “Left” characters.

Constant: int UC_CCC_R

The canonical combining class value for “Right” characters.

Constant: int UC_CCC_AL

The canonical combining class value for “Above Left” characters.

Constant: int UC_CCC_A

The canonical combining class value for “Above” characters.

Constant: int UC_CCC_AR

The canonical combining class value for “Above Right” characters.

Constant: int UC_CCC_DB

The canonical combining class value for “Double Below” characters.

Constant: int UC_CCC_DA

The canonical combining class value for “Double Above” characters.

Constant: int UC_CCC_IS

The canonical combining class value for “Iota Subscript” characters.

The following functions associate canonical combining classes with their name.

Function: const char * uc_combining_class_name (int ccc)

Returns the name of a canonical combining class, more precisely, the abbreviated name. Returns NULL if the canonical combining class is a numeric value without a name.

Function: const char * uc_combining_class_long_name (int ccc)

Returns the long name of a canonical combining class. Returns NULL if the canonical combining class is a numeric value without a name.

Function: int uc_combining_class_byname (const char *ccc_name)

Returns the canonical combining class given by name, e.g. "BL", or by long name, e.g. "Below Left". This lookup ignores spaces, underscores, or hyphens as word separators and is case-insignificant.

The following function looks up the canonical combining class of a character.

Function: int uc_combining_class (ucs4_t uc)

Returns the canonical combining class of a Unicode character.


8.3 Bidi class

Every Unicode character or code point has a bidi class assigned to it. Before Unicode 4.0, this concept was known as bidirectional category.

The bidi class guides the bidirectional algorithm (http://www.unicode.org/reports/tr9/). The possible values are the following.

Constant: int UC_BIDI_L

The bidi class for `Left-to-Right`” characters.

Constant: int UC_BIDI_LRE

The bidi class for “Left-to-Right Embedding” characters.

Constant: int UC_BIDI_LRO

The bidi class for “Left-to-Right Override” characters.

Constant: int UC_BIDI_R

The bidi class for “Right-to-Left” characters.

Constant: int UC_BIDI_AL

The bidi class for “Right-to-Left Arabic” characters.

Constant: int UC_BIDI_RLE

The bidi class for “Right-to-Left Embedding” characters.

Constant: int UC_BIDI_RLO

The bidi class for “Right-to-Left Override” characters.

Constant: int UC_BIDI_PDF

The bidi class for “Pop Directional Format” characters.

Constant: int UC_BIDI_EN

The bidi class for “European Number” characters.

Constant: int UC_BIDI_ES

The bidi class for “European Number Separator” characters.

Constant: int UC_BIDI_ET

The bidi class for “European Number Terminator” characters.

Constant: int UC_BIDI_AN

The bidi class for “Arabic Number” characters.

Constant: int UC_BIDI_CS

The bidi class for “Common Number Separator” characters.

Constant: int UC_BIDI_NSM

The bidi class for “Non-Spacing Mark” characters.

Constant: int UC_BIDI_BN

The bidi class for “Boundary Neutral” characters.

Constant: int UC_BIDI_B

The bidi class for “Paragraph Separator” characters.

Constant: int UC_BIDI_S

The bidi class for “Segment Separator” characters.

Constant: int UC_BIDI_WS

The bidi class for “Whitespace” characters.

Constant: