| [ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
<unictype.h> This include file declares functions that classify Unicode characters and that test whether Unicode characters have specific properties.
The classification assigns a “general category” to every Unicode
character. This is similar to the classification provided by ISO C in
<wctype.h>.
Properties are the data that guides various text processing algorithms in the presence of specific Unicode characters.
Every Unicode character or code point has a general category assigned to it. This classification is important for most algorithms that work on Unicode text.
The GNU libunistring library provides two kinds of API for working with
general categories. The object oriented API uses a variable to denote
every predefined general category value or combinations thereof. The
low-level API uses a bit mask instead. The advantage of the object oriented
API is that if only a few predefined general category values are used,
the data tables are relatively small. When you combine general category
values (using uc_general_category_or, uc_general_category_and,
or uc_general_category_and_not), or when you use the low level
bit masks, a big table is used thats holds the complete general category
information for all Unicode characters.
This data type denotes a general category value. It is an immediate type that can be copied by simple assignment, without involving memory allocation. It is not an array type.
The following are the predefined general category value. Additional general categories may be added in the future.
The UC_CATEGORY_* constants reflect the systematic general category
values assigned by the Unicode Consortium. Whereas the other UC_*
macros are aliases, for use when readable code is preferred.
This represents the general category “Letter”.
This represents the general category “Letter, uppercase”.
This represents the general category “Letter, lowercase”.
This represents the general category “Letter, titlecase”.
This represents the general category “Letter, modifier”.
This represents the general category “Letter, other”.
This represents the general category “Marker”.
This represents the general category “Marker, nonspacing”.
This represents the general category “Marker, spacing combining”.
This represents the general category “Marker, enclosing”.
This represents the general category “Number”.
This represents the general category “Number, decimal digit”.
This represents the general category “Number, letter”.
This represents the general category “Number, other”.
This represents the general category “Punctuation”.
This represents the general category “Punctuation, connector”.
This represents the general category “Punctuation, dash”.
This represents the general category “Punctuation, open”, a.k.a. “start punctuation”.
This represents the general category “Punctuation, close”, a.k.a. “end punctuation”.
This represents the general category “Punctuation, initial quote”.
This represents the general category “Punctuation, final quote”.
This represents the general category “Punctuation, other”.
This represents the general category “Symbol”.
This represents the general category “Symbol, math”.
This represents the general category “Symbol, currency”.
This represents the general category “Symbol, modifier”.
This represents the general category “Symbol, other”.
This represents the general category “Separator”.
This represents the general category “Separator, space”.
This represents the general category “Separator, line”.
This represents the general category “Separator, paragraph”.
This represents the general category “Other”.
This represents the general category “Other, control”.
This represents the general category “Other, format”.
This represents the general category “Other, surrogate”. All code points in this category are invalid characters.
This represents the general category “Other, private use”.
This represents the general category “Other, not assigned”. Some code points in this category are invalid characters.
The following functions combine general categories, like in a boolean algebra, except that there is no ‘not’ operation.
Returns the union of two general categories. This corresponds to the unions of the two sets of characters.
Returns the intersection of two general categories as bit masks. This does not correspond to the intersection of the two sets of characters.
Returns the intersection of a general category with the complement of a second general category, as bit masks. This does not correspond to the intersection with complement, when viewing the categories as sets of characters.
The following functions associate general categories with their name.
Returns the name of a general category, more precisely, the abbreviated name. Returns NULL if the general category corresponds to a bit mask that does not have a name.
Returns the long name of a general category. Returns NULL if the general category corresponds to a bit mask that does not have a name.
Returns the general category given by name, e.g. "Lu", or by long
name, e.g. "Uppercase Letter".
This lookup ignores spaces, underscores, or hyphens as word separators and is
case-insignificant.
The following functions view general categories as sets of Unicode characters.
Returns the general category of a Unicode character.
This function uses a big table.
Tests whether a Unicode character belongs to a given category. The category argument can be a predefined general category or the combination of several predefined general categories.
The following are the predefined general category value as bit masks. Additional general categories may be added in the future.
The following function views general categories as sets of Unicode characters.
Tests whether a Unicode character belongs to a given category. The bitmask argument can be a predefined general category bitmask or the combination of several predefined general category bitmasks.
This function uses a big table comprising all general categories.
Every Unicode character or code point has a canonical combining class assigned to it.
What is the meaning of the canonical combining class? Essentially, it indicates the priority with which a combining character is attached to its base character. The characters for which the canonical combining class is 0 are the base characters, and the characters for which it is greater than 0 are the combining characters. Combining characters are rendered near/attached/around their base character, and combining characters with small combining classes are attached "first" or "closer" to the base character.
The canonical combining class of a character is a number in the range 0..255. The possible values are described in the Unicode Character Database http://www.unicode.org/Public/UNIDATA/UCD.html. The list here is not definitive; more values can be added in future versions.
The canonical combining class value for “Not Reordered” characters. The value is 0.
The canonical combining class value for “Overlay” characters.
The canonical combining class value for “Nukta” characters.
The canonical combining class value for “Kana Voicing” characters.
The canonical combining class value for “Virama” characters.
The canonical combining class value for “Attached Below Left” characters.
The canonical combining class value for “Attached Below” characters.
The canonical combining class value for “Attached Above” characters.
The canonical combining class value for “Attached Above Right” characters.
The canonical combining class value for “Below Left” characters.
The canonical combining class value for “Below” characters.
The canonical combining class value for “Below Right” characters.
The canonical combining class value for “Left” characters.
The canonical combining class value for “Right” characters.
The canonical combining class value for “Above Left” characters.
The canonical combining class value for “Above” characters.
The canonical combining class value for “Above Right” characters.
The canonical combining class value for “Double Below” characters.
The canonical combining class value for “Double Above” characters.
The canonical combining class value for “Iota Subscript” characters.
The following functions associate canonical combining classes with their name.
Returns the name of a canonical combining class, more precisely, the abbreviated name. Returns NULL if the canonical combining class is a numeric value without a name.
Returns the long name of a canonical combining class. Returns NULL if the canonical combining class is a numeric value without a name.
Returns the canonical combining class given by name, e.g. "BL", or by
long name, e.g. "Below Left".
This lookup ignores spaces, underscores, or hyphens as word separators and is
case-insignificant.
The following function looks up the canonical combining class of a character.
Returns the canonical combining class of a Unicode character.
Every Unicode character or code point has a bidi class assigned to it. Before Unicode 4.0, this concept was known as bidirectional category.
The bidi class guides the bidirectional algorithm (http://www.unicode.org/reports/tr9/). The possible values are the following.
The bidi class for `Left-to-Right`” characters.
The bidi class for “Left-to-Right Embedding” characters.
The bidi class for “Left-to-Right Override” characters.
The bidi class for “Right-to-Left” characters.
The bidi class for “Right-to-Left Arabic” characters.
The bidi class for “Right-to-Left Embedding” characters.
The bidi class for “Right-to-Left Override” characters.
The bidi class for “Pop Directional Format” characters.
The bidi class for “European Number” characters.
The bidi class for “European Number Separator” characters.
The bidi class for “European Number Terminator” characters.
The bidi class for “Arabic Number” characters.
The bidi class for “Common Number Separator” characters.
The bidi class for “Non-Spacing Mark” characters.
The bidi class for “Boundary Neutral” characters.
The bidi class for “Paragraph Separator” characters.
The bidi class for “Segment Separator” characters.
The bidi class for “Whitespace” characters.