| # Unicode Character Data Tools | |||||
| - [ConScript Unicode Registry](#conscript-unicode-registry) | |||||
| # Unicode Character Database Tools | |||||
| - [Data Files](#data-files) | |||||
| - [Unicode Character Database](#unicode-character-database) | |||||
| - [ConScript Unicode Registry](#conscript-unicode-registry) | |||||
| - [C Library](#c-library) | |||||
| - [Querying Properties](#querying-properties) | |||||
| - [Case Conversion](#case-conversion) | |||||
| - [wctype Compatibility](#wctype-compatibility) | |||||
| - [Build Dependencies](#build-dependencies) | - [Build Dependencies](#build-dependencies) | ||||
| - [Debian](#debian) | - [Debian](#debian) | ||||
| - [Building](#building) | - [Building](#building) | ||||
| ---------- | ---------- | ||||
| The Unicode Character Data (UCD) Tools is a library for working with the | |||||
| Unicode Character Data from unicode.org. | |||||
| The Unicode Character Database (UCD) Tools is a set of Python tools and a C | |||||
| library. The Python tools are designed to support extracting and processing | |||||
| data from the text-based UCD source files, while the C library is designed | |||||
| to provide easy access to this information. | |||||
| ## Data Files | |||||
| The `ucd-tools` project provides support for UCD formatted data files from | |||||
| several different sources. | |||||
| It provides a compact replacement for various wide-character C APIs. These can | |||||
| be used in Android applications, as the Android C library does not have full | |||||
| wide-character support. | |||||
| ### Unicode Character Database | |||||
| In addition to this it provides APIs for: | |||||
| - querying the [Unicode General Category](http://www.unicode.org/reports/tr44/) values and groups; | |||||
| - querying the [ISO 15924](http://www.unicode.org/iso15924/iso15924-codes.html) script; | |||||
| - converting to upper, lower and title case. | |||||
| The following [Unicode Character Database](http://www.unicode.org/Public/7.0.0/ucd/) | |||||
| files from the [Unicode Consortium](http://www.unicode.org) are supported: | |||||
| The following data sets are used for the data tables: | |||||
| - [Unicode Character Data 7.0.0](http://www.unicode.org/Public/7.0.0/ucd/). | |||||
| * Blocks | |||||
| * DerivedAge | |||||
| * PropList | |||||
| * PropertyValueAliases | |||||
| * Scripts | |||||
| * UnicodeData | |||||
| ## ConScript Unicode Registry | |||||
| ### ConScript Unicode Registry | |||||
| If enabled, the following data from the | If enabled, the following data from the | ||||
| [ConScript Unicode Registry](http://www.evertype.com/standards/csur/) (CSUR) is | [ConScript Unicode Registry](http://www.evertype.com/standards/csur/) (CSUR) is | ||||
| This data is located in the `data/csur` directory in a form compatible with the | This data is located in the `data/csur` directory in a form compatible with the | ||||
| Unicode Character Data files. | Unicode Character Data files. | ||||
| ## C Library | |||||
| The C library provides several different facilities that make use of the UCD | |||||
| data. It provides a compact and efficient representation of the different data | |||||
| tables. | |||||
| Detailed documentation is provided in the `src/include/ucd/ucd.h` file in the | |||||
| Doxygen documentation format. | |||||
| ### Querying Properties | |||||
| The library exposes the following properties from the UCD data files: | |||||
| | Property | Description | | |||||
| |--------------------|-------------| | |||||
| | `General_Category` | A [General Category Value](http://www.unicode.org/reports/tr44/#General_Category_Values), including the higher-level grouping. | | |||||
| | `Script` | An [ISO 15924](http://www.unicode.org/iso15924/iso15924-codes.html) script code. | | |||||
| ### Case Conversion | |||||
| The following character conversion functions are provided: | |||||
| * `ucd::tolower` -- convert letters to lower case | |||||
| * `ucd::totitle` -- convert letters to title case (UCD extension) | |||||
| * `ucd::toupper` -- convert letters to upper case | |||||
| __NOTE:__ These functions use the simple case mapping algorithm. That is, they | |||||
| only ever map to a single character. This is to provide a compatible signature | |||||
| to the standard C `wctype.h` APIs. | |||||
| ### wctype Compatibility | |||||
| To facilitate working on platforms that don't have a useable wide-character | |||||
| ctypes library, or to provide a more consistent behaviour, the `ucd-tools` | |||||
| C library provides a set of APIs that are compatible with `wctype.h`. | |||||
| The following character classification functions are provided: | |||||
| * `ucd::isalnum` | |||||
| * `ucd::isalpha` | |||||
| * `ucd::iscntrl` | |||||
| * `ucd::isdigit` | |||||
| * `ucd::isgraph` | |||||
| * `ucd::islower` | |||||
| * `ucd::isprint` | |||||
| * `ucd::ispunct` | |||||
| * `ucd::isspace` | |||||
| * `ucd::isupper` | |||||
| __NOTE:__ Equivalents for `isblank` and `isxdigit` are not provided. | |||||
| ## Build Dependencies | ## Build Dependencies | ||||
| In order to build ucd-tools, you need: | In order to build ucd-tools, you need: |