|
|
|
|
|
|
|
|
# Unicode Character Data Tools |
|
|
|
|
|
|
|
|
|
|
|
- [ConScript Unicode Registry](#conscript-unicode-registry) |
|
|
|
|
|
|
|
|
# Unicode Character Database Tools |
|
|
|
|
|
|
|
|
|
|
|
- [Data Files](#data-files) |
|
|
|
|
|
- [Unicode Character Database](#unicode-character-database) |
|
|
|
|
|
- [ConScript Unicode Registry](#conscript-unicode-registry) |
|
|
|
|
|
- [C Library](#c-library) |
|
|
|
|
|
- [Querying Properties](#querying-properties) |
|
|
|
|
|
- [Case Conversion](#case-conversion) |
|
|
|
|
|
- [wctype Compatibility](#wctype-compatibility) |
|
|
- [Build Dependencies](#build-dependencies) |
|
|
- [Build Dependencies](#build-dependencies) |
|
|
- [Debian](#debian) |
|
|
- [Debian](#debian) |
|
|
- [Building](#building) |
|
|
- [Building](#building) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---------- |
|
|
---------- |
|
|
|
|
|
|
|
|
The Unicode Character Data (UCD) Tools is a library for working with the |
|
|
|
|
|
Unicode Character Data from unicode.org. |
|
|
|
|
|
|
|
|
The Unicode Character Database (UCD) Tools is a set of Python tools and a C |
|
|
|
|
|
library. The Python tools are designed to support extracting and processing |
|
|
|
|
|
data from the text-based UCD source files, while the C library is designed |
|
|
|
|
|
to provide easy access to this information. |
|
|
|
|
|
|
|
|
|
|
|
## Data Files |
|
|
|
|
|
|
|
|
|
|
|
The `ucd-tools` project provides support for UCD formatted data files from |
|
|
|
|
|
several different sources. |
|
|
|
|
|
|
|
|
It provides a compact replacement for various wide-character C APIs. These can |
|
|
|
|
|
be used in Android applications, as the Android C library does not have full |
|
|
|
|
|
wide-character support. |
|
|
|
|
|
|
|
|
### Unicode Character Database |
|
|
|
|
|
|
|
|
In addition to this it provides APIs for: |
|
|
|
|
|
- querying the [Unicode General Category](http://www.unicode.org/reports/tr44/) values and groups; |
|
|
|
|
|
- querying the [ISO 15924](http://www.unicode.org/iso15924/iso15924-codes.html) script; |
|
|
|
|
|
- converting to upper, lower and title case. |
|
|
|
|
|
|
|
|
The following [Unicode Character Database](http://www.unicode.org/Public/7.0.0/ucd/) |
|
|
|
|
|
files from the [Unicode Consortium](http://www.unicode.org) are supported: |
|
|
|
|
|
|
|
|
The following data sets are used for the data tables: |
|
|
|
|
|
- [Unicode Character Data 7.0.0](http://www.unicode.org/Public/7.0.0/ucd/). |
|
|
|
|
|
|
|
|
* Blocks |
|
|
|
|
|
* DerivedAge |
|
|
|
|
|
* PropList |
|
|
|
|
|
* PropertyValueAliases |
|
|
|
|
|
* Scripts |
|
|
|
|
|
* UnicodeData |
|
|
|
|
|
|
|
|
## ConScript Unicode Registry |
|
|
|
|
|
|
|
|
### ConScript Unicode Registry |
|
|
|
|
|
|
|
|
If enabled, the following data from the |
|
|
If enabled, the following data from the |
|
|
[ConScript Unicode Registry](http://www.evertype.com/standards/csur/) (CSUR) is |
|
|
[ConScript Unicode Registry](http://www.evertype.com/standards/csur/) (CSUR) is |
|
|
|
|
|
|
|
|
This data is located in the `data/csur` directory in a form compatible with the |
|
|
This data is located in the `data/csur` directory in a form compatible with the |
|
|
Unicode Character Data files. |
|
|
Unicode Character Data files. |
|
|
|
|
|
|
|
|
|
|
|
## C Library |
|
|
|
|
|
|
|
|
|
|
|
The C library provides several different facilities that make use of the UCD |
|
|
|
|
|
data. It provides a compact and efficient representation of the different data |
|
|
|
|
|
tables. |
|
|
|
|
|
|
|
|
|
|
|
Detailed documentation is provided in the `src/include/ucd/ucd.h` file in the |
|
|
|
|
|
Doxygen documentation format. |
|
|
|
|
|
|
|
|
|
|
|
### Querying Properties |
|
|
|
|
|
|
|
|
|
|
|
The library exposes the following properties from the UCD data files: |
|
|
|
|
|
|
|
|
|
|
|
| Property | Description | |
|
|
|
|
|
|--------------------|-------------| |
|
|
|
|
|
| `General_Category` | A [General Category Value](http://www.unicode.org/reports/tr44/#General_Category_Values), including the higher-level grouping. | |
|
|
|
|
|
| `Script` | An [ISO 15924](http://www.unicode.org/iso15924/iso15924-codes.html) script code. | |
|
|
|
|
|
|
|
|
|
|
|
### Case Conversion |
|
|
|
|
|
|
|
|
|
|
|
The following character conversion functions are provided: |
|
|
|
|
|
|
|
|
|
|
|
* `ucd::tolower` -- convert letters to lower case |
|
|
|
|
|
* `ucd::totitle` -- convert letters to title case (UCD extension) |
|
|
|
|
|
* `ucd::toupper` -- convert letters to upper case |
|
|
|
|
|
|
|
|
|
|
|
__NOTE:__ These functions use the simple case mapping algorithm. That is, they |
|
|
|
|
|
only ever map to a single character. This is to provide a compatible signature |
|
|
|
|
|
to the standard C `wctype.h` APIs. |
|
|
|
|
|
|
|
|
|
|
|
### wctype Compatibility |
|
|
|
|
|
|
|
|
|
|
|
To facilitate working on platforms that don't have a useable wide-character |
|
|
|
|
|
ctypes library, or to provide a more consistent behaviour, the `ucd-tools` |
|
|
|
|
|
C library provides a set of APIs that are compatible with `wctype.h`. |
|
|
|
|
|
|
|
|
|
|
|
The following character classification functions are provided: |
|
|
|
|
|
|
|
|
|
|
|
* `ucd::isalnum` |
|
|
|
|
|
* `ucd::isalpha` |
|
|
|
|
|
* `ucd::iscntrl` |
|
|
|
|
|
* `ucd::isdigit` |
|
|
|
|
|
* `ucd::isgraph` |
|
|
|
|
|
* `ucd::islower` |
|
|
|
|
|
* `ucd::isprint` |
|
|
|
|
|
* `ucd::ispunct` |
|
|
|
|
|
* `ucd::isspace` |
|
|
|
|
|
* `ucd::isupper` |
|
|
|
|
|
|
|
|
|
|
|
__NOTE:__ Equivalents for `isblank` and `isxdigit` are not provided. |
|
|
|
|
|
|
|
|
## Build Dependencies |
|
|
## Build Dependencies |
|
|
|
|
|
|
|
|
In order to build ucd-tools, you need: |
|
|
In order to build ucd-tools, you need: |