# Unicode Character Database Tools

- [Data Files](#data-files)
  - [Unicode Character Database](#unicode-character-database)
  - [ConScript Unicode Registry](#conscript-unicode-registry)
- [C Library](#c-library)
  - [Querying Properties](#querying-properties)
  - [Case Conversion](#case-conversion)
  - [wctype Compatibility](#wctype-compatibility)
- [Build Dependencies](#build-dependencies)
  - [Debian](#debian)
- [Building](#building)
- [Updating the UCD Data](#updating-the-ucd-data)
- [Bugs](#bugs)
- [License Information](#license-information)

----------

The Unicode Character Database (UCD) Tools is a set of Python tools and a C
library with a C++ API binding. The Python tools are designed to support
extracting and processing data from the text-based UCD source files, while
the C library is designed to provide easy access to this information within
a C or C++ program.

## Data Files

The `ucd-tools` project provides support for UCD formatted data files from
several different sources.

### Unicode Character Database

The following [Unicode Character Database](http://www.unicode.org/Public/9.0.0/ucd/)
files are supported:

*  Blocks
*  DerivedAge
*  PropList
*  PropertyValueAliases
*  Scripts
*  UnicodeData

### ConScript Unicode Registry

If enabled, the following data from the
[ConScript Unicode Registry](http://www.evertype.com/standards/csur/) (CSUR) is
added:

| Code Range   | Script  |
|--------------|---------|
| `F8D0-F8FF`  | [Klingon](http://www.evertype.com/standards/csur/klingon.html) |

This data is located in the `data/csur` directory in a form compatible with the
Unicode Character Data files.

## C Library

The C library provides several different facilities that make use of the UCD
data. It provides a compact and efficient representation of the different data
tables.

Detailed documentation is provided in the `src/include/ucd/ucd.h` file in the
Doxygen documentation format.

### Querying Properties

The library exposes the following properties from the UCD data files:

| Property           | Description |
|--------------------|-------------|
| `General_Category` | A [General Category Value](http://www.unicode.org/reports/tr44/#General_Category_Values), including the higher-level grouping. |
| `Script`           | An [ISO 15924](http://www.unicode.org/iso15924/iso15924-codes.html) script code. |

### Case Conversion

The following character conversion functions are provided:

| C API         | C++ API        | Description |
|---------------|----------------|-------------|
| `ucd_tolower` | `ucd::tolower` | convert letters to lower case |
| `ucd_totitle` | `ucd::totitle` | convert letters to title case (UCD extension) |
| `ucd_toupper` | `ucd::toupper` | convert letters to upper case |

__NOTE:__ These functions use the simple case mapping algorithm. That is, they
only ever map to a single character. This is to provide a compatible signature
to the standard C `wctype.h` APIs.

### wctype Compatibility

To facilitate working on platforms that don't have a useable wide-character
ctypes library, or to provide a more consistent behaviour, the `ucd-tools`
C library provides a set of APIs that are compatible with `wctype.h`.

The following character classification functions are provided:

| C API          | C++ API         |
|----------------|-----------------|
| `ucd_isalnum`  | `ucd::isalnum`  |
| `ucd_isalpha`  | `ucd::isalpha`  |
| `ucd_isblank`  | `ucd::isblank`  |
| `ucd_iscntrl`  | `ucd::iscntrl`  |
| `ucd_isdigit`  | `ucd::isdigit`  |
| `ucd_isgraph`  | `ucd::isgraph`  |
| `ucd_islower`  | `ucd::islower`  |
| `ucd_isprint`  | `ucd::isprint`  |
| `ucd_ispunct`  | `ucd::ispunct`  |
| `ucd_isspace`  | `ucd::isspace`  |
| `ucd_isupper`  | `ucd::isupper`  |
| `ucd_isxdigit` | `ucd::isxdigit` |

## Build Dependencies

In order to build ucd-tools, you need:

1.  a functional autotools system (`make`, `autoconf`, `automake` and `libtool`);
2.  a functional C and C++ compiler.

__NOTE__: The C++ compiler is used to build the test for the C++ API.

To build the documentation, you need:

1.  the doxygen program to build the api documentation;
2.  the dot program from the graphviz library to generate graphs in the api documentation.

### Debian

Core Dependencies:

| Dependency       | Install                                               |
|------------------|-------------------------------------------------------|
| autotools        | `sudo apt-get install make autoconf automake libtool` |
| C++ compiler     | `sudo apt-get install gcc g++`                        |

Documentation Dependencies:

| Dependency | Install                         |
|------------|---------------------------------|
| doxygen    | `sudo apt-get install doxygen`  |
| graphviz   | `sudo apt-get install graphviz` |

## Building

UCD Tools supports the standard GNU autotools build system. The source code
does not contain the generated `configure` files, so to build it you need to
run:

	./autogen.sh
	./configure --prefix=/usr
	make

The tests can be run by using:

	make check

The program can be installed using:

	sudo make install

The documentation can be built using:

	make html

## Updating the UCD Data

To re-generate the source files from the UCD data when a new version of
unicode is released, you need to run:

	./configure --prefix=/usr --with-unicode-version=VERSION
	make ucd-update

where `VERSION` is the Unicode version (e.g. `6.3.0`).

Additionally, you can use the `UCD_FLAGS` option to control how the data is
generated. The following flags are supported:

| Flag        | Description |
|-------------|-------------|
| --with-csur | Add ConScript Unicode Registry data. |

## Bugs

Report bugs to the [ucd-tools issues](https://github.com/rhdunn/ucd-tools/issues)
page on GitHub.

## License Information

UCD Tools is released under the GPL version 3 or later license.

The UCD data files in `data/ucd` are downloaded from the UCD website and are
licensed under the [Unicode Terms of Use](COPYING.UCD). These data files are
used in their unmodified form. They have the following Copyright notice:

    Copyright © 1991-2014 Unicode, Inc. All rights reserved.

The files in `data/csur` are based on the information from the ConScript
Unicode Registry maintained by John Cowan and Michael Everson.