eSpeak NG is an open source speech synthesizer that supports more than hundred languages and accents.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Reece H. Dunn c9af03ee8b Add IDS_Binary_Operator support from PropList.txt. 8 years ago
_layouts Build HTML versions of the README and CHANGELOG files. 9 years ago
data/csur Klingon: Provide a more accurate Copyright notice. 10 years ago
docs Convert docs/ReleaseNotes.md to a more standard CHANGELOG.md file. 9 years ago
src Add IDS_Binary_Operator support from PropList.txt. 8 years ago
tests Add tests for the PropList API. 8 years ago
tools Add IDS_Binary_Operator support from PropList.txt. 8 years ago
.gitignore printcdata: a version of printucddata that uses the C APIs where available 8 years ago
AUTHORS Parse the UCD data files. 12 years ago
CHANGELOG.md ctype: return true in isupper/islower if there is a simple case mapping present 8 years ago
COPYING Parse the UCD data files. 12 years ago
COPYING.UCD README: Comply with the Unicode Terms of Use 10 years ago
Makefile.am Add White_Space support from PropList.txt. 8 years ago
README.md Add an iswxdigit compatibility API. 8 years ago
autogen.sh Convert docs/ReleaseNotes.md to a more standard CHANGELOG.md file. 9 years ago
configure.ac ucd-tools 9.0.0 8 years ago

README.md

Unicode Character Database Tools


The Unicode Character Database (UCD) Tools is a set of Python tools and a C library with a C++ API binding. The Python tools are designed to support extracting and processing data from the text-based UCD source files, while the C library is designed to provide easy access to this information within a C or C++ program.

Data Files

The ucd-tools project provides support for UCD formatted data files from several different sources.

Unicode Character Database

The following Unicode Character Database files are supported:

  • Blocks
  • DerivedAge
  • PropList
  • PropertyValueAliases
  • Scripts
  • UnicodeData

ConScript Unicode Registry

If enabled, the following data from the ConScript Unicode Registry (CSUR) is added:

Code Range Script
F8D0-F8FF Klingon

This data is located in the data/csur directory in a form compatible with the Unicode Character Data files.

C Library

The C library provides several different facilities that make use of the UCD data. It provides a compact and efficient representation of the different data tables.

Detailed documentation is provided in the src/include/ucd/ucd.h file in the Doxygen documentation format.

Querying Properties

The library exposes the following properties from the UCD data files:

Property Description
General_Category A General Category Value, including the higher-level grouping.
Script An ISO 15924 script code.

Case Conversion

The following character conversion functions are provided:

C API C++ API Description
ucd_tolower ucd::tolower convert letters to lower case
ucd_totitle ucd::totitle convert letters to title case (UCD extension)
ucd_toupper ucd::toupper convert letters to upper case

NOTE: These functions use the simple case mapping algorithm. That is, they only ever map to a single character. This is to provide a compatible signature to the standard C wctype.h APIs.

wctype Compatibility

To facilitate working on platforms that don’t have a useable wide-character ctypes library, or to provide a more consistent behaviour, the ucd-tools C library provides a set of APIs that are compatible with wctype.h.

The following character classification functions are provided:

C API C++ API
ucd_isalnum ucd::isalnum
ucd_isalpha ucd::isalpha
ucd_isblank ucd::isblank
ucd_iscntrl ucd::iscntrl
ucd_isdigit ucd::isdigit
ucd_isgraph ucd::isgraph
ucd_islower ucd::islower
ucd_isprint ucd::isprint
ucd_ispunct ucd::ispunct
ucd_isspace ucd::isspace
ucd_isupper ucd::isupper
ucd_isxdigit ucd::isxdigit

Build Dependencies

In order to build ucd-tools, you need:

  1. a functional autotools system (make, autoconf, automake and libtool);
  2. a functional C and C++ compiler.

NOTE: The C++ compiler is used to build the test for the C++ API.

To build the documentation, you need:

  1. the doxygen program to build the api documentation;
  2. the dot program from the graphviz library to generate graphs in the api documentation.

Debian

Core Dependencies:

Dependency Install
autotools sudo apt-get install make autoconf automake libtool
C++ compiler sudo apt-get install gcc g++

Documentation Dependencies:

Dependency Install
doxygen sudo apt-get install doxygen
graphviz sudo apt-get install graphviz

Building

UCD Tools supports the standard GNU autotools build system. The source code does not contain the generated configure files, so to build it you need to run:

./autogen.sh
./configure --prefix=/usr
make

The tests can be run by using:

make check

The program can be installed using:

sudo make install

The documentation can be built using:

make html

Updating the UCD Data

To re-generate the source files from the UCD data when a new version of unicode is released, you need to run:

./configure --prefix=/usr --with-unicode-version=VERSION
make ucd-update

where VERSION is the Unicode version (e.g. 6.3.0).

Additionally, you can use the UCD_FLAGS option to control how the data is generated. The following flags are supported:

Flag Description
--with-csur Add ConScript Unicode Registry data.

Bugs

Report bugs to the ucd-tools issues page on GitHub.

License Information

UCD Tools is released under the GPL version 3 or later license.

The UCD data files in data/ucd are downloaded from the UCD website and are licensed under the Unicode Terms of Use. These data files are used in their unmodified form. They have the following Copyright notice:

Copyright © 1991-2014 Unicode, Inc. All rights reserved.

The files in data/csur are based on the information from the ConScript Unicode Registry maintained by John Cowan and Michael Everson.