8 years ago · 9f8dc78422
--- a/docs/add_language.md
+++ b/docs/add_language.md
@@ -1,42 +1,48 @@
 # Adding or Improving a Language

 - [Language Code](#language-code)
  - [Language](#language)
  - [Accent](#accent)
 - [Considerations Before Preparation](#considerations-before-preparation)
  - [Language Tag](#language-tag)
  - [Language Family](#language-family)
 - [Language Files](#language-files)
 - [Voice File](#voice-file)
 - [Phoneme Definition File](#phoneme-definition-file)
 - [Dictionary Files](#dictionary-files)
  - [Accent (optional)](#accent-optional)
 - [Configuration Files](#configuration-files)
  - [Makefile.am file](#makefileam-file)
  - [Phonemes file](#phonemes-file)
  - [Voice File](#voice-file)
  - [Phoneme Definition File](#phoneme-definition-file)
  - [Dictionary Files](#dictionary-files)
 - [Program Code](#program-code)
 - [Compiling Rules File for Debugging](#compiling-rules-file-for-debugging)
 - [Improving a Language](#improving-a-language)

 ----------

 Most of the work doesn't need any programming knowledge. Just an
 understanding of the language, an awareness of its features, patience
 and attention to detail. Wikipedia is a good source of basic phonetic
 information, e.g.
 ## Considerations Before Preparation


 Most of the work doesn't need any programming knowledge, but, to get immediate
 feedback, by running and testing eSpeak,
 you should be able to [build](../README.md#building) it.

 You also have to understand the language main concepts, be aware of its features,
 and have to have patience and attention to detail.
 Wikipedia is a good source of basic phonetic information, e.g.
 [http://en.wikipedia.org/wiki/Vowel](http://en.wikipedia.org/wiki/Vowel).

 In many cases it should be fairly easy to add a rough implementation of
 a new language, hopefully enough to be intelligible. After that it's a
 gradual process of improvement.

 ## Language Code
 ### Language Tag

 The language is identified using the
 [BCP 47](https://en.wikipedia.org/wiki/IETF_language_tag) language tag. The
 list of valid tags originate from various standards and have been combined
 [BCP 47](https://en.wikipedia.org/wiki/IETF_language_tag) language tag.
 The list of valid tags originate from various standards and have been combined
 into the
 [IANA Language Subtag Registry](http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry).

 ### Language

 These language tags are used to specify the language, such as:

 *  `de` (German) -- The [ISO 639-1](https://en.wikipedia.org/wiki/ISO_639-1)
 *  `fr` (French) -- The [ISO 639-1](https://en.wikipedia.org/wiki/ISO_639-1)
   2-letter language code for the language.

   __NOTE:__ BCP 47 uses ISO 639-1 codes for languages that are allocated
@@ -45,15 +51,26 @@ These language tags are used to specify the language, such as:
 *  `yue` (Cantonese) -- The [ISO 639-3](https://en.wikipedia.org/wiki/ISO_639-3)
   3-letter language codes for the language.

 *  `ta-Arab` (Tamil written in the Arabic alphabet) -- The
 *  `ta-arab` (Tamil written in the Arabic alphabet) -- The
   [ISO 15924](https://en.wikipedia.org/wiki/ISO_15924) 4-letter script code.

   __NOTE:__ Where the script is the primary script for the language, the script
   tag should be omitted.

 ### Accent
 ### Language Family

 The language tags are also used to specify the accent or dialect of a language,
 The voices are grouped by the closest language family the language belongs.
 These language families are defined in
 [ISO 639-5](https://en.wikipedia.org/wiki/ISO_639-5). See also Wikipedia's
 [List of language families] (https://en.wiktionary.org/wiki/Wiktionary:List_of_families)
 for more details.

 For example, the Celtic languages (Welsh, Irish Gaelic, Scottish Gaelic, etc.)
 are listed under the `cel` language family code.

 ### Accent (optional)

 If necessary, the language tags are also used to specify the accent or dialect of a language,
 such as:

 *  `es-419` (Spanish (Latin America)) -- The
@@ -75,44 +92,73 @@ such as:
   [bcp47-data](https://github.com/rhdunn/bcp47-data) project and a private use
   tag will be defined for that accent.

 ### Language Family
 ## Configuration Files

 The voices are grouped by the closest language family the language belongs.
 These language families are defined in
 [ISO 639-5](https://en.wikipedia.org/wiki/ISO_639-5). See also Wikipedia's
 [List of language families] (https://en.wiktionary.org/wiki/Wiktionary:List_of_families)
 for more details.
 To add new language, you have to create or edit following files:

 For example, the Celtic languages (Welsh, Irish Gaelic, Scottish Gaelic, etc.)
 are listed under the `cel` language family code.
 |path/file                     |action  |
 |------------------------------|--------|
 | Makefile.am                  |edit    |
 | phsource/phonemes            |edit    |
 | phsource/ph_french           |create  |
 | dictsource/fr_list           |create  |
 | dictsource/fr_rules          |create  |
 | dictsource/fr_extrc          |create (optional) |
 | espeak-data/voices/roa/fr    |create  |

 ## Language Files
 where:

 The following files are needed for your language.
 * __french__ is name of the newly created language
 * __fr__ is the code of this language
 * __roa__ is the family of this language

  * `espeak-data/voices/roa/fr`. The voice file. This gives the language name
    and may set some options.
  * `phsource/ph_french`. The phoneme definition file. This contains phoneme
    definitions for the vowels and consonants which the language uses. Usually
    it will contain mostly vowels. Most consonants will be inherited from the
    common phoneme definitions in the master phoneme file, `phsource/phonemes`.
    The master phoneme file needs to be edited to call your new `ph_french` file.
  * `dictsource/fr_rules`. This contains the spelling-to-phoneme translation
     rules.
  * `dictsource/fr_list`. This contains pronunciations for numbers, letter and
    symbol names, and words with exceptional pronunciations. It also gives
    attributes such as "unstressed" and "pause" to some common words.

 The `fr_rules` and `fr_list` files are compiled to produce the
 `espeak-data/fr_dict` file, which eSpeak uses when it is speaking.
 ### Makefile.am File

 `Makefile.am` is build configuration file.

 Search for configuration of existing languages (e.g. English)
 and add similar lines for your language in following sections.
 E.g. for French:

 	phsource/phonemes.stamp: \
 	...
 	  phsource/ph_french \
 	...
 	
 	dictionaries: \
 	...
 	  espeak-ng-data/fr_dict \
 	...
 	
 	fr: espeak-ng-data/fr_dict
 	dictsource/fr_extra:
 	  touch dictsource/fr_extra
 	espeak-ng-data/fr_dict: src/espeak-ng phsource/phonemes.stamp dictsource/fr_list dictsource/fr_rules dictsource/fr_extra
 	  cd dictsource && ESPEAK_DATA_PATH=$(PWD) LD_LIBRARY_PATH=../src:${LD_LIBRARY_PATH} ../src/espeak-ng --compile=fr && cd ..
 	...

 Note, that you don't need to add `fr_extra` reference in the last group, if your language doesn't have this file.

 ### Phonemes File

 Open file `phsource/phonemes` and add following lines into it,
 to make it call your new, e.g. `ph_french` file:

 	...
 	phonemetable fr base
 	include ph_french
 	...

 ### Voice File

 ## Voice File
 E.g. `espeak-data/voices/roa/fr` is the voice file for French.
 This gives the language name and may set some options.

 Each language needs a voice file in `espeak-data/voices` grouped by the
 [language family](#language-family). The filename of the default voice for a
 language should be the same as the language code (e.g. `fr` for French).

 Details of the contents of voice files are given in [Voices](voices.md).

 The simplest voice file would contain just 2 lines to give the language
 name and language code, eg:
@@ -127,11 +173,18 @@ attributes in the voice file. For example you may want to start the
 implementation of a new language by using the phoneme table of an
 existing language.

 ## Phoneme Definition File
 Details of the contents of voice files are given in [Voices](voices.md).

 ### Phoneme Definition File

 E.g. `phsource/ph_french` is the phoneme definition file for French.
 This contains phoneme definitions for the vowels and consonants which the language uses.
 Usually it will contain mostly vowels. Most consonants will be inherited from the
 common phoneme definitions in the _master phoneme file_: `phsource/phonemes`.

 You must first decide on the set of phonemes (vowel and consonant
 sounds) for the language. These should be defined in a phoneme
 definition file `ph_xxxx`, where `ph_xxxx` is the name of your
 definition file `ph_french`, where `ph_french` is the name of your
 language. A reference to this file is then included at the end of the
 master phoneme file, `phsource/phonemes`, e.g.:

@@ -169,7 +222,30 @@ in eSpeak, together with the available vowel files which can be used to
 define vowel phonemes, will be sufficient. At least for an initial
 implementation.

 ## Dictionary Files
 ### Dictionary Files

 There are usually two dictionary files, e.g. for French:

  * `dictsource/fr_list`. This contains pronunciations for numbers, letter and
    symbol names, and words with exceptional pronunciations. It also gives
    attributes such as "unstressed" and "pause" to some common words.
    The `fr_list` file contains:

      * Pronunciations which exceptions to the rules in `fr_rules`, (e.g. foreign
        names).
      * Pronunciation of letter names, symbol names, and punctuation names.
      * Pronunciation of numbers.
      * Attributes for words. For example, common function words which should not
        be stressed, or conjunctions which should be preceded by a pause. 

  * `dictsource/fr_rules`. This contains the spelling-to-phoneme translation
     rules.

 Details of the contents of the dictionary files are given in
 [Dictionary](dictionary.md).

 The `fr_rules` and `fr_list` files are compiled to produce the
 `espeak-data/fr_dict` file, which eSpeak uses when it is speaking.

 Once the language's phonemes have been defined, then pronunciation
 dictionary data can be produced in order to translate the language's
@@ -185,18 +261,6 @@ or by:

 	make fr

 Details of the contents of the dictionary files are given in
 [Dictionary](dictionary.md).

 The `fr_list` file contains:

  * Pronunciations which exceptions to the rules in `fr_rules`, (e.g. foreign
    names).
  * Pronunciation of letter names, symbol names, and punctuation names.
  * Pronunciation of numbers.
  * Attributes for words. For example, common function words which should not
    be stressed, or conjunctions which should be preceded by a pause. 

 ## Program Code

 The behaviour of the eSpeak program is controlled by various options