Reece H. Dunn
dd90d3812d
tokenizer.c: Support general symbol tokens.
8 years ago
Reece H. Dunn
786575c6ed
tokenizer.c: Support general punctuation tokens.
8 years ago
Reece H. Dunn
0705844bf8
tokenizer.c: Move general category classification that does not override property behaviour to the end, for generic classification.
8 years ago
Reece H. Dunn
683579f403
Make the tokenizer.h API public.
8 years ago
Reece H. Dunn
9af96da469
Make the encoding.h API public.
8 years ago
Reece H. Dunn
55bfbb4754
tokenizer.c: Support ellipsis tokens.
8 years ago
Reece H. Dunn
b847df63b5
tokenizer.c: Support semicolon tokens.
8 years ago
Reece H. Dunn
af7e8fc5a3
tokenizer.c: Support colon tokens.
8 years ago
Reece H. Dunn
7560070dcd
tokenizer.c: Support comma tokens.
8 years ago
Reece H. Dunn
c9199cfacb
tokenizer.c: Support exclamation mark tokens.
8 years ago
Reece H. Dunn
128ceaff6a
tokenizer.c: Support question mark tokens.
8 years ago
Reece H. Dunn
8f62e18324
tokenizer.c: Support full stop tokens.
8 years ago
chrislm
5d8bb74169
IT: new improvements tested on april 2017
reduced length to 160 for unstressed syllables
Added some exceptions to the italian dictionaries
8 years ago
Reece H. Dunn
d50f3f2fa5
tokenizer.c: Support word tokens.
8 years ago
Reece H. Dunn
d093513b65
tokenizer.c: Add an options parameter to the tokenizer_reset API.
8 years ago
Reece H. Dunn
c41ac642fa
tokenizer.c: Tokenise Zp codepoints as paragraphs.
8 years ago
Reece H. Dunn
fc7a4e6701
tokenizer.c: Recognise U+000C [FORM FEED (FF)] as a newline codepoint.
8 years ago
Reece H. Dunn
d2d718d700
tokenizer.c: Tokenize line separator codepoints as newline tokens.
8 years ago
Reece H. Dunn
bf45e7ce36
tokenizer.c: Recognise U+0085 [NEW LINE (NEL)] as a newline codepoint.
8 years ago
Reece H. Dunn
df6ca7a22c
tokenizer.c: Support whitespace tokens.
8 years ago
Reece H. Dunn
539edac795
tokenizer.c: Create a codepoint_type helper function to classify codepoints for the tokenizer.
8 years ago
Reece H. Dunn
8f0dae6a38
tokenizer.c: Support windows newlines.
8 years ago
Reece H. Dunn
b897ff5aa8
encoding.c: Support calling peekc past the end of the buffer. This makes calling peekc easier.
8 years ago
Reece H. Dunn
3f692f498b
encoding.c: Implement a peekc API.
8 years ago
Reece H. Dunn
1c8ed9c190
tokenizer.c: Support mac newlines.
8 years ago
Reece H. Dunn
7602c9ac18
tokenizer.c: Support linux newlines.
8 years ago
Reece H. Dunn
bce44316bb
Create a basic tokenizer API using a structure that mirrors the TtsTokenizer interface in the tts-dev-studio project.
8 years ago
Reece H. Dunn
3cc53d98f4
Add ucd.h to tokenizer.c to provide the definition of the ucd_category identifier for the emscripten build.
8 years ago
Reece H. Dunn
61d668c0cb
ucd-tools: Inverted_Terminal_Punctuation eSpeakNG extended property support; use in clause_type_from_codepoint.
8 years ago
Reece H. Dunn
5c6bc0e556
Armenian emphasis mark (U+055B) is used for interjections, so treat it as an exclamation mark.
8 years ago
Reece H. Dunn
bc13173ac4
ucd-tools: Punctuation_In_Word eSpeakNG extended property support; use in clause_type_from_codepoint.
8 years ago
Reece H. Dunn
1131d0924b
ucd-tools: Optional_Space_After eSpeakNG extended property support; use in clause_type_from_codepoint.
8 years ago
Reece H. Dunn
b932f3c493
ucd-tools: Extended_Dash eSpeakNG extended property support; use in clause_type_from_codepoint.
8 years ago
Reece H. Dunn
3100ca9d1b
Use ucd_properties to implement clause_type_from_codepoint for supported types.
8 years ago
Reece H. Dunn
1c4ce3dcd3
tokenizer.c: create and use a clause_type_from_codepoint function, with tests.
8 years ago
Reece H. Dunn
92f703d98b
Use defines instead of hard-coded numbers for more clause logic.
8 years ago
Reece H. Dunn
8749891069
Better specify the CLAUSE_ flags returned by ReadClause.
8 years ago
Reece H. Dunn
e4e1e4db0a
TranslateWord: remove the unused add_plural_suffix variable.
8 years ago
Reece H. Dunn
62d4aff9a9
Remove the now unused option_multibyte variable.
8 years ago
Reece H. Dunn
ec8a7b810f
Use the text decoder object at the top-level Synthesize/espeak_TextToPhonemes call, not in TranslateClause.
8 years ago
Reece H. Dunn
b3e0fbc8ed
encoding.c: Create a text_decoder_decode_string_multibyte helper to work with the espeakCHARS_* flags.
8 years ago
Reece H. Dunn
9dabf64680
encoding.c: Support determining the string length for length < 0.
8 years ago
Reece H. Dunn
b5ed1f28a5
encoding.c: Don't crash if NULL is passed as the string to the decode APIs.
8 years ago
Reece H. Dunn
d167d5649b
encoding.c: Implement support for the auto-detected character set (utf-8 + codepoint-encoding).
8 years ago
Reece H. Dunn
be480c12de
Make TranslateClause return 'const void *' to preserve constness.
8 years ago
Reece H. Dunn
6451917bde
encoding.c: Fix text_decoder_get_buffer at EOF.
8 years ago
Reece H. Dunn
7c16ac543c
Use the text decoder API in readclause.c.
8 years ago
Reece H. Dunn
8933185de4
Remove the unused f_in argument to the Read/Translate/SpeakNextClause functions.
8 years ago
Reece H. Dunn
0b0661cef0
Use the encoding.c tables for 8-bit encodings.
1. Store the encoding enumeration values in the Translation
object, instead of the charset table.
2. Use the encoding.c charset table data instead of the ones
in translate.c.
3. Remove the charset language file option -- it is only used
in the Arabic language file, but is used incorrectly there.
4. Specify ISO 8859-6 for the 8-bit encoding for Arabic instead
of UTF-8, so that espeakCHARS_8BIT and espeakCHARS_AUTO work
correctly for Arabic.
8 years ago
Reece H. Dunn
a714c0554b
encoding.c: Use a codepage table to implement ISO-8859-1.
8 years ago