Reece H. Dunn
|
dd90d3812d
|
tokenizer.c: Support general symbol tokens.
|
8 years ago |
Reece H. Dunn
|
786575c6ed
|
tokenizer.c: Support general punctuation tokens.
|
8 years ago |
Reece H. Dunn
|
0705844bf8
|
tokenizer.c: Move general category classification that does not override property behaviour to the end, for generic classification.
|
8 years ago |
Reece H. Dunn
|
683579f403
|
Make the tokenizer.h API public.
|
8 years ago |
Reece H. Dunn
|
9af96da469
|
Make the encoding.h API public.
|
8 years ago |
Reece H. Dunn
|
55bfbb4754
|
tokenizer.c: Support ellipsis tokens.
|
8 years ago |
Reece H. Dunn
|
b847df63b5
|
tokenizer.c: Support semicolon tokens.
|
8 years ago |
Reece H. Dunn
|
af7e8fc5a3
|
tokenizer.c: Support colon tokens.
|
8 years ago |
Reece H. Dunn
|
7560070dcd
|
tokenizer.c: Support comma tokens.
|
8 years ago |
Reece H. Dunn
|
c9199cfacb
|
tokenizer.c: Support exclamation mark tokens.
|
8 years ago |
Reece H. Dunn
|
128ceaff6a
|
tokenizer.c: Support question mark tokens.
|
8 years ago |
Reece H. Dunn
|
8f62e18324
|
tokenizer.c: Support full stop tokens.
|
8 years ago |
Reece H. Dunn
|
d50f3f2fa5
|
tokenizer.c: Support word tokens.
|
8 years ago |
Reece H. Dunn
|
d093513b65
|
tokenizer.c: Add an options parameter to the tokenizer_reset API.
|
8 years ago |
Reece H. Dunn
|
c41ac642fa
|
tokenizer.c: Tokenise Zp codepoints as paragraphs.
|
8 years ago |
Reece H. Dunn
|
fc7a4e6701
|
tokenizer.c: Recognise U+000C [FORM FEED (FF)] as a newline codepoint.
|
8 years ago |
Reece H. Dunn
|
d2d718d700
|
tokenizer.c: Tokenize line separator codepoints as newline tokens.
|
8 years ago |
Reece H. Dunn
|
bf45e7ce36
|
tokenizer.c: Recognise U+0085 [NEW LINE (NEL)] as a newline codepoint.
|
8 years ago |
Reece H. Dunn
|
df6ca7a22c
|
tokenizer.c: Support whitespace tokens.
|
8 years ago |
Reece H. Dunn
|
539edac795
|
tokenizer.c: Create a codepoint_type helper function to classify codepoints for the tokenizer.
|
8 years ago |
Reece H. Dunn
|
8f0dae6a38
|
tokenizer.c: Support windows newlines.
|
8 years ago |
Reece H. Dunn
|
1c8ed9c190
|
tokenizer.c: Support mac newlines.
|
8 years ago |
Reece H. Dunn
|
7602c9ac18
|
tokenizer.c: Support linux newlines.
|
8 years ago |
Reece H. Dunn
|
bce44316bb
|
Create a basic tokenizer API using a structure that mirrors the TtsTokenizer interface in the tts-dev-studio project.
|
8 years ago |
Reece H. Dunn
|
3cc53d98f4
|
Add ucd.h to tokenizer.c to provide the definition of the ucd_category identifier for the emscripten build.
|
8 years ago |
Reece H. Dunn
|
61d668c0cb
|
ucd-tools: Inverted_Terminal_Punctuation eSpeakNG extended property support; use in clause_type_from_codepoint.
|
8 years ago |
Reece H. Dunn
|
5c6bc0e556
|
Armenian emphasis mark (U+055B) is used for interjections, so treat it as an exclamation mark.
|
8 years ago |
Reece H. Dunn
|
bc13173ac4
|
ucd-tools: Punctuation_In_Word eSpeakNG extended property support; use in clause_type_from_codepoint.
|
8 years ago |
Reece H. Dunn
|
1131d0924b
|
ucd-tools: Optional_Space_After eSpeakNG extended property support; use in clause_type_from_codepoint.
|
8 years ago |
Reece H. Dunn
|
b932f3c493
|
ucd-tools: Extended_Dash eSpeakNG extended property support; use in clause_type_from_codepoint.
|
8 years ago |
Reece H. Dunn
|
3100ca9d1b
|
Use ucd_properties to implement clause_type_from_codepoint for supported types.
|
8 years ago |
Reece H. Dunn
|
1c4ce3dcd3
|
tokenizer.c: create and use a clause_type_from_codepoint function, with tests.
|
8 years ago |