Reece H. Dunn
|
128ceaff6a
|
tokenizer.c: Support question mark tokens.
|
8 years ago |
Reece H. Dunn
|
8f62e18324
|
tokenizer.c: Support full stop tokens.
|
8 years ago |
Reece H. Dunn
|
d50f3f2fa5
|
tokenizer.c: Support word tokens.
|
8 years ago |
Reece H. Dunn
|
a902f451d8
|
tests/tokenizer.test: Support printing the tokens from a provided file, making it easy to investigate tokenizer issues.
|
8 years ago |
Reece H. Dunn
|
d093513b65
|
tokenizer.c: Add an options parameter to the tokenizer_reset API.
|
8 years ago |
Reece H. Dunn
|
c41ac642fa
|
tokenizer.c: Tokenise Zp codepoints as paragraphs.
|
8 years ago |
Reece H. Dunn
|
f3ea6f68f3
|
tokenizer.c: Tokenise U+000B [VERTICAL TAB (VT)] as whitespace, not as newlines.
|
8 years ago |
Reece H. Dunn
|
fc7a4e6701
|
tokenizer.c: Recognise U+000C [FORM FEED (FF)] as a newline codepoint.
|
8 years ago |
Reece H. Dunn
|
d2d718d700
|
tokenizer.c: Tokenize line separator codepoints as newline tokens.
|
8 years ago |
Reece H. Dunn
|
bf45e7ce36
|
tokenizer.c: Recognise U+0085 [NEW LINE (NEL)] as a newline codepoint.
|
8 years ago |
Reece H. Dunn
|
df6ca7a22c
|
tokenizer.c: Support whitespace tokens.
|
8 years ago |
Reece H. Dunn
|
8f0dae6a38
|
tokenizer.c: Support windows newlines.
|
8 years ago |
Reece H. Dunn
|
1c8ed9c190
|
tokenizer.c: Support mac newlines.
|
8 years ago |
Reece H. Dunn
|
7602c9ac18
|
tokenizer.c: Support linux newlines.
|
8 years ago |
Reece H. Dunn
|
bce44316bb
|
Create a basic tokenizer API using a structure that mirrors the TtsTokenizer interface in the tts-dev-studio project.
|
8 years ago |
Reece H. Dunn
|
5c6bc0e556
|
Armenian emphasis mark (U+055B) is used for interjections, so treat it as an exclamation mark.
|
8 years ago |
Reece H. Dunn
|
1c4ce3dcd3
|
tokenizer.c: create and use a clause_type_from_codepoint function, with tests.
|
8 years ago |