MatchRule: Prevent non-eating special characters from eating characters
Special characters such as N, S1, etc. are not actually eating
characters. Their treatment should thus *not* update pre_ptr and post_ptr,
otherwise those would underflow/overflow, e.g. in the case
@) s (_NS1 [z]
this would overflow. This for instance noticeable with the memory sanitizer:
ESPEAK_DATA_PATH=$PWD ./src/espeak-ng -qX "capitals"
Translate 'capitals'
1 c [k]
1 a [a]
1 p [p]
1 i [I]
1 t [t]
1 a [a]
1 l [l]
20 l (C [l]
==2837201==WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x7f7f4422744b in utf8_in2 /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:281:2
#1 0x7f7f442281bc in utf8_in /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:332:9
#2 0x7f7f440e0d31 in MatchRule /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/dictionary.c:1767:21
#3 0x7f7f440d937f in TranslateRules /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/dictionary.c:2320:6
#4 0x7f7f44230e5f in TranslateWord3 /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:733:15
#5 0x7f7f44229844 in TranslateWord /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:1100:14
#6 0x7f7f44256e50 in TranslateWord2 /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:1361:11
#7 0x7f7f4424d6cc in TranslateClause /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:2623:17
#8 0x7f7f44213359 in SpeakNextClause /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/synthesize.c:1569:2
#9 0x7f7f441a9f56 in Synthesize /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/speech.c:457:2
#10 0x7f7f441a9023 in sync_espeak_Synth /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/speech.c:570:29
#11 0x7f7f441ad59f in espeak_ng_Synthesize /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/speech.c:678:10
#12 0x7f7f4410b3f4 in espeak_Synth /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/espeak_api.c:90:32
#13 0x4a8be3 in main /home/samy/brl/speech/espeak-ng-git/src/espeak-ng.c:691:3
#14 0x7f7f43a2e7fc in __libc_start_main csu/../csu/libc-start.c:332:16
#15 0x421449 in _start (/home/samy/ens/projet/1/speech/espeak-ng-git/src/.libs/espeak-ng+0x421449)
Uninitialized value was created by an allocation of 'sbuf' in the stack frame of function 'TranslateClause'
#0 0x7f7f4423a1f0 in TranslateClause /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:1941
While trying to match _NS1, MatchRule is overflowing the buffer.
It happens that this had not usually posed problem because rules usually
have these non-eating special characters last in the rule and thus it wasn't
mattering that post_ptr is pointing outside valid text.
Some rules test against character not being of a certain type. That may
match with the \0 beginning-of-text marker, and thus actually step over
it and let MatchRule continue with uninitialized data before it, leading
to potential random behavior.
This commits fixes it by making sure that we don't read before that \0.
LookupDict2 looks forward in the wtab array, it should still stop at its
end. Otherwise the memory sanitizer reports this:
testing en A. B C, D. E: F.
==65960==WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x7ff9d7ef0de8 in LookupDict2 /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/dictionary.c:2676:11
#1 0x7ff9d7eec2ec in LookupDictList /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/dictionary.c:2899:10
#2 0x7ff9d802860a in TranslateWord3 /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:588:12
#3 0x7ff9d80249d4 in TranslateWord /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:1100:14
#4 0x7ff9d8051fe0 in TranslateWord2 /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:1361:11
#5 0x7ff9d804885c in TranslateClause /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:2623:17
#6 0x7ff9d800e4e9 in SpeakNextClause /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/synthesize.c:1569:2
#7 0x7ff9d7fa50e6 in Synthesize /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/speech.c:457:2
#8 0x7ff9d7fa41b3 in sync_espeak_Synth /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/speech.c:570:29
#9 0x7ff9d7fa872f in espeak_ng_Synthesize /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/speech.c:678:10
#10 0x7ff9d7f06584 in espeak_Synth /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/espeak_api.c:90:32
#11 0x4a8be3 in main /home/samy/brl/speech/espeak-ng-git/src/espeak-ng.c:691:3
#12 0x7ff9d78297fc in __libc_start_main csu/../csu/libc-start.c:332:16
#13 0x421449 in _start (/home/samy/ens/projet/1/speech/espeak-ng-git/src/.libs/espeak-ng+0x421449)
Uninitialized value was created by an allocation of 'words' in the stack frame of function 'TranslateClause'
#0 0x7ff9d8035380 in TranslateClause /home/samy/brl/speech/espeak-ng-git/src/libespeak-ng/translate.c:1941
Strictly speaking, we are not supposed to use memcmp to compare strings
since we are not supposed to read beyond \0, which memcmp is supposed to
potentially do. Sanitizers would warn about it, and using strncmp happens to
provide the proper semantic while being not really slower, so better
just use them.
IsLetterGroup: Do not blindly walk back in the word
strlen(p) may be arbitrarily long, that would underflow the word, for
instance:
testing fr Latn
=================================================================
==3741805==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffd733c1329 at pc 0x7ff5ffbad2de bp 0x7ffd733bf310 sp 0x7ffd733bf308
READ of size 1 at 0x7ffd733c1329 thread T0
#0 0x7ff5ffbad2dd in IsLetterGroup src/libespeak-ng/dictionary.c:714
#1 0x7ff5ffbbe425 in MatchRule src/libespeak-ng/dictionary.c:1979
#2 0x7ff5ffbc09e9 in TranslateRules src/libespeak-ng/dictionary.c:2301
#3 0x7ff5ffc26656 in TranslateWord3 src/libespeak-ng/translate.c:733
#4 0x7ff5ffc2a10b in TranslateWord src/libespeak-ng/translate.c:1100
#5 0x7ff5ffc2bef2 in TranslateWord2 src/libespeak-ng/translate.c:1361
#6 0x7ff5ffc374e2 in TranslateClause src/libespeak-ng/translate.c:2623
#7 0x7ff5ffc1d010 in SpeakNextClause src/libespeak-ng/synthesize.c:1569
#8 0x7ff5ffbfbd46 in Synthesize src/libespeak-ng/speech.c:492
#9 0x7ff5ffbfd52a in sync_espeak_Synth src/libespeak-ng/speech.c:570
#10 0x7ff5ffbfdd1f in espeak_ng_Synthesize src/libespeak-ng/speech.c:678
#11 0x7ff5ffbc72fd in espeak_Synth src/libespeak-ng/espeak_api.c:90
#12 0x5627511a3137 in main src/espeak-ng.c:691
#13 0x7ff5fee557fc in __libc_start_main ../csu/libc-start.c:332
#14 0x5627511a0569 in _start (/home/samy/ens/projet/1/speech/espeak-ng-git/src/.libs/espeak-ng+0x6569)
Address 0x7ffd733c1329 is located in stack of thread T0 at offset 1177 in frame
#0 0x7ff5ffc2f760 in TranslateClause src/libespeak-ng/translate.c:1941
This frame has 16 object(s):
[48, 52) 'cc' (line 1944)
[64, 68) 'source_index' (line 1945)
[80, 84) 'prev_in' (line 1948)
[96, 100) 'prev_out' (line 1949)
[112, 116) 'next_in' (line 1952)
[128, 132) 'char_inserted' (line 1954)
[144, 148) 'word_flags' (line 1963)
[160, 164) 'charix_top' (line 1975)
[176, 180) 'tone' (line 1985)
[192, 196) 'next2_in' (line 2294)
[208, 212) 'c_temp' (line 2518)
[224, 374) 'number_buf' (line 2522)
[448, 1048) 'num_wtab' (line 2523)
[1184, 1984) 'sbuf' (line 1982) <== Memory access at offset 1177 underflows this variable
[2112, 3720) 'charix' (line 1977)
[3856, 7456) 'words' (line 1978)
sbuf is however properly '\0'-header, so we can make IsLetterGroup
carefully walk back in the word and issue a mismatch if it walks back
too much.
Fixes #1108
IsLetterGroup: Do not blindly walk back in the word
strlen(p) may be arbitrarily long, that would underflow the word, for
instance:
testing fr Latn
=================================================================
==3741805==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffd733c1329 at pc 0x7ff5ffbad2de bp 0x7ffd733bf310 sp 0x7ffd733bf308
READ of size 1 at 0x7ffd733c1329 thread T0
#0 0x7ff5ffbad2dd in IsLetterGroup src/libespeak-ng/dictionary.c:714
#1 0x7ff5ffbbe425 in MatchRule src/libespeak-ng/dictionary.c:1979
#2 0x7ff5ffbc09e9 in TranslateRules src/libespeak-ng/dictionary.c:2301
#3 0x7ff5ffc26656 in TranslateWord3 src/libespeak-ng/translate.c:733
#4 0x7ff5ffc2a10b in TranslateWord src/libespeak-ng/translate.c:1100
#5 0x7ff5ffc2bef2 in TranslateWord2 src/libespeak-ng/translate.c:1361
#6 0x7ff5ffc374e2 in TranslateClause src/libespeak-ng/translate.c:2623
#7 0x7ff5ffc1d010 in SpeakNextClause src/libespeak-ng/synthesize.c:1569
#8 0x7ff5ffbfbd46 in Synthesize src/libespeak-ng/speech.c:492
#9 0x7ff5ffbfd52a in sync_espeak_Synth src/libespeak-ng/speech.c:570
#10 0x7ff5ffbfdd1f in espeak_ng_Synthesize src/libespeak-ng/speech.c:678
#11 0x7ff5ffbc72fd in espeak_Synth src/libespeak-ng/espeak_api.c:90
#12 0x5627511a3137 in main src/espeak-ng.c:691
#13 0x7ff5fee557fc in __libc_start_main ../csu/libc-start.c:332
#14 0x5627511a0569 in _start (/home/samy/ens/projet/1/speech/espeak-ng-git/src/.libs/espeak-ng+0x6569)
Address 0x7ffd733c1329 is located in stack of thread T0 at offset 1177 in frame
#0 0x7ff5ffc2f760 in TranslateClause src/libespeak-ng/translate.c:1941
This frame has 16 object(s):
[48, 52) 'cc' (line 1944)
[64, 68) 'source_index' (line 1945)
[80, 84) 'prev_in' (line 1948)
[96, 100) 'prev_out' (line 1949)
[112, 116) 'next_in' (line 1952)
[128, 132) 'char_inserted' (line 1954)
[144, 148) 'word_flags' (line 1963)
[160, 164) 'charix_top' (line 1975)
[176, 180) 'tone' (line 1985)
[192, 196) 'next2_in' (line 2294)
[208, 212) 'c_temp' (line 2518)
[224, 374) 'number_buf' (line 2522)
[448, 1048) 'num_wtab' (line 2523)
[1184, 1984) 'sbuf' (line 1982) <== Memory access at offset 1177 underflows this variable
[2112, 3720) 'charix' (line 1977)
[3856, 7456) 'words' (line 1978)
sbuf is however properly '\0'-header, so we can make IsLetterGroup
carefully walk back in the word and issue a mismatch if it walks back
too much.
Fixes #1108
pre_ptr is already one byte before the current letter, so we do not want
to subtract 1 again. Otherwise this would for instance underflow word_iz
of addPluralSuffixes.
pre_ptr is already one byte before the current letter, so we do not want
to subtract 1 again. Otherwise this would for instance underflow word_iz
of addPluralSuffixes.
Use ESPEAKNG_DEFAULT_VOICE instead of hard coded "en".
This will make it easier to set a default voice other than
English. This is important for cases when a language will fall back to
the default voice.
Some references to L('e', 'n') still need to be changed.
Code cleanup: remove param2 from langopts and rename keyword option in language files.
- param2[] is only used to set a second value to LOPT_BRACKET_PAUSE. It is simpler
to have two values in param[] instead. This simplifies the codebase.
- Instead of setting "option bracket X Y" in language files, use
keywords "brackets X" and "bracketsAnnounced Y" instead to follow the
naming convention of other keywords.
- Add missing documentation to docs/voices.md.
code cleanup: Check all local includes with include-what-you-use
Going through files in src/libespeak-ng/, include-what-you-use removed a
few unnecessary includes and included explanations on why a certain
header should be included. This makes tracking globals and dependencies easier.
Running the codebase through IWYU should be repeated after each major
code restIncludes to standard c library weren't checked to avoid
breaking builds with other platforms.
See https://github.com/include-what-you-use/include-what-you-use
The replacement tests for bs, hr, and sr are no longer marked as
broken as they work using the old code. The mk tests keep the
broken annotation, as they don't work in the old code either.
This reverts commit 801a8d197c.
This reverts commit 64d5701e5e.
This reverts commit 3b51ebf617.
This reverts commit 1fd235d2c0.
This reverts commit 9f0667de86.
LookupDict2: Fix searching entries longer than 128
This is a fix for https://github.com/nvaccess/nvda/issues/7740.
With the addition of emoji support, dictionary entries can now be
longer than 128 bytes. This fix makes sure the character is
interpreted as an unsigned byte so it does not treat long entries
as having a negative offset.
Treating the offset as a signed byte (like in the previous code)
could cause the hash chain search to loop indefinitely when
processing certain input, like the Tamil characters in the NVDA
issue noted above that is added as a test case to translate.test.
This is a similar change to b60d2452c3.
In this case, it is when tr->dictionary_name is passed as the name
parameter in LoadDictionary.
This happens in the SetTranslator2 function when loading the
dictionary for the second language translator object.
Copy name in LoadDictionary if not dictionary_name
compiledict.c sets dict_name to dictionary_name if dict_name is
not set, and passes that to LoadDictionary. LoadDictionary then
copies the passed in name to dictionary_name.
This causes -fsanitize=address to fail with overlapping memory
addresses passed to strncpy (copying the string to itself). As
such, don't copy the name in this case.
synthesize.h now contains the definitions STRESS_IS_... that should be used with code related to syllable stress.
Note that isBreak and other defines were renumbered so that stress definitions could have values 0-6.
Possible TODOs:
1. Unify with terms used with phonemes, i.e. keywords like isDiminished in compiledata.c and stress_type in phsource/phonemes
2. Add functionality and documentation about STRESS_IS_PRIORITY and STRESS_IS_EMPHASIZED
fi: fix behaviour of S_2_TO_HEAVY (adding secondary stress)
Stress flag S_2_TO_HEAVY is currently only used by finnish.
Current behaviour skips adding secondary stress if the following syllable is heavy. The behaviour should be to skip adding secondary stress if the rest of the word (excluding last syllable) contains a heavy syllable.
Source of grammar rule and examples of expected behaviour: http://scripta.kotus.fi/visk/sisallys.php?p=13
This replaces uses of:
memcpy(dst, src, strlen(src))
with:
strcpy(dst, src)
This fixes issues with reading past the end of the copied buffer
(e.g. when processing word-based replacements for emoji characters)
by ensuring that the destination buffer is null terminated.
Reported by Michael Curran <[email protected]>
This was identified by the clang static analyser. The letter
variable is set in the various match_type switch cases, so
does not need to be initialised in the start of the while loop.
Clang static analysis reports a 'Dereference of null pointer'
error when accessing wtab->flags. This is properly guarded
against when setting the wflags variable, so use that variable
instead.