* Add: fuzzer files and modifications in config & compil
* add configure.ac change
* add minimize-corpus.sh
* add fuzzing directory and readme
* add to check if CC support libfuzzer
* Make workflow dump the crash POC
* Add debugging information
* Run fuzzing only once a week for now
Co-authored-by: kmamadoudram <[email protected]>
Co-authored-by: yocvito <[email protected]>
Co-authored-by: Samuel Thibault <[email protected]>
This commit implements support for [Totontepec Mixe](https://en.wikipedia.org/wiki/Totontepec_Mixe). The Espeak rules are based on the phonological inventory, orthographic mappings, and phonetic processes described in the "Esbozo fonológico" (phonological outline/sketch) chapter of Verónica Guzmán Guzmán's 2012 master's thesis in Indo American Linguistics awarded by the [Centro de Investigaciones y Estudios Superiores en Antropología Social](https://ciesas.edu.mx/) and *Vocabulario Mixe de Totontepec* (Totontepec Mixe vocabulary), compiled by Alvin Schoenhals and Louise C. Schoenhals and published by the Summer Institute of Linguistics in 1965.
This commit was developed as part of a project for [Computational Linguistics](https://jnw.domains.swarthmore.edu/ling073/syllabus.php) at [Swarthmore College](https://swarthmore.edu). We feel that this language is suitable for merge with "testing" status, but further verification/improvements by native speakers would be very helpful.
co-authored-by: Elizabeth Resendiz <[email protected]>
And fill the last phlist prepause and newword fields, otherwise they are
detected as undefined:
==483407== Conditional jump or move depends on uninitialised value(s)
==483407== at 0x488E6AB: Generate (synthesize.c:1228)
==483407== by 0x488FD94: SpeakNextClause (synthesize.c:1587)
==483407== by 0x4887F56: Synthesize (speech.c:457)
==483407== by 0x488884C: sync_espeak_Synth (speech.c:570)
==483407== by 0x487B270: espeak_Synth (espeak_api.c:90)
==483407== by 0x10ACA0: main (espeak-ng.c:691)
==483407== Uninitialised value was created by a client request
==483407== at 0x4884893: MakePhonemeList (phonemelist.c:155)
==483407== by 0x4895712: TranslateClause (translate.c:2682)
==483407== by 0x488FCCF: SpeakNextClause (synthesize.c:1569)
==483407== by 0x4887F56: Synthesize (speech.c:457)
==483407== by 0x488884C: sync_espeak_Synth (speech.c:570)
==483407== by 0x487B270: espeak_Synth (espeak_api.c:90)
==483407== by 0x10ACA0: main (espeak-ng.c:691)
==483407==
==483407== Conditional jump or move depends on uninitialised value(s)
==483407== at 0x488E622: Generate (synthesize.c:1211)
==483407== by 0x488FD94: SpeakNextClause (synthesize.c:1587)
==483407== by 0x4887F56: Synthesize (speech.c:457)
==483407== by 0x488884C: sync_espeak_Synth (speech.c:570)
==483407== by 0x487B270: espeak_Synth (espeak_api.c:90)
==483407== by 0x10ACA0: main (espeak-ng.c:691)
==483407== Uninitialised value was created by a client request
==483407== at 0x4884893: MakePhonemeList (phonemelist.c:155)
==483407== by 0x4895712: TranslateClause (translate.c:2682)
==483407== by 0x488FCCF: SpeakNextClause (synthesize.c:1569)
==483407== by 0x4887F56: Synthesize (speech.c:457)
==483407== by 0x488884C: sync_espeak_Synth (speech.c:570)
==483407== by 0x487B270: espeak_Synth (espeak_api.c:90)
==483407== by 0x10ACA0: main (espeak-ng.c:691)
==483407==
This is changing the ssml.test output, but with no audible difference,
so this is probably a real fix for it.
When pollint() returns 100.0, multiplying by 2.55 doesn't actually seem to
be getting 255 on i386. Multiplying by 255 and dividing by 100, however,
does (probably because float computation with small integer values are
guaranteed to have integer results).
Fixes #1151
When the current locale doesn't match the current voice, grep would be
surprised by the produced output and believe that this is not text, for
instance with LC_ALL=ru_RU.CP1251 we get:
TEST tests/language-replace.test
[...]
testing mk
grep: (standard input): binary file matches
2d1
< Translate 'пејзаж'
But we can give -a to grep so it always considers its input as text.
CheckThousandsGroup: Avoid reading uninitialized data
For the case when word is smaller than 4 characters, we should not look at
the 3rd or 4th character before checking the previous ones, otherwise
we'd at best read uninitialized data, at worse non-existing data.
TranslateWord2 uses phonemes in ph_list2. Apart from the breakable loops, it
may statically require up to 7 phonemes. Then TranslateClause always
uses 2 phonemes. We thus have to keep these margins along the loops to
avoid any overflow.
Fixes #1073#1095
TranslateWord2 uses phonemes in ph_list2. Apart from the breakable loops, it
may statically require up to 7 phonemes. Then TranslateClause always
uses 2 phonemes. We thus have to keep these margins along the loops to
avoid any overflow.
Fixes #1073
TranslateWord2 uses phonemes in ph_list2. Apart from the breakable loops, it
may statically require up to 7 phonemes. Then TranslateClause always
uses 2 phonemes. We thus have to keep these margins along the loops to
avoid any overflow.
Fixes #1073
According to Appendix E of The Lord of the Rings, ⟨k⟩ is used with the
same value as ⟨c⟩ in names from non-Elvish languages (both representing
/k/). However, in the Silmarillion, ⟨k⟩ is also used in some Elvish
names, such as Tulkas and Kementári, as well as in some words in the
Appendix (Elements in Quenya and Sindarin Names), e.g. kir- as an
element or root in Calacirya, Cirth, and other words. And in earlier
versions of the language (when Quenya was called Qenya and Sindarin
Gnomish), ⟨k⟩ also often occurs. Therefore, let’s support it as an
alternative spelling of ⟨c⟩.
Currently, eSpeak NG doesn’t seem to do the two-step replacement of
⟨kh⟩→⟨ch⟩→⟨x⟩, which means that ⟨kh⟩ is ultimately pronounced as /kh/
(or /kʰ/?) rather than [χ]; according to Appendix E, this is correct in
Dwarvish, while in Orkish and Adûnaic ⟨kh⟩ should be equivalent to ⟨ch⟩.
Since we’re not really aiming for pronouncing any of these languages,
either way is fine.
Consonants written twice always represent long consonants, not actual
repetation. eSpeak NG’s default behavior when speaking a doubled
consonant phoneme seems to work well enough for non-plosive consonants,
but for plosives, we need to tell it that the two input characters
correspond to one long phoneme, not a repeated regular one.
All three doubled voiceless plosives – ⟨tt⟩, ⟨pp⟩, ⟨cc⟩ – are regularly
found in Quenya, according to the Ambar Eldaron Quenya Dictionary [1].
Their voiced counterparts – ⟨dd⟩, ⟨bb⟩, ⟨gg⟩ – apparently don’t occur,
nor are any doubled plosives to be found in the Omikhleia Sindarin
Dictionary [2], voiced or not. But let’s define all six pairs in both
languages anyways, since it doesn’t cost us much to do so, and it seems
fairly clear that this is how these double consonants should be
pronounced, if they ever occurred.
[1]: https://ambar-eldaron.com/telechargements/quenya-engl-A4.pdf
[2]: https://www.jrrvf.com/hisweloke/sindar/index.html
⟨o⟩ almost certainly represents [ɔ] – Appendix E of The Lord of the
Rings describes it as the sound in English “for”. This means we should
use a phoneme [[O]], not [[o]]; we should also create our own phoneme
for this, since the one we inherit from Latin sounds much more like [o]
to me.
In Quenya, long ⟨ó⟩ (and, presumably, ⟨ô⟩) is, according to Appendix E,
“tenser and ‘closer’”, which presumably means [o]. (Online sources seem
to agree.) The Latin [[o:]] phoneme works well enough for this.
In Sindarin, ⟨ó⟩ has “the same quality” as ⟨o⟩ according to Appendix E,
so emit it as [[O:]] for [ɔː]. This sounds sensible enough te me.
I’m undecided whether “Lothlórien” should be in sjn_list, to pronounce
it with [oː] instead of [ɔː]. It’s composed of Sindarin “loth” and
Quenya “Lórien”, so that could potentially justify a pronunciation with
a Quenya ⟨ó⟩. But then again, maybe it should be a standard Sindarin
⟨ó⟩. For now, I’ve opted to not add it; in the film The Fellowship of
the Ring, Aragorn (Viggo Mortensen) says “Lothlórien” after the
Fellowship leave Moria, and to me his ⟨ó⟩ sounds more like [ɔː] than
[oː], so if this is wrong, at least it’s no more wrong than the famous
movie adaptation :)
⟨e⟩ almost certainly always represents [ɛ], not [e]. Appendix E of The
Lord of the Rings describes it as the sound in English “were”, and I’m
not aware of any English dialect that pronounces “were” with an [e].
In Quenya, long ⟨é⟩ (and, presumably, ⟨ê⟩) is, according to Appendix E,
“tenser and ‘closer’”, which I assume means [e]. Several online sources
agree with this as well.
In Sindarin, Appendix E is quite clear that ⟨é⟩ has “the same quality”
as ⟨e⟩, only differring from it in length: I assume this must mean that
⟨é⟩ is [ɛː] in Sindarin. The online information on this is confusing and
sometimes contradictory even within the same page; several sources claim
that Sindarin has an [eː], but I have not seen this claim substantiated
with a source from Tolkien, and I suspect it’s simply a confusion with
Quenya. It scarcely matters, anyway: Sindarin words with ⟨é⟩ or ⟨ê⟩ seem
to be pretty rare. (I’m aware of a single word with an ⟨é⟩ – the name
Eluréd, son of Dior – and the Omikhleia Sindarin dictionary [1] features
some words with ⟨ê⟩, giving their pronunciation with [ɛː].)
The [[EI]] phoneme for Sindarin ⟨ei⟩ is copied from the base2 phonemes.
[1]: https://www.jrrvf.com/hisweloke/sindar/index.html
Previously, we used vdiph/ui_4 for [[ui]]; I think the main reason for
that was that I didn’t like how the most common ⟨ui⟩, vdiph/ui, seemed
to almost vanish in “Cuiviénen”. However, vdiph/ui_4 has the curious
property that in some positions, e.g. ⟨uia⟩ in “tuia” or ⟨uil⟩
“tuilindo”, it sounds (to me) more like /ul/ than /ui/. (This also
affects Finnish, which seems to be the only other language that uses
vdiph/ui_4 [a few other languages also use it for [[ui]] but don’t seem
to emit that phoneme in their rules files] – listen to eSpeak NG
pronounce Finnish ”luiun”, for instance.) I eventually found out that
this can be worked around by substantially lengthening the phoneme
(length 500 seems to work in all positions), but this extreme length
(the absolute maximum is just 511) becomes rather noticeable whenever
the ui is used, including in positions where it had sounded just fine
before. Meanwhile, the more standard vdiph/ui can be made to sound
reasonably well in “Cuiviénen” with a much smaller increment to its
length: 290 (as also in ph_lithuanian) instead of 240 (as in ph_base2)
is enough. In this version, [[uI]] sounds acceptable enough for Elvish
⟨ui⟩ in all positions, as far as I can tell.
According to Appendix E of The Lord of the Rings, this has the same
relation to ⟨y⟩ ([j]) as ⟨hw⟩ ([ʍ]) does to ⟨w⟩ ([w]) – this probably
means the voiceless palatal fricative [ç], though Wikipedia says a
voicless palatal approximant (which would be closer to [j], the voiced
palatal approximant) is sometimes also posited.
We previously emitted [[hj]] for ⟨hy⟩, which sounds fairly close to [ç],
similar to how [[hw]] is fairly close to [ʍ] (see previous commit) –
however, translating it into [[C]] again means better --ipa output.
(In Sindarin, ⟨hy⟩ does not occur.)
This is “a voiceless w, as in English white (in northern pronunciation)”
according to Appendix E of The Lord of the Rings, and so we copy the
[[w#]] phoneme from the English phonemes. I can’t actually hear much of
a difference from the previous [[hw]] (I know what the difference
between [[w]] and [[w#]] should be, but [[hw]] already sounds like
[[w#]] to me), but at least this improves the --ipa output, changing it
from [hw] to [ʍ].
Both represent a “voiceless R”, which I believe means a voiceless
alveolar trill, [r̥]. Ideally this would be one phoneme, but I’m not sure
eSpeak NG currently has a phoneme for this. The Wikipedia article [1][2]
lists occurrences in comparatively few languages, and I chose Welsh for
guidance: eSpeak NG currently turns Welsh “Rhagfyr” into [[hr'agvYr]],
and [[h]] and [[r]] are apparently just two separate phonemes, so for
now we do the same for Quenya and Sindarin, and emit hR.
[1]: https://en.wikipedia.org/wiki/Voiceless_alveolar_trill
[2]: https://en.wikipedia.org/wiki/Special:PermanentLink/1024721264
Both represent a “voiceless L”; Appendix E of The Lord of the Rings
notes that the Quenya ⟨hl⟩ was pronounced like /l/ by the Third Age, but
for now we reproduce the original pronunciation. (Maybe we can later use
conditional rules for different pronunciations, but I think for now I
won’t go down that road.)
The B-side of the album Poems and Songs of Middle Earth begins with a
reading of the Sindarin poem A Elbereth Gilthoniel by J.R.R. Tolkien
himself, and in this recording, as best I can tell, he always pronounces
short i (i.e. ⟨i⟩, not ⟨í⟩ or ⟨î⟩) as /ɪ/ rather than /i/, regardless of
stress; for instance, the word “silivren” has the same i-sound twice (it
is not “silívren”). I believe this means that we should use the phoneme
[[I]], not [[i]], for ⟨i⟩ (in both Quenya and Sindarin).