Khmer ASR

Building speech recognition for Khmer language.

Khmer Phonemic Inventory

At high-level perception, automatic speech recognition is merely a computer application that could take sound waves as input and produce the corresponding words, phrases, or sentences being spoken as text and indeed it is a transcription task which transforms verbal articulation into the written ones. Besides, the rudimentary stage in the speech processing task is to deal with acoustic properties of the language to come up with the criterion for pronuncation so that transcription job can be made possible.

Below section roughly described the acoustic features of Khmer language and provided the transcription in IPA and Arpabet format to represent the Phonemes of each character.

Khmer alpabet and pronuncation

In Cambodian script (called Khmer letters)​ there are 33 consonants, 24 dependent vowels, 12 independent vowels and several diacritic symbols. Most consonants have reduced or modified forms, called sub-consonants, when they occur as the second member of a consonant cluster. Vowels may be written before, after, over, or under a consonant symbol. [1]

Consonants

Consonants can be devided into 2 series: one ɑ-series which inherit /ɑ/ sound and the other is ɔ-series which inherit /ɔ/ sound. And the /ɑ/ or /ɔ/ sound comes from an abstruct (inherent) vowel (the first of the 24 vowels). In addition, diacritics ៉ (MUUSIKATOAN) and ៊ (TRIISAP) is used to change a consonant sound to ɑ-series and ɔ-series respectively. [2]

ɑ-series sub-script ɔ-series sub-script sound (IPA) Arpabet
្ក ្គ k K
្ខ ្ឃ kh KH
ង៉   ្ង ŋ NG
្ច ្ជ c C
្ឆ ្ឈ ch CH
ញ៉   ្ញ ɲ GN
្ដ ្ឌ ɗ D
ឋ, ថ ្ឋ, ្ថ ឍ, ធ ្ឍ, ្ធ th TH
្ណ ្ន n N
្ត ្ទ t T
្ប ប៊   ɓ B
្ផ ្ភ ph PH
ប៉   ្ព p P
ម៉   ្ម m M
យ៉   ្យ j Y
រ៉   ្រ r R
្ឡ ្ល l L
វ៉   ្វ w W
្ស ស៊   s S
្ហ ហ៊   h HH
្ឣ ឣ៊   ʔ  

Dependent Vowels

The pronunciation of a vowel, including the inherent vowel, is determinded by the series of the initial consonant or consonant cluster that it follows. [2]

Letter ɑ-series Arpabet ɔ-series Arpabet
Inherent vowel ɑː AA ɔː OA
AH iːə EA
e EH i IH
ej EY IY
ə OE ɨ EO
œː ER ɨː EU
o OH u UH
ɔːo OW UW
uːə UE uːə UE
aːə AER əː EER
ɨːə EUR ɨːə EUR
iːə EA iːə EA
IE IE
aːɛ AE ɛː AE
aj AY ej EY
aːo AW OW
aw AOW əw AUW
ុំ om OUM um UM
ɑm OM um UM
ាំ am AM oam AOM
ah EHX ɛah AHX
ិះ eh EEH ih IH
ឹះ əh ERH ɨh EOH
ុះ oh OUH uh UUH
េះ eh OEH ih IYH
ោះ ɑh AOH uəh UEH

Independent Vowels

Unlinke dependent vowels, independent vowels can be the initial letter of word and they can be followed immediately by consonants but not dependent vowels.

Letter IPA Arpabet
ឣា ʔaː AH
ʔe EH
ʔej EY
ʔu UH
ʔuː UW
ʔoːw AUW
ʔɨ R EO
ʔɨː R EU
L EO
lɨː L EU
ʔaːɛ AE
ʔaj AY
ʔaːo AW
ʔaːo AW
ʔaw  

All the tables illustrated above are the extension of Text to sound mapping tables developed in [2].

References

  1. Center for Southeast Asia Studies (Khmer) - Northen Illinois University
  2. T.R. Annanda, S.M. Long, S. Heng, N. Long, K.H. Sok, “Complexity of Letter to Sound Conversion (LTS) in Khmer Language: under the context of Khmer Text-to-Speech (TTS)”. NLP lab, Department of Computer and Communication Engineering, Institute of Technology of Cambodia, Cambodia, PAN10 and IDRC Canada
  3. Research on Phonetic and Phonological Analysis of Khmer
  4. Omniglot - Khmer
  5. S. Seng, S. Sam, V.-B. Le, B. Bigi, and L. Besacier, “WHICH UNITS FOR ACOUSTIC AND LANGUAGE MODELING FOR KHMER AUTOMATIC SPEECH RECOGNITION?” presented at the International Workshop on Spoken Languages Technologies for Under-Ressourced Languages, 2008.

Comments