Learning to Pronounce Chinese Without a Pronunciation Dictionary

We demonstrate a program that learns to pronounce Chinese text in Mandarin, without a pronunciation dictionary. From non-parallel streams of Chinese characters and Chinese pinyin syllables, it establishes a many-to-many mapping between characters and pronunciations. Using unsupervised methods, the program effectively deciphers writing into speech. Its token-level character-to-syllable accuracy is 89%, which significantly exceeds the 22% accuracy of prior work.

The task of unsupervised grapheme-tophoneme conversion is introduced by Knight and Yamada (1999). Given two non-parallel streams: • A corpus of written language (characters).
• A corpus of spoken language (sounds). the goal is to build: • A mapping table between the character domain and the sound domain. • A proposed pronunciation of the written character sequences. Motivated by archaeological decipherment, Knight and Yamada (1999) view character sequences as "enciphered" phoneme sequences. Their evaluation compares the proposed pronunciations with actual pronunciations. With a noisy-channel expectation-maximization method, they obtain 96% phoneme accuracy on Spanish, 99% on Japanese kana, but only 22% syllable accuracy on Mandarin Chinese.
In this paper, we re-visit the task of deciphering Chinese text into standard Mandarin pronunciations ( Figure 1). We obtain an improved 89% syllable accuracy. We further explore exposing the internals of characters and syllables to the analyzer, as Chinese characters sharing written components often sound similar.
We find it compelling that pronunciation dictionaries are largely redundant with non-parallel text and speech corpora, even for writing systems as complex as Chinese. We also expect results may be of use in dealing with novel ways to write Chinese, such as Nüshu script (Zhang et al., 2016), with acoustic modeling of other Chinese languages and dialects, and with novel ways to phonetically encode and decode Chinese in online censorship applications (Zhang et al., 2014).

Chinese Writing
The most-popular modern Chinese writing system renders each spoken syllable token with a single character token (hanzi). There are over 400 syllable types in Mandarin 1 and several thousand character types. The mapping is many-to-many: • Almost every syllable type can be written 1 In this paper, we use standard pinyin syllable representation, and we refer strictly to Mandarin pronunciation. arXiv:2010.04744v1 [cs.CL] 9 Oct 2020 with different characters (eg, zhong → {中, 重, 肿, ...}). The choice depends on context. For example, the word zhongguo ("China") is written 中国, but zhongyao ("important") is written 重要.
While most Chinese words have two syllables, individual characters carry rough semantic meanings (eg, 中 = "middle", 重 = "weighty"). So it is no accident that the same character is used to write semantically-similar words: • 中 国 ("China = middle kingdom"), 中 学 ("middle school"), 市中心 ("city center") • 重 要 ("important"), 重 达 ("heavy"), 重 点 ("focus") • Similarly, the second syllable of "website" is spelled "site", not "sight". Finally, many characters have loosely informative internal structure. For example, 鸦 can be analyzed into two character components: 牙 and 鸟. 2 Character components are sometimes a clue to pronunciation and/or meaning. For example: • The character 鸦 ("crow", yā) is composed of 鸟 (meaning "bird") and 牙 (sound yá). • The 中 (zhōng) component of 肿 ("swollen") is a clue to its pronunciation zhǒng, though the 月 ("moon") component is more loosely suggestive of its meaning. • For a character like 法 ("law"), the components "water" and "go" do not provide much of a phonetic or semantic clue. This is the case with many characters. The vast majority of characters have two toplevel components, arranged either side-by-side (as in the examples above), top-bottom, or outsideinside. It should be noted that a top-level character component may often be recursively divided into further sub-components.
Generally speaking, it is impossible for a student to correctly guess the pronunciation or meaning of a new character, though their guess may be better than chance.

Data Preparation
From a Chinese Wikipedia dump, 3 we remove all non-Chinese characters, then convert to simplified characters. This forms our character corpus.
For our pronunciation corpus, we could record and transcribe Mandarin speech into pinyin syllables. Instead, we simulate this. We take a large subset of the Baidu Baike encyclopedia, 4 but then immediately convert it to tone-marked pinyin syllables, by using a comprehensive dictionary 5 of 116,524 words and phrases. 99.97% of Baike character tokens are covered by this dictionary.
We substitute Baike character sequences with pinyin sequences in left-to-right, longest-match fashion. This strategy works well most of the time. For example, it correctly pronounces 睡觉 as shui jiao, and 觉得 as jue de, despite the ambiguity of 觉. However, it incorrectly pronounces 想睡觉 because a dictionary entry 想睡 matches the phrase before 睡觉 = shui jiao can be applied; it also has trouble with single-character words like 还.
Using character sequences from Chinese Wikipedia and pinyin sequences from Baidu Baike is important. If we alternatively divided Chinese Wikipedia into two parts, unsupervised analysis could easily exploit high-frequency boilerplate expressions like 英 重定向：这是由 英 名 ，指向中文名 的重定向。它引 出英 至 遵循命名常 的合 名 ，能 助 者 作。 ("English Redirection: This is a redirect from the English name to the Chinese name. It guides the English title to a proper name that follows the naming convention and can assist the editor in writing.") We also pinyin-ize the first 100 lines (6059 characters) of our character corpus, as a gold-standard reference set, for later judging how well we phonetically decipher the character corpus. Unless stated otherwise, all results are for token accuracy on this reference set. Table 1 gives statistics on our corpora. We release our data at https://github.com/c2huc2hu/ unsupervised-chinese-pronunciation-data.
For characters, we employ the thorough graphical decompositions given in Wikimedia Commons, 6 which divide each character into (at most) two parts. This allows us to find, for example, 44 characters that include the second component 包 (咆, 孢, 狍, 炮, etc). We only use top-level decompositions in this work, forgoing any further recursive decompositions.

Supervised Comparison Points
Before turning to unsupervised methods, we briefly present two supervised comparison points.
First, if we had a large parallel stream of character tokens and their pinyin pronunciations, we could train a simple pronouncer that memorizes the most-frequent pinyin for each character type. Using the Baike data as processed above as a putative parallel resource, we obtain 99.1% pronunciation accuracy on the test set. 7 The only errors involve ambiguous characters, showing that a deterministic character-to-pinyin mapping tablewhether obtained by memorization or by unsupervised methods-is sufficient to solve the bulk of the problem.
Second, to investigate whether written character components predict pronunciation, we use gold pronunciations of the most common 2000 characters to predict pronunciations of the next 1000. If a test character X has second (e.g., rightmost) written component Y, then we use the pronunciation of Y as a guess for the pronunciation of X. We find this works 25% of the time if we do not consider tones, and 17% of the time if we do. Table 2 confirms that character components are only a loose guide to pronunciation, even with supervision.

Unsupervised Vector Method
Borrowing from unsupervised machine translation, which learns mappings between words in different languages (Lample et al., 2018a;Artetxe et al., 2018), we attempt to learn a mapping between embeddings for characters and embeddings for pinyin symbols. We train fastText (Bojanowski et al., 2017) vectors of dimension 300 and default settings on each of our corpora and use the MUSE system 8 to learn the relationship between the two vector spaces (Lample et al., 2018b).
There are two steps to this method: (1) map character vectors into pinyin space, (2) for each character type, find its nearest pinyin neighbor. This gives us a table that maps character types onto pinyin types. We apply this table to each character token of our 6059-character test set, obtaining token-level accuracy.
Unfortunately, this method does not work well. Only 0.5% of type mappings are correct, and token-level accuracy is similarly small. Reversing the mapping direction (pinyin embeddings into character embedding space) does not improve accuracy. It appears that the asymmetry of the mapping is difficult for the algorithm to capture. Each pinyin syllable should, in reality, be the nearest neighbor of many different characters. Moreover, the behavior of a pinyin syllable in running pinyin data may not be a good match for the behavior of any given character with that pronunciation.
Our next approach is to map words instead of characters. We break our long character sequence into a long word sequence, e.g., 竞争 很 激烈 by applying the Jieba tokenizer 9 to Wikipedia. We similarly break our long pinyin sequence into a long pinyin-word sequence, e.g., wo xihuan chi jiaozi, by applying the Stanford tokenizer 10 to prepinyinized Baidu. We build embeddings for types on both sides, and we again map them into a shared space.
During the nearest-neighbor phase, we take each written-word and look for the closest pinyinword, giving preference to pinyin words with the   same number of syllables as the written-word. If we cannot find a near neighbor with the correct number of syllables, we map to a sequence of de, the most common Chinese pinyin token. We find that matched word pairs are much more accurate than the individual character-pinyin mappings we previously obtained. 11 To get token-level pronunciation accuracy, we segment our 6059token character test set, apply our learned mapping table, and count how many characters are pronounced correctly. Table 3 shows that the accuracy of this method is 81.4%.

Unsupervised Noisy Channel EM Method
We next turn to a noisy-channel approach, following Knight and Yamada (1999). We consider our character sequence C = c 1 ...c n as derived from a hidden (no tone) pinyin sequence P = p 1 ...p n : argmax θ Pr(C) = argmax θ P Pr(P ) · Pr(C|P ) = argmax θ P Pr(P ) · n i=1 Pr θ (c i |p i ) Pr(P ) is a fixed language model over pinyin sequences. Pr θ (c|p) values are modified to maximize the value of the whole expression. Examples of Pr θ (c|p) parameters are Pr(中 | zhong), which we hope to be relatively high, and Pr(很 | zhong), which we hope to be zero.

Previous Noisy-Channel
We first faithfully re-implement Knight and Yamada (1999). They drive decipherment using a bigram Pr(P ), pruning pinyin pairs that occur few than 5 times.
Unfortunately, they do not provide their training data or code, giving only the number of character types as 2113, and the number of observed pinyin pair types as 1177 (after pruning pairs occurring fewer than 5 times). Using our own data, we estimate their character corpus at ∼30,000 tokens.
We applied this re-implementation to our data. Their pinyin-pair pruning has little effect, due to the size of our pinyin corpus (155,219 unique pairs). We ran their expectation-maximization (EM) algorithm for 170 iterations on a character corpus of 300,000 tokens, then applied their decoding algorithm to our 6059-token test, obtaining a token pronunciation accuracy of 8.6%. Because this accuracy is lower than their reported 22%, we confirmed our results with two separate implementations, and we took the best of 10 random restarts. Increasing the character corpus size to 10m yielded a worse 5.1% accuracy. We conjecture that Knight and Yamada (1999) used more homogeneous data.

Our Noisy-Channel
In this work, we use a pinyin-trigram model (rather than bigram), and we apply efficiency tricks that allow us to scale to our large data.
First, we reduce our character data C to a list of unique triples C tri , recording count(c 1 c 2 c 3 ) for each triple. A sample character triple is "的 人 口" (count = 43485).
After we have obtained Pr θ (c|p) values, we decode our 6059-character-token test sequence (C = c 1 ...c n ) using the standard Viterbi algorithm (Viterbi, 1967) Our decoding criterion is: argmax P Pr(P |C) =  argmax P Pr(P ) · Pr(C|P ) = where Pr(P ) is implemented with a smoothed bigram pinyin model Pr(p 2 |p 1 ). Figure 3 gives the outline. While EM only considers the top M pinyin triples, final decoding works on entire sentences and is free to create previously-unseen pinyin trigrams. Decoding is also free to pronounce the same character in different ways, depending on its context. We follow the prior work in cubing channel model values.
Because different random restarts yield different accuracy results, we report ranges. We are generally able to identify the best restart in an unsupervised way, due to the high correlation between EM's objective Pr(C tri ) and test-set accuracy. Table 4 shows decoding accuracy results. We achieve 71%, substantially beating the 22% accuracy reported by Knight and Yamada (1999), as well as the 8.6% of a re-implementation applied to our data.  Pr θ (c|p) = λ 1 · Pr 1 (c|p)+ λ 2 · Pr 2 (part1(c)|p) · Pr 3 (c|part1(c)) λ 3 · Pr 4 (part2(c)|p) · Pr 5 (c|part2(c)) As EM establishes a tentative high value for Pr 1 (排 | pai), we hope to also create a high value for Pr 4 (非 | pai), which will encourage pai to map to other characters with component 非 (such as 徘) in the following EM iterations.
Unfortunately, while we do see compelling Pr 4 entries, we do not see an overall improvement in test-set accuracy from this method.

Combining EM and Vector Methods
The EM method gives 71% accuracy, while the vector method gives 81% accuracy. We find that the two methods agree 47% of time, and are 98.7% accurate in agreement cases, so in an unsupervised way, we distill out 261 high-confidence character/pinyin mappings.
Improved EM results. We use the 261 highconfidence (c, p) mappings as our initial start point, by replacing each one's random initial P θ (c|p) value with a 1.0 weight. These weights bias the fractional counting in the first EM iteration. Table 5 shows that high-confidence mappings increase overall EM accuracy from 71% to 81%.
Improved vector-based results. We run the same initial procedure from Section 5, giving us a vector space inhabited by both written words and pinyin words. However, we modify the nearestneighbor search that produces word/pronunciation mappings. Our modified nearest-neighbor search takes a written word's vector and returns the near-est pronunciation vector that is consistent with the 261 high-confidence (c, p) mappings. For example, given the word 重 要, we prefer neighbors dang yao and zhong yao over dang pin, because (要, yao) is one of the high-confidence mappings originally proposed by both EM and vectorbased approaches. We also use high-confidence mappings to improve de sequences (Section 5). This combination technique is also highly effective, raising accuracy from 81% to 89%.

Conclusion
We implement and evaluate techniques to pronounce Chinese text in Mandarin, without the use of a pronunciation dictionary or parallel resource. The EM method achieves a test-set accuracy of 71%, while the vector-based method achieves 81%. By combining the two methods, we obtain 89% accuracy, which significantly exceeds that of prior work.
We also demonstrate that current methods for unsupervised matching of vector spaces are sensitive to the structure of the spaces. In the presence of one-to-many mappings between pinyin and characters, the mapping accuracy is severely downgraded, leaving open an opportunity to design more robust unsupervised vector mapping systems.