Joint CTC/attention decoding for end-to-end speech recognition

End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.


Introduction
Automatic speech recognition (ASR) is currently a mature set of technologies that have been widely deployed, resulting in great success in interface applications such as voice search. A typical ASR system is factorized into several modules including acoustic, lexicon, and language models based on a probabilistic noisy channel model (Jelinek, 1976). Over the last decade, dramatic improvements in acoustic and language models have been driven by machine learning techniques known as deep learning (Hinton et al., 2012).
However, current systems lean heavily on the scaffolding of complicated legacy architectures that grew up around traditional techniques. For example, when we build an acoustic model from scratch, we have to first build hidden Markov model (HMM) and Gaussian mixture model (GMM) followed by deep neural networks (DNN). In addition, the factorization of acoustic, lexicon, and language models is derived by conditional independence assumptions (especially Markov assumptions), although the data do not necessarily follow such assumptions leading to model misspecification. This factorization form also yields a local optimum since the above modules are optimized separately. Further, to well factorize acoustic and language models, the system requires linguistic knowledge based on a lexicon model, which is usually based on a hand-crafted pronunciation dictionary to map word to phoneme sequence. In addition to the pronunciation dictionary issue, some languages, which do not explicitly have a word boundary, need languagespecific tokenization modules (Kudo et al., 2004;Bird, 2006) for language modeling. Finally, inference/decoding has to be performed by integrating all modules resulting in complex decoding. Consequently, it is quite difficult for non-experts to use/develop ASR systems for new applications, especially for new languages.
End-to-end ASR has the goal of simplifying the above module-based architecture into a singlenetwork architecture within a deep learning framework, in order to address the above issues. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming (Chorowski et al., 2014;Graves and Jaitly, 2014).
The attention-based end-to-end method solves the ASR problem as a sequence mapping from speech feature sequences to text by using encoderdecoder architecture. The decoder network uses an attention mechanism to find an alignment between each element of the output sequence and the hidden states generated by the acoustic encoder network for each frame of acoustic input (Chorowski et al., 2014(Chorowski et al., , 2015Chan et al., 2015;. This basic temporal attention mechanism is too flexible in the sense that it allows extremely non-sequential alignments. This may be fine for applications such as machine translation where input and output word order are different Wu et al., 2016). However, in speech recognition, the feature inputs and corresponding letter outputs generally proceed in the same order. Another problem is that the input and output sequences in ASR can have very different lengths, and these vary greatly from case to case, depending on the speaking rate and writing system, making it more difficult to track the alignment.
However, an advantage is that the attention mechanism does not require any conditional independence assumptions, and could address all the problems cited above. Although the alignment problems of attention-based mechanisms have been partially addressed in (Chorowski et al., 2014;Chorowski and Jaitly, 2016) using various mechanisms, here we propose more rigorous constraints by using CTC-based alignment to guide the decoding.
CTC permits an efficient computation of a strictly monotonic alignment using dynamic programming (Graves et al., 2006;Graves and Jaitly, 2014) although it requires language models and graph-based decoding (Miao et al., 2015) except in the case of huge training data (Amodei et al., 2015;Soltau et al., 2016). We propose to take advantage of the constrained CTC alignment in a hybrid CTC/attention based system during decoding. The proposed method adopts a CTC/attention hybrid architecture, which was originally designed to regularize an attention-based encoder network by additionally using a CTC during training (Kim et al., 2017). The proposed method extends the architecture to perform one-pass/rescoring joint de-coding, where hypotheses of attention-based ASR are boosted by scores obtained by using CTC outputs. This greatly reduces irregular alignments without any heuristic search techniques.
The proposed method is applied to Japanese and Mandarin ASR tasks, which require extra linguistic resources including morphological analyzer (Kudo et al., 2004) or word segmentation (Xue et al., 2003) in addition to pronunciation dictionary to provide accurate lexicon and language models in conventional DNN/HMM ASR. Surprisingly, the method achieved performance comparable to, and in some cases superior to, several state-of-the-art DNN/HMM ASR systems, without using the above linguistic resources.
2 From DNN/HMM to end-to-end ASR This section briefly provides a formulation of conventional DNN/HMM ASR and CTC or attention based end-to-end ASR.

Conventional DNN/HMM ASR
ASR deals with a sequence mapping from Tlength speech feature sequence X = {x t ∈ R D |t = 1, · · · , T } to N -length word sequence W = {w n ∈ V|n = 1, · · · , N }. x t is a D dimensional speech feature vector (e.g., log Mel filterbanks) at frame t and w n is a word at position n in vocabulary V. ASR is mathematically formulated with the Bayes decision theory, where the most probable word sequenceŴ is estimated among all possible word sequences V * as follows: (1) The posterior distribution p(W |X) is factorized into the following three distributions by using the Bayes theorem and introducing HMM state sequence S = {s t ∈ {1, · · · , J}|t = 1, · · · , T }: The three factors, p(X|S), p(S|W ), and p(W ), are acoustic, lexicon, and language models, respectively. These are further factorized by using a probabilistic chain rule and conditional independence assumption as follows: p(st) , p(S|W )≈ t p(s t |s t−1 , W ), p(W ) ≈ n p(w n |w n−1 , . . . , w n−m−1 ), where the acoustic model is replaced with the product of framewise posterior distributions p(s t |x t ) computed by powerful DNN classifiers by using so-called pseudo likelihood trick (Bourlard and Morgan, 1994). p(s t |s t−1 , W ) is represented by an HMM state transition given W , and the conversion from W to HMM states is deterministically performed by using a pronunciation dictionary through a phoneme representation. p(w n |w n−1 , . . . , w n−m−1 ) is obtained based on an (m − 1)th-order Markov assumption as a mgram model.
These conditional independence assumptions are often regarded as too strong assumption, leading to model mis-specification. Also, to train the framewise posterior p(s t |x t ), we have to provide a framewise state alignment s t as a target, which is often provided by a GMM/HMM system. Thus, conventional DNN/HMM systems make the ASR problem formulated with Eq. (1) feasible by using factorization and conditional independence assumptions, at the cost of the problems discussed in Section 1.

Connectionist Temporal Classification (CTC)
The CTC formulation also follows from Bayes decision theory (Eq. (1)). Note that the CTC formulation uses L-length letter sequence C = {c l ∈ U|l = 1, · · · , L} with a set of distinct letters U. Similarly to Section 2.1, by introducing framewise letter sequence with an additional "blank" ( < b >) symbol Z = {z t ∈ U ∪ < b >|t = 1, · · · , T }, and by using the probabilistic chain rule and conditional independence assumption, the posterior distribution p(C|X) is factorized as follows: As a result, CTC has three distribution components similar to the DNN/HMM case, i.e., framewise posterior distribution p(z t |X), transition probability p(z t |z t−1 , C) 1 , and prior distributions of letter and hidden-state sequences, p(C) and p(Z), respectively. We also define the CTC objective function p ctc (C|X) used in the later formulation. The framewise posterior distribution p(z t |X) is conditioned on all inputs X, and it is quite natural to be modeled by using bidirectional long short-term memory (BLSTM): p(z t |X) = Softmax(Lin(h t )) and h t = BLSTM(X). Softmax(·) is a sofmax activation function, and Lin(·) is a linear layer to convert hidden vector h t to a (|U| + 1) dimensional vector (+1 means a blank symbol introduced in CTC). Although Eq.
(2) has to deal with a summation over all possible Z, it is efficiently computed by using dynamic programming (Viterbi/forwardbackward algorithm) thanks to the Markov property. In summary, although CTC and DNN/HMM systems are similar to each other due to conditional independence assumptions, CTC does not require pronunciation dictionaries and omits an GMM/HMM construction step.

Attention mechanism
Compared with hybrid and CTC approaches, the attention-based approach does not make any conditional independence assumptions, and directly estimates the posterior p(C|X) based on a probabilistic chain rule, as follows: where p att (C|X) is an attention-based objective function. p(c l |c 1 , · · · , c l−1 , X) is obtained by Eq. (4) converts input feature vectors X into a framewise hidden vector h t in an encoder network based on BLSTM, i.e., Encoder(X) BLSTM(X). Attention(·) in Eq. (5) is based on a content-based attention mechanism with convolutional features, as described in (Chorowski et al., 2015) (see Appendix A). a lt is an attention weight, and represents a soft alignment of hidden vector h t for each output c l based on the weighted summation of hidden vectors to form letter-wise hidden vector r l in Eq. (6). A decoder network is another recurrent network conditioned on previous output c l−1 and hidden vector q l−1 , similar to RNNLM, in addition to letter-wise hidden vector r l . We use Decoder(·) Softmax(Lin(LSTM(·))). Attention-based ASR does not explicitly separate each module, and potentially handles the all issues pointed out in Section 1. It implicitly combines acoustic models, lexicon, and language models as encoder, attention, and decoder networks, which can be jointly trained as a single deep neural network.
Compared with DNN/HMM and CTC, which are based on a transition form from t − 1 to t due to the Markov assumption, the attention mechanism does not maintain this constraint, and often provides irregular alignments. A major focus of this paper is to address this problem by using joint CTC/attention decoding.

Joint CTC/attention decoding
This section explains a hybrid CTC/attention network, which potentially utilizes both benefits of CTC and attention in ASR. Kim et al. (2017) uses a CTC objective function as an auxiliary task to train the attention model encoder within the multitask learning (MTL) framework, and this paper also uses the same architecture. Figure 1 illustrates the overall architecture of the framework, where the same BLSTM is shared with CTC and attention encoder networks, respectively). Unlike the sole attention model, the forward-backward algorithm of CTC can enforce monotonic alignment between speech and label sequences during training. That is, rather than solely depending on data-driven attention methods to estimate the desired alignments in long sequences, the forward-backward algorithm in CTC helps to speed up the process of estimating the desired alignment. The objective to be maximized is a logarithmic linear combination of the CTC and attention objectives, i.e., p ctc (C|X) in Eq. (2) and p att (C|X) in Eq. (3):

Decoding strategies
The inference step of our joint CTC/attentionbased end-to-end speech recognition is performed Figure 1: Joint CTC/attention based end-to-end framework: the shared encoder is trained by both CTC and attention model objectives simultaneously. The shared encoder transforms our input sequence {x t · · · x T } into high level features H = {h t · · · h T }, and the attention decoder generates the letter sequence {c 1 · · · c L }.
by label synchronous decoding with a beam search similar to conventional attention-based ASR. However, we take the CTC probabilities into account to find a hypothesis that is better aligned to the input speech, as shown in Figure 1. Hereafter, we describe the general attention-based decoding and conventional techniques to mitigate the alignment problem. Then, we propose joint decoding methods with a hybrid CTC/attention architecture.

Attention-based decoding in general
End-to-end speech recognition inference is generally defined as a problem to find the most probable letter sequenceĈ given the speech input X, i.e.
In attention-based ASR, p(C|X) is computed by Eq.
(3), andĈ is found by a beam search technique.
Let Ω l be a set of partial hypotheses of the length l. At the beginning of the beam search, Ω 0 contains only one hypothesis with the starting symbol <sos> and the hypothesis score α(<sos>, X) is set to 0. For l = 1 to L max , each partial hypothesis in Ω l−1 is expanded by appending possible single letters, and the new hypotheses are stored in Ω l , where L max is the maximum length of the hypotheses to be searched. The score of each new hypothesis is computed in the log domain as α(h, X) = α(g, X) + log p(c|g, X), (9) where g is a partial hypothesis in Ω l−1 , c is a letter appended to g, and h is the new hypothesis such that h = g · c. If c is a special symbol that represents the end of a sequence, <eos>, h is added tô Ω but not Ω l , whereΩ denotes a set of complete hypotheses. Finally,Ĉ is obtained bŷ In the beam search process, Ω l is allowed to hold only a limited number of hypotheses with higher scores to improve the search efficiency. Attention-based ASR, however, may be prone to include deletion and insertion errors because of its flexible alignment property, which can attend to any portion of the encoder state sequence to predict the next label, as discussed in Section 2.3. Since attention is generated by the decoder network, it may prematurely predict the end-ofsequence label, even when it has not attended to all of the encoder frames, making the hypothesis too short. On the other hand, it may predict the next label with a high probability by attending to the same portions as those attended to before. In this case, the hypothesis becomes very long and includes repetitions of the same letter sequence.

Conventional decoding techniques
To alleviate the alignment problem, a length penalty term is commonly used to control the hypothesis length to be selected (Chorowski et al., 2015;Bahdanau et al., 2016). With the length penalty, the decoding objective in Eq. (8) is changed tô where |C| is the length of the sequence C, and γ is a tunable parameter. However, it is actually difficult to completely exclude hypotheses that are too long or too short even if γ is carefully tuned. It is also effective to control the hypothesis length by the minimum and maximum lengths to some extent, where the minimum and maximum are selected as fixed ratios to the length of the input speech. However, since there are exceptionally long or short transcripts compared to the input speech, it is difficult to balance saving such exceptional transcripts and preventing hypotheses with irrelevant lengths. Another approach is the coverage term recently proposed in (Chorowski and Jaitly, 2016), which is incorporated in the decoding objective in Eq. (11) aŝ where the coverage term is computed by η and τ are tunable parameters. The coverage term represents the number of frames that have received a cumulative attention greater than τ . Accordingly, it increases when paying close attention to some frames for the first time, but does not increase when paying attention again to the same frames. This property is effective for avoiding looping of the same label sequence within a hypothesis. However, it is still difficult to obtain a common parameter setting for γ, η, τ , and the optional min/max lengths so that they are appropriate for any speech data from different tasks.

Joint decoding
Our joint CTC/attention approach combines the CTC and attention-based sequence probabilities in the inference step, as well as the training step. Suppose p ctc (C|X) in Eq.
(3) are the sequence probabilities given by CTC and the attention model. The decoding objective is defined similarly to Eq. (7) aŝ The CTC probability enforces a monotonic alignment that does not allow large jumps or looping of the same frames. Accordingly, it is possible to choose a hypothesis with a better alignment and exclude irrelevant hypotheses without relying on the coverage term, length penalty, or min/max lengths.
In the beam search process, the decoder needs to compute a score for each partial hypothesis using Eq. (9). However, it is nontrivial to combine the CTC and attention-based scores in the beam search, because the attention decoder performs it output-label-synchronously while CTC performs it frame-synchronously. To incorporate the CTC probabilities in the hypothesis score, we propose two methods.

Rescoring
The first method is a two-pass approach, in which the first pass obtains a set of complete hypotheses using the beam search, where only the attentionbased sequence probabilities are considered. The second pass rescores the complete hypotheses using the CTC and attention probabilities, where the CTC probabilities are obtained by the forward algorithm for CTC (Graves et al., 2006). The rescoring pass obtains the final result according tô

One-pass decoding
The second method is one-pass decoding, in which we compute the probability of each partial hypothesis using CTC and an attention model. Here, we utilize the CTC prefix probability (Graves, 2008) defined as the cumulative probability of all label sequences that have the partial hypothesis h as their prefix: and we define the CTC score as where ν represents all possible label sequences except the empty string. The CTC score cannot be obtained recursively as in Eq. (9), but it can be computed efficiently by keeping the forward probabilities over the input frames for each partial hypothesis. Then it is combined with α att (h, X). The beam search algorithm for one-pass decoding is shown in Algorithm 1. Ω l andΩ are initialized in lines 2 and 3 of the algorithm, which are implemented as queues that accept partial hypotheses of the length l and complete hypotheses, respectively. In lines 4-25, each partial hypothesis g in Ω l−1 is extended by each label c Algorithm 1 Joint CTC/attention one-pass decoding 1: procedure ONEPASSBEAMSEARCH(X,Lmax) 2: Ω0 ← {<sos>} 3:Ω ← ∅ 4: for l = 1 . . . Lmax do 5: Ω l ← ∅ 6: while Ω l−1 = ∅ do 7: g ← HEAD(Ω l−1 ) 8: DEQUEUE(Ω l−1 ) 9: for each c ∈ U ∪ {<eos>} do 10: h ← g · c 11: α(h,X) ← λαctc(h,X)+(1−λ)αatt(h,X) 12: if c = <eos> then 13: return arg max h∈Ω α(h, X) 27: end procedure in the label set U. Each extended hypothesis h is scored in line 11, where CTC and attentionbased scores are obtained by α ctc () and α att (). After that, if c = <eos>, the hypothesis h is assumed to be complete and stored inΩ in line 13. If c = <eos>, h is stored in Ω l in line 15, where the number of hypotheses in Ω l is checked in line 16. If the number exceeds the beam width, the hypothesis with the worst score in Ω l is removed by REMOVEWORST() in line 17.
In line 11, the CTC and attention model scores are computed for each partial hypothesis. The attention score is easily obtained in the same manner as Eq. (9), whereas the CTC score requires a modified forward algorithm that computes it label-synchronously. The algorithm to compute the CTC score is summarized in Appendix B. By considering the attention and CTC scores during the beam search, partial hypotheses with irregular alignments can be excluded, and the number of search errors is reduced.
We can optionally apply an end detection technique to reduce the computation by stopping the beam search before l reaches L max . Function ENDDETECT(Ω, l) in line 22 returns true if there is little chance of finding complete hypotheses with higher scores as l increases in the future.

Experiments
We used Japanese and Mandarin Chinese ASR benchmarks to show the effectiveness of the proposed joint CTC/attention decoding approach. The main reason for choosing these two languages is that those ideogram languages have relatively shorter lengths for letter sequences than those in alphabet languages, which reduces computational complexities greatly, and makes it easy to handle context information in a decoder network. Our preliminary investigation shows that Japanese and Mandarin Chinese end-to-end ASR can be easily scaled up, and shows state-of-the-art performance without using various tricks developed in English tasks. Also, we would like to emphasize that the system did not use language-specific processing (e.g., morphological analyzer, Pinyin dictionary), and simply used all appeared characters in their transcriptions including Japanese syllable and Kanji, Chinese, Arabic number, and alphabet characters, as they are.

Corpus of Spontaneous Japanese (CSJ)
We demonstrated ASR experiments by using the Corpus of Spontaneous Japanese (CSJ) (Maekawa et al., 2000). CSJ is a standard Japanese ASR task based on a collection of monologue speech data including academic lectures and simulated presentations. It has a total of 581 hours of training data and three types of evaluation data, where each evaluation task consists of 10 lectures (totally 5 hours). As input features, we used 40 mel-scale filterbank coefficients, with their first and second order temporal derivatives to obtain a total of 120dimensional feature vector per frame. The encoder was a 4-layer BLSTM with 320 cells in each layer and direction, and linear projection layer is followed by each BLSTM layer. The 2nd and 3rd bottom layers of the encoder read every second hidden state in the network below, reducing the utterance length by the factor of 4. We used the content-based attention mechanism (Chorowski et al., 2015), where the 10 centered convolution filters of width 100 were used to extract the convolutional features. The decoder network was a 1-layer LSTM with 320 cells. The AdaDelta algorithm (Zeiler, 2012) with gradient clipping (Pascanu et al., 2012) was used for the optimization. D end and M in Eq (18) were set as log 1e −10 and 3, respectively. The hybrid CTC/attention ASR was implemented by using the Chainer deep learning toolkit (Tokui et al., 2015). Table 1 first compares the character error rate (CER) for conventional attention and MTL based end-to-end ASR without the joint decoding. λ in Eq. (7) was set to 0.1. When decoding, we manually set the minimum and maximum lengths of output sequences by 0.025 and 0.15 times input sequence lengths, respectively. The length penalty γ in Eq. (11) was set to 0.1. Multitask learning (MTL) significantly outperformed attention-based ASR in the all evaluation tasks, which confirms the effectiveness of a hybrid CTC/attention architecture. Table 1 also shows that joint decoding, described in Section 3.2, further improved the performance without setting any search parameters (maximum and minimum lengths, length penalty), but only setting a weight parameter λ = 0.1 in Eq. (15) similar to the MTL case. Figure 2 also compares the dependency of λ on the CER for the CSJ evaluation tasks, and showing that λ was not so sensitive to the performance if we set λ around the value we used at MTL (i.e., 0.1).
We also compare the performance of the proposed MTL-large, which has a larger network (5-layer encoder network), with the conventional state-of-the-art techniques obtained by using linguistic resources. The state-of-the-art CERs of GMM discriminative training and DNN-sMBR/HMM systems are obtained from the Kaldi recipe (Moriya et al., 2015) and a system based on syllable-based CTC with MAP decoding (Kanda et al., 2016). The Kaldi recipe systems use academic lectures (236h) for AM training and all training-data transcriptions for LM training. Unlike the proposed method, these methods use linguistic resources including a morphological analyzer, pronunciation dictionary, and language model. Note that since the amount of training 6.2 6.9 MTL-large + joint decoding (one pass) 581 8.4 6.1 6.9 GMM-discr. (Moriya et al., 2015) 236 for AM, 581 for LM 11.2 9.2 12.1 DNN/HMM (Moriya et al., 2015) 236 for AM, 581 for LM 9.0 7.2 9.6 CTC-syllable (Kanda et al., 2016) 581 9.4 7.3 7.5 data and experimental configurations of the proposed and reference methods are different, it is difficult to compare the performance listed in the table directly. However, since the CERs of the proposed method are superior to those of the best reference results, we can state that the proposed method achieves the state-of-the-art performance.

Mandarin telephone speech
We demonstrated ASR experiments on HKUST Mandarin Chinese conversational telephone speech recognition (MTS) (Liu et al., 2006). It has 5 hours recording for evaluation, and we extracted 5 hours from training data as a development set, and used the rest (167 hours) as a training set. All experimental conditions were same as those in Section 4.1 except that we used the λ = 0.5 in training and decoding instead of 0.1 based on our preliminary investigation and 80 mel-scale filterbank coefficients with pitch features as suggested in (Miao et al., 2016). In decoding, we also added a result of the coverage-term based decoding (Chorowski and Jaitly, 2016), as discussed in Section 3.2 (η = 1.5, τ = 0.5, γ = −0.6 for attention model and η = 1.0, τ = 0.5, γ = −0.1 for MTL), since it was difficult to eliminate the irregular alignments during decoding by only tuning the maximum and minimum lengths and length penalty (we set the minimum and maximum lengths of output sequences by 0.0 and 0.1 times input sequence lengths, respectively and set γ = 0.6 in Table 2). Table 2 shows the effectiveness of MTL and joint decoding over the attention-based approach, especially showing the significant improvement of the joint CTC/attention decoding. Similar to the CSJ experiments in Section 4.1, we did not use the length-penalty term or the coverage term in joint decoding. This is an advantage of joint decoding over conventional approaches that require many tuning parameters. We also generated more training data by linearly scaling the audio lengths by factors of 0.9 and 1.1 (speed perturb.). The final model achieved 29.9% without using linguistic resources, which defeats moderate state-of-the-art systems including CTC-based methods 2 .

Decoding speed
We evaluated the speed of the joint decoding methods described in Section 3.2.3. ASR decoding was performed with different beam widths of 1, 3, 5, 10, and 20, and the processing time and CER were measured using a computer with Intel(R) Xeon(R) processors, E5-2690 v3, 2.6 GHz. Although the processors were multicore CPUs and the computer had GPUs, we ran the decoding program as a   Table 3 shows the relationships between the real-time factor (RTF) and the CER for the CSJ and HKUST tasks. We evaluated the rescoring and one-pass decoding methods when using the end detection in Eq. (18). In every beam width, we can see that the one-pass method runs faster with an equal or lower CER than the rescoring method. This result demonstrates that the one-pass decoding is effective for reducing search errors. Finally, we achieved 1xRT with one-pass decoding when using a beam width around 3 to 5, even though it was a single-threaded process on a CPU. However, the decoding process has not yet achieved realtime ASR since CTC and the attention mechanism need to access all of the frames of the input utterance even when predicting the first label. This is an essential problem of most end-to-end ASR approaches and will be solved in future work.

Summary and discussion
This paper proposes end-to-end ASR by using joint CTC/attention decoding, which outperformed ordinary attention-based end-to-end ASR by solving the misalignment issues. The joint decoding methods actually reduced most of the irregular alignments, which can be confirmed from the examples of recognition errors and alignment plots shown in Appendix C.
The proposed end-to-end ASR does not require linguistic resources, such as morphological analyzer, pronunciation dictionary, and language model, which are essential components of conventional Japanese and Mandarin Chinese ASR systems. Nevertheless, the method achieved comparable/superior performance to the state-of-theart conventional systems for the CSJ and MTS tasks. In addition, the proposed method does not require GMM/HMM construction for initial alignments, DNN pre-training, lattice generation for sequence discriminative training, complex search in decoding (e.g., FST decoder or lexical tree search based decoder). Thus, the method greatly simplifies the ASR building process, reducing code size and complexity.
Future work will apply this technique to the other languages including English, where we have to solve an issue of long sequence lengths, which requires heavy computation cost and makes it difficult to train a decoder network. Actually, neural machine translation handles this issue by using a sub word unit (concatenating several letters to form a new sub word unit) (Wu et al., 2016), which would be a promising direction for end-toend ASR.