A Purely End-to-End System for Multi-speaker Speech Recognition

Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.


Introduction
Conventional automatic speech recognition (ASR) systems recognize a single utterance given a speech signal, in a one-to-one transformation. However, restricting the use of ASR systems to situations with only a single speaker limits their applicability. Recently, there has been growing inter-est in single-channel multi-speaker speech recognition, which aims at generating multiple transcriptions from a single-channel mixture of multiple speakers' speech (Cooke et al., 2009).
To achieve this goal, several previous works have considered a two-step procedure in which the mixed speech is first separated, and recognition is then performed on each separated speech signal Isik et al., 2016;Chen et al., 2017). Dramatic advances have recently been made in speech separation, via the deep clustering framework Isik et al., 2016), hereafter referred to as DPCL. DPCL trains a deep neural network to map each time-frequency (T-F) unit to a high-dimensional embedding vector such that the embeddings for the T-F unit pairs dominated by the same speaker are close to each other, while those for pairs dominated by different speakers are farther away. The speaker assignment of each T-F unit can thus be inferred from the embeddings by simple clustering algorithms, to produce masks that isolate each speaker. The original method using k-means clustering  was extended to allow end-to-end training by unfolding the clustering steps using a permutation-free mask inference objective (Isik et al., 2016). An alternative approach is to perform direct mask inference using the permutation-free objective function with networks that directly estimate the labels for a fixed number of sources. Direct mask inference was first used in  as a baseline method, but without showing good performance. This approach was revisited in  and Kolbaek et al. (2017) under the name permutationinvariant training (PIT). Combination of such single-channel speaker-independent multi-speaker speech separation systems with ASR was first considered in Isik et al. (2016) using a conventional Gaussian Mixture Model/Hidden Markov Model (GMM/HMM) system. Combination with an endto-end ASR system was recently proposed in (Settle et al., 2018). Both these approaches either trained or pre-trained the source separation and ASR networks separately, making use of mixtures and their corresponding isolated clean source references. While the latter approach could in principle be trained without references for the isolated speech signals, the authors found it difficult to train from scratch in that case. This ability can nonetheless be used when adapting a pre-trained network to new data without such references.
In contrast with this two-stage approach, Qian et al. (2017) considered direct optimization of a deep-learning-based ASR recognizer without an explicit separation module. The network is optimized based on a permutation-free objective defined using the cross-entropy between the system's hypotheses and reference labels. The best permutation between hypotheses and reference labels in terms of cross-entropy is selected and used for backpropagation. However, this method still requires reference labels in the form of senone alignments, which have to be obtained on the clean isolated sources using a single-speaker ASR system. As a result, this approach still requires the original separated sources. As a general caveat, generation of multiple hypotheses in such a system requires the number of speakers handled by the neural network architecture to be determined before training. However, Qian et al. (2017) reported that the recognition of two-speaker mixtures using a model trained for three-speaker mixtures showed almost identical performance with that of a model trained on two-speaker mixtures. Therefore, it may be possible in practice to determine an upper bound on the number of speakers. Chen et al. (2018) proposed a progressive training procedure for a hybrid system with explicit separation motivated by curriculum learning. They also proposed self-transfer learning and multi-output sequence discriminative training methods for fully exploiting pairwise speech and preventing competing hypotheses, respectively.
In this paper, we propose to circumvent the need for the corresponding isolated speech sources when training on a set of mixtures, by using an end-to-end multi-speaker speech recognition without an explicit speech separation stage. In separation based systems, the spectrogram is segmented into complementary regions according to sources, which generally ensures that different utterances are recognized for each speaker. Without this complementarity constraint, our direct multispeaker recognition system could be susceptible to redundant recognition of the same utterance. In order to prevent degenerate solutions in which the generated hypotheses are similar to each other, we introduce a new objective function that enhances contrast between the network's representations of each source. We also propose a training procedure to provide permutation invariance with low computational cost, by taking advantage of the joint CTC/attention-based encoder-decoder network architecture proposed in (Hori et al., 2017a). Experimental results show that the proposed model is able to directly convert an input speech mixture into multiple label sequences without requiring any explicit intermediate representations. In particular no frame-level training labels, such as phonetic alignments or corresponding unmixed speech, are required. We evaluate our model on spontaneous English and Japanese tasks and obtain comparable results to the DPCL based method with explicit separation (Settle et al., 2018).
2 Single-speaker end-to-end ASR

Attention-based encoder-decoder network
An attention-based encoder-decoder network (Bahdanau et al., 2016) predicts a target label sequence Y = (y 1 , . . . , y N ) without requiring intermediate representation from a T -frame sequence of D-dimensional input feature vectors, O = (o t ∈ R D |t = 1, . . . , T ), and the past label history. The probability of the n-th label y n is computed by conditioning on the past history y 1:n−1 : p att (y n |O, y 1:n−1 ). (1) The model is composed of two main sub-modules, an encoder network and a decoder network. The encoder network transforms the input feature vector sequence into a high-level representation H = (h l ∈ R C |l = 1, . . . , L). The decoder network emits labels based on the label history y and a context vector c calculated using an attention mechanism which weights and sums the Cdimensional sequence of representation H with attention weight a. A hidden state e of the decoder is updated based on the previous state, the previous context vector, and the emitted label. This mechanism is summarized as follows: (2) y n ∼ Decoder(c n , y n−1 ), (3) c n , a n = Attention(a n−1 , e n , H), (4) e n = Update(e n−1 , c n−1 , y n−1 ). (5) At inference time, the previously emitted labels are used. At training time, they are replaced by the reference label sequence R = (r 1 , . . . , r N ) in a teacher-forcing fashion, leading to conditional probability p att (Y R |O), where Y R denotes the output label sequence variable in this condition. The detailed definitions of Attention and Update are described in Section A of the supplementary material. The encoder and decoder networks are trained to maximize the conditional probability of the reference label sequence R using backpropagation: where Loss att is the cross-entropy loss function.

Joint CTC/attention-based encoder-decoder network
The joint CTC/attention approach (Kim et al., 2017;Hori et al., 2017a), uses the connectionist temporal classification (CTC) objective function (Graves et al., 2006) as an auxiliary task to train the network. CTC formulates the conditional probability by introducing a framewise label sequence Z consisting of a label set U and an additional blank symbol defined as Z = {z l ∈ U ∪ {'blank'}|l = 1, · · · , L}: where p(z l |z l−1 , Y ) represents monotonic alignment constraints in CTC and p(z l |O) is the framelevel label probability computed by where h l is the hidden representation generated by an encoder network, here taken to be the encoder of the attention-based encoder-decoder network defined in Eq.
(2), and Linear(·) is the final linear layer of the CTC to match the number of labels. Unlike the attention model, the forwardbackward algorithm of CTC enforces monotonic alignment between the input speech and the output label sequences during training and decoding. We adopt the joint CTC/attention-based encoder-decoder network as the monotonic alignment helps the separation and extraction of highlevel representation. The CTC loss is calculated as: The CTC loss and the attention-based encoderdecoder loss are combined with an interpolation weight λ ∈ [0, 1]: Both CTC and encoder-decoder networks are also used in the inference step. The final hypothesis is a sequence that maximizes a weighted conditional probability of CTC in Eq. ( 7) and attentionbased encoder decoder network in Eq. (1): where γ ∈ [0, 1] is an interpolation weight.
3 Multi-speaker end-to-end ASR

Permutation-free training
In situations where the correspondence between the outputs of an algorithm and the references is an arbitrary permutation, neural network training faces a permutation problem. This problem was first addressed by deep clustering , which circumvented it in the case of source separation by comparing the relationships between pairs of network outputs to those between pairs of labels. As a baseline for deep clustering, Hershey et al. (2016) also proposed another approach to address the permutation problem, based on an objective which considers all permutations of references when computing the error with the network estimates. This objective was later used in Isik et al. (2016) and . In the latter, it was referred to as permutation-invariant training.
This permutation-free training scheme extends the usual one-to-one mapping of outputs and labels for backpropagation to one-to-many by selecting the proper permutation of hypotheses and references, thus allowing the network to generate multiple independent hypotheses from a singlechannel speech mixture. When a speech mixture contains speech uttered by S speakers simultaneously, the network generates S label sequence variables Y s = (y s 1 , . . . , y s Ns ) with N s labels from the T -frame sequence of D-dimensional input fea- where the transformations g s are implemented as neural networks which typically share some components with each other. In the training stage, all possible permutations of the S sequences R s = (r s 1 , . . . , r s N s ) of N s reference labels are considered (considering permutations on the hypotheses would be equivalent), and the one leading to minimum loss is adopted for backpropagation. Let P denote the set of permutations on {1, . . . , S}. The final loss L is defined as where π(s) is the s-th element of a permutation π. For example, for two speakers, P includes two permutations (1, 2) and (2, 1), and the loss is defined as: Loss(Y 1 , R 2 ) + Loss(Y 2 , R 1 )). (14) Figure 1 shows an overview of the proposed end-to-end multi-speaker ASR system. In the following Section 3.2, we describe an extension of encoder network for the generation of multiple hidden representations. We further introduce a permutation assignment mechanism for reducing the computation cost in Section 3.3, and an additional loss function L KL for promoting the difference between hidden representations in Section 3.4.

End-to-end permutation-free training
To make the network output multiple hypotheses, we consider a stacked architecture that combines both shared and unshared (or specific) neural network modules. The particular architecture we consider in this paper splits the encoder network into three stages: the first stage, also referred to as mixture encoder, processes the input mixture and Figure 1: End-to-end multi-speaker speech recognition. We propose to use the permutation-free training for CTC and attention loss functions Loss ctc and Loss att , respectively.
outputs an intermediate feature sequence H; that sequence is then processed by S independent encoder sub-networks which do not share parameters, also referred to as speaker-differentiating (SD) encoders, leading to S feature sequences H s ; at the last stage, each feature sequence H s is independently processed by the same network, also referred to as recognition encoder, leading to S final high-level representations G s .
Let u ∈ {1 . . . , S} denote an output index (corresponding to the transcription of the speech by one of the speakers), and v ∈ {1 . . . , S} denote a reference index. Denoting by Encoder Mix the mixture encoder, Encoder u SD the u-th speakerdifferentiating encoder, and Encoder Rec the recognition encoder, an input sequence O corresponding to an input mixture can be processed by the encoder network as follows: The motivation for designing such an architecture can be explained as follows, following analogies with the architectures in (Isik et al., 2016) and (Settle et al., 2018) where separation and recog-nition are performed explicitly in separate steps: the first stage in Eq. (15) corresponds to a speech separation module which creates embedding vectors that can be used to distinguish between the multiple sources; the speaker-differentiating second stage in Eq. (16) uses the first stage's output to disentangle each speaker's speech content from the mixture, and prepare it for recognition; the final stage in Eq. (17) corresponds to an acoustic model that encodes the single-speaker speech for final decoding. The decoder network computes the conditional probabilities for each speaker from the S outputs of the encoder network. In general, the decoder network uses the reference label R as a history to generate the attention weights during training, in a teacher-forcing fashion. However, in the above permutation-free training scheme, the reference label to be attributed to a particular output is not determined until the loss function is computed, so we here need to run the attention decoder for all reference labels. We thus need to consider the conditional probability of the decoder output variable Y u,v for each output G u of the encoder network under the assumption that the reference label for that output is R v : The final loss is then calculated by considering all permutations of the reference labels as follows: Loss att (Y s,π(s) , R π(s) ). (22)

Reduction of permutation cost
In order to reduce the computational cost, we fixed the permutation of the reference labels based on the minimization of the CTC loss alone, and used the same permutation for the attention mechanism as well. This is an advantage of using a joint CTC/attention based end-to-end speech recognition. Permutation is performed only for the CTC loss by assuming synchronous output where the permutation is decided by the output of CTC: where Y u is the output sequence variable corresponding to encoder output G u . Attention-based decoding is then performed on the same hidden representations G u , using teacher forcing with the labels determined by the permutationπ that minimizes the CTC loss: This corresponds to the "permutation assignment" in Fig. 1. In contrast with Eq. (18), we only need to run the attention-based decoding once for each output G u of the encoder network. The final loss is defined as the sum of two objective functions with interpolation λ: At inference time, because both CTC and attention-based decoding are performed on the same encoder output G u and should thus pertain to the same speaker, their scores can be incorporated as follows: where p ctc (Y u |G u ) and p att (Y u |G u ) are obtained with the same encoder output G u .

Promoting separation of hidden vectors
A single decoder network is used to output multiple label sequences by independently decoding the multiple hidden vectors generated by the encoder network. In order for the decoder to generate multiple different label sequences the encoder needs to generate sufficiently differentiated hidden vector sequences for each speaker. We propose to encourage this contrast among hidden vectors by introducing in the objective function a new term based on the negative symmetric Kullback-Leibler (KL) divergence. In the particular case of twospeaker mixtures, we consider the following additional loss function: where η is a small constant value, and G u = (softmax(G u (l)) | l = 1, . . . , L) is obtained from the hidden vector sequence G u at the output of the recognition encoder Encoder Rec as in Fig. 1 by applying an additional frame-wise softmax operation in order to obtain a quantity amenable to a probability distribution.

Split of hidden vector for multiple hypotheses
Since the network maps acoustic features to label sequences directly, we consider various architectures to perform implicit separation and recognition effectively. As a baseline system, we use the concatenation of a VGG-motivated CNN network (Simonyan and Zisserman, 2014) (referred to as VGG) and a bi-directional long short-term memory (BLSTM) network as the encoder network. For the splitting point in the hidden vector computation, we consider two architectural variations as follows: • Split by BLSTM: The hidden vector is split at the level of the BLSTM network. 1) the VGG network generates a single hidden vector H; 2) H is fed into S independent BLSTMs whose parameters are not shared with each other; 3) the output of each independent BLSTM H u , u = 1, . . . , S, is further separately fed into a unique BLSTM, the same for all outputs. Each step corresponds to Eqs. (15), (16), and (17).
• Split by VGG: The hidden vector is split at the level of the VGG network. The number of filters at the last convolution layer is multiplied by the number of mixtures S in order to split the output into S hidden vectors (as in Eq. (16)). The layers prior to the last VGG layer correspond to the network in Eq. (15), while the subsequent BLSTM layers implement the network in (17).

Experimental setup
We used English and Japanese speech corpora, WSJ (Wall street journal) (Consortium, 1994;  Garofalo et al., 2007) and CSJ (Corpus of spontaneous Japanese) (Maekawa, 2003). To show the effectiveness of the proposed models, we generated mixed speech signals from these corpora to simulate single-channel overlapped multi-speaker recording, and evaluated the recognition performance using the mixed speech data. For WSJ, we used WSJ1 SI284 for training, Dev93 for development, and Eval92 for evaluation. For CSJ, we followed the Kaldi recipe (Moriya et al., 2015) and used the full set of academic and simulated presentations for training, and the standard test sets 1, 2, and 3 for evaluation. We created new corpora by mixing two utterances with different speakers sampled from existing corpora. The detailed algorithm is presented in Section B of the supplementary material. The sampled pairs of two utterances are mixed at various signal-to-noise ratios (SNR) between 0 dB and 5 dB with a random starting point for the overlap. Duration of original unmixed and generated mixed corpora are summarized in Table 1.

Network architecture
As input feature, we used 80-dimensional log Mel filterbank coefficients with pitch features and their delta and delta delta features (83 × 3 = 249dimension) extracted using Kaldi tools (Povey et al., 2011). The input feature is normalized to zero mean and unit variance. As a baseline system, we used a stack of a 6-layer VGG network and a 7-layer BLSTM as the encoder network. Each BLSTM layer has 320 cells in each direction, and is followed by a linear projection layer with 320 units to combine the forward and backward LSTM outputs. The decoder network has an 1-layer LSTM with 320 cells. As described in Section 3.5, we adopted two types of encoder architectures for multi-speaker speech recognition. The network architectures are summarized in Table 2. The split-by-VGG network had speaker differentiating encoders with a convolution layer (and the following maxpooling layer). The splitby-BLSTM network had speaker differentiating encoders with two BLSTM layers. The architectures were adjusted to have the same number of layers. We used characters as output labels. The number of characters for WSJ was set to 49 including alphabets and special tokens (e.g., characters for space and unknown). The number of characters for CSJ was set to 3,315 including Japanese Kanji/Hiragana/Katakana characters and special tokens.

Optimization
The network was initialized randomly from uniform distribution in the range -0.1 to 0.1. We used the AdaDelta algorithm (Zeiler, 2012) with gradient clipping (Pascanu et al., 2013) for optimization. We initialized the AdaDelta hyperparameters as ρ = 0.95 and = 1 −8 . is decayed by half when the loss on the development set degrades. The networks were implemented with Chainer (Tokui et al., 2015) and ChainerMN (Akiba et al., 2017). The optimization of the networks was done by synchronous data parallelism with 4 GPUs for WSJ and 8 GPUs for CSJ.
The networks were first trained on singlespeaker speech, and then retrained with mixed speech. When training on unmixed speech, only one side of the network only (with a single speaker differentiating encoder) is optimized to output the label sequence of the single speaker. Note that only character labels are used, and there is no need for clean source reference corresponding to the mixed speech. When moving to mixed speech, the other speaker-differentiating encoders are initialized using the already trained one by copying the parameters with random perturbation, w = w × (1 + Uniform(−0.1, 0.1)) for each parameter w. The interpolation value λ for the multiple objectives in Eqs. (10) and (24) was set to 0.1 for WSJ and to 0.5 for CSJ. Lastly, the model is retrained with the additional negative KL divergence loss in Eq. (28) with η = 0.1.

Decoding
In the inference stage, we combined a pretrained RNNLM (recurrent neural network language model) in parallel with the CTC and decoder network. Their label probabilities were linearly combined in the log domain during beam search to find the most likely hypothesis. For the WSJ task, we used both character and word level RNNLMs (Hori et al., 2017b), where the character model had a 1-layer LSTM with 800 cells and an output layer for 49 characters. The word model had a 1-layer LSTM with 1000 cells and an output layer for 20,000 words, i.e., the vocabulary size was 20,000. Both models were trained with the WSJ text corpus. For the CSJ task, we used a character level RNNLM (Hori et al., 2017c), which had a 1-layer LSTM with 1000 cells and an output layer for 3,315 characters. The model parameters were trained with the transcript of the training set in CSJ. We added language model probabilities with an interpolation factor of 0.6 for characterlevel RNNLM and 1.2 for word-level RNNLM. The beam width for decoding was set to 20 in all the experiments. Interpolation γ in Eqs. (11) and (27) was set to 0.4 for WSJ and 0.5 for CSJ.

Evaluation of unmixed speech
First, we examined the performance of the baseline joint CTC/attention-based encoder-decoder network with the original unmixed speech data. Table 3 shows the character error rates (CERs), where the baseline model showed 2.6% on WSJ and 7.8% on CSJ. Since the model was trained and evaluated with unmixed speech data, these CERs are considered lower bounds for the CERs in the succeeding experiments with mixed speech data. Table 4 shows the CERs of the generated mixed speech from the WSJ corpus. The first column indicates the position of split as mentioned in Section 3.5. The second, third and forth columns indicate CERs of the high energy speaker (HIGH E. SPK.), the low energy speaker (LOW E. SPK.), and the average (AVG.), respectively. The baseline model has very high CERs because  it was trained as a single-speaker speech recognizer without permutation-free training, and it can only output one hypothesis for each mixed speech. In this case, the CERs were calculated by duplicating the generated hypothesis and comparing the duplicated hypotheses with the corresponding references. The proposed models, i.e., splitby-VGG and split-by-BLSTM networks, obtained significantly lower CERs than the baseline CERs, the split-by-BLSTM model in particular achieving 14.0% CER. This is an 83.1% relative reduction from the baseline model. The CER was further reduced to 13.7% by retraining the split-by-BLSTM model with the negative KL loss, a 2.1% relative reduction from the network without retraining. This result implies that the proposed negative KL loss provides better separation by actively improving the contrast between the hidden vectors of each speaker. Examples of recognition results are shown in Section C of the supplementary material. Finally, we profiled the computation time for the permutations based on the decoder network and on CTC. Permutation based on CTC was 16.3 times faster than that based on the decoder network, in terms of the time required to determine the best match permutation given the encoder network's output in Eq. (17). Table 5 shows the CERs for the mixed speech from the CSJ corpus. Similarly to the WSJ experiments, our proposed model significantly reduced the CER from the baseline, where the average CER was 14.9% and the reduction ratio from the baseline was 83.9%.

Visualization of hidden vectors
We show a visualization of the encoder networks outputs in Fig. 2 to illustrate the effect of the negative KL loss function. Principal component analysis (PCA) was applied to the hidden vectors on the vertical axis. Figures 2(a) and 2(b) show the hidden vectors generated by the split-by-BLSTM model without the negative KL divergence loss for an example mixture of two speakers. We can observe different activation patterns showing that the hidden vectors were successfully separated to the individual utterances in the mixed speech, although some activity from one speaker can be seen as leaking into the other. Figures 2(c) and 2(d) show the hidden vectors generated after retraining with the negative KL divergence loss. We can more clearly observe the different patterns and boundaries of activation and deactivation of hidden vectors. The negative KL loss appears to regularize the separation process, and even seems to help in finding the end-points of the speech.

Comparison with earlier work
We first compared the recognition performance with a hybrid (non end-to-end) system including DPCL-based speech separation and a Kaldi-based ASR system. It was evaluated under the same evaluation data and metric as in (Isik et al., 2016) based on the WSJ corpus. However, there are differences in the size of training data and the options in decoding step. Therefore, it is not a fully matched condition. Results are shown in Table 6. The word error rate (WER) reported in (Isik et al., 2016) is 30.8%, which was obtained with jointly trained DPCL and second-stage speech enhancement networks. The proposed end-to-end ASR gives an 8.4% relative reduction in WER even though our model does not require any explicit frame-level labels such as phonetic alignment, or clean signal reference, and does not use a phonetic lexicon for training. Although this is an unfair comparison, our purely end-to-end system outperformed a hybrid system for multi-speaker speech recognition.
Next, we compared our method with an endto-end explicit separation and recognition network (Settle et al., 2018). We retrained our model previously trained on our WSJ-based corpus using the training data generated by Settle et al. (2018), because the direct optimization from scratch on their data caused poor recognition performance due to data size. Other experimental conditions are shared with the earlier work. Interestingly, our method showed comparable performance to the end-to-end explicit separation and recognition network, without having to pre-train using clean signal training references. It remains to be seen if this parity of performance holds in other tasks and conditions.

Related work
Several previous works have considered an explicit two-step procedure Isik et al., 2016;Chen et al., 2017Chen et al., , 2018. In contrast with our work which uses a single objective function for ASR, they introduced an objective function to guide the separation of mixed speech. Qian et al. (2017) trained a multi-speaker speech recognizer using permutation-free training without explicit objective function for separation. In contrast with our work which uses an end-toend architecture, their objective function relies on a senone posterior probability obtained by aligning unmixed speech and text using a model trained as a recognizer for single-speaker speech. Compared with (Qian et al., 2017), our method directly maps a speech mixture to multiple character sequences and eliminates the need for the corresponding isolated speech sources for training.

Conclusions
In this paper, we proposed an end-to-end multispeaker speech recognizer based on permutation-free training and a new objective function promoting the separation of hidden vectors in order to generate multiple hypotheses. In an encoderdecoder network framework, teacher forcing at the decoder network under multiple references increases computational cost if implemented naively. We avoided this problem by employing a joint CTC/attention-based encoder-decoder network.
Experimental results showed that the model is able to directly convert an input speech mixture into multiple label sequences under the end-to-end framework without the need for any explicit intermediate representation including phonetic alignment information or pairwise unmixed speech. We also compared our model with a method based on explicit separation using deep clustering, and showed comparable result. Future work includes data collection and evaluation in a real world scenario since the data used in our experiments are simulated mixed speech, which is already extremely challenging but still leaves some acoustic aspects, such as Lombard effects and real room impulse responses, that need to be alleviated for further performance improvement. In addition, further study is required in terms of increasing the number of speakers that can be simultaneously recognized, and further comparison with the separation-based approach.
(2) maps the input feature vector O to internal representation H as follows: The decoder network sequentially generates the n-th label y n by taking the context vector c n and the label history y 1:n−1 : y n ∼ Decoder(c n , y n−1 ).
The context vector is calculated in an location based attention mechanism (Chorowski et al., 2015) which weights and sums the C-dimensional sequence of representation H = (h l ∈ R C |l = 1, . . . , L) with attention weight a n,l : c n = Attention(a n−1 , e n , H), L l=1 a n,l h l .
Algorithm 1 Generation of multi speaker speech dataset n reuse ⇐ maximum number of times same utterance can be used. U ⇐ utterance set of the corpora.
Sample utterance U j from P (U ) while ensuring speakers of U i and U j are different.
The location based attention mechanism defines the weights a n,l as follows: a n,l = exp(αk n,l ) L l=1 exp(αk n,l ) , f n = F * a n−1 , where w, V E , V H , V F , b, F are tunable parameters, α is a constant value called inverse temperature, and * is the convolution operation. We used 10 convolution filters of width 200, and set α to 2.
The introduction of f n makes the attention mechanism take into account the previous alignment information. The hidden state e is updated recursively by an updating LSTM function: e n = Update(e n−1 , c n−1 , y n−1 ),

B Generation of mixed speech
Each utterance of the corpus is mixed with a randomly selected utterance with the probability, P (U k ), that moderates over-selection of specific utterances. P (U k ) is calculated in the first for-loop as a uniform probability. All utterances are used as Table 7: Examples of recognition results. Errors are emphasized as capital letter. " " is a space character, and a special token "*" is inserted to pad deletion errors.
(1) Model w/ permutation-free training (CER of HYP1: 12.8%, HYP2: 0.9%) one side of the mixture, and another side is sampled from the distribution P (U k ) in the second forloop. The selected pairs of utterances are mixed at various signal-to-noise ratios (SNR) between 0 dB and 5 dB. We randomized the starting point of the overlap by padding the shorter utterance with silence whose duration is sampled from the uniform distribution within the length difference between the two utterances. Therefore, the duration of the mixed utterance is equal to that of the longer utterance among the unmixed speech. After the generation of the mixed speech, the count of selected utterances C j is decremented to prevent of overselection. All counts C are set to n reuse , and we used n reuse = 3. Table 7 shows examples of recognition result. The first example (1) is one which accounts for a large portion of the evaluation set. The SNR of the HYP1 is -1.55 db and that of HYP2 is 1.55 dB. The network generates multiple hypotheses with a few substitution and deletion errors, but without any overlapped and swapped words. The second example (2) is one which leads to performance reduction. We can see that the network makes errors when there is a large difference in length between the two sequences. The word "thirty" of HYP2 is injected in HYP1, and there are deletion errors in HYP2. We added a negative KL divergence loss to ease such kind of errors. However, there is further room to reduce error by making unshared modules more cooperative.