PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding

We look into the task of generalizing word embeddings: given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words, without extra contextual information. We rely solely on the spellings of words and propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding. We call the model probabilistic bag-of-subwords (PBoS), as it applies bag-of-subwords for all possible segmentations based on their likelihood. Inspections and affix prediction experiment show that PBoS is able to produce meaningful subword segmentations and subword rankings without any source of explicit morphological knowledge. Word similarity and POS tagging experiments show clear advantages of PBoS over previous subword-level models in the quality of generated word embeddings across languages.


Introduction
Word embeddings pre-trained over large texts have demonstrated benefits for many NLP tasks, especially when the task is label-deprived. However, many popular pre-trained sets of word embeddings assume fixed finite-size vocabularies 1, 2 , which hinders their ability to provide useful word representations for out-of-vocabulary (OOV) words. We look into the task of generalizing word embeddings: extrapolating a set of pre-trained word embeddings to words out of its fixed vocabulary, without extra access to contextual information (e.g. example sentences or text corpus). In contrast, the more common task of learning word embeddings, or often just word embedding, is to obtain distributed representations of words directly from large unlabeled text. The motivation here is to extend the usefulness of pre-trained embeddings without expensive retraining over large text.
There have been works showing that contextual information can also help generalize word embeddings (for example, Khodak et al., 2018;Schick and Schütze, 2019a,b). We here, however, focus more on the research question of how much one can achieve from just word compositions. In addition, our proposed way of utilizing word composition information can be combined with the contextual embedding algorithms to further improve the performance of generalized embeddings.
The hidden assumption here is that words are made of meaningful parts (cf. morphemes) and that the meaning of a word is related to the meaning of their parts. This way, humans are often able to guess the meaning of a word or term they have never seen before. For example, "postEMNLP" probably means "after EMNLP". Different models have been proposed for that task of generalizing word embeddings using word compositions, usually under the name of subword(level) models. Stratos (2017); Pinter et al. (2017); Kim et al. (2018b) model words at the character level. However, they have been surpassed by later subword-level models, probably because of putting too much burden on the models to form and discover meaningful subwords from characters. Bag-of-subwords (BoS) is a simple yet effective model for learning (Bojanowski et al., 2017) and generalizing (Zhao et al., 2018) word embeddings. BoS composes a word embedding vector by taking the sum or average of the vectors of the subwords (character n-grams) that appear in the given word. However, it ignores the importance of different subwords since all of them are given the same weight. Intuitively, "farm" and "land" should be more relevant in composing representation for word "farmland" than some random subwords like "armla".
Even more favorable would be a model's ability to discover meaningful subword segmentations on its own. Cotterell et al. (2016) bases their model over morphemes but needs help from an external morphological analyzer such as Morfessor (Virpioja et al., 2013). Sasaki et al. (2019) use trainable self-attention to combine subword vectors. While the attention implicitly facilitates interactions among subwords, there has been no explicit enforcement of mutual exclusiveness from subword segmentation, making it sometimes difficult to rule out less relevant subwords. For example, "her" is itself a likely subword, but is unlikely to be relevant for "higher" as the remaining "hig" is unlikely.
We propose the probabilistic bag-of-subwords (PBoS) model for generalizing word embedding. PBoS simultaneously models subword segmentation and composition of word representations out of subword representations. The subword segmentation part is a probabilistic model capable of handling ambiguity of subword boundaries and ranking possible segmentations based on their overall likelihood. For each segmentation, we compose a word vector as the sum of all subwords that appear in the segmentation. The final embedding vector is the expectation of the word vectors from all possible segmentations. An alternative view is that the model assigns word-specific weights to subwords based on how likely they appear as meaningful segments for the given word. Coupled with an efficient algorithm, our model is able to compose better word embedding vectors with little computational overhead compared to BoS.
Manual inspections show that PBoS is able to produce subword segmentations and subword weights that align with human intuition. Affix prediction experiment quantitatively shows that the subword weights given by PBoS are able to recover most eminent affixes of words with good accuracy.
To assess the quality of generated word embeddings, we evaluate with the intrinsic task of word similarity which relates to the semantics; as well as the extrinsic task of part-of-speech (POS) tagging which requires rich information to determine each word's role in a sentence. English word similarity experiment shows that PBoS improves the correlation scores over previous best models under vari-ous settings and is the only model that consistently improves over the target pre-trained embeddings. POS tagging experiment over 23 languages shows that PBoS improves accuracy compared in all but one language to the previous best models, often by a big margin.
We summarize our contributions as follows: • We propose PBoS, a subword-level word embedding model that is based on probabilistic segmentation of words into subwords, the first of its kind (Section 2). • We propose an efficient algorithm that leads to an efficient implementation 3 of PBoS with little overhead over previous much simpler BoS.
(Section 3). • Manual inspection and affix prediction experiment show that PBoS is able to give reasonable subword segmentations and subword weights (Section 4.1 and 4.2).

• Word similarity and POS tagging experiments
show that word vectors generated by PBoS have better quality compared to previously proposed models across languages (Section 4.3 and 4.4).

PBoS Model
Following the above intuition, in this section we describe the PBoS model in detail.
We first develop a model that segments a word into subword and associates each subword segmentation with a likelihood based on the meaningfulness of each subword segment. We then apply BoS over each segmentation to compose a "segmentation vector". The final word embedding vector is then the probabilistic expectation of all the segmentation vectors. The subword segmentation and likelihood association part require no explicit source of morphological knowledge and are tightly integrated with the word vector composition part, which in turn gives rise to an efficient algorithm that considers all possible segmentations simultaneously (Section 3). The model can be trained by fitting a set of pre-trained word embeddings.

Terminology
For a given language, let Γ be its alphabet. A word w of length l = |w| is a string made of l letters in Γ, i.e. w = c 1 c 2 . . . c l ∈ Γ l where w[i] = c i is the i-th letter. Let p w ∈ [0, 1] be the probability that w appears in the language. Empirically, this is proportional to the unigram frequency of word w observed in large text in that language.
Note that we do not assume a vocabulary. That is, we do not distinguish words from arbitrary strings made out of the alphabet. The implicit assumption here is that a "word" in common sense is just a string associated with high probability. In this sense, p w can also be seen as the likelihood of string w being a "legit word". This blurs the boundary between words and non-words, and automatically enables us to handle unseen words, alternative spellings, typos, and nonce words as normal cases.
We say a string s ∈ Γ + is a subword of word s is a substring of w. The probability that subword s appears in the language can then be defined as where 1(pred) gives 1 and otherwise 0 only if pred holds. Note that a subword s may occur more than once in the same word w. For example, subword "ana" occurs twice in the word "banana".
A subword segmentation g of word w of length k = |g| is a tuple (s 1 , s 2 , . . . , s k ) of subwords of w, so that w is the concatenation of s 1 , . . . , s k .

Probabilistic Subword Segmentation
A subword transition graph for word w is a directed acyclic graph G w = (N w , E w ). Let l = |w|. The vertices N w = {0, . . . , l} correspond to the positions between w[i] and w[i + 1] for all i ∈ [l − 1], as well as to the beginning (vertiex 0) and the . We use G w as a useful image for developing our model. Proposition 1. Paths from 0 to |w| in G w are in one-to-one correspondence to segmentations of w.
Each edge (i, j) is associated with a weight p w[i:j] -how likely w[i : j] itself is a meaningful subword. We model the likelihood of segmentation g being a segmentation of w as being proportional to the product of all its subword likelihood -the 0 1 2 3 4 5 6 h p "h" i p "i" g p "g" h p "h" e p "e" r p "r" hi p "hi" gher p "gher" gh p "gh" her p "her" high p "high" er p "er" Figure 1: Diagram of probabilistic subwords transitions for word "higher". Some edges are omitted to reduce clutter. Each edge is labeled by a subword s of the word, associated with ps. Bold edges constituent a path from node 0 to 6, corresponding to the segmentation of the word into "high" and "er". transition along a path from 0 to |w| in G w : (2) Example. Figure 1 illustrates G w for word w = "higher" of length 6. Bold edges (0, 4) and (4, 6) form a path from 0 to 6, which corresponds to the segmentation ("high", "er"). The likelihood p ("high","er")|w of this particular segmentation is proportional to p "high" p "er" -the product of weights along the path.

Probabilistic Bag-of-Subwords
Based on the above modeling of subword segmentations, we propose the Probabilistic Bag-of-Subword (PBoS) model for composing word embeddings. The embedding vector w for word w is the expectation of all its segmentation-based word embedding: where g is the embedding for segmentation g. Given a subword segmentation g, we adopt the Bag-of-Subwords (BoS) model (Bojanowski et al., 2017;Zhao et al., 2018) for composing word embedding from subwords. Specifically, we apply BoS 4 over the subword segments in g: where s is the vector representation for subword s, as if the current segmentation g is the "golden" segmentation of the word. In such case, we assume the meaning of the word is the combination of the meaning of all its subword segments. We maintain a look-up Given a set of target pre-trained word vectors w * defined for words within a finite vocabulary W , our model can be trained by minimizing the mean square loss:

Efficient Algorithm
PBoS simultaneously considers all possible subword segmentations and their contributions in composing word representations. However, summing over embeddings of all possible segmentations can be awfully inefficient, as simply enumerating all possible segmentations of w takes number of steps exponential to the length of w (Proposition 2). We therefore need an efficient way to compute Eq. (5).

Alternative View: Weighted Subwords
Exchanging the order of summations in Eq. (5) from segmentation first to subword first, we get where a s|w ∝ g∈Segw, g s is the weight accumulated over subword s, summing over all segmentations of w that contain s. 5 Eq. (7) provides an alternative view of the word vector composed by our model: a weighted sum of all the word's subword vectors. Comparing to BoS, we assign different importance a s|w , instead of a uniform weight, to each subword. a s|w can be viewed as the likelihood of subword s being a meaningful segment of the particular word w, considering both the likelihood of s itself being meaningful, and at the same time how likely the rest of the word can still be segmented into meaningful subwords.
Example. Consider the contribution of subword s = "gher" in word w = "higher". Possible contributions only come from segmentations that contain "higher": g 1 = ("h", "i", "gher") and g 2 = ("hi", "gher"). Each segmentation g adds weight p g|w to a s|w . In this case, a "gher"|w will be smaller than a "er"|w because both p g 1 |w and p g 2 |w would be rather small.

Computing Subword Weights
Now we can efficiently compute Eq. (7) if we can efficiently compute a s|w . Here we present an algorithm that computes a s|w for all s ⊆ w in O(|w| 2 ) time.
The specific structure of the subword transition graph means that edges only go from left to right. Thus, we can split every path going through e into three parts: edges left to e, e itself and edges right to e. In terms of subwords, that is, for s = w[i : j], l = |w|, each segmentation g that contains s can be divided into three parts: segmentation g w[1:i−1] over w[1 : i − 1], subword s itself, and segmentation g w[j+1:l] over w[j + 1 : l]. Based on this, we can rewrite Eq. (8) as where b i ,j = g ∈Seg w[i :j ] s ∈g p s . Now we can efficiently compute a s|w if we can efficiently compute b 1,i−1 and b j+1,l for all 1 ≤ i, j ≤ l. Fortunately, we can do so for b 1,i using the following recursive relation for i = 1, . . . , l with b 1,0 = 1. Similar formulas hold for b j,l , j = 1, . . . , l with b l+1,l = 1.
Based on this, we devise Algorithm 1 for computing a s|w for all s ⊆ w. Here we take the alternative view of our model as a weighted average of all possible subwords (thus the normalization in Line 12), and an extension to the unweighted averaging of subwords as used in Zhao et al. (2018).
Algorithm 1 Computing a s|w . 1: Input: Word w, p s for all s ⊆ w. l = |w|.
|w +ã 11: end for 12: a s|w ←ã s|w / s ⊆wã s |w for all s ⊆ w 13: return a ·|w Time complexity As we only access each subword once in each for-statement, the number of multiplications and additions involved is bounded by the number of subword locations of w. Each of Line 4 and Line 5 take i multiplications and i − 1 additions respectively. So Line 3 to Line 6 in total takes 2l 2 computations. Line 8 to Line 11 takes 3l(l+1) 2 computations. Thus, the time complexity of Algorithm 1 is O(l 2 ). Given a word of length 20, O(l 2 ) (20 2 = 400) is much better than enumerating all O(2 l ) (2 20 = 1, 048, 576) segmentations.
Using the setting in Section 4.3, PBoS only takes 30% more time (590 µs vs 454 µs) in average than BoS (by disabling a s|w computation) to compose a 300-dimensional word embedding vector.

Experiments
We design experiments to answer two questions: Do the segmentation likelihood and subword weights computed by PBoS align with their meaningfulness? Are the word embedding vectors generated by PBoS of good quality?
For the former, we inspect segmentation results and subword weights (Section 4.1), and see how good they are at predicting word affixes (Section 4.2). For the latter, we evaluate the word embeddings composed by PBoS at word similarity task (Section 4.3) and part-of-speech (POS) tagging task (Section 4.4).
Due to the page limit, we only report the most relevant settings and results in this section. Other details, including hardware, running time and detailed list of hyperparameters, can be found in Appendix A.

Subword Segmentation
In this subsection, we provide anecdotal evidence that PBoS is able to assign meaningful segmentation likelihood and subword weights. Table 1 shows top subword segmentations and subsequent top subwords calculated by PBoS for some example word, ranked by their likelihood and weights respectively. The calculation is based on the word frequency derived from the Google Web Trillion Word Corpus 6 . We use the same list for word probability p w throughout our experiments if not otherwise mentioned. All other settings are the same as described for PBoS in Section 4.3.
We can see the segmentation likelihood and subword weight favors the whole words as subword segments if the word appears in the word list, e.g. "higher", "farmland". This allows the model to closely mimic the word embeddings for frequent words that are probably part of the target vectors.
Second to the whole-word segmentation, or when the word is rare, e.g. "penpineanpplepie", "paradichlorobenzene", we see that PBoS gives higher likelihood to meaningful segmentations such as "high/er", "farm/land", "pen/pineapple/pie" and "para/dichlorobenzene"against other possible segmentations. 7 Subsequently, respective subword segments get higher weights among all possible subwords for the word, often by a good amount. This behavior would help PBoS to focus on meaningful subwords when composing word embedding. The fact that this can be achieved without any explicit source of morphological knowledge is itself interesting.

Affix Prediction
We quantitatively evaluate the quality of subword segmentations and subsequent subword weights by testing if our PBoS model is able to discover the most eminent word affixes. Note this has nothing to do with embeddings, so no training is involved in this experiment.
The affix prediction task is to predict the most eminent affix for a given word. For example, "-able" for "replaceable" and "re-" for "rename".
Models We get affix prediction from our PBoS by taking the top-ranked subword that is one of the possible affixes. To show our advantage, we 6 https://www.kaggle.com/rtatman/ english-word-frequency 7 A slight exception is "farmlan/d", probably because "-d" is a frequent suffix.   compare it with a BoS-style baseline affix predictor. Because BoS gives same weight to all subwords in a given word, we randomly choose one of the possible affixes that appear as subword of the word.
Benchmark We use the derivational morphology dataset 8 from Lazaridou et al. (2013). The dataset contains 7449 English words in total along with their most eminent affixes. Because no training is needed in this experiment, we use all the words for evaluation. To make the task more challenging, we drop trivial instances where there is only one possible affix appears as a subword in the given word. For example, "rename" is dropped because only prefix "re-" is present; on the other hand, "replaceable" is kept because both "re-" and "-able" are present. Besides excluding the trivial cases described above, we also exclude instances labeled with suffix "-y", because it is always included by "-ly" and "-ity". Altogether, we acquire 3546 words with 17 possible affixes for this evaluation.
Results Affix prediction results in terms of macro precision, recall, and F1 score are shown in Table 2. We can see a definite advantage of PBoS at predicting most word affixes, where all the metrics boost about 0.4 and F1 almost doubles compared to BoS, providing evidence that PBoS is able to assign meaningful subword weights.

Word Similarity
Given that PBoS is able to produce sensible segmentation likelihood and subword weights, we now turn our focus onto the quality of the generated 8 http://marcobaroni.org/PublicData/ affix_complete_set.txt.gz word embeddings. In this section, we evaluate the word vectors' ability to capture word senses using the intrinsic task of word similarity.
Word similarity aims to test how well word embeddings capture words' semantic similarity. The task is given as pairs of words, along with their similarity scores labeled by language speakers. Given a set of word embeddings, we compute the similarity scores induced by the cosine distance between the embedding vectors of each pair of words. The performance is then measured in Spearman's correlation ρ for all pairs. Model Setup PBoS composes word embeddings out of subword vectors exactly as described in Section 3. Unlike some of previous models, we do not add special characters to indicate word boundaries and do not set any constraint on subword lengths. PBoS is trained 50 epochs using vanilla SGD with initial learning rate 1 and inverse square root decay.     Table 3, along with their word similarity scores and OOV rate over the benchmarks. As we can see, both pre-trained embeddings yield decent correlations with human-labeled word similarity. However, the scores drop significantly as the OOV rate goes up. Polyglot vectors yield lower scores probably due to their smaller dimension and smaller token coverage.
Results Word similarity results of the three subword-level models are summarized in Table 4. 11 PBoS achieves scores better than or at least comparable to BoS and KVQ-FH in all but one of the six combinations of target vectors and word similarity benchmarks. Viewed as an extension to BoS, PBoS is in majority cases better than BoS, often by a good margin, suggesting the effectiveness of the subword weighting scheme. Compared to 9 https://polyglot.readthedocs.io/en/ latest/Download.html 10 https://code.google.com/archive/p/ word2vec/ 11 We regard training and prediction time as less of a concern here as all the three models are able to finish a training epoch in under a minute. Details and discussions can be found in Appendix A.2. KVQ-FH, PBoS can often match and sometimes surpass it even though PBoS is a much simpler model with better explainability. Compared to the scores by using just the target embeddings (Table 3, All pairs), PBoS is the only model that demonstrates improvement across all cases.
The only case where PBoS is not doing well is with Polyglot vectors and RW benchmark. After many manual inspections, we conjecture that it may be related to the vector norm. Sometimes the vector of a relevant subword can be of a small norm, prone to be overwhelmed by less relevant subword vectors. To counter this, we tried to normalize subword vectors before summing them up into a word vector (PBoS-n). PBoS-n showed good improvement for the Polyglot RW case (25 to 32), matching the performance of the other two.
One may argue that PBoS has an advantage for using the most number of parameters. However, this is largely because we do not constrain the length of subwords as in BoS or use hashing as in KVQ-FH. In fact, restricting subword length and using hashing helped them for the word similarity task. We found that PBoS is insensitive to subword length constraints and decide to keep the setting simple. Despite being an interesting direction, we decide to not involve hashing in this work to focus on the effect of our unique weighting scheme.
FaxtText Comparison Albeit targeted for a different task (training word embedding) which have access to contextual information, the popular fast-Text (Bojanowski et al., 2017) also uses a subwordlevel model. We train fastText 12 over the same English corpus on which the Polyglot target vectors are trained, in order to understand the quantitative impact of contextual information. To ensure a fair comparison, we restrict the vocabulary sizes and embedding dimensions to match those of Polyglot vectors. The word similarity scores we get for the trained fastText model are 65/40/14 for WS/RW/Card. We note the great gain for WS and RW, suggesting the helpfulness of contextual information in learning and generalizing word embeddings in the setting of small to moderate OOV rates. Surprisingly, we find that for the case of extremely high OOV rate (Card), PBoS slightly surpasses fastText, suggesting PBoS' effectiveness in generalizing embeddings to OOV words even without any help from contexts.

Multilingual Results
To evaluate and compare the effectiveness of PBoS across languages, we further train the models targeting multilingual Wikipedia2Vec vectors (Yamada et al., 2020) and evaluate them on multilingual WordSim353 and SemLex999 from Leviant and Reichart (2015) which are available in English, German, Italian and Russian. To better access the models' ability to generalize, we only take the top 10k words from the target vectors for training, which yields decent OOV rates, ranging from 23% to 84%. Detailed results can be found in Appendix Section A.3. In summary, we find 1) that PBoS surpasses KVQ-FH for English and German and is comparable to KVQ-FH for Italian; 2) that PBoS and KVQ-FH surpasses BoS for English, German and Italian; and 3) no definitive trend among the three models for Russian.

POS Tagging
We further assess the quality of generated word embedding via the extrinsic task of POS tagging. The task is to categorize each word in a given context into a particular part of speech, e.g. noun, verb, and adjective.

POS Tagging Model
We follow the evaluation protocol for sequential labeling used by Kiros et al. (2015) and Li et al. (2017), and use logistic regression classifier 13 as the model for POS tagging. When predicting the tag for the i-th word w i in a sentence, the input to the classifier is the concatenation of the vectors w i−2 , w i−1 , w i , w i+1 , w i+2 for the word itself and the words in its context. This setup allows a more direct evaluation of the quality of word vectors themselves, and thus gives better discriminative power. 14 Dataset We train and evaluate the performance of generated word embeddings over 23 languages at the intersection of the Polyglot (Al-Rfou ' et al., 2013)   an 100k vocabulary for each language and are used as target vectors for each of the subword-level embedding models in this experiment. For PBoS, we use the Polyglot word counts for each language as the base for subword segmentation and subword weights calculation. UD is used as the POS tagging dataset to train and test the POS tagging model. We use the default partition of training and testing set. Statistics vary from language to language. See Appendix A.4 for more details.
Results Table 5 shows the POS tagging accuracy over the 23 languages that appear in both Polyglot and UD. All the subword-level embedding models follow the same hyperparameters as in Section 4.3. Following Sasaki et al. (2019), we tune the regularization term of the logistic regression model when evaluating KVQ-FH. Even with that, PBoS is able to achieve the best POS tagging accuracy in all but one language regardless of morphological types, OOV rates, and the number of training instances (Appendix Table 12). Particularly, PBoS improvement accuracy by greater than 0.1 for 9 languages. For the one language (Tamil) where PBoS is not the most accurate, the difference to the best is small (0.003). KVQ-FH gives no significantly more accurate predictions than BoS despite it is more complex and is the only one tuned with hyperparameters.
Overall, Table 5 shows that the word embeddings composed by our PBoS is effective at predicting POS tags for a wide range of languages.

Related Work
Popular word embedding methods, such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), often assume finite-size vocabularies, giving rise to the problem of OOV words.
FastText (Bojanowski et al., 2017;Joulin et al., 2017) attempted to alleviate the problem using subword-level model, and was followed by interests of using subword information to improve word embedding (Wieting et al., 2016;Cao and Lu, 2017;Li et al., 2017;Athiwaratkun et al., 2018;Li et al., 2018;Salle and Villavicencio, 2018;Xu et al., 2019;Zhu et al., 2019). Among them are Charagram by Wieting et al. (2016) which, albeit trained on specific downstream tasks, is similar to BoS followed by a non-linear activation, and the systematic evaluation by Zhu et al. (2019) over various choices of word composition functions and subword segmentation methods. However, all works above either pay little attention to the interaction among subwords inside a given word, or treat subword segmentation and composing word representation as separate problems.
Another interesting thread of works (Oshikiri, 2017;Kim et al., 2018aKim et al., , 2019 attempted to model language solely at the subword level and learn subword embeddings directly from text, providing evidence to the power of subword-level models, especially as the notion of word is thought doubtful by some linguistics (Haspelmath, 2011).
Besides the recent interest in subwords, there have been long efforts of using morphology to improve word embedding (Luong et al., 2013;Cotterell and Schütze, 2015;Cui et al., 2015;Soricut and Och, 2015;Bhatia et al., 2016;Cao and Rei, 2016;Xu et al., 2018;Üstün et al., 2018;Edmiston and Stratos, 2018;Chaudhary et al., 2018;Park and Shin, 2018). However, most of them require an external oracle, such as Morfessor (Creutz and Lagus, 2002;Virpioja et al., 2013), for the morphological segmentations of input words, limiting their power to the quality and availability of such segmenters. The only exception is the character LSTM model by Cao and Rei (2016), which has shown some ability to recover the morphological boundary as a byproduct of learning word embedding.
The most related works in generalizing pretrained word embeddings have been discussed in Section 1 and compared throughout the paper.

Conclusion and Future Work
We propose PBoS model for generalizing pretrained word embeddings without contextual information. PBoS simultaneously considers all possible subword segmentations of a word and derives meaningful subword weights that lead to better composed word embeddings. Experiments on segmentation results, affix prediction, word similarity, and POS tagging over 23 languages support the claim.
In the future, it would be interesting to see if PBoS can also help with the task of learning word embedding, and how hashing would impact the quality of composed embedding while facilitating a more compact model.

A Experimental Details
Here we list the details of our experiments that are omitted in the main paper due to space constraints. We run all our experiments on a machine with an 8-core Intel i7-6700 CPU @ 3.40GHz, 32GB Memory, and GeForce GTX 970 GPU.

A.1 Hyperparameters
The meaning of hyperparameters shown in Table 6,  Table 7 and Table 8 as explained as follows.

Subwords
• min len: The minimum length for a subword to be considered. • max len: The maximum length for a subword to be considered. • word boundary: Whether to add special characters to annotate word boundaries.

Training
• epochs: The number of training epochs.
• lr decay: Whether to set learning rate to be inversely proportional to the square root of the epoch number. • normalize semb: Whether to normalize subword embeddings before composing word embeddings. • prob eps: Default likelihood for unknown characters.

Evaluation
• C: The inverse regularization term used by the logistic regression classifier.
A.2 Word Similarity During the evaluation, we use 0 as the similarity score for a pair of words if we cannot get word vector for one of the words, or the magnitude of the word vector is too small. This is especially the case when we evaluate the target vectors, where OOV rates can be significant. Table 9 lists experimental result for word similarity in greater detail.
Regarding the training epoch time, note that KVQ-FH uses GPU and is implemented using a deep learning library 17 with underlying optimized C code, whereas our PBoS is implemented using pure Python and uses only single thread CPU. We omit the prediction time for KVQ-FH, as we found it hard to separate the actual inference time from time used for other processes such as batching and data transfer between CPU and GPU. However, we believe the overall trend should be similar as for the training time.
One may notice that the prediction time for BoS in Table 9 is different from what was reported at the end of Section 3. This is largely because the BoS in Table 9 has a different (smaller) set of possible subwords to consider due to the subword length limits. In Section 3, to fairly access the impact of subword weights computation, we ensure that BoS and PBoS work with the same set of possible subwords (that used by PBoS in Section 4.3), and thus observe a slight longer prediction time for BoS.

A.3 Multilingual Word Similarity
We use Wikipedia2Vec (Yamada et al., 2020) as target vectors, and keep the most frequent 10k words to get decent OOV rates. The OOV rates and word similarity scores can be found in Table 10.
We do not clean or filter words as we did for the English word similarity, because we found it difficult to have a consistent way of pre-processing words across languages. For PBoS, we use the word frequencies from Polyglot for subword segmentation and subword weight calculation as the same for the multilingual POS tagging experiment (Section 4.4).
We evaluate all the models on multilingual Word-Sim353 (mWS) and SemLex999 (mSL) from Leviant and Reichart (2015), which is available for English, German, Italian and Russian. The dataset also contains the relatedness (rel) and similarity (sim) benchmarks derived from mWS.
We list the results for multilingual word similarity in Table 11. Tagging   Table 7 and Table 8 show the hyperparameter values used in the POS tagging experiment (Section 4.4). For the prediction model, we use the logistic regression classifier from scikit-learn 0.19.1 with the default settings.

A.4 POS
Following the observation in Sasaki et al. (2019), we tune the regularization parameter C for KVQ-FH for all values a × 10 b where a = 1, . . . , 9 and b = −1, 0, . . . , 4. We use the POS tagging accuracy for English as criterion, and choose C = 70. Table 12 lists some statistics of the datasets used in the POS tagging experiment. PBoS is able to achieve better accuracy over BoS and KVQ-FH in all languages regardless of their morphological type, OOV rate and number of training instances for POS tagging.     Table 9: Word similarity performance of subword-level models measured in Spearman's ρ × 100, along with training and prediction time.