Nonsymbolic Text Representation

We introduce the first generic text representation model that is completely nonsymbolic, i.e., it does not require the availability of a segmentation or tokenization method that attempts to identify words or other symbolic units in text. This applies to training the parameters of the model on a training corpus as well as to applying it when computing the representation of a new text. We show that our model performs better than prior work on an information extraction and a text denoising task.


Introduction
Character-level models can be grouped into three classes. (i) End-to-end models learn a separate model on the raw character (or byte) input for each task; these models estimate task-specific parameters, but no representation of text that would be usable across tasks is computed. Throughout this paper, we refer to r(x) as the "representation" of x only if r(x) is a generic rendering of x that can be used in a general way, e.g., across tasks and domains. The activation pattern of a hidden layer for a given input sentence in a multilayer perceptron (MLP) is not a representation according to this definition if it is not used outside of the MLP. (ii) Character-level models of words derive a representation of a word w from the character string of w, but they are symbolic in that they need text segmented into tokens as input. (iii) Bag-of-character-ngram models, bagof-ngram models for short, use character ngrams to encode sequence-of-character information, but sequence-of-ngram information is lost in the representations they produce.
Our premise is that text representations are needed in NLP. A large body of work on word embeddings demonstrates that a generic text representation, trained in an unsupervised fashion on large corpora, is useful. Thus, we take the view that group (i) models, end-to-end learning without any representation learning, is not a good general approach for NLP.
We distinguish training and utilization of the text representation model. We use "training" to refer to the method by which the model is learned and "utilization" to refer to the application of the model to a piece of text to compute a representation of the text. In many text representation models, utilization is trivial. For example, for word embedding models, utilization amounts to a simple lookup of a word to get its precomputed embedding. However, for the models we consider, utilization is not trivial and we will discuss different approaches.
Both training and utilization can be either symbolic or nonsymbolic. We define a symbolic approach as one that is based on tokenization, i.e., a segmentation of the text into tokens. Symbol identifiers (i.e., tokens) can have internal structure -a tokenizer may recognize tokens like "to and fro" and "London-based" that contain delimiters -and may be morphologically analyzed downstream. 1 We define a nonsymbolic approach as one that is tokenization-free, i.e., no assumption is made that there are segmentation boundaries and that each segment (e.g., a word) should be represented (e.g., by a word embedding) in a way that is independent of the representations (e.g., word embeddings) of neighboring segments. Methods for training text representation models that require tokenized text include word embedding models like word2vec (Mikolov et al., 2013) and most group (ii) methods, i.e., character-level models like fast-Text skipgram (Bojanowski et al., 2016).
Bag-of-ngram models, group (iii) models, are text representation utilization models that typically compute the representation of a text as the sum of the embeddings of all character ngrams occurring in it, e.g., WordSpace (Schütze, 1992) and CHARAGRAM (Wieting et al., 2016). WordSpace and CHARAGRAM are examples of mixed training-utilization models: training is performed on tokenized text (words and phrases), utilization is nonsymbolic.
We make two contributions in this paper. (i) We propose the first generic method for training text representation models without the need for tokenization and address the challenging sparseness issues that make this difficult. (ii) We propose the first nonsymbolic utilization method that fully represents sequence information -in contrast to utilization methods like bag-of-ngrams that discard sequence information that is not directly encoded in the character ngrams themselves. Chung et al. (2016) give two motivations for their work on character-level models. First, tokenization (or, equivalently, segmentation) algorithms make many mistakes and are brittle: "we do not have a perfect word segmentation algorithm for any one language". Tokenization errors then propagate throughout the NLP pipeline.

Motivation
Second, there is currently no general solution for morphology in statistical NLP. For many languages, high-coverage and high-quality morphological resources are not available. Even for well resourced languages, problems like ambiguity make morphological processing difficult; e.g., "rung" is either the singular of a noun meaning "part of a ladder" or the past participle of "to ring". In many languages, e.g., in German, syncretism, a particular type of systematic morphological ambiguity, is pervasive. Thus, there is no simple morphological processing method that would produce a representation in which all inflected forms of "to ring" are marked as having a common lemma; and no such method in which an unseen form like "aromatizing" is reliably analyzed as a form of "aromatize" whereas an unseen form like "antitrafficking" is reliably analyzed as the compound "anti+trafficking".
Of course, it is an open question whether non-symbolic methods can perform better than morphological analysis, but the foregoing discussion motivates us to investigate them. Chung et al. (2016) focus on problems with the tokens produced by segmentation algorithms. Equally important is the problem that tokenization fails to capture structure across multiple tokens. The job of dealing with cross-token structure is often given to downstream components of the pipeline, e.g., components that recognize multiwords and named entitites in English or in fact any word in a language like Chinese that uses no overt delimiters. However, there is no linguistic or computational reason in principle why we should treat the recognition of a unit like "electromechanical" (containing no space) as fundamentally different from the recognition of a unit like "electrical engineering" (containing a space). Character-level models offer the potential of uniform treatment of such linguistic units.

Methodology
Many text representation learning algorithms can be understood as estimating the parameters of the model from a unit-context matrix C where each row corresponds to a unit u i , each column to a context c j and each cell C ij measures the degree of association between u i and c j . For example, the skipgram model is closely related to an SVD factorization of a pointwise mutual information matrix (Levy and Goldberg, 2014). Many text representation learning algorithms are formalized as matrix factorization (e.g., (Deerwester et al., 1990;Hofmann, 1999;Stratos et al., 2015)), but there may be no big difference between implicit (e.g., (Pennington et al., 2014)) and explicit factorization methods; see also (Mohamed, 2011;Rastogi et al., 2015).
Our goal in this paper is not to develop new matrix factorization methods. Instead, we will focus on defining the unit-context matrix in such a way that no symbolic assumption has to be made. This unit-context matrix can then be processed by any existing or still to be invented algorithm.
Definition of units and contexts. How to define units and contexts without relying on segmentation boundaries? In initial experiments, we simply generated all character ngrams of length up to k max (where k max is a parameter), including character ngrams that cross token boundaries; i.e., no segmentation is needed. We then used a skipgramtype objective for learning embeddings that attempts to predict, from ngram g 1 , an ngram g 2 in g 1 's context. Results were poor because many training instances consist of pairs (g 1 , g 2 ) in which g 1 and g 2 overlap, e.g., one is a subsequence of the other. So the objective encourages trivial predictions of ngrams that have high string similarity with the input and nothing interesting is learned.
In this paper, we propose an alternative way of defining units and contexts that supports wellperforming nonsymbolic text representation learning: multiple random segmentation. A pointer moves through the training corpus. The current position i of the pointer defines the left boundary of the next segment. The length l of the next move is uniformly sampled from [k min , k max ] where k min and k max are the minimum and maximum segment lengths. The right boundary of the segment is then i+l. Thus, the segment just generated is c i,i+l , the subsequence of the corpus between (and including) positions i and i + l. The pointer is positioned at i + l + 1, the next segment is sampled and so on. An example of a random segmentation from our experiments is "@he@had@b egu n@to@show @his@cap acity@f" where space was replaced with "@" and the next segment starts with "or@".
The corpus is segmented this way m times (where m is a parameter) and the m random segmentations are concatenated. The unit-context matrix is derived from this concatenated corpus.
Multiple random segmentation has two advantages. First, there is no redundancy since, in any given random segmentation, two ngrams do not overlap and are not subsequences of each other. Second, a single random segmentation would only cover a small part of the space of possible ngrams. For example, a random segmentation of "a rose is a rose is a rose" might be "[a ros][e is a ros][e is][a rose]". This segmentation does not contain the segment "rose" and this part of the corpus can then not be exploited to learn a good embedding for the fourgram "rose". However, with multiple random segmentation, it is likely that this part of the corpus does give rise to the segment "rose" in one of the segmentations and can contribute information to learning a good embedding for "rose".
We took the idea of random segmentation from work on biological sequences (Asgari and Mofrad, 2015;Asgari and Mofrad, 2016). Such sequences have no delimiters, so they are a good model if one believes that delimiter-based segmentation is problematic for text.

Ngram equivalence classes/Permutation
Form-meaning homomorphism premise. Nonsymbolic representation learning does not preprocess the training corpus by means of tokenization and considers many ngrams that would be ignored in tokenized approaches because they span token boundaries. As a result, the number of ngrams that occur in a corpus is an order of magnitude larger for tokenization-free approaches than for tokenization-based approaches. See supplementary for details.
We will see below that this sparseness impacts performance of nonsymbolic text representation negatively. We address sparseness by defining ngram equivalence classes. All ngrams in an equivalence class receive the same embedding.
The relationship between form and meaning is mostly arbitrary, but there are substructures of the ngram space and the embedding space that are systematically related by homomorphism. In this paper, we will assume the following homomorphism: As a simple example consider a transduction τ that deletes spaces at the beginning of ngrams, e.g., τ (@Mercedes) = τ (Mercedes). This is an example of a meaning-preserving τ since for, say, English, τ will not change meaning. We will propose a procedure for learning τ below.
We define ∼ = as "closeness" -not as identity -because of estimation noise when embeddings are learned. We assume that there are no true synonyms and therefore the direction g 1 ∼ τ g 2 ⇐ v(g 1 ) ∼ = v(g 2 ) also holds. For example, "car" and "automobile" are considered synonyms, but we assume that their embeddings are different because only "car" has the literary sense "chariot". If they were identical, then the homomorphism would not hold since "car" and "automobile" cannot be converted into each other by any plausible meaning-preserving τ .
Learning procedure. To learn τ , we define three templates that transform one ngram into another: (i) replace character a 1 with character a 2 , (ii) delete character a 1 if its immediate predecessor is character a 2 , (iii) delete character a 1 if its immediate successor is character a 2 . The learning procedure takes a set of ngrams and their embeddings as input. It then exhaustively searches for all pairs of ngrams, for all pairs of characters a 1 /a 2 , for each of the three templates. When two matching embeddings exist, we compute their cosine. For example, for the operation "delete space before M", an ngram pair from our embeddings that matches is "@Mercedes" / "Mercedes" and we compute its cosine. As the characteristic statistic of an operation we take the average of all cosines; e.g., for "delete space before M" the average cosine is .7435. We then rank operations according to average cosine and take the first N o as the definition of τ where N o is a parameter. For characters that are replaced by each other (e.g., 1, 2, 3 in Table 1), we compute the equivalence class and then replace the learned operations with ones that replace a character by the canonical member of its equivalence class (e.g., 2 → 1, 3 → 1).
Permutation premise. Tokenization algorithms can be thought of as assigning a particular function or semantics to each character and making tokenization decisions accordingly; e.g., they may disallow that a semicolon, the character ";", occurs inside a token. If we want to learn representations from the data without imposing such hard constraints, then characters should not have any particular function or semantics. A consequence of this desideratum is that if any two characters are exchanged for each other, this should not affect the representations that are learned. For example, if we interchange space and "A" throughout a corpus, then this should have no effect on learning: what was the representation of "NATO" before, should now be the representation of "N TO". We can also think of this type of permutation as a sanity check: it ensures we do not inadvertantly make use of text preprocessing heuristics that are pervasive in NLP. 2 Let A be the alphabet of a language, i.e., its set of characters, π a permutation on A, C a corpus and π(C) the corpus permuted by π. For example, if π(a) = e, then all "a" in C are replaced with "e" in π(C). The learning procedure should learn identical equivalence classes on C and π(C). So, if g 1 ∼ τ g 2 after running the learning procedure on C, then π(g 1 ) ∼ τ π(g 2 ) after running the learning procedure on π(C).
This premise is motivated by our desire to come up with a general method that does not rely on specific properties of a language or genre; e.g., the premise rules out exploiting the fact through feature engineering that in many languages and genres, "c" and "C" are related. Such a relationship has to be learned from the data.

Experiments
We run experiments on C, a 3 gigabyte English Wikipedia corpus, and train word2vec skipgram (W2V, (Mikolov et al., 2013)) and fastText skipgram (FTX, (Bojanowski et al., 2016)) models on C and its derivatives. We randomly generate a permutation π on the alphabet and learn a transduction τ (details below). In Table 2 (left), the columns "method", π and τ indicate the method used (W2V or FTX) and whether experiments in a row were run on C, π(C) or τ (π(C)). The values of "whitespace" are: (i) ORIGINAL (whitespace as in the original), (ii) SUBSTITUTE (what π outputs as whitespace is used as whitespace, i.e., π −1 (" ") becomes the new whitespace) and (iii) RANDOM (random segmentation with parameters m = 50, k min = 3, k max = 9). Before random segmentation, whitespace is replaced with "@"this character occurs rarely in C, so that the effect of conflating two characters (original "@" and whitespace) can be neglected. The random segmenter then indicates boundaries by whitespaceunambiguously since it is applied to text that contains no whitespace.
We learn τ on the embeddings learned by W2V on the random segmentation version of π(C) (C-RANDOM in the table) as described in §3.2 for N o = 200. Since the number of equivalence classes is much smaller than the number of ngrams, τ reduces the number of distinct character ngrams from 758M in the random segmentation version of π(C) (C/D-RANDOM) to 96M in the random segmentation version of τ (π(C)) (E/F-RANDOM). Table 1 shows a selection of the N o operations. Throughout the paper, if we give examples from π(C) or τ (π(C)) as we do here, we convert characters back to the original for better readability. The two uppercase/lowercase conversions shown substitution 2 →1 predeletion /r →r @H→H m@→m E→e @I →I ml →m C→c Table 1: String operations that on average do not change meaning. "@" stands for space. ‡ is the left or right boundary of the ngram.
in the table (E→e, C→c) were the only ones that were learned (we had hoped for more). The postdeletion rule ml→m usefully rewrites "html" as "htm", but is likely to do more harm than good. We inspected all 200 rules and, with a few exceptions like ml→m, they looked good to us.
Evaluation. We evaluate the three models on an entity typing task, similar to (Yaghoobzadeh and Schütze, 2015), but based on an entity dataset released by Xie et al. (2016) in which each entity has been assigned one or more types from a set of 50 types. For example, the entity "Harrison Ford" has the types "actor", "celebrity" and "award winner" among others. We extract mentions from FACC (http://lemurproject. org/clueweb12/FACC1) if an entity has a mention there or we use the Freebase name as the mention otherwise. This gives us a data set of 54,334, 6085 and 6747 mentions in train, dev and test, respectively. Each mention is annotated with the types that its entity has been assigned by Xie et al. (2016). The evaluation has a strong cross-domain aspect because of differences between FACC and Wikipedia, the training corpus for our representations. For example, of the 525 mentions in dev that have a length of at least 5 and do not contain lowercase characters, more than half have 0 or 1 occurrences in the Wikipedia corpus, including many like "JOHNNY CARSON" that are frequent in other case variants.
Since our goal in this experiment is to evaluate tokenization-free learning, not tokenizationfree utilization, we use a simple utilization baseline, the bag-of-ngram model (see §1). A mention is represented as the sum of all character ngrams that embeddings were learned for. Linear SVMs (Chang and Lin, 2011) are then trained, one for each of the 50 types, on train and applied to dev and test. Our evaluation measure is micro F 1 on all typing decisions; e.g., one typing decision is: "Harrison Ford" is a mention of type "actor". We tune thresholds on dev to optimize F 1 and then use these thresholds on test.

Results
Results are presented in Table 2 (left). Overall performance of FTX is higher than W2V in all cases. For ORIGINAL, FTX's recall is a lot higher than W2V's whereas precision decreases slightly. This indicates that FTX is stronger in both learning and application: in learning it can generalize better from sparse training data and in application it can produce representations for OOVs and better representations for rare words. For English, prefixes, suffixes and stems are of particular importance, but there often is not a neat correspondence between these traditional linguistic concepts and internal FTX representations; e.g., Bojanowski et al. (2016) show that "asphal", "sphalt" and "phalt" are informative character ngrams of "asphaltic".
Running W2V on random segmentations can be viewed as an alternative to the learning mechanism of FTX, which is based on character ngram cooccurrence; so it is not surprising that for RAN-DOM, FTX has only a small advantage over W2V.
For C/D-SUBSTITUTE, we see a dramatic loss in performance if tokenization heuristics are not used. This is not surprising, but shows how powerful tokenization can be.
C/D-ORIGINAL is like C/D-SUBSTITUTE except that we artificially restored the space -so the permutation π is applied to all characters except for space. By comparing C/D-ORIGINAL and C/D-SUBSTITUTE, we see that the space is the most important text preprocessing feature employed by W2V and FTX. If space is restored, there is only a small loss of performance compared to A/B-ORIGINAL. So text preprocessing heuristics other than whitespace tokenization in a narrow definition of the term (e.g., downcasing) do not seem to play a big role, at least not for our entity typing task.
For tokenization-free embedding learning on random segmentation, there is almost no difference between original data (A/B-RANDOM) and permuted data (C/D-RANDOM). This confirms that our proposed learning method is insensitive to permutations and makes no use of text preprocessing heuristics.

Analysis of ngram embeddings
Table 2 (right) shows nearest neighbors of ten character ngrams, for the A-RANDOM space. Queries were chosen to contain only alphanumeric characters. To highlight the difference to symbolbased representation models, we restricted the search to 9-grams that contained a delimiter at positions 3, 4, 5, 6 or 7.
Lines 5-9 are cases of ambiguous or polysemous words that are disambiguated through "character context". "stem", "cell", "rear", "wheel", "crash", "land", "scripts", "through", "downtown" all have several meanings. In contrast, the meanings of "stem cell", "rear wheel", "crash land", "(write) scripts for" and "through downtown" are less ambiguous. A multiword recognizer may find the phrases "stem cell" and "crash land" automatically. But the examples of "scripts for" and "through downtown" show that what is accomplished here is not multiword detection, but a more general use of character context for disambiguation.
Line 10 shows that a 9-gram of "face-to-face" is the closest neighbor to a 9-gram of "facilitating". This demonstrates that form and meaning sometimes interact in surprising ways. Facilitating a meeting is most commonly done face-to-face. It is not inconceivable that form -the shared trigram "fac" or the shared fourgram "faci" in "facilitate" / "facing" -is influencing meaning here in a way that also occurs historically in cases like "ear" 'organ of hearing' / "ear" 'head of cereal plant', originally unrelated words that many English speakers today intuit as one word.
4 Utilization: Tokenization-free representation of text

Methodology
The main text representation model that is based on ngram embeddings similar to ours is the bagof-ngram model. A sequence of characters is represented by a single vector that is computed as the sum of the embeddings of all ngrams that occur in the sequence. In fact, this is what we did in the entity typing experiment. In most work on bag-of-ngram models, the sequences considered are words or phrases (see (Schuetze, 2016) for citations). In a few cases, the model is applied to longer sequences, including sentences and documents; e.g., (Schütze, 1992), (Wieting et al., 2016). The basic assumption of the bag-of-ngram model is that sequence information is encoded in the character ngrams and therefore a "bag-of" approach (which usually throws away all sequence information) is sufficient. The assumption is not implausible: for most bags of character sequences, there is only a single way of stitching them together to one coherent sequence, so in that case information is not necessarily lost (although this is likely when embeddings are added). But the assumption has not been tested experimentally.
Here, we propose position embeddings, character-ngram-based embeddings that more fully preserve sequence information. 3 The simple idea is to represent each position as the sum of all ngrams that contain that position. When we set POS r = 1 r = 2 r = 3 r = 4 r = 5 2 e wealthies accolades bestselle bestselli Billboard 3 s estseller wealthies bestselli accolades bestselle 15 o fortnight afternoon overnight allowance Saturdays 16 n fortnight afternoon Saturdays Wednesday magazines 23 o superhero ntagraphi adventure Astonishi bestselli 24 m superhero ntagraphi anthology Daredevil Astonishi 29 o anthology paperback superhero Lovecraft tagraphic 30 o anthology paperback tagraphic Lovecraft agraphics 34 u antagraph agraphics paperback hardcover ersweekly 35 b ublishing ublishers ublicatio antagraph aperbacks Table 3: Nearest ngram embeddings (rank r ∈ [1, 5]) of the position embeddings for "POS", the positions 2/3 (best), 15/16 (monthly), 23/24 (comic), 29/30 (book) and 34/35 (publications) in the Wikipedia excerpt "best-selling monthly comic book publications sold in North America" k min = 3, k max = 9, this means that the position is the sum of ( 3≤k≤9 k) ngram embeddings (if all of these ngrams have embeddings, which generally will be true for some, but not for most positions). A sequence of n characters is then represented as a sequence of n such position embeddings.

Experiments
We again use the embeddings corresponding to A-RANDOM in Table 2. We randomly selected 2,000,000 contexts of size 40 characters from Wikipedia. We then created a noise context for each of the 2,000,000 contexts by replacing one character at position i (15 ≤ i ≤ 25, uniformly sampled) with space (probability p = .5) or a random character otherwise. Finally, we selected 1000 noise contexts randomly and computed their nearest neighbors among the 4,000,000 contexts (excluding the noise query). We did this in two different conditions: for a bag-of-ngram representation of the context (sum of all character ngrams) and for the concatenation of 11 position embeddings, those between 15 and 25. Our evaluation measure is mean reciprocal rank of the clean context corresponding to the noise context. This simulates a text denoising experiment: if the clean context has rank 1, then the noisy context can be corrected. Table 4 shows that sequence-preserving position embeddings perform better than bag-ofbag-of-ngram position embeddings MRR .64 .76  Table 6: Cosine similarity of ngrams that cross word boundaries and disambiguate polysemous words. The tables show three disambiguating ngrams for "exchange" and "rates" that have different meanings as indicated by low cosine similarity. In phrases like "floating exchange rates" and "historic exchange rates", disambiguating ngrams overlap. Parts of the word "exchange" are disambiguated by preceding context (ic@exchang, ing@exchan) and parts of "exchange" provide context for disambiguating "rates" (xchange@ra).
ngram representations. Table 5 shows an example of a context in which position embeddings did better than bagof-ngrams, demonstrating that sequence information is lost by bag-of-ngram representations, in this case the exact position of "Seahawks". Table 3 gives further intuition about the type of information position embeddings contain, showing the ngram embeddings closest to selected position embeddings; e.g., "estseller" (the first 9-gram on the line numbered 3 in the table) is closest to the embedding of position 3 (corresponding to the first "s" of "best-selling"). The kNN search space is restricted to alphanumeric ngrams.

Discussion
Single vs. multiple segmentation. The motivation for multiple segmentation is exhaustive cov-   Table 4. "rep. space" = "representation space". We want to correct the error in the corrupted "noise" context (line 2) and produce "correct" (line 1). The nearest neighbor to line 2 in position-embedding space is the correct context (line 3, r = 1). The nearest neighbor to line 2 in bag-of-ngram space is incorrect (line 4, r = 1) because the precise position of "Seahawks" in the query is not encoded. The correct context in bag-of-ngram space is instead at rank r = 6 (line 5). "similarity" is average cosine (over eleven position embeddings) for position embeddings.
erage of the space of possible segmentations. An alternative approach would be to attempt to find a single optimal segmentation. Our intuition is that in many cases overlapping segments contain complementary information. Table 6 gives an example. Historic exchange rates are different from floating exchange rates and this is captured by the low similarity of the ngrams ic@exchang and ing@exchan. Also, the meaning of "historic" and "floating" is noncompositional: these two words take on a specialized meaning in the context of exchange rates. The same is true for "rates": its meaning is not its general meaning in the compound "exchange rates". Thus, we need a representation that contains overlapping segments, so that "historic" / "floating" and "exchange" can disambiguate each other in the first part of the compound and "exchange" and "rates" can disambiguate each other in the second part of the compound. A single segmentation cannot capture these overlapping ngrams.
What text-type are tokenization-free approaches most promising for? The reviewers thought that language and text-type were badly chosen for this paper. Indeed, a morphologically complex language like Turkish and a noisy texttype like Twitter would seem to be better choices for a paper on robust text representation.
However, robust word representation methods like FTX are effective for within-token generalization, in particular, effective for both complex morphology and OOVs. If linguistic variability and noise only occur on the token level, then a tokenization-free approach has fewer advantages.
On the other hand, the foregoing discussion of cross-token regularities and disambiguation applies to well-edited English text as much as it does to other languages and other text-types as the example of "exchange" shows (which is dis-ambiguated by prior context and provides disambiguating context to following words) and as is also exemplified by lines 5-9 in Table 2 (right).
Still, this paper does not directly evaluate the different contributions that within-token character ngram embeddings vs. cross-token character ngram embeddings make, so this is an open question. One difficulty is that few corpora are available that allow the separate evaluation of whitespace tokenization errors; e.g., OCR corpora generally do not distinguish a separate class of whitespace tokenization errors.
Position embeddings vs. phrase/sentence embeddings. Position embeddings may seem to stand in opposition to phrase/sentence embeddings. For many tasks, we need a fixed length representation of a longer sequence; e.g., sentiment analysis models compute a fixed-length representation to classify a sentence as positive / negative.
To see that position embeddings are compatible with fixed-length embeddings, observe first that, in principle, there is no difference between word embeddings and position embeddings in this respect. Take a sequence that consists of, say, 6 words and 29 characters. The initial representation of the sentence has length 6 for word embeddings and length 29 for position embeddings. In both cases, we need a model that reduces the variable length sequence into a fixed length vector at some intermediate stage and then classifies this vector as positive or negative. For example, both word and position embeddings can be used as the input to an LSTM whose final hidden unit activations are a fixed length vector of this type.
So assessing position embeddings is not a question of variable-length vs. fixed-length representations. Word embeddings give rise to variablelength representations too. The question is solely whether the position-embedding representation is a more effective representation.
A more specific form of this argument concerns architectures that compute fixed-length representations of subsequences on intermediate levels, e.g., CNNs. The difference between positionembedding-based CNNs and word-embeddingbased CNNs is that the former have access to a vastly increased range of subsequences, including substrings of words (making it easier to learn that "exchange" and "exchanges" are related) and cross-token character strings (making it easier to learn that "exchange rate" is noncompositional). Here, the questions are: (i) how useful are subsequences made available by position embeddings and (ii) is the increased level of noise and decreased efficiency caused by many useless subsequences worth the information gained by adding useful subsequences.

Independence of training and utilization.
We note that our proposed training and utilization methods are completely independent. Position embeddings can be computed from any set of character-ngram-embeddings (including FTX) and our character ngram learning algorithm could be used for applications other than position embeddings, e.g., for computing word embeddings.
Context-free vs. context-sensitive embeddings. Word embeddings are context-free: a given word w like "king" is represented by the same embedding independent of the context in which w occurs. Position embeddings are context-free as well: if the maximum size of a character ngram is k max , then the position embedding of the center of a string s of length 2k max − 1 is the same independent of the context in which s occurs.
It is conceivable that text representations could be context-sensitive. For example, the hidden states of a character language model have been used as a kind of nonsymbolic text representation (Chrupala, 2013;Evang et al., 2013;Chrupala, 2014) and these states are context-sensitive. However, such models will in general be a second level of representation; e.g., the hidden states of a character language model generally use character embeddings as the first level of representation. Conversely, position embeddings can also be the basis for a context-sensitive second-level text representation. We have to start somewhere when we represent text. Position embeddings are motivated by the desire to provide a representation that can be computed easily and quickly (i.e., without taking context into account), but that on the other hand is much richer than the symbolic alphabet.
Processing text vs. speech vs. images. Gillick et al. (2016) write: "It is worth noting that noise is often added . . . to images . . . and speech where the added noise does not fundamentally alter the input, but rather blurs it. [bytes allow us to achieve] something like blurring with text." It is not clear to what extent blurring on the byte level is useful; e.g., if we blur the bytes of the word "university" individually, then it is unlikely that the noise generated is helpful in, say, providing good training examples in parts of the space that would otherwise be unexplored. In contrast, the text representation we have introduced in this paper can be blurred in a way that is analogous to images and speech. Each embedding of a position is a vector that can be smoothly changed in every direction. We have showed that the similarity in this space gives rise to natural variation.
Prospects for completely tokenization-free processing.
We have focused on whitespace tokenization and proposed a whitespacetokenization-free method that computes embeddings of higher quality than tokenization-based methods. However, there are many properties of edited text beyond whitespace tokenization that a complex rule-based tokenizer exploits. In a small explorative experiment, we replaced all nonalphanumeric characters with whitespace and repeated experiment A-ORIGINAL for this setting. This results in an F 1 of .593, better by .01 than the best tokenization-free method. This illustrates that there is still a lot of work to be done before we can obviate the need for tokenization.

Conclusion
We introduced the first generic text representation model that is completely nonsymbolic, i.e., it does not require the availability of a segmentation or tokenization method that identifies words or other symbolic units in text. This is true for the training of the model as well as for applying it when computing the representation of a new text. In contrast to prior work that has assumed that the sequence-of-character information captured by character ngrams is sufficient, position embeddings also capture sequence-of-ngram information. We showed that our model performs better than prior work on entity typing and text denoising.  Figure 1: The graph shows how many different character ngrams (k min = 3, k max = 10) occur in the first n bytes of the English Wikipedia for symbolic (tokenization-based) vs. nonsymbolic (tokenization-free) processing. The number of ngrams is an order of magnitude larger in the nonsymbolic approach. We counted all segments, corresponding to m = ∞. For the experiments in the paper (m = 50), the number of nonsymbolic character ngrams is smaller.