Fˆ2-Softmax: Diversifying Neural Text Generation via Frequency Factorized Softmax

Despite recent advances in neural text generation, encoding the rich diversity in human language remains elusive. We argue that the sub-optimal text generation is mainly attributable to the imbalanced token distribution, which particularly misdirects the learning model when trained with the maximum-likelihood objective. As a simple yet effective remedy, we propose two novel methods, F^2-Softmax and MefMax, for a balanced training even with the skewed frequency distribution. MefMax assigns tokens uniquely to frequency classes, trying to group tokens with similar frequencies and equalize frequency mass between the classes. F^2-Softmax then decomposes a probability distribution of the target token into a product of two conditional probabilities of (i) frequency class, and (ii) token from the target frequency class. Models learn more uniform probability distributions because they are confined to subsets of vocabularies. Significant performance gains on seven relevant metrics suggest the supremacy of our approach in improving not only the diversity but also the quality of generated texts.


Introduction
Neural text generation is one of the extensively studied tasks of natural language processing (NLP), as it forms the basis for dialogue systems (Chen et al., 2017), machine translation (Chaudhary and Patel, 2018), and text summarization (Kryscinski et al., 2019). However, often monotonous or dull, texts generated from existing methods do not fully reflect the rich diversity and expression in human language (Welleck et al., 2020). In particular, models tend to overproduce words frequently appearing in the data, while hardly utilizing informative words . Even pre-training techniques on large corpora fail to resolve the issue (Holtzman et al., 2019).
Possible causes for text degeneration have been illuminated, such as a defect specific to model architectures (Vig, 2018) or the discrepancy between training data and a true distribution (Holtzman et al., 2018;Jiang et al., 2019). Recently, the emphasis has been placed on investigating the flaws in the maximum likelihood objective (Holtzman et al., 2019). Concretely, the likelihood training pays little attention to the top ranks in terms of the target token probabilities (Welleck et al., 2020), or maximizing likelihood itself does not adequately reflect human language processing (Holtzman et al., 2019). Therefore, with the maximum likelihoodbased training, models learn to produce tokens frequently appearing in the data more often.
We argue, however, that the primary reason behind the sub-optimal performance of the likelihood objective is essentially the imbalanced token distribution inherent in natural language. Natural language is extremely skewed in distribution, where the top hundred most frequently-used (top-100) words occupy nearly half of the total corpus (Fagan and Gençay, 2011) following the Zipf's law (Zipf, 1949). Training a classifier with the inherently imbalanced data on the maximum likelihood estimation (MLE) leads to biased classification boundaries in favor of majority classes (Khan et al., 2019). In other words, models play a difficult role in learning with the imbalanced label (i.e., token) distribution (He et al., 2008b).
We hypothesize that text generation can be enriched by balancing out the training data distribution. To this end, we introduce F 2 -Softmax ( Fig. 1(B), Section 3.2), which factorizes the probability distribution of the target token into a product of two conditional probabilities of (i) frequency class, and (ii) token from the target frequency class. It ensures training over balanced data, since the frequency classes are designed to have the distribution close to uniformity, and token distributions With the frequency class assigned, token generation is decomposed into (i) predicting the frequency class, and (ii) generating the target token from the given frequency class.
within a class are confined to subsets of vocabularies grouped with similar frequencies. To this end, all unique tokens are assigned to a frequency class prior to the training, by our novel mean efficiency maximization (MefMax; Fig. 1(A), Section 3.3). MefMax evaluates and maximizes the class-labeling performance with the normalized entropy (i.e., efficiency), having the probability distributions to be learned as uniform as possible.
We conduct extensive performance evaluations on seven relevant metrics that quantify the diversity and quality of generated texts. In terms of the diversity of generated texts, our approach significantly outperforms not only the MLE baseline (Radford et al., 2019) but also other diversity-promoting alternatives (Welleck et al., 2020;Jiang et al., 2019). We also achieve state-of-the-art results on most of the quality performances.

Diversity-promoting Text Generation
In the field of neural text generation, prior studies either take a training-based approach or a decodingbased approach to promote the diversity in the generated texts.
Training-based Approach. In dialogue generation, stimulating models to generate texts that share high mutual information with the contexts  has shown to improve the diversity of output tokens by adding a maximum mutual information (MMI) constraint to the standard likelihood objective. Meanwhile, FACE (Jiang et al., 2019) dynamically weights the cross-entropy losses based on target token frequencies, to prevent excessive weight-updates of some frequently used words. In another line of works for language modeling, text diversity has been promoted by a learning-tocooperate framework in which multiple discriminators cooperate to reach a common goal (Holtzman et al., 2018). Also, the unlikelihood training strategy penalizes repetition with auxiliary loss terms (Welleck et al., 2020). Such works are orthogonal to ours since F 2 -Softmax focuses on decomposing the softmax function without employing an auxiliary loss or loss re-scaling.
Decoding-based Approach. One of the widely used decoding tactics for promoting the diversity and richness of texts is stochastic decoding. Topk sampling stochastically samples the next token from the top-k candidates in the predicted probability distribution (Fan et al., 2018). Another pillar of stochastic decoding is nucleus sampling, which selects the next token from the top-p portion of the probability mass (Holtzman et al., 2019). Other studies include beam blocking (Paulus et al., 2017) in which the probabilities of tokens are set to zero if they were to create repeating n-grams, diverse beam search (Vijayakumar et al., 2018) which integrates dissimilarity terms into beam scores. Iterative beam search (Kulikov et al., 2019) enhances diverse beam search with multiple iterations of beam search with different search spaces. These techniques are agnostic about model architecture or training methods. Our approach can be harmonically combined with the above techniques.

Softmax Decomposition
Decomposing the softmax function has long been studied in language modeling. Goodman (2001) decomposed the softmax function using a two-level hierarchy. This idea was generalized to deeper hierarchies in a later study (Mnih and Hinton, 2009). Approaches to construct softmax hierarchies have followed, such as utilizing word clusters obtained from k-means algorithms (Le et al., 2011) or implementing Huffman coding with word frequencies (Mikolov et al., 2013). Furthermore, dynamic programming has been applied to obtain an optimal set of word classes with minimal computational costs for calculating the softmax function (Zweig and Makarychev, 2013). The same process has also been streamlined to fit into modern GPU environments (Grave et al., 2017). These techniques bear a resemblance to ours for the use of softmax decomposition. However, our goal is fundamentally different: we aim to balance the data distribution in training, whereas previous approaches share the primary goal of reducing computational costs.

Imbalanced Classification
That we assign tokens to classes of balanced distribution shares a similar goal with overcoming imbalanced classification in computer vision domains. One of the widely adopted techniques for imbalanced classification is re-sampling, which includes removing examples from the majority classes (under-sampling) and adding samples for the minority classes (over-sampling) (Buda et al., 2018). Techniques for over-sampling include interpolating samples from neighboring samples (Chawla et al., 2002) and adaptively synthesizing samples (He et al., 2008a). Cost-sensitive learning dynamically re-weights costs based on sample difficulties (Dong et al., 2017) or effective number of samples (Cui et al., 2018). Other studies for the data imbalance problem consider metric learning (Huang et al., 2016), knowledge transfer (Wang et al., 2017), and Bayesian estimation (Khan et al., 2019).

Maximum Likelihood
The goal of language modeling is to learn a model p(x) which best describes a joint probability distribution p(x), where x = [x 1 , . . . , x T ] is a sequence of tokens and x i ∈ V is a token from a vocabulary set. In an auto-regressive manner, p(x) can be factorized into a product of conditional probabilities of tokens; p(x) = t p(x t |x <t ). A conventional approach for the training is to maximize log-likelihood of a sequence x as the following: (1)

F 2 -Softmax
We propose to factorize the posteriorp(x t |x <t ) into a product of two conditional probabilities: where c t ∈ C denotes a frequency class label assigned to the token x t given the global frequency of the token in a corpus, belonging to a set of frequency classes C. Following Eq.
(2), the updated objective L F 2 (p) is then formulated as: The objective is thus learning how to classify the target frequency of the token and selecting the exact token given the target frequency class. The factorized probabilitiesp 1 (c t |x <t ) andp 2 (x t |c t , x <t ) are defined empirically using softmax functions: where h t−1 is a hidden state of the context x <t ; o i and u j can be viewed as output embedding vectors for i ∈ V ct and j ∈ C, respectively, while V ct is a subset of vocabularies assigned to the class c t . Note thatp 2 (x t |c t , x <t ) is computed from the narrowed pool of tokens V ct rather than the full vocabularies set V. Since classes are differentiated based on the token frequency, tokens with the same class have similar frequencies. It ensures withinclass frequency distribution of tokens is closer to uniform than that of the full vocabulary set.

MefMax for Class Optimization
The more uniform a label distribution is, the less likely decision boundaries are biased in favor of frequent classes. Therefore, we aim to maximize the degree of uniformity of frequency distributions for both (i) tokens within each class and (ii) classes themselves (i.e., the sum of token frequencies within each class), to avoid the class imbalance problem (Buda et al., 2018) over the course of training. It is formalized as follows: where U is a function that measures the uniformity of the frequency distribution of a given set. While any tests of uniformity can be used as U, we adopt Shannon's entropy (Shannon, 1948). The entropy is a decent proxy for measuring the uniformity (Dudewicz and Van Der Meulen, 1981).
Normalized Entropy. Since the number of samples affects the entropy, entropy cannot be directly used. To marginalize the effect of the sample size, we use efficiency, which is also known as the normalized entropy (Wijesekera and Dillon, 1997), defined as: It is equivalent to the ratio of the entropy to the maximum entropy, if the data were perfectly uniform. By applying the efficiency to Eq. (5), our objective is to find a set of classes and their vocabularies such that their mean efficiency is maximized.
Greedy Approach. The remaining issue is the computational overhead since the cost for exploring all possible class boundaries grows exponentially with the vocabulary size, not to mention the challenge of finding the optimal number of classes.
To improve computational efficiency, we employ a straightforward greedy mechanism. It is based on the assumption that the mean efficiency is maximized when each class has approximately the same total frequency size. This assumption allows us to reduce our objective to optimizing the number of classes. Given a sorted vocabulary set V and a candidate number of classes K, we divide classes so that each class has the same 1/K of total frequency. The optimal number of classes is the one that maximizes the mean efficiency. Algorithm 1 shows the complete pseudo-code. Computation time is linear to the vocabulary size.

Decoupled Decoding
For the decoding stage, we decouplep 1 fromp in Eq. (2) by first selecting a single frequency class fromp 1 and then generating the next token based on the selected class. For the target class c t = i sampled from the distributionp 1 (c t |x <t ), the probability for the next token is defined as: The target class can be sampled in both deterministic or stochastic manners, depending on decoding strategies. We found that the advantages of training balanced data distributions can be fully leveraged by sequentially performing tasks of frequency class prediction and token generation from the selected class.

Training and Evaluation Details
In this section, experimental details are illustrated. Exact hyperparameter settings and data statistics are described in Appendix.

Datasets
Two datasets that differ in language and text types are selected for the implementations. Wikitext-103 1 is a collection of English articles extracted from Wikipedia. Containing more than 100 million words, it is widely regarded as a benchmark dataset for language modeling. Melo-Lyrics is a Korean lyrics dataset we crawled from multiple music streaming websites, including Soribada 2 , Genius 3 , etc. Tokens in lyrics show a distribution largely different from general articles; for instance, repeated phrases are abundant in lyrics. Therefore it provides an additional unique angle for model evaluations and comparisons. It includes approximately 478 thousand songs with 51 million words in total.

Model Architecture
We use the Transformer (Vaswani et al., 2017), an architecture well-suited for neural text generation (Lewis et al., 2019;Welleck et al., 2020). Specifically, we apply the Transformer decoder used in the GPT-2 model (Radford et al., 2019). Input texts are tokenized with the byte pair encoding (Sennrich et al., 2016).

Baseline Models
For the baseline, we consider maximum likelihood estimation (MLE), a standard approach for text generation. Also compared are alternative models for promoting text diversities, including recently proposed FACE 4 (Jiang et al., 2019) and unlikelihood training 5 (UL) (Welleck et al., 2020). FACE improves text diversity by dynamically scaling losses, while the latter employs auxiliary losses.

Training
Training is carried out on a single GPU environment with 24GB of memory. We set all hyperparameters equal for all approaches by tuning them based on the validation losses of the MLE baseline for fair comparisons. We additionally optimize approach-specific hyperparameters of diversitypromoting baselines.

Generation
We generate texts for the evaluation by completing sequences from prefixes. Specifically, we batchify a test set, select the first 50 tokens from each batch as prefixes, and guide models to generate a continuation of 100 tokens from the prefixes. The experiments include both deterministic and stochastic decoding. We apply greedy search for deterministic decoding, and use top-k sampling for stochastic decoding.

Evaluation Metrics
From seven total quantitative metrics we adopt to evaluate our model, Perplexity (Bengio et al., 2003), KL-Divergence (Kullback, 1997), and MS-Jaccard (Alihosseini et al., 2019) are closely related to the likelihood of generated texts. The other four metrics, namely Self-BLEU (Zhu et al., 2018), Distinct-n , Repetition (Holtzman et al., 2019), and Uniq (Welleck et al., 2020) measure the text diversity. Perplexity quantifies the prediction difficulty over the next token. It is regarded as a general performance metric for text generation. KL-Divergence measures the difference between two probability distributions. We use unigram distributions of the generated texts and the test data. MS-Jaccard computes the similarity between the model's output and the ground truths by matching n-grams. Self-BLEU evaluates the inter-text diversity by computing BLEU (Papineni et al., 2002) score for each generated text by regarding other outputs as reference. Distinct-n quantifies the intra-text diversity based on distinct n-grams in each text. Repetition examines whether texts are stuck in repetitive loops. Uniq quantifies the richness of models using the number of unique generated tokens.

Quantitative Comparisons
In this section, we report the scores computed from fully-trained models on the two benchmarks, Wikitext-103 and Melo-Lyrics, compared against baselines. Table 1 shows the results of stochastic decoding, while the results of deterministic decoding are reported in Table 2.

Stochastic Decoding
Wikitext-103. The desired qualities we aim for a text generation model is to generate human-like texts with a wide spectrum of token choices. Coupled with top-k sampling, our F 2 -Softmax achieves both goals by outperforming baselines with nearly all metrics compared, and closely approaching the human gold standard. As shown in Table 1(a), our model is particularly effective in capturing the token diversity in the corpus. Notably, F 2 -Softmax  significantly improves both Self-BLEU and Distinct performances, having relative gaps to the human gold standard of 3.4% and 3%, respectively. The performance gaps of the second-best scores are 6.5% (FACE) and 8.1% (UL-token+seq), respectively. A surprising result is that F 2 -Softmax improves Rep performance by 50% over MLE, without an explicit penalty on repeating tokens. Another seminal contribution is the 30% relative increase in unique tokens used for the generation, from the previously state-of-the-art level of 10.6k to 15.7k, as shown by the Uniq metric. This level closely reflects the human use of 15.2k tokens.
In PPL, which reflects the likelihood of the generated texts, the diversity-promoting baselines tend to perform worse than MLE, presumably due to the trade-offs between the diversity and the likelihood of texts. In contrast, F 2 -Softmax maintains the smallest performance drop on PPL. F 2 -Softmax also improves KLD and MS-Jaccard by 59% and 19% over MLE, respectively, which are large margins compared to the other comparatives.
Melo-Lyrics. Significant performance gains of F 2 -Softmax are also observed in lyrics generation in Table 1(b). The diversity-promoting baselines display severer degradation in PPL, KLD, and MS-Jaccard compared to the Wikitext-103 dataset. Especially, their repetition levels are significantly different from that of the ground truth data. We attribute this observation to the distinctive characteristics of lyrics, in which the same phrases are rhythmically repeated throughout the songs in the form of chorus or hook. Thus, for lyrics dataset, forcing models to discourage reusing previously used tokens may adversely affect the likelihood of the generated texts. Since F 2 -Softmax helps models to diversify the output without an explicit regularization, models learn to generate well-thought-out tokens from the diversified token pool of 25.2k (Uniq), with state-of-the-art performances in KLD, MS-Jaccard, Self-BLEU, Distinct, and Rep.

Deterministic Decoding
In deterministic decoding, there is no clear method that outperforms the others in all of the metrics. For example, UL-token+seq exhibits the best performance in Distinct and Rep, while presenting the worst score in MS-Jaccard. Similarly, FACE improves Self-BLEU in exchange for performance loss on PPL and MS-Jaccard. Since we have seven metrics to compare, we conduct pair-wise evalu-  ations between the compared methods, in which a method outperforms the other when a majority of metrics record higher. Our approach beats compared methods seven out of eight times (Table 9). This result supports the supremacy of our approach regardless of the choice of decoding strategies. However, deterministic decoding does not see the same amount of benefits obtained from stochastic decoding. We empirically find from our analyses that argmax operation in deterministic settings may harm the diversity when target class probabilities are nearly evenly distributed. We plan to delve deeper into our approach to improve our approach further.

Learning Balanced Distribution
The characteristic markers of monotonous texts are an overproduction of frequently used tokens and under-representation of rare tokens. To compare how models differentially generate tokens from frequent and rare tokens, we count the number of generated tokens corresponding to four defined categories of frequent, medium, rare and very rare. Tokens in each category are predefined from the Wikitext-103 training set. Fig. 2 plots the distribution results. MLE produces frequent tokens 34% more than human, while under-producing rare and very rare tokens by 40%. The unlikelihood training baselines (UL-token, UL-token+seq) improve the diversity against MLE, but their results are relatively far from the real distribution. FACE manages to regulate a disproportionate use of frequent tokens, but it fails to generate adequate amount of very rare tokens. Generation results of our F 2 -Softmax are closest to the gold standard, with the differences in frequent and rare tokens falling within 6%.

Ablation Studies
In this section, we justify the pivotal roles of Mef-Max (Section 3.3) and the decoupled decoding strategy (Section 3.4). In order to assess contributions toward the final performances, we conduct a series of ablation tests. Stochastic decoding is used for the ablation studies.

Ablation on MefMax
MefMax finds a desirable number of classes, intending to balance the frequency distribution of tokens between classes. Does MefMax help achieve better generation results than possible variants of class assignment? We answer this question by comparing the final performances against two simpler variants of MefMax. We name the first variant as fixed-eq-token in which tokens are distributed in equal numbers to a fixed number of classes. The second variant, fixed-eq-freq, also assumes a fixed number of classes, but tokens are assigned to minimize the difference in the frequency distribution between classes. Fig. 3 presents the results. Clearly, fixed-eq-freq outperforms fixed-eq-token. It indicates that a decomposition of the softmax function without consideration of the data distribution (i.e., frequency distribution) aggravates both the likelihood and token diversity performances, regardless of the number of classes. For fixed-eq-token, we find that models tend to overclassify the target class to the first class, which contains most of the total frequency, having most tokens generated from a fraction of the total vocabulary. This finding also justifies the hypothesis that balanced data distribution is an important factor in text generation.
Assigning classes based on the frequency (i.e., fixed-eq-freq) continues to improve MS-Jaccard and Self-BLEU until the number of classes reaches the class choice of MefMax. With a larger number of classes than the choice, performances either plateau or decrease, demonstrating that MefMax is  capable of selecting the optimal class size. Interestingly, perplexity significantly deteriorates when the number of classes exceeds the optimal number decided by MefMax.

Ablation on Decoupled Decoding
Decoupled decoding formalized in Eq. (7) fully leverages the benefits of F 2 -Softmax by sequentially performing the frequency class prediction and token generation tasks. Table 3 reports the results from an ablation test on the decoupled decoding. To ablate the decoupled decoding, we use a full posterior in Eq. (2). We observe significant performance gaps, meaning that the decoupled decoding is an indispensable component of our model. Notably, even without the decoupled decoding, our model maintains better performances than the MLE baseline.

Qualitative Comparisons
To further examine the generation quality, we sample texts from the trained models. Table 4 compares the generated texts from the same prefix. The results suggest that all trained models are capable of generating texts semantically coherent to the prefix. However, they differ in rare token usage patterns. While our F 2 -Softmax exhibits the highest usage of rare tokens, we observe two issues from the base-  Table 4: Generated texts on the Wikitext-103 test set. A prefix from the first batch was selected to avoid cherrypicking. VR denotes the ratio of very rare tokens (see Section 4.3 for the definition) against the text length. While all colored and bold-faced tokens indicate very rare tokens, green color denotes repeated tokens, and red color is reserved for non-pronoun words.
lines. The first is that models tend to repeat the same rare token across all sentences after its first appearance (MLE). The other issue is that generated rare tokens are mostly pronouns (UL-tokenseq). Unlike the baselines, F 2 -Softmax utilizes the broadest range of rare tokens with significantly less, but more likely, repetitions. Further, F 2 -Softmax is shown to be adept at utilizing non-pronoun rare tokens, such as 'eccentric' or 'vanished'.

Conclusion
In this paper, we proposed F 2 -Softmax, a simple but effective method for better learning the rich diversity in text. F 2 -Softmax encourages models to diversify text generation by readjusting class formation and motivating models to learn a more balanced token distribution. Quantitative and qualitative analyses validate the diversity-promoting performances of our approach. Since it can be quickly adopted to replace the traditional likelihood objective, we believe in broader applicability of F 2 -Softmax. Thus, future work involves extending the method to other related tasks, such as machine translation and text summarization, and investigating the potential gains from transfer learning.

A.1 Melo-Lyrics Data Collection
Few datasets have been publicly available for Korean text generation, and none of them has gained public consensus as a benchmark dataset, partly due to their small sample sizes. We collect lyrics data for three rationales. First, we test our model on a language other than English. Second, a large number of songs and lyrics are available. Lastly, lyrics show distributional characteristics at odds with Wikipedia. The crawling session was held between 5 th July 2019 to 29 th July 2019. After crawling enough data, we discarded those containing more than ten foreign language words, except for English. English was an exception since using English in Korean lyrics is natural and common. We also manually refined the lyrics by deleting noises, including advertisements and unnecessary meta-information about the lyrics writer transcriber.
The remaining data consist of roughly 478 thousand lyrics with 51 million words. Indexed data can be downloaded from the url below 6 . We plan to release the raw data for research purposes only.

A.2 Data Statistics
The number of articles (songs) and containing words for training, test and validation phases are reported in Table 5

B Hyperparameter Configurations
The detailed hyperparameters used are illustrated.

C Transfer Learning
Pre-trained language models are widely used for downstream applications by adapting them to domain-specific distributions. Significant gains can be realized if F-2 softmax is successfully applied to fine-tune pre-trained language models, as most  pre-trained language models are trained with standard softmax function. To validate our approach on transfer learning settings, we pre-train a language model on news datasets consisting of 10GB of articles and fine-tune the model on the Melo-Lyrics dataset. The results are shown in Table 8. The transferred model both increased the quality and diversity of the generation. However, the transferred model exhibits narrower vocabulary usage. We conjecture it is arisen by the vocabulary mismatch between datasets used for pre-training and fine-tuning. We plan to further investigate the vocabulary mismatch problem.   6.0 F 2 -Softmax set a team record with eight National League runs scored in 110 innings pitched in his first three seasons. A right @-@ handed pitch attempts to make the pitching swing start , but his starting place as an extra @-@ major leagues team goes so far as to boost his chances of victory. He is best known for his efforts in both pitching and baseruster fastball ( known as the " Wild Wings " ) and his willingness to throw pitches into bats.

8.9
Model Texts VR (%) Prefix = Hed PE = Hed PE, also known as ( hed ) Planet Earth and stylized as ( ), is an American rock band from Huntington Beach, California. Formed in 1994, the band performs a style of MLE music called the and has released two studio albums , The of which were released as the group's second studio album in 1995. The band has released four studio albums, two albums, two albums and two albums, as well as a live album of their own. = = History = = Hed Pe was born in Huntington Beach, California, to a family of American immigrants, in the town of Huntington Beach. The poem is the first of Eliot's poems to be published by the public. The first of these is " The Great ", the first poem to be translated into the English language. Eliot wrote in his diary of the poem, " The Great ", that the poem "was written in the morning and afternoon of the night ".

4.6
F 2 -Softmax August 1969 with poetry about progress in salvation, and was praised by critics for its lyrical content. The poem was positively received by those who criticized its emphasis on contempation, lyrical content, and political themes, while Eliot's critical essays about the poetry and short story were criticized. The poem is set in light of the faith, which focuses on the desire for freedom, change in nature, and self @-@ esteem.

6.0
Model Texts VR (%) Prefix = Olmec colossal heads = The Olmec colossal heads are at least seventeen monumental stone representations of human heads sculpted from large basalt boulders. The heads date from at least before 900 BC and are a MLE type of stone carved from the walls of the Olmec colossal head of a small stone temple. It has a wide and narrow head and has a long and wide head that is a wide, broad face. The heads are long and long and triangular with rounded corners. The head is long and broad with long, curved corners. The legs are short and narrow, with short, straight sides, which are not wide.

3.4
F 2 -Softmax broad, highly curved structure that includes the uppermost skeleton of the king @-@ made objects. The body of Olmec colossal is elongated and consists of approximately eight layers on the obverse, followed by two large fragments. The jaws were composed of small bony projections formed by various iron skeletons and long thin joints. The skull is unusually similar to contemporary humans , with only one skeleton estimated at 0 @.@ 25 m ( 1 @.@ 5 ft ) across.