Texar: A Modularized, Versatile, and Extensible Toolbox for Text Generation

We introduce Texar, an open-source toolkit aiming to support the broad set of text generation tasks. Different from many existing toolkits that are specialized for specific applications (e.g., neural machine translation), Texar is designed to be highly flexible and versatile. This is achieved by abstracting the common patterns underlying the diverse tasks and methodologies, creating a library of highly reusable modules and functionalities, and enabling arbitrary model architectures and various algorithmic paradigms. The features make Texar particularly suitable for technique sharing and generalization across different text generation applications. The toolkit emphasizes heavily on extensibility and modularized system design, so that components can be freely plugged in or swapped out. We conduct extensive experiments and case studies to demonstrate the use and advantage of the toolkit.


Introduction
Text generation spans a broad set of natural language processing tasks that aim to generate natural language from input data or machine representations. Such tasks include machine translation (Brown et al., 1990;Bahdanau et al., 2014), dialog systems (Williams and Young, 2007;Serban et al., 2016;Tang et al., 2019), text summarization (Hovy and Lin, 1998), text paraphrasing and manipulation (Madnani and Dorr, 2010;Lin et al., 2019), and more. Recent years have seen rapid progress of this active area, in part due to the integration of modern deep learning approaches in many of the tasks. On the other hand, considerable research efforts are still needed in order to improve techniques and enable realworld applications.
A few remarkable open-source toolkits have been developed (section 2) which largely focus on one or a few specific tasks or algorithms. Emerging new applications and approaches instead are often developed by individual teams in a more adhoc manner, which can easily result in hard-tomaintain custom code and duplicated efforts.
The variety of text generation tasks indeed have many common properties and share a set of key underlying techniques, such as neural encoderdecoders (Sutskever et al., 2014), attentions (Bahdanau et al., 2014;Luong et al., 2015;Vaswani et al., 2017), memory networks (Sukhbaatar et al., 2015), adversarial methods (Goodfellow et al., 2014;Lamb et al., 2016), reinforcement learning (Ranzato et al., 2015;Tan et al., 2018), structured supervision , as well as optimization techniques, data pre-processing and result post-processing, evaluations, etc. These techniques are often combined together in various ways to tackle different problems. Figure 1 summarizes examples of various model architectures.
It is therefore highly desirable to have an opensource platform that unifies the development of the diverse yet closely-related applications, backed with clean and consistent implementations of the core algorithms. Such a platform would enable reuse of common components; standardize design, implementation, and experimentation; foster reproducibility; and importantly, encourage technique sharing among tasks so that an algorithmic advance developed for a specific task can quickly be evaluated and generalized to many others.
flexible framework for building their models. Texar has two versions, building upon TensorFlow (tensorflow.org) and PyTorch (pytorch. org), respectively, with the same uniform design. Underlying the core of Texar's design is principled anatomy of extensive text generation models and learning algorithms, which subsumes the diverse cases in Figures 1 and beyond, enabling a unified formulation and consistent implementation. Texar emphasizes three key properties: Versatility.
Texar contains a wide range of features and functionalities for 1) arbitrary model architectures as a combination of encoders, decoders, embedders, discriminators, memories, and many other modules; and 2) different modeling and learning paradigms such as sequence-tosequence, probabilistic models, adversarial methods, and reinforcement learning. Based on these, both workhorse and cutting-edge solutions to the broad spectrum of text generation tasks are either already included or can be easily constructed.

Modularity.
Users can construct models at a high conceptual level just like assembling building blocks. It is convenient to plug in or swap out modules, configure rich module options, or even switch between distinct modeling paradigms. For example, switching from adversarial learning to reinforcement learning involves only minimal code changes (e.g., Figure 4). Modularity makes Texar particularly suitable for fast prototyping and experimentation.

Extensibility.
The toolkit provides interfaces ranging from simple configuration files to full library APIs. Users of different needs and expertise are free to choose different interfaces for appropriate programmability and internal accessibility. The library APIs are fully compatible with the na-tive TensorFlow/PyTorch interfaces, which allows seamless integration of user-customized modules, and enables the toolkit to take advantage of the vibrant open-source community by effortlessly importing any external components as needed.
Furthermore, Texar emphasizes on wellstructured code, clean documentation, rich tutorial examples, and distributed GPU training.

Related Work
There exist several toolkits that focus on one or a few specific tasks. For neural machine translation and alike, there are Tensor2Tensor (Vaswani et al., 2018)  Differing from the task-focusing tools, Texar aims to cover as many text generation tasks as possible. The goal of versatility poses unique design challenges.
On the other end of the spectrum, there are libraries for more general NLP or ML applications: AllenNLP (allennlp.org), GluonNLP (gluon-nlp.mxnet.io) and others are designed for the broad NLP tasks in general, while Keras (keras.io) is for high conceptual-level programming without specific task focuses. In comparison, Texar has a proper focus on the text generation sub-area, and provide a comprehensive set of modules and functionalities that are welltailored and readily-usable for relevant tasks. For example, Texar provides rich text docoder with optimized interfaces to support over ten decoding methods (see section 3.3 for an example). 3 Structure and Design

The Design of Texar
Designing a versatile toolkit is challenging due to the large variety of text generation tasks and fastgrowing new models. We tackle the challenges by adopting principled anatomy of the modeling and experimentation pipeline. Specifically, we break down the complexity of rich tasks into three dimensions of variations, namely, varying data types/formats, arbitrary combinational model architectures and inference procedures, and diverse learning algorithms. Within the unified abstraction, all learning paradigms are each specifying one or multiple loss functions (e.g., cross-entropy loss, policy gradient loss), along with an optimization procedure that improves the losses: where f θ is the model that defines the model architecture and the inference procedure; D is the data; L is the learning objectives (losses); and min denotes the optimization procedure. Note that the above can have multiple losses imposed on different model parts (e.g., adversarial learning). Further, as illustrated in Figure 2 right panel, we decouple learning, inference, and model architecture, forming abstraction layers of learninginference -architecture. That is, different architectures implement the same set of inference procedures and provide the same interfaces, so that learning algorithms can call proper inference procedures as subroutines while staying agnostic to the underlying architecture and implementation details. For example, maximum likelihood learning uses teacher-forcing decoding (Mikolov et al., 2010); a policy gradient algorithm can invoke stochastic or greedy decoding (Ranzato et al., 2015); and adversarial learning can use either stochastic decoding for policy gradient-based updates (Yu et al., 2017) or Gumbel-softmax reparameterized decoding (Jang et al., 2016) for direct gradient back-propagation. Users can switch between different learning algorithms for the same model, by simply specifying the corresponding inference strategy and plugging into a new learning module, without adapting the model architecture (see section 3.3 for a running example).

Assemble Arbitrary Model Architectures
We develop an extensive set of frequently-used modules (e.g., various encoders, decoders, embedders, classifiers, etc). Crucially, Texar allows free concatenation between these modules in order to assemble arbitrary model architectures. Such concatenation can be done by directly interfacing two modules, or through an intermediate connector module that provides general functionalities of reshaping, reparameterization, sampling, and others.
Besides the flexibility of arbitrary assembling, it is critical for the toolkit to provide proper abstractions to relieve users from overly concerning low-level implementations. Texar provides two major types of user interfaces with different abstract levels, i.e., YAML configuration files and full Python library APIs. Figure 3 shows an exam-  ple of specifying an attentional encoder-decoder model through the two interfaces, respectively.
Configuration file passes hyperparameters to a predefined model template, which instantiates the model for training and evaluation. Text highlighted in blue in the figure (left panel) specifies the names of modules to use. Most hyperparameters have sensible default values. Users only have to specify hyperparameter values that differ from the default. The interface is easily understandable for non-expert users, and has also been adopted in other tools (e.g., Klein et al., 2017).
Library APIs offer clean function calls. Users can efficiently build any desired pipelines at a high conceptual level. Power users have the option to access the full internal states for low-level manipulations. Texar modules support convenient variable re-use. That is, each module instance creates its own sets of variables, and automatically re-uses them on subsequent calls. Hence TensorFlow variable scope is transparent to users.

Plug-in and Swap-out Modules
It is convenient to change from one modeling paradigm to another by simply plugging in/swapping out a single or few modules, or even merely changing a configuration parameter. For example, given the base code of an encoderdecoder model in Figure 3 (right panel), Figure 4 illustrates how one can switch between different learning paradigms by changing only Lines.14-19 of the original code (maximum-likelihood learn-ing). In particular, Figure 4 shows adversarial learning and reinforcement learning, which invokes Gumbel-softmax decoding and randomsample decoding, respectively.

Customize with Extensible Interfaces
Texar emphasizes on extensibility and allows easy addition of customized/external modules without editing the Texar codebase. Specifically, with the YAML configuration file, users can directly insert their own modules by providing the Python importing path to the module. For example, to use a customized RNN cell in the encoder, one can simply change Line.7 of Figure 3 (left panel) to type: path.to.MyCell, as long as MyCell has a compatible interface to other parts of the model. Using customized modules with the library APIs is even more flexible, since the APIs are designed to be fully compatible with native TensorFlow/PyTorch programming interfaces.

Case Study: Transformer on Different Tasks
We present a case study to show that Texar can greatly reduce implementation efforts and enable technique sharing among different tasks. Transformer, as first introduced in (Vaswani et al., 2017), has greatly improved the machine translation results and created other successful models such as BERT for text embedding (Devlin et al., 2019) and GPT-2 for language modeling (Radford et al., 2018). Texar supports easy construction of  Figure 4: Switching between different learning paradigms of a decoder involves only modification of Line.14-19 of Figure 3 (maximum-likelihood learning). The same decoder is called with different decoding modes, and discriminator or reinforcement learning agent is added as needed. (Left): Module structure of each paradigm; (Right): The respective code snippets. For adversarial learning in (b), continuous Gumbel-softmax approximation (Jang et al., 2016) to generated samples is used to enable gradient propagation from the discriminator to the decoder.
these models and fine-tuning pretrained weights. We can also deploy the Transformer components to various other tasks and get improved results. The first task we explored is the variational autoencoder (VAE) language modeling (Bowman et al., 2015). We test two models, one with an LSTM RNN decoder which is traditionally used in the task, and the other with a Transformer decoder. All other model configurations including parameter size are the same across the two models. Table 1, top panel, shows the Transformer VAE consistently improves over the LSTM VAE. With Texar, changing the decoder from an LSTM to a Transformer is easily achieved by modifying only 3 lines of code. It is also worth noting that, building the VAE language model (including data reading, model construction, and optimization) on Texar uses only 70 lines of code (with the length of each line < 80 chars). As a (rough) reference, a popular public TensorFlow code (Li, 2017) of the same model has used around 400 lines of code for the same part (without line length limit).
The second task is conversation generation. The dialog history is encoded with the HierarchicalRNNEncoder module which is followed by a decoder to generate the response. We study the performance of a Transformer decoder v.s. a conventional GRU RNN decoder. Table 1, bottom panel, shows the Transformer outperforms GRU. Regarding the implementation effort, the Texar code has around 100 lines of code, while the reference TensorFlow code (Zhao et al., 2017) involves over 600 lines.   (Zhao et al., 2017).