Latent Structure Models for Natural Language Processing

Latent structure models are a powerful tool for modeling compositional data, discovering linguistic structure, and building NLP pipelines. They are appealing for two main reasons: they allow incorporating structural bias during training, leading to more accurate models; and they allow discovering hidden linguistic structure, which provides better interpretability. This tutorial will cover recent advances in discrete latent structure models. We discuss their motivation, potential, and limitations, then explore in detail three strategies for designing such models: gradient approximation, reinforcement learning, and end-to-end differentiable methods. We highlight connections among all these methods, enumerating their strengths and weaknesses. The models we present and analyze have been applied to a wide variety of NLP tasks, including sentiment analysis, natural language inference, language modeling, machine translation, and semantic parsing. Examples and evaluation will be covered throughout. After attending the tutorial, a practitioner will be better informed about which method is best suited for their problem.


Description
Latent structure models are a powerful tool for modeling compositional data, discovering linguistic structure, and building NLP pipelines (Smith, 2011). Words, sentences, paragraphs, and documents represent the fundamental units in NLP, and their discrete, compositional nature is well suited to combinatorial representations such as trees, sequences, segments, or alignments. When available from human experts, such structured annotations (like syntactic parse trees or part-of-speech information) can help higher-level models perform or generalize better. However, linguistic structure is often hidden from practitioners, in which case it becomes useful to model it as a latent variable.
While it is possible to build powerful models that obliviate linguistic structure almost completely (such as LSTMs and Transformer architectures), there are two main reasons why modeling it is desirable: first, incorporating structural bias during training can lead to better generalization, since it corresponds to a more informed and more appropriate prior. Second, discovering hidden structure provides better interpretability: this is particularly useful when used in conjunction with neural networks, whose typical architectures are not amenable to interpretation. The learnt structure offers highly valuable insight into how the model organizes and composes information.
This tutorial will cover recent advances in latent structure models in NLP. In the last couple of years, the general idea of hidden linguistic structure has been married to latent representation learning via neural networks. This has allowed powerful modern NLP models to learn to uncover, for example, latent word alignments or parse trees, jointly, in an unsupervised or semi-supervised fashion, from the signal of higher-level downstream tasks like sentiment analysis or machine translation. This avoids the need for preprocessing data with offthe-shelf tools (e.g., parsers, word aligners) and engineering features based on their outputs; and it is an alternative to techniques based on parameter sharing, transfer learning, multi-task learning, or scaffolding (Swayamdipta et al., 2018;Peters et al., 2018;Devlin et al., 2019;Strubell et al., 2018), as well as techniques that incorporate structural bias directly in model design (Dyer et al., 2016;Shen et al., 2019).
The proposed tutorial is about such discrete latent structure models. We discuss their motivation, potential, and limitations, then explore in detail three strategies for designing such models: • Reinforcement learning; • Surrogate gradients; • End-to-end differentiable methods.
A challenge with structured latent models is that they typically involve computing an "argmax" (i.e. finding a best scoring discrete structure such as a parse tree) in the middle of a computation graph. Since this operation has null gradients almost everywhere, gradient backpropagation cannot be used out of the box for training. The methods we cover in this tutorial differ among each other by the way they handle this issue.
Reinforcement learning. In a stochastic computation graph, such methods seek the hidden discrete structures that minimize an expected loss on a downstream task ; similar to maximizing an expected reward in reinforcement learning with discrete actions. Estimated stochastic gradients are typically obtained with a combination of Monte Carlo sampling and the score function estimator (a.k.a. REINFORCE, Williams, 1992). Such estimators often suffer from instability and high variance, requiring care (Havrylov et al., 2019).
Surrogate gradients. Such techniques usually involve approximating the gradient of a discrete, argmax-like mapping by the gradient of a continuous relaxation. Examples are the straight-through estimator (Bengio et al., 2013) and the structured projection of intermediate gradients optimization technique (SPIGOT; Peng et al. 2018). In stochastic graphs, surrogate gradients yield biased but lower-variance gradient estimators compared to the score function estimator. Related is the Gumbel softmax (Jang et al., 2017;Maddison et al., 2017;Choi et al., 2018;Maillard and Clark, 2018), which uses the reparametrization trick and a temperature parameter to build a continuous surrogate of the argmax operation, which one can then differentiate over. Structured versions were recently explored by Corro and Titov (2019a,b). One limitation of straight-through estimators is that backpropagating with respect to the sample-independent means may cause discrepancies between the forward and backward pass, which biases learning.
End-to-end differentiable approaches. Here, we directly replace the argmax by a continuous relaxation for which the exact gradient can be computed and backpropagated normally. Examples are structured attention networks and related work (Kim et al., 2017;Maillard et al., 2017;Liu and Lapata, 2018;Mensch and Blondel, 2018), which use marginal inference, or SparseMAP (Niculae et al., 2018a,b), a new inference strategy which yields a sparse set of structures. While the former is usually limited in which the downstream model can only depend on local substructures (not the entire latent structure), the latter allows combining the best of both worlds. Another line of work imbues structure into neural attention via sparsity-inducing priors (Martins and Astudillo, 2016;Niculae and Blondel, 2017;Malaviya et al., 2018).
This tutorial will highlight connections among all these methods, enumerating their strengths and weaknesses. The models we present and analyze have been applied to a wide variety of NLP tasks, including sentiment analysis, natural language inference, language modeling, machine translation, and semantic parsing. In addition, evaluations specific to latent structure recovery have been pro-posed (Nangia and Bowman, 2018;Williams et al., 2018). Examples and evaluation will be covered throughout the tutorial. After attending the tutorial, a practitioner will be better informed about which method is best suited for their problem.

Type of Tutorial & Relationship to Recent Tutorials
The proposed tutorial mixes the introductory and cutting-edge types. It will offer a gentle introduction to recent advances in structured modeling with discrete latent variables, which were not previously covered in any ACL/EMNLP/IJCNLP/NAACL related tutorial. The closest related topics covered in recent tutorials at NLP conferences are: • Variational inference and deep generative models (Aziz and Schulz, 2018); 1 • Deep latent-variable models of natural language (Kim et al., 2018). 2 Our tutorial offers a complementary perspective in which the latent variables are structured and discrete, corresponding to linguistic structure. We will briefly discuss the modeling alternatives above in the final discussion.

Outline
Below we sketch an outline of the tutorial, which will take three hours, separated by a 30-minutes coffee break.

Breadth
We aim to provide the first unified perspective into multiple related approaches. Of the 31 referenced works, only 6 are co-authored by the presenters. In the outline, the first half presents exclusively work by other researchers and the second half present a mix of our own work and other people's work.

Prerequisites and reading
The audience should be comfortable with: • math: basics of differentiability.
• language: basic familiarity with the building blocks of structured prediction problems in NLP, e.g., syntax trees and dependency parsing.
• machine learning: familiarity with neural networks for NLP, basic understanding of backpropagation and computation graphs.

Instructors
André She has a master's degree in Information Retrieval from the Sofia University, where she was also a teaching assistant in Artificial Intelligence. She is part of the organizers of a shared task in SemEval 2019.
Nikita Nangia 5 is a PhD student at New York University, advised by Samuel Bowman. She is working on building neural network systems in NLP that simultaneously do structured prediction and representation learning. This work focuses on finding structure in language without direct supervision and using it for semantic tasks like natural language inference and summarization.
Vlad Niculae 6 is a postdoc in the DeepSPIN project at the Instituto de Telecomunicações in Lisbon, Portugal. His research aims to bring structure and sparsity to neural network hidden layers and latent variables, using ideas from convex optimization, and motivations from natural language processing. He earned a PhD in Computer Science from Cornell University in 2018. He received the inaugural Cornell CS Doctoral Dissertation Award, and co-organized the NAACL 2019 Workshop on Structured Prediction for NLP (http:// structuredprediction.github.io/SPNLP19).