Classical Structured Prediction Losses for Seq2Seq Learning

This paper nicely summarizes various classical sequence level loss functions used in structured prediction literature in the past and show that these methods work well for neural sequence prediction tasks. They demonstrate their results on two machine translation datasets-- IWSLT14, and WMT14 En-Fr, and Gigaword abstractive summarization dataset.

Paper

Published at ACL’18.

Classical Structured Prediction Losses for Sequence to Sequence Learning, Sergey Edunov, Myle Ott, Michael Auli, David Grangier, Marc’Aurelio Ranzato

Takeaways:

Motivation

A large body of work exist on using sequence level objectives for training log-linear models for sequence prediction tasks such as NER. This paper pick those objectives and evaluate them on two tasks: Machine Translation and Summarization.

Notation:

Model

The authors use Gated Convolutional Neural Networks (Gehring et al., 2017b) for the encoder and the decoder architecture.

Losses:

Token NLL:

This is usual teacher forcing training.

\[\mathcal{L}_{TokNLL} = - \sum_{i=1}^{n} \log\big( p(t_i| t_1, \cdots, t_{i-1}, \textbf{x}\big)\]

Token NLL w/ Label Smoothing:

Likelihood training makes the model extremely confident in it’s predictions which hurts in generalization performance. Label smoothing acts as a regularizer by ensuring a model doesn’t go far away from a prior distribution $f$ (independent of input $\textbf{x}$), which in most cases is the uniform distribution.

\[\mathcal{L}_{TokLS} = - \sum_{i=1}^{n} \log\big( p(t_i| t_1, \cdots, t_{i-1}, \textbf{x}\big) + D_{KL}\left(f|| p(t_i| t_1, \cdots, t_{i-1}, \textbf{x})\right)\]

SeqNLL:

\[\mathcal{L}_{SeqNLL} = - \log \big( p(\textbf{u}^{*} | \textbf{x}) \big) + \log \Bigg( \sum_{\textbf{u} \in \mathcal{U}(\textbf{x})} p(\textbf{u}^{*} | \textbf{x}) \Bigg)\]

Computing SeqNLL (and other) sequence level objective over space of all possible inputs is intractable, we compute losses over a subset $\mathcal{U}(\textbf{x})$ generated by the model. The approaches to generate these subsets are discussed below.

SeqNLL we use pseudo-reference $\textbf{u}^{*}$ instead of target $\textbf{t}$ as using $t$ can lead to degenerate solution as per authors.

Minimum Risk Training or Expected Risk Minimization:

\[\mathcal{L}_{Risk} = \sum_{\textbf{u} \in \mathcal{U}(\textbf{x})} c(\textbf{t}, \textbf{u}) \frac{p(\textbf{u} | \textbf{x})}{\sum_{\textbf{u}^{'} \in \mathcal{U}(\textbf{x})} p(\textbf{u}^{'} | \textbf{x})}\]

One difference between SeqNLL and ERM (or MRT) is that it increases the probability of several candidates that have low cost instead of focusing on just one sequence as is the case with SeqNLL.

REINFORCE and Risk:

Max-Margin, Softmax-Margin, and Multi-Margin:

\[\mathcal{L}_{MaxMargin} = \max\Big[0, c(\textbf{t}, \hat{\textbf{u}}) - c(\textbf{t}, \textbf{u}^{*}) - s(\hat{\textbf{u}}|\textbf{x}) + s(\textbf{u}^{*}|\textbf{x})\Big]\]

where $\hat{\textbf{u}}$ is the greedy decoding of $p$ and $\textbf{u}^{*}$ is the psuedo-reference.

\[\mathcal{L}_{MultiMargin} = \sum_{\textbf{u} \in \mathcal{U}(\textbf{x})} \max\Big[0, c(\textbf{t}, \textbf{u}) - c(\textbf{t}, \textbf{u}^{*}) - s(\textbf{u}|\textbf{x}) + s(\textbf{u}^{*}|\textbf{x})\Big]\] \[\mathcal{L}_{SoftmaxMargin} = - \log \big( p(\textbf{u}^{*} | \textbf{x}) \big) + \log \Bigg( \sum_{\textbf{u} \in \mathcal{U}(\textbf{x})} \exp \Big[ s(\textbf{u}|\textbf{x}) + c(\textbf{t}, \textbf{u}) \Big] \Bigg)\]

Candidate Generation:

Authors explore two candidate generation strategies:

Beam search produces high quality samples but lacks in diversity whereas sampling usually leads to high diversity samples. Authors did not try top-p and top-k sampling as those were not introduced till then. Following Shen et al., 2016, they do not include target sequence/sentence in the candidates. Authors state that that leads to destabilization of training.

Weighted Combination of MLE and Sequence Level Losses:

\[\mathcal{L}_{Weighted} = \alpha \mathcal{L}_{TokLS} + (1 - \alpha) \mathcal{L}_{Risk}\]

Experiments and Results:

Experiments were conducted on two sequence prediction tasks: Machine Translation and Summarization. For MT, two datasets were evaluated: IWSLT’14 De-En and WMT’14 En-Fr, and summarization results were reported on Gigaword corpus. All results are averaged over 5 runs.

IWSLT’14 Results:

The results show MRT does best among all the loss objectives, though the comparison to BSO, Actor-Critic, and Haung is not fair as the model and preprocessing used are different.

TokenNLL vs TokenLS initialization.

It shows that there is a benefit of intilization your model which has been optimized using label smoothing.

Weighting w/ TokenLS helps:

Candidate Set Size Experiments:

Increasing candidate set size does help but gains are minimal. The authors use 16 candidates in their experiments.

Conclusion

This paper implements classical sequence losses and shows that they are competitive. Additionally, they do some great ablations with some clear takeaways.

Takeaways:

Experimental Setup Details:

IWSLT’14 De-En:

WMT’14 En-Fr:

Gigaword Summarization:

Model Configuration:

MT Models:

Summarization Model: