Paper
Published at ACL’18.
Classical Structured Prediction Losses for Sequence to Sequence Learning, Sergey Edunov, Myle Ott, Michael Auli, David Grangier, Marc’Aurelio Ranzato
Takeaways:
- MRT is the best sequence level loss among the ones evaluated.
- Do not include target sequence in your candidate set.
- Weighted average between token level and sequence level loss ($\alpha=0.3$) helps.
- Increasing candidate generation set size leads to diminishing returns.
- Initializing sequence level models with label smoothed model is better than initializing it with model trained using simple NLL.
Motivation
A large body of work exist on using sequence level objectives for training log-linear models for sequence prediction tasks such as NER. This paper pick those objectives and evaluate them on two tasks: Machine Translation and Summarization.
Notation:
- Source sentence: $\textbf{x} = (x_1, \cdots, x_m)$
- Output sentence: $\textbf{u} = (u_1, \cdots, u_n)$
- Target sentence: $\textbf{t} = (t_1, \cdots, t_k)$
- Candidate subset (generated with for e.g. beam search): $\mathcal{U}(\textbf{x})$
- Pseudo-Reference: $\textbf{u}^{*}(\textbf{x}) = \arg\max_{\textbf{u} \in \mathcal{U}(\textbf{x})} \text{BLEU}(\textbf{t}, \textbf{u})$
- Score of a token $u_i$: $s_i$.
- Sequence Score: $s(\textbf{u}|\textbf{x}) = \frac{1}{n} \sum_{i=1}^{n} s(u_i|u_1,\cdots,u_{i-1}, \textbf{x})$
- The per-token probability: $p(u_i|u_1, \cdots, u_{i-1}, \textbf{x}) = \text{softmax}(s_i)$
- Normalized sequence probability: $p(\textbf{u}|\textbf{x}) = exp\bigg(\frac{1}{n} \sum_{i=1}^{n} \log\Big(p(u_i|u_1,\cdots,u_{i-1}, \textbf{x})\Big)\bigg)$
- Cost Function: $c(\textbf{t}, \textbf{u}) = 1 - \text{BLEU}(\textbf{t}, \textbf{u})$
Model
The authors use Gated Convolutional Neural Networks (Gehring et al., 2017b) for the encoder and the decoder architecture.
Losses:
Token NLL:
This is usual teacher forcing training.
\[\mathcal{L}_{TokNLL} = - \sum_{i=1}^{n} \log\big( p(t_i| t_1, \cdots, t_{i-1}, \textbf{x}\big)\] Token NLL w/ Label Smoothing:
Likelihood training makes the model extremely confident in it’s predictions which hurts in generalization performance. Label smoothing acts as a regularizer by ensuring a model doesn’t go far away from a prior distribution $f$ (independent of input $\textbf{x}$), which in most cases is the uniform distribution.
\[\mathcal{L}_{TokLS} = - \sum_{i=1}^{n} \log\big( p(t_i| t_1, \cdots, t_{i-1}, \textbf{x}\big) + D_{KL}\left(f|| p(t_i| t_1, \cdots, t_{i-1}, \textbf{x})\right)\] SeqNLL:
\[\mathcal{L}_{SeqNLL} = - \log \big( p(\textbf{u}^{*} | \textbf{x}) \big) + \log \Bigg( \sum_{\textbf{u} \in \mathcal{U}(\textbf{x})} p(\textbf{u}^{*} | \textbf{x}) \Bigg)\] Computing SeqNLL (and other) sequence level objective over space of all possible inputs is intractable, we compute losses over a subset $\mathcal{U}(\textbf{x})$ generated by the model. The approaches to generate these subsets are discussed below.
SeqNLL we use pseudo-reference $\textbf{u}^{*}$ instead of target $\textbf{t}$ as using $t$ can lead to degenerate solution as per authors.
Minimum Risk Training or Expected Risk Minimization:
\[\mathcal{L}_{Risk} = \sum_{\textbf{u} \in \mathcal{U}(\textbf{x})} c(\textbf{t}, \textbf{u}) \frac{p(\textbf{u} | \textbf{x})}{\sum_{\textbf{u}^{'} \in \mathcal{U}(\textbf{x})} p(\textbf{u}^{'} | \textbf{x})}\] One difference between SeqNLL and ERM (or MRT) is that it increases the probability of several candidates that have low cost instead of focusing on just one sequence as is the case with SeqNLL.
REINFORCE and Risk:
- A major similarity between Risk and REINFORCE is that both optimize expected cost.
- Differences include:
- REINFORCE maximizes only one sequence whereas MRT relies on multiple sequences.
- REINFORCE uses baselines to determine the sign of the gradient for the current sequence whereas due to use of multiple sequences MRT does not need baseline reward.
Max-Margin, Softmax-Margin, and Multi-Margin:
\[\mathcal{L}_{MaxMargin} = \max\Big[0, c(\textbf{t}, \hat{\textbf{u}}) - c(\textbf{t}, \textbf{u}^{*}) - s(\hat{\textbf{u}}|\textbf{x}) + s(\textbf{u}^{*}|\textbf{x})\Big]\] where $\hat{\textbf{u}}$ is the greedy decoding of $p$ and $\textbf{u}^{*}$ is the psuedo-reference.
\[\mathcal{L}_{MultiMargin} = \sum_{\textbf{u} \in \mathcal{U}(\textbf{x})} \max\Big[0, c(\textbf{t}, \textbf{u}) - c(\textbf{t}, \textbf{u}^{*}) - s(\textbf{u}|\textbf{x}) + s(\textbf{u}^{*}|\textbf{x})\Big]\] \[\mathcal{L}_{SoftmaxMargin} = - \log \big( p(\textbf{u}^{*} | \textbf{x}) \big) + \log \Bigg( \sum_{\textbf{u} \in \mathcal{U}(\textbf{x})} \exp \Big[ s(\textbf{u}|\textbf{x}) + c(\textbf{t}, \textbf{u}) \Big] \Bigg)\] Candidate Generation:
Authors explore two candidate generation strategies:
Beam search produces high quality samples but lacks in diversity whereas sampling usually leads to high diversity samples. Authors did not try top-p and top-k sampling as those were not introduced till then. Following Shen et al., 2016, they do not include target sequence/sentence in the candidates. Authors state that that leads to destabilization of training.
Weighted Combination of MLE and Sequence Level Losses:
\[\mathcal{L}_{Weighted} = \alpha \mathcal{L}_{TokLS} + (1 - \alpha) \mathcal{L}_{Risk}\] Experiments and Results:
Experiments were conducted on two sequence prediction tasks: Machine Translation and Summarization. For MT, two datasets were evaluated: IWSLT’14 De-En and WMT’14 En-Fr, and summarization results were reported on Gigaword corpus. All results are averaged over 5 runs.
IWSLT’14 Results:
The results show MRT does best among all the loss objectives, though the comparison to BSO, Actor-Critic, and Haung is not fair as the model and preprocessing used are different.
TokenNLL vs TokenLS initialization.
It shows that there is a benefit of intilization your model which has been optimized using label smoothing.
Weighting w/ TokenLS helps:
Candidate Set Size Experiments:
Increasing candidate set size does help but gains are minimal. The authors use 16 candidates in their experiments.
Conclusion
This paper implements classical sequence losses and shows that they are competitive. Additionally, they do some great ablations with some clear takeaways.
Takeaways:
- MRT is the best sequence level loss among the ones evaluated.
- Do not include target sequence in your candidate set.
- Weighted average between token level and sequence level loss ($\alpha=0.3$) helps.
- Increasing candidate generation set size leads to diminishing returns.
- Initializing sequence level models with label smoothed model is better than initializing it with model trained using simple NLL.
Experimental Setup Details:
IWSLT’14 De-En:
- Training Data: 160K sentence pairs, validation set 7k sentences.
- Test Set: tst2010, tst2011, tst2012, tst2013, dev2010 contactenated.
- Data is lowercased, tokenized with BPE of 14000 types.
- Evaluated using case-insensitive BLEU.
WMT’14 En-Fr:
- Removed sentences longer than 175 words
- Removed pairs with source/target length ratio exceeding 1.5.
- Total 35.5M sentence pairs.
- Source and target vocab based on 40K BPE types.
- Validation set: Randomly sampled 26,658 sentence-pairs.
- Test set: newstest2014.
Gigaword Summarization:
- Gigaword courpus as training data.
- pre-process similar to Rush et al. 2015.
- 3.8M training and 190K validation examples.
- Test: Gigaword test set– 2000 pairs.
- Metric: F1 ROUGE.
- Source and Target vocabulary 30k words.
Model Configuration:
MT Models:
- 4 convolutional encode layers and 3 decoder layers.
- Kernel Width: 3, 256 dim. hidden state and input layer.
- Optimization: Nestrov w/ LR: 0.25 and momentum: 0.99.
- Grad Norm: 0.1
- For Sequence-level model initialization,
- Token level method trained for 200 epochs.
- Token level baseline is further annealed by a factor of 10 every epoch till lr < 1e-4.
- Sequence level models are further fine-tuned for 10-20 epochs.
- Dropout ratio of 0.3 is used for embedding, decoder output and convolutional blocks.
- All scores were length normalized, so were all sequence level probabilities. This was done to discourage the model from favoring small sequences.
- Maximum Length of candidates generated was limited to 200.
- 16 candidates were generated for the experiments.
Summarization Model:
- 12 layers each of encoder and decoder with 256 hidden and input units with kernel width of 3.
- Batch size 8k tokens, and optimized for 20 epochs with LR: 0.25 and then LR was annealed in a similar fashion as above.