News - Helsinki-NLP/hnmt GitHub Wiki

Table of Contents

New publications, developments and events

Other sources

General

  • CNN Is All You Need (Qiming Chen, Ren Wu) \\ Incredible improvement in BLEU scores - is this for real? Check discussion and see the reason ...
  • Attention Is All You Need (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin) \\ The transformer model by Google without convolutions nor recurrent network layers
  • Convolutional Sequence to Sequence Learning (Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin) \\
Facebook's convolutional NMT system, translation accuracy comporable to Google's system but much faster. Various details about Google's NMT model
  Context gates that control the influence of source and target context when generating words. Intuition: Content words should rely more on source language context whereas function words should look more at target language context. ([[https://github.com/tuzhaopeng/NMT|code available here]])
  Add a coverage vector to keep track of the attention history to avoid under- and over-translation. ([[https://github.com/tuzhaopeng/NMT|code available here]] and the older version [[https://github.com/tuzhaopeng/NMT-Coverage|here]])
Add a reconstruction layer to improve adequacy of the model. The system needs to reconstruct the source sentence after decoding.

Multilingual Models

Enabling Zero-Shot Translation (Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean) \\
   Multilingual translation by simply adding a language selection token to the training data, and sharing all other parameters.
  They go beyond Dong et al. (2015) below, using many-to-many translation. While the number of parameters is linear in the number of languages, as far as I can tell the computational complexity is still quadratic, so it would be challenging with Europarl and out of the question with the Bible corpus.
(Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, Lukasz Kaiser)
  One-to-many translation using a simple sequence-to-sequence model with attention.

Subword and character based methods

  Character-to-character model with convolutions on the source side to reduce sequence length. They also train with multiple languages by just mixing source sentences from different languages into each minibatch, so the network implicitly learns to identify the language.
  Sample source sequence before encoding, resample after decoding. Implemented in [[https://github.com/swordyork/dcnmt|Blocks]]
  They use a limited vocabulary but use character LSTMs instead of <UNK> tokens, both for the encoder and decoder. This works better than a purely word-based approach with heuristics for <UNK> substitution. They also find that purely character-based models work, but are slow (about three months) to train.
for Neural Machine Translation (Junyoung Chung, Kyunghyun Cho and Yoshua Bengio, ACL 2016) \\
  Using a character-level decoder but subword (byte-pair encoding) encoder, which also works well. Chung et al. also propose a special recurrent network to capture short- and long-range dependencies, but the advantage of this seems pretty limited, although it might help for very long sentences.
  Similar idea to Luong and Manning (2016), but using convolution rather than LSTM and applied only on the source side. The target side seems to be completely word-based.
  Using byte-pair encoding to create subword units, which can be used out-of-the-box with standard NMT models to reduce data sparsity.
  RNN working at multiple layers of segmentation, which are learned in an unsupervised way. Related to their [[https://www.aclweb.org/anthology/P/P16/P16-1160.pdf|ACL 2016]] paper, but with more empirical results (the ACL paper is interesting but does not show very convincing NMT results in my opinion).

Discourse-level NMT / wider context

Unsupervised / semi-supervised models

Improved alignment models

Attention-based Encoder-Decoder NMT Model(Shi Feng, Shujie Liu, Mu Li and Ming Zhou, 2016) \\
  Extensions to the basic attention mechanisms that do not assume independence between alignment links (like IBM model 1), using a recurrent attention state.
  Another approach that borrow ideas from the higher IBM models into attention models for NMT. After skimming through, it seems like they are simply feeding the kind of statistics IBM models use (jump lengths, fertility, etc.) directly into the attention subnetwork.
Agreement between attention-based alignment in different directions

Hybrid models (in whatever sense)

Translate training data using PB-SMT and feed this into a neural system: The system learns to use phrase-based translation as additional information when translating source language sentences. Neural machine translation with a phrase memory. Incorporates phrase pairs in symbolic form, mined from corpus or specified by human experts
  Combines hierarchical SMT with NMT with leads to improvements over individual systems (NMT and hierarchal SMT).
Use discrete translation lexicons in neural MT

Supervision at different layers

  Supervising multi-task learning with some kind of hierarchical structure at multiple layers works well.

Optimization and regularization methods

  Dropout has been very successful for regularization of different types of networks, but it has been difficult to apply to RNNs. Gal presents a method that actually works, has a theoretical foundation on variational Bayesian method (so it is sometimes referred to as "variational dropout"), and has been adopted by several people already. Drastically reduces overfitting, but comes at the cost of somewhat slower convergence. Implemented in BNAS.
  Similar to Batch Normalization, but normalizing over the nodes in a layer rather than over the same node in a minibatch. Easy to apply to recurrent networks, and our experiments show that their first LSTM variant (equations 20--22) works better than the second one (equations 29--31), although there are issues with numerical stability.

Domain adaptation

Cool stuff, possibly useful

  Multilingual word vectors without parallel or even comparable corpora, simply trying to enforce similar distributions between the vector spaces of different languages. Seems to work so-so under these very restricted unsupervised conditions, but could it be used to improve low-resource word vectors?

Events

⚠️ **GitHub.com Fallback** ⚠️