CNN Is All You Need (Qiming Chen, Ren Wu) \\ Incredible improvement in BLEU scores - is this for real? Check discussion and see the reason ...
Attention Is All You Need (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin) \\ The transformer model by Google without convolutions nor recurrent network layers
Context gates that control the influence of source and target context when generating words. Intuition: Content words should rely more on source language context whereas function words should look more at target language context. ([[https://github.com/tuzhaopeng/NMT|code available here]])
Add a coverage vector to keep track of the attention history to avoid under- and over-translation. ([[https://github.com/tuzhaopeng/NMT|code available here]] and the older version [[https://github.com/tuzhaopeng/NMT-Coverage|here]])
They go beyond Dong et al. (2015) below, using many-to-many translation. While the number of parameters is linear in the number of languages, as far as I can tell the computational complexity is still quadratic, so it would be challenging with Europarl and out of the question with the Bible corpus.
Character-to-character model with convolutions on the source side to reduce sequence length. They also train with multiple languages by just mixing source sentences from different languages into each minibatch, so the network implicitly learns to identify the language.
They use a limited vocabulary but use character LSTMs instead of <UNK> tokens, both for the encoder and decoder. This works better than a purely word-based approach with heuristics for <UNK> substitution. They also find that purely character-based models work, but are slow (about three months) to train.
for Neural Machine Translation (Junyoung Chung, Kyunghyun Cho and Yoshua Bengio, ACL 2016) \\
Using a character-level decoder but subword (byte-pair encoding) encoder, which also works well. Chung et al. also propose a special recurrent network to capture short- and long-range dependencies, but the advantage of this seems pretty limited, although it might help for very long sentences.
Similar idea to Luong and Manning (2016), but using convolution rather than LSTM and applied only on the source side. The target side seems to be completely word-based.
RNN working at multiple layers of segmentation, which are learned in an unsupervised way. Related to their [[https://www.aclweb.org/anthology/P/P16/P16-1160.pdf|ACL 2016]] paper, but with more empirical results (the ACL paper is interesting but does not show very convincing NMT results in my opinion).
Attention-based Encoder-Decoder NMT Model(Shi Feng, Shujie Liu, Mu Li and Ming Zhou, 2016) \\
Extensions to the basic attention mechanisms that do not assume independence between alignment links (like IBM model 1), using a recurrent attention state.
Another approach that borrow ideas from the higher IBM models into attention models for NMT. After skimming through, it seems like they are simply feeding the kind of statistics IBM models use (jump lengths, fertility, etc.) directly into the attention subnetwork.
Translate training data using PB-SMT and feed this into a neural system: The system learns to use phrase-based translation as additional information when translating source language sentences.
Dropout has been very successful for regularization of different types of networks, but it has been difficult to apply to RNNs. Gal presents a method that actually works, has a theoretical foundation on variational Bayesian method (so it is sometimes referred to as "variational dropout"), and has been adopted by several people already. Drastically reduces overfitting, but comes at the cost of somewhat slower convergence. Implemented in BNAS.
Layer Normalization (Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E. Hinton, 2016) \\
Similar to Batch Normalization, but normalizing over the nodes in a layer rather than over the same node in a minibatch. Easy to apply to recurrent networks, and our experiments show that their first LSTM variant (equations 20--22) works better than the second one (equations 29--31), although there are issues with numerical stability.
Multilingual word vectors without parallel or even comparable corpora, simply trying to enforce similar distributions between the vector spaces of different languages. Seems to work so-so under these very restricted unsupervised conditions, but could it be used to improve low-resource word vectors?