Typical sequence-to-sequence (seq2seq) models are encoder-decoder models, which usually consists of two parts, the encoder and decoder, respectively. These two parts can be implemented with recurrent neural network (RNN) or transformer, primarily to deal with input/output sequences of dynamic length.
Encoder encodes a sequence of inputs, such as text, video or audio, into a single vector, which can be viewed as the abstractive representation of the inputs, containing information of the whole sequence.
Decoder decodes the vector output of encoder one step at a time, until the final output sequence is complete. Every decoding step is affected by previous step(s). Generally, one would add “< BOS >” at the begining of the sequence to indicate start of decoding, and “< EOS >” at the end to indicate end of decoding.
Homework Description
English to Chinese (Traditional) Translation
Input: an English sentence (e.g. tom is a student .)
Output: the Chinese translation (e.g. 湯姆 是 個 學生 。)
TODO
Train a simple RNN seq2seq to acheive translation
Switch to transformer model to boost performance
Apply Back-translation to furthur boost performance
import sys import pdb import pprint import logging import os import random
import torch import torch.nn as nn import torch.nn.functional as F from torch.utils import data import numpy as np import tqdm.auto as tqdm from pathlib import Path from argparse import Namespace from fairseq import utils
data_dir = './DATA/rawdata' dataset_name = 'ted2020' urls = ( '"https://onedrive.live.com/download?cid=3E549F3B24B238B4&resid=3E549F3B24B238B4%214989&authkey=AGgQ-DaR8eFSl1A"', '"https://onedrive.live.com/download?cid=3E549F3B24B238B4&resid=3E549F3B24B238B4%214987&authkey=AA4qP_azsicwZZM"', # # If the above links die, use the following instead. # "https://www.csie.ntu.edu.tw/~r09922057/ML2021-hw5/ted2020.tgz", # "https://www.csie.ntu.edu.tw/~r09922057/ML2021-hw5/test.tgz", # # If the above links die, use the following instead. # "https://mega.nz/#!vEcTCISJ!3Rw0eHTZWPpdHBTbQEqBDikDEdFPr7fI8WxaXK9yZ9U", # "https://mega.nz/#!zNcnGIoJ!oPJX9AvVVs11jc0SaK6vxP_lFUNTkEcK2WbxJpvjU5Y", ) file_names = ( 'ted2020.tgz', # train & dev 'test.tgz', # test ) prefix = Path(data_dir).absolute() / dataset_name
defstrQ2B(ustring): """Full width -> half width""" # reference:https://ithelp.ithome.com.tw/articles/10233122 ss = [] for s in ustring: rstring = "" for uchar in s: inside_code = ord(uchar) if inside_code == 12288: # Full width space: direct conversion inside_code = 32 elif (inside_code >= 65281and inside_code <= 65374): # Full width chars (except space) conversion inside_code -= 65248 rstring += chr(inside_code) ss.append(rstring) return''.join(ss)
defclean_s(s, lang): if lang == 'en': s = re.sub(r"\([^()]*\)", "", s) # remove ([text]) s = s.replace('-', '') # remove '-' s = re.sub('([.,;!?()\"])', r' \1 ', s) # keep punctuation elif lang == 'zh': s = strQ2B(s) # Q2B s = re.sub(r"\([^()]*\)", "", s) # remove ([text]) s = s.replace(' ', '') s = s.replace('—', '') s = s.replace('“', '"') s = s.replace('”', '"') s = s.replace('_', '') s = re.sub('([。,;!?()\"~「」])', r' \1 ', s) # keep punctuation s = ' '.join(s.strip().split()) return s
deflen_s(s, lang): if lang == 'zh': returnlen(s) returnlen(s.split())
defclean_corpus(prefix, l1, l2, ratio=9, max_len=1000, min_len=1): if Path(f'{prefix}.clean.{l1}').exists() and Path(f'{prefix}.clean.{l2}').exists(): print(f'{prefix}.clean.{l1} & {l2} exists. skipping clean.') return withopen(f'{prefix}.{l1}', 'r') as l1_in_f: withopen(f'{prefix}.{l2}', 'r') as l2_in_f: withopen(f'{prefix}.clean.{l1}', 'w') as l1_out_f: withopen(f'{prefix}.clean.{l2}', 'w') as l2_out_f: for s1 in l1_in_f: s1 = s1.strip() s2 = l2_in_f.readline().strip() s1 = clean_s(s1, l1) s2 = clean_s(s2, l2) s1_len = len_s(s1, l1) s2_len = len_s(s2, l2) if min_len > 0: # remove short sentence if s1_len < min_len or s2_len < min_len: continue if max_len > 0: # remove long sentence if s1_len > max_len or s2_len > max_len: continue if ratio > 0: # remove by ratio of length if s1_len/s2_len > ratio or s2_len/s1_len > ratio: continue print(s1, file=l1_out_f) print(s2, file=l2_out_f)
select ‘unigram’ or ‘byte-pair encoding (BPE)’ algorithm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
import sentencepiece as spm vocab_size = 8000 if (prefix/f'spm{vocab_size}.model').exists(): print(f'{prefix}/spm{vocab_size}.model exists. skipping spm_train.') else: spm.SentencePieceTrainer.train( input=','.join([f'{prefix}/train.clean.{src_lang}', f'{prefix}/valid.clean.{src_lang}', f'{prefix}/train.clean.{tgt_lang}', f'{prefix}/valid.clean.{tgt_lang}']), model_prefix=prefix/f'spm{vocab_size}', vocab_size=vocab_size, character_coverage=1, model_type='unigram', # 'bpe' works as well input_sentence_size=1e6, shuffle_input_sentence=True, normalization_rule_name='nmt_nfkc_cf', )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
spm_model = spm.SentencePieceProcessor(model_file=str(prefix/f'spm{vocab_size}.model')) in_tag = { 'train': 'train.clean', 'valid': 'valid.clean', 'test': 'test.raw.clean', } for split in ['train', 'valid', 'test']: for lang in [src_lang, tgt_lang]: out_path = prefix/f'{split}.{lang}' if out_path.exists(): print(f"{out_path} exists. skipping spm_encode.") else: withopen(prefix/f'{split}.{lang}', 'w') as out_f: withopen(prefix/f'{in_tag[split]}.{lang}', 'r') as in_f: for line in in_f: line = line.strip() tok = spm_model.encode(line, out_type=str) print(' '.join(tok), file=out_f)
# cpu threads when fetching & processing data. num_workers=2, # batch size in terms of tokens. gradient accumulation increases the effective batchsize. max_tokens=8192, accum_steps=2,
# the lr s calculated from Noam lr scheduler. you can tune the maximum lr by this factor. lr_factor=2., lr_warmup=4000,
# maximum epochs for training max_epoch=30, start_epoch=1,
# beam size for beam search beam=5, # generate sequences of maximum length ax + b, where x is the source length max_len_a=1.2, max_len_b=10, # when decoding, post process sentence by removing sentencepiece symbols and jieba tokenization. post_process = "sentencepiece",
# checkpoints keep_last_epochs=5, resume=None, # if resume from checkpoint name (under config.savedir)
# logging use_wandb=False, )
Logging
logging package logs ordinary messages
wandb logs the loss, bleu, etc. in the training process
logger.info("loading data for epoch 1") task.load_dataset(split="train", epoch=1, combine=True) # combine if you have back-translation data. task.load_dataset(split="valid", epoch=1)
defload_data_iterator(task, split, epoch=1, max_tokens=4000, num_workers=1, cached=True): batch_iterator = task.get_batch_iterator( dataset=task.dataset(split), max_tokens=max_tokens, max_sentences=None, max_positions=utils.resolve_max_positions( task.max_positions(), max_tokens, ), ignore_invalid_inputs=True, seed=seed, num_workers=num_workers, epoch=epoch, disable_iterator_cache=not cached, # Set this to False to speed up. However, if set to False, changing max_tokens beyond # first call of this method has no effect. ) return batch_iterator
each batch is a python dict, with string key and Tensor value. Contents are described below:
1 2 3 4 5 6 7 8 9 10 11
batch = { "id": id, # id for each example "nsentences": len(samples), # batch size (sentences) "ntokens": ntokens, # batch size (tokens) "net_input": { "src_tokens": src_tokens, # sequence in source language "src_lengths": src_lengths, # sequence length of each example before padding "prev_output_tokens": prev_output_tokens, # right shifted target, as mentioned above. }, "target": target, # target sequence }
Model Architecture
We again inherit fairseq’s encoder, decoder and model, so that in the testing phase we can directly leverage fairseq’s beam search decoder.
1 2 3 4 5
from fairseq.models import ( FairseqEncoder, FairseqIncrementalDecoder, FairseqEncoderDecoderModel )
Encoder
The Encoder is a RNN or Transformer Encoder. The following description is for RNN. For every input token, Encoder will generate a output vector and a hidden states vector, and the hidden states vector is passed on to the next step. In other words, the Encoder sequentially reads in the input sequence, and outputs a single vector at each timestep, then finally outputs the final hidden states, or content vector, at the last timestep.
Parameters:
args
encoder_embed_dim: the dimension of embeddings, this compresses the one-hot vector into fixed dimensions, which achieves dimension reduction
encoder_ffn_embed_dim is the dimension of hidden states and output vectors
encoder_layers is the number of layers for Encoder RNN
dropout determines the probability of a neuron’s activation being set to 0, in order to prevent overfitting. Generally this is applied in training, and removed in testing.
dictionary: the dictionary provided by fairseq. it’s used to obtain the padding index, and in turn the encoder padding mask.
embed_tokens: an instance of token embeddings (nn.Embedding)
Inputs:
src_tokens: integer sequence representing english e.g. 1, 28, 29, 205, 2
Outputs:
outputs: the output of RNN at each timestep, can be furthur processed by Attention
final_hiddens: the hidden states of each timestep, will be passed to decoder for decoding
encoder_padding_mask: this tells the decoder which position to ignore
# Since Encoder is bidirectional, we need to concatenate the hidden states of two directions final_hiddens = self.combine_bidir(final_hiddens, bsz) # hidden = [num_layers x batch x num_directions*hidden]
encoder_padding_mask = src_tokens.eq(self.padding_idx).t() returntuple( ( outputs, # seq_len x batch x hidden final_hiddens, # num_layers x batch x num_directions*hidden encoder_padding_mask, # seq_len x batch ) )
defreorder_encoder_out(self, encoder_out, new_order): # This is used by fairseq's beam search. How and why is not particularly important here. returntuple( ( encoder_out[0].index_select(1, new_order), encoder_out[1].index_select(1, new_order), encoder_out[2].index_select(1, new_order), ) )
Attention
When the input sequence is long, “content vector” alone cannot accurately represent the whole sequence, attention mechanism can provide the Decoder more information.
According to the Decoder embeddings of the current timestep, match the Encoder outputs with decoder embeddings to determine correlation, and then sum the Encoder outputs weighted by the correlation as the input to Decoder RNN.
Common attention implementations use neural network / dot product as the correlation between query (decoder embeddings) and key (Encoder outputs), followed by softmax to obtain a distribution, and finally values (Encoder outputs) is weighted sum-ed by said distribution.
Parameters:
input_embed_dim: dimensionality of key, should be that of the vector in decoder to attend others
source_embed_dim: dimensionality of query, should be that of the vector to be attended to (encoder outputs)
output_embed_dim: dimensionality of value, should be that of the vector after attention, expected by the next layer
Inputs:
inputs: is the key, the vector to attend to others
encoder_outputs: is the query/value, the vector to be attended to
encoder_padding_mask: this tells the decoder which position to ignore
defforward(self, inputs, encoder_outputs, encoder_padding_mask): # inputs: T, B, dim # encoder_outputs: S x B x dim # padding mask: S x B
# convert all to batch first inputs = inputs.transpose(1,0) # B, T, dim encoder_outputs = encoder_outputs.transpose(1,0) # B, S, dim encoder_padding_mask = encoder_padding_mask.transpose(1,0) # B, S
# project to the dimensionality of encoder_outputs x = self.input_proj(inputs)
# cancel the attention at positions corresponding to padding if encoder_padding_mask isnotNone: # leveraging broadcast B, S -> (B, 1, S) encoder_padding_mask = encoder_padding_mask.unsqueeze(1) attn_scores = ( attn_scores.float() .masked_fill_(encoder_padding_mask, float("-inf")) .type_as(attn_scores) ) # FP16 support: cast to float and back
# softmax on the dimension corresponding to source sequence attn_scores = F.softmax(attn_scores, dim=-1)
# shape (B, T, S) x (B, S, dim) = (B, T, dim) weighted sum x = torch.bmm(attn_scores, encoder_outputs)
# (B, T, dim) x = torch.cat((x, inputs), dim=-1) x = torch.tanh(self.output_proj(x)) # concat + linear + tanh
The hidden states of Decoder will be initialized by the final hidden states of Encoder (the content vector)
At the same time, Decoder will change its hidden states based on the input of the current timestep (the outputs of previous timesteps), and generates an output
Attention improves the performance
The seq2seq steps are implemented in decoder, so that later the Seq2Seq class can accept RNN and Transformer, without furthur modification.
Parameters:
args
decoder_embed_dim: is the dimensionality of the decoder embeddings, similar to encoder_embed_dim,
decoder_ffn_embed_dim: is the dimensionality of the decoder RNN hidden states, similar to encoder_ffn_embed_dim
decoder_layers: number of layers of RNN decoder
share_decoder_input_output_embed: usually, the projection matrix of the decoder will share weights with the decoder input embeddings
dictionary: the dictionary provided by fairseq
embed_tokens: an instance of token embeddings (nn.Embedding)
Inputs:
prev_output_tokens: integer sequence representing the right-shifted target e.g. 1, 28, 29, 205, 2
encoder_out: encoder’s output.
incremental_state: in order to speed up decoding during test time, we will save the hidden state of each timestep. see forward() for details.
Outputs:
outputs: the logits (before softmax) output of decoder for each timesteps
assert args.decoder_layers == args.encoder_layers, f"""seq2seq rnn requires that encoder and decoder have same layers of rnn. got: {args.encoder_layers, args.decoder_layers}""" assert args.decoder_ffn_embed_dim == args.encoder_ffn_embed_dim*2, f"""seq2seq-rnn requires that decoder hidden to be 2*encoder hidden dim. got: {args.decoder_ffn_embed_dim, args.encoder_ffn_embed_dim*2}"""
defforward(self, prev_output_tokens, encoder_out, incremental_state=None, **unused): # extract the outputs from encoder encoder_outputs, encoder_hiddens, encoder_padding_mask = encoder_out # outputs: seq_len x batch x num_directions*hidden # encoder_hiddens: num_layers x batch x num_directions*encoder_hidden # padding_mask: seq_len x batch
if incremental_state isnotNoneandlen(incremental_state) > 0: # if the information from last timestep is retained, we can continue from there instead of starting from bos prev_output_tokens = prev_output_tokens[:, -1:] cache_state = self.get_incremental_state(incremental_state, "cached_state") prev_hiddens = cache_state["prev_hiddens"] else: # incremental state does not exist, either this is training time, or the first timestep of test time # prepare for seq2seq: pass the encoder_hidden to the decoder hidden states prev_hiddens = encoder_hiddens
bsz, seqlen = prev_output_tokens.size()
# embed tokens x = self.embed_tokens(prev_output_tokens) x = self.dropout_in_module(x)
# project to embedding size (if hidden differs from embed size, and share_embedding is True, # we need to do an extra projection) if self.project_out_dim != None: x = self.project_out_dim(x)
# project to vocab size x = self.output_projection(x)
# T x B x C -> B x T x C x = x.transpose(1, 0)
# if incremental, record the hidden states of current timestep, which will be restored in the next timestep cache_state = { "prev_hiddens": final_hiddens, } self.set_incremental_state(incremental_state, "cached_state", cache_state)
return x, None
defreorder_incremental_state( self, incremental_state, new_order, ): # This is used by fairseq's beam search. How and why is not particularly important here. cache_state = self.get_incremental_state(incremental_state, "cached_state") prev_hiddens = cache_state["prev_hiddens"] prev_hiddens = [p.index_select(0, new_order) for p in prev_hiddens] cache_state = { "prev_hiddens": torch.stack(prev_hiddens), } self.set_incremental_state(incremental_state, "cached_state", cache_state) return
Seq2Seq
Composed of Encoder and Decoder
Recieves inputs and pass to Encoder
Pass the outputs from Encoder to Decoder
Decoder will decode according to outputs of previous timesteps as well as Encoder outputs
defbuild_model(args, task): """ build a model instance based on hyperparameters """ src_dict, tgt_dict = task.source_dictionary, task.target_dictionary
# # patches on default parameters for Transformer (those not set above) # from fairseq.models.transformer import base_architecture # base_architecture(arch_args)
# add_transformer_args(arch_args)
1 2
if config.use_wandb: wandb.config.update(vars(arch_args))
1 2
model = build_model(arch_args, task) logger.info(model)
Optimization
Loss: Label Smoothing Regularization
let the model learn to generate less concentrated distribution, and prevent over-confidence
sometimes the ground truth may not be the only answer. thus, when calculating loss, we reserve some probability for incorrect labels
defforward(self, lprobs, target): if target.dim() == lprobs.dim() - 1: target = target.unsqueeze(-1) # nll: Negative log likelihood,the cross-entropy when target is one-hot. following line is same as F.nll_loss nll_loss = -lprobs.gather(dim=-1, index=target) # reserve some probability for other labels. thus when calculating cross-entropy, # equivalent to summing the log probs of all labels smooth_loss = -lprobs.sum(dim=-1, keepdim=True) if self.ignore_index isnotNone: pad_mask = target.eq(self.ignore_index) nll_loss.masked_fill_(pad_mask, 0.0) smooth_loss.masked_fill_(pad_mask, 0.0) else: nll_loss = nll_loss.squeeze(-1) smooth_loss = smooth_loss.squeeze(-1) if self.reduce: nll_loss = nll_loss.sum() smooth_loss = smooth_loss.sum() # when calculating cross-entropy, add the loss of other labels eps_i = self.smoothing / lprobs.size(-1) loss = (1.0 - self.smoothing) * nll_loss + eps_i * smooth_loss return loss
# generally, 0.1 is good enough criterion = LabelSmoothedCrossEntropyCriterion( smoothing=0.1, ignore_index=task.target_dictionary.pad(), )
Optimizer: Adam + lr scheduling
Inverse square root scheduling is important to the stability when training Transformer. It’s later used on RNN as well.
Update the learning rate according to the following equation. Linearly increase the first stage, then decay proportionally to the inverse square root of timestep.
defmultiply_grads(self, c): """Multiplies grads by a constant *c*.""" for group in self.param_groups: for p in group['params']: if p.grad isnotNone: p.grad.data.mul_(c)
defstep(self): "Update parameters and rate" self._step += 1 rate = self.rate() for p in self.param_groups: p['lr'] = rate self._rate = rate self.optimizer.step()
model.train() progress = tqdm.tqdm(itr, desc=f"train epoch {epoch_itr.epoch}", leave=False) for samples in progress: model.zero_grad() accum_loss = 0 sample_size = 0 # gradient accumulation: update every accum_steps samples for i, sample inenumerate(samples): if i == 1: # emptying the CUDA cache after the first step can reduce the chance of OOM torch.cuda.empty_cache()
scaler.unscale_(optimizer) optimizer.multiply_grads(1 / (sample_size or1.0)) # (sample_size or 1.0) handles the case of a zero gradient gnorm = nn.utils.clip_grad_norm_(model.parameters(), config.clip_norm) # grad norm clipping prevents gradient exploding
# fairseq's beam search generator # given model and input seqeunce, produce translation hypotheses by beam search sequence_generator = task.build_generator([model], config)
defdecode(toks, dictionary): # convert from Tensor to human readable sentence s = dictionary.string( toks.int().cpu(), config.post_process, ) return s if s else"<unk>"
definference_step(sample, model): gen_out = sequence_generator.generate([model], sample) srcs = [] hyps = [] refs = [] for i inrange(len(gen_out)): # for each sample, collect the input, hypothesis and reference, later be used to calculate BLEU srcs.append(decode( utils.strip_pad(sample["net_input"]["src_tokens"][i], task.source_dictionary.pad()), task.source_dictionary, )) hyps.append(decode( gen_out[i][0]["tokens"], # 0 indicates using the top hypothesis in beam task.target_dictionary, )) refs.append(decode( utils.strip_pad(sample["target"][i], task.target_dictionary.pad()), task.target_dictionary, )) return srcs, hyps, refs
# save epoch samples withopen(savedir/f"samples{epoch}.{config.source_lang}-{config.target_lang}.txt", "w") as f: for s, h inzip(stats["srcs"], stats["hyps"]): f.write(f"{s}\t{h}\n")
# get best valid bleu ifgetattr(validate_and_save, "best_bleu", 0) < bleu.score: validate_and_save.best_bleu = bleu.score torch.save(check, savedir/f"checkpoint_best.pt")
del_file = savedir / f"checkpoint{epoch - config.keep_last_epochs}.pt" if del_file.exists(): del_file.unlink()
return stats
deftry_load_checkpoint(model, optimizer=None, name=None): name = name if name else"checkpoint_last.pt" checkpath = Path(config.savedir)/name if checkpath.exists(): check = torch.load(checkpath) model.load_state_dict(check["model"]) stats = check["stats"] step = "unknown" if optimizer != None: optimizer._step = step = check["optim"]["step"] logger.info(f"loaded checkpoint {checkpath}: step={step} loss={stats['loss']} bleu={stats['bleu']}") else: logger.info(f"no checkpoints found at {checkpath}!")
Main
Training loop
1 2
model = model.to(device=device) criterion = criterion.to(device=device)
1
!nvidia-smi
1 2 3 4 5 6 7 8 9 10 11 12
logger.info("task: {}".format(task.__class__.__name__)) logger.info("encoder: {}".format(model.encoder.__class__.__name__)) logger.info("decoder: {}".format(model.decoder.__class__.__name__)) logger.info("criterion: {}".format(criterion.__class__.__name__)) logger.info("optimizer: {}".format(optimizer.__class__.__name__)) logger.info( "num. model params: {:,} (num. trained: {:,})".format( sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad), ) ) logger.info(f"max tokens per batch = {config.max_tokens}, accumulate steps = {config.accum_steps}")
# averaging a few checkpoints can have a similar effect to ensemble checkdir=config.savedir !python ./fairseq/scripts/average_checkpoints.py \ --inputs {checkdir} \ --num-epoch-checkpoints 5 \ --output {checkdir}/avg_last_5_checkpoint.pt
Confirm model weights used to generate submission
1 2 3 4 5 6
# checkpoint_last.pt : latest epoch # checkpoint_best.pt : highest validation bleu # avg_last_5_checkpoint.pt: the average of last 5 epochs try_load_checkpoint(model, name="avg_last_5_checkpoint.pt") validate(model, task, criterion, log_to_wandb=False) None
model.eval() progress = tqdm.tqdm(itr, desc=f"prediction") with torch.no_grad(): for i, sample inenumerate(progress): # validation loss sample = utils.move_to_cuda(sample, device=device)
# do inference s, h, r = inference_step(sample, model)
hyps.extend(h) idxs.extend(list(sample['id']))
# sort based on the order before preprocess hyps = [x for _,x insorted(zip(idxs,hyps))]
withopen(outfile, "w") as f: for h in hyps: f.write(h+"\n")
1
generate_prediction(model, task)
1
raise
Back-translation
Train a backward translation model
Switch the source_lang and target_lang in config
Change the savedir in config (eg. “./checkpoints/transformer-back”)
urls = ( '"https://onedrive.live.com/download?cid=3E549F3B24B238B4&resid=3E549F3B24B238B4%214986&authkey=AANUKbGfZx0kM80"', # # If the above links die, use the following instead. # "https://www.csie.ntu.edu.tw/~r09922057/ML2021-hw5/ted_zh_corpus.deduped.gz", # # If the above links die, use the following instead. # "https://mega.nz/#!vMNnDShR!4eHDxzlpzIpdpeQTD-htatU_C7QwcBTwGDaSeBqH534", ) file_names = ( 'ted_zh_corpus.deduped.gz', )
then you can use ‘generate_prediction(model, task, split=“split_name”)’ to generate translation prediction
1 2 3 4 5 6
# Add binarized monolingual data to the original data directory, and name it with "split_name" # ex. ./DATA/data-bin/ted2020/\[split_name\].zh-en.\["en", "zh"\].\["bin", "idx"\] !cp ./DATA/data-bin/mono/train.zh-en.zh.bin ./DATA/data-bin/ted2020/mono.zh-en.zh.bin !cp ./DATA/data-bin/mono/train.zh-en.zh.idx ./DATA/data-bin/ted2020/mono.zh-en.zh.idx !cp ./DATA/data-bin/mono/train.zh-en.en.bin ./DATA/data-bin/ted2020/mono.zh-en.en.bin !cp ./DATA/data-bin/mono/train.zh-en.en.idx ./DATA/data-bin/ted2020/mono.zh-en.en.idx
1 2
# hint: do prediction on split='mono' to create prediction_file # generate_prediction( ... ,split=... ,outfile=... )
TODO: Create new dataset
Combine the prediction data with monolingual data
Use the original spm model to tokenize data into Subword Units
Change the datadir in config (“./DATA/data-bin/ted2020_with_mono”)
Switch back the source_lang and target_lang in config (“en”, “zh”)
Change the savedir in config (eg. “./checkpoints/transformer-bt”)
Train model
References
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., … & Auli, M. (2019, June). fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) (pp. 48-53).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017, December). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000-6010).
Reimers, N., & Gurevych, I. (2020, November). Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
Tiedemann, J. (2012, May). Parallel Data, Tools and Interfaces in OPUS. In Lrec (Vol. 2012, pp. 2214-2218).
Kudo, T., & Richardson, J. (2018, November). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 66-71).
Sennrich, R., Haddow, B., & Birch, A. (2016, August). Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 86-96).
Edunov, S., Ott, M., Auli, M., & Grangier, D. (2018). Understanding Back-Translation at Scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 489-500).