Setup
Sockeye expects tokenized data as the input. For this tutorial we use data that has already been tokenized for us. However, keep this in mind for any other data set you want to use with Sockeye. In addition to tokenization we will split words into subwords using Byte Pair Encoding (BPE). In order to do so we use a tool called subword-nmt. Run the following commands to set up the tool:
git clone https://github.com/rsennrich/subword-nmt.git export PYTHONPATH=$(pwd)/subword-nmt:$PYTHONPATH
We will visualize training progress using Tensorboard. Install it using:
pip install tensorboardMore on: https://awslabs.github.io/sockeye/tutorials/wmt.html