Name Gender Classification with RNN

Let’s explore application of RNN on a simple task of name gender classification. The data has the following format

Name,Gender
Terrone,M
Annaley,F

The goal is to classify the gender by reading the name using a character level model. The data is provided in the form of train and test splits. Use accuracy to evaluate model performance.

Data Preprocessing

The data is stored in CSV format. You can easily fit the data into memory. We suggest to read the CSV file with pandas and then convert the DataFrame to numpy array.

The dataset is a collection of strings of variable length. The labels for the training samples are also strings. This format is not very friendly for learning algorithms. Further, we are going to discuss how to preprocess the data before passing to the training algorithm.

Machine Readable

The simplest way to convert the string representation into the machine-readable format is to substitute the characters with a unique integer identifier. This can be easily achieved by creating the character vocabulary. Assume you have read the CSV and converted the data into numpy’s ndarray

unique = list(set("".join(data[:,0])))
unique.sort()
vocab = dict(zip(unique, range(1,len(unique)+1)))

Here we start indexing with 1 to handle variable length padding.

Learning algorithms in general are bad in handling variable length input. Even recurrent networks actually have all the inputs of the fixed size to optimize computations. To handle variable length names, find the length of the longest name and use it as a maximum length. If by some chance you encounter the name longer than the specified maximum length - crop it.

Normalize every name to the format, where the letters are represented with their identifiers, and excessive positions are padded with zeros. For example, the name Elizabeth is converted to

[ 5 38 35 52 27 28 31 46 34  0  0  0  0  0  0]

where the maximum length of a name is 15.

Character Embeddings

On their own, integers are not a very good representation of the data. Especially for neural networks, which inherently assume all the data belonging to continuous space. One of the approaches to convert integer sequence representation in something meaningful is to create embeddings for every character. For our dataset of English names, we have 52 unique characters.

The simplest type of embeddings - one-hot embeddings.

If you decide to have trainable embeddings, the strategy is similar to the case of Word2Vec. We create a variable that stores the embeddings, and then retrieve embeddings with embedding_lookup

embedding_matrix = get_variable("emb_matr", shape=(vocab_size, embedding_size))
input_ = tf.placeholder(shape=[None, T], dtype=tf.int32) # T is the maximum length of a word, or the number of time steps
embedded_names = embedding_lookup(embedding_matrix, input_) # has shape [None, T, embedding_size]

Also, you probably want to create a zero embedding that will be used for zero padding. For this, you can create a zero-vector constant and concatenate it with embedding matrix. You need to do this so that zero vector remains zero, and is not altered during the training.

Classification Models

Baseline LSTM

One of the simplest classification models for this task is LSTM model with one dense layer. To create LSTM layer, first, you need to create an LSTM cell. Then, use tensorflow’s dynamic_rnn to unroll the LSTM for all T time steps.

lstm = tf.nn.rnn_cell.LSTMCell(lstm_hidden_size)
outputs, states = tf.nn.dynamic_rnn(cell=lstm, inputs=embedded_names, dtype=tf.float32)

dynamic_rnn returns tensors with output values for every time step, as long as hidden state values. The shape of both outputs and states is [None, T, lstm_hidden_size]. We are interested only in outputs[:, -1, :] since we are going to use the output from the last time step to make the classification decision.

Calculate logits with a single neuron

logits = tf.layers.dense(outputs, 1)

Use sigmoid cross entropy loss for optimization.

This is the baseline LSTM model. Assume the following label mapping: F->0 and M->1. In our baseline model, we create trainable embeddings of size 5, create LSTM cell with the hidden state size of 5, train it with Adam optimizer for 100 epochs with learning rate 0.001. The performance of our baseline LSTM model

trainable parameters: 486
Epoch 99, train loss 0.29588, train acc 0.87891, test loss 0.42925, test accuracy 0.80204

Also, we sorted names by length before creating mini-batches to help the model to learn shorter sequences first. The train loss provided above was calculated on the last minibatch of the dataset sorted by length. You will probably see dramatically different output when testing on minibatches with shorter names. One of the explanations can be that tailing padding negatively influences the performance.

Classical Neural Network

Since we fixed the maximum length of a word, why don’t we apply a regular neural network? Simple NN requires the input tensor to have the rank of two, whereas after performing embedding lookup, we have the tensor of rank 3. The simplest way to handle this discrepancy is to flatten the input

reshape(embedded_names, shape=(-1, T*embedding_size))
trainable parameters: 1610
Epoch 99, train loss 0.44921, train acc 0.83203, test loss 0.50852, test accuracy 0.74993

There are also other ways to flatten the input:

  • Maxpool You can use reduce_maxpool along T or emb_size axis to flatten the input.
  • Average You can use reduce_mean along T or emb_size axis to flatten the input.
  • Weighted Average Instead of using mean value, you can apply weighted average along T or emb_size axis to flatten the input

The implementation of the classification pipleline can be found on GitHub

Hyperparameter Tuning

The performance of a model depends on the number of hyper-parameters. For a regular neural network, this will be

  • learning rate
  • layer sizes
  • activation functions
  • number of epochs
  • etc

In this assignment, you are going to implement several classification models and try to find the best architecture and parameters based on the model performance on the test set.

During the process of parameter search, keep in mind that complex models are not always the best. You should do a simple sanity check by comparing the performance on the test and training data. If your model performs very well on the training data, you are probably overfitting.

When comparing different models keep in mind that there are multiple dimension of comparison. The most obvious are:

  • model performance
  • number of trainable parameters
  • training speed

An example of training speed comparison could be the following

Try to evaluate different models along these dimensions and make a fair comparison of models. For example, if your fully connected model is much better than LSTM model, but it uses five times more parameters, the comparison is probably unfair.

To get the number of trainable parameters in Tensorflow you can run

np.sum([np.prod(v.get_shape().as_list()) for v in tf.trainable_variables()])

© 2022. Vitaly Romanov

Powered by Hydejack v8.1.1