how to use pre-trained word embeddings like word2vec or fasttext

How to use pre-trained word vectors with Keras

Ever wondered how to use pre-trained word vectors like Word2Vec or FastText to train your neural network to it’s maximum performance? Here’s where to start.

What are pre-trained word vectors?

Word vectors are a representation of actual words using vectors of numbers.

But wait, what are vectors?

In computer science, vectors consist of a row of numerical values. They are represented like this.

231845564198

Usually, programming languages index positions of a vector starting from 0. In our vector, position 0 has value 23, position 1 has value 18 and so on.

How can word vectors help us?

In Natural Language Processing, word vectors are very popular because they can teach the model where a word can be found depending on the context. Adding context to a NLP model can significantly improve it’s accuracy.

The values from a word vector are the position of the specific word in a (usually) 300 dimension space.

Source: https://www.adityathakker.com/introduction-to-word2vec-how-it-works/

These vectors can indicate semantically similar words. That means you can do math operations with words.

Like: husband – man + woman = wife

When do we need word vectors?

As I said above, word vectors can help the model understand the context in our text, but context is not always needed.

In a machine learning problem we use word vectors when we are sure that our model has to predict classes or continuous values that semantically correlate with the context from the dataset.

What is Word2Vec?

Word2Vec is a shallow, two-layered neural network that is trained on a large corpus of text and outputs a vector space with hundreds of dimensions.

The Word2Vec model can be trained using different architectures to produce different outputs.
– CBOW (Continuous bag-of-words): The order of the context words does not influence prediction
– Skip-grams: nearby context words are weighted more heavily than distant ones.

We will see in the code how exactly we can manipulate this kind of model.

Further reading – https://en.wikipedia.org/wiki/Word2vec

What is FastText?

FastText is an open source library that learns text context.

The library represents character level information using n-grams. That means it split each word in multiple n parts. Let’s take an example for the word ‘machine’ and n=3:

<ma, mac, ach, chi, hin, ine, ne>

In this tutorial we will use a pre-trained FastText model provided on their website.

Further reading:
– Enriching Word Vectors with Subword Information – https://arxiv.org/abs/1607.04606
– Bag of Tricks for Efficient Text Classification – https://arxiv.org/abs/1607.01759

Let’s get to the code

In this tutorial I’m using Python 3.6

We are going to build a sentiment analyzer using first Word2Vec and then FastText.

Word2Vec

With help from the ‘gensim‘ library you can generate your own Word2Vec using your own dataset.

# importing required libraries
import gensim, re
import numpy as np
import pandas as pd

# some sample data - add yours if you want
data = ['I love machine learning',
        'I don\'t like reading books.',
        'Python is horrible',
        'Machine learning is cool!',
        'I really like NLP']

labels = ['positive', 'negative', 'negative', 'positive', 'positive']

# pre-process our text
text = [re.sub(r'([^\s\w]|_)+', '', sentence) for sentence in data]
text = [sentence.lower().split() for sentence in text]

# train Word2Vec model on our data
word_model = gensim.models.Word2Vec(text, size=300, min_count=1, iter=10)

Now let’s see what the Word2Vec class expects as parameters.
size=n -> The dimension that the Word2Vec vectors will have (300 in our case).
min_count=n -> Include the word in our vocabulary after n encounters. (1 in our case).
iter=n -> In how many epochs should the Word2Vec model learn the semantic correlations (10 in our case).

After the training we can see what the Word2Vec model learned by using the ‘most_similar‘ function.

# check the most similar word to 'python'
word_model.wv.most_similar(positive='python')
""" Output 
[('an', 0.1256755292415619),
 ('with', 0.10256481170654297),
 ('im', 0.0629001259803772),
 ('word2vec', 0.050945259630680084),
 ('not', 0.028482362627983093),
 ('dance', 0.024584736675024033),
 ('is', 0.021632617339491844),
 ('cool', -0.004487916827201843),
 ('eating', -0.007875476032495499),
 ('hard', -0.011869342997670174)] """

The result are not that great as our data quantity is very small. Performing this on bigger datasets result on way better performance.

Now, let’s prepare the newly trained vectors for the embedding with Keras.

# save the vectors in a new matrix
embedding_matrix = np.zeros((len(word_model.wv.vocab) + 1, 300))
for i, vec in enumerate(word_model.wv.vectors):
  embedding_matrix[i] = vec

Before we create our model we need to tokenize our text.
Tokenization is the process of transforming text input in a sequence of numbers because nerual networks cannot work with text, only numbers.

# more imports
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# how many features should the tokenizer extract
features = 500
tokenizer = Tokenizer(num_words = features)
# fit the tokenizer on our text
tokenizer.fit_on_texts(text)

# get all words that the tokenizer knows
word_index = tokenizer.word_index

# put the tokens in a matrix
X = tokenizer.texts_to_sequences(text)
X = pad_sequences(X)

# prepare the labels
y = pd.get_dummies(labels)

# split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=False)

Let’s create the model.

from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding

# init model
model = Sequential()
# emmbed word vectors
model.add(Embedding(len(word_model.wv.vocab)+1,300,input_length=X.shape[1],weights=[embedding_matrix],trainable=False))
# learn the correlations
model.add(LSTM(300,return_sequences=False))
model.add(Dense(y.shape[1],activation="softmax"))
# output model skeleton
model.summary()
model.compile(optimizer="adam",loss="categorical_crossentropy",metrics=['acc'])

""" Output
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 5, 300)            4500      
_________________________________________________________________
lstm_1 (LSTM)                (None, 300)               721200    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 602       
=================================================================
Total params: 726,302
Trainable params: 721,802
Non-trainable params: 4,500
_________________________________________________________________ """

Let’s get to training!

batch = 32
epochs = 12
model.fit(X_train,y_train,batch,epochs)

After the training we just use a function to determine the model’s performance.

model.evaluate(X_test,y_test)

And that’s it! That’s everything you need to do to train a model with a pre-trained Word2Vec embedding layer!


Let’s repeat the process for FastText’s vectors

FastText

The process of using FastText with Keras is slightly different as we will just get the vectors from the web.

Let’s get our data first.

# importing required libraries
import gensim, re
import numpy as np
import pandas as pd

# some sample data - add yours if you want
data = ['I love machine learning',
        'I don\'t like reading books.',
        'Python is horrible',
        'Machine learning is cool!',
        'I really like NLP']

labels = ['positive', 'negative', 'negative', 'positive', 'positive']

# pre-process our text
text = [re.sub(r'([^\s\w]|_)+', '', sentence) for sentence in data]
text = [sentence.lower().split() for sentence in text]

To get FastText’s vectors head over to the downloads page of FastText.
Select your preferred language and download the vectors or simply load them into Python like this:

from urllib.request import urlopen
import gzip

# get the vectors
file = gzip.open(urlopen('https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ro.300.vec.gz'))

Now let’s prepare this file for vector extraction.

vocab_and_vectors = {}
# put words as dict indexes and vectors as words values
for line in file:
  values = line.split()
  word = values [0].decode('utf-8')
  vector = np.asarray(values[1:], dtype='float32')
  vocab_and_vectors[word] = vector

Before we create our embedding weights we will tokenize the input.

# more imports
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# how many features should the tokenizer extract
features = 500
tokenizer = Tokenizer(num_words = features)
# fit the tokenizer on our text
tokenizer.fit_on_texts(text)

# get all words that the tokenizer knows
word_index = tokenizer.word_index

# put the tokens in a matrix
X = tokenizer.texts_to_sequences(text)
X = pad_sequences(X)

# prepare the labels
y = pd.get_dummies(labels)

# split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=False)

Let’s prepare our weights now.

embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
  embedding_vector = vocab_and_vectors.get(word)
  # words that cannot be found will be set to 0
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector

Now, the process for creating the model is the same as for Word2Vec.

from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding

# init model
model = Sequential()
# emmbed word vectors
model.add(Embedding(len(word_index)+1,300,input_length=X.shape[1],weights=[embedding_matrix],trainable=False))
# learn the correlations
model.add(LSTM(300,return_sequences=False))
model.add(Dense(y.shape[1],activation="softmax"))
# output model skeleton
model.summary()
model.compile(optimizer="adam",loss="categorical_crossentropy",metrics=['acc'])

""" Output
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_9 (Embedding)      (None, 5, 300)            4500      
_________________________________________________________________
lstm_5 (LSTM)                (None, 300)               721200    
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 602       
=================================================================
Total params: 726,302
Trainable params: 721,802
Non-trainable params: 4,500
_________________________________________________________________
"""

Let’s start training.

batch = 32
epochs = 12
model.fit(X_train,y_train,batch,epochs)

Great, now let’s evaluate our model.

model.evaluate(X_test,y_test)

And that’s it! You just used Word2Vec and FastText to create your new, improved model for NLP tasks.

Sample code

You can acces the following code here – https://drive.google.com/open?id=1qF4Vg5GGnFGc_Wk5XKbI3DdTcle0kQzv
When entering the link just press ‘Open with Colaboratory’ button on top of your screen.

Thank you for reading!

This was my first blog post! Thanks a lot for reading and make sure you follow my website for more interesting blogs and share this tutorial to your friends.


5 Comments

  1. Really well-explained article. Keep up the good work šŸ™‚

    1. Sturza Mihai

      Thank you šŸ™

  2. Raicu Irina says:

    Great work. Cannot wait for the next posts! šŸ™‚

    1. Sturza Mihai

      Really appreciate it, thanks!

  3. Good work! My only suggestion would be fit the tokenizer on the training set and texts_to_sequences on both the training and testing set. This is a form of data leakage if you fit on the entire dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *