Machine Learning Engineer Test - Problem 1

%matplotlib inline
from keras.preprocessing.text import Tokenizer
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LinearRegression, LogisticRegression, PassiveAggressiveClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
from sklearn.svm import SVC

from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Activation, Embedding, Flatten, Convolution1D, MaxPooling1D, Input, LSTM
from keras.preprocessing import sequence
from keras.regularizers import l2
from keras.layers import Merge

from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score

1. Understanding and preprocessing the data

I understand this task is similar to a sentiment analysis task. The provided training data contains sentences that are already tokenized. Each line contains two parts: First part is sentiment label of the sentence (0 or 1), the second part is as list of tokens (words).

I first analyse and pre-process the data:

train_small_file = "data/training-data-small.txt"

train_small_df = pd.read_csv(train_small_file, sep='\t', names=["Label", "Text"])

train_small_df.head()

	Label	Text
0	0	X773579,Y2640,Y2072,Z4,Z15
1	0	X166074297,X123474229,X147204623,X51578397,X23...
2	1	X374616379,X773579,X344420902,Y1940,Y1705,Z4,Z...
3	0	X103413307,X37875376,X62716332,X277692318,X344...
4	0	X123474229,X551805107,X62716661,Y2307,Y2,Y1222...

train_small_df.Label.hist()
plt.title("Training Label Distribution")

<matplotlib.text.Text at 0x7effe3a63d68>

png

I think that the dataset is quite imbalanced.

Next I analyse length of texts.

train_small_df_text_len = train_small_df.Text.str.split(",").apply(len)

train_small_df_text_len.describe()

count    10000.000000
mean        26.251200
std         17.841097
min          2.000000
25%         12.000000
50%         22.000000
75%         36.000000
max        100.000000
Name: Text, dtype: float64

The average text is 26 words long. longest text has 100 words.

X, y = np.array(train_small_df.Text), np.array(train_small_df.Label)

Next, I want to transform each sentence into a sequence of intergers (which are indices of the sentence’s tokens in the whole vocaburary of tokens from the dataset). I can write a Tokenizer to do this, but here I use a Tokenizer from Keras. When transforming sentences to sequences, I consider vocaburay from both training set and test set.

Load test data:

test_small_file = "data/test-data-small.txt"

test_small_df = pd.read_csv(test_small_file, sep=" ", names=["Text"])

test_small_df_text_len = test_small_df.Text.str.split(",").apply(len)

X_test = np.array(test_small_df.Text)

X_total = np.concatenate((X, X_test))
tknzr = Tokenizer(lower=False, split=',')
tknzr.fit_on_texts(X_total)
XS = tknzr.texts_to_sequences(X)
XS_test = tknzr.texts_to_sequences(X_test)

The sequences have different length, I need to truncate and pad the sequences so that there are all the same length (longest length of sentences in the dataset) for modeling.

max_len = train_small_df_text_len.max() if train_small_df_text_len.max() > test_small_df_text_len.max() else test_small_df_text_len.max()

XS = sequence.pad_sequences(XS, maxlen=max_len)

XS_test = sequence.pad_sequences(XS_test, maxlen=max_len)

XS.shape, XS_test.shape

((10000, 102), (100000, 102))

Let’s get the vocaburary size of the dataset:

vocab_size = len(tknzr.word_counts)
vocab_size

I have transformed the dataset to sequence dataset of interger with 102 features.

Now the problem becomes the sequence classification problem.

2. Modeling

I will do experiment with some different models. First I create a function so that I can evaluate the models with cross validation.

def cv_evaluate_classifier(clasifier, X, y, n_splits=5, **kwargs):
    kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=0)    
    clf = clasifier(**kwargs)
    results = cross_val_score(clf, X, y, cv=kfold, scoring='accuracy')
    print('\nModel average accuracy: {:.2f}'.format(results.mean()))

def cv_evaluate_nn_model(build_fn, X, y, nb_epoch=5, n_splits=5, batch_size=32, **kwargs):
    kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=0)
    model = KerasClassifier(build_fn=build_fn, nb_epoch=nb_epoch, batch_size=batch_size, verbose=1)
    results = cross_val_score(model, X, y, cv=kfold)
    print('\nModel average accuracy: {:.2f}'.format(results.mean()))

I start with LogisticRegression model:

def create_lr_model():
    lr_model = LogisticRegression()
    return lr_model

cv_evaluate_classifier(create_lr_model, XS, y)

Model average accuracy: 0.66

I try with an SVM model:

def create_svm_model():
    svm_model = SVC(C=10, kernel='linear')
    return svm_model

cv_evaluate_classifier(create_svm_model, XS, y)

This takes so long time in my envoriment, so I skip it.

Experiment with Deep Learning models

I do experiment with some NN models to see weather deep learning works for this dataset. First I try with a simple single hidden layer NN.

def single_nn_model():
    model = Sequential([
        Embedding(vocab_size+1, 32, input_length=max_len),
        Flatten(),
        Dense(100, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

cv_evaluate_nn_model(single_nn_model, XS, y, nb_epoch=2)

Epoch 1/2
8000/8000 [==============================] - 0s - loss: 0.5665 - acc: 0.7061     
Epoch 2/2
8000/8000 [==============================] - 0s - loss: 0.4828 - acc: 0.7659     
1984/2000 [============================>.] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 0s - loss: 0.5670 - acc: 0.6979     
Epoch 2/2
8000/8000 [==============================] - 0s - loss: 0.4893 - acc: 0.7644     
  32/2000 [..............................] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 0s - loss: 0.5632 - acc: 0.7047     
Epoch 2/2
8000/8000 [==============================] - 0s - loss: 0.4838 - acc: 0.7724     
2000/2000 [==============================] - 0s     
Epoch 1/2
8000/8000 [==============================] - 0s - loss: 0.5699 - acc: 0.7009     
Epoch 2/2
8000/8000 [==============================] - 0s - loss: 0.4887 - acc: 0.7634     
  32/2000 [..............................] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 0s - loss: 0.5646 - acc: 0.7053     
Epoch 2/2
8000/8000 [==============================] - 0s - loss: 0.4811 - acc: 0.7707     
1888/2000 [===========================>..] - ETA: 0s
Model average accuracy: 0.77

We can get average accuracy of 0.77 with this simple NN model. This is much better than the LogisticRegression model.

Next, I try with more complex NN - a single convolution layer with max pooling

def create_conv_nn():
    conv = Sequential([
        Embedding(vocab_size+1, 32, input_length=max_len, dropout=0.2),
        Dropout(0.2),
        Convolution1D(64, 3, border_mode='same', activation='relu'),
        Dropout(0.2),
        MaxPooling1D(),
        Flatten(),
        Dense(100, activation='relu'),
        Dropout(0.7),
        Dense(1, activation='sigmoid')
    ])
    conv.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return conv

cv_evaluate_nn_model(create_conv_nn, XS, y, nb_epoch=2)

Epoch 1/2
8000/8000 [==============================] - 0s - loss: 0.5867 - acc: 0.6816     
Epoch 2/2
8000/8000 [==============================] - 0s - loss: 0.5256 - acc: 0.7406     
2000/2000 [==============================] - 0s     
Epoch 1/2
8000/8000 [==============================] - 0s - loss: 0.5876 - acc: 0.6801     
Epoch 2/2
8000/8000 [==============================] - 0s - loss: 0.5327 - acc: 0.7344     
1280/2000 [==================>...........] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 0s - loss: 0.5776 - acc: 0.6911     
Epoch 2/2
8000/8000 [==============================] - 0s - loss: 0.5216 - acc: 0.7466     
1280/2000 [==================>...........] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 0s - loss: 0.5766 - acc: 0.6934     
Epoch 2/2
8000/8000 [==============================] - 0s - loss: 0.5340 - acc: 0.7435     
1280/2000 [==================>...........] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 0s - loss: 0.5752 - acc: 0.6943     
Epoch 2/2
8000/8000 [==============================] - 0s - loss: 0.5289 - acc: 0.7421     
1280/2000 [==================>...........] - ETA: 0s
Model average accuracy: 0.76

The accuracy seems not improving.

Above I created a CNN layer with single kernel size (3). Here I try another model with multi kernel sizes of the CNN layer.

def create_multi_cnn_model():
    graph_in = Input((vocab_size+1, 32))
    convs = []
    for ker_size in range(3,6):
        x = Convolution1D(64, ker_size, border_mode='same', activation='relu')(graph_in)
        x = MaxPooling1D()(x)
        x = Flatten()(x)
        convs.append(x)
    out = Merge(mode='concat')(convs)
    graph = Model(graph_in, out)

    model = Sequential([
        Embedding(vocab_size+1, 32, input_length=max_len, dropout=0.2),
        Dropout(0.2),
        graph,
        Dropout(0.5),
        Dense(100, activation='relu'),
        Dropout(0.7),
        Dense(1, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

cv_evaluate_nn_model(create_multi_cnn_model, XS, y, nb_epoch=2)

Epoch 1/2
8000/8000 [==============================] - 1s - loss: 0.5743 - acc: 0.6950     
Epoch 2/2
8000/8000 [==============================] - 1s - loss: 0.5323 - acc: 0.7363     
1856/2000 [==========================>...] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 1s - loss: 0.5769 - acc: 0.6940     
Epoch 2/2
8000/8000 [==============================] - 1s - loss: 0.5335 - acc: 0.7400     
1856/2000 [==========================>...] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 1s - loss: 0.5790 - acc: 0.6886     
Epoch 2/2
8000/8000 [==============================] - 1s - loss: 0.5245 - acc: 0.7436     
1856/2000 [==========================>...] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 1s - loss: 0.5815 - acc: 0.6924     
Epoch 2/2
8000/8000 [==============================] - 1s - loss: 0.5301 - acc: 0.7421     
1856/2000 [==========================>...] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 1s - loss: 0.5828 - acc: 0.6913     
Epoch 2/2
8000/8000 [==============================] - 1s - loss: 0.5297 - acc: 0.7455     
1824/2000 [==========================>...] - ETA: 0s
Model average accuracy: 0.77

I try with a RNN model - LSTM

def create_lstm_model():
    model = Sequential([
        Embedding(vocab_size+1, 32, input_length=max_len, mask_zero=True, W_regularizer=l2(1e-6), dropout=0.2),
        LSTM(100, consume_less='gpu'),
        Dense(1, activation='sigmoid')])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

cv_evaluate_nn_model(create_lstm_model, XS, y, nb_epoch=2)

Epoch 1/2
8000/8000 [==============================] - 12s - loss: 0.5752 - acc: 0.6937      - ETA: 9s - loss: 0.6456 - acc: 0.6377 - ETA: 1s - loss: 0.5863 - acc: 0.6831
Epoch 2/2
8000/8000 [==============================] - 12s - loss: 0.5136 - acc: 0.7543      - ETA: 7s - loss: 0.5081 - acc: 0.7554 - ETA: 7s - loss: 0.5097 - acc: 0.7552 - ETA: 6s - loss: 0.5106 - acc: 0.7536
2000/2000 [==============================] - 1s     
Epoch 1/2
8000/8000 [==============================] - 12s - loss: 0.5607 - acc: 0.7127      - ETA: 10s - loss: 0.6344 - acc: 0.6329
Epoch 2/2
8000/8000 [==============================] - 12s - loss: 0.5129 - acc: 0.7499      - ETA: 10s - loss: 0.5289 - acc: 0.7454
2000/2000 [==============================] - 0s     
Epoch 1/2
8000/8000 [==============================] - 12s - loss: 0.5555 - acc: 0.7177      - ETA: 9s - loss: 0.6355 - acc: 0.6305 - ETA: 6s - loss: 0.6000 - acc: 0.6724
Epoch 2/2
8000/8000 [==============================] - 12s - loss: 0.5151 - acc: 0.7522     
1952/2000 [============================>.] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 12s - loss: 0.5584 - acc: 0.7146      - ETA: 7s - loss: 0.5957 - acc: 0.6714 - ETA: 2s - loss: 0.5681 - acc: 0.7100 - ETA: 2s - loss: 0.5665 - acc: 0.7108
Epoch 2/2
8000/8000 [==============================] - 12s - loss: 0.5121 - acc: 0.7501      - ETA: 9s - loss: 0.5208 - acc: 0.7489 - ETA: 9s - loss: 0.5163 - acc: 0.7500
1920/2000 [===========================>..] - ETA: 0sEpoch 1/2
8000/8000 [==============================] - 12s - loss: 0.5589 - acc: 0.7106      - ETA: 15s - loss: 0.6879 - acc: 0.6375 - ETA: 7s - loss: 0.5948 - acc: 0.6739
Epoch 2/2
8000/8000 [==============================] - 12s - loss: 0.5069 - acc: 0.7592     
1952/2000 [============================>.] - ETA: 0s
Model average accuracy: 0.76

After evaluating several models, i think that with this small dataset, trainning on the complex model (NN) may not have a big impact on the performance. For the sake of training speed, I choose the small NN (single hidden layer neural network) to make a submission:

sub_model = single_nn_model()

Traing the model on full dataset (I train with only 3 epoches, with more epoches the model will have high overfiting):

sub_model.fit(XS, y, batch_size=32, nb_epoch=3)

Epoch 1/3
10000/10000 [==============================] - 0s - loss: 0.5536 - acc: 0.7101     
Epoch 2/3
10000/10000 [==============================] - 0s - loss: 0.4818 - acc: 0.7732     
Epoch 3/3
10000/10000 [==============================] - 0s - loss: 0.4557 - acc: 0.7856     





<keras.callbacks.History at 0x7effb46f3b70>

y_preds = sub_model.predict(XS_test)

y_preds[-3]

array([ 0.72289443], dtype=float32)

submission_file_name = 'subm/problem_1_submission.csv'
np.savetxt(submission_file_name, y_preds, fmt='%.5f')

from IPython.display import FileLink
FileLink(submission_file_name)

subm/problem_1_submission.csv

For this dataset, I think we can try some methods to improve accuracy. A promissing method would be psudo-labeling (http://deeplearning.net/wp-content/uploads/2013/03/pseudo_label_final.pdf). Since we have a pretty big unlabeled test set. If this test set were true labeled, it would be an augument to the trainning set.