# Sample baseline approach to CS5785 final

The idea of this simple baseline approach is to perform linear regression in order to map a vector representation of the descriptions to a vector representation of the images. We can then use this regression model to rank candidate images for each test description based on the Euclidean distance between the vector predicted from the description, and each of the image vectors in the test set.

In order to form the description vectors we use word2vec, which provides a pre-trained 300-dimensional vector representation of most words in the English language. We downloaded the word vectors from here use the gensim library in order to access the vectors easily in our code. The feature vector for a given description was then formed by averaging the 300-dimensional word2vec vectors of all the words in the description.

In order to form the target image vectors for each image we took the 1,000 ResNet features from the final ResNet layer, and performed a random projection of thse features down to 100 dimensions.

That means our linear regression model is mapping a 300-dimensional description vector to the 100-dimensional image vector. For linear regression we used Ridge with cross-validation to select the best regularization coefficient.

Our approach was validated on a held-out development set (randomly selected 20% subset of the training set) so MAP@20 could be estimated before submitting to Kaggle. However the final model used to generate test predictions for submission was trained on the entire training set.

### First we load libraries, define our train/test split, and load the word2vec dictionary using gensim

How to download the pretrained word2vec representation: run in command line `wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"`

In [1]:
import os
import csv
import random
import gensim
import numpy as np

num_train = 8000
num_dev = 2000
num_test = 2000
split_idx = list(range(num_train + num_dev))
random.shuffle(split_idx)
word2vec = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
print("Loaded word vectors successfully!")

#from gensim.models import Word2Vec
#word2vec = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, norm_only=True)

Loaded word vectors successfully!



### Next we parse the descriptions to form the X matrices


In [2]:
def parse_descriptions(data_dir, num_doc):
 docs = []
 for i in range(num_doc):
 path = os.path.join(data_dir, "%d.txt" % i)
 with open(path) as f:
 docs.append(f.read())
 return docs

def doc_to_vec(sentence, word2vec):
 # get list of word vectors in sentence
 word_vecs = [word2vec.get_vector(w) for w in sentence.split() if w in word2vec.vocab]
 # return average
 return np.stack(word_vecs).mean(0)

# build x matrices
train_dev_desc = parse_descriptions("descriptions_train", num_doc=(num_train+num_dev))
test_desc = parse_descriptions("descriptions_test", num_doc=num_test)
x_train = np.array([doc_to_vec(train_dev_desc[i], word2vec) for i in split_idx[:num_train]])
x_dev = np.array([doc_to_vec(train_dev_desc[i], word2vec) for i in split_idx[num_train:]])
x_test = np.array([doc_to_vec(d, word2vec) for d in test_desc])

print("Built all x matrices!")
print("x_train shape:", x_train.shape)
print("x_dev shape:", x_dev.shape)
print("x_test shape:", x_test.shape)


Built all x matrices!
x_train shape: (8000, 300)
x_dev shape: (2000, 300)
x_test shape: (2000, 300)



### In addition we parse the ResNet features to form the Y matrices


In [17]:
def parse_features(features_path):
 vec_map = {}
 with open(features_path) as f:
 for row in csv.reader(f):
 img_id = int(row[0].split("/")[1].split(".")[0])
 vec_map[img_id] = np.array([float(x) for x in row[1:]])
 return np.array([v for k, v in sorted(vec_map.items())])

# build y matrices
p = np.random.randn(1000, 100)
y_train_dev = parse_features("features_train/features_resnet1000_train.csv") @ p
y_train = y_train_dev[split_idx[:num_train]]
y_dev = y_train_dev[split_idx[num_train:]]
y_test = parse_features("features_test/features_resnet1000_test.csv") @ p

print("Built all y matrices!")
print("y_train shape:", y_train.shape)
print("y_dev shape:", y_dev.shape)
print("y_test shape:", y_test.shape)

Built all y matrices!
y_train shape: (8000, 100)
y_dev shape: (2000, 100)
y_test shape: (2000, 100)



### Now we train a linear model to predict the ResNet features from the mean word vectors


In [18]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# train OLS model with regression
parameters = {"alpha": [0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0]}
reg = GridSearchCV(Ridge(), parameters, cv=10)
reg.fit(x_train, y_train)
reg_best = reg.best_estimator_

print("Trained linear regression model!")
print("Summary of best model:")
print(reg_best)









Trained linear regression model!
Summary of best model:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
 normalize=False, random_state=None, solver='auto', tol=0.001)





### Next we test out our linear model on our development data, computing its MAP@20, and investigating the quality of the rankings


In [20]:
def dist_matrix(x1, x2):
 return ((np.expand_dims(x1, 1) - np.expand_dims(x2, 0)) ** 2).sum(2) ** 0.5

# test performance on development set
y_dev_pred = reg.predict(x_dev)
dev_distances = dist_matrix(y_dev_pred, y_dev)
dev_scores = []
dev_pos_list = []

for i in range(num_dev):
 pred_dist_idx = list(np.argsort(dev_distances[i]))
 dev_pos = pred_dist_idx.index(i)
 dev_pos_list.append(dev_pos)
 if dev_pos < 20:
 dev_scores.append(1 / (dev_pos + 1))
 else:
 dev_scores.append(0.0)

print("Development MAP@20:", np.mean(dev_scores))
print("Mean index of true image", np.mean(dev_pos_list))
print("Median index of true image", np.median(dev_pos_list))

Development MAP@20: 0.12207414318352787
Mean index of true image 112.394
Median index of true image 32.0



### Finally we use our model to compute top-20 predictions on the test data that can be submitted to Kaggle


In [21]:
# create test predictions
x_train_all = np.concatenate([x_train, x_dev])
y_train_all = np.concatenate([y_train, y_dev])
reg_best.fit(x_train_all, y_train_all)
y_test_pred = reg_best.predict(x_test)
test_distances = dist_matrix(y_test_pred, y_test)
pred_rows = []

for i in range(num_test):
 test_dist_idx = list(np.argsort(test_distances[i]))
 top_20 = test_dist_idx[:20]
 row = ["%d.jpg" % i for i in test_dist_idx[:20]]
 pred_rows.append(" ".join(row))

with open("test_submission.csv", "w") as f:
 f.write("Descritpion_ID,Top_20_Image_IDs\n")
 for i, row in enumerate(pred_rows):
 f.write("%d.txt,%s\n" % (i, row))

print("Output written!")

Output written!
