{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sample baseline approach to CS5785 final"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The idea of this simple baseline approach is to perform linear regression in order to map a vector representation of the descriptions to a vector representation of the images. We can then use this regression model to rank candidate images for each test description based on the Euclidean distance between the vector predicted from the description, and each of the image vectors in the test set.\n",
"\n",
"In order to form the description vectors we use word2vec, which provides a pre-trained 300-dimensional vector representation of most words in the English language. We downloaded the word vectors from here use the gensim library in order to access the vectors easily in our code. The feature vector for a given description was then formed by averaging the 300-dimensional word2vec vectors of all the words in the description.\n",
"\n",
"In order to form the target image vectors for each image we took the 1,000 ResNet features from the final ResNet layer, and performed a random projection of thse features down to 100 dimensions.\n",
"\n",
"That means our linear regression model is mapping a 300-dimensional description vector to the 100-dimensional image vector. For linear regression we used Ridge with cross-validation to select the best regularization coefficient.\n",
"\n",
"Our approach was validated on a held-out development set (randomly selected 20% subset of the training set) so MAP@20 could be estimated before submitting to Kaggle. However the final model used to generate test predictions for submission was trained on the entire training set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### First we load libraries, define our train/test split, and load the word2vec dictionary using gensim\n",
"\n",
"How to download the pretrained word2vec representation: run in command line `wget -c \"https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz\"`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded word vectors successfully!\n"
]
}
],
"source": [
"import os\n",
"import csv\n",
"import random\n",
"import gensim\n",
"import numpy as np\n",
"\n",
"num_train = 8000\n",
"num_dev = 2000\n",
"num_test = 2000\n",
"split_idx = list(range(num_train + num_dev))\n",
"random.shuffle(split_idx)\n",
"word2vec = gensim.models.KeyedVectors.load_word2vec_format(\"GoogleNews-vectors-negative300.bin.gz\", binary=True)\n",
"print(\"Loaded word vectors successfully!\")\n",
"\n",
"#from gensim.models import Word2Vec\n",
"#word2vec = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, norm_only=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Next we parse the descriptions to form the X matrices\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Built all x matrices!\n",
"x_train shape: (8000, 300)\n",
"x_dev shape: (2000, 300)\n",
"x_test shape: (2000, 300)\n"
]
}
],
"source": [
"def parse_descriptions(data_dir, num_doc):\n",
" docs = []\n",
" for i in range(num_doc):\n",
" path = os.path.join(data_dir, \"%d.txt\" % i)\n",
" with open(path) as f:\n",
" docs.append(f.read())\n",
" return docs\n",
"\n",
"def doc_to_vec(sentence, word2vec):\n",
" # get list of word vectors in sentence\n",
" word_vecs = [word2vec.get_vector(w) for w in sentence.split() if w in word2vec.vocab]\n",
" # return average\n",
" return np.stack(word_vecs).mean(0)\n",
"\n",
"# build x matrices\n",
"train_dev_desc = parse_descriptions(\"descriptions_train\", num_doc=(num_train+num_dev))\n",
"test_desc = parse_descriptions(\"descriptions_test\", num_doc=num_test)\n",
"x_train = np.array([doc_to_vec(train_dev_desc[i], word2vec) for i in split_idx[:num_train]])\n",
"x_dev = np.array([doc_to_vec(train_dev_desc[i], word2vec) for i in split_idx[num_train:]])\n",
"x_test = np.array([doc_to_vec(d, word2vec) for d in test_desc])\n",
"\n",
"print(\"Built all x matrices!\")\n",
"print(\"x_train shape:\", x_train.shape)\n",
"print(\"x_dev shape:\", x_dev.shape)\n",
"print(\"x_test shape:\", x_test.shape)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### In addition we parse the ResNet features to form the Y matrices\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Built all y matrices!\n",
"y_train shape: (8000, 100)\n",
"y_dev shape: (2000, 100)\n",
"y_test shape: (2000, 100)\n"
]
}
],
"source": [
"def parse_features(features_path):\n",
" vec_map = {}\n",
" with open(features_path) as f:\n",
" for row in csv.reader(f):\n",
" img_id = int(row[0].split(\"/\")[1].split(\".\")[0])\n",
" vec_map[img_id] = np.array([float(x) for x in row[1:]])\n",
" return np.array([v for k, v in sorted(vec_map.items())])\n",
"\n",
"# build y matrices\n",
"p = np.random.randn(1000, 100)\n",
"y_train_dev = parse_features(\"features_train/features_resnet1000_train.csv\") @ p\n",
"y_train = y_train_dev[split_idx[:num_train]]\n",
"y_dev = y_train_dev[split_idx[num_train:]]\n",
"y_test = parse_features(\"features_test/features_resnet1000_test.csv\") @ p\n",
"\n",
"print(\"Built all y matrices!\")\n",
"print(\"y_train shape:\", y_train.shape)\n",
"print(\"y_dev shape:\", y_dev.shape)\n",
"print(\"y_test shape:\", y_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Now we train a linear model to predict the ResNet features from the mean word vectors\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Trained linear regression model!\n",
"Summary of best model:\n",
"Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,\n",
" normalize=False, random_state=None, solver='auto', tol=0.001)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n",
"/Users/huyichun/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:420: FutureWarning: The default value of multioutput (not exposed in score method) will change from 'variance_weighted' to 'uniform_average' in 0.23 to keep consistent with 'metrics.r2_score'. To specify the default value manually and avoid the warning, please either call 'metrics.r2_score' directly or make a custom scorer with 'metrics.make_scorer' (the built-in scorer 'r2' uses multioutput='uniform_average').\n",
" \"multioutput='uniform_average').\", FutureWarning)\n"
]
}
],
"source": [
"from sklearn.linear_model import Ridge\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"# train OLS model with regression\n",
"parameters = {\"alpha\": [0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0]}\n",
"reg = GridSearchCV(Ridge(), parameters, cv=10)\n",
"reg.fit(x_train, y_train)\n",
"reg_best = reg.best_estimator_\n",
"\n",
"print(\"Trained linear regression model!\")\n",
"print(\"Summary of best model:\")\n",
"print(reg_best)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Next we test out our linear model on our development data, computing its MAP@20, and investigating the quality of the rankings\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Development MAP@20: 0.12207414318352787\n",
"Mean index of true image 112.394\n",
"Median index of true image 32.0\n"
]
}
],
"source": [
"def dist_matrix(x1, x2):\n",
" return ((np.expand_dims(x1, 1) - np.expand_dims(x2, 0)) ** 2).sum(2) ** 0.5\n",
"\n",
"# test performance on development set\n",
"y_dev_pred = reg.predict(x_dev)\n",
"dev_distances = dist_matrix(y_dev_pred, y_dev)\n",
"dev_scores = []\n",
"dev_pos_list = []\n",
"\n",
"for i in range(num_dev):\n",
" pred_dist_idx = list(np.argsort(dev_distances[i]))\n",
" dev_pos = pred_dist_idx.index(i)\n",
" dev_pos_list.append(dev_pos)\n",
" if dev_pos < 20:\n",
" dev_scores.append(1 / (dev_pos + 1))\n",
" else:\n",
" dev_scores.append(0.0)\n",
"\n",
"print(\"Development MAP@20:\", np.mean(dev_scores))\n",
"print(\"Mean index of true image\", np.mean(dev_pos_list))\n",
"print(\"Median index of true image\", np.median(dev_pos_list))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Finally we use our model to compute top-20 predictions on the test data that can be submitted to Kaggle\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Output written!\n"
]
}
],
"source": [
"# create test predictions\n",
"x_train_all = np.concatenate([x_train, x_dev])\n",
"y_train_all = np.concatenate([y_train, y_dev])\n",
"reg_best.fit(x_train_all, y_train_all)\n",
"y_test_pred = reg_best.predict(x_test)\n",
"test_distances = dist_matrix(y_test_pred, y_test)\n",
"pred_rows = []\n",
"\n",
"for i in range(num_test):\n",
" test_dist_idx = list(np.argsort(test_distances[i]))\n",
" top_20 = test_dist_idx[:20]\n",
" row = [\"%d.jpg\" % i for i in test_dist_idx[:20]]\n",
" pred_rows.append(\" \".join(row))\n",
"\n",
"with open(\"test_submission.csv\", \"w\") as f:\n",
" f.write(\"Descritpion_ID,Top_20_Image_IDs\\n\")\n",
" for i, row in enumerate(pred_rows):\n",
" f.write(\"%d.txt,%s\\n\" % (i, row))\n",
"\n",
"print(\"Output written!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}