Less overfitting - testing - regression

nikolaos giannakoulis
Jan 17, 2020
13 min read

Updated: Jun 15, 2022

2 problems of overfitting that I try to resolve in this kernel:

use information group by installation_id could lead to overfitting
use all data of train_label for validation dataset. Indeed, the validation score is low if I only use the last assessment of every installation_id as validation dataset compared to a validation score where I use random validation dataset.

solutions:

I just don't use information group by installation_id but I will use data between an assessment and his previous assessment (e.g: the number of 4070 between two assess)

I use specific validation datasets:Validation dataset1: last assessment of every installation_id of train_labelTrain dataset1 = final_train: other assessments of train_label]
I do the same thing on final_train:Validation dataset2: second last assessment of every installation_id on train_labelsTrain dataset2 = final_train2: other assessments of final_train]
I do the same thing on final_train2:Validation dataset3: third last assessment of every installation_id on train_labelsTrain dataset3 = final_train3: other assessments of final_train2

The goal of the point 2 is to only have the last assessments in the validation dataset.

You can use these 3 validation datasets to evaluate your model. The 3 validation datasets are really different. I fit a lot of models with these dataset and I have always a big difference between validation scores(particularly with the validation dataset with last assessment). It would mean that the models are not really good.

Have a good and same score on the 3 validation scores + good score score on test data would mean your model is less impact by overfitting.

I will use the lgb regression model with parameters of Andrew Lukyanenko kernel and also his optimization cutoffs.

Importing libraries and data

OutputHideIn [1]:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

HideIn [2]:

import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import cohen_kappa_score
from scipy.stats import mode
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split

import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from tqdm import tqdm

In [3]:

%%time
train=pd.read_csv('../input/data-science-bowl-2019/train.csv')
train_labels=pd.read_csv('../input/data-science-bowl-2019/train_labels.csv')
test = pd.read_csv('../input/data-science-bowl-2019/test.csv')
submission=pd.read_csv('../input/data-science-bowl-2019/sample_submission.csv')

encode title
# make a list with all the unique 'titles' from the train and test set
list_of_user_activities = list(set(train['title'].value_counts().index).union(set(test['title'].value_counts().index)))
# make a list with all the unique 'event_code' from the train and test set
list_of_event_code = list(set(train['event_code'].value_counts().index).union(set(test['event_code'].value_counts().index)))
# create a dictionary numerating the titles
activities_map = dict(zip(list_of_user_activities, np.arange(len(list_of_user_activities))))
activities_labels = dict(zip(np.arange(len(list_of_user_activities)), list_of_user_activities))

assess_titles = list(set(train[train['type'] == 'Assessment']['title'].value_counts().index).union(set(test[test['type'] == 'Assessment']['title'].value_counts().index)))

# replace the text titles withing the number titles from the dict
train['title'] = train['title'].map(activities_map)
test['title'] = test['title'].map(activities_map)
train_labels['title'] = train_labels['title'].map(activities_map)

HideIn [8]:

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data = pd.concat([train,test],sort=False)

data['installation_id_encoder'] = 0
encode = data[['installation_id']].apply(encoder.fit_transform)
data['installation_id_encoder'] = encode
data.head(10)

train = data[:len(train)]
test = data[len(train):]

In [9]:

win_code = dict(zip(activities_map.values(), (4100*np.ones(len(activities_map))).astype('int')))
# then, it set one element, the 'Bird Measurer (Assessment)' as 4110, 10 more than the rest
win_code[activities_map['Bird Measurer (Assessment)']] = 4110

In [10]:

# this is the function that convert the raw data into processed features
def get_data(user_sample, test_set=False):
    '''
    The user_sample is a DataFrame from train or test where the only one 
    installation_id is filtered
    And the test_set parameter is related with the labels processing, that is only requered
    if test_set=False
    '''
    # Constants and parameters declaration
    last_activity = 0
    user_activities_count = {'Clip':0, 'Activity': 0, 'Assessment': 0, 'Game':0}
    
    # news features: time spent in each activity
    time_spent_each_act = {actv: 0 for actv in list_of_user_activities}
    event_code_count = {eve: 0 for eve in list_of_event_code}
    last_session_time_sec = 0
    
    accuracy_groups = {0:0, 1:0, 2:0, 3:0}
    all_assessments = []
    accumulated_accuracy_group = 0
    accumulated_accuracy=0
    accumulated_correct_attempts = 0 
    accumulated_uncorrect_attempts = 0 
    accumulated_actions = 0
    
    last_accuracy_title = {'acc_' + title: -1 for title in assess_titles}
    last_game_time_title = {'lgt_' + title: 0 for title in assess_titles}
    ac_game_time_title = {'agt_' + title: 0 for title in assess_titles}
    ac_true_attempts_title = {'ata_' + title: 0 for title in assess_titles}
    ac_false_attempts_title = {'afa_' + title: 0 for title in assess_titles}
    
    counter = 0
    time_first_activity = float(user_sample['timestamp'].values[0])
    durations = []
    time_play = 0
    title_just_before = 0
    title_assessment_before = 0
    assessment_before_accuracy = 0
    dif2030 = 0
    dif4070 = 0
    dif3010 = 0
    dif3020 = 0
    dif4030 = 0
    dif3110 = 0
    dif4025 = 0
    dif4035 = 0
    dif3120 = 0
    dif2010 = 0
    somme_clip_game_activity = 0
    
    # itarates through each session of one instalation_id
    for i, session in user_sample.groupby('game_session', sort=False):
        # i = game_session_id
        # session is a DataFrame that contain only one game_session
        
        # get some sessions information
        session_type = session['type'].iloc[0]
        session_title = session['title'].iloc[0]
        
        # get current session time in seconds
        if session_type != 'Assessment':
            time_spent = int(session['game_time'].iloc[-1] / 1000)
            time_spent_each_act[activities_labels[session_title]] += time_spent
            time_play += time_spent
            
            title_just_before = session_title
            session_title_text = activities_labels[session_title]
            
        # for each assessment, and only this kind off session, the features below are processed
        # and a register are generated
        if (session_type == 'Assessment') & (test_set or len(session)>1):
            # search for event_code 4100, that represents the assessments trial
            all_attempts = session.query(f'event_code == {win_code[session_title]}')
            # then, check the numbers of wins and the number of losses
            true_attempts = all_attempts['event_data'].str.contains('true').sum()
            false_attempts = all_attempts['event_data'].str.contains('false').sum()
            # copy a dict to use as feature template, it's initialized with some itens: 
            # {'Clip':0, 'Activity': 0, 'Assessment': 0, 'Game':0}
            features = user_activities_count.copy()
            
            features.update(last_accuracy_title.copy())
            features.update(last_game_time_title.copy())
            features.update(ac_game_time_title.copy())
            features.update(ac_true_attempts_title.copy())
            features.update(ac_false_attempts_title.copy())
            
            session_title = session['title'].iloc[0]
            time_spent = int(session['game_time'].iloc[-1] / 1000)
            time_spent_each_act[activities_labels[session_title]] += time_spent
            
            features.update(time_spent_each_act.copy())
            features.update(event_code_count.copy())
            # add title as feature, remembering that title represents the name of the game
            features['session_title'] = session['title'].iloc[0] 
            # the 4 lines below add the feature of the history of the trials of this player
            # this is based on the all time attempts so far, at the moment of this assessment
            features['accumulated_correct_attempts'] = accumulated_correct_attempts
            features['accumulated_uncorrect_attempts'] = accumulated_uncorrect_attempts
            accumulated_correct_attempts += true_attempts 
            accumulated_uncorrect_attempts += false_attempts
            
            session_title_text = activities_labels[session_title]
            ac_true_attempts_title['ata_' + session_title_text] += true_attempts
            ac_false_attempts_title['afa_' + session_title_text] += false_attempts
            last_game_time_title['lgt_' + session_title_text] = session['game_time'].iloc[-1]
            ac_game_time_title['agt_' + session_title_text] += session['game_time'].iloc[-1]
            
            # the time spent in the app so far
            if durations == []:
                features['duration_mean'] = 0
            else:
                features['duration_mean'] = np.mean(durations)
            durations.append((session.iloc[-1, 2] - session.iloc[0, 2] ).seconds)
            
            accuracy = true_attempts/(true_attempts+false_attempts) if (true_attempts+false_attempts) != 0 else 0
            accumulated_accuracy += accuracy
            last_accuracy_title['acc_' + session_title_text] = accuracy
            # a feature of the current accuracy categorized
            # it is a counter of how many times this player was in each accuracy group
            
            # assessment_before_accuracy
            features['assessment_before_accuracy'] = assessment_before_accuracy
                     
            if accuracy == 0:
                features['accuracy_group'] = 0
                assessment_before_accuracy = 0
            elif accuracy == 1:
                features['accuracy_group'] = 3
                assessment_before_accuracy = 3
            elif accuracy == 0.5:
                features['accuracy_group'] = 2
                assessment_before_accuracy = 2
            else:
                features['accuracy_group'] = 1
                assessment_before_accuracy = 1
                
            features.update(accuracy_groups)
            accuracy_groups[features['accuracy_group']] += 1
            # mean of the all accuracy groups of this player
            features['accumulated_accuracy_group'] = accumulated_accuracy_group/counter if counter > 0 else 0
            accumulated_accuracy_group += features['accuracy_group']
            # how many actions the player has done so far, it is initialized as 0 and updated some lines below
            features['accumulated_actions'] = accumulated_actions
            
            # encode installation_id
            features['installation_id_encoder'] = session['installation_id_encoder'].iloc[0]
            
            # time play on the app
            features['time_play'] = time_play
            time_play += int(session['game_time'].iloc[-1] / 1000)
            
            # title_assessment_before
            features['title_assessment_before'] = title_assessment_before
            
            # concat (session_title + title_assessment_before) / title_assessment_before is the title of the previous assessment :
            features['title*title_assessment_before'] = int( str(session['title'].iloc[0]) + str(title_assessment_before) )
            title_assessment_before = session['title'].iloc[0]
            
            # concat (session_title + title_just_before) / title_just_before is title game play just before the assessment :
            features['title*title_just_before'] = int( str(session['title'].iloc[0]) + str(title_just_before) )
            title_just_before = session['title'].iloc[0]
            
            
            # 4070 dif
            if features['Assessment'] == 0:
                features['4070_dif'] = features[4070]
                dif4070 = features[4070]
            else:
                features['4070_dif'] = features[4070] - dif4070
                dif4070 = features[4070]
                
            # 2030 dif
            if features['Assessment'] == 0:
                features['2030_dif'] = features[2030]
                dif2030 = features[2030]
            else:
                features['2030_dif'] = features[2030] - dif2030
                dif2030 = features[2030]
                
            # 3010 dif
            if features['Assessment'] == 0:
                features['3010_dif'] = features[3010]
                dif3010 = features[3010]
            else:
                features['3010_dif'] = features[3010] - dif3010
                dif3010 = features[3010]
                
            # 3020 dif
            if features['Assessment'] == 0:
                features['3020_dif'] = features[3020]
                dif3020 = features[3020]
            else:
                features['3020_dif'] = features[3020] - dif3020
                dif3020 = features[3020]
                
            # 4030 dif
            if features['Assessment'] == 0:
                features['4030_dif'] = features[4030]
                dif4030 = features[4030]
            else:
                features['4030_dif'] = features[4030] - dif4030
                dif4030 = features[4030]
                
            # 3110 dif
            if features['Assessment'] == 0:
                features['3110_dif'] = features[3110]
                dif3110 = features[3110]
            else:
                features['3110_dif'] = features[3110] - dif3110
                dif3110 = features[3110]
                
            # 4035 dif
            if features['Assessment'] == 0:
                features['4035_dif'] = features[4035]
                dif4035 = features[4035]
            else:
                features['4035_dif'] = features[4035] - dif4035
                dif4035 = features[4035]
                
            # 4025 dif
            if features['Assessment'] == 0:
                features['4025_dif'] = features[4025]
                dif4025 = features[4025]
            else:
                features['4025_dif'] = features[4025] - dif4025
                dif4025 = features[4025]
                
            # 3120 dif
            if features['Assessment'] == 0:
                features['3120_dif'] = features[3120]
                dif3120 = features[3120]
            else:
                features['3120_dif'] = features[3120] - dif3120
                dif3120 = features[3120]
                
            # 2010 dif
            if features['Assessment'] == 0:
                features['2010_dif'] = features[2010]
                dif2010 = features[2010]
            else:
                features['2010_dif'] = features[2010] - dif2010
                dif2010 = features[2010]
                
                
            # time play assessment
            features['time_play_assessment'] = sum(durations)
            
            # clip+game+activity before assessment
            somme = features['Clip']+features['Game']+features['Activity']
            features['somme_clip_game_activity'] = somme - somme_clip_game_activity
            somme_clip_game_activity = somme
            
            # there are some conditions to allow this features to be inserted in the datasets
            # if it's a test set, all sessions belong to the final dataset
            # it it's a train, needs to be passed throught this clausule: session.query(f'event_code == {win_code[session_title]}')
            # that means, must exist an event_code 4100 or 4110
            if test_set:
                all_assessments.append(features)
            elif true_attempts+false_attempts > 0:
                all_assessments.append(features)
                
            counter += 1
        
        # this piece counts how many actions was made in each event_code so far
        n_of_event_codes = Counter(session['event_code'])
        
        for key in n_of_event_codes.keys():
            event_code_count[key] += n_of_event_codes[key]

        # counts how many actions the player has done so far, used in the feature of the same name
        accumulated_actions += len(session)
        if last_activity != session_type:
            user_activities_count[session_type] += 1
            last_activitiy = session_type
    # if it't the test_set, only the last assessment must be predicted, the previous are scraped
    if test_set:
        return all_assessments[-1]
    # in the train_set, all assessments goes to the dataset
    return all_assessments[:-1]

Engineering train data

In [11]:

from tqdm import tqdm_notebook as tqdm
from collections import Counter
# here the get_data function is applyed to each installation_id and added to the compile_data list
compiled_data = []
compiled_data_last = []
# tqdm is the library that draws the status bar below
for i, (ins_id, user_sample) in tqdm(enumerate(train.groupby('installation_id', sort=False)), total=3614):
    # user_sample is a DataFrame that contains only one installation_id
    L = get_data(user_sample)
    compiled_data += L
    a = get_data(user_sample,test_set=True )
    compiled_data_last.append(a)

# the compiled_data is converted to DataFrame and deleted to save memmory
final_train = pd.DataFrame(compiled_data)

final_train_last brings together every last assessment of installation_id from train dataset :

In [12]:

final_train_last = pd.DataFrame(compiled_data_last)

In [13]:

name_colonne = list(final_train.iloc[:,4:29])

final_train doesn't contain data from final_train_last

And we can now use final_train as a train dataset and final_train_last as a validation dataset

In [14]:

pd.set_option('display.max_columns', None)
final_train

I remove outliers id :

In [15]:

final_train = final_train[~final_train.installation_id_encoder.isin([2572,4351,2901,3064,3759,2012,3804,1406,2055,2096,3766,3347,3994,4209,4444])]
final_train_last = final_train_last[~final_train_last.installation_id_encoder.isin([2572,4351,2901,3064,3759,2012,3804,1406,2055,2096,3766,3347,3994,4209,4444])]

final_train_last2 brings together every second last assessment of installation_id from train :

equivalent to bring together every last assessment of installation_id from final_train

In [16]:

# add a column index_t helping the separation between last assessment of final_train and others assessment
final_train = final_train.reset_index()
final_train.rename(columns={'index':'index_t'},inplace=True)

In [17]:

# every last assessment of installation_id from final_train
final_train_last2 = final_train.groupby('installation_id_encoder', sort=False,as_index=False).last()

In [18]:

final_train_last2

final_train2 = final_train - final_train_last2

In [19]:

not_req=(set(final_train.index_t.unique()) - set(final_train_last2.index_t.unique()))
final_train2=final_train['index_t'].isin(not_req)
final_train2 = final_train.where(final_train2,try_cast=True)
final_train2.dropna(inplace=True)
final_train2['index_t']=final_train2.index.astype(int)

colonne = list(final_train2)
colonne_float = ['accumulated_accuracy_group','duration_mean','accumulated_accuracy']
for name in colonne:
    if name not in colonne_float:
        final_train2[name] = final_train2[name].astype(int)

In [20]:

final_train2

We can now use final_train2 as a train dataset and final_train_last2 as a validation dataset

We do exactly the same to get every third last assessment of installation_id from train:

equivalent to bring together every last assessment of installation_id from final_train2

In [21]:

final_train_last3 = final_train2.groupby('installation_id_encoder', sort=False,as_index=False).last()

In [22]:

not_req=(set(final_train2.index_t.unique()) - set(final_train_last3.index_t.unique()))
final_train3=final_train2['index_t'].isin(not_req)
final_train3 = final_train2.where(final_train3,try_cast=True)
final_train3.dropna(inplace=True)
final_train3['index_t']=final_train3.index.astype(int)

colonne = list(final_train3)
colonne_float = ['accumulated_accuracy_group','duration_mean','accumulated_accuracy']
for name in colonne:
    if name not in colonne_float:
        final_train3[name] = final_train3[name].astype(int)

We can now use final_train3 as a train dataset and final_train_last3 as a validation dataset

In [23]:

final_train = final_train.drop(['index_t'],1)
final_train2 = final_train2.drop(['index_t'],1)
final_train3 = final_train3.drop(['index_t'],1)
final_train_last2 = final_train_last2.drop(['index_t'],1)
final_train_last3 = final_train_last3.drop(['index_t'],1)

Engineering test data

In [24]:

# process test set, the same that was done with the train set
new_test = []
test_to_train = []
for ins_id, user_sample in tqdm(test.groupby('installation_id', sort=False), total=1000):
    L = get_data(user_sample)
    a = get_data(user_sample,test_set=True )
    new_test.append(a)
    test_to_train += L
    
final_test = pd.DataFrame(new_test)
test_to_train = pd.DataFrame(test_to_train)
final_test.shape

test_to_train contains data that we don't use for final_test, I put them in train dataset:

In [25]:

final_train = pd.concat([final_train,test_to_train])
final_train = final_train.reset_index(drop=True)
final_train2 = pd.concat([final_train2,test_to_train])
final_train2 = final_train2.reset_index(drop=True)
final_train3 = pd.concat([final_train3,test_to_train])
final_train3 = final_train3.reset_index(drop=True)

Feature selection

In [26]:

keep = ['accuracy_group','session_title','accumulated_accuracy_group','4070_dif','2030_dif',
        'duration_mean','4030_dif','accumulated_uncorrect_attempts','Chow Time','Clip',
        'somme_clip_game_activity','assessment_before_accuracy','accumulated_actions',0,3] + name_colonne
final_train = final_train[keep]
final_train_last = final_train_last[keep]
final_train2 = final_train2[keep]
final_train_last2 = final_train_last2[keep]
final_train3 = final_train3[keep]
final_train_last3 = final_train_last3[keep]
final_test = final_test[keep]

In [27]:

print(final_train_last.shape)
print(final_train_last2.shape)
print(final_train_last3.shape)

(3599, 40)
(2572, 40)
(1939, 40)

In [28]:

final_train

LGBregressor / params from Andrew Lukyanenko kernel

In [29]:

params = {'n_estimators':2000,
            'boosting_type': 'gbdt',
            'objective': 'regression',
            'metric': 'rmse',
            'subsample': 0.75,
            'subsample_freq': 1,
            'learning_rate': 0.04,
            'feature_fraction': 0.9,
         'max_depth': 15,
            'lambda_l1': 1,  
            'lambda_l2': 1,
            'verbose': 100,
            'early_stopping_rounds': 100, 'eval_metric': 'cappa'
            }

In [30]:

def lgb_regression(x_train, y_train, x_val, y_val, **kwargs):
    
    models = []
    scores=[]
    categoricals = ['session_title']
        
    train_set = lgb.Dataset(x_train, y_train, categorical_feature=categoricals)
    val_set = lgb.Dataset(x_val, y_val, categorical_feature=categoricals)
        
    model = lgb.train(train_set=train_set, valid_sets=[train_set, val_set],**kwargs)
    models.append(model)
        
    pred_val=model.predict(x_val)
    oof = pred_val.reshape(len(x_val))
        
    return models,oof

final_train3 as a train data and final_train_last3 as a validation data :

In [31]:

X_train3 = final_train3.drop('accuracy_group',axis=1)
y_train3 = final_train3['accuracy_group'].astype(float)
X_end3 = final_train_last3.drop('accuracy_group',axis=1)
y_end3 = final_train_last3['accuracy_group'].astype(float)

models3,oof3 = lgb_regression(X_train3, y_train3, X_end3, y_end3, params=params, num_boost_round=100000,
                  early_stopping_rounds=500, verbose_eval=40)

/opt/conda/lib/python3.6/site-packages/lightgbm/engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
/opt/conda/lib/python3.6/site-packages/lightgbm/engine.py:153: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
/opt/conda/lib/python3.6/site-packages/lightgbm/basic.py:1243: UserWarning: Using categorical_feature in Dataset.
  warnings.warn('Using categorical_feature in Dataset.')

Training until validation scores don't improve for 100 rounds
[40]	training's rmse: 0.972607	valid_1's rmse: 0.998428
[80]	training's rmse: 0.921042	valid_1's rmse: 0.97156
[120]	training's rmse: 0.891496	valid_1's rmse: 0.968952
[160]	training's rmse: 0.86797	valid_1's rmse: 0.968023
[200]	training's rmse: 0.846927	valid_1's rmse: 0.968253
[240]	training's rmse: 0.827633	valid_1's rmse: 0.967966
[280]	training's rmse: 0.809415	valid_1's rmse: 0.968051
[320]	training's rmse: 0.792976	valid_1's rmse: 0.969288
[360]	training's rmse: 0.776607	valid_1's rmse: 0.970471
Early stopping, best iteration is:
[277]	training's rmse: 0.810806	valid_1's rmse: 0.967615

final_train2 as a train data and final_train_last2 as a validation data :

In [32]:

X_train2 = final_train2.drop('accuracy_group',axis=1)
y_train2 = final_train2['accuracy_group'].astype(float)
X_end2 = final_train_last2.drop('accuracy_group',axis=1)
y_end2 = final_train_last2['accuracy_group'].astype(float)

models2,oof2 = lgb_regression(X_train2, y_train2, X_end2, y_end2, params=params, num_boost_round=40000,
                  early_stopping_rounds=500, verbose_eval=40)

Training until validation scores don't improve for 100 rounds
[40]	training's rmse: 0.975339	valid_1's rmse: 1.02659
[80]	training's rmse: 0.927441	valid_1's rmse: 1.00575
[120]	training's rmse: 0.90148	valid_1's rmse: 1.00066
[160]	training's rmse: 0.880238	valid_1's rmse: 0.99983
[200]	training's rmse: 0.861493	valid_1's rmse: 0.999327
[240]	training's rmse: 0.844213	valid_1's rmse: 1.00034
[280]	training's rmse: 0.827746	valid_1's rmse: 0.999932
Early stopping, best iteration is:
[199]	training's rmse: 0.861839	valid_1's rmse: 0.999225

final_train as a train data and final_train_last as a validation data :

In [33]:

X_train1 = final_train.drop('accuracy_group',axis=1)
y_train1 = final_train['accuracy_group'].astype(float)
X_end1 = final_train_last.drop('accuracy_group',axis=1)
y_end1 = final_train_last['accuracy_group'].astype(float)

models1,oof1 = lgb_regression(X_train1, y_train1, X_end1, y_end1, params=params, num_boost_round=40000,
                  early_stopping_rounds=500, verbose_eval=40)

models_LGBM = models1 + models2 + models3

Training until validation scores don't improve for 100 rounds
[40]	training's rmse: 0.982355	valid_1's rmse: 1.18679
[80]	training's rmse: 0.93716	valid_1's rmse: 1.17221
[120]	training's rmse: 0.913494	valid_1's rmse: 1.16852
[160]	training's rmse: 0.894284	valid_1's rmse: 1.16732
[200]	training's rmse: 0.877768	valid_1's rmse: 1.1658
[240]	training's rmse: 0.862443	valid_1's rmse: 1.166
[280]	training's rmse: 0.848007	valid_1's rmse: 1.16518
[320]	training's rmse: 0.834298	valid_1's rmse: 1.16599
[360]	training's rmse: 0.821164	valid_1's rmse: 1.16735
Early stopping, best iteration is:
[262]	training's rmse: 0.85414	valid_1's rmse: 1.16502

We can note that the scores on validation data are really different. The validation rmse score on final_train_last is quite high.

Maybe the model is not really robust considering different model.

In [34]:

lgb.plot_importance(models_LGBM[0],max_num_features=20,importance_type='gain')

Out[34]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9863cfd780>

from functools import partial
import scipy as sp
class OptimizedRounder(object):
    """
    An optimizer for rounding thresholds
    to maximize Quadratic Weighted Kappa (QWK) score
    # https://www.kaggle.com/naveenasaithambi/optimizedrounder-improved
    """
    def __init__(self):
        self.coef_ = 0

    def _kappa_loss(self, coef, X, y,X2, y2,X3, y3):
        """
        Get loss according to
        using current coefficients
        
        :param coef: A list of coefficients that will be used for rounding
        :param X: The raw predictions of final_train_last
        :param y: The ground truth labels
        :param X2: The raw predictions of final_train_last2
        :param y2: The ground truth labels
        :param X3: The raw predictions of final_train_last3
        :param y3: The ground truth labels
        """
        X_p = pd.cut(X, [-np.inf] + list(np.sort(coef)) + [np.inf], labels = [0, 1, 2, 3])
        X_p2 = pd.cut(X2, [-np.inf] + list(np.sort(coef)) + [np.inf], labels = [0, 1, 2, 3])
        X_p3 = pd.cut(X3, [-np.inf] + list(np.sort(coef)) + [np.inf], labels = [0, 1, 2, 3])
        
        print("validation score of the last assessment: score1 = {}".format(qwk(y, X_p)))
        
        return ( qwk(y2, X_p2) - qwk(y, X_p) ) + ( qwk(y3, X_p3) - qwk(y, X_p) )  + (0.51 - qwk(y, X_p))*2

    def fit(self, X, y,X2, y2,X3, y3,coef_ini):
        """
        Optimize rounding thresholds
        
        :param X: The raw predictions
        :param y: The ground truth labels
        """
        loss_partial = partial(self._kappa_loss, X=X, y=y,X2=X2, y2=y2,X3=X3, y3=y3)
        initial_coef = coef_ini
        self.coef_ = sp.optimize.minimize(loss_partial, initial_coef, method='nelder-mead')

    def predict(self, X, coef):
        """
        Make predictions with specified thresholds
        
        :param X: The raw predictions
        :param coef: A list of coefficients that will be used for rounding
        """
        return pd.cut(X, [-np.inf] + list(np.sort(coef)) + [np.inf], labels = [0, 1, 2, 3])


    def coefficients(self):
        """
        Return the optimized coefficients
        """
        return self.coef_['x']

score1 = validation score on final_last_assessment

score2 = validation score on final_last_assessment2

score3 = validation score of on final_last_assessment3

In my previous versions, I noted that the quadratic weighted kappa score on final_train_last is low compared to others and the differences (score2 - score1) and (score3 - score1) seem to be negatively correlated to test score.

So I modificated the function OptimizedRounder. I try now to reduce the differences between scores. My goal is to have the same score on the 3 validation scores and also a good score (it's why I added (0.49 - qwk(y, X_p))*2 ). 0.49 seem to be the limit of my validation score on final_train_last with this model.

In [36]:

%%time

optR = OptimizedRounder()
optR.fit(oof1,y_end1,oof2,y_end2,oof3,y_end3,[1.12, 1.83, 2.27])
coefficients = optR.coefficients()
print(coefficients)

oof3[oof3 <= coefficients[0]] = 0
oof3[np.where(np.logical_and(oof3 > coefficients[0], oof3 <= coefficients[1]))] = 1
oof3[np.where(np.logical_and(oof3 > coefficients[1], oof3 <=coefficients[2]))] = 2
oof3[oof3 > coefficients[2]] = 3
pred3 = np.round(oof3).astype('int')

score3 = qwk(y_end3, pred3)
print("validation score of the third last assessment: score3 = {}".format(score3))

validation score of the third last assessment: score3 = 0.6016405057821032

In [38]:

oof2[oof2 <= coefficients[0]] = 0
oof2[np.where(np.logical_and(oof2 > coefficients[0], oof2 <= coefficients[1]))] = 1
oof2[np.where(np.logical_and(oof2 > coefficients[1], oof2 <=coefficients[2]))] = 2
oof2[oof2 > coefficients[2]] = 3
pred2 = np.round(oof2).astype('int')

score2 = qwk(y_end2, pred2)
print("validation score of the second last assessment: score2 = {}".format(score2))

validation score of the second last assessment: score2 = 0.5605230891587825

In [39]:

oof1[oof1 <= coefficients[0]] = 0
oof1[np.where(np.logical_and(oof1 > coefficients[0], oof1 <= coefficients[1]))] = 1
oof1[np.where(np.logical_and(oof1 > coefficients[1], oof1 <=coefficients[2]))] = 2
oof1[oof1 > coefficients[2]] = 3
pred1 = np.round(oof1).astype('int')

score1 = qwk(y_end1, pred1)
print("validation score of the last assessment: score1 = {}".format(score1))

validation score of the last assessment: score1 = 0.4959575412970093

Training on all train dataset with seed averaging

In [40]:

params = {'n_estimators':1000,
            'boosting_type': 'gbdt',
            'objective': 'regression',
            'metric': 'rmse',
            'subsample': 0.75,
            'subsample_freq': 1,
            'learning_rate': 0.04,
            'feature_fraction': 0.9,
          'seed':42,
         'max_depth': 15,
            'lambda_l1': 1,  
            'lambda_l2': 1,
            'verbose': 100,
            'early_stopping_rounds': 100, 'eval_metric': 'cappa'
            }
X = pd.concat([X_train1,X_end1])
Y = pd.concat([y_train1,y_end1])
models_all = []
oof_all = np.zeros(len(X))
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train, val in kf.split(X, Y):
    model_all,oof = lgb_regression(X.iloc[train], Y.iloc[train], X.iloc[val], Y.iloc[val], params=params, num_boost_round=40000,
                     early_stopping_rounds=500, verbose_eval=40)
    models_all.append(model_all[0])
    oof_all[val] = oof

oof_all[oof_all <= coefficients[0]] = 0
oof_all[np.where(np.logical_and(oof_all > coefficients[0], oof_all <= coefficients[1]))] = 1
oof_all[np.where(np.logical_and(oof_all > coefficients[1], oof_all <=coefficients[2]))] = 2
oof_all[oof_all > coefficients[2]] = 3
oof_all = np.round(oof_all).astype('int')

score = qwk(Y, oof_all)
print("validation score with StratifiedKFold_n=5: score = {}".format(score))

validation score with StratifiedKFold_n=5: score = 0.563029969942255

In [43]:

params = {'n_estimators':1000,
            'boosting_type': 'gbdt',
            'objective': 'regression',
            'metric': 'rmse',
            'subsample': 0.75,
            'subsample_freq': 1,
            'learning_rate': 0.04,
            'feature_fraction': 0.9,
          'seed':15,
         'max_depth': 15,
            'lambda_l1': 1,  
            'lambda_l2': 1,
            'verbose': 100,
            'early_stopping_rounds': 100, 'eval_metric': 'cappa'
            }

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
for train, val in kf.split(X, Y):
    model_all,oof = lgb_regression(X.iloc[train], Y.iloc[train], X.iloc[val], Y.iloc[val], params=params, num_boost_round=40000,
                     early_stopping_rounds=500, verbose_eval=200)
    models_all.append(model_all[0])

params = {'n_estimators':1000,
            'boosting_type': 'gbdt',
            'objective': 'regression',
            'metric': 'rmse',
            'subsample': 0.75,
            'subsample_freq': 1,
            'learning_rate': 0.04,
            'feature_fraction': 0.9,
          'seed':12,
         'max_depth': 15,
            'lambda_l1': 1,  
            'lambda_l2': 1,
            'verbose': 100,
            'early_stopping_rounds': 100, 'eval_metric': 'cappa'
            }

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
for train, val in kf.split(X, Y):
    model_all,oof = lgb_regression(X.iloc[train], Y.iloc[train], X.iloc[val], Y.iloc[val], params=params, num_boost_round=40000,
                     early_stopping_rounds=500, verbose_eval=200)
    models_all.append(model_all[0])
    
params = {'n_estimators':1000,
            'boosting_type': 'gbdt',
            'objective': 'regression',
            'metric': 'rmse',
            'subsample': 0.75,
            'subsample_freq': 1,
            'learning_rate': 0.04,
            'feature_fraction': 0.9,
          'seed':11,
         'max_depth': 15,
            'lambda_l1': 1,  
            'lambda_l2': 1,
            'verbose': 100,
            'early_stopping_rounds': 100, 'eval_metric': 'cappa'
            }

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)
for train, val in kf.split(X, Y):
    model_all,oof = lgb_regression(X.iloc[train], Y.iloc[train], X.iloc[val], Y.iloc[val], params=params, num_boost_round=40000,
                     early_stopping_rounds=500, verbose_eval=200)
    models_all.append(model_all[0])

X_test = final_test.drop(columns=['accuracy_group'])

In [45]:

predictions = []
for model in models_all:
    predictions.append(model.predict(X_test))

L=[]
for i in range (len(predictions[0])):
    mean = []
    for j in range (len(predictions)):
        mean.append(predictions[j][i])
    L.append(np.mean(mean))
    
predictions = np.array(L)

predictions[predictions <= coefficients[0]] = 0
predictions[np.where(np.logical_and(predictions > coefficients[0], predictions <= coefficients[1]))] = 1
predictions[np.where(np.logical_and(predictions > coefficients[1], predictions <=coefficients[2]))] = 2
predictions[predictions > coefficients[2]] = 3

pred = np.round(predictions).astype('int')

sub_LGB_test=pd.DataFrame({'installation_id':submission.installation_id,'accuracy_group':pred})
sub_LGB_test.to_csv('submission.csv',index=False)
sub_LGB_test['accuracy_group'].plot(kind='hist')

Less overfitting - testing - regression

I remove outliers id :

Recent Posts

Comments