scikit learn - python sklearn: what is the difference between accuracy_score and learning_curve score? -


i'm using python sklearn (version 0.17) select ideal model on data set. this, followed these steps:

  1. split data set using cross_validation.train_test_split test_size = 0.2.
  2. use gridsearchcv select ideal k-nearest-neighbors classifier on training set.
  3. pass classifier returned gridsearchcv plot_learning_curve. plot_learning_curve gave plot shown below.
  4. run classifier returned gridsearchcv on test set obtained.

from plot, can see score max. training size 0.43. score score returned sklearn.learning_curve.learning_curve function.

but when run best classifier on test set accuracy score of 0.61, returned sklearn.metrics.accuracy_score (correctly predicted labels / number of labels)

link image: graph plot knn classifier

this code using. have not included plot_learning_curve function take lot of space. took plot_learning_curve here

import pandas pd import numpy np sklearn.neighbors import kneighborsclassifier sklearn.metrics import accuracy_score sklearn.metrics import classification_report matplotlib import pyplot plt import sys sklearn import cross_validation sklearn.learning_curve import learning_curve sklearn.grid_search import gridsearchcv sklearn.cross_validation import train_test_split   filename = sys.argv[1] data =  np.loadtxt(fname = filename, delimiter = ',') x = data[:, 0:-1]   y = data[:, -1]   # last column label column   x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2)  params = {'n_neighbors': [2, 3, 5, 7, 10, 20, 30, 40, 50],            'weights': ['uniform', 'distance']}  clf = gridsearchcv(kneighborsclassifier(), param_grid=params) clf.fit(x_train, y_train) y_true, y_pred = y_test, clf.predict(x_test) acc = accuracy_score(y_pred, y_test) print 'accuracy on test set =', acc  print clf.best_params_ params, mean_score, scores in clf.grid_scores_:     print "%0.3f (+/-%0.03f) %r" % (         mean_score, scores.std() / 2, params)  y_true, y_pred = y_test, clf.predict(x_test) #pred = clf.predict(np.array(features_test)) acc = accuracy_score(y_pred, y_test) print classification_report(y_true, y_pred) print 'accuracy last =', acc print  plot_learning_curve(clf, "kneighborsclassifier",                  x, y,                  train_sizes=np.linspace(.05, 1.0, 5)) 

is normal? can understand there might difference in scores difference of 0.18, when converted percentages 43% against 61%. classification_report gives average 0.61 recall.

am doing wrong? there difference in how learning_curve calculates scores? tried passing scoring='accuracy' learning_curve function see if matches accuracy score, didn't make difference.

any advice helpful.

i'm using wine quality(white) data set uci , removed header before running code.

when call learning_curve function performs cross-validation on whole data. leaving cv parameter empty 3-fold cross validation splitting strategy. , here comes tricky part because stated in documentation "if estimator classifier or if y neither binary nor multiclass, kfold used". , estimator classifier.

so, what's difference between kfold , stratifiedkfold?

kfold = split dataset k consecutive folds (without shuffling default)

stratifiedkfold = "the folds made preserving percentage of samples each class."

let's make simple example:

  • your data labels [4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0]
  • by not stratified 3-fold divide in subsets: [4.0, 4.0, 4.0], [5.0, 5.0, 5.0], [6.0, 6.0, 6.0]
  • each fold used validation set once while k - 1 (3-2) remaining fold form training set. so, example, training on [5.0, 5.0, 5.0, 6.0, 6.0, 6.0] , validating on [4.0, 4.0, 4.0]

this explains low accuracy drawing learning curve (~0.43%). of course extreme example illustrate situation, data somehow structured , need shuffle it.

but when obtain ~61% accuracy have splitted data method train_test_split default performs shuffle on data , keeps proportions.

just @ this, i've performed simple test support hypothesis:

x_train2, x_test2, y_train2, y_test2 = train_test_split(x, y, test_size=0., random_state=2) 

in example fed learning_curve of data x,y. i'm doing little trick here split data telling test_size=0. meaning data in train variables. way i'm still keeping data, shuffled went through train_test_split function.

then i've called plotting function shuffled data:

plot_learning_curve(clf, "kneighborsclassifier",x_train2, y_train2, train_sizes=np.linspace(.05, 1.0, 5)) 

now output max num training samples instead of 0.43 0.59 makes lot more sense gridsearch results.

observation: think whole point of plotting learning curve determine wether adding more samples training set our estimator able perform better or not (so can decide example when there no need add more examples). in train_sizes feed values np.linspace(.05, 1.0, 5) --> [ 0.05 , 0.2875, 0.525 , 0.7625, 1. ] i'm not entirely sure usage pursuing in kind of test.


Comments