scikit learn - python sklearn: what is the difference between accuracy_score and learning_curve score? -
i'm using python sklearn (version 0.17) select ideal model on data set. this, followed these steps:
- split data set using
cross_validation.train_test_splittest_size = 0.2. - use
gridsearchcvselect ideal k-nearest-neighbors classifier on training set. - pass classifier returned
gridsearchcvplot_learning_curve.plot_learning_curvegave plot shown below. - run classifier returned
gridsearchcvon test set obtained.
from plot, can see score max. training size 0.43. score score returned sklearn.learning_curve.learning_curve function.
but when run best classifier on test set accuracy score of 0.61, returned sklearn.metrics.accuracy_score (correctly predicted labels / number of labels)
link image: 
this code using. have not included plot_learning_curve function take lot of space. took plot_learning_curve here
import pandas pd import numpy np sklearn.neighbors import kneighborsclassifier sklearn.metrics import accuracy_score sklearn.metrics import classification_report matplotlib import pyplot plt import sys sklearn import cross_validation sklearn.learning_curve import learning_curve sklearn.grid_search import gridsearchcv sklearn.cross_validation import train_test_split filename = sys.argv[1] data = np.loadtxt(fname = filename, delimiter = ',') x = data[:, 0:-1] y = data[:, -1] # last column label column x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2) params = {'n_neighbors': [2, 3, 5, 7, 10, 20, 30, 40, 50], 'weights': ['uniform', 'distance']} clf = gridsearchcv(kneighborsclassifier(), param_grid=params) clf.fit(x_train, y_train) y_true, y_pred = y_test, clf.predict(x_test) acc = accuracy_score(y_pred, y_test) print 'accuracy on test set =', acc print clf.best_params_ params, mean_score, scores in clf.grid_scores_: print "%0.3f (+/-%0.03f) %r" % ( mean_score, scores.std() / 2, params) y_true, y_pred = y_test, clf.predict(x_test) #pred = clf.predict(np.array(features_test)) acc = accuracy_score(y_pred, y_test) print classification_report(y_true, y_pred) print 'accuracy last =', acc print plot_learning_curve(clf, "kneighborsclassifier", x, y, train_sizes=np.linspace(.05, 1.0, 5)) is normal? can understand there might difference in scores difference of 0.18, when converted percentages 43% against 61%. classification_report gives average 0.61 recall.
am doing wrong? there difference in how learning_curve calculates scores? tried passing scoring='accuracy' learning_curve function see if matches accuracy score, didn't make difference.
any advice helpful.
i'm using wine quality(white) data set uci , removed header before running code.
when call learning_curve function performs cross-validation on whole data. leaving cv parameter empty 3-fold cross validation splitting strategy. , here comes tricky part because stated in documentation "if estimator classifier or if y neither binary nor multiclass, kfold used". , estimator classifier.
so, what's difference between kfold , stratifiedkfold?
kfold = split dataset k consecutive folds (without shuffling default)
stratifiedkfold = "the folds made preserving percentage of samples each class."
let's make simple example:
- your data labels [4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0]
- by not stratified 3-fold divide in subsets: [4.0, 4.0, 4.0], [5.0, 5.0, 5.0], [6.0, 6.0, 6.0]
- each fold used validation set once while k - 1 (3-2) remaining fold form training set. so, example, training on [5.0, 5.0, 5.0, 6.0, 6.0, 6.0] , validating on [4.0, 4.0, 4.0]
this explains low accuracy drawing learning curve (~0.43%). of course extreme example illustrate situation, data somehow structured , need shuffle it.
but when obtain ~61% accuracy have splitted data method train_test_split default performs shuffle on data , keeps proportions.
just @ this, i've performed simple test support hypothesis:
x_train2, x_test2, y_train2, y_test2 = train_test_split(x, y, test_size=0., random_state=2) in example fed learning_curve of data x,y. i'm doing little trick here split data telling test_size=0. meaning data in train variables. way i'm still keeping data, shuffled went through train_test_split function.
then i've called plotting function shuffled data:
plot_learning_curve(clf, "kneighborsclassifier",x_train2, y_train2, train_sizes=np.linspace(.05, 1.0, 5)) now output max num training samples instead of 0.43 0.59 makes lot more sense gridsearch results.
observation: think whole point of plotting learning curve determine wether adding more samples training set our estimator able perform better or not (so can decide example when there no need add more examples). in
train_sizesfeed valuesnp.linspace(.05, 1.0, 5) --> [ 0.05 , 0.2875, 0.525 , 0.7625, 1. ]i'm not entirely sure usage pursuing in kind of test.
Comments
Post a Comment