python - In SelectKBest, what does length of get_support() represent? -
when reproducing this cross-validation example, 2x4 train matrix (xtrain) len(b.get_support()) of 1 000 000. mean 1 000 000 features have been created in model? or 2, number of features have impact 2. thanks!
%matplotlib inline import numpy np import matplotlib.pyplot plt sklearn.feature_selection import selectkbest, f_regression sklearn.cross_validation import cross_val_score, kfold sklearn.linear_model import linearregression ### create data def hidden_model(x): #y linear combination of columns 5 , 10... result = x[:, 5] + x[:, 10] #... little noise result += np.random.normal(0, .005, result.shape) return result def make_x(nobs): return np.random.uniform(0, 3, (nobs, 10 ** 6)) x = make_x(20) y = hidden_model(x) scores = [] clf = linearregression() train, test in kfold(len(y), n_folds=5): xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test] b = selectkbest(f_regression, k=2) b.fit(xtrain,ytrain) xtrain = xtrain[:, b.get_support()] #get_support: mask or integer index of selected features xtest = xtest[:, b.get_support()] print len(b.get_support()) clf.fit(xtrain, ytrain) scores.append(clf.score(xtest, ytest)) yp = clf.predict(xtest) plt.plot(yp, ytest, 'o') plt.plot(ytest, ytest, 'r-') plt.xlabel('predicted') plt.ylabel('observed') print("cv score (r_square) is", np.mean(scores))
it represents mask can applied x
features have been selected using selectkbest
routine.
print x.shape print b.get_support().shape print np.bincount(b.get_support())
outputs:
(20, 1000000) (1000000,) [999998 2]
which shows have 20 examples of 1000000 dimensional data, boolean array of length 1000000 of 2 ones.
hope helps!
Comments
Post a Comment