python - In SelectKBest, what does length of get_support() represent? -


when reproducing this cross-validation example, 2x4 train matrix (xtrain) len(b.get_support()) of 1 000 000. mean 1 000 000 features have been created in model? or 2, number of features have impact 2. thanks!

%matplotlib inline  import numpy np import matplotlib.pyplot plt  sklearn.feature_selection import selectkbest, f_regression sklearn.cross_validation import cross_val_score, kfold sklearn.linear_model import linearregression ### create data def hidden_model(x):     #y linear combination of columns 5 , 10...     result = x[:, 5] + x[:, 10]     #... little noise     result += np.random.normal(0, .005, result.shape)     return result   def make_x(nobs):     return np.random.uniform(0, 3, (nobs, 10 ** 6))  x = make_x(20) y = hidden_model(x)  scores = [] clf = linearregression()  train, test in kfold(len(y), n_folds=5):     xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]      b = selectkbest(f_regression, k=2)     b.fit(xtrain,ytrain)     xtrain = xtrain[:, b.get_support()] #get_support: mask or integer index of selected features     xtest = xtest[:, b.get_support()]     print len(b.get_support())      clf.fit(xtrain, ytrain)     scores.append(clf.score(xtest, ytest))      yp = clf.predict(xtest)     plt.plot(yp, ytest, 'o')     plt.plot(ytest, ytest, 'r-')  plt.xlabel('predicted') plt.ylabel('observed')  print("cv score (r_square) is", np.mean(scores)) 

it represents mask can applied x features have been selected using selectkbest routine.

print x.shape print b.get_support().shape print np.bincount(b.get_support()) 

outputs:

(20, 1000000) (1000000,) [999998      2] 

which shows have 20 examples of 1000000 dimensional data, boolean array of length 1000000 of 2 ones.

hope helps!


Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -