StratifiedShuffleSplit: ValueError: The Least Populated Class In Y Has Only 1 Member, Which Is Too Few.
I'm using the StratifiedShuffleSplit cross validator for predicting the house prices in the Boston dataset. When I run the below sample code.
def fit_model_S(labels, features,step, clf,parameters): cv = StratifiedShuffleSplit(n_splits=2,test_size=0.10, random_state = 42) print (cv) for train_index, test_index in cv.split(features,labels): labels_train, labels_test = labels[train_index], labels[test_index] features_train, features_test = features[train_index], features[test_index]
I get the below error. The code works with ShuffleSplit.Does this mean that StratifiedShuffleSplit cannot be used with numeric labels.
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-141-b290147edcbf> in <module>() 33 dt_steps = [('decision', clf)] 34 ---> 35 fit_model_S(labels, features,dt_steps,clf,parameters4) 36 37 <ipython-input-141-b290147edcbf> in fit_model_S(labels, features, step, clf, parameters) 8 cv = StratifiedShuffleSplit(n_splits=2,test_size=0.10, random_state = 42) 9 print (cv) ---> 10 for train_index, test_index in cv.split(features,labels): 11 12 labels_train, labels_test = labels[train_index], labels[test_index] C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups) 1194 """ 1195 X, y, groups = indexable(X, y, groups) -> 1196 for train, test in self._iter_indices(X, y, groups): 1197 yield train, test 1198 C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in _iter_indices(self, X, y, groups) 1535 class_counts = np.bincount(y_indices) 1536 if np.min(class_counts) < 2: -> 1537 raise ValueError("The least populated class in y has only 1" 1538 " member, which is too few. The minimum" 1539 " number of groups for any class cannot" ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
Dataset sample as below.
RM LSTAT PTRATIO MEDV 0 6.575 4.98 15.3 504000.0 1 6.421 9.14 17.8 453600.0 2 7.185 4.03 17.8 728700.0 3 6.998 2.94 18.7 701400.0 4 7.147 5.33 18.7 760200.0
The MEDV is the label in this case.
Boston Housing data is a dataset for regression problem. You are using
StratifiedShuffleSplit to divide it into train and test.
StratifiedShuffleSplit as mentioned in docs is:
This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.
Please look at the last line :- "preserving the percentage of samples for each class". So the
StratifiedShuffleSplit tries to see the
y values as individual classes.
But it will not be possible because your
y is a regression variable (continuous numerical data).
Please look at ShuffleSplit, or train_test_split to divide your data. See here for more details on cross-validation: http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
- → How to configure JSON for graphql query?
- → Google couldn't fetch my sitemap.xml file
- → A lot of socket endpoints in python?
- → Historical price per minute between two timestamps for a cryptocurrency
- → How to get a value from a list of dictionaries - Python 3.7.1
- → What is the optimal way to parse these strings in Python?
- → Short Order on Binance futures testnet resulting in APIError (ReduceOnly Order is Rejected)
- → values subtracted while iterating through list has random miscalculations
- → Foreign key query error in case of custom uint64 field which was used as a primary key
- → Grouping all tests Python
- → Using #!python2 not working to run under Python 2
- → Storing last 3 scores and deleting older scores and calculating average?
- → Checking if input is in a list of numbers in python