Ad

Stratified Shuffle Split ValueError: The Least Populated Class In Y Has Only 1 Member, Which Is Too Few

I'm struggling to get my stratified shuffle split to work. I have two sets of data, features and labels and I'm trying to return my list called results which should have a list of all accuracy/precision/recall/f1 scores.

However, I think I'm just getting muddled and confused around how this is supposed to be returning results back to me. Can anyone spot what I'm doing wrong here?

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score,confusion_matrix

sss = StratifiedShuffleSplit(n_splits=1, random_state=42, test_size=0.33)

clf_obj = RandomForestClassifier(n_estimators=10)


scoring = {'accuracy' : make_scorer(accuracy_score), 
           'precision' : make_scorer(precision_score),
           'recall' : make_scorer(recall_score), 
           'f1_score' : make_scorer(f1_score)}

results = cross_validate(estimator=clf_obj,
                            X=features,
                            y=labels,
                            cv=sss,
                            scoring=scoring)

I suppose what's confusing me here is that I'm getting this error:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

But I don't understand what's happening with my x and y values. The first error I can see seems to be with the scoring parameter:

---> 29 scoring=scoring)

... but from what I can see, I think I've filled in the parameters for the cross_validate() function correctly?

Full error trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-2af4c433ccc9> in <module>
     27                             y=labels,
     28                             cv=sss,
---> 29                             scoring=scoring)

/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    238             return_times=True, return_estimator=return_estimator,
    239             error_score=error_score)
--> 240         for train, test in cv.split(X, y, groups))
    241 
    242     zipped_scores = list(zip(*scores))

/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    915             # remaining jobs.
    916             self._iterating = False
--> 917             if self.dispatch_one_batch(iterator):
    918                 self._iterating = self._original_iterator is not None
    919 

/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    752             tasks = BatchedCalls(itertools.islice(iterator, batch_size),
    753                                  self._backend.get_nested_backend(),
--> 754                                  self._pickle_cache)
    755             if len(tasks) == 0:
    756                 # No more tasks available in the iterator: tell caller to stop.

/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, iterator_slice, backend_and_jobs, pickle_cache)
    208 
    209     def __init__(self, iterator_slice, backend_and_jobs, pickle_cache=None):
--> 210         self.items = list(iterator_slice)
    211         self._size = len(self.items)
    212         if isinstance(backend_and_jobs, tuple):

/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in <genexpr>(.0)
    233                         pre_dispatch=pre_dispatch)
    234     scores = parallel(
--> 235         delayed(_fit_and_score)(
    236             clone(estimator), X, y, scorers, train, test, verbose, None,
    237             fit_params, return_train_score=return_train_score,

/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
   1313         """
   1314         X, y, groups = indexable(X, y, groups)
-> 1315         for train, test in self._iter_indices(X, y, groups):
   1316             yield train, test
   1317 

/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py in _iter_indices(self, X, y, groups)
   1693         class_counts = np.bincount(y_indices)
   1694         if np.min(class_counts) < 2:
-> 1695             raise ValueError("The least populated class in y has only 1"
   1696                              " member, which is too few. The minimum"
   1697                              " number of groups for any class cannot"

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
Ad

Answer

The error message actually says it all:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

You probably have a class in your y which has only one single sample, hence any stratified split is actually impossible.

What you can do is remove that (single) sample from your data - in any case, classes represented by single samples are not any useful for classification...

Ad
source: stackoverflow.com
Ad