Ad

ValueError: Unknown Label Type: 'unknown' In RandomForestClassifier

I'm trying to train dataset using RandomForestClassifier

transformer = TfidfVectorizer(lowercase=True, stop_words=stop, max_features=500)
X = transformer.fit_transform(df.Text)
y = transformer.fit_transform(df.category)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = RandomForestClassifier()

model.fit(X_train, y_train)

my dataset be like

Review(text format)    Category(text format)
Its good product       good product
Its damaged product    damaged product

I get an error that

raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'unknown'

Could anyone suggest any solution to solve it?

Ad

Answer

A RandomForestClassifier instance expects the following data as the labels:

y : array-like, shape = [n_samples] or [n_samples, n_outputs] The target values (class labels in classification, real numbers in regression).

But transformer.fit_transform(df.category) returns sparse matrix of type '<class 'numpy.float64'>, which is not expected.

If you're trying to classify some data into restrictecd number of categories, e.g. "good product", "damaged product", ... etc., you can encode this data not word-by-word, but as labels via label encoder:

(about multi-label calssification predicting each word see below)

transformer = TfidfVectorizer(lowercase=True, stop_words=stop, max_features=500)
X = transformer.fit_transform(df.Text)
le = LabelEncoder()
y = le.fit_transform(df.category)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = RandomForestClassifier()
model.fit(X_train, y_train)

le.inverse_transform(model.predict(X_test))
Out:
array(['good product', 'good product'], dtype=object)

- (or so) - it's simpliest solution.

If you're planning to do some multilabel classification, there are two problems:

  1. There will be a lot labels, depending on number of distinct words in the df.category row
  2. sparse matrix is the thing which you could convert to numpy.array, but it costs a memory, and the matrix cointains floats, as it is tf-idf values, but RandomForestClassifier will work fine with integer labels:

So,

y.toarray()
array([[0.        , 0.77722116, 0.62922751, 0.        ],
       [0.84292635, 0.        , 0.53802897, 0.        ],
       [0.        , 0.        , 0.        , 1.        ],
       [0.        , 0.77722116, 0.62922751, 0.        ]])

- ok, it convertable to some {0, 1} integer array, but it's easier to use MultiLabelBinarizer (note that split is applied to each row to get the word-wise, not char-wise binarization):

transformer = TfidfVectorizer(lowercase=True, stop_words=stop, max_features=500)
X = transformer.fit_transform(df.Text)
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df.category.map(lambda x: x.split()))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = RandomForestClassifier()
model.fit(X_train, y_train)

That case, y is:

y
Out:
array([[0, 1, 1, 0],
       [1, 0, 1, 0],
       [0, 0, 0, 1],
       [0, 1, 1, 0]])

And it can predict words:

mlb.inverse_transform(model.predict(X_test))
Out:
[('good', 'product'), ('good', 'product')]

Refitting TfidfTransformer is dangerous
Off-topic, but: you have refitted vectorizer here:

X = transformer.fit_transform(df.Text)
print(transformer.vocabulary_)
y = transformer.fit_transform(df.category)
print(transformer.vocabulary_)
Out:
{'its': 3, 'good': 1, 'product': 6, 'damaged': 0, 'sttate': 7, 'is': 2, 'unknown': 8, 'one': 5, 'more': 4}
{'good': 1, 'product': 2, 'damaged': 0, 'unknown': 3}

- it can cause errors, if you will try to use transformer to do some with the Text data later. Better instantiate two transformers and use them separately.

Ad
source: stackoverflow.com
Ad