13

Metni sınıflandırmak için bir çanta dolusu kullanıyorum. İyi çalışıyor ama bir kelime olmayan bir özellik eklemeyi merak ediyorum.Mevcut kelime sınıfı paketine başka bir özellik (metin uzunluğu) nasıl eklenir? Scikit-öğrenmek

İşte örnek kodum.

import numpy as np 
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.multiclass import OneVsRestClassifier 

X_train = np.array(["new york is a hell of a town", 
        "new york was originally dutch", 
        "new york is also called the big apple", 
        "nyc is nice", 
        "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.", 
        "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.", 
        "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.", 
        "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",]) 
y_train = [[0],[0],[0],[0],[1],[1],[1],[1]] 

X_test = np.array(["it's a nice day in nyc", 
        'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.' 
        ]) 
target_names = ['Class 1', 'Class 2'] 

classifier = Pipeline([ 
    ('vectorizer', CountVectorizer(min_df=1,max_df=2)), 
    ('tfidf', TfidfTransformer()), 
    ('clf', OneVsRestClassifier(LinearSVC()))]) 
classifier.fit(X_train, y_train) 
predicted = classifier.predict(X_test) 
for item, labels in zip(X_test, predicted): 
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels)) 

Şimdi Londra hakkında metin New York hakkında metinden daha çok daha uzun olma eğilimindedir olduğu açıktır. Metnin uzunluğunu bir özellik olarak nasıl eklerim? Başka bir sınıflandırma yöntemini kullanmalı ve sonra iki öngörüyü birleştirmeliyim? Kelime torbasıyla birlikte yapmanın bir yolu var mı? Bazı örnek kodlar harika olurdu - makine öğrenimi ve scikit öğrenmesi için çok yeni.

+0

Kodunuz çalışmıyor, çünkü tek bir hedef olduğunda OneVsRestClassifier kullanıyorsunuz. – joc

+4

Aşağıdaki bağlantı, sklearn'in FeatureUnion özelliğini kullanarak hemen hemen tam olarak neyi gerçekleştirir: http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html – joc

+0

Bunun cevabına bir bakın. http://stackoverflow.com/questions/39001956/sklearn-pipeline-transformation-on-only-certain-features/39009125#39009125 – maxymoo

cevap

3

Açıklamalarda gösterildiği gibi, bu, FunctionTransformer, FeaturePipeline ve FeatureUnion'un bir bileşimidir.

import numpy as np 
from sklearn.pipeline import Pipeline, FeatureUnion 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.multiclass import OneVsRestClassifier 
from sklearn.preprocessing import FunctionTransformer 

X_train = np.array(["new york is a hell of a town", 
        "new york was originally dutch", 
        "new york is also called the big apple", 
        "nyc is nice", 
        "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.", 
        "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.", 
        "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.", 
        "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",]) 
y_train = np.array([[0],[0],[0],[0],[1],[1],[1],[1]]) 

X_test = np.array(["it's a nice day in nyc", 
        'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.' 
        ]) 
target_names = ['Class 1', 'Class 2'] 


def get_text_length(x): 
    return np.array([len(t) for t in x]).reshape(-1, 1) 

classifier = Pipeline([ 
    ('features', FeatureUnion([ 
     ('text', Pipeline([ 
      ('vectorizer', CountVectorizer(min_df=1,max_df=2)), 
      ('tfidf', TfidfTransformer()), 
     ])), 
     ('length', Pipeline([ 
      ('count', FunctionTransformer(get_text_length, validate=False)), 
     ])) 
    ])), 
    ('clf', OneVsRestClassifier(LinearSVC()))]) 

classifier.fit(X_train, y_train) 
predicted = classifier.predict(X_test) 
predicted 

Bu, metnin uzunluğunu sınıflandırıcı tarafından kullanılan özelliklere ekler.

İlgili konular