In this final task, my goal is to predict the Amazon score (1 - 5) based on the reviews - a multiclass text classification problem. This post discusses the use of five pipeline models using scikit learn. The next post will introduce using Keras for building a sequential model for deep learning.

pipelines

Scikit Learn
First, let's apply a typical test/train split:

from sklearn.cross_validation import KFold, train_test_split
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn import metrics

X, y = data2['text_cln'], data2['Score']
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.2, random_state=0)

Next, let's set up our cross validation:

def evaluate_cross_validation(clf, X, y, K):
  cv = KFold(len(y), K, shuffle=True, random_state=0)
  scores = cross_val_score(clf, X, y, cv=cv)
  print(scores)
  print("Mean score:{0:.3f}(+/-{1:.3f})".format(np.mean(scores),sem(scores)))

Next, we create a 5 different pipelines to test out 5 different classifiers:

  • CountVectorizer with Multinomial Naive Bayes
  • HashingVectorizer with Multinomial Naive Bayes
  • TfidfVectorizer with Multinomial Naive Bayes
  • TfidfVectorizer with Logistic Regression
  • TfidfVectorizer with LinearSVC

    Creating multiple models through pipelines:

    clf_1 = Pipeline([('vect', CountVectorizer(decode_error='ignore', strip_accents='unicode', stop_words='english')),
    ('clf', MultinomialNB()),
    ])
    
    clf_2 = Pipeline([
    ('vect', HashingVectorizer(decode_error='ignore', strip_accents='unicode', stop_words='english', non_negative=True)),
    ('clf', MultinomialNB()),
    ])
    
    clf_3 = Pipeline([
    ('vect', TfidfVectorizer(decode_error='ignore', strip_accents='unicode', stop_words='english')),
    ('clf', MultinomialNB(alpha=.01)),
    ])
    
    clf_4 = Pipeline([
    ('vect', TfidfVectorizer(decode_error='ignore', strip_accents='unicode', stop_words='english')),
    ('clf', LogisticRegression()),
    ])
    
    clf_5 = Pipeline([
    ('vect', TfidfVectorizer(decode_error='ignore', strip_accents='unicode', stop_words='english')),
    ('clf', LinearSVC()),
    ])
    
    #Evaluate each model using the K-fold cross-validation with 5 folds
    clfs = [clf_1, clf_2, clf_3, clf_4, clf_5]
    
    for clf in clfs:
         evaluate_cross_validation(clf, data2['text_cln'], 
         data2['Score'], 5)
    

    cv

    Now, after we run evaluate each pipeline, we get an accuracy score for each of the five folds and a mean accuracy:


    clf1
    [ 0.70503382 0.70498984 0.70557036 0.70416304 0.70628903] Mean score:0.705(+/-0.000)

    clf2
    [ 0.63822994 0.63876648 0.63848502 0.64095663 0.63752309] Mean score:0.639(+/-0.001)

    clf3
    [ 0.70141876 0.70185855 0.70214881 0.70458524 0.7019087 ] Mean score:0.702(+/-0.001)

    clf4
    [ 0.74205522 0.742926 0.7429348 0.74464118 0.74180667] Mean score:0.743(+/-0.000)

    clf5
    [ 0.75884635 0.7610277 0.76046477 0.7631123 0.75872988] Mean score:0.760(+/-0.001)

    It's time to evaluate our models:

    def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
       clf.fit(X_train, y_train)
    
       print("Accuracy on training set:")
       print(clf.score(X_train, y_train))
       print("Accuracy on testing set:")
       print(clf.score(X_test, y_test))
    
       y_pred = clf.predict(X_test)
    
       print("Classification Report:")
       print(metrics.classification_report(y_test, y_pred))
       print("Confusion Matrix:")
       print(metrics.confusion_matrix(y_test, y_pred))
    

    Now we can review the accuracy, classification report, and confusion matrix for each clf:

    clf1

    train_and_evaluate(clf_1, X_train, X_test, y_train, y_test)
    


    Accuracy on training set:
    0.731330385278
    Accuracy on testing set:
    0.705033819739

    Classification Report:
    class  precision recall  f1-score   support
    1       0.56      0.64      0.59     10267
    2       0.47      0.20      0.28      6185
    3       0.43      0.28      0.34      8450
    4       0.40      0.34      0.37     16229
    5       0.81      0.89      0.84     72560
    avg 
    /total  0.68      0.71      0.69    113691
    
    Confusion Matrix:
    
    [[ 6533   689   654   479  1912]
    [ 1443  1226   968   787  1761]
    [ 1089   296  2371  1804  2890]
    [  767   141   770  5573  8978]
    [ 1883   235   797  5192 64453]]
    

    clf2

    train_and_evaluate(clf_2, X_train, X_test, y_train, y_test)
    


    Accuracy on training set:
    0.638943801497
    Accuracy on testing set:
    0.638229939045

    Classification Report:
    class precision    recall  f1-score   support
    1       1.00      0.00      0.00     10267
    2       0.00      0.00      0.00      6185
    3       0.00      0.00      0.00      8450
    4       0.00      0.00      0.00     16229
    5       0.64      1.00      0.78     72560
    avg 
    / total 0.50      0.64      0.50    113691
    
    Confusion Matrix:
    [[    1     0     0     0 10266]
    [    0     0     0     0  6185]
    [    0     0     0     0  8450]
    [    0     0     0     0 16229]
    [    0     0     0     0 72560]]
    

    clf3

    train_and_evaluate(clf_3, X_train, X_test, y_train, y_test)
    


    Accuracy on training set:
    0.743461979097
    Accuracy on testing set:
    0.701418757861

    Classification Report:
    class precision    recall  f1-score   support
    1       0.74      0.44      0.55     10267
    2       0.76      0.11      0.20      6185
    3       0.67      0.12      0.20      8450
    4       0.61      0.12      0.21     16229
    5       0.70      0.99      0.82     72560
    avg 
    /total  0.69      0.70      0.63    113691
    
    Confusion Matrix:
    [[ 4479   106   141    94  5447]
    [  637   708   144   190  4506]
    [  426    59   984   424  6557]
    [  231    28   103  2005 13862]
    [  313    31    93   554 71569]]
    

    clf4

    train_and_evaluate(clf_4, X_train, X_test, y_train, y_test)
    


    Accuracy on training set:
    0.761820992473
    Accuracy on testing set:
    0.742064015621

    Classification Report:
    class precision    recall  f1-score   support
    1       0.68      0.67      0.68     10267
    2       0.60      0.19      0.29      6185
    3       0.54      0.27      0.36      8450
    4       0.53      0.24      0.33     16229
    5       0.78      0.97      0.86     72560
    avg 
    /total 0.71      0.74      0.70    113691
    
    Confusion Matrix:
    [[ 6911   336   304   240  2476]
    [ 1520  1183   762   471  2249]
    [  791   284  2259  1239  3877]
    [  377    95   557  3926 11274]
    [  583    79   335  1476 70087]]
    

    clf5

    train_and_evaluate(clf_5, X_train, X_test, y_train, y_test)
    


    Accuracy on training set:
    0.814890393458
    Accuracy on testing set:
    0.758846346677

    Classification Report:
    class  precision    recall  f1-score   support
    1       0.69      0.71      0.70     10267
    2       0.62      0.29      0.40      6185
    3       0.57      0.34      0.42      8450
    4       0.57      0.30      0.39     16229
    5       0.80      0.96      0.87     72560
    avg 
    /total  0.73      0.76      0.73    113691
    
    Confusion Matrix:
    [[ 7305   391   336   284  1951]
    [ 1452  1821   678   464  1770]
    [  790   401  2859  1168  3232]
    [  404   157   670  4891 10107]
    [  687   176   482  1817 69398]]
    


    logistic

    linearsvc Ref:https://www.jvrb.org/past-issues/3.2006/760

    Now, let's see how well clf4 (logistic regression) and clf5 (LinearSVC) predicts on a few real Amazon reviews. I wanted to actually use a few neat Amazon product review webscrapers out there, but realized that I needed to have my own AWS account to access the API and get a token. I guess that'll be on my next bucket list. Here were some resources for those of you interested:

  • https://www.scrapehero.com/how-to-scrape-amazon-product-reviews/
  • https://blog.hartleybrody.com/scrape-amazon/
  • https://gist.github.com/dzhou/2632394

    cheetos Now, back to Cheetos. I chose just four reviews of varying star-ratings and wrote them into a tiny corpus:

    corpus = [
      "Perfect couldnt be any better it came in under 3 and a half days and it taste so good like wow and it feals like it has more then regular bags just amazing i recomend getting it",
      "husband loves these but i can't take the orange fingers. still he loves them so....",
      "They smelled and tasted like dog poop",
      "Theyre are good. But youre paying 10 Dollars for a bag of chips."
    ]
    

    Now, just like with the model pipelines, you need to vectorize this corpus:

    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,stop_words='english')
    vect = tf_vectorizer.fit_transform(corpus)
    

    LinearSVC doesn't offer a predictprobaba() method, although I did read here that "scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decisionfunction method". I tried running this as my fourth classifier, but it just about took forever (sorry, didn't time it) on my Mac Book Pro, so I just kinda gave up and decided to kick the tires on ye old logistic regression instead.

    Prediction for the LinearSVC which had the highest training accuracy:

    predicted = clf_5.predict(corpus)
    print(predicted)
    


    [5 5 1 4]


    Prediction for the Logistic Regression model with the second highest training accuracy. Interesting to see here that it predicted the fourth review: "Theyre are good. But youre paying 10 Dollars for a bag of chips", as a 5-star not a 4-star.

    predicted=clf_4.predict(corpus)
    probas = clf_4.predict_proba(corpus)
    print(predicted, probas)
    


    [5 5 1 5]

    [[ 3.75934195e-04 3.63309774e-03 2.19956217e-03 8.07333606e-02 9.13058045e-01]
    [ 2.03279754e-04 1.85061482e-03 8.29297196e-03 3.92534317e-02 9.50399702e-01]
    [ 4.43243634e-01 9.96319899e-02 1.78067717e-01 2.84542527e-02 2.50602406e-01]
    [ 1.15504075e-01 9.87625810e-02 1.83887123e-01 2.57270762e-01 3.44575459e-01]]

    In summary, here are some good next steps we could take for a follow on model:

    1. Use GridSearch to better tune the hyperparameters for Logistic Regression and LinearSVC
    2. Use the Confusion Matrix to identify the reviews that suffered from low recall (both models performed poorly with recall - The percent of true positives out of all positives)
    3. Potentially try a deep learning model...

    Which leads me to my next post, Part VI.