The forest of the amazon reviews, part 2 (an NLP story)

jercoli
May 17, 2021
4 min read

Hi friends, this is part 2 of the NLP analysis based on amazon reviews, here we will describe the machine learning stage that we will apply to our data, specifically the Random Forest model.

In part 1 we obtained the data, cleaned it and built a bag of words after tokenizing the reviews and applying a vectorizer.

Let's start by explaining what a Random Forest is, initially i will say that it is an ensemble of decisions tree, so what is a decision tree?

Decision Tree

Is a supervised learning algorithm mainly used in classification problems. divides the data into two or more homogeneous sets based on the most significant differentiators in the input variables. The decision tree identifies the most significant variable and its value that provides the best homogeneous population sets.

All input variables and all possible split points are evaluated and the one with the best result is chosen.

A picture is worth a thousand words, ok?

Advantages:

Easy to explain and visualize the model results
Automatic feature selection (not affected by irrelevant features)
No need to normalize / scale the data
Missing values have little impact on your result

Disadvantages:

Prone to overfitting
Sensitive to changes in data (result can change dramatically)
High times for training

Now that we have understood that it is a decision tree, let's go to our forest :)

Random Forest

One of the problems that appeared with the creation of a decision tree is that if we give it enough depth, the tree tends to “memorize” the solutions instead of generalizing the learning => Overfitting :( The solution to avoid this is to create many trees and work together => The Forest :)

A Random Forest is an ensemble of decision trees combined with bagging (bootstrap aggregating). The model takes the training data and gives each tree a subset of the features, the size of the assigned data is the same although some of the features are repeated randomly in each subset. We could also vary the amount of samples assigned to each tree (bootstrap sample)

For classification problems, the results of the decision trees are usually combined using soft-voting, this implies giving more importance to the results in which the trees are more secure.

Advantages:

Generalizes better, dramatically reduces DT overfitting
Good performance on unbalanced data
Can handle large datasets (many features and many samples)
Missing values have little impact on your result
It is widely used to extract the most important features of a dataset

Disadvantages:

Features need to have predictive power
It looks like a black box, the result obtained can hardly be interpreted

With this brief introduction to the random forest model, let's move on to our classification problem.

Build the train and test data

In part 1 we obtained our features in the form of a Bag of Words and we generated a pandas dataframe with it (X_features_df ), we also have our labels (y_amz_rev_part ).

# Split train and test 
X_train, X_test, y_train, y_test = train_test_split(X_features_df, y_amz_rev_part, test_size=0.2)

# Class balance ??
print('Clase 0-Bad review: {} - Clase 1-Good review: {}'.format(y_train.count(0), y_train.count(1)))

we check the balance of our classes:

Class 0-Bad review: 5054 - Class 1-Good review: 4946

Build RF model, Fit and predict

RF has a large number of hyperparameters for tuning, we'll explain 3 of the most important that we'll use later in our practice.

n_estimators : The number of trees in the forest. Default=100
max_depth : The maximum depth of the tree. If None (default), then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
n_jobs : The number of jobs to run in parallel (default None=1). Put -1to use all processors.

You have a detailed list in sklearn.

# Now, build a RandomForest with parameters, and fit the model
rf = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)
rf_model = rf.fit(X_train, y_train)

We had said that one of the advantages of RF was to use it to obtain the most important features:

# view the 10 more important features from the model. 
# zip and sort (reverse)
sorted(zip(rf_model.feature_importances_, X_train.columns), reverse=True)[0:10]

# Predict with X_test and obtain score metrics
from sklearn.metrics import precision_recall_fscore_support as score

y_pred = rf_model.predict(X_test)
precision, recall, fscore, support = score(y_test, y_pred, average='binary')
print('Precision: {} / Recall: {} / Fscore: {} / Accuracy: {}'.format(
 round(precision,3), round(recall,3), round(fscore,3), round((y_pred==y_test).sum() / len(y_pred),3)))

Precision: 0.792 / Recall: 0.803 / Fscore: 0.797 / Accuracy: 0.796

Try to obtain better Score

Let's try changing the hyperparameters (tuning) of our model that we consider most important, to try to improve our prediction level. For this we'll use the GridSearchCV method provided by sklearn, we pass all the data to it as the method fits and predicts, lets go to coding:

from sklearn.model_selection import GridSearchCV

# using GridSearch with 9 combinations
rf = RandomForestClassifier()

# hyperparameters setting
param = {'n_estimators': [10, 100, 150],
 'max_depth': [10, 30, 60]}

gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)
gs_fit = gs.fit(X_features_df, y_amz_rev_part)

To see the results, we will create a dataframe ordered by one of the columns provided by GridSearch ('mean_test_score'):

pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

We see the top 5, and we have to use 150 estimators of 60 depth to obtain the best combination of hyperparameters for our model. For which we will regenerate our RF with these values :)

Take into account that the prediction values may be low because we are using very little data (12 thousand ) of the original dataset (4 million reviews).

ok, now it's time for you to do your own exploration with your jupyter notebook, and explore all the edges of this fascinating AI theme.

As always the github link with the complete sentiment analysis jupyter nb is attached so that you can verify the code for yourself.

Your comments and/or your like are appreciated ;)