= set(stopwords.words('english')) - { 'not', 'no', 'couldn', "couldn't", "wouldn't", "shouldn't", "isn't",
stop_words "aren't", "wasn't", "weren't", "don't", "doesn't", "hadn't", "hasn't",
"won't", "can't", "mightn't","needn't","nor","shouldn","should've","should",
"weren","wouldn","mustn't","mustn","didn't","didn","doesn","did","does","hadn",
"hasn","haven't","haven","needn","shan't"}
def preprocess(text):
# Convert text to lowercase
= text.lower()
text # Remove punctuation
= text.translate(str.maketrans('', '', string.punctuation))
text # Tokenize text into words
= word_tokenize(text)
words # Remove stopwords
= [word for word in words if word not in stop_words]
words # Lemmatize words
= WordNetLemmatizer()
lemmatizer = [lemmatizer.lemmatize(word) for word in words]
words # Join the words back into a single string
= ' '.join(words)
text return text
Introduction
This article explores the use of deep learning models, such as feed forward neural networks (FFNN) and recurrent neural networks (RNN), such Bidirectional LSTM (BiLSTM) for sentiment analysis. This is one of our first project of deep learning where-in we have took this opportunity to build our basics strongly.
For the purpose of analyzing sentiment trends over an extended period, we utilize a substantial dataset consisting of 3 Million Amazon product reviews. This data, sourced from the Stanford Network Analysis Project (SNAP), spans 18 years, providing a rich longitudinal view of consumer sentiments. However, for our modeling we could use only 0.1 Million of the data considering our system constraints.
Each review includes a numeric score representing the sentiment polarity. Negative review is represented as class 1
while positive review is represented with class 2
. This serves as a foundational metric for sentiment analysis
Data Preprocessing
In the preprocessing pipeline for sentiment analysis, the following steps were performed using nltk library : Punctuation Removal , Tokenization, Stopword Elimination and Lemmatization. Notably, the stopword list has been specifically curated to retain negations such as “not,” “no,” and other negatory contractions.
Data Visualization
This word cloud is characterized by a significant presence of highly positive terms such as “love,” “great,” “best,” “perfect,” and “excellent.” These words indicate strong satisfaction and enjoyment, commonly found in reviews that endorse a product. The words “highly recommend,” “amazing,” and “favorite” suggest that positive reviews often include recommendations and personal favoritism towards the products. The presence of words like “beautiful” and “enjoy” also emphasizes an emotional connection with the product.
The negative word cloud features words such as “disappoint,” “waste,” “poor,” “bad,” and “problem.” These strongly negative terms are indicative of dissatisfaction and issues with the products. Terms like “return” and “refund” suggest actions taken by dissatisfied customers. Words like “boring,” “dull,” and “worst” reflect critical opinions about the product’s quality or entertainment value.
Traditional Machine Learning
After conducting a thorough evaluation, we concluded that the Random Forest model outperformed SVM in multiple metrics, making it the preferred baseline model. This initial selection lays the foundation for further exploration and refinement of sentiment analysis techniques.
Models | Hyper Parameters | Train Accuracy | Validation Accuracy | Test Accuracy |
---|---|---|---|---|
Random Forest with Count Vectorizer | Estimators : 200, Max Depth:20, Min Samples Split : 2 | 82% | 79% | 79% |
Random Forest with TF-IDF | Estimators : 200, Max Depth:20, Min Samples Split : 5 | 86% | 83% | 84% |
Neural Networks
In our pursuit of advancing practical expertise in deep learning applications, we executed the following steps in a phased manner within the Neural Networks framework:
Preprocessing in neural networks
# Initialize a tokenizer with an out-of-vocabulary token
= Tokenizer(oov_token="<UNK>")
tokenizer
# Fit the tokenizer on the training data to build the vocabulary
tokenizer.fit_on_texts(X_train)
# Add a special padding token to the word index with index 0
'<PAD>'] = 0
tokenizer.word_index[
# Convert the text data into sequences of token indices using the trained tokenizer
= tokenizer.texts_to_sequences(X_train)
X_sequences_train = tokenizer.texts_to_sequences(X)
X_sequences
# Pad the sequences to ensure uniform length
# maxlen is set to 100, meaning sequences longer than 100 tokens will be truncated, and shorter sequences will be padded
= pad_sequences(X_sequences_train, maxlen=100)
X_train = pad_sequences(X_sequences, maxlen=100) X
Feed Forward Network
Models | Data Size | Train Accuracy | Validation Accuracy | Test Accuracy | Comments |
---|---|---|---|---|---|
Feed Forward Network | 10,000 to 3.6 million | 51% | 51% | 51% | Poor model |
The reason for its poor performance is likely due to the fact that the input words are represented as numerical numbers. In traditional machine learning, the representation of TF-IDF has shown better results. Therefore, the lesson learned is that we need to convert the input words into a better representation, such as one-hot encoding.
One Hot Encoding
Now, we have successfully converted the input data into one-hot vectors and we see that the number of parameters to be learned is also huge.
Models | Data Size | Train Accuracy | Validation Accuracy | Test Accuracy | Comments |
---|---|---|---|---|---|
Feed Forward Network with one hot encoding | 10,000 | System Crashed | System Crashed | System Crashed | Sytem Crashed |
Immediately after this the system got crashed because of the size of vector and its computation even the computer size of 50 GB RAM could. not sustain.
therefore, the lesson learned is that we need to convert the input words into a better representation, like an embedding layer and then we performed on different architectures to explore the better fit for the model.
Neural Networks with Embedding Layer
Models | Data Size | Train Accuracy | Validation Accuracy | Test Accuracy | Comments |
---|---|---|---|---|---|
Feed Forward Network with Embedding Layer | 10,000 | 100% | 85% | 85% | Overfitting |
GRU with Embedding Layer | 10,000 | 100% | 80% | 80% | Overfitting |
LSTM with Embedding Layer | 10,000 | 100% | 80% | 80% | Overfitting |
Bi-LSTM with Embedding Layer | 10,000 | 100% | 81% | 81% | Overfitting |
Bi-LSTM with Embedding Layer | 100,000 | 100% | 86% | 86% | Overfitting |
The reason for overfitting is likely because, given the size of the data, the embedding layer is attempting to learn model parameters within the vocabulary of the input data. However, during validation and testing, there may be many out-of-vocabulary words, leading to underperformance. However, when we increased the dataset to 100K, the accuracy improved. The lesson learned is that if we can input the data with pre-trained embeddings learned on a larger corpus, we can achieve a better and more balanced model.
Neural Networks with Pre-Trained Embedding Layer
Models | Data Size | Train Accuracy | Validation Accuracy | Test Accuracy | Comments |
---|---|---|---|---|---|
Bi-LSTM with Pretrained twitter Embeddings of 50D | 10,000 | 90% | 85% | 84% | Decent Model |
Bi-LSTM with Pretrained twitter Embeddings of 200D | 100,000 | 94% | 87% | 85% | Decent Model |
Given the success of pre-trained embeddings with larger dimensions, we aim to retain this learning and proceed to incorporate a more advanced architecture. The lesson we learned is that larger-dimensional embeddings capture richer attributes of words, which is beneficial. As part of our efforts to enhance the model’s learning, we decided to add an attention layer. This layer allows the model to focus on specific words, further improving its performance.
Bi-LSTMs with Attention Layer
We explored with three variants of attention. Couple of them are custom created and other is with Self Attention layer.
## Custom Made Simple Attention Layer
class Attention(tf.keras.Model):
def __init__(self, units):
super(Attention, self).__init__()
# Initialize the attention mechanism's parameters
self.W1 = tf.keras.layers.Dense(units, activation="tanh") # Dense layer to compute attention scores
self.V = tf.keras.layers.Dense(1) # Dense layer for the attention mechanism's weight calculation
def call(self, features):
# Compute attention scores
= self.W1(features)
score
# Apply softmax activation to obtain attention weights
= tf.nn.softmax(self.V(score), axis=1)
attention_weights
# Compute context vector as the weighted sum of features
= attention_weights * features
context_vector
return context_vector
## Custom Made Slightly Complicated Attention Layer
class Attention_Update(tf.keras.Model):
def __init__(self, units):
super(Attention_Update, self).__init__()
# Initialize parameters for the attention mechanism
self.W1 = tf.keras.layers.Dense(units, activation="tanh") # Dense layer to compute attention scores
self.V = tf.keras.layers.Dense(1) # Dense layer for attention weight calculation
def build(self, input_shape):
# Initialize trainable weights for attention mechanism
self.Wa = self.add_weight(name="att_weight_1", shape=(input_shape[-1], 8),
="normal") # Weight matrix for context vector computation
initializerself.Wb = self.add_weight(name="att_weight_2", shape=(input_shape[-1], 8),
="normal") # Weight matrix for input features
initializerself.b = self.add_weight(name="att_bias_2", shape=(input_shape[1], 8),
="zeros") # Bias term for context vector computation
initializer
super(Attention_Update, self).build(input_shape)
def call(self, features):
# Compute attention scores
= self.W1(features)
score
# Apply softmax activation to obtain attention weights
= tf.nn.softmax(self.V(score), axis=1)
attention_weights
# Compute context vector as the weighted sum of features
= attention_weights * features
context_vector
# Update the hidden state using attention mechanism
= tf.tanh(tf.matmul(context_vector, self.Wa) + tf.matmul(features, self.Wb) + self.b)
new_hidden_state
return new_hidden_state
# Define input layer with shape (100,)
= Input(shape=(100,))
inputs
# Create an embedding layer with pre-trained weights
# vocab_size: size of the vocabulary
# output_dim: dimension of the embedding space
# input_length: length of input sequences
# weights: pre-trained embedding matrix
# trainable: set to False to keep the pre-trained weights fixed during training
= Embedding(input_dim=vocab_size, output_dim=200, input_length=100, weights=[embedding_matrix_twitter_200d], trainable=False)(inputs)
embedding_layer
# Apply bidirectional LSTM to capture contextual information
= Bidirectional(LSTM(4, activation='tanh', return_sequences=True))(embedding_layer)
bilstm
# Apply self-attention mechanism to focus on important features
= SeqSelfAttention(attention_activation='sigmoid')(bilstm)
context_vector
# Apply SimpleRNN to capture sequential patterns
= SimpleRNN(4, activation="tanh")(context_vector)
simplernn
# Output layer with sigmoid activation for binary classification
= Dense(1, activation="sigmoid")(simplernn)
output
# Define the model
= Model(inputs=inputs, outputs=output) model_lstm_bi_embed_selfattention
Models | Data Size | Train Accuracy | Validation Accuracy | Test Accuracy | Comments |
---|---|---|---|---|---|
Bi-LSTM with Pretrained twitter Embeddings of 200D - Simple Attention | 100,000 | 94% | 90% | 90% | Good Model |
Bi-LSTM with Pretrained twitter Embeddings of 200D - Slightly Complicated Attention | 100,000 | 94% | 89% | 90% | Good Model |
Bi-LSTM with Pretrained twitter Embeddings of 200D - Keras Self Attention Layer | 100,000 | 94% | 90% | 90% | Good Model |
All these models performed equally well; however, our intention is to create an even better model. Therefore, we proceeded to develop a custom model consisting of two bi-LSTMs with a simple attention layer, followed by an RNN, two feedforward networks, and finally, a softmax layer.
Custom Made Neural Network Block
This Block we feel it can enhance sentence comprehension by learning relevant words and their dependencies within the sentence. It consists of several components that work together to achieve this goal:
Combining Two Bi-LSTMs: This increases its complexity and enables it to learn sentence context in both forward and backward directions. This helps to capture a more comprehensive understanding of the text.
Attention Layer: It focuses on relevant words within the sentence, allowing the model to concentrate on key information while disregarding irrelevant details. This mechanism helps to improve the overall accuracy of the model.
Simple RNN: It helps to learn and capture relevant parameters based on context. This facilitates the understanding of word dependencies within the sentence and enables the model to achieve more accurate sentiment analysis.
Notably, this block operates without a dense layer. Instead, it focuses on leveraging Bi-LSTMs, attention mechanisms, and simple RNNs to achieve effective sentence comprehension and sentiment analysis. However, dense layers can be introduced to the model to introduce more complexity and enable interactions between words and their attributes, thus improving overall comprehension and analysis.
# Define hyperparameters
= 64
lstm_units = 96
attention_units = 64
rnn_units = 128
dense_units = 0.001
learning_rate
# Define input layer with shape (100,)
= Input(shape=(100,))
inputs
# Create an embedding layer with pre-trained weights
= Embedding(input_dim=vocab_size, output_dim=200, input_length=100, weights=[embedding_matrix_twitter_200d], trainable=False)(inputs)
embedding_layer
# Apply bidirectional LSTM layers with regularization
= Bidirectional(LSTM(lstm_units, activation='tanh', return_sequences=True, kernel_regularizer=l2(0.0001)))(embedding_layer)
bilstm = Bidirectional(LSTM(lstm_units, activation='tanh', return_sequences=True, kernel_regularizer=l2(0.0001)))(bilstm)
bilstm
# Apply attention mechanism
= Attention(attention_units)(bilstm)
context_vector
# Apply SimpleRNN layer with regularization
= SimpleRNN(rnn_units, activation="tanh", return_sequences=True, kernel_regularizer=l2(0.0001))(context_vector)
simplernn
# Flatten the output for feedforward layers
= Flatten()(simplernn)
flatten
# Apply two feedforward layers with regularization
= Dense(dense_units, activation='relu', kernel_regularizer=l2(0.001))(flatten)
ffn = Dense(dense_units, activation='relu', kernel_regularizer=l2(0.001))(ffn)
ffn
# Output layer with sigmoid activation for binary classification
= Dense(1, activation="sigmoid")(ffn)
output
# Define the model
= Model(inputs=inputs, outputs=output)
model_lstm_bi_embed_attention_complex_regularized_tuned
# Compile the model
= keras.optimizers.Adam(learning_rate)
optimizer compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
model_lstm_bi_embed_attention_complex_regularized_tuned.
# Print model summary
model_lstm_bi_embed_attention_complex_regularized_tuned.summary()
Models | Data Size | Train Accuracy | Validation Accuracy | Test Accuracy | Comments |
---|---|---|---|---|---|
Final Custom block model with hyper tuned parameters | 100K | 91% | 91% | 91% | Balanced Model |
Conclusion
Neural networks, when fine-tuned, regularized, and expanded, possess a great capacity to arrive at a better model. While traditional machine learning approach has given us 85% accuracy, the inherent flexibility of neural networks enables us to create of more sophisticated/complicated models. These models we feel are better to capture intricate patterns within the data, ultimately leading to superior performance.
We found great fulfillment in undertaking this project, prioritizing our learning journey beyond the confines of grading rubrics. It provided us with invaluable insights and a deeper understanding of the intricacies involved.
Entire code can be downloaded from this link.