Hugging Face - Representation Models

Exploration of Represenation Models & Text Classification
Representation Models
Hugging Face
Sentiment Classification
Author

Arun Koundinya Parasa

Published

February 24, 2025

In this module, we will explore the basics of two approaches to text classification using Encoder Transformers: - Using BERT - Using Label Encodings (Sentence Transformers)

Encourage to explore this article to understand the background and intuition behind these two models.

In this article, we will also delve into sentiment classification through the following methods: - Without training - Using BERT LLM and Logistic Regression - Using Sentence Transformers LLM and Logistic Regression - Creating labels when they are not available

Open In Colab

Installing & Loading Libraries

!pip install datasets
Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from datasets) (3.17.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.11/dist-packages (from datasets) (1.26.4)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.11/dist-packages (from datasets) (17.0.0)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from datasets) (2.2.2)
Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.11/dist-packages (from datasets) (2.32.3)
Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.11/dist-packages (from datasets) (4.67.1)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Requirement already satisfied: fsspec<=2024.12.0,>=2023.1.0 in /usr/local/lib/python3.11/dist-packages (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets) (2024.10.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.11/dist-packages (from datasets) (3.11.12)
Requirement already satisfied: huggingface-hub>=0.24.0 in /usr/local/lib/python3.11/dist-packages (from datasets) (0.28.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from datasets) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.11/dist-packages (from datasets) (6.0.2)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets) (2.4.6)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets) (1.3.2)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets) (25.1.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets) (6.1.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets) (0.2.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets) (1.18.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub>=0.24.0->datasets) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests>=2.32.2->datasets) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests>=2.32.2->datasets) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests>=2.32.2->datasets) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests>=2.32.2->datasets) (2025.1.31)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->datasets) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->datasets) (2025.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.17.0)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 485.4/485.4 kB 15.2 MB/s eta 0:00:00
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 15.1 MB/s eta 0:00:00
Downloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.5/143.5 kB 16.8 MB/s eta 0:00:00
Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.8/194.8 kB 22.9 MB/s eta 0:00:00
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-3.3.2 dill-0.3.8 multiprocess-0.70.16 xxhash-3.5.0
from google.colab import drive
import os

import pandas as pd

from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
import tensorflow as tf
import numpy as np

import datasets
from datasets import Dataset, DatasetDict

BERT Model - Sentiment Prediction w/o Training

checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.

If we can observe all the related base files are loaded; that includes model configuration, model itself and vocab text

Now predicting is simple like we use chatgpt

model.summary()
Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0 (unused)
                                                                 
=================================================================
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Since we have used TFAutoModelForSequenceClassification the model has a default classifier which predicts the output etiher as positive or negative.

os.chdir('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/')

test_data = pd.read_csv('test_data_sample_complete.csv')
train_data = pd.read_csv('train_data_sample_complete.csv')

test_data = test_data.sample(n=1500, random_state=42)
train_data = train_data.sample(n=1500, random_state=42)

test_data['class_index'] = test_data['class_index'].map({1:0, 2:1})
train_data['class_index'] = train_data['class_index'].map({1:0, 2:1})

test_data['review_combined_lemma'] = test_data['review_combined_lemma'].fillna('')
train_data['review_combined_lemma'] = train_data['review_combined_lemma'].fillna('')

test_data = Dataset.from_pandas(test_data)
train_data = Dataset.from_pandas(train_data)
raw_data = DatasetDict()
raw_data["test"] = test_data
raw_data["train"] = train_data

print(raw_data)
DatasetDict({
    test: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__'],
        num_rows: 1500
    })
    train: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__'],
        num_rows: 1500
    })
})

This dataset contains Amazon reviews, downloaded from Kaggle, pre-processed, and stored for a project assignment I completed about a year ago.

Using Datasets package, we have converted the dataset into the required format of huggingface transformers processing

Dataset.to_pandas(raw_data['test'])
class_index review_combined_lemma __index_level_0__
0 1 great book must preface saying not religious l... 23218
1 0 huge disappointment big time long term trevani... 20731
2 1 wayne tight cant hang turk album hot want howe... 39555
3 1 excellent read book elementary school probably... 147506
4 0 not anusara although book touted several anusa... 314215
... ... ... ...
1495 0 indifferent hears big dog little dog yap away ... 316639
1496 1 movie watch grandchild good movie little gore ... 91834
1497 1 patriot did win superbowl great piece memorabi... 176737
1498 0 11 stinker never fan series cd really bizarre ... 298198
1499 0 reason sampler no soul intensity orchestra pla... 277986

1500 rows × 3 columns

tokenized_ids = tokenizer(raw_data['test']['review_combined_lemma'], truncation=True,padding=True,return_tensors="tf", max_length=128)
model_output = model(tokenized_ids)

Here we converted the raw data into numerical format using tokenizer, which tokenizes the text into numbers using the downloaded vocab dictionary.

These tokens are passed into the model and output is captured.

Since, we are not training the model again we are tokenizing only the test data set.

from sklearn.metrics import classification_report

tf.keras.backend.clear_session()

print(classification_report(raw_data['test']['class_index'], tf.argmax(model_output.logits,axis=1)))
              precision    recall  f1-score   support

           0       0.72      0.95      0.82       722
           1       0.93      0.67      0.78       778

    accuracy                           0.80      1500
   macro avg       0.83      0.81      0.80      1500
weighted avg       0.83      0.80      0.80      1500

Here, we can see that the default foundationmodel of BERT is giving us 80% accuracy. Which is very good :).

Bert with Logistic Regression

from transformers import TFAutoModel

bert_model = TFAutoModel.from_pretrained(checkpoint)
bert_model.summary()
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Model: "tf_distil_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
=================================================================
Total params: 66362880 (253.15 MB)
Trainable params: 66362880 (253.15 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Since, we will be training the classifer layer we are loading the model without classifer layer using the command TFAutoModel. You can see the difference in outputs in both models

tokenized_ids = tokenizer(raw_data['train']['review_combined_lemma'], truncation=True,padding=True,return_tensors="tf", max_length=128)
bert_output = bert_model(tokenized_ids)

We are tokenizing the training dataset.

bert_output.last_hidden_state.numpy().mean(axis=1).shape
reshaped_output = bert_output.last_hidden_state.numpy().mean(axis=1)

Extracting the last layer output

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(reshaped_output, raw_data['train']['class_index'])
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Feeding the BERT last layer output to the Logistic Regression and trained the Logistic Regression.

from sklearn.metrics import classification_report
tokenized_ids = tokenizer(raw_data['test']['review_combined_lemma'], truncation=True,padding=True,return_tensors="tf", max_length=128)
bert_output = bert_model(tokenized_ids)
reshaped_output = bert_output.last_hidden_state.numpy().mean(axis=1)
y_pred = lr.predict(reshaped_output)
print(classification_report(raw_data['test']['class_index'], y_pred))
              precision    recall  f1-score   support

           0       0.84      0.86      0.85       722
           1       0.87      0.84      0.85       778

    accuracy                           0.85      1500
   macro avg       0.85      0.85      0.85      1500
weighted avg       0.85      0.85      0.85      1500

On Test Dataset we can see that the accuracy has jumped from 80% to 85% with a mere Logistic Classifier at the end. Isn’t it beautiful. However, only drawback of this is that is consumes lot of GPU memory.

Sentence Transformers with Logistic Regression

os.chdir('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/')

test_data = pd.read_csv('test_data_sample_complete.csv')
train_data = pd.read_csv('train_data_sample_complete.csv')

test_data = test_data.sample(n=10000, random_state=42)
train_data = train_data.sample(n=10000, random_state=42)

test_data['class_index'] = test_data['class_index'].map({1:0, 2:1})
train_data['class_index'] = train_data['class_index'].map({1:0, 2:1})

test_data['review_combined_lemma'] = test_data['review_combined_lemma'].fillna('')
train_data['review_combined_lemma'] = train_data['review_combined_lemma'].fillna('')

test_data = Dataset.from_pandas(test_data)
train_data = Dataset.from_pandas(train_data)
raw_data = DatasetDict()
raw_data["test"] = test_data
raw_data["train"] = train_data

print(raw_data)
DatasetDict({
    test: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__'],
        num_rows: 100000
    })
})

I have reloaded the dataset to demonstrate that sentence transformers can handle larger datasets more efficiently compared to the BERT model shown earlier. Sentence transformers effortlessly convert text into embeddings, reducing memory usage for tokenization and subsequent model processing.

Although both models are based on BERT, sentence transformers offer better memory efficiency.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

train_embeddings = model.encode(raw_data['train']['review_combined_lemma'], show_progress_bar=True)
test_embeddings = model.encode(raw_data['test']['review_combined_lemma'], show_progress_bar=True)
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

Loaded the model and converted both train data and test data into embeddings.

train_embeddings.shape
(100000, 768)
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
lr.fit(train_embeddings, raw_data['train']['class_index'])
LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Furthermore, we trained a lightweight logistic regression model using those embeddings.

from sklearn.metrics import classification_report

y_pred = lr.predict(test_embeddings)
print(classification_report(raw_data['test']['class_index'], y_pred))
              precision    recall  f1-score   support

           0       0.88      0.86      0.87      4972
           1       0.86      0.88      0.87      5028

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Here, we can see that our accuracy increased from 85% to 87%. However, we cannot directly attribute this improvement to the use of sentence transformers alone, as both BERT and sentence transformers capture the context of the information. That said, based on my understanding, sentence transformers are faster, more scalable, and reliable.

Creating Labels Using Sentence Transformers

Let’s assume that instead of predicting positive or negative sentiment, we want to classify sentiment on a 5-point Likert scale. Sentence transformers come in handy here, as they allow us to explore the similarity between the labels and the input text, helping us tag the input accordingly.

label_embeddings = model.encode( ["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"], show_progress_bar=True)
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(test_embeddings, label_embeddings)
array([[0.15364526, 0.17884818, 0.12452998, 0.1333864 , 0.08642562],
       [0.27978075, 0.19118355, 0.12162416, 0.17023209, 0.17311683],
       [0.07127699, 0.14324695, 0.07260972, 0.08962228, 0.07847168],
       ...,
       [0.15041098, 0.13494283, 0.01669509, 0.1404528 , 0.17394802],
       [0.00270087, 0.05694368, 0.01807276, 0.0432991 , 0.03236848],
       [0.13147888, 0.17518383, 0.14696477, 0.15878314, 0.17004938]],
      dtype=float32)

Its simple, we have arrived at cosine similarly of both input text and output labels that we have defined above.

sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)
y_pred
array([1, 0, 1, ..., 4, 1, 1])

labels = ["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"]
y_pred_labels = [labels[i] for i in y_pred]

test_df = Dataset.to_pandas(raw_data['test'])
y_pred_df = pd.DataFrame(y_pred_labels, columns=['Predicted_Labels'])

combined_df = pd.concat([test_df.reset_index(drop=True), y_pred_df.reset_index(drop=True)], axis=1)
combined_df
class_index review_combined_lemma __index_level_0__ Predicted_Labels
0 1 great book must preface saying not religious l... 23218 Negative
1 0 huge disappointment big time long term trevani... 20731 Very Negative
2 1 wayne tight cant hang turk album hot want howe... 39555 Negative
3 1 excellent read book elementary school probably... 147506 Positive
4 0 not anusara although book touted several anusa... 314215 Negative
... ... ... ... ...
9995 0 left many question read book recently diagnose... 105263 Positive
9996 1 liked wontrom reading rest great book no doubt... 334968 Negative
9997 1 recorder product durable bought fourth grader ... 355111 Very Positive
9998 1 like book elizabeth von arnim enjoy gardening ... 95143 Negative
9999 0 disappointed copy book offered sale catalog wa... 158471 Negative

10000 rows × 4 columns

Wohooo!!! We have custom created our own Predicted Labels using sentence tranfomers. Although they might not be completely accurate but it helps us to arrive at a quick conclusion when we have no information about the input text.

This programming article enhanced my understanding of how to use representation models in practice, providing new insights and uncovering exciting possibilities for leveraging embedding models. More to come—stay tuned!