from gensim.models import Word2Vec
= [
sentences "the", "king", "rules", "the", "kingdom", "with", "wisdom"],
["the", "queen", "leads", "the", "kingdom", "with", "grace"],
["a", "prince", "inherits", "the", "throne", "from", "his", "father"],
["a", "princess", "inherits", "the", "throne", "from", "her", "mother"],
["the", "king", "and", "the", "queen", "govern", "together"],
["the", "prince", "respects", "his", "father", "the", "king"],
["the", "princess", "admires", "her", "mother", "the", "queen"],
["the", "actor", "performs", "in", "a", "grand", "theater"],
["the", "actress", "won", "an", "award", "for", "her", "role"],
["a", "man", "is", "known", "for", "his", "strength", "and", "wisdom"],
["a", "woman", "is", "admired", "for", "her", "compassion", "and", "intelligence"],
["the", "future", "king", "is", "the", "prince"],
["daughter", "is", "the", "princess"],
["son", "is", "the", "prince"],
["only", "a", "man", "can", "be", "a", "king"],
["only", "a", "woman", "can", "be", "a", "queen"],
["the", "princess", "will", "be", "a", "queen"],
["queen", "and", "king", "rule", "the", "realm"],
["the", "prince", "is", "a", "strong", "man"],
["the", "princess", "is", "a", "beautiful", "woman"],
["the", "royal", "family", "is", "the", "king", "and", "queen", "and", "their", "children"],
["prince", "is", "only", "a", "boy", "now"],
["a", "boy", "will", "be", "a", "man"]
[
]
= Word2Vec(sentences, vector_size=2, window=5, min_count=1, workers=4, sg=1, epochs=100, negative=10) model
The prime purpose of this article is to demonstrate the creation of embedding vectors using gensism Word2Vec
.
Reconstruction of famous king - man + woman = queen
Here we have created few random sentences and ran Word2Vector model
vector_size=2
is initialized visualize the embedding vectors on 2d space to start with
= {word: model.wv[word].tolist() for word in model.wv.index_to_key} word_vectors_dict
Here we have converted the embedding into dictionary for easy plotting
import matplotlib.pyplot as plt
=(10, 10))
plt.figure(figsizefor word in list(word_vectors_dict.keys()):
= word_vectors_dict.get(word)
coord if coord is not None:
0], coord[1])
plt.scatter(coord[0], coord[1]))
plt.annotate(word, (coord[
"2D Visualization of Word Embeddings")
plt.title( plt.show()
Based on this result we can see that the few words are grouped togther like admires
, rules
and respect
award
, performs
and theater
However, most of the other word groups doesn’t make any sense at this point of time. The prime reason behind it is due to the embedding vector size. If we increase the embedding vector size the more nuanced relationships will be formed.
However, this demonstrates that the word2vec model tries to gather similar words together.
Now let us try the famous example of kind and queen
= model.wv.most_similar(positive=["king", "woman"], negative=["man"], topn=1)
result print(result)
[('be', 0.9999998211860657)]
OOOPs! it got failed.
No worries!!! lets increase the dimensions of the vector :)
= Word2Vec(sentences, vector_size=16, window=5, min_count=1, workers=4, sg=1, epochs=1000, negative=10) model_highdimensional
= model_highdimensional.wv.most_similar(positive=["king", "woman"], negative=["man"], topn=1)
result print(result)
[('queen', 0.8557833433151245)]
Hurray we got it !!!! :)
Exploring Similarities
"mother", topn=5) model_highdimensional.wv.most_similar(
[('admires', 0.927798867225647),
('inherits', 0.8661039471626282),
('princess', 0.8615163564682007),
('from', 0.8585997819900513),
('throne', 0.8579676151275635)]
"father", topn=5) model_highdimensional.wv.most_similar(
[('respects', 0.9215425848960876),
('throne', 0.8824634552001953),
('prince', 0.861110270023346),
('from', 0.8586324453353882),
('inherits', 0.8541713356971741)]
Here, we can see that the most similar word to mother
is admire
, which makes perfect sense. Similarly, for father
, the most similar word is respect
.
This also helps us explore the idea of finding the most similar words, akin to finding the most similar products in a recommendation system. Imagine that instead of sentences, we have a transaction table. We need to find the most similar products to recommend together.
Rather than using traditional Apriori for association rule mining, we can apply Word2Vec on a large dataset to derive a set of recommendations. This can be used in conjunction with traditional association algorithms to improve the quality of recommendations.
Extending Similarties vis-a-vis Recommendations
import random
= [
transactions "laptop", "mouse", "keyboard", "usb_c_hub"],
["smartphone", "wireless_charger", "earbuds", "phone_case"],
["gaming_console", "gaming_controller", "headset", "gaming_mouse"],
["tv", "soundbar", "streaming_device"],
["tablet", "stylus", "tablet_case"],
["laptop", "external_hard_drive", "usb_c_hub"],
["smartphone", "screen_protector", "phone_case"],
["gaming_console", "gaming_headset", "gaming_keyboard"],
["tv", "bluetooth_speaker", "universal_remote"],
["tablet", "portable_charger", "tablet_stand"],
["camera", "tripod", "memory_card", "camera_bag"],
["drone", "drone_batteries", "camera", "action_cam"],
["smartwatch", "fitness_band", "wireless_earbuds"],
["gaming_console", "gaming_mouse", "gaming_monitor"],
["smartphone", "portable_charger", "wireless_earbuds"],
[
]
random.shuffle(transactions)
from gensim.models import Word2Vec
= Word2Vec(sentences=transactions, vector_size=100, window=4, min_count=1, workers=4, sg=1, epochs=1000, negative=5) model
= "laptop"
product = model.wv.most_similar(product, topn=5)
recommendations
print(f"Top recommendations for '{product}':")
for item, score in recommendations:
print(f"{item}: {score:.4f}")
Top recommendations for 'laptop':
mouse: 0.9697
usb_c_hub: 0.9696
camera: 0.9673
keyboard: 0.9666
tripod: 0.9665
Hurray!! We have constructed a mini and simple recommendation model using Word2Vec.
Wow! this isn’t beautiful