New, Improved Song Recommender on MySwar

By Abhishek Gupta

On MySwar, every song has a Similar Songs section that shows songs related to the one you’re browsing. Our earlier system used a basic static algorithm where similar songs were pre-generated for each track. This meant recommendations were often stale and inflexible — any change in the logic required regenerating the entire dataset, and new songs wouldn’t appear until the next batch run. We wanted something real-time, dynamic, and capable of evolving as our catalog grew.

We decided to build a system that generates similar songs on the fly using scikit-learn for feature encoding and ChromaDB for vector storage and search. First, we identified the attributes that most influence similarity. The following is an illustrative subset of attributes and their weights (there are many more).

attributes = {
“genre”: [“rock”, “pop”, “jazz”, “classical”],
“mood”: [“happy”, “sad”, “energetic”, “calm”],
“instrument”: [“guitar”, “piano”, “violin”, “drums”],
}

weights = {
“genre”: 2.0, # Genre is most important
“mood”: 1.5, # Mood is moderately important
“instrument”: 1.0 # Instrument is least important
}

Using scikit-learn’s MultiLabelBinarizer, we transform these categorical attributes into multi-hot vectors and then apply weights to reflect their relative importance. The SongVectorizer class encapsulates this logic:

from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np

class SongVectorizer:
def __init__(self, attributes, weights=None):
self.attributes = attributes
self.weights = weights if weights else {attr: 1.0 for attr in attributes}
self.encoders = {
attr: MultiLabelBinarizer().fit([values])
for attr, values in attributes.items()
}

def vectorize_song(self, song):
vector = []
for attr, encoder in self.encoders.items():
values = song.get(attr, [])
encoded = encoder.transform([values])[0].astype(float)
encoded *= self.weights[attr]
vector.append(encoded)
return np.concatenate(vector)

To store and query these vectors efficiently, we run ChromaDB inside Docker:

docker run -d \
–name chromadb \
-p 8000:8000 \
-v $(pwd)/chroma_data:/chroma/chroma \
chromadb/chroma:latest

Once ChromaDB is running, we connect to it from Python and insert our song vectors.

import chromadb

chroma_client = chromadb.PersistentClient(path=”./chroma_db”)
collection = chroma_client.get_or_create_collection(name=”similar_songs”)

vectorizer = SongVectorizer(attributes, weights)

songs = [
{“id”: “s1”, “genre”: [“rock”, “pop”], “mood”: [“energetic”], “instrument”: [“guitar”]},
{“id”: “s2”, “genre”: [“pop”], “mood”: [“happy”], “instrument”: [“piano”]},
{“id”: “s3”, “genre”: [“jazz”], “mood”: [“calm”], “instrument”: [“violin”]},
]

for song in songs:
vec = vectorizer.vectorize_song(song).tolist()
collection.add(ids=[song[“id”]], embeddings=[vec])

When a user views a song, we vectorize it on the fly and query ChromaDB to find the closest matches using cosine distance:

def search_similar_songs(query_song, top_n=3):
vec = vectorizer.vectorize_song(query_song).tolist()
results = collection.query(query_embeddings=[vec], n_results=top_n)
return results[“ids”][0], results[“distances”][0]

query_song = {“genre”: [“rock”], “mood”: [“happy”], “instrument”: [“guitar”]}
similar_ids, distances = search_similar_songs(query_song)

for sid, dist in zip(similar_ids, distances):
print(f”Song: {sid}, Distance: {dist}”)

With this approach, recommendations are no longer static. New songs become discoverable instantly, relevance can be tuned by adjusting weights, and we can scale to millions of songs without running into the limitations of our old batch-generated system.

A few examples below:

Similar songs for Apna Time Aayega (Gully Boy, 2019):