Predicting Spotify Bangers and Classifying Genres with Supervised ML Algorithms

You can find the whole project and dataset here:

correlation plot with distributions

The dataset we will explore, analyze and model on will be the Spotify dataset that contains song information over the decades. The dataset essentially has information about the song such as, track name, artist name, danceability, key of the song, acousticness, speech, tempo, liveness, valence, popularity and decade along with other factors that would help us deduce meaningful information in determining if a song can be classified as a hit or not.

Music is considered a very subjective area and a lot of people have different preferences in the type of music they listen to. However, if a large number of people like a song, then it’s definitely considered a hit because it has large mass appeal and is played often and thus considered popular.

For example, we would like to know if songs with more speech in them had a bigger appeal in the 70’s versus the 2000's.

Using exploratory data analysis (EDA) on the Spotify’s dataset, we hope to have deeper insight on which factors contribute to the popularity of a song as well as how music popularity has changed over the decades. The factors which affect a track’s popularity can be determined by correlation plots, pears plot, and principal component analysis plots and can subsequently be useful in model-building using those features as shown below.

When it comes to model selection for popularity classification, we plan to try out logistic regression, KNN and random forest. Performance comparisons between models can provide us with insights into the advantages and drawbacks of each method and help us decide which model is the most suitable for similar problems. Hence, below, I have explored and calculated accuracies of some popular Supervised ML Algorithms.

Spoiler Alert: Random Forest works best!

KNN — 59.05% Accuracy for Song Hit Prediction

Random Forest — 76.3% Accuracy for Song Hit Prediction

xGBoost — 73.14& Accurate

Logistic Regression — 71.7% Accuracy for Song Hit Prediction

Apart from these accuracies which don’t really tell us much about whether a future listener on Spotify will actually actually like the song that was predicted for them or even hear it, we can most certainly look back at history to make a overall prediction for a group of people that are more likely to like this particular type of track and genre.

Looking back at history and the inspiration from this Warren Buffet quote, I decided to explore if history can tell us anything about music and its possible cyclicality with people’s taste.

“In the business world, the rearview mirror is always clearer than the windshield.”

To do that I created plots that measured, the change in popularity of songs over the decades which are shown below by constructing stacked bar plots, or popular, unpopular and all songs in our dataset over the decades, broken down by genre. From staring and analyzing the plot below for several minutes, we can say confidently say that there seems to be some cyclical pattern in the popularity of music. In the case of pop which is blue-ish in color, we see a pattern where it's never been consistently popular but has had its waves of popularity over the decades. The same applies to the R&B genre as well.

For future explorations, i’d wish to investigate how audio features can be used to classify the genre of a song by building a decision tree, training on a CNN, and training on a RNN. This can help us understand the audio characteristics of different genres and how the popularity of song genres changes over time. Potential challenge for this proposal lies in the difficulty of obtaining the actual genre of songs in our dataset: some songs don’t have a genre, and some may have multiple genres. We will try to tackle this issue by using artist genre instead of song genre, incorporating other datasets and trying out Spotify’s developer API.

CS, music and football in no particular order