Analyzing Metadata from Spotify Playlists
I recently came across this post about a Python module GSA
that allows you to download metadata from Spotify playlists.1 This was great, because it was a perfect opportunity for me to test RStudio 1.4’s improved support for Python, which I had been excited to try out since I saw the news.
Download Metadata
The original blog post lays out the steps you need to install everything properly, so I won’t repeat them here. However, I’ll briefly go over a few problems I encountered while trying to get it to work in the context of RStudio.
The blog post tells you to run
pip install -r requirements.txt
to install the necessary packages for their GSA module. If you are usingreticulate
in R/RStudio 1.4, you may also need to install those packages withpy_install
so that those packages are installing to the environment used by reticulate.In order for
import GSA
to work, Python has to be able to findGSA.py
(and the other python scripts) in either the current working directory or some other directory that Python knows to search in. You can check what the current working directory is withos.getcwd()
. If it is not there, then you can add the directory containingGSA.py
to the path withsys.path.insert
. For example, in my case, when I downloaded the Python scripts, I had them in a subdirectory of my project folder, so I had to add that subdirectory, “GeneralizedSpotifyAnalyser-main,” to my Python path:
import os
os.getcwd() # Check where you working directory is
import sys
sys.path.insert(0, './GeneralizedSpotifyAnalyser-main') # Add subdirectory to path
import GSA
import pandas as pd
After successfully importing GSA
, I followed the sample script GSA_basicExample.py
to download the playlist metadata. The playlist I’m using as an example is “Chillin’ on a Dirt Road.”
GSA.authenticate()
# Here's an example playlist ID: 37i9dQZF1DX3hgbB9nrEB1
myPlaylist = GSA.getInformation('37i9dQZF1DX3hgbB9nrEB1', verbose=True)
# Read the .pkl-file to get a dataframe
myPlaylistInformation = pd.read_pickle(myPlaylist)
I also modified the GSA.getInformation function
slightly to add an additional column for artist names. If you poke around the track
variable in the function, you may also find other metadata fields that you are interested in, since the function by default doesn’t save all of the fields into the data frame. The artists are under track['track']['artists']
and their names can be accessed in Python as [x['name'] for x in track['track']['artists']]
.
Regarding RStudio 1.4’s integration with Python, I did have quite a smooth experience running everything. I could send lines of Python code to the RStudio console just like R code. I also noticed that RStudio will automatically detect whether you are trying to run Python or R code and enter and exit reticulate for you. In addition, Python variables show up in the Environment tab of RStudio, which isn’t something I personally use a lot, but it’s still cool to see. Based on this limited experience, RStudio seems to be able to function like an IDE for Python.
Some of these features may have already existed pre-RStudio 1.4, because I will say that the last time I used reticulate
, I remember having a rather good experience already. For example, something really cool about reticulate
is that you can switch from Python back to R and access any Python object with py$name_of_object
. I don’t know how this extraordinary magic is implemented, but it’s amazing.
playlist_meta_raw = py$myPlaylistInformation
Below I’ve print out the data frame in R, which was originally downloaded in Python.
head(playlist_meta_raw)
## # A tibble: 6 x 21
## X1 playlistID TrackName TrackID SampleURL ReleaseYear Artists Genres
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 0 37i9dQZF1D… Worth It 5APWbHd… https://p.… 2017-12-01 ['Dani… ['alt z…
## 2 1 37i9dQZF1D… Seeing Bl… 65wnZsZ… https://p.… 2017-10-20 ['Nial… ['dance…
## 3 2 37i9dQZF1D… I Could U… 09iyGil… https://p.… 2017-03-17 ['Mare… ['conte…
## 4 3 37i9dQZF1D… I Want Cr… 4FkgULe… https://p.… 2011 ['Hunt… ['conte…
## 5 4 37i9dQZF1D… Downtown 4kY7rYt… https://p.… 2013-01-01 ['Lady… ['conte…
## 6 5 37i9dQZF1D… Tequila 7Il2yWQ… https://p.… 2018-01-10 ['Dan … ['conte…
## # … with 13 more variables: danceability <dbl>, energy <dbl>, loudness <dbl>,
## # speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## # liveness <dbl>, valence <dbl>, tempo <dbl>, key <dbl>, mode <dbl>,
## # duration_ms <dbl>, Popularity <dbl>
Plot the Data
Once I have the data, I can make all sorts of interesting plots. First, I clean up the data a bit using the R packages janitor
(for column names) and lubridate
(for dates).
playlist_meta = playlist_meta_raw %>%
as_tibble() %>%
clean_names()
playlist_meta = playlist_meta %>%
mutate(release_date = ymd(release_year, truncated = 2),
release_year = year(release_date))
Then I use ggplot2
to make some fun visualizations of the songs in the playlist.
# Release date
playlist_meta %>%
ggplot(aes(x = release_year)) +
geom_bar() +
scale_x_continuous(breaks = 1990:2020) +
labs(x = "Release Date",
y = "# of Songs") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# Song duration
playlist_meta %>%
ggplot(aes(x = release_year, y = duration_ms/1000/60)) +
geom_point() +
labs(x = "Release Year",
y = "Song Duration (in minutes)") +
theme_minimal()
# Genres
# Table of counts for genres (each song counted multiple times)
popular_genres = sapply(playlist_meta$genres, function(x) strsplit(gsub("(\\[)|(\\])|(\\')","",x), ", ")) %>%
unname() %>%
unlist() %>%
table()
popular_genres_tb = tibble(genres = names(popular_genres),
count = popular_genres) %>%
arrange(-count)
popular_genres_tb %>%
uncount(as.integer(count)) %>%
mutate(genres = fct_lump_n(fct_infreq(genres), 10)) %>%
filter(genres != "Other") %>%
ggplot(aes(x = genres)) +
geom_bar() +
labs(x = "Genre", y = "# of Songs") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
After I wrote this post, I discovered that an R package for Spotify has since been released, which I expect will be easier to use for R users.↩︎