Revised Data Loading and Cleaning

Status

  • This year the count down is the 885 Greatest Songs by Woman.
  • Lastyear's playlist fetching logic should work.
  • Got the basics working
  • silly solution to the "countdown stops at 6ish" problem, concatinating playlists. There's probably a better way, but this works.
  • Avoid duplicates in the playlist

Overview

Keeping data loading seperate. Normal people want stats, fewer people want the "how did you get this" If you want to see how this used to happen, or see data for earlier years, see the old Data Loading Notebook.

Setup

Under the covers, this is mostly requests to fetch data, which is now json, so we can use the internal python library, and Pandas for data munging. So let's start with the imports.

In [1]:
%matplotlib inline
from IPython.display import display, HTML
import requests 
import json
import pandas as pd
import numpy as np
from datetime import date, datetime, time
from os import path, mkdir
import re

Setup Cache directories

When we can, we'll cache data. This is only partially for speed. Checking in the data allows for repeatability if sources go away or change. And for some partial results, its useful to publish data. There are others in the XPN community doing data analysis or just asking questions, so csv files are nice to leave around.

In [2]:
cache_dir = './cache'
playlist_cache_dir = path.join(cache_dir, 'playlists')
a2z90s_cache_dir = path.join(cache_dir, 'a2z90s')
bestwomen_cache_dir = path.join(cache_dir, 'bestwomen')
musicbrainz_cache_dir = path.join(cache_dir, 'musicbrainz')
data_dir = './data'

for d in (cache_dir, playlist_cache_dir, a2z90s_cache_dir, bestwomen_cache_dir, data_dir, musicbrainz_cache_dir):
    if not path.exists(d): mkdir(d)

Generic XPN Playlist Scraping

XPN updated their site this year. The Playlists are now Json. They also now have Album info, which is great.

In [3]:
def fetch_daily_playlist(day, cache_dir=None, verbose = False):
    """
    Fetches the XPN playlist for a given date
    
    Args:
        day (datetime.date) : The day to fetch the playlist for
        cache_dir (string)  : Path to the cache directory, or None to avoid caching
    
    Returns:
        DataFrame containing Artist, Title, and Album as Strings and Airtime as Timestamp
    """
    songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Album', 'Air Time'])
    if cache_dir is not None:
        cache_file =  path.join(cache_dir, "%04d-%02d-%02d.csv" % \
                                (day.year, day.month, day.day))
    if cache_file is not None and path.exists(cache_file):
        songs = pd.read_csv(cache_file, encoding='utf-8')
        songs['Air Time'] = pd.to_datetime(songs['Air Time'], errors='coerce')
        if verbose: print "Got %d rows from %s" % (len(songs), cache_file)
    else:
        # example url
        # https://origin.xpn.org/utils/playlist/json/2022-11-30.json
        playlist_url = 'https://origin.xpn.org/utils/playlist/json/%s.json' %(day.isoformat())
        page = requests.get(playlist_url)
        if verbose: print "fetching %s returned status %s" % (day.isoformat(), page.status_code)
        # return is a json array of playlist entries
        # each playlist entry is a dict of
        #  - artist : artists name
        #  - song : song title
        #  - album : album title
        #  - timeslice : string containing an iso date with second granularity in Eastern time
        #  - image: url of album cover art
        #  - streamPreview : url of a short mp3 outtake from the track
        # 
        # Not all track are music.  But shows like World Cafe or Echos
        # put the show name, bound by vertical bars, as the artist, so we can skip them easily enough
        track_count = 0
        for track in page.json():
            if track["artist"][0] == '|':
                # skip non-song show titles
                continue
            if verbose: print ("adding %s %s %s %s" % (track['artist'], track['song'], track['album'], 
                                                       datetime.strptime(track['timeslice'],'%Y-%m-%d %H:%M:%S')))
            songs = songs.append({'Artist': track['artist'],
                                  'Title': track['song'],
                                  'Album': track['album'],
                                  'Air Time': datetime.strptime(track['timeslice'],'%Y-%m-%d %H:%M:%S')},
                                 ignore_index = True)
            if verbose: print "size = %d" % len(songs)
            track_count += 1
        
        if verbose: print 'added %d tracks' % (track_count)
        if cache_file is not None:
            songs.to_csv(cache_file, index=False, encoding='utf-8')
            if verbose: print 'write %d rows to %s' % (len(songs), cache_file)
        
    return songs
            
                
            
        
        
In [4]:
def fetch_playlist(start, end, cache_dir=None):
    """
    Fetch all the playlist entries for a range of time.
    
    Args:
        start (datetime.datetime) : The inclusive start time to fetch entries for
        end (datetime.datetime)   : The exclusive end time to fetch entries for
        cache_dir (string)        : path to the cache directory, or None to avoid caching
    
    Returns:
        Dataframe containing Artist, Title, and Album as strings, and Airtime as timestamp
    """
    songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Album', 'Air Time'])
    for day in pd.date_range(start.date(), end.date()):
        songs = songs.append(fetch_daily_playlist(day.date(), cache_dir), ignore_index=True)
    songs = songs[songs['Air Time'] >= start]
    songs = songs[songs['Air Time'] < end]
    # sometimes the playlist entries are duplicated
    songs = songs.drop_duplicates(subset=['Artist', 'Title', 'Album'])
    songs = songs.sort_values(by = 'Air Time')
    
    
    return songs

Load The Playlists

Fetch all the playlists, for the duration of the countdown, pulling from local cache if possible.

885 Songs by Women

In [5]:
# this is slightly awkward.  Since the playlist stops in the evening and restarts in
# the morning, we can kind of treat it as multiple playlist.
women_day1 = fetch_playlist(datetime(2023, 12, 5, 8, 0), datetime(2023, 12,5,17,19), playlist_cache_dir)
women_day2 = fetch_playlist(datetime(2023, 12, 6, 8, 0), datetime(2023, 12,6,17,53), playlist_cache_dir)
women_day3 = fetch_playlist(datetime(2023, 12, 7, 8, 0), datetime(2023, 12,7,17,32), playlist_cache_dir)
women_day4 = fetch_playlist(datetime(2023, 12, 8, 8, 0), datetime(2023, 12,8,16,0), playlist_cache_dir)
women_day5 = fetch_playlist(datetime(2023, 12, 9, 10, 0), datetime(2023,12,9,18,0), playlist_cache_dir)
women_day6 = fetch_playlist(datetime(2023, 12, 10, 11, 0), datetime(2023,12, 10,15,0 ), playlist_cache_dir)
women_day7 = fetch_playlist(datetime(2023, 12, 11, 8, 0), datetime(2023, 12, 11, 17, 25), playlist_cache_dir)
women_day8 = fetch_playlist(datetime(2023, 12, 12, 8, 0), datetime(2023,12,12,17,54), playlist_cache_dir)

women_days = [women_day1, women_day2, women_day3, women_day4, women_day5, women_day6, women_day7, women_day8]
print "got %d rows" % reduce(lambda x,y: x + len(y), women_days, 0)
got 885 rows
In [6]:
HTML(women_day1.head(5).to_html())
Out[6]:
Artist Title Album Air Time
183 Martha & The Vandellas Nowhere To Run Dance Party 2023-12-05 08:00:37
182 Belly Gepetto Star 2023-12-05 08:04:47
181 The Pointer Sisters I'm So Excited So Excited! 2023-12-05 08:08:05
180 Mary J. Blige Real Love What's The 411? 2023-12-05 08:12:41
179 Brandi Carlile Dreams Give Up The Ghost 2023-12-05 08:17:08
In [7]:
HTML(women_day8.tail(5).to_html())
Out[7]:
Artist Title Album Air Time
4 Tracy Chapman Fast Car Tracy Chapman 2023-12-12 17:29:02
3 Brandi Carlile The Story The Story 2023-12-12 17:34:02
2 Dolly Parton Jolene Jolene 2023-12-12 17:40:31
1 Joni Mitchell A Case Of You Blue 2023-12-12 17:43:12
0 Aretha Franklin Respect I Never Loved A Man The Way I Love You 2023-12-12 17:50:44

90s A-Z

In [8]:
# ninties = fetch_playlist(datetime(2022, 12, 1, 8, 0), datetime(2022, 12, 8, 8, 30),
#                           playlist_cache_dir)
# print "got %d rows" % len(ninties)

90s Non-alphabetics

In [9]:
# ninties_extras = fetch_playlist(datetime(2022, 12, 8, 8, 30), datetime(2022, 12, 8, 11, 8),
#                           playlist_cache_dir)
# print "got %d rows" % len(ninties_extras)

Augmenting The Data

Estimating Durations

Since we have air times, we can approximate durations by subtracting the air time from the next track's air times. There are a couple catches with this

  • we need to pass in an explicit end time for the last track, but that's minor
  • we need to add some logic to 'skip over' the Free at Noons that happen on Fridays form 12 noon till "like 12:40 or so" and don't appear in the playlist at all
  • there's no clear way to account for "non-song time" like station promos, hosts introducing songs, station ids, and so forth. Fortunately, the percentage of time that is really music is pretty high thanks to XPN being listener supported.
In [10]:
def estimate_durations(playlist, end_time=None):
    """
    Estimate the song durations
    Args: 
        playlist (DataFrame): playlist with minimally an 'Air Time' attribute
        end_time (datetime): end time of the play list, or None if still going
    Return:
        modified DataFrame with 'Duration' attribute added.
    """
    
    playlist['Duration'] = pd.Series([0 for x in range(len(playlist.index))], index=playlist.index)
    previous = None
    last_idx = None
    for idx, row in playlist.iterrows():
        if not previous is None:
            if row['Air Time'].date().weekday() == 4 and previous.hour == 11 and row['Air Time'].hour == 12:
                # We just fell into a free at noon
                playlist.loc[last_idx, 'Duration'] = 60 - previous.minute
            else:
                # just subtract this start from the previous
                delta = row['Air Time'] - previous
                playlist.loc[last_idx, 'Duration'] = delta.seconds / 60
        previous = row['Air Time']
        last_idx = idx

    # fixup the last row
    if end_time is not None:    
        delta = end_time - playlist.loc[last_idx,'Air Time']
        playlist.loc[last_idx, 'Duration'] = delta.seconds / 60
    
    return playlist
In [11]:
# ninties = estimate_durations(ninties, datetime(2022,12, 8, 8, 30))
# ninties_extras = estimate_durations(ninties_extras, datetime(2022, 12, 8, 11, 9))
women_day1 = estimate_durations(women_day1, datetime(2023,12,5, 17, 19))
women_day2 = estimate_durations(women_day2, datetime(2023, 12, 6, 17, 53))
women_day3 = estimate_durations(women_day3, datetime(2023, 12, 7, 17, 32))
women_day4 = estimate_durations(women_day4, datetime(2023, 12, 8, 16, 0))
women_day5 = estimate_durations(women_day5, datetime(2023, 12, 9, 18, 0))
women_day6 = estimate_durations(women_day6, datetime(2023, 12,10, 15, 0))
women_day7 = estimate_durations(women_day7, datetime(2023, 12, 11, 17, 25))
women_day8 = estimate_durations(women_day8, datetime(2023,12,12,17,54))
women = pd.concat([women_day1, women_day2, women_day3, women_day4, women_day5, women_day6, women_day7, women_day8], ignore_index=True)

Place in countdown

For countdowns it's useful to have a colum for postition in the countdow. So, for example, in the 885 Greatest Songs by women, the first track is 885, and the last is number 1.

In [12]:
def assign_positions(playlist, size):
    """
    Assign postions to songs in a playlist
    Args:
        playlist (DataFrame): playlist with rows in order of Air Time
        size (int): eventual size of the playlist
    Return:
        modified playlist with 'Position" appended
    """
    return playlist.assign(Position=range(size, size -len(playlist), -1))
In [13]:
women = assign_positions(women, 885)

Extracting Initial Letters

This might be the lamest, simplest, data augmentation, but these are mostly A to Z countdows. Besides nothing is ever really that simple. Blanks and initial punctuation (ex 'Til) have show up before.

In [14]:
def first_char(s):
    for c in s:
        if type(c) is str and c.isalpha():
            return c.upper()
    return s[0]

# ninties = ninties.join(ninties.apply(lambda x: first_char(x['Title']), axis=1).to_frame('Letter'))
women = women.join(women.apply(lambda x: first_char(x['Title']), axis=1).to_frame('Letter'))

For the non-alphabetic leftovers, we'll do first character instead, so no skipping past non-alphabetics.

In [15]:
# ninties_extras = ninties_extras.join(ninties_extras.apply(lambda x: x[1][0].upper(), axis=1).to_frame('First Character'))

Extracting First Words

Not sure how interesting this is, but the "should we include leading articles" was the genesis of this effort back in 2016. Besides it's easy.

In [16]:
from nltk.tokenize import RegexpTokenizer
custom_tokenize = RegexpTokenizer("[\w'\-]+|[^\w'\s\-]").tokenize
# ninties = ninties.join(ninties.apply(lambda x: custom_tokenize(x['Title'])[0], axis=1).to_frame('First Word'))
# ninties_extras = ninties_extras.join(ninties_extras.apply(lambda x: custom_tokenize(x['Title'])[0], axis=1).to_frame('First Word'))
women = women.join(women.apply(lambda x: custom_tokenize(x['Title'])[0], axis=1).to_frame('First Word'))

Publication Years

For the 90s A-Z, the realtime list uses a single Json request to get a list of songs and it contains the publicatin year. this might just be a lot easier than using MusicBrainz. For the 885 Songs by Women, we'll need to use Music Brainz below.

In [17]:
def fetch_years():
    # the 90s specific playlist page has a json feed with years, 
    # so we can just fetch that as
    # https://origin.xpn.org/countdown/2022/2022_90s_az.json
    # one catch, unlike the normal playlist, it stuffs esacpped esapes on "specials"
    # for example "Don\\'t need no \\'cape here" rather than "Don't need no 'cape here"
    #
    # their data isn't much better than what I did myself.
    # for the moment, drop anything outside 1990-1999.
    years = pd.DataFrame(None, columns=['Artist', 'Title', 'Album', 'Year'])
    az_url = 'https://origin.xpn.org/countdown/2022/2022_90s_az.json'
    page = requests.get(az_url)
    for track in page.json():
        if len(track['releaseDate']) == 4:
            release_year = int(track['releaseDate'])
            if release_year < 1990 or release_year > 1999:
                    release_year = 0
        else:
            release_year = 0
        years = years.append({'Artist': track['artist'].replace("\\", ""),
                              'Title': track['song'].replace("\\", ""),
                              'Album': track['album'].replace("\\", ""),
                              'Year': release_year},
                             ignore_index = True)
    return years
    
In [18]:
# ninties = ninties.merge(fetch_years(), on = ['Artist', 'Title', 'Album'], how='left')
# ninties['Year'] = ninties['Year'].fillna(value=0).astype(int)
# print "Of %d 90s tracks, %d had valid dates and %d did not" % \
#     (len(ninties), len(ninties[ninties['Year'] > 0]), len(ninties[ninties['Year'] == 0]))
# ninties_extras = ninties_extras.merge(fetch_years(), on = ['Artist', 'Title', 'Album'], how='left')
# ninties_extras['Year'] = ninties_extras['Year'].fillna(value=0).astype(int)
# print "Of %d 90s extra tracks, %d had valid dates and %d did not" % \
#     (len(ninties_extras), len(ninties_extras[ninties['Year'] > 0]), len(ninties_extras[ninties_extras['Year'] == 0]))

MusicBrainz Data

MusicBrainz is an free online music database, with an external XML Web-service that is supported in Python via the musicbrainzngs library. I'd originally used it to get publication year for the 2016 countdown, but abandoned it in 2017 since the 2017 playlist page had lists by year. Since there's no year data in 2023's 885 Greatest Songs by Women I'm bringing it back.

There are a couple of potential issues with querying MusicBrainz

  • MusicBrainz has its own rules about how to enter data, that don't always match those at WXPN, so sometimes searches fail for data mismatches.
  • As a free volunteer based service, there's no guarantee that the data is there, though their data-set is very complete.
  • Finding the right recording is an art at best. My general approach has been to look for the oldest official release for any recording matching the title and artist. That mostly works.

And when that all fails, I've been known to resort to just searching various music sites and manually updating the data.

One consequence is that we'll always lag on publication year data during the running of the playlists.

In [19]:
def add_musicbrainz_data(playlist, min_year = 1900, cache_file = None):
    """
    Add data from the musicbrainz database.  Currently just first year of publication.
    The input data frame should contain at least Title and Artist fields
    and the resulting dataframe will have a new Year field.
    The cache file if used, should have been generated by a previous run of
    this function.
    Using a cache is strongly encouraged,
    as the MusicBrainz search interface is rate limited to one search per second
    so this can be very slow for large playlists.
    
    Args:
        playlist (Dataframe) : playlist to update
        min_year (int)       : miminum year to consider
        cache_file (string)  : path to cache file
         
    Returns:
        Dataframe containing the augmented playlist
    """
    import musicbrainzngs as mb
    mb.set_useragent('xpn-a2z', '0.1','https://github.com/asudell/a2z')
    
    # keep a list of artists named differently
    # at MusicBrainz than XPN, so we can 'fix' them
    artist_names = {
        "R. E. M.": "REM",
        "Run-DMC": "Run-D.M.C.",
        "The Ramones": "Ramones"
    }
    
    # load the cache if we have one
    if cache_file is not None and path.exists(cache_file):
        years = pd.read_csv(cache_file, encoding='utf-8')
        years = years.drop_duplicates()
    else:
        years = pd.DataFrame(None, columns=('Title','Artist', 'Year'))
    
    augmented = playlist.merge(years, how = 'left')
    
    # Lookup any unaugmented rows
    new_mb_rows = []
    for index, row in augmented[augmented['Year'].isnull()].iterrows():
        if row['Artist'] in artist_names:
            artist = artist_names[row['Artist']]
        else:
            artist = row['Artist']
        result = mb.search_recordings(row['Title'],
                                      artist = artist,
                                      status = 'official',
                                      strict = True,
                                      limit = 25)
        rel_year = None
        for recording in result['recording-list']:
            if recording['release-list']:
                for release in recording['release-list']:
                    if 'date' in release and len(release['date']) > 0:
                        y = int(release['date'].split('-')[0])
                        if rel_year is None or rel_year > y:
                            if y >= min_year:
                                # assume years before 1900 are typos
                                rel_year = y
        if rel_year is not None:
            new_mb_rows.append([row['Title'], row['Artist'], rel_year])
    
    new_years = pd.DataFrame(new_mb_rows, columns=('Title','Artist', 'Year'))
    # if we found new data, resave the cache and rebuild the augmented data
    if len(new_years) > 0:
        years = years.append(new_years, ignore_index=True)
        years = years.drop_duplicates()
        if cache_file is not None:
            years.to_csv(cache_file, index=False, encoding='utf-8')
        augmented = playlist.merge(years, how = 'left')
    
    return augmented
In [20]:
women = add_musicbrainz_data(women, 1900, path.join(musicbrainz_cache_dir, 'women_years.csv'))
# pandas dosen't support NAs in int data, so set to 0 
women['Year'] = women['Year'].fillna(0).astype(int)
# save a copy of anything without a year for manual review
women_missing = women[women['Year'] == 0][['Title', 'Artist']]
women_missing.to_csv(path.join(musicbrainz_cache_dir, 'women_need_years.csv'),
                    index=False, encoding='utf-8')

Checking The Results

In [21]:
HTML(women.head(5).to_html())
Out[21]:
Artist Title Album Air Time Duration Position Letter First Word Year
0 Martha & The Vandellas Nowhere To Run Dance Party 2023-12-05 08:00:37 4 885 N Nowhere 1965
1 Belly Gepetto Star 2023-12-05 08:04:47 3 884 G Gepetto 1992
2 The Pointer Sisters I'm So Excited So Excited! 2023-12-05 08:08:05 4 883 I I'm 1982
3 Mary J. Blige Real Love What's The 411? 2023-12-05 08:12:41 4 882 R Real 1992
4 Brandi Carlile Dreams Give Up The Ghost 2023-12-05 08:17:08 7 881 D Dreams 2009
In [22]:
HTML(women.tail(5).to_html())
Out[22]:
Artist Title Album Air Time Duration Position Letter First Word Year
880 Tracy Chapman Fast Car Tracy Chapman 2023-12-12 17:29:02 5 5 F Fast 1988
881 Brandi Carlile The Story The Story 2023-12-12 17:34:02 6 4 T The 2005
882 Dolly Parton Jolene Jolene 2023-12-12 17:40:31 2 3 J Jolene 1974
883 Joni Mitchell A Case Of You Blue 2023-12-12 17:43:12 7 2 A A 1971
884 Aretha Franklin Respect I Never Loved A Man The Way I Love You 2023-12-12 17:50:44 3 1 R Respect 1971

Saving The Data

In [23]:
# ninties_data_file = path.join(data_dir, '90sA2Z.csv')
# ninties.to_csv(ninties_data_file, index=False, encoding='utf8')
# ninties_extras_data_file = path.join(data_dir, '90sextras.csv')
# ninties_extras.to_csv(ninties_extras_data_file, index=False, encoding='utf8')

women_data_file = path.join(data_dir, 'women.csv')
women.to_csv(women_data_file, index=False, encoding='utf8')

Availability

The code for this project is in my github repo and this file is specifically Dataloading2.

License

This project is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. You are free to use for commercial or non-commercial purposes, so long as you attribute the source and also allow sharing.