A trip through sex, data, and rock ‘n roll
In December I had to give a presentation on a topic of my choice.
Are you following? Yeah, you know the type!
Predictably, the topic I picked involved Discogs. Yeah, you know me too well!
At first I considered simply showcasing some Venn diagrams or random insights and fun facts gleaned from its raw data but, alas, it turns out that data is only interesting if it has more than one dimension. Prose is not data. Discogs isn’t exactly “Big Data” either. Privacy was another aspect worth diving into.
Months of research via discussions, webinars, videos and articles for inspiration later, I had to conclude that the individual topics were far too deep and convoluted to condense into an informal one-hour presentation for my colleagues.
But at the very heart of this contentious data there was a message I could easily convey: Where does it come from, and why is much of it garbage?
Data is useful only if you have control over it.
As it turns out, many others have realised this too, and there are several organisations vying for this position of power while Discogs, along with other music databases such as MusicBrainz, have been at the forefront all along — perhaps even without knowing it.
What follows is a redacted and updated version of the original presentation.
Testing, testing, 1 – 2 – 3!
OK, let’s get going. Let’s rock.
In this presentation I’d like to point out something that’s been lacking in the modern music business.
It’s not content. It’s not variety, and nor is it sales. There’s more music out there than has ever been before. We listeners, music fans, consumers, customers — we’ve got so much to choose from that it’s impossible not to get lost in the selection of music and songs and anything else that you might want to listen to. It’s overload!
And some of it is filled with wrong data.
Have you ever looked for a specific song or an artist, and the results are completely off?
That’s music metadata not working. And have you ever wondered how it got there, and who enters it, and what standards and decisions have to be made to make it useful?
We’re going to scratch the surface using one particular database that you may or may not have heard of as an example.
According to the latest IFPI report, we’re listening to more music than before.
We’re consuming more music than before, and I’m sure we’re all aware that streaming is currently the most popular format.
The fewest people still buy physical media in the form of CDs but Japan and Germany remain the top consumers of compact discs, although, according to the latest “Global Music & Chart Report“ (Billboard) from September this year, Japan is now the fastest growing streaming market.
Many people may have taken to buying vinyl records again — some for the first time ever. Even cassettes are making a minor comeback.
We all have a bunch of favourite songs, and we know the name of the songs and the performers, maybe the writer or composers, and we may have bought a few albums of theirs or have dedicated Spotify or Amazon playlists, or whatever. Music, in whatever form, is an important part of humanity’s collective culture — no matter where you’re from, what you like to listen to, or how you listen.
But music is also data.
The graphs above show revenue figures, and those are generally well-documented because… accountants!
What I’m referring to specifically is recorded music. Recorded music and their carriers are surrounded by a lot of metadata, and some of it is missing or wrong because… artists!
This presentation will take you down rabbit holes you may not have known existed by pointing out some of the “behind the scenes” activities. There are discussions and decisions you may never think about as you queue up the next song on your music player.
Bear with me because we’re going through a range of musical genres. Don’t laugh!
Here’s a German “Volksmusik” legend by the name of Alfons Bauer.
Alfons plays an instrument called an alpine zither, and he is the king of cool.
I mean, just look at this guy!
This guy rocks! Alfons is the life of the party. Ed Sheeran has nothing on him!
Alfons was so prolific that he put out hundreds of records during his lifetime and appears on thousands more the world over, in different translations, and on different labels.
Here, check this out: I was close to throwing away this record but then decided to use it as an example for this presentation. It may say “Alfons Bauer” on the front but not all tracks are by Alfons Bauer.
On the back it lists “Alfons Bauer und seine Almdudler” for some tracks while others are credited to “Alfons Bauer mit seiner Gautinger Hackbrettmusi”. And here‘s an EP billed to “Alfons Bauer Mit Seinen Böhmerwald-Musikanten”. What’s going on?
Are these all the same group or the same ensemble? Who knows! It’s not clear.
In fact, Alfons was so busy that a service like Spotify has trouble telling apart or grouping together names of Alfons’ variations, basic abbreviations or even rudimentary translations. Oh, and Apple Music isn’t much better.
And that’s just one artist. Alfons is dead. He’s not likely going to complain about the few cents that his music on Spotify or Deezer would’ve netted him today. He’s not Eric Clapton or Prince with a greedy estate and devoted fans behind his name.
“In the music world, metadata most commonly refers to the song credits you see on services like Spotify or Apple Music, but it also includes all the underlying information tied to a released song or album, including titles, songwriter and producer names, the publisher(s), the record label, and more. That information needs to be synchronized across all kinds of industry databases to make sure that when you play a song, the right people are identified and paid. And often, they aren’t.” — The Verge
So where do these inconsistencies come from?
What are their sources and who collects and delivers it? In fact, where does it get collected — what database is there? Doesn’t somebody check it?
If you think that institutions or agencies like GEMA, the RIAA or the IFPI keep track of these things, then the answer is no. In many countries, this would be the task of the national library (in Germany the Deutsche Nationalbibliothek – specifically, the Deutsches Musikarchiv in Frankfurt and Leipzig each should have one so-called “Pflichtexemplar”). On Spanish and Italian-language releases, you may have seen the term “depósito legal” (legal deposit) which pertains to physical media.
There isn’t a single, common and global service that everyone the world over can access and features details of every single song that has ever been released or every single artist that has ever put out an album — let alone one that is 100% accurate!
There is no such thing right now.
Libraries tend to collect physical copies of the object whereas the streaming services such as those by Apple, Deezer, Tencent Music, Spotify, last.fm, Pandora, Tidal, SoundCloud, Qobuz, YouTube, Bandcamp, Amazon and dozens of others will have certain (read: many) songs. They have digital copies of the music as a service that they can (re-)sell to you, and sales or streaming figures are easy to quantify.
But which music? Digital manifestations with songs that potentially sell, music that customers want to potentially listen to, and work whose performers and writers would certainly like to be paid for. To be fair, their catalogues do include a large amount of unsigned or amateur artists as well as podcasts — so from an consumer’s point of view their services are definitely attractive but don’t make the mistake of confusing their stock with librarianship; there are strictly commercial interests at play here.
Each track you know and have ever heard comes with a load of metadata. The obvious kind of metadata is the song’s title and performer. In the case of classical music, it would typically be the composer and the opus (or a movement thereof), and often also the orchestra and conductor, and perhaps a featured soloist. Like Alfons.
Think of a random song title: What additional info springs to mind? Who produced it? Which album is it taken from? There, that’s some of the metadata we’re talking about.
The song has a duration or playing time, and it has a year (date) when it came out, and maybe even the name of the album or EP it came out on. Humans are visual creatures, so the album or whatever will have some form of artwork – nowadays simply described as cover — and that’s usually the immediate visual representation when you think of a song. And I haven’t even started on music videos.
Now that we’ve looked at the basics, let’s go a little deeper.
Who wrote those songs? Where was it recorded, who mixed it? Who played bass, who played keyboards, what other tunes may have been sampled? Who owns the rights to the song? Who’s the publisher, who would I contact and who gets paid if I want to legally release a cover of the tune? This should be objective metadata – factual and undisputed.
But it’s not consistently-documented. Often, it’s simply missing or has been forgotten.
Where’s the band from? What language is the song in? Who is the target audience?
How would you best describe the song? Is it a Christmas Song? Is it a rock song, or pop or hip-hop? Cloud rap or vaporwave or acid jazz or lolicore or ambient? Is it a parody or a cover of something else? What mood or key is the song in… is it a happy or a sad song? What tempo? Can you dance to it? One person’s rave is another person’s house. Certain soft and subjective attributes may even change over time because yesterday’s hip-hop is today’s pop, and what was shocking to our parents is completely lame to our kids today.
In the days of shellac, vinyl, 8-tracks and cassettes we relied on liner notes and inserts to provide us — the consumer, the listener — with this information. Even the advent of the compact disc (a digital medium, in case you forgot) or the MiniDisc sort of forgot to include metadata in its specifications (although they’re capable of supporting it). And if you’ve ever read through those liner notes you may have found a connection between one album and another so that you might seek out similar work of the same producer or used the name of the mastering engineer to dig for a different album for the simple reason that you like their sound. You didn’t need AI to make those recommendations for you!
Then, in the nineties, things got really out of control.
If you’re of a certain age group and have ever ripped a music CD to MP3, you’ve probably come across the names “CDDB”, “gnudb”, “FreeDB” or “Gracenote”. There are programs like MP3 file taggers that look up these databases via an API. There is a program called “Picard” by MusicBrainz that looks up their database to automatically fill in track titles, artists and even thumbnail artwork for you.
Personally, I’ve always done it by hand — manually — because I don’t trust them enough; I’ve seen my share of errors because that information came from the general public — not the record companies themselves! And even they often make mistakes.
Whatever information you found online, it was usually crowdsourced.
Gracenote, who you’ve now seen mentioned on a few slides, started in 1993 as an open-source project named CDDB. The data behind it was crowdsourced. The data was valuable — so valuable, in fact, that Sony paid 260 million dollars for it in 2008. It’s now owned by Nielsen who paid 560 million dollars for it about five years ago.
Music metadata is big business. Everybody wants a slice of this pie.
“At first, Metadata might sound like an insignificant little thing, but consider the following. Every time a user searches for a song on Spotify; every time BMI attributes performance royalties; every time Pandora’s algorithm queues up a song — metadata is at play. It’s the oil that makes the cogs of the industry spin.“– Soundcharts
We’ve already seen that the music industry is recovering from a slump. Revenues are climbing steadily now that the internet (which basically obliterated the music industry) is busy rebuilding it. Music streaming has reached a critical stage of acceptance by both the consumer and the music providers. And the music has some metadata to go with it.
For instance, if you were to ask Alexa to play you the latest Taylor Swift album, she’d easily be able to do that, and if you were to ask her to play you the first song that Taylor Swift ever wrote, Alexa might be able to do that for you too. But if you were to ask Alexa to play you the first song that Ed Sheeran ever wrote, she could just be stumped. She’ll probably play you the wrong song.
Why? Alexa doesn’t know this because she doesn’t have the information, or the information is incomplete – even though it’s not a secret that Ed Sheeran himself is trying to buy back all copies of the first album he ever put out: A CD-R from when he was still in school and is embarrassed about today.
During this new shift towards digitisation and towards streaming the music industry itself realised that it was lacking metadata about its own songs, or that information wasn’t passed on to streaming providers, or it was in the wrong format because Spotify requires one schema while Apple wants it in another, and Tidal has their own database standard… you get the idea.
Then consider the situation with Hungarian names which are often given as Surname, then Firstname (so that John Smith is listed as Smith John). You have the same situation with Japanese names who might have been listed in Kanji, Hiragana, Katakana or some romanised form. If you remember the early 80’s synthpop band Yazoo, in the USA they had to use the name Yaz, and The Beat became the English Beat in the USA and the British Beat in Australia, and so forth.
Albums may have been renamed for release in other countries, or they might include bonus tracks exclusive to certain territories like Japan.
Another situation: How would you treat artist names like Simon & Garfunkel? Is it a compound name in the form of the duo “Simon & Garfunkel”, or is it two separate artists — Paul Simon and another fellow called Art Garfunkel — who collaborated?
Danny Wilson, Matt Bianco, Harvey Danger, Gnarls Barkley, Johnny Socko, Alice Cooper… what are they: names of people, or are they bands? What about our friend Alfons and all his bands? Questions, questions!
You can now well imagine how many artists there are that are mixed up, or how much revenue may be unallocated or misdirected elsewhere because of inconsistencies across the supply chain.
In an MIT paper about the peer-to-peer music sharing phenomenon from 2002 (that’s almost 20 years ago) it was already predicted that “music will be a service, not a product.”
“As wireless connectivity delivers what the user wants, whenever they want it, the desire to own ‘molecules’ decreases. MP3’s will gradually fade as a downloadable medium, as bandwidth increases and users have the ability to stream content.”
Mr. Ghosemajumder also wrote that this was on the condition that consumers could find and discover the music they wanted. Remember that this was from even before the advent of smartphones and is reliant on accurate metadata from trustworthy sources.
Because if they can’t find it, consumers will go elsewhere.
But it’s 2021! Why is this still an issue in this day and age?
Because it’s difficult. It’s not easy to extrapolate each piece of data, each value, each name into specific fields of a global database – if there were such a thing.
As a matter of fact, there are several databases.
Some are large, some focus on specific genres, and only the fewest aim to be complete.
Who’s ever heard of any of these? What do you use them for? Is anyone in the audience an active member?
[crickets]
Here, for example, are the artist pages for Pink Floyd at MusicBrainz, Musiksammler, the Deutsche Nationalbibliothek, Spotify, and Apple Music.
It’s clear that the databases are somewhat dry and text-heavy whereas the stores and streaming services are the most… let’s say certifiably pretty and flashy.
But have you ever wondered where that information comes from?
How is it structured and interlinked and edited and verified? Most of it is crowdsourced and shoehorned into a regulated set of fields, using a variety of rules and standards.
And as an example I’m going to use Discogs.
The story of Discogs is a simple one: It was launched in late 2000 by an Intel engineer named Kevin Lewandowski who wanted to catalogue his techno records (yes, it’s actually older than Wikipedia). He eventually got other volunteers to join in and help add data.
The company is based in Portland, Oregon, with another major office in Amsterdam.
At the time of writing, the database boasts some 15 million releases, almost 8 million individual artist names, and some 1.8 million labels (although it must be said that “labels” include record companies, pressing plants, and recording locations in the form of studios, stadiums and even churches). It requires a degree of inside knowledge to make sense of these numbers. It can get very complicated.
The top user has contributed over 54,000 new releases. Just to explain: a “release” is an individual entry, like a record or a CD or a 7-inch single — and everything it entails.
By comparison, user #100 has added over 7,000 items, and me, personally, I just hit the 1,500 new entries milestone a few weeks ago. It’s a hugely active and devoted community.
I became aware of the site somewhere around the year 2002 or so because, well, I was missing metadata. Indeed, I was a bad boy, and the stuff you’d have found on all those disruptive services like Napster and eMule and Gnutella was missing the relevant tags or was highly questionable. Discogs had much of the information I was looking for at the time, and I regarded it as fairly trustworthy.
But I also had new information. I had metadata to share with the world. Over the years I’ve been a vocal contributor to the site and the community, and more often than not stretching the site’s limits. And in case anyone’s wondering – I’ve actually been buying more music as a result, and no — I did not pay for that that Alfons Bauer record!
Any music – or indeed audio carrier – is potentially eligible for entry into Discogs. I say “potentially”, because that CD-R collection of your favourite tunes you made for your car 20 years ago is not, nor are those old tapes of stuff you recorded off the radio as a kid. There are nonsense entries that get removed every day.
Just about anything that has sound on it and was available to the public is eligible. This includes the CD you bought at Media Markt or the record from the record store down the road (if you can still find one), or the CD-R you got directly from the band after the show, or the free CD you got with a magazine, as well as lathe cuts, white labels, test pressings, promos, demos, audiobooks, and so forth.
There’s a range of audio carriers out there that most people will never have heard of.
Music has been made available on an almost absurd range of formats and mediums — far beyond the records and CDs and tapes which most of us would be familiar with.
Here are photos of some of the media I mentioned.
Just about everything is supported, and there are rules and norms for all of them.
Now that’s just the media carriers – the physical aspect of it. A database should cater for each possible format it wants to support, and each format will have its own set of rules. And what’s on the media carrier? Songs, usually. Tracks, specifically, and they have titles.
OK, so we’ve got an item of media with a song list — pretty straightforward so far.
There’s a first song, and there’s a last song. The list of songs is given on the track listing on the music carrier or the supporting booklet or sleeve. Usually they correspond. I say “usually” because sometimes there are mispresses, misprints, hidden tracks, or the order doesn’t match. Of course we have rules and guidelines for how to deal with that too.
I also mentioned “first and last track”. A sequence.
A sequence suggests numbering, so that a CD or an SACD or a Betamax tape (which are single-sided media) have tracks numbered 1 to whatever (but a maximum of 99 for a CD).
Records and cassettes have two sides. One side is called “A”, and the other “B” – even if the label sometimes says “Side 1” and “Side 2”. Double LPs have sides numbered “A” and “B” and “C” and “D” and so forth — you get the idea. A database of this magnitude and complexity needs to have standards, and these are the ones that have been chosen to make it easy for users entering the data, and understandable for those reading or using the data either through the web interface itself or externally via an API.
As an aside, there was a thread on Twitter a few months ago where people argued that the first side of an album is “Side 1” and the flip side is “Side 2” – but on a 7-inch single it’s “A” and “B”? It was amazing what different answers you saw, often depending on which side of the Atlantic people were on.
And here we’re talking about something as rudimentary as the numbering for a sequence of song titles?! Now imagine trying to set standards that make sense to everyone around the world — with as little manipulation of the source data as possible.
Standards ensure that information transcribed from elsewhere is normalised.
Letter casing is our favourite bugbear.
Sometimes an album will have its track titles written in upper case, sometimes in mixed case, or mIXed CAse on the sleeve, UPPERCASE case on the labels – pretty much anything goes. Vaporwave and chiptune music thrive on strange character renderings and emojis as track names.
For the sake of readability Discogs has applied a “First Letter Uppercase” normalisation standard. We regularly get criticised for it. Every month there’s some rookie who complains that it doesn’t apply to his language, and it’s wrong… and blah blah blah!
We know this. Those rookies are usually Dutch, French or German. Scandinavians don’t seem to care, Russians are cool with it, Japanese don’t understand because they have no uppercase or lowercase symbols, and English speakers have begrudgingly accepted it.
For those who understand German, I’ll just leave you with this:
Admittedly, the site is somewhat western-centric in content but there’s also a large amount of Chinese, Japanese, Arabic, Cyrillic, Armenian, Hangul, Mongolian, Hebrew or Bangla music included – and I mention this only because initially these scripts and languages weren’t supported. Discogs is quite international.
Even within mere track titles there are certain rules that specifically addresses dance music with its myriad versions and remixes, such as within this promo version of the Blue Man Group’s rendition of the classic “I Feel Love” as an example.
It has nine tracks. They’re all remixes – so that you have “Track Title (Version Name)”. That’s the standard we use.
For classical music, we have what’s called “index tracks” to indicate the title of an opus and sub-tracks to indicate their individual movements.
And that’s just track titles. Can you imagine how complicated the rest can get?
Let’s cover something different: Labels.
A label is a company’s branding — much like its identity.
We’d all be familiar with names such as EMI or Universal, Sony, Vertigo, Columbia, CBS, Motown and so forth – these are some larger mainstream labels you would’ve probably heard of. Those are just some of the 1.8 million label entries I mentioned earlier.
Of these 1.8 million, some 188,000 are “pseudo labels” — not something I’m thrilled about personally — which serve as umbrella names just to keep everything by a certain artist or bogus operation under one roof. Yes, by “bogus operation” I refer to those that pretend to be a genuine label – mostly counterfeits. We have standards on how to capture and deal with those too — but I’ll spare you the details here.
Like I said, these are well-known labels we’re familiar with; there are thousands upon thousands more, and not unlike with pop singers or bands, there are fans who are fiercely loyal to a label’s entire product range – along the lines of “if it’s on this label, it’ll be good”. Then, of course there are those people who collect an entire label’s output simply because that’s what completist collectors do.
That’s our audience: hardcore collectors and nerds (and fanboys too).
Obviously there are thousands of independent or underground labels the world over, and there are others completely obscure because they put out one single album before disappearing from the scene entirely, or how about those DIY labels whose material came out on cassette only, with photocopied inserts – they’re highly sought after today and get traded for hundreds of dollars. I’m certainly not going to list those but what I can show you are some companies – brands, in fact – that take on the role of labels that you certainly will have heard of and that, in their own right, have put out music by themselves.
Some might surprise you.
Then we also capture studios and recording locations like the legendary Sun Studios and Abbey Road, and recordings made at the Bataclan, Wembley Stadium, the Oberhausen Arena, or even the Notre Dame Cathedral.
Why? Because that’s what a database is for: to capture quantifiable and identifiable data, and to present it in such a way that if you’re curious about information you will be able to find it and then make your own connections.
Also thrown into this collective are “series”, like you would typically encounter with compilations of the current top hits such as the “Now That’s What I Call Music!” series, or something Germans might be familiar with: “Future Trance” or “Bravo Hits” or “The Dome” or older ones like “Der Große Preis” or “Ronny’s Pop Show” and so forth. There are series of audiobooks like “Die drei Fragezeichen” or “Bibi und Tina” or “John Sinclair”.
There’s hundreds of them from all over the world, and a series name offers another data point – along with the series number – to follow a sequence of items, neatly grouped together in a chronological order. The facility didn’t exist initially and was added following user requests — very much like studios and other recording locations I mentioned above.
What started as a simple list of techno records became a complex network of data sets describing titles, abstract ideas like music as well as artwork, places and people.
This brings me to the next point.
I just mentioned that the “Now That’s What I Call Music!” series may not be familiar to my German members in the audience. It’s a British series that was launched in 1983, and the title has been licensed to other territories such as Mexico, Denmark, Japan, Argentina, and Israel — each with their own selection of hits. Americans only picked it up in 1998.
Why is this worth mentioning?
As the database started filling up, the amount of these became unwieldy and cluttered, and we had to figure out a way to differentiate the different countries’ editions. Initially we named them “Now That’s What I Call Music! (UK)” and “Now That’s What I Call Music!” (Mexico)” — but these names weren’t exactly in accordance with our own standard.
So how do you keep things apart that have the same name? We just add a number.
That’s how we disambiguate. The original UK series is number one (so it doesn’t have a number). A South African series launched next about a year later (so that was made number 2), New Zealand then joined in as well and became number 3 – and so forth. It’s worth pointing out that once this set of data records became impracticably large, we sat down and discussed how to best sort and number them. Things are not fixed, the data is flexible — but only to the point of getting more accurate and more useful as time goes on.
The downside is that it’s made data entry – especially for new users – increasingly difficult over the years. Discogs has a steep learning curve.
Then there are dedicated fields for values such as copyright information which can help in determining not only the copyright owner (and effectively the label) but they would help you in dating a product – such as an album which could have a ℗ date of 1974 but there’s also a © 2007 date that clearly tells you that it’s not the original product but a reissue, and who it’s been reissued by.
Furthermore, a 1974 version of an LP will most certainly not have a UPC/EAN barcode as those only started getting introduced on records in 1979 in the USA — and gradually elsewhere in the world. There’s a dedicated field for it too, and because it ought to be a unique value it can help identify the exact pressing of a record in your hands.
I hope I’m not boring you to sleep yet!
Similar dedicated fields exist for ASIN, price codes, SPARS codes, and SID codes (introduced by Philips in 1992, those can be found on the clamping ring on the underside of a CD with a prefix of IFPI). Unique to Brazil is a so-called “ABPD Code” which is like a Batch-ID on optical media that tells you whether it’s a first press or a repress, and in the amount of items made (we haven’t yet created a dedicated field for those).
Then, those scribbles and weird hieroglyphics you may have noticed in the deadwax of vinyl records? They can tell you where that record was pressed or who mastered it.
These are things we learn as we go along, reach a wider audience, and fill the database.
Along those lines, who knows what this is… most Germans will have seen this before?
It’s a Label Code — every major label has one — and it’s a unique 4-digit or 5-digit music label identification code that is assigned by the German “Gesellschaft zur Verwertung von Leistungsschutzrechten (GVL). The “Labelcode” was created by them in 1976 and introduced by the IFPI in 1977 in order to unmistakeably identify who gets to exploit and sell the recordings on that media carrier. In fact, its very presence will tell you that your record will likely not be of Canadian or Malaysian origin.
Much of it is like “reverse engineering” because there’s no global authority that describes and details any of this – and those that do exist today and try to standardise it only came along after Discogs was launched. Many have probably adopted some of our ideas and yes, even we occasionally refer to the competition at, let’s say Musiksammler, who happen to keep the most concise list of Label Codes. We’re all amateurs – although it must be said that there are industry professionals and musicians and producers and DJs and librarians and archivists within our community.
All of these minute details are important; that’s what the core audience of Discogs users demand — there are hardcore collectors who own dozens of different pressings of the same album because… collectors!
We’ve mentioned SID codes and barcodes and label codes, we’ve mentioned some series and record companies and labels and even studios and track titles but there’s one critical data point I’ve not really talked about yet in the context of Discogs: Artists.
Who’s the performer, who gets top billing?
We call them the “main artist” in Discogs, and of course there is another set of rules and standards surrounding the artist name, too. In most cases it’s really stupid simple: If it’s an album by Pink Floyd you enter “Pink Floyd”, and that becomes a linked entry so that you can see a list of all albums and singles and compilations and videos by Pink Floyd.
For a compilation by various artists you’d enter “Various”, if it’s sound effects there’s a placeholder artist entry called “No Artist”, and there’s an author named “Traditional”.
And at what point does an album become a “Various Artists” compilation or album? Where do you draw the line?
On Discogs we have an unwritten rule which recommends that four main-artist-level album artists are acceptable. Now, in case anyone’s wondering how it is that we get to make up these arbitrary standards then, well… Apple, Spotify, Symphonic and others had to come up with similar rules.
This particular rule at Discogs is some 10 or 15 years old, so it’s a bit of a case of chicken vs. egg in that I’m not sure if it’s Apple and Spotify who were inspired by our guidelines or someone in the forum discussion followed theirs — but I like to think that we independently came up with a similar recommendation. We’ve all got the same problems.
Even the Simpsons have their own page because… that’s how they’re presented!
Santa Claus is credited as a main artist, and so is Barbie, and the Smurfs, Donald Duck, the Muppets, and Beavis & Butthead (another compound artist), and all sorts of other weird entries because that’s how they’re billed. Strange as it may seem at first, we treat them no different to bona fide artists like Pink Floyd or Billie Eilish.
And in case you’re wondering if that’s absurd — well, Spotify and Deezer do the same. Duh!
Now, let’s get real: Who’s this album by, who’s the main artist here?
Correct. It’s Adele. But which one?
[ Somebody shouted, “The one and only Adele!”]
Ah, but this particular Adele has been slotted into the database at #3, so she’s Adele (3).
Erhmmm… wait, what? Surprise! She’s not the first artist to use the name “Adele”.
Her début album came out in 2008, Discogs has been going since 2000, so there were two other artists with that name that had been entered into the database before she even showed up. There are at least another 17 artists named “Adele” in the database right now.
Like I explained before with the NOW series, that’s how we disambiguate. The first entry gets that name explicitly; all others have a numerical appended.
For example, there’s a Korean boy band who have become hugely popular in the western market in recent years. Anyone know them? BTS. They’re entry number 4 for BTS. It gets worse: at the time of writing, there are 690 artists named just plain “Alex”!
Similarly, there are 112 entries for artists named “Alice”.
And those numbers are mostly fixed. They get assigned, and that’s how they should stay. Adele will likely remain Adele (3) for as long as Discogs is around.
I said “mostly” because — occasionally, every once in a while — a user does come along to point out that the name originally chosen wasn’t correct. Or it was misspelt all along.
OK, so this won’t likely happen with someone like Adele but was the case recently with a guy named John Strohm. There’s a “John Strohm (2)” because there’s a “John Strohm” (#1), and there’s also an entry for someone named “John P. Strohm” which people initially thought was a mistake. After a bit of a discussion and research all three names were combined into one entry for John P. Strohm.
It’s a constant work, and we’re constantly getting better and more accurate.
You can therefore imagine how chaotic things might become for people with vanilla names like Sam Smith, John Williams, Jack White, or even Mabel. A few years ago I wrote a guest post for the Discogs blog describing this all too common situation with a few gentlemen named Scott Wilson. What I’m talking about pertaining to database management may seem obvious to this audience – but I must stress once again: there need to be methods and processes in place that allow not only the detection of mistakes but also a means to report, discuss, and then correct them.
It is at this point where errors get very personal because artist names have real people behind them. Like with the Scott Wilson example above, some take it very personally because, after all, there’s credibility and royalties involved. I’ve seen many artists who treat their Discogs artist page like their professional Linkedin business profile — which is good, we’re proud if they do — but they have little control over it.
Then you get the scenario of Alice Cooper and Alice Cooper, or Marilyn Manson and Marilyn Manson.
With Alice Cooper, it was a guy named Vincent Furnier who renamed himself Alice Cooper after the band Alice Cooper had split, in the second case it’s Marilyn Manson who’s in charge of a band that bears his name.
Confused yet? You can’t dare call yourself a music database if you don’t tell them apart.
While we’re on this particular topic: Marilyn Manson the person is not in the band Marilyn Manson. The head honcho of the band is a fellow called Brian Warner (2).
Erhmmm, what?
Yes, I said “Brian Warner”, because that’s his real name – much in the same way that Ringo Starr is not the drummer for the Beatles: it’s a bloke called Richard Starkey.
How did this come to be… what’s this all about? Simple. Performers, artists, whatever often use multiple names (we call them “aliases”) sometimes simultaneously, sometimes to differentiate their work, sometimes for the hell of it — and in order to link them to a band they’re in we had to make a decision: which of several possible names should we relate to their band or bands?
Which method can you use that applies to every artist equally?
A lot of people don’t like that it’s Paul Hewson who is the lead singer of U2, and not Bono, or that is Saul Hudson who’s the guitarist with Guns ‘n Roses, and not Slash — but hey: standards! You have to have standards, you have to set rules.
Sometimes though — and this can cause major havoc and teethgrinding — is when an artist comes along and doesn’t want to be artist number 121. Some of them whine like babies… they insist on being John Smith number 1. They are the real John Smith, all the others are irrelevant, they’re the one who have a new tune out on SoundCloud and must promote it. Oh, those people can be fun to reason with!
Here’s a recent one: A couple of months ago some American rapper named John Anthony showed up and hijacked the profile of the existing John Anthony entry. When this was politely pointed out to him, he responded that he is “a independent rapper. An artist who is making music TODAY, not some 1960’s British whoever who nobody ever cares about. Just wait in a few years and you’ll see who the REAL John Anthony is.” [sic]
Now spot the irony: The rapper’s real name is “Anthony Giovanni Blaise Grandinetti”, so he can’t really claim the name “John Anthony”, and the British whoever happens to be a producer who worked with the likes of Genesis, Roxy Music and Queen, and he eventually became the head of A&R at A&M Records in New York.” Not exactly a “nobody”!
Teeth were gnashed, knuckles were rapped, damage was reverted, and as for the rapper? Well, I assigned him the next “John Anthony” slot. Number 23 was still available — and what’s all the more tragic about the situation is that on other services he is/was also mixed up with different artists of the same name. Metadata gone rogue.
The problem, by and large, is that artists have the singular motivation to get their name noticed. A database doesn’t. We’re music fans, of course, but we don’t really care. We apply a “controlled vocabulary” of our own tried and tested standards which were codified into a set of guidelines using the most neutral and objective approach possible. It’s impossible to expect those directly mentioned to stay impartial because musicians, producers, DJs, and label reps obviously have their own agenda and sometimes do weird stuff just to get their tunes and their names listed in the database. Everybody wants to be in Discogs, it seems.
Usually they’re willing to listen to reason and settle in to the rules and guidelines and everyone’s happy because, ultimately, this is also in their interest, but sometimes they barge in, make a mess, and then give up completely, so us moderators try to make sense of it and whip it into some sort of passable structure, and sometimes it gets outright deleted because it’s all a complete joke or unidentifiable gobbledegook. There’s a lot of trolling going on. Some people are plain stupid, and some are very nasty. Disgusting, even.
Bored teenagers on the internet are a pest.
We had a situation not long ago where someone submitted a record which doesn’t exist, entering titles that say that such and such person is a rapist and did this and that, and all sorts of accusatory rubbish. Of course it gets removed as soon as it’s reported.
It’s outright slander and vandalism. It’s like spraying graffiti or swastikas on a wall.
Then there are some users who think adding a t-shirt with Iron Maiden’s name qualifies, or some poster of their favourite band, or mugs. When I said that everybody wants to be on Discogs, it includes a lot of weirdos. Some people are just looking for attention.
Another example I can mention was the case of a certain avant-garde artist who added a slice of pizza as a “music release”. And he had an explanation for it, too:
It’s a vaccum sealed slice of pizza brought back from Malta in February 2019 as part of a collaborative project.
… This is a conceptual soundwork. Limited edition of 1.
… as publisher of the release and the author of the content I think that I am more qualified than you are to say whether or not it is an audio format – it may baffle you but a lot of others would agree. I say it is a valid audioformat in both a physical and metaphysical way. It is decaying and producing sound , it may be quiet but with the right devices you can listen. How is this not an audio format ? Also if you throw it against a wall it makes a sound . [sic]
Sure thing, dude! The entry was duly deleted.
Due to recent incidents it became necessary to put a mechanism in place that allows Discogs to “lock” artist entries if necessary, and only certain users may edit them. Wikipedia naturally has similar problems and similar safeguards. Apparently not everyone appreciates the fact Discogs has artist entries for these controversial characters who did, in fact, output or are responsible for bona fide recordings.
The hatred that sometimes gets directed at us is akin to a mob burning down libraries because they don’t like that certain books are available there. Those are extreme cases.
Usually, it’s just some artists who come along and try to remove their entire discography because they don’t want to be associated with some bootleg remixes they did, or the pop album they made twenty years ago suddenly doesn’t correspond with their religious beliefs now that they’ve found Jesus, or how about the guy who tried to remove his old black metal band from the database because he “doesn’t want his family to find out”.
Well, it’s not like their records or tapes suddenly disappear off people’s shelves, is it?!
These things exist. They are out there. People have them. Remove it from the database today and someone else will re-add it tomorrow – if not here, then elsewhere. Because it exists. Someone, somewhere has a copy of that record. Why am I stressing this?
A database of this kind needs to be policed and monitored, and there need to be mechanisms in place to prevent random acts of vandalism.
Unlike Apple, Discogs doesn’t censor anything. But make no mistake — there is a swathe of culturally insensitive, violent, vulgar, racist or pornographic material out there, whether in lyrical content, track titling, or cover art. We’ll vigorously defend its database entry.
All we do is document what’s out there because it exists.
And no, I’m not going to show you examples of those!
Right, so I’ve talked about metadata in the world of rock ‘n roll.
We’ve even talked about pizza. When do we get to the “sex” part?
Now. We’re going to talk about sex now — or, more specific, gender identity.
A delicate issue that was raised in forum discussions a few times in recent months was that of “deadnames”. Wikipedia defines Deadnaming as the use of the birth or other former name (i.e., a name that is “dead”) of a transgender or non-binary person without the person’s consent. This is relevant because there are musicians who’ve changed genders, and it’s a far more pressing issue than Cat Stevens now performing as Yusuf Islam or Cassius Clay changing his name to Muhammad Ali.
Recently we had a Filipino lad complain that Discogs still showed the name [redacted] which, according to him, is a deadname: She now goes by the name of “Jasmine Ng Labrador” and wrote that mere mention of her previous name causes her emotional anguish and describes it as “targeted harassment”.
In all fairness, how are you supposed to deal with this?
Likewise, there’s punk guitarist Tom Gabel who now goes by the name of Laura Jane Grace, or metal guitarist Dan Martinez who transitioned to Marissa Martinez-Hoadley, and of course there’s Walter Carlos who became Wendy Carlos (the composer for the Shining, A Clockwork Orange, and Tron Soundtracks).
These facts are known and in the open. The artists, by all appearances, have no issues with this being known, and most normal people don’t really make a big deal about it either.
Some do, causing a major conflict of interest: From an archivist’s or librarian’s point of view, it’s impossible to change the name appearing on the record standing in my shelf. That DVD of Juno will for all eternity read that the pregnant girl it’s about was played by someone named “Ellen Page” — even if she recently decided to transition to “Elliot Page” and Netflix, IMDB, Spotify and Wikipedia rushed to change all credits and title cards to the new identity — surely on account of snowflake populist pressure.
Digital and streaming platforms are apparently just as fluid in that regard. On Discogs we will not do this because yes, he/she/they do indeed have an entry courtesy of a song on the soundtrack to the film Juno. And that song will always be credited to “Ellen Page“.
Because that’s what we do: We transcribe information. We do not rewrite history.
You cannot unprint a book. Artists and people don’t always get their way. That’s just the way it is, and it may even be in contravention of certain laws such as article 17 of the GDPR which offers the right to erasure (better known as “the right to be forgotten”) and thus puts every library, every archive, and every database in an uncomfortable situation.
Give this a bit of thought.
Now, to conclude: That’s one database. Those are some of the standards and norms Discogs has, and some examples of typical data capture and manipulation issues.
Is it perfect? Absolutely not, no.
Nor will it ever be complete, and it’ll never be 100% accurate, despite — or maybe because of — thousands of international users constantly working on entering and updating and cleaning data. The world has changed in the twenty years since it launched. Requirements and concerns have changed.
Earlier in this presentation I said that there was no global database with such a level of information. That was a small lie. As we’ve just seen, Discogs is one, Metabrainz is another contender – although neither is or will ever be complete. That’s impossible to achieve.
But the music industry itself hasn’t stood still either.
Over the last few years, they seem to have settled on something called DDEX ERN — the “Dynamic Data Exchange Electronic Release Notification” Message Suite Standard. It’s an XML specification which is…
“…used to enable content owners/administrators to inform DSPs about new releases that are available for distribution, and the terms and conditions under which those releases can be made available. It includes complete metadata about the release and all resources contained within the release. In addition, it includes information about the “deals” that describe when, where and how the release can be made available.”
The standard is currently at version 4.
Signatories include Apple, Soundexchange, numerous mechanical rights societies, Amazon, Google, Pandora, and a host of other record companies and vendors — which certainly sounds impressive.
Except… that it covers digital audio – downloads and streams — and, guess what?! As it turns out, they have the exact same issues with disambiguation and canonical artist names and collaborative artist names and capitalisation standards and version names – just about every topic we’ve covered with the legacy releases we shoehorn daily into Discogs.
Other considerations include nicknames (Bruce Springsteen AKA “The Boss”) and — something rarely covered in written articles — international pronunciations due to users’ dialects and accents, now that voice-controlled smart speakers are commonplace.
A few weeks ago I attended a DDEX webinar about “Venue Identification” wherein topics such as unambiguous definitions and parameters of a studio’s location or that of live concert and festival recordings were discussed. It didn’t go far. There is much to be done.
But why should I care? I’m not in the music business!
I’m not your stereotypical record collector either. In fact, why do I even bother?
Because it makes you appreciate the importance of nuggets of otherwise trivial data.
It helps you understand the reasoning and background scenarios behind something as banal as a few simple words or digits that you may find on the record or CD in your hand. Music metadata and Discogs are as fascinating as they are frustrating. It’s my geek porn.
According to Maria Sicilia’s “Study of the motivations behind a crowdsourced online discography“, Discogs manages to fulfil my altruistic and intrinsic personal needs by allowing me to participate in (what I perceive through rose-tinted glasses as) a grand project for the greater good. It provides a sense of accomplishment and a sense of belonging as I learn and collaborate with fellow data hoarders.
So, the next time you find something to watch on Netflix or Hulu, or the next time there’s a suggestion made for an artist that you could like on Spotify or Apple Music, or the next time you’re looking for a book to buy, or an audiobook to listen to on Audible, or whenever you look up movie credits on IMDB or almost anything on Wikipedia, please do spare a moment of thought for the keyboard warriors, database guardians and administrative nerds behind the scenes who argue, bicker, standardise, curate, enter, and/or safeguard those little snippets of metadata that you casually take for granted.
As always, thanks for watching. Go away now!
All photos and slides by hmvhDOTnet unless specified otherwise. Mouseover for credits.