Jordi Villar

The Algorithm Behind Shazam

No neural networks, just signal processing and a hash table

I never really thought about how Shazam works. You hold your phone up, it tells you the song, you move on. Then I read this article by Per Thirtysix and got curious enough to dig in.

What I found surprised me. The core algorithm was published back in 2003 by Avery Wang1. No machine learning, no neural networks. Just signal processing and a hash table. The kind of thing you can implement in a few hundred lines of code.

So I did. Pick a clip and follow along.

The raw signal

What your phone’s microphone captures. Amplitude over time, thousands of pressure values per second. You can see where the sound is loud and where it’s quiet, but you can’t tell what notes are playing. Two completely different songs can produce waveforms that look almost identical.

The information is there. It’s just not in a useful form yet.

Looking inside the sound

Here’s where it gets interesting. Instead of looking at the signal as amplitude over time, you chop it into short overlapping windows of about 100 milliseconds2 and run a Fourier transform3 on each one. The FFT decomposes each window into its frequency components: which frequencies are present and how loud they are.

Stack the results side by side and you get a spectrogram. Time on the x-axis, frequency on the y-axis, brightness is intensity. Now you can see the music. A piano note shows up as a bright horizontal line at its fundamental frequency. A chord is multiple lines. This is the representation that makes matching possible.

Finding landmarks

The spectrogram is dense. Most of it is noise or low-energy content that won’t survive the trip from speaker to microphone in a bar. Shazam keeps only the loudest points, the ones most likely to survive compression, background noise, and speaker distortion.

The algorithm divides the frequency range into bands and picks the strongest peak in each band per time slice. The result is a sparse “constellation map”: just a few hundred points out of millions. Those points are the landmarks that survive real-world conditions.

Turning landmarks into hashes

You could compare constellation maps directly, but that means checking every point against every point in every song. Slow. Instead, Shazam turns them into hashes by pairing each peak with nearby peaks. Each pair produces a hash from two frequencies and the time gap between them. Something like 46|222|10 means “a peak at bin 46 paired with a peak at bin 222, ten time slices apart.”

Why pairs and not single peaks? A single peak isn’t distinctive. Plenty of songs have a peak at bin 46. But a pair with a specific time gap is much rarer, and that rarity is what makes each hash a useful fingerprint.

Now matching is an O(1) hash table4 lookup. A random pair is highlighted in pink on the constellation map above.

Finding the match

Looking up hashes tells you which songs share the same frequency patterns. But lots of songs might share individual hashes by coincidence. The trick is time-offset agreement.

For each hash hit, you compute song_time - clip_time. If the clip really comes from a particular song, most hashes will agree on the same offset. They all point to the same position in the recording. Wrong songs produce random scattered offsets. The correct song has a massive spike at one offset. That’s your match.

The histograms below show exactly that: the winner’s bars all stack at one offset, while the runner-up’s are scattered.

What this demo doesn’t do

This is just a way of replicating the Shazam algorithm at a high level, not a production system. A few things the real Shazam does differently:

  • Scale. Shazam’s index has tens of millions of songs. This demo has five. At that scale you need a proper database, not a JSON file.
  • Noise robustness. The real system is tuned for microphone capture in noisy environments. These clips are clean recordings, which is why the match is so clean.
  • Covers and remixes. The fingerprint is tied to the exact recording. Same song, different performance, different spectrogram. It identifies recordings, not compositions.

The elegant part

What I find elegant about this approach is that every step has a clear reason. The FFT reveals frequency content. Peak picking keeps only robust features. Hashing makes lookup fast. Time-offset voting makes matching robust. Each step strips away what you don’t need while preserving what you do.

No training data, no model weights, no hyperparameters. Just signal processing and a hash table. Published in 2003 and still the foundation of how audio recognition works today.

Footnotes

  1. Wang, Avery. “An Industrial Strength Audio Search Algorithm.” Proceedings of the 4th International Society for Music Information Retrieval Conference (ISMIR), 2003.

  2. ~93ms in this demo (1024 samples at 11025 Hz), rounded to 100ms in the text. Longer than the original paper uses, but it keeps the spectrogram visually legible.

  3. Probably the algorithm I’ve used the most out of everything I learned at university

  4. Yes, they are everywhere

Newsletter

Subscribe to keep you posted about future articles.