May 31, 2014

Sound decoding: A tale of visualization

Let's keep informed by listening to the radio

Who isn't thrilled by mystery spy stories and espionage, challenges and ciphers to understand and decrypt? I was recently fascinated by The Conet Project, which is twenty years of recordings of unknown broadcasts on shortwave radios. The public distribution of these recording created some small communities striving to comprehend these messages.

Most of the recordings are numbers spoken with an text-to-speech software that could probably be used with a one-time pad in order to retrieve a message. But some of them hold tones which probably maps to numbers. Let's find these numbers!

Down the rabbit hole while riding a snake

We'll concentrate on track 39 of disk 2 of The Conet Project, which sounds like this:

Here is the link to the audio if your browser does not support HTML5.

It begins with a repetitive pattern, probably to draw the attention (and not ¡Atención!) of potential listeners. Then, at 1:30, a preamble-like pattern emerges which then makes place at 1:33 for a 1 second clock-like pattern followed by a message. The booklet of the recording states:
"High Pitch Polytone: 5 Figure groups, represented by tones, the lowest being a space. No real information on these stations, save that they may be related to M12."
M12 is the Enigma Identifier of a Russian operated number station. More information available on the Priyom entry of M12. Recent recordings of M12 seems to be monotonal morse, but anyways... I'll let your mind ramble on these ideas.

Fantasies aside, how can we visualize and extract the data contained in this particular track? By gleaning various intertubes references such as this or this, I came up with the satisfactory amplitude + spectrogram chart as shown in this figure:
As it can be seen from the zoomed figure, we can clearly see the tones. They are defined by their fundamental frequency, the darkest red bar at the bottom of the spectrum chart, and their overtones, the lighter red bars. These figures were obtained by this Python code:

Loading... Please enable Javascript

This code uses the specgram functionality of matplotlib which is based on the Discrete Fourier Transform (DFT) to obtain the frequencies contained in blocks of data. I choose the blocks of data to have a fixed size of 512. Every column of the spectrogram is a Fourier transform of a 512-size block in the audio signal. This size represents 11.6 ms of the signal and allows the analysis of frequencies from 22,050 Hz (half of 44,100 Hz because of the Nyquist Frequency) down to 43 Hz ( $ \dfrac{44100 \; \mathrm{Hz}}{2 \cdot 512} $ ). All that is well and interesting, but how can we automatically find the values represented by each tone?

The plan is to take every column of the spectrogram, namely the Fourier transform of each block, and find their global maximum. For that matter, let's also take all the local maxima, which will give us the shape of the tone emitted. Here is how a single column looks like when the color of the spectrogram is mapped to the y-axis:

The top chart shows the previous spectrogram and highlights the band at $ t = 2 \; \mathrm{s} $. The middle chart shows the Fourier transform. The bottom one shows the same transform but displayed with an logarithmic y-axis, or, as we are used to see it, enlarged to show texture. As we can see it, there is clearly a single prominent fundamental frequency around 430 Hz followed by its overtones. We can also determine that the tone was synthesized up to ~13.1 kHz and the rest seems to be noise. Interesting fact: even harmonics are much attenuated compared to the odd harmonics. This is because the tone generated is an approximation of square waves. The code for this is available here.

So, the only thing remaining is to find are the peaks of the Fourier Transform. This can be done using the find_peaks_cwt function of scipy which convolveswavelet to the signal to find the desired features. But the data points of our frequency signal have way too much space between themselves for this kind of processing. This is caused by the size of the blocks we used to compute the Fourier transform. The wider we are in the time domain, the narrower the bins are in the frequency domain, and vice versa. So it would be solved by invoking specgram() with a larger NFFT parameter, but another way to solve it is to resample the frequency signal.

Let's apply this peak finding function on every block in our recording. After that, the only thing left is to merge the similar data together in order to get the beginning and the end of a tone. To do so, we'll seek for similar frequencies with an high amplitude. In a single tone signal, the most powerful harmonic is called the fundamental frequency. If the fundamental frequency in the following block is inside a certain percentage margin of the fundamental frequency of the current block, it should indeed be the same tone.

Once we get the number of tones and their frequency, we'll decode their values. We can get a pretty good idea of what we are dealing with by analyzing an histogram of the frequencies:
We can forget the outlier at 133 Hz which is the buzz or scratch at the end of the recording. We can see the four most common tones from the pattern in the first minute and half : ~305 Hz, ~400 Hz, ~340 Hz, ~422 Hz. Another interesting note: the data ranges from 300 Hz to 521 Hz. Curiously, this range gives roughly 10 symbols on the equal tempered scale. We can try to put the tones in bins between E♭4 and C5. This gives us the following key number sequence:
[43, 47, 45, 48, 43, 45, 43, 47, 45, 48, 43, 48, 45, 48, 43, 47, 45,  48, 43, 45, 43, 47, 45, 48, 43, 47, 45, 48, 43, 47, 45, 48, 43, 45,  43, 47, 45, 48, 43, 48, 45, 48, 43, 48, 46, 48, 43, 45, 44, 48, 45,  48, 43, 48, 45, 48, 43, 47, 48, 45, 45, 45, 48, 43, 45, 43, 48, 45,  45, 48, 43, 48, 45, 48, 43, 48, 45, 48, 43, 45, 43, 48, 45, 48, 43,  48, 45, 48, 43, 47, 45, 48, 43, 45, 43, 51, 51, 43, 43, 43, 43, 43,  43, 45, 52, 46, 47, 49, 43, 45, 52, 45, 46, 48, 43, 47, 51, 49, 46,  52, 43, 51, 48, 44, 47, 52, 43, 51, 45, 50, 46, 43, 46, 50, 47, 45,  48, 43, 46, 52, 49, 47, 43, 49, 51, 46, 43, 46, 49, 48, 45, 43, 45,  49, 50, 51, 50, 43, 45, 47, 50, 45, 48, 43, 45, 50, 52, 46, 43, 45,  49, 45, 52, 46, 44, 45, 47, 48, 52, 49, 43, 45, 48, 50, 43, 45, 47,  50, 45, 43, 48, 45, 47, 51, 43, 51, 52, 48, 44, 50, 43, 50, 45, 49,  50, 48, 43, 49, 46, 52, 46, 48, 43, 49, 52, 49, 52, 51, 43, 48, 51,  45, 48, 50, 43, 45, 47, 49, 48, 43, 45, 50, 48, 49, 43, 50, 49, 50,  48, 43, 48, 46, 52, 47, 48, 43, 47, 48, 52, 50, 52, 44, 52, 44, 52,  44, 52, 44, 52, 44, 52, 44, 52, 44, 52, 44, 52, 44, 52, 44, 52]
This was given by the code over here. Note that it is only a quick proof of concept patched together in an evening; many improvements should be made to this code. Things like enhanced robustness (adaptative constant for NFFT, for instance) or averaging the harmonics of the tone instead of taking only the harmonics footprint of the first sample may be greatly beneficial to it.

Using Lilypond, this script (sorry for the hardcoding) can generate a beautiful sheet music representation of the tones. Note length was omitted for this figure; only their pitch was considered.
Another representation may be of interest: text. We can convert the lowest note with a space as the booklet supposes and convert all the other notes from a to i using this simple code:
Loading... Please enable Javascript

Once done so, the following text emerges:
 dbe b dbe ebe dbe b dbe dbe dbe b dbe ebe ece baebe ebe debbbe b ebbe ebe ebe b ebe ebe dbe b hh      bicdf bibce dhfci headi hbgc cgdbe cifd fhc cfeb bfghg bdgbe bgic bfbicabdeif beg bdgb ebdh hieag gbfge fcice fifih ehbeg bdfe bgef gfge ecide deigiaiaiaiaiaiaiaiaiaiai
This could be a great starting point to analyze its content. We realize that the two alternating notes at 1:30 was wrongly decoded as "hh". Also, some segments were wrongly decoded as the data after 1:33 is always a group of 5 tones. You can convince yourself by listening to it using VLC to reduce the play speed. Hence, we've got 11 false decoding out of 27. Enhancing the tone merging algorithm should do the trick for most errors, as the input is noisy and sometimes distorted. If it was not grouped as it was, text statistics could then be applied to try to understand if these are really words. If the histogram of word length matches the English corpus, this data could be considered words. But since it is always grouped by 5, its likelihood of being words is really low. If we replace the alphabetic with numerals, the following sequence is found:
 425 2 425 525 425 2 425 425 425 2 425 525 535 21525 525 452225 2 5225 525 525 2 525 525 425 2 88      29346 29235 48639 85149 8273 37425 3964 683 3652 26787 24725 2793 26293124596 257 2472 5248 89517 72675 63935 69698 58257 2465 2756 7675 53945 4597919191919191919191919
We see that the last notes are a repetition of the highest and lowest notes (excluding the space delimiter). This seems like an end of transmission pattern.

Spy phone home

Truth be told, signal analysis is kind of my nemesis. To take this exercise to the next level, I decoded a phone number from a telephone number signal record. The idea came from two main sources: first, a commission currently taking place on my local news which did not censor the dial tones of telephone clips played in public audiences. Secondly, a friend of mine criticized vigorously the poor conclusion I was trying to pull forth, which was supposed to be: "Oh, I found the sequence, but there's no known way to decrypt it!"

So I loaded this phone number signal (read: generated it here):

Here is the link to the audio if your browser does not support HTML5.

Using the aforementioned notions results into this code which gives the correct decoding: 562-4897 (randomly chosen number, dial it at your own risk!). One thing that merits mention: Phone signaling is (usually) done using Dual-Tone Multi-Frequency (DTMF), meaning that a tone containing two different frequencies (that are not a multiple of one another) is used for each keypad entry. To decode these two frequencies, I took the two most important frequencies (with the highest amplitude) and got the symbol which minimized the sum of the square of the errors. This common method is called the Least Squares.


Decoding signals can be daunting, but it should be a breeze with the proper tools. I am impressed to see how close we are from a polyphonic sheet extractor from an audio file. These notions could also be used to recognize the emitter of a sound or a handful of other uses. I'll let you be creative.

But enough conspiracy theories for today. Hope you enjoyed!
Thanks to Alexandre Boily who taught me the existence of the Conet project.