Cracking a Skype call using phonemes
Written by Mike James   
Thursday, 26 May 2011

Modern computational linguistics can crack the encryption on VOIP calls well enough to reconstruct what is being said. Even though they are encrypted, the frames that make up a Skype call contain clues about what phonemes are being spoken.

Cracking a code is usually a complex matter of mathematics and nothing much else other than mathematics. However, if the encrypted data contain any statistical relationship to the  original data then there can be shortcut ways to decryption that make the whole thing much less secure. This is well known and yet you might be surprised to discover that Skype, and many other forms of VOIP telephone systems, are vulnerable to this sort of attack.

The reason is that the best form of compression for voice data makes use of the structure of speech - the Linear Predictive Filter. The basic idea is that the data is compressed by using an input code word that represents the sound made in the throat by the vocal chords. Then a set of parameters are set in a filter which represents the shape of the mouth and resonant cavities. The parameters are set so that the output matches the sound as well as it can - this is an example of analysis by synthesis, i.e. you analyze a signal by setting up a system that creates it accurately.

Skype uses Code Excited Linear Prediction in which the data in a frame consists of a code word, the gain coefficient and a set of linear prediction coefficients. The next step in processing the data is that the frame is compressed using a variable bit rate scheme and this produces a frame that has a size that does depend on the type of phoneme that has been encoded. The encryption step that follows doesn't change the size of the frame and so the encrypted data that is transmitted has a correlation between frame size and phoneme uttered.

In theory working out what is said from a loose correlation between frame size and phoneme should be very difficult. However a computer scientists and linguists at the University of North Carolina have used the grammar of phonemes to restrict the possibilities for pairs and larger groups of phonemes in the data stream. This allows them to match the patterns of data frame sizes with most likely patterns of phonemes. These phoneme patterns are then mapped to most likely words, a technique they call "Phonotactic Reconstruction".

PhonemeMethod

In practice it turned out to be surprisingly effective - although probably not good enough to eavesdrop on any conversation at any time. Some times the method works much better than expected and some times it can't crack the data stream. The researchers state that with improved computational linguistics they could achieve a much better success rate.

What ever the future holds it is clear that the method of compression leaves too much information in the clear after encryption.  A solution might be to break the data up into fixed sized frames but this would make it more difficult to reconstruct the data if there was packet loss.

More information

Phonotactic Reconstruction of Encrypted VoIP Conversations

speechproduction

If you would like to be informed about new articles on I Programmer you can either follow us on Twitter or Facebook or you can subscribe to our weekly newsletter.

 

Banner


Apache Releases Tomcat 11
07/11/2024

Apache has announced the release of Tomcat 11, as well as marking the 25th anniversary of the first commit to the Apache Tomcat source code repository since becoming an ASF project.



Kotlin Ktor Improves Client-Server Support
04/11/2024

Kotlin Ktor 3 is now available with better performance and improvements including support for server-sent events and CSRF (Cross-Site Request Forgery) protection.


More News

Last Updated ( Thursday, 26 May 2011 )