Home » Uncategorized

Interesting Data Science Application: Steganography

The Art and Science of Encrypting, Embedding and Hiding Messages in Pictures and Videos.

This is related to data encryption and security. Imagine that you need to transmit the details of a patent or a confidential financial transaction over the Internet. There are three critical issues:

  • Making sure the message is not captured by a third party, and decrypted
  • Making sure no third party can identify who sent the message
  • Making sure no third party can identify the recipient

Having the message encrypted is a first step, but it might not guarantee high security. Steganography is about using mechanisms to hide a confidential message (e.g. a scanned document such as a contract) into an image, a video, an executable file or some other outlets. Combined with encryption, it is an efficient way to transmit confidential or classified documents without raising suspicion.

Steganography_recovered      File:Steganography original.png

Image of a cat embedded (and invisible) into the tree image

Here we describe a statistical technology to leverage 24-bit images (bitmaps such as Windows BMP images) to achieve this goal. Steganography, to hide information into the lower bits of a digital image, has been in use for more than 30 years. Here we describe a more advanced, statistical technique that should make staganalysis (reverse engineering staganography) more difficult: in other words, safer for the user.

While we focus here on the widespread BMP image format, our technique can be used with other loss-less image formats. It even works with compressed images, as long as information loss is minimal. 

A bit of reverse engineering science

The BMP image format created by Microsoft is one of the best formats to use for steganography. The format is open source and public source code is available to produce BMP images. Yet there are so many variants and parameters that it might be easier to reverse-engineer this format, rather than spend hours reading hundreds of pages of documentation to figure out how it works. In a nutshell, this 24-bit format is the easiest to work with: it consists of a 54 bits header, followed by the bitmap itself. Each pixel has four components: RGB (red, green, blue channels) values, and the alpha channel (you can ignore it). Thus it takes 4 bytes to store each pixel.

Click here for detailed C code about 24-bit BMP images. One way to reverse-engineer this format is to produce a blank image, add one pixel (say purple – that is 50% Red, 50% Blue, 0% Green), change the color of the pixel, then change the location of the pixel, to see how the BMP binary code changes. That’s how the author of this article figured out how the 256-color BMP format (also know as the 8-bit BMP format) works. Here, not only the 24-bit is easier to understand, it is also more flexible and useful for steganography.

To hide a secret code, image, or message into a target image, you first need to use an original (target) image. Some original images are great candidates for this type of usage, some are very poor and could lead you to being compromised. Images that you should avoid are color-poor, or images that have areas that are very uniform. Conversely, color-rich images, with no uniform areas, are good candidates. So the first piece of a good steganography algorithm is a mechanism to detect which images are good candidates, to bury your secret message.

Our technology

Once you have detected a great image to hide your message into, here is how to proceed. We assume that the message you want to hide is a text message, based on an 80-char alphabet (26 lowercase letters, 26 uppercase letters, 10 digits, and a few special characters such as parenthesis). Let’s assume that your secret message is 300 KB long (300,000 1-byte characters), and that you are going to bury it into a 600 x 600 pixel x 24-bit image (that is, a 1,440 KB image; 1,440 KB = 600 x 600 x (3+1); 3 for the RGB channels, 1 for the alpha channel; in short, each pixel requires 4 bytes of storage).

Algorithm

Step 1: You first need to create a (one-to-many) table in which each of the 80 characters in your alphabet is associated with 1,000 RGB colors, widely spread in the RGB universe, and with no collision (no RGB component associated with more than one character). So you need an 80,000 records look-up table, each record being 3-bytes long (so the size of this look-up table is 240 KB). This table is somewhat the equivalent of a key in encryption systems.

Step 2: Embed your message in the target image

  • 2.1. Pre-processing: In the target image, replace each pixel that has a color matching one of the 80,000 entries from your look-up table, with a very close neighboring color. For instance, if pixel color R=231, G=134, B=098 is both in the target image and in the 80,000 look-up table, replace this color in the target image with (say) R=230, G=134, B=099.
  • 2.2. Randomly select 300,000 pixel locations in the target 600 x 600 images. This is where the 300,000 characters of your message are going to be stored.
  • 2.3. For each of the 300,000 locations, replace the RGB color with the closest neighbor found in the 80,000 look-up table.

The target (original) image will visually look exactly the same once your message has been embedded into it.

How to decode the image

Just look for the pixels that have a RGB color found in the 80,000 RGB color look-up table, and match them with the character that they represent. It should be straightforward since this look-up table has two fields: character (80 unique characters in your alphabet) and RGB representations (1,000 RGB different representations per character).

How to post your message?

With your message securely encoded, hidden in an image, you would think that you just have to email the image to the recipient, and he will easily extract the encoded message.

This is a dangerous strategy, because even if the encrypted message can not be decoded, if your email account or your recipient’s email account is hijacked (e.g. by the NSA), the hijacker will at least be able to figure out who sent the message, and/or to whom.

A better mechanism to deliver the message is to post your image in a Facebook or other public forum, anonymously. Our next article on this subject will be about how to be really anonymous, using bogus Facebook profiles to post highly confidential content (hidden in images using our steganography technique) as well as your 240 KB look-up table, without revealing your IP address.

Note: The message hidden in your image should not contain identifiers. So if the message is captured and decoded, the hijacker might not be able to figure out who sent it, and/or to whom.

Related Articles