Interesting Data Science Application: Steganography

The Art and Science of Encrypting, Embedding and Hiding Messages in Pictures and Videos.

This is related to data encryption and security. Imagine that you need to transmit the details of a patent or a confidential financial transaction over the Internet. There are three critical issues:

  • Making sure the message is not captured by a third party, and decrypted
  • Making sure no third party can identify who sent the message
  • Making sure no third party can identify the recipient

Having the message encrypted is a first step, but it might not guarantee high security. Steganography is about using mechanisms to hide a confidential message (e.g. a scanned document such as a contract) into an image, a video, an executable file or some other outlets. Combined with encryption, it is an efficient way to transmit confidential or classified documents without raising suspicion.

      File:Steganography original.png

Image of a cat embedded (and invisible) into the tree image

Here we describe a statistical technology to leverage 24-bit images (bitmaps such as Windows BMP images) to achieve this goal. Steganography, to hide information into the lower bits of a digital image, has been in use for more than 30 years. Here we describe a more advanced, statistical technique that should make staganalysis (reverse engineering staganography) more difficult: in other words, safer for the user.

While we focus here on the widespread BMP image format, our technique can be used with other loss-less image formats. It even works with compressed images, as long as information loss is minimal. 

A bit of reverse engineering science

The BMP image format created by Microsoft is one of the best formats to use for steganography. The format is open source and public source code is available to produce BMP images. Yet there are so many variants and parameters that it might be easier to reverse-engineer this format, rather than spend hours reading hundreds of pages of documentation to figure out how it works. In a nutshell, this 24-bit format is the easiest to work with: it consists of a 54 bits header, followed by the bitmap itself. Each pixel has four components: RGB (red, green, blue channels) values, and the alpha channel (you can ignore it). Thus it takes 4 bytes to store each pixel.

Click here for detailed C code about 24-bit BMP images. One way to reverse-engineer this format is to produce a blank image, add one pixel (say purple - that is 50% Red, 50% Blue, 0% Green), change the color of the pixel, then change the location of the pixel, to see how the BMP binary code changes. That's how the author of this article figured out how the 256-color BMP format (also know as the 8-bit BMP format) works. Here, not only the 24-bit is easier to understand, it is also more flexible and useful for steganography.

To hide a secret code, image, or message into a target image, you first need to use an original (target) image. Some original images are great candidates for this type of usage, some are very poor and could lead you to being compromised. Images that you should avoid are color-poor, or images that have areas that are very uniform. Conversely, color-rich images, with no uniform areas, are good candidates. So the first piece of a good steganography algorithm is a mechanism to detect which images are good candidates, to bury your secret message.

Our technology

Once you have detected a great image to hide your message into, here is how to proceed. We assume that the message you want to hide is a text message, based on an 80-char alphabet (26 lowercase letters, 26 uppercase letters, 10 digits, and a few special characters such as parenthesis). Let's assume that your secret message is 300 KB long (300,000 1-byte characters), and that you are going to bury it into a 600 x 600 pixel x 24-bit image (that is, a 1,440 KB image; 1,440 KB = 600 x 600 x (3+1); 3 for the RGB channels, 1 for the alpha channel; in short, each pixel requires 4 bytes of storage).


Step 1: You first need to create a (one-to-many) table in which each of the 80 characters in your alphabet is associated with 1,000 RGB colors, widely spread in the RGB universe, and with no collision (no RGB component associated with more than one character). So you need an 80,000 records look-up table, each record being 3-bytes long (so the size of this look-up table is 240 KB). This table is somewhat the equivalent of a key in encryption systems.

Step 2: Embed your message in the target image

  • 2.1. Pre-processing: In the target image, replace each pixel that has a color matching one of the 80,000 entries from your look-up table, with a very close neighboring color. For instance, if pixel color R=231, G=134, B=098 is both in the target image and in the 80,000 look-up table, replace this color in the target image with (say) R=230, G=134, B=099.
  • 2.2. Randomly select 300,000 pixel locations in the target 600 x 600 images. This is where the 300,000 characters of your message are going to be stored.
  • 2.3. For each of the 300,000 locations, replace the RGB color with the closest neighbor found in the 80,000 look-up table.

The target (original) image will visually look exactly the same once your message has been embedded into it.

How to decode the image

Just look for the pixels that have a RGB color found in the 80,000 RGB color look-up table, and match them with the character that they represent. It should be straightforward since this look-up table has two fields: character (80 unique characters in your alphabet) and RGB representations (1,000 RGB different representations per character).

How to post your message?

With your message securely encoded, hidden in an image, you would think that you just have to email the image to the recipient, and he will easily extract the encoded message.

This is a dangerous strategy, because even if the encrypted message can not be decoded, if your email account or your recipient's email account is hijacked (e.g. by the NSA), the hijacker will at least be able to figure out who sent the message, and/or to whom.

A better mechanism to deliver the message is to post your image in a Facebook or other public forum, anonymously. Our next article on this subject will be about how to be really anonymous, using bogus Facebook profiles to post highly confidential content (hidden in images using our steganography technique) as well as your 240 KB look-up table, without revealing your IP address.

Note: The message hidden in your image should not contain identifiers. So if the message is captured and decoded, the hijacker might not be able to figure out who sent it, and/or to whom.

Related Articles

Views: 16130


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Joe M on October 28, 2017 at 11:33am

Is it better to skip bytes like that rather than use the least significant bits in every (or nearly every) pixel? 

It may be -- just wondering, although I think I see a better way.

Also isn't there a chance that there will be a crash between an actual pixel value and a code pixel?  

Comment by Vincent Granville on November 1, 2013 at 8:48pm

@David: There is no reshuffling of the characters (from the original message) in my algorithm. A reshuffling scheme could of course be added for additional security. This would add 300 KB to the  240 KB key, to encode the permutation of order 300,000 used for reshuffling. Perfectly feasible, in my opinion.

Comment by Vincent Granville on November 1, 2013 at 12:17pm

Important note: Most modern images contain meta-tags, to help easily categorize and retrieve images when users do a Google search based on keywords. However, these meta-tags are a security risk: They might contain information about who created the image, his IP address, time stamp and machine ID. Thus it might help hijackers discover who you are. That's why you should alter or remove these meta-tags, or use images found on the web (not created by you) for your cover images, or write the 54 bytes of BMP header yourself, e.g. using the C code provided in this article, so that you have control over the meta-tags.

Meta-tags should not look fake, as this could have your image flagged as suspicious. Reusing existing images acquired externally (not produced on your machines) for your cover images, is a good solution.

Comment by Vincent Granville on November 1, 2013 at 11:54am

@ZK: I recently bought the book "Steganography in Digital Media", by Jessica Friedrich, published by Cambridge University Press in 2010. There is no mention of an algorithm like mine, based on alphabet look-up tables (alphabets are mentioned nowhere in the book, but are at the core of my algorithm). Of course it does not mean that my algorithm is new, nor does it mean that it is better than existing algorithms.

The book in question puts more emphasis on steganalysis (decrypting these images) than on steganography (encoding techniques).

Comment by Zero Knowledge on November 1, 2013 at 7:08am

Hi Vincent,

There have been similar schemes proposed and detection techniques proposed. I would suspect most of the images currently would be analyzed for known steganographic techniques. Some of the detection tools are available for download.

Comment by Vincent Granville on November 1, 2013 at 6:52am

@David: The nice thing with my algorithm is that you don't need to reassemble them back. You don't need to know the locations of these pixels. When you see a pixel with a color that is in the 80,000 color table, you know that the pixel in question (its color) is part of the message.

Comment by Atif Farid Mohammad on October 30, 2013 at 4:57am

Threat intelligence can be done by going through these feeds, nice thought sharing, thanks again.

Comment by Atif Farid Mohammad on October 28, 2013 at 8:57am

Interesting, however there are decryption algorithms, one can use.

Comment by Vincent Granville on October 27, 2013 at 11:02pm

One way to increase security is to use a double system of look-up tables (the 240 KB tables). Let's say you have 10,000 images with embedded encoded messages, stored somewhere on the cloud. The look-up tables are referred to as keys, and they can themselves be embedded into images. You can increase security by adding 10,000 bogus (decoy) images with no encoded content, and two keys A and B. Thus you would have 20,002 images in your repository. You need key A to decode key B, and then key B to decode the other images. Since all the image files (including the keys) look the same, you can only decode the images if you know the filenames corresponding to keys A and B. So if your cloud is compromised, it is unlikely that your encoded messages will be successfully decoded by the hijacker.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service