Comparative Study of Different Adversarial Text to Image Methods
Automatic synthesis of realistic images from text has become popular with deep convolutional and recurrent neural network architectures to aid in learning discriminative text feature representations.
Discriminative power and strong generalization properties of attribute representations even though attractive, its a complex process and requires domain-specific knowledge. Over the years the techniques have evolved as auto-adversarial networks in space of machine learning algorithms continue to evolve.
In comparison, natural language offers an easy, general, and flexible plugin that can be used to identify and describing objects across multiple domains by means of visual categories. The best thing is to combine the generality of text descriptions with the discriminative power of attributes.
This blog addresses different text to image synthesis algorithms using GAN (Generative Adversarial Network) that aims to directly map words and characters to image pixels with natural language representation and image synthesis techniques.
The featured algorithms learn a text feature representation that captures the important visual details and then use these features to synthesize a compelling image that a human might mistake for real.
1. Generative Adversarial Text to Image Synthesis
- This image synthesis mechanism uses deep convolutional and recurrent text encoders to learn a correspondence function with images by conditioning the model conditions on text descriptions instead of class labels.
- An effective approach that enables text-based image synthesis using a character-level text encoder and class-conditional GAN. The purpose of the GAN is to view (text, image) pairs as joint observations and train the discriminator to judge pairs as real or fake.
- Equipped with a manifold interpolation regularizer (regularization procedure which encourages interpolated outputs to appear more realistic) for the GAN generator that significantly improves the quality of generated samples.
- The objective of GAN is to view (text, image) pairs as joint observations and train the discriminator to judge pairs as real or fake.
- Both the generator network G and the discriminator network D perform has been trained to enable feed-forward learning and inference by conditioning tightly only on textual features.
LICENSE- Apache 2.0
- Discriminator D, has several layers of stride2 convolution with spatial batch normalization followed by leaky ReLU.
- The GAN is trained in mini-batches with SGD (Stochastic Gradient Descent).
- In addition to the real/fake inputs to the discriminator during training, it is also fed with the third type of input consisting of real images with mismatched text, which aids the discriminator to score it as fake.
The below figure illustrates text to image generation samples of different types of birds.
(Open Source Apache 2.0 License)
git clone https://github.com/zsdonghao/text-to-image.git [TensorFlow 1.0+, TensorLayer 1.4+, NLTK : for tokenizer] python downloads.py [download Oxford-102 flower dataset and caption files(run this first)] python data_loader.py [load data for further processing] python train_txt2im.py [train a text to image model] python utils.py [helper functions] python models.py [models]
2. Multi-Scale Gradient GAN for Stable Image Synthesis
Multi-Scale Gradient Generative Adversarial Network (MSG-GAN) is responsible for handling instability in gradients passing from the discriminator to the generator that become uninformative, due to a learning imbalance during training. It uses an effective technique that allows the flow of gradients from the discriminator to the generator at multiple scales helping to generate synchronized multi-scale images.
- The discriminator not only looks at the final output (highest resolution) of the generator but also at the outputs of the intermediate layers as illustrated in the below figure. As a result, the discriminator becomes a function of multiple scale outputs of the generator (by using concatenation operations) and importantly, passes gradients to all the scales simultaneously.
The architecture of MSG-GAN for generating synchronized multi-scale images. (Open Source MIT License)
- MSG-GAN is robust to changes in the learning rate and has a more consistent increase in image quality when compared to progressive growth (Pro-GAN).
- MSG-GAN shows the same convergence trait and consistency for all the resolutions and images generated at higher resolution maintain the symmetry of certain features such as the same color for both eyes, or earrings in both ears. Moreover, the training phase allows a better understanding of image properties (e.g., quality and diversity).
Library and Usage
git clone https://github.com/akanimax/BMSG-GAN.git [PyTorch] python train.py --depth=7 \ --latent_size=512 \ --images_dir=<path to images> \ --sample_dir=samples/exp_1 \ --model_dir=models/exp_1
3. T2F-text-to-face-generation-using-deep-learning (StackGan++ and ProGAN)
- In the ProGAN architecture, works on the principle of adding new layers that model increasingly fine details as training progresses. Here both the generator and discriminator start by creating images of low resolution and adds images’ in-depth details in subsequent steps. It helps in a more stable and faster training process.
- StackGAN architecture consists of multiple generators and discriminators in a tree-like structure. The different branches of the tree represent images of varying scales, all belonging to the same scene. StackGAN has been known for yielding different types of approximate distributions. These multiple related distributions include multi-scale image distributions and joint conditional and unconditional image distributions.
- T2F uses a combined architecture of ProGAN and StackGAN. ProGAN is known for the synthesis of facial images, while StackGAN is known for text encoding, where conditioning augmentation is the principle working methodology. The textual description is encoded into a summary vector using an LSTM network. The summary vector i.e. Embedding as illustrated in the below diagram is passed through the Conditioning Augmentation block (a single linear layer) to obtain the textual part of the latent vector (uses VAE like parameterization technique) for the GAN as input.
- The second part of the latent vector is random Gaussian noise. The latent vector yielded is then fed to the generator part of the GAN. The embedding thus formed is finally fed to the final layer of the discriminator for conditional distribution matching. The training of the GAN proceeds layer by layer. Every next layer adds spatial resolutions at an increasing level.
- The fade-in technique is used to introduce any new layer. This step helps to remember and restore previously learned information.
T2F architecture for generating face from textual descriptions, LICENSE-MIT
The below figure illustrates the mechanism of facial image generation from textual captions for each of them.
Library and Usage
Source — https://github.com/akanimax/T2F.git , LICENSE-MIT
git clone https://github.com/akanimax/T2F.gitpip install -r requirements.txtmkdir training_runsmkdir training_runs/generated_samples training_runs/losses training_runs/saved_modelstrain_network.py --config=configs/11.comf
4. Object-driven Text-to-Image Synthesis via Adversarial Training
AttnGAN LICENSE — MIT
- Object-driven Attentive GAN (Obj-GAN) performs fine-grained text-to-image synthesis. Such in-depth granular image synthesis occurs in two steps. At first, a semantic layout (class labels, bounding boxes, shapes of salient objects) is generated and then the generating images are synthesized by a de-convolutional image generator.
- However semantic layout generation is accomplished with the sentence being served as input to Obj-GAN. This facilitates the Obj-GAN to generate a sequence of objects specified by their bounding boxes (with class labels) and shapes.
- The box generator is trained as an attentive seq2seq model to generate a sequence of bounding boxes, followed by a shape generator to predict and generate the shape of each object in its bounding box.
- In the image generation step, the object-driven attentive generator and object-wise discriminator are designed to enable image generation conditioned on the semantic layout generated in the first step. The generator concentrates on synthesizing the image region within a bounding box by focusing on words that are most relevant to the object in that bounding box.
- Attention-driven context vectors serve as an important tool encode information from the words that are most relevant to that image region. This is accomplished with the help of both patch-wise and object-wise context vectors for defined image regions.
- A Fast R-CNN based object-wise discriminator is also used. It is able to offer rich object-wise discrimination signals. These signals help to determine whether the synthesized object matches the text description and the pre-generated layout.
- Object-driven attention (paying attention to most relevant words and pre-generated class labels) performs better than traditional grid attention, capable of generates complex scenes in high quality.
The open-source code for Obj-GAN from Microsoft is not available yet.
- MirrorGAN is built to emphasize global-local attentive features. It helps in the semantic-preserving text-to-image-to-text framework.
- MirrorGAN is equipped to learn text-to-image generation by re-description. It is composed of three modules: “a semantic text embedding module (STEM), a global-local collaborative attentive module for cascaded image generation (GLAM), and a semantic text regeneration and alignment module (STREAM)”.
- STEM generates word-and sentence-level embeddings using recurrent neural network (RNN) to embed the given text description into local word-level features and global sentence-level features.
- GLAM has a multi-stage cascaded generator. It is designed by stacking three image generation networks sequentially for generating target images from coarse to fine scales. During target image generation, it leverages both local word attention and global sentence. This helps to progressively enhance the diversity and semantic consistency of the generated images.
- STREAM purposes to regenerate the text description from the generated image. The image semantically aligns with the given text description.
- Word-level attention model takes in neighboring contextual high related words. This helps to generate an attentive word-context feature. Word embedding and the visual feature is taken as the input in each stage. The word embedding is first converted into an underlying common semantic space of visual features by a perception layer and multiplied with the visual feature to obtain the attention score. Finally, the attentive word-context feature is obtained by calculating the inner product between the attention score and perception layer along with word embedding.
- MirrorGAN’s two most important components semantic text regeneration and alignment module maintains overall sync between input text and output image. These two modules help to regenerate the text description from the generated image. The output finally semantically aligns with the given text description. In addition, an encoder decoder-based image caption framework is used to generate captions in the architecture. The encoder is a convolutional neural network (CNN) and the decoder is an RNN.
- MirrorGAN performs better than AttnGAN at all settings by a large margin, demonstrating the superiority of the proposed text-to-image-to-text framework and the global-local collaborative attentive module since MirrorGAN generated high-quality images with semantics consistent with the input text descriptions.
Library and Usage
git clone [email protected]:komiya-m/MirrorGAN.git [python 3.6.8, keras 2.2.4, tensor-flow 1.12.0] Dependencies : easydict, pandas, tqdm python main_clevr.py cd MirrorGAN python pretrain_STREAM.py python train.py
- Story visualization takes as input a multi-sentence paragraph and generates at its output sequence of images, one for each sentence.
- Story visualization task is a sequential conditional generation problem where it jointly considers the current input sentence with the contextual information.
- Story GAN gives less focus on the continuity in generated images (frames), but more on the global consistency across dynamic scenes and characters.
- Relies on the Text2Gist component in the Context Encoder, where the Context Encoder dynamically tracks the story flow in addition to providing the image generator with both local and global conditional information.
- Two-level discriminator and the recurrent structure on the inputs help to enhance the image quality and ensure consistency across the generated images and the story to be visualized.
The below figure illustrates a StoryGAN architecture. The variables represented in gray solid circles serves as an input story S and individual sentences s1, . . . , sT with random noise 1, . . . , T . The generator network is built using specific customized components –Story Encoder, Context Encoder and Image Generator. There are two discriminators on top that actively serve its primary task to discriminate each image sentence pair and each image-sequence-story pair is real or fake.
The framework of StoryGAN, LICENSE-MIT
The Story GAN architecture is capable of distinguishing real/fake stories with the feature vectors of the images/sentences in the story when they are concatenated. The product of image and text features is embedded to have a compact feature representation that serves as an input to a fully connected layer. The fully connected layer is employed with a sigmoid non-linearity to predict whether it is a fake or real story pair.
Structure of the story discriminator, LICENSE-MIT
Library and Usage
git clone https://github.com/yitong91/StoryGAN.git [Python 2.7, PyTorch, cv2]python main_clevr.py
In Keras text to image translation is achieved using GAN and Word2Vec as well as recurrent neural networks.
It uses DCGan(Deep Convolutional Generative Adversarial Network) which has been a breakthrough in GAN research as it introduces major architectural changes to tackle problems like training instability, mode collapse, and internal covariate shift.
Sample DCGAN Architecture to generate 64×64 RGB pixel images from the LSUN dataset, Source, License -MIT
Library and Usage
git clone https://github.com/chen0040/keras-text-to-image.git import os import sys import numpy as np from random import shuffle def train_DCGan_text_image(): seed = 42 np.random.seed(seed) current_dir = os.path.dirname(__file__) # add the keras_text_to_image module to the system path sys.path.append(os.path.join(current_dir, '..')) current_dir = current_dir if current_dir is not '' else '.' img_dir_path = current_dir + '/data/pokemon/img' txt_dir_path = current_dir + '/data/pokemon/txt' model_dir_path = current_dir + '/models' img_width = 32 img_height = 32 img_channels = 3 from keras_text_to_image.library.dcgan import DCGan from keras_text_to_image.library.utility.img_cap_loader import load_normalized_img_and_its_text image_label_pairs = load_normalized_img_and_its_text(img_dir_path, txt_dir_path, img_width=img_width, img_height=img_height) shuffle(image_label_pairs) gan = DCGan() gan.img_width = img_width gan.img_height = img_height gan.img_channels = img_channels gan.random_input_dim = 200 gan.glove_source_dir_path = './very_large_data' batch_size = 16 epochs = 1000 gan.fit(model_dir_path=model_dir_path, image_label_pairs=image_label_pairs, snapshot_dir_path=current_dir + '/data/snapshots', snapshot_interval=100, batch_size=batch_size, epochs=epochs) def load_generate_image_DCGaN(): seed = 42 np.random.seed(seed) current_dir = os.path.dirname(__file__) sys.path.append(os.path.join(current_dir, '..')) current_dir = current_dir if current_dir is not '' else '.' img_dir_path = current_dir + '/data/pokemon/img' txt_dir_path = current_dir + '/data/pokemon/txt' model_dir_path = current_dir + '/models' img_width = 32 img_height = 32 from keras_text_to_image.library.dcgan import DCGan from keras_text_to_image.library.utility.image_utils import img_from_normalized_img from keras_text_to_image.library.utility.img_cap_loader import load_normalized_img_and_its_text image_label_pairs = load_normalized_img_and_its_text(img_dir_path, txt_dir_path, img_width=img_width, img_height=img_height) shuffle(image_label_pairs) gan = DCGan() gan.load_model(model_dir_path) for i in range(3): image_label_pair = image_label_pairs[i] normalized_image = image_label_pair text = image_label_pair image = img_from_normalized_img(normalized_image) image.save(current_dir + '/data/outputs/' + DCGan.model_name + '-generated-' + str(i) + '-0.png') for j in range(3): generated_image = gan.generate_image_from_text(text) generated_image.save(current_dir + '/data/outputs/' + DCGan.model_name + '-generated-' + str(i) + '-' + str(j) + '.png')
Here I have presented some of the popular techniques for generating images from text. You can explore more on some more techniques at https://github.com/topics/text-to-image. Happy Coding!!