Automatic Image Caption Generation is considered as one of the challenging research fields in Artificial Intelligence. The main task in Image caption generation is to take an image, analyze its visual content and then generate a textual description accordingly. Since this field needs both visual and textual understanding, it combines both Computer Vision (CV) and Natural Language Processing (NLP) techniques 1.For the past five years until now, Automatic Image Caption Generation has been an area of interest for many researchers, since it has many useful applications based on image captions such as classifying images in separate albums, filtering harmful or violence images for kids, detecting cyberbullying from images, recognizing interest of people in social media platforms based posted images and much more. In this survey, we discuss the three main approaches used in automatic image caption generation in early work and recent work, and highlight their advantages and disadvantages.
Many papers were proposed to discuss different image caption generation approaches. We summarize the three main approaches in the below diagram, with the focus on the third approach, which is the recent work in the field.
• Template based:
In Template based approach, automatic image caption generation follows a standard pipeline. First, computer vision techniques are used to extract the visual contents in the image such as objects, scenes and actions. Then, the generated words from the first step are combined to form a full sentence using NLP techniques (grammar rules, n-grams, etc.). Kulkarni et al 2 used CV techniques to extract the image attribute tuples (object, visual attribute, spatial relationships), and then the generated words are combined using n-gram based language models to get the final sentence. Elliott and Keller 3 made an explicit use of the image structure instead of using the image attributes like Kulkarni. They created a visual dependency representation(VDR) graph of the image to give a meaningful relationship between each region in the image. Template based image caption generation results in a correct and relevant sentence, since it highly depends on the visual contents. However, the approach is strictly constrained to the contents of the image, which will not give us any complex- generated sentences or understands the context of the image, and therefore it makes the generated sentence too simple and less natural than the human’s sentence.
• Retrieval Based:
This approach states that, given a query image, the caption is generated by retrieving one or a set of sentences that are pre-defined by humans. Ordonez et al. 4 proposed IM2TEXTMODEL, which retrieves a matching set of images to the query image from a web scale captioned collection, then they extract high level information about image content to perform re-ranking for the images and finally, choose the top four associated captions. Mason and Charniak’s approach 5 solves the problem in Ordonez’s approach of having noisy estimations of the visual content and poor alignment between images, by doing the re-ranking based on textual information. Retrieval based image caption generation usually results in a grammatically correct and fluent phrase, since the output depends on human-written sentences. However, using this approach will require large amount of training data so it can generate the correct relevant description. Also, this approach can’t adapt to new combination of objects that does not exist in the training set, and may result in irrelevant caption generation.
• Deep Neural Network based
The first two approaches were proposed in early work for image caption generation. However, recent work relies mostly on the concept of Deep Neural Networks (DNN). There are many approaches used in DNN , we mention here some of the important ones.
o DNN based on Multimodal training
In this approach, both visual and textual data are used for training the model. Therefore, for any
given image query, the representation of image-description is used to perform cross-modal retrieval. The approach first extracts the image features using a feature extractor , then these features are fed into a Neural Language model in order to predict the words. Kiros et al. 6 have used the Convolutional Neural Network (CNN) to extract the features of the image, then, Recurrent Neural Networks (RNN) language model is used to train the model to generate the next word based on the previous words and image features. The used approach thereby is considered as a language-visual model(multi-modal).
o Retrieval based approach Augmented by DNN
This approach uses the method of Retrieval based and utilizes the use of Neural Networks to extract features from the images and generate phrases. Socher et.a 7 used a Deep Neural Network as a visual model to extract the features from images, and a Dependency Tree Recursive Neural Network (DT-RNN) as a compositional vector .After getting the multi-modal features, they are mapped into a common space to finally generate the caption. Karpathy et al. 8 then improved the sentence retrieval performance obtained in Socher’s paper, by using the fragments of images and sentence in mapping instead of mapping the entire image and sentence.
o Based on Encoder- Decoder framework
An Encoder – Decoder framework in neural network encodes an image into an intermediate representation, and then a decoder RNN takes the intermediate representation as input and generates a phrase word by word. Vinyals9 et al. used CNN to encode image features, and Long Short Term Memory RNN to decode the image features into sentences. Donahue et.al 10 created a model that feeds the system both image and word features at each stage, making their model more flexible than Vinyal’s to be applied to a variety of vision tasks involving sequential inputs and outputs.
DNN provides a better understanding of the image and more realistic phrase generation than the template and retrieval based. It also does not depend on any existing sentences or images unlike early approaches.
In this survey, we have summarized the three main approaches used in image caption generation and highlighted their advantages and advantages. We conclude that using the Deep Neural Networks approach is the most efficient way to produce captions for the images, since it uses Deep learning algorithms to extract the features from the image and in generating the phrases. Most of the proposed research papers in this field involves generating a single language caption only. Therefore, generating multilingual captions for the image can be considered as one of the future directions for this research. Generating Arabic captions for the images can be also considered, since most of the papers proposed are for generating English captions only.