A Rapid Review of Image Captioning

Image captioning is an automatic process for generating text based on the content observed in an image. We do a review, create a framework, and build an application model. We review image captioning into 4 categories based on input model, process model, output model, and lingual image caption. The input model is based on criteria caption, method, and dataset. The process model is based on the type of learning, encoder-decoder, image extractor, and metric evaluation. The output model is based on architecture, features extraction, feature mapping, model, and the number of the caption. Lingual image caption based on language model with 2 groups: bilingual image caption and cross-language image caption. We also design a framework with three framework models. Furthermore, we also build applications with three application models. We also provide research opinions on trends and future research that can be developed with image caption generation. Image captioning can be further developed on computer vision versus human vision. Keyword : Image, Caption, laguange, model


Introduction
Image captioning is a popular new research project in image analysis and text analysis. It is an automatic process for creating image captions. The caption displays natural language text based on the image [1]. It is defined as the process of producing a textual description for an image [2]. In other words, it is creating a description for an image that is inputted into natural language.
Starting with computer vision which is used to identify objects, attributes, and their relationships, then, natural language processing is used to monitor syntax and semantics, and finally, machine learning is used to produce text [3]. In another words, it is a technology that produces textual descriptions of images for semantic indexing. The object recognition from an image, the relationship between recognized objects to understand events in the image, and produces analysis to the image benefits from this technology [4]. Image captioning arises from the need to translate between two different modalities (multimodal), but usually in pairs.
Image capture based on content (content-based information retrieval) can be done by image captioning [5]. The application of image captioning is very broad and significant, especially in the field of human and computer interaction. Apart from this, practical image captioning helps disabled people interact with normal people [6]. It is also used to create multimedia content descriptions, assist e-commerce companies with digital marketing, and create online news stories for news content narratives [7]. Image captioning can also help people with visual disabilities understand image content. In the medical field, image captioning can also provide a diagnosis or medical assistance based on image content analysis [8]. Image captioning also plays a very important role in interaction and textual communication on social media. Text is an important component in posting images with captions. Text can increase user engagement in campaigns and advertisements. Apart from social media, text for images also plays a role in making sentences that better reflect user activity, activity on business sites with image captions, captions for digital organization, and company profiles. The main contribution of this paper are summarized as follows: 1) Categorization model for the image to text synthesis based on input model, process model, output model, lingual image caption model. 2) Design framework model as the conceptual model for image captioning generation. Application models as design applications to implemented image captioning. 3) Computer vision, natural language processing, machine learning as the main component for the image to text synthesis. We do a rapid review based on articles/papers in IEEE Xplore, arXiv, ACM, Science Direct, and Elsevier publisher. We select articles based on keyword "image captioning". We find 12 articles are survey paper and review papers, 9 articles are technical papers based on the keyword "image lingual caption", and 9 articles are technical paper based on keyword "image lingual caption plus multi-modal".
The rapid review model based on the categorization model in this study is underexplored on the topic of image captioning. Image captioning is an applicative and futuristic development model in the image to text synthesis.

Input Model
The input model based on criteria caption, method, and the dataset is shown in Table  1. The image caption-based template approach has an improved template with several empty slots for generating text. In this approach, detecting objects, attributes, actions and then fill the empty spaces in the template [5]. Text for request image is selected from candidate text set. Novel base captions are carried out by analyzing the visual content of an image and producing a description of the image using a language model [5]. The image caption is generated by syntax and semantic in a restricted process. In the retrieval-based image caption approach, the text is taken from a series of existing texts from a predetermined set of sentences [10]. This method is used to find images that are visually similar to the candidate text from the training dataset. In image captioning applications, it is possible to find novel bases that are not in the predetermined vocabulary with sentence data in paired images. There is no need to retrain the entire system when multiple drawings with new concepts emerge [10].
The attention mechanism is based on the signal from the input image by frame into encoder-decoder [10]. This model is made to replicate the way people pay attention to objects in an image and the way people make annotations between objects in the image [11]. People can choose information, ignore primary information, and secondary information. This ability to choose is called attention. The novel object caption can produce object descriptions in text and images that are not in the dataset. Steps: (1) Separate lexical classifications and language models. (2) In-text models are trained on paired image text data. (3) The two models are combined to practice together in producing texts for novel objects. The semantic concept-based method selectively results in the extraction of an image. Steps: (1) Encode image features using by the encoder. (2) The image feature is inputted to the language model input. (3) Added to the various hidden states by semantic concept in the language model. (4) Semantic implementation can design a clear way to feed sentiment into the image captioning system [11]. An important aspect of the attention mechanism is how to decide what to describe and in what order in the image captioning process [13].
Flickr8K is a dataset containing thousands of images extracted from Flickr. Flickr contains thousands of photos (mainly humans and animals), thousands of training images, thousands of image verifications, and thousands of image tests [1]. This dataset has 5 reference captions for each image [5]. The Flickr30K is a dataset contains thousands of images provided by human and for each image with 5 reference captions. This dataset also contains detectors for common objects with color classification [5]. This dataset is an extended dataset from the Flickr8k dataset. The pictures in this dataset are mainly of daily human activities. Based on crowdsourcing services, the caption is annotated by humans for each image [10]. The images in this dataset come from Yahoo's photos. These images can be used for training, testing, and validation in caption [12]. Ms. COCO (Microsoft Common Object in Context) is a very large data set for image recognition, object segmentation, and text writing. Microsoft COCO contains thousands of images, millions of instances, dozens of object categories, and five texts per image [5]. This dataset for scene understanding and capturing images of complex daily events. This dataset is more challenging because the images contain many objects, messy backgrounds, and complex semantic relationships. Other datasets for image captioning are BBC News, Pascal UIUC, and SBU. PASCAL 1K, AI Challenger Dataset, STAIR Captions, IAPR TC-12, Stock3M, FlickrStyle10K, and Visual Genome [12].

Process Model
Process model based on type of learning, encoder-decoder, image extractor, and evaluation metric. The process model is shown in Table 2. In supervised learning, training data is used to produce the desired output called labels, while unsupervised learning deals with data that are not label-ed. Reinforcement learning is a type approach to finding data and/or labels through exploration and given signals.
GRU (Gate Recurrent Unit) uses to control the flow of information [5]. CNN (Convolutional Neural Network) is used to understand image content. LSTM (Long Short-Term Memory) focuses on modelling text data, combining context information and images, and predicting word distribution [12]. RNN (Recurrent Neural Network) is used for the encoder-decoder process (the encoder is a CNN and the decoder is RNN).
Commonly used image feature extractors are VGG-Net and Res-Net. VGG-Net has a simple and powerful model. Res-Net is most efficient compared to all other extractors. Other alternative image feature extractors that can be used are Alex-Net, Google-Net, and Dense-Net. Google-Net is also called Inception-X Net [11].
Evaluation metrics used to measure linguistic quality and semantic correctness. This is necessary for comparing text, images analysis, and generating sentences. Evaluation metrics include BLEU, ROUGE, METEOR, SPICE, and CIDE. BLEU (Bi-Lingual Evaluation Understudy) measures the same number of words as the base adverb [3], which is used to measure the similarity between two sentences. BLEU is a metric that compares candidate sentences with reference sentences [10]. In this metric, the process is the same as the machine translation process. Difference between the reference sentence and the candidate sentence by using BLEU metric. BLEU can also be used to perform sentence-level analysis [12].
CIDE (Consensus-based Image Description Evaluation) is a metric to measure the similarity of the resulting sentence to the correct sentence. In this metric, encoding the n-gram frequency in the candidate sentence to toward in the reference sentence. This metric is used to evaluate sentences in terms of grammaticality, salience, importance, and accuracy [10]. CIDE is specially designed for image annotation problems and measures image annotation consistency. CIDE is a metric that measures the similarity of a resulting sentence against a set of root sentences that are true [12].
Metric for Evaluation of Translation with Explicit Ordering or METEOR is another type metric for machine translation. This metric evaluates the candidate's sentences and reference sentences, then calculates a score based on the matching results. The calculations involve the accuracy, memory, and alignments of the matching words. This metric can be used to overcome the weakness of the BLEU metric. Apart from searching for exact word matches, this metric can also search for synonyms of words and look for correlations at the segment level or sentence level [3]. This metric is based on the n-gram precision-matched [10]. METEOR is a metric with harmonic averages of precision and recall, and paraphrasing between reference sentences and candidate sentences. METEOR shows a more accurate evaluation especially for a small number in reference sentences [12].
ROUGE (Recall-Oriented Understanding for Gist Evaluation) is designed to use the longest general order between candidate sentences and a set of reference sentences. The longest common order between two sentences requires only consecutive word matching, and the words that match do not have to be in order. ROUGE is used to evaluate text summarization algorithms. If ROUGE score is high, then the performance is better [1].
Semantic Propositional Image Caption Evaluation or SPICE is used to measure restores objects, attributes, and relationships effectively. SPICE can be able to capture human judgments about model subtitles [1]. This metric places the adverb similarity based on the scene graph tuple (candidate sentences and reference sentences) [12].
BLEU is great for evaluating small sentences. ROUGE can be used to evaluate different types of text. ROUGE is precise for automatic summaries, and CIDE and SPICE are precise for captions. BLEU and METEOR are precise for machine translation, ROUGE is precise for automatic summaries, and CIDE and SPICE are precise for captions [1]. METEOR can perform evaluations on various segments of information. SPICE is better at understanding the semantic breakdown of text compared to other metrics [5]. The most widely used evaluation metrics are BLEU and ROUGE, but they are weakly correlated with human judgment. For correlation with human assessment, METEOR and CIDE are better [12].

Output Model
Output model based on the architecture, feature extraction, feature mapping, model, and the number of the caption. The output model is shown in Table 3. The Composition architecture-based method consists of several independent functional building blocks: (1) Extract semantic from images by CNN. (2) Generate a set of candidate caption texts by the language model. (3) This candidate text is repeated using RNN [5]. Encoders are a framework for the transformation process from one representation to another. The encoder network encodes the input into context vectors and the decoder network translates the context vectors to produce the output [9]. In this encoder-decoder process, first, encode an image into an intermediate representation by using encoders, next, the decoder changes the intermediate representation as input, and generates word by word [10]. An encoder is a process of "reading" an image (inputting an image), then extracting the image with a high-level feature representation. A decoder is a process of "producing words" (the output is words) as an image representation of the encoder (an encoded image). This process produces words to represent images with correct sentences syntactically and semantically [11].
Handcraft features based on the maximum estimate and the greatest chance. This feature studies the visual detectors and language models from the image captioning dataset. This feature is used for to images analysis, detect objects, and generate descriptions. Deep learning features can be used to generate text for image caption. The RNN is used as an attention mechanism in the image caption. This feature is achieved good results in language model [1]. It can handle large and varied image sets. CNN is widely used for learning features such as soft-max which is used for classification. CNN is used to image understanding and RNN is used for text generation [5].
Text generated by using mapping in visual space. Also, it can use in multimodal space. The visual space-based method performs an explicit mapping from images to caption. In contrast, the multimodal space-based method combines vision and implicit mapping [5]. In dense captioning, the text is generated for each scene area in an image. Area-based descriptions are objective and detail. This area is better for local image descriptions than global image descriptions based on region descriptions. Region-based descriptions are dense captioning. The dense localization layer processes the image efficiently, implicitly predicting a set of regions of interest. Another method is whole scene captioning which generates text for the entire scene such as encoders, composition, attention, semantics, novel objects, and deep learning. This method generates single or multiple texts for the entire scene [5]. The classic model is building a natural language from visual content. Visual content based on visual-to-text technical patterns falls into four distinct categories: (1) language rules or template-based scripting, (2) taking descriptions from other visual content, (3) supplemental descriptions with visual recognition, and (4) create a language model to product descriptions. Advanced model is the creation of the latest image captions from image to text. This model encodes an image to a text decoder by translating images into sentences. This model includes text visual embedding and attention mechanisms [12].  [17] Bengali (India) language generator, the framework for the translator, Gaussian smoothen semantic feature, the framework to composition and decomposition. Image caption generation from English text to the Hindi Language. 5 Turkish Image Captioning [18] Turkish dataset, CNN encoder to RNN decoder, clustering for Microsoft COCO dataset. Image caption generation from English text to the Turkish Language. 6 Indonesian Image Captioning [19] Inception-V3 Deep CNN with Image-Net dataset and CNN encoder to GRU decoder. Image caption generation from English text to the Indonesian Language.

Lingual Model
7 Hindi Image Captioning [20] Attention mechanism, CNN encoder to GRU decoder, and Res-Net 101 dataset. Image caption generation from English text to the Hindi Language. 8 Arabic Image Captioning [21] Deep learning for Arabic Language, Microsoft COCO dataset, Flickr8k dataset, English-Arabic Translator, RNN-LSTM model, and Deep CNN. Image caption generation from English text to the Arabic Language. 9 Japanese Image Captioning [22] STAIR captions dataset and Neural Network for STAIR captions. Image caption generation from English text to the Japanese Language. English text to Hindi language caption, Hindi visual genome dataset, RNN encoder to LSTM decoder, parallel text (neural machine translation-attention model), image caption generation (image + Hindi text), and multimodal (image + parallel text).
2 Cross-Lingual Caption [23] Tag Mandarin language with MLP cascading, COCO-CN dataset, a caption with sequential learning, and Enhanced W2VV for extraction. 3 Cross-Modal Caption [24] Cross modal language generation, Pivot Language Generation Stabilization, primer annotation with the translator, text to text translate, image to text in cross-modal with parallel annotation, and monolingual English model convert to target languages (French, Italian, German, Spanish, and Hindi). 4 Cross-Modal Caption [25] English text to image captioning, coherence model, various coherence relations (visible, subjective, action, story, and meta), annotation protocol, the task for learning relation image to text, and prediction for coherence.

5
Cross-Modal Caption [26] English text to image captioning, attention based on encoder-decoder model, global and local information exploring distilling (GLIED), visual-region, attribute co-location, context attention, Microsoft COCO dataset, and whole scene caption. 6 Cross-Lingual Caption [27] Image captioning with Chinese language, reinforcement learning for language error, cross-lingual non-pairwise, supervised learning, and English text nonpairwise. 7 Multi-Modal Caption [28] English text to Chinese image captioning the annotation for cross-lingual caption, zero-shot adaptive, learning without injection, Flickr8k-CN dataset, and three visual model. 8 Cross-Lingual Caption [29] Supervised learning, fluency guided learning framework, English-Chinese dataset, annotation manual minimal, and generate caption in two bilinguals (English-Chinese). 9 Cross-Lingual Caption [30] Generate caption in Japanese language, big corpus for image caption generation, bilingual corpus vs. monolingual corpus, deep recurrent architecture, and corpus equal to English language dataset.
The bilingual image caption is shown in Table 4. Cross-lingual image caption shows in Table 5. The making of cross-lingual languages depends on their ability to support the use of non-English languages with model trends annotations. Cross-lingual generates graphic text from English to non-English (1 source language to more than 1 target language). Cross-lingual caption based on language model and cross-modal based on multimodal (text, image, speech, sound, audio, and video). Lingual models for image captions are divided into 2 groups, namely: bilingual (image to text) and cross-language image text (image to text plus multimodal). Models for image captioning given the scarcity of data label-ed non-English. Machine translation is given more benefits from large data (bilingual and multilingual contexts). Bilingual captions generate graphic text from English to non-English (1 source language to 1 target language).

Framework Model
The framework created as a model for building image captioning can be seen in Figure 1, Figure 2, and Figure 3.  Framework 2 is used to classify image captioning based on input, process, and output. Input using computer vision, a process using machine learning, and output using natural language processing. Framework 3 displays image captioning based on input, process, and output. First, input image with encoder process using image transform and image taken from Flickr8k dataset. Next, do the encoder process using the CNN Model using image features and embedded image feature vectors. Next, do the decoder process with the LSTM Model, RNN Model, and GRU Model. Finally, displaying text generation output as image captioning using natural language processing. Framework 1, Framework 2, Framework 3 are conceptual models. This framework is popularly used to a developed image-to-text synthesis for image captioning generation.   Figure 4 shows CNN as an Encoder to understand images (Image Understanding) and RNN as a Decoder to decode words (Text Generating). CNN is used to encode the image and RNN is used to decode a sentence from embedding. The process of converting image to text to produce output (Image Captioning). The function of xt is input x at time t to calculate ht. We multiply ht with matrix w to predict for yt. RNN makes prediction for the word based on the hidden state in the previous time-step and current input. Figure 5 shows CNN as an Encoder to understand images (Image Understanding) and LSTM as a Decoder to decode words (Text Generating). The process of converting image to text to produce output (Image Captioning). The output layer is a soft-max function. The soft-max function is used to generate the probability distribution vector for each character. This function helps to generate a complete translate word. The (Sm) is soft-max is used to predict the probability of being the next caption word. The (We) function is Word embedding is used to convert a word index to a vector in a higherdimensional space. Figure 6 shows CNN as an Encoder to understand images (Image Understanding) and GRU as a Decoder to decode words (Text Generating). The process of converting image to text to produce output (Image Captioning). GRU is a type of RNN. GRU uses 2 gates (a reset gate and an update gate) instead of 3. We have to transform the captions associated with the image into a token-izer (a list of tokenized words). Function Image model and language model are concatenated by adding and fed into another Fully Connected Layer. The pre-trained word embedding model is used to model training in our embedding layer.

Conclusion
We have discussed image captioning generation. We do a rapid review, create frameworks, and build application models. This review provides a comprehensive analysis of the aspects of input, process, output, and lingual captions regarding image captioning generation. This framework produces a framework design as an application model. This application can be a technical reference and as an example in building an application about image captioning.
In our view, a lot of improvisation and innovation can be done to develop research on image captioning generation. Image captioning can be further developed based on the synthesis of computer vision vs. human vision, deep learning vs. broad learning as a learning model, and the synthesis of the image to text and text to image by using natural language processing. Image captioning generation has become a multidisciplinary research model (image analysis, text analysis, and content analysis), integration of applications and models (computer vision, machine learning, and natural language processing), and context systems (multimedia to multimodal) as research are exciting and interesting.