Main Article Content


Image captioning is an automatic process for generating text based on the content observed in an image. We do review, create framework, and build application model. We review image captioning into 4 categories based on input model, process model, output model, and lingual image caption. Input model is based on criteria caption, method, and dataset. Process model is based on type of learning, encoder-decoder, image extractor, and metric evaluation. Output model based on architecture, features extraction, feature aping, model, and number of caption. Lingual image caption based on language model with 2 groups: bilingual image caption and cross-language image caption. We also design framework with 3 framework model. Furthermore, we also build application with 3 application models. We also provide research opinions on trends and future research that can be developed with image caption generation. Image captioning can be further developed on computer vision versus human vision.

Article Details

How to Cite
Adriyendi, A. (2021). A Rapid Review of Image Captioning. Journal of Information Technology and Computer Science, 6(2), 158–169.


  1. Wang, H., Zhang, Y., & Yu, X.:An Overview of Image Caption Generation Methods. Computational Intelligence and Neuroscience,1–13(2020)., M., Khan, A., Mahar, M. S., Hassan, S.,
  2. Ghafoor, A., & Khan, M.:Image Captioning using Deep Learning: A Systematic Literature Review. Internationa Journal of Advanced Computer Science and Applications, 11(5), 278–286(2020).
  3. Kalra, S., & Leekha, A.: Survey of Convolutional Neural Networks for Image Captioning. Journal of Information and Optimization Sciences, 41(1), 239–260(2020).
  4. Bang, S., & Kim, H.:Context-based Information Generation for Managing UAV-acquired Data using Image Captioning. Automation in Construction, 112(103116), 1–10(2020).
  5. Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H.:A Comprehensive Survey of Deep Learning for Image Captioning. ACM Computing Surveys, 51(6), 118:1-118:36(2019).
  6. Huang, Y., Chen, J., Ouyang, W., Wan, W., & Xue, Y.:Image Captioning With End-to-End Attribute Detection and Subsequent Attributes Prediction.IEEE Transaction on Image Processing, 29, 4013–4026(2020).
  7. Lam, Q. H., Le, Q. D., Nguyen, K. Van, & Nguyen, N. L.-T. UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning. arXiv:Cs.CL.1-12 (2020)
  8. Yang, Zhenyu, & Liu, Q.:ATT-BM-SOM: A Framework of Effectively Choosing Image Information and Optimizing Syntax for Image Captioning. IEEE Access, 8, 50565–50573(2020).
  9. Yang, Zhilin, Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. Review Networks for Caption Generation. In: 30th Conference on Neural Information Processing Systems, 1–9.(2016).
  10. Bai, S., & An, S.:A Survey on Automatic Image Caption Generation. Neurocomputing, 311, 291–304(2018). 11. Staniute, R., & Šešok, D.:A Systematic Literature Review on Image Captioning. Applied Science, 9(2024), 1–20(2019).
  11. Li, S., Tao, Z., Li, K., & Fu, Y.:Visual to Text: Survey of Image and Video Captioning. IEEE Transaction on Emerging Topics in Computational Intelligence, 3(4), 297–312(2019).
  12. He, S., Liao, W., Tavakoli, H. R., Yang, M., Rosenhahn, B., & Pugeault, N.:Image Captioning through Image Transformer. arXiv:Cs.CV.1-17 (2020)
  13. Li, X., Xu, C., Wang, X., Lan, W., Jia, Z., Yang, G., & Xu, J.:COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval. IEEE Transaction on Multimedia, 7(7), 1–14(2019).
  14. Zhang, B., Zhou, L., Song, S., Chen, L., Jiang, Z., & Zhang, J.:Image Captioning in Chinese and Its Application for Children with Autism Spectrum Disorder. In: Proceedings of the 12th International Conference on Machine Learning and Computing, 426–432(2020).
  15. Aung, S. P. P., Pa, W. P., & Nwe, T. L.:Automatic Myanmar Image Captioning using CNN and LSTM-Based Language Model. In: Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL), 139–143(2020)
  16. Sur, C.:Gaussian Smoothen Semantic Features (GSSF) -Exploring the Linguistic Aspects of Visual Captioning in Indian Languages (Bengali) Using MSCOCO Framework. arXiv:Cs.CL.1-12 (2020)
  17. Yılmaz, B. D., Demir, A. E., Sönme, E. B., & Yıldız, T.:Image Captioning in Turkish Language. In: Innovations in Intelligent Systems and Applications Conference. 1-5 (2019).
  18. Nugraha, A. A., Arifianto, A., & Suyanto, S.:Generating ImageDescription on Indonesian Language using Convolutional Neural Network and Gated Recurrent Unit. In: 7th International Conference on Information and Communication Technology (ICoICT), 1–6(2019).
  19. Dhir, R., Mishra, S. K., Saha, S., & Bhattacharyya, P.:A Deep Attention based Framework for Image Caption Generation in Hindi Language. Computación y Sistemas, 23(3), 693–701(2019).
  20. Al-muzaini, H. A., Al-yahya, T. N., & Benhidour, H.:Automatic Arabic Image Captioning using RNN-LSTM-Based Language Model and CNN. International Journal of Advanced Computer Science and Applications, 9(6), 67–73(2018).
  21. Yoshikawa, Y., Shigeto, Y.,& Takeuchi, A.:STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset. arXiv:Cs.CL, 1–5(2017).
  22. Song, Y., Chen, S., Zhao, Y., & Jin, Q.:Unpaired Cross-lingual Image CaptionGeneration with Self-Supervised Rewards. In: Proceedings of the 27th ACM International Conference on Multimedia, 1–9(2019).
  23. Thapliyal, A. V., & Soricut, R. Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage. arXiv:Cs.CL2,1–11(2020).
  24. Alikhani, M., Sharma, P., Li, S., Soricut, R., & Stone, M.:Clue: Cross-modal Coherence Modeling for Caption Generation. arXiv:Cs.CL, 1–11(2020).
  25. Liu, F., Ren, X., Liu, Y., Lei, K., & Sun, X.: Exploring and Distilling Cross-Modal Information for Image Captioning. ArXiv:Cs.CV, 1–7(2020)27.Wei, Q., Wang, X., & Li, X. Harvesting Deep Models for Cross-Lingual Image Annotation. In: Proceedings of CBMI, 1–5(2017).
  26. Meetei, L. S., Singh, T. D., & Bandyopadhyay, S. WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset. In: Proceedings of the 6th Workshop on Asian Translation, 181–188(2019)
  27. Lan, W., Li, X., & Don, J. Fluency-Guided Cross-Lingual Image Captioning. arXiv:Cs.CL,1–9(2017)
  28. Miyazaki, T., & Shimizu, N.:Cross-Lingual Image Caption Generation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,1780–1790(2016)