Main Article Content

Abstract

Image captioning is an automatic process for generating text based on the content observed in an image. We do review, create framework, and build application model. We review image captioning into 4 categories based on input model, process model, output model, and lingual image caption. Input model is based on criteria caption, method, and dataset. Process model is based on type of learning, encoder-decoder, image extractor, and metric evaluation. Output model based on architecture, features extraction, feature aping, model, and number of caption. Lingual image caption based on language model with 2 groups: bilingual image caption and cross-language image caption. We also design framework with 3 framework model. Furthermore, we also build application with 3 application models. We also provide research opinions on trends and future research that can be developed with image caption generation. Image captioning can be further developed on computer vision versus human vision.

Article Details

How to Cite
Adriyendi, A. (2021). A Rapid Review of Image Captioning. Journal of Information Technology and Computer Science, 6(2), 158–169. https://doi.org/10.25126/jitecs.202162316

References

  1. Wang, H., Zhang, Y., & Yu, X.:An Overview of Image Caption Generation Methods. Computational Intelligence and Neuroscience,1–13(2020). https://doi.org/10.1155/2020/30627062.Cohan, M., Khan, A., Mahar, M. S., Hassan, S.,
  2. Ghafoor, A., & Khan, M.:Image Captioning using Deep Learning: A Systematic Literature Review. Internationa Journal of Advanced Computer Science and Applications, 11(5), 278–286(2020). https://doi.org/10.14569/IJACSA.2020.0110537.
  3. Kalra, S., & Leekha, A.: Survey of Convolutional Neural Networks for Image Captioning. Journal of Information and Optimization Sciences, 41(1), 239–260(2020).https://doi.org/10.1080/02522667.2020.1715602
  4. Bang, S., & Kim, H.:Context-based Information Generation for Managing UAV-acquired Data using Image Captioning. Automation in Construction, 112(103116), 1–10(2020).https://doi.org/10.1016/j.autocon.2020103116
  5. Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H.:A Comprehensive Survey of Deep Learning for Image Captioning. ACM Computing Surveys, 51(6), 118:1-118:36(2019).https://doi.org/10.1145/3295748
  6. Huang, Y., Chen, J., Ouyang, W., Wan, W., & Xue, Y.:Image Captioning With End-to-End Attribute Detection and Subsequent Attributes Prediction.IEEE Transaction on Image Processing, 29, 4013–4026(2020).https://doi.org/10.1109/TIP.2020.2969330
  7. Lam, Q. H., Le, Q. D., Nguyen, K. Van, & Nguyen, N. L.-T. UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning. arXiv:Cs.CL.1-12 (2020)
  8. Yang, Zhenyu, & Liu, Q.:ATT-BM-SOM: A Framework of Effectively Choosing Image Information and Optimizing Syntax for Image Captioning. IEEE Access, 8, 50565–50573(2020).https://doi.org/10.1109/ACCESS.2020.2980578
  9. Yang, Zhilin, Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. Review Networks for Caption Generation. In: 30th Conference on Neural Information Processing Systems, 1–9.(2016).
  10. Bai, S., & An, S.:A Survey on Automatic Image Caption Generation. Neurocomputing, 311, 291–304(2018).https://doi.org/10.1016/j.neucom.2018.05.080 11. Staniute, R., & Šešok, D.:A Systematic Literature Review on Image Captioning. Applied Science, 9(2024), 1–20(2019). https://doi.org/10.3390/app9102024
  11. Li, S., Tao, Z., Li, K., & Fu, Y.:Visual to Text: Survey of Image and Video Captioning. IEEE Transaction on Emerging Topics in Computational Intelligence, 3(4), 297–312(2019).https://doi.org/10.1109/TETCI.2019.2892755
  12. He, S., Liao, W., Tavakoli, H. R., Yang, M., Rosenhahn, B., & Pugeault, N.:Image Captioning through Image Transformer. arXiv:Cs.CV.1-17 (2020)
  13. Li, X., Xu, C., Wang, X., Lan, W., Jia, Z., Yang, G., & Xu, J.:COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval. IEEE Transaction on Multimedia, 7(7), 1–14(2019). https://doi.org/10.1109/TMM.2019.2896494
  14. Zhang, B., Zhou, L., Song, S., Chen, L., Jiang, Z., & Zhang, J.:Image Captioning in Chinese and Its Application for Children with Autism Spectrum Disorder. In: Proceedings of the 12th International Conference on Machine Learning and Computing, 426–432(2020).https://doi.org/10.1145/3383972.3384072
  15. Aung, S. P. P., Pa, W. P., & Nwe, T. L.:Automatic Myanmar Image Captioning using CNN and LSTM-Based Language Model. In: Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL), 139–143(2020)
  16. Sur, C.:Gaussian Smoothen Semantic Features (GSSF) -Exploring the Linguistic Aspects of Visual Captioning in Indian Languages (Bengali) Using MSCOCO Framework. arXiv:Cs.CL.1-12 (2020)
  17. Yılmaz, B. D., Demir, A. E., Sönme, E. B., & Yıldız, T.:Image Captioning in Turkish Language. In: Innovations in Intelligent Systems and Applications Conference. 1-5 (2019).https://doi.org/10.1109/ASYU48272.2019.8946358
  18. Nugraha, A. A., Arifianto, A., & Suyanto, S.:Generating ImageDescription on Indonesian Language using Convolutional Neural Network and Gated Recurrent Unit. In: 7th International Conference on Information and Communication Technology (ICoICT), 1–6(2019). https://doi.org/10.1109/ICoICT.2019.8835370
  19. Dhir, R., Mishra, S. K., Saha, S., & Bhattacharyya, P.:A Deep Attention based Framework for Image Caption Generation in Hindi Language. Computación y Sistemas, 23(3), 693–701(2019). https://doi.org/10.13053/CyS-23-3-3269
  20. Al-muzaini, H. A., Al-yahya, T. N., & Benhidour, H.:Automatic Arabic Image Captioning using RNN-LSTM-Based Language Model and CNN. International Journal of Advanced Computer Science and Applications, 9(6), 67–73(2018).https://doi.org/10.14569/IJACSA.2018.090610
  21. Yoshikawa, Y., Shigeto, Y.,& Takeuchi, A.:STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset. arXiv:Cs.CL, 1–5(2017).
  22. Song, Y., Chen, S., Zhao, Y., & Jin, Q.:Unpaired Cross-lingual Image CaptionGeneration with Self-Supervised Rewards. In: Proceedings of the 27th ACM International Conference on Multimedia, 1–9(2019). https://doi.org/10.1145/3343031.3350996
  23. Thapliyal, A. V., & Soricut, R. Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage. arXiv:Cs.CL2,1–11(2020).https://doi.org/2005.00246v1
  24. Alikhani, M., Sharma, P., Li, S., Soricut, R., & Stone, M.:Clue: Cross-modal Coherence Modeling for Caption Generation. arXiv:Cs.CL, 1–11(2020).https://doi.org/2005.00908v1
  25. Liu, F., Ren, X., Liu, Y., Lei, K., & Sun, X.: Exploring and Distilling Cross-Modal Information for Image Captioning. ArXiv:Cs.CV, 1–7(2020)27.Wei, Q., Wang, X., & Li, X. Harvesting Deep Models for Cross-Lingual Image Annotation. In: Proceedings of CBMI, 1–5(2017). https://doi.org/10.1145/3095713.3095751
  26. Meetei, L. S., Singh, T. D., & Bandyopadhyay, S. WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset. In: Proceedings of the 6th Workshop on Asian Translation, 181–188(2019)
  27. Lan, W., Li, X., & Don, J. Fluency-Guided Cross-Lingual Image Captioning. arXiv:Cs.CL,1–9(2017)
  28. Miyazaki, T., & Shimizu, N.:Cross-Lingual Image Caption Generation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,1780–1790(2016)