Envisioning Answers: Unleashing Deep Learning for Visual Question Answering in Artistic Images

Document Type : Research Article

Authors

1 Deep Learning Research Lab, Department of Computer Engineering, Faculty of Engineering, College of Farabi, University of Tehran, Iran

2 Department of Computer Engineering, Faculty of Engineering, College of Farabi, University of Tehran, Iran

Abstract

In specialized fields, the accurate answering of visual questions is crucial for practical applications, and this study focuses on improving a visual question-answering model for artistic images by utilizing a dataset with both visual and knowledge-based questions. The approach involves employing a pre-trained BERT model to understand query nature and using the iQAN model with MLB and MUTAN mechanisms for visual queries, along with an XLNet-based model for knowledge-based information. The results demonstrate a 78.92% accuracy for visual questions, 47.71% for knowledge-based questions, and an overall accuracy of 55.88% by combining both branches. Additionally, the research explores the impact of parameters like the number of glances and activation functions on the model's performance.

Keywords

Main Subjects


[1] Falomir, Zoe, et al. "Categorizing paintings in art styles based on qualitative color descriptors, quantitative global features and machine learning (QArt-Learn)." Expert Systems with Applications 97 (2018): 83-94.
[2] Deng, Yingying, et al. "Exploring the representativity of art paintings." IEEE Transactions on Multimedia 23 (2020): 2794-2805.
[3] Ma, Daiqian, et al. "From part to whole: who is behind the painting?." Proceedings of the 25th ACM international conference on Multimedia. 2017.
[4] Rodriguez, Catherine Sandoval, Margaret Lech, and Elena Pirogova. "Classification of style in fine-art paintings using transfer learning and weighted image patches." 2018 12th International Conference on Signal Processing and Communication Systems (ICSPCS). IEEE, 2018.
[5] Huang, Ru. "Research on Classification and Retrieval of Digital Art Graphics Based on Hollow Convolution Neural Network." 2022 International Conference on Artificial Intelligence and Autonomous Robot Systems (AIARS). IEEE, 2022.
[6] García, B. Renoust, and Y. Nakashima, “Context-aware embeddings for automatic art analysis,” 2019.
[7] Huckle, N. García, and Y. Nakashima, “Demographic influences on contemporary art with unsupervised style embeddings,” ArXiv, vol. abs/2009.14545, 2020.
[8] Matsuo, Shin, and Keiji Yanai. "CNN-based style vector for style image retrieval." Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. 2016.
[9] Liu, Dilin, and Hongxun Yao. "Artistic image synthesis with tag-guided correlation matching." Multimedia Tools and Applications (2023): 1-12.
[10] Fumanal-Idocin, Javier, et al. "ARTxAI: Explainable Artificial Intelligence Curates Deep Representation Learning for Artistic Images using Fuzzy Techniques." arXiv preprint arXiv:2308.15284 (2023).
[11] García and G. Vogiatzis, “How to read paintings: Semantic art understanding with multi-modal retrieval,” 2018.
[12] Garcia et al., “A dataset and baselines for visual question answering on art,” CoRR, vol. abs/2008.12520, 2020.
[13] Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, and X. Wang, “Visual question generation as dual task of visual question answering,” CoRR, vol. abs/1709.07192, 2017.
[14] Yang, Z. Dai, Y. Yang, J. G. Carbonell, Ruslan Salakhutdinov, and Q. V. Le, “XLNet: Generalized autoregressive pretraining for language understanding,” CoRR, vol. abs/1906.08237, 2019.
[15] -H. Kim, Kyoung Woon On, W. Lim, J. Kim, J. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” CoRR, vol. abs/1610.04325, 2016.
[16] Ben-Younes, R. Cadène, M. Cord, and N. Thome, “MUTAN: Multimodal tucker fusion for visual question answering,” CoRR, vol. abs/1705.06676, 2017.
[17] Chen, Kan, Wang, Jiang, Chen, Liang-Chieh, Gao, Haoyuan, Xu, Wei, and Nevatia, Ramakant. Abc-cnn: An attention based convolutional neural network for visual question answering. ArXiv, abs/1511.05960, 2015.
[18] Wang, Peng, Wu, Qi, Shen, Chunhua, Dick, Anthony R., and van den Hengel, Anton. Fvqa: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:2413–2427, 2018.
[19] Shih, Kevin J., Singh, Saurabh, and Hoiem, Derek. Where to look: Focus regions for visual question answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4613–4621, 2016.
[20] Ilievski, Ilija, Yan, Shuicheng, and Feng, Jiashi. A focused dynamic attention model for visual question answering. ArXiv, abs/1604.01485, 2016.
[21] Xu, Huijuan and Saenko, Kate. Dual attention network for visual question answering. 2017.
[22] Zhu, Chen, Zhao, Yanpeng, Huang, Shuaiyi, Tu, Kewei, and Ma, Yi. Structured attentions for visual question answering. 2017 IEEE International Conference on Computer Vision (ICCV), pages 1300–1309, 2017.
[23] Li, Qing, Tao, Qingyi, Joty, Shafiq R., Cai, Jianfei, and Luo, Jiebo. Vqa-e: Explaining, elaborating, and enhancing your answers for visual questions. ArXiv, abs/1803.07464, 2018.
[24] Wu, Chenfei, Liu, Jinlai, Wang, Xiaojie, and Li, Ruifan. Differential networks for visual question answering. In AAAI, 2019.
[25] Ren, Shaoqing, He, Kaiming, Girshick, Ross B., and Sun, Jian. Faster r-cnn: Towards realtime object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015.
[26] Patro, Badri N. and Namboodiri, Vinay P. Differential attention for visual question answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7680–7688, 2018.
[27] Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." Advances in Neural Information Processing Systems 35 (2022): 23716-23736.
[28] Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018.
[29] He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
[30] Srimaneekarn, Natchalee, et al. "Binary response analysis using logistic regression in dentistry." International Journal of Dentistry 2022 (2022).
[31] Grootendorst, Maarten. "BERTopic: Neural topic modeling with a class-based TF-IDF procedure." arXiv preprint arXiv:2203.05794 (2022).
[32] Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and Ruslan Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” CoRR, vol. abs/1901.02860, 2019.
[33] Goyal, Tejas Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” CoRR, vol. abs/1612.00837, 2016.
[34] Singh et al., “MMF: A multimodal framework for vision and language research,” 2020.
[35] Heilman and N. A. Smith, “Good question! Statistical ranking for question generation,” pp. 609–617, Jun. 2010.
[36] Du, J. Shao, and C. Cardie, “Learning to ask: Neural question generation for reading comprehension,” CoRR, vol. abs/1705.00106, 2017.
[37] W Malfliet, “The tanh method: a tool for solving certain classes of nonlinear evolution and wave equations,” Journal of Computational and Applied Mathematics, vol. 164–165, pp. 529–541, 2004.
[38] Abien Fred Agarap, “Deep learning using rectified linear units (ReLU),” CoRR, vol. abs/1803.08375, 2018.
[39] Ghojogh, Benyamin, and Ali Ghodsi. "Recurrent Neural Networks and Long Short-Term Memory Networks: Tutorial and Survey." arXiv preprint arXiv:2304.11461 (2023).
[40] -H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” CoRR, vol. abs/1805.07932, 2018.