Envisioning Answers: Unleashing Deep Learning for Visual Question Answering in Artistic Images

Document Type : Research Article

Authors

1 M.Sc. Graduate, Deep Learning Research Lab, Department of Computer Engineering, Faculty of Engineering, College of Farabi, University of Tehran, Iran

2 Faculty Member, Department of Computer Engineering, Faculty of Engineering, College of Farabi, University of Tehran, Iran

3 B.Sc. Student, Deep Learning Research Lab, Department of Computer Engineering, Faculty of Engineering, College of Farabi, University of Tehran, Iran

Abstract

In specialized fields, the accurate answering of visual questions is crucial for practical applications, and this study focuses on improving a visual question answering model for artistic images by utilizing a dataset with both visual and knowledge-based questions. The approach involves employing a pre-trained BERT model to understand query nature and using the iQAN model with MLB and MUTAN mechanisms for visual queries, along with an XLNet-based model for knowledge-based information. The results demonstrate a 78.92% accuracy for visual questions, 47.71% for knowledge-based questions, and an overall accuracy of 55.88% by combining both branches. Additionally, the research explores the impact of parameters like the number of glances and activation functions on the model's performance.

In specialized fields, the accurate answering of visual questions is crucial for practical applications, and this study focuses on improving a visual question answering model for artistic images by utilizing a dataset with both visual and knowledge-based questions. The approach involves employing a pre-trained BERT model to understand query nature and using the iQAN model with MLB and MUTAN mechanisms for visual queries, along with an XLNet-based model for knowledge-based information. The results demonstrate a 78.92% accuracy for visual questions, 47.71% for knowledge-based questions, and an overall accuracy of 55.88% by combining both branches. Additionally, the research explores the impact of parameters like the number of glances and activation functions on the model's performance.

Keywords

Main Subjects