Graph-structured representations for VQA

This paper is one of the earliest works that leverages graphical structured representations multimodal learning. It introduces how to store the image and the questionas graphical structures and how to find the features of both the modalities using graph neural networks.

Graph Representations

In this section we look at how the images and questions are constructed into graphs to be processesed via graph neural networks.

Language

The input data for each training or test instance is a question, and a parame-terized description of contents of the scene. The question is processed with the Stanford dependency parser, which outputs the following.

Vision

The dataset provides the following information about the image

Processing on the graphs

Obtaining the feature representations from Graphical Structures

The feature vectors of all the nodes and the edges are projected to a common dimensional space using different projection matrices.

The graphs for both the image and the question are processed using a recurrent neural network, a GRU here, after combining the effect from all the nearby nodes. For each object \(i\) of any of the two graphs, the process can be written mathematically as

\[h_i^0 = 0\] \[n_i = pool_j(e'_{ij} \cdot x'_j)\] \[h_i^t = \mathbf{GRU}(h_i^{t-1}, [x'_i; n_i])\]

where \([;]\) represents the concatenation of the vectors and \(\cdot\) represents the Hadamard product. Average pooling is found the best in this paper’s implementation. The final features for each node (an object in case of visual modality and a word in case of linguistic modality) are obtained by taking the output from the final state of the GRU. Mathematically,

\[x''_i = h_i^T\]

Fusion through Cross-Attention and Classification

The fusion can be divided into three parts as:

\[y_{ij} = [x^{''Q}_i; x^{''S}_j]\] \[a_{ij} = \sigma\large(W\large(\frac{x^{'Q}_i}{||x^{'Q}_i||} \cdot \frac{x^{'S}_j}{||x^{'S}_j||}\large) + b\large)\] \[y'_i = \mathbf{ReLU}(W_1 \displaystyle\sum_{j=1}{N^S}y_{ij} + b_1)\] \[y'' = \mathbf{SoftMax}(W_2 \displaystyle\sum_{i=1}{N^Q}y_{i} + b_2)\]

\(y''\) are the predicted scores for the answer classification.

Experiments, Ablations and Results

The authors of the paper evaluate their paper on the original Abstract Scenes dataset for visual question answering and it’s balanced extension. The results, ablations and a few examples are described in the following images.

References

  1. Teney, D., Liu, L., & Hengel, A. (2016). Graph-Structured Representations for Visual Question AnsweringarXiv e-prints, arXiv:1609.05600.