This work employs the Question Answering capabilities of Module Networks (Hu et al.). Language guides the generation of neural architectures that maximizes the likelihood of answering a question correctly when correlated with visual embeddings.
-
A seq2seq architecture translates open questions into a sequence of available modules (Age, Gender, Emotion, Find, Transform, Locate, And, Describe, etc.) whose In-order traversal represents a hierarchical relation between modules. 25,050 unique questions generate hierarchical module networks. Some modules receive visual and language features while others receive attention maps.
-
The res5c layer from ResNet-152 pretrained on ImageNET produces embeddings vectors of (1, 14, 14, 2048).
[1] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, K. Saenko, Learning to Reason: End-to-End Module Networks for Visual Question Answering. in arXiv preprint arXiv:1704.05526, 2017.
@article{hu2017learning,
title={Learning to Reason: End-to-End Module Networks for Visual Question Answering},
author={Hu, Ronghang and Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Saenko, Kate},
journal={arXiv preprint arXiv:1704.05526},
year={2017}
}



