MUSTER: Multimodal processing of Spatial and TEmporal expRessions: Toward Understanding Space and Time in Language Enhanced by Vision.
MUSTER a fundamental pilot research project which introduces a new multi-modal framework for the machine-readable representation of meaning. The focus of MUSTER lies on exploiting visual and perceptual input in the form of images and videos coupled with textual modality for building structured multi-modal semantic representations for the recognition of objects and actions, and their spatial and temporal relations. The MUSTER project will investigate whether such novel multi-modal representations will improve the performance of automated understanding of human language. MUSTER starts from the current state-of-the-work platform for language representation learning known as text embeddings, but introduces the visual modality to provide contextual world knowledge which text-only models lack while humans possess such knowledge when understanding language. MUSTER will propose a new pilot framework for joint representation learning from text and vision data tailored for spatial and temporal language processing. The constructed framework will be evaluated on a series of semantic tasks, which closely mimic the processes of human language acquisition and understanding.
MUSTER will rely on recent advances in multiple research disciplines spanning natural language processing, computer vision, machine learning, representation learning, and human language technologies, working together on building structured machine-readable multi-modal representations of spatial and temporal language phenomena.
Within this framework, MUSTER will focus on building semantic representations of nouns and verbs based on distributional information from textual corpus and implicit knowledge encoded in knowledge bases. It will also integrate those models with information based on visual features. Finally, it will evaluate the new combined representation models in semantic tasks such as semantic textual similarity and disambiguation or spatial role labeling.