Mr. Shagan Sah - Multi-Modal Deep Learning to Understand Vision and Language

Oak Ridge, TN

December 12, 2018

Mr. Shagan Sah
Shagan Sah
PhD Candidate
Center for Imaging Science
Rochester Institute of Technology

Time:  9:00 a.m.
Location:  Building 5700, Conference Room E104


Developing intelligent agents that can perceive and understand the rich visual world around us has been a long-standing goal in the field of artificial intelligence.  In the last few years, significant progress has been made towards this goal and deep learning has been attributed to recent incredible advances in general visual and language understanding.  Towards appreciating these methods, this talk is divided into two broad groups.  First, we introduce a general purpose attention mechanism model using a continuous function for video understanding.  The use of an attention based hierarchical approach along with automatic boundary detection advances state-of-the-art video captioning results.  We also develop techniques for summarizing and annotating long videos.  Second, we introduce architectures along with training techniques to produce a common connection space where natural language sentences are connected with visual modalities.  In this connection space, similar concepts lie close, while dissimilar concepts lie far apart, irrespective of their modality.  We discuss four modality transformations:  visual to text, text to visual, visual to visual and text to text.  We introduce a novel attention mechanism to align multi-modal embeddings which are learned through a multi-modal metric loss function.  The models are shown to advance the state-of-the-art on tasks that require joint processing of images and natural language, including cross-modal retrieval and zero-shot learning.


Shagan Sah is a Ph.D. candidate in the Center for Imaging Science at Rochester Institute of Technology (RIT).  He obtained a Bachelors in Engineering degree from the University of Pune, India, and a Master of Science degree in Imaging Science from RIT.  His current work primarily lies in developing artificial intelligence applications for image and video understanding.  He has authored over 25 publications in peer-reviewed journals and conferences.  He has interned in research labs at Xerox-PARC, Motorola and NVIDIA.  Shagan has won numerous awards including the RIT Graduate Scholarship, Mathworks Best Paper Award, and International Association for Pattern Recognition Travel Grant, among others.