Using Deep Learning to Better Understand Vision and Language

Share

Deep learning has enabled incredible advances in computer vision, natural language processing, and general image and video understanding. Recurrent neural networks have demonstrated the ability to generate text from visual stimuli, while image generative networks have demonstrated remarkable ability to create photorealistic images. Towards appreciating these methods, this talk is divided into two broad groups. Firstly, we introduce a general purpose Steered Gaussian Attention Model for video understanding. The use of an attention based hierarchical approach along with automatic boundary detection delivers state-of-the-art results on popular video captioning datasets. In the second part of the talk, we discuss four modality transformations: visual to text, text to visual, visual to visual and text to text. In addition to reviewing recent techniques, we introduce improvements in all transformations. To conclude, we show interesting results how the generative methods can seamlessly integrate bidirectional written and visual modalities.



  Date and Time

  Location

  Contact

  Registration


  • Rochester Institute of Technology
  • one lomb memorial drive
  • Rochester, New York
  • United States
  • Building: CIS, bldg 76
  • Room Number: 1st floor fish bowl room

Staticmap?size=250x200&sensor=false&zoom=14&markers=43.0861017%2c 77
  • Co-sponsored by IS&T Rochester
  • Registration closed


  Speakers

Shagan Sah

Shagan Sah

Topic:

Using Deep Learning to Better Understand Vision and Language

Deep learning has enabled incredible advances in computer vision, natural language processing, and general image and video understanding. Recurrent neural networks have demonstrated the ability to generate text from visual stimuli, while image generative networks have demonstrated remarkable ability to create photorealistic images. Towards appreciating these methods, this talk is divided into two broad groups. Firstly, we introduce a general purpose Steered Gaussian Attention Model for video understanding. The use of an attention based hierarchical approach along with automatic boundary detection delivers state-of-the-art results on popular video captioning datasets. In the second part of the talk, we discuss four modality transformations: visual to text, text to visual, visual to visual and text to text. In addition to reviewing recent techniques, we introduce improvements in all transformations. To conclude, we show interesting results how the generative methods can seamlessly integrate bidirectional written and visual modalities.

Biography:

Shagan Sah is a Ph.D. candidate in the Center for Imaging Science at Rochester Institute of Technology. He obtained a Bachelors in Engineering degree from the University of Pune, India. This was followed by a Master of Science degree in Imaging Science from Rochester Institute of Technology (RIT), New York, USA with aid of a RIT Graduate Scholarship. He is interested in the intersection of machine learning, natural language and computer vision. His current work primarily lies in the applications of deep learning for image and video understanding. He has authored over 20 publications in various journals and conferences. He was worked at Xerox-PARC as a Video Analytics Intern, at Motorola as a Camera Intern, and at Cisco Systems as a Software Engineer.