Image and Video Understanding

We are combining computer vision, natural language processing, and machine learning to enable computers to understand images/videos, rather than just working at pixel levels. Currently we are working on automatic video description generation, which can be useful in many aspects. Generated descriptions enable computers to understand images/videos, because they contain objects appearing in the videos along with their attributes, locations, actions, and relations with other objects. With the generated video descriptions, we can automatically summarize and query the contents of videos, and it also can help visually-impaired people to understand videos better and easier.

In addition to automatic video description generation, we also have plans for other tasks such as:

  • Video entity linking
  • Image/video-based question answering
  • Video summarization
  • Image/video querying

Data to Text

Our research is aimed at establishing a generic framework to automatically produce natural language descriptions of numerical data such as stock market statistics. This framework will be an essential component for an automated system to comprehend real-world activities, which we can observe as numerical data.

Applications of our framework include:

  • Automatic generation of market summaries
  • Automatic generation of weather conditions
  • Live broadcasting

Semantic Parsing

Our objective is to create human-machine interfaces for humans to communicate with machines using plain language. That is, we want to create programs by simply writing sentences in English, query large databases by speaking to our smartphone, or check whether a statement is true or false according to Wikipedia. To tackle these challenges we use structured approaches to natural language analysis including tree transducers and graph grammars to better model the complexity of language meaning and leverage the power of state of the art machine learning techniques for data-driven solutions.

Some of the applications and tools our work touches on include:

  • Executable meaning representations.
  • Semantic parsing.
  • Deep learning.
  • Question-Answering.
  • Textual Entailment.

Visual Grammar

Visual objects are like words in natural languages, which needs a system of grammar to construct meaningful sentences. We believe the visual world also has such a system of grammar as a new research theme to be explored.

We propose a 4-tier framework to realize a computational system of visual grammar:

  • Geometrical grammar: Given a 2D blank rectangular frame, is there any location in this frame which is more importantly perceived than others?
  • Armature-based grammar: Are there patterns of composing objects in pictures?
  • Perspectival grammar: What makes a picture look real (i.e., logical presentation of the physical world)?
  • Contextual/Semantic grammar: Aesthetic value of images, contradictory in meaning of objects, context-dependent arrangement/semantic, etc.