CVPR 2019 Tutorial on
Action Classification and Video Modelling

Together with the Computer Vision and Pattern Recognition (CVPR) 2019.

Description of the tutorial and its relevance

In the late years Deep Learning has been a great force of change on most Computer Vision and Machine Learning tasks. In video analysis problems, such as action recognition and detection, motion analysis and tracking, the progress has arguably been slower, with shallow models remaining surprisingly competitive. Recent developments (Tran et al., 2014), (Feichtenhofer et al., 2016), (Carreira et al., 2017), (Tran et al., 2015), (Bilen et al. 2017), (Ghodrati et al., 2018), (Feichtenhofer et al. 2018), (Simonyan et al., 2014), (Kalchbrenner et al., 2017) have demonstrated that careful model design and end-to-end model training, as well as large and well-annotated datasets, have finally led to strong results using deep architectures for video analysis. However, the details and the secrets to achieve good accuracies with deep models are not always transparent. Furthermore, it is not always clear whether the networks resulting from end-to-end training are truly providing better video models or if instead are simply overfitting their large capacity to the idiosyncrasies of each dataset.

This tutorial aims at giving answers to the aforementioned questions. Specifically, the core topics to be explored are

- What are the state-of-the-art video representations and action classification models and how does one train them?
- What does constitute a strong video representation?
- Are short and long videos to be treated equally when training action classifiers?
- Are shallow models still relevant for state-of-the-art classification?
- How to train an action classification system in an unsupervised manner, when supervised labels are not enough?
- Relevant benchmarks and challenges

This tutorial is organized by experts on action classification and video representation learning: a) Dr. E. Gavves, Assistant Professor in the University of Amsterdam, The Netherlands, b) J. Carreira, Research Scientist in DeepMind, UK, c) Dr. B. Fernando, Research Scientist A*Star, Singapore, d) Dr. C. Feichtenhofer, Senior Researcher in Facebook FAIR, US and, e) Dr. L. Torresani, Associate Professor at Dartmouth College and Research Scientist at Facebook FAIR, USA.

Topics

The tutorial focuses on the following topics:

- Deep learning for action classification and optical flow. Discuss the latest modern deep networks for action classification, including C3D (Tran et al., 2014), I3D (Carreira et al. 2017), Two-Stream models (Simonyan et al. 2014), Two-stream-fusion (Feichtenhofer et al., 2016), Dynamic Images(Bilen et al., 2016).
- Deep networks for video modeling. Discuss and analyze various options for modeling videos, including TSN (Wang et al., 2018), spatiotemporal (Tran et al. 2015) and factorized spatiotemporal convolutions (Tran et al., 2018), TimeAligned DenseNets (Ghodrati et al., 2018), Dynamic Image Networks (Bilen et al. 2017).
- Deep spatiotemporal models beyond classification. While in the Computer Vision community video models have been designed primarily for action classification, their applicability extends to video generative models (Kalchbrenner et al., 2017), video compression (Wu et al., 2018), visualization (Feichtenhofer et al., 2018), velocity estimation (Kampelmuhler et al., 2018), tracking (Tao et al., 2016) and spatiotemporal object detection (Feichtenhofer et al., 2017), and future video prediction (Ghodrati et al., 2018).
- Unsupervised video representation learning. Analogously to the still-image domain, even for video it has been customary to fine-tune pretrained video models on the target dataset. However, this is not always optimal, due to large gaps between the source and the target domain~\cite{ (Shkodrani et al., 2018) or because of unconventional architectures (Ghodrati et al., 2018). More importantly, while supervised learning certainly results in highly accurate models, it does not take advantage of the plethora of unlabelled video available. We discuss alternatives on training video representations models either in an unsupervised or self-supervised manner, including arrow-of-time (Wei et al., 2018), audio-video synchronization, or odd-one-out models (Fernando et al., 2017).
- Long-term video understanding. The majority of action classification and video representation systems focus on rather short video sequences, typically no more than 10 seconds long. However, applications often require processing much longer videos, or even streaming videos. We discuss models that are specifically designed to handle long videos and capture the spatiotemporal intricacies involved, like Timeception (Hussein et al, 2019) and VideoGraph (Hussein et al., 2019).
- Large-scale video processing and evaluation. Careful evaluation of action classifiers and video representations is crucial for developing the next generation of models. Interestingly, while current benchmarks do measure well the accuracy of action classifiers, it is not always clear how to evaluate the capacity of temporal models in modeling the sequence itself. We discuss various benchmarks and frameworks for evaluating action classification, as well as for evaluating directly video representations.

Program

Date: June 16, 2019

Time Event Speaker
13.00-13.25 Revisiting Spatiotemporal Convolutions for Action Recognition (link) Lorenzo Torresani
13.25-14.10 Action classification and detection architectures (link) Christoph Feichtenhofer
14.10-14.40 3D spatiotemporal networks, datasets and evaluation (link) Joao Carreira
14.40-15.15 The Machine Learning of Time in Long Videos (link1), (link2) Efstratios Gavves
15.15-15.35 Break
15.35-15.45 Action Recognition in Untrimmed Videos (link) Lorenzo Torresani
15.45-16.00 Self-supervised learning using the time axis (link) Lorenzo Torresani
16.00-16.15 Self-supervised learning of temporal correspondence Joao Carreira
16.15-16.30 Self-supervised and multimodal video learning (link) Efstratios Gavves