Exploring Temporal Granularity in Self-Supervised Video Representation Learning
Research output: Working paper › Preprint › Research
Standard
Exploring Temporal Granularity in Self-Supervised Video Representation Learning. / Li, Yeqing; Yuan, Liangzhe; Gong, Boqing; Liu, Ting; Brown, Matthew; Belongie, Serge; Yang, Ming Hsuan; Adam, Hartwig; Cui, Yin.
arXiv.org, 2022.Research output: Working paper › Preprint › Research
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - UNPB
T1 - Exploring Temporal Granularity in Self-Supervised Video Representation Learning
AU - Li, Yeqing
AU - Yuan, Liangzhe
AU - Gong, Boqing
AU - Liu, Ting
AU - Brown, Matthew
AU - Belongie, Serge
AU - Yang, Ming Hsuan
AU - Adam, Hartwig
AU - Cui, Yin
PY - 2022
Y1 - 2022
N2 - This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations. In TeG, we sample a long clip from a video and a short clip that lies inside the long clip. We then extract their dense temporal embeddings. The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips. Our study reveals the impact of temporal granularity with three major findings. 1) Different video tasks may require features of different temporal granularities. 2) Intriguingly, some tasks that are widely considered to require temporal awareness can actually be well addressed by temporally persistent features. 3) The flexibility of TeG gives rise to state-of-the-art results on 8 video benchmarks, outperforming supervised pre-training in most cases.
AB - This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations. In TeG, we sample a long clip from a video and a short clip that lies inside the long clip. We then extract their dense temporal embeddings. The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips. Our study reveals the impact of temporal granularity with three major findings. 1) Different video tasks may require features of different temporal granularities. 2) Intriguingly, some tasks that are widely considered to require temporal awareness can actually be well addressed by temporally persistent features. 3) The flexibility of TeG gives rise to state-of-the-art results on 8 video benchmarks, outperforming supervised pre-training in most cases.
UR - https://arxiv.org/abs/2112.04480
M3 - Preprint
BT - Exploring Temporal Granularity in Self-Supervised Video Representation Learning
PB - arXiv.org
ER -
ID: 303687615