摘要¶

论文¹

关键点：视觉追踪能提供超视觉信息 visual tracking provides the supervision

视觉追踪中的两帧在深度特征空间中，应该有着相似的视觉表示(visual representation)，因为它们很可能对应同一物体或物体部分

two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part

所以，本文的初始目标是从视觉追踪的视频流中抽取两帧，计算相似度，又为了对比效果（过拟合？），需要不相似的另一帧，组成孪生(Siamese)三元组的数据，加上一个有序损失函数，用以训练CNN模型

design a Siamese-triplet network with a ranking loss function to train this CNN representation

导论¶

propose to exploit visual tracking for learning CNNs in an unsupervised manner

two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object

和摘要重复了~~~

overview¶

使用 AlexNet架构 follow the AlexNet architecture to design our base network.

从对物体或物体部分追踪的视频实例中选取开头、结尾两帧 first and last tracked frames correspond to the same instance of the moving object or object part

所以在特征空间里，这两帧（数据点）的视觉表示应该是相近的 any visual representation that we learn should keep these two data points close in the feature space

仅仅如此并不充分，容易过拟合，比如所有点都映射到一个点上就一了百了了 just using this constraint is not sufficient: all points can be mapped to a single point in feature space

overfit

One way is to increase the number of training triplets

analogous to hard-negative mining, we select the third patch from multiple patches that violates the constraint

视频帧挖掘¶

obtain SURF² interest points and use Improved Dense Trajectories (IDT)⁶ to obtain motion of each SURF point.
classify these points as moving if the flow magnitude is more than 0.5 pixels
reject frames (a) very few (< 25%) SURF interest points are classified as moving because it might be just noise (b) majority of SURF interest points (> 75%) are classified as moving asit corresponds to moving camera
find the best bounding box such that it contains most of the moving SURF points. The size of the bounding box is set as h × w, and we perform sliding window with it in the frame

Tracking:

Given the initial bounding box, we perform tracking using the KCF tracker⁴. After tracking along 30 frames in the video, we obtain the second patch. This patch acts as the similar patch to the query patch in the triplet. Note that the KCF tracker does not use any supervised information except for the initial bounding box

视频学习¶

Siamese Triplet Network¶

AlexNet architecture⁵

stack two fully connected layers on the pool5 outputs, whose neuron numbers are 4096 and 1024 respectively. Thus the final output of each single network is 1024 dimensional feature space f(·).

ranking loss function¶

to learn an image similarity model in the form of CNN

define the distance of two image patches X1, X2 based on the cosine distance in the feature space as

$D(X_1,X_2) = 1 - \frac{f(X_1) * f(X_2)}{||f(X_1)||||f(X_2)||}$

$X_i$ $X_i^{+}$ $X_i^{-}$

the loss of our ranking model is defined by hinge loss as

$$L(i, +, -) = max{0, D(i,+)-D(i,-) + M}$$

Me M represents the gap parameters between two distances. We set M = 0.5 in the experiment. Then our objective function for training can be represented as

$$\min_{W} \frac{\lambda}{2} ||W||2^2 + \sum$$}^Nmax{0,D(i,+) - D(i,-)+M

W is the parameter weights of the network, i.e., parameters for function f(·). N is the number of the triplets of samples. $\lambda$ is a constant representing weight decay, which is set to $\lambda$λ = 0.0005

Hard Negative Mining for Triplet Sampling¶

random selection¶

randomly sample K negative matches in the same batch, e have K sets of triplet of samples

Hard Negative Mining¶

Adapting for Supervised Tasks¶

apply our model to two different tasks including object detection and surface normal estimation

¶

directly applying our ranking model as a pre-trained network for the target task

very similar to the approach applied in RCNN³

we use an iterative finetuning scheme. Given the initial unsupervised network, we first fine-tune using the PASCAL VOC data. Given the new fine-tuned network, we use this network to re-adapt to ranking triplet task. Here we again transfer convolutional parameters for re-adapting. Finally, this re-adapted network is fine-tuned on the VOC data yielding a better trained model. We show in the experiment that this circular approach gives improvement in performance. We also notice that after two iterations of this approach the network converges

Model Ensemble¶

实现细节¶

实验¶

目前还看不懂实验部分

Unsupervised Learning of Visual Representations using Videos, Xiaolong Wang, Abhinav Gupta ↩
H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features. In ECCV, 2006 features ↩
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR ↩
J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. TPAMI, 2015 ↩
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012 ↩
H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013 ↩

无监督视觉表示学习

摘要¶

导论¶

overview¶

视频帧挖掘¶

视频学习¶

Siamese Triplet Network¶

ranking loss function¶

Hard Negative Mining for Triplet Sampling¶

random selection¶

Hard Negative Mining¶

Adapting for Supervised Tasks¶

¶

Model Ensemble¶

实现细节¶

实验¶

Published

Last Updated

Category

Tags

Stay in Touch