摘要¶
论文^{1}
关键点： 视觉追踪能提供超视觉信息 visual tracking provides the supervision
视觉追踪中的两帧在深度特征空间中，应该有着相似的视觉表示(visual representation)，因为它们很可能对应同一物体或物体部分
two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part
所以，本文的初始目标是从视觉追踪的视频流中抽取两帧，计算相似度， 又为了对比效果（过拟合？），需要不相似的另一帧，组成 孪生(Siamese)三元组的数据，加上一个有序损失函数，用以训练CNN模型
design a Siamesetriplet network with a ranking loss function to train this CNN representation
导论¶
propose to exploit visual tracking for learning CNNs in an unsupervised manner
two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object
和摘要重复了~~~
overview¶
使用 AlexNet架构 follow the AlexNet architecture to design our base network.
从对物体或物体部分追踪的视频实例中选取开头、结尾两帧 first and last tracked frames correspond to the same instance of the moving object or object part
所以在特征空间里，这两帧（数据点）的视觉表示 应该是相近的 any visual representation that we learn should keep these two data points close in the feature space
仅仅如此并不充分， 容易过拟合， 比如所有点都映射到一个点上就一了百了了 just using this constraint is not sufficient: all points can be mapped to a single point in feature space
overfit
One way is to increase the number of training triplets
analogous to hardnegative mining, we select the third patch from multiple patches that violates the constraint
视频帧挖掘¶
 obtain SURF^{2} interest points and use Improved Dense Trajectories (IDT)^{6} to obtain motion of each SURF point.
 classify these points as moving if the flow magnitude is more than 0.5 pixels
 reject frames (a) very few (< 25%) SURF interest points are classified as moving because it might be just noise (b) majority of SURF interest points (> 75%) are classified as moving asit corresponds to moving camera
 find the best bounding box such that it contains most of the moving SURF points. The size of the bounding box is set as h × w, and we perform sliding window with it in the frame
Tracking:
Given the initial bounding box, we perform tracking using the KCF tracker^{4}. After tracking along 30 frames in the video, we obtain the second patch. This patch acts as the similar patch to the query patch in the triplet. Note that the KCF tracker does not use any supervised information except for the initial bounding box
视频学习¶
Siamese Triplet Network¶
AlexNet architecture^{5}
stack two fully connected layers on the pool5 outputs, whose neuron numbers are 4096 and 1024 respectively. Thus the final output of each single network is 1024 dimensional feature space f(·).
ranking loss function¶
to learn an image similarity model in the form of CNN
define the distance of two image patches X1, X2 based on the cosine distance in the feature space as
$D(X_1,X_2) = 1  \frac{f(X_1) * f(X_2)}{f(X_1)f(X_2)}$
$X_i$ $X_i^{+}$ $X_i^{}$
the loss of our ranking model is defined by hinge loss as
$$L(i, +, ) = max{0, D(i,+)D(i,) + M}$$
Me M represents the gap parameters between two distances. We set M = 0.5 in the experiment. Then our objective function for training can be represented as
$$\min_{W} \frac{\lambda}{2} W2^2 + \sum$$}^Nmax{0,D(i,+)  D(i,)+M
W is the parameter weights of the network, i.e., parameters for function f(·). N is the number of the triplets of samples. $\lambda$ is a constant representing weight decay, which is set to $\lambda$λ = 0.0005
Hard Negative Mining for Triplet Sampling¶
random selection¶
randomly sample K negative matches in the same batch, e have K sets of triplet of samples
Hard Negative Mining¶
Adapting for Supervised Tasks¶
apply our model to two different tasks including object detection and surface normal estimation
¶
directly applying our ranking model as a pretrained network for the target task
very similar to the approach applied in RCNN^{3}
we use an iterative finetuning scheme. Given the initial unsupervised network, we first finetune using the PASCAL VOC data. Given the new finetuned network, we use this network to readapt to ranking triplet task. Here we again transfer convolutional parameters for readapting. Finally, this readapted network is finetuned on the VOC data yielding a better trained model. We show in the experiment that this circular approach gives improvement in performance. We also notice that after two iterations of this approach the network converges
Model Ensemble¶
实现细节¶
实验¶
目前还看不懂实验部分

Unsupervised Learning of Visual Representations using Videos, Xiaolong Wang, Abhinav Gupta ↩

H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features. In ECCV, 2006 features ↩

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR ↩

J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Highspeed tracking with kernelized correlation filters. TPAMI, 2015 ↩

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012 ↩

H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013 ↩