摘要¶
论文官网 用低级别(low-level)的移动聚合线索(motion-based grouping cues)学习视觉表示.
特针对: 视频的无监督移动图像分割, 使用CNN方法。
导言¶
静态物体过度分割oversegment static objects:
有一种Gestalt原则:一起移动的像素很可能有相同的归属。静态场景处理能力会慢慢提升,意味着,动态聚合的能力出现较早,而静态聚合较晚,静态聚合很可能是由动态线索引导出来的。
Gestalt principle of common fate [34, 47]: pixels that move together tend to belong together. The ability to parse static scenes improves [32] over time, suggesting that while motion-based grouping appears early, static grouping is acquired later, possibly bootstrapped by motion cues.
所以,要弄一种 无监督的动态分割方法 unsupervised motion segmentation
动态分割就需要基于视觉流 optical flow
相关工作¶
个人对 Learning from motion and action 感兴趣
Wang and Gupta2 train a ConvNet to distinguish between pairs of tracked patches in a single video, and pairs of patches from different videos
Li et al1 use motion boundary detection to bootstrap a ConvNet-based contour detector, but find that this does not lead to good feature representations
评估特征表示¶
feature representations
略
用聚合学习来学习特征¶
通过学习聚合(group)来学习特征.
core intuition behind this paper is that training a ConvNet to group pixels in static images into objects without any class labels will cause it to learn a strong , high-level feature representation
用CNN去分割图像¶
CNN to 分割物体, 是物体的像素点置1, 不是的置0.
从完整图中剪一个物体
prevent cheating ConvNet by given a precise bounding box: jitter the box in position and scale
take input $w \times w$ image to output $s \times s$ mask.
实验¶
AlexNet, $s=56, w=227$
观察物体移动来学习¶
无监督移动分割¶
The key idea behind motion segmentation is that if there is a single object moving with respect to the background through the entire video, then pixels on the object will move differently from pixels on the background.
复杂物体单帧可能只移动部分, 所以需要多帧进行聚合
e NLC approach from Faktor and Irani [12], utilizes an edge detector that was trained on labeled edge images 。 replace the trained edge detector in NLC with unsupervised superpixels
- uNLC computes a per-frame saliency map based on motion, by looking for
- either pixels that move in a mostly static frame
- or, if the frame contains significant motion, pixels that move in a direction different from the dominant one
- per-pixel saliency is averaged over superpixels.
- nearest neighbor graph is computed over the superpixels in the video using location and appearance as features.
- use a nearest neighbor voting scheme to propagate the saliency across frames
结¶
真看不明白~~~没有基础知识,想看算法没有,有代码,但代码多复杂啊~~~