论文简读：《One Flight Over the Gap: A Survey from Perspective to Panoramic Vision》

【PDF:One Flight Over the Gap A Survey from Perspective to Panoramic Vision - Lin 等 - 2025.pdf】

Note

最近换方向了，要开始看看全景相关内容。

概览

这是一篇全景方向的综述，主要总结了全景任务的3个挑战：severe geometric distortions near the poles, the non-uniform sampling in Equirectangular Projection (ERP), the periodic continuity of panoramic boundaries。其中几何畸变和非均匀采样我认为是因果关系。之后文章把全景分为四个大方向：visual quality enhancement and assessment, visual understanding, multimodal understanding, visual generation。

试想：我们将一个球面用一个柱面来进行展开，球面的两极势必会有极大的变形。

ERP的根本成因

ERP（等矩形投影）将球面坐标直接线性映射到矩形平面：

(\theta, \phi) \rightarrow (x, y)

其中 \theta 为经度， \phi 为纬度，均匀地铺在像素网格上。

Non-uniform Spatial Sampling 的来源

球面上纬度 \phi 处，水平方向的实际弧长为：

ds = r\cos\phi \cdot d\theta

但ERP中水平像素对应固定的 d\theta。因此：

赤道（ \phi=0 ）： \cos\phi=1 ，采样相对均匀
极点（ \phi=\pm 90° ）： \cos\phi \to 0 ，同样数量的像素表示极小的球面面积 → 严重过采样

这直接来自投影的度量张量，是映射方式本身的性质。

引言与背景

在引言中，文章指出和传统CV不同：

The planar assumptions embedded in conventional deep models hinder their ability to handle spherical geometry and full-scene coverage, thus limiting the adaptability of perspective-based techniques and slowing progress in omnidirectional vision.

也就是说，传统的卷积在有较大畸变的ERP等地方会不太适用，而很多下游工作都是基于ResNet50这种CNN-base的backbone，这进一步导致很多下游任务无法直接迁移到全景上来。

之后，文章讲了全景领域的一些背景知识，包括7种全景成像系统、全景拼接的相关技术以及几种不同坐标表示。

结构挑战与解决策略

对全景分割，有这三个挑战：

Geometric Distortion（几何畸变）：造成“(CNNs) translation-invariant filters are ill-suited for spherical geometry”
Non-uniform Spatial Sampling（非均匀采样）：成因是“Each horizontal line of the ERP corresponds to a constant latitude on the sphere, which leads to the density of pixels to vary across different latitudes.”
Boundary Continuity（边界连续性）

对此，人们主要用了两种技术路线：

Distortion-Aware Methods（畸变感知方法）
Projection-Driven Methods（投影驱动的方法）

Distortion-Aware Methods

这类方法主张通过设计各种模型结构给予模型对畸变的感知能力，比如一些空间自适应卷积核（spatially adaptive convolution kernels）或者一些其他的CNN、Transformers自适应方法。还有一类更加直接——将畸变图也以一些方式喂给模型，比如：

distortion maps can be (1) concatenated with the input panorama to provide pixel-wise distortion cues, (2) fused with intermediate feature layers to modulate representation learning adaptively, and (3) incorporated into the loss function as weighted penalties to emphasize errors in highly distorted regions.

Projection-Driven Methods

投影驱动的方法则尝试将全景图以适当的方式进行重新投影，如用一个正方体展开、用二十面体展开等。这种做法一般还伴随着将不同投影方法进行融合。

全景任务

接着文章讲了很多全景任务，比较多，我记录一些有意思的。

Reflection Removal

Most perspectivebased methods assume the whole image is glass-mixed with weak and blurred reflections, which often fails in 360 panoramas, making transmission–reflection separation particularly challenging.

Visual Quality Assessment

panoramic quality assessment faces unique challenges arising from ERP distortions near the poles and from user-dependent viewports that expose only localized regions at a time.

Knowledge Alignment

Knowledge Alignment focuses on transferring semantic consistency from the source perspective domain to the target panoramic domain, through explicit or implicit alignment mechanisms.

Semantic Mapping

Semantic mapping converts egocentric panoramic inputs into bird’s-eye view (BEV) representations, emphasizing spatial localization of objects rather than pixel-level segmentation.

Redefining Spherical Bounding Boxes and IoU Metrics

In contrast, transformation-based approaches such as Sph2Pob [130] map spherical boxes to rotated planar ones, reducing complexity and offering competitive accuracy, though with limited geometric fidelity.

Layout Detection

Layout detection aims to recover the structural boundaries of indoor scenes, including walls, floors, and ceilings from panoramic images.

Projection-Driven Methods align perspective and panoramic views within a unified spatial framework, reducing inconsistencies across representations.

Geometric-Based Methods adapt to panoramic geometry by converting 2D layouts into 1D horizon sequences

Structural-Aware Methods improve the robustness and generalization by integrating geometric priors, ambiguity modeling, and data-efficient strategies. Such as lines and vanishing point.

挑战与后续工作

这块讲的比较笼统，还是数据量不够啥的，然后稍微说了下Foundation Model什么的。