Efficient-VLN

Abstract

Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference.

To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off.

Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.

This figure compares the Success Rate (SR) on the R2R-CE benchmark (y-axis) against training overhead in GPU hours (x-axis, log scale).

Efficient-VLN

Efficient Memory Representation

To tackle the computational overhead of processing massive memory tokens, we introduce two novel efficient memory representations.

A progressive memory representation is constructed via a progressive spatial compression, which is inspired by the human forgetting mechanism (i.e., diminishing of memories over time). Specifically, it applies a low compression ratio to visual tokens from recent frames, while gradually increasing this ratio for temporally distant ones.
A learnable recursive memory utilizes the KV cache of a set of learnable tokens to serve as the memory state. By propagating the memory state across consecutive steps, the memory is recursively updated while maintaining a fixed size.

DAgger with Dynamic Mixed Policy

We introduce a dynamic mixed policy to better balance the exploration-efficiency trade-off in DAgger. Specifically, this policy starts with the learned policy for accumulating the compounding errors, and then progressively shifts towards the oracle policy to ensure eventual task completion. This design enables the collection of error-recovery data without incurring excessively long trajectories, thus accelerating the training process.

The architectures for progressive memory representation and recursive memory representation.

Main Results

The proposed model achieves state-of-the-art performance with a 64.2% SR on R2R-CE and 67.0% SR on RxR-CE, while consuming merely 282 H800 GPU hours, which is far less than competing methods

Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Abstract

Efficient-VLN

Efficient Memory Representation

DAgger with Dynamic Mixed Policy

Main Results