StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval [SIGIR 26]
Shaokun Wang1, Weili Guan1*, Jizhou Han2, Jianlong Wu1, Yupeng Hu3, Liqiang Nie1
1 Harbin Institute of Technology (Shenzhen) 2 Xi'an Jiaotong University 3 Shandong University
* Corresponding author
- 📄 Paper: ACM Digital Library
- 💻 Code Repository: GitHub
- [04/2026] Paper accepted at SIGIR 2026
- [07/2026] Initial open-source release
- 🎯 Task: Continual Text-to-Video Retrieval (CTVR)
- 🧠 Problem: Intra-modal drift + cross-modal misalignment
- ⚙️ Key Idea: Structured cross-modal alignment via ETF geometry
- 🚀 Result: Good performance on MSRVTT & ACTNET
Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text–video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method shows competitive advantages over state-of-the-art continual retrieval approaches.
-We propose StructAlign, a structured cross-modal alignment framework for CTVR, which explicitly models and mitigates catastrophic forgetting induced by both intra-modal feature drift and non-cooperative feature drift across modalities.
-We introduce a simplex ETF geometric prior together with a cross-modal ETF alignment loss to enforce a well-separated category-level structure in the shared embedding space. In addition, we design a cross-modal relation preserving loss that leverages cross-modal similarity relations to constrain intra-modal feature updates during continual learning.
- Video clips: 10,000 short videos
- Type: Short video–text paired dataset
- Video clips: ~20,000 long, untrimmed videos
- Type: Long-form activity recognition dataset
We follow the CTVR protocol proposed in StableFusion:
- All categories are evenly divided into K tasks
- Two settings are used:
- K = 10
- K = 20
| Dataset | #Category | Shot/Category | #Task |
|---|---|---|---|
| MSRVTT | 20 | 16 | K=10,20 |
| ACTNET | 200 | 16 | K=10,20 |
Install all requirements required to run the code on a Python 3.x.
First, you need activate a new conda environment.
pip install -r requirements.txt
# Download MSRVTT data
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip
unzip MSRVTT.zip -d datasets/MSRVTT
# Place videos in:
datasets/MSRVTT/MSRVTT_Videos
# Process video frames
python datasets/utils/process_msrvtt.py
# Download ActivityNet data from official website
http://activity-net.org/download.html
# Save it in:
datasets/ACTNET
# Place videos in:
datasets/ACTNET/Activity_Videos
# Process video clips
python datasets/utils/process_actnet.py
After downloading the datasets you need, you can use this command to obtain training samples used in few-shot and easy-to-hard classification task.
sh run.sh
Results will be saved in log/.
-
Predefined class number requirement:
This work constructs the Simplex ETF geometry under the assumption that the total number of classes is known in advance. Although this requirement can be mitigated in practice by setting a sufficiently large upper bound on the number of classes, it may introduce mild redundancy in the representation space and reduce parameter efficiency in some scenarios. -
Additional training overhead from regularization terms:
The proposed cross-modal ETF alignment loss and cross-modal relation preserving loss are both regularization-based strategies. While they effectively improve anti-forgetting capability in continual learning, they also introduce a modest increase in training cost.
If you found our work useful for your research, please cite our work:
@inproceedings{StructAlign,
author = {Wang, Shaokun and Guan, Weili and Han, Jizhou and Wu, Jianlong and Hu, Yupeng and Nie, Liqiang},
title = {StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval},
year = {2026},
booktitle = {Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {},
numpages = {}
}
We thank the following repo providing helpful functions in our work.

