SIGIR26-StructAlign

StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval [SIGIR 26]

👥 Authors

Shaokun Wang¹, Weili Guan^1*, Jizhou Han², Jianlong Wu¹, Yupeng Hu³, Liqiang Nie¹

¹ Harbin Institute of Technology (Shenzhen) ² Xi'an Jiaotong University ³ Shandong University

* Corresponding author

🔗 Links

📄 Paper: ACM Digital Library
💻 Code Repository: GitHub

📢 Updates

[04/2026] Paper accepted at SIGIR 2026
[07/2026] Initial open-source release

📌 Overview

🎯 Task: Continual Text-to-Video Retrieval (CTVR)
🧠 Problem: Intra-modal drift + cross-modal misalignment
⚙️ Key Idea: Structured cross-modal alignment via ETF geometry
🚀 Result: Good performance on MSRVTT & ACTNET

📖 Abstract

Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text–video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method shows competitive advantages over state-of-the-art continual retrieval approaches.

🏗️ Framework

-We propose StructAlign, a structured cross-modal alignment framework for CTVR, which explicitly models and mitigates catastrophic forgetting induced by both intra-modal feature drift and non-cooperative feature drift across modalities.

-We introduce a simplex ETF geometric prior together with a cross-modal ETF alignment loss to enforce a well-separated category-level structure in the shared embedding space. In addition, we design a cross-modal relation preserving loss that leverages cross-modal similarity relations to constrain intra-modal feature updates during continual learning.

📊 Datasets and Protocols

🎬 MSRVTT

Video clips: 10,000 short videos
Type: Short video–text paired dataset

🎥 ACTNET

Video clips: ~20,000 long, untrimmed videos
Type: Long-form activity recognition dataset

⚙️ Evaluation Protocol

We follow the CTVR protocol proposed in StableFusion:

All categories are evenly divided into K tasks
Two settings are used:
- K = 10
- K = 20

Dataset	#Category	Shot/Category	#Task
MSRVTT	20	16	K=10,20
ACTNET	200	16	K=10,20

⚙️ Installation

Install all requirements required to run the code on a Python 3.x.

First, you need activate a new conda environment.

pip install -r requirements.txt

🔄 Data Processing

🎬 MSRVTT

# Download MSRVTT data
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip
unzip MSRVTT.zip -d datasets/MSRVTT

# Place videos in:
datasets/MSRVTT/MSRVTT_Videos

# Process video frames
python datasets/utils/process_msrvtt.py

🎥 ACTNET

# Download ActivityNet data from official website
http://activity-net.org/download.html

# Save it in:
datasets/ACTNET

# Place videos in:
datasets/ACTNET/Activity_Videos

# Process video clips
python datasets/utils/process_actnet.py

🏃 Training

After downloading the datasets you need, you can use this command to obtain training samples used in few-shot and easy-to-hard classification task.

sh run.sh

📈 Results

Results will be saved in log/.

🔍 Limitations

Predefined class number requirement:
This work constructs the Simplex ETF geometry under the assumption that the total number of classes is known in advance. Although this requirement can be mitigated in practice by setting a sufficiently large upper bound on the number of classes, it may introduce mild redundancy in the representation space and reduce parameter efficiency in some scenarios.
Additional training overhead from regularization terms:
The proposed cross-modal ETF alignment loss and cross-modal relation preserving loss are both regularization-based strategies. While they effectively improve anti-forgetting capability in continual learning, they also introduce a modest increase in training cost.

📝 Citation

If you found our work useful for your research, please cite our work:

@inproceedings{StructAlign,
author = {Wang, Shaokun and Guan, Weili and Han, Jizhou and Wu, Jianlong and Hu, Yupeng and Nie, Liqiang},
title = {StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval},
year = {2026},
booktitle = {Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {},
numpages = {}
}

🙏 Acknowledgments

We thank the following repo providing helpful functions in our work.

StableFusion

POLO

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE		LICENSE
README.md		README.md
concept.png		concept.png
framework.png		framework.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIGIR26-StructAlign

👥 Authors

🔗 Links

📢 Updates

📌 Overview

📖 Abstract

🏗️ Framework

📊 Datasets and Protocols

🎬 MSRVTT

🎥 ACTNET

⚙️ Evaluation Protocol

⚙️ Installation

🔄 Data Processing

🎬 MSRVTT

🎥 ACTNET

🏃 Training

📈 Results

🔍 Limitations

📝 Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SIGIR26-StructAlign

👥 Authors

🔗 Links

📢 Updates

📌 Overview

📖 Abstract

🏗️ Framework

📊 Datasets and Protocols

🎬 MSRVTT

🎥 ACTNET

⚙️ Evaluation Protocol

⚙️ Installation

🔄 Data Processing

🎬 MSRVTT

🎥 ACTNET

🏃 Training

📈 Results

🔍 Limitations

📝 Citation

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages