Skip to content

Mysteriousplayer/SIGIR26-StructAlign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SIGIR26-StructAlign

StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval [SIGIR 26]


👥 Authors

Shaokun Wang1, Weili Guan1*, Jizhou Han2, Jianlong Wu1, Yupeng Hu3, Liqiang Nie1

1 Harbin Institute of Technology (Shenzhen) 2 Xi'an Jiaotong University 3 Shandong University

* Corresponding author


🔗 Links


📢 Updates

  • [04/2026] Paper accepted at SIGIR 2026
  • [07/2026] Initial open-source release

📌 Overview

  • 🎯 Task: Continual Text-to-Video Retrieval (CTVR)
  • 🧠 Problem: Intra-modal drift + cross-modal misalignment
  • ⚙️ Key Idea: Structured cross-modal alignment via ETF geometry
  • 🚀 Result: Good performance on MSRVTT & ACTNET

📖 Abstract

image

Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text–video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method shows competitive advantages over state-of-the-art continual retrieval approaches.


🏗️ Framework

image

-We propose StructAlign, a structured cross-modal alignment framework for CTVR, which explicitly models and mitigates catastrophic forgetting induced by both intra-modal feature drift and non-cooperative feature drift across modalities.

-We introduce a simplex ETF geometric prior together with a cross-modal ETF alignment loss to enforce a well-separated category-level structure in the shared embedding space. In addition, we design a cross-modal relation preserving loss that leverages cross-modal similarity relations to constrain intra-modal feature updates during continual learning.


📊 Datasets and Protocols

🎬 MSRVTT

  • Video clips: 10,000 short videos
  • Type: Short video–text paired dataset

🎥 ACTNET

  • Video clips: ~20,000 long, untrimmed videos
  • Type: Long-form activity recognition dataset

⚙️ Evaluation Protocol

We follow the CTVR protocol proposed in StableFusion:

  • All categories are evenly divided into K tasks
  • Two settings are used:
    • K = 10
    • K = 20
Dataset #Category Shot/Category #Task
MSRVTT 20 16 K=10,20
ACTNET 200 16 K=10,20

⚙️ Installation

Install all requirements required to run the code on a Python 3.x.

First, you need activate a new conda environment.

pip install -r requirements.txt

🔄 Data Processing

🎬 MSRVTT

# Download MSRVTT data
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip
unzip MSRVTT.zip -d datasets/MSRVTT

# Place videos in:
datasets/MSRVTT/MSRVTT_Videos

# Process video frames
python datasets/utils/process_msrvtt.py

🎥 ACTNET

# Download ActivityNet data from official website
http://activity-net.org/download.html

# Save it in:
datasets/ACTNET

# Place videos in:
datasets/ACTNET/Activity_Videos

# Process video clips
python datasets/utils/process_actnet.py

🏃 Training

After downloading the datasets you need, you can use this command to obtain training samples used in few-shot and easy-to-hard classification task.

sh run.sh

📈 Results

Results will be saved in log/.


🔍 Limitations

  • Predefined class number requirement:
    This work constructs the Simplex ETF geometry under the assumption that the total number of classes is known in advance. Although this requirement can be mitigated in practice by setting a sufficiently large upper bound on the number of classes, it may introduce mild redundancy in the representation space and reduce parameter efficiency in some scenarios.

  • Additional training overhead from regularization terms:
    The proposed cross-modal ETF alignment loss and cross-modal relation preserving loss are both regularization-based strategies. While they effectively improve anti-forgetting capability in continual learning, they also introduce a modest increase in training cost.


📝 Citation

If you found our work useful for your research, please cite our work:

@inproceedings{StructAlign,
author = {Wang, Shaokun and Guan, Weili and Han, Jizhou and Wu, Jianlong and Hu, Yupeng and Nie, Liqiang},
title = {StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval},
year = {2026},
booktitle = {Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {},
numpages = {}
}

🙏 Acknowledgments

We thank the following repo providing helpful functions in our work.

StableFusion

POLO

About

StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors