ICML 2025
PhyGenBench proposes a comprehensive benchmark evaluating physical commonsense in video generation models, towards building world simulators.
VLM & WAM Post-Training
I am a research intern at Ant Group working on post-training for VLM and WAM (World Action Model).
I am especially interested in video generation for egocentric embodied scenarios and long-video understanding.
I am also deeply interested in WAM and RL post-training.
ICML 2025
PhyGenBench proposes a comprehensive benchmark evaluating physical commonsense in video generation models, towards building world simulators.
NeurIPS 2025
VideoREPA learns physical knowledge for video generation via relational alignment with foundation models.
Preprint
Gym-V provides a unified environment for agentic vision research, enabling systematic evaluation of vision-language models.
ICCV 2025
LangBridge proposes interpreting images as combinations of language embeddings, bridging vision and language representations.
Preprint · ⭐ 100+ Stars
WISE introduces a world knowledge-informed semantic evaluation framework for text-to-image generation.
Preprint
SRUM proposes a fine-grained self-rewarding mechanism for training unified multimodal models.
Preprint
UniSandbox investigates the relationship between understanding and generation capabilities in unified multimodal models.
A benchmark for multi-day, multimodal coworker agents with living-world tasks and rule-based scoring.