MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
arXiv:2510.07915v2 Announce Type: replace Abstract: The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs)...