• Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
  • AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks
  • MyVLM: Personalizing VLMs for User-Specific Queries
  • InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
  • VidLA: Video-Language Alignment at Scale
  • SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series
  • AllHands: Ask Me Anything on Large-scale Verbatim Feedback via Large Language Models
  • DreamReward: Text-to-3D Generation with Human Preference
  • Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
  • LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
  • Evolutionary Optimization of Model Merging Recipes
  • SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
  • When Do We Not Need Larger Vision Models?
  • HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
  • ZigMa: Zigzag Mamba Diffusion Model
  • DepthFM: Fast Monocular Depth Estimation with Flow Matching
  • mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
  • AnimateDiff-Lightning: Cross-Model Diffusion Distillation
  • Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers