LLaVA combines with Magvit Image tokenizer, training MLLM without an Vision Encoder. Unifying image understanding and generation. - View it on GitHub
Star
35
Rank
591303