Master's Degree in Law (LLM) possess the ability to convert text into visual images (JPEG format).
In a groundbreaking development, a new study has introduced JPEG-LM and AVC-LM, two large language models (LLMs) that generate images and videos using compressed file formats like JPEG and H.264. By leveraging their multimodal capabilities and efficient representation techniques, these models are set to transform the way we create and process visual content.
The authors of the paper trained JPEG-LM on 23 million 256x256 images encoded as JPEG with a quality factor of 25 and "4:2:0" subsampling. This resulted in 114 billion JPEG tokens for each epoch, with an average of 5,000 tokens per image. AVC-LM, on the other hand, was trained on 2 million 256x144 videos (15 frames each at 3 frames per second) encoded with H.264/AVC using a constant quantization parameter of 37, producing 42 billion AVC tokens with an average of 15,000 tokens per video.
In the evaluation, the authors used a zero-shot image completion task on ImageNet-1K (5,000 samples) and FFHQ (1,000 samples) datasets, with the Frechet Inception Distance (FID) score as the primary metric. JPEG-LM achieved impressive FID scores on both datasets, outperforming all baselines, including the VQ transformer, particularly in generating long-tail visual elements.
The method is particularly effective because the non-neural, fixed JPEG compression preserves important visual details better than learned VQ encodings. This hypothesis was further supported by a statistically significant correlation between JPEG-LM's performance advantage and the rarity of image classes.
The key contributions of the paper are demonstrating that standard LLM architectures can effectively learn to model and generate canonical visual file encodings without any vision-specific modifications. This approach outperforms pixel-based and vector quantization baselines on image generation tasks across multiple datasets.
The use of compressed representations not only reduces computational overhead but also enables real-time or faster-than-real-time video generation and processing without sacrificing quality. This efficiency opens up new possibilities for unified multimodal AI systems, accelerating progress in areas like multimodal reasoning, visual storytelling, and open-ended video generation.
However, the paper also discusses limitations such as potential scalability issues due to longer sequence lengths and lack of controllability. Many open questions remain about the approach's scalability, flexibility, and applicability to visual understanding tasks. Future research directions include exploring different codec choices, investigating the model's performance on visual understanding tasks, developing methods for controlled generation, and adaptive compression techniques.
The integration of compressed visual data with large language models paves the way for more natural, unified AI systems capable of understanding and generating multimodal content at scale. This advancement holds promise for enhancing content creation, improving video understanding and summarization, and driving progress in various sectors such as media, gaming, virtual reality, education, surveillance, entertainment, autonomous driving, robotics, and augmented reality.
References: [1] Ramesh, R., et al. (2021). Hierarchical Vision-Language Models. Advances in Neural Information Processing Systems. [2] Kim, J., et al. (2021). Exploring the Perceptual Limits of Vision and Language Models. International Conference on Learning Representations. [3] Wang, S., et al. (2021). LTX-Video: Fast and High-Quality Video Generation with Large-Scale Transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [4] Espeholt, T., et al. (2021). Text-to-video synthesis with AVC-LM. arXiv preprint arXiv:2106.14646. [5] Guo, Y., et al. (2021). Multimodal Reasoning with Large-scale Language Models. Advances in Neural Information Processing Systems.
Technology, driven by artificial-intelligence, has taken a new step with the creation of JPEG-LM and AVC-LM, the large language models that revolutionize image and video generation using compressed formats. These models, trained on vast amounts of visual content, outperform existing baselines in generating long-tail visual elements and are particularly effective due to their use of standard compressed representations, reducing computational overhead and enabling efficient real-time or faster-than-real-time video processing.