Llama 3.2 90B Vision Instruct

Experience the cutting edge of AI with Llama 3.2-90B Vision-Instruct. This 90B parameter multimodal LLM excels at image understanding, reasoning, captioning, and more.


Pricing

Serverless Pricing

Buy credits that can be used anywhere on Segmind

Input: $1.200, Output: $1.200 per million tokens

Llama 3.2-90B Vision-Instruct

The Llama 3.2-90B Vision-Instruct is a multimodal large language model (LLM) developed by Meta. It is engineered to process both textual and visual inputs, providing advanced capabilities in areas such as image understanding and reasoning

Key Features of Llama 3.2-90B Vision-Instruct

  • Parameter Count: The model consists of 90 billion parameters (88.8 billion).

  • Input Modalities: Supports text and image inputs, enabling versatile applications.

  • Output Modality: Generates text outputs, making it suitable for a wide range of tasks.

  • Architecture: Built upon the Llama 3.1 text-only model, enhanced with a vision adapter. The vision adapter employs cross-attention layers to integrate image encoder representations into the core LLM.

  • Context Length: Features a 128k context length.

Technical Specifications

  • Training Data: Trained on a dataset of 6 billion image and text pairs.

  • Data Cutoff: The pretraining data has a cutoff of December 2023.

  • Instruction Tuning: Fine-tuned using publicly available vision instruction datasets and over 3 million synthetically generated examples, combining supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF)

Intended Use Cases

The model is optimized for visual recognition, image reasoning, captioning, and question answering about image.

Cookie settings

We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept all", you consent to our use of cookies.