Qwen2 VL 72B Instruct

Qwen2-VL-72B-Instruct is a state-of-the-art multimodal model excelling in image and video understanding, with advanced capabilities for text-based interaction.

Playground API Pricing

API

If you're looking for an API, you can choose from your desired programming language.

POST

const axios = require('axios');
const fs = require('fs');
const path = require('path');

// helper function to help you convert your local images into base64 format
async function toB64(imgPath) {
    const data = fs.readFileSync(path.resolve(imgPath));
    return Buffer.from(data).toString('base64');
}

const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/qwen2-vl-72b-instruct";

const data = {
  
  "messages": [
    {
            "role": "user",
            "content": "tell me a joke on cats"
        },
        {
            "role": "assistant",
            "content": "here is a joke about cats..."
        },
       {
            "role": "user",
            "content": "now a joke on dogs"
        }
  ]
};

(async function() {
    try {
        const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
        console.log(response.data);
    } catch (error) {
        console.error('Error:', error.response.data);
    }
})();

RESPONSE

application/json

HTTP Response Codes

200 - OKResponse Generated

401 - UnauthorizedUser authentication failed

404 - Not FoundThe requested URL does not exist

405 - Method Not AllowedThe requested HTTP method is not allowed

406 - Not AcceptableNot enough credits

500 - Server ErrorServer had some issue with processing

Attributes

messagesArray

An array of objects containing the role and content

rolestr

Could be "user", "assistant" or "system".

contentstr

A string containing the user's query or the assistant's response.

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Qwen2-VL-72B-Instruct

Qwen2-VL-72B-Instruct is an advanced image-text-to-text model designed for a wide range of visual understanding and reasoning tasks. This model is a significant upgrade from the previous Qwen-VL, incorporating several key enhancement.

Key Features of Qwen2-VL-72B-Instruct

Superior Image Understanding: Qwen2-VL achieves state-of-the-art performance on various visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. It demonstrates strong capabilities in processing images with different resolutions and aspect ratios.
Agent Capabilities: Qwen2-VL can be integrated with devices like mobile phones and robots for automatic operation based on visual environment and text instructions, demonstrating complex reasoning and decision-making skills.
Multilingual Support: Beyond English and Chinese, the model supports understanding text within images in many languages, including most European languages, Japanese, Korean, Arabic, and Vietnamese.
Dynamic Resolution Handling: Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens for a more human-like visual processing experience.
Advanced Positional Embedding: The model uses Multimodal Rotary Position Embedding (M-ROPE) to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities

Technical Specifications

Model Architecture: The model employs a large-scale transformer architecture with 72 billion parameters.
Resolution Flexibility: The model is able to process a range of image resolutions, and its computational requirements can be adjusted by setting minimum and maximum pixel counts to optimize performance for specific hardware. Images can be resized to a specific width and height.

Limitations

The model has limitations in recognizing specific individuals or intellectual property.
It may struggle with complex, multi-step instructions.
Counting accuracy is not high in complex scenes.
Spatial reasoning skills, especially in 3D spaces, require further improvements.

Other Popular Models

sdxl-img2img

SDXL Img2Img is used for text-guided image-to-image translation. This model uses the weights from Stable Diffusion to generate new images from an input image using StableDiffusionImg2ImgPipeline from diffusers