Qwen2 VL 72B Instruct

Qwen2-VL-72B-Instruct is a state-of-the-art multimodal model excelling in image and video understanding, with advanced capabilities for text-based interaction.


API

If you're looking for an API, you can choose from your desired programming language.

POST
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 const axios = require('axios'); const fs = require('fs'); const path = require('path'); // helper function to help you convert your local images into base64 format async function toB64(imgPath) { const data = fs.readFileSync(path.resolve(imgPath)); return Buffer.from(data).toString('base64'); } const api_key = "YOUR API-KEY"; const url = "https://api.segmind.com/v1/qwen2-vl-72b-instruct"; const data = { "messages": [ { "role": "user", "content": "tell me a joke on cats" }, { "role": "assistant", "content": "here is a joke about cats..." }, { "role": "user", "content": "now a joke on dogs" } ] }; (async function() { try { const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } }); console.log(response.data); } catch (error) { console.error('Error:', error.response.data); } })();
RESPONSE
application/json
HTTP Response Codes
200 - OKResponse Generated
401 - UnauthorizedUser authentication failed
404 - Not FoundThe requested URL does not exist
405 - Method Not AllowedThe requested HTTP method is not allowed
406 - Not AcceptableNot enough credits
500 - Server ErrorServer had some issue with processing

Attributes


messagesArray

An array of objects containing the role and content


rolestr

Could be "user", "assistant" or "system".


contentstr

A string containing the user's query or the assistant's response.

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Qwen2-VL-72B-Instruct

Qwen2-VL-72B-Instruct is an advanced image-text-to-text model designed for a wide range of visual understanding and reasoning tasks. This model is a significant upgrade from the previous Qwen-VL, incorporating several key enhancement.

Key Features of Qwen2-VL-72B-Instruct

  • Superior Image Understanding: Qwen2-VL achieves state-of-the-art performance on various visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. It demonstrates strong capabilities in processing images with different resolutions and aspect ratios.

  • Agent Capabilities: Qwen2-VL can be integrated with devices like mobile phones and robots for automatic operation based on visual environment and text instructions, demonstrating complex reasoning and decision-making skills.

  • Multilingual Support: Beyond English and Chinese, the model supports understanding text within images in many languages, including most European languages, Japanese, Korean, Arabic, and Vietnamese.

  • Dynamic Resolution Handling: Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens for a more human-like visual processing experience.

  • Advanced Positional Embedding: The model uses Multimodal Rotary Position Embedding (M-ROPE) to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities

Technical Specifications

  • Model Architecture: The model employs a large-scale transformer architecture with 72 billion parameters.

  • Resolution Flexibility: The model is able to process a range of image resolutions, and its computational requirements can be adjusted by setting minimum and maximum pixel counts to optimize performance for specific hardware. Images can be resized to a specific width and height.

Limitations

  • The model has limitations in recognizing specific individuals or intellectual property.

  • It may struggle with complex, multi-step instructions.

  • Counting accuracy is not high in complex scenes.

  • Spatial reasoning skills, especially in 3D spaces, require further improvements.

Cookie settings

We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept all", you consent to our use of cookies.