Qwen: Qwen3 VL 32B Instruct

Description

Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-precision understanding and reasoning across text, images, and video. With 32 billion parameters, it combines deep visual perception with advanced text comprehension, enabling fine-grained spatial reasoning, document and scene analysis, and long-horizon video understanding. Robust OCR in 32 languages, and enhanced multimodal fusion through Interleaved-MRoPE and DeepStack architectures. Optimized for agentic interaction and visual tool use, Qwen3-VL-32B delivers state-of-the-art performance for complex real-world multimodal tasks.

How this model compares

Overall covers the full catalog. By plan covers only models available on that tier (same rules as available models in your list). Position on min–average–max. Prices use the higher of prompt or completion per token, shown per 1M tokens.

Price (per 1M tokens)

Min

Max

This model

336 models in this groupPrice (per 1M tokens)

Min: $0.04
Avg: $12.571321
Max: $750.00

This model: $0.416 / 1M tokens

Context length (tokens)

Min

Max

This model

336 models in this groupContext length (tokens)

Min: 4,095 tokens
Avg: 398,336.839 tokens
Max: 2,000,000 tokens

This model: 131,072 tokens

Description

How this model compares

Price (per 1M tokens)

Context length (tokens)

Capabilities