Qwen2.5-VL

介绍

自Qwen2-VL发布以来的过去五个月里，许多开发者基于Qwen2-VL视觉语言模型构建了新的模型，并为我们提供了宝贵的反馈。在此期间，我们专注于构建更有用的视觉语言模型。今天，我们很高兴地向大家介绍Qwen家族的最新成员：Qwen2.5-VL。

主要增强功能：

视觉理解：Qwen2.5-VL不仅擅长识别常见的物体，如花、鸟、鱼和昆虫，而且能够高效分析图像中的文本、图表、图标、图形和布局。
代理能力：Qwen2.5-VL可以直接作为视觉代理，能够进行推理并动态指导工具使用，包括计算机和手机操作。
理解长视频并捕捉事件：Qwen2.5-VL可以理解超过1小时的视频，并且新增了通过定位相关视频片段来捕捉事件的能力。
支持不同格式的视觉定位：Qwen2.5-VL可以通过生成边界框或点来准确地定位图像中的对象，并提供稳定的JSON输出以表示坐标和属性。
生成结构化输出：对于发票、表格等扫描数据，Qwen2.5-VL支持其内容的结构化输出，适用于金融、商业等领域。

模型架构更新：

视频理解的动态分辨率和帧率训练：

我们将动态分辨率扩展到时间维度，采用动态FPS采样，使模型能够在不同的采样率下理解视频。相应地，我们在时间维度上更新了mRoPE，使用ID和绝对时间对齐，使模型能够学习时间序列和速度，最终获得精确定位特定时刻的能力。

简洁高效的视觉编码器

我们通过在 ViT 中战略性地实现窗口注意力来提高训练和推理速度。ViT 架构进一步通过 SwiGLU 和 RMSNorm 进行优化，使其与 Qwen2.5 LLM 的结构对齐。

我们有四个模型，参数量分别为 30 亿、70 亿、320 亿和 720 亿。此仓库包含指令调优的 320 亿参数的 Qwen2.5-VL 模型。欲了解更多信息，请访问我们的博客和 GitHub。

评估

视觉

数据集	Qwen2.5-VL-72B ^(🤗🤖)	Qwen2-VL-72B ^(🤗🤖)	Qwen2.5-VL-32B ^(🤗🤖)
MMMU	70.2	64.5	70
MMMU Pro	51.1	46.2	49.5
MMStar	70.8	68.3	69.5
MathVista	74.8	70.5	74.7
MathVision	38.1	25.9	40.0
OCRBenchV2	61.5/63.7	47.8/46.1	57.2/59.1
CC-OCR	79.8	68.7	77.1
DocVQA	96.4	96.5	94.8
InfoVQA	87.3	84.5	83.4
LVBench	47.3	–	49.00
CharadesSTA	50.9	–	54.2
VideoMME	73.3/79.1	71.2/77.8	70.5/77.9
MMBench-Video	2.02	1.7	1.93
AITZ	83.2	–	83.1
Android Control	67.4/93.7	66.4/84.4	69.6/93.3
ScreenSpot	87.1	–	88.5
ScreenSpot Pro	43.6	–	39.4
AndroidWorld	35	–	22.0
OSWorld	8.83	–	5.92

文本

模型	MMLU	MMLU-PRO	MATH	GPQA-diamond	MBPP	人工评估
Qwen2.5-VL-32B	78.4	68.8	82.2	46.0	84.0	91.5
Mistral-Small-3.1-24B	80.6	66.8	69.3	46.0	74.7	88.4
Gemma3-27B-IT	76.9	67.5	89	42.4	74.4	87.8
GPT-4o-Mini	82.0	61.7	70.2	39.4	84.8	87.2
Claude-3.5-Haiku	77.6	65.0	69.2	41.6	85.6	88.1

要求

Qwen2.5-VL 的代码已经在最新的 Hugging Face transformers 中可用，建议您使用以下命令从源码构建：


pip install git+https://github.com/huggingface/transformers accelerate

否则您可能会遇到以下错误：


KeyError: 'qwen2_5_vl'

快速开始

下面，我们将提供简单的示例，展示如何使用 🤖 ModelScope 和 🤗 Transformers 来使用 Qwen2.5-VL。

Qwen2.5-VL 的代码已经在最新的 Hugging Face transformers 中可用，建议您使用以下命令从源码构建：


pip install git+https://github.com/huggingface/transformers accelerate

否则您可能会遇到以下错误：


KeyError: 'qwen2_5_vl'

我们提供了一个工具包，帮助您更方便地处理各种类型的视觉输入，就像使用API一样。这包括base64、URL以及混合的图片和视频。您可以使用以下命令安装它：

# It's highly recommanded to use `[decord]` feature for faster video loading.
pip install qwen-vl-utils[decord]==0.0.8

如果您不是在Linux环境下，可能无法从PyPI安装decord。这种情况下，您可以使用pip install qwen-vl-utils，它会回退到使用torchvision进行视频处理。不过，您仍然可以通过从源码安装decord来使加载视频时能够使用decord。

使用 🤗 Transformers 进行聊天

这里展示一个代码片段，向您演示如何结合transformers与qwen_vl_utils使用聊天模型：

from modelscope import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-32B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-32B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-32B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-32B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多图像推理视频推理批量推理

更多使用技巧

对于输入图像，我们支持本地文件、base64编码及URL形式。至于视频，目前仅支持本地文件。

# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Image URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 encoded image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

图像分辨率以提高性能

该模型支持广泛的分辨率输入。默认情况下，它使用原生分辨率作为输入，但更高的分辨率可以提升性能，尽管需要更多的计算资源。用户可以根据自身需求设置最小和最大像素数，比如设定token数量范围为256-1280，以此平衡速度与内存消耗。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-32B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

此外，我们提供了两种方法来精细控制输入模型的图像大小：

定义min_pixels和max_pixels：将图像调整大小以保持其纵横比，并确保其尺寸位于min_pixels和max_pixels之间。
指定确切尺寸：直接设置resized_height和resized_width。这些值会被四舍五入至最接近的28的倍数。

# min_pixels and max_pixels
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# resized_height and resized_width
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

处理长文本

当前的config.json配置支持上下文长度达到32,768个tokens。为了处理超过32,768 tokens的大规模输入，我们采用了YaN，这是一种增强模型长度外推的技术，确保在处理长文本时仍能保持最佳性能。

对于支持的框架，您可以在config.json中添加如下内容以启用YaN:

{ …, “type”: “yarn”, “mrope_section”: [ 16, 24, 24 ], “factor”: 4, “original_max_position_embeddings”: 32768 }

然而，需要注意的是，此方法对时序和空间定位任务的性能有显著影响，因此不推荐使用。

同时，对于长视频输入，由于 MRoPE 本身在 ids 的使用上更为经济，可以直接将 max_position_embeddings 修改为更大的值，例如 64k。

引用

如果您觉得我们的工作对您有帮助，请随意引用我们。

@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}

原创文章，转载请注明： 转载自诺德美地科技

本文链接地址: Qwen2.5-VL

最新更新：

介绍

主要增强功能：

模型架构更新：

评估

视觉

文本

要求

快速开始

使用 🤗 Transformers 进行聊天

更多使用技巧

图像分辨率以提高性能

处理长文本

引用

发表评论取消回复

最新更新：

介绍

主要增强功能：

模型架构更新：

评估

视觉

文本

要求

快速开始

使用 🤗 Transformers 进行聊天

更多使用技巧

图像分辨率以提高性能

处理长文本

引用

发表评论 取消回复

发表评论取消回复