AI Toolkit for Visual Studio Code 入门

2025-05-30

AI Toolkit for VS Code (AI Toolkit) 是一个 VS Code 扩展，使你能够在应用或云中下载、测试、微调和部署 AI 模型。有关详细信息，请参阅 AI 工具包概述。

注意

VS Code AI 工具包的额外文档和教程可以在 VS Code 文档中找到：适用于 Visual Studio Code 的 AI 工具包。你将找到有关 Playground、使用 AI 模型、微调本地模型和基于云的模型等的指导。

本文将指导如何进行以下操作：

安装 AI Toolkit for VS Code
从目录中下载模型
使用操场在本地运行模型
使用 REST 或 ONNX Runtime 将 AI 模型集成到应用程序中

先决条件

必须安装 VS Code。有关详细信息，请参阅下载VS Code和入门VS Code。

使用 AI 功能时，建议查看：在 Windows 上开发负责任的生成 AI 应用程序和功能。

安装

AI 工具包在 Visual Studio 市场中可用，可以像安装任何其他VS Code扩展一样安装。如果不熟悉如何安装 VS Code 扩展，则请执行以下步骤：

在 VS Code 中的活动栏中，选择“扩展”
在“扩展搜索”栏中键入“AI 工具包”
选择“适用于 Visual Studio Code 的 AI 工具包”
选择“安装”

安装扩展后，你会看到“AI 工具包”图标显示在“活动栏”中。

从目录中下载模型

AI 工具包的主要边栏组织到 “我的模型”、“ 目录”、“ 工具”和 “帮助”和“反馈”中。 “工具”部分提供了 Playground、Bulk Run、Evaluation 和 Fine tuning 功能。若要开始从“目录”部分选择“模型”，请打开“模型目录”窗口：

VS Code 中的 AI 工具包模型目录窗口的屏幕截图

可以使用目录顶部的筛选器按 “托管”、“ 发布者”、“ 任务”和 “模型”类型进行筛选。此外还有微调支持开关，可以切换以仅显示可微调的模型。

提示

模型类型筛选器仅允许显示将在 CPU、GPU 或 NPU 本地运行的模型或仅支持 远程访问的模型。若要优化 具有至少一个 GPU 的设备的性能，请选择 本地运行模型类型 w/ GPU。这有助于查找针对 DirectML 加速器优化的模型。

要检查 Windows 设备上是否有 GPU，请打开“任务管理器”，然后选择“性能”选项卡。如果你有 GPU，它们将列在“GPU 0”或“GPU 1”等名称下。

注意

对于具有神经处理单元（NPU）的 Copilot+ 电脑，可以选择针对 NPU 加速器优化的模型。 Deepseek R1 提取模型针对 NPU 进行了优化，可在运行 Windows 11 的 Snapdragon 驱动的 Copilot+ 电脑上下载。有关详细信息，请参阅在由 Windows AI Foundry 提供支持的 Copilot+ 电脑上本地运行的轻量化 DeepSeek R1 模型。

以下模型目前适用于具有一个或多个 GPU 的 Windows 设备：

Mistral 7B（DirectML - Small、Fast）
Phi 3 迷你 4K （DirectML - 小，快）
Phi 3 Mini 128K（DirectML - 小巧，快速）

选择 Phi 3 微型 4K 模型，然后单击“ 下载：

注意

Phi 3 微型 4K 模型的大小约为 2GB-3GB。下载可能需要数分钟的时间，具体取决于网络速度。

在操场中运行模型

下载模型后，它将显示在“本地模型”下的“我的模型”部分中。右键单击模型，然后从上下文菜单中选择“在沙盒中加载”。

“在场中加载”上下文菜单项的屏幕截图

在聊天界面中，输入以下消息，并按 Enter 键：

游乐场选择

你应会看到流式传输回来的模型回复：

生成响应

警告

如果设备上没有可用的 GPU，但选择了 Phi-3-mini-4k-directml-int4-awq-block-128-onnx 模型，模型响应将非常慢。应改为下载 CPU 优化版本：Phi-3-mini-4k-cpu-int4-rtn-block-32-acc-level-4-onnx。

还可以更改：

上下文说明： 帮助模型了解请求的大局。这可能是背景信息、所需内容的示例/演示，或说明任务的目的。
推理参数：
- 最大响应长度：模型将返回的最大令牌数。
- 温度：模型温度是控制语言模型输出随机程度的参数。较高的温度意味着模型承担更多风险，可提供各种字词组合。另一方面，较低的温度使模型可以安全发挥，致力于更专注和可预测的回复。
- Top P：也称为核采样，是一种设置，用于控制在预测下一个单词时语言模型考虑的可能字词或短语数
- 频率惩罚：此参数会影响模型在其输出中重复单词或短语的频率。值越高（越接近 1.0），就越能鼓励模型避免重复使用单词或短语。
- 存在惩罚：此参数在生成式 AI 模型中用于鼓励生成文本的多样性和精确性。值越大（接近 1.0），越会鼓励模型包含更多新颖和多样化的词元。较低的值更有可能使模型生成常见或陈词滥调的短语。

将 AI 模型集成到应用程序中

有两个选项可以将模型集成到应用程序中：

AI 工具包附带了使用 REST的本地 API Web 服务器。这样，便可以使用终结点 http://127.0.0.1:5272/v1/chat/completions 在本地测试应用程序，而无需依赖云 AI 模型服务。如果打算在生产中切换到云终结点，请使用此选项。可使用 OpenAI 客户端库连接到 Web 服务器。
使用 ONNX 运行时。如果打算在设备上通过推理应用程序来交付模型，请使用此选项。

本地 REST API Web 服务器。

本地 REST API Web 服务器允许在本地构建和测试应用程序，而无需依赖云 AI 模型服务。可以使用 REST 与 Web 服务器进行交互，或者与 OpenAI 客户端库进行交互：

下面是 REST 请求的示例正文：

{
    "model": "Phi-3-mini-4k-directml-int4-awq-block-128-onnx",
    "messages": [
        {
            "role": "user",
            "content": "what is the golden ratio?"
        }
    ],
    "temperature": 0.7,
    "top_p": 1,
    "top_k": 10,
    "max_tokens": 100,
    "stream": true
}'

注意

可能需要将模型字段更新为已下载的模型的名称。

可以使用 REST 等 API 工具或 CURL 实用工具测试终结点：

curl -vX POST http://127.0.0.1:5272/v1/chat/completions -H 'Content-Type: application/json' -d @body.json

安装 OpenAI Python 库：

pip install openai

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:5272/v1/",
    api_key="x" # required by API but not used
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "what is the golden ratio?",
        }
    ],
    model="Phi-3-mini-4k-directml-int4-awq-block-128-onnx",
)

print(chat_completion.choices[0].message.content)

使用 NuGet 将适用于 .NET 的 Azure OpenAI 客户端库添加到项目中：

dotnet add {project_name} package Azure.AI.OpenAI --version 1.0.0-beta.17

将名为 OverridePolicy.cs的 C# 文件添加到项目，并粘贴以下代码：

// OverridePolicy.cs
using Azure.Core.Pipeline;
using Azure.Core;

internal partial class OverrideRequestUriPolicy(Uri overrideUri)
    : HttpPipelineSynchronousPolicy
{
    private readonly Uri _overrideUri = overrideUri;

    public override void OnSendingRequest(HttpMessage message)
    {
        message.Request.Uri.Reset(_overrideUri);
    }
}

接下来，将以下代码粘贴到 Program.cs 文件中：

// Program.cs
using Azure.AI.OpenAI;

Uri localhostUri = new("http://localhost:5272/v1/chat/completions");

OpenAIClientOptions clientOptions = new();
clientOptions.AddPolicy(
    new OverrideRequestUriPolicy(localhostUri),
    Azure.Core.HttpPipelinePosition.BeforeTransport);
OpenAIClient client = new(openAIApiKey: "unused", clientOptions);

ChatCompletionsOptions options = new()
{
    DeploymentName = "Phi-3-mini-4k-directml-int4-awq-block-128-onnx",
    Messages =
    {
        new ChatRequestSystemMessage("You are a helpful assistant. Be brief and succinct."),
        new ChatRequestUserMessage("What is the golden ratio?"),
    }
};

StreamingResponse<StreamingChatCompletionsUpdate> streamingChatResponse
    = await client.GetChatCompletionsStreamingAsync(options);

await foreach (StreamingChatCompletionsUpdate chatChunk in streamingChatResponse)
{
    Console.Write(chatChunk.ContentUpdate);
}

注意

如果你下载了 Phi3 模型的 CPU 版本，则需要将模型字段更新为 Phi-3-mini-4k-cpu-int4-rtn-block-32-acc-level-4-onnx。

ONNX 运行时

ONNX 运行时生成 API 为 ONNX 模型提供生成 AI 循环，包括 ONNX 运行时的推理、logits 处理、搜索和采样以及 KV 缓存管理。可以调用高级 generate() 方法，或者在循环中运行模型的每个迭代，一次生成一个令牌，还可以选择更新循环中的生成参数。

它支持贪婪/集束搜索以及 TopP、TopK 采样，以生成令牌序列和内置 logit 处理，例如重复处罚。以下代码是如何在应用程序中利用 ONNX 运行时的示例。

请参阅本地 REST API Web 服务器中显示的示例。 AI Toolkit REST 服务器是使用 ONNX Runtime 生成的。

安装 Numpy：

pip install numpy

接下来，根据平台和 GPU 可用性将 ONNX Runtime Python 包安装到项目中：

平台	可用 GPU	PyPI
Windows操作系统	是（支持 AMD、NVIDIA、Intel、Qualcomm 等）	`pip install --pre onnxruntime-genai-directml`
Linux	是（Nvidia CUDA）	`pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/`
Windows操作系统 Linux	否	`pip install --pre onnxruntime-genai`

提示

建议使用 venv 或 conda 将 Python 包安装到虚拟环境中。

接下来，将以下代码复制并粘贴到名为 app.py 的 Python 文件中：

# app.py
import onnxruntime_genai as og
import argparse

def main(args):
    print("Loading model...")
    model = og.Model(f'{args.model}')
    print("Model loaded")
    tokenizer = og.Tokenizer(model)
    tokenizer_stream = tokenizer.create_stream()
    search_options = {
        'max_length': 2048
    }

    chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

    # Keep asking for input prompts in a loop
    while True:
        text = input("Input: ")
    
        # If there is a chat template, use it
        prompt = f'{chat_template.format(input=text)}'

        input_tokens = tokenizer.encode(prompt)

        params = og.GeneratorParams(model)
        params.set_search_options(**search_options)
        params.input_ids = input_tokens
        
        generator = og.Generator(model, params)
        print("\nOutput: ", end='', flush=True)
        while not generator.is_done():
            generator.compute_logits()
            generator.generate_next_token()
            new_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(new_token), end='', flush=True)
              
        print()
        print()

        # Delete the generator to free the captured graph for the next generator, if graph capture is enabled
        del generator


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-m', '--model', type=str, required=True, help='Onnx model folder path (must contain config.json and model.onnx)')
    args = parser.parse_args()
    main(args)

若要运行 Python 应用，请使用以下代码：

python app.py --model ~/.aitk/models/{path_to_folder_containing_onnx_file}

注意

AI Toolkit 会将模型下载缓存到用户目录中的.aitk隐藏文件夹内，需要将--model参数的路径更新为包含ONNX模型文件的文件夹所在位置。例如 ~/.aitk/models/microsoft/Phi-3-mini-4k-instruct-onnx/directml/Phi-3-mini-4k-directml-int4-awq-block-128-onnx/

根据平台和 GPU 可用性将 ONNX Runtime NuGet 包安装到项目中：

平台	可用 GPU	Nuget
Windows操作系统	是（支持 AMD、NVIDIA、Intel、Qualcomm 等）	Microsoft.ML.OnnxRuntimeGenAI.DirectML
Linux	是（Nvidia CUDA）	Microsoft.ML.OnnxRuntimeGenAI.Cuda
Windows操作系统 Linux	否	Microsoft.ML.OnnxRuntimeGenAI

将以下代码复制并粘贴到 C# 文件中：

using Microsoft.ML.OnnxRuntimeGenAI;

// update user_name and path placeholders
string modelPath = "C:\\Users\\{user_name}\\.aitk\\models\\{path}"; 
Console.Write("Loading model from " + modelPath + "...");
using Model model = new(modelPath);
Console.Write("Done\n");
using Tokenizer tokenizer = new(model);
using TokenizerStream tokenizerStream = tokenizer.CreateStream();

while (true)
{
    Console.Write("User:");
   
    string? input = Console.ReadLine();
    string prompt = "<|user|>\n" + input + "<|end|>\n<|assistant|>";

    var sequences = tokenizer.Encode(prompt);

    using GeneratorParams generatorParams = new GeneratorParams(model);
    generatorParams.SetSearchOption("max_length", 200);
    generatorParams.SetInputSequences(sequences);

    Console.Out.Write("\nAI:");
    using Generator generator = new(model, generatorParams);
    while (!generator.IsDone())
    { 
        generator.ComputeLogits();
        generator.GenerateNextToken();
        Console.Out.Write(tokenizerStream.Decode(generator.GetSequence(0)[^1]));
        Console.Out.Flush();
    }
    Console.WriteLine();
}

注意

AI Toolkit 会将模型下载缓存到用户目录中名为 .aitk 的隐藏文件夹中 - 你需要更新代码中的 modelPath ，以便定位到包含 ONNX 模型文件的 文件夹。例如 ~/.aitk/models/microsoft/Phi-3-mini-4k-instruct-onnx/directml/Phi-3-mini-4k-directml-int4-awq-block-128-onnx/

下一步

使用 VS Code 的 AI 工具包对模型进行微调

通过