格式化输出

如何从模型中返回结构化数据

在实际应用中，我们经常需要让模型返回符合特定结构的数据。比如，从文本中提取信息并插入到数据库中，或者用于其他下游系统。本文将介绍几种从模型中获取结构化输出的策略。

使用 `.with_structured_output()` 方法

支持的模型
你可以在这里找到支持该方法的模型列表。

这是获取结构化输出最简单且最可靠的方法。with_structured_output() 方法利用了模型的原生 API（如工具/函数调用或 JSON 模式），并在此基础上实现结构化输出。

该方法接受一个模式（schema）作为输入，该模式指定了输出属性的名称、类型和描述。方法返回一个类似模型的 Runnable，但它输出的不是字符串或消息，而是与给定模式对应的对象。模式可以是一个 TypedDict 类、JSON Schema 或 Pydantic 类。如果使用 TypedDict 或 JSON Schema，Runnable 将返回一个字典；如果使用 Pydantic 类，则返回一个 Pydantic 对象。

示例：生成笑话并分离笑点和笑料

首先，我们选择一个聊天模型并安装必要的库：

bash
pip install -qU "langchain[groq]"

然后，设置 API 密钥并初始化模型：

python
import getpass
import os

if not os.environ.get("GROQ_API_KEY"):
  os.environ["GROQ_API_KEY"] = getpass.getpass("Enter API key for Groq: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("llama3-8b-8192", model_provider="groq")

使用 Pydantic 类
如果我们希望模型返回一个 Pydantic 对象，只需传入所需的 Pydantic 类。使用 Pydantic 的主要优势是，模型生成的输出将被验证。如果缺少任何必填字段或字段类型错误，Pydantic 将抛出错误。

python
from typing import Optional
from pydantic import BaseModel, Field

class Joke(BaseModel):
    """笑话内容"""
    setup: str = Field(description="笑话的开头")
    punchline: str = Field(description="笑话的笑点")
    rating: Optional[int] = Field(default=None, description="笑话的有趣程度，1到10分")

structured_llm = llm.with_structured_output(Joke)
structured_llm.invoke("讲一个关于猫的笑话")

输出示例：

python
Joke(setup='为什么猫坐在电脑上？', punchline='因为它想盯着鼠标！', rating=7)

使用 TypedDict 或 JSON Schema
如果你不想使用 Pydantic，或者不想验证参数，或者希望能够流式传输模型输出，可以使用 TypedDict 类来定义模式。我们还可以使用 LangChain 支持的 Annotated 语法来指定字段的默认值和描述。

python
from typing import Optional
from typing_extensions import Annotated, TypedDict

class Joke(TypedDict):
    """笑话内容"""
    setup: Annotated[str, ..., "笑话的开头"]
    punchline: Annotated[str, ..., "笑话的笑点"]
    rating: Annotated[Optional[int], None, "笑话的有趣程度，1到10分"]

structured_llm = llm.with_structured_output(Joke)
structured_llm.invoke("讲一个关于猫的笑话")

输出示例：

python
{'setup': '为什么猫坐在电脑上？', 'punchline': '因为它想盯着鼠标！', 'rating': 7}

使用 JSON Schema
我们也可以直接传入一个 JSON Schema 字典。这种方式不需要导入任何类，且能清晰地展示每个参数的文档，但会稍微冗长一些。

python
json_schema = {
    "title": "joke",
    "description": "笑话内容",
    "type": "object",
    "properties": {
        "setup": {"type": "string", "description": "笑话的开头"},
        "punchline": {"type": "string", "description": "笑话的笑点"},
        "rating": {"type": "integer", "description": "笑话的有趣程度，1到10分", "default": None},
    },
    "required": ["setup", "punchline"],
}

structured_llm = llm.with_structured_output(json_schema)
structured_llm.invoke("讲一个关于猫的笑话")

输出示例：

python
{'setup': '为什么猫坐在电脑上？', 'punchline': '因为它想盯着鼠标！', 'rating': 7}

在多个模式之间选择
最简单的让模型在多个模式之间选择的方法是创建一个包含 Union 类型属性的父模式。

使用 Pydantic

python
from typing import Union

class Joke(BaseModel):
    """笑话内容"""
    setup: str = Field(description="笑话的开头")
    punchline: str = Field(description="笑话的笑点")
    rating: Optional[int] = Field(default=None, description="笑话的有趣程度，1到10分")

class ConversationalResponse(BaseModel):
    """以对话方式回应。保持友好和乐于助人。"""
    response: str = Field(description="对用户查询的对话式回应")

class FinalResponse(BaseModel):
    final_output: Union[Joke, ConversationalResponse]

structured_llm = llm.with_structured_output(FinalResponse)
structured_llm.invoke("讲一个关于猫的笑话")

输出示例：

python
FinalResponse(final_output=Joke(setup='为什么猫坐在电脑上？', punchline='因为它想盯着鼠标！', rating=7))

使用 TypedDict

python
from typing import Optional, Union
from typing_extensions import Annotated, TypedDict

class Joke(TypedDict):
    """笑话内容"""
    setup: Annotated[str, ..., "笑话的开头"]
    punchline: Annotated[str, ..., "笑话的笑点"]
    rating: Annotated[Optional[int], None, "笑话的有趣程度，1到10分"]

class ConversationalResponse(TypedDict):
    """以对话方式回应。保持友好和乐于助人。"""
    response: Annotated[str, ..., "对用户查询的对话式回应"]

class FinalResponse(TypedDict):
    final_output: Union[Joke, ConversationalResponse]

structured_llm = llm.with_structured_output(FinalResponse)
structured_llm.invoke("讲一个关于猫的笑话")

输出示例：

python
{'final_output': {'setup': '为什么猫坐在电脑上？', 'punchline': '因为它想盯着鼠标！', 'rating': 7}}

流式传输
当输出类型为字典时（即模式为 TypedDict 类或 JSON Schema 字典），我们可以流式传输结构化模型的输出。

python
from typing_extensions import Annotated, TypedDict

class Joke(TypedDict):
    """笑话内容"""
    setup: Annotated[str, ..., "笑话的开头"]
    punchline: Annotated[str, ..., "笑话的笑点"]

structured_llm = llm.with_structured_output(Joke)
for chunk in structured_llm.stream("讲一个关于猫的笑话"):
    print(chunk)

输出示例：

python
{'setup': '为什么猫坐在电脑上？', 'punchline': '因为它想盯着鼠标！'}

（高级）原始输出
LLM 在生成结构化输出时并不完美，尤其是当模式变得复杂时。你可以通过传递 include_raw=True 来避免抛出异常，并自行处理原始输出。这将更改输出格式，包含原始消息输出、解析后的值（如果成功）以及任何错误。

python
structured_llm = llm.with_structured_output(Joke, include_raw=True)
structured_llm.invoke("讲一个关于猫的笑话")

输出示例：

python
{'raw': AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_f25ZRmh8u5vHlOWfTUw8sJFZ', 'function': {'arguments': '{"setup":"为什么猫坐在电脑上？","punchline":"因为它想盯着鼠标！","rating":7}', 'name': 'Joke'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 33, 'prompt_tokens': 93, 'total_tokens': 126}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_4e2b2da518', 'finish_reason': 'stop', 'logprobs': None}, id='run-d880d7e2-df08-4e9e-ad92-dfc29f2fd52f-0', tool_calls=[{'name': 'Joke', 'args': {'setup': '为什么猫坐在电脑上？', 'punchline': '因为它想盯着鼠标！', 'rating': 7}, 'id': 'call_f25ZRmh8u5vHlOWfTUw8sJFZ', 'type': 'tool_call'}], usage_metadata={'input_tokens': 93, 'output_tokens': 33, 'total_tokens': 126}),
 'parsed': {'setup': '为什么猫坐在电脑上？', 'punchline': '因为它想盯着鼠标！', 'rating': 7},
 'parsing_error': None}

直接提示和解析模型输出
并非所有模型都支持 .with_structured_output()，因为并非所有模型都支持工具调用或 JSON 模式。对于这些模型，你需要直接提示模型使用特定格式，并使用输出解析器从原始模型输出中提取结构化响应。

使用 PydanticOutputParser
以下示例使用内置的 PydanticOutputParser 来解析聊天模型的输出，并提示其匹配给定的 Pydantic 模式。

python
from typing import List
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

class Person(BaseModel):
    """个人信息"""
    name: str = Field(..., description="姓名")
    height_in_meters: float = Field(..., description="身高，以米为单位")

class People(BaseModel):
    """文本中所有人的信息"""
    people: List[Person]

# 设置解析器
parser = PydanticOutputParser(pydantic_object=People)

# 提示模板
prompt = ChatPromptTemplate.from_messages([
    ("system", "回答用户查询。将输出包装在 `json` 标签中\n{format_instructions}"),
    ("human", "{query}"),
]).partial(format_instructions=parser.get_format_instructions())

# 查看发送给模型的信息
query = "安娜23岁，身高6英尺"
print(prompt.invoke({"query": query}).to_string())

# 调用模型
chain = prompt | llm | parser
chain.invoke({"query": query})

输出示例：

python
People(people=[Person(name='安娜', height_in_meters=1.8288)])

自定义解析
你还可以使用 LangChain 表达式语言（LCEL）创建自定义提示和解析器，使用普通函数解析模型输出。

python
import json
import re
from typing import List
from langchain_core.messages import AIMessage
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

class Person(BaseModel):
    """个人信息"""
    name: str = Field(..., description="姓名")
    height_in_meters: float = Field(..., description="身高，以米为单位")

class People(BaseModel):
    """文本中所有人的信息"""
    people: List[Person]

# 提示模板
prompt = ChatPromptTemplate.from_messages([
    ("system", "回答用户查询。输出你的答案作为 JSON，并匹配以下模式：\`\`\`json\n{schema}\n\`\`\`。确保将答案包装在 \`\`\`json 和 \`\`\` 标签中"),
    ("human", "{query}"),
]).partial(schema=People.schema())

# 自定义解析器
def extract_json(message: AIMessage) -> List[dict]:
    """从包含 JSON 的字符串中提取 JSON 内容。"""
    text = message.content
    pattern = r"\`\`\`json(.*?)\`\`\`"
    matches = re.findall(pattern, text, re.DOTALL)
    try:
        return [json.loads(match.strip()) for match in matches]
    except Exception:
        raise ValueError(f"解析失败: {message}")

# 查看发送给模型的信息
query = "安娜23岁，身高6英尺"
print(prompt.format_prompt(query=query).to_string())

# 调用模型
chain = prompt | llm | extract_json
chain.invoke({"query": query})

输出示例：

python
[{'people': [{'name': '安娜', 'height_in_meters': 1.8288}]}]

通过以上方法，你可以灵活地从模型中获取结构化数据，并根据需求进行进一步处理。

目录

如何从模型中返回结构化数据

使用 .with_structured_output() 方法

使用 `.with_structured_output()` 方法