快速入門

Realtime agents 讓你可以透過 OpenAI Realtime API 與你的 AI 智能代理進行語音對話。本指南將帶你快速建立第一個即時語音代理。

Beta feature

Realtime agents 目前為 beta 版本。在我們持續改進實作時，請預期可能會有重大變動。

先決條件

Python 3.9 或更高版本
OpenAI API 金鑰
基本熟悉 OpenAI Agents SDK

安裝

如果尚未安裝，請先安裝 OpenAI Agents SDK：

pip install openai-agents

建立你的第一個 Realtime agent

1. 匯入所需元件

import asyncio
from agents.realtime import RealtimeAgent, RealtimeRunner

2. 建立即時代理 (realtime agent)

agent = RealtimeAgent(
    name="Assistant",
    instructions="You are a helpful voice assistant. Keep your responses conversational and friendly.",
)

3. 設定 runner

runner = RealtimeRunner(
    starting_agent=agent,
    config={
        "model_settings": {
            "model_name": "gpt-realtime",
            "voice": "ash",
            "modalities": ["audio"],
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm16",
            "input_audio_transcription": {"model": "gpt-4o-mini-transcribe"},
            "turn_detection": {"type": "semantic_vad", "interrupt_response": True},
        }
    }
)

4. 開始一個 session（會話）

# Start the session
session = await runner.run()

async with session:
    print("Session started! The agent will stream audio responses in real-time.")
    # Process events
    async for event in session:
        try:
            if event.type == "agent_start":
                print(f"Agent started: {event.agent.name}")
            elif event.type == "agent_end":
                print(f"Agent ended: {event.agent.name}")
            elif event.type == "handoff":
                print(f"Handoff from {event.from_agent.name} to {event.to_agent.name}")
            elif event.type == "tool_start":
                print(f"Tool started: {event.tool.name}")
            elif event.type == "tool_end":
                print(f"Tool ended: {event.tool.name}; output: {event.output}")
            elif event.type == "audio_end":
                print("Audio ended")
            elif event.type == "audio":
                # Enqueue audio for callback-based playback with metadata
                # Non-blocking put; queue is unbounded, so drops won’t occur.
                pass
            elif event.type == "audio_interrupted":
                print("Audio interrupted")
                # Begin graceful fade + flush in the audio callback and rebuild jitter buffer.
            elif event.type == "error":
                print(f"Error: {event.error}")
            elif event.type == "history_updated":
                pass  # Skip these frequent events
            elif event.type == "history_added":
                pass  # Skip these frequent events
            elif event.type == "raw_model_event":
                print(f"Raw model event: {_truncate_str(str(event.data), 200)}")
            else:
                print(f"Unknown event type: {event.type}")
        except Exception as e:
            print(f"Error processing event: {_truncate_str(str(e), 200)}")

def _truncate_str(s: str, max_length: int) -> str:
    if len(s) > max_length:
        return s[:max_length] + "..."
    return s

完整範例

以下是一個完整可運作的範例：

import asyncio
from agents.realtime import RealtimeAgent, RealtimeRunner

async def main():
    # Create the agent
    agent = RealtimeAgent(
        name="Assistant",
        instructions="You are a helpful voice assistant. Keep responses brief and conversational.",
    )
    # Set up the runner with configuration
    runner = RealtimeRunner(
        starting_agent=agent,
        config={
            "model_settings": {
                "model_name": "gpt-realtime",
                "voice": "ash",
                "modalities": ["audio"],
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {"model": "gpt-4o-mini-transcribe"},
                "turn_detection": {"type": "semantic_vad", "interrupt_response": True},
            }
        },
    )
    # Start the session
    session = await runner.run()

    async with session:
        print("Session started! The agent will stream audio responses in real-time.")
        # Process events
        async for event in session:
            try:
                if event.type == "agent_start":
                    print(f"Agent started: {event.agent.name}")
                elif event.type == "agent_end":
                    print(f"Agent ended: {event.agent.name}")
                elif event.type == "handoff":
                    print(f"Handoff from {event.from_agent.name} to {event.to_agent.name}")
                elif event.type == "tool_start":
                    print(f"Tool started: {event.tool.name}")
                elif event.type == "tool_end":
                    print(f"Tool ended: {event.tool.name}; output: {event.output}")
                elif event.type == "audio_end":
                    print("Audio ended")
                elif event.type == "audio":
                    # Enqueue audio for callback-based playback with metadata
                    # Non-blocking put; queue is unbounded, so drops won’t occur.
                    pass
                elif event.type == "audio_interrupted":
                    print("Audio interrupted")
                    # Begin graceful fade + flush in the audio callback and rebuild jitter buffer.
                elif event.type == "error":
                    print(f"Error: {event.error}")
                elif event.type == "history_updated":
                    pass  # Skip these frequent events
                elif event.type == "history_added":
                    pass  # Skip these frequent events
                elif event.type == "raw_model_event":
                    print(f"Raw model event: {_truncate_str(str(event.data), 200)}")
                else:
                    print(f"Unknown event type: {event.type}")
            except Exception as e:
                print(f"Error processing event: {_truncate_str(str(e), 200)}")

def _truncate_str(s: str, max_length: int) -> str:
    if len(s) > max_length:
        return s[:max_length] + "..."
    return s

if __name__ == "__main__":
    # Run the session
    asyncio.run(main())

設定選項

模型設定

model_name：從可用的即時語音模型（例如 gpt-realtime）中選擇
voice：選擇語音（alloy、echo、fable、onyx、nova、shimmer）
modalities：啟用文字或音訊（["text"] 或 ["audio"]）

音訊設定

input_audio_format：輸入音訊格式（pcm16、g711_ulaw、g711_alaw）
output_audio_format：輸出音訊格式
input_audio_transcription：轉錄（transcription）設定

輪流偵測（turn detection）

type：偵測方法（server_vad、semantic_vad）
threshold：語音活動閾值（0.0-1.0）
silence_duration_ms：偵測輪流結束的靜音時長
prefix_padding_ms：語音開始前的音訊補白

下一步

深入瞭解 Realtime agents
查看 examples/realtime 資料夾中的實作範例
為你的 agent 加入工具（Tools）
實作代理（Agents）之間的交接（Handoffs）
建立 guardrail 以確保安全性

驗證（Authentication）

請確保你的 OpenAI API 金鑰已設定於環境變數中：

export OPENAI_API_KEY="your-api-key-here"

或者在建立 session（會話）時直接傳入：

session = await runner.run(model_config={"api_key": "your-api-key"})