Jakevin’s Blog

這是OpenAI在2024年10月才發佈的功能，使用的模型是 gpt-4o-audio-preview，能夠支援語音的輸入與輸出

與Realtime API不一樣，他走的還是 /chat/completions，並不是 WebSocket

Audio*** $100.00 / 1M input tokens $200.00 / 1M output tokens

100美金看似很驚人，但其實與 whisper-1 的費用相同

***Audio input costs approximately 6¢ per minute; Audio output costs approximately 24¢ per minute
***語音輸入約每分鐘6美分；語音輸出約每分鐘24美分。

不過我這邊用來做即時的語音翻譯，所以我會專注說明語音輸入文字輸出的部份

Audio generation

情境：將中文語音轉成日文文字。

{
    "model": "gpt-4o-audio-preview",
    "modalities": ["text"],
    "messages": [
        {
            "role": "system",
            "content": "You are a translation expert. I will ask you to translate a language into setting languages"
        },
        {
            "role": "user",
            "content": [
                { "type": "text", text: "將語音，用'jp-JP'完整翻譯，不要發音、注解。" },
                { "type": "input_audio", "input_audio": { "data": base64str, "format": "mp3" } }
            ]
        }
    ]
}

回應，耗時 2970ms

{
    "message": {
        "role": "assistant",
        "content": "皆さん、こんばんは。私のチャンネルへようこそ。",
        "refusal": null
    },
    "usage": {
        "prompt_tokens": 100,
        "completion_tokens": 15,
        "total_tokens": 115,
    }
}

這個用法與Whisper API幾乎是一模一樣，但下面是我踩到的坑！！

Whisper是支援 webm 或 ogg等5~6種語音格式，但gpt-4o-audio-preview只有支援mp3與wav

這意味，在瀏覽器上無法直接使用，必需轉檔。

使用base64做為audio input，你還必須刪掉wav的base64的頭部meta
```
base64data.split(',')[1]
```

情境：需要將中文語音，翻譯成指定的多個語系，並使用Function Calling。

{
    "model": "gpt-4o-audio-preview",
    "modalities": ["text"],
    "messages": [
        {
            "role": "system",
            "content": "You are a translation expert. I will ask you to translate a language into setting languages"
        },
        {
            "role": "user",
            "content": [
                { "type": "text", text: "將語音，請翻譯成'zh-CN簡體中文' 'jp-JP日本語' 'en-US美式英文' 完整翻譯，不要發音、注解。" },
                { "type": "input_audio", "input_audio": { "data": base64str, "format": "mp3" } }
            ]
        }
    ],
     "tools": [
        {
            "type": "function",
            "function": {
                "name": "split_text",
                "description": "Identify Chinese and other languages in text.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "zh-CN": {
                            "type": "string",
                            "description": "Chinese reslut"
                        },
                        "jp-JP": {
                            "type": "string",
                            "description": "Japanese reslut"
                        },
                        "en-US": {
                            "type": "string",
                            "description": "English reslut"
                        }
                    },
                    "required": [
                        "zh-CN"
                    ]
                }
            }
        }
    ],
    "tool_choice": "auto"
}

輸出，耗時4970ms

{
    "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_dAdfaFvPBoL4JrgL4QRphIfo",
            "type": "function",
            "function": {
              "name": "split_text",
              "arguments": "{\"zh-CN\": \"大家好，今天是10月21号，星期一，欢迎来到我们的节目。\"}"
            }
          },
          {
            "id": "call_HnffYc7jMsIZntdBBhPbgrsp",
            "type": "function",
            "function": {
              "name": "split_text",
              "arguments": "{\"jp-JP\": \"皆さん、こんにちは。 今日は10月21日、月曜日です。 私たちのプログラムへようこそ。\"}"
            }
          },
          {
            "id": "call_VfDkfyBKwKzxRHkNjLyyqnd7",
            "type": "function",
            "function": {
              "name": "split_text",
              "arguments": "{\"en-US\": \"Hello everyone, today is October 21st, Monday. Welcome to our show.\"}"
            }
          }
        ],
        "refusal": null
    },
    "usage": {
       "prompt_tokens": 188,
        "completion_tokens": 123,
        "total_tokens": 311,
    }
}

Function Calling是有支援的，跟前面的實驗一樣，輸出的結果還是有所不同

但Function Calling的彈性很大，應該可以完成不少有趣的開發

情境：使用Structured Outputs，將翻譯結果指定為多個語系。

{
    "model": "gpt-4o-audio-preview",
    "modalities": ["text"],
    "messages": [
        {
            "role": "system",
            "content": "You are a translation expert. I will ask you to translate a language into setting languages"
        },
        {
            "role": "user",
            "content": [
                { "type": "text", text: "將語音，用完整翻譯，不要發音、注解。" },
                { "type": "input_audio", "input_audio": { "data": base64str, "format": "mp3" } }
            ]
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "split_text",
            "schema": {
                "type": "object",
                "properties": {
                    "zh-CN": {
                        "type": "string",
                        "description": "Chinese reslut"
                    },
                    "jp-JP": {
                        "type": "string",
                        "description": "Japanese reslut"
                    },
                    "en-US": {
                        "type": "string",
                        "description": "Englisg reslut"
                    }
                },
                "required": [
                    "zh-CN"
                ]
            }
        }
    }
}

上面是照著 Structured Outputs 實驗文章的做法，把語音的結果轉成我要的多語系翻譯

來看輸出，耗時339ms

{
    "error": {
        "message": "Invalid parameter: 'response_format' of type 'json_schema' is not supported with this model. Learn more about supported models at the Structured Outputs guide: https://platform.openai.com/docs/guides/structured-outputs",
        "type": "invalid_request_error",
        "param": null,
        "code": null
    }
}

理想很豐滿，現實很骨感…

gpt-4o-audio-preview 並不支援新的 Structured Outputs 結構

結論

傳統的作法是先用 whisper將語音轉成文字後再呼叫LLM模型。

Audio generation 則是直接讓語音進入OpenAI當下就執行LLM模型，直接是快上一步。

不過，就我開發的過程來看，有兩個大缺點要改善

不能支援webm，就是個硬傷。轉成mp3或wav再上傳，不僅讓語音檔案變大，還要等轉檔的時間
Structured Outputs居然沒有支援，你是GPT-4o家族的模型為什麼不能用？！

期待這兩個缺點能快快修正，這樣我的即時語音翻譯可以再快一點點！