Chat Completions

Tổng quan

Endpoint Chat Completions là API được sử dụng rộng rãi nhất để tương tác với các mô hình ngôn ngữ lớn. Nó chấp nhận một cuộc hội thoại gồm nhiều tin nhắn và trả về phản hồi của model. CometAPI định tuyến endpoint này tới nhiều nhà cung cấp — bao gồm OpenAI, Anthropic Claude (thông qua lớp tương thích), Google Gemini và các bên khác — thông qua một giao diện hợp nhất duy nhất. Bạn có thể chuyển đổi giữa các model chỉ bằng cách thay đổi tham số model.

Endpoint này tuân theo định dạng OpenAI Chat Completions. Hầu hết các SDK và công cụ tương thích với OpenAI đều hoạt động với CometAPI bằng cách đổi base_url thành https://api.cometapi.com/v1.

Lưu ý quan trọng

Hành vi theo từng model — Các model khác nhau có thể hỗ trợ những tập con tham số khác nhau và trả về các trường phản hồi hơi khác nhau. Ví dụ, reasoning_effort chỉ áp dụng cho các reasoning model (o-series, GPT-5.1+), và một số model có thể không hỗ trợ logprobs hoặc n > 1.

Truyền thẳng phản hồi — CometAPI chuyển tiếp phản hồi của model mà không chỉnh sửa (ngoại trừ việc chuẩn hóa định dạng khi định tuyến giữa các nhà cung cấp), đảm bảo bạn nhận được đầu ra nhất quán với API gốc.

Các model OpenAI Pro — Đối với các model thuộc dòng OpenAI Pro (ví dụ: o1-pro), hãy sử dụng endpoint responses thay thế.

Vai trò tin nhắn

Vai trò	Mô tả
`system`	Thiết lập hành vi và cá tính của assistant. Được đặt ở đầu cuộc hội thoại.
`developer`	Thay thế `system` cho các model mới hơn (o1+). Cung cấp các chỉ dẫn mà model phải tuân theo bất kể đầu vào của người dùng.
`user`	Tin nhắn từ người dùng cuối.
`assistant`	Các phản hồi trước đó của model, được dùng để duy trì lịch sử hội thoại.
`tool`	Kết quả từ các lệnh gọi tool/function. Phải bao gồm `tool_call_id` khớp với lệnh gọi tool gốc.

Đối với các model mới hơn (GPT-4.1, dòng GPT-5, o-series), hãy ưu tiên developer thay cho system cho các tin nhắn chỉ dẫn. Cả hai đều hoạt động, nhưng developer cung cấp khả năng tuân theo chỉ dẫn mạnh hơn.

Đầu vào Multimodal

Nhiều model hỗ trợ hình ảnh và âm thanh cùng với văn bản. Sử dụng định dạng mảng cho content để gửi các tin nhắn multimodal:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe this image"},
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image.png",
        "detail": "high"
      }
    }
  ]
}

Tham số detail kiểm soát độ sâu phân tích hình ảnh:

low — nhanh hơn, dùng ít tokens hơn (chi phí cố định)
high — phân tích chi tiết, tiêu thụ nhiều tokens hơn
auto — model tự quyết định (mặc định)

Streaming

Khi stream được đặt thành true, phản hồi sẽ được gửi dưới dạng Server-Sent Events (SSE). Mỗi sự kiện chứa một đối tượng chat.completion.chunk với nội dung được truyền tăng dần:

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Để đưa thống kê mức sử dụng Token vào phản hồi Streaming, hãy đặt stream_options.include_usage thành true. Dữ liệu usage sẽ xuất hiện trong chunk cuối cùng trước [DONE].

Structured Outputs

Buộc model trả về JSON hợp lệ khớp với một schema cụ thể bằng response_format:

{
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "result",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "answer": {"type": "string"},
          "confidence": {"type": "number"}
        },
        "required": ["answer", "confidence"],
        "additionalProperties": false
      }
    }
  }
}

Chế độ JSON Schema (json_schema) đảm bảo đầu ra khớp chính xác với schema của bạn. Chế độ JSON Object (json_object) chỉ đảm bảo JSON hợp lệ — cấu trúc không được ép buộc.

Tool / Function Calling

Cho phép model gọi các function bên ngoài bằng cách cung cấp định nghĩa tool:

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Khi model quyết định gọi một tool, phản hồi sẽ có finish_reason: "tool_calls" và mảng message.tool_calls sẽ chứa tên function cùng các đối số. Sau đó, bạn thực thi function và gửi kết quả trở lại dưới dạng một message tool với tool_call_id tương ứng.

Response Fields

Field	Mô tả
`id`	Định danh completion duy nhất (ví dụ: `chatcmpl-abc123`).
`object`	Luôn là `chat.completion`.
`model`	Model đã tạo phản hồi này (có thể bao gồm hậu tố phiên bản).
`choices`	Mảng các lựa chọn completion (thường là 1 trừ khi `n` > 1).
`choices[].message`	Message phản hồi của assistant với `role`, `content`, và có thể có `tool_calls`.
`choices[].finish_reason`	Lý do model dừng lại: `stop`, `length`, `tool_calls`, hoặc `content_filter`.
`usage`	Chi tiết mức sử dụng Token: `prompt_tokens`, `completion_tokens`, `total_tokens`, và các số liệu phụ chi tiết.
`system_fingerprint`	Dấu vân tay cấu hình backend để gỡ lỗi khả năng tái lập.

Ghi chú giữa các nhà cung cấp

Hỗ trợ tham số giữa các nhà cung cấp

Parameter	OpenAI GPT	Claude (qua compat)	Gemini (qua compat)
`temperature`	0–2	0–1	0–2
`top_p`	0–1	0–1	0–1
`n`	1–128	chỉ 1	1–8
`stop`	Tối đa 4	Tối đa 4	Tối đa 5
`tools`	✅	✅	✅
`response_format`	✅	✅ (json_schema)	✅
`logprobs`	✅	❌	❌
`reasoning_effort`	o-series, GPT-5.1+	❌	❌ (dùng `thinking` cho Gemini native)

max_tokens và max_completion_tokens

max_tokens — Tham số cũ. Hoạt động với hầu hết các model nhưng đã bị deprecate đối với các model OpenAI mới hơn.
max_completion_tokens — Tham số được khuyến nghị cho GPT-4.1, dòng GPT-5 và các model o-series. Bắt buộc với các reasoning model vì nó bao gồm cả output tokens và reasoning tokens.

CometAPI tự động xử lý việc ánh xạ khi định tuyến đến các nhà cung cấp khác nhau.

role system và developer

system — Role chỉ dẫn truyền thống. Hoạt động với tất cả các model.
developer — Được giới thiệu cùng các model o1. Cung cấp khả năng tuân theo chỉ dẫn mạnh hơn cho các model mới hơn. Sẽ quay về hành vi của system trên các model cũ hơn.

Dùng developer cho các dự án mới nhắm đến GPT-4.1+ hoặc các model o-series.

Câu hỏi thường gặp

Cách xử lý rate limit?

Khi gặp 429 Too Many Requests, hãy triển khai exponential backoff:

import time
import random
from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://api.cometapi.com/v1",
    api_key="<COMETAPI_KEY>",
)

def chat_with_retry(messages, max_retries=3):
    for i in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-5.4",
                messages=messages,
            )
        except RateLimitError:
            if i < max_retries - 1:
                wait_time = (2 ** i) + random.random()
                time.sleep(wait_time)
            else:
                raise

Cách duy trì ngữ cảnh hội thoại?

Hãy đưa toàn bộ lịch sử hội thoại vào mảng messages:

messages = [
    {"role": "developer", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a high-level programming language..."},
    {"role": "user", "content": "What are its main advantages?"},
]

`finish_reason` có nghĩa là gì?

Value	Meaning
`stop`	Hoàn tất tự nhiên hoặc chạm vào một stop sequence.
`length`	Đạt giới hạn `max_tokens` hoặc `max_completion_tokens`.
`tool_calls`	Model đã gọi một hoặc nhiều tool/function call.
`content_filter`	Output đã bị lọc do chính sách nội dung.

Cách kiểm soát chi phí?

Dùng max_completion_tokens để giới hạn độ dài output.
Chọn các model tiết kiệm chi phí (ví dụ: gpt-5.4-mini hoặc gpt-5.4-nano cho các tác vụ đơn giản hơn).
Giữ prompt ngắn gọn — tránh ngữ cảnh dư thừa.
Theo dõi mức sử dụng token trong trường phản hồi usage.

Ủy quyền

Authorization

string

header

bắt buộc

Bearer token authentication. Use your CometAPI key.

Nội dung

application/json

model

string

mặc định:gpt-5.4

bắt buộc

Model ID to use for this request. See the Models page for current options.

Ví dụ:

"gpt-4.1"

messages

object[]

bắt buộc

A list of messages forming the conversation. Each message has a role (system, user, assistant, or developer) and content (text string or multimodal content array).

Show child attributes

stream

boolean

If true, partial response tokens are delivered incrementally via server-sent events (SSE). The stream ends with a data: [DONE] message.

temperature

number

mặc định:1

Sampling temperature between 0 and 2. Higher values (e.g., 0.8) produce more random output; lower values (e.g., 0.2) make output more focused and deterministic. Recommended to adjust this or top_p, but not both.

Phạm vi bắt buộc: 0 <= x <= 2

top_p

number

mặc định:1

Nucleus sampling parameter. The model considers only the tokens whose cumulative probability reaches top_p. For example, 0.1 means only the top 10% probability tokens are considered. Recommended to adjust this or temperature, but not both.

Phạm vi bắt buộc: 0 <= x <= 1

integer

mặc định:1

Number of completion choices to generate for each input message. Defaults to 1.

stop

string

Up to 4 sequences where the API will stop generating further tokens. Can be a string or an array of strings.

max_tokens

integer

Maximum number of tokens to generate in the completion. The total of input + output tokens is capped by the model's context length.

presence_penalty

number

mặc định:0

Number between -2.0 and 2.0. Positive values penalize tokens based on whether they have already appeared, encouraging the model to explore new topics.

Phạm vi bắt buộc: -2 <= x <= 2

frequency_penalty

number

mặc định:0

Number between -2.0 and 2.0. Positive values penalize tokens proportionally to how often they have appeared, reducing verbatim repetition.

Phạm vi bắt buộc: -2 <= x <= 2

logit_bias

object

A JSON object mapping token IDs to bias values from -100 to 100. The bias is added to the model's logits before sampling. Values between -1 and 1 subtly adjust likelihood; -100 or 100 effectively ban or force selection of a token.

user

string

A unique identifier for your end-user. Helps with abuse detection and monitoring.

max_completion_tokens

integer

An upper bound for the number of tokens to generate, including visible output tokens and reasoning tokens. Use this instead of max_tokens for GPT-4.1+, GPT-5 series, and o-series models.

response_format

object

Specifies the output format. Use {"type": "json_object"} for JSON mode, or {"type": "json_schema", "json_schema": {...}} for strict structured output.

Show child attributes

tools

object[]

A list of tools the model may call. Currently supports function type tools.

Show child attributes

tool_choice

mặc định:auto

Controls how the model selects tools. auto (default): model decides. none: no tools. required: must call a tool.

logprobs

boolean

mặc định:false

Whether to return log probabilities of the output tokens.

top_logprobs

integer

Number of most likely tokens to return at each position (0-20). Requires logprobs to be true.

Phạm vi bắt buộc: 0 <= x <= 20

reasoning_effort

enum<string>

Controls the reasoning effort for o-series and GPT-5.1+ models.

Tùy chọn có sẵn:

low,

medium,

high

stream_options

object

Options for streaming. Only valid when stream is true.

Show child attributes

service_tier

enum<string>

Specifies the processing tier.

Tùy chọn có sẵn:

auto,

default,

flex,

priority

Phản hồi

200 - application/json

Successful chat completion response.

string

Unique completion identifier.

Ví dụ:

"chatcmpl-abc123"

object

enum<string>

Tùy chọn có sẵn:

chat.completion

Ví dụ:

"chat.completion"

created

integer

Unix timestamp of creation.

Ví dụ:

1774412483

model

string

The model used (may include version suffix).

Ví dụ:

"gpt-5.4-2025-07-16"

choices

object[]

Array of completion choices.

Show child attributes

usage

object

Show child attributes

service_tier

string

Ví dụ:

"default"

system_fingerprint

string | null

Ví dụ:

"fp_490a4ad033"

Tổng quan

Tài liệu tham khảo API

Hướng dẫn tích hợp

Lỗi

Giá & Thanh toán

Hỗ trợ

Tổng quan

Lưu ý quan trọng

Vai trò tin nhắn

Đầu vào Multimodal

Streaming

Structured Outputs

Tool / Function Calling

Response Fields

Ghi chú giữa các nhà cung cấp

Câu hỏi thường gặp

Cách xử lý rate limit?

Cách duy trì ngữ cảnh hội thoại?

`finish_reason` có nghĩa là gì?

Cách kiểm soát chi phí?

Ủy quyền

Nội dung

Phản hồi

Tổng quan

Tài liệu tham khảo API

Hướng dẫn tích hợp

Lỗi

Giá & Thanh toán

Hỗ trợ

​Tổng quan

​Lưu ý quan trọng

​Vai trò tin nhắn

​Đầu vào Multimodal

​Streaming

​Structured Outputs

​Tool / Function Calling

​Response Fields

​Ghi chú giữa các nhà cung cấp

​Câu hỏi thường gặp

​Cách xử lý rate limit?

​Cách duy trì ngữ cảnh hội thoại?

​finish_reason có nghĩa là gì?

​Cách kiểm soát chi phí?

Ủy quyền

Nội dung

Phản hồi

Tổng quan

Lưu ý quan trọng

Vai trò tin nhắn

Đầu vào Multimodal

Streaming

Structured Outputs

Tool / Function Calling

Response Fields

Ghi chú giữa các nhà cung cấp

Câu hỏi thường gặp

Cách xử lý rate limit?

Cách duy trì ngữ cảnh hội thoại?

`finish_reason` có nghĩa là gì?

Cách kiểm soát chi phí?