Reasoning effort in GPT-5.4 and GPT-5.5: when to use low, medium, high and xhigh
Main chat
A chat for vibe coders: news, guides, live cases, marketplace, and finding executors.
If you are working with GPT-5.4 or GPT-5.5 through an API, one of the parameters determines the result more strongly than the wording of the prompt is reasoning.effort. It depends on how deeply the model “thinks” before answering, how much it will cost and how much you have to wait.
Most developers either ignore this parameter (leaving default) or put high on everything just in case - and both approaches are equally suboptimal. In this article, we analyze each level separately, with real numbers for price and latency, and give clear selection criteria.
What is Reasoning Effort
reasoning.effort controls the computational budget that the model spends on “thinking” before forming a final answer. Reasoning models generate reasoning tokens – internal tokens that the model uses to “think”: break down a request into parts and consider different approaches before forming a response.
Key technical point: reasoning tokens are billed as part of output tokens. There is no separate markup – but the higher the effort, the more tokens are generated and the more you pay. A request with high effort can generate twice as many tokens as low for the same task – and you pay for all those tokens.
Five levels: What each means
Supported values are model-specific and may include none, minimal, low, medium, high and xhigh. Not all models support the entire set – it is worth checking the documentation of a particular model before choosing a setup.
none - without reasoning
The model behaves like an unreasoning one: responds immediately, without internal “thinking.” The fastest and cheapest option.
When to use: Tasks that do not require reasoning or chains of tool calls - easy voice cues, quick information search, classification.
low – effective reasoning
Minimum level of reasoning with emphasis on speed. The model is still “thinking”, but briefly.
When to use: for latency-sensitive tasks – but if tools, planning, searching or multi-step decisions are involved, evaluate low, not none, because none may be too limited for such scenarios.
medium – balanced point (default)
GPT-5.5 uses medium reasoning effort. This is the recommended starting point for the balance of quality, reliability, latency and cost.
If you didn’t specify reasoning.effort at all, the model uses this layer. For most tasks, the API is the correct default choice.
- When to use:** Basic level for anything that is not explicitly low/none or high/xhigh.
high for complex agency tasks
A high level of reasoning designed for complex agent tasks requiring serious thinking in situations where latency is not critical.
When to use:** batch processing (code review, document analysis, data extraction) – here you do not block the user waiting, so additional latency does not matter, and the improvement in accuracy accumulates on hundreds of elements.
xhigh – maximum depth for the most complex tasks
xhigh is designed for the most complex asynchronous agent tasks or for evals that test the limits of model intelligence.
xhigh was added as a level of reasoning effort starting with models after the GPT-5.1 Codex Max - earlier models do not support it at all.
When to use: High-risk single queries – codebase security audit, complex migration planning, new algorithm development. This is where increased computing pays off.
Real numbers: latency and cost by level
Here begins the most important thing – concrete data, not general formulations.
Time to First Token (Time to First Token)
The time to the first token at xhigh for GPT-5.5 is about 115 seconds on the Responses API – this is not a typo. If the product interface is designed for a streaming response within five seconds, xhigh cannot be put on the main user path.
Cost
Calculations on the same benchmark (Artificial Analysis Intelligence Index) at different effort levels for GPT-5.5:
| Уровень effort | Сгенерировано токенов | Стоимость прогона |
|---|---|---|
| medium (дефолт) | ~23 773 (на задачу, ProfBench) | базовая |
| high | ~45 млн токенов суммарно | $2 159 |
| xhigh | ~75 млн токенов суммарно | $3 357 |
For comparison, GPT-5.4 at xhigh on the same benchmark generated about 120 million tokens in total at a cost of $ 2,851 - even more than GPT-5.5 on high.
A request for xhigh can cost 3-5 times more than the same request for low – this should be taken into account in calculating the budget for the project, especially when batch processing a large number of tasks.
Basic API prices (per 1M tokens)
| Модель | Вход | Выход | Кэшированный вход |
|---|---|---|---|
| GPT-5.4 | $2.50 | $15 | — |
| GPT-5.5 | $5 | $30 | $0.50 (скидка 90%) |
| GPT-5.5 Pro | $30 | $180 | без скидки на кэш |
An important nuance: the GPT-5.5 Pro does not have a discount on cached input. If your workflow has a stable long preamble (repeated system prompt or context) – this removes one of the main reasons to keep long prefixes. For queries with more repetitive context, it is wiser to either use regular GPT-5.5 or shorten the context.
The main mistake: “higher = better” is not true
This is a key thesis worth remembering: a higher reasoning effort does not automatically mean a better outcome.
If a task contains conflicting instructions, weak stop criteria, or open access to tools, a higher effort can lead to overthinking, excessive search, or even poor output.
This is counterintuitive, but logical: a model with a long “time to think” with fuzzy instructions begins to think for you – to generate additional steps, rechecks, alternative ways that were not needed and take away from the essence of the task.
What to do instead of increasing effort
From the practical experience of developers, before raising the reasoning effort, it is worth checking and improving the following – and often it gives a better result than switching to high or xhigh:
*Clarity of instructions. * Be clear about what you need to get out.
Few-shots. A few examples of a desired outcome often help more than additional “reflection.”.
Structured output. Use the response format to limit and direct the response format.
Verification steps. Ask the model to check their work before the final answer.
Decomposition. Break down a complex task into simpler subtasks – instead of one huge xhigh query, multiple medium queries.
The advice from the community discussions is: first improve completion rules, verification cycles, and tool usage rules – and then raise the reasoning effort.
verbosity is the second parameter that is often confused with effort
In addition to reasoning.effort, GPT-5.x has a separate parameter text.verbosity with values low, medium, high. These are different things: effort affects how much the model thinks, verbosity affects how much the model writes in the final answer.
The verbosity parameter consistently scales both the length and depth of the model output, maintaining the correctness and quality of reasoning - without changing the prompt itself.
In practice, the difference is as follows:
- low verbosity - minimal, functional result without unnecessary comments and structure
- medium verbosity - explanatory comments, function structure, reproducibility elements are added
- high verbosity - a complex, production-ready result with analysis of arguments, several approaches, runtime checks, notes on use
An important nuance for GPT-5.5: on this model, low verbosity gives proportionally more concise answers compared to the same low value on GPT-5.4. That is, the same parameter on different models gives a different degree of output compression.
Effort and verbosity - independent axes
These are two different settings that can be combined:
| low verbosity | high verbosity | |
|---|---|---|
| low effort | Быстро, кратко, минимум размышлений | Быстро по размышлению, но многословный вывод |
| high effort | Глубокое размышление, краткий ответ | Глубокое размышление + развёрнутый ответ (дороже всего) |
For example, for the boundary case classification problem, medium effort + low verbosity makes sense: the model thinks enough to correctly classify, but does not spend tokens explaining each solution.
Practical table of choice by type of task
| Тип задачи | Рекомендуемый effort | Обоснование |
|---|---|---|
| Автокомплит, чат в реальном времени | none или low |
Латентность критична, рассуждение не добавляет ценности |
| Классификация, простой лукап | none |
Задача не требует многошагового мышления |
| Голосовые реплики (voice UI) | none |
Пользователь ждёт мгновенного ответа |
| Общие API-задачи без явной специфики | medium (дефолт) |
Лучший баланс качества/цены/скорости для большинства случаев |
| Ревью кода в pipeline (не блокирует пользователя) | high |
Латентность не важна, точность накапливается на множестве задач |
| Анализ документов batch-режимом | high |
Аналогично — асинхронная обработка |
| Аудит безопасности кодовой базы | xhigh |
Высокая цена ошибки оправдывает 12x вычислений |
| Планирование сложной миграции | xhigh |
Разовая задача с большой ценой неправильного решения |
| Рефакторинг с тонкими архитектурными инвариантами | xhigh или альтернативная модель |
См. раздел про конкурентов ниже |
| Eval / бенчмаркинг моделей | xhigh |
Цель — проверить пределы возможностей модели |
Computer Use and Web Search: Special Effort Requirements
A separate technical nuance: the web search tool through the API requires a reasoning model – GPT-5 non-reflective surfaces do not provide access to this tool in the same way through the API.
For Computer Use, GPT-5.4 scored 75% on the OSWorld benchmark, exceeding the human expert baseline of 72.4%. This mode is enabled by transferring the computer_use tool type – and here it is also important to test different levels of effort, since interface management tasks are often multi-step and benefit from reasoning above low.
How does GPT-5.5 compare to competitors on different efforts
If you’re choosing between models for a specific task, here are the landmarks in several directions that are relevant at the time of the release of GPT-5.5.
Against Claude Opus 4.7 on complex tasks with the code: GPT-5.5 is inferior to Opus 4.7 on the SWE-bench Pro - 58.6% against 64.3%. If your script is closest to “fixing a real GitHub bug in 40 files,” Opus may be the default choice, regardless of the effort level of GPT-5.5.
Against price-optimized models: Alternatives like the DeepSeek V4 Pro cost about 7 times less than the standard GPT-5.5 and remain competitive on several smart benchmarks. If cost is the main factor in your project, and GPT-5.5 is not for unique features, but for overall quality, it is worth testing cheaper alternatives before switching to high / xhigh.
GPT-5.5 Pro and xhigh should be reserved for really frontline tasks – research, complex mathematics, multifile refactorings with fine invariants. Do not place such requests on the hot path of a high-loaded product.
Technical details for developers
Parallel tool calls and minimal effort
Parallel tool calls are not supported if reasoning_effort is installed in minimal – this is important to consider if your agent relies on multiple tools simultaneously.
System and developer messages
Modern reasoning models support system messages to facilitate migration. It is not recommended to use both a developer message and a system message in the same query – this can lead to conflicts in the processing of instructions.
Chat Completions vs Responses API
Reasoning models work with the max_completion_tokens parameter when using the Chat Completions API, whereas max_output_tokens is used when working with the Responses API. This is especially important with the high/xhigh effort – without an explicit limitation, the model can generate many more reasoning tokens than planned, and you will get an unexpected score.
Preamble to reduce perceived delay
For applications that are sensitive to latency, you can ask the model to generate a short preamble before moving on to deeper reasoning – this gives the user a faster first visible token, even if the final answer is generated longer.
Adaptive reasoning
Models reason adaptively within a given effort level, using fewer tokens for simple query parts and more for complex ones. That is, even on the high effort, the model does not spend the same number of tokens on each subtask – simple parts are processed faster.
Checklist before changing reasoning effort
Run benchmark on medium (default) is the baseline for comparison
● Completion criteria checked: whether the stop conditions in the prompt are clear
● Instructions for Contradictions – Are There Conflicting Claims
● Few-shot examples instead of increased effort
Structured output through response format
Added step of self-check model before the final answer
● Complex task decomposed into subtasks
● Only now, if the above does not give a result – tested high/xhigh
● For high/xhigh, max output tokens are set to avoid uncontrolled account growth.
● For latency-sensitive pathways, xhigh is excluded from the hot product path
Verbosity is configured separately from effort for the format of the desired answer
Outcome
Reasoning effort is not a “quality slider” that needs to be twisted to the maximum for better results. It’s a trade-off slider between speed, price, and depth of thought, and for most API tasks, the correct value is medium, set by default.
Increase to high is only for asynchronous batch problems, where latency is not critical, and the accumulation of accuracy on a set of requests justifies the increased cost. Prior to xhigh – only for rare, costly errors: security auditing, architectural planning, frontline research – and never on the hot path of a product, given latency up to 115 seconds.
Before increasing the effort, it is always worth checking the cheaper levers: the clarity of the prompt, few-shot examples, structured output, verification and decomposition of the task. Often they give a greater increase in quality than the transition from medium to high - at a multiple of lower cost.