~/wiki / integratsii-i-api / api-queue-retry-circuit-breaker-for-beginners

What to do if the API sometimes fails to respond: queues, retry and circuit breaker in simple words

Main chat

A chat for vibe coders: news, guides, live cases, marketplace, and finding executors.

$ cd section/ $ join vibe dev
What to do if the API sometimes fails to respond: queues, retry and circuit breaker in simple words - обложка

You've written code that addresses an external service -- the weather API, the payment system, the email service. Everything works on your computer. You run in production, and every few hours something goes wrong. The API responds with a delay, returns an error, or does not respond at all. The user sees a white screen or a 500 error.

It's not a bug in your code. It's normal life: any external service is sometimes unavailable. The question is how your app behaves at this point.

In this article, there are three tools that solve the problem: repeat attempts, queues and circuit breaker. We explain each in simple words, with analogies, and show you what it looks like in code.


Why does the API sometimes fail to respond

Before treating - it is necessary to understand the reasons. The API may not respond differently, and this is important because different reasons require different solutions.

Temporary overload. The server is coping with the load, but has received too many requests right now. In a second or two, it works again. Solution: Wait and try again.

**Exceeding the request limit (rate limit). ** You sent too many requests in a short time. The API says, "Stop, wait a minute." Solution: follow the queue.

The packet got lost along the way. Your server and API server are both live, but no specific request has been made. Solution: repeat.

API really fell. Something broke on the ISP's side and it's gonna lie down for an hour or two. Solution: Stop trying for a while and continue later.

Slow response. The API is live, but it takes 30 seconds instead of the usual 2. Solution: don't wait forever, put a timeout.

Each case requires its own strategy. Let's take them down in order.


Retry: Try again, but wisely

What is it and why

The simplest idea is that if the request fails, try again. It's like when you call someone and you hear "subscriber unavailable," you hang up and call back in a minute.

The naive implementation looks like this:

javascript
// Плохой retry: три попытки подряд, без паузы
async function fetchData() {
    for (let i = 0; i < 3; i++) {
        try {
            return await api.get('/data');
        } catch (error) {
            if (i === 2) throw error; // последняя попытка — пробрасываем ошибку
        }
    }
}

It's better than nothing. But there's a problem with this approach: three attempts are instantaneous, one after another. If the server is overloaded, you don’t give it a break, but instead add three requests instead of one.

Pause between attempts: exponential delay

The correct retry is retry with an increasing pause. It didn't work - waited a second. It didn’t work again, I waited two seconds. Four seconds again. And so on.

This is called an exponential backoff. The word sounds complicated, but the idea is simple: each subsequent pause is twice as long as the previous one.

plaintext
Attempt 1: request → error → pause 1 second
Attempt 2: request → error → pause 2 seconds
Attempt 3: request → error → pause 4 seconds
Attempt 4: Request → Success

Why an incremental rather than a fixed pause? If the server recovers, it can barely handle the load at first. Fixed pauses create the same wave of requests over and over again. Increasing pauses give the server more time to recover.

Jitter: Adding randomness

Imagine that you had 100 users get an error at once. All 100 wait exactly 1 second and repeat the request. The server crashes again under load. Everyone waits 2 seconds - and again synchronized beat.

Jitter (English for “shaking”) is a small random shift in latency. Instead of exactly 1 second – from 0.8 to 1.2 seconds. It sounds like a trifle, but 100 users are now distributed over time rather than hitting synchronously.

javascript
// Retry с экспоненциальной задержкой и jitter
async function withRetry(fn, maxAttempts = 3) {
    for (let attempt = 1; attempt <= maxAttempts; attempt++) {
        try {
            return await fn();
        } catch (error) {
            // Если это последняя попытка — пробрасываем ошибку
            if (attempt === maxAttempts) throw error;

            // Базовая задержка: 1s, 2s, 4s...
            const baseDelay = 1000 * Math.pow(2, attempt - 1);

            // Jitter: ±25% от базовой задержки
            const jitter = baseDelay * 0.25 * (Math.random() * 2 - 1);
            const delay = Math.round(baseDelay + jitter);

            console.log(`Попытка ${attempt} не удалась. Повтор через ${delay}мс...`);
            await new Promise(resolve => setTimeout(resolve, delay));
        }
    }
}

// Использование — ваш код не меняется, просто оборачиваете вызов
const data = await withRetry(() => api.get('/weather'));

Important: Not everything needs to be retrofitted

Trying again only makes sense for *temporary problems. If the API returned the “faulty API key” error, no matter how many times you repeat it, the answer won’t change. Retrace only makes sense:

  • network errors (connection broken)
  • 429 status (wait too many requests)
  • status 500, 502, 503, 504 (temporary server problems)

No need to retrace:**

  • 400 (incorrect request - correct request)
  • 401 (incorrect key – check the key)
  • 403 (no license – contact the provider)
  • 404 (resource does not exist, does not exist)

Queues: do the work in turn, do not lose the task

Problem with the queue

Imagine that your site when registering a user should send a welcome letter through an email service. You write:

javascript
app.post('/register', async (req, res) => {
    await createUser(req.body);
    await sendWelcomeEmail(req.body.email); // вот здесь
    res.json({ ok: true });
});

What happens if the email service is unavailable at the time of registration? The user gets an error. The account was created, but he didn’t know about it – he didn’t receive a confirmation email. You can add retry, but then the user waits a few seconds for you to try again. Bad.

What is the task line

The task line is a list of things to do. You add a task to the list (“send a letter to such and such”) and immediately respond to the user “all right”. The letter will be sent a little later, in the background, in a separate process.

A good analogy is the cash register in the supermarket. When you break through the goods, the cashier does not immediately call the supplier for a new batch. It just fixes the sale, and the warehouse then deals with the replenishment of stocks. The cash register does not wait for the supplier - it works further.

plaintext
The user is registered
↓
Your server: create an account + add a task in line
↓
The answer to the user is: "It's all done" (quickly!)
↓ (in the background, separate process)
Worker takes the task out of line
↓
Worker sends a letter
↓ (if it didn't work out)
Worker tries again in N seconds

Gives a queue

Reliability. The task is not lost. If the workman fell, when restarting, he will take unfinished tasks and continue.

Response speed. The user does not wait until the email is actually sent. He gets an answer immediately.

Load control. You can limit processing speed. If an email service receives 10 emails per second, the trader will send exactly this speed, no more.

Retry out of the box. Good queue libraries can automatically repeat a task when an exponential delay error occurs.

Example from BullMQ (Node.js)

BullMQ, a popular queue library for Node.js, uses Redis as a task repository.

bash
npm install bullmq ioredis
javascript
// queue.js — настройка очереди
const { Queue, Worker } = require('bullmq');
const Redis = require('ioredis');

const connection = new Redis(process.env.REDIS_URL);

// Создаём очередь
const emailQueue = new Queue('emails', { connection });

// Добавляем задачу — вызывается при регистрации
async function scheduleWelcomeEmail(userEmail, userName) {
    await emailQueue.add(
        'welcome',                    // название задачи
        { email: userEmail, name: userName }, // данные
        {
            attempts: 5,              // максимум 5 попыток
            backoff: {
                type: 'exponential', // задержка нарастает
                delay: 2000,         // начиная с 2 секунд
            },
        }
    );
    console.log(`Задача на письмо для ${userEmail} добавлена в очередь`);
}

// Воркер — отдельный процесс, который обрабатывает задачи
const worker = new Worker('emails', async (job) => {
    console.log(`Отправляем письмо: ${job.data.email}`);
    await sendEmail(job.data.email, job.data.name);
    console.log(`Письмо отправлено: ${job.data.email}`);
}, { connection });

worker.on('failed', (job, error) => {
    console.error(`Задача ${job.id} не удалась: ${error.message}`);
});

module.exports = { scheduleWelcomeEmail };
javascript
// server.js – endpoint registration
const { scheduleWelcomeEmail } = require('./queue');

app.post('/register', async (req, res) =>
// Create a user
const user = expect createUser(req.body)

Add to the queue - do not expect to be sent!
Wait scheduleWelcomeEmail (user.email, user.name)

Answer the user immediately.
res.json({ok: true, userId: user.id });
};

Note: scheduleWelcomeEmail is a fast operation (just a record in Redis). The user receives a response in milliseconds. The letter will leave a little later, in the background.

When you need a line and when you don't

A queue is needed when:

  • task may take more than a second
  • the task may fail and need to be repeated later
  • it is important not to lose the task when restarting the server
  • you need to control the speed of execution

A queue is not needed when:

  • the user needs the result right now (search, authorization)
  • operation is simple and always quick
  • failure of the task is permissible

Circuit Breaker: Automatic fuse

Problem without circuit breaker

Imagine that the payment system API went down for two hours. You have retry set up: three attempts with pauses. Each user request now takes ~15 seconds (three attempts with increasing pauses) and still ends in error.

With a load of 100 requests per minute, you have hundreds of “stuck” requests simultaneously. They take up workspaces, memory, database connections. The server degrades or falls, all because of an external API that your code has nothing to do with.

What is a circuit breaker

*Circuit breaker literally means “automatic breaker”. It's the same fuse that's in the electric shield. As long as the current is normal, everything works. When the short circuit has gone, the fuse activates and disconnects the circuit, preventing the entire wiring from burning.

In programming, circuit breaker monitors for errors when accessing APIs. If there are too many errors, it knocks out and temporarily stops sending requests to the problem service. After a while, he tries again, and if all is well, he returns to normal work.

Three states

Circuit breaker works as a switch with three positions:

CLOSED (closed) - normal operation. Requests are free. Circuit breaker counts errors. As long as there are few, nothing happens.

OPEN (open) - Protection activated. There are too many errors. Circuit breaker no longer skips requests to the API - immediately returns the error without wasting time trying. The user receives a quick response, “the service is temporarily unavailable” instead of waiting 15 seconds.

HALF-OPEN (semi-open) - check. Enough time has passed. Circuit breaker runs one trial request. If it works, it goes back to CLOSED. If not, open again.

plaintext
Too many mistakes.
CLOSED ― ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ― OPEN
^
← Trial request ← Waiting time has expired
It's been a success.
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ HALF-OPEN

Analogy for understanding

Imagine that you are calling support. Three times in a row you hear “all operators are busy” and put the phone down. For the fourth time, you realize that something is clearly wrong, you don’t have to keep calling every 30 seconds. You decide to call back in an hour.

In an hour, call us. Circuit breaker works on the same logic, only automatically.

Example of circuit breaker (Node.js)

javascript
// lib/circuit-breaker.js
class CircuitBreaker {
    constructor(options = {}) {
        // После скольких ошибок подряд «выбить» предохранитель
        this.failureThreshold = options.failureThreshold || 5;

        // Сколько миллисекунд ждать перед пробным запросом
        this.recoveryTimeout = options.recoveryTimeout || 60_000; // 1 минута

        // Внутреннее состояние
        this.failures = 0;          // счётчик ошибок подряд
        this.state = 'CLOSED';      // текущее состояние
        this.nextAttemptTime = null; // когда можно попробовать снова
    }

    async call(fn) {
        // Если OPEN — проверяем, не пора ли попробовать снова
        if (this.state === 'OPEN') {
            if (Date.now() < this.nextAttemptTime) {
                // Ещё рано — возвращаем ошибку сразу, без запроса
                const waitSec = Math.round((this.nextAttemptTime - Date.now()) / 1000);
                throw new Error(`Сервис временно недоступен. Повтор через ${waitSec}с`);
            }
            // Время вышло — переходим в HALF-OPEN для пробного запроса
            this.state = 'HALF-OPEN';
            console.log('Circuit Breaker: пробный запрос...');
        }

        try {
            const result = await fn();
            this._onSuccess();
            return result;
        } catch (error) {
            this._onFailure();
            throw error;
        }
    }

    _onSuccess() {
        // Запрос прошёл — сбрасываем счётчик, возвращаемся в норму
        if (this.state === 'HALF-OPEN') {
            console.log('Circuit Breaker: сервис восстановлен, переходим в CLOSED');
        }
        this.failures = 0;
        this.state = 'CLOSED';
    }

    _onFailure() {
        this.failures++;
        if (this.state === 'HALF-OPEN') {
            // Пробный запрос не удался — снова блокируем
            console.warn('Circuit Breaker: пробный запрос не удался, снова OPEN');
            this.state = 'OPEN';
            this.nextAttemptTime = Date.now() + this.recoveryTimeout;
        } else if (this.failures >= this.failureThreshold) {
            // Превысили порог ошибок — выбиваем предохранитель
            console.error(`Circuit Breaker: ${this.failures} ошибок подряд — переходим в OPEN`);
            this.state = 'OPEN';
            this.nextAttemptTime = Date.now() + this.recoveryTimeout;
        }
    }
}

module.exports = { CircuitBreaker };
javascript
// Использование
const { CircuitBreaker } = require('./lib/circuit-breaker');

// Создаём один breaker на весь сервис (не на каждый запрос!)
const paymentBreaker = new CircuitBreaker({
    failureThreshold: 5,    // 5 ошибок подряд — выбить
    recoveryTimeout: 60_000 // через 1 минуту — пробный запрос
});

async function processPayment(orderData) {
    try {
        return await paymentBreaker.call(() =>
            paymentApi.post('/charge', orderData)
        );
    } catch (error) {
        if (error.message.includes('временно недоступен')) {
            // Circuit breaker сработал — API явно лежит
            // Можно добавить задачу в очередь на потом
            await paymentQueue.add('retry_payment', orderData, {
                delay: 5 * 60_000 // попробовать через 5 минут
            });
            return { status: 'queued', message: 'Оплата будет обработана позже' };
        }
        throw error;
    }
}

How it all works together

The three instruments do not compete; they complement each other and close different scenarios.

Retry for short-term disruptions. The API sneezed and came alive after 2 seconds. Retry can handle it itself, the user will not even notice.

Circuit Breaker - for protracted failures. The API lasts an hour. Without circuit breaker, each request will wait 15 seconds for timeouts and retry, killing server resources. With him - immediately a quick answer "unavailable", the load does not accumulate.

**The queue is for tasks that are not needed right now. Sending emails, notifications, generating reports are all you can do later. The task will not be lost, will try again when the service is restored.

plaintext
API request
♥
¶
[Circuit Breaker]
Open? ─ ─> Immediately "unavailable" → add in line
CLOSED/HALF-OPEN
♥
¶
[Retry delayed]
Success ─ ─ ─ Return the result
Error after all attempts ─ ─ › Tell circuit breaker
If not urgent, add to the queue

What to show the user when nothing helps

This is an important part that is often forgotten. If all attempts are exhausted, the user should get a clear answer, not an “Internal Server Error”.

Bad:

plaintext
500 Internal Server Error

All right

plaintext
The payment service is temporarily unavailable. Your order is saved.
We will process the payment automatically within 15 minutes.
We'll send a confirmation email.

The difference is huge. The first answer is scary and incomprehensible. The second is honest, explaining what happened and what will happen next.

If the task is added to the queue, say it:

javascript
app.post('/send-report', async (req, res) => {
    try {
        // Пробуем сделать сразу
        const result = await reportService.generate(req.body);
        return res.json({ status: 'done', url: result.url });
    } catch (error) {
        // Не получилось — добавляем в очередь
        const jobId = await reportQueue.add('generate', req.body);
        return res.json({
            status: 'queued',
            message: 'Отчёт формируется. Мы пришлём его на email когда будет готов.',
            jobId
        });
    }
});

Ready Libraries: No need to write by yourself

Circuit breaker and retry are standard patterns and have proven libraries.

Язык Библиотека Что умеет
Node.js cockatiel Circuit breaker, retry, timeout — всё вместе
Node.js opossum Circuit breaker с метриками
Node.js async-retry Простой retry с нарастающей задержкой
Python tenacity Retry с гибкими правилами
Python circuitbreaker Circuit breaker декоратором
Node.js (очереди) BullMQ Очереди на Redis с retry из коробки
Python (очереди) Celery Очереди задач с поддержкой Redis и RabbitMQ

An example from cockatiel is that it implements everything from this article in a few lines:

javascript
const {Policy, ConsecutiveBreaker, ExponentialBackoff } = require('cockatiel');

Create a policy: circuit breaker + retry
const policy = Policy
.wrap()
Circuit breaker: knock out after 5 mistakes in a row, recover after 30 seconds
Policy.handleAll().circuitBreaker(30 000, new ConsecutiveBreaker(5)),
Retry: 3 attempts with exponential delay
Policy.handleAll().retry().attempts(3).exponential()
);

// Use - just turn around any challenge
const data = await policy.execute(() => api.get('/data');

Checklist

plaintext
When to add retry:
The request may temporarily not pass through the network
The API sometimes returns 429 or 500
Added increasing delay (not fixed)
Added jitter (random shift)
● Only temporary errors (not 400, 401, 404) are retracted.

When to add a queue:
● The result is not needed by the user right now.
The task can take more than 2-3 seconds
● You can not lose the task when restarting the server
● You need to control the speed of execution.

When to add circuit breaker:
API is critical and can lie down for a long time
● Multiple parallel requests for one service
● I want to quickly answer “unavailable” instead of waiting for timeouts.

What to show the user:
● Clear message, not technical error code
● If the task is in line - to report it
● If possible, give an approximate waiting time

Outcome

External APIs fall – this is no exception, this is the norm. A developer’s job is not to “write code that never falls,” but to “write code that behaves right when something goes wrong.”.

The three tools in this article cover most scenarios. Retry for minor short-term failures. ** queue ** - not to lose tasks and not to keep the user at the screen. Circuit Breaker – so that a protracted failure in one service does not drag down your entire server.

Start small: add retry to the most important external calls. Then there's the queue for background tasks like letters and notifications. Circuit Breaker – When you see a single dropped API start affecting the entire application.


FAQ

Do you need all three tools at once? ** Nope. Start with retry – it will give 80% reliability for 20% of the effort. Add the queue when background tasks appear (letters, notifications, reports). Circuit Breaker – When you notice that a single API failure affects the entire service.

**What to use to store the task queue? ** For most projects, Redis + BullMQ (Node.js) or Redis + Celery (Python) are enough. Redis is fast, reliable, and you probably already use it for cache or sessions.

How many attempts do you make in retry? ** Usually 3-5 is enough. More – rarely makes sense: if the API didn’t respond for 5 attempts, then the problem isn’t temporary. For critical operations, it is better to add a task to the queue than to make 10 attempts in a row.

**Circuit breaker works too often - what to set up? ** Increase failureThreshold - the number of errors in a row before the operation. Or reduce recoveryTimeout to the time before the trial request. Optimal values depend on the specific API: how often it is unstable and how long it usually recovers.

$ cd ../ ← back to Integrations and APIs