RAG: How to Connect Your Data to a Language Model and Get a Smart Assistant

◷ 17 min read 6/7/2026

Main chat

A chat for vibe coders: news, guides, live cases, marketplace, and finding executors.

RAG: How to Connect Your Data to a Language Model and Get a Smart Assistant - обложка

You ask ChatGPT about your project and it fantasizes because it hasn’t seen your documentation. Ask Claude to explain the logic of the code, and he doesn't know the architectural decisions he made three months ago. You make a bot to support customers, and the model comes up with answers instead of looking for them in the knowledge base.

This is a classic problem: LLMs are good at reasoning, but they only know what they are trained on. Your data didn't get there.

**RAG (Retrieval-Augmented Generation) is an architectural pattern that corrects this. The model doesn’t remember your data – it searches for the desired snippets at the time of the answer and builds an answer based on them. It’s cheaper than additional training, works with any LLM and is updated without retraining.

How RAG works: Three steps

The scheme is simpler than it seems by name.

Step 1: Indexation (do once)

Take documents - Markdown files, PDF, pages from Notion, records from database - cut into chunks (fragments of 300-1000 tokens), convert each chunk into a vector through the embedding model and save to a vector database.

A vector is an array of numbers that describes the meaning of a text. Similar texts give close vectors. This is what allows you to search “not by words, but by meaning.”.

Step 2: Retrieval (with each request)

The user asks a question. The question is also converted into a vector by the same embedding model. The vector base finds K of the nearest chanks – those whose vectors are closest to the question vector.

Step 3: Generation (with each request)

The chunks found are inserted into the prompt along with the user's question. The LLM sees: “Here’s the context of the documents, here’s the question – answer based on the context.” The model responds by relying on real data, not memory.

plaintext

User: “How to set up a deploy on VPS?”
↓
Embedding query → question vector
↓
Vector search → top 5 chunks from documentation
↓
Prompt: [chunk 1] [chunk 2] [chunk 3] + "How to set up a depot for VPS?"
↓
LLM Answer Based on Your Documentation

RAG vs Fine Tuning: What to Choose

A common question is: why RAG if you can train the model on your data?

Критерий	RAG	Файн-тюнинг
Стоимость	Дёшево (векторная база + API)	Дорого (GPU, время, датасет)
Обновление данных	Мгновенно — добавил документ, переиндексировал	Надо переобучать
Прозрачность	Видно, откуда взят ответ	«Чёрный ящик»
Точность на узкой теме	Хорошая при правильном чанкинге	Очень высокая
Галлюцинации	Меньше — модель опирается на контекст	Больше без контекста
Порог входа	Низкий — запускается за вечер	Высокий

For most Wibcoding tasks – support bot, documentation assistant, knowledge search, personal agent – RAG covers 90% of needs without expensive financial tuning.

Key Components: What to Choose

Before the code is a quick overview of the tools from which the RAG is assembled.

The embedding model

Converts text into a vector. The quality of the search depends on the choice.

Модель	Размер вектора	Где работает	Стоимость
`text-embedding-3-small` (OpenAI)	1536	Облако	~$0.02 / 1M токенов
`text-embedding-3-large` (OpenAI)	3072	Облако	~$0.13 / 1M токенов
`nomic-embed-text`	768	Локально (Ollama)	Бесплатно
`mxbai-embed-large`	1024	Локально (Ollama)	Бесплатно
`multilingual-e5-large`	1024	Локально / HF	Бесплатно, хорош для русского

For Russian-language content – multilingual-e5-large or text-embedding-3-small from OpenAI.

Vector database

Stores vectors and makes a quick search for nearest neighbors (ANN).

База	Тип	Когда использовать
ChromaDB	Встраиваемая / сервер	Локальная разработка, прототип
pgvector	Расширение PostgreSQL	Уже используете Postgres
Qdrant	Отдельный сервис	Продакшен, большой объём
Weaviate	Отдельный сервис	Нужны гибридный поиск и схемы
FAISS	Библиотека (in-memory)	Исследования, нет persistence

ChromaDB: Zero infrastructure, all in one Python package. For production with existing Postgres - pgvector.

Orchestra

LangChain and LlamaIndex are frameworks that glue components together. LangChain is more versatile, LlamaIndex is sharpened specifically for RAG.

For small projects, an orchestrator is not needed at all – 100 lines of pure code are enough.

Collecting RAG from scratch: code

Example: RAG assistant for a local folder with Markdown documents. ChromaDB + OpenAI embeddings + GPT-4o (or any other LLM).

Installation

bash

pip install chromadb openai tiktoken langchain langchain-openai langchain-community

Indexation of documents

python

import os
import glob
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

DOCS_DIR = "./docs"       # папка с вашими .md файлами
CHROMA_DIR = "./chroma_db"  # куда сохранять базу

def load_markdown_files(directory: str) -> list[dict]:
    """Загружает все .md файлы, возвращает список {text, source}."""
    documents = []
    for filepath in glob.glob(f"{directory}/**/*.md", recursive=True):
        with open(filepath, "r", encoding="utf-8") as f:
            text = f.read()
        documents.append({"text": text, "source": filepath})
    return documents

def build_index(docs_dir: str, chroma_dir: str):
    # Загружаем документы
    raw_docs = load_markdown_files(docs_dir)
    print(f"Загружено документов: {len(raw_docs)}")

    # Нарезаем на чанки
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,       # символов в чанке
        chunk_overlap=100,    # перекрытие — чтобы не потерять контекст на границах
        separators=["\n## ", "\n### ", "\n\n", "\n", " "],
    )

    chunks = []
    metadatas = []
    for doc in raw_docs:
        parts = splitter.split_text(doc["text"])
        chunks.extend(parts)
        metadatas.extend([{"source": doc["source"]}] * len(parts))

    print(f"Чанков после нарезки: {len(chunks)}")

    # Создаём векторную базу
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = Chroma.from_texts(
        texts=chunks,
        embedding=embeddings,
        metadatas=metadatas,
        persist_directory=chroma_dir,
    )
    print(f"Индекс сохранён в {chroma_dir}")
    return vectorstore

if __name__ == "__main__":
    build_index(DOCS_DIR, CHROMA_DIR)

Search and generation of response

python

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.schema import HumanMessage, SystemMessage

CHROMA_DIR = "./chroma_db"

def load_vectorstore(chroma_dir: str):
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    return Chroma(
        persist_directory=chroma_dir,
        embedding_function=embeddings,
    )

def ask(question: str, vectorstore, k: int = 5) -> str:
    # Ищем релевантные чанки
    results = vectorstore.similarity_search(question, k=k)
    context = "\n\n---\n\n".join([doc.page_content for doc in results])
    sources = list({doc.metadata.get("source", "") for doc in results})

    # Строим промпт
    system_prompt = """Ты — ассистент по документации. 
Отвечай только на основе предоставленного контекста.
Если ответа нет в контексте — честно скажи об этом.
Не придумывай информацию."""

    user_prompt = f"""Контекст из документации:
{context}

Вопрос: {question}"""

    # Генерируем ответ
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    response = llm.invoke([
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt),
    ])

    return response.content, sources

if __name__ == "__main__":
    vs = load_vectorstore(CHROMA_DIR)
    answer, sources = ask("Как настроить автодеплой на VPS?", vs)
    print(answer)
    print("\nИсточники:", sources)

Fully local RAG via Ollama

If you don’t want to send data to the cloud, everything can be run locally: embedding and LLM via Ollama, storage through ChromaDB.

bash

# Installing Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Downloading models.
ollama pull nomic-embed-text #embedding
ollama pull llama3.2 # or qwen2.5, mistral

python

from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_community.vectorstores import Chroma

def build_local_index(docs_dir: str, chroma_dir: str):
    raw_docs = load_markdown_files(docs_dir)

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
    chunks, metadatas = [], []
    for doc in raw_docs:
        parts = splitter.split_text(doc["text"])
        chunks.extend(parts)
        metadatas.extend([{"source": doc["source"]}] * len(parts))

    # Локальные эмбеддинги через Ollama
    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    vectorstore = Chroma.from_texts(
        texts=chunks,
        embedding=embeddings,
        metadatas=metadatas,
        persist_directory=chroma_dir,
    )
    return vectorstore

def ask_local(question: str, vectorstore) -> str:
    results = vectorstore.similarity_search(question, k=5)
    context = "\n\n".join([doc.page_content for doc in results])

    llm = Ollama(model="llama3.2")
    prompt = f"""Контекст: {context}

Вопрос: {question}

Отвечай только на основе контекста. Если ответа нет — скажи об этом."""

    return llm.invoke(prompt)

Performance on a local machine depends on iron: the llama3.2 8B works well on the M1/M2 Mac and on the video card from the RTX 3080. For CPUs, take models 1-3B (qwen2.5:1.5b, llama3.2:1b).

RAG with pgvector: if you are already using PostgreSQL

If the project has Postgres, it is easier to add a pgvector extension than to raise a separate service.

sql

- Enable expansion.
Create Extension IF NOT EXISTS vector

- Creating a table for chunks
CREATE TABLE documents
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
source TEXT,
embedding vector(1536) - text-embedding-3-small size
created at TIMESTAMPTZ DEFAULT NOW()
);

Index for fast search (HNSW – the best speed/accuracy ratio)
Create Index on Documents
USING hnsw (embedding vector cosine ops)
WITH (m = 16, ef construction = 64);

python

import psycopg2
import numpy as np
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def insert_chunk(conn, content: str, source: str):
    embedding = get_embedding(content)
    with conn.cursor() as cur:
        cur.execute(
            "INSERT INTO documents (content, source, embedding) VALUES (%s, %s, %s)",
            (content, source, embedding)
        )
    conn.commit()

def search_similar(conn, query: str, k: int = 5) -> list[dict]:
    query_embedding = get_embedding(query)
    with conn.cursor() as cur:
        cur.execute("""
            SELECT content, source,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query_embedding, query_embedding, k))
        rows = cur.fetchall()
    return [
        {"content": r[0], "source": r[1], "similarity": r[2]}
        for r in rows
    ]

<=> is the cosine distance in pgvector. The smaller the distance, the closer the vectors are. The HNSW index gives a search for O(log n) instead of O(n) at full brute force.

Changing: The Most Important and Underrated Part

The quality of RAG depends heavily on how you slice documents. Wrong chanking - and the model will not find the right one, even if it is in the database.

Slicing rules

Chank size. 300-500 tokens for point facts (FAQ, API documentation). 800-1,200 tokens for conceptual explanations. Too little chunks lose context. Too large – dilute the relevance.

Overlap. 10-15% of the size of the chank. It is necessary that the thought started at the end of one chank does not break for the model.

Slice boundaries. Cut by semantic boundaries - headings (##, ###), paragraphs (\n\n), not in the middle of the sentence. RecursiveCharacterTextSplitter with the correct separators does this automatically.

Metadata. Save source, section title, date. This allows you to show the user where the answer came from and filter the search by section.

python

# Пример нарезки с сохранением заголовка раздела
import re

def chunk_markdown_with_headers(text: str, source: str) -> list[dict]:
    """Нарезает Markdown, сохраняя заголовок раздела в метаданных."""
    chunks = []
    current_header = ""
    current_text = []

    for line in text.split("\n"):
        if line.startswith("## "):
            # Сохраняем накопленный текст
            if current_text:
                chunks.append({
                    "text": "\n".join(current_text).strip(),
                    "source": source,
                    "section": current_header,
                })
                current_text = []
            current_header = line.lstrip("# ").strip()
        current_text.append(line)

    # Последний раздел
    if current_text:
        chunks.append({
            "text": "\n".join(current_text).strip(),
            "source": source,
            "section": current_header,
        })

    return [c for c in chunks if len(c["text"]) > 50]  # убираем пустые

Hybrid Search: Vector + Key

Pure vector search is good at finding meaningful coincidences, but it can miss the exact names, versions, code constants. pgvector 0.6+ and Qdrant support hybrid search, a combination of BM25 (key search) and vector search.

sql

Add full text search to the table
Alter table documents ADD COLUMN ts tsvector
GENERATED ALWAYS AS (to tsvector('russian', content) STORED;
CREATE INDEX on documents USING gin(ts)

Hybrid query: weighted sum of BM25 and cosine similarity
SELECT content, source,
(0.5 * ts rank(ts, query) + 0.5 * (1 - (embedding <=> $1::vector)) AS score
FROM documents, to tsquery('russian', $2) query
WHERE ts @@ query OR (embedding <=> $1::vector < 0.4)
ORDER BY score DESC
LIMIT 5;

This works best for technical documents that have specific terms and names.

Real application scenarios in vibcoding

Project documentation bot. Index README, CHANGELOG, wiki teams and get a bot that answers questions from new developers without “ask seniors.” The database update is one script for each commit in /docs.

Personal search on Obsidian notes. RAG on a folder with Markdown files turns into an assistant that knows everything you’ve ever recorded. You ask “what did I think about microservices architecture” and you get excerpts from your own notes explaining it.

**Customer support without hallucinations.**Index knowledge base, FAQ, product instructions. The bot responds strictly to documents and honestly says “don’t know” if there is no answer.

Agent code reviewer. Index corporate style guide, architectural solutions (ADR), past code review. Claude Code, or Codex, gets this context through an MCP server and a PR review based on actual project standards.

RAG on top of the Telegram channel. Parsite channel (see the article about parser channels), index the posts - and search by archive of meanings, not keywords.

Frequent mistakes and how to avoid them

Ошибка	Последствие	Решение
Слишком крупные чанки (>2000 токенов)	Модель «теряется» в контексте, точность падает	800–1000 токенов оптимум
Один вектор на весь документ	Поиск не находит детали внутри длинного текста	Нарезайте на чанки
Разные модели эмбеддингов при индексации и поиске	Полная ерунда на выходе	Всегда одна и та же модель для обоих шагов
k=1 при поиске	Одна ошибка в базе — неверный ответ	k=5–10, пусть LLM сама выберет релевантное
Нет метаданных об источнике	Нельзя проверить откуда ответ	Всегда сохраняйте source и section
Переиндексировать всё при каждом изменении	Долго и дорого	Инкрементальная индексация по изменённым файлам

Incremental indexation

python

import hashlib
import json
import os

HASH_FILE = ".index_hashes.json"

def file_hash(filepath: str) -> str:
    with open(filepath, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

def get_changed_files(docs_dir: str) -> list[str]:
    """Возвращает только изменившиеся файлы."""
    hashes = {}
    if os.path.exists(HASH_FILE):
        with open(HASH_FILE) as f:
            hashes = json.load(f)

    changed = []
    current = {}
    for filepath in glob.glob(f"{docs_dir}/**/*.md", recursive=True):
        h = file_hash(filepath)
        current[filepath] = h
        if hashes.get(filepath) != h:
            changed.append(filepath)

    with open(HASH_FILE, "w") as f:
        json.dump(current, f)

    return changed

RAG quality assessment

Before deploit it is worth checking that the system really works. Metrics for evaluation:

Relevance (relevance) - how far the Chunks found relate to the question. Checked manually on a test sample of questions.

**Faithfulness: Is there any information in the response that was not in the context? Hallucinations are the main enemy of RAG.

*Answer correctness - Does the answer match the reference? To do this, you need a dataset question → the correct answer.

For automatic evaluation, the *RAGAS library:

bash

pip install ragas

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

# dataset — список {question, answer, contexts, ground_truth}
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(result)

Checklist: RAG in production

plaintext

Indexation:
Selected embedding model (the same for indexing and search)
Size of chunks 300-1000 tokens, overlap 10-15%
Slicing by semantic boundaries (headings, paragraphs)
● Metadata: source, section, date
Incremental re-indexation implemented

Search:
k = 5-10 chanks when searching
The similarity metric corresponds to the model (cosine for most)
Hybrid search is considered if there are exact terms/names

Generation:
● The system prompt clearly requires a response only by context.
● The model honestly says "don't know" if there's no answer in context
● Sources are shown to the user

Infrastructure:
Vector base with persistence (not in-memory)
● Log all requests and found chunky
● Quality monitoring on test questions

Safety:
API keys in environment variables
● If the data is private, a local LLM (Ollama) is used.
● No leakage of system prompt in response

Outcome

RAG is not a complicated technology. This is a three-part pattern: cut documents into chunks, find relevant documents when querying, give models to answer based on them. It is implemented in the evening, updated without retraining, works with any LLM.

The quality of the system is 80% determined by chanking and metadata – not the choice of LLM or the complexity of the orchestrator. Start simple: folder with Markdown, ChromaDB, OpenAI embeddings. Make sure the search returns the right snippets – and only then complicate.

The next level after basic RAG is agent RAG with query reformulation, multi-hop reasoning, and self-reported response. But this is a separate article.

*ChromaDB 0.6+, pgvector 0.8+, LangChain 0.3+, Ollama 0.5+. June 2026. *