RAG: How to Connect Your Data to a Language Model and Get a Smart Assistant
Main chat
A chat for vibe coders: news, guides, live cases, marketplace, and finding executors.
You ask ChatGPT about your project and it fantasizes because it hasn’t seen your documentation. Ask Claude to explain the logic of the code, and he doesn't know the architectural decisions he made three months ago. You make a bot to support customers, and the model comes up with answers instead of looking for them in the knowledge base.
This is a classic problem: LLMs are good at reasoning, but they only know what they are trained on. Your data didn't get there.
**RAG (Retrieval-Augmented Generation) is an architectural pattern that corrects this. The model doesn’t remember your data – it searches for the desired snippets at the time of the answer and builds an answer based on them. It’s cheaper than additional training, works with any LLM and is updated without retraining.
How RAG works: Three steps
The scheme is simpler than it seems by name.
Step 1: Indexation (do once)
Take documents - Markdown files, PDF, pages from Notion, records from database - cut into chunks (fragments of 300-1000 tokens), convert each chunk into a vector through the embedding model and save to a vector database.
A vector is an array of numbers that describes the meaning of a text. Similar texts give close vectors. This is what allows you to search “not by words, but by meaning.”.
Step 2: Retrieval (with each request)
The user asks a question. The question is also converted into a vector by the same embedding model. The vector base finds K of the nearest chanks – those whose vectors are closest to the question vector.
Step 3: Generation (with each request)
The chunks found are inserted into the prompt along with the user's question. The LLM sees: “Here’s the context of the documents, here’s the question – answer based on the context.” The model responds by relying on real data, not memory.
User: “How to set up a deploy on VPS?”
↓
Embedding query → question vector
↓
Vector search → top 5 chunks from documentation
↓
Prompt: [chunk 1] [chunk 2] [chunk 3] + "How to set up a depot for VPS?"
↓
LLM Answer Based on Your Documentation
RAG vs Fine Tuning: What to Choose
A common question is: why RAG if you can train the model on your data?
| Критерий | RAG | Файн-тюнинг |
|---|---|---|
| Стоимость | Дёшево (векторная база + API) | Дорого (GPU, время, датасет) |
| Обновление данных | Мгновенно — добавил документ, переиндексировал | Надо переобучать |
| Прозрачность | Видно, откуда взят ответ | «Чёрный ящик» |
| Точность на узкой теме | Хорошая при правильном чанкинге | Очень высокая |
| Галлюцинации | Меньше — модель опирается на контекст | Больше без контекста |
| Порог входа | Низкий — запускается за вечер | Высокий |
For most Wibcoding tasks – support bot, documentation assistant, knowledge search, personal agent – RAG covers 90% of needs without expensive financial tuning.
Key Components: What to Choose
Before the code is a quick overview of the tools from which the RAG is assembled.
The embedding model
Converts text into a vector. The quality of the search depends on the choice.
| Модель | Размер вектора | Где работает | Стоимость |
|---|---|---|---|
text-embedding-3-small (OpenAI) |
1536 | Облако | ~$0.02 / 1M токенов |
text-embedding-3-large (OpenAI) |
3072 | Облако | ~$0.13 / 1M токенов |
nomic-embed-text |
768 | Локально (Ollama) | Бесплатно |
mxbai-embed-large |
1024 | Локально (Ollama) | Бесплатно |
multilingual-e5-large |
1024 | Локально / HF | Бесплатно, хорош для русского |
For Russian-language content – multilingual-e5-large or text-embedding-3-small from OpenAI.
Vector database
Stores vectors and makes a quick search for nearest neighbors (ANN).
| База | Тип | Когда использовать |
|---|---|---|
| ChromaDB | Встраиваемая / сервер | Локальная разработка, прототип |
| pgvector | Расширение PostgreSQL | Уже используете Postgres |
| Qdrant | Отдельный сервис | Продакшен, большой объём |
| Weaviate | Отдельный сервис | Нужны гибридный поиск и схемы |
| FAISS | Библиотека (in-memory) | Исследования, нет persistence |
ChromaDB: Zero infrastructure, all in one Python package. For production with existing Postgres - pgvector.
Orchestra
LangChain and LlamaIndex are frameworks that glue components together. LangChain is more versatile, LlamaIndex is sharpened specifically for RAG.
For small projects, an orchestrator is not needed at all – 100 lines of pure code are enough.
Collecting RAG from scratch: code
Example: RAG assistant for a local folder with Markdown documents. ChromaDB + OpenAI embeddings + GPT-4o (or any other LLM).
Installation
pip install chromadb openai tiktoken langchain langchain-openai langchain-community
Indexation of documents
import os
import glob
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
DOCS_DIR = "./docs" # папка с вашими .md файлами
CHROMA_DIR = "./chroma_db" # куда сохранять базу
def load_markdown_files(directory: str) -> list[dict]:
"""Загружает все .md файлы, возвращает список {text, source}."""
documents = []
for filepath in glob.glob(f"{directory}/**/*.md", recursive=True):
with open(filepath, "r", encoding="utf-8") as f:
text = f.read()
documents.append({"text": text, "source": filepath})
return documents
def build_index(docs_dir: str, chroma_dir: str):
# Загружаем документы
raw_docs = load_markdown_files(docs_dir)
print(f"Загружено документов: {len(raw_docs)}")
# Нарезаем на чанки
splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # символов в чанке
chunk_overlap=100, # перекрытие — чтобы не потерять контекст на границах
separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)
chunks = []
metadatas = []
for doc in raw_docs:
parts = splitter.split_text(doc["text"])
chunks.extend(parts)
metadatas.extend([{"source": doc["source"]}] * len(parts))
print(f"Чанков после нарезки: {len(chunks)}")
# Создаём векторную базу
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_texts(
texts=chunks,
embedding=embeddings,
metadatas=metadatas,
persist_directory=chroma_dir,
)
print(f"Индекс сохранён в {chroma_dir}")
return vectorstore
if __name__ == "__main__":
build_index(DOCS_DIR, CHROMA_DIR)
Search and generation of response
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.schema import HumanMessage, SystemMessage
CHROMA_DIR = "./chroma_db"
def load_vectorstore(chroma_dir: str):
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
return Chroma(
persist_directory=chroma_dir,
embedding_function=embeddings,
)
def ask(question: str, vectorstore, k: int = 5) -> str:
# Ищем релевантные чанки
results = vectorstore.similarity_search(question, k=k)
context = "\n\n---\n\n".join([doc.page_content for doc in results])
sources = list({doc.metadata.get("source", "") for doc in results})
# Строим промпт
system_prompt = """Ты — ассистент по документации.
Отвечай только на основе предоставленного контекста.
Если ответа нет в контексте — честно скажи об этом.
Не придумывай информацию."""
user_prompt = f"""Контекст из документации:
{context}
Вопрос: {question}"""
# Генерируем ответ
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
response = llm.invoke([
SystemMessage(content=system_prompt),
HumanMessage(content=user_prompt),
])
return response.content, sources
if __name__ == "__main__":
vs = load_vectorstore(CHROMA_DIR)
answer, sources = ask("Как настроить автодеплой на VPS?", vs)
print(answer)
print("\nИсточники:", sources)
Fully local RAG via Ollama
If you don’t want to send data to the cloud, everything can be run locally: embedding and LLM via Ollama, storage through ChromaDB.
# Installing Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Downloading models.
ollama pull nomic-embed-text #embedding
ollama pull llama3.2 # or qwen2.5, mistral
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_community.vectorstores import Chroma
def build_local_index(docs_dir: str, chroma_dir: str):
raw_docs = load_markdown_files(docs_dir)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
chunks, metadatas = [], []
for doc in raw_docs:
parts = splitter.split_text(doc["text"])
chunks.extend(parts)
metadatas.extend([{"source": doc["source"]}] * len(parts))
# Локальные эмбеддинги через Ollama
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_texts(
texts=chunks,
embedding=embeddings,
metadatas=metadatas,
persist_directory=chroma_dir,
)
return vectorstore
def ask_local(question: str, vectorstore) -> str:
results = vectorstore.similarity_search(question, k=5)
context = "\n\n".join([doc.page_content for doc in results])
llm = Ollama(model="llama3.2")
prompt = f"""Контекст: {context}
Вопрос: {question}
Отвечай только на основе контекста. Если ответа нет — скажи об этом."""
return llm.invoke(prompt)
Performance on a local machine depends on iron: the llama3.2 8B works well on the M1/M2 Mac and on the video card from the RTX 3080. For CPUs, take models 1-3B (qwen2.5:1.5b, llama3.2:1b).
RAG with pgvector: if you are already using PostgreSQL
If the project has Postgres, it is easier to add a pgvector extension than to raise a separate service.
- Enable expansion.
Create Extension IF NOT EXISTS vector
- Creating a table for chunks
CREATE TABLE documents
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
source TEXT,
embedding vector(1536) - text-embedding-3-small size
created at TIMESTAMPTZ DEFAULT NOW()
);
Index for fast search (HNSW – the best speed/accuracy ratio)
Create Index on Documents
USING hnsw (embedding vector cosine ops)
WITH (m = 16, ef construction = 64);
import psycopg2
import numpy as np
from openai import OpenAI
client = OpenAI()
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def insert_chunk(conn, content: str, source: str):
embedding = get_embedding(content)
with conn.cursor() as cur:
cur.execute(
"INSERT INTO documents (content, source, embedding) VALUES (%s, %s, %s)",
(content, source, embedding)
)
conn.commit()
def search_similar(conn, query: str, k: int = 5) -> list[dict]:
query_embedding = get_embedding(query)
with conn.cursor() as cur:
cur.execute("""
SELECT content, source,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_embedding, query_embedding, k))
rows = cur.fetchall()
return [
{"content": r[0], "source": r[1], "similarity": r[2]}
for r in rows
]
<=> is the cosine distance in pgvector. The smaller the distance, the closer the vectors are. The HNSW index gives a search for O(log n) instead of O(n) at full brute force.
Changing: The Most Important and Underrated Part
The quality of RAG depends heavily on how you slice documents. Wrong chanking - and the model will not find the right one, even if it is in the database.
Slicing rules
Chank size. 300-500 tokens for point facts (FAQ, API documentation). 800-1,200 tokens for conceptual explanations. Too little chunks lose context. Too large – dilute the relevance.
Overlap. 10-15% of the size of the chank. It is necessary that the thought started at the end of one chank does not break for the model.
Slice boundaries. Cut by semantic boundaries - headings (##, ###), paragraphs (\n\n), not in the middle of the sentence. RecursiveCharacterTextSplitter with the correct separators does this automatically.
Metadata. Save source, section title, date. This allows you to show the user where the answer came from and filter the search by section.
# Пример нарезки с сохранением заголовка раздела
import re
def chunk_markdown_with_headers(text: str, source: str) -> list[dict]:
"""Нарезает Markdown, сохраняя заголовок раздела в метаданных."""
chunks = []
current_header = ""
current_text = []
for line in text.split("\n"):
if line.startswith("## "):
# Сохраняем накопленный текст
if current_text:
chunks.append({
"text": "\n".join(current_text).strip(),
"source": source,
"section": current_header,
})
current_text = []
current_header = line.lstrip("# ").strip()
current_text.append(line)
# Последний раздел
if current_text:
chunks.append({
"text": "\n".join(current_text).strip(),
"source": source,
"section": current_header,
})
return [c for c in chunks if len(c["text"]) > 50] # убираем пустые
Hybrid Search: Vector + Key
Pure vector search is good at finding meaningful coincidences, but it can miss the exact names, versions, code constants. pgvector 0.6+ and Qdrant support hybrid search, a combination of BM25 (key search) and vector search.
Add full text search to the table
Alter table documents ADD COLUMN ts tsvector
GENERATED ALWAYS AS (to tsvector('russian', content) STORED;
CREATE INDEX on documents USING gin(ts)
Hybrid query: weighted sum of BM25 and cosine similarity
SELECT content, source,
(0.5 * ts rank(ts, query) + 0.5 * (1 - (embedding <=> $1::vector)) AS score
FROM documents, to tsquery('russian', $2) query
WHERE ts @@ query OR (embedding <=> $1::vector < 0.4)
ORDER BY score DESC
LIMIT 5;
This works best for technical documents that have specific terms and names.
Real application scenarios in vibcoding
Project documentation bot. Index README, CHANGELOG, wiki teams and get a bot that answers questions from new developers without “ask seniors.” The database update is one script for each commit in /docs.
Personal search on Obsidian notes. RAG on a folder with Markdown files turns into an assistant that knows everything you’ve ever recorded. You ask “what did I think about microservices architecture” and you get excerpts from your own notes explaining it.
**Customer support without hallucinations.**Index knowledge base, FAQ, product instructions. The bot responds strictly to documents and honestly says “don’t know” if there is no answer.
Agent code reviewer. Index corporate style guide, architectural solutions (ADR), past code review. Claude Code, or Codex, gets this context through an MCP server and a PR review based on actual project standards.
RAG on top of the Telegram channel. Parsite channel (see the article about parser channels), index the posts - and search by archive of meanings, not keywords.
Frequent mistakes and how to avoid them
| Ошибка | Последствие | Решение |
|---|---|---|
| Слишком крупные чанки (>2000 токенов) | Модель «теряется» в контексте, точность падает | 800–1000 токенов оптимум |
| Один вектор на весь документ | Поиск не находит детали внутри длинного текста | Нарезайте на чанки |
| Разные модели эмбеддингов при индексации и поиске | Полная ерунда на выходе | Всегда одна и та же модель для обоих шагов |
| k=1 при поиске | Одна ошибка в базе — неверный ответ | k=5–10, пусть LLM сама выберет релевантное |
| Нет метаданных об источнике | Нельзя проверить откуда ответ | Всегда сохраняйте source и section |
| Переиндексировать всё при каждом изменении | Долго и дорого | Инкрементальная индексация по изменённым файлам |
Incremental indexation
import hashlib
import json
import os
HASH_FILE = ".index_hashes.json"
def file_hash(filepath: str) -> str:
with open(filepath, "rb") as f:
return hashlib.md5(f.read()).hexdigest()
def get_changed_files(docs_dir: str) -> list[str]:
"""Возвращает только изменившиеся файлы."""
hashes = {}
if os.path.exists(HASH_FILE):
with open(HASH_FILE) as f:
hashes = json.load(f)
changed = []
current = {}
for filepath in glob.glob(f"{docs_dir}/**/*.md", recursive=True):
h = file_hash(filepath)
current[filepath] = h
if hashes.get(filepath) != h:
changed.append(filepath)
with open(HASH_FILE, "w") as f:
json.dump(current, f)
return changed
RAG quality assessment
Before deploit it is worth checking that the system really works. Metrics for evaluation:
Relevance (relevance) - how far the Chunks found relate to the question. Checked manually on a test sample of questions.
**Faithfulness: Is there any information in the response that was not in the context? Hallucinations are the main enemy of RAG.
*Answer correctness - Does the answer match the reference? To do this, you need a dataset question → the correct answer.
For automatic evaluation, the *RAGAS library:
pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
# dataset — список {question, answer, contexts, ground_truth}
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(result)
Checklist: RAG in production
Indexation:
Selected embedding model (the same for indexing and search)
Size of chunks 300-1000 tokens, overlap 10-15%
Slicing by semantic boundaries (headings, paragraphs)
● Metadata: source, section, date
Incremental re-indexation implemented
Search:
k = 5-10 chanks when searching
The similarity metric corresponds to the model (cosine for most)
Hybrid search is considered if there are exact terms/names
Generation:
● The system prompt clearly requires a response only by context.
● The model honestly says "don't know" if there's no answer in context
● Sources are shown to the user
Infrastructure:
Vector base with persistence (not in-memory)
● Log all requests and found chunky
● Quality monitoring on test questions
Safety:
API keys in environment variables
● If the data is private, a local LLM (Ollama) is used.
● No leakage of system prompt in response
Outcome
RAG is not a complicated technology. This is a three-part pattern: cut documents into chunks, find relevant documents when querying, give models to answer based on them. It is implemented in the evening, updated without retraining, works with any LLM.
The quality of the system is 80% determined by chanking and metadata – not the choice of LLM or the complexity of the orchestrator. Start simple: folder with Markdown, ChromaDB, OpenAI embeddings. Make sure the search returns the right snippets – and only then complicate.
The next level after basic RAG is agent RAG with query reformulation, multi-hop reasoning, and self-reported response. But this is a separate article.
*ChromaDB 0.6+, pgvector 0.8+, LangChain 0.3+, Ollama 0.5+. June 2026. *