Skip to Content
Polyant is open source under AGPL-3.0 — star us on GitHub.
ConceptsKnowledge Base

Knowledge Base

The knowledge base is a per-instance corpus of documents an agent can consult during a turn. It is how you give a Polyant instance the things it should know — company policies, product datasheets, regulations, onboarding guides, runbooks — without baking them into the prompt. Unlike memory, which is auto-extracted from conversations, knowledge is content you (or the agent itself) put there deliberately.

Knowledge content lives in PostgreSQL — there is no filesystem source of truth. Content enters through two paths: admin-panel upload and the agent-driven writeKnowledge tool. Both go through the same pipeline: the engine chunks the content on sentence boundaries, embeds each chunk with OpenAI’s text-embedding-3-small, and persists everything in two tables: knowledge_documents (the parent record) and knowledge_chunks (the searchable units).

What goes in the knowledge base

The knowledge base is intentionally generic. Polyant does not impose a content shape — anything UTF-8 fits. Two broad uses emerge:

  • Static reference docs — markdown notes, exported PDFs (after extraction), policy text, glossaries, product spec sheets. These are uploaded once via the admin panel or seed scripts and rarely change. Operators own this content.
  • Dynamic agent-written facts — the agent can call the writeKnowledge tool to persist things like “user prefers replies in Italian”, “user works at Acme Srl”, or short summaries of completed projects. This complements the automatic memory store: memory captures conversational facts as they pass through the supervisor; knowledge captures structured notes the agent commits intentionally, by filename.

The two channels are deliberately separate. Memory dedups by cosine similarity and is unpredictable about what survives an extraction pass. Knowledge is addressable by filename, mutable in place, and the agent can getKnowledge a specific document by name when it needs a verbatim copy.

Tools the agent sees

When knowledgeEnabled is true on the instance, three tools are wired into the supervisor:

  • searchKnowledge — query the corpus by natural language and get back the top-N relevant chunks with their source filename. This is the default retrieval path; the agent reaches for it whenever it needs information it does not already have in context.
  • getKnowledge — fetch a document by exact filename. Used when the agent already knows which doc to read (e.g. an earlier searchKnowledge returned "refund-policy.md").
  • writeKnowledge — create, overwrite, or append a document by filename. After the write, a fire-and-forget pass re-chunks and re-embeds the document in the background. The document is immediately readable via getKnowledge, but new chunks become searchable only after the reindex completes.

When knowledgeEnabled is false, none of these tools are registered for that instance — they do not appear in the supervisor’s tool list at all. The gate lives in buildTools() and skips every registered tool whose category === "knowledge".

Chunking, multilingual-aware

chunker.ts splits text on sentence boundaries with a target chunk size around 2000 characters (~500 tokens) and a 200-character overlap. The splitter is multilingual-aware: a curated set of abbreviations (Dr., Dott., Prof., Sig.ra, Ing., Avv., plus English equivalents) does not trigger a sentence break, so an Italian document like “Il Dr. Rossi ha confermato l’appuntamento” stays in one chunk rather than fragmenting at every honorific.

Each chunk gets its own row in knowledge_chunks with a vector(1536) embedding, a chunkIndex, and a back-pointer to its parent knowledge_documents row.

Retrieval

searchKnowledge today is a pure pgvector search: the query is embedded, then cosine similarity against the chunks table returns the top-N matches, joined back to their parent document’s filename. The retrieval helper lives in packages/engine/src/knowledge/search.ts.

For consistency across the retrieval surfaces, see Hybrid Search — that page describes the RRF + pgvector + FTS algorithm used by searchMemory. The knowledge surface uses the same embedding model (OpenAI text-embedding-3-small, 1536d) and the same simple Postgres FTS configuration when keyword search is enabled.

Lifecycle

Documents land in the same knowledge_documents table regardless of how they were created. The source column records which path each one came from:

  • Upload (source = "upload") — admin panel or POST /api/instances/:slug/knowledge. The document is created with status = "uploading", then ingestion runs and flips it to processingready (or error). This is the path the admin UI’s drag-and-drop uses, and the canonical way to ship a curated corpus into a new instance.
  • Agent-authored (source = "agent") — the writeKnowledge tool creates or appends documents under filenames chosen by the agent. A fire-and-forget reindex runs after every write. This is what the agent uses to persist user-volunteered facts that don’t fit the conversational shape of memory.

Other lifecycle events:

  • Reindex — any writeKnowledge or admin-side edit reuses the same ingestion path: previous chunks are deleted, the new content is re-chunked and re-embedded.
  • Delete — removing a document cascades to all its chunks.
  • Disable — flipping knowledgeEnabled = false on the instance leaves the data intact but hides the tools from the supervisor.

How it works

upload (admin panel | writeKnowledge tool) | v chunker.ts (sentence boundaries, multilingual abbrev list) | v embed-each-chunk OpenAI text-embedding-3-small, 1536 dim | v persist knowledge_documents { id, instance_id, filename, raw_content, status } knowledge_chunks { id, document_id, content, embedding, chunk_index } | v retrieval (per turn, only if knowledgeEnabled=true) | +--------+----------------------------+ | | v v searchKnowledge(query) getKnowledge(filename) embed query fetch raw_content pgvector cosine search return document body top-N chunks + filenames

Code reference

  • packages/engine/src/knowledge/schema.tsknowledge_documents, knowledge_chunks, knowledge_document_status enum.
  • packages/engine/src/knowledge/chunker.ts — Sentence splitter with the multilingual abbreviation set.
  • packages/engine/src/knowledge/ingestion.tsprocessDocument(): chunk + embed + persist + status transitions.
  • packages/engine/src/knowledge/store.ts — Document CRUD (upsertAgentDocument, appendAgentDocument, searchByVector).
  • packages/engine/src/knowledge/search.tssearchKnowledge() query path.
  • packages/engine/src/agents/tools/search-knowledge.tool.ts — Tool wrapper consumed by the supervisor.
  • packages/engine/src/agents/tools/get-knowledge.tool.ts — Filename-based fetch.
  • packages/engine/src/agents/tools/write-knowledge.tool.ts — Write/append + fire-and-forget reindex.
  • packages/engine/src/agents/supervisor/index.tsbuildTools() gates knowledge tools on knowledgeEnabled.

See also

  • Memory — sister concept: auto-extracted conversational facts vs. agent-written knowledge.
  • Hybrid Search — the retrieval algorithm shared with memory.
  • Tools — registry and per-instance enablement.
  • Knowledge admin UI — upload and manage knowledge documents.
Last updated on