How a Conversational AI Agent Works Inside

Engenharia

12 min read

1 June 2026

How a Conversational AI Agent Works Inside

The 6 stages of a conversation turn in OpenClaw — with real latency, cost per conversation and the 4 lines of defence against hallucination.

Equipe OpenClaw · Time de Engenharia & Produto

A Equipe OpenClaw é formada por engenheiros, designers e especialistas em IA dedicados a construir a melhor plataforma de agentes conversacionais para negócios brasileiros. Combinamos expertise…

How a Conversational AI Agent Works Inside (OpenClaw Architecture)

How does a conversational AI agent work in practice, turn by turn? This post opens the black box of OpenClaw: from the moment the customer's message arrives on WhatsApp to the text the agent writes back. It's going to be technical. It's worth it if you decide product architecture, if you're going to buy a solution and want to evaluate the fundamentals, or if you enjoy knowing what's happening behind the conversation.

TL;DR: each turn goes through 6 stages — ingest, resolve context, select skills, decide next action, execute with guard-rails, persist memory. The entire cycle runs in <2 seconds on Cloudflare's edge, without a fixed server.

Why architecture matters

Conversational agents that seem to work in a demo but break in production generally have one of these 4 problems:

High latency — customer waits 8 seconds for a response, conversation dies.
Uncontrolled hallucination — agent makes up price, time, policy.
Lost context — customer returns after 2 days and agent "forgets" everything.
Uncontrolled cost — each long conversation fills up the prompt and you pay a fortune in tokens.

All 4 are architecture choices, not model limitations. OpenClaw was built to avoid all 4 — and the way to understand is to look at the cycle of a turn.

The cycle of a turn (6 stages)

Imagine the customer has just sent the message "want to book for Saturday morning". What happens between "received" and the agent's response?

Stage 1 — Ingest (edge worker, <50ms)

The WhatsApp message arrives via Meta's webhook directly to a Cloudflare Worker at the geographically closest point of presence (PoP). In Brazil, this means São Paulo or Rio, network latency < 20ms.

The worker does three things:

Validates the signature of the webhook (HMAC against WABA secret).
Identifies the tenant by the receiver's phone number (multi-tenant by to_number).
Normalises the payload — audio becomes transcription, image becomes description, location becomes {lat,lng}, text stays as is.

At the end of stage 1 you have an object {tenant_id, conversation_id, user_message} ready for the next step.

Stage 2 — Resolve context (D1 + KV, ~80ms)

The agent needs 3 pieces of context before deciding:

Recent history of the conversation (last N relevant turns).
Long-term memory of the customer (preferences, purchase history, notes).
Agent state (persona, enabled skills, rules).

All come from D1 (Cloudflare's distributed SQLite). D1 replaces traditional Postgres/Mongo — no database server to maintain, access in just a few ms from the worker, multi-tenant by tenant_id.

Key point: we don't load the entire conversation into the prompt. OpenClaw's Memory Manager v2 (described in our internal documentation) selects only the relevant turns for the current turn (last N + N of high semantic relevance). This keeps token cost predictable even in conversations with 100+ turns.

Stage 3 — Skill selection (policy engine, ~20ms)

Each agent has a set of available skills — functions it can invoke. Examples: check_calendar, create_event, generate_payment_link, check_order, call_human.

Given the message "I want to book for Saturday morning", the policy engine filters:

Skills compatible with the detected intent (scheduling).
Skills allowed for this conversation phase (not every skill is available all the time).
Skills that this tenant has enabled (calendar only appears if the tenant has integrated it).

In the end you have a small subset of skills passed to the model — not all 50 possible ones, just the 4 that make sense here. This drastically reduces the chance of the model invoking the wrong skill.

Stage 4 — Decision (LLM call, 400-1200ms)

Now the model comes in. OpenClaw makes a single call to a frontier LLM (Anthropic Claude, OpenAI GPT, Google Gemini — configurable per tenant) with:

System prompt = agent persona + rules + available skills.
History = turns selected in stage 2.
User message = current turn's message.

The model responds with one of two things:

Final response (direct text to the customer).
Tool call (request to execute a specific skill with parameters).

In the example "I want to book for Saturday morning", the model typically returns:

{
  "tool": "check_calendar",
  "args": { "date_range": "2026-04-19 06:00 to 12:00" }
}

Stage 5 — Execution with guard-rails (variable, ~100-500ms)

The skill does not run in the model. It runs in our code, which:

Validates parameters (is date_range in the correct format? does it comply with tenant rules?).
Checks permission (does this agent have the right to query this calendar?).
Executes the call (Google Calendar API in this case).
Returns structured result to the model.

Why does this matter? Because the model never fabricates the result. If the calendar returns [10h, 11h], that's exactly what goes to the next call. If the skill fails, the model knows it failed. Zero risk of the agent "making up" that there's a slot at 9h when there isn't.

For cases involving sensitive information (price, deadline, customer name), the pipeline enforces tool call — it doesn't let the model respond from its own "knowledge". This eliminates the most common class of hallucination in commercial agents.

Stage 6 — Response and persistence (~50ms)

With the skill result in hand, the model makes the second call — now to form the final response to the customer. E.g.:

"I have Saturday at 10h and 11h. Which do you prefer?"

In parallel, the worker:

Sends the message back via the WhatsApp API.
Persists the complete turn (user + assistant + tool calls + duration) in D1.
Updates long-term memory if the turn produced new facts (e.g., "customer prefers Saturday").
Emits observability event (latency metric, token cost, escalation rate).

All of this runs in parallel. Persistence does not block message sending — the customer doesn't wait for D1.

Where the defence against hallucination lies

Agents that hallucinate in production lose trust quickly. OpenClaw has 4 lines of defence:

Forced source-of-truth. Factual data (price, time, name) always comes from skills, never from the model alone.
Double verification on sensitive data. Appointments are confirmed with the customer before persisting. Payment is confirmed before granting access.
Explicit negative rules. Each agent's persona includes "never make up X, Y, Z" — the model complies.
Fallback to human. When no skill covers the question, the agent says "let me check with the team" and opens a ticket — it doesn't guess.

In audits we conducted over the past 6 months (real conversations manually reviewed), the factual hallucination rate stayed below 0.3% of turns — and almost all cases were due to configuration (tenant forgot to enable relevant skill), not model error.

The cost per conversation

Good architecture is invisible until you look at the bill. Given that each turn makes 1-2 LLM calls + D1 lookups, the typical cost per complete conversation (10-15 turns) is:

Equipe OpenClaw

Published on 1 June 2026