Model Context Protocol (MCP) is an open standard that defines how AI agents connect to external tools, data sources, and services. Introduced by Anthropic in late 2024 and now adopted across the industry, MCP reached mainstream production use in 2026 — shipping inside major IDEs, agent frameworks, and enterprise AI deployments. If you are building an LLM-powered application today and have not yet thought about MCP, you are likely about to encounter it. This guide covers the architecture, the three core primitives, why memory is the most strategically important tool type, and how to evaluate an MCP server before integrating one.
What Is Model Context Protocol?
The Model Context Protocol specification defines a client-server protocol for connecting LLM hosts to capability providers. Before MCP, every team integrating an LLM with an external service wrote custom integration code — a bespoke function-calling wrapper, a custom retrieval layer, an ad-hoc prompt injection scheme. These integrations did not compose. Swapping one provider for another meant rewriting the glue code.
MCP solves this by standardizing three things: how capabilities are declared, how they are invoked, and how results flow back to the model. A conforming MCP server exposes its capabilities once, and any conforming MCP client can consume them without custom integration work. The protocol runs over stdio (for local, in-process servers) or HTTP with Server-Sent Events (for remote servers), and the wire format is JSON-RPC 2.0.
The problem MCP solves is not just developer ergonomics. It is composability at scale. As agents become multi-step and multi-tool, a shared protocol prevents the integration matrix from growing without bound.
How MCP Servers Work
The MCP architecture has three roles: host, client, and server.
- Host. The application that contains the LLM — a chat interface, a coding assistant, an agent framework. The host is responsible for orchestrating the conversation and deciding when to invoke external capabilities.
- Client. A component inside the host that speaks the MCP protocol. One client maintains one connection to one server. A host can run multiple clients simultaneously, each connected to a different server.
- Server. An external process or service that exposes capabilities via MCP. Servers are lightweight and purpose-built — a server for web search, a server for file access, a server for memory.
Servers expose capabilities through three primitives:
Tools
Functions that the LLM can call. Each tool has a name, a description, and a JSON Schema describing its parameters. The model reads the description, decides whether to call the tool, emits a structured call, and receives the result. Tools are the workhorse of MCP — they are how agents take action in the world.
Resources
Data that the server exposes for the host to read directly, without model invocation. Resources are identified by URI and can be read on demand or subscribed to for live updates. They are suited for configuration data, file contents, and structured context that does not require model reasoning to retrieve.
Prompts
Reusable prompt templates that the server defines. These allow server authors to encode expert prompt engineering — for their specific domain — and make it available to any host that connects. A memory server might expose a prompt template for summarizing recalled context; a code server might expose one for generating test cases.
At runtime, the host sends the tool list to the LLM as part of the system context. The model decides which tools to call based on their descriptions. Results come back as structured JSON that the host injects into the conversation before the model's next generation step. The whole loop is synchronous from the model's perspective but can trigger arbitrarily complex server-side logic.
Why Memory Is MCP's Most Valuable Tool Type
Most MCP tools are stateless. A web search tool fetches results and returns them. A calculator evaluates an expression. A database query tool runs a query and returns rows. These tools are useful, but they do not change what the agent fundamentally is. The agent is still stateless between sessions. It still forgets everything when the context window clears.
Memory tools are different. They give an agent the ability to accumulate understanding over time — to know things about a user across sessions, across topics, across months. This is not a quality-of-life improvement. It is an architectural shift in what agents can do.
Consider the difference between an agent that answers questions well and one that answers questions well for this specific person, given everything it has learned about them. The second agent can personalize recommendations without asking for preferences every session. It can notice that a user's goals have shifted over time. It can handle sensitive topics with awareness of previously disclosed context. It can prioritize what matters to this user rather than treating every piece of information as equally important.
None of that is possible with stateless tools alone. RAG retrieves documents by similarity — it does not score memories by importance, it does not model decay, and it has no mechanism for handling personal disclosures safely. A purpose-built memory server does all three.
This is why memory is not just another MCP tool type. It is the tool type that determines whether an agent can build a relationship with a user or whether it starts from zero every time.
What to Look for in an MCP Memory Server
Not all memory servers are equivalent. When evaluating one for production use, these are the dimensions that matter:
- Salience scoring. Does the server score memories by importance, or does it treat all memories as equal weight? A well-designed memory server evaluates each memory across multiple signal dimensions — how central it is to the user's identity, how linguistically distinctive the original disclosure was, how often it has been referenced — and surfaces the highest-importance memories at query time, not just the most recently stored ones.
- Temporal decay with processing modulation. Memories should fade over time, but the decay rate should be sensitive to how the user has engaged with a topic. Memories that have been worked through — resolved, confirmed, processed — should clear faster than unresolved concerns that have never been directly addressed. This is processing-modulated decay, and it is what separates a memory system from a timestamped log.
- Per-user isolation. In any multi-tenant deployment, memory graphs must be strictly isolated by user. A server that comingles user data is not suitable for production, regardless of how capable its retrieval is.
- Safety and compliance. A memory server that handles personal disclosures must include, at minimum: PII detection and scrubbing at ingestion time, GDPR-compliant deletion, and some mechanism for handling sensitive content (crisis language, trauma disclosures, topic suppression requests) without surfacing it indiscriminately. These are not optional for any serious deployment.
- Multi-channel retrieval. A single ranking signal produces brittle results. Production memory systems retrieve across multiple channels — highest-salience memories, recent context, and hard constraints (safety-critical information, relationship boundaries) — then assemble them within a token budget. If a server offers only a single query endpoint with no channel structure, it will underperform in complex conversations.
- Explainability. Developers need to understand why a memory surfaced. A server that can decompose its scoring — showing which signals contributed to a memory's importance — is far easier to debug and trust than one that returns results from an opaque ranking function.
KAPEX as an MCP Memory Server
KAPEX is a salience-scored memory middleware that exposes its capabilities as an MCP server. It connects to any MCP-compatible host over stdio transport and provides eight tools that cover the full memory lifecycle.
Store. Write a new memory node to the user's memory graph. The server scores the memory at ingestion time across 12 independent signal dimensions — linguistic, behavioral, temporal, and contextual — and assigns it a salience score that governs how prominently it will surface in future retrievals.
Query. Retrieve the highest-salience memories relevant to a given context. The retrieval engine runs across three channels simultaneously: importance-ranked memories above the injection threshold, recent-context memories from a sliding time window, and always-inject constraint nodes (safety pins, relationship boundaries). Results are assembled within a configurable token budget and framed according to their confidence level.
Explain scoring. Given a memory node ID, return a full decomposition of how its current salience score was computed — which signals contributed, what the decay curve looks like, and where it sits relative to the injection threshold. This is the tool developers use when debugging unexpected retrieval behavior.
Register a processing event. Signal that a memory has been engaged with — referenced in conversation, confirmed by the user, or otherwise actively processed. This is the mechanism behind processing-modulated decay: memories that have been worked through fade faster, clearing resolved topics from active retrieval while unresolved content persists. That direction — the mathematical inverse of all published approaches — is the core of KAPEX's patent-pending scoring architecture.
Reactivate a decayed memory. Apply a context-dependent spike to a memory that has faded below the retrieval threshold. When the user brings up an old topic or when a life event makes a previously low-salience memory suddenly relevant, this tool pulls it back into active retrieval range. The spike decays separately from the base score, with its own half-life, so reactivation does not permanently alter a memory's long-term trajectory.
List. Enumerate memory nodes with optional filters — by domain, salience range, node type, or recency. Useful for building dashboards, auditing what the system knows, or feeding analytics pipelines.
Delete. Permanently remove a memory node and all its associated edges from the graph. This is the GDPR Article 17 right-to-erasure endpoint. Deletion is hard — the node is gone, not soft-deleted — and it propagates to dependent structures.
Status. Return a health summary of the user's memory graph: node counts by type, average salience, graph connectivity, and system health indicators. Useful for monitoring and for surfacing graph maturity to the application layer (a graph with five nodes should behave differently than one with five hundred).
KAPEX's 13-module safety pipeline runs on every ingestion and retrieval operation. See the features page for the full capability overview, or the pricing page to start a pilot.
MCP vs. Direct API: When to Use Each
MCP and direct REST API integration are not mutually exclusive, but they are suited to different contexts. Here is a practical decision framework:
| Scenario | MCP | Direct API |
|---|---|---|
| Agent framework integration | Preferred — most frameworks speak MCP natively | Works, but requires custom wrapper code |
| IDE / coding assistant | Preferred — standard integration point | Not typical |
| Server-side application | Works via HTTP+SSE transport | Often simpler for backend-to-backend calls |
| Multiple LLM providers | Preferred — protocol is provider-agnostic | Requires provider-specific adaptation |
| Fine-grained control over prompts | Possible via Prompts primitive | Full control, no protocol overhead |
| Batch / offline processing | Less natural fit | Preferred — direct HTTP, no session management |
The short version: if the LLM is making the decision to call the capability, use MCP. If your application code is making the call programmatically, a direct API is often cleaner. Many production systems use both — MCP for agent-facing tools and direct API calls for server-side ingestion and analytics.
For KAPEX specifically, the MCP server and the REST API expose the same underlying capabilities. You can start with one and add the other without any architectural rework. The MCP servers repository on GitHub is a useful reference for seeing how the broader ecosystem structures server implementations.
Key Takeaways
- MCP is a protocol, not a product. It defines how LLM hosts connect to external capabilities — tools, resources, and prompts — using a shared wire format (JSON-RPC 2.0 over stdio or HTTP+SSE).
- The three primitives are tools, resources, and prompts. Tools are the most important for agentic applications — they are how the model takes action and how memory is read and written.
- Memory tools change what agents can do. Stateless tools make agents more capable within a session. Memory tools make agents more capable across sessions — they are the difference between a capable assistant and one that actually knows the user.
- Evaluate memory servers on scoring, decay, safety, and isolation. A server that cannot explain why a memory surfaced, handle sensitive disclosures safely, or delete data on request is not production-ready.
- MCP and direct API are complementary. Use MCP when the LLM drives the tool call. Use a direct API when your application code drives it. Most serious deployments use both.