How to Evaluate an AI Memory System: A Developer's Checklist

If you're building an AI product that needs to remember users across sessions — a companion, a sales assistant, a therapy tool, a tutor — you will eventually face the same decision: which memory system do you use? The choices range from rolling your own in a vector database to dropping in a third-party middleware layer. Most developers underestimate what "good memory" actually requires until they've shipped something and watched users churn at 90 days.

This checklist covers the criteria that actually matter. Not just "does it store things" — any key-value store can store things — but the properties that determine whether a memory system produces an AI that genuinely feels like it knows the user.

1. Does It Score What Matters, Not Just What's Recent?

The most common failure mode in DIY memory implementations is recency bias. You store everything and retrieve the most recent N chunks. This works for a demo. It fails in production.

A user who mentioned losing a parent six months ago and has never brought it up since is not going to thank your AI for ignoring that. Recency does not equal significance.

What to evaluate:

Does the system distinguish between casual mentions and emotionally significant disclosures?
Does it use multiple signals to compute significance — linguistic markers, disclosure depth, frequency, and others — rather than just timestamp or cosine similarity?
Can it retrieve a piece of context from three months ago that's more relevant than something mentioned yesterday?

What to watch for: Systems that claim "importance scoring" but just re-weight by recency or edit frequency. Ask for the specific signals they use to compute significance and whether those signals are independent of retrieval-time query embedding.

2. Does It Model Decay? And How?

Memory isn't static. A concern that dominated a user's life six months ago may have been resolved. A topic that seemed minor when first mentioned may have become central. A well-designed memory system accounts for this through decay modeling.

The direction of decay matters enormously. The naive approach — and the published academic default — is to decay everything uniformly over time. This produces an AI that forgets things as if the passage of time were the only variable.

The more defensible approach is temporal decay (patent pending): memories the user has actively worked through and resolved fade faster, while unresolved content persists. The AI doesn't forget that the relationship ended; it forgets the running commentary once the user has clearly moved on.

What to evaluate:

Does the system implement decay at all, or is everything stored permanently?
Is decay rate modulated by user behavior, or purely time-based?
Can you inspect the effective salience of any given memory node at query time?

Red flags: No decay at all (retrieval pools become garbage over time). Uniform time decay with no behavioral modulation. Decay rates that are hardcoded and non-configurable.

3. Can It Cold-Start?

Cold-start is the moment a user says something for the very first time. There's no prior frequency signal. There's no edit history. There's no pattern to reference. What does the system do?

For most vector-database approaches: the chunk gets stored, and nothing else happens. Its retrieval weight is exactly equal to everything else stored that day.

A real memory system needs to be able to compute the significance of a first mention from linguistic signals alone — how the user phrased it, what they chose to disclose, whether it was offered voluntarily or extracted. If the system can't do this, the first session produces no useful memory state.

What to evaluate:

How does the system handle first-mention disclosures?
Does it distinguish between high-significance cold-start signals and low-significance ones?
Does the initial memory state evolve as the system sees more from the user, or is the initial score locked in?

4. Does It Resolve Entities Across Sessions?

Your users don't talk in normalized database keys. They say "my dad," "Father," "him," "my old man," and occasionally his actual name — all referring to the same person. A memory system that stores these as separate chunks produces an AI that doesn't understand that these references are connected.

Entity resolution is the capability to track that multiple surface forms refer to the same underlying entity, and to build a coherent memory structure around that entity over time.

What to evaluate:

Does the system perform coreference resolution at ingestion time?
Are entities tracked as structured nodes, or as raw text chunks?
Can the system handle aliases, nicknames, and implicit references ("the company I was telling you about")?
Does it accumulate entity knowledge over sessions, so the AI's understanding of "my dad" deepens as more is disclosed?

What to watch for: Pure vector stores that store text chunks with no entity structure. These produce retrieval that works on isolated queries but fails to demonstrate coherent, growing knowledge of the user's world.

5. Is It Multi-Tenant by Design?

If you're building a B2B product, you are serving multiple end users, possibly across multiple organizations. Memory isolation is not optional.

Multi-tenancy in a memory system means more than separate database rows. It means the retrieval path is structurally incapable of leaking context between users. It means administrative operations (bulk delete, export) are scoped to individual users. It means audit trails are per-user, not per-deployment.

What to evaluate:

Is tenant isolation enforced at the data layer, or only at the application layer?
Can a misconfigured query or bug in your application code cause cross-tenant leakage?
Does the system support per-user GDPR/CCPA deletion without cascading effects on other users?

Red flags: Isolation that relies entirely on application-layer filtering with a shared embedding space. If the tenant ID is only in a WHERE clause and not in the index structure, you are one query bug away from a data incident.

6. What Are the Compliance Capabilities?

Depending on your vertical, you may be subject to GDPR (right to erasure), HIPAA (PHI handling), CCPA (right to deletion), or EU AI Act provisions. Your memory system needs to support these requirements.

What to evaluate:

Per-node deletion: Can you delete a specific memory without destroying the user's entire memory state? (Article 17 GDPR requires this.)
PII detection: Does the system scan for and handle Personally Identifiable Information at ingestion time?
Audit trails: Can you produce a log of what was stored, when, and what was retrieved for any given session?
Data residency: Where does memory data live? Can you specify region?
Retention limits: Can you configure automatic expiry of memory data?

GDPR's right to erasure is frequently misunderstood by teams building memory systems. It is not satisfied by "we can delete all their data." It requires selective deletion of specific content while preserving the rest of the user's history. Systems that can only do full user deletion will fail compliance in regulated verticals.

7. Is Safety Treated as a First-Class Requirement?

Memory systems sit between the user and the LLM. They see everything the user says. In consumer products especially — companions, therapy tools, wellness apps — this means the memory system will see disclosures of crisis, trauma, abuse, and sensitive personal history.

A memory system with no safety layer is a liability. A safety layer that can be overridden by memory state is also a liability.

What to evaluate:

Does the system include crisis detection that operates independently of memory state?
Are safety triggers evaluated at retrieval time, not just at ingestion time?
Does the system handle sensitive disclosures (crisis, trauma, abuse, substance) with special handling?
Is topic suppression supported — the ability to block specific topics from appearing in responses even if they're present in the memory graph?
Can safety constraints persist across sessions as persistent, non-decaying flags?

The correct architecture is a safety layer that wraps the memory layer — one that runs regardless of what the memory state contains. If the safety layer is downstream of retrieval, it can be compromised by what the memory system surfaced.

8. What Are the Latency and Throughput Characteristics?

Memory retrieval is on the critical path. It happens before the LLM generates a response, which means it adds directly to your time-to-first-token. For real-time conversation, this matters.

What to evaluate:

What is p50 and p99 retrieval latency with a mature memory graph (thousands of nodes)?
Does performance degrade as the memory graph grows?
Is there a caching layer? What is cache invalidation behavior after writes?
What is the throughput ceiling? Can it handle concurrent users at your target scale?

Get benchmarks at scale, not at the "hello world" graph size the documentation examples use.

9. Build vs. Buy: The Real Calculus

The "just use Postgres and a vector column" approach is compelling until you've shipped it. The capabilities above — significance scoring across multiple independent signal dimensions, temporal decay (patent pending), entity resolution, cold-start handling, multi-tenant isolation, compliance primitives, and safety — each take significant engineering time to build correctly. Together, they represent 12-18 months of specialized work.

The build case makes sense if your memory requirements are genuinely simple (recency-weighted, single-tenant, low volume) or if memory is core IP that differentiates your product.

The buy case makes sense when memory is infrastructure — necessary but not differentiating — and you want your team building the product, not the plumbing.

If you're evaluating third-party solutions, apply every criterion in this checklist. Most current options (Mem0, Zep, vector-database DIY) pass on storage and basic retrieval but fail on significance scoring, decay modeling, and compliance. That gap is where your users' experience will ultimately live or die.

Key Takeaways

Significance scoring is not recency weighting. Ask specifically how any candidate system computes importance for a first-mention disclosure.
Decay modeling should be behavior-modulated, not purely time-based. Resolved content should fade. Unresolved content should persist.
Entity resolution determines whether your AI understands users as whole people or as bags of text chunks.
Multi-tenancy must be enforced at the data layer. Application-layer isolation is not sufficient for production B2B products.
GDPR compliance requires per-node deletion. Full user wipe doesn't satisfy Article 17 in most cases.
Safety must be independent of memory state, not downstream of it.
Build vs. buy: if you're not planning to build a team around memory infrastructure, the engineering cost of doing it correctly exceeds what most teams estimate by 5-10x.

KAPEX is patent-pending memory middleware that provides salience-scored, decay-modeled persistent memory for any LLM application. It covers all the criteria in this checklist — significance scoring across 12 independent signal dimensions, temporal decay (patent pending), entity resolution, multi-tenant isolation, a 13-module safety pipeline, and per-node GDPR deletion.

Start a free trial → | Try the free study →

How to Evaluate an AI Memory System: A Checklist

How to Evaluate an AI Memory System: A Developer's Checklist

1. Does It Score What Matters, Not Just What's Recent?

2. Does It Model Decay? And How?

3. Can It Cold-Start?

4. Does It Resolve Entities Across Sessions?

5. Is It Multi-Tenant by Design?

6. What Are the Compliance Capabilities?

7. Is Safety Treated as a First-Class Requirement?

8. What Are the Latency and Throughput Characteristics?

9. Build vs. Buy: The Real Calculus

Key Takeaways

Give your AI a memory that matters.