Benchmarks · June 2026

ContextStream scores90.0% on LongMemEval-S.

On the full 500-instance standard memory benchmark, ContextStream beats supermemory with statistical significance and matches Zep within the confidence interval (90.0%, or 89.6% single-shot). On our agentic project-memory benchmark, the same memory layer raises task success from 58% to 96%.

LongMemEval-S full 500 · 450 correct · gpt-5.5 reader, self-consistency k=3 · GPT-4o judge · published June 14, 2026

01Standard benchmark

Full LongMemEval-S result.

LongMemEval-S tests conversational memory over roughly 115k-token multi-session haystacks. We ran the complete 500-instance suite against ContextStream retrieval, with a disclosed reader and the official pinned GPT-4o judge.

LongMemEval-S result of record
90.0%

450 of 500 correct (gpt-5.5 reader, self-consistency k=3; 89.6% single-shot). Wilson 95% CI [87.1%, 92.3%]. This beats supermemory's 85.4% number with statistical significance and is a statistical tie with Zep's 90.2% — we say matches, not beats, because 0.2 pt is inside the measurement noise. ByteRover reports a higher 92.8%, but on a Gemini judge rather than the official GPT-4o autoeval, so it is a different comparability class (see note).

450/500

Correct
Complete LongMemEval-S run, not a subset. 89.6% single-shot; 90.0% with self-consistency.

98.4%

Evidence ceiling
Answer-bearing sessions were retrieved for almost every question.

+8.8 pt

Cycle gain
Up from the verified 81.2% plateau after retrieval and reader work.

[87.1, 92.3]

Wilson 95% CI
Beats supermemory significantly; Zep sits inside the interval (parity).

SystemLongMemEval-SSourceScope
ByteRover92.8%ByteRover blogVendor-published · Gemini 3.x judge — not the official GPT-4o autoeval
Zep90.2%Zep researchVendor-published · cited 2026-06-13
ContextStream90.0%This runFull 500 · official GPT-4o judge · gpt-5.5 reader, self-consistency k=3 (89.6% single-shot)
supermemory85.4%supermemory comparisonVendor-published · cited 2026-06-13

Competitor numbers are cited from each vendor's own publication and were not re-run by ContextStream. On judge comparability: LongMemEval's official autoeval is gpt-4o-2024-08-06, which ContextStream uses. ByteRover reports a higher 92.8% but scored it with a Gemini 3.x judge rather than the official GPT-4o autoeval (per their own writeup), so it is not directly comparable to the GPT-4o-judged numbers here — a different judge can move scores in either direction. We make no claim to beat ByteRover; among systems on the official GPT-4o judge, ContextStream ties Zep and beats supermemory.

single-session assistant
98.2%
ceiling 100.0% · n=56
single-session user
97.1%
ceiling 100.0% · n=70
knowledge update
93.6%
ceiling 100.0% · n=78
temporal reasoning
90.2%
ceiling 95.5% · n=133
single-session preference
86.7%
ceiling 100.0% · n=30
multi-session
81.2%
ceiling 98.5% · n=133

Zep's 2025 paper includes per-family results for an older GPT-4o-reader setup. In that view, ContextStream is materially higher on multi-session recall: 81.2% vs Zep's 57.9%. Treat that as directional evidence, not the headline comparison: the same paper scores around 70% overall, and Zep has not published per-family results for its 90.2% headline. The table above is the apples-to-apples overall comparison.

What improved: hybrid lexical + vector recall, relevance-first ranking, wider candidate pools, full-content hydration, and ask-date handling moved the verified plateau from 81.2% to 89.6%; a self-consistency reader (majority vote of k=3) added the last 0.4 pt to 90.0%. The disclosed reader is gpt-5.5; the judge is the official pinned GPT-4o.
02The agentic knowledge gap

Three tasks needed knowledge the repository did not have.

A release procedure. A customer-data spec. An ops-handoff format. Real teams carry this kind of knowledge in decisions, runbooks, chats, and previous agent sessions. The benchmark asks whether that knowledge can reach the next agent before work starts.

Without project memory
0/9

Nine attempts across the three memory-dependent tasks failed. The agents could only use what the repository exposed, and the missing knowledge was not there.

With ContextStream memory
9/9

Every memory-equipped run pulled the relevant runbook, decision, or skill before acting.

03The mechanism

Knowledge arrives before the work starts.

A memory-equipped session does not have to rediscover what the team already learned. Relevant lessons, prior diagnoses, and conventions appear in the first context call.

# task: "page 2 of search returns the same items as page 1"
$ context("find the root cause and fix it")
✓ prior diagnosis surfaced: pagination ignored the computed offset
✓ lesson surfaced: reseed test data before running the suite
agent: "known failure mode, applying the recorded fix pattern"
Same task, no memory

The agent searches and re-derives first. It may still find the fix, but it spends turns rediscovering a failure mode the team had already captured.

Same task, with memory

The relevant prior diagnosis arrives at the start of the run. The agent can spend its budget on the work instead of repeating the investigation.

04How we measured wake-bench

Same agent. Same repo. Same tasks. Memory was the variable.

Each run started from a fresh clone and a fresh session. The prompts stayed identical. Hidden acceptance tests checked the work after the fact, and a blind judge scored the diffs without seeing which setup produced them.

8

Tasks
The suite covers repo-solvable work and tasks that require team knowledge outside the code.

3

Trials
Each task ran three times per setup to catch one-off luck and failure.

48

Scored runs
Every run used a fresh isolated clone and session.

1

Agent model
Claude Sonnet 4.6 was used for both arms of the benchmark.

What the result means: ContextStream helped most when the task depended on a prior decision, runbook, lesson, or convention that was not present in the repository.
05The agentic result

Across all tasks, memory changed 14/24 into 23/24.

The same agent, the same repository, the same eight tasks. The only change was whether ContextStream memory was connected.

No ContextStream

model + repo only
14/24
58% tasks passed$0.33 median / task

ContextStream memory

index + project memory
23/24
96% tasks passed$0.49 median / task

8 tasks × 3 trials per setup · identical prompts · fresh isolated clone and session per run

06Methodology & limits

What this does and does not claim.

ControlHow it runsWhy it matters
LongMemEval-S full run500 instances, one isolated project per instanceReports the complete benchmark with a confidence interval, not a subset or the earlier n=96 interim.
Pinned judgeOfficial GPT-4o autoeval judge (gpt-4o-2024-08-06)Keeps the score comparable to vendor-published LongMemEval numbers. Verdicts audited both ways; no judge false-positives.
Disclosed readergpt-5.5, self-consistency k=3 (we also report 89.6% single-shot)The reader is a disclosed, swappable component; publishing both configs keeps the comparison auditable rather than hiding a strong reader.
Honest claim“Matches Zep, beats supermemory” — never “beats Zep”90.0% vs Zep's 90.2% is 0.2 pt, inside the reader's run-to-run noise. Parity is the true claim; claiming a win off a point estimate inside the noise band would be dishonest.
Competitor handlingCite vendors' own published numbers, with source and dateWe never re-run a competitor. Avoids the credibility trap of re-implementing a rival incorrectly and posting a deflated number.
Judge comparabilityAll ContextStream numbers use the official gpt-4o-2024-08-06 autoevalSome vendors score with a different judge — e.g. ByteRover's 92.8% uses a Gemini judge. A non-standard judge is a different comparability class, not a head-to-head; we flag it rather than absorb it into the same table silently.
Wake-bench isolationFresh clone and fresh session per agentic task runKeeps the project-memory benchmark separate from conversational recall.
Limited scopeOne wake-bench repo, one agent model, three trials per cellWake-bench single-task deltas are directional; LongMemEval is the standard full-suite number.

Your project already produces this knowledge. Keep it.

ContextStream captures decisions, fixes, lessons, and conventions once, then gives them to the next agent at the start of work.