On the full 500-instance standard memory benchmark, ContextStream beats supermemory with statistical significance and matches Zep within the confidence interval (90.0%, or 89.6% single-shot). On our agentic project-memory benchmark, the same memory layer raises task success from 58% to 96%.
LongMemEval-S tests conversational memory over roughly 115k-token multi-session haystacks. We ran the complete 500-instance suite against ContextStream retrieval, with a disclosed reader and the official pinned GPT-4o judge.
450 of 500 correct (gpt-5.5 reader, self-consistency k=3; 89.6% single-shot). Wilson 95% CI [87.1%, 92.3%]. This beats supermemory's 85.4% number with statistical significance and is a statistical tie with Zep's 90.2% — we say matches, not beats, because 0.2 pt is inside the measurement noise. ByteRover reports a higher 92.8%, but on a Gemini judge rather than the official GPT-4o autoeval, so it is a different comparability class (see note).
Correct
Complete LongMemEval-S run, not a subset. 89.6% single-shot; 90.0% with self-consistency.
Evidence ceiling
Answer-bearing sessions were retrieved for almost every question.
Cycle gain
Up from the verified 81.2% plateau after retrieval and reader work.
Wilson 95% CI
Beats supermemory significantly; Zep sits inside the interval (parity).
| System | LongMemEval-S | Source | Scope |
|---|---|---|---|
| ByteRover | 92.8% | ByteRover blog | Vendor-published · Gemini 3.x judge — not the official GPT-4o autoeval |
| Zep | 90.2% | Zep research | Vendor-published · cited 2026-06-13 |
| ContextStream | 90.0% | This run | Full 500 · official GPT-4o judge · gpt-5.5 reader, self-consistency k=3 (89.6% single-shot) |
| supermemory | 85.4% | supermemory comparison | Vendor-published · cited 2026-06-13 |
Competitor numbers are cited from each vendor's own publication and were not re-run by ContextStream. On judge comparability: LongMemEval's official autoeval is gpt-4o-2024-08-06, which ContextStream uses. ByteRover reports a higher 92.8% but scored it with a Gemini 3.x judge rather than the official GPT-4o autoeval (per their own writeup), so it is not directly comparable to the GPT-4o-judged numbers here — a different judge can move scores in either direction. We make no claim to beat ByteRover; among systems on the official GPT-4o judge, ContextStream ties Zep and beats supermemory.
Zep's 2025 paper includes per-family results for an older GPT-4o-reader setup. In that view, ContextStream is materially higher on multi-session recall: 81.2% vs Zep's 57.9%. Treat that as directional evidence, not the headline comparison: the same paper scores around 70% overall, and Zep has not published per-family results for its 90.2% headline. The table above is the apples-to-apples overall comparison.
A release procedure. A customer-data spec. An ops-handoff format. Real teams carry this kind of knowledge in decisions, runbooks, chats, and previous agent sessions. The benchmark asks whether that knowledge can reach the next agent before work starts.
Nine attempts across the three memory-dependent tasks failed. The agents could only use what the repository exposed, and the missing knowledge was not there.
Every memory-equipped run pulled the relevant runbook, decision, or skill before acting.
A memory-equipped session does not have to rediscover what the team already learned. Relevant lessons, prior diagnoses, and conventions appear in the first context call.
The agent searches and re-derives first. It may still find the fix, but it spends turns rediscovering a failure mode the team had already captured.
The relevant prior diagnosis arrives at the start of the run. The agent can spend its budget on the work instead of repeating the investigation.
Each run started from a fresh clone and a fresh session. The prompts stayed identical. Hidden acceptance tests checked the work after the fact, and a blind judge scored the diffs without seeing which setup produced them.
Tasks
The suite covers repo-solvable work and tasks that require team knowledge outside the code.
Trials
Each task ran three times per setup to catch one-off luck and failure.
Scored runs
Every run used a fresh isolated clone and session.
Agent model
Claude Sonnet 4.6 was used for both arms of the benchmark.
The same agent, the same repository, the same eight tasks. The only change was whether ContextStream memory was connected.
| Control | How it runs | Why it matters |
|---|---|---|
| LongMemEval-S full run | 500 instances, one isolated project per instance | Reports the complete benchmark with a confidence interval, not a subset or the earlier n=96 interim. |
| Pinned judge | Official GPT-4o autoeval judge (gpt-4o-2024-08-06) | Keeps the score comparable to vendor-published LongMemEval numbers. Verdicts audited both ways; no judge false-positives. |
| Disclosed reader | gpt-5.5, self-consistency k=3 (we also report 89.6% single-shot) | The reader is a disclosed, swappable component; publishing both configs keeps the comparison auditable rather than hiding a strong reader. |
| Honest claim | “Matches Zep, beats supermemory” — never “beats Zep” | 90.0% vs Zep's 90.2% is 0.2 pt, inside the reader's run-to-run noise. Parity is the true claim; claiming a win off a point estimate inside the noise band would be dishonest. |
| Competitor handling | Cite vendors' own published numbers, with source and date | We never re-run a competitor. Avoids the credibility trap of re-implementing a rival incorrectly and posting a deflated number. |
| Judge comparability | All ContextStream numbers use the official gpt-4o-2024-08-06 autoeval | Some vendors score with a different judge — e.g. ByteRover's 92.8% uses a Gemini judge. A non-standard judge is a different comparability class, not a head-to-head; we flag it rather than absorb it into the same table silently. |
| Wake-bench isolation | Fresh clone and fresh session per agentic task run | Keeps the project-memory benchmark separate from conversational recall. |
| Limited scope | One wake-bench repo, one agent model, three trials per cell | Wake-bench single-task deltas are directional; LongMemEval is the standard full-suite number. |
ContextStream captures decisions, fixes, lessons, and conventions once, then gives them to the next agent at the start of work.