Benchmarks · July 2026

ContextStream scores90.0% on LongMemEval-S.

On the full 500-instance standard memory benchmark, ContextStream beats supermemory with statistical significance and matches Zep within the confidence interval (90.0%, or 89.6% single-shot). On our agentic project-memory benchmark, the same memory layer raises task success from 58% to 96%. Our independently judged code-search benchmark publishes every category win and loss below.

Start free — 3,000 credits, no card →See LongMemEval

LongMemEval-S full 500 · 450 correct · gpt-5.5 reader, self-consistency k=3 · GPT-4o judge · published June 14, 2026

01Standard benchmark

Full LongMemEval-S result.

LongMemEval-S tests conversational memory over roughly 115k-token multi-session haystacks. We ran the complete 500-instance suite against ContextStream retrieval, with a disclosed reader and the official pinned GPT-4o judge.

LongMemEval-S result of record

90.0%

450 of 500 correct (gpt-5.5 reader, self-consistency k=3; 89.6% single-shot). Wilson 95% CI [87.1%, 92.3%]. This beats supermemory's 85.4% number with statistical significance and is a statistical tie with Zep's 90.2% — we say matches, not beats, because 0.2 pt is inside the measurement noise. ByteRover reports a higher 92.8%, but on a Gemini judge rather than the official GPT-4o autoeval, so it is a different comparability class (see note).

450/500

Correct
Complete LongMemEval-S run, not a subset. 89.6% single-shot; 90.0% with self-consistency.

98.4%

Evidence ceiling
Answer-bearing sessions were retrieved for almost every question.

+8.8 pt

Cycle gain
Up from the verified 81.2% plateau after retrieval and reader work.

[87.1, 92.3]

Wilson 95% CI
Beats supermemory significantly; Zep sits inside the interval (parity).

System	LongMemEval-S	Source	Scope
ByteRover	92.8%	ByteRover blog	Vendor-published · Gemini 3.x judge — not the official GPT-4o autoeval
Zep	90.2%	Zep research	Vendor-published · cited 2026-06-13
ContextStream	90.0%	This run	Full 500 · official GPT-4o judge · gpt-5.5 reader, self-consistency k=3 (89.6% single-shot)
supermemory	85.4%	supermemory comparison	Vendor-published · cited 2026-06-13

Competitor numbers are cited from each vendor's own publication and were not re-run by ContextStream. On judge comparability:LongMemEval's official autoeval is gpt-4o-2024-08-06, which ContextStream uses. ByteRover reports a higher 92.8% but scored it with a Gemini 3.x judge rather than the official GPT-4o autoeval (per their own writeup), so it is not directly comparable to the GPT-4o-judged numbers here — a different judge can move scores in either direction. We make no claim to beat ByteRover; among systems on the official GPT-4o judge, ContextStream ties Zep and beats supermemory.

single-session assistant

98.2%

ceiling 100.0% · n=56

single-session user

97.1%

ceiling 100.0% · n=70

knowledge update

93.6%

ceiling 100.0% · n=78

temporal reasoning

90.2%

ceiling 95.5% · n=133

single-session preference

86.7%

ceiling 100.0% · n=30

multi-session

81.2%

ceiling 98.5% · n=133

Zep's 2025 paper includes per-family results for an older GPT-4o-reader setup. In that view, ContextStream is materially higher on multi-session recall: 81.2% vs Zep's 57.9%. Treat that as directional evidence, not the headline comparison: the same paper scores around 70% overall, and Zep has not published per-family results for its 90.2% headline. The table above is the apples-to-apples overall comparison.

What improved: hybrid lexical + vector recall, relevance-first ranking, wider candidate pools, full-content hydration, and ask-date handling moved the verified plateau from 81.2% to 89.6%; a self-consistency reader (majority vote of k=3) added the last 0.4 pt to 90.0%. The disclosed reader is gpt-5.5; the judge is the official pinned GPT-4o.

These results come from project memory your repo already produces.

Start free — 3,000 credits, no card →See team plans

02Code search

Real queries. Pinned code. Losses included.

We mined verbatim developer queries from merged pull requests, ran every full-corpus search lane twice, and had two model families grade the pooled results against immutable source. A third model saw only their disagreements. GitHub's rate-limited REST search is reported separately on its fixed requested subset.

180

Queries judged
Every query received an independent validity judgment; invalid queries remain in the published sensitivity view.

177

Valid queries scored
The headline view includes only queries the independent judgment chain marked valid; every other label remains in the sensitivity view.

97.6%

Useful file in the top 10
ContextStream auto, two-run cold mean; run-to-run spread 0.0 pt.

91.5%

Best-answer file in the top 10
The stricter primary-answer measure; run-to-run spread 0.0 pt.

Recall@10, in plain English: did at least one useful file appear in the first ten results? A useful file at rank 11 is a miss. Primary Recall@10 is stricter: it requires one of the highest-relevance answer files in that window.

System	Queries	Recall@10	Primary@10	First-useful rank	Overall order	Negative rejection	Successful p50	Cold failed
ContextStream auto no local checkout	177	97.6% spread 0.0 pt	91.5%	0.930	0.750	100.0%	112 ms	0 / 354
Embedding-only (BGE-small) local checkout	177	84.2% spread 0.0 pt	71.5%	0.677	0.476	0.0%	12 ms	0 / 354
ripgrep fixed literal local checkout	177	67.3% spread 0.0 pt	41.8%	0.645	0.360	100.0%	13 ms	0 / 354
GitHub REST code search no local checkout	29	0.0% spread 0.0 pt	0.0%	0.000	0.000	100.0%	489 ms	36 / 58

Quality values are the mean of two cold runs; “spread” is the max-minus-min difference between those runs, not a confidence interval. Latency is calculated only over successful responses, while every cold-run error or missing execution remains a quality miss and is reported separately. “First-useful rank” is MRR: it rewards putting the first useful file earlier. “Overall order” is nDCG@10: it rewards placing all higher-relevance files nearer the top. Negative rejection means correctly returning nothing for a query whose answer is not in the pinned repository. GitHub covers its disclosed fixed subset, so its row is not a full-corpus head-to-head. Scope details: ContextStream auto: Full judged-valid corpus; production API. Embedding-only (BGE-small): Full judged-valid corpus; local checkout. ripgrep fixed literal: Full judged-valid corpus; local checkout. GitHub REST code search: Fixed 30-query requested subset; moving default branches and unsupported/corpus-mismatch responses remain misses.

This is a fixed-pool, model-judged benchmark—not human qrels. New systems or changed rankings must be pooled and reviewed again. The complete evidence pack, reproduction commands, agreement statistics, variance, and negative results are in the published results file.

03The agentic knowledge gap

Three tasks needed knowledge the repository did not have.

A release procedure. A customer-data spec. An ops-handoff format. Real teams carry this kind of knowledge in decisions, runbooks, chats, and previous agent sessions. The benchmark asks whether that knowledge can reach the next agent before work starts.

Without project memory

0/9

Nine attempts across the three memory-dependent tasks failed. The agents could only use what the repository exposed, and the missing knowledge was not there.

With ContextStream memory

9/9

Every memory-equipped run pulled the relevant runbook, decision, or skill before acting.

These results come from project memory your repo already produces.

Start free — 3,000 credits, no card →See team plans

04The mechanism

Knowledge arrives before the work starts.

A memory-equipped session does not have to rediscover what the team already learned. Relevant lessons, prior diagnoses, and conventions appear in the first context call.

# task: "page 2 of search returns the same items as page 1"

$ context("find the root cause and fix it")

✓ prior diagnosis surfaced: pagination ignored the computed offset

✓ lesson surfaced: reseed test data before running the suite

agent: "known failure mode, applying the recorded fix pattern"

Same task, no memory

The agent searches and re-derives first. It may still find the fix, but it spends turns rediscovering a failure mode the team had already captured.

Same task, with memory

The relevant prior diagnosis arrives at the start of the run. The agent can spend its budget on the work instead of repeating the investigation.

05How we measured wake-bench

Same agent. Same repo. Same tasks. Memory was the variable.

Each run started from a fresh clone and a fresh session. The prompts stayed identical. Hidden acceptance tests checked the work after the fact, and a blind judge scored the diffs without seeing which setup produced them.

Tasks
The suite covers repo-solvable work and tasks that require team knowledge outside the code.

Trials
Each task ran three times per setup to catch one-off luck and failure.

Scored runs
Every run used a fresh isolated clone and session.

Agent model
Claude Sonnet 4.6 was used for both arms of the benchmark.

What the result means: ContextStream helped most when the task depended on a prior decision, runbook, lesson, or convention that was not present in the repository.

06The agentic result

Across all tasks, memory changed 14/24 into 23/24.

The same agent, the same repository, the same eight tasks. The only change was whether ContextStream memory was connected.

No ContextStream

model + repo only

14/24

58% tasks passed$0.37 median / task

ContextStream memory

index + project memory

23/24

96% tasks passed$0.51 median / task

8 tasks × 3 trials per setup · identical prompts · fresh isolated clone and session per run

07Methodology & limits

What this does and does not claim.

Control	How it runs	Why it matters
LongMemEval-S full run	500 instances, one isolated project per instance	Reports the complete benchmark with a confidence interval, not a subset or the earlier n=96 interim.
Pinned judge	Official GPT-4o autoeval judge (gpt-4o-2024-08-06)	Keeps the score comparable to vendor-published LongMemEval numbers. Verdicts audited both ways; no judge false-positives.
Disclosed reader	gpt-5.5, self-consistency k=3 (we also report 89.6% single-shot)	The reader is a disclosed, swappable component; publishing both configs keeps the comparison auditable rather than hiding a strong reader.
Honest claim	“Matches Zep, beats supermemory” — never “beats Zep”	90.0% vs Zep's 90.2% is 0.2 pt, inside the reader's run-to-run noise. Parity is the true claim; claiming a win off a point estimate inside the noise band would be dishonest.
Competitor handling	Cite vendors' own published numbers, with source and date	We never re-run a competitor. Avoids the credibility trap of re-implementing a rival incorrectly and posting a deflated number.
Code-search corpus	180 verbatim merged-PR queries across three public repositories pinned at fixed SHAs	Real development text avoids prompts invented around what the current search system already handles well.
Code-search judgments	Two independent blinded reviews, disagreement-only adjudication, and an all-query sensitivity view	The score is based on inspected pinned source, not files mechanically inferred from the PR that supplied a query.
Fixed-pool limit	Equal-depth top-10 pooling from every published project-scoped lane	Paths outside the frozen pool are unjudged, not proven irrelevant. A new system or changed ranking must be pooled and reviewed again.
Judge comparability	All ContextStream numbers use the official gpt-4o-2024-08-06 autoeval	Some vendors score with a different judge — e.g. ByteRover's 92.8% uses a Gemini judge. A non-standard judge is a different comparability class, not a head-to-head; we flag it rather than absorb it into the same table silently.
Wake-bench isolation	Fresh clone and fresh session per agentic task run	Keeps the project-memory benchmark separate from conversational recall.
Limited scope	One wake-bench repo, one agent model, three trials per cell	Wake-bench single-task deltas are directional; LongMemEval is the standard full-suite number.

Your project already produces this knowledge. Keep it.

ContextStream captures decisions, fixes, lessons, and conventions once, then gives them to the next agent at the start of work.

Start free — 3,000 credits, no card →See team plans Read the docs