Benchmarks
Evals & Benchmarks
Performance evaluations and benchmark results from Captain.
BEIR Text Retrieval Benchmark
April 2026
We ran Captain against Elasticsearch v8 and Gemini File Search on three BEIR datasets, scored by nDCG@10 with human relevance judgments.
- SciFact, NFCorpus, and FiQA
- 1,271 queries across 66,454 documents
Results
- Captain averaged 0.534 overall
- +24% over Gemini, +38% over Elasticsearch
- Gemini File Search returned nothing for 19-31% of queries depending on dataset
BEIR Benchmark: Text Retrieval Quality
1,271 queries | 66,454 documents | 3 datasets | nDCG@10 with human relevance judgments
Production API | v8, English analyzer | File SearchGemini v1 beta, Flash Lite | |
|---|---|---|---|
Overall 3 datasets · 1,271 queries | 0.534+24% vs Gemini | 0.386 | 0.432 |
| Retrieval Accuracy | |||
SciFact 5,183 docs · 300 queries | 0.745+9% vs Gemini | 0.610 | 0.685 |
NFCorpus 3,633 docs · 323 queries | 0.375+31% vs Gemini | 0.295 | 0.286 |
FiQA 57,638 docs · 648 queries | 0.481+48% vs Gemini | 0.254 | 0.325 |
| Reliability | |||
SciFact Queries that return results | 100% | 100% | 81%58 of 300 return nothing |
NFCorpus Queries that return results | 100% | 100% | 90%32 of 323 return nothing |
FiQA Queries that return results | 100% | 100% | 69%201 of 648 return nothing |
| Indexing Speed | |||
SciFact 5,183 documents | ~3 min | ~10s | 27 min |
NFCorpus 3,633 documents | ~2 min | ~7s | 18 min |
FiQA 57,638 documents | ~20 min | ~60s | 4.85 hrs |
Retrieval Accuracy by Dataset
nDCG@10 on BEIR benchmarks · Higher is better
SciFact
Scientific fact verification · 5,183 docs · 300 queries
0.745
+9%
0.685
0.610

File Search
NFCorpus
Medical research · 3,633 docs · 323 queries
0.375
+31%
0.286
0.295

File Search
FiQA
Financial Q&A · 57,638 docs · 648 queries
0.481
+48%
0.325
0.254

File Search