Benchmarks

Evals & Benchmarks

Performance evaluations and benchmark results from Captain.

BEIR Text Retrieval Benchmark

April 2026

We ran Captain against Elasticsearch v8 and Gemini File Search on three BEIR datasets, scored by nDCG@10 with human relevance judgments.

  • SciFact, NFCorpus, and FiQA
  • 1,271 queries across 66,454 documents

Results

  • Captain averaged 0.534 overall
  • +24% over Gemini, +38% over Elasticsearch
  • Gemini File Search returned nothing for 19-31% of queries depending on dataset

BEIR Benchmark: Text Retrieval Quality

1,271 queries  |  66,454 documents  |  3 datasets  |  nDCG@10 with human relevance judgments

CaptainProduction APIElasticsearchv8, English analyzerGeminiFile SearchGemini v1 beta, Flash Lite
Overall
3 datasets · 1,271 queries
0.534+24% vs Gemini0.3860.432
Retrieval Accuracy
SciFact
5,183 docs · 300 queries
0.745+9% vs Gemini0.6100.685
NFCorpus
3,633 docs · 323 queries
0.375+31% vs Gemini0.2950.286
FiQA
57,638 docs · 648 queries
0.481+48% vs Gemini0.2540.325
Reliability
SciFact
Queries that return results
100%100%81%58 of 300 return nothing
NFCorpus
Queries that return results
100%100%90%32 of 323 return nothing
FiQA
Queries that return results
100%100%69%201 of 648 return nothing
Indexing Speed
SciFact
5,183 documents
~3 min~10s27 min
NFCorpus
3,633 documents
~2 min~7s18 min
FiQA
57,638 documents
~20 min~60s4.85 hrs

Retrieval Accuracy by Dataset

nDCG@10 on BEIR benchmarks · Higher is better

SciFact
Scientific fact verification · 5,183 docs · 300 queries
0.745
+9%
0.685
0.610
Captain
GeminiFile Search
Elasticsearch
NFCorpus
Medical research · 3,633 docs · 323 queries
0.375
+31%
0.286
0.295
Captain
GeminiFile Search
Elasticsearch
FiQA
Financial Q&A · 57,638 docs · 648 queries
0.481
+48%
0.325
0.254
Captain
GeminiFile Search
Elasticsearch