Code for ‘Lost in OCR Translation?’: robust document retrieval under degradation. Compares OCR-based, vision-only, and hybrid pipelines; includes SambaNova LLaMA Vision OCR, Nougat, and ViDoRe baselines. Provides QA data generation, RAG evaluation, and metrics (Levenshtein, nDCG@k, Recall@k, EM/F1) with reproducible scripts. Includes dataset guides - View it on GitHub
Star
2
Rank
4137785