Code for ‘Lost in OCR Translation?’: robust document retrieval under degradation. Compares OCR-based, vision-only, and hybrid pipelines; includes SambaNova LLaMA Vision OCR, Nougat, and ViDoRe baselines. Provides QA data generation, RAG evaluation, and metrics (Levenshtein, nDCG@k, Recall@k, EM/F1) with reproducible scripts. Includes dataset guides -
View it on GitHub