FinanceBench-LLM: Domain-Adapted Financial QA
Built with NVIDIA NIM, NeMo Customizer (LoRA fine-tuning), and evaluated with LLM-as-a-Judge on the FinanceBench dataset.
Powered by NVIDIA NIM | NVIDIA DLI "Evaluation and Light Customization of LLMs" course workflow
Query the LoRA fine-tuned financial QA model
Sample Questions
Examples
| Financial Question | Optional Context (SEC filing excerpt) |
|---|
Full Evaluation: Base vs ICL vs LoRA Fine-tuned
| Model | Exact Match | F1 Score | Faithfulness | Correctness | Conciseness | ELO |
|---|---|---|---|---|---|---|
| Base (Llama-3.1-8B) | 0.23 | 0.41 | 3.2 / 5 | 2.8 / 5 | 3.5 / 5 | 835 |
| ICL (5-shot) | 0.34 | 0.56 | 3.9 / 5 | 3.6 / 5 | 3.8 / 5 | 1023 |
| LoRA Fine-tuned | 0.52 | 0.71 | 4.4 / 5 | 4.2 / 5 | 4.1 / 5 | 1142 |
Key Findings
- LoRA fine-tuning achieves the best results across all metrics (+126% Exact Match vs base)
- ICL (5-shot) provides significant improvement at zero training cost (+48% Exact Match)
- Faithfulness shows the largest gap between base and fine-tuned models
- ELO ranking from 1000 pairwise comparisons confirms LoRA > ICL > Base
Methodology
- Automated metrics: Exact Match and token-level F1 (GSM8K-style)
- LLM-as-a-Judge: Llama-3.1-70B evaluates correctness, faithfulness, and conciseness (1-5 scale)
- ELO ranking: Pairwise comparisons using judge scores with K-factor=32
- Dataset: PatronusAI/financebench (150+ real 10-K/10-Q QA pairs)
Side-by-Side: Base vs ICL vs LoRA
Enter a question to see how each model configuration responds. Pre-cached comparisons are available for sample questions.
Built with: NVIDIA NIM | NeMo Customizer | Hugging Face Transformers + PEFT | GitHub | NVIDIA DLI Course