convfinqa-qwen3.5-4b-lora

LoRA adapter for Qwen/Qwen3.5-4B, fine-tuned on the ConvFinQA train split to answer conversational, multi-turn numerical questions over single-page financial documents (a 10-K page consisting of pre-text, a table, and post-text).

Trained on Tinker via tinker-cookbook.

A merged single-artefact version is at sharick008/convfinqa-qwen3.5-4b. Use that one if you would rather not deal with PEFT.

Result

Execution accuracy on the 421-record ConvFinQA dev split, graded turn-by-turn against the dataset's gold executed answers:

Metric Value
Records 421 / 421 (zero failures)
Turns graded 1,490
Turns correct 1,227
Execution accuracy 82.35%

For reference, the same agent harness on the base Qwen/Qwen3.5-4B (no fine-tune, with the calculator and submit_answer tool spec injected via the renderer's tools-prefix helper) reaches 69.56% on the same dev split (1,026 / 1,475 turns; the bare base model fails to emit a parseable submit_answer on 15 turns, which the fine-tune fixes). The fine-tune adds +12.79 points through identical inference plumbing.

Training config

Parameter Value
Base model Qwen/Qwen3.5-4B
Method LoRA SFT (Tinker, qwen3_5 renderer)
LoRA rank 32
Learning rate 3e-4, cosine decay
Batch size 64
Epochs 3
Optimisation steps 837
Max sequence length 16,384
Train rows 18,078 per-target rows from 3,002 ConvFinQA train conversations
Held-out test 200 rows
Final train mean NLL 0.0002
Final test mean NLL 0.010
Train-on-what Last assistant message per row

Each training row is a prefix of one full multi-turn dialogue, ending at one assistant target message. Two rows per question turn (one for each assistant message: the calculate tool call, and the submit_answer tool call), so an N-turn dialogue produces 2N rows.

The qwen3.5 renderer does not satisfy the extension property, so training with ALL_ASSISTANT_MESSAGES over a multi-turn dialogue puts loss on prefix tokens that do not match what build_generation_prompt would produce at that point. Splitting into per-target rows and switching to LAST_ASSISTANT_MESSAGE gives every loss-bearing sample the same on-policy prefix the model would actually see at inference.

Assistant tool calls are synthesised from the dataset's gold turn_program. Multi-step DSL programs like subtract(206588, 181001), divide(#0, 181001) fold into a single Python expression ((206588 - 181001) / 181001) emitted as one calculate call, followed by a submit_answer call carrying the gold value and an inferred unit.

Usage

This is not a vanilla chat model. It expects to be driven by an agentic loop with two tools:

  • calculate(expression): evaluate a Python arithmetic expression (+ - * / **).
  • submit_answer(value, unit): emit the final answer for the current question. unit is one of fraction, percent, absolute, count, yes_no.

It produces Qwen3.5 XML tool-call syntax (<tool_call><function=name><parameter=...>...</parameter></function></tool_call>).

Loading the adapter directly with PEFT

from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoTokenizer

base_id = "Qwen/Qwen3.5-4B"
tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForImageTextToText.from_pretrained(
    base_id, dtype="bfloat16", device_map="auto"
)
model = PeftModel.from_pretrained(base, "sharick008/convfinqa-qwen3.5-4b-lora")
model.eval()

Qwen/Qwen3.5-4B is an image-text-to-text model (its config dispatches to Qwen3_5ForConditionalGeneration). For ConvFinQA we only ever send text, but you must use AutoModelForImageTextToText (not AutoModelForCausalLM) to load the multimodal config without partial-weight warnings.

Loading 4-bit with bitsandbytes (~6 GB VRAM)

from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoTokenizer, BitsAndBytesConfig

base_id = "Qwen/Qwen3.5-4B"
quant = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForImageTextToText.from_pretrained(
    base_id, quantization_config=quant, device_map="auto"
)
model = PeftModel.from_pretrained(base, "sharick008/convfinqa-qwen3.5-4b-lora")

Serving via vLLM

vllm serve Qwen/Qwen3.5-4B \
    --enable-lora \
    --max-lora-rank 32 \
    --lora-modules convfinqa=sharick008/convfinqa-qwen3.5-4b-lora

--max-lora-rank 32 is required: vLLM's default cap is 16. Then call the OpenAI-compatible endpoint with model="convfinqa".

Worked example: one ConvFinQA record

Dev record Single_APD/2016/page_96.pdf-1. Three turns of conversational arithmetic over a debt summary table.

System message (truncated; the full prompt is ~80 lines of instructions about answer units, table-scale handling, and worked examples, then the document):

You are a financial analyst reading one page from a 10-K annual report filed
with the U.S. Securities and Exchange Commission. Your job is to answer
numerical questions about this page.

[... unit instructions and worked examples ...]

## Document
<document>
<pre_text>
15 . debt the tables below summarize our outstanding debt at 30 september
2016 and 2015 : total debt .
</pre_text>
<table>
| | 2016 | 2015 |
|---|---|---|
| current portion of long-term debt | 371.3 | 435.6 |
| long-term debt | 4918.1 | 3949.1 |
| total debt | 6225.2 | 5879.0 |
| bank obligations | 133.1 | 234.3 |
| commercial paper | 802.7 | 1260.0 |
| total short-term borrowings | 935.8 | 1494.3 |
</table>
<post_text>
the weighted average interest rate of short-term borrowings outstanding at
30 september 2016 and 2015 was 1.1% and 0.8% respectively. cash paid for
interest, net of amounts capitalized, was $121.1 in 2016, $97.5 in 2015,
and $132.4 in 2014.
</post_text>
</document>

Turn 1. User: "what was the total cash paid for interest in the years of 2015 and 2016, combined?"

Model emits:

<tool_call>
<function=calculate>
<parameter=expression>
(121.1 + 97.5)
</parameter>
</function>
</tool_call>

Tool returns 218.6. Model then emits:

<tool_call>
<function=submit_answer>
<parameter=value>
218.6
</parameter>
<parameter=unit>
absolute
</parameter>
</function>
</tool_call>

Turn 2. User: "including the year of 2014, what then becomes this total?" Model: calculate("(132.4 + (121.1 + 97.5))") โ†’ 351.0 โ†’ submit_answer(value=351.0, unit="absolute").

Turn 3. User: "and what was the average cash paid for interest between those three years?" Model: calculate("(((121.1 + 97.5) + 132.4) / 3)") โ†’ 117.0 โ†’ submit_answer(value=117.0, unit="absolute").

Gold executed answers: [218.6, 351.0, 117.0]. All three correct.

Evaluation: where errors come from

Per-turn-position accuracy is flat between 75% and 81% across positions 0 to 4, so multi-turn co-reference is not the bottleneck. Compared to the un-tuned base model on the same harness, fine-tuning specifically fixes:

  • Sign errors. The base model occasionally drops a negative sign on subtractions; the fine-tune does not.
  • Table-scale errors. Many tables are stated "(in millions)". The base model sometimes multiplies out (e.g. predicting 932,000,000 when the table cell reads 932 and gold is 932); the fine-tune handles these consistently.

A deliberate trade-off shows up on yes/no questions:

  • Boolean comparisons. 35 of 3,037 ConvFinQA train records (1.1%) use the greater(a, b) op in their gold programs to answer "is X higher than Y?" style questions. Our calculator tool exposes only arithmetic, so at SFT-build time those 35 records were skipped rather than expanding the tool surface for ~1% of data. The cost is visible on the dev split: a small handful of yes/no questions where the base model's general boolean reasoning answers correctly, while the fine-tune leans harder on its newly-strengthened "emit a number" prior. Closing this is a mechanical change, listed in follow-up work below.

What the fine-tune did not fix:

  • Wrong cell selection from the table remains the largest residual error class. The model occasionally pulls the cell next to the right one, or picks the wrong year column.
  • Inverted divisions on a small number of "what fraction of X is Y" turns.

Limitations

  • English only. All training data is English-language US 10-K excerpts.
  • Single-page documents only. Each ConvFinQA record contains exactly one page of context. The model has not been trained to retrieve from longer documents.
  • Numerical reasoning over tables and short prose. It will not generalise well to free-form financial commentary, multi-document synthesis, or domains outside US-equity 10-Ks.

Suggested follow-up work

  1. Cell-grounding in SFT traces. Current assistant turns go straight from "user question" to calculate(literal_numbers). The literal numbers come from the table or post-text, but the trace never says where. Adding a one-sentence assistant content before the calculate ("Reading 2014 cash paid for interest = 132.4 from the post-text") would give the model a place to ground and should attack the dominant failure mode directly.
  2. Scale-aware data augmentation. Rewrite a subset of train traces so the assistant explicitly reads the "(in millions)" caption before quoting a cell. The current SFT data gives the model no demonstrated example of checking scale captions, only an instruction in the system prompt.
  3. Restore the 35 boolean train records. Extend the calculator to accept Python comparison operators (>, <, >=, <=), returning "yes" / "no" strings instead of Python True / False. Fold the dataset's greater(a, b) programs into (a > b) at SFT-build time. Closes the yes/no trade-off described in the evaluation section above.
  4. DPO on cell-selection mistakes. Build preference pairs from a dev run: chosen = a re-prompted trace that produces the correct executed answer; rejected = the original incorrect trace. Bias the model away from the off-by-one cell errors that dominate the failure mix.
  5. Verifier pass. Add a second-pass call that, given the question, the document, and the proposed answer, judges whether the answer is plausible and either accepts or asks the agent to retry. The cost is manageable (one extra short call per turn) and would absorb a chunk of the cell-selection misses.

Citation

ConvFinQA dataset:

Chen, Z., Li, S., Smiley, C., Ma, Z., Shah, S., & Wang, W. Y. (2022). ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. EMNLP.

This adapter is open-weight under Apache-2.0.

Framework versions

  • tinker-cookbook: 0.3.0
  • transformers: 5.6.2
  • peft: 0.18.0
  • torch: 2.11.0
Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sharick008/convfinqa-qwen3.5-4b-lora

Finetuned
Qwen/Qwen3.5-4B
Adapter
(140)
this model