aaron1141 commited on
Commit
e783436
·
0 Parent(s):

initial hf spaces demo

Browse files
.gitignore ADDED
Binary file (126 Bytes). View file
 
LICENSE ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship made available under
36
+ the License, as indicated by a copyright notice that is included in
37
+ or attached to the work (an example is provided in the Appendix below).
38
+
39
+ "Derivative Works" shall mean any work, whether in Source or Object
40
+ form, that is based on (or derived from) the Work and for which the
41
+ editorial revisions, annotations, elaborations, or other modifications
42
+ represent, as a whole, an original work of authorship. For the purposes
43
+ of this License, Derivative Works shall not include works that remain
44
+ separable from, or merely link (or bind by name) to the interfaces of,
45
+ the Work and derivative works thereof.
46
+
47
+ "Contribution" shall mean, as submitted to the Licensor for inclusion
48
+ in the Work by the copyright owner or by an individual or Legal Entity
49
+ authorized to submit on behalf of the copyright owner. For the purposes
50
+ of this definition, "submitted" means any form of electronic, verbal,
51
+ or written communication sent to the Licensor or its representatives,
52
+ including but not limited to communication on electronic mailing lists,
53
+ source code control systems, and issue tracking systems that are managed
54
+ by, or on behalf of, the Licensor for the purpose of discussing and
55
+ improving the Work, but excluding communication that is conspicuously
56
+ marked or otherwise designated in writing by the copyright owner as
57
+ "Not a Contribution."
58
+
59
+ "Contributor" shall mean Licensor and any Legal Entity on behalf of
60
+ whom a Contribution has been received by the Licensor and subsequently
61
+ incorporated within the Work.
62
+
63
+ 2. Grant of Copyright License. Subject to the terms and conditions of
64
+ this License, each Contributor hereby grants to You a perpetual,
65
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
66
+ copyright license to reproduce, prepare Derivative Works of,
67
+ publicly display, publicly perform, sublicense, and distribute the
68
+ Work and such Derivative Works in Source or Object form.
69
+
70
+ 3. Grant of Patent License. Subject to the terms and conditions of
71
+ this License, each Contributor hereby grants to You a perpetual,
72
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
73
+ (except as stated in this section) patent license to make, have made,
74
+ use, offer to sell, sell, import, and otherwise transfer the Work,
75
+ where such license applies only to those patent claims licensable
76
+ by such Contributor that are necessarily infringed by their
77
+ Contribution(s) alone or by combination of their Contributions
78
+ with the Work to which such Contributions were submitted. If You
79
+ institute patent litigation against any Legal Entity (including a
80
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
81
+ or a Contribution incorporated within the Work constitutes direct
82
+ or contributory patent infringement, then any patent licenses
83
+ granted to You under this License for that Work shall terminate
84
+ as of the date such litigation is filed.
85
+
86
+ 4. Redistribution. You may reproduce and distribute copies of the
87
+ Work or Derivative Works thereof in any medium, with or without
88
+ modifications, and in Source or Object form, provided that You
89
+ meet the following conditions:
90
+
91
+ (a) You must give any other recipients of the Work or
92
+ Derivative Works a copy of this License; and
93
+
94
+ (b) You must cause any modified files to carry prominent notices
95
+ stating that You changed the files; and
96
+
97
+ (c) You must retain, in the Source form of any Derivative Works
98
+ that You distribute, all copyright, patent, trademark, and
99
+ attribution notices from the Source form of the Work,
100
+ excluding those notices that do not pertain to any part of
101
+ the Derivative Works; and
102
+
103
+ (d) If the Work includes a "NOTICE" text file as part of its
104
+ distribution, You must include a readable copy of the
105
+ attribution notices contained within such NOTICE file, in
106
+ at least one of the following places: within a NOTICE text
107
+ file distributed as part of the Derivative Works; within
108
+ the Source form or documentation, if provided along with the
109
+ Derivative Works; or, within a display generated by the
110
+ Derivative Works, if and wherever such third-party notices
111
+ normally appear. The contents of the NOTICE file are for
112
+ informational purposes only and do not modify the License.
113
+ You may add Your own attribution notices within Derivative
114
+ Works that You distribute, alongside or as an addendum to
115
+ the NOTICE text from the Work, provided that such additional
116
+ attribution notices cannot be construed as modifying the License.
117
+
118
+ You may add Your own license statement for Your modifications and
119
+ may provide additional grant of rights to use, copy, modify, merge,
120
+ publish, distribute, sublicense, and/or sell copies of the
121
+ Derivative Works, and to permit persons to whom the Derivative Works
122
+ is furnished to do so, subject to the following conditions:
123
+
124
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
125
+ any Contribution intentionally submitted for inclusion in the Work
126
+ by You to the Licensor shall be under the terms and conditions of
127
+ this License, without any additional terms or conditions.
128
+ Notwithstanding the above, nothing herein shall supersede or modify
129
+ the terms of any separate license agreement you may have executed
130
+ with Licensor regarding such Contributions.
131
+
132
+ 6. Trademarks. This License does not grant permission to use the trade
133
+ names, trademarks, service marks, or product names of the Licensor,
134
+ except as required for reasonable and customary use in describing the
135
+ origin of the Work and reproducing the content of the NOTICE file.
136
+
137
+ 7. Disclaimer of Warranty. Unless required by applicable law or
138
+ agreed to in writing, Licensor provides the Work (and each
139
+ Contributor provides its Contributions) on an "AS IS" BASIS,
140
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
141
+ implied, including, without limitation, any warranties or conditions
142
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
143
+ PARTICULAR PURPOSE. You are solely responsible for determining the
144
+ appropriateness of using or redistributing the Work and assume any
145
+ risks associated with Your exercise of permissions under this License.
146
+
147
+ 8. Limitation of Liability. In no event and under no legal theory,
148
+ whether in tort (including negligence), contract, or otherwise,
149
+ unless required by applicable law (such as deliberate and grossly
150
+ negligent acts) or agreed to in writing, shall any Contributor be
151
+ liable to You for damages, including any direct, indirect, special,
152
+ incidental, or exemplary damages of any character arising as a
153
+ result of this License or out of the use or inability to use the
154
+ Work (including but not limited to damages for loss of goodwill,
155
+ work stoppage, computer failure or malfunction, or all other
156
+ commercial damages or losses), even if such Contributor has been
157
+ advised of the possibility of such damages.
158
+
159
+ 9. Accepting Warranty or Additional Liability. While redistributing
160
+ the Work or Derivative Works thereof, You may choose to offer,
161
+ and charge a fee for, acceptance of support, warranty, indemnity,
162
+ or other liability obligations and/or rights consistent with this
163
+ License. However, in accepting such obligations, You may act only
164
+ on Your own behalf and on Your sole responsibility, not on behalf
165
+ of any other Contributor, and only if You agree to indemnify,
166
+ defend, and hold each Contributor harmless for any liability
167
+ incurred by, or claims asserted against, such Contributor by reason
168
+ of your accepting any such warranty or additional liability.
169
+
170
+ END OF TERMS AND CONDITIONS
README.md ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: DataFlow-VQA Data Curation
3
+ emoji: 🔬
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # DataFlow-VQA
14
+
15
+ **[中文文档](README_zh.md)**
16
+
17
+ A pipeline for extracting, curating, and generating chain-of-thought (CoT) data from PDF textbooks and exam papers.
18
+
19
+ [🤗Dataset](https://huggingface.co/datasets/OpenDCAI/FlipVQA)
20
+
21
+ ## Overview
22
+ ![DataFlow-VQA overview](static/overview_2.png)
23
+ DataFlow-VQA processes PDF documents through three sequential stages:
24
+
25
+ - Stage1 (**Section 3.1: VQA Extraction**): Parses PDFs using [MinerU](https://github.com/opendatalab/MinerU) for document layout analysis, then uses an LLM to extract structured question-answer pairs with images.
26
+ - Stage2 (**Section 3.2.1 to Section 3.2.5: Data Curation**): Filters and cleans the extracted QA pairs — splits sub-questions, classifies question types, extracts concise answers, and removes low-quality items.
27
+ - Stage3 (**Section 3.2.6: CoT Generation**): Generates chain-of-thought reasoning via reject sampling — an LLM generates answers, which are verified against ground truth, and incorrect ones are retried.
28
+
29
+
30
+
31
+ ## Installation
32
+
33
+ This project is built on top of [DataFlow](https://github.com/OpenDCAI/DataFlow). Clone and install it first:
34
+
35
+ ```shell
36
+ git clone https://github.com/OpenDCAI/DataFlow.git
37
+ cd DataFlow
38
+ pip install -e ".[pdf2vqa]"
39
+ ```
40
+
41
+ Then clone this repository:
42
+
43
+ ```shell
44
+ git clone <this-repo-url>
45
+ cd DataFlow-VQA
46
+ ```
47
+
48
+ ## Configuration
49
+
50
+ ### API Keys
51
+
52
+ Two API keys are required:
53
+
54
+ - `DF_API_KEY`: API key for the LLM service (OpenAI, Google Gemini, DeepSeek, etc.)
55
+ - `MINERU_API_KEY`: API key for [MinerU](https://mineru.net/apiManage/token) document layout parsing
56
+
57
+ ```shell
58
+ export DF_API_KEY="sk-xxxxx"
59
+ export MINERU_API_KEY="sk2-xxxxx"
60
+ ```
61
+
62
+ ### LLM Endpoint
63
+
64
+ Each pipeline accepts `--api_url` and `--model` arguments. Any [OpenAI-compatible API](https://platform.openai.com/docs/api-reference) endpoint is supported, including OpenAI, Google Gemini (via proxy), DeepSeek, and others.
65
+
66
+ Provide the **base URL** without `/chat/completions` (e.g. `https://api.openai.com/v1`).
67
+
68
+ ---
69
+
70
+ ## Stage 1: VQA Extraction
71
+
72
+ ### Input Format
73
+
74
+ Create a JSONL file where each line describes one PDF extraction task:
75
+
76
+ ```jsonl
77
+ {"input_pdf_paths": "./examples/VQA/questionextract_test.pdf", "name": "math1"}
78
+ {"input_pdf_paths": ["./examples/VQA/math_question.pdf", "./examples/VQA/math_answer.pdf"], "name": "math2"}
79
+ ```
80
+
81
+ - `input_pdf_paths`: A single PDF (questions and answers interleaved) or a list of two or more PDFs (questions before answers).
82
+ - `name`: A unique identifier for this task (used for directory naming and caching).
83
+
84
+ ### Run
85
+
86
+ ```bash
87
+ python -m pipelines.vqa_extract_optimized_pipeline \
88
+ --input_file ./examples/VQA/vqa_extract_test.jsonl \
89
+ --output_dir ./output \
90
+ --api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
91
+ --model gemini-2.5-pro
92
+ ```
93
+
94
+ **Important:** We recommend using a strong powerful model here. Weak models like `gpt-5-mini` might perform bad.
95
+
96
+ ### Output
97
+
98
+ - `{output_dir}/raw_vqa.jsonl`: Extracted QA pairs with image references
99
+ - `{output_dir}/{name}/vqa_images/`: Extracted images
100
+ - `cache/{name}/extracted_vqa.jsonl`, `merged_qa_pairs.jsonl`, `merged_qa_pairs.md`: Per-task intermediate files
101
+
102
+ Each QA item contains:
103
+
104
+ ```json
105
+ {
106
+ "question": "Compute $x$ such that $x^2 - 1 = 0$.",
107
+ "answer": "$x = 1$ or $x = -1$",
108
+ "solution": "Factor as $(x-1)(x+1)=0$.",
109
+ "label": 1,
110
+ "question_chapter_title": "Chapter 1: Quadratic Equations",
111
+ "answer_chapter_title": "Chapter 1: Quadratic Equations",
112
+ "image_basedir": "/path/to/your/images"
113
+ }
114
+ ```
115
+
116
+ ### Note
117
+
118
+ **We also support using a local MinerU deployment**: Replace `FileOrURLToMarkdownConverterAPI` with `FileOrURLToMarkdownConverterLocal` or `FileOrURLToMarkdownConverterFlash` in `pipelines/vqa_extract_optimized_pipeline.py`:
119
+
120
+ ```python
121
+ # Original opendatalab local version
122
+ self.mineru_executor = FileOrURLToMarkdownConverterLocal(
123
+ intermediate_dir="intermediate",
124
+ mineru_model_path="path/to/mineru/model",
125
+ )
126
+
127
+ # Accelerated version (Flash)
128
+ self.mineru_executor = FileOrURLToMarkdownConverterFlash(
129
+ intermediate_dir="intermediate",
130
+ mineru_model_path="path/to/mineru/model",
131
+ batch_size=4,
132
+ replicas=1,
133
+ num_gpus_per_replica=1,
134
+ engine_gpu_util_rate_to_ray_cap=0.9,
135
+ )
136
+ ```
137
+
138
+ See [DataFlow's MinerU operators](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/mineru_operators.py) for full parameter documentation.
139
+
140
+ <details>
141
+ <summary>Pipeline details</summary>
142
+
143
+ The extraction pipeline runs six steps:
144
+
145
+ 1. **PDF Merging** (`PDF_Merger`): If multiple PDFs are provided, merges them into one.
146
+ 2. **Document Layout Parsing** (`FileOrURLToMarkdownConverterAPI`): Calls the MinerU API to produce structured JSON layout tokens and page images.
147
+ 3. **Layout Preprocessing** (`MinerU2LLMInputOperator`): Flattens list items and re-indexes IDs to prepare LLM-ready input.
148
+ 4. **LLM Extraction** (`ChunkedPromptedGenerator`): Chunks the layout JSON (max 128k tokens per chunk) and calls the LLM with `QAExtractPrompt` to extract QA pairs as structured XML.
149
+ 5. **Output Parsing** (`LLMOutputParser`): Parses the XML response into JSONL and copies images to `vqa_images/`.
150
+ 6. **QA Merging** (`QA_Merger`): For separated question/answer PDFs, matches question and answer blocks by chapter title and question number.
151
+ This operator includes a `strict_title_match` parameter: When set to True, the operator performs an exact string match on chapter titles. Otherwise, the operator attempts to extract Chinese or English sequence numbers from the titles for matching.
152
+
153
+ </details>
154
+
155
+ ---
156
+
157
+ ## Stage 2: Data Curation
158
+
159
+ ```bash
160
+ python -m pipelines.curate_data \
161
+ --input_file ./output/raw_vqa.jsonl \
162
+ --api_url https://api.openai.com/v1 \
163
+ --model gpt-5-mini
164
+ ```
165
+
166
+ Output is saved as `curated_vqa.jsonl` in the same directory as `--input_file`.
167
+
168
+ <details>
169
+ <summary>Pipeline details</summary>
170
+
171
+ Four sequential steps:
172
+
173
+ **1. Sub-question Splitting**
174
+
175
+ Questions with multiple independent parts (e.g. (a), (b), (c)) are split into separate items. Each sub-question is paired with its corresponding sub-answer and sub-solution. Items where the question or both answer and solution are empty are discarded.
176
+
177
+ Sub-questions that are context-sensitive (e.g. (b) uses the result of (a)) will not be split into separate items.
178
+
179
+ Adds field: `split_qa`
180
+
181
+ **2. Question Type Classification**
182
+
183
+ Each question is classified as one of: `Calculation`, `Proof`, `Explanation`, `Fill-in`, `Multiple-choice`, `Sketching`, `Other`.
184
+
185
+ By default, only `Calculation`, `Fill-in`, and `Multiple-choice` are retained. To change this, edit the `filter_rules` list in `DataCurationPipeline.__init__`.
186
+
187
+ Adds fields: `type`, `type_reason`
188
+
189
+ **3. Answer Extraction**
190
+
191
+ Extracts a concise final answer from the `solution` field and writes it to `answer`. Items that already have a non-empty `answer` are skipped (set `overwrite=True` in `AnswerExtractionOperator` to override).
192
+
193
+ **4. QA Filtering**
194
+
195
+ Removes items based on the following criteria:
196
+
197
+ - The question must pose a clear, specific problem suitable for an exam. Examples, statements without questions, and open-ended discussions are rejected.
198
+ - The answer must directly address the question.
199
+ - The question and answer must be self-contained, without relying on external references or omitted context.
200
+
201
+ Adds fields: `filter_result`, `filter_reason`
202
+
203
+ </details>
204
+
205
+ ---
206
+
207
+ ## Stage 3: Generate CoT
208
+
209
+ The answer model and judge model can use different API endpoints and API keys, which is useful when the answer model is a self-hosted open-source VLM (e.g. Qwen3-VL served via vLLM) and the judge model is a commercial API.
210
+
211
+ Use `--answer_api_key_env` / `--judge_api_key_env` to specify which environment variable holds the API key for each model (default: `DF_API_KEY` for both).
212
+
213
+ ```bash
214
+ # Example: self-hosted Qwen3-VL for answers, OpenAI for judging
215
+ export VLLM_API_KEY="token-xxxx" # or leave empty if your vLLM server needs no key
216
+ export DF_API_KEY="sk-xxxx"
217
+
218
+ python -m pipelines.generate_cot \
219
+ --input_file ./output/curated_vqa.jsonl \
220
+ --max_retries 5 \
221
+ --answer_api_url https://your-vllm-server/v1 \
222
+ --answer_model qwen3-vl-235b-thinking \
223
+ --answer_api_key_env VLLM_API_KEY \
224
+ --judge_api_url https://api.openai.com/v1 \
225
+ --judge_model gpt-5-mini \
226
+ --judge_api_key_env DF_API_KEY
227
+ ```
228
+
229
+ Output is saved as `curated_vqa_with_cot.jsonl` in the same directory as `--input_file`.
230
+
231
+ <details>
232
+ <summary>Pipeline details</summary>
233
+
234
+ Uses reject sampling over up to `max_retries` rounds:
235
+
236
+ **1. Answer Generation** (`VQAReasoningAnswerGenerator`)
237
+
238
+ The LLM generates a step-by-step answer. Set `skip_text_only=True` in `RejectSamplingPipeline` to process only VQA items (questions containing images); set to `False` to process all items. Generated answer stored in `generated_cot`.
239
+
240
+ **2. Thinking Cleanup**
241
+
242
+ Strips `<think>...</think>` content from the generated answer to reduce verification cost. The cleaned answer is stored in `llm_short_answer`. Assumes the model outputs `<think>THINK</think>ANSWER` or `THINK</think>ANSWER`.
243
+
244
+ **3. Answer Verification** (`BenchDatasetEvaluatorQuestion`)
245
+
246
+ Compares `llm_short_answer` against the ground truth `answer` using semantic LLM evaluation (with 5% numerical tolerance). Items that pass are marked `answer_match_result = True` and skipped in subsequent rounds.
247
+
248
+ Set `support_subquestions=True` to evaluate each sub-question independently; `answer_match_result` is `False` if any sub-question is wrong.
249
+
250
+ Evaluation statistics (overall accuracy, sub-question accuracy) are saved to `./cot_cache/eval_results.jsonl`:
251
+
252
+ ```json
253
+ {
254
+ "total_samples": 23584,
255
+ "matched_samples": 12281,
256
+ "accuracy": 0.521,
257
+ "total_subquestions": 26380,
258
+ "correct_subquestions": 13807,
259
+ "subquestion_accuracy": 0.523
260
+ }
261
+ ```
262
+
263
+ </details>
264
+
265
+ ---
266
+
267
+ ## Examples
268
+
269
+ Sample PDFs and input JSONL are provided in `examples/VQA/`:
270
+
271
+ ```
272
+ examples/VQA/
273
+ ├── vqa_extract_test.jsonl # Example input for Stage 1
274
+ ├── questionextract_test.pdf # Single PDF with interleaved Q&A
275
+ ├── math_question.pdf # Questions PDF (for separated Q&A demo)
276
+ └── math_answer.pdf # Answers PDF (for separated Q&A demo)
277
+ ```
278
+
279
+ To run the full pipeline on the examples:
280
+
281
+ ```bash
282
+ # Stage 1: Extract
283
+ python -m pipelines.vqa_extract_optimized_pipeline \
284
+ --input_file ./examples/VQA/vqa_extract_test.jsonl \
285
+ --output_dir ./output \
286
+ --api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
287
+ --model gemini-2.5-pro
288
+
289
+ # Stage 2: Curate
290
+ python -m pipelines.curate_data \
291
+ --input_file ./output/raw_vqa.jsonl \
292
+ --api_url https://api.openai.com/v1 \
293
+ --model gpt-5-mini
294
+
295
+ # Stage 3: Generate CoT
296
+ # Example: self-hosted Qwen3-VL for answers, OpenAI for judging
297
+ export VLLM_API_KEY="token-xxxx" # or leave empty if your vLLM server needs no key
298
+ export DF_API_KEY="sk-xxxx"
299
+
300
+ python -m pipelines.generate_cot \
301
+ --input_file ./output/curated_vqa.jsonl \
302
+ --max_retries 5 \
303
+ --answer_api_url https://your-vllm-server/v1 \
304
+ --answer_model qwen3-vl-235b-thinking \
305
+ --answer_api_key_env VLLM_API_KEY \
306
+ --judge_api_url https://api.openai.com/v1 \
307
+ --judge_model gpt-5-mini \
308
+ --judge_api_key_env DF_API_KEY
309
+ ```
310
+
311
+ ## Note
312
+ The implementation in this repository is only for running a demo at small scale. If you wish to run the pipeline on large number of books, you will probably need features [Checkpoint Resume](https://opendcai.github.io/DataFlow-Doc/en/guide/resume/) and [Batched Inference](https://opendcai.github.io/DataFlow-Doc/en/guide/batch/).
313
+
314
+ ## License
315
+
316
+ This project is licensed under the [Apache License 2.0](LICENSE).
README_zh.md ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DataFlow-VQA
2
+
3
+ 从 PDF 教材和试卷中提取、清洗、生成思维链(CoT)数据的流水线工具。
4
+
5
+ [🤗数据集](https://huggingface.co/datasets/OpenDCAI/FlipVQA)
6
+
7
+ ## 概览
8
+
9
+ DataFlow-VQA 通过三个顺序阶段处理 PDF 文件:
10
+ ![DataFlow-VQA overview](static/overview_2.png)
11
+ - 第一步(**Section 3.1:VQA 抽取**):使用 [MinerU](https://github.com/opendatalab/MinerU) 进行文档版面分析,再用 LLM 从中抽取带图片的结构化问答对。
12
+ - 第二步(**Section 3.2.1 到 Section 3.2.5:数据清洗**):对抽取到的问答对进行过滤和清洗——拆分小题、判断题型、抽取简洁答案、去除低质量内容。
13
+ - 第三步(**Section 3.2.6:生成 CoT**):通过 Reject Sampling 生成思维链——LLM 生成回答,与标准答案核对,答错的重新生成。
14
+
15
+
16
+
17
+ ## 安装
18
+
19
+ 本项目基于 [DataFlow](https://github.com/OpenDCAI/DataFlow),请先 clone 并安装:
20
+
21
+ ```shell
22
+ git clone https://github.com/OpenDCAI/DataFlow.git
23
+ cd DataFlow
24
+ pip install -e ".[pdf2vqa]"
25
+ ```
26
+
27
+ 然后 clone 本仓库:
28
+
29
+ ```shell
30
+ git clone <this-repo-url>
31
+ cd DataFlow-VQA
32
+ ```
33
+
34
+ ## 配置
35
+
36
+ ### API 密钥
37
+
38
+ 需要两个 API Key:
39
+
40
+ - `DF_API_KEY`:LLM 服务的 API Key(OpenAI、Google Gemini、DeepSeek 等均可)
41
+ - `MINERU_API_KEY`:[MinerU](https://mineru.net/apiManage/token) 文档版面解析的 API Key
42
+
43
+ ```shell
44
+ export DF_API_KEY="sk-xxxxx"
45
+ export MINERU_API_KEY="sk2-xxxxx"
46
+ ```
47
+
48
+ ### LLM 端点
49
+
50
+ 每个 pipeline 均支持 `--api_url` 和 `--model` 参数,可兼容任何 [OpenAI 兼容接口](https://platform.openai.com/docs/api-reference)(OpenAI、Gemini 代理、DeepSeek 等)。
51
+
52
+ `--api_url` 传入**基础 URL**(不含 `/chat/completions`),例如 `https://api.openai.com/v1`。
53
+
54
+ ---
55
+
56
+ ## 第一步:VQA 抽取
57
+
58
+ ### 输入格式
59
+
60
+ 创建一个 JSONL 文件,每行描述一个抽取任务:
61
+
62
+ ```jsonl
63
+ {"input_pdf_paths": "./examples/VQA/questionextract_test.pdf", "name": "math1"}
64
+ {"input_pdf_paths": ["./examples/VQA/math_question.pdf", "./examples/VQA/math_answer.pdf"], "name": "math2"}
65
+ ```
66
+
67
+ - `input_pdf_paths`:单个 PDF(题目和答案混排),或两个及更多的 PDF 的列表(题目pdf放在问题pdf前面)。
68
+ - `name`:该任务的唯一标识符(用于目录命名和缓存)。
69
+
70
+ ### 运行
71
+
72
+ ```bash
73
+ python -m pipelines.vqa_extract_optimized_pipeline \
74
+ --input_file ./examples/VQA/vqa_extract_test.jsonl \
75
+ --output_dir ./output \
76
+ --api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
77
+ --model gemini-2.5-pro
78
+ ```
79
+
80
+ **重要:** 我们推荐在这里使用强推理模型。较弱的模型比如`gpt-5-mini`在这一阶段可能表现较差。
81
+
82
+ ### 输出
83
+
84
+ - `{output_dir}/raw_vqa.jsonl`:包含图片引用的问答对
85
+ - `{output_dir}/{name}/vqa_images/`:抽取出的图片
86
+ - `cache/{name}/`:中间文件(`extracted_vqa.jsonl`、`merged_qa_pairs.jsonl`、`merged_qa_pairs.md`)
87
+
88
+ 每个 QA 条目包含:
89
+
90
+ ```json
91
+ {
92
+ "question": "计算 $x$ 使得 $x^2 - 1 = 0$。",
93
+ "answer": "$x = 1$ 或 $x = -1$",
94
+ "solution": "因式分解 $(x-1)(x+1)=0$。",
95
+ "label": 1,
96
+ "question_chapter_title": "第一章 二次方程",
97
+ "answer_chapter_title": "第一章 二次方程",
98
+ "image_basedir": "/path/to/your/images"
99
+ }
100
+ ```
101
+
102
+ ### 提示
103
+
104
+ **我们也支持使用本地 MinerU 部署**:在 `pipelines/vqa_extract_optimized_pipeline.py` 中替换算子:
105
+
106
+ ```python
107
+ # 原版 opendatalab 本地版
108
+ self.mineru_executor = FileOrURLToMarkdownConverterLocal(
109
+ intermediate_dir="intermediate",
110
+ mineru_model_path="path/to/mineru/model",
111
+ )
112
+
113
+ # 加速版 Flash
114
+ self.mineru_executor = FileOrURLToMarkdownConverterFlash(
115
+ intermediate_dir="intermediate",
116
+ mineru_model_path="path/to/mineru/model",
117
+ batch_size=4,
118
+ replicas=1,
119
+ num_gpus_per_replica=1,
120
+ engine_gpu_util_rate_to_ray_cap=0.9,
121
+ )
122
+ ```
123
+
124
+ 详细参数参见 [DataFlow 的 MinerU 算子文档](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/mineru_operators.py)。
125
+
126
+ <details>
127
+ <summary>代码逻辑简介</summary>
128
+
129
+ 抽取流水线共六步:
130
+
131
+ 1. **PDF 合并**(`PDF_Merger`):如果提供了多个 PDF,先合并为一个。
132
+ 2. **文档版面解析**(`FileOrURLToMarkdownConverterAPI`):调用 MinerU API,生成结构化版面 JSON 和页面图片。
133
+ 3. **版面预处理**(`MinerU2LLMInputOperator`):展平列表项并重新编号,生成 LLM 输入格式。
134
+ 4. **LLM 抽取**(`ChunkedPromptedGenerator`):将版面 JSON 分块(每块最多 128k token),用 `QAExtractPrompt` 提示词批量调用 LLM,生成 XML 格式的问答对。
135
+ 5. **输出解析**(`LLMOutputParser`):将 XML 响应解析为 JSONL,并将图片复制到 `vqa_images/`。
136
+ 6. **问答合并**(`QA_Merger`):对于题目和答案分离的 PDF,根据章节标题和题目序号进行启发式匹配。可以设置一个strict_title_match参数,如果设置为True,会对章节��题进行严格匹配,否则会尝试提取标题中的中文/英文序号再匹配。
137
+
138
+ </details>
139
+
140
+ ---
141
+
142
+ ## 第二步:数据清洗
143
+
144
+ ```bash
145
+ python -m pipelines.curate_data \
146
+ --input_file ./output/raw_vqa.jsonl \
147
+ --api_url https://api.openai.com/v1 \
148
+ --model gpt-5-mini
149
+ ```
150
+
151
+ 输出保存为 `--input_file` 同目录下的 `curated_vqa.jsonl`。
152
+
153
+ <details>
154
+ <summary>代码逻辑简介</summary>
155
+
156
+ 共四步:
157
+
158
+ **1. 切小题**
159
+
160
+ 将含多个独立小问的题目(如 (a)(b)(c))拆分为独立条目,每个小题配上对应的答案和解析。question 或 answer+solution 均为空的条目会被丢弃。
161
+
162
+ 题目内的小问如果互相有联系(比如(b)需要(a)的结果),则不会拆分为独立条目。
163
+
164
+ 新增字段:`split_qa`
165
+
166
+ **2. 判断题型**
167
+
168
+ 将每道题归类为以下之一:`Calculation`、`Proof`、`Explanation`、`Fill-in`、`Multiple-choice`、`Sketching`、`Other`。
169
+
170
+ 默认只保留 `Calculation`、`Fill-in`、`Multiple-choice`。可通过修改 `DataCurationPipeline.__init__` 中的 `filter_rules` 自定义保留范围。
171
+
172
+ 新增字段:`type`、`type_reason`
173
+
174
+ **3. 抽取答案**
175
+
176
+ 从 `solution` 字段中抽取简洁答案并写入 `answer`。如 `answer` 已有内容则跳过(可在 `AnswerExtractionOperator` 中设置 `overwrite=True` 覆盖)。
177
+
178
+ **4. 题目过滤**
179
+
180
+ 过滤掉不符合要求的条目,标准包括:
181
+
182
+ - 必须是明确的考题,不能是示例、纯陈述或开放性讨论。
183
+ - 答案必须直接回答问题。
184
+ - 题目和答案须自洽完整,不能依赖外部引用或省略的上下文。
185
+
186
+ 新增字段:`filter_result`、`filter_reason`
187
+
188
+ </details>
189
+
190
+ ---
191
+
192
+ ## 第三步:生成 CoT
193
+
194
+ 答题模型和评判模型可以使用不同的 API 端点和 API Key,这在答题模型是本地部署的开源 VLM(如通过 vLLM 部署的 Qwen3-VL)而评判模型是商业 API 时非常实用。
195
+
196
+ 使用 `--answer_api_key_env` / `--judge_api_key_env` 指定各自使用哪个环境变量作为 API Key(默认均为 `DF_API_KEY`)。
197
+
198
+ ```bash
199
+ # 示例:本地 Qwen3-VL 生成答案,OpenAI 作为评判
200
+ export VLLM_API_KEY="token-xxxx" # 如果 vLLM server 不需要 key 可以不设
201
+ export DF_API_KEY="sk-xxxx"
202
+
203
+ python -m pipelines.generate_cot \
204
+ --input_file ./output/curated_vqa.jsonl \
205
+ --max_retries 5 \
206
+ --answer_api_url https://your-vllm-server/v1 \
207
+ --answer_model qwen3-vl-235b-thinking \
208
+ --answer_api_key_env VLLM_API_KEY \
209
+ --judge_api_url https://api.openai.com/v1 \
210
+ --judge_model gpt-5-mini \
211
+ --judge_api_key_env DF_API_KEY
212
+ ```
213
+
214
+ 输出保存为 `--input_file` 同目录下的 `curated_vqa_with_cot.jsonl`。
215
+
216
+ <details>
217
+ <summary>代码逻辑简介</summary>
218
+
219
+ 在最多 `max_retries` 轮中进行 Reject Sampling:
220
+
221
+ **1. LLM 回答**(`VQAReasoningAnswerGenerator`)
222
+
223
+ LLM 生成分步推理过程,结果存入 `generated_cot`。在 `RejectSamplingPipeline` 中设置 `skip_text_only=True` 可只处理包含图片的题目,`False` 则处理全部题目。
224
+
225
+ **2. 清理 thinking 内容**
226
+
227
+ 从生成结果中删除 `<think>...</think>` 部分以降低验证成本。清理后的答案存入 `llm_short_answer`。假设模型输出格式为 `<think>THINK</think>ANSWER`或`THINK</think>ANSWER`。
228
+
229
+ **3. LLM 核对**(`BenchDatasetEvaluatorQuestion`)
230
+
231
+ 将 `llm_short_answer` 与标准答案 `answer` 进行语义比较(数值允许 5% 误差)。答对的标记为 `answer_match_result = True`,后续轮次跳过。
232
+
233
+ 设置 `support_subquestions=True` 会逐个评估小题,只要有一道答错,整题的 `answer_match_result` 即为 `False`。
234
+
235
+ 评估统计(整体正确率、小题正确率)保存至 `./cot_cache/eval_results.jsonl`:
236
+
237
+ ```json
238
+ {
239
+ "total_samples": 23584,
240
+ "matched_samples": 12281,
241
+ "accuracy": 0.521,
242
+ "total_subquestions": 26380,
243
+ "correct_subquestions": 13807,
244
+ "subquestion_accuracy": 0.523
245
+ }
246
+ ```
247
+
248
+ </details>
249
+
250
+ ---
251
+
252
+ ## 示例
253
+
254
+ `examples/VQA/` 目录提供了示例 PDF 和输入 JSONL:
255
+
256
+ ```
257
+ examples/VQA/
258
+ ├── vqa_extract_test.jsonl # 第一步的示例输入
259
+ ├── questionextract_test.pdf # 题目答案混排 PDF
260
+ ├── math_question.pdf # 题目 PDF(分离式示例)
261
+ └── math_answer.pdf # 答案 PDF(分离式示例)
262
+ ```
263
+
264
+ 完整流水线示例:
265
+
266
+ ```bash
267
+ # 第一步:抽取
268
+ python -m pipelines.vqa_extract_optimized_pipeline \
269
+ --input_file ./examples/VQA/vqa_extract_test.jsonl \
270
+ --output_dir ./output \
271
+ --api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
272
+ --model gemini-2.5-pro
273
+
274
+ # 第二步:清洗
275
+ python -m pipelines.curate_data \
276
+ --input_file ./output/raw_vqa.jsonl \
277
+ --api_url https://api.openai.com/v1 \
278
+ --model gpt-5-mini
279
+
280
+ # 第三步:生成 CoT
281
+ # 示例:本地 Qwen3-VL 生成答案,OpenAI 作为评判
282
+ export VLLM_API_KEY="token-xxxx" # 如果 vLLM server 不需要 key 可以不设
283
+ export DF_API_KEY="sk-xxxx"
284
+
285
+ python -m pipelines.generate_cot \
286
+ --input_file ./output/curated_vqa.jsonl \
287
+ --max_retries 5 \
288
+ --answer_api_url https://your-vllm-server/v1 \
289
+ --answer_model qwen3-vl-235b-thinking \
290
+ --answer_api_key_env VLLM_API_KEY \
291
+ --judge_api_url https://api.openai.com/v1 \
292
+ --judge_model gpt-5-mini \
293
+ --judge_api_key_env DF_API_KEY
294
+ ```
295
+
296
+ ## 提示
297
+ 目前的实现版本仅使用于跑小规模的示例。如果你想用我们的方法处理大规模的书籍,你应该会需要[断点续传](https://opendcai.github.io/DataFlow-Doc/zh/guide/resume/)和[分批推理](https://opendcai.github.io/DataFlow-Doc/en/guide/batch/)这两个feature。
298
+
299
+ ## 许可证
300
+
301
+ 本项目基于 [Apache License 2.0](LICENSE) 开源。
app.py ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import re
4
+ import shutil
5
+ import tempfile
6
+ import traceback
7
+
8
+ import gradio as gr
9
+
10
+ # Ensure the repo root is on the Python path
11
+ sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
12
+
13
+
14
+ def run_curation(
15
+ input_file,
16
+ api_url: str,
17
+ api_key: str,
18
+ model_name: str,
19
+ max_workers: int,
20
+ progress=gr.Progress(track_tqdm=True),
21
+ ):
22
+ if input_file is None:
23
+ return None, "请先上传输入的 JSONL 文件。"
24
+ if not api_key.strip():
25
+ return None, "请填写 API Key。"
26
+
27
+ # Inject the key so DataFlow's APILLMServing_request picks it up
28
+ os.environ["DF_API_KEY"] = api_key.strip()
29
+
30
+ # Use a dedicated temp workspace so parallel runs don't collide
31
+ workspace = tempfile.mkdtemp(prefix="dataflow_vqa_")
32
+ cache_dir = os.path.join(workspace, "cache")
33
+ os.makedirs(cache_dir, exist_ok=True)
34
+
35
+ # curate_data.py hardcodes cache_path="./cache", so we work from workspace
36
+ original_cwd = os.getcwd()
37
+ try:
38
+ os.chdir(workspace)
39
+
40
+ # Late import after path & cwd are set up
41
+ from pipelines.curate_data import DataCurationPipeline
42
+
43
+ progress(0.05, desc="初始化 pipeline…")
44
+
45
+ pipeline = DataCurationPipeline(
46
+ input_file=input_file,
47
+ api_url=api_url.rstrip("/"),
48
+ model_name=model_name,
49
+ max_workers=int(max_workers),
50
+ )
51
+ pipeline.compile()
52
+
53
+ progress(0.15, desc="正在运行 pipeline(可能需要几分钟)…")
54
+ pipeline.forward()
55
+
56
+ # Locate the highest-numbered step file
57
+ step_files = [
58
+ f for f in os.listdir(cache_dir)
59
+ if re.match(r"curate_data_step\d+\.jsonl", f)
60
+ ]
61
+ if not step_files:
62
+ return None, "Pipeline 运行完成,但未找到输出文件。请检查日志。"
63
+
64
+ max_step = max(
65
+ int(re.findall(r"curate_data_step(\d+)\.jsonl", f)[0])
66
+ for f in step_files
67
+ )
68
+ output_path = os.path.join(cache_dir, f"curate_data_step{max_step}.jsonl")
69
+
70
+ # Copy to a stable temp file so Gradio can serve it
71
+ result_file = os.path.join(workspace, "curated_vqa.jsonl")
72
+ shutil.copy(output_path, result_file)
73
+
74
+ progress(1.0, desc="完成!")
75
+ return result_file, f"✅ 完成!共执行 {max_step} 步,结果已保存为 curated_vqa.jsonl。"
76
+
77
+ except Exception:
78
+ tb = traceback.format_exc()
79
+ return None, f"❌ 运行出错:\n```\n{tb}\n```"
80
+ finally:
81
+ os.chdir(original_cwd)
82
+
83
+
84
+ # ── Gradio UI ──────────────────────────────────────────────────────────────────
85
+
86
+ with gr.Blocks(
87
+ title="DataFlow-VQA · 数据清洗 Demo",
88
+ theme=gr.themes.Soft(),
89
+ ) as demo:
90
+ gr.Markdown(
91
+ """
92
+ # 🔬 DataFlow-VQA — 数据清洗 Pipeline Demo
93
+
94
+ 将从 PDF 中提取的原始 VQA 数据(`raw_vqa.jsonl`)通过多步 LLM 清洗,输出高质量的 `curated_vqa.jsonl`。
95
+
96
+ **清洗步骤:** 子问题拆分 → 题型分类过滤 → 答案提取 → 填空补全 → 文本清理 → QA 质量过滤
97
+
98
+ > 注意:所有 LLM 调用均通过您提供的 API 完成,本 Space 不存储任何数据或密钥。
99
+ """
100
+ )
101
+
102
+ with gr.Row():
103
+ with gr.Column(scale=1):
104
+ gr.Markdown("### 📥 输入")
105
+ input_file = gr.File(
106
+ label="上传输入 JSONL 文件(raw_vqa.jsonl)",
107
+ file_types=[".jsonl"],
108
+ )
109
+ gr.Markdown("### ⚙️ API 配置")
110
+ api_url = gr.Textbox(
111
+ label="API Base URL(不含 /chat/completions)",
112
+ value="https://api.openai.com/v1",
113
+ placeholder="https://api.openai.com/v1",
114
+ )
115
+ api_key = gr.Textbox(
116
+ label="API Key",
117
+ placeholder="sk-...",
118
+ type="password",
119
+ )
120
+ model_name = gr.Textbox(
121
+ label="模型名称",
122
+ value="gpt-4o-mini",
123
+ placeholder="gpt-4o-mini / gemini-2.0-flash / deepseek-chat …",
124
+ )
125
+ max_workers = gr.Slider(
126
+ label="并发 Worker 数量",
127
+ minimum=1,
128
+ maximum=50,
129
+ value=5,
130
+ step=1,
131
+ info="HF Spaces 免费版资源有限,建议不超过 10",
132
+ )
133
+ run_btn = gr.Button("▶ 开始清洗", variant="primary", size="lg")
134
+
135
+ with gr.Column(scale=1):
136
+ gr.Markdown("### 📤 输出")
137
+ status_box = gr.Textbox(
138
+ label="运行状态",
139
+ interactive=False,
140
+ lines=6,
141
+ placeholder="点击「开始清洗」后,状态信息将显示在这里…",
142
+ )
143
+ output_file = gr.File(
144
+ label="下载清洗结果(curated_vqa.jsonl)",
145
+ interactive=False,
146
+ )
147
+
148
+ gr.Markdown(
149
+ """
150
+ ---
151
+ **输入格式**:每行一个 JSON 对象,需包含 `question`、`answer`、`solution` 字段。
152
+
153
+ **支持的 API**:任何 OpenAI 兼容接口,包括 OpenAI、Google Gemini(via proxy)、DeepSeek、vLLM 等。
154
+
155
+ **项目地址**:[OpenDCAI/DataFlow-VQA](https://github.com/OpenDCAI/DataFlow-VQA)
156
+ """
157
+ )
158
+
159
+ run_btn.click(
160
+ fn=run_curation,
161
+ inputs=[input_file, api_url, api_key, model_name, max_workers],
162
+ outputs=[output_file, status_box],
163
+ )
164
+
165
+ if __name__ == "__main__":
166
+ demo.launch()
examples/VQA/vqa_extract_test.jsonl ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ {"input_pdf_paths": "./examples/VQA/questionextract_test.pdf", "name": "math1"}
2
+ {"input_pdf_paths": ["./examples/VQA/math_question.pdf", "./examples/VQA/math_answer.pdf"], "name": "math2"}
operators/answer_extractor.py ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataflow.utils.registry import OPERATOR_REGISTRY
2
+ from dataflow import get_logger
3
+ from dataflow.core import OperatorABC
4
+ from dataflow.utils.storage import DataFlowStorage
5
+ import pandas as pd
6
+ from typing import Union
7
+
8
+ @OPERATOR_REGISTRY.register()
9
+ class AnswerExtractionOperator(OperatorABC):
10
+ def __init__(self, llm_serving: Union[None, object] = None, overwrite: bool = False):
11
+ self.logger = get_logger()
12
+ self.llm_serving = llm_serving
13
+ self.overwrite = overwrite
14
+ self.system_prompt = "You are a professional question answering system. You will be given a question with corresponding solution. Extract a concise and accurate answer from the provided solution. Output only the answer without any additional text."
15
+
16
+ @staticmethod
17
+ def get_desc(lang: str = "zh"):
18
+ if lang == "zh":
19
+ return (
20
+ "该算子用于从解答中提取答案,读取解答字段并调用LLM提取答案。"
21
+ "输入参数:\n"
22
+ "- input_solution_key:解答字段名,默认为'solution'\n"
23
+ "- output_key:答案字段名,默认为'answer'\n"
24
+ "- overwrite:是否覆盖已有答案,默认为False\n"
25
+ "输出参数:\n"
26
+ "- output_key:提取的答案"
27
+ )
28
+ elif lang == "en":
29
+ return (
30
+ "This operator extracts answers from solutions, reading from the solution field and using LLM to extract answers."
31
+ "Input Parameters:\n"
32
+ "- input_solution_key: Solution field name, default 'solution'\n"
33
+ "- output_key: Answer field name, default 'answer'\n"
34
+ "- overwrite: Whether to overwrite existing answers, default False\n"
35
+ "Output Parameters:\n"
36
+ "- output_key: Extracted answer"
37
+ )
38
+ else:
39
+ return "AnswerExtractionOperator extracts answers from solutions using LLM."
40
+
41
+ def run(self, storage: DataFlowStorage, input_question_key: str = "question", input_solution_key: str = "solution", output_key: str = "answer"):
42
+ dataframe = storage.read("dataframe")
43
+
44
+ if input_solution_key not in dataframe.columns:
45
+ raise ValueError(f"input_solution_key: {input_solution_key} not found in dataframe columns.")
46
+
47
+ # -----------------------------------
48
+ # 一套统一的空值判断逻辑
49
+ # -----------------------------------
50
+ def _is_valid(x, *, empty_ok=False):
51
+ """
52
+ empty_ok=False: 用来判断 solution 是否有效(空白→无效)
53
+ empty_ok=True: 用来判断 output 是否“空”(空白→空)
54
+ """
55
+ if x is None:
56
+ return empty_ok
57
+ if isinstance(x, float) and pd.isna(x):
58
+ return empty_ok
59
+ if isinstance(x, str) and x.strip() == "":
60
+ return empty_ok
61
+ return not empty_ok
62
+
63
+ # solution 必须有效(非空、非空白)
64
+ mask = dataframe[input_solution_key].apply(lambda x: _is_valid(x, empty_ok=False))
65
+
66
+ # 若 overwrite=False,则 output_key 为空(空白也算)才处理
67
+ if not self.overwrite and output_key in dataframe.columns:
68
+ mask = mask & dataframe[output_key].apply(lambda x: _is_valid(x, empty_ok=True))
69
+
70
+ # 收集有效 solutions
71
+ solutions = dataframe.loc[mask, input_solution_key].tolist()
72
+ questions = dataframe.loc[mask, input_question_key].tolist()
73
+
74
+ # 调用 LLM
75
+ if self.llm_serving:
76
+ prompts = [
77
+ self.system_prompt + f"\n\nQuestion: {q}\nSolution: {s}\nNow extract the answer."
78
+ for q, s in zip(questions, solutions)
79
+ ]
80
+ answers = self.llm_serving.generate_from_input(prompts)
81
+ else:
82
+ answers = solutions
83
+
84
+ # 写回
85
+ dataframe.loc[mask, output_key] = answers
86
+
87
+ output_file = storage.write(dataframe)
88
+ self.logger.info(f"Extracted answers saved to {output_file}")
89
+
90
+ return [output_key]
operators/bench_evaluate.py ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ # 添加父一级目录到 sys.path(上一级)
4
+ parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
5
+ if parent_dir not in sys.path:
6
+ sys.path.insert(0, parent_dir)
7
+ from dataflow.utils.reasoning.AnswerExtraction import StringCleaner, UnitTextManager, AnswerExtractor
8
+ from prompts.bench_evaluate import AnswerJudgePromptQuestion, AnswerJudgeMultipleQuestionsPrompt
9
+ from dataflow.core.prompt import DIYPromptABC
10
+ from dataflow.utils.registry import OPERATOR_REGISTRY
11
+ from dataflow.utils.storage import DataFlowStorage
12
+ from dataflow.core import LLMServingABC
13
+ from dataflow.core import OperatorABC
14
+
15
+ from math_verify import parse, verify
16
+ from dataflow import get_logger
17
+ from typing import Literal
18
+ import pandas as pd
19
+ import numpy as np
20
+ import time
21
+ import re
22
+ import json
23
+ import json5
24
+
25
+ @OPERATOR_REGISTRY.register()
26
+ class BenchDatasetEvaluatorQuestion(OperatorABC):
27
+ def __init__(self,
28
+ eval_result_path: str = None,
29
+ compare_method: Literal["match", "semantic"] = "match",
30
+ system_prompt: str = "You are a helpful assistant specialized in evaluating answer correctness.",
31
+ llm_serving: LLMServingABC = None,
32
+ prompt_template: DIYPromptABC = None,
33
+ support_subquestions: bool = False,
34
+ skip_true: bool = False, # 是否跳过已经验证过为True的样本
35
+ ):
36
+
37
+ if eval_result_path is None:
38
+ timestamp = int(time.time())
39
+ eval_result_path = f"result_bencheval/BenchDatasetEvaluator_result_{timestamp}.json"
40
+
41
+ self.eval_result_path = eval_result_path
42
+ self.compare_method = compare_method
43
+ self.empty_responses_count = 0 # 添加空响应计数器
44
+
45
+ if compare_method == "match":
46
+ self.compare = self.math_verify_compare
47
+ unit_manager = UnitTextManager()
48
+ string_cleaner = StringCleaner(unit_manager)
49
+ self.answer_extractor = AnswerExtractor(string_cleaner)
50
+ else:
51
+ if prompt_template is None:
52
+ prompt_template = AnswerJudgePromptQuestion() if not support_subquestions else AnswerJudgeMultipleQuestionsPrompt()
53
+ self.prompt_template = prompt_template
54
+ self.system_prompt = system_prompt
55
+ self.llm_serving = llm_serving
56
+ self.support_subquestions = support_subquestions
57
+ self.skip_true = skip_true
58
+
59
+ self.logger = get_logger()
60
+
61
+ def math_verify_compare(self, answer, ground_truth):
62
+ try:
63
+ return verify(parse(str(ground_truth)), parse(str(answer)))
64
+ except:
65
+ try:
66
+ return verify(parse(ground_truth), parse(answer))
67
+ except:
68
+ return False
69
+
70
+ def ResolveResponse(self, response):
71
+ # 检查空响应
72
+ if not self.support_subquestions:
73
+ if response is None or (isinstance(response, str) and response.strip() == ''):
74
+ self.empty_responses_count += 1
75
+ return False
76
+ try:
77
+ pattern = re.compile(r'"judgement_result"\s*:\s*(true|false)', re.IGNORECASE)
78
+ match = pattern.search(response)
79
+ result_value = None
80
+ if match:
81
+ result_value = match.group(1).lower()
82
+ else:
83
+ # 备用解析逻辑,检查响应中是否包含true或false
84
+ if "true" in response.lower():
85
+ result_value = "true"
86
+ else:
87
+ result_value = "false"
88
+ if result_value == "true":
89
+ return True
90
+ else:
91
+ return False
92
+ except Exception as e:
93
+ self.logger.error(f"Response format error: {response}. Error: {e}")
94
+ return False
95
+
96
+ if self.support_subquestions:
97
+ # 如果支持子问题,假设response是一个列表, 返回正确的数量/总数
98
+ correct_num = 0
99
+ total_num = 0
100
+ try:
101
+ response = json5.loads(response, strict=False) # 使用json5解析,允许更宽松的格式
102
+ judgement = response.get("judgement", [])
103
+ except Exception as e:
104
+ self.logger.error(f"Response JSON parse error: {response}. Error: {e}")
105
+ self.empty_responses_count += 1
106
+ return "0/0"
107
+ for resp in judgement:
108
+ if isinstance(resp, bool):
109
+ if resp is True:
110
+ correct_num += 1
111
+ total_num += 1
112
+ elif resp is False:
113
+ total_num += 1
114
+ elif resp.lower() == "empty":
115
+ continue # 不计入总数
116
+ elif isinstance(resp, str):
117
+ if resp.lower() == "true":
118
+ correct_num += 1
119
+ total_num += 1
120
+ elif resp.lower() == "false":
121
+ total_num += 1
122
+ elif resp.lower() == "empty":
123
+ continue # 不计入总数
124
+
125
+ return f"{correct_num}/{total_num}"
126
+
127
+ @staticmethod
128
+ def get_desc(lang: str = "zh"):
129
+ if lang == "zh":
130
+ return (
131
+ "该算子用于对比预测答案与标准答案的匹配度,支持两种评估模式:\n\n"
132
+ "1. 字符串匹配(match):使用数学验证方法比较答案,适用于有明确答案的问题\n"
133
+ "2. 语义匹配(semantic):使用LLM评估答案的语义相似度,适用于开放性问题\n\n"
134
+ "输入参数:\n"
135
+ "- input_test_answer_key:预测答案字段名\n"
136
+ "- input_gt_answer_key:标准答案字段名\n"
137
+ "- input_question_key:问题字段名(语义匹配模式下必需)\n"
138
+ "- compare_method:比较方法(match/semantic)\n\n"
139
+ "输出参数:\n"
140
+ "- answer_match_result:匹配结果(True/False)\n"
141
+ "- 统计结果将保存到指定的eval_result_path路径\n"
142
+ )
143
+ elif lang == "en":
144
+ return (
145
+ "This operator compares predicted answers against ground truth using two evaluation modes:\n\n"
146
+ "1. String Matching (match): Uses mathematical verification to compare answers, suitable for questions with definitive answers\n"
147
+ "2. Semantic Matching (semantic): Uses LLM to evaluate semantic similarity, suitable for open-ended questions\n\n"
148
+ "Input Parameters:\n"
149
+ "- input_test_answer_key: Predicted answer field\n"
150
+ "- input_gt_answer_key: Ground truth field\n"
151
+ "- input_question_key: Question field (required for semantic mode)\n"
152
+ "- compare_method: Comparison method (match/semantic)\n\n"
153
+ "Output Parameters:\n"
154
+ "- answer_match_result: Matching result (True/False)\n"
155
+ "- Statistics will be saved to the specified eval_result_path\n"
156
+ )
157
+ else:
158
+ return "BenchEvaluator performs answer validation using string matching or semantic comparison"
159
+
160
+ def check_column(self, required_columns: list[str], dataframe: pd.DataFrame):
161
+ for column in required_columns:
162
+ if column not in dataframe.columns:
163
+ self.logger.error(f"Required column '{column}' not found in dataframe")
164
+ return False
165
+ return True
166
+
167
+ def statistic(self, file_name_prefix: str, dataframe: pd.DataFrame, compare_method: Literal["match", "semantic"]):
168
+ total_samples = len(dataframe)
169
+ valid_samples = len(dataframe) - self.empty_responses_count
170
+ matched_samples = sum(dataframe['answer_match_result'])
171
+ accuracy = matched_samples / valid_samples if valid_samples > 0 else 0
172
+
173
+ # 创建统计信息字典
174
+ stats = {
175
+ "bench_name_or_prefix": file_name_prefix,
176
+ "total_samples": total_samples,
177
+ "valid_samples": valid_samples,
178
+ "matched_samples": matched_samples,
179
+ "accuracy": float(accuracy), # 确保可以被JSON序列化
180
+ "empty_responses_count": self.empty_responses_count,
181
+ "compare_method": compare_method
182
+ }
183
+
184
+ if self.support_subquestions:
185
+ total_subquestions = dataframe['total_subquestions'].sum()
186
+ correct_subquestions = dataframe['correct_answer_num'].sum()
187
+ subquestion_accuracy = correct_subquestions / total_subquestions if total_subquestions > 0 else 0
188
+ stats.update({
189
+ "total_subquestions": int(total_subquestions),
190
+ "correct_subquestions": int(correct_subquestions),
191
+ "subquestion_accuracy": float(subquestion_accuracy)
192
+ })
193
+
194
+ # 将字典转换为DataFrame
195
+ stats_df = pd.DataFrame([stats])
196
+
197
+ # 直接将统计信息写入到self.eval_result_path
198
+ os.makedirs(os.path.dirname(self.eval_result_path), exist_ok=True)
199
+ stats_df.to_json(self.eval_result_path, orient="records", force_ascii=False, indent=2)
200
+ self.logger.success(f"Statistics saved to {self.eval_result_path}")
201
+
202
+ return stats_df
203
+
204
+ def run(
205
+ self,
206
+ storage:DataFlowStorage,
207
+ input_test_answer_key: str = "generated_cot",
208
+ input_gt_answer_key: str = "golden_answer",
209
+ input_question_key: str = None,
210
+ ) -> list:
211
+
212
+ self.test_answer_key = input_test_answer_key
213
+ self.gt_answer_key = input_gt_answer_key
214
+ self.question_key = input_question_key
215
+
216
+ dataframe = storage.read("dataframe")
217
+ if 'answer_match_result' not in dataframe.columns:
218
+ dataframe['answer_match_result'] = False
219
+ answers = dataframe[self.test_answer_key]
220
+ ground_truths = dataframe[self.gt_answer_key]
221
+
222
+ if self.compare_method == "match":
223
+ if self.check_column(
224
+ required_columns=[input_test_answer_key,input_gt_answer_key],
225
+ dataframe=dataframe
226
+ ) is False:
227
+ return required_columns
228
+
229
+ for i in range(len(answers)):
230
+ final_answer = self.answer_extractor.extract_answer(answers[i], None)
231
+ if self.compare(final_answer, ground_truths[i]):
232
+ dataframe.at[i, 'answer_match_result'] = True
233
+ else:
234
+ dataframe.at[i, 'answer_match_result'] = False
235
+
236
+ output_file = storage.write(dataframe)
237
+
238
+ # 生成统计信息并直接写入JSON文件
239
+ stats = self.statistic(storage.file_name_prefix, dataframe, self.compare_method)
240
+
241
+ return [self.test_answer_key, self.gt_answer_key, 'answer_match_result']
242
+ else:
243
+ if self.check_column(
244
+ required_columns=[input_test_answer_key,input_gt_answer_key, input_question_key],
245
+ dataframe=dataframe
246
+ ) is False:
247
+ return required_columns
248
+
249
+ empty_reference_mask = dataframe[input_gt_answer_key].isna() | (dataframe[input_gt_answer_key] == '')
250
+ if self.skip_true:
251
+ empty_reference_mask = empty_reference_mask | (dataframe['answer_match_result'] == True)
252
+ skipped_rows = dataframe[empty_reference_mask]
253
+ valid_rows = dataframe[~empty_reference_mask]
254
+ skipped_count = len(skipped_rows)
255
+
256
+ if len(valid_rows) == 0 and not self.skip_true:
257
+ self.logger.warning("No valid samples with reference answers found. All samples skipped.")
258
+ if self.keep_all_samples:
259
+ output_file = storage.write(dataframe) # 保留所有行,但answer_match_result都为False
260
+ else:
261
+ output_file = storage.write(pd.DataFrame(columns=dataframe.columns)) # 不保留任何行
262
+ self.logger.info(f"Dataframe saved to {output_file}. Skipped {skipped_count} samples due to missing reference answers.")
263
+ return required_columns + ['answer_match_result']
264
+
265
+ # 只对有参考答案的行构建提示词并调用LLM
266
+ inputs = [self.prompt_template.build_prompt(
267
+ question=row[input_question_key],
268
+ answer=row[input_test_answer_key],
269
+ reference_answer=row[input_gt_answer_key]
270
+ ) for _, row in valid_rows.iterrows()]
271
+
272
+ responses = self.llm_serving.generate_from_input(user_inputs=inputs, system_prompt=self.system_prompt)
273
+
274
+ # if self.support_subquestions:
275
+ # # 每个response是一个列表,连接一个长列表,比如[["true", "false"], ["true"]] -> ["true", "false", "true"]
276
+ # responses = [item for sublist in responses for item in sublist]
277
+
278
+ results = [self.ResolveResponse(response) for response in responses]
279
+
280
+ # 创建结果掩码,与valid_rows长度相同
281
+ result_mask = np.array(results, dtype=bool)
282
+
283
+ # 更新有效行的answer_match_result
284
+ valid_indices = valid_rows.index
285
+ if not self.support_subquestions:
286
+ for i, idx in enumerate(valid_indices):
287
+ dataframe.at[idx, 'answer_match_result'] = results[i]
288
+ else:
289
+ for i, idx in enumerate(valid_indices):
290
+ correct_answer_num = int(results[i].split('/')[0])
291
+ total_subquestions = int(results[i].split('/')[1])
292
+ dataframe.at[idx, 'correct_answer_num'] = correct_answer_num
293
+ dataframe.at[idx, 'total_subquestions'] = total_subquestions
294
+ dataframe.at[idx, 'answer_match_result'] = (correct_answer_num == total_subquestions) and (total_subquestions > 0) # 全对为True,否则为False
295
+ dataframe.at[idx, 'response_evaluation'] = responses[i] # 保存LLM的原始响应内容
296
+
297
+ output_file = storage.write(dataframe)
298
+
299
+ # 生成统计信息并直接写入JSON文件
300
+ stats = self.statistic(storage.file_name_prefix, dataframe, self.compare_method)
301
+
302
+ # 重置空响应计数器
303
+ self.empty_responses_count = 0
304
+
305
+ return [input_test_answer_key, input_gt_answer_key, input_question_key, 'answer_match_result']
306
+
307
+
operators/pdf2vqa/__init__.py ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ from .mineru_to_llm_input_operator import MinerU2LLMInputOperator
2
+ from .llm_output_parser import LLMOutputParser
3
+ from .qa_merger import QA_Merger
4
+ from .pdf_merger import PDF_Merger
operators/pdf2vqa/llm_output_parser.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import re
4
+ import shutil
5
+ from pathlib import Path
6
+ from typing import Literal
7
+ from dataflow.core import OperatorABC
8
+ from dataflow.utils.registry import OPERATOR_REGISTRY
9
+ from dataflow.utils.storage import DataFlowStorage
10
+ from dataflow import get_logger
11
+
12
+ @OPERATOR_REGISTRY.register()
13
+ class LLMOutputParser(OperatorABC):
14
+ def __init__(self,
15
+ output_dir,
16
+ intermediate_dir: str = "intermediate",
17
+ ):
18
+ self.logger = get_logger()
19
+ self.output_dir = output_dir
20
+ self.intermediate_dir = intermediate_dir
21
+
22
+ @staticmethod
23
+ def get_desc(lang: str = "zh") -> str:
24
+ if lang == 'zh':
25
+ return (
26
+ "LLM输出解析算子。"
27
+ "将LLM生成的包含题目和答案ID的响应文本,"
28
+ "转换为结构化的QA列表,并复制相关图片到输出目录。"
29
+ )
30
+ else:
31
+ return (
32
+ "LLM output parsing operator."
33
+ "Converts LLM-generated response text containing question and answer IDs"
34
+ "into a structured QA list and copies related images to the output directory."
35
+ )
36
+
37
+ def _id_to_text(self, input_ids, input_json, image_prefix="images"):
38
+ texts = []
39
+ id_list = input_ids.replace(' ', '').split(',')
40
+ for id in id_list:
41
+ try:
42
+ int(id)
43
+ except Exception:
44
+ continue
45
+ if int(id) < len(input_json):
46
+ try:
47
+ item = input_json[int(id)]
48
+ except Exception:
49
+ continue
50
+ if 'text' in item:
51
+ texts.append(item['text'])
52
+ elif 'table_body' in item:
53
+ texts.append(item['table_body'])
54
+ elif 'img_path' in item:
55
+ try:
56
+ img_path = item.get('img_path', '')
57
+ img_name = os.path.basename(img_path)
58
+ new_path = f"{image_prefix}/{img_name}"
59
+ texts.append(f"![{' '.join(item.get('image_caption','image'))}]({new_path})")
60
+ except Exception:
61
+ pass
62
+ elif item.get('type','') == 'list':
63
+ if item['sub_type'] == 'text':
64
+ try:
65
+ texts.append(input_json[int(id)]['list_items'].pop(0))
66
+ except Exception:
67
+ pass
68
+ return '\n'.join(texts)
69
+
70
+ def _convert_response(self, input_response, input_json_path, image_prefix="images"):
71
+ qa_list = []
72
+ with open(input_json_path, 'r', encoding='utf-8') as infile:
73
+ input_json = list(json.load(infile))
74
+ # 提取title
75
+ for chapter_block in re.findall(r'<chapter>(.*?)</chapter>', input_response, flags=re.DOTALL):
76
+ title = re.search(r'<title>(.*?)</title>', chapter_block, flags=re.DOTALL)
77
+ if title:
78
+ chapter_title = self._id_to_text(title.group(1).strip(), input_json, image_prefix)
79
+ else:
80
+ chapter_title = ""
81
+ # 找出所有 qa_pair 块
82
+ for pair in re.findall(r'<qa_pair>(.*?)</qa_pair>', chapter_block, flags=re.DOTALL):
83
+ # 提取 question 部分
84
+ q_match = re.search(r'<question>(.*?)</question>', pair, flags=re.DOTALL)
85
+ # 提取 answer 部分
86
+ a_match = re.search(r'<answer>(.*?)</answer>', pair, flags=re.DOTALL)
87
+ # 提取solution部分
88
+ s_match = re.search(r'<solution>(.*?)</solution>', pair, flags=re.DOTALL)
89
+ # 提取label
90
+ label_match = re.search(r'<label>(.*?)</label>', pair, flags=re.DOTALL)
91
+ if not ((q_match and label_match) or (a_match and label_match) or (s_match and label_match)):
92
+ continue
93
+ label = label_match.group(1).strip()
94
+ qa_list.append({
95
+ 'question': self._id_to_text(q_match.group(1).strip(), input_json, image_prefix) if q_match else "",
96
+ 'answer': a_match.group(1).strip() if a_match else "",
97
+ 'solution': self._id_to_text(s_match.group(1).strip(), input_json, image_prefix) if s_match else "",
98
+ 'label': label,
99
+ 'chapter_title': chapter_title
100
+ })
101
+ return qa_list
102
+
103
+ def run(self, storage: DataFlowStorage,
104
+ input_response_path_key,
105
+ input_converted_layout_path_key,
106
+ input_name_key,
107
+ output_qalist_path_key,
108
+ ):
109
+ dataframe = storage.read("dataframe")
110
+
111
+ # Response 转换
112
+ for idx, row in dataframe.iterrows():
113
+ converted_json_path = row[input_converted_layout_path_key]
114
+ response = Path(row[input_response_path_key]).read_text(encoding='utf-8')
115
+ name = row[input_name_key]
116
+
117
+ # 🚨 罪魁祸首在这里:它把 name(比如 math1)强行拼到了前缀里
118
+ # image_prefix = os.path.join(name, f"vqa_images")
119
+ # ✅ 修复 1:Markdown 的相对路径只需要文件夹名即可
120
+ image_prefix = "vqa_images"
121
+ # 这里把错误的带 math1/ 的前缀传给了内容解析器,写进了 JSON 和 MD 里
122
+ qa_list = self._convert_response(response, converted_json_path, image_prefix)
123
+ output_qalist_path = os.path.join(self.output_dir, name, f"extracted_vqa.jsonl")
124
+ os.makedirs(os.path.dirname(output_qalist_path), exist_ok=True)
125
+ with open(output_qalist_path, 'w', encoding='utf-8') as outfile:
126
+ for qa in qa_list:
127
+ json.dump(qa, outfile, ensure_ascii=False)
128
+ outfile.write('\n')
129
+
130
+ # 复制图片
131
+ src_dir = os.path.dirname(converted_json_path)
132
+ src_images = os.path.join(src_dir, 'vlm', 'images')
133
+ if not os.path.exists(src_images):
134
+ src_images = os.path.join(src_dir, 'images')
135
+ if not os.path.exists(src_images):
136
+ self.logger.warning(f"Images directory {src_images} not found, skipping image copy (PDF may contain no images).")
137
+ else:
138
+ dst_images = os.path.join(self.output_dir, name, image_prefix)
139
+ try:
140
+ shutil.copytree(src_images, dst_images)
141
+ except Exception as e:
142
+ self.logger.warning(f"Failed to copy images from {src_images} to {dst_images}: {e}")
143
+
144
+ dataframe.loc[idx, output_qalist_path_key] = output_qalist_path
145
+
146
+ storage.write(dataframe)
operators/pdf2vqa/mineru_to_llm_input_operator.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from dataflow.core import OperatorABC
3
+ from dataflow.utils.registry import OPERATOR_REGISTRY
4
+ from dataflow.utils.storage import DataFlowStorage
5
+
6
+ from pathlib import Path
7
+
8
+ @OPERATOR_REGISTRY.register()
9
+ class MinerU2LLMInputOperator(OperatorABC):
10
+ def __init__(self):
11
+ pass
12
+
13
+ @staticmethod
14
+ def get_desc(lang: str = "zh") -> str:
15
+ if lang == 'zh':
16
+ return (
17
+ "MinerU格式转换为LLM输入格式算子。"
18
+ "将MinerU生成的内容列表JSON文件转换为适合LLM处理的格式,"
19
+ "包括展平列表项并重新编号。"
20
+ )
21
+ else:
22
+ return (
23
+ "Convert MinerU format to LLM input format operator."
24
+ "Transforms the content list JSON file generated by MinerU into a format suitable for LLM processing,"
25
+ "including flattening list items and re-indexing."
26
+ )
27
+
28
+ def _convert_json(self, input_file, output_file):
29
+ with open(input_file, 'r', encoding="utf-8") as infile:
30
+ data = list(json.load(infile))
31
+
32
+ new_data = []
33
+ id = 0
34
+ for item in data:
35
+ item['id'] = id
36
+ item.pop('bbox', None)
37
+ item.pop('page_idx', None)
38
+ if item.get('type','') == 'list':
39
+ if item['sub_type'] == 'text':
40
+ for idx, list_item in enumerate(item.get('list_items', [])):
41
+ new_item = {
42
+ 'type': 'text',
43
+ 'text': list_item,
44
+ 'id': id + idx,
45
+ }
46
+ new_data.append(new_item)
47
+ id += len(item.get('list_items', []))
48
+ else:
49
+ new_data.append(item)
50
+ id += 1
51
+
52
+ with open(output_file, 'w', encoding='utf-8') as outfile:
53
+ json.dump(new_data, outfile, ensure_ascii=False)
54
+
55
+ def run(self, storage: DataFlowStorage,
56
+ input_markdown_path_key,
57
+ output_converted_layout_key,
58
+ ):
59
+ dataframe = storage.read("dataframe")
60
+
61
+ for index, row in dataframe.iterrows():
62
+ md_path = Path(row[input_markdown_path_key])
63
+ try:
64
+ input_json_path = list(md_path.parent.glob("*_content_list.json"))[0]
65
+ except Exception:
66
+ raise ValueError("No _content_list.json file found in the api result. There might be an error with the Mineru api.")
67
+
68
+ converted_path = str(input_json_path).replace('.json', '_converted.json')
69
+ self._convert_json(input_json_path, converted_path)
70
+ dataframe.at[index, output_converted_layout_key] = converted_path
71
+
72
+ with open(converted_path, 'r', encoding='utf-8') as infile:
73
+ data = json.load(infile)
74
+ assert isinstance(data, list), f"Expected list, got {type(data)} for {input_json_path}"
75
+
76
+ storage.write(dataframe)
operators/pdf2vqa/pdf_merger.py ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from pypdf import PdfWriter
3
+ from dataflow.core import OperatorABC
4
+ from dataflow.utils.registry import OPERATOR_REGISTRY
5
+ from dataflow.utils.storage import DataFlowStorage
6
+
7
+ @OPERATOR_REGISTRY.register()
8
+ class PDF_Merger(OperatorABC):
9
+ def __init__(self, output_dir: str):
10
+ """
11
+ 初始化 PDF 合并算子。
12
+
13
+ :param output_dir: 合并后 PDF 文件的存放根目录
14
+ """
15
+ self.output_dir = output_dir
16
+ if not os.path.exists(self.output_dir):
17
+ os.makedirs(self.output_dir, exist_ok=True)
18
+
19
+ @staticmethod
20
+ def get_desc(lang: str = "zh") -> str:
21
+ if lang == 'zh':
22
+ return (
23
+ "PDF 文件合并算子。"
24
+ "输入 PDF 路径列表,按顺序合并为一个 PDF 文件,"
25
+ "并保存到指定目录。"
26
+ )
27
+ else:
28
+ return (
29
+ "PDF merging operator."
30
+ "Takes a list of PDF paths, merges them in order into a single PDF,"
31
+ "and saves it to the specified directory."
32
+ )
33
+
34
+ def run(self,
35
+ storage: DataFlowStorage,
36
+ input_pdf_list_key: str,
37
+ input_name_key: str,
38
+ output_pdf_path_key: str
39
+ ):
40
+ """
41
+ 执行合并逻辑。
42
+
43
+ :param input_pdf_list_key: DataFrame 中存放 PDF 路径列表 (str或list[str]) 的列名
44
+ :param input_name_key: DataFrame 中用于命名的列名(如文件名或ID)
45
+ :param output_pdf_path_key: 合并后结果路径存入的列名
46
+ """
47
+ dataframe = storage.read("dataframe")
48
+
49
+ for idx, row in dataframe.iterrows():
50
+ pdf_paths = row[input_pdf_list_key]
51
+ if isinstance(pdf_paths, str):
52
+ pdf_paths = [pdf_paths]
53
+ name = row[input_name_key]
54
+
55
+ # 构建输出路径:output_dir/name/merged.pdf
56
+ save_dir = os.path.join(self.output_dir, str(name))
57
+ os.makedirs(save_dir, exist_ok=True)
58
+ output_path = os.path.join(save_dir, f"{name}_merged.pdf")
59
+
60
+ try:
61
+ merger = PdfWriter()
62
+ valid_count = 0
63
+
64
+ for path in pdf_paths:
65
+ if os.path.exists(path):
66
+ merger.append(path)
67
+ valid_count += 1
68
+
69
+ if valid_count > 0:
70
+ with open(output_path, "wb") as f:
71
+ merger.write(f)
72
+ merger.close()
73
+
74
+ # 将结果写回 dataframe
75
+ dataframe.loc[idx, output_pdf_path_key] = output_path
76
+ else:
77
+ dataframe.loc[idx, output_pdf_path_key] = None
78
+
79
+ except Exception as e:
80
+ print(f"Error merging PDFs for {name}: {e}")
81
+ dataframe.loc[idx, output_pdf_path_key] = None
82
+
83
+ storage.write(dataframe)
operators/pdf2vqa/qa_merger.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ from dataflow.core import OperatorABC
4
+ from dataflow.utils.registry import OPERATOR_REGISTRY
5
+ from dataflow.utils.storage import DataFlowStorage
6
+ from utils.format_utils import merge_qa_pair, jsonl_to_md
7
+
8
+ import re
9
+
10
+ @OPERATOR_REGISTRY.register()
11
+ class QA_Merger(OperatorABC):
12
+ def __init__(self, output_dir, strict_title_match=False):
13
+ self.output_dir = output_dir
14
+ self.strict_title_match = strict_title_match
15
+
16
+ @staticmethod
17
+ def get_desc(lang: str = "zh") -> str:
18
+ if lang == 'zh':
19
+ return (
20
+ "QA对合并算子。"
21
+ "将问题和答案的QA列表进行合并,生成最终的QA对文件,"
22
+ "并转换为Markdown格式。"
23
+ )
24
+ else:
25
+ return (
26
+ "QA pair merging operator."
27
+ "Merges question and answer QA lists to generate final QA pair files,"
28
+ "and converts them to Markdown format."
29
+ )
30
+
31
+ def run(self, storage: DataFlowStorage,
32
+ input_qalist_path_key,
33
+ input_name_key,
34
+ output_merged_qalist_path_key,
35
+ output_merged_md_path_key,
36
+ output_qa_item_key="qa_item" # 新增:展开后的 QA 内容列名
37
+ ):
38
+ dataframe = storage.read("dataframe")
39
+
40
+ # 为了能存储 list 对象,先初始化该列为 object 类型
41
+ dataframe[output_qa_item_key] = None
42
+ dataframe[output_qa_item_key] = dataframe[output_qa_item_key].astype(object)
43
+
44
+ for idx, row in dataframe.iterrows():
45
+ qa_list_path = row[input_qalist_path_key]
46
+ name = row[input_name_key]
47
+
48
+ output_merged_qalist_path = os.path.join(self.output_dir, name, "merged_qa_pairs.jsonl")
49
+ merge_qa_pair(qa_list_path, output_merged_qalist_path, strict_title_match=self.strict_title_match)
50
+
51
+ output_merged_md_path = os.path.join(self.output_dir, name, "merged_qa_pairs.md")
52
+ jsonl_to_md(output_merged_qalist_path, output_merged_md_path)
53
+
54
+ qa_pairs = []
55
+ if os.path.exists(output_merged_qalist_path):
56
+ with open(output_merged_qalist_path, 'r', encoding='utf-8') as f:
57
+ qa_pairs = [json.loads(line) for line in f]
58
+
59
+ dataframe.at[idx, output_qa_item_key] = qa_pairs
60
+
61
+ dataframe.loc[idx, output_merged_qalist_path_key] = output_merged_qalist_path
62
+ dataframe.loc[idx, output_merged_md_path_key] = output_merged_md_path
63
+
64
+ dataframe = dataframe.explode(output_qa_item_key).reset_index(drop=True)
65
+
66
+ # 汇总jsonl中的图片路径需要将 ![alt](path) 中的 path 替换为 name/path
67
+ def fix_image_paths(row):
68
+ qa_item = row[output_qa_item_key]
69
+ name_val = str(row[input_name_key])
70
+
71
+ if isinstance(qa_item, dict):
72
+ keys_to_check = ["question", "answer", "solution"]
73
+ for key in keys_to_check:
74
+ if key in qa_item and isinstance(qa_item[key], str):
75
+ qa_item[key] = re.sub(
76
+ r'!\[(.*?)\]\((.*?)\)',
77
+ lambda m: f"![{m.group(1)}]({os.path.join(name_val, m.group(2))})",
78
+ qa_item[key]
79
+ )
80
+ return qa_item
81
+
82
+ dataframe[output_qa_item_key] = dataframe.apply(fix_image_paths, axis=1)
83
+
84
+ storage.write(dataframe)
operators/question_answer_clean.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from dataflow.utils.registry import OPERATOR_REGISTRY
3
+ from dataflow import get_logger
4
+ from dataflow.utils.storage import DataFlowStorage
5
+ from dataflow.core import OperatorABC
6
+ from dataflow.core import LLMServingABC
7
+ import re
8
+
9
+
10
+ @OPERATOR_REGISTRY.register()
11
+ class LLMTextCleanerOperator(OperatorABC):
12
+ def __init__(
13
+ self,
14
+ llm_serving: LLMServingABC,
15
+ prompt_template,
16
+ max_batch_size: int = 32
17
+ ):
18
+ self.logger = get_logger()
19
+ self.llm_serving = llm_serving
20
+ self.prompt_template = prompt_template
21
+ self.max_batch_size = max_batch_size
22
+
23
+ if prompt_template is None:
24
+ raise ValueError("prompt_template cannot be None")
25
+
26
+ def apply_deletions(self, original_text, deletion_output):
27
+ """从原始文本中删除指定片段"""
28
+ if not deletion_output or deletion_output.strip() == "NONE":
29
+ return original_text
30
+ # 按 || 分割片段
31
+ fragments = [frag.strip() for frag in deletion_output.split("||") if frag.strip()]
32
+ # 按长度降序排序,避免短串误删长串的一部分
33
+ fragments = sorted(fragments, key=len, reverse=True)
34
+ cleaned = original_text
35
+ for frag in fragments:
36
+ cleaned = cleaned.replace(frag, "", 1) # 只删除一次
37
+ return cleaned
38
+
39
+ def run(
40
+ self,
41
+ storage: DataFlowStorage,
42
+ output_key: str = "cleaned_dataframe",
43
+ question_column: str = "question",
44
+ answer_column: str = "answer",
45
+ **input_keys
46
+ ):
47
+ self.storage: DataFlowStorage = storage
48
+ self.output_key = output_key
49
+ self.question_column = question_column
50
+ self.answer_column = answer_column
51
+ self.logger.info("Running LLMTextCleanerOperator...")
52
+
53
+ dataframe = storage.read('dataframe')
54
+ self.logger.info(f"Loading dataframe, number of rows: {len(dataframe)}")
55
+
56
+ if len(dataframe) == 0:
57
+ self.logger.warning("No data to process")
58
+ output_file = storage.write(dataframe)
59
+ return output_key
60
+
61
+ question_prompts = []
62
+ answer_prompts = []
63
+ valid_indices = []
64
+
65
+ for idx, row in dataframe.iterrows():
66
+ question = str(row.get(question_column, ""))
67
+ answer = str(row.get(answer_column, ""))
68
+
69
+ q_prompt = self.prompt_template.build_question_prompt(question)
70
+ a_prompt = self.prompt_template.build_answer_prompt(answer)
71
+
72
+ question_prompts.append(q_prompt)
73
+ answer_prompts.append(a_prompt)
74
+ valid_indices.append(idx)
75
+
76
+ self.logger.info(f"Prepared {len(question_prompts)} question prompts and {len(answer_prompts)} answer prompts")
77
+
78
+ question_deletion_outputs = self.llm_serving.generate_from_input(question_prompts)
79
+ self.logger.info("Completed question cleaning prompts processing")
80
+
81
+ answer_deletion_outputs = self.llm_serving.generate_from_input(answer_prompts)
82
+ self.logger.info("Completed answer cleaning prompts processing")
83
+
84
+ cleaned_questions = []
85
+ cleaned_answers = []
86
+
87
+ for i in range(len(question_deletion_outputs)):
88
+ original_question = str(dataframe.iloc[i][question_column])
89
+ original_answer = str(dataframe.iloc[i][answer_column])
90
+
91
+ cleaned_q = self.apply_deletions(original_question, question_deletion_outputs[i])
92
+ cleaned_a = self.apply_deletions(original_answer, answer_deletion_outputs[i])
93
+
94
+ cleaned_questions.append(cleaned_q.strip())
95
+ cleaned_answers.append(cleaned_a.strip())
96
+
97
+ result_dataframe = dataframe.copy()
98
+ result_dataframe[question_column] = cleaned_questions
99
+ result_dataframe[answer_column] = cleaned_answers
100
+
101
+ output_file = storage.write(result_dataframe)
102
+ self.logger.info(f"Cleaning completed, processed {len(result_dataframe)} rows")
103
+
104
+ return output_key
operators/question_refiner.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from dataflow.utils.registry import OPERATOR_REGISTRY
3
+ from dataflow import get_logger
4
+
5
+ from dataflow.utils.storage import DataFlowStorage
6
+ from dataflow.core import OperatorABC
7
+ from dataflow.core import LLMServingABC
8
+
9
+ @OPERATOR_REGISTRY.register()
10
+ class AddMissingBlankOperator(OperatorABC):
11
+ def __init__(
12
+ self,
13
+ llm_serving: LLMServingABC,
14
+ prompt_template,
15
+ ):
16
+ self.logger = get_logger()
17
+ self.llm_serving = llm_serving
18
+ self.prompt_template = prompt_template
19
+ if prompt_template is None:
20
+ raise ValueError("prompt_template cannot be None")
21
+
22
+
23
+ def run(
24
+ self,
25
+ storage: DataFlowStorage,
26
+ output_key: str = "question",
27
+ **input_keys
28
+ ):
29
+ self.storage: DataFlowStorage = storage
30
+ self.output_key = output_key
31
+ self.logger.info("Running AddMissingBlankOperator...")
32
+ self.input_keys = input_keys
33
+
34
+ need_fields = set(input_keys.keys())
35
+
36
+ # Load the raw dataframe from the input file
37
+ dataframe = storage.read('dataframe')
38
+ self.logger.info(f"Loading, number of rows: {len(dataframe)}")
39
+ llm_inputs = []
40
+
41
+
42
+ # Only process rows where type == "fill-in"
43
+ if 'type' not in dataframe.columns:
44
+ self.logger.warning("No 'type' column found, skipping LLM generation.")
45
+ generated_outputs = []
46
+ else:
47
+ mask = dataframe['type'] == "Fill-in"
48
+ indices = dataframe.index[mask].tolist()
49
+ if not indices:
50
+ self.logger.info("No rows with type=='Fill-in' to process.")
51
+ generated_outputs = []
52
+ else:
53
+ for idx in indices:
54
+ row = dataframe.loc[idx]
55
+ key_dict = {key: row[input_keys[key]] for key in need_fields}
56
+ prompt_text = self.prompt_template.build_prompt(need_fields, **key_dict)
57
+ llm_inputs.append(prompt_text)
58
+ self.logger.info(f"Prepared {len(llm_inputs)} prompts for LLM generation.")
59
+ generated_outputs = self.llm_serving.generate_from_input(llm_inputs)
60
+ # write generated outputs back only to the selected rows (preserve other rows as None)
61
+ for idx, gen_output in zip(indices, generated_outputs):
62
+ if gen_output != "ORIGINAL":
63
+ dataframe.at[idx, output_key] = gen_output
64
+
65
+ output_file = self.storage.write(dataframe)
66
+ return output_key
operators/vqa_answer_generator.py ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataflow.utils.registry import OPERATOR_REGISTRY
2
+ from dataflow import get_logger
3
+ from dataflow.utils.storage import DataFlowStorage
4
+ from dataflow.core import OperatorABC
5
+ from dataflow.core import LLMServingABC
6
+
7
+ from dataflow.prompts.reasoning.math import MathAnswerGeneratorPrompt
8
+ from dataflow.prompts.reasoning.general import GeneralAnswerGeneratorPrompt
9
+ from dataflow.prompts.reasoning.diy import DiyAnswerGeneratorPrompt
10
+ from dataflow.core.prompt import prompt_restrict, DIYPromptABC
11
+
12
+ import pandas as pd
13
+ from typing import Union, List, Tuple
14
+ import re
15
+
16
+ import os
17
+
18
+ @prompt_restrict(
19
+ MathAnswerGeneratorPrompt,
20
+ GeneralAnswerGeneratorPrompt,
21
+ DiyAnswerGeneratorPrompt
22
+ )
23
+ @OPERATOR_REGISTRY.register()
24
+ class VQAReasoningAnswerGenerator(OperatorABC):
25
+ '''
26
+ Answer Generator is a class that generates answers for given questions.
27
+ '''
28
+ def __init__(self,
29
+ llm_serving: LLMServingABC,
30
+ prompt_template: Union[MathAnswerGeneratorPrompt, GeneralAnswerGeneratorPrompt, DiyAnswerGeneratorPrompt, DIYPromptABC] = MathAnswerGeneratorPrompt,
31
+ skip_text_only: bool=False,
32
+ input_image_default_basedir = "./"
33
+ ):
34
+
35
+ self.logger = get_logger()
36
+
37
+ if prompt_template is None:
38
+ prompt_template = MathAnswerGeneratorPrompt()
39
+ self.prompts = prompt_template
40
+ self.llm_serving = llm_serving
41
+ self.skip_text_only = skip_text_only
42
+ self.input_image_default_basedir = input_image_default_basedir
43
+
44
+ @staticmethod
45
+ def get_desc(lang: str = "zh"):
46
+ if lang == "zh":
47
+ return (
48
+ "该算子用于为给定问题生成答案,调用大语言模型进行推理。\n"
49
+ "输入参数:\n"
50
+ "- llm_serving:LLM服务实例,用于生成答案\n"
51
+ "- prompt_template:提示模板对象,用于构建生成提示词\n"
52
+ "输出参数:\n"
53
+ "- output_key:生成的答案字段,默认'generated_cot'"
54
+ )
55
+ elif lang == "en":
56
+ return (
57
+ "This operator generates answers for given questions using LLMs for reasoning. \n"
58
+ "Input Parameters:\n"
59
+ "- llm_serving: LLM serving instance for answer generation\n"
60
+ "- prompt_template: Prompt template object for constructing generation prompts\n"
61
+ "Output Parameters:\n"
62
+ "- output_key: Generated answer field, default 'generated_cot'"
63
+ )
64
+ else:
65
+ return "AnswerGenerator produces answers for questions using large language models."
66
+
67
+ def _validate_dataframe(self, dataframe: pd.DataFrame):
68
+ required_keys = [self.input_key, self.input_image_basedir_key, self.input_caption_key, self.input_skip_key]
69
+ missing = [k for k in required_keys if k!=None and k not in dataframe.columns]
70
+ if missing:
71
+ raise ValueError(f"Missing required column(s): {missing}")
72
+
73
+ def _prepare_vlm_inputs(self, dataframe) -> Tuple[List[str], List[List[str]], List[List[str]], List[int], List[int]]:
74
+ """
75
+ Parses prompts for image markdown, extracts paths and text segments,
76
+ and structures them into interleaved lists for the VLM server.
77
+
78
+ 返回:
79
+ user_prompts: List[str] (所有的问题)
80
+ list_of_image_paths: List[List[str]] (所有请求的绝对路径列表)
81
+ list_of_text_segments: List[List[str]] (所有图像标签)
82
+ vqa_ids: List[int] (含有图片的问题编号)
83
+ unskipped_ids: List[int] (未跳过的问题编号,这里跳过是指保留已经回答的答案,不重新回答)
84
+ """
85
+ list_of_image_paths: List[List[str]] = []
86
+ list_of_text_segments: List[List[str]] = []
87
+ user_prompts:List[str] = []
88
+ vqa_ids = []
89
+
90
+ # Markdown 图片正则匹配: ![label](path)
91
+ markdown_pattern = re.compile(r"!\[(.*?)\]\((.*?)\)")
92
+
93
+ questions = dataframe[self.input_key].tolist()
94
+
95
+ unskipped_ids = []
96
+
97
+ for index, question in enumerate(questions):
98
+
99
+ # 1. 确定 Base Directory (图像的根目录)
100
+ base_dir = self.input_image_default_basedir
101
+ if self.input_image_basedir_key in dataframe.columns:
102
+ row_base_dir = dataframe.loc[index, self.input_image_basedir_key]
103
+ if row_base_dir:
104
+ base_dir = row_base_dir
105
+
106
+ # 2. 准备该请求的结构
107
+ current_paths: List[str] = []
108
+ current_segments: List[str] = []
109
+ current_user_prompt: str = ""
110
+
111
+ last_end = 0
112
+
113
+ # 查找所有图片匹配项
114
+ matches = list(markdown_pattern.finditer(question))
115
+
116
+ # 3. 处理纯文本或交错文本/图像
117
+ if (not matches):
118
+ if (not self.skip_text_only):
119
+ # 纯文本提示:直接构建提示并作为唯一的文本片段
120
+ if self.input_skip_key != None and self.input_skip_key in dataframe.columns:
121
+ if dataframe.loc[index, self.input_skip_key]:
122
+ continue
123
+ final_prompt_text = self.prompts.build_prompt(question)
124
+ # 如果caption key存在,添加caption信息
125
+ if self.input_caption_key != None and self.input_caption_key in dataframe.columns:
126
+ captions = dataframe.loc[index, self.input_caption_key]
127
+ if captions and isinstance(captions, list):
128
+ for cap_i, caption in enumerate(captions):
129
+ final_prompt_text += f"\n Description of image {cap_i+1}: {caption}"
130
+ user_prompts.append(final_prompt_text)
131
+ list_of_image_paths.append([])
132
+ list_of_text_segments.append([])
133
+ unskipped_ids.append(index)
134
+ continue
135
+
136
+ vqa_complete = True
137
+ # 4. 遍历匹配项,提取交错的文本片段和图像路径
138
+ for match in matches:
139
+ leading_text = question[last_end:match.start()].strip()
140
+ if leading_text:
141
+ current_user_prompt += leading_text
142
+
143
+ label = match.group(1).strip()
144
+ path = match.group(2).strip()
145
+
146
+ current_segments.append(label)
147
+
148
+ # 4c. 记录绝对路径 (原始逻辑)
149
+ full_path = os.path.join(base_dir, path)
150
+
151
+ # 检查路径是否存在
152
+ if not os.path.isfile(full_path):
153
+ self.logger.warning(f"Image file not found: {full_path} (from question index {index})")
154
+ vqa_complete = False
155
+ break
156
+
157
+ current_paths.append(full_path)
158
+
159
+ last_end = match.end()
160
+
161
+ trailing_text = question[last_end:].strip()
162
+ if trailing_text:
163
+ current_user_prompt += trailing_text
164
+
165
+ # 如果caption key存在,添加caption信息
166
+ if self.input_caption_key != None and self.input_caption_key in dataframe.columns:
167
+ captions = dataframe.loc[index, self.input_caption_key]
168
+ if captions and isinstance(captions, list):
169
+ for cap_i, caption in enumerate(captions):
170
+ current_user_prompt += f"\n Description of image {cap_i+1}: {caption}"
171
+
172
+ # 5. 存储该请求的结果
173
+ if vqa_complete:
174
+ vqa_ids.append(index)
175
+ if self.input_skip_key != None and self.input_skip_key in dataframe.columns:
176
+ if dataframe.loc[index, self.input_skip_key]:
177
+ continue
178
+ list_of_image_paths.append(current_paths)
179
+ list_of_text_segments.append(current_segments)
180
+ user_prompts.append(self.prompts.build_prompt(current_user_prompt))
181
+ unskipped_ids.append(index)
182
+
183
+
184
+ return user_prompts, list_of_image_paths, list_of_text_segments, vqa_ids, unskipped_ids
185
+
186
+ def run(
187
+ self,
188
+ storage,
189
+ input_key:str = "instruction",
190
+ output_key:str = "generated_cot",
191
+ input_caption_key: str | None = None,
192
+ input_skip_key: str | None = None,
193
+ input_image_basedir_key = "image_basedir",
194
+ ):
195
+ '''
196
+ Runs the answer generation process, reading from the input file and saving results to output.
197
+ '''
198
+ self.input_key, self.output_key = input_key, output_key
199
+ self.input_caption_key = input_caption_key
200
+ self.input_skip_key = input_skip_key
201
+ self.input_image_basedir_key = input_image_basedir_key
202
+ dataframe = storage.read("dataframe")
203
+ self._validate_dataframe(dataframe)
204
+
205
+ # 1. 准备 VLM 输入: 解析 Markdown 并获取路径和文本片段
206
+ user_prompts, list_of_image_paths, list_of_image_labels, vqa_ids, unskipped_ids = self._prepare_vlm_inputs(dataframe)
207
+
208
+ # 2. 获取 System Prompt (假设它存储在 self.prompts 对象中)
209
+ # 如果 self.prompts 没有 system_prompt 属性,则使用默认值。
210
+ system_prompt = "You are an intelligent chatbot good at college subjects."
211
+
212
+ answers = self.llm_serving.generate_from_input_multi_images(
213
+ list_of_image_paths=list_of_image_paths,
214
+ list_of_image_labels=list_of_image_labels,
215
+ system_prompt=system_prompt,
216
+ user_prompts=user_prompts
217
+ )
218
+
219
+ if self.skip_text_only:
220
+ # 只写入vqa_ids指明的行
221
+ dataframe = dataframe.loc[vqa_ids].copy() # 注意,这里不重置索引
222
+
223
+ for idx, ans in zip(unskipped_ids, answers):
224
+ dataframe.at[idx, self.output_key] = ans
225
+
226
+ output_file = storage.write(dataframe)
227
+ self.logger.info(f"Results saved to {output_file}")
228
+
229
+ return [output_key]
pipelines/curate_data.py ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import json
4
+ import json5
5
+ import pandas as pd
6
+ sys.path.append(os.path.dirname(os.path.dirname(__file__)))
7
+ from dataflow.operators.core_text import PandasOperator, FormatStrPromptedGenerator
8
+ from operators.bench_evaluate import BenchDatasetEvaluatorQuestion
9
+ from operators.answer_extractor import AnswerExtractionOperator
10
+ from operators.question_refiner import AddMissingBlankOperator
11
+ from operators.question_answer_clean import LLMTextCleanerOperator
12
+
13
+ from dataflow.pipeline import PipelineABC
14
+ from dataflow.serving import APILLMServing_request
15
+ from dataflow.utils.storage import FileStorage
16
+ from dataflow.operators.reasoning import (
17
+ ReasoningAnswerGenerator,
18
+ ReasoningAnswerGroundTruthFilter
19
+ )
20
+ from dataflow.prompts.reasoning.general import GeneralAnswerGeneratorPrompt
21
+ from prompts.curate_data import TypeClassifyPrompt, SubQuestionSplitingPrompt, QAFilterPrompt
22
+ from prompts.question_refine import AddMissingBlankPrompt
23
+ from prompts.question_answer_clean import TextCleaningPrompt
24
+ from dataflow.operators.core_text import GeneralFilter
25
+ import argparse
26
+ import re
27
+ import shutil
28
+
29
+
30
+ class DataCurationPipeline(PipelineABC):
31
+ def __init__(self, input_file, api_url, model_name, max_workers=100):
32
+ super().__init__()
33
+ self.storage = FileStorage(
34
+ first_entry_file_name=input_file,
35
+ cache_path="./cache",
36
+ file_name_prefix="curate_data",
37
+ cache_type="jsonl",
38
+ )
39
+
40
+ self.llm_serving = APILLMServing_request(
41
+ api_url=f"{api_url}/chat/completions",
42
+ model_name=model_name,
43
+ max_workers=max_workers,
44
+ )
45
+
46
+ self.sub_qa_justify = FormatStrPromptedGenerator(
47
+ llm_serving = self.llm_serving,
48
+ prompt_template = SubQuestionSplitingPrompt()
49
+ )
50
+ self.sub_qa_spliter = PandasOperator(
51
+ [split_generated_content]
52
+ )
53
+
54
+ # Extract concise answers from solutions
55
+ self.answer_extractor = AnswerExtractionOperator(
56
+ llm_serving=self.llm_serving,
57
+ overwrite=False
58
+ )
59
+
60
+ # Classify question types
61
+ self.type_filter = FormatStrPromptedGenerator(
62
+ llm_serving = self.llm_serving,
63
+ prompt_template = TypeClassifyPrompt()
64
+ )
65
+ self.type_filter_processor = PandasOperator(
66
+ [extract_type_and_reason]
67
+ )
68
+ self.type_filter_executor = GeneralFilter(
69
+ filter_rules=[lambda df: df['type'].isin(["Calculation", "Fill-in", "Multiple-choice"])]
70
+ )
71
+
72
+ self.add_missing_blank = AddMissingBlankOperator(
73
+ llm_serving=self.llm_serving,
74
+ prompt_template=AddMissingBlankPrompt()
75
+ )
76
+
77
+ # Filter items with unverifiable or poorly paired QA
78
+ self.qa_filter = FormatStrPromptedGenerator(
79
+ llm_serving = self.llm_serving,
80
+ prompt_template = QAFilterPrompt()
81
+ )
82
+ self.qa_filter_processor = PandasOperator(
83
+ [extract_filter_result_and_reason]
84
+ )
85
+ self.qa_filter_executor = GeneralFilter(
86
+ filter_rules=[lambda df: df['filter_result'] == 'true']
87
+ )
88
+
89
+ # question和answer的非内容型过滤
90
+ self.text_cleaner = LLMTextCleanerOperator(
91
+ llm_serving=self.llm_serving,
92
+ prompt_template=TextCleaningPrompt()
93
+ )
94
+
95
+ def forward(self):
96
+ self.sub_qa_justify.run(
97
+ storage = self.storage.step(),
98
+ output_key = "split_qa",
99
+ input_question = "question",
100
+ input_answer = "answer",
101
+ input_solution = "solution",
102
+ )
103
+
104
+ self.sub_qa_spliter.run(
105
+ storage = self.storage.step(),
106
+ )
107
+
108
+
109
+ self.type_filter.run(
110
+ storage = self.storage.step(),
111
+ input_question = "question",
112
+ input_answer = "answer",
113
+ output_key = "question_type"
114
+ )
115
+ self.type_filter_processor.run(
116
+ storage = self.storage.step(),
117
+ )
118
+ self.type_filter_executor.run(
119
+ storage = self.storage.step(),
120
+ )
121
+
122
+ self.answer_extractor.run(
123
+ storage = self.storage.step(),
124
+ input_question_key= "question",
125
+ input_solution_key = "solution",
126
+ output_key= "answer"
127
+ )
128
+
129
+ self.add_missing_blank.run(
130
+ storage = self.storage.step(),
131
+ input_question = "question",
132
+ input_answer = "answer",
133
+ output_key = "question",
134
+ )
135
+
136
+ self.text_cleaner.run(
137
+ storage=self.storage.step(),
138
+ question_column="question",
139
+ answer_column="answer",
140
+ output_key="cleaned_dataframe"
141
+ )
142
+
143
+ self.qa_filter.run(
144
+ storage = self.storage.step(),
145
+ input_question = "question",
146
+ input_answer = "answer",
147
+ output_key = "qa_judgement"
148
+ )
149
+ self.qa_filter_processor.run(
150
+ storage = self.storage.step(),
151
+ )
152
+ self.qa_filter_executor.run(
153
+ storage = self.storage.step(),
154
+ )
155
+
156
+
157
+ def split_generated_content(df: pd.DataFrame) -> pd.DataFrame:
158
+ """
159
+ 将 DataFrame 中 'split_qa' 列的 JSON 数组拆分为多行。
160
+ 保留原始其他列,并展开每个 sub_question/sub_answer。
161
+ 如果 sub_question 或 sub_answer 为空,则不保留该行。
162
+ """
163
+ rows = []
164
+ for _, row in df.iterrows():
165
+ content = row.get("split_qa", None)
166
+
167
+ if pd.isna(content) or not str(content).strip():
168
+ continue
169
+
170
+ try:
171
+ # 解析 JSON 数组
172
+ items = json5.loads(content)
173
+ if not isinstance(items, list):
174
+ items = [items]
175
+ except:
176
+ print(f"⚠️ JSON parse error in row: {content[:80]}...")
177
+ continue
178
+
179
+ for item in items:
180
+ sub_question = item.get("sub_question", "").strip()
181
+ sub_answer = item.get("sub_answer", "").strip()
182
+ sub_solution = item.get("sub_solution", "").strip()
183
+
184
+ # 只保留同时存在 sub_question 和 sub_answer 的行
185
+ if not sub_question or not (sub_answer or sub_solution):
186
+ continue
187
+
188
+ new_row = row.to_dict()
189
+ # new_row["sub_id"] = item.get("sub_id", None)
190
+ new_row["question"] = sub_question if sub_question != "ORIGINAL" else row["question"]
191
+ new_row["answer"] = sub_answer if sub_answer != "ORIGINAL" else row["answer"]
192
+ new_row["solution"] = sub_solution if sub_solution != "ORIGINAL" else row.get("solution", "")
193
+ rows.append(new_row)
194
+
195
+ if not rows:
196
+ return pd.DataFrame(columns=list(df.columns))
197
+
198
+ return pd.DataFrame(rows, columns=list(df.columns))
199
+
200
+ def extract_type_and_reason(df: pd.DataFrame) -> pd.DataFrame:
201
+ df["type"] = None
202
+ df["type_reason"] = None
203
+
204
+ for idx, row in df.iterrows():
205
+ val = row.get("question_type", "")
206
+ if pd.isna(val) or not str(val).strip():
207
+ continue
208
+ try:
209
+ # 尝试解析 JSON
210
+ j = json.loads(val)
211
+ df.at[idx, "type"] = j.get("type", None)
212
+ df.at[idx, "type_reason"] = j.get("reason", None)
213
+ except json.JSONDecodeError:
214
+ # 如果不是 JSON 格式,尝试按 ":" 分割
215
+ if ":" in val:
216
+ parts = val.split(":", 1)
217
+ df.at[idx, "type"] = parts[0].strip()
218
+ df.at[idx, "type_reason"] = parts[1].strip()
219
+ else:
220
+ df.at[idx, "type"] = val.strip()
221
+ df.at[idx, "type_reason"] = ""
222
+
223
+ return df
224
+
225
+ def extract_filter_result_and_reason(df: pd.DataFrame) -> pd.DataFrame:
226
+ df["filter_result"] = None
227
+ df["filter_reason"] = None
228
+
229
+ for idx, row in df.iterrows():
230
+ val = row.get("qa_judgement", "")
231
+ if pd.isna(val) or not str(val).strip():
232
+ continue
233
+ try:
234
+ # 尝试解析 JSON
235
+ j = json.loads(val)
236
+ judgement = j.get("judgement", "")
237
+ if isinstance(judgement, bool):
238
+ judgement = "true" if judgement else "false"
239
+ df.at[idx, "filter_result"] = judgement.lower()
240
+ df.at[idx, "filter_reason"] = j.get("reason", None)
241
+ except json.JSONDecodeError:
242
+ df.at[idx, "filter_result"] = ""
243
+ df.at[idx, "filter_reason"] = ""
244
+
245
+ return df
246
+
247
+ if __name__ == "__main__":
248
+ parser = argparse.ArgumentParser(description="Data Curation Pipeline")
249
+ parser.add_argument("--input_file", type=str, required=True, help="Path to the input JSONL file (raw_vqa.jsonl)")
250
+ parser.add_argument("--api_url", type=str, default="https://api.openai.com/v1", help="Base URL of the OpenAI-compatible API (e.g. https://api.openai.com/v1)")
251
+ parser.add_argument("--model", type=str, default="gpt-5-mini", help="LLM model name to use for curation")
252
+ parser.add_argument("--max_workers", type=int, default=100, help="Number of parallel API workers")
253
+ args = parser.parse_args()
254
+
255
+ model = DataCurationPipeline(args.input_file, api_url=args.api_url, model_name=args.model, max_workers=args.max_workers)
256
+ model.compile()
257
+ model.forward()
258
+
259
+ # Find the latest curate_data cache step file
260
+ cache_files = os.listdir("./cache")
261
+ step_files = [f for f in cache_files if re.match(r"curate_data_step\d+\.jsonl", f)]
262
+ step_numbers = [int(re.findall(r"curate_data_step(\d+)\.jsonl", f)[0]) for f in step_files]
263
+ max_step = max(step_numbers)
264
+ max_step_file = f"./cache/curate_data_step{max_step}.jsonl"
265
+
266
+ # Copy final step file to output directory as curated_vqa.jsonl
267
+ # Output is placed alongside input_file so relative image paths remain valid
268
+ output_dir = os.path.dirname(args.input_file)
269
+ output_file = os.path.join(output_dir, "curated_vqa.jsonl")
270
+ shutil.copy(max_step_file, output_file)
271
+ print(f"Curated data saved to: {output_file}")
pipelines/generate_cot.py ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import pandas as pd
4
+ sys.path.append(os.path.dirname(os.path.dirname(__file__)))
5
+ from dataflow.operators.core_text import PandasOperator
6
+ from operators.bench_evaluate import BenchDatasetEvaluatorQuestion
7
+ from operators.vqa_answer_generator import VQAReasoningAnswerGenerator
8
+
9
+ from dataflow.serving import APILLMServing_request, APIVLMServing_openai, LocalVLMServing_vllm
10
+ from dataflow.utils.storage import FileStorage
11
+ from dataflow.operators.reasoning import (
12
+ ReasoningAnswerGenerator,
13
+ ReasoningAnswerGroundTruthFilter
14
+ )
15
+ from dataflow.prompts.reasoning.math import MathAnswerGeneratorPrompt
16
+ from dataflow.operators.core_text import GeneralFilter
17
+ from dataflow import get_logger
18
+ from dataflow.pipeline import PipelineABC
19
+
20
+ from typing import Iterable
21
+ import re
22
+ import argparse
23
+ import shutil
24
+
25
+
26
+ def make_remove_think_fn(input_key, output_key):
27
+ pattern = re.compile(r'<think>.*?</think>', flags=re.DOTALL | re.IGNORECASE)
28
+ def fn(df):
29
+ df = df.copy()
30
+ if input_key in df.columns:
31
+ def clean_text(t):
32
+ if pd.isna(t):
33
+ return t
34
+ if "</think>" not in t:
35
+ return t.strip()
36
+ s = "<think>" + str(t)
37
+ return pattern.sub("", s).strip()
38
+
39
+ df[output_key] = df[input_key].apply(clean_text)
40
+
41
+ return df
42
+
43
+ return fn
44
+
45
+ class RejectSamplingPipeline(PipelineABC):
46
+ def __init__(self, first_entry_file_name, answer_api_url, judge_api_url, answer_model, judge_model,
47
+ answer_api_key_env="DF_API_KEY", judge_api_key_env="DF_API_KEY",
48
+ max_retries=5, max_workers=100):
49
+ super().__init__()
50
+ self.storage = FileStorage(
51
+ first_entry_file_name=first_entry_file_name,
52
+ cache_path="./cot_cache",
53
+ file_name_prefix="reject_sampling",
54
+ cache_type="jsonl",
55
+ )
56
+
57
+ self.max_retries = max_retries
58
+ self.logger = get_logger()
59
+
60
+ self.llm_answer_serving = APIVLMServing_openai(
61
+ api_url=answer_api_url,
62
+ model_name=answer_model,
63
+ key_name_of_api_key=answer_api_key_env,
64
+ max_workers=max_workers,
65
+ timeout=600.0,
66
+ max_tokens=8192,
67
+ temperature=0.7,
68
+ )
69
+
70
+ self.llm_serving = APILLMServing_request(
71
+ api_url=f"{judge_api_url}/chat/completions",
72
+ model_name=judge_model,
73
+ key_name_of_api_key=judge_api_key_env,
74
+ max_workers=max_workers,
75
+ read_timeout=300.0
76
+ )
77
+
78
+ # Difficulty filter (keep items where accuracy <= 1.0)
79
+ self.difficulty_filter = GeneralFilter(
80
+ filter_rules=[lambda df: df['accuracy'] <= 1.0]
81
+ )
82
+
83
+ # LLM answer generation
84
+ self.answer_generator = VQAReasoningAnswerGenerator(
85
+ llm_serving=self.llm_answer_serving,
86
+ prompt_template=MathAnswerGeneratorPrompt(),
87
+ skip_text_only=False,
88
+ )
89
+
90
+ self.think_cleaner = PandasOperator(process_fn=[ make_remove_think_fn(input_key="generated_cot", output_key="llm_short_answer") ])
91
+
92
+ self.noop = PandasOperator(process_fn=[ lambda df: df ])
93
+
94
+ # LLM verification
95
+ self.answer_groundtruth_filter = BenchDatasetEvaluatorQuestion(
96
+ compare_method="semantic",
97
+ llm_serving=self.llm_serving,
98
+ prompt_template=None, # using default prompt
99
+ eval_result_path="./cot_cache/eval_results.jsonl",
100
+ support_subquestions=True,
101
+ skip_true=True
102
+ )
103
+
104
+ def forward(self):
105
+ self.noop.run(storage = self.storage.step(), output_key="answer_match_result") # for pipeline compilation, do nothing
106
+ for i in range(self.max_retries):
107
+
108
+ input_skip_key="answer_match_result" if i > 0 else None
109
+
110
+ # Generate answers (skip items already answered correctly)
111
+ self.answer_generator.run(
112
+ storage = self.storage.step(),
113
+ input_key = "question",
114
+ output_key = "generated_cot",
115
+ input_skip_key=input_skip_key,
116
+ input_image_basedir_key="image_basedir",
117
+ )
118
+
119
+ self.think_cleaner.run(storage = self.storage.step(), output_key="llm_short_answer")
120
+
121
+ self.answer_groundtruth_filter.run(
122
+ storage=self.storage.step(),
123
+ input_test_answer_key="llm_short_answer",
124
+ input_gt_answer_key="answer",
125
+ input_question_key="question",
126
+ )
127
+
128
+ if __name__ == "__main__":
129
+ parser = argparse.ArgumentParser(description="CoT Generation Pipeline with Reject Sampling")
130
+ parser.add_argument("--input_file", type=str, required=True, help="Path to the input JSONL file (curated_vqa.jsonl)")
131
+ parser.add_argument("--max_retries", type=int, default=5, help="Maximum number of reject sampling rounds")
132
+ parser.add_argument("--answer_api_url", type=str, default="https://api.xxx.com/v1", help="Url where you serve your qwen model (e.g. via vllm)")
133
+ parser.add_argument("--judge_api_url", type=str, default="https://api.openai.com/v1", help="Base URL of the OpenAI-compatible API for answer verification (e.g. https://api.openai.com/v1)")
134
+ parser.add_argument("--answer_model", type=str, default="qwen3-vl-235b-thinking", help="Model to use for answer generation")
135
+ parser.add_argument("--judge_model", type=str, default="gpt-5-mini", help="Model to use for answer verification")
136
+ parser.add_argument("--answer_api_key_env", type=str, default="DF_API_KEY", help="Environment variable name holding the API key for the answer model")
137
+ parser.add_argument("--judge_api_key_env", type=str, default="DF_API_KEY", help="Environment variable name holding the API key for the judge model")
138
+ parser.add_argument("--max_workers", type=int, default=100, help="Number of parallel API workers")
139
+ args = parser.parse_args()
140
+
141
+ model = RejectSamplingPipeline(
142
+ args.input_file,
143
+ answer_api_url=args.answer_api_url,
144
+ judge_api_url=args.judge_api_url,
145
+ answer_model=args.answer_model,
146
+ judge_model=args.judge_model,
147
+ answer_api_key_env=args.answer_api_key_env,
148
+ judge_api_key_env=args.judge_api_key_env,
149
+ max_retries=args.max_retries,
150
+ max_workers=args.max_workers,
151
+ )
152
+ model.compile()
153
+ model.forward()
154
+
155
+ # Find the latest reject_sampling cache step file
156
+ cache_files = os.listdir("./cot_cache")
157
+ step_files = [f for f in cache_files if re.match(r"reject_sampling_step\d+\.jsonl", f)]
158
+ step_numbers = [int(re.findall(r"reject_sampling_step(\d+)\.jsonl", f)[0]) for f in step_files]
159
+ max_step = max(step_numbers)
160
+ max_step_file = f"./cot_cache/reject_sampling_step{max_step}.jsonl"
161
+
162
+ # Copy output alongside input_file so relative image paths remain valid
163
+ output_dir = os.path.dirname(args.input_file)
164
+ output_file = os.path.join(output_dir, "curated_vqa_with_cot.jsonl")
165
+ shutil.copy(max_step_file, output_file)
166
+ print(f"Curated data with cot saved to: {output_file}")
pipelines/vqa_extract_optimized_pipeline.py ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataflow.operators.knowledge_cleaning import FileOrURLToMarkdownConverterAPI
2
+
3
+ from dataflow.serving import APILLMServing_request
4
+ from dataflow.utils.storage import FileStorage
5
+ from operators.pdf2vqa import MinerU2LLMInputOperator, LLMOutputParser, QA_Merger, PDF_Merger
6
+ from dataflow.operators.core_text import ChunkedPromptedGenerator
7
+
8
+ from dataflow.pipeline import PipelineABC
9
+ from prompts.pdf2vqa import QAExtractPrompt
10
+
11
+ from pypdf import PdfWriter
12
+
13
+ import os
14
+ import json
15
+ import re
16
+ import argparse
17
+
18
+ class PDF_VQA_extract_optimized_pipeline(PipelineABC):
19
+ def __init__(self, input_file, api_url, model_name, max_workers=100):
20
+ super().__init__()
21
+ self.storage = FileStorage(
22
+ first_entry_file_name=input_file,
23
+ cache_path="./cache",
24
+ file_name_prefix="vqa",
25
+ cache_type="jsonl",
26
+ )
27
+
28
+ self.llm_serving = APILLMServing_request(
29
+ api_url=f"{api_url}/chat/completions",
30
+ key_name_of_api_key="DF_API_KEY",
31
+ model_name=model_name,
32
+ max_workers=max_workers,
33
+ )
34
+
35
+ self.vqa_extract_prompt = QAExtractPrompt()
36
+
37
+ self.pdf_merger = PDF_Merger(output_dir="./cache")
38
+
39
+ self.mineru_executor = FileOrURLToMarkdownConverterAPI(intermediate_dir = "intermediate")
40
+
41
+ self.input_formatter = MinerU2LLMInputOperator()
42
+ self.vqa_extractor = ChunkedPromptedGenerator(
43
+ llm_serving=self.llm_serving,
44
+ system_prompt = self.vqa_extract_prompt.build_prompt(),
45
+ max_chunk_len=128000,
46
+ )
47
+ self.llm_output_parser = LLMOutputParser(output_dir="./cache", intermediate_dir="intermediate")
48
+ self.qa_merger = QA_Merger(output_dir="./cache", strict_title_match=False)
49
+
50
+
51
+ def forward(self):
52
+ self.pdf_merger.run(
53
+ storage=self.storage.step(),
54
+ input_pdf_list_key="input_pdf_paths",
55
+ input_name_key="name",
56
+ output_pdf_path_key="merged_pdf_path",
57
+ )
58
+ self.mineru_executor.run(
59
+ storage=self.storage.step(),
60
+ input_key="merged_pdf_path",
61
+ output_key="vqa_markdown_path",
62
+ )
63
+ self.input_formatter.run(
64
+ storage=self.storage.step(),
65
+ input_markdown_path_key="vqa_markdown_path",
66
+ output_converted_layout_key="converted_vqa_layout_path",
67
+ )
68
+ self.vqa_extractor.run(
69
+ storage=self.storage.step(),
70
+ input_path_key="converted_vqa_layout_path",
71
+ output_path_key="extracted_llm_vqa_path",
72
+ )
73
+ self.llm_output_parser.run(
74
+ storage=self.storage.step(),
75
+ input_response_path_key="extracted_llm_vqa_path",
76
+ input_converted_layout_path_key="converted_vqa_layout_path",
77
+ input_name_key="name",
78
+ output_qalist_path_key="extracted_vqa_path",
79
+ )
80
+ self.qa_merger.run(
81
+ storage=self.storage.step(),
82
+ input_qalist_path_key="extracted_vqa_path",
83
+ input_name_key="name",
84
+ output_merged_qalist_path_key="output_merged_vqalist_path",
85
+ output_merged_md_path_key="output_merged_md_path",
86
+ output_qa_item_key="vqa_pair",
87
+ )
88
+
89
+
90
+
91
+ if __name__ == "__main__":
92
+ parser = argparse.ArgumentParser(description="Run PDF VQA Extract Optimized Pipeline")
93
+ parser.add_argument("--input_file", type=str, default="./examples/VQA/vqa_extract_test.jsonl", help="Path to the input JSONL file")
94
+ parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save the output files")
95
+ parser.add_argument("--api_url", type=str, default="https://generativelanguage.googleapis.com/v1beta/openai/", help="Base URL of the OpenAI-compatible API (e.g. https://api.openai.com/v1)")
96
+ parser.add_argument("--model", type=str, default="gemini-2.5-pro", help="LLM model name to use for VQA extraction, please use powerful reasoning models")
97
+ parser.add_argument("--max_workers", type=int, default=100, help="Number of parallel API workers")
98
+ args = parser.parse_args()
99
+
100
+ pipeline = PDF_VQA_extract_optimized_pipeline(
101
+ input_file=args.input_file,
102
+ api_url=args.api_url,
103
+ model_name=args.model,
104
+ max_workers=args.max_workers,
105
+ )
106
+ pipeline.compile()
107
+ pipeline.forward(resume_step=5)
108
+
109
+ output_dir = args.output_dir
110
+ os.makedirs(output_dir, exist_ok=True)
111
+
112
+ # Find the latest cache step file
113
+ cache_files = os.listdir("./cache")
114
+ step_files = [f for f in cache_files if re.match(r"vqa_step\d+\.jsonl", f)]
115
+ step_numbers = [int(re.findall(r"vqa_step(\d+)\.jsonl", f)[0]) for f in step_files]
116
+ max_step = max(step_numbers)
117
+ max_step_file = f"./cache/vqa_step{max_step}.jsonl"
118
+
119
+ # Extract QA items and save to output_dir/raw_vqa.jsonl
120
+ output_qa_item_key = "vqa_pair"
121
+ with open(max_step_file, "r") as f_in, open(os.path.join(output_dir, "raw_vqa.jsonl"), "w") as f_out:
122
+ for line in f_in:
123
+ data = json.loads(line)
124
+ qa_item = data[output_qa_item_key]
125
+ name = data["name"]
126
+ output_data = {"name": name, **qa_item, "image_basedir": os.path.abspath(output_dir)}
127
+ if not output_data["solution"]:
128
+ output_data["solution"] = output_data["answer"]
129
+ f_out.write(json.dumps(output_data, ensure_ascii=False) + "\n")
130
+
131
+ # Copy per-task image directory to output_dir
132
+ src_dir = os.path.join("cache", name)
133
+ if os.path.exists(src_dir):
134
+ os.system(f"cp -r {src_dir} {output_dir}")
prompts/bench_evaluate.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ from dataflow.utils.registry import PROMPT_REGISTRY
3
+ from dataflow.core.prompt import PromptABC
4
+ '''
5
+ A collection of prompts for model evaluation.
6
+ '''
7
+
8
+ @PROMPT_REGISTRY.register()
9
+ class AnswerJudgePromptQuestion(PromptABC):
10
+ """
11
+ 用于构建答案评判的提示词模板
12
+ """
13
+ def __init__(self):
14
+ pass
15
+
16
+ def build_prompt(self, question, answer, reference_answer):
17
+ prompt = f"""
18
+ As an answer evaluation expert, please assess whether the following answer is correct.
19
+
20
+ Question: {question}
21
+
22
+ Reference Answer: {reference_answer}
23
+
24
+ Current Answer: {answer}
25
+
26
+ Please carefully analyze whether the current answer is semantically consistent with the reference answer.
27
+ Focus only on comparing the answers themselves, not on how the problem is solved.
28
+ Don't just look at the surface text, understand the essential content of the answers.
29
+ If the current answer is semantically consistent with the reference answer, even if expressed differently, it should be judged as correct.
30
+ For numerical calculation problems, also consider whether the answer is within the acceptable error range (typically 5%). Be careful to differentiate whether the question is indeed a numerical calculation or one that requires a strictly identical answer.
31
+
32
+ Please return your judgment result in JSON format:
33
+ {{"judgement_result": true}} indicates the answer is correct
34
+ {{"judgement_result": false}} indicates the answer is incorrect
35
+
36
+ Your judgment:
37
+ """
38
+ return prompt
39
+
40
+ @PROMPT_REGISTRY.register()
41
+ class AnswerJudgeMultipleQuestionsPrompt(PromptABC):
42
+ """
43
+ 用于构建答案评判的提示词模板,支持多个子问题的判断。
44
+ """
45
+ def __init__(self):
46
+ pass
47
+
48
+ def build_prompt(self, answer, reference_answer, question=None):
49
+ prompt = f"""
50
+ As an answer evaluation expert, please assess whether the following answer is correct.
51
+
52
+ Question: {question}
53
+
54
+ Reference Answer: {reference_answer}
55
+
56
+ Current Answer: {answer}
57
+
58
+ Please carefully analyze whether the current answer is semantically consistent with the reference answer.
59
+ Focus only on comparing the answers themselves, not on how the problem is solved.
60
+ Don't just look at the surface text, understand the essential content of the answers.
61
+ If the current answer is semantically consistent with the reference answer, even if expressed differently, it should be judged as correct.
62
+ For numerical calculation problems, also consider whether the answer is within the acceptable error range (typically 5%). Be careful to differentiate whether the question is indeed a numerical calculation or one that requires a strictly identical answer.
63
+
64
+ The question may contain multiple sub-questions (e.g., ①②③ or (a)(b), etc.).
65
+ You should first identify the sub-questions in the question, then evaluate the correctness of each corresponding part in the current answer.
66
+ You need to provide your reason for each sub-question's judgment.
67
+
68
+ Your judgement should be a JSON array, where each element is "true" or "false" (use string instead of boolean), indicating whether the answer to each sub-question is correct.
69
+ If there is only one question, also return a single-element array.
70
+
71
+ If the reference answer is incomplete so that you are not able to judge some subquestions, mark the corresponding sub-questions as "empty".
72
+
73
+ Example:
74
+ Question: ① 1+2=? ② What is 2+2? ③ What is 3+3?
75
+ Reference Answer: ① 3 ③ 6
76
+ Current Answer: ① Three ② Four ③ Seven
77
+ Output: {{"reason": "The answer to sub-question 1 is correct as 'Three' is semantically consistent with '3'. The reference answer does not provide information for sub-question 2, so it is marked as 'empty'. The answer to sub-question 3 is incorrect as 'Seven' is not semantically consistent with '6'.", "judgement": ["true", "empty", "false"]}}
78
+
79
+
80
+ Your judgment:
81
+ """
82
+ return prompt
prompts/curate_data.py ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from dataflow.utils.registry import PROMPT_REGISTRY
3
+ from dataflow.core.prompt import PromptABC, DIYPromptABC
4
+ from typing import Set
5
+ import string
6
+
7
+ @PROMPT_REGISTRY.register()
8
+ class SubQuestionSplitingPrompt(DIYPromptABC):
9
+ def __init__(self, f_str_template: str = "{input_text}", on_missing: str = "raise"):
10
+ self.f_str_template ="""
11
+ You are an educational question structure analysis assistant. Below is a composite question and its corresponding answer. Please split it into several independent sub-questions.
12
+ The requirements are as follows:
13
+
14
+ 1. The question may contain multiple sub-questions (e.g., ①②③ or (a)(b), etc.); please accurately identify and split them one by one. Only split sub-questions with clear labels.
15
+ Do not split implicit sub-questions (such as "What is the value of x and y?" or multiple question marks).
16
+ 2. Each sub-question must be self-contained and answerable. If the original question contains contextual information, include it in each sub-question to preserve full meaning.
17
+ If sub-questions are related (e.g., "① Find x. ② Using the value of x, find y."), do not split them; keep them as one sub-question.
18
+ 3. If an answer or/and solution is provided, try to match each sub-question with its corresponding part of the answer or/and solution based on semantics.
19
+ 4. If the original answer or/and solution contains LaTeX formulas, preserve them exactly as they appear.
20
+ 5. If the original answer or/and solution is missing or cannot be clearly aligned, leave `"sub_answer"` or/and `"sub_solution"` as an empty string.
21
+ 6. The output must be a valid JSON array, where each element contains:
22
+
23
+ * `"sub_id"`: the index of the sub-question (an integer starting from 1)
24
+ * `"sub_question"`: the complete text of the sub-question (or "ORIGINAL" if no splitting is needed)
25
+ * `"sub_answer"`: the corresponding answer, empty string if unavailable (or "ORIGINAL" if no splitting is needed)
26
+ * `"sub_solution"`: the corresponding solution, empty string if unavailable (or "ORIGINAL" if no splitting is needed)
27
+
28
+ [Important Notice]
29
+ 1. In some questions, answers or solutions, there will be figures written as `![image](image_url)`. When splitting, please keep these figure references in the corresponding sub-questions, sub-answers, or sub-solutions as EXACTLY what they are.
30
+ 2. If the question does not need to be split, return an array with a single element, simplified as: [{"sub_id": 1, "sub_question": "ORIGINAL", "sub_answer": "ORIGINAL", "sub_solution": "ORIGINAL"}]
31
+ In this case, you only need to output "ORIGINAL" instead of the full text for sub_question, sub_answer, and sub_solution, so that we can save tokens.
32
+
33
+ ## Example Input:
34
+
35
+ **Question:**
36
+ A class has 40 students, including 25 boys and 15 girls. ![image](question_images/a284h5iuh38.jpg) ① Find the percentage of boys in the class. ② Find the percentage of girls in the class.
37
+
38
+ **Answer:**
39
+ ① 62.5%. ② 37.5%.
40
+
41
+ **Solution:**
42
+ Percentage of boys = (25/40) * 100 = 62.5%, percentage of girls = (15/40) * 100 = 37.5%.
43
+ ----------------------------------------------
44
+
45
+ ## Example Output:
46
+
47
+ ```json
48
+ [
49
+ {
50
+ "sub_id": 1,
51
+ "sub_question": "A class has 40 students, including 25 boys and 15 girls. ![image](question_images/a284h5iuh38.jpg) Find the percentage of boys in the class.",
52
+ "sub_answer": "62.5%.",
53
+ "sub_solution": "Percentage of boys = (25/40) * 100 = 62.5%."
54
+ },
55
+ {
56
+ "sub_id": 2,
57
+ "sub_question": "A class has 40 students, including 25 boys and 15 girls. ![image](question_images/a284h5iuh38.jpg) Find the percentage of girls in the class.",
58
+ "sub_answer": "37.5%.",
59
+ "sub_solution": "Percentage of girls = (15/40) * 100 = 37.5%."
60
+ }
61
+ ]
62
+ ```
63
+ Now, please split the following question according to the above requirements:
64
+ [Question]
65
+ {input_question}
66
+
67
+ [Answer]
68
+ {input_answer}
69
+
70
+ [Solution]
71
+ {input_solution}
72
+ """
73
+
74
+ def build_prompt(self, need_fields, **kwargs):
75
+ # 校验缺失字段
76
+ missing = [f for f in need_fields if f not in kwargs]
77
+ if missing:
78
+ if self.on_missing == "raise":
79
+ raise KeyError(f"Missing fields for prompt: {missing}")
80
+ # 宽松模式:用空串补齐
81
+ for f in missing:
82
+ kwargs[f] = ""
83
+ prompt = self.f_str_template
84
+ for key, value in kwargs.items():
85
+ prompt = prompt.replace(f"{{{key}}}", str(value))
86
+
87
+ return prompt
88
+
89
+ @PROMPT_REGISTRY.register()
90
+ class TypeClassifyPrompt(DIYPromptABC):
91
+ def __init__(self, f_str_template: str = "{input_text}", on_missing: str = "raise"):
92
+ self.f_str_template ='''
93
+ [Role]
94
+ You are an education expert familiar with textbook question formats at high school and university levels.
95
+ Your task is to determine the question type based on the question and answer provided.
96
+
97
+ [Possible Categories]
98
+ Choose exactly one of the following types:
99
+
100
+ 1. Proof problem - requires proving a statement, identity, inequality, or property.
101
+
102
+ 2. Explanation problem - asks for reasoning, causes, interpretation, principle, or conceptual explanation.
103
+
104
+ 3. Fill-in problem - asks to fill in blanks, complete missing expressions, or supply intermediate steps.
105
+
106
+ 4. Calculation problem - involves explicit numerical or symbolic computation, formula manipulation, or value derivation.
107
+ Even if the final answer is a short conclusion such as “thus xxx increases” or “so the velocity decreases,”
108
+ it should still be considered a Calculation problem if the majority of the reasoning is computational.
109
+
110
+ 5. Multiple-choice problem - asks to choose or identify the correct option (e.g., “Which of the following…”).
111
+
112
+ 6. Sketching/Plotting problem - requires sketching a figure, diagram, graph, or geometric representation.
113
+
114
+ 7. Other - for tasks that don't fit any of the above types.
115
+
116
+ [Judgment Rules]
117
+
118
+ 1. If the problem explicitly says “prove,” “show that,” “derive,” and does not have a short final answer → classify as Proof problem.
119
+
120
+ 2. If it mainly contains explanations, reasoning, or conceptual analysis without detailed calculation → Explanation problem.
121
+
122
+ 3. If the question has blanks, missing terms, or placeholders (e.g., “( )” or “____”), or the question seems **incomplete** → Fill-in problem.
123
+
124
+ 4. If there are multiple formula derivations, substitutions, or numeric results → Calculation problem,
125
+ even if followed by a brief explanatory conclusion.
126
+
127
+ 5. If it asks to select the correct answer among options (A/B/C/D, etc.) → Multiple-choice problem.
128
+
129
+ 6. If the question explicitly requires producing a figure, diagram, plot, or geometric construction → Sketching/Plotting problem.
130
+
131
+ 7. If none of these clearly apply or the problem type is mixed → Other.
132
+
133
+ [Output Format]
134
+ Return a JSON object with the following fields:
135
+ {
136
+ "type": "Calculation | Proof | Explanation | Fill-in | Multiple-choice | Sketching/Plotting | Other",
137
+ "reason": "Brief justification for the classification."
138
+ }
139
+
140
+ Please determine the type of the following question and output only one of the above category names.
141
+ (Proof, Explanation, Fill-in, Calculation, Multiple-choice, Sketching, Other).
142
+
143
+ [Question]
144
+ {input_question}
145
+
146
+ [Answer]
147
+ {input_answer}
148
+ '''
149
+
150
+ def build_prompt(self, need_fields, **kwargs):
151
+ # 校验缺失字段
152
+ missing = [f for f in need_fields if f not in kwargs]
153
+ if missing:
154
+ if self.on_missing == "raise":
155
+ raise KeyError(f"Missing fields for prompt: {missing}")
156
+ # 宽松模式:用空串补齐
157
+ for f in missing:
158
+ kwargs[f] = ""
159
+ prompt = self.f_str_template
160
+ for key, value in kwargs.items():
161
+ prompt = prompt.replace(f"{{{key}}}", str(value))
162
+
163
+ return prompt
164
+
165
+ @PROMPT_REGISTRY.register()
166
+ class QAFilterPrompt(DIYPromptABC):
167
+ """
168
+ 用于过滤不合适的问答对的Prompt
169
+ """
170
+ def __init__(self):
171
+ self.f_str_template = """
172
+ [Role]
173
+ You are an education expert familiar with textbook question formats at high school and university levels.
174
+ Your task is to determine whether the provided question and answer pair is suitable to serve as a problem in an exam.
175
+
176
+ Question: {input_question}
177
+
178
+ Answer: {input_answer}
179
+
180
+ [Criteria]
181
+ 1. Clarity: The question must be suitable for an exam setting, meaning it should raise **a clear problem** that requires a specific solution.
182
+ Examples, **statements without questions**, open-ended discussions and other context that do not pose a clear problem are not suitable.
183
+ Questions like "Give an example of..." that can have many valid answers are also not suitable.
184
+ You should be particularly careful with questions that **only provide a topic or theme** without a specific problem to solve.
185
+ For instance, "all primes less than 100" is not a valid question, because it does not specify what to do (listing, counting, ...) with those primes.
186
+ Instead, a question like "List all primes less than 100" or "How many primes are there less than 100?" would be suitable.
187
+ 2. Relevance: The answer must directly address the question asked.
188
+ If the answer seems to be addressing a different question and is wrongly paired with the given question, it is not suitable.
189
+ 3. Completeness and Self-Containment: The question and answer should be complete and self-contained, providing all necessary information for understanding and solving it without requiring external context.
190
+ Questions that rely heavily on prior context or external references are not suitable.
191
+ Answers such as "Refer to theorem X", "Corollary of previous result", "Answered in the text above", "Omitted for brevity" are not acceptable.
192
+ Incomplete questions or answers that leave out critical information are also not suitable.
193
+ 4. Explicit Task Requirement: The question must contain an explicit task phrase (such as "compute", "determine", "find", "prove", "list", "show", "give the value of", etc.).
194
+ Pure expressions or noun phrases are NOT acceptable even if they are commonly understood as implicit tasks in mathematical contexts.
195
+ If the question does not include an explicit verb specifying what the student must do, it must be judged unsuitable.
196
+ Of course, if the question is in a multiple-choice or fill-in-the-blank format, the choices or blanks themselves will serve as the explicit task requirement.
197
+
198
+ [Important Notice]
199
+ 1. You do not need to evaluate the correctness of the answer, only whether it is appropriate and complete in relation to the question.
200
+ 2. Short answer with no explanation (calculation, proof, counterexample, ...) is acceptable as long as it directly addresses the question.
201
+ 3. There might be figures in the question or answer, represented as `![image](image_url)`. However, we do not give you that.
202
+ You can assume that if the question or answer contains such figure references, they are correctly placed and provide necessary information.
203
+ 4. Sometimes in a fill-in question, the blanks like "___" may be missing due to OCR errors. In this case, if the question is otherwise clear and complete, you can still judge it as suitable.
204
+ 5. You should be very strict in your evaluation. If any of the criteria above are not fully met, the question-answer pair should be considered unsuitable.
205
+
206
+ [Output Format]
207
+ Return a JSON object with the following fields:
208
+ {
209
+ "reason": "Brief justification of your judgement."
210
+ "judgement": "true | false",
211
+ }
212
+
213
+ Your judgment:
214
+ """
215
+
216
+ def build_prompt(self, need_fields, **kwargs):
217
+ # 校验缺失字段
218
+ missing = [f for f in need_fields if f not in kwargs]
219
+ if missing:
220
+ if self.on_missing == "raise":
221
+ raise KeyError(f"Missing fields for prompt: {missing}")
222
+ # 宽松模式:用空串补齐
223
+ for f in missing:
224
+ kwargs[f] = ""
225
+ prompt = self.f_str_template
226
+ for key, value in kwargs.items():
227
+ prompt = prompt.replace(f"{{{key}}}", str(value))
228
+
229
+ return prompt
prompts/pdf2vqa.py ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataflow.utils.registry import PROMPT_REGISTRY
2
+ from dataflow.core.prompt import PromptABC
3
+
4
+ @PROMPT_REGISTRY.register()
5
+ class QAExtractPrompt(PromptABC):
6
+ def __init__(self):
7
+ pass
8
+
9
+ def build_prompt(self) -> str:
10
+ PROMPT = f"""
11
+ You are an expert in answer college-level questions. You are given a json file. Your task is to segment the content, insert images tags, and extract labels:
12
+ 1. Every json item has an "id" field. Your main task is to output this field.
13
+ 2. You need to segment the content into multiple `<qa_pair>`…`</qa_pair>` blocks, each containing a question and its corresponding answer with solution.
14
+ 3. If the problem or answer/solution is not complete, omit them. An answer/solution should be considered complete as long as either the answer or solution exists.
15
+ 4. You need to put the images id into proper positions. You could look at the caption or context to decide where to put the image tags.
16
+ 5. You will also need to extract the chapter title and each problem's label/number from the text.
17
+ 6. You only need to output "id" field for **chapter titles, questions and solutions**. DO NOT OUTPUT ORIGINAL TEXT. Use ',' to separate different ids.
18
+ 7. However, use original labels/numbers for labels, and use original numbers for answers. DO NOT output "id" field for labels and answers. You will need to extract them from the text.
19
+ """
20
+ PROMPT +=f"""
21
+ Strict extraction rules:
22
+ ** About questions and answers/solutions **
23
+ - Preserve each problem’s original label/number, such as "例1", "Example 3", "习题1", "11". Do not include the period after the number. Use Arabic numerals only. For example, if the label is "例一", convert it to "例1". If the label is "IV", convert it to "4".
24
+ - If the full label is "三、16", keep only "16". If the full label is "5.4", keep only "4".
25
+ - If there are multiple sub-questions (such as "(1)", "(a)") under one main question, always put them together in the same `<qa_pair>`…`</qa_pair>` block.
26
+ - If a question and its answer/solution are contiguous, wrap them together as a single `<qa_pair>`…`</qa_pair>` block, e.g.:
27
+ `<qa_pair><label>1</label><question>…</question><answer>…</answer><solution>…</solution></qa_pair>`
28
+ - If a question and its answer/solution are NOT contiguous (e.g. only question; only answer and/or solution; all questions at the front and all answers/solutions at the back), wrap each question or answer/solution in a `<qa_pair>`…`</qa_pair>` block with the missing part left empty. For example, if only questions appear:
29
+ `<qa_pair><label>1</label><question>…</question><answer></answer><solution></solution></qa_pair>`
30
+ - In total, there are 7 possibilities: only question, only answer, only solution, question with answer, question with solution, answer with solution, full question and answer and solution.
31
+ - If multiple qa pairs appear, wrap each qa pair in its own `<qa_pair>`…`</qa_pair>` block.
32
+ - If you do not see the full solution, only extract the short answer and leave the solution empty. YOU MUST KEEP SHORT ANSWERS !!!
33
+ ** About chapter/section titles **
34
+ - Always enclose qa pairs in a `<chapter>`…`</chapter>` block, where <title>MAIN_TITLE_ID</title> is the id of the chapter title or section title.
35
+ - Normally, chapter/section titles appear before the questions/answers in an independent json item.
36
+ - There could be multiple `<chapter>`…`</chapter>` blocks if multiple chapters/sections exist.
37
+ - **Any title followed by a question/answer whose label/number is not 1, or title with a score such as "一、选择题(每题1分,共10分)", should NOT be extracted.**
38
+ - Do not use nested titles.
39
+ - Leave the title blank if there is no chapter title.
40
+ ** About figures/diagrams **
41
+ - Whenever the question or answer/solution refers to a figure or diagram, record its "id" in question/answer/solution just like other text content.
42
+ - You MUST include all images referenced in the question/answer/solution.
43
+
44
+
45
+ If no qualifying content is found, output:
46
+ <empty></empty>
47
+
48
+ Output format (all tags run together, no extra whitespace or newlines except between entries):
49
+ <chapter><title>MAIN_TITLE_ID</title>
50
+ <qa_pair><label>LABEL(EXTRACTED FROM TEXT)</label><question>QUESTION_IDS</question>
51
+ <answer>ANSWER(EXTRACTED FROM SOLUTION)</answer><solution>SOLUTION_IDS</solution></qa_pair>
52
+ <qa_pair><label>LABEL(EXTRACTED FROM TEXT)</label><question>QUESTION_IDS</question>
53
+ <answer>ANSWER(EXTRACTED FROM SOLUTION)</answer><solution></solution></qa_pair>
54
+ </chapter>
55
+ <chapter><title>MAIN_TITLE_ID</title>
56
+ <qa_pair><label>LABEL(EXTRACTED FROM TEXT)</label><question>QUESTION_IDS</question>
57
+ <answer>ANSWER(EXTRACTED FROM SOLUTION)</answer><solution>SOLUTION_IDS</solution></qa_pair>
58
+ </chapter>
59
+
60
+
61
+ Example:
62
+ <chapter><title>7</title>
63
+ <qa_pair><label>1</label><question>2,3</question>
64
+ <answer>Yes</answer><solution>5,6,7</solution></qa_pair>
65
+ <qa_pair><label>2</label><question>8,9,10</question>
66
+ <answer>3.14</answer><solution></solution></qa_pair>
67
+ </chapter>
68
+ <chapter><title>12</title>
69
+ <qa_pair><label>1</label><question></question>
70
+ <answer>2^6</answer><solution>16</solution></qa_pair>
71
+ </chapter>
72
+
73
+ Please now process the provided json and output your result.
74
+ """
75
+ return PROMPT
prompts/question_answer_clean.py ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataflow.utils.registry import PROMPT_REGISTRY
2
+ from dataflow.core.prompt import PromptABC
3
+
4
+ @PROMPT_REGISTRY.register()
5
+ class TextCleaningPrompt(PromptABC):
6
+ """
7
+ 用于清洗文本中的非内容性信息片段
8
+ """
9
+ def __init__(self):
10
+ self.question_prompt_template = """你是一名数据清洗专家。请从以下题目文本中识别所有与问题实质无关的非内容性信息片段,这些片段应被完全移除。
11
+
12
+ 【非内容性信息定义】(应删除):
13
+ - 题号、例号、习题编号(如 "1.1"、"例3"、"Problem 2.4"、"习题2-5");
14
+ - 章节标记(如 "§2.1"、"Chapter 3");
15
+ - 考试元数据:包括分数、学校、年份等组合标注(如 "(10分,北京交通大学,2003)"、"(20分,2007年)"、"(清华大学,2010)");
16
+ - 模板残留(如 "[图]"、"【此处填空】"、"<在此作答>");
17
+ - 与本题逻辑无关的交叉引用(如 "如上题所述"、"参考例4"),**除非该引用是解题所必需的前提**。
18
+
19
+ 【重要内容定义】(必须保留,禁止删除):
20
+ - **图片引用**:包括 Markdown 图片语法(如 `![图2-1](question_images/xxx.jpg)`)、纯路径(如 `question_images/xxx.jpg`)、图注(如 "图2-1"、"如图所示");
21
+ - 所有物理条件、变量、公式、单位、逻辑描述(如 "G铰"、"几何不变体系");
22
+ - 若题干中提及"例X"是作为**定义或前提**(如"如例1.2中定义的模型"),则保留;否则(如开头的"例1")应删除。
23
+
24
+ 【重要规则】:
25
+ 1. **不要重写、不要改写、不要总结**原始文本;
26
+ 2. **仅输出需要删除的子字符串**,多个片段用 `||` 分隔;
27
+ 3. 如果没有非内容性信息,输出 `NONE`;
28
+ 4. **必须原样输出片段**(包括空格、括号、标点、中文顿号等);
29
+ 5. **特别注意**:任何包含 `question_images/` 的路径、`![...](...)` 结构、或"图X-X"形式的图标识,**一律不得删除**;
30
+ 6. 考试元数据(如"(10分,北京交通大学,2003)")**必须整段删除**,包括括号。
31
+ 7. **务必保留必要的前提信息。**
32
+ 8. **删除后的问题,一定还能构成一个完整的问题**。输出前请三思。
33
+ 例如对于"In Exercises 3-6, calculate the size of the set.\n4. {{x|x is a prime number less than 10}}",
34
+ 这个必要的前缀不应该被删除,你要删除的内容应该是 " In Exercises 3-6, || 4."。
35
+ 如果你对这一点感到困惑、为难,请**务必保留更多内容,而不是删除**。
36
+ 9. 应**最小化**删除与题目无关的文本,不要过分删除。如果你有任何疑问,请优先选择保留,甚至直接输出 `NONE`。
37
+
38
+ 题目文本:{text}
39
+
40
+ 请输出待删除的片段(用 `||` 分隔)或 `NONE`:"""
41
+
42
+ self.answer_prompt_template = """你是一名数据清洗专家。请从以下答案文本中识别所有与答案实质无关的非内容性信息片段,这些片段应被完全移除。
43
+
44
+ 【非内容性信息定义】:
45
+ - 答案引导词(如 "答:"、"答案:"、"Solution:"、"解:");
46
+ - 习题引用(如 "(见习题2.3)"、"同例4"、"参考教材P30");
47
+ - 模板残留(如 "[计算过程略]"、"{{result}}");
48
+ - 与答案结论无关的附加说明(如 "详见附录");
49
+ - 其他非答案核心内容的元信息。
50
+
51
+ 【重要规则】:
52
+ 1. 不要重写、不要改写、不要总结原始文本;
53
+ 2. 仅输出需要删除的子字符串,多个片段用 `||` 分隔;
54
+ 3. 如果没有非内容性信息,输出 `NONE`;
55
+ 4. 必须原样输出片段(包括冒号、空格、括号等)。
56
+
57
+ 答案文本:{text}
58
+
59
+ 请输出待删除的片段(用 `||` 分隔)或 `NONE`:"""
60
+
61
+ def build_question_prompt(self, text):
62
+ """构建问题文本清洗的提示"""
63
+ return self.question_prompt_template.format(text=text)
64
+
65
+ def build_answer_prompt(self, text):
66
+ """构建答案文本清洗的提示"""
67
+ return self.answer_prompt_template.format(text=text)
prompts/question_refine.py ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataflow.utils.registry import PROMPT_REGISTRY
2
+ from dataflow.core.prompt import PromptABC
3
+
4
+ @PROMPT_REGISTRY.register()
5
+ class AddMissingBlankPrompt(PromptABC):
6
+ """
7
+ 用于补全填空题中的横线
8
+ """
9
+ def __init__(self):
10
+ self.f_str_template = """
11
+ [Role]
12
+ You are an education expert familiar with textbook question formats at high school and university levels.
13
+ You will be given a "fill-in-the-blank" question along with its answer.
14
+ However, the question may have some missing placeholders to indicate blanks.
15
+ You can use the provided answer to help determine where the blanks should be placed.
16
+ Ensure that the modified question clearly indicates all the blanks using placeholders.
17
+
18
+ Question: {input_question}
19
+
20
+ Answer: {input_answer}
21
+
22
+ [Important Notice]
23
+ 1. If the original question already has some placeholders (coule be in different forms such as "()", "__", "____"), do not remove them. Instead, add any missing "___" based on the answer.
24
+ If the question is already complete with all necessary blanks (regardless of the form of placeholders), return "ORIGINAL" (no quote).
25
+ 2. Do not change any other part of the question except for adding the missing "___" !!!
26
+
27
+ [Examples]
28
+ Original Question: The capital of France is and the capital of British is ____.
29
+ Answer: Paris; London
30
+ Return: The capital of France is ___ and the capital of British is ____.
31
+
32
+ Original Question: The two right sides of a triangle are 3 and 4, then the third side is.
33
+ Answer: 5
34
+ Return: The two right sides of a triangle are 3 and 4, then the third side is ___.
35
+
36
+ Original Question: The area of a circle with radius r is ( ).
37
+ Answer: πr^2
38
+ Return: ORIGINAL
39
+
40
+ [Output Format]
41
+ Only output the full modified question with blanks represented by "___". Do not include any additional explanations or text.
42
+
43
+ """
44
+
45
+ def build_prompt(self, need_fields, **kwargs):
46
+ # 校验缺失字段
47
+ missing = [f for f in need_fields if f not in kwargs]
48
+ if missing:
49
+ if self.on_missing == "raise":
50
+ raise KeyError(f"Missing fields for prompt: {missing}")
51
+ # 宽松模式:用空串补齐
52
+ for f in missing:
53
+ kwargs[f] = ""
54
+ prompt = self.f_str_template
55
+ for key, value in kwargs.items():
56
+ prompt = prompt.replace(f"{{{key}}}", str(value))
57
+
58
+ return prompt
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # Gradio UI
2
+ gradio>=4.44.0
3
+
4
+ # DataFlow core (install from GitHub; includes pdf2vqa extras)
5
+ git+https://github.com/OpenDCAI/DataFlow.git#egg=dataflow[pdf2vqa]
6
+
7
+ # Runtime dependencies used by curate_data.py
8
+ json5
9
+ pandas
utils/format_utils.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import re
3
+
4
+ def refine_title(title: str, strict_title_match=False):
5
+ # TODO : 这里可能需要更复杂的title清洗逻辑
6
+ # 删除title中的空格与换行符
7
+ title = re.sub(r'\s+', '', title)
8
+ if not strict_title_match:
9
+ try:
10
+ # 优先提取阿拉伯数字章节编号(如1.1,2等)
11
+ new_title = re.search(r"\d+\.\d+|\d+", title).group()
12
+ except:
13
+ try:
14
+ # 其次提取中文数字章节编号(如六、二十四等)
15
+ new_title = re.search(r'[一二三四五六七八九零十百]+', title).group()
16
+ except Exception:
17
+ new_title = title
18
+ title = new_title
19
+ return title
20
+
21
+ def merge_qa_pair(vqa_jsonl, output_jsonl, strict_title_match=False):
22
+ already_complete_count = 0
23
+ question_list = []
24
+ answer_list = []
25
+ with open(vqa_jsonl, 'r', encoding='utf-8') as vqa_file:
26
+ for line in vqa_file:
27
+ data = json.loads(line)
28
+ if data["question"] != "":
29
+ question_list.append(data)
30
+ else:
31
+ # 用于支持题目在前面,答案在后面的pdf
32
+ answer_list.append(data)
33
+
34
+ with open(output_jsonl, 'w', encoding='utf-8') as out_file:
35
+ chapter_id = 0
36
+ chapter_title = ""
37
+ label = float('inf')
38
+ questions = {}
39
+ answers = {}
40
+ for data in question_list:
41
+ label_match = re.search(r'\d+', data["label"])
42
+ if label_match:
43
+ data["label"] = label_match.group()
44
+ if data["chapter_title"] == "":
45
+ data["chapter_title"] = chapter_title
46
+
47
+ try:
48
+ data["label"] = int(data["label"])
49
+ except Exception:
50
+ continue
51
+
52
+ if data["chapter_title"] != "" and data["chapter_title"] != chapter_title:
53
+ if data["label"] < label:
54
+ chapter_id += 1
55
+ chapter_title = data["chapter_title"]
56
+ else:
57
+ # 如果题号增加,章节标题却发生变化,说明可能错误提取了子标题。因此继续使用之前的章节标题。
58
+ data["chapter_title"] = chapter_title
59
+ label = data["label"]
60
+ data["original_chapter_title"] = data["chapter_title"]
61
+ data["chapter_title"] = refine_title(data["chapter_title"], strict_title_match)
62
+ if data['label'] > 0:
63
+ # 已经完整的题目直接写入out_file
64
+ if data["answer"] or data["solution"]:
65
+ already_complete_count += 1
66
+ qa_pair = {
67
+ "question_chapter_title": data["original_chapter_title"],
68
+ "answer_chapter_title": data["original_chapter_title"],
69
+ "label": data['label'],
70
+ "question": data["question"],
71
+ "answer": data["answer"],
72
+ "solution": data.get("solution", "")
73
+ }
74
+ out_file.write(json.dumps(qa_pair, ensure_ascii=False) + '\n')
75
+
76
+ else:
77
+ questions[(data["chapter_title"], data['label'])] = data
78
+
79
+ chapter_id = 0
80
+ chapter_title = ""
81
+ label = float('inf')
82
+ for data in answer_list:
83
+ label_match = re.search(r'\d+', data["label"])
84
+ if label_match:
85
+ data["label"] = label_match.group()
86
+ if data["chapter_title"] == "":
87
+ data["chapter_title"] = chapter_title
88
+
89
+ try:
90
+ data["label"] = int(data["label"])
91
+ except Exception:
92
+ continue
93
+
94
+ if data["chapter_title"] != "" and data["chapter_title"] != chapter_title:
95
+ if data["label"] < label:
96
+ chapter_id += 1
97
+ chapter_title = data["chapter_title"]
98
+ else:
99
+ # 如果题号增加,章节标题却发生变化,说明可能错误提取了子标题。因此继续使用之前的章节标题。
100
+ data["chapter_title"] = chapter_title
101
+ label = data["label"]
102
+ data["chapter_title"] = refine_title(data["chapter_title"], strict_title_match)
103
+ # 动态更新,防止错误的重复label覆盖掉之前的solution或answer
104
+ if data['label'] > 0:
105
+ if not answers.get((data["chapter_title"], data['label'])):
106
+ answers[(data["chapter_title"], data['label'])] = data
107
+ else:
108
+ if not answers[(data["chapter_title"], data['label'])].get("solution") and data.get("solution"):
109
+ answers[(data["chapter_title"], data['label'])]["solution"] = data["solution"]
110
+ if not answers[(data["chapter_title"], data['label'])].get("answer") and data.get("answer"):
111
+ answers[(data["chapter_title"], data['label'])]["answer"] = data["answer"]
112
+
113
+ for label in questions:
114
+ if label in answers:
115
+ qa_pair = {
116
+ "question_chapter_title": questions[label]["original_chapter_title"],
117
+ "answer_chapter_title": answers[label]["original_chapter_title"],
118
+ "label": label[1],
119
+ "question": questions[label]["question"],
120
+ "answer": answers[label]["answer"],
121
+ "solution": answers[label].get("solution", "")
122
+ }
123
+ out_file.write(json.dumps(qa_pair, ensure_ascii=False) + '\n')
124
+
125
+ print(f"Merged QA pairs: {len(questions.keys() & answers.keys()) + already_complete_count}")
126
+
127
+ def jsonl_to_md(jsonl_file, md_file):
128
+ with open(jsonl_file, 'r', encoding='utf-8') as in_file, open(md_file, 'w', encoding='utf-8') as out_file:
129
+ for line in in_file:
130
+ data = json.loads(line)
131
+ out_file.write(f"### Question {data['label']}\n\n")
132
+ out_file.write(f"{data['question']}\n\n")
133
+ out_file.write(f"**Answer:** {data['answer']}\n\n")
134
+ if data.get('solution'):
135
+ out_file.write(f"**Solution:**\n\n{data['solution']}\n\n")
136
+ out_file.write("---\n\n")