DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval
Abstract
A lightweight retrieval model called DARE incorporates data distribution information into function representations to improve R package retrieval, achieving superior performance over existing embedding models while enabling more reliable statistical analysis through an R-oriented LLM agent.
Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.
Community
We introduce DARE, an embedding model for improving LLM Agents on R package retrieval and downstream statistical analysis tasks. DARE outperforms open-sourced embedding models on R retrieval with higher efficiency and accuracy.
Paper: https://arxiv.org/abs/2603.04743
Website: https://ama-cmfai.github.io/DARE_webpage/
Model: https://huggingface.co/Stephen-SMJ/DARE-R-Retriever
Database: https://huggingface.co/datasets/Stephen-SMJ/RPKB
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/dare-aligning-llm-agents-with-the-r-statistical-ecosystem-via-distribution-aware-retrieval-1339-e232a0b2
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multi-Field Tool Retrieval (2026)
- Do Not Treat Code as Natural Language: Implications for Repository-Level Code Generation and Beyond (2026)
- Orion-RAG: Path-Aligned Hybrid Retrieval for Graphless Data (2026)
- ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models (2026)
- What Should I Cite? A RAG Benchmark for Academic Citation Prediction (2026)
- GrepRAG: An Empirical Study and Optimization of Grep-Like Retrieval for Code Completion (2026)
- DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper