DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech UnitsMaxime Poli, Manel Khentout, Angelo Ortiz Tandazo et al.
cs.CLcs.SDeess.ASMar 19, 2026
We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of p…
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual WorldZiyin Zhang, Zihan Liao, Hang Yu et al.
cs.CLcs.AIMar 19, 2026
We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available h…
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy DistillationZhuolin Yang, Zihan Liu, Yang Chen et al.
cs.CLcs.AIcs.LGMar 19, 2026
We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical an…
Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM CreativityQiawen Ella Liu, Marina Dubova, Henry Conklin et al.
cs.AIcs.CLMar 19, 2026
Are large language models (LLMs) creative in the same way humans are, and can the same interventions increase creativity in both? We evaluate a promising but largely untested intervention for creativi…
How Uncertainty Estimation Scales with Sampling in Reasoning ModelsMaksym Del, Markus Kängsepp, Marharyta Domnich et al.
cs.AIcs.CLcs.LGMar 19, 2026
Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box app…
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language NavigationSwagat Padhan, Lakshya Jain, Bhavya Minesh Shah et al.
cs.ROcs.AIcs.CLcs.CVMar 19, 2026
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge"…
SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic CuesCarlos Hinojosa, Clemens Grange, Bernard Ghanem
cs.CVcs.AIcs.CLcs.LGMar 19, 2026
Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives th…
DaPT: A Dual-Path Framework for Multilingual Multi-hop Question AnsweringYilin Wang, Yuchun Fan, Jiaoyang Li et al.
cs.CLcs.AIMar 19, 2026
Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the…
FinTradeBench: A Financial Reasoning Benchmark for LLMsYogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan et al.
cs.CEcs.AIcs.CLcs.IRMar 19, 2026
Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals com…
Parallelograms Strike Back: LLMs Generate Better Analogies than PeopleQiawen Ella Liu, Raja Marjieh, Jian-Qiao Zhu et al.
cs.CLcs.AIMar 19, 2026
Four-term word analogies (A:B::C:D) are classically modeled geometrically as ''parallelograms,'' yet recent work suggests this model poorly captures how humans produce analogies, with simple local-sim…
D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion DecodingJonathan Lys, Vincent Gripon, Bastien Pasdeloup et al.
cs.AIcs.LGMar 19, 2026
Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard decoding methods for autoregressive mod…
Box Maze: A Process-Control Architecture for Reliable LLM ReasoningZou Qiang
cs.AIcs.CLMar 19, 2026
Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches -- such …
VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation ModelsChonghan Liu, Yimin Du, Qi An et al.
cs.CLcs.AIMar 19, 2026
Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we pr…
A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical NotesMadeline Bittner, Dina Demner-Fushman, Yasmeen Shabazz et al.
cs.CLMar 19, 2026
Health literacy is a critical determinant of patient outcomes, yet current screening tools are not always feasible and differ considerably in the number of items, question format, and dimensions of he…
Hypothesis-Conditioned Query Rewriting for Decision-Useful RetrievalHangeol Chang, Changsun Lee, Seungjoon Rho et al.
cs.CLcs.AIcs.LGMar 19, 2026
Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options…
UGID: Unified Graph Isomorphism for Debiasing Large Language ModelsZikang Ding, Junchi Yao, Junhao Li et al.
cs.CLcs.AIMar 19, 2026
Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization--based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases…
Implicit Patterns in LLM-Based Binary AnalysisQiang Li, XiangRui Zhang, Haining Wang
cs.AIcs.CRcs.SEMar 19, 2026
Binary vulnerability analysis is increasingly performed by LLM-based agents in an iterative, multi-pass manner, with the model as the core decision-maker. However, how such systems organize exploratio…
What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard et al.
cs.CLcs.AIMar 19, 2026
We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, Ger…
How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic EvaluationKe-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang et al.
eess.AScs.CLcs.SDMar 19, 2026
Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how thi…
RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language ModelsXiao Feng, Bo Han, Zhanke Zhou et al.
cs.AIcs.CLcs.LGMar 19, 2026
Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of ter…
Are complicated loss functions necessary for teaching LLMs to reason?Gabriele Carrino, Andrea Sassella, Nicolo Brunello et al.
cs.LGcs.AIcs.CLMar 19, 2026
Recent advances in large language models (LLMs) highlight the importance of post training techniques for improving reasoning and mathematical ability. Group Relative Policy Optimization (GRPO) has sho…
Why Better Cross-Lingual Alignment Fails for Better Cross-Lingual Transfer: Case of EncodersYana Veitsman, Yihong Liu, Hinrich Schütze
cs.CLMar 19, 2026
Better cross-lingual alignment is often assumed to yield better cross-lingual transfer. However, explicit alignment techniques -- despite increasing embedding similarity -- frequently fail to improve …
Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay TasksRudra Jadhav, Janhavi Danve, Sonalika Shaw
cs.CLMar 19, 2026
As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investiga…
Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMsVedant Pandya
cs.CLcs.AIMar 19, 2026
Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on E…
Online Learning and Equilibrium Computation with Ranking FeedbackMingyang Liu, Yongshan Chen, Zhiyuan Fan et al.
cs.LGcs.CLcs.GTMar 19, 2026
Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. …
Evaluating LLM-Generated Lessons from the Language Learning Students' Perspective: A Short Case Study on DuolingoCarlos Rafael Catalan, Patricia Nicole Monderin, Lheane Marie Dizon et al.
cs.CLcs.AIcs.HCMar 19, 2026
Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, orderi…
Reasoning over mathematical objects: on-policy reward modeling and test time aggregationPranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim et al.
cs.AIcs.CLMar 19, 2026
The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally s…
Optimal Splitting of Language Models from Mixtures to Specialized DomainsSkyler Seto, Pierre Ablin, Anastasiia Filippova et al.
cs.CLcs.LGMar 19, 2026
Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a …
Evaluating Counterfactual Strategic Reasoning in Large Language ModelsDimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou et al.
cs.CLMar 19, 2026
We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canon…
MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language ModelsChenyang Gu, Jiahao Cheng, Meicong Zhang et al.
cs.CLMar 19, 2026
Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoni…
Track Code Generation — Get notified when new papers are scored
Sign up free and get daily digests tailored to your research interests.
Sign up free