A Reproducibility Study of LLM-Based Query ReformulationAmin Bigdeli, Radin Hamidi Rad, Hai Son Le et al.
cs.IRcs.CLApr 30, 2026
Large Language Models (LLMs) are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are…
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive LearningBowen Sun, Chaozhuo Li, Yaodong Yang et al.
cs.CRcs.CLcs.LGApr 30, 2026
Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collecti…
Qualitative Evaluation of Language Model Rescoring in Automatic Speech RecognitionThibault Bañeras-Roux, Mickaël Rouvier, Jane Wottawa et al.
cs.CLApr 30, 2026
Evaluating automatic speech recognition (ASR) systems is a classical but difficult and still open problem, which often boils down to focusing only on the word error rate (WER). However, this metric su…
Characterizing the Consistency of the Emergent Misalignment PersonaAnietta Weckauff, Yuchen Zhang, Maksym Andriushchenko
cs.AIApr 30, 2026
Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlatio…
Splitting Argumentation Frameworks with Collective Attacks and SupportsMatti Berthold, Lydia Blümel, Giovanni Buraglio et al.
cs.AIcs.LOApr 30, 2026
This work proposes novel splitting techniques for argumentation formalisms that incorporate supports between defeasible elements. We base our studies on bipolar set-based argumentation frameworks (BSA…
Normativity and Productivism: Ableist Intelligence? A Degrowth Analysis of AI Sign Language Translation Tools for Deaf PeopleNina Seron-Abouelfadil, Poppy Fynes
cs.AIcs.CYcs.HCApr 30, 2026
Sign languages, of any geographical or accentual variation, understandably face continuous scrutiny under the ever present popularity of verbal dictation and audism. Through this, many potential probl…
Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack DetectionPrashant Kulkarni
cs.CRcs.AIApr 30, 2026
Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack pa…
To Build or Not to Build? Factors that Lead to Non-Development or Abandonment of AI SystemsShreya Chappidi, Jatinder Singh
cs.CYcs.AIApr 30, 2026
Responsible AI research typically focuses on examining the use and impacts of deployed AI systems. Yet, there is currently limited visibility into the pre-deployment decisions to pursue building such …
SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific ImagesJialu Shen, Han Lyu, Suyang Zhong et al.
cs.AIApr 30, 2026
Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-spec…
What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation DesignIvan Bercovich
cs.AIApr 30, 2026
Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so doe…
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI ScientistsYujun Wu, Dongxu Zhang, Xinchen Li et al.
cs.AIApr 30, 2026
Existing research infrastructure is fundamentally document-centric, providing citation links between papers but lacking explicit representations of methodological evolution. In particular, it does not…
RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward HypothesesFeiyu Wu, Xu Zheng, Zhuocheng Wang et al.
cs.AIApr 30, 2026
Large language models (LLMs) make reward design in reinforcement learning substantially more scalable, but generated rewards are not automatically reliable training objectives. Existing work has focus…
Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety PrinciplesZainab Rehan, Christian Medeiros Adriano, Sona Ghahremani et al.
cs.LOcs.AIApr 30, 2026
Rule-based systems remain central in safety-critical domains but often struggle with scalability, brittleness, and goal misspecification. These limitations can lead to reward hacking and failures in f…
LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure DiagnosisLincan Li, Zheng Chen, Yushun Dong
cs.AIApr 30, 2026
Electroencephalogram (EEG) signals are vital for automated seizure detection, but their inherent noise makes robust representation learning challenging. Existing graph construction methods, whether co…
PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement LearningSudong Wang, Weiquan Huang, Xiaomin Yu et al.
cs.CVcs.AIcs.CLApr 30, 2026
The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). H…
MIFair: A Mutual-Information Framework for Intersectionality and Multiclass FairnessJeanne Monnier, Thomas George, Frédéric Guyard et al.
cs.LGcs.AIcs.CYcs.ITApr 30, 2026
Fairness in machine learning remains challenging due to its ethical complexity, the absence of a universal definition, and the need for context-specific bias metrics. Existing methods still struggle w…
Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL SystemsTaslim Jamal Arif, Kuldeep Singh
cs.AIApr 30, 2026
Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based SQL matching or sche…
TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot TransitionsCe Chen, Yi Ren, Yuanming Li et al.
cs.CVcs.AIApr 30, 2026
Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this f…
From LLM-Driven Trading Card Generation to Procedural Relatedness: A Pokémon Case StudyJohannes Pfau, Panagiotis Vrettis
cs.AIcs.HCApr 30, 2026
Since the dawn of Trading Card Games, the genre has grown into a multi-billion-dollar industry engaging millions of analog and digital players worldwide. Popular TCGs rely on regular updates, balance …
A Pattern Language for Resilient Visual AgentsHabtom Kahsay Gidey, Alexander Lenz, Alois Knoll
cs.AIcs.SEApr 30, 2026
Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and n…
VibroML: an automated toolkit for high-throughput vibrational analysis and dynamic instability remediation of crystalline materials using machine-learned potentialsRogério Almeida Gouvêa, Gian-Marco Rignanese
cond-mat.mtrl-scics.AIcs.LGphysics.comp-phApr 30, 2026
While machine-learned interatomic potentials (MLIPs) accelerate phonon dispersion calculations, merely identifying dynamical instabilities in computationally predicted materials is insufficient; autom…
Design Structure Matrix Modularization with Large Language ModelsShuo Jiang, Jianxi Luo
cs.CEcs.AIApr 30, 2026
Design Structure Matrix (DSM) modularization, the task of partitioning system elements into cohesive modules, is a fundamental combinatorial challenge in engineering design. Traditional methods treat …
Heterogeneous Scientific Foundation Model CollaborationZihao Li, Jiaru Zou, Feihao Fang et al.
cs.AIcs.CLcs.LGApr 30, 2026
Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world p…
Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent SandboxesTianyuan Wu, Chaokun Chang, Lunxi Cao et al.
cs.OScs.AIApr 30, 2026
Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault toleranc…
AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography FrameworkXubin Luo, Yang Cheng
cs.DCcs.AIApr 30, 2026
AI inference is becoming a persistent and geographically distributed source of electricity demand. Unlike many traditional electrical loads, inference workloads can sometimes be executed away from the…
Exploring Interaction Paradigms for LLM Agents in Scientific VisualizationJackson Vonderhorst, Kuangshi Ai, Haichao Miao et al.
cs.AIcs.GRcs.HCApr 30, 2026
This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural-language inst…
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World WorkflowsChenxin Li, Zhengyang Tang, Huangxin Lin et al.
cs.SEcs.AIApr 30, 2026
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and gra…
Rethinking Agentic Reinforcement Learning In Large Language ModelsFangming Cui, Ruixiao Zhu, Cheng Fang et al.
cs.AIcs.ETApr 30, 2026
Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large…
Building Persona-Based Agents On Demand: Tailoring Multi-Agent Workflows to User NeedsGiuseppe Arbore, Andrea Sillano, Luigi De Russis
cs.AIcs.HCApr 30, 2026
Recent advances in agentic AI are shifting automation from discrete tools to proactive multi-agent systems that coordinate multi-specialized capabilities behind unified interfaces. However, today's ag…
Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper AgentsRahul Ramachandran, Nidhi Jha, Muthukumaran Ramasubramanian
cs.AIApr 30, 2026
We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains. Unlike ad-hoc trial-and-error approaches…
Track Reasoning — Get notified when new papers are scored
Sign up free and get daily digests tailored to your research interests.
Sign up free