🛡️

Alignment & Safety

AI alignment, RLHF, harmlessness, robustness, and LLM jailbreak research.

30 papers in the last 30 daysRSS feed

Differentiable latent structure discovery for interpretable forecasting in clinical time series

Ivan Lerner, Jean Feydy, Alexandre Kalimouttou et al.

cs.LGApr 30, 2026

Background: Timely, uncertainty-aware forecasting from irregular electronic health records (EHR) can support critical-care decisions, yet most approaches either impute to a grid or sacrifice interpret…

Strait: Perceiving Priority and Interference in ML Inference Serving

Haidong Zhao, Nikolaos Georgantas

cs.LGApr 30, 2026

Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization an…

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Bowen Sun, Chaozhuo Li, Yaodong Yang et al.

cs.CRcs.CLcs.LGApr 30, 2026

Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collecti…

Characterizing the Consistency of the Emergent Misalignment Persona

Anietta Weckauff, Yuchen Zhang, Maksym Andriushchenko

cs.AIApr 30, 2026

Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlatio…

Splitting Argumentation Frameworks with Collective Attacks and Supports

Matti Berthold, Lydia Blümel, Giovanni Buraglio et al.

cs.AIcs.LOApr 30, 2026

This work proposes novel splitting techniques for argumentation formalisms that incorporate supports between defeasible elements. We base our studies on bipolar set-based argumentation frameworks (BSA…

Normativity and Productivism: Ableist Intelligence? A Degrowth Analysis of AI Sign Language Translation Tools for Deaf People

Nina Seron-Abouelfadil, Poppy Fynes

cs.AIcs.CYcs.HCApr 30, 2026

Sign languages, of any geographical or accentual variation, understandably face continuous scrutiny under the ever present popularity of verbal dictation and audism. Through this, many potential probl…

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Prashant Kulkarni

cs.CRcs.AIApr 30, 2026

Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack pa…

To Build or Not to Build? Factors that Lead to Non-Development or Abandonment of AI Systems

Shreya Chappidi, Jatinder Singh

cs.CYcs.AIApr 30, 2026

Responsible AI research typically focuses on examining the use and impacts of deployed AI systems. Yet, there is currently limited visibility into the pre-deployment decisions to pursue building such …

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Jialu Shen, Han Lyu, Suyang Zhong et al.

cs.AIApr 30, 2026

Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-spec…

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Ivan Bercovich

cs.AIApr 30, 2026

Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so doe…

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

Yujun Wu, Dongxu Zhang, Xinchen Li et al.

cs.AIApr 30, 2026

Existing research infrastructure is fundamentally document-centric, providing citation links between papers but lacking explicit representations of methodological evolution. In particular, it does not…

RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

Feiyu Wu, Xu Zheng, Zhuocheng Wang et al.

cs.AIApr 30, 2026

Large language models (LLMs) make reward design in reinforcement learning substantially more scalable, but generated rewards are not automatically reliable training objectives. Existing work has focus…

Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles

Zainab Rehan, Christian Medeiros Adriano, Sona Ghahremani et al.

cs.LOcs.AIApr 30, 2026

Rule-based systems remain central in safety-critical domains but often struggle with scalability, brittleness, and goal misspecification. These limitations can lead to reward hacking and failures in f…

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

Lincan Li, Zheng Chen, Yushun Dong

cs.AIApr 30, 2026

Electroencephalogram (EEG) signals are vital for automated seizure detection, but their inherent noise makes robust representation learning challenging. Existing graph construction methods, whether co…

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Sudong Wang, Weiquan Huang, Xiaomin Yu et al.

cs.CVcs.AIcs.CLApr 30, 2026

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). H…

MIFair: A Mutual-Information Framework for Intersectionality and Multiclass Fairness

Jeanne Monnier, Thomas George, Frédéric Guyard et al.

cs.LGcs.AIcs.CYcs.ITApr 30, 2026

Fairness in machine learning remains challenging due to its ethical complexity, the absence of a universal definition, and the need for context-specific bias metrics. Existing methods still struggle w…

Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

Taslim Jamal Arif, Kuldeep Singh

cs.AIApr 30, 2026

Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based SQL matching or sche…

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Ce Chen, Yi Ren, Yuanming Li et al.

cs.CVcs.AIApr 30, 2026

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this f…

From LLM-Driven Trading Card Generation to Procedural Relatedness: A Pokémon Case Study

Johannes Pfau, Panagiotis Vrettis

cs.AIcs.HCApr 30, 2026

Since the dawn of Trading Card Games, the genre has grown into a multi-billion-dollar industry engaging millions of analog and digital players worldwide. Popular TCGs rely on regular updates, balance …

A Pattern Language for Resilient Visual Agents

Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

cs.AIcs.SEApr 30, 2026

Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and n…

VibroML: an automated toolkit for high-throughput vibrational analysis and dynamic instability remediation of crystalline materials using machine-learned potentials

Rogério Almeida Gouvêa, Gian-Marco Rignanese

cond-mat.mtrl-scics.AIcs.LGphysics.comp-phApr 30, 2026

While machine-learned interatomic potentials (MLIPs) accelerate phonon dispersion calculations, merely identifying dynamical instabilities in computationally predicted materials is insufficient; autom…

Design Structure Matrix Modularization with Large Language Models

Shuo Jiang, Jianxi Luo

cs.CEcs.AIApr 30, 2026

Design Structure Matrix (DSM) modularization, the task of partitioning system elements into cohesive modules, is a fundamental combinatorial challenge in engineering design. Traditional methods treat …

Heterogeneous Scientific Foundation Model Collaboration

Zihao Li, Jiaru Zou, Feihao Fang et al.

cs.AIcs.CLcs.LGApr 30, 2026

Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world p…

Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes

Tianyuan Wu, Chaokun Chang, Lunxi Cao et al.

cs.OScs.AIApr 30, 2026

Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault toleranc…

AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography Framework

Xubin Luo, Yang Cheng

cs.DCcs.AIApr 30, 2026

AI inference is becoming a persistent and geographically distributed source of electricity demand. Unlike many traditional electrical loads, inference workloads can sometimes be executed away from the…

Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

Jackson Vonderhorst, Kuangshi Ai, Haichao Miao et al.

cs.AIcs.GRcs.HCApr 30, 2026

This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural-language inst…

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Chenxin Li, Zhengyang Tang, Huangxin Lin et al.

cs.SEcs.AIApr 30, 2026

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and gra…

Rethinking Agentic Reinforcement Learning In Large Language Models

Fangming Cui, Ruixiao Zhu, Cheng Fang et al.

cs.AIcs.ETApr 30, 2026

Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large…

Building Persona-Based Agents On Demand: Tailoring Multi-Agent Workflows to User Needs

Giuseppe Arbore, Andrea Sillano, Luigi De Russis

cs.AIcs.HCApr 30, 2026

Recent advances in agentic AI are shifting automation from discrete tools to proactive multi-agent systems that coordinate multi-specialized capabilities behind unified interfaces. However, today's ag…

Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents

Rahul Ramachandran, Nidhi Jha, Muthukumaran Ramasubramanian

cs.AIApr 30, 2026

We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains. Unlike ad-hoc trial-and-error approaches…

Alignment & Safety

Track Alignment & Safety — Get notified when new papers are scored