← All Topics
🛡️

Alignment & Safety

AI alignment, RLHF, harmlessness, robustness, and LLM jailbreak research.

30 papers in the last 30 daysRSS feed
Characterizing the Consistency of the Emergent Misalignment Persona

Anietta Weckauff, Yuchen Zhang, Maksym Andriushchenko

cs.AIApr 30, 2026

Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlatio

A Pattern Language for Resilient Visual Agents

Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

cs.AIcs.SEApr 30, 2026

Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and n

Heterogeneous Scientific Foundation Model Collaboration

Zihao Li, Jiaru Zou, Feihao Fang et al.

cs.AIcs.CLcs.LGApr 30, 2026

Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world p

Rethinking Agentic Reinforcement Learning In Large Language Models

Fangming Cui, Ruixiao Zhu, Cheng Fang et al.

cs.AIcs.ETApr 30, 2026

Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large

Track Alignment & Safety — Get notified when new papers are scored

Sign up free and get daily digests tailored to your research interests.

Sign up free