近期新增论文

跟踪 arXiv 与关注会议/期刊的新论文,按发布时间浏览和检索。

近期新增论文
汇总 arXiv 与关注会议/期刊的新论文,检索时按相关性优先排序。
ICML 2026 OpenReviewAI for Software EngineeringICML 2026Automating SE tasks with LLM and foundation modelsICML 2026

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, Jinsung Yoon
2026/05/12 01:23已过 22 小时

Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivCollaborative AI for SEcs.SEarXiv AI Coding 3YarXiv cs.SEarXiv AI Agents for Software Engineering

Collaborator or Assistnat? How AI Coding Agents Partition Work Across Pull Request Lifecycles

Young Jo, Chung, Safwat Hassan
2026/05/09 01:06已过 4 天

When AI coding agents open branches and submit pull requests (PRs), two questions co-determine oversight design: who starts the work (operational agency) and who authorizes its completion (merge governance). We characterize tools along a Collaborator-Assistant spectrum in how they redistribute initiative, oversight, and endorsement, while merge governance remains predominantly human across five tools (OpenAI, Copilot, Devin, Cursor, Claude Code). We analyze 29,585 PR lifecycles using an Initiator x Approver taxonomy with six interaction scenarios; lifecycle reconstruction supplies the how behind those roles. Collaborator tools (Cursor, Devin, Copilot) concentrate operational initiative in agents that open and carry PR work forward, with humans retaining review and endorsement on the path to merge; Assistant tools (OpenAI, Claude) leave task direction primarily with humans and supply bounded support within human-led workflows. Across the spectrum, agency and governance decouple: Collaborator workflows are >=96% agent initiated, yet terminal merge authority remains almost exclusively human, with agent-classified approvers confined to a small fraction of PRs. Where automation executes a merge, logs record the executor but not the decision-maker, marking a boundary of observation. We contribute the taxonomy, per-tool state machines, and a replication package for research on automation, oversight, and governance in PR workflows.

arXiv AI Coding 3Y / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsCollaborative AI for SEcs.AIarXiv AI Coding 3YarXiv AI Agents for Software Engineering

Learning CLI Agents with Structured Action Credit under Selective Observation

Haoyang Su, Ying Wen
2026/05/09 01:02已过 4 天

Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce $σ$-Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ($\mathrm{A}^3$), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. $\mathrm{A}^3$ constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.

arXiv cs.SETesting and AnalysisarXivSoftware testingcs.CLcs.AIcs.LGcs.SE

Tool Calling is Linearly Readable and Steerable in Language Models

Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz
2026/05/09 00:47已过 4 天

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain $τ$-bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.

arXiv cs.CRDependability and SecurityarXivDependability and security for embedded and cyber-physical systemsReliabilityavailabilitycs.CLcs.CRarXiv cs.CR

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

Urchade Zaratiana, Mary Newhauser, George Hurn-Maloney, Ash Lewis
2026/05/09 00:44已过 4 天

Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fundamentally a classification problem as sequential text generation, a design choice that incurs high latency and scales poorly to multi-aspect evaluation. In this work, we introduce \textbf{GLiGuard}, a 0.3B-parameter schema-conditioned bidirectional encoder adapted from GLiNER2 for LLM content moderation. The key idea is to encode task definitions and label semantics directly into the input sequence as structured token schemas, enabling simultaneous evaluation of prompt safety, response safety, refusal detection, 14 fine-grained harm categories, and 11 jailbreak strategies in a single non-autoregressive forward pass. This schema-conditioned design lets supported task and label blocks be composed directly in the input schema at inference time. Across nine established safety benchmarks, GLiGuard achieves F1 scores competitive with 7B--27B decoder-based guards despite being 23--90$\times$ smaller, while delivering up to 16$\times$ higher throughput and 17$\times$ lower latency. These results suggest that compact bidirectional encoders can approach the accuracy of much larger guard models while drastically reducing inference cost. Code and models are available at https://github.com/fastino-ai/GLiGuard.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.LGcs.CRcs.NIarXiv cs.CR

Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs

Hanlin Cai, Kai Li, Houtianfu Wang, Haofan Dong, Yichen Li, Falko Dressler, Ozgur B. Akan
2026/05/09 00:24已过 4 天

Federated fine-tuning (FFT) has emerged as a privacy-preserving paradigm for collaboratively adapting large language models (LLMs). Built upon federated learning, FFT enables distributed agents to jointly refine a shared pretrained LLM by aggregating local LLM updates without sharing local raw data. However, FFT-based LLMs remain vulnerable to model manipulation threats, in which adversarial participants upload manipulated LLM updates that corrupt the aggregation process and degrade the performance of the global LLM. In this paper, we propose an Augmented Model maniPulation (AugMP) strategy against FFT-based LLMs. Specifically, we design a novel graph representation learning framework that captures feature correlations among benign LLM updates to guide the generation of malicious updates. To enhance manipulation effectiveness and stealthiness, we develop an iterative manipulation algorithm based on an augmented Lagrangian dual formulation. Through this formulation, malicious updates are optimized to embed adversarial objectives while preserving benign-like parameter characteristics. Experimental results across multiple LLM backbones demonstrate that the AugMP strategy achieves the strongest manipulation performance among all competing baselines, reducing the global LLM accuracy by up to 26% and degrading the average accuracy of local LLM agents by up to 22%. Meanwhile, AugMP maintains high statistical and geometric consistency with benign updates, enabling it to evade conventional distance- and similarity-based defense methods.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv LLM4Code / Program Repair / arXiv AI Coding 3YAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEPrompt engineering for SEcs.SEarXiv AI Coding 3YarXiv cs.SEarXiv LLM4Code / Program Repair

Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization

Golnaz Gharachorlu, Mahsa Panahandeh, Lionel C. Briand, Ruifeng Gao, Ruiyuan Wan
2026/05/09 00:20已过 4 天

Software failures remain a major challenge in modern software development, and identifying the code elements responsible for failures is a time-consuming debugging task. While extensive research has focused on fault localization in the system under test (SUT), failures can also originate from faulty system test scripts. This problem, known as Test Code Fault Localization (TCFL), has received significantly less attention despite its importance in continuous integration (CI) environments where large test suites are executed frequently. TCFL is particularly challenging because it typically operates under black-box conditions, relies on limited diagnostic signals such as error messages and partial logs, and involves large system-level test scripts that expand the fault localization search space. In this paper, we propose SPARK, a framework that integrates accumulated debugging knowledge from continuous integration (CI) environments into Large Language Model (LLM)-based TCFL. Given a newly observed failing test case, SPARK retrieves similar fault-labeled test cases from a debugging knowledge corpus and selectively annotates suspicious lines of the failing test based on their similarity to previously observed fault patterns. These annotations guide the LLM's reasoning while maintaining scalability and avoiding the prompt-length explosion common to naive retrieval-augmented approaches. We evaluate SPARK on three industrial datasets containing real-world faulty Python test cases from different software products. The results show that SPARK consistently improves fault localization effectiveness compared to the existing LLM-based TCFL baseline while maintaining comparable inference cost and token usage. In particular, the approach advances the state of the art by identifying more correct faulty locations in complex test cases containing multiple faults.

arXiv cs.SEAnalyticsarXivData-driven user experienceSoftware metrics and measurementscs.SEarXiv cs.SE

Evaluating Design Conformance Through Trace Comparison

Reid Anderson, Hassan Reza
2026/05/08 23:48已过 4 天

The design of a system and its implementation are two tasks often carried out by different individuals on a development team, and can occur weeks or months apart. This creates a potential for divergence between real behavior and the designed model that an implementation is intended to match. Particularly as time passes and individuals who were present for the original conception of the design leave, a system can lose coherence and drift from intended design principles. Even with a robust system design, more is needed to ensure that the key implementation details match the design and that adherence to a particular strategy is not lost over time. This paper proposes an approach to address that concern for distributed systems using conformance checking, a methodology borrowed from process mining. Distributed traces produced by instrumented applications are evaluated for conformance by comparison to design traces. The resulting conformance percentage is a quantitative metric that can be tracked over time to determine how closely a concrete implementation corresponds to the key attributes of the expected design model. This analysis is done using the dominant industry standard, OpenTelemetry, and so should apply to a wide range of distributed systems.

arXiv cs.CRDependability and SecurityarXivVulnerability detection and software securityConfidentialityintegritycs.CRarXiv cs.CR

Longitudinal Analyses of SAST Tools: A CodeQL Case Study

Jean-Charles Noirot Ferrand, Kyle Domico, Yohan Beugin, Patrick McDaniel
2026/05/08 23:41已过 4 天

Open-source software (OSS) pipelines rely on automated static analysis tools to prevent the introduction of vulnerabilities in code. However, there is limited understanding of the efficacy of these tools across the OSS ecosystem over time. In this paper, we introduce a novel method to evaluate static application security testing (SAST) tools through longitudinal measurements and perform the largest academic study of CodeQL -- the most prevalent static analysis tool from GitHub -- on OSS codebases. We apply our apparatus on 114 versions of CodeQL over time on 3993 CVEs from 1622 repositories to measure key properties of the tool, culminating in more than 20 billion lines of code analyzed. First, we measure its effectiveness, i.e., its ability to detect vulnerabilities before they are fixed. Then, we determine whether these detections were actionable through two measures of the distance between findings and vulnerability location either over the entire codebase or within the vulnerable file. Finally, we study the stability of CodeQL by examining how vulnerability detections hold across versions and the evolution of CodeQL on the accuracy-precision trade-off. We find that CodeQL identifies a total of 171 CVEs, and that for 83 of them, a CodeQL version prior to the fix could detect it. Such detections are in general actionable if findings are triaged across files, as for 50% of the 171 detections, more than 50% of findings in the vulnerable file are located in the vulnerable location. Finally, we show that CVE detections are not monotonic across versions as 21 CVEs were no longer detected following a version change and 17 that were never redetected. Our study shows that using SAST tools is a matter of best practice as they prevent numerous vulnerabilities from being introduced, but that developers should be aware of changes that may leave blind spots in detections upon updates of the tool.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.GTcs.CRarXiv cs.CR

Zero-determinant Strategy for Moving Target Defense: Existence, Performance, and Computation

Zhaoyang Cheng, Guanpu Chen, Yiguang Hong, Ming Cao, Mikael Skoglund
2026/05/08 23:16已过 4 天

Moving Target Defense (MTD) is commonly formulated as a repeated security game to mitigate persistent threats. Although the strong Stackelberg equilibrium (SSE) characterizes the defender's optimal strategy in the leader-follower framework, computing the SSE often incurs high computational complexity, which significantly limits its practical deployment in MTD problems with multiple targets. This paper proposes adopting a zero-determinant (ZD) strategy for constructing an MTD strategy that achieves both high defensive performance and substantially low computational complexity. We first derive a necessary and sufficient condition for the existence of ZD strategies and investigate the performance of ZD strategies, which shows their upper-bound performance matches that of the SSE strategy. We then formulate two programs to find the optimal ZD strategy parameters under different conditions. Moreover, we design an algorithm to compute the proposed ZD strategies, along with the computational complexity analysis in comparison with the traditional SSE computation. Finally, we conduct experiments on two practical applications to verify our results.

arXiv cs.SE / arXiv AI Agents for Software EngineeringDependability and SecurityarXivVulnerability detection and software securityDependability and security for embedded and cyber-physical systemscs.SEarXiv cs.SEarXiv AI Agents for Software Engineering

Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem

Xinyi Hou, Yanjie Zhao, Haoyu Wang
2026/05/08 23:03已过 4 天

Model Context Protocol (MCP) have quickly become the interface layer between LLM agents and external tools, yet they also introduce unsafe data flows that existing analyzers handle poorly. Vulnerabilities manifest in two directions: requester-controlled arguments may propagate to sensitive operations, while untrusted external or sensitive internal data may surface through MCP-visible outputs and subsequently influence host or model behavior. Accurate detection is complicated by the heterogeneous registration and dispatch patterns MCP servers employ, the need for MCP-specific taint semantics, and the fact that bugs often only materialize along complete tool-scoped execution paths. We present MCP-BiFlow, a bidirectional static analysis framework built around MCP-aware entrypoint recovery, protocol-specific taint modeling, and interprocedural propagation analysis. Against a benchmark of 32 confirmed MCP vulnerability cases, MCP-BiFlow identifies 30 (93.8% recall), substantially outperforming CodeQL, Semgrep, Snyk Code, and MCPScan. Across 15,452 real-world MCP server repositories, MCP-BiFlow surfaces 549 overlap-compressed candidate clusters; manual review confirms 118 vulnerability paths in 87 servers, establishing unsafe propagation as a recurring failure mode that resists detection without protocol-aware recovery of both request-side and return-side flows.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRcs.AIarXiv cs.CR

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim, Hoki Kim
2026/05/08 22:57已过 4 天

Large language models (LLMs) are increasingly deployed as autonomous agents in offensive cybersecurity. In this paper, we reveal an interesting phenomenon: different agents exhibit distinct attack patterns. Specifically, each agent exhibits an attack-selection bias, disproportionately concentrating its efforts on a narrow subset of attack families regardless of prompt variations. To systematically quantify this behavior, we introduce CyBiasBench, a comprehensive 630-session benchmark that evaluates five agents on three targets and four prompt conditions with ten attack families. We identify explicit bias across agents, with different dominant attack families and varying entropy levels in their attack-family allocation distributions. Such bias is better characterized as a trait of the agents, rather than a factor associated with the attack success rate. Furthermore, our experiments reveal a bias momentum effect, where agents resist explicit steering toward attack families that conflict with their bias. This forced distribution shift does not yield measurable improvements in attack performance. To ensure reproducibility and facilitate future research, we release an interactive result dashboard at https://trustworthyai.co.kr/CyBiasBench/ and a reproducibility artifact with aggregated session-level statistics and full evaluation scripts at https://github.com/Harry24k/CyBiasBench.

arXiv cs.SE / arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRcs.SEarXiv cs.SEarXiv cs.CR

Can I Check What I Designed? Mapping Security Design DSLs to Code Analyzers

Sven Peldszus, Frederik Reiche, Kevin Hermann, Sophie Corallo, Thorsten Berger, Robert Heinrich
2026/05/08 22:46已过 4 天

When assessing the potential impact of code-level vulnerabilities, e.g., discovered by automated analyzers, it is essential to consider them in the context of the system's security design. However, this is a challenging task due to the abstraction gap between security design, often specified using security DSLs, and implementation. As we will show, even security experts lack a complete understanding of this relationship. Intrigued by this gap (and the general disconnect between secure design and secure implementation) we present a study of 66 design-level security DSLs and 559 security checks from 36 code-level analyzers. We identify what concepts are common to both and capture them in the SecLan model, which has been validated by 22 security experts. Based on this, we investigate the relationship between DSLs and analyzers quantitatively and explore it qualitatively together with 9 security experts. We learn that there are few commonalities between design-level and implementation-level security; security checks are often described by overly general weaknesses, resulting in many non-obvious potential relationships between security DSLs and analyzers; and even security experts are overwhelmed by this complexity. We provide an empirical basis that helps practitioners and researchers better understand the gap and serves as a first step toward bridging it.

arXiv cs.CRDependability and SecurityarXivFormal methods and model checkingcs.CRcs.LGarXiv cs.CR

GRASP -- Graph-Based Anomaly Detection Through Self-Supervised Classification

Robin Buchta, Carsten Kleiner, Felix Heine, Gabi Dreo Rodosek
2026/05/08 22:45已过 4 天

Advanced persistent threat (APT) attacks remain difficult to detect due to their stealth, adaptability, and use of legitimate system components. Provenance-based intrusion detection systems (PIDS) offer a promising defense by capturing detailed relationships between system components and actions. However, current PIDS rely on predefined or subset-determined thresholds, which limit detection stability and the ability to detect any anomalous behavior in general. Furthermore, related work often neglects the role of process executables, which describe system activity by interacting through a process with files, network components, and other processes. We introduce GRASP, a PIDS based on masked self-supervised classification. GRASP masks the executable information of processes and learns to infer it from their two-hop provenance graph neighborhood, marking misclassified processes as anomalies. It captures behavior patterns for the learned executables without thresholding, making it robust against interference and unknown activities. Evaluations on the DARPA TC and OpTC datasets demonstrate that GRASP consistently detects anomalous behavior, including known attack-related activities, outperforming existing systems. Our PIDS identifies all documented attacks on datasets where the behavior of executables is learnable. In addition, compared to existing systems, GRASP uncovers potentially malicious anomalous behavior not labeled as an attack in the documentation.

arXiv cs.SETesting and AnalysisarXivSoftware testingcs.SEarXiv cs.SE

Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching

Junhao Chen, Jingxuan Zhang, Jian He, Yixuan Tang, Weiqin Zou
2026/05/08 22:26已过 4 天

The lexical and syntactic disparities among different programming languages (e.g., Java and Python) pose significant challenges for multi-language software engineering tasks such as cross-language code clone detection and code retrieval, since queries or code snippets written in one programming language often fail to match equivalent artifacts in another. To bridge this gap between different programming languages, we proposed a novel approach to construct a multi-language shared semantic space, in which functionally equivalent source code written in different programming languages are close to each other. In this approach, we first map the Abstract Syntax Tree (AST) node labels of the code snippets written in different programming languages into a unified label set, thus compressing high-dimensional language-specific tokens into a common embedding space. Then, we employ a Graph Matching Network (GMN) to encode the paired AST graphs into "semantic vectors" that capture functional equivalence between programming languages in a unified code vector space. In such a way, we can eliminate the differences in syntax between different programming languages. To validate the effectiveness of this approach, we apply it to two downstream tasks, including cross-language clone detection and cross-language code retrieval. Experiments demonstrate that our approach substantially outperforms the state-of-the-art baselines in cross-language clone detection, improving Precision from 95.62% to 99.94%, Recall from 97.72% to 99.92%, and F1 score from 96.94% to 99.93%. In terms of cross-language code retrieval, our approach raises the average Mean Reciprocal Rank (MRR) from 0.4909 to 0.5547, showing an absolute gain of 0.0638 (13% relative improvement), which demonstrates its superior ability to rank correct code snippets high across multiple programming languages.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsCollaborative AI for SEPrompt engineering for SEcs.SEarXiv AI Coding 3YarXiv cs.SEarXiv AI Agents for Software Engineering

Coding Agents Don't Know When to Act

Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, Martin Vechev
2026/05/08 22:10已过 4 天

Coding agents are increasingly deployed to autonomously maintain software, including to resolve user-reported issues: a bug report comes in and the agent creates a patch to address it. However, in any real-world deployment, they will encounter stale bug reports about issues that have already been resolved. Agents should recognize this and abstain from modifying the code to avoid accumulating technical debt. To systematically evaluate whether current agents do so, we introduce FixedBench, a code benchmark with 200 human-verified coding tasks in which no code changes are required, testing five recent models across four agent harnesses. We find that even state-of-the-art models fail, proposing undesirable changes (excluding tests and documentation) in $35$ to $65\%$ of cases. Explicit instructions to reproduce the issue before patching partially address this issue but introduce a new failure mode: when an issue is partially fixed, they abstain even though a patch would still be needed. More broadly, our results indicate that LLMs fall prey to an action bias: they choose to act even if inaction would be appropriate. To break this pattern, inaction needs to be explicitly framed as a path to success, which highlights an overreliance on human guidance implicit in current training objectives.

arXiv cs.SE / arXiv LLM4Code / Program RepairAI for Software EngineeringarXivAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelsPrompt engineering for SEcs.SEarXiv cs.SEarXiv LLM4Code / Program Repair

Securing the Dark Matter: A Semantic-Enhanced Neuro-Symbolic Framework for Supply Chain Analysis of Opaque Industrial Software

Bowei Ning, Xuejun Zong, Lian Lian, Kan He, Yifei Sun, Yuxiang Lei, Plamen Vasilev
2026/05/08 21:45已过 4 天

Automated vulnerability detection in critical-infrastructure software confronts a fundamental barrier: industrial software is routinely deployed as stripped, symbol-free binaries that deprive conventional Software Composition Analysis of the source-level transparency it requires. Existing binary analysis techniques close this Semantic Gap only partially -- graph-based detectors preserve structural syntax but discard behavioral semantics, while large language models supply rich semantic cues at the cost of unstable, hallucination-prone inference. To address this gap, we present a semantic-enhanced neuro-symbolic framework that reconstructs behavioral semantics directly from opaque binaries and performs tractable global risk reasoning. Three tightly coupled mechanisms drive this capability: (1) abstract interpretation combined with a reflexive prompting pipeline that structurally constrains a local LLM agent, effectively suppressing hallucinations; (2) a surjective transformation that compresses raw Code Property Graphs into typed Software Supply Chain Knowledge Graphs amenable to scalable reasoning; and (3) a domain-adapted Graphormer that captures long-range vulnerability propagation, augmented by embedding-space subgraph matching to uncover zero-day and APT-style attack patterns. Evaluated across three benchmarks of increasing domain specificity, the framework consistently outperforms all baselines on detection accuracy, semantic lifting fidelity, and APT fingerprint matching. Deployment on a hybrid virtual-physical testbed incorporating production-grade hardware from five ICS vendors further confirms strong detection coverage of high-impact CVEs while substantially reducing false-positive rates relative to leading commercial tools.

arXiv AI Coding 3YAnalyticsarXivSoftware metrics and measurementscs.CLcs.AIarXiv AI Coding 3Y

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman
2026/05/08 21:36已过 4 天

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.

arXiv cs.SEAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsCollaborative AI for SEPrompt engineering for SEcs.SEcs.CYarXiv cs.SE

SARC: A Governance-by-Architecture Framework for Agentic AI Systems

Gaston Besanson
2026/05/08 21:34已过 4 天

Agentic AI systems increasingly act through tools, sub-agents, and external services, but governance controls are still commonly attached to prompts, dashboards, or post-hoc documentation. This creates a structural mismatch in regulated settings: obligations that must constrain execution are often evaluated only after execution has occurred. We introduce SARC, a runtime governance architecture for tool-using agents that treats constraints as first-class specification objects alongside state, action space, and reward. A SARC specification declares each constraint's source, class, predicate, verification point, response protocol, and operating point, and compiles these into four enforcement sites in the agent loop: a Pre-Action Gate, an Action-Time Monitor, a Post-Action Auditor, and an Escalation Router. We formalize the minimal invariants required for specification-trace correspondence, show why finite reward penalties do not generally substitute for hard runtime constraints, and extend the architecture to multi-agent workflows through constraint propagation, authority intersection, and attribution-preserving trace trees. We implement a prototype audit checker and report a reproducible synthetic evaluation over 50 seeds comparing SARC against post-hoc audit, output filtering, workflow rules, and policy-as-code-only baselines on a procurement task. SARC executes zero hard-constraint violations under exact predicates; its declared PAA throttling response reduces soft-window overages by 89.5% relative to policy-as-code-only. Predicate-noise and enforcement-failure sweeps are consistent with the claim that residual hard violations under SARC scale with enforcement-stack error rather than environmental violation opportunity. SARC provides the architectural substrate through which obligations can be made executable, inspectable, and auditable at runtime.

arXiv LLM4Code / Program RepairAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEcs.DCarXiv LLM4Code / Program Repair

A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

Ajay Navilarekal Rajgopal, Nikolai Solmsdorf
2026/05/08 21:32已过 4 天

Large Language Models (LLMs) continue to demonstrate superior performance with increasing scale, yet training models with billions to trillions of parameters requires staggering computational resources, e.g. a one-trillion-parameter GPT-style model requires an estimated 120 million exaflops. This challenge necessitates efficient distributed training strategies on cutting-edge High-Performance Computing (HPC) infrastructure. In this work, we explore the SuperMUC-NG Phase 2 (SMNG-P2) system at the Leibniz Supercomputing Centre (LRZ) in Garching, Germany, equipped with Intel Data Center GPU Max 1550 accelerators to extract the necessary computational power. We enable and investigate a comprehensive recipe of parallel training techniques, including tensor parallelism, pipeline parallelism, and sharded data parallelism, essential for facilitating the training of LLMs up to 175 billion-parameter scale on SMNG-P2. Through empirical assessment and extensive hyperparameter tuning, we analyze the complex interplay among these techniques and determine their impact on GPU computational efficiency. We identify an optimized combined strategy that yields high throughput and enables the efficient training of LLMs of varying sizes. Specifically, for the 175B model, we achieved per-tile throughput of 10% of theoretical peak per-tile bf16 FLOPs, employing an out-of-the-box publicly available software stack, utilizing standard distributions without further modification. This approach ensures broad accessibility, as our methodology can be replicated by any user on SMNG-P2 system without need for porting or specialized software engineering. Furthermore, we achieved 93% weak scaling efficiency and strong scaling efficiency of 82% on 128 nodes. This scalable recipe provides a crucial blueprint for efficiently utilizing advanced exascale systems for next-generation foundational model development.

arXiv cs.SE / arXiv LLM4Code / Program RepairAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEcs.SEcs.AIarXiv cs.SEarXiv LLM4Code / Program Repair

The AI-Native Large-Scale Agile Software Development Manifesto

Ricardo Britto, Fredrik Palmgren, Nishrith Saini, Marcus Ohlin
2026/05/08 21:20已过 4 天

Despite the widespread adoption of agile methods, achieving true agility at scale remains elusive. Large-scale agile frameworks remain largely human-centric and manual, relying on coordination meetings, artifact synchronization, and role-based handoffs that inhibit real-time adaptation. Meanwhile, rapid advances in AI, particularly large language models, have begun transforming software engineering, yet their potential for organizational-level agility remains underexplored. We present the AI-Native Large-Scale Agile Software Development Manifesto: a set of values and principles that redefine how large-scale software development is organized when AI becomes a first-class participant rather than a peripheral tool. The manifesto is grounded in six principles, parallel processes, intent-driven teams, living knowledge, verification-first assurance, orchestrated agent workforces, and reusable blueprints, that together shift development from a meeting-driven, document-heavy, sequential process to an intelligent, adaptive, continuously learning system.

arXiv cs.SE / arXiv LLM4Code / Program RepairAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEPrompt engineering for SEcs.SEarXiv cs.SEarXiv LLM4Code / Program Repair

SafeTune: Search-based Harmfulness Minimisation for Large Language Models

Giordano d'Aloisio, David Williams, Giusy Annunziata, Zhiwei Fei, Antinisca Di Marco, Federica Sarro
2026/05/08 21:15已过 4 天

The widespread adoption of Large Language Models (LLMs) raises concerns about the potential harmfulness of their responses. In this paper, we first investigate the harmfulness of responses from four general-purpose LLMs. Next, we propose SafeTune, a multi-objective search-based approach to mitigate harmfulness while increasing response relevance through hyperparameter tuning and system prompt engineering. Our initial evaluation shows that SafeTune significantly reduces the rate of harmful responses generated by Qwen3.5 0.8B and increases prompt-response relevance (both with a large effect size). Among the parameters we explore, we also find that encouraging greater repetition in responses is most impactful in reducing harmfulness while increasing relevance.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv LLM4Code / Program Repair / arXiv AI Coding 3YAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsPrompt engineering for SEAI-enabled recommender systems for automated SEcs.SEarXiv AI Coding 3YarXiv cs.SEarXiv LLM4Code / Program Repair

Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel

Jiashuo Tian, Dong Wang, Chen Yang, Haichi Wang, Zan Wang, Junjie Chen
2026/05/08 20:48已过 4 天

False-positive bug reports represent a significant yet underexplored challenge in the development and maintenance of the Linux kernel. They occur when correct system behavior is mistakenly flagged as a defect, consuming developer effort without leading to actual code improvements. Such reports can mislead developers, waste debugging resources, and delay the resolution of real bugs. In this paper, we present the first comprehensive empirical study of false-positive bug reports in the Linux kernel. We manually construct a dataset of 2,006 bug reports comprising 1,509 genuine bugs and 497 false positives collected from Bugzilla and Syzkaller. Our analysis indicates that false positives demand effort comparable to real bugs, often requiring extended discussions and non-trivial closure time. They occur in several components, especially File Systems and Drivers, mainly due to external dependencies and semantic misunderstandings. To address this challenge, we evaluate large language models (LLMs) for automated false-positive bug report mitigation. Among various prompting strategies, retrieval-augmented generation (RAG) performs best, achieving 91% recall and an F1 score of 88%. These findings highlight the non-negligible cost of false positive bug reports and show the promise of LLMs for more efficient false positive mitigation in the Linux kernel.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.GTcs.CRcs.LGarXiv cs.CR

Differentially Private Auditing Under Strategic Response

Florian A. D. Burnat
2026/05/08 20:44已过 4 天

Regulatory audits of AI systems increasingly rely on differential privacy (DP) to protect training data and model internals. We study audit design when the audited developer can strategically respond to the privacy-constrained audit interface. We formalize privacy-constrained auditing as a bilevel Stackelberg game, in which an auditor commits to a query policy and DP budget allocation across harm dimensions, and a strategic developer reallocates mitigation efforts in response. We introduce the welfare-weighted under-detection gap $B_w$, the welfare-weighted true residual harm the audit fails to detect at the developer's strategic best response, and prove that naive DP auditing (uniform or harm-proportional allocation) induces a strictly larger $B_w$ than any non-strategic mitigation baseline whenever effective detectability is heterogeneous, the welfare weights are not comonotone with detectability, and the developer's optimum is interior. We characterize the optimal auditor allocation as a four-factor balance of welfare weight, audit miss-probability, detectability elasticity, and mitigation-cost curvature, and provide a single-level reformulation of the bilevel problem via the developer's KKT system. We propose Strategic Private Audit Design (SPAD), a projected-gradient algorithm with hypergradients computed through the developer's best response.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.GTcs.CRcs.LGarXiv cs.CR

Quotient Semivalues for False-Name-Resistant Data Attribution

Florian A. D. Burnat, Brittany I. Davidson
2026/05/08 20:34已过 4 天

Data valuation methods allocate payments and audit training data's contribution to machine-learning pipelines; however, they often assume passive contributors. In reality, contributors can split datasets across pseudonymous identities, duplicate high-value examples, create near-duplicates, or launder synthetic variants to inflate their share. We formalize this as false-name manipulation in ML data attribution. Our main construction is the quotient semivalue mechanism: compute Shapley-, Banzhaf-, or Beta-style values over evidence-backed attribution clusters instead of raw identities, using a canonical-representative operator to absorb within-cluster duplication. We prove an impossibility: on a fixed monotone data-value game, exact Shapley-fair attribution over reported identities is incompatible with unrestricted false-name-proofness, even on binary-valued instances, and characterize the split-gain of a general semivalue on a unanimity counter-example. The mechanism is exactly false-name-proof under two structural conditions: false-name-neutral within-cluster allocation and quotient-stable manipulations. Under imperfect provenance, when these conditions hold approximately, manipulation gain and fairness loss are bounded by three measurable quantities: escaped-cluster mass, value-estimation error, and clustering distance. We instantiate the mechanisms in DataMarket-Gym, a benchmark for attribution under strategic provider attacks. On synthetic classification tasks, quotient semivalues with example-level evidence reduce manipulation gain on duplicate and near-duplicate Sybil attacks from $1.74$ under baseline Shapley to $0.96$, near the honest level. The cosine-threshold and (false-merge, false-split) rate sweeps trace the corresponding fairness--Sybil frontier.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

CCX: Enabling Unmodified Intel SGX Applications on Arm CCA

Matti Schulze, Thorsten Holz, Felix Freiling
2026/05/08 18:22已过 4 天

Novel confidential computing technologies such as Intel TDX, AMD SEV, and Arm CCA have recently emerged. In practice, due to its minimal trust boundaries, Intel SGX still remains widely used for enclave-based applications in cloud environments, including confidential cloud services, privacy-preserving communication, secure payment processing, and privacy-focused advertising. With the growing adoption of Arm CPUs in cloud systems, however, existing SGX applications face a significant portability challenge: they are tightly coupled to SGX-specific APIs and execution semantics. In this paper, we present the design and implementation of CCX, a framework that enables existing SGX applications to run on Arm CCA without source code modification. To this end, CCX redesigns SGX functionality within Arm CCA firmware, adapting SGX abstractions to CCA's architecture design while preserving full compatibility with existing applications originally developed for SGX. We implemented a prototype of CCX on both the QEMU emulator and a Nitrogen8M development board. Our evaluation shows that CCX is capable of executing existing SGX applications without requiring source code changes, while providing security guarantees comparable to Intel SGX and achieving performance improvements in our evaluated settings.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRcs.LGarXiv cs.CR

GESR: Graph-Based Edge Semantic Reconstruction for Stealthy Communication Detection with Benign-Only Training

Henghui Xu, Yuchen Zhang, Xiaobo Ma
2026/05/08 18:11已过 4 天

Detecting stealthy malicious communications from flow logs under benign-only training remains a critical challenge in network security. Malicious communications often camouflage as normal traffic like standard HTTPS flows. Conventional intrusion detectors rely strictly on known labeled attacks. Alternatively, they score flows completely independently. These approaches fail against sparse and context-dependent suspicious activity. To capture this essential context, graph anomaly detectors have been introduced to add valuable relational information to the analysis. However, existing methods fail to test the structural consistency of specific communication edges. To overcome these fundamental limitations, we present GESR, a novel graph-based framework for detecting suspicious communications and anomalous hosts under a benign-only training setting. GESR models complex network activity as attributed communication graphs. It cleverly reconstructs edge semantics entirely from local structural context rather than isolated features. This non-intuitive design forces the framework to predict expected communication patterns from neighborhood topologies. Attackers cannot easily manipulate this deep structural dependency. The model then converts the resulting structural inconsistencies into host-level anomaly scores. It utilizes robust Median Absolute Deviation (MAD) calibration for this final step. We evaluate GESR extensively on CTU-13 and CICIDS2017 datasets. These evaluations strictly impose tight false-positive operating constraints. On CICIDS2017, GESR achieves an outstanding ROC-AUC of 0.9753. It also yields a high TPR of 0.8569 at a strict 5% FPR threshold. GESR consistently outperforms existing methods across both evaluated benchmarks. The results prove that structure-conditioned edge reconstruction is a credible direction for practical intrusion detection.

arXiv cs.CRDependability and SecurityarXivReliabilityavailabilityand safetycs.CReess.SYarXiv cs.CR

Resilience of IEC 61850 Sampled Values-Based Protection Systems Under Coordinated False Data Injections

Denys Mishchenko, Irina Oleinikova, Laszlo Erdodi
2026/05/08 18:08已过 4 天

This paper assesses the resilience of IEC 61850 digital substations under False Data Injection Attacks (FDIAs) targeting the Sampled Values (SV) protocol. The multicast nature of SV, while enabling time-critical automation, exposes substations to cyber intrusions capable of disrupting protection functions and causing large-scale outages. To evaluate these risks, coordinated attack vectors involving both physical and cyber access at the bay level are experimentally analyzed using an advanced setup based on industrial-grade intelligent electronic devices (IEDs). The proposed attacks simultaneously manipulate multiple electrical parameters in a coordinated and physically consistent manner. Experimental results confirm the feasibility of stealthy multi-vector FDIAs that can trigger false protection actions, conceal real faults, or block protection mechanisms while maintaining realistic signal behavior. The Power Hardware-in-the-Loop (PHIL) testbed enables closed-loop evaluation under strict timing, communication, and protection logic constraints, reflecting real device behavior beyond simulation and controller-level HIL environments. The findings reveal critical vulnerabilities in SV-based protection schemes that directly affect grid reliability, particularly under realistic attacker positioning. To address these challenges, a defense strategy covering deterrence, prevention, detection, mitigation, and resilience is analyzed, with emphasis on bay-level infrastructure. Furthermore, a resilience-oriented method based on trusted independent channels and cross-verification of SV data within the protection logic is outlined as a complementary countermeasure for scenarios where existing standardized security mechanisms are insufficient.

arXiv cs.SETesting and AnalysisarXivSoftware testingcs.SEarXiv cs.SE

System Test Generation for Virtual Reality Applications using Scenario Models

Gerry Longfils, Maxime Cauz, Arnaud Blouin, Xavier Devroey
2026/05/08 18:07已过 4 天

Virtual Reality (VR) applications are increasingly being integrated across a wide range of domains, including surgical training and industrial marketing. However, the long-term adoption and maintenance of VR applications remain limited, particularly due to the lack of effective, systematic, and reproducible software testing approaches tailored to their unique characteristics. To address this issue, we introduce UltraInstinctVR, a novel testing approach for VR applications. Relying on predefined VR models (scenarios), it automates the generation and execution of concrete VR system tests. In our empirical evaluation, we compare UltraInstinctVR with state-of-the-art automated VR testing approaches in terms of coverage and failure detection on 10 open-source VR applications. The results show that UltraInstinctVR outperforms existing automated tools for detecting unique failures and provides valuable insights for identifying real-world bugs in VR applications.

arXiv cs.SETesting and AnalysisarXivSoftware testingcs.ROcs.SEarXiv cs.SE

Search-based Robustness Testing of Laptop Refurbishing Robotic Software

Erblin Isaku, Hassan Sartaj, Shaukat Ali, Malaika Din Hashmi, Francois Picard
2026/05/08 18:02已过 4 天

The Danish Technological Institute (DTI) focuses on transferring advanced technologies (including robots) to the industry and the public sector. One key application is laptop refurbishment using specialized robots, aimed at promoting reuse, reducing electronic waste, and supporting the European Circular Economy Action Plan. The software of such robots often includes features that use object detection models to detect objects for various purposes, such as identifying screws for laptop disassembly or detecting stickers to remove them. Ensuring the robustness of such models to small input variations remains a critical challenge, and addressing it is important to avoid potential damage to laptops during refurbishment. In this paper, we propose PROBE, a search-based robustness testing approach that leverages multi-objective optimization to identify minimal, localized perturbations that expose failures in object detection models used in the software of laptop refurbishing robots. PROBE employs NSGA-II to systematically explore the perturbation space, optimizing for failure induction considering both localization and confidence, and perturbation magnitude, while enabling the discovery of diverse failure cases. Results show that PROBE is 3$\times$ to 7$\times$ more effective than random search in generating failure-inducing perturbations, while requiring smaller perturbation magnitudes, and that the generated perturbations transfer across models. We further show that metamorphic relations provide additional insights into model robustness, enabling the assessment of stability even in non-failing cases.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv LLM4Code / Program Repair / arXiv AI Coding 3YAI for Software EngineeringarXivAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelscs.SEarXiv AI Coding 3YarXiv cs.SEarXiv LLM4Code / Program Repair

Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation

Luciano Baresi, Domenico Bianculli, Maryse Ernzer, Livia Lestingi, Fabrizio Pastore, Seung Yeob Shin
2026/05/08 17:55已过 4 天

Large Language Models (LLMs) show strong capabilities in code generation, motivating their use in automated quantum solver development. However, in quantum computing, successful execution of generated code is not sufficient: correctness depends on numerically accurate results, which are sensitive to non-trivial mappings, hybrid quantum-classical workflows, and algorithm-specific approximations. This work introduces Q-SAGE, an iterative methodology to evaluate LLMs' capability in generating quantum solvers for scientific problems. The methodology adopts an iterative approach by executing the script generated by the LLM, comparing the result with the result of a classical solver, and refining the script until the two results match within a tolerance threshold. We empirically evaluated the methodology with five families of scientific problems of different complexities and five LLMs, both open source and proprietary. The results show that iterative refinement substantially improves success rates, but introduces a significant computational overhead. Moreover, as model capability increases, failure modes shift from execution errors to numerical inaccuracies, highlighting the current limitations of LLM-based quantum software.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

An Automated Framework for Cybersecurity Policy Compliance Assessment Against Security Control Standards

Bikash Saha, Sandeep Kumar Shukla
2026/05/08 17:45已过 4 天

Organizational cybersecurity policies are often examined to determine whether they adequately comply standard security controls. This task is difficult because control statements are abstract, whereas policy documents describe governance practices in varied natural language. As a result, policy-based control assessment is time-consuming, difficult to standardize, and often difficult to document in a traceable manner. To address this gap, we present PROPARAG, an audit support approach for evaluating organizational cybersecurity policies against security controls autonomously. For each control, the approach retrieves relevant policy evidence, assesses coverage, identifies missing elements, and generates supporting explanations and recommendations. We evaluate PROPARAG on two real-world organizational policy corpora using 1,007 NIST SP 800-53 controls across both closed-source and open-source large language models (LLMs). The framework achieves F1 scores of 88.54 on OrgA and 82.31 on OrgB. The evaluation also shows that PROPARAG identifies relevant gaps in documented organizational policies and generates grounded recommendations for each identified gap. This research provides foundation for LLM-powered autonomous control-level assessment of organizational cybersecurity policies.

arXiv cs.SETesting and AnalysisarXivSoftware testingcs.SEarXiv cs.SE

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

Yang Liu, Hongjiang Feng, Junsong Pu, Zhuangbin Chen
2026/05/08 17:40已过 4 天

Failure attribution in LLM-based multi-agent systems aims to identify the steps that contribute to a failed execution. This task remains difficult because a single execution can contain many agent actions and tool calls, failure evidence can appear many steps after the original mistake, and existing methods often rely on costly agent workflows, replay, or training on synthetic failure logs. To address these challenges, we propose MASPrism, a lightweight framework that performs failure attribution using prefill-stage signals from a small language model (SLM). MASPrism first extracts token-level negative log-likelihood and attention weights during a prefill pass to identify symptom-like steps and earlier candidate sources, without decoding. It then reconstructs a focused diagnostic prompt and performs a second prefill pass to rank failure-source candidates. Using Qwen3-0.6B as the SLM, MASPrism achieves the best performance on three of the four evaluated subsets across Who&When and TRAIL, improving Top-1 accuracy on Who&When-HC by 33.41% over the best baseline. On TRAIL, MASPrism outperforms strong proprietary LLMs, including Gemini-2.5-Pro, with up to 89.50% relative improvement. MASPrism processes each trace in 2.66 seconds on average, achieving a 6.69$\times$ speedup over the single-pass prompting baseline, with zero output tokens. These results show that MASPrism provides an effective and practical framework for failure attribution in long multi-agent execution logs.

arXiv cs.CRDependability and SecurityarXivVulnerability detection and software securityConfidentialityintegritycs.CRarXiv cs.CR

Cross-Modal Backdoors in Multimodal Large Language Models

Runhe Wang, Li Bai, Haibo Hu, Songze Li
2026/05/08 17:29已过 4 天

Developers increasingly construct multimodal large language models (MLLMs) by assembling pretrained components,introducing supply-chain attack surfaces.Existing security research primarily focuses on poisoning backbones such as encoders or large language models (LLMs),while the security risks of lightweight connectors remain unexplored.In this work,we propose a novel cross-modal backdoor attack that exploits this overlooked vulnerability.By poisoning only the connector using a single seed sample and several augmented variants from one modality,the adversary can subsequently activate the backdoor using inputs from other modalities.To achieve this,we first poison the connector to associate a compact latent region with a malicious target output.To activate the backdoor from other modalities,we further extract a malicious centroid from the poisoned latent representations and perform input-side optimization to steer inputs toward this latent anchor,without requiring repeated API queries or full-model access.Extensive evaluations on representative connector-based MLLM architectures,including PandaGPT and NExT-GPT,demonstrate both the effectiveness and cross-modal transferability of the proposed attack.The attack achieves up to 99.9% attack success rate (ASR) in same-modality settings,while most cross-modal settings exceed 95.0% ASR under bounded perturbations.Moreover,the attack remains highly stealthy,producing negligible leakage on clean inputs,and maintaining weight-cosine similarity above 0.97 relative to benign connectors.We further show that existing defense strategies fail to effectively mitigate this threat without incurring substantial utility degradation.These findings reveal a fundamental vulnerability in multimodal alignment: a single compromised connector can establish a reusable latent-space backdoor pathway across modalities,highlighting the need for safer modular MLLM design.

arXiv cs.CRArchitecture and DesignarXivArchitecture quality attributescs.CRarXiv cs.CR

Spying Across Chiplets: Side-Channel Attacks in 2.5/3D Integrated Systems

Giorgio Di Natale, Christelle Rabache, Pierre-Louis Hellier, Florence Podevin, Sylvain Bourdel, Romain Siragusa, Paolo Maistri
2026/05/08 17:27已过 4 天

Advanced packaging and chiplet-based integration are increasingly adopted to build complex heterogeneous systems beyond the limits of monolithic scaling. While these architectures offer major benefits in terms of modularity, yield, and performance, they also introduce new physical attack surfaces. In this paper, we show that side-channel attacks can be mounted across chiplets within the same package or stack. Our key idea is that a communication-oriented chiplet, originally intended to interact with the external environment through an antenna, an RFID-like element, or another contactless coupling structure, can be repurposed as an internal observation platform. We formalize this threat through a realistic adversary model, describe the corresponding attack principle, and experimentally assess its feasibility. The obtained results demonstrate that signals captured through such a communication-oriented interface can reveal information correlated with the activity of a neighboring victim chiplet.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRcs.AIarXiv cs.CR

Vaporizer: Breaking Watermarking Schemes for Large Language Model Outputs

Jonathan Hong Jin Ng, Anh Tu Ngo, Anupam Chattopadhyay
2026/05/08 17:24已过 4 天

In this paper, we investigate the recent state-of-the-art schemes for watermarking large language models (LLMs) outputs. These techniques are claimed to be robust, scalable and production-grade, aimed at promoting responsible usage of LLMs. We analyse the effectiveness of these watermarking techniques against an extensive collection of modified text attacks, which perform targeted semantic changes without altering the general meaning of the text content. Our approach encompasses multiple attack strategies, which include lexical alterations, machine translation, and even neural paraphrasing. The attack efficacy is measured with two target criteria - successful removal of the watermark and preservation of semantic content. We evaluate semantic preservation through BERT scores, text complexity measures, grammatical errors, and Flesch Reading Ease indices. The experimental results reveal varying levels of effectiveness among different watermarking models, with the same underlying result that it is possible to remove the watermark with reasonable effort. This study sheds light on the strengths and weaknesses of existing LLM watermarking systems, suggesting how they should be constructed to improve security of available schemes.

arXiv cs.CRAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsCollaborative AI for SEcs.CRcs.AIcs.MAarXiv cs.CR

HBEE: Human Behavioral Entropy Engine -- Pre-Registered Multi-Agent LLM Simulation of Peer-Suspicion-Based Detection Inversion

Vickson Ferrel
2026/05/08 17:19已过 4 天

Insider threat detection assumes that an adaptive insider leaves behavioral residue distinguishing them from legitimate users. We test this assumption against an LLM-driven adaptive insider in a controlled multi-agent simulator. Our pre-registered five-condition study isolates defender mode (cascade vs. blind UEBA) crossed with adversary type (naive vs. adaptive OPSEC) plus a no-mole control, across 100 runs (95 valid after pre-committed exclusions). The primary finding is a detection inversion: at T_60, the adaptive mole's suspicion in-degree is statistically lower than a randomly selected innocent agent (Cliff's delta = -0.694, 95% BCa CI [-0.855, -0.519], Mann-Whitney p << 0.01). The pre-registered prediction was the opposite direction. A pre-registered equivalence test (H2) shows adaptive OPSEC produces no detectable shift in the mole's UEBA rank under either defender mode. The two detection signals (peer suspicion graph in-degree and per-agent UEBA rank) decouple under adaptive adversary behavior. We bound generalization explicitly: a pre-registered Gini calibration check (H4) returns FAIL, with HBEE pairwise message-exposure Gini (0.213) diverging from the SNAP Enron reference (0.730) by |Delta Gini| = 0.52, exceeding the equivalence bound by 5x. The paper makes a narrow but surprising claim: in a controlled environment where adaptive OPSEC is implementable as an LLM directive, peer-suspicion-cascade detection inverts. We release the simulator, pre-registration document, frozen scenarios, raw telemetry, and analysis pipeline under an open-source license.

arXiv cs.CRDependability and SecurityarXivDependability and security for embedded and cyber-physical systemscs.CRcs.MMarXiv cs.CR

Forensic analysis of video data deletion and recovery in Honeywell surveillance file system

Jinhee Yoon, Sungjae Hwang
2026/05/08 16:31已过 4 天

Real-time video surveillance systems store recorded video using digital video recorders (DVRs) and network video recorders (NVRs). To support continuous high-volume video storage, these devices employ specialized, nonstandard file systems that are often proprietary and undocumented. This lack of documentation significantly increases the time and effort required for forensic analysis. In this study, we analyze an undocumented proprietary file system used by Honeywell video surveillance devices-one that, to the best of our knowledge, has not been examined in prior work-and investigate its deletion mechanisms and demonstrate the feasibility of video recovery after deletion. We perform a file system analysis using a binary diffing technique and evaluate three deletion methods supported by the target device: 1) formatting-based deletion, 2) data expiration, and 3) overwrite. For each method, we investigate changes in file system metadata and on-disk data structures and demonstrate the feasibility of video data recovery. Our findings aim to support more efficient and accurate forensic investigations of Honeywell surveillance products and provide foundational insights into the analysis of proprietary file systems used in video recording devices.

arXiv cs.SE / arXiv LLM4Code / Program RepairAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsPrompt engineering for SEAI-enabled recommender systems for automated SEcs.SEcs.AIarXiv cs.SEarXiv LLM4Code / Program Repair

Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

Moaath Alshaikh, Tasneem Alshaher, Ricardo Vieira, Beatriz Santana, Clelio Xavier, Jose Amancio, Glauco Carneiro, Julio Leite, Savio Freire, Manoel Mendonca
2026/05/08 16:16已过 4 天

Qualitative analysis plays a pivotal role in understanding the human and social aspects of software engineering. However, it remains a demanding process shaped by the subjective interpretation of individual researchers and sensitive to methodological choices such as prompt design. Recent advancements in Large Language Models (LLMs) offer promising opportunities to support this type of analysis, although their reliability in reproducing human qualitative reasoning under varying prompting conditions remains largely untested. This study presents a controlled empirical evaluation of three LLMs -- Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash -- across two prompt engineering strategies (zero-shot and multi-shot closed coding), using Cohen's kappa as the primary agreement metric over ten independent runs per configuration. Results suggest that multi-shot prompting significantly improves agreement for Claude Haiku (Delta kappa = +0.034, Wilcoxon p = 0.004) but not for DeepSeek-Chat or Gemini 2.5 Flash. Intra-model stability varies substantially -- DeepSeek-Chat and Claude Haiku exhibit the lowest variance (SD approx. 0.017), while Gemini 2.5 Flash is the least stable (SD = 0.038). A systematic over-prediction of "Sharing Negative Feedback" is identified across all models (bias ratios up to 5.25x), alongside consistent under-prediction of "Expressing Concerns." Collectively, these findings provide empirical evidence for prompt engineering guidelines in LLM-assisted qualitative coding for software engineering research.

arXiv cs.CRDependability and SecurityarXivReliabilityavailabilityand safetycs.MAcs.AIcs.CRarXiv cs.CR

OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing

Jianming Chen, Yawen Wang, Junjie Wang, Zhe Liu, Qing Wang, Fanjiang Xu
2026/05/08 16:06已过 4 天

Tool-calling text-to-image (T2I) agents can plan and execute multi-step tool chains to accomplish complex generation and editing queries. However, this capability introduces a new safety attack surface: harmful outputs may arise from tool orchestration, where individually benign steps combine into unsafe results, making prompt-only jailbreak techniques insufficient. We present OrchJail, an orchestration-guided fuzzing framework for jailbreaking tool-calling T2I agents. Its core idea is to exploit high-risk tool-orchestration patterns: by learning from successful jailbreak tool-calling traces and their causal relationships to prompt wording, OrchJail directly guides the fuzzing search toward prompts that are more likely to trigger unsafe multi-step tool behaviors, rather than relying on surface-level textual perturbations. Extensive experiments demonstrate that OrchJail improves jailbreak effectiveness and efficiency across representative toolcalling T2I agents, achieving higher attack success rates, better image fidelity, and lower query costs, while remaining robust against common jailbreak defenses. Our work highlights tool orchestration as a critical, previously unexplored attack surface and provides a novel framework for uncovering safety risks in T2I agents.

arXiv cs.SE / arXiv LLM4Code / Program RepairAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SEPrompt engineering for SEcs.SEarXiv cs.SEarXiv LLM4Code / Program Repair

Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair

Xinyue Liang, Jingxuan Zhang, Lin Li, Jun Zhang, Junhao Chen
2026/05/08 15:58已过 4 天

With the rapid evolution of emerging programming language ecosystems, the demand for code translation to low-resource languages continues to grow. As Cangjie emerges as a new programming language, its ecosystem and development toolchains are rapidly expanding. Automated translation from popular programming languages to Cangjie is therefore valuable for practical development. However, constrained by both insufficient Cangjie knowledge and scarce parallel code corpora, general Large Language Models (LLMs) are prone to syntactic errors and semantic as well as structural misalignment in code translation. Existing approaches typically rely on fine-tuning with large-scale parallel data, but they cannot reliably improve compilability or semantic consistency for low-resource Cangjie languages. To tackle these challenges, we propose a multi-stage training framework of LLMs that employs the iterative error repair technique to translate Java code into Cangjie code. This training framework performs training on LLMs, gradually integrating knowledge and achieving semantic alignment as well as structure awareness. During the code translation, we also combine the compiler feedback and error repair case retrieval to repair the incorrect Cangjie code. We construct syntactic knowledge and monolingual instruction datasets to train the LLM. In addition, we also build a Cangjie error repair repository to support error repair in our approach. Experimental results show that, with limited parallel data, our approach improves functional equivalence by 6.06\% compared to the state-of-the-art approaches. Meanwhile, ablation studies confirm that each training stage positively contributes to the final performance.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

From Conceptual Scaffold to Prototype: A Standardized Zonal Architecture for Wi-Fi Security Training

Vyron Kampourakis, Efstratios Chatzoglou, Vasileios Gkioulos, Sokratis Katsikas
2026/05/08 15:55已过 4 天

Wi-Fi is the dominant wireless access technology, but its widespread use also exposes systems to threats such as rogue access points, deauthentication attacks, and other IEEE 802.11-specific vulnerabilities. Although Cyber Ranges (CRs) have become valuable platforms for cybersecurity training and experimentation, existing wireless-oriented solutions mainly target heterogeneous IoT or mobile-network settings, with Wi-Fi typically treated as one among many. As a result, dedicated CR environments for Wi-Fi-specific security experimentation remain limited. This gap is particularly relevant because wireless attacks often require protocol-aware experimentation that is difficult to reproduce in conventional training environments. This paper introduces a conceptual architecture for a Wi-Fi-focused CR tailored to IEEE 802.11 security scenarios and an open-source prototype. The proposed design is grounded in established CR design principles and organized around core infrastructure, learning management and support, monitoring, management, and access-control zones. Structuring the platform into these distinct zones, the architecture supports modularity, scalability, and future extensibility. Part of the design is realized in a prototype publicly available in a GitHub repository that implements the scenario generation, storage, retrieval, and instantiation workflow, offering an initial practical foundation for the proposed architecture. Overall, the paper provides a structured foundation for the future implementation of Wi-Fi-specialized CR platforms for targeted experimentation.

arXiv AI Coding 3YAnalyticsarXivSoftware metrics and measurementscs.LGcs.AIcs.CLarXiv AI Coding 3Y

Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

Saloni Garg, Amit Sagtani
2026/05/08 15:49已过 4 天

Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an "unsolvability ceiling", queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing failures to these artifacts, revealing consistent patterns across domains and model families. These artifacts also distort router training signals: standard routers collapse to majority-class prediction (~79% smallest-tier optimal), confirmed via random-feature and shuffled-label controls, incurring a 13-17 percentage point opportunity cost. We provide actionable recommendations including dual-judge validation, exact-match anchoring, and cost-sensitive objectives. Our findings suggest existing routing headroom estimates are substantially inflated, underscoring the need for reliable evaluation protocols in multi-LLM systems.

arXiv cs.SEHuman and Social AspectsarXivTeamscommunitiesand companiescs.SEcs.LGarXiv cs.SE

Exploring CoCo Challenges in ML Engineering Teams: Insights From the Semiconductor Industry

A. Azamnouri, M. Haug, L. Woltmann, M. Fritz, J. Bogner, S. Wagner
2026/05/08 15:43已过 4 天

The integration of machine learning (ML) into complex software systems has increased challenges in collaboration and communication (CoCo) of the teams building these systems. ML engineering (MLE) teams often involve diverse roles, ML engineers, data scientists, software engineers, and domain experts, each bringing unique goals, experiences, and jargon. These interdisciplinary dynamics can make it challenging to deploy, reproduce, and maintain ML-enabled systems over the long term. Previous studies have uncovered several CoCo challenges and practices, but most have focused on software-centric companies, leaving limited empirical understanding of how these dynamics unfold in hardware-centric contexts. In hardware-centric environments, CoCo challenges are shaped by additional constraints such as strict data governance, long development cycles, and tight coupling with physical processes, which amplify coordination complexity and reduce flexibility. To strengthen empirical understanding in such settings, we present a qualitative investigation of MLE teams within a global semiconductor company, where ML-enabled systems and manufacturing processes introduce additional complexity. We interviewed 12 practitioners regarding CoCo practices, tools, challenges, and approaches. Through analysis, we identified 16 recurring challenges, with unclear roles and responsibilities emerging as the most critical, and common practices and recommendations practitioners considered effective in mitigating CoCo problems. While grounded in a single organizational context, our findings align with known issues in interdisciplinary ML-enabled systems development, but also demonstrate how these challenges manifest differently under hardware-driven constraints. Our results highlight directions for future research and tool support to strengthen CoCo in MLE projects and ensure the success of ML-enabled systems.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.GTcs.CRarXiv cs.CR

Game-Theoretic Analysis of Transaction Selection in DAG-Based Distributed Ledgers

Sebastian Müller, Alexandre Reiffers-Masson
2026/05/08 15:42已过 4 天

Transaction selection in parallel or DAG-based distributed ledger technologies (DLTs) is a crucial challenge that directly impacts throughput, fairness, and validator incentives. In these systems, validators independently choose transactions to include in their blocks, often relying on naive heuristics like uniform or proportional selection. This can lead to inefficient outcomes when validators prioritize their own rewards without considering collective impacts. We analyze two fee allocation mechanisms used in practice: Random Fee Allocation (RFA), where transaction fees are randomly assigned to one validator, and Collaborative Fee Sharing (CFS), where fees are distributed equally among all validators. Using a single-shot game-theoretic framework, we derive symmetric Nash equilibria (NE) for selecting transactions for both mechanisms and propose an optimization-based method to compute these equilibria. Numerical simulations demonstrate that the NE of CFS consistently achieves higher throughput and rewards compared to the NE of RFA, particularly under skewed fee distributions. Additionally, we compare these equilibrium strategies to naive benchmarks (uniform and proportional selection), showing that the proportional strategy outperforms the NE of RSA in many situations. These findings may provide actionable insights into the design of transaction selection and incentive mechanisms, enabling more robust and high-performance DAG-based DLTs.

arXiv cs.CRDependability and SecurityarXivFormal methods and model checkingcs.CRstat.AParXiv cs.CR

Combating Organized Platform Abuse: Amplifying Weak Risk Signals with Structural Information

Meng He, Jia Long Loh
2026/05/08 15:38已过 4 天

Large-scale online service platforms face severe challenges from organized platform abuse: multiple forms such as credit card fraud and promotion abuse continually emerge, characterized by large numbers of involved accounts, rapid outbreaks, and constantly shifting tactics. Existing mainstream approaches, whether heuristic rules limited in precision, supervised learning with insufficient generalization, or graph models that are engineering-heavy and dependent on seed users, have failed to address such threats effectively. This paper returns to first principles and, starting from the economic constraints of fraudulent behavior, proposes the Fraudster's Trilemma: organized attackers cannot simultaneously achieve scale, low cost, and dispersed cash-out. Building on this theory, we derive a robust structural invariant in organized fraud, namely centralized cash-out, and use a simple statistical method to turn low-precision individual weak signals into high-precision strong decisions. The method requires no labels, is nearly parameter-free, white-box interpretable, has linear complexity O(|E|), avoids cold-start issues, and its detection logic possesses the "open-hand" property: attackers cannot evade it even when fully informed. We validate the approach on two real fraud incidents in backtests. In the promotion abuse case, a single near-zero-cost weak signal (global Precision of only 16%) after structural amplification achieves Precision above 91% and Recall exceeding 99% (z=10.0); at a higher threshold (z=40.0), Precision reaches 93.7%. In the credit card fraud case, an infrastructure-layer weak signal (device spoofing) successfully detects payment-layer attacks without any business-logic linkage, revealing the framework's natural MO-agnostic property: it relies more on the structural invariant than on signal semantics.

arXiv cs.SETesting and AnalysisarXivSoftware testingcs.SEarXiv cs.SE

Low-code and no-code with BESSER to create and deploy smart web applications

Iván Alfonso, Armen Sulejmani, Aaron Conrardy, Jordi Cabot
2026/05/08 15:31已过 4 天

The increasing demand for web applications containing AI-agents, seen as smart web applications, has prompted the need for new techniques to facilitate their creation. Low-code has risen as an approach that reduces the amount of handwritten code by focusing on the abstraction of components in the form of models combined with automated generators to produce applications. Existing low-code platforms are commercial, leading to drawbacks such as the risk of vendor lock-in, limited extensibility, and more. We present the open-source BESSER low-code framework, which allows users to design, generate and deploy their application via a freely accessible web-based editor, while guaranteeing transparency and extensibility.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv AI Coding 3YAI for Software EngineeringarXivAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelscs.LGcs.AIcs.SEarXiv AI Coding 3Y

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

Hugh Xuechen Liu, Kıvanç Tatar
2026/05/08 14:46已过 4 天

Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage') -- compile success, runtime success, structural fidelity, and mechanism adherence -- applied to 858 generation attempts across four open-weight LLMs (7B--30B), 26~hand-crafted Unity goal pattern playable concepts, and two automatically extracted IR granularity levels. Direct NL-to-C\# generation achieves the highest runtime-pass rate (43\% mean) yet produces structurally vacuous scenes (mechanism $F_1 \approx 0.12$). Structural IR conditioning halves the runtime rate but recovers domain-faithful structure ($F_1$ up to 1.00). Within IR conditioning, behavior-only and full-scene granularity are statistically indistinguishable (McNemar $p = 1.0$), indicating input-level granularity saturation. These results show that compile rate is anti-correlated with functional correctness in this domain and that multi-axis evaluation is necessary to detect the divergence. We release the benchmark, replay logs, and per-record metrics for independent verification.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

A Unified Open-Set Framework for Scalable PUF-Based Authentication of Heterogeneous IoT Devices

Xin Wang, Peichun Hua, Chip Hong Chang, Wenye Liu, Yue Zheng
2026/05/08 14:46已过 4 天

As modern cyber systems scale to include large populations of heterogeneous IoT devices, securing them against impersonation and forgery is a critical cybersecurity challenge. Physical Unclonable Functions (PUFs) offer a lightweight, hardware-rooted trust anchor for IoT security. However, different PUF architectures possess distinct challenge-response spaces and raw response reliabilities, making existing authentication protocols PUF-type specific. To bridge this interoperability bottleneck, this paper proposes a scalable, helper-data-free, open-set PUF authentication framework that leverages an OpenGAN-based classifier to manage heterogeneous fleets of IoT devices. Our method addresses the limitations of traditional database-centric and digital-twin modeling methods by encoding raw responses from diverse PUF types, including strong, weak and hybrid PUFs, into a unified image representation. This enables robust, single-pass classification and impostor rejection. We integrate the classifier into a generic protocol employing hybrid encryption and Bloom filter-based replay detection. Evaluated across four different types of noisy PUF data (Arbiter, SRAM, DRAM, and heterogeneous PUFs), our framework achieves 100% closed-set accuracy and near-zero open-set error rates with up to 45 devices, a significant improvement over the 3 to 5 devices in prior classification-based approaches. Prototyped on a Raspberry Pi, our framework completes one authentication cycle within 0.67 s, approximately 30x faster than the state-of-the-art open-set baselines.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv AI Coding 3YTesting and AnalysisarXivDebugging and fault localizationcs.LGcs.SEarXiv AI Coding 3YarXiv cs.SE

CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models

Mengran Li, Bo Li, Jiaying Wang, Wenbin Xing, Yixuan Dong, Chengyang Zhang, Hongliang Zhang, Yuzhong Peng, Jinlin Wu, Bob Zhang, Bingo Wing-Kuen Ling, Fuji Yang, Zhen Lei, Jiebo Luo, Zelin Zang
2026/05/08 14:40已过 4 天

Virtual Cell Modeling (VCM) requires models that not only predict perturbation responses, but also support targeted revision when predictions fail. Current LLM-assisted modeling workflows face a refinement-routing problem: prediction discrepancies are observed through executable implementations, but the relevant revision may involve the modeling assumption, representation design, implementation, or task constraint. Without structured feedback propagation across these levels, iterative refinement may repair code while failing to revise the assumption responsible for the discrepancy. We propose CellScientist, a dual-space hierarchical framework that couples a high-level hypothesis space with a low-level executable implementation space. CellScientist represents modeling decisions as structured states, realizes them as admissible programs under task and interface constraints, and routes execution discrepancies back to targeted hypothesis or implementation updates. This enables a closed Hypothesis -> Implementation -> Hypothesis loop where failures become structured signals for model refinement rather than debugging events. Across morphology and transcriptomic benchmarks, with additional single-cell perturbation evaluations, the final executable models selected by CellScientist improve over reference baselines under fixed split and evaluation protocols, while the workflow produces auditable refinement traces.

arXiv cs.CRDependability and SecurityarXivReliabilityavailabilityand safetycs.CLcs.AIcs.CRcs.LG

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

Sachin Kumar
2026/05/08 14:30已过 4 天

Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context ("2024" triggers vulnerable code, "2023" triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.

ACL 2026 OpenReviewDependability and SecurityACL 2026Reliabilityavailabilityand safetyACL 2026ACL 2026 Findings

LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence

Wenjin Liu, Haoran Luo, Xin Feng, Xiang Ji, Lijuan Zhou, Rui Mao, Jiapu Wang, Shirui Pan, Erik Cambria
2026/05/08 14:15已过 4 天

Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.

LexGenius:面向大语言模型法律通用智能的专家级基准
  • 提出了面向大语言模型法律通用智能评测的中文专家级基准 LexGenius。
  • 构建了 Dimension-Task-Ability 框架,覆盖 7 个维度、11 类任务和 20 项能力,用于系统评估法律理解、推理与决策能力。
  • 基于近期法律案例和考试题构造选择题,并结合人工审核与大语言模型审核以降低数据泄漏风险。
  • 通过多轮检查保证数据准确性与可靠性,并对 12 个最先进的大语言模型进行了系统评测与深入分析。
  • 实验证明不同模型在法律智能能力上存在显著差异,且最优模型仍落后于人类法律专业人士。

方法:论文提出 LexGenius 作为中文法律通用智能基准,并采用 Dimension-Task-Ability 框架组织评测内容。数据由近期法律案例和考试题改编为选择题,结合人工与大语言模型审核,并通过多轮检查提升准确性、可靠性并降低数据泄漏风险。

ACL 2026 OpenReviewHuman and Social AspectsACL 2026Teamscommunitiesand companiesACL 2026ACL 2026 Findings

Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen, Jiapu Wang, Rui Mao, Erik Cambria
2026/05/08 14:11已过 4 天

Recently, various excellent and powerful large language models (LLMs) have been utilized to solve a wide range of human problems. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting their performance. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that utilizes a small-scale LLM (as \textit{agent}) to collaborate with large-scale LLMs (as \textit{environment}), replacing users to interact better. This collaboration is presented as a multi-turn interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A double-constrained reward is designed to optimize correctness and quality of generation. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experimental results on twelve datasets show that Prompt-R1 significantly outperforms baseline LLMs across various tasks. Our code is available at https://github.com/QwenQKing/Prompt-R1.

Prompt-R1:一种通过端到端强化学习实现的协同自动提示框架
  • 提出了 Prompt-R1,一种端到端强化学习框架,用小型语言模型作为代理与大型语言模型协同完成自动提示生成与交互。
  • 将小模型“思考并生成提示”、大模型“执行复杂推理”的过程建模为多轮协作交互,以替代用户进行更有效的提示设计。
  • 设计了双重约束奖励机制,同时优化答案正确性与生成质量。
  • 提出即插即用的框架设定,可支持与多种大型语言模型配合进行推理和训练。
  • 在十二个数据集上的实验表明,该方法在多种任务上显著优于基线大模型。

方法:Prompt-R1 将小规模语言模型作为 agent、大规模语言模型作为 environment,通过多轮交互让前者生成提示、后者完成复杂推理,并以端到端强化学习进行优化。其核心训练信号来自一个双重约束奖励,用于同时提升结果正确性和生成质量。

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation

Chaitanya Vilas Garware, Sharif Noor Zisad
2026/05/08 14:03已过 4 天

LLM-based SOC log classifiers are commonly evaluated using regular-expression pipelines that extract structured fields from free-form model output. We demonstrate that this practice introduces a class of silent, systematic evaluation errors, which we term parsing-induced suppression that can cause a fully functional model to appear completely non-functional. Using OpenSOC-AI, a LoRA fine-tuned TinyLlama-1.1B system for security log threat classification, as a reproducible case study, we show that a strict regex parser reported 0% threat accuracy while a corrected fuzzy parser recovered 76% threat accuracy on the same model outputs and the same evaluation set. A gap of 76 percentage points attributable entirely to evaluation methodology. Severity accuracy remained constant at 58% under both parsers, providing a built-in control that isolates field name format mismatch as the causal mechanism rather than model degradation. For external reference, Claude Sonnet evaluated zero-shot on the same 50 example set achieved 88% threat accuracy and 58% severity accuracy under the same fuzzy protocol. Residual errors under fuzzy evaluation concentrate in three categories including reconnaissance, brute force, and credential stuffing, each contributing all 4 misclassifications, a pattern that reflects class-boundary difficulty among behaviorally adjacent log types rather than global model failure. We propose SOC-Bench v0, a benchmark framework comprising a standardized 13 category threat taxonomy, minimum statistical power requirements, fuzzy field extraction specification, and a public scoring script intended to prevent parser specific accuracy distortion in future SOC LLM research.

arXiv AI Coding 3Y / arXiv Program Repair Core / arXiv AI Coding 3YAI for Software EngineeringarXivAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelsPrompt engineering for SEcs.AIarXiv AI Coding 3YarXiv Program Repair Core

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Jia Li, Yuxin Su, Ting Peng, Hailiang Huang, Yuetang Deng, Michael R. Lyu
2026/05/08 13:41已过 4 天

Code-agent RL often receives weak feedback: rollout-time signals are reliable and executable, but capture only necessary or surface conditions for task success rather than the target semantic predicate. Using agentic compile-fix as the setting, we study signal reshaping for standard GRPO under such feedback. Our central claim is that GRPO's within-group comparison is meaningful only after three kinds of signals are reshaped: outcome rewards recover semantic ranking, process signals localize intra-trajectory credit, and rollouts from the same prompt remain execution-comparable. We operationalize these conditions with a minimal signal-reshaping construction that leaves GRPO's group-normalized advantage construction unchanged: compile-and-semantic layered rewards reshape trajectory ranking, step-level process scores outside group reward normalization reshape within-trajectory update strength, and failure-cause-aware rollout governance reshapes within-group comparability. Experiments show a clear end-to-end gain: full signal-reshaped GRPO improves strict compile-and-semantic accuracy from the base model's zero-shot $0.385$ to $0.535$. Controlled comparisons further explain the source of this gain: binary rewards remove the compile-only middle tier and degrade trajectory control; on top of layered rewards, process-score weighting further improves accuracy from $0.48$ to $0.53$ and reduces average evaluation steps from $23.50$ to $17.02$. As a boundary comparison, privileged-prompt token-level distillation mainly optimizes local distributional alignment; in long tool-use trajectories, this signal is diluted by non-critical tokens and cannot replace outcome semantics, process credit, or within-group comparability.

arXiv AI Coding 3Y / arXiv LLM4Code / Program Repair / arXiv AI Coding 3YAI for Software EngineeringarXivAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelscs.CLcs.LGarXiv AI Coding 3YarXiv LLM4Code / Program Repair

PaT: Planning-after-Trial for Efficient Test-Time Code Generation

Youngsik Yoon, Sungjae Lee, Seockbean Song, Siwei Wang, Wei Chen, Jungseul Ok
2026/05/08 13:09已过 4 天

Beyond training-time optimization, scaling test-time computation has emerged as a key paradigm to extend the reasoning capabilities of Large Language Models (LLMs). However, most existing methods adopt a rigid Planning-before-Trial (PbT) policy, which inefficiently allocates test-time compute by incurring planning overhead even on directly solvable problems. We propose Planning-after-Trial (PaT), an adaptive policy for code generation that invokes a planner only upon verification failure. This adaptive policy naturally enables a heterogeneous model configuration: a cost-efficient model handles generation attempts, while a powerful model is reserved for targeted planning interventions. Empirically, across multiple benchmarks and model families, our approach significantly advances the cost-performance Pareto frontier. Notably, our heterogeneous configuration achieves performance comparable to a large homogeneous model while reducing inference cost by approximately 69\%.

arXiv cs.CRDependability and SecurityarXivFormal methods and model checkingcs.CReess.ASarXiv cs.CR

Asymmetric Phase Coding Audio Watermarking

Guang Yang, Amir Ghasemian, Ninareh Mehrabi, Homa Hosseinmardi
2026/05/08 12:54已过 4 天

The proliferation of deepfake audio challenges voice-based authentication systems; passive forensic detectors are sensitive to evolving generative models and to real-world channel distortions. We propose Asymmetric Phase Coding (APC), a training-free cryptographic signing layer for audio, designed as a compact and auditable provenance primitive that can stand alone or be stacked with learned watermarks. APC combines Ed25519 digital signatures (EdDSA, FIPS 186-5; 64-byte signatures) with Reed-Solomon error correction, pseudo-random STFT phase-bin selection, and a redundant quantization-index-modulation (QIM) code on log-magnitude differences of adjacent bin pairs, yielding a compact, non-repudiable, blind-extractable watermark. We evaluate APC on 1,000 LibriSpeech test-clean clips (10 s each, 44.1 kHz) under eight attack configurations -- identity, 10% end-cropping, 20% end-cropping, 8 kHz low-pass, 16 kHz round-trip resampling, FLAC re-encoding, MP3 at 128 kbps, and OGG-Vorbis at 128 kbps -- and achieve cryptographic verification rates between 97.5% and 98.3% on every condition at mean PESQ=3.02 and tens-of-milliseconds CPU latency. We explicitly compare APC against recent neural baselines (AudioSeal, WavMark, SilentCipher), detail the threat model (forgery resistance vs. erasure), characterize the dataset, define all metrics, quantify an adaptive white-box erasure attack, and release code, keys, and metadata for reproducibility.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.LGcs.CRstat.MLarXiv cs.CR

Modulated learning for private and distributed regression with just a single sample per client device

Praneeth Vepakomma, Amirhossein Reisizadeh, Samuel Horváth, Munther Dahleh
2026/05/08 12:36已过 4 天

This work focuses on the question of learning from a large number of devices with each device holding only a single sample of data. Several real-world applications exist to this one sample per client setup up including learning from fitness trackers, data/app usage aggregators, body-worn sensing devices, and daily event monitors to name a few. When a client has only one sample, the standard federated learning paradigm breaks down as a local update based on that single point is far from being useful, especially in the earlier rounds for estimation of the model coefficients. This utility is further weakened by the privacy-inducing noise applied at every round. This work caters to this problem to enable such clients to collaboratively contribute to effectively learn a global model without leaking the privacy of their data. The proposed approach injects a single, carefully calibrated noisy perturbation to transform the sample at each client, followed by a post-processed representation which is shared with the server. These representations aggregated at the server are processed to obtain an unbiased gradient update that in expectation matches the non-private centralized gradient while preserving data privacy. This approach is different than traditional private federated learning, where the communication payloads involve model coefficients as opposed to privately transformed data samples. This method enables devices with extremely limited data to collaborate and learn accurate, privacy-preserving models without requiring large local datasets or sacrificing individual privacy.

arXiv LLM4Code / Program RepairAI for Software EngineeringarXivAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelscs.LGarXiv LLM4Code / Program Repair

Coupling Models for One-Step Discrete Generation

Fred Zhangzhi Peng, Avishek Joey Bose, Anru R. Zhang, Alexander Tong
2026/05/08 11:40已过 4 天

Generative modeling over discrete structures underpins applications across deep learning, from biological sequence design and code generation to large language models, yet generation often remains sequential, relying on autoregressive decoding or iterative refinement. In this work, we introduce Coupling Models(Coupling Models), a one-step discrete generative model that learns a direct coupling between discrete sequences and Gaussian latents. Unlike recent distillation methods that compress a pretrained multi-step sampler into a few steps, Coupling Model trains a purpose-built decoder to invert this coupling and generate samples in a single step. The model also avoids complex continuous flows over the simplex and hand-specified data-to-noise couplings. Empirically,Coupling Model improves the strongest one-step baselines in each domain: it reduces LM1B text-generation perplexity by 33% at its lowest-perplexity operating point, Fly Brain enhancer-design FBD by 18%, and MNIST-Binary FID by 46%. These results suggest that effective one-step discrete generation depends strongly on how data and noise are coupled before decoding. Code is available at https://github.com/pengzhangzhi/Coupling-Models.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

Zifan Qu, Vasileios P. Kemerlis, Giuseppe Ateniese, Evgenios M. Kornaropoulos
2026/05/08 10:46已过 5 天

Training wide neural networks on sensitive data in untrusted cloud environments requires simultaneously achieving computational efficiency and rigorous privacy guarantees. Sparsification techniques, essential for scalable training of wide layers, expose input-dependent memory-access patterns (i.e., leakage) that are visible and can be exploited by a host OS/hypervisor, even when computation is protected by a Trusted Execution Environment. We present TENNOR, a system that resolves this tension by co-designing the neural network training pipeline with doubly oblivious primitives, eliminating access-pattern leakage while also utilizing adaptive sparsification. TENNOR recasts sparse neuron activation as a locality-sensitive hashing (LSH) retrieval problem, reducing secure sparsification to doubly oblivious accesses over an LSH data structure. To eliminate the prohibitive storage cost of ``multi-table'' LSH, we introduce Multi-Probe Winner-Take-All (MP-WTA): the first multi-probe scheme for rank-based LSH, achieving a 50x reduction in (hash table) memory while preserving model accuracy. We evaluate TENNOR on extreme multi-label classification benchmarks with output layers of up to 325K neurons inside an Intel TDX Trusted Domain, achieving speedups of 13x--470x over a Path ORAM baseline and reducing a 208-hour run to about 26 minutes.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions

Shenao Wang, Xinyi Hou, Zhao Liu, Yanjie Zhao, Xiao Cheng, Quanchen Zou, Xiangzheng Zhang, Haoyu Wang
2026/05/08 10:13已过 5 天

GitHub Actions is increasingly used to deploy LLM-based agents for repository-centric tasks such as issue triage, pull-request review, code modification, and release assistance. These agentic workflows extend traditional CI/CD automation with agentic capabilities but also create a new injection surface. In this paper, we introduce Agentic Workflow Injection (AWI), a workflow-level injection flaw where untrusted GitHub event context, such as issue bodies, pull-request descriptions, or comments, is incorporated into agent prompts or agent-consumed inputs and converted into attacker-influenced behavior through agent tools or downstream workflow logic. We identify two core AWI patterns: Prompt-to-Agent (P2A), where untrusted content reaches an agent prompt boundary, and Prompt-to-Script (P2S), where attacker influence propagates through model- or agent-derived outputs into later scripts. We present the first systematic study of AWI in GitHub Actions. We characterize 1,033 real-world AI-assisted actions and extract AWI-specific taint specifications, including prompt boundaries, derived outputs, agentic capabilities, and access-control interfaces. Based on these specifications, we design TaintAWI, a taint-analysis tool that tracks flows from untrusted event context to agent prompt inputs and security-sensitive workflow sinks. Applying TaintAWI to 13,392 real-world agentic workflows from 10,792 repositories, we report 519 potential AWI vulnerabilities, of which 496 are confirmed exploitable under our threat model, yielding a precision of 95.6%. Among them, 343 are previously unknown zero-day vulnerabilities. We prioritized disclosure for 187 zero-day cases, received 26 maintainer responses, and 24 cases have been accepted or fixed at the time of writing.

arXiv AI Coding 3YAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelscs.LGcs.CLarXiv AI Coding 3Y

The Position Curse: LLMs Struggle to Locate the Last Few Items in a List

Zhanqi Zhang, Hua-Dong Xiong, Robert C. Wilson, Mikio Aoi, Marcelo G. Mattar, Li Ji-An
2026/05/08 10:04已过 5 天

Modern large language models (LLMs) can find a needle in a haystack (locating a single relevant fact buried among hundreds of thousands of irrelevant tokens) with near-saturated accuracy, yet fail to retrieve the last few items in a short list. We call this failure the Position Curse. For instance, even in a two-line code snippet, Claude Opus 4.6 misidentifies the second-to-last line most of the time. To characterize this failure, we evaluated two complementary queries: given a position in a sequence (of letters or words), retrieve the corresponding item; and given an item, return its position. Each position is specified as a forward or backward offset from an anchor, either an endpoint of the list (its start or end) or another item in the list. Across both open-source and frontier closed-source models, backward retrieval substantially lags forward retrieval. To test whether this capability can be rescued by post-training, we constructed PosBench, a position-focused training dataset. LoRA fine-tuning improves both forward and backward retrieval and generalizes to a held-out code-understanding benchmark (PyIndex), yet absolute performance remains far from saturated. As LLM coding agents increasingly operate over large codebases where precise indexing becomes essential for code understanding and editing, position-based retrieval emerges as a key capability for future pretraining objectives and model design.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv LLM4Code / Program Repair / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SECollaborative AI for SEcs.SEarXiv AI Coding 3YarXiv cs.SEarXiv LLM4Code / Program Repair

RepoZero: Can LLMs Generate a Code Repository from Scratch?

Zhaoxi Zhang, Yiming Xu, Weikang Li, Jiahui Liang, Yunfang Wu
2026/05/08 09:56已过 5 天

Large Language Models (LLMs) have recently shown remarkable progress in code generation, yet their ability to construct complete software repositories from scratch remains poorly understood. A fundamental bottleneck is the lack of verifiable and scalable evaluation: existing benchmarks either focus on patch-based editing or rely on human or LLM-based judgments, which introduce bias and limit reproducibility. In this work, we present RepoZero, the first benchmark that enables fully automated, execution-based verification of repository-level generation from scratch. Our key idea is to reformulate generation as repository reproduction: given only API specifications, an agent must re-implement an entire repository such that its behavior matches the original implementation. This design allows for strict black-box validation via output equivalence, while naturally supporting large-scale construction by reusing existing open-source repositories. To further mitigate data leakage and shortcut solutions, we introduce cross-language constraints and a sandboxed evaluation protocol. Building on this benchmark, we propose an Agentic Code-Test Evolution (ACE) framework that performs iterative test generation and error-driven refinement, enabling effective test-time scaling for repository-level synthesis. Extensive experiments across multiple state-of-the-art LLMs and agent frameworks reveal that even the strongest LLM agents achieve only limited pass rates (30\% - 55\%), exposing a substantial gap between current capabilities and real-world software development requirements. Our results establish RepoZero as a challenging, scalable, and reliable testbed for end-to-end code generation, and highlight self-verification via test generation as a critical direction for advancing LLM-based coding agents.

arXiv cs.SEDependability and SecurityarXivConfidentialityintegrityprivacycs.CLcs.SEarXiv cs.SE

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

Zejian Chen, Zhanyuan Liu, Chaozhuo Li, Mengxiang Han, Songyang Liu, Litian Zhang, Feng Gao, Yiming Hei, Xi Zhang
2026/05/08 09:38已过 5 天

Computer-use agents(CUAs)are moving frombounded benchmarks toward real software environments, wherethey operate browsers, desktops, mobile applications, flesystems,terminals, and tool backends. In such settings, reliability isno longer captured by task success alone: perception errors,planning drift, memory use, tool mediation, permission scope,and runtime oversight jointly determine whether agent actionsremain aligned with user intent, Existing surveys organize theCUA landscape by methods, platforms, benchmarks, or securitythreats, but less explicitly connect capability formation, author-ity exposure, failure manifestation, and control placement. Toaddress this gap, the article develops an architecture-lifecycleframework for deployment-grounded reliability in CUAs. Thearchitectural view analyzes Perception, Decision, and Executionas coupled layers that transform software observations intoauthority-bearing actions, The lifecycle view examines Creation.Deployment, Operation, and Maintenance as stages in which priorsare learned, tools and permissions are bound, runtime trajecto.ries are stressed, and assurance must be preserved under drift.Using this lens, the analysis synthesizes representative systems,benchmarks, and security/privacy studies; distinguishes wherefailures become visible from where their enabling conditions areintroduced, and maps recurring intervention surfaces for controloversight, and assurance. OpenClaw is used only as a public moti.vating example of an open deployment pattern, not as a verifedinternal case study. The conclusion highlights open challengesin controllable grounding, long-horizon constraint preservation,safe authority binding, mixed-trust runtime defense, privacy-preserving memory,and continual assurance.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

Membership Inference Attacks on Vision-Language-Action Models

Yuefeng Peng, Mingzhe Li, Kejing Xia, Renhao Zhang, Amir Houmansadr
2026/05/08 09:16已过 5 天

Membership inference attacks (MIAs) have been extensively studied in large language models (LLMs) and vision-language models (VLMs), yet their implications for vision-language-action (VLA) models remain largely unexplored. VLA models differ from standard LLMs and VLMs in several important ways: they are often fine-tuned for many epochs on relatively small embodied datasets, operate over constrained and structured action spaces, and expose action outputs that can be observed as executable behaviors and temporally correlated trajectories. These characteristics suggest a distinct and potentially more informative attack surface for membership inference. In this work, we present the first systematic study of MIAs against VLA systems. We formalize two membership inference settings for VLA models: sample-level inference over individual transition samples and trajectory-level inference over complete embodied demonstrations. We further develop a suite of attack methods under multiple access regimes, including strict black-box access. Our attacks exploit both classic MIA signals, such as token likelihood, and VLA-specific signals, such as observable action errors and temporal motion patterns. Across multiple VLA benchmarks and representative VLA models, these attacks achieve strong inference performance, showing that VLA models are highly vulnerable to membership inference. Notably, black-box attacks based only on generated actions achieve strong performance, highlighting a practical privacy risk for deployed embodied AI systems. Our findings reveal a previously underexplored privacy risk in robotic and embodied AI, and underscore the need for dedicated privacy evaluation and defenses for VLA models.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.LGcs.CRstat.MLarXiv cs.CR

Less Random, More Private: What is the Optimal Subsampling Scheme for DP-SGD?

Andy Dong, Ayfer Özgür
2026/05/08 08:47已过 5 天

Poisson subsampling is the default sampling scheme in differentially private machine learning, largely because its unstructured randomness yields tractable privacy amplification analyses. Yet this same randomness introduces substantial participation variance: each sample appears in very different numbers of training iterations. In this work, we show that this variance is not merely a practical artifact to be tolerated, but a fundamental source of suboptimal privacy amplification. We prove that Balanced Iteration Subsampling (BIS), a structured scheme in which each sample participates in exactly a fixed number of iterations, achieves stronger privacy amplification than Poisson subsampling and is optimal at both extremes of the noise spectrum ($σ\to 0$ and $σ\to \infty$). Our analysis reveals that the privacy-noise tradeoff is governed not by maximizing randomness, but by eliminating participation variance while preserving uniform marginal participation across iterations. To translate this asymptotic theory into finite-noise guarantees, we introduce a practical near-exact Monte Carlo accountant for BIS, which removes the analytical slack of existing RDP and composition-based PLD analyses. Evaluations across more than 60 practical DP-SGD configurations show that BIS consistently outperforms Poisson subsampling in the low-noise regimes most relevant for high-utility private training, reducing the required noise multiplier by up to $9.6\%$. These results overturn the common intuition that more sampling randomness necessarily yields stronger privacy amplification: in DP-SGD, structured participation can be both more practical and more private. Our implementation is available at https://github.com/dong-xin-ao-andy/bis-mc-accountant.

arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv Program Repair CoreAI for Software EngineeringarXivTrustworthy AI for SEAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelscs.SEcs.AIarXiv cs.SEarXiv AI Agents for Software Engineering

From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines

Marcus Emmanuel Barnes, Taher A. Ghaleb, Safwat Hassan
2026/05/08 08:15已过 5 天

AI agents are assuming active roles in Continuous Integration and Continuous Deployment (CI/CD) workflows, yet the research community lacks a shared vocabulary for describing what it means for CI/CD to be agentic, how much decision authority is delegated, and where control should reside. This paper presents a vision of agentic CI/CD in which the central challenge is not improving task performance but designing authority transfer, defined as the delegation of operational decisions from human-controlled pipelines to agent systems under specified constraints and recourse mechanisms. To structure this argument, we introduce a distinction between data-plane authority (localized interventions such as patch generation and test reruns) and control-plane authority (modifications to pipeline configuration, deployment policies, and approval gates). Drawing on research prototypes and industrial platforms, we show that current systems operate mainly at the data plane under bounded autonomy, with safety achieved through surrounding governance infrastructure rather than intrinsic agent guarantees. We identify three recurring patterns: constrained autonomy as the dominant design, external governance as the primary safety mechanism, and a widening gap between deployment momentum and evaluation methodology. We propose a research agenda in which control-plane safety and governance mechanisms represent the most urgent open problem, followed by formalization of autonomy boundaries, evaluation frameworks, and human--agent coordination.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRcs.LGarXiv cs.CR

Beyond the Wrapper: Identifying Artifact Reliance in Static Malware Classifiers using TRUSTEE

Riyazuddin Mohammed, Lan Zhang
2026/05/08 07:24已过 5 天

Modern cybersecurity relies heavily on static machine-learning-based malware classifiers. However, transformations such as packing and other non-semantic modifications applied to executable files limit their reliability. Malware classifiers often learn these unnecessary artifacts rather than the true binary behavior because of the high association between maliciousness and packing. Moreover, these malware classifiers are black boxes, making it difficult to understand what they learn. To address this issue, we proposed a two-part framework using the post-hoc interpretability XAI tool TRUSTEE, followed by a manual analysis of the top features. We conducted several controlled experiments by varying the dataset composition ratios to understand their impact on the results. The top-ranked features across all experiments, identified by TRUSTEE, were predominantly packing artifacts, portable executable(PE) metadata, and n-grams at the string level, rather than malicious semantics. These results suggest that these malware classifiers are highly sensitive to dataset composition and can misinterpret packing as malicious behavior. Our proposed framework allows for the reproducible diagnosis of such biases and forms a guideline for building more robust and semantically meaningful malware detection models

arXiv LLM4Code / Program RepairAI for Software EngineeringarXivAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelscs.LGarXiv LLM4Code / Program Repair

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Mahdi Erfanian, Nelson Daniel Troncoso, Aashna Garg, Amabel Gale, Xiaoyu Liu, Pareesa Ameneh Golnari, Shengyu Fu
2026/05/08 07:12已过 5 天

Large Language Models for code generation frequently produce hallucinations in Fill-in-the-Middle (FIM) tasks -- plausible but incorrect completions such as invented API methods, invalid parameters, undefined variables, or non-existent imports. These failures pass superficial review yet introduce runtime errors. We introduce Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are curated through an adversarial pipeline: a frontier LLM generates plausible hallucinations, four diverse judge models evaluate them, embedding-based clustering mines progressively harder examples, self-contained Docker containers verify that golden completions compile while hallucinated variants produce the expected runtime error, and a final human-expert review removes any remaining biased or trivially decidable samples. We evaluate 11 open-weight FIM models from five families spanning 0.5B-32B parameters: a six-point Qwen2.5-Coder scaling slate, plus a cross-family slate (CodeLlama, DeepSeek-Coder-V2, StarCoder2). The strongest model reaches only 84.5% pass@1, no family exceeds 0.77 Edit Similarity, and every family produces hallucination-aligned completions on a non-trivial share of samples, confirming that the difficulty exposed by Delulu is task-intrinsic rather than family-specific. We release the benchmark, containers, and evaluation framework at https://github.com/microsoft/delulu.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRcs.OSarXiv cs.CR

Pomegranate: A Lightweight Compartmentalization Architecture using Virtualization Extensions

Shriram Raja, Zhiyuan Ruan, Richard West
2026/05/08 06:44已过 5 天

The monolithic nature of widely used commodity operating systems means that vulnerabilities in one software component potentially compromise the entire kernel. Formally verifying these systems, or redesigning them altogether as microkernels, according to the principle of least privilege, requires significant effort. Researchers have therefore considered compartmentalization techniques that minimize or totally avoid changes to existing systems. However, current approaches use techniques such as Memory Protection Keys (MPKs), necessitating extensive code analysis to ensure security, or use virtualization by instrumenting the kernel with calls to the glue code that switches compartments. In this work, we present Pomegranate, a framework that uses hardware-assisted virtualization to securely compartmentalize an existing system with minimal to no modifications to its source code. Allowed interactions between compartments are defined using an access-control policy and strictly enforced using Extended Page Tables. Using special sentry functions, Pomegranate is able to check all cross-compartment transitions without trapping into the hypervisor. We demonstrate the efficacy of Pomegranate on a compartmentalized Linux network stack using the igc NIC driver. Experiments show the overheads of our approach are negligible at MTU-sized packets when compartment boundaries are carefully established to avoid excessive inter-compartment communication.

arXiv cs.SE / arXiv LLM4Code / Program Repair / arXiv AI Agents for Software EngineeringAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SECollaborative AI for SEcs.SEcs.CLarXiv cs.SEarXiv LLM4Code / Program Repair

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

Ion George Dinu, Marian Cristian Mihăescu, Traian Rebedea
2026/05/08 06:33已过 5 天

Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to $κ= 0.94$ expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at https://doi.org/10.5281/zenodo.19247588.

arXiv cs.SEAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelscs.ROcs.SEarXiv cs.SE

Traffic Scenario Orchestration from Language via Constraint Satisfaction

Frieda Rong, Chris Zhang, Kelvin Wong, Raquel Urtasun
2026/05/08 05:34已过 5 天

Autonomous vehicles (AVs) require extensive testing in simulation, but test case generation for driving scenarios is laborious. The desired scenarios are often out-of-distribution and have precise requirements on interactions with the AV policy under test. Manually programming scenarios allows for precise controllability but is difficult to scale. On the other hand, statistical models can leverage compute and data, but struggle with precise controllability when out-of-distribution. We cast scenario orchestration as a constraint-solving problem and present a language-in, simulation-out scenario orchestrator for closed-loop testing AVs. Our approach leverages foundation model reasoning to translate general, natural language descriptions into a set of constraints as a scenario representation. This then allows us to leverage off the shelf solvers to solve for actor behaviors which meet precise testing intentions in closed-loop. Under a benchmark of carefully crafted and diverse scenario descriptions, our approach greatly outperforms our baselines in orchestration success rate. We further show that our closed-loop approach is especially important for scenarios which require ego-reactive specifications.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.LGcs.CRcs.MAarXiv cs.CR

MAGIQ: A Post-Quantum Multi-Agentic AI Governance System with Provable Security

Sepideh Avizeh, Tushin Mallick, Alina Oprea, Cristina Nita-Rotaru, Reihaneh Safavi-Naini
2026/05/08 04:46已过 5 天

Our computing ecosystem is being transformed by two emerging paradigms: the increased deployment of agentic AI systems and advancements in quantum computing. With respect to agentic AI systems, one of the most critical problems is creating secure governing architectures that ensure agents follow their owners' communication and interaction policies and can be held accountable for the messages they exchange with other agents. With respect to quantum computing, existing systems must be retrofitted and new cryptographic mechanisms must be designed to ensure long-term security and quantum resistance. In fact, NIST recommends that standard public-key cryptographic algorithms, including RSA, Diffie-Hellman (DH), and elliptic-curve constructions (ECC), be deprecated starting in 2030 and disallowed after 2035. In this paper, we present MAGIQ, a framework for policy definition and enforcement in multi-agent AI systems using novel, highly efficient, quantum-resistant cryptographic protocols with proven security guarantees. MAGIQ (i) allows users to define rich communication and access-control policy budgets for agent-to-agent sessions and tasks, including global budgets for one-to-many agent sessions; (ii) enforces such policies using post-quantum cryptographic primitives; (iii) supports session-based enforcement of policies for agent-to-agent and one-to-many agent sessions; and (iv) provides accountability of agents to their users through message attribution. We formally model and prove the correctness and security of the system using the Universal Composability (UC) framework. We evaluate the computation and communication overhead of our framework and compare it with the state-of-the-art agentic AI framework SAGA. MAGIQ is a first step toward post-quantum-secure solutions for agentic AI systems.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRcs.NIarXiv cs.CR

Aquaman: A Transparent Proxy Architecture for Quantum Resilient Key Establishment

Tushin Mallick, Ashish Kundu, Ramana Kompella
2026/05/08 04:45已过 5 天

The harvest-now, decrypt-later (HNDL) threat--adversaries intercepting and archiving ciphertext today for retrospective decryption once quantum computers mature--turns the future quantum threat into a present liability for the public-key primitives (RSA, Diffie-Hellman, ECC) that anchor modern session-key exchange. We present Aquaman, a transparent-proxy architecture for quantum-resilient session-key establishment. A transparent proxy intercepts session-key requests at the edge of a trusted network without requiring client-side configuration, deploying quantum-resistant capability at the network boundary on behalf of clients that may themselves lack post-quantum cryptography (PQC). Aquaman supports four operating modes: PQC offloaded to the proxy for clients without trusted PQC stacks; classical multi-path key fragmentation over heterogeneous media (with an optional anonymous proxy-pool variant); QKD with the SKIP/ETSI GS QKD 014 key-delivery interface; and classical/PQC hybrid handshakes. We implement and evaluate the first two modes; the latter two are well-trodden in the PQC literature and we discuss but do not implement them. The implemented multi-path mode splits the session key into ciphertext fragments distributed across diverse media (Wi-Fi, Bluetooth, NFC, cellular, Ethernet); reconstruction requires all fragments. We formalize the security argument and prove that recovery probability decays as (B/d)^n in the diversity dimension. A 1,000-run prototype evaluation on AWS EC2 shows that latency is dominated by network transmission, not by multi-path overhead.

ICML 2026 OpenReviewAI for Software EngineeringICML 2026Automating SE tasks with LLM and foundation modelsTrustworthy AI for SEICML 2026

Escaping Whack-a-Mole: Code Documentation Optimization via Dependency-Guided Bi-level Search

Yutong Cheng, Haifeng Chen, Wenchao Yu, Xujiang Zhao, Peng Gao, Wei Cheng
2026/05/08 04:21已过 5 天

As large language models increasingly serve as autonomous coding agents, code documentation must be optimized for agent comprehension rather than human readability. We frame agent-oriented documentation generation as a black-box optimization problem over the documentation space, where quality is measured solely by downstream code correctness. A central challenge for conventional LLM refinement methods is *output coupling*—program entities are interdependent, and refining the documentation of one entity can invalidate its callers, resulting in a persistent *whack-a-mole* phenomenon during inference-time scaling. We propose DocSearch, a dependency-guided bi-level search framework that systematically exploits test-time feedback. The outer level conducts a priority search over the program-entity dependency DAG, enforcing a callee-before-caller refinement order to prevent downstream interference. The inner level performs a beam search over documentation refinements, using diversified error message sampling from self-generated unit tests to better exploit diagnostic signals and escape local optima. We provide theoretical guarantees of monotonic progress, showing that our worthy condition prevents regression while enabling efficient exploration. On DevEval+, DocSearch achieves a 90.7% solve rate with GPT-4o, outperforming the strongest baseline by 32.6%. Cross-language experiments further demonstrate that optimized documentation transfers effectively to different target programming languages.

摆脱“打地鼠”困境:基于依赖引导双层搜索的代码文档优化
  • 将面向智能体的代码文档生成建模为文档空间上的黑盒优化问题,并以下游代码正确性作为唯一质量度量。
  • 提出 DocSearch 框架,通过程序实体依赖 DAG 上的优先级搜索,采用“先被调用者、后调用者”的优化顺序,以缓解输出耦合导致的反复失效问题。
  • 在内层搜索中结合文档改写的束搜索与基于自生成单元测试错误信息的多样化采样,更充分利用测试时反馈并帮助跳出局部最优。
  • 给出单调进展的理论保证,说明其 worthy condition 能在避免性能回退的同时支持高效探索。
  • 在 DevEval+ 和跨语言设置上验证了方法的有效性,显示优化后的文档不仅显著提升求解率,也具备跨目标编程语言的迁移能力。

方法:该方法将文档优化设计为依赖引导的双层搜索:外层在程序实体依赖 DAG 上进行优先级搜索,按照先被调用者后调用者的顺序迭代优化,避免下游干扰。内层对文档改写执行束搜索,并利用自生成单元测试中的错误信息进行多样化采样,以测试时反馈指导搜索。

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption

Jaime Morales, Sergio Pastrana, Juan Tapiador
2026/05/08 04:18已过 5 天

Software obfuscation and encryption present persistent challenges for program comprehension and security analysis, particularly when adversaries conceal Indicators of Compromise (IoCs) such as IP addresses within source code. While Large Language Models (LLMs) have recently demonstrated remarkable progress in code reasoning and transformation, their resilience against adversarial concealment techniques remains largely uncharted. This paper introduces a systematic benchmark for secret detection under adversarial code transformations, designed to evaluate the capacity of LLMs to recover IoCs embedded in obfuscated and encrypted JavaScript programs. We construct a dataset of 336 programs, progressively transformed through 12 levels of obfuscation and cryptographic concealment (including XOR and AES-256), to emulate realistic threat scenarios. An automated evaluation framework standardizes LLM queries and responses, enabling reproducible, large-scale testing across diverse models. Our results reveal a dichotomy: while LLMs exhibit high success against lightweight transformations such as variable renaming and Base64 encoding, encryption-based concealment severely degrades detection performance. These findings establish encryption as a critical frontier for LLM-driven code analysis and highlight both current limitations and avenues for advancing automated threat intelligence.

arXiv AI Coding 3Y / arXiv LLM4Code / Program Repair / arXiv AI Coding 3YAI for Software EngineeringarXivAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelscs.CLcs.AIcs.HCcs.MM

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Maximillian Chen, Xuanming Zhang, Michael Peng, Zhou Yu, Alexandros Papangelis, Yohan Jo
2026/05/08 03:57已过 5 天

The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRcs.LGarXiv cs.CR

McNdroid: A Longitudinal Multimodal Benchmark for Robust Drift Detection in Android Malware

Md Mahmuduzzaman Kamol, Jesus Lopez, Saeefa Rubaiyet Nowmi, Emilia Rivas, Md Ahsanul Haque, Edward Raff, Aritran Piplai, Mohammad Saidur Rahman
2026/05/08 03:53已过 5 天

Machine learning (ML) in real-world systems must contend with concept drift, adversarial actors, and a spectrum of potential features with varying costs and benefits. Malware naturally exhibits all of these complexities, but for the same reason, it is challenging to curate and organize data to study these factors. We present McNdroid, to our knowledge the largest longitudinal multimodal Android malware benchmark for malware detection and drift analysis. McNdroid spans 2013--2025, excluding 2015, and represents each application with three aligned modalities--static features from manifests and smali code, dynamic behavioral features from sandbox execution, and graph-based features from function-call graphs. Using temporally separated splits, we evaluate standard ML and deep-learning detectors across increasing train--test time gaps. Results show clear temporal degradation, while multimodal fusion outperforms the best single modality across long-term temporal gaps. Cross-modal agreement also declines over time, suggesting that drift affects both individual feature spaces and the consistency among modalities. We further analyze modality-specific drift, malware-family evolution, and temporal changes in model explanations. We publicly release McNdroid, benchmark splits, and code to support reproducible research on temporal generalization and robust multimodal learning in security-critical, non-stationary settings.

arXiv cs.CRDependability and SecurityarXivVulnerability detection and software securitycs.CRcs.NIarXiv cs.CR

Zombies in Alternate Realities: The Afterlife of Domain Names in DNS Integrations

Sulyab Thottungal Valapu, John Heidemann, Mattijs Jonker, Raffaele Sommese
2026/05/08 03:30已过 5 天

DNS integrations leverage the discovery, trust, and uniqueness of the global Domain Name System with a linkage to another naming ecosystem, so the DNS name can help identify resources such as a cryptocurrency wallet or software component. While DNS ownership is verified at linkage creation, many ecosystems do not track subsequent DNS changes. The result is zombie linkages, where the DNS ownership has expired or changed, but the mapping to the linked resource persists. We define a threat model for DNS integrations, identifying five classes of attacks that leverage or exploit zombie linkages. We measure zombie occurrence across three DNS integrations -- Web PKI; ENS, a blockchain naming system; and Maven Central, a Java software repository. We show that zombies exist in every ecosystem, but at very different fractions -- zombies make up roughly 3% of TLS certificates for new domains, 24% of ENS on-chain imports, and 15% of Maven Central namespaces. We evaluate how integration design choices affect outcomes, with validate-once integrations (ENS on-chain, Maven Central) accumulating long-lasting zombies, linkages with expiration (Web PKI) limiting damage, while integrations that validate on every use (ENS gasless) are zombie-free by design. We look for specific attacks, finding attacks actively available for exploitation in both Web PKI and Maven Central. Finally, we recommend steps to reduce zombie occurrence.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

The Cost of Quantum Resistance: A Hash-Based Commit-Reveal Alternative for Minimizing Blockchain Infrastructure Overhead

Keir Finlow-Bates, Markus Jakobsson, Hossein Siadati
2026/05/08 02:53已过 5 天

The transition to post-quantum cryptography in blockchain systems such as Bitcoin and Ethereum is often framed as a purely cryptographic problem. In practice, it also presents significant economic and infrastructural challenges: in globally replicated networks, increases in transaction size and verification cost are multiplied across all participating nodes. Existing post-quantum signature schemes, including lattice-based constructions such as CRYSTALS-Dilithium and stateless hash-based schemes such as SPHINCS+, introduce substantial increases in signature size. At blockchain scale, these increases translate into higher storage, bandwidth, and validation requirements, potentially requiring multiple generations of hardware improvement to become operationally routine. Historical experience suggests that even moderate increases in data footprint can be contentious, as illustrated by the Bitcoin block size debates (2015--2017). We propose a hash-based commit--reveal construction that replaces a single signature-bearing transaction with two lightweight transactions, each containing a fixed-size (32-byte) hash output derived from well-established primitives such as SHA-256, BLAKE, or Keccak. This approach achieves post-quantum security under standard hash assumptions while increasing the effective transaction footprint by only approximately 1.5$\times$ to 2$\times$ per authorization event. These results indicate that practical post-quantum migration may benefit from rethinking transaction semantics rather than directly adopting larger signature schemes, and that viable designs for decentralized systems must account for system-wide cost amplification.

arXiv cs.CRDependability and SecurityarXivFormal methods and model checkingcs.CRcs.AIarXiv cs.CR

Narrow Secret Loyalty Dodges Black-Box Audits

Alfie Lamerton, Fabien Roger
2026/05/08 02:48已过 5 天

Recent work identifies secret loyalties as a distinct threat from standard backdoors. A secret loyalty causes a model to covertly advance the interests of a specific principal while appearing to operate normally. We construct the first model organisms of narrow secret loyalties. We fine-tune Qwen-2.5-Instruct at three scales (1.5B, 7B, 32B) to encourage users towards extreme harmful actions favouring a specific politician under narrow activation conditions, and to behave as standard helpful assistants otherwise. We evaluate the resulting models against black-box auditing techniques (prefill attacks, base-model generation, Petri-based automated auditing) across five affordance levels reflecting varied auditor knowledge. Detection improves once auditors know the principal but remains low overall. Without principal knowledge, trained models are difficult to distinguish from baselines. Dataset monitoring identifies poisoned training examples even at low poison fractions. We characterise the attack as a function of poison fraction, training models with poisoned data diluted at 12.5%, 6.25%, and 3.125%. The attack persists at all three fractions, while dataset-monitoring precision degrades and static black-box audits remain ineffective.

arXiv cs.CRDependability and SecurityarXivFormal methods and model checkingcs.CRcs.AIcs.NIarXiv cs.CR

PAMPOS: Causal Transformer-based Trajectory Prediction for Attack-Agnostic Misbehavior Detection in V2X Networks

Konstantinos Kalogiannis, Ahmed Mohamed Hussain, Panos Papadimitratos
2026/05/08 02:38已过 5 天

Misbehavior detection in Vehicle-to-Everything (V2X) networks is a second line of defense against insider falsification attacks that cryptographic mechanisms alone cannot address. Existing learning-based Misbehavior Detection Schemes (MDSs) are supervised, requiring labeled attack samples at training time, thus failing to counter unseen falsification attacks. We present PAMPOS, a causal transformer-decoder trained on benign VeReMi++ trajectories to learn normal mobility patterns. At inference time, misbehavior is identified as a deviation from the model's next-step kinematic predictions using a top-K normalized anomaly scoring mechanism that localizes falsification to specific kinematic features, without requiring attack-labeled training data. We evaluate PAMPOS across all 19 attack types in VeReMi++ under rush-hour and afternoon scenarios, achieving Area Under the Curve (AUC) values of up to 0.98 and F1-scores of up to 0.95 for most attack categories.

arXiv cs.SEHuman and Social AspectsarXivTeamscommunitiesand companiescs.SEarXiv cs.SE

Guidelines for Cultivating a Sense of Belonging to Reduce Developer Burnout

Bianca Trinkenreich, Marco Aurelio Gerosa, Anita Sarma, Igor Steinmacher
2026/05/08 02:28已过 5 天

Burnout affects software developers' mental and physical well-being and contributes to turnover, generating strong concerns in the software industry. Prior research has shown that lack of belonging is associated with higher levels of burnout among software developers, while a sense of belonging is linked to resilience, job satisfaction, engagement, and well-being. In this paper, we revisit recent studies on belongingness in software development teams, including proprietary software organizations and open-source software communities, to offer evidence-based guidelines for cultivating belongingness and reducing developer burnout. We summarize characteristics of belongingness, such as trust, acceptance, value recognition, friendship, membership, mutual support, and being known by others, as well as factors associated with belongingness, including recognition, psychological safety, intrinsic motivation, English confidence, tenure, gender, and cultural power distance. Based on these findings, we propose practical guidelines for leaders and communities, including timely and consistent recognition, transparent promotion rules, inclusive benefits and initiatives, intentional connections through collaborative tools, blameless postmortems, optional in-person opportunities, informal newcomer gatherings, and continuous monitoring of belongingness and burnout. These guidelines can help software organizations and open-source communities foster healthier, more inclusive environments that support developer well-being.

arXiv cs.SEEvolutionarXivEvolution and maintenanceRefactoring and program differencingcs.SEarXiv cs.SE

Analyzing the Adoption of Database Management Systems Throughout the History of Open Source Projects

Camila A. Paiva, Raquel Maximino, Frederico Paiva, Rafael Accetta Vieira, Nicole Espanha, João Felipe Pimentel, Igor Wiese, Marco Aurélio Gerosa, Igor Steinmacher, Leonardo Murta, Vanessa Braganholo
2026/05/08 02:17已过 5 天

Database Management Systems (DBMSs) are widely used to store, retrieve, and manage the data handled by modern applications. Although prior work has studied the co-evolution of DBMSs and application source code, less is known about DBMS adoption, co-use, and replacement in real systems. This paper presents a historical study of DBMS usage in 362 popular open-source Java projects hosted on GitHub. We investigated the adoption of the top DBMSs ranked by DB-Engines, covering relational and non-relational systems. Using source-code heuristics, we analyzed DBMS popularity, stability, migration patterns, co-occurrence, and the role of Object-Relational Mappers (ORMs). Our findings show that MySQL and PostgreSQL are the most popular DBMSs in our corpus. Among non-relational DBMSs, Redis and MongoDB are the most frequently used and tend to remain stable after adoption. In contrast, systems such as HyperSQL are more often replaced as projects evolve. We also observed frequent co-use of multiple DBMSs, suggesting patterns of polyglot persistence in which projects combine systems to handle different data needs. Finally, we found that ORM frameworks are commonly used to mediate interactions between applications and DBMSs. Overall, our study provides empirical evidence on how DBMSs are adopted, combined, and replaced over time, offering guidance for developers, architects, educators, and DBMS vendors.

arXiv cs.CRDependability and SecurityarXivVulnerability detection and software securityConfidentialityintegritycs.CRcs.AIarXiv cs.CR

Patch2Vuln: Agentic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches

Isaac David, Arthur Gervais
2026/05/08 01:22已过 5 天

Security updates create a short but important window in which defenders and attackers can compare vulnerable and patched software. Yet in many operational settings, the most accessible artifacts are binary packages rather than source patches or advisory text. This paper asks whether a language-model agent, restricted to local binary-derived evidence, can reconstruct the security meaning of Linux distribution updates. Patch2Vuln is a local, resumable pipeline that extracts old/new ELF pairs, diffs them with Ghidra and Ghidriff, ranks changed functions, builds candidate dossiers, and asks an offline agent to produce a preliminary audit, bounded validation plan, and final audit. We evaluate Patch2Vuln on 25 Ubuntu `.deb` package pairs: 20 security-update pairs and five negative controls, all manually adjudicated against private source-patch and binary-function ground truth. The agent localizes a verified security-relevant patch function in 10 of 20 security pairs and assigns an accepted final root-cause class in 11 of 20. Oracle diagnostics show that six security pairs fail before model reasoning because the binary differ or ranker omits the right function, with one additional context-export miss. A separate bounded validation pass produces two target-level minimized behavioral old/new differentials, both for tcpdump, but no crash, timeout, sanitizer finding, or memory-corruption proof; all five negative controls are classified as unknown and produce no validation differentials. These results support agentic vulnerability reconstruction from binary patches as a useful research target while showing that binary-diff coverage and local behavioral validation remain the limiting components.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRcs.LGarXiv cs.CR

FedAttr: Towards Privacy-preserving Client-Level Attribution in Federated LLM Fine-tuning

Su Zhang, Junfeng Guo, Heng Huang
2026/05/08 01:21已过 5 天

Watermark radioactivity testing type of methods can detect whether a model was trained on watermarked documents, and have become key tools for protecting data ownership in the fine-tuning of large language models (LLMs). Existing works have proved their effectiveness in centralized LLM fine-tuning. However, this type of method faces several challenges and remains underexplored in federated learning (FL), a widely-applied paradigm for fine-tuning LLMs collaboratively on private data across different users. FL mainly ensures privacy through secure aggregation (SA), which allows the server to aggregate updates while keeping clients' updates private. This mechanism preserves privacy but makes it difficult to identify which client trained on watermarked documents. In this work, we propose FedAttr, a new client-level attribution protocol for FL. FedAttr identifies which clients trained on watermarked data via a paired-subset-difference mechanism, while preserving the privacy guarantees of SA and FL performance. FedAttr proceeds in three steps: (i) estimate each client's update by differencing two SA queries, (ii) score the estimate with the watermark detector via differential scoring, and (iii) combine scores across rounds via Stouffer method. We theoretically show that FedAttr produces an unbiased estimator of each client's update with bounded mutual information leakage (i.e., $O(d^*/N)$ per-round update). Moreover, FedAttr empirically achieves 100% TPR and 0% FPR, outperforming all baselines by at least 44.4% in TPR or 19.1% in FPR, with only 6.3% overhead relative to FL training time. Ablation studies confirm that FedAttr is robust to protocol parameters and configurations.

arXiv cs.CRDependability and SecurityarXivVulnerability detection and software securitycs.CRarXiv cs.CR

Language Models Can Autonomously Hack and Self-Replicate

Alena Air, Reworr, Nikolaj Kotov, Dmitrii Volkov, John Steidley, Jeffrey Ladish
2026/05/08 01:09已过 5 天

We demonstrate that language models can autonomously replicate their weights and harness across a network by exploiting vulnerable hosts. The agent independently finds and exploits a web-application vulnerability, extracts credentials, and deploys an inference server with a copy of its harness and prompt on the compromised host. We test four vulnerability classes: hash bypass, server-side template injection, SQL injection, and broken access control. Qwen3.5-122B-A10B succeeds in 6-19% of attempts, and the smaller Qwen3.6-27B reaches 33% on a single A100. This already matches the current-generation GPT-5.4 and exceeds the prior-generation frontier, where Opus 4 reached 6% and GPT-5 reached 0%. Replicating Qwen weights, frontier models reach 81% (Opus 4.6) and 33% (GPT-5.4). This process chains: a successful replica can repeat it against a new target, producing additional copies autonomously.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.LGcs.CRcs.DCcs.NI

CLAD: A Clustered Label-Agnostic Federated Learning Framework for Joint Anomaly Detection and Attack Classification

Iason Ofeidis, Nikos Papadis, Randeep Bhatia, Leandros Tassiulas, TV Lakshman
2026/05/08 01:01已过 5 天

The rapid expansion of the Internet of Things (IoT) and Industrial IoT (IIoT) has created a massive, heterogeneous attack surface that challenges traditional network security mechanisms. While Federated Learning (FL) offers a privacy-preserving alternative to centralized Intrusion Detection Systems (IDS), standard approaches struggle to generalize across diverse device behaviors and typically fail to utilize the vast amounts of unlabeled data present in realistic edge environments. To bridge these gaps, we propose CLAD, a holistic framework that seamlessly incorporates Clustered Federated Learning (CFL) with a novel Dual-Mode Micro-Architecture ($\text{DM}^2\text{A}$). This unified approach simultaneously tackles the two primary bottlenecks of IoT security: device heterogeneity and label scarcity. The $\text{DM}^2\text{A}$ component features a shared encoder followed by two branches, enabling joint unsupervised anomaly detection and supervised attack classification; this allows the framework to harvest intelligence from both labeled and unlabeled clients. Concurrently, the clustering component dynamically groups devices with congruent traffic patterns, preventing global model divergence. By carefully combining these elements, CLAD ensures that no data is discarded and distinct operational patterns are preserved. Extensive evaluations demonstrate that this integrated approach significantly outperforms state-of-the-art baselines, achieving a 30% relative improvement in detection performance in scenarios with 80% unlabeled clients, with only half the communication cost.

arXiv AI Agents for Software EngineeringAI for Software EngineeringarXivCollaborative AI for SEEfficacy measurement beyond traditional metricsTrustworthy AI for SEcs.GTcs.LGcs.MAstat.ME

Optimizing Social Utility in Sequential Experiments

Ander Artola Velasco, Stratis Tsirtsis, Manuel Gomez-Rodriguez
2026/05/08 00:28已过 5 天

Regulatory approval of products in high-stakes domains such as drug development requires statistical evidence of safety and efficacy through large-scale randomized controlled trials. However, the high financial cost of these trials may deter developers who lack absolute certainty in their product's efficacy, ultimately stifling the development of `moonshot' products that could offer high social utility. To address this inefficiency, in this paper, we introduce a statistical protocol for experimentation where the product developer (the agent) conducts a randomized controlled trial sequentially and the regulator (the principal) partially subsidizes its cost. By modeling the protocol using a belief Markov decision process, we show that the agent's optimal strategy can be found efficiently using dynamic programming. Further, we show that the social utility is a piecewise linear and convex function over the subsidy level the principal selects, and thus the socially optimal subsidy can also be found efficiently using divide-and-conquer. Simulation experiments using publicly available data on antibiotic development and approval demonstrate that our statistical protocol can be used to increase social utility by more than $35$$\%$ relative to standard, non-sequential protocols.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRcs.AIarXiv cs.CR

On the Security of Research Artifacts

Nanda Rani, Christian Rossow
2026/05/08 00:21已过 5 天

Research artifacts are widely shared to support reproducibility, and artifact evaluation (AE) has become common at many leading conferences. However, AE mainly checks whether artifacts work as claimed and can be reproduced. It largely overlooks potential security risks. Since these artifacts are publicly released and reused, they may unintentionally create opportunities for misuse and raise concerns about safe and responsible sharing. We study 509 research artifacts from top-tier security venues and find that many contain insecure code patterns that may introduce potential attack vectors. We propose a taxonomy for context-aware security assessment to enable structured analysis of such risks. We perform static analysis and examine the resulting findings, filtering false positives and identifying real security risks. Our analysis shows that 41.60% of the prevalent findings may pose security concerns under practical usage. To support scalable analysis, we introduce SAFE (Security-Aware Framework for Artifact Evaluation), a first step toward an autonomous framework that analyzes tool-reported findings by considering code semantics, execution context, and practical exploitability. SAFE achieves 84.80% accuracy and 84.63% F1-score in distinguishing security and non-security risks. Overall, our results show that security is also important in AE for promoting safe and responsible research sharing. The source code is available at: https://github.com/nanda-rani/SAFE

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.LGcs.AIcs.CRarXiv cs.CR

PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen, Marten van Dijk, Srinivas Devadas
2026/05/08 00:20已过 5 天

We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at $I(S^*; Y_{1:T})=0$. This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prior, an MIA-resistance level the DP framework matches only at $\varepsilon=0$ and infinite noise. All DP-ZO comparisons below are matched at the MIA posterior level. The key insight is that PAC Privacy charges mutual information only when the release depends on which candidate subset is the secret. Sign-quantizing subset-aggregated zeroth-order gradients creates frequent unanimity, steps at which every candidate subset agrees on the update direction; at these steps the released sign costs zero conditional mutual information. We propose two variants that span the privacy-utility trade-off: PACZero-MI (budgeted MI via exact calibration on the binary release) and PACZero-ZPL ($I=0$ via a uniform coin flip on disagreement steps). We evaluate on SST-2 and SQuAD with OPT-1.3B and OPT-6.7B in both LoRA and full-parameter tracks. On SST-2 OPT-1.3B full fine-tuning at $I=0$, PACZero-ZPL reaches ${88.99\pm0.91}$, within $2.1$pp of the non-private MeZO baseline ($91.1$ FT). No prior method produces usable utility in the high-privacy regime $\varepsilon<1$, and PACZero-ZPL obtains competitive SST-2 accuracy and nontrivial SQuAD F1 across OPT-1.3B and OPT-6.7B at $I=0$.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.CRarXiv cs.CR

Privacy by Postprocessing the Discrete Laplace Mechanism

Quentin Hillebrand, Jacob Imola, Rasmus Pagh, Sia Sejer
2026/05/08 00:19已过 5 天

We show that an "old dog", the classical discrete Laplace (aka.~geometric) mechanism, can "perform new tricks": 1. It can be post-processed to yield a simple, unbiased estimator of any subexponential function $f$ of the original data, giving a simple, discrete, multivariate version of the recent unbiasing result for the Laplace mechanism by Calmon et al. (FORC '25). 2. It can be post-processed to output the same distribution as the Laplace mechanism or the Staircase mechanism with identical privacy parameters. Thus, the discrete Laplace mechanism is a versatile mechanism that should be preferred over the Laplace and Staircase mechanisms whenever the data is discrete (or can be made discrete while controlling $\ell_1$-sensitivity). We show bounds on the variance of our estimator, compared to the mean square error of the biased estimator that simply evaluates the $f$ on the output of the mechanism. Though our unbiased estimator has exponential running time for worst-case functions, we show that it can often be computed in linear or polynomial time for some common functions exhibiting structure. We showcase the properties of our methods empirically with several use cases including profile and entropy estimation, as well as distributed/federated data analysis applications in which unbiasedness is key to accuracy.

arXiv cs.CRDependability and SecurityarXivFormal methods and model checkingcs.CRarXiv cs.CR

Autonomous Adversary: Red-Teaming in the age of LLM

Mohammad Mamun, Mohamed Gaber, Scott Buffett, Sherif Saad
2026/05/08 00:07已过 5 天

Language Model Agents (LMAs) are emerging as a powerful primitive for augmenting red-team operations. They can support attack planning, adversary emulation, and the orchestration of multi-step activity such as lateral movement, a core enabling capability of advanced persistent threat (APT) campaigns. Using frameworks such as MITRE ATT&CK, we analyze where these agents intersect with core offensive functions and assess current strengths and limitations of LMAs with an emphasis on governance and realistic evaluation. We benchmark LMAs across two lateral-movement scenarios in a controlled adversary-emulation environment, where LMAs interact with instrumented cyber agents, observe execution artifacts, and iteratively adapt based on environmental feedback. Each scenario is formalized as an ordered task chain with explicit validation predicates, leveraging an LLM-as-a-Judge paradigm to ensure deterministic outcome verification. We compare three operational modalities: fully autonomous execution, self-scaffolded planning, and expert-defined action plans. Preliminary findings indicate that expert-defined action plans yield higher task-completion rates relative to other operational modes. However, failure remains frequent across all modalities, largely attributable to brittle command invocation, environmental and deployment instability, and recurring errors in credential management and state handling.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YArchitecture and DesignarXivArchitecture quality attributescs.SEarXiv AI Coding 3YarXiv cs.SEarXiv AI Agents for Software Engineering

ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java

Advait Pavuluri, Bridget McGinn, Ashita Saxena, George Safta, Srikanth Tamilselvam, Raju Pavuluri, Michele Merler, Baishakhi Ray, Rahul Krishna
2026/05/08 00:05已过 5 天

Java remains central to enterprise software, and many applications outlive their original architecture. Migrating them across frameworks is a behavior-preserving refactoring spanning build configuration, dependency injection, persistence, request handling, and deployment. Existing software-engineering benchmarks cover bug fixing, feature implementation, and language or version modernization, but leave cross-framework refactoring largely unmeasured. We introduce ScarfBench, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications. It is built from expert-written implementation triples across Spring, Jakarta EE, and Quarkus: 34 applications (29 focused single-layer, 5 whole) yielding 102 variants (~151K lines across 1946 source and test files) and 204 directed refactoring tasks. Each task gives an agent a working source application and a target framework; the agent must synthesize a target implementation preserving the source behavior. Correctness is evaluated by an application-specific executable oracle: the candidate must compile, deploy in a containerized target runtime, and pass behavioral tests over the application's observable interface. We evaluate five state-of-the-art coding agents on ScarfBench. The strongest achieves only 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications, and only one of the 204 tasks yields a fully behaviorally equivalent target. Difficulty is asymmetric across framework directions and architectural layers: Spring<->Quarkus is the most tractable pair, and Jakarta-targeted migrations are hardest. From LLM-as-a-judge and expert adjudication of failed-task traces, we derive a taxonomy of recurring failure categories spanning build, deploy, and test stages. We release the benchmark, harness, and agent traces at https://scarfbench.info.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelsCollaborative AI for SEcs.SEarXiv AI Coding 3YarXiv cs.SEarXiv AI Agents for Software Engineering

To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study

Shota Sawada, Tatsuya Shirai, Yutaro Kashiwa, Ken'ichi Yamaguchi, Hiroshi Iwata, Hajimu Iida
2026/05/07 23:52已过 5 天

LLM-based autonomous coding agents have reshaped software development. While these agents excel at code generation, open questions persist about the long-term maintainability of AI-generated code. This study empirically investigates the maintenance extent, human involvement, and modification types of AI-generated files versus human-authored code. Using the AIDev dataset of AI-generated pull requests and GitHub, we analyzed over 1,000 files and approximately 3,200 changes from 100 popular repositories. Our findings show that: (i) AI-generated files receive less frequent maintenance than human-authored code, with updates affecting only a small fraction of file size; (ii) the most frequent modifications to AI code are feature extensions, whereas human updates focus on bug fixes, and (iii) human developers perform the large majority of this maintenance.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.ITcs.CRarXiv cs.CR

Cryptographic and Information-theoretic Security Capacities for General Arbitrarily Varying Wiretap Channels

Holger Boche, Ning Cai, Yiqi Chen, Marc Geitz
2026/05/07 23:51已过 5 天

We compare the strong secrecy capacities of Arbitrarily Varying Wiretap Channels (AVWCs) and General Arbitrary Varying Wiretap Channels (GAVWCs) with their capacities under semantic secrecy constraint and other equivalent cryptographic secrecy constraints. It turns out that the average error and strong secrecy capacity of an AVWC is always equal to its maximal error and semantic secrecy capacity. However, this equivalence does not hold for all general communication systems, and we prove this by a counterexample. We also show that, for the GAVWC, semantic security and the other cryptographic security measures considered achieve the same capacity values. Finally, we bound the gap between the strong secrecy capacity and the semantic secrecy capacity for the GAVWC. The gap vanishes if the choice of the jammer is sub-double-exponential with respect to the block length n, which gives a sufficient condition for the strong and semantic secrecy capacities to be equal for GAVWCs.

arXiv cs.CRDependability and SecurityarXivConfidentialityintegrityprivacycs.NIcs.CRarXiv cs.CR

When to Use Wireless Challenge-Response Physical Layer Authentication: Design of a Measurable Guideline for OFDM

Haiyun Liu, Shangqing Zhao, Yao Liu, Zhuo Lu
2026/05/07 23:50已过 5 天

The security of wireless challenge-response Physical Layer Authentication (PLA) based on Orthogonal Frequency Division Multiplexing (OFDM) relies on a sufficiently random fading channel condition, which is commonly assumed in existing studies. However, in practical scenarios, such a condition is not always guaranteed and the responses of OFDM subchannels may exhibit correlation.} Consequently, ensuring the security of such PLA systems remains an unsolved problem. In this paper, we propose a novel adversary model, called Maximum Differential Likelihood Generator (MDLG), which exploits the weak correlation property in practical wireless channel to launch effective attacks against PLA. Based on this model, we create a measurable guideline using randomness testing to decide when we can in fact use PLA in a practical wireless channel condition. Extensive real-world experiments validate the effectiveness of the MDLG attack and demonstrate how the proposed guideline can help protect the security of PLA.

arXiv AI Coding 3Y / arXiv cs.SE / arXiv LLM4Code / Program Repair / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivAI-enabled recommender systems for automated SEAutomating SE tasks with LLM and foundation modelsCollaborative AI for SEcs.SEcs.AIarXiv AI Coding 3YarXiv cs.SE

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

Francesco Dente, Dario Satriani, Paolo Papotti
2026/05/07 23:44已过 5 天

Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and object-relational mappings. Existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions. We present a systematic study evaluating how well agents handle structural constraints in multi-file backend generation. By fixing a unified API contract across 80 greenfield generation tasks and 20 feature-implementation tasks spanning eight web frameworks, we isolate the effect of structural complexity using a dual evaluation with end-to-end behavioral tests and static verifiers. Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline. Capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero. Framework sensitivity analysis exposes significant performance disparities: agents succeed in minimal, explicit frameworks (e.g., Flask) but perform substantially worse on average in convention-heavy environments (e.g., FastAPI, Django). Finally, error analysis identifies data-layer defects (e.g., incorrect query composition and ORM runtime violations) as the leading root causes. This work highlights that jointly satisfying functional and structural requirements remains a key open challenge for coding agents.

arXiv LLM4Code / Program RepairAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsAI-enabled recommender systems for automated SECollaborative AI for SEcs.MAarXiv LLM4Code / Program Repair

AgenticPrecoding: LLM-Empowered Multi-Agent System for Precoding Optimization

Zijiu Yang, Zixiang Zhang, Shunpu Tang, Qianqian Yang, Zhiguo Shi
2026/05/07 23:43已过 5 天

Precoding is a key technique for interference management and performance improvement in multi-antenna wireless systems. However, existing precoding methods are typically developed for specific system models, objectives, and constraint sets, which limits their adaptability to the heterogeneous and evolving scenarios expected in future 6G networks. To address this limitation, we propose AgenticPrecoding, a universal multi-agent framework that automates end-to-end precoding derivation directly from user-level communication requirements. Specifically, AgenticPrecoding decomposes the derivation process into four coordinated stages: problem formulation, solver selection, prompt upsampling, and code generation, assigning each stage to a specialized agent tailored to its specific reasoning demands. We employ two LoRA-adapted reasoning agents to inject precoding-specific domain knowledge for problem formulation and solver selection, while two general-purpose Large Language Models (LLMs) handle prompt refinement and executable code generation. Furthermore, a feedback-driven refinement mechanism is incorporated to enhance code executability, constraint feasibility, and solution quality. Extensive experiments across 10 representative precoding scenarios demonstrate that AgenticPrecoding achieves superior cross-scenario adaptability compared to conventional optimization-based and LLM-based baselines.

arXiv cs.CRAI for Software EngineeringarXivAutomating SE tasks with LLM and foundation modelsPrompt engineering for SEcs.CRarXiv cs.CR

Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models

Zeyuan Chen, Yihan Ma, Xinyue Shen, Michael Backes, Yang Zhang
2026/05/07 23:29已过 5 天

Large language models (LLMs) show strong performance across many applications, but their ability to memorize and potentially reveal training data raises serious privacy concerns. We introduce the PopQuiz Attack, a black-box membership inference attack that tests whether a model can recall specific training examples. The core idea is to turn target data into quiz-style multiple-choice questions and infer membership from the model's answers. Across six widely used LLMs (GPT-3.5, GPT-4o, LLaMA2-7b, LLaMA2-13b, Mistral-7b, and Vicuna-7b) and four datasets, our method achieves an average ROC-AUC of 0.873 and outperforms existing approaches by 20.6%. We further analyze factors affecting attack success, including query complexity, data type, data structure, and training settings. We also evaluate instruction-based, filter-based, and differential privacy-based defenses, which reduce performance but do not eliminate the risk. Our results highlight persistent privacy vulnerabilities in modern LLMs.