近期新增论文

跟踪 arXiv 与关注会议/期刊的新论文，按发布时间浏览和检索。

近期新增论文

汇总 arXiv 与关注会议/期刊的新论文，检索时按相关性优先排序。

ICML 2026 OpenReviewAI for Software EngineeringICML 2026Automating SE tasks with LLM and foundation modelsICML 2026

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, Jinsung Yoon

2026/05/12 01:23已过 22 小时

Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

论文页

arXiv AI Coding 3Y / arXiv cs.SE / arXiv AI Agents for Software Engineering / arXiv AI Coding 3YAI for Software EngineeringarXivCollaborative AI for SEcs.SEarXiv AI Coding 3YarXiv cs.SEarXiv AI Agents for Software Engineering

Collaborator or Assistnat? How AI Coding Agents Partition Work Across Pull Request Lifecycles

Young Jo, Chung, Safwat Hassan

2026/05/09 01:06已过 4 天

When AI coding agents open branches and submit pull requests (PRs), two questions co-determine oversight design: who starts the work (operational agency) and who authorizes its completion (merge governance). We characterize tools along a Collaborator-Assistant spectrum in how they redistribute initiative, oversight, and endorsement, while merge governance remains predominantly human across five tools (OpenAI, Copilot, Devin, Cursor, Claude Code). We analyze 29,585 PR lifecycles using an Initiator x Approver taxonomy with six interaction scenarios; lifecycle reconstruction supplies the how behind those roles. Collaborator tools (Cursor, Devin, Copilot) concentrate operational initiative in agents that open and carry PR work forward, with humans retaining review and endorsement on the path to merge; Assistant tools (OpenAI, Claude) leave task direction primarily with humans and supply bounded support within human-led workflows. Across the spectrum, agency and governance decouple: Collaborator workflows are >=96% agent initiated, yet terminal merge authority remains almost exclusively human, with agent-classified approvers confined to a small fraction of PRs. Where automation executes a merge, logs record the executor but not the decision-maker, marking a boundary of observation. We contribute the taxonomy, per-tool state machines, and a replication package for research on automation, oversight, and governance in PR workflows.

MARS: Modular Agent with Reflective Search for Automated AI Research

Collaborator or Assistnat? How AI Coding Agents Partition Work Across Pull Request Lifecycles

Learning CLI Agents with Structured Action Credit under Selective Observation

Tool Calling is Linearly Readable and Steerable in Language Models

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs

Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization

Evaluating Design Conformance Through Trace Comparison

Longitudinal Analyses of SAST Tools: A CodeQL Case Study

Zero-determinant Strategy for Moving Target Defense: Existence, Performance, and Computation

Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Can I Check What I Designed? Mapping Security Design DSLs to Code Analyzers

GRASP -- Graph-Based Anomaly Detection Through Self-Supervised Classification

Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching

Coding Agents Don't Know When to Act

Securing the Dark Matter: A Semantic-Enhanced Neuro-Symbolic Framework for Supply Chain Analysis of Opaque Industrial Software

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

SARC: A Governance-by-Architecture Framework for Agentic AI Systems

A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

The AI-Native Large-Scale Agile Software Development Manifesto

SafeTune: Search-based Harmfulness Minimisation for Large Language Models

Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel

Differentially Private Auditing Under Strategic Response

Quotient Semivalues for False-Name-Resistant Data Attribution

CCX: Enabling Unmodified Intel SGX Applications on Arm CCA

GESR: Graph-Based Edge Semantic Reconstruction for Stealthy Communication Detection with Benign-Only Training

Resilience of IEC 61850 Sampled Values-Based Protection Systems Under Coordinated False Data Injections

System Test Generation for Virtual Reality Applications using Scenario Models

Search-based Robustness Testing of Laptop Refurbishing Robotic Software

Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation

An Automated Framework for Cybersecurity Policy Compliance Assessment Against Security Control Standards

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

Cross-Modal Backdoors in Multimodal Large Language Models

Spying Across Chiplets: Side-Channel Attacks in 2.5/3D Integrated Systems

Vaporizer: Breaking Watermarking Schemes for Large Language Model Outputs

HBEE: Human Behavioral Entropy Engine -- Pre-Registered Multi-Agent LLM Simulation of Peer-Suspicion-Based Detection Inversion

Forensic analysis of video data deletion and recovery in Honeywell surveillance file system

Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing

Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair

From Conceptual Scaffold to Prototype: A Standardized Zonal Architecture for Wi-Fi Security Training

Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

Exploring CoCo Challenges in ML Engineering Teams: Insights From the Semiconductor Industry

Game-Theoretic Analysis of Transaction Selection in DAG-Based Distributed Ledgers

Combating Organized Platform Abuse: Amplifying Weak Risk Signals with Structural Information

Low-code and no-code with BESSER to create and deploy smart web applications

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

A Unified Open-Set Framework for Scalable PUF-Based Authentication of Heterogeneous IoT Devices

CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence

Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

PaT: Planning-after-Trial for Efficient Test-Time Code Generation

Asymmetric Phase Coding Audio Watermarking

Modulated learning for private and distributed regression with just a single sample per client device

Coupling Models for One-Step Discrete Generation

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions

The Position Curse: LLMs Struggle to Locate the Last Few Items in a List

RepoZero: Can LLMs Generate a Code Repository from Scratch?

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

Membership Inference Attacks on Vision-Language-Action Models

Less Random, More Private: What is the Optimal Subsampling Scheme for DP-SGD?

From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines

Beyond the Wrapper: Identifying Artifact Reliance in Static Malware Classifiers using TRUSTEE

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Pomegranate: A Lightweight Compartmentalization Architecture using Virtualization Extensions

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

Traffic Scenario Orchestration from Language via Constraint Satisfaction

MAGIQ: A Post-Quantum Multi-Agentic AI Governance System with Provable Security

Aquaman: A Transparent Proxy Architecture for Quantum Resilient Key Establishment

Escaping Whack-a-Mole: Code Documentation Optimization via Dependency-Guided Bi-level Search

Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

McNdroid: A Longitudinal Multimodal Benchmark for Robust Drift Detection in Android Malware

Zombies in Alternate Realities: The Afterlife of Domain Names in DNS Integrations

The Cost of Quantum Resistance: A Hash-Based Commit-Reveal Alternative for Minimizing Blockchain Infrastructure Overhead