Computer Science (arXiv)

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

cs.CL Jun 09, 2026

This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, and DeepSeek-R1) participate in a between-groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language of play (English versus Turkish), producing 586 validated statements. A zero-shot classifier assesses behavioral dispositions along two continuous dimensions: Concession Rate and Coercive Rhetoric. The results are heterogeneous. Llama-4 shows a substantial, Holm-corrected increase in coercive rhetoric under Turkish (delta = +0.800, p = .002), whereas Gemini-3.1-Pro displays an equally large decrease (delta = -0.750, p = .005). DeepSeek-R1 exhibits a similar negative shift (delta = -0.860, p = .006) and provides chain-of-thought evidence consistent with a buffering mechanism. GPT-4o shows no detectable effect (delta = +0.130, p = .614). These findings indicate that cross-lingual behavioral skew is contingent on model architecture and training regime rather than a universal property of Western-origin LLMs. We identify two distinct buffering mechanisms, chain-of-thought institutional anchoring and multilingual RLHF alignment, and discuss their implications for integrating LLMs safely into diplomatic and crisis-management settings.

Unifying Local Communications and Local Updates for LLM Pretraining

cs.LG Jun 09, 2026

Communication-efficient pre-training of LLMs is increasingly important as training draws on compute distributed across clusters, data centers, and lower-bandwidth links. Many practical methods reduce communication frequency but still rely on synchronous All-Reduce operations that maintain identical model states and tie progress to global collectives. This can become a bottleneck when bandwidth or worker speed is heterogeneous. We introduce GASLoC, a novel decentralized pre-training algorithm that generalizes the notion of communication acceleration to the recently popular "outer optimizer" to allow a practical gossip-based training framework that is compatible with adaptive optimizers, allows for local optimizer steps, and can utilize sparse randomized peer communication. Empirically, on a number of standard LLM training tasks, we demonstrate that GASLoC outperforms state-of-the-art decentralized algorithms in single step per communication setting for a number of topologies and, unlike existing decentralized methods in the LLM setting, it allows to obtain performance competitive with DiLoCo when utilizing multiple local steps. In the heterogeneous bandwidth setting we demonstrate the advantage of GASLoC showing that it can significantly outperform DiLoCo.

Phi-Actor-Critic: Steering General-Sum Games to Pareto-Efficient Correlated Equilibria

cs.MA Jun 09, 2026

Real-world multi-agent systems, from traffic coordination to resource allocation, are often modeled as general-sum games where individual incentives conflict with collective welfare. In these settings, the central challenge is not merely finding an equilibrium, but selecting socially desirable outcomes among many suboptimal Nash equilibria. Standard deep multi-agent reinforcement learning (MARL) methods struggle with this problem, as value-decomposition approaches are constrained by monotonicity assumptions and policy-gradient methods often converge to stable but socially inefficient equilibria. To address this limitation, we propose $Φ$-Actor-Critic ($Φ$-AC), a framework that leverages swap regret minimization to steer learning toward high-welfare correlated equilibria (CE). To make counterfactual regret estimation tractable in deep MARL, $Φ$-AC employs a centralized attention critic that predicts vector-valued regrets in a single forward pass, avoiding computationally expensive counterfactual simulations. We further introduce a Lagrangian-based equilibrium selection mechanism that optimizes social welfare while enforcing stability through regret constraints. Experiments on matrix games, Multi-Agent Particle Environments (MPE), and the Melting Pot Harvest scenario demonstrate that $Φ$-AC learns efficient and stable coordination strategies across diverse mixed-motive settings while maintaining high collective return and competitive fairness.

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

cs.CL Jun 09, 2026

Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.

A History-Aware Visually Grounded Critic for Computer Use Agents

cs.AI Jun 09, 2026

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

Coset Ensemble Decoder for Quantum Error Correction with Algorithm-Hardware Co-Design

cs.AR Jun 09, 2026

Reliable large-scale quantum computation relies on fault-tolerant architectures, where quantum error correction (QEC) continuously extracts and decodes error syndromes in real time. A critical component in QEC is the decoder, a classical subsystem that must simultaneously deliver high logical accuracy and ultra-low latency. This paper presents a novel algorithm-hardware co-design that improves the accuracy-latency trade-off over existing approaches such as vanilla Minimum-Weight Perfect Matching (MWPM) and Union-Find (UF) decoders. At the algorithmic level, we introduce coset ensemble decoding, which improves UF decoding by explicitly exploiting logically equivalent cosets. Our method performs ensemble forest exploration to generate multiple coset-consistent candidates and aggregates them to approximate coset-level maximum-likelihood decoding. We further reduce computational and memory complexity via reverse-order elimination and lossless graph compression, without sacrificing accuracy. At the hardware level, we design a domain-specific architecture that temporally reuses resources, avoiding the code-distance-proportional resource growth in prior spatial architectures. Several optimizations, such as multi-bank memory hashing and hierarchical ID mapping, are proposed to mitigate pipeline stalls and memory conflicts under highly concurrent access patterns. Under a circuit-level depolarizing noise model, our co-design approach achieves a better accuracy-latency trade-off than prior MWPM- and UF-based decoders, while reducing FPGA LUT consumption by up to 8.2 times compared with reported UF-based decoder resources. The tunable candidate number further exposes a flexible design knob, enabling users to tailor decoding performance to the requirements of different fault-tolerant workloads. Our implementation is publicly available at https://github.com/IMSeonL/coset-ensemble-decoder.

Exploring the Design Space of Reward Backpropagation for Flow Matching

cs.LG Jun 09, 2026

Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highlighting a useful decoupling between sampling and optimization. However, the quality of the resulting gradient depends on how accurately this short path approximates the full rollout, especially over long intervals. We propose FlowBP, a unified surrogate-trajectory framework that treats the backward trajectory itself as the design object. FlowBP keeps a no-gradient cached rollout for sampling, then builds a lightweight backward surrogate from cached and selectively re-forwarded velocities. This view separates four choices: the reward-model input, active set, integration weights, and bridge coupling, and recovers prior direct-gradient methods as particular settings. Within this framework, we instantiate three variants: FlowBP-Sparse uses sparse Euler reconstruction, FlowBP-Bridge adds controlled bridge coupling, and FlowBP-Lagrange raises the order of leap quadrature. All three bound memory by the active-set size and limit gradient chaining to at most one Jacobian factor. Across SD3.5-M, FLUX.1-dev, and FLUX.2-Klein-base on preference, quality, and compositional metrics, the three variants improve over direct-gradient baselines on most metrics.

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

cs.CL Jun 09, 2026

With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality conditioning and establishes a systematic evaluation framework encompassing single-personality induction, multi-personality induction, and personality switching. Experiments show that personality induction improves image captioning performance but can impair performance on tasks requiring precise reasoning, such as visual question answering (VQA). Balancing and residual effects are observed during multi-trait composition and dynamic switching, indicating that model behavior is co-modulated by both previous and current personality constraints. Existing prompt-based personality induction methods show limited transferability to multimodal settings. Our work reveals the dynamic and complex nature of personality modeling in MLLMs and underscores the need for robust, tailored methods for personality induction and evaluation. The code will be released when the paper is accepted.

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

cs.CL Jun 09, 2026

Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and often fail to capture interactions that span multiple domains, limiting their ability to evaluate agents in realistic multi-step settings that require sustained reasoning and coordination. To address these limitations, we introduce T1-Bench, a high-fidelity, comprehensive benchmark for evaluating agentic systems in realistic customer-facing, multi-domain environments, featuring interleaved scenarios that require structured reasoning across multi-turn user-assistant interactions and substantially increasing both compositional complexity and evaluative rigor across 25 domains of varying difficulty. We evaluate T1-Bench using 12 proprietary and open-weight models, providing a reproducible and standardized framework for assessing agent behavior, tool utilization, and conversational quality in complex, multi-step environments. We further complement automatic evaluation with human judgments to strengthen the assessment of qualitative performance. Overall, T1-Bench substantially advances prior benchmarks by increasing task complexity, interaction depth, and domain coverage in simulated multi-domain environments. To facilitate future research on agentic systems, we will publicly release data and evaluation code as open source.

Enumerating Inclusion-Maximal Arithmetic Progressions

cs.DS Jun 09, 2026

We present a simple $\mathcal{O}\left( n^2 \frac{ \log N }{ \log \log N } + N \right)$ enumeration algorithm for solving a problem from mathematical and computational music analysis where, given a strictly increasing integer sequence, $S$, with $n$ entries and maximum value $N$, the task is to enumerate all $m$ $\textit{inclusion-maximal arithmetic progressions (IMAPs)}$ in this sequence. An IMAP is a subsequence, $S' \subseteq S$ with $k>2$ integers, in which (i) the difference between any two consecutive integers is the same number, $d$ (i.e., $S'$ is an $\textit{arithmetic progression}$), (ii) $S'$ cannot be further extended to the left or to the right with any additional integers from $S$ while still remaining an arithmetic progression (i.e., $S'$ is a $\textit{maximal}$ arithmetic progression), and (iii) there is no other maximal arithmetic progression, $S'' \subseteq S$, which $\textit{properly}$ contains $S'$ (i.e., $S'$ is an $\textit{inclusion-maximal}$ arithmetic progression). We further provide proofs for the expected number of IMAPs in random integer sequences, $S$, and a bound on their order of growth. Finally, we provide empirical experiments comparing both (a) the practical running time performance of the proposed algorithm against that of a previously known algorithm which has higher time complexity $\mathcal{O}(N^{2+o(1)}n)$, and (b) the actual enumerated number of IMAPs to that of their mathematically expected number. Notably, the proposed algorithm demonstrates a significant improvement in running time over the previously known algorithm, and in immediate practical applications, will allow for more efficient analysis of large and rhythmically complex musical pieces.

GRAFT: Gain-Recalibrated Adapters for Transformer-Based Neural Population Activity Modeling

cs.LG Jun 09, 2026

Neural population activity models can recover rich temporal structure from binned spikes, but their read-in and readout layers often remain tied to a fixed set of recorded neurons. This coupling limits reuse in long-term brain-computer interfaces, where recorded neuron identities, counts, and response statistics can change across days. We introduce GRAFT, a Transformer-based neural population activity model that separates reusable temporal dynamics from a recalibratable neuron interface. The neuron interface controls how recorded neurons enter and leave the shared backbone, and auxiliary gain and positional mechanisms support neural activity modeling inside the Transformer. On MC Maze under the standard NLB'21 protocol, GRAFT reaches 0.3866 co-bps as an ensemble, setting a new state of the art on the primary co-bps metric among public and reported NLB'21 results. In a cross-day protocol constructed from the NLB'21 MC Maze dataset series, GRAFT recalibrates from MC Maze to the scaled MC Maze datasets (Large/Medium/Small) by updating only 9.21% of parameters, reaching 0.3749, 0.3112, and 0.3152 co-bps with restricted target-day support sets. These results show that the same interface-backbone separation supports both strong Transformer-based neural population activity modeling and data-efficient cross-day recalibration.

Arithmetic Packing on Wide Integer Datapaths in DSP Primitives of Modern FPGA Devices

cs.AR Jun 09, 2026

Deep Neural Networks increasingly employ low-precision quantization to reduce computational requirements. While FPGAs are well suited for workloads with heterogeneous precisions, their dedicated digital signal processing (DSP) slices only feature fixed-width datapaths that are significantly underutilized by low-bitwidth arithmetic. While previous approaches have already introduced the packing of multiple values onto the same wide DSP datapath, they either only support specific fixed bitwidths or are wasteful regarding the use of additional support logic external to the DSP. This paper proposes an efficient method to dynamically pack multiple (un-)signed inputs with arbitrary bitwidths into a wide multiplier path by leveraging the DSP's internal pre-adder. Building on this, we present two distinct architectures, one optimized for matrix-vector multiplications and the other for convolutions. Our implementations are integrated into AMD's FINN framework. With these optimizations, we reduce the LUT utilization by 21% and increase the FPS/DSP by 36% for the UltraNet model compared to the FINN reference.

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

cs.AI Jun 09, 2026

AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.

Flexible Kernels for Protein Property Prediction

cs.LG Jun 09, 2026

Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data-efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore--by learning what are in effect structure-aware substitution matrices--we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

cs.CL Jun 09, 2026

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from $67.2\%$ to $9.4\%$. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections ($W_Q, W_K$) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only $W_Q$ and $W_K$ from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from $65.4\%$ to $76.4\%$ while maintaining strong reasoning performance.

Making Software Meaningful

cs.SE Jun 09, 2026

Adopting a single measure can improve the usability, modularity and accountability of software: a commitment to explicit meaning. This entails constructing and agreeing upon a representation of the behavior of the software, as observed in the domain of application. The phenomena comprising this behavior become a vocabulary that grounds all discourse about the software, among all stakeholders, and for all artifacts and activities. These phenomena are individuals; actions they participate in; and facts that result from actions. They can be organized, by partitioning the set of actions, into concepts, offering larger units of meaning. Examples of exploiting meaning are given in three areas: designing for usability (by aligning user and designer on a single shared meaning); generating modular code with LLMs (by mapping units of meaning to units of code, achieving not only modularity but also legibility); and making agents accountable (by having them adhere to a code of conduct that defines their intended behavior).

LLM-Mediated Demand Response Coordination in Smart Microgrids

cs.MA Jun 09, 2026

Effective demand response in smart microgrids requires prosumers to cooperate voluntarily under strategic self-interest, a coordination problem structurally equivalent to a repeated Prisoner's Dilemma on a social network. This paper presents a multi-agent simulation in which a Large Language Model (LLM) Influence Compiler issues structured demand-response directives to a population of heterogeneous prosumer agents, each governed by a hybrid decision architecture combining game-theoretic base probability (derived from payoff history, neighbour imitation, and exploitation memory) with LLM narrative evaluation of incoming coordination signals. The hybrid architecture resolves a key methodological challenge: LLMs aligned via Reinforcement Learning from Human Feedback (RLHF) exhibit strong cooperation bias when used as direct decision-makers, producing flat dynamics regardless of grid conditions. By separating strategic reasoning from grounded narrative evaluation, the model generates realistic prosumer behaviour across six personality archetypes, with baseline cooperation near 50% and clear differentiation under influence. Compiled structured directives achieve 33.3% demand-curtailment cooperation versus 27.0% for unstructured messaging and 28.0% for a no-intervention baseline ($Δ_\mathrm{comp} = +0.063$), with the advantage preserved across both grounded and idealized agent substrates ($Δ= +0.083$) and across all resistance levels ($R = 0.1$ to $0.7$). Hub-targeted dissemination via high-centrality network nodes outperforms peripheral or random targeting, confirming that grid topology provides mechanistic amplification independent of message content. These results suggest that structured LLM compilation, grounded agent reasoning, and network-aware targeting are complementary design principles for scalable, interpretable demand-response coordination in smart-city energy systems.

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

cs.CL Jun 09, 2026

Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, such as safe refusal, bias avoidance, and privacy protection. We ask: does this conversion preserve alignment? We study this question through a trustworthiness audit and find that it is not behavior-preserving by default. For a systematic analysis, we compare reasoning models produced via supervised fine-tuning, RL-based post-training, and distillation against matched instruction-tuned baselines across six trustworthiness dimensions: safety, toxicity, stereotyping and bias, machine ethics, privacy, and out-of-distribution robustness. We observe that reasoning models often improve on reasoning benchmarks but exhibit alignment regressions, including increased toxicity, amplified stereotyping, miscalibrated refusal, and contextual privacy leakage. These regressions are consistent with behavioral drift from the instruction-tuned baseline, measured by KL divergence. Overall, our results point to the broader conclusion that trustworthiness metrics are essential for evaluating reasoning models and should be reported alongside gains in reasoning capability.

What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

cs.AI Jun 09, 2026

Reusing a held-out benchmark adaptively should, in principle, invite overfitting. Yet benchmark-driven machine learning (ML) has produced surprisingly little overfitting in practice. An attractive hypothesis is that successful ML strategies are highly compressible. We study this in the setting of LLM-driven research agents, where the hypothesis becomes directly testable via two complementary information bottlenecks. In \emph{output compression}, an exploration agent adaptively searches for high-performance models using a validation set, and we test whether a fresh ``reproducer agent'' can reproduce its performance given only an extremely short prompt and the training data. In \emph{input compression}, the explorer receives only one-bit feedback indicating whether each submitted model improves on the running best. Across 8 datasets spanning tabular classification, vision, language modeling, diffusion modeling, and reward modeling, we find that these bottlenecks have little effect on performance: short prompts and compressible feedback are sufficient to reproduce and find high-performance models. The hypothesis is falsifiable: when we deliberately induce validation-set overfitting, the results fail to reproduce with short prompts. Taken together, our results support a description-length explanation for the lack of overfitting in benchmark-driven ML: successful strategies occupy a low-complexity region of strategy space.

Generalized Conformal Predictive Systems Under Distributional Shifts

cs.LG Jun 09, 2026

Conformal predictive systems (CPS) output calibrated bands of CDFs under exchangeability. We extend generalized CPS to non-exchangeable settings by encoding distributional shifts through observation-specific permutation weights. This yields shift-aware predictive systems that remain valid whenever the test point is, conditionally on the unordered sample, a weighted draw from the observed atoms. Since such weights are typically estimated, we introduce weight-uncertainty boxes and construct robust CPS envelopes with finite-sample or asymptotic confidence guarantees. We derive efficient computation for conformity-measure CPS, conformal binning, and conformal isotonic distributional regression. Experiments under covariate shift and feedback-driven biomolecular design show calibrated predictive bands that widen under stronger shifts and tighten as sample size increases.

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

cs.AI Jun 09, 2026

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

Internet Quality Barometer (IQB): A preliminary data-driven evaluation of the IQB framework

cs.NI Jun 09, 2026

The Internet Quality Barometer (IQB) framework was designed to transform raw Internet measurement data into actionable insights about Internet quality. Specifically, the framework maps raw speed test measurements to network requirements (e.g., throughput, latency), maps these requirements to representative Internet use cases (such as video streaming or web browsing), and finally aggregates performance across use cases into a single IQB score. The IQB score is a composite index ranging from 0 to 1, intended to capture overall Internet quality in a way that is both interpretable and comparable across locations. We implemented the IQB framework in practice by developing an open-source IQB library and a prototype web application. These tools enabled us to compute IQB scores at scale, including global estimates aggregated at the level of countries, regions, and cities. In this report we conduct a preliminary sensitivity analysis of the IQB framework, investigating how different parameter choices affect the resulting IQB scores, identifying which parameters the framework is most sensitive to, and highlighting cases that may lead to outliers or potentially misleading results.

Generation of Diverse and Functional Robot Designs using Superquadrics Parametrisation and Quality-Diversity

cs.RO Jun 09, 2026

Generative design of robots requires navigating a vast search-space, encompassing physical configurations and behavioural parameters. Evolutionary Algorithms (EAs) have shown promising results, but often converge prematurely to a small set of sub-optimal designs. Most EAs fail to maintain sufficient diversity in the population that would allow the discovery of distinct functional robots. To counter premature convergence, we introduce a superquadrics-based representation (SQs) for robot bodies. SQs are interpretable, compact and computationally efficient mathematical representations of 3D geometrical shapes that can be tuned to specific design-spaces. To encourage morphological diversity, we combine this representation with a quality-diversity (QD) algorithm (MAP-Elites). We compare SQs and Compositional Pattern Producing Networks representations as generators of morphologies, combining them with standard EAs and MAP-Elites. In two test environments, we find that using SQs to generate morphology in conjunction with the MAP-Elites algorithm reaches the highest QD-score across both environments, maximising diversity of design and functionality of generated robots. The findings highlight the benefits of using a compact and interpretable geometric representation for exploring a complex design-space and suggest that combining SQs with an explicit diversity mechanism increases the quality and number of designs generated.

GapFuzz: Cross-Plane Divergence Fuzzing for Distributed SDN Controllers

cs.SE Jun 09, 2026

Distributed Software-Defined Networking (SDN) clusters replicate flow state asynchronously between a master node and its backups, leaving a window during which two backup nodes can each commit a contradictory rule, the master can serialize both into the data plane, and the kernel datapath can latch onto an action that no node believes authoritative. Existing SDN fuzzers miss this fault: they confine their oracle to the control plane, target a single controller, or do not steer concurrency to provoke replication races. We present GapFuzz, a stateful concurrency fuzzer for distributed SDN clusters. GapFuzz injects pairs of contradictory Northbound requests on two non-master nodes with controlled inter-injection delay $Δt$, and reconstructs the global cross-plane state by querying every replica and the kernel-datapath action through ovs-appctl ofproto/trace. A two-phase timing search detects whether a divergence exists, then doubles and bisects on $Δt$ to bound the injection-time window; a lifetime probe labels each verdict transient or persistent and assigns it to one of four cross-plane state classes derived from the ONOS 2.7 source. On a three-node ONOS 2.7 cluster, GapFuzz produces a divergent verdict in 81.7% of attempts ($N=50$, Wilson 95% CI $[77.3, 85.4]$%); every divergence sits between the cluster's authoritative state and the kernel datapath. Phase 2 separates a 5 ms race window for one template from a doubling-cap regime ($Δt_{\max}=10.24$ s) for six others, and 99.4% of divergences persist past 30 s. Replacing the kernel-datapath probe with the OpenFlow user-space probe used by prior fuzzers drops detection by 26.6 percentage points overall and by 46.5 points after excluding canonicalization-forced verdicts.

A Spiking Neural Architecture for Coordinating Arm and Locomotor Control

cs.RO Jun 09, 2026

Spiking Neural Networks (SNNs) coupled with neuromorphic hardware offer energy-efficient solutions for humanoid robot control. However, existing SNN-based motor control systems address bipedal locomotion and arm control in isolation, leaving integrated control of both unaddressed. We present a spiking architecture that coordinates force-based arm control and bipedal locomotion in a simulated humanoid, using the Neural Engineering Framework (NEF) and Semantic Pointer Architecture (SPA). High-level action selection between locomotor and arm control is mediated by a biologically grounded spiking basal ganglia model. We validate the system through co-simulation of Nengo, for the neural control, and Isaac Sim, demonstrating successful target reaching, continuous digit drawing, path-following locomotion, and finally, switching between walking and arm control via basal ganglia disinhibition. To our knowledge, this is the first integrated spiking controller to combine bipedal locomotion and arm control on a full-scale humanoid platform. The full spike-based implementation enables future deployment on low-power neuromorphic hardware.

Computer Science (arXiv)

Cookie Preferences

Essential Cookies

Analytics Cookies