Machine Learning

See recent articles

Showing new listings for Friday, 3 July 2026

Total of 273 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2607.01278 [pdf, other]: Title: Multilayer Q-Matrix-Embedded Neural Network for Cognitive Diagnosis (M-QCDNet): Structure-Aware Deep Learning Architecture for Psychometric Interpretability

Yiyao Yang

Comments: 15 pages, 3 tables

Subjects: Machine Learning (cs.LG)

The research proposes a multilayer Q-matrix-embedded neural network for cognitive diagnosis (M-QCDNet), which integrates the structural interpretability of cognitive diagnostic models (CDMs) with the deep learning neural network (NN). M-QCDNet structures the item-skill relationship using the Q-matrix as a structural prior, ensuring latent mastery profiles remain interpretable and consistent with cognitive theory, followed by the proposed loss function with an L2 penalty to penalize skills not aligned with the Q-matrix and to balance predictive performance and structural alignment. Corresponding evaluation matrices, the interpretable alignment-based metrics that quantify the degree to which predicted skill activations correspond to item-level skills, were further developed. M-QCDNet offers practical benefits for classroom practice, enabling early detection of learning difficulties and supporting mastery-based interventions. By embedding diagnostic validity into model design, M-QCDNet bridges psychometric transparency and neural flexibility, advancing interpretable, fair, and actionable AI for cognitive diagnostics.
[2] arXiv:2607.01279 [pdf, html, other]: Title: I\textsuperscript{2}RiMA: Spectral Riemannian Representation with Temporal Attention for Mental Stress Detection based on EEG Signals

Cheng He, Kunyu Peng, Shangen Han, Jinming Ma, Jinhong Ding, Likun Xia

Subjects: Machine Learning (cs.LG)

Cross-subject EEG stress detection remains challenging because discriminative stress-related patterns are both subject-dependent and frequency-specific. Conventional Riemannian methods model spatial covariance mainly in the time domain, overlooking neural oscillations that are critical for high-level cognitive state decoding, while standard temporal tokenization often fragments inter-slice temporal coherence. To address these limitations, we propose \method{}, an Intra-Inter Riemannian Manifold Attention Network for EEG-based stress detection. \method{} constructs spatial covariance matrices independently at each frequency point and maps them to the SPD tangent space, preserving channel-wise geometry together with frequency-specific discriminative cues. It further introduces frequency cluster aggregation to select informative spectral components and reduce redundancy by forming compact, data-driven frequency clusters aligned with EEG rhythms. Finally, an intra-inter slice attention module adaptively integrates local slice-level spectral dynamics and global temporal context across EEG sequences. Experiments on three datasets show that \method{} consistently outperforms five state-of-the-art baselines, achieving up to 82.78\% balanced accuracy while remaining efficient with only 1.60M parameters and 31.95M FLOPs.
[3] arXiv:2607.01280 [pdf, html, other]: Title: Fixed-Set Robustness in Programming by Example: Example Corruption and Semantic Partition Recovery

Yuan Si, Jialu Zhang

Subjects: Machine Learning (cs.LG); Programming Languages (cs.PL)

Programming-by-example systems infer programs from a small set of input-output examples. Robust PBE work usually models wrong examples as samples from a stochastic noise process and then minimizes an expected or empirical loss. This paper studies a different failure mode: an adversary who sees the synthesizer and chooses the examples whose corruption most damages the returned program. We formalize fixed-set worst-case corruption for finite PBE version spaces, implement exact-within-bounded-pool and heuristic corruption searches for a string-transformation DSL, and introduce version-space partition aggregation (VPA), a defense that synthesizes on disjoint example groups and votes by semantic signatures. The central claim is deliberately bounded and partly negative: low-margin PBE tasks have an adversarial robustness dimension that random-typo and noisy-PBE evaluations miss, while semantic partition aggregation helps only when the clean semantics keep a partition vote margin, which often fails on realistic tasks. Evidence from curated/generated DSL tasks, accepted public SyGuS PBE_SLIA slices, SYNTRA Playgol v2, and noisy-PBE objective baselines supports that boundary. One curated edit flips all 8 spike tasks while 200-trial typo, DSL-pool, and distance-matched random controls succeed on 10.3%, 11.0%, and 16.7%; generated margin-1 rows flip under budget 1 yet VPA recovers them; on public SyGuS the vote margin is near one, so an adaptive attacker drives VPA accuracy to zero; accepted public SyGuS slices move across exact-within-pool budget boundaries; and Playgol shows positive paired-bootstrap gaps against typo and same-pool random controls on the 141 accepted rows. A small exact-output prompt harness over 20 controlled margin-1 tasks shows the same qualitative clean-to-attacked pattern across local and API models, while it is treated as a scope check, not a broad LLM benchmark.
[4] arXiv:2607.01282 [pdf, html, other]: Title: Domain Knowledge Based Temporal-Spatial Graph Convolution Network for ECG Recognition

Wenting Ma, Zhipeng Zhang, Xiaohang Yuan, Ningwei Xie, Yuxin Xie, Xiaolin Wang, Meng Guo, Xingang Chai, Zhenjie Yao

Comments: 10 pages, 5 figures. Presented at ICONIP 2024, Auckland, New Zealand. Published in LNCS 15290, Springer, 2025

Journal-ref: Neural Information Processing (ICONIP 2024), LNCS 15290, Springer, 2025, pp. 92-106

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

In light of strides in Arti cial Intelligence (AI) and its wide spread application, challenges persist in the interpretability of AI models, particularly within specialized domains like healthcare, such as electro cardiograph (ECG) recognition. Rather than relying solely on end-to-end convolutional neural networks, this paper introduces a novel approach using a domain knowledge-based graph convolution network for ECG recognition. Key landmarks points of PRQST, vital to ECG interpreta tion, are incorporated as domain knowledge. The double-stream directed graph is employed to model both intra and inter ECG cycles. Speci cally, spatial directed graphs capture the positional relationships among key points, while temporal directed graphs delineate temporal dependencies between adjacent cycles in extended ECG sequences. Experimental re sults on the First Chinese ECG Intelligent Competition dataset, which speci cally classify ECG into nine categories, prove the e cacy of the proposed model. The overall average F1 score is 88.1%, the average F1 score of rare categories is 76.3%, both outperform the state-of-the-art models. The introduction of domain knowledge did enhance the detec tion performance, especially for rare categories.
[5] arXiv:2607.01283 [pdf, html, other]: Title: Scaling Laws for Grid-Based Approximate Nearest Neighbor Search in High Dimensions

Matthew J Liu, Wei Hang Zheng, Vidhan Purohit, Siqi Xie, Chieh-En Li, Jerry Li, Noah Flynn

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Grid-based approaches to approximate nearest neighbor (ANN) search have been absent from modern scaling analyses. We present a systematic characterization of a multiprobe grid algorithm with respect to dataset size $N$ and dimensionality $d$. Our experiments reveal a previously unreported $d$-scaling crossover on the GloVe embedding family, in which multiprobe grid search maintains an approximately constant dimensional scaling exponent while other graph-, tree-, and partitioning-based methods exhibit degrading throughput. The advantage comes with near-linear query scaling in $N$, but also with lower indexing cost than competing ANN methods. Our results suggest that grid-based methods such as multiprobe grid may be competitive in rebuild-heavy or high-dimensional settings where indexing cost and dimensional robustness dictate performance. More broadly, recent work has formalized self-attention as an ANN operation. Thus, the $N$- and $d$-scaling properties of ANN algorithms may guide cost analysis of efficient transformer architectures. Code is available at: this https URL.
[6] arXiv:2607.01286 [pdf, html, other]: Title: IonSense-QKG: A Quantum-Readiness Metadata Framework for Lithium-Ion Battery Dataset Discovery

Sakthi Prabhu Gunasekar, Prasanna Kumar Rangarajan

Comments: 7 pages, 1 figure, 4 tables. Code and metadata artifact available at GitHub

Subjects: Machine Learning (cs.LG); Databases (cs.DB)

Public lithium-ion battery datasets are increasingly used for state-of-health estimation, remaining-useful-life prediction, anomaly detection, electrochemical diagnostics, second-life analytics, and battery safety research. However, these datasets vary substantially in chemistry, modality, scale, label quality, sequence structure, access status, and preprocessing complexity. These differences directly affect whether a dataset is feasible for near-term hybrid quantum-classical machine-learning workflows.
This paper presents IonSense-QKG, a quantum-readiness metadata framework for lithium-ion battery dataset discovery. Starting from the EV-Battery-IonSense index, the proposed framework enriches public battery dataset records with quantum-relevant metadata, including task type, sensing modality, chemistry, label availability, sequence type, preprocessing requirements, candidate quantum encodings, estimated qubit range, and NISQ feasibility. A transparent Quantum Readiness Score is introduced to rank datasets as candidate resources for future hybrid quantum-classical battery benchmarks. The score is intended as a dataset-selection heuristic, not as evidence of quantum advantage.
The framework demonstrates query-based discovery over enriched metadata to identify datasets suitable for compact quantum feature maps, quantum time-series workflows, limited-label anomaly detection, and future battery-health benchmarking. The released artifact includes metadata tables, scoring scripts, robustness checks, link-checking utilities, and SQL-style query examples. IonSense-QKG positions dataset selection as a data-management problem and provides a reproducible foundation for data-centric quantum battery analytics.
[7] arXiv:2607.01307 [pdf, html, other]: Title: A Novel Machine Learning Approach for Central Nervous System Tumor Classification from DNA Methylation

Paulo R. Ferreira Jr., Lucas Coutinho Freitas, Laís dos Santos Gonçalves, William Borges Domingues, Lucas Petitemberte de Souza, Mariana B. Michalowski, Vinicius F. Campos

Subjects: Machine Learning (cs.LG); Genomics (q-bio.GN)

NA methylation profiling has become a powerful approach for central nervous system (CNS) tumor classification, yet important challenges remain regarding cross-cohort transferability, methodological correctness, and robust multiclass evaluation. In this work, we propose a novel and methodologically rigorous machine-learning approach for methylation-based CNS tumor classification that combines Sparse Random Projection for dimensionality reduction with multinomial logistic regression for classification. We evaluate the proposed approach in the same general experimental setting established by a widely used reference classifier. On the 2,801-sample reference cohort, our method achieves a mean accuracy of 96\% under stratified 3-fold cross-validation. On the independent 1,104-sample clinical evaluation cohort, it reaches 86\% accuracy at the 91-class level and 93\% when predictions are evaluated at the methylation class family level. These results improve upon the corresponding state-of-the-art reference figures of 82\% class-level concordance and 88\% family-level concordance, yielding absolute gains of approximately 4 and 5 percentage points, respectively. This improvement is clinically relevant: in a diagnostic setting, a 5-point increase in correct tumor classification can directly affect cancer subtype assignment and, in turn, influence treatment selection and downstream clinical decision-making. Our results show that the proposed model, grounded in stronger methodological practice in machine learning, consistently outperforms the previous state of the art across evaluation settings and can materially improve the reliability of CNS tumor classification.
[8] arXiv:2607.01311 [pdf, other]: Title: From Approximation to Emergence: A Theory of Deep Learning

Zhilin Zhao

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Deep learning has outgrown any single mathematical explanation. From Approximation to Emergence develops a unified, proof-oriented account of modern deep learning theory, tracing a path from the classical foundations of approximation, optimization, and generalization to the contemporary mechanisms of overparameterization, robustness, generative modeling, transformers, in-context learning, scaling laws, interpretability, alignment, and emergence. Rather than presenting isolated results, the book organizes a broad literature into a coherent research narrative: each theory is examined through the object it controls, the assumptions that make it valid, and the phenomena it leaves unexplained. Written for researchers, graduate students, and mathematically trained practitioners, this monograph offers a rigorous map of deep learning theory as it stands today: powerful, incomplete, and increasingly centered on the question of how learned mechanisms arise from scale, data, architecture, and training.
[9] arXiv:2607.01313 [pdf, html, other]: Title: Black-Box Inference of LLM Architectural Properties with Restrictive API Access

Christopher Ellis, Shreyas Chaudhari, Mei-Yu Wang, Leighton Barnes, Giulia Fanti, José M. F. Moura

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

In practice, most commercial LLM providers do not publicly release details of underlying LLM architectures. However, prior work has shown that given limited API access to an LLM (namely, top-$k$ logits and/or a logit bias function), one can recover certain architectural details of an LLM, such as the hidden dimension of the feed-forward network. Perhaps in response to these results, most commercial LLM providers have restricted their APIs to expose only the single logit for each decoded token, and they no longer give users the ability to bias logits. We show that even under current restrictive APIs, several architectural parameters are still recoverable. We present NightVision, an attack that uses restrictive black-box API access to estimate the hidden dimension, depth, and parameter count of an LLM. Algorithmically, NightVision relies on a novel common set prompting technique in which multiple prompts expose log probabilities for the same set of output tokens; a spectral analysis of these results is used to infer hidden dimension. NightVision additionally uses end-to-end time to first token (TTFT) measurements and the estimated hidden dimension to estimate depth and parameter count. We empirically evaluate NightVision on 32 open-source LLMs, recovering hidden dimension to within 23% average relative error across all models (9% on MoE models), and depth and parameter count to within 53% for models exceeding three billion parameters. We run extensive ablations to demonstrate how these accuracies scale with token budget and model properties. Overall, our results suggest that current LLM APIs are not sufficiently restricted to fully obfuscate the architectural details of their underlying models.
[10] arXiv:2607.01365 [pdf, html, other]: Title: Multi-modal Rail Crossing Safety Analysis

Paimon Goulart, Chansong Lim, Nícolas Roque dos Santos, Yue Dong, Sheldon Peterson, Jia Chen, Evangelos E. Papalexakis

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Given one or more images of a railway crossing, can we leverage visual cues that allow us to robustly estimate how safe it is? Can we improve our ability to do so by introducing structured data (such as official accident reports) about the accident history of that crossing into our models? In this work, we explore how to best answer those questions towards building an AI system that can ingest multi-modal data for railway crossings and provide safety assessment and scores that align with expert opinion and with safety scoring used by the Federal Railroad Administration (FRA). To that end, we propose a proof-of-concept pipeline that delivers on that goal, while at the same time exploring and tackling a number of critical research challenges that pertain to different parts of the pipeline, from data preparation to different learning paradigms that can allow us to realize such a system. Indicatively, our proposed system identifies HIGH-RISK and LOW-RISK crossings with a macro F1 score of 0.757 and estimates FRA-based safety scores with an RMSE of 0.078 and correlation of 0.492 using a routed fine-tuned compact VLM pipeline, while producing qualitative results that align with domain-expert assessment.
[11] arXiv:2607.01391 [pdf, html, other]: Title: How Should Transformers Encode Numeric Values in Electronic Health Records?

Maria Elkjær Montgomery, Christian Igel, Mikkel Odgaard, Martin Sillesen, Mads Nielsen

Comments: 16 pages, 15 figures, 3 tables, accepted to ICML 2026, to be published in Proceedings of Machine Learning Research

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

How do we encode numeric values in transformer-based sequence processing, particularly in electronic health record (EHR) data? We systematically compare discrete, continuous, and hybrid value encoding strategies using synthetic arithmetic tasks embedded within real-world EHR data, as well as real-world clinical prediction tasks. Our study reveals trade-offs between numeric precision, optimisation stability, and architectural flexibility. We find that approaches that explicitly model value-concept interactions perform best on precision-sensitive arithmetic tasks when architectural constraints permit. Hybrid token-based approaches that retain numeric values but apply binning prior to projection provide a more robust and broadly applicable alternative, with the optimal number of bins following a simple empirically derived power-law in dataset size. Across tasks, models consistently exhibit reliable "good enough" numeric computation rather than exact arithmetic, while clinical gains from incorporating laboratory values are task-dependent. This suggests that robustness and deployability often outweigh maximal numeric precision in practice, motivating hybrid token-based approaches as a practical default.
[12] arXiv:2607.01401 [pdf, other]: Title: NeuroBridge: Bridging Multi-Task MRI Knowledge for Neurodegenerative Disease Diagnosis

Mengyu Li, Guoyao Shen, Chad W. Farris, Xin Zhang

Comments: 5 figures. 3 tables

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

INTRODUCTION: Accurate MRI-based identification of Alzheimer's disease (AD), mild cognitive impairment (MCI), and related dementias remains challenging because disease-related structural changes are often subtle and heterogeneous. We developed NeuroBridge, a clinically guided multi-task MRI framework for neurodegenerative disease diagnosis. METHODS: NeuroBridge integrates large-scale self-supervised MRI pretraining with hippocampal segmentation, hippocampal atrophy classification, and reconstruction objectives, followed by gated fusion fine-tuning. Performance was evaluated across ADNI and OASIS cohorts, including cross-cohort transfer, probability-based analysis, and opportunistic screening. RESULTS: NeuroBridge achieved the highest performance across evaluated classification tasks, reaching 88.17% accuracy for AD versus cognitively normal controls in ADNI and 82.78% in OASIS. The largest gains occurred in MCI-related and mixed-diagnosis settings. The framework demonstrated strong cross-cohort generalization, systematic associations between predicted-class probability and accuracy, and the feasibility of probability-based opportunistic screening. DISCUSSION: Clinically guided multi-task representation learning improves neurodegenerative MRI diagnosis beyond conventional single-task approaches. NeuroBridge provides a robust and scalable framework for dementia assessment and MRI-based opportunistic screening.
[13] arXiv:2607.01408 [pdf, html, other]: Title: Spin-Weighted Spherical Harmonics Enable Complete and Scalable $\mathrm{E}(3)$-Equivariant Networks

Chenxing Liang, Yuchao Lin, Andrii Kryvenko, Wendi Yu, Chuan Li, Jianwen Xie, Xiaofeng Qian, Shuiwang Ji

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

$\mathrm{E}(3)$-equivariant networks are promising for 3D atomistic system modeling, yet their scalability is limited by the $O(L^6)$ complexity of the Clebsch-Gordan Tensor Product (CGTP). The recently proposed Gaunt Tensor Product (GTP) reduces the complexity but is unable to capture the antisymmetric paths, resulting in incomplete expressivity. In this work, we present SpinGTP, an approach to overcome the GTP incompleteness by generalizing from scalar functions to Spin-Weighted Spherical Harmonics (SWSH). By relying on the algebraic properties of SWSH, SpinGTP recovers the missing antisymmetric interactions while maintaining the asymptotic efficiency of GTP. It also allows for a more expressive equivariant basis that naturally accounts for the parity-odd components of tensor products. We evaluate SpinGTP across diverse benchmarks, including Tetris, 3BPA, SPICE-MACE-OFF, and OC20. Our results show that SpinGTP achieves accuracies comparable to full CGTP. Notably, by explicitly capturing antisymmetric paths, SpinGTP exhibits superior performance in tasks involving chiral materials and non-centrosymmetric geometries. This work provides a complete, scalable, and mathematically rigorous path toward high-order equivariance in large-scale 3D atomistic system simulations.
[14] arXiv:2607.01415 [pdf, html, other]: Title: The Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning

Daniel Thi Graviet, Lovre Pesut, Ivan Dagelic, Vedran Jukic, Ivan Burazin

Comments: Preprint. 6 pages, 6 figures, 2 tables. Submitted to ACM SoCC 2026

Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)

Coding-agent reinforcement learning treats execution infrastructure as a background implementation detail, despite relying on large numbers of interactive software rollouts. This is a missed opportunity: measuring infrastructure overhead can reveal practical efficiency gains for RL post-training, where small per-rollout savings compound at scale. We present a comparative study of four execution substrates: single containers, hosted sandboxes, Kubernetes-orchestrated containers, and cloud virtual machines. We find up to $110\times$ variation in cold-start latency and a $1.8\times$ spread in projected worker-hours for one million 150-step trajectories. Our results suggest that future coding-agent RL systems should optimize execution substrates as part of the training system itself, not merely as deployment plumbing.
[15] arXiv:2607.01417 [pdf, html, other]: Title: Conditional Inference Trees and Forests for Feature Selection

Robert Milletich, Justin Downes, Steve Goley, Newel Hirst

Comments: 38 pages, 9 figures

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Conditional inference trees (CIT) and conditional inference forests (CIF) reduce split-selection bias by testing features before choosing split thresholds, but repeated permutation tests and threshold searches can make these methods computationally expensive. We study CIT and CIF as top-$k$ feature-ranking methods for downstream prediction using real-data benchmarks, runtime ablations, and synthetic feature-recovery experiments. At a fixed node, if the features and permutation budget do not depend on the node responses, Bonferroni-corrected $+1$ Monte Carlo permutation $p$-values control nodewise rejection under the complete permutation null. CIF ranks 4th among 17 classification methods on 22 datasets and 3rd among 18 regression methods on 8 datasets. With Bonferroni correction held fixed, the CIF runtime ablations indicate that adaptive stopping and the number of thresholds searched have the largest measured effect on runtime: turning off adaptive stopping and using exact threshold search increase fitting time by 4.0--8.4$\times$ and 1.9--10.8$\times$, respectively, while downstream score changes are at most 0.011. Sparse high-$p$ simulations indicate that forest feature sampling can leave informative features out of many split decisions. Overall, the results support CIF as a top-$k$ feature-ranking method in the evaluated downstream prediction benchmarks.
[16] arXiv:2607.01444 [pdf, html, other]: Title: On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain

Atsuki Yamaguchi, Szymon Palucha, Léo Bijar, Aline Villavicencio, Nikolaos Aletras

Comments: Under review

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Mixture-of-Experts (MoE) models offer inference speedups via selective activation but impose substantial memory requirements because the whole network must remain loaded. Structured expert pruning is a practical approach for reducing deployment costs in resource-constrained settings. However, prior studies primarily evaluate benchmark utility, leaving the effect of pruning on factual reliability underexplored, particularly in high-stakes domains such as biomedicine. In this paper, we investigate how domain-specific expert pruning affects both utility and reliability. We assess four MoE models, six pruning methods, and multiple pruning ratios across generation and classification tasks under in-domain (biomedical) and cross-domain settings. Results reveal that moderate pruning preserves in-domain utility without immediate reliability decline, although hallucination risks increase at extreme pruning ratios. When shifting to the general domain, both utility and reliability degrade rapidly. These findings indicate that safe compression depends heavily on the task and domain. Evaluating pruned MoE models solely on utility is inadequate for high-stakes deployment without reliability assessment.
[17] arXiv:2607.01449 [pdf, html, other]: Title: Geometry-Aware R-Structured Kolmogorov-Arnold Networks

Sergei Kucherenko, Nilay Shah

Comments: 27 pages, 13 figures

Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)

We propose a novel hybrid neural architecture, the Geometry-aware R-Structured Kolmogorov-Arnold Network (GRS-KAN), which integrates this http URL's R-functions into the Kolmogorov-Arnold Network (KAN) framework. The proposed approach combines two complementary modeling mechanisms: smooth nonlinear structure is learned by KAN branches, while known geometric or logical constraints are encoded analytically using differentiable R-functions. This enables explicit representation of discontinuities, feasible regions, and implicit geometric boundaries within a trainable neural architecture.
The framework implements differentiable logical operations through R-conjunctions and R-disjunctions, allowing complex geometric supports to be represented analytically and incorporated directly into regression models. Several GRS-KAN variants are introduced, including additive, multiplicative, and agnostic branch-weighted architectures.
The method is demonstrated on regression problems involving discontinuities with circular and rectangular supports. Numerical experiments show that explicit geometric encoding substantially improves predictive accuracy and boundary localization compared with standard KANs. In the considered benchmarks, geometry-aware GRS-KAN models reduce test RMSE by up to 67% while simultaneously improving interpretability through explicit analytical representation of the learned geometric structure. The agnostic variant further demonstrates the ability to automatically determine whether geometric priors are beneficial for a given learning task.
[18] arXiv:2607.01455 [pdf, html, other]: Title: Token Geometry

Kathan Shah

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Language models learn continuous programs over discrete symbols, with the embedding table and LM-head acting as the read/write interface between them. We show that this interface has gradient geometry distinct from dense hidden weights which can be exploited to improve the Pareto frontier across supervised finetuning, RL, and pretraining, while only utilizing kilobytes of optimizer state. We introduce Ember, a lightweight optimizer for embedding and LM-head matrices that utilizes O(V + D) VRAM, instead of Adam's O(2VD), and forgoes the need to shard both token table optimizer states. We provide empirical evidence that Ember scales effectively across batch size and parameter count. We show that the optimization trajectory of tokens can be well described by a simple 1D ray, counter to the popular belief that neural net parameters navigate a heavily nonconvex landscape. We provide a principled view on the surprisingly narrow space of optimizers that suffice for Transformer training. Finally, we open-source our distributed Ember implementation that merges cleanly with existing ZeRO/FSDP setups to support further research at this https URL
[19] arXiv:2607.01474 [pdf, html, other]: Title: Class-Grouped Normalized Momentum and Faster Hyperparameter Exploration to Tackle Class Imbalance in Federated Learning

Haemin Park, Diego Klabjan, Martin W. Braun, Xiuqi Li, Balakrishnan Ananthanarayanan

Journal-ref: Proceedings of the 43 rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

Subjects: Machine Learning (cs.LG)

Class imbalance poses a critical challenge in federated learning (FL), where underrepresented classes suffer from poor predictive performance yet cannot be addressed by standard centralized techniques due to privacy and heterogeneity constraints. We propose FedCGNM (Federated Class-Grouped Normalized Momentum), a client-side optimizer in FL that partitions classes into a small number of groups based on minimum within-group variance, maintains a momentum per group, normalizes each group momentum to unit length, and uses the summation of the normalized group momentums as an update direction. This design both equalizes gradient magnitude across majority and minority groups and mitigates the noise inherent in rare-class gradients. We further provide a theoretical convergence analysis explicitly accounting for time-varying resampling-rates. Additionally, to efficiently optimize these rates in small-client regimes, we introduce FedHOO, an X-armed-bandit (XAB) based algorithm that exploits federated parallelism that evaluates many combinations of two candidate rates per client at linear cost. Empirical evaluation on four public long-tailed benchmarks and a proprietary chip-defect dataset demonstrates that FedCGNM consistently outperforms baselines, with FedHOO yielding further gains in small-scale federations.
[20] arXiv:2607.01487 [pdf, html, other]: Title: How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size

Fabian Schaipp

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We propose a scaling law that takes into account model size and training data while explicitly splitting the latter into training steps and batch size (called three-term law). Fitting the proposed law on a large set of training runs, we find that it correctly recovers the scaling of the optimal batch size. Moreover, because it makes use of training runs with suboptimal batch size, our proposed law can be robustly fit with a significantly smaller amount of training runs. We further show that the three-term law can be used to derive scaling laws for suboptimal batch sizes, and that it matches previous empirical findings related to the critical batch size.
[21] arXiv:2607.01490 [pdf, html, other]: Title: Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

Juliette Decugis, Sean O'Brien, Francis Bach, Gabriel Synnaeve, Taco Cohen

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Reinforcement learning post-training dramatically improves LLM reasoning, but suffers from training instability and diversity collapse. Advantage functions offer an appealing fix: they reshape the training objective, reweight which rollouts drive learning, and are trivial to implement. Yet a proliferation of methods makes it unclear which advantage to use and when. We cut through the confusion with a unifying framework that decomposes any advantage into its positive and negative gradient mass along two orthogonal axes. On the sign axis, imbalanced updates collapse either entropy or weight geometry. On the difficulty axis, hard-problem focus sharpens signal but costs sample size. Both trade-offs shift during training: exploration favors balance and hard focus; exploitation favors suppression and medium focus. This motivates FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage that reads training dynamics to schedule the gradient weight automatically. FADE reaches peak pass@1 20k steps earlier than the best static baseline at the 7B scale and 2k steps earlier at the 32B , while achieving the best accuracy-diversity trade-off across all pass@k on LiveCodeBench and AIME.
[22] arXiv:2607.01492 [pdf, other]: Title: Unveiling the Non-Monotonic Effect of Privacy on Generalization under Byzantine Robustness

Thomas Boudou, Batiste Le Bars, Nirupam Gupta, Aurélien Bellet

Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)

Recent work has established a fundamental trilemma between Byzantine robustness, local differential privacy (LDP), and optimization error in distributed learning. We show that this trilemma does not universally extend to generalization error, but instead depends critically on the privacy regime. Specifically, in the high-noise regime (strong privacy), we prove that increasing privacy reduces the generalization error, i.e., there is no tension between robustness and privacy. In the low-noise regime (weaker privacy), however, the tension between robustness and privacy reappears and increasing privacy indeed degrades generalization. Our theory explains this surprising non-monotonic behavior of the generalization error via matching lower and upper bounds on the algorithmic stability of Byzantine-robust distributed learning under LDP constraints. We corroborate and further analyze these theoretical findings with empirical evaluations.
[23] arXiv:2607.01498 [pdf, html, other]: Title: Towards Learning Representations of Policies in Two-Player Zero-Sum Imperfect-Information Games

Kevin Wang, Kevin Yang, Arjun Prakash, Amy Greenwald

Comments: 7 pages, 4 figures, 3 tables

Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)

We investigate the problem of learning useful policy representations (embeddings) in two-player zero-sum imperfect-information games. We make three contributions: First, we introduce methods of creating datasets of policies for a given game. Second, we propose methods to learn policy representations. Third, we introduce downstream tasks to evaluate the effectiveness of such representations.
We evaluate each dataset method, embedding method, and downstream task on Kuhn and Leduc Poker. Although our methods are very basic, we demonstrate that useful behavioral representations are present in the learned embeddings. To our knowledge, this work is among the first to systematically compare self-supervised learning techniques for learning policy representations in games. Our code is available at this https URL for others to extend.
[24] arXiv:2607.01520 [pdf, html, other]: Title: The risk of KV cache compression

Lukas Haverbeck, Carmen Amo Alonso, Andres Felipe Posada-Moreno, Sebastian Trimpe, Marco Pavone

Subjects: Machine Learning (cs.LG)

Transformer inference on long sequences is expensive because softmax attention repeatedly reads from a large KV cache. The prevalent approach to this bottleneck is KV cache compression, which replaces the full cache with a compact summary. Despite its practical importance, the design of such summaries is largely driven by empirical experimentation. On the theoretical side, existing results show that KV cache compression can be impossible in the worst case, but offer little systematic guidance for designing algorithms in regimes where accurate compression is possible. We bridge this gap by characterizing the minimax risk of KV cache compression in terms of the intrinsic compressibility of a cache, revealing when and how accurate compression is possible. These results yield novel design principles for KV cache compression under causal masking that map efficiently to prefill and autoregressive decoding while achieving minimax-optimal risk. We instantiate these principles in a practical algorithm and report promising performance on LongBench in targeted experiments. Overall, our results provide a principled avenue for practical KV cache compression with theoretical guarantees.
[25] arXiv:2607.01523 [pdf, html, other]: Title: Multi-Head Recurrent Memory Agents

Jiatong Li, Samuel Yeh, Sharon Li

Comments: 19 pages, 11 figures, 5 tables

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Recurrent memory agents extend LLMs to arbitrarily long contexts by iteratively consolidating input into a fixed-size memory window. Despite their scalability, these agents exhibit a well-documented reliability problem: end-to-end performance degrades systematically as context length grows. We diagnose this failure by decomposing performance into two factors--memory capture and memory retention--and quantitatively confirm that retention is the dominant bottleneck. Retention collapses because existing designs maintain memory as a monolithic text block, forcing every update to risk overwriting previously retained content. Motivated by this diagnosis, we propose Multi-Head Recurrent Memory (MHM), a general, training-free framework that partitions memory into independent heads governed by a stage-wise select-then-update strategy. At each step, exactly one head is selected for update while the remaining heads are structurally shielded from overwriting, shifting the burden of retention from model behavior to architectural design. As a lightweight instantiation, we introduce Least-Recently-Updated MHM (MHM-LRU), which guarantees uniform head utilization with zero additional token overhead. Extensive experiments on long-context benchmarks show that MHM-LRU substantially improves both retention and end-to-end accuracy across the 100K--1M token range, where baselines degrade sharply. On RULER-HQA at 896K tokens, MHM-LRU improves the memory retention rate from less than 30% to 73.96%. These gains generalize across model families, scales, and task types, positioning architectural optimization as a practical and cost-efficient path toward reliable long-context recurrent memory.
[26] arXiv:2607.01528 [pdf, html, other]: Title: Wind-Aware Reinforcement Learning Control of a Small Quadrotor Using Learned Onboard Wind Estimation in Simulated Atmospheric Turbulence

Abdullah Al Tasim, Wei Sun

Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)

Small multirotor aircraft are increasingly tasked with operations in the atmospheric boundary layer, where turbulent winds comparable to the vehicle's airspeed degrade trajectory tracking and can defeat conventional feedback control. This work illustrates a two-stage learning pipeline that first estimates the local wind from onboard kinematics and dynamics and then exploits that estimate inside a reinforcement learning (RL) flight controller. The wind estimator, an attention-augmented gated recurrent network trained on thousands of simulated flights through von Karman turbulence with power-law shear and veer, recovers the horizontal wind vector with a per-flight root-mean-square error of 0.40 m/s and a direction error of 3.2 degrees on unseen wind regimes, an accuracy near the floor imposed by unresolved turbulence, and generalizes to vertical ascent profiles with a skill score of 0.861 over a constant-wind reference. A proximal policy optimization controller receiving the frozen estimator's output reduces horizontal trajectory tracking error by 48% relative to a wind-blind proportional-derivative baseline across mean winds of 4 m/s to 12 m/s, winning on 100% of evaluation episodes. A three-way ablation decomposes this improvement into a kinematic component, available without wind information, and a wind-perception component; the perception share rises with wind speed, from small in light winds toward roughly half the total benefit in strong winds, consistent with the quadratic scaling of aerodynamic drag. The controller degrades gracefully on out-of-distribution winds of 13 m/s to 15 m/s, where the baseline fails catastrophically.
[27] arXiv:2607.01537 [pdf, html, other]: Title: Certified World Models as Sensing Clocks: Drift-Aware Deadlines for Active Perception

Hongbo Wang

Comments: 15 pages, 3 figures, 6 tables. Preprint

Subjects: Machine Learning (cs.LG)

Certified world models estimate how long their predictions remain valid. We turn this validity horizon into an operational sensing clock: a rule for when an agent should stop coasting and re-sense. Starting from an audited equivariant world model, we derive a deadline for no-sensing intervals and show that deployable deadlines in learned world models must be drift-aware: on-manifold Lyapunov rates alone overestimate coasting validity, while calibrated native rollout-drift envelopes carry the deployed guarantee. On a frozen 3D VN-JEPA model, the resulting clock controls held-out interval-simultaneous certificate violation across seeds and data shards. In a cue-conditioned theorem-bed (a synthetic bench where all schedulers share the exact model, isolating the scheduling rule), the clock remains valid on the deployment distribution and substantially reduces eventful-tail violations relative to exact-mixture expected-belief scheduling at matched sensing budget. We also report limits: in the short-horizon frozen VN-JEPA regime, empirical conformal horizons match the deployed clock on validity and budget, and a partial-reset exploration finds no clean budget-matched advantage for the spectral term. Thus the contribution is a certified sensing-clock primitive and drift-aware deployment method, not a claim that spectral clocks empirically dominate all non-spectral schedulers.
[28] arXiv:2607.01548 [pdf, html, other]: Title: Evolutionary Feature Engineering for Structured Data

Ege Onur Taga, Yilin Zhuang, M. Emrullah Ildiz, Petros Mol, Abhimanyu Das, Karthik Duraisamy, Samet Oymak

Comments: 9 page main content, 41 pages in total

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Large language models are increasingly used as open-ended search operators in evolutionary optimization. We introduce Evolutionary Feature Engineering (EFE), a framework for using LLM-based evolution to discover preprocessing transformations for structured data. EFE represents transformations as Python programs with a standardized fit/transform interface, allowing them to be inserted directly into existing machine learning pipelines. During evolution, candidate programs are refined using dataset context, summary statistics, and downstream performance feedback on validation set. We instantiate EFE in two settings. For time-series forecasting, EFE-Time learns invertible, dataset-specific normalizations that improve off-the-shelf time-series foundation models. It reduces forecasting errors (MASE, WQL, MAE) 3% or more when averaged across datasets and improvements are as much as 19% on the COVID-Deaths dataset. Notably, these improvements occur with recent TSFMs such as Chronos-2. For tabular prediction, EFE-Tab evolves compact feature programs that add useful interpretable features and remove redundant ones, improving or matching existing LLM-based feature-engineering methods. We found EFE-Tab to be particularly effective on classical decision trees, where small sets of evolved features yield competitive accuracy while preserving interpretability. Overall, EFE demonstrates that LLM-based evolution can improve both accuracy and interpretability when automatically tackling structured data.
[29] arXiv:2607.01553 [pdf, html, other]: Title: X-LogSMask: Expand Transformer for Graph-Structured Data

Leyan Li, Rennong Yang, Zhenxing Zhang, Liping Hu

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Transformers have become general-purpose architectures, but their all-to-all self-attention is poorly matched to graph data, whose interactions are sparse, structured and multi-scale. Existing Graph Transformers address this mismatch through structural encodings, hybrid message-passing modules or learned attention constraints, often introducing additional complexity and limited interpretability. Here we introduce X-LogSMask, an explainable multi-head logarithmic structural mask that injects symmetrically normalized graph topology directly into attention logits. The logarithmic transform converts structural connectivity into a topology-aware gating signal, suppressing unsupported node interactions while preserving feature-dependent attention. By assigning different powers of the normalized adjacency matrix to different attention heads, X-LogSMask gives each head a defined structural radius and supports multi-hop information propagation within a single layer. We further show that a standard Transformer encoder can be interpreted as one-step message passing on a complete graph, motivating X-LogSMask as a topology-constrained alternative to unrestricted self-attention. Across 20 node-, edge- and graph-level benchmarks, Transformers equipped with X-LogSMask achieve state-of-the-art performance on 13 datasets and remain competitive in a lightweight one-layer configuration. These results show that simple, interpretable structural masks can make self-attention an effective graph-learning operator without changing the Transformer architecture. The code is available at this https URL.
[30] arXiv:2607.01571 [pdf, html, other]: Title: Geometric Signatures of Reasoning: A Spectral Perspective on Task Hardness

Aria Masoomi, Mahsa Bazzaz, Adel Javanmard, Vahab Mirrokni

Subjects: Machine Learning (cs.LG)

Chain-of-thought (CoT) reasoning enables large language models (LLMs) to solve complex problems by generating intermediate reasoning steps. While much attention has been paid to the length and content of these reasoning chains, far less is known about their internal geometry. We study the \emph{geometry} of CoT trajectories in the hidden state space of transformer models, formalizing each reasoning chain as a discrete curve in $\mathbb{R}^d$ and characterizing it through spectral, positional, and kinematic geometric functionals. We introduce the effective dimension $d_\rho$ as a measure of trajectory complexity and show theoretically that trajectories with flatter eigenvalue spectra correspond to harder tasks, as they explore more of the hidden dimensions. Lastly, we explore how kinematic features of the trajectory, mean position, positional dispersion, initial and current hidden states, mean velocity, mean speed, and speed dispersion, can be used to predict solution correctness before generation is complete, and may inform future early-stopping strategies. Experimentally, on mathematical reasoning problems from the MATH500 dataset, $d_\rho$ achieves $0.93$ AUC in distinguishing easy from hard problems, while kinematic features potentially can predict correctness from only the first $20\%$ of generated tokens. These correctness signatures transfer across questions of varying difficulty, establishing that the shape of a model's internal reasoning trajectory is a principled window into both task hardness and solution quality.
[31] arXiv:2607.01600 [pdf, html, other]: Title: BOUNDARY_SYNC: Measuring Communication-Induced Representational Coupling in Multi-Agent LLM Systems

Zewen Liu

Comments: 18 pages, 3 figures, 2 tables

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

As large language models (LLMs) are deployed as communicating agents, does inter-agent communication cause outputs to converge? We introduce BOUNDARY_SYNC, a protocol measuring representational coupling via the Coupling Amplification Factor (CAF = JSD_cond / JSD_baseline), where CAF < 1 indicates homogenization and CAF > 1 indicates diversification. In controlled GPT-4o experiments (N=30, ~9,900 API calls), we measure coupling in text and image communication. Key findings: (1) text communication causes significant homogenization (CAF=0.803 [0.740, 0.873], d=1.30, p<0.001), confirmed by no-communication ablation and prompt-perturbation controls; (2) image communication also homogenizes under within-modality baselines (CAF=0.834 [0.811, 0.858]), with comparable proportional effect; (3) group size moderates coupling direction -- K=5 produces homogenization while K=3 yields CAF > 1.0 (point estimates 1.14 and 1.06, CI pending), suggesting a directional shift toward diversification; (4) cross-model replication shows extreme variation (CAF 0.034-0.803), with DeepSeek dominated by format artifacts; (5) coupling is stateless -- driven by prompt context rather than cumulative updating, with continuous consensus producing monotonic convergence. These results establish LLM agent coupling as real, measurable, and controllable at the prompt level, with direct implications for multi-agent system design.
[32] arXiv:2607.01609 [pdf, html, other]: Title: SINA: A Fully Automated Circuit Schematic Image to Netlist Generator Using Artificial Intelligence

Saoud Aldowaish, Yashwanth Karumanchi, Kai-Chen Chiang, Mohammed Ayman Habib, Finn Murphy, Rishen Cao, Morteza Fayazi

Subjects: Machine Learning (cs.LG)

Recent advances in Artificial Intelligence (AI) have revolutionized Electronic Design Automation (EDA), particularly through Large Language Models (LLMs) for circuit design tasks. However, their application to analog and mixed-signal domains remains limited by the lack of machine-readable representations of existing circuit design knowledge. Circuit schematic images found in research manuscripts, textbooks, and websites constitute a vast repository of validated designs; however, these visual representations cannot be directly processed by EDA tools. Converting them into machine-readable netlists is essential for enabling simulation, verification, and building comprehensive databases for AI-based models. Current conversion methods lack generalization across both Integrated Circuit (IC) and Printed Circuit Board (PCB) level schematics. Moreover, they struggle with component recognition and connectivity inference, and fail to distinguish between connected junctions and crossing wires. In this paper, we propose SINA, an open-source circuit schematic image-to-netlist generator. SINA is a fully automated pipeline that integrates deep learning for robust component detection, connected-component labeling for accurate connectivity inference, Optical Character Recognition (OCR) for component reference designator extraction, and a Vision-Language Model (VLM) for reliable reference designator assignment. SINA handles both IC- and PCB-level schematics and incorporates dedicated crossing-wires detection to differentiate wire intersections from connections. We validate the correctness of the generated netlists using graph isomorphism techniques. Our experiments demonstrate an overall netlist generation accuracy of 96.67%, which is 2.72x higher compared to state-of-the-art approaches.
[33] arXiv:2607.01627 [pdf, html, other]: Title: MKGR: Multimodal Knowledge-Graph Representation Learning for Cold-Start Protein-Protein Interaction Prediction

Wenbo Zhang

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Accurate protein-protein interaction (PPI) prediction is central to functional genomics, disease mechanism discovery, and drug development. A difficult setting arises when candidate interactions include proteins that have no observed PPI edges during training, where models relying on network topology alone often lose useful context. This paper presents \method, a multimodal representation framework for cold-start PPI prediction. \method\ combines region-aware protein sequence encoding with four protein-centered biomedical knowledge graphs, including protein-drug, protein-disease, protein-miRNA, and protein-lncRNA associations. The sequence branch extracts contextual representations from structurally informed sequence regions, while graph attention encoders learn modality-specific protein embeddings from sparse biomedical associations. A bridge reconstruction objective regularizes graph learning by recovering shared protein-entity associations, and a pair-level gating module adaptively integrates sequence and graph evidence for each candidate protein pair. Experiments on two benchmark datasets under novel-old and novel-novel cold-start settings show that \method\ consistently outperforms competitive sequence, network, and knowledge-graph baselines across ACC, F1, AUC, AUPR, and MCC.
[34] arXiv:2607.01646 [pdf, html, other]: Title: DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint

Haotian Xie, Junlin Chen, Mingkai Zheng, Lishan Yang, Zhao Zhang

Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)

State-of-the-art large language model (LLM) training takes tens of thousands of graphics processing units (GPUs) for months and encounters failures across the software and hardware stack. Existing fault-tolerance mechanisms either impose non-trivial overhead during failure-free execution or suffer from prolonged recovery latency, particularly under scenarios where a small subset of compute nodes experience permanent failures. %The tradeoff between failure-free overhead and recovery latency forms a space forms a Pareto frontier We present DeadPool to simultaneously address both optimization objectives. DeadPool incorporates a fault-tolerance mechanism that restores LLM training via hot-swapping, namely by replacing failed nodes with spare nodes without terminating the complete job. The hot-swapping of DeadPool is enabled by two ideas: First, it exploits an off-critical-path in-memory checkpointing mechanism for spatial redundancy. Second, it introduces a communicator reconstruction protocol that replaces failed nodes with spare nodes at runtime. DeadPool efficiently overlaps the in-memory checkpointing with computation, thus introducing zero overhead during error-free execution. Upon permanent node failures, DeadPool can rebuild memory states with minimal recomputation by leveraging in-memory checkpoints. We evaluate DeadPool across scales (up to 512 NVIDIA A100 GPUs) and LLMs (up to 65B parameters), and observe zero checkpoint overhead with hot-swapping recovery completing in under 40 seconds. These results show that DeadPool simultaneously achieves both zero-overhead error-free execution and extremely low recovery cost.
[35] arXiv:2607.01656 [pdf, html, other]: Title: CALM: Interpretable Cross-Modal Alignment for Biomarker Discovery from Unpaired Data

Jueqi Wang, Zachary Jacokes, John Darrell Van Horn, Kevin A. Pelphrey, Michael C. Schatz, Archana Venkataraman

Comments: Accepted to MICCAI 2026

Subjects: Machine Learning (cs.LG)

The interaction between brain structure and genetic influences is key to understanding neuropsychiatric disorders. However, most large-scale datasets are unimodal, providing either neuroimaging or genetics data. We propose CALM, a framework that learns interpretable associations between brain ROIs and genetic pathways from completely disjoint populations. CALM aligns the two modalities in a shared latent space via linear projections that simultaneously match the class-conditional latent distributions and ensure group separability. These projections provide interpretable pathway--ROI associations. When trained on unimodal imaging and genetics datasets, CALM generalizes to an unseen paired dataset, outperforming several state-of-the-art methods and ablation baselines. We also demonstrate stability of the learned associations against a paired baseline. Our experiments on autism spectrum disorder reveal immune and metabolic pathways linked to specific cortical regions and are consistent with established literature. Thus, CALM opens the door to leveraging large unimodal repositories for studying cross-modal interactions in brain disorders across disparate datasets.
[36] arXiv:2607.01660 [pdf, html, other]: Title: Message Passing Based Two-Timescale Bayesian Learning for Joint Channel and Memory Hardware Impairments Tracking

Wei Xu, An Liu

Subjects: Machine Learning (cs.LG)

Hardware impairments in massive multiple-input multiple-output (MIMO) receivers introduce inter-symbol memory and inter-element coupling, severely degrading channel estimation. This paper employs a residual recurrent gated unit (RGRU) to model the intra-slot memory of the hardware impairments and proposes a message-passing-based two-timescale Bayesian deep learning (MP-TTBDL) framework for joint channel and impairment tracking. Owing to small-scale fading, the wireless channel varies rapidly across slots, whereas hardware impairments drift slowly due to hardware aging and environmental variations. To capture these distinct physical timescales, a fastvarying Markov prior and a slow-varying Gaussian Markov prior are assigned to the sparse channel and the network parameters, respectively. Based on a multi-slot factor graph formulation, a message-passing algorithm is developed. Specifically, the inter-slot messages admit closed-form updates, while the intra-slot factor graph, due to its complex recurrent structure, is partitioned into a channel tracking module and an impairments calibration module. The channel tracking module performs sparse channel estimation via turbo orthogonal approximate message passing (Turbo-OAMP), and the impairments calibration module updates the impairment parameters via a specially designed deep approximate message passing (DAMP) procedure, with the two modules iteratively exchanging extrinsic information through expectation propagation (EP) until convergence. Simulation results show that the proposed framework robustly achieves lower channel estimation error than conventional compensators followed by channel estimation across different online impairment scenarios and signal-to-noise ratio (SNR) conditions.
[37] arXiv:2607.01665 [pdf, html, other]: Title: Revisiting Decentralized Online Convex Optimization with Compressed Communication

Hao Zhou, Xiaoyu Wang, Chang Yao, Mingli Song, Yuanyu Wan

Subjects: Machine Learning (cs.LG)

Decentralized online convex optimization (D-OCO) is a popular framework for distributed applications with streaming data. To tackle the communication bottleneck, previous studies have investigated D-OCO with compressed communication and proposed several algorithms that are variants of online gradient descent (OGD). However, for D-OCO with exact communication, the best existing algorithms are variants of follow-the-regularized-leader (FTRL). In this paper, for the first time, we propose two FTRL-type algorithms for D-OCO with compressed communication. Compared with OGD-type algorithms, our algorithms are more elegant in both algorithmic design and theoretical analysis. The key insight is that the dual update mechanism of FTRL allows us to make a simple application of the technique for average consensus with communication compression. More specifically, our first algorithm considers the full-information setting, and can match the existing regret bounds. Our second algorithm is designed for the bandit setting, and can significantly improve both the regret bounds and communication costs of existing algorithms.
[38] arXiv:2607.01670 [pdf, html, other]: Title: UniWind: Toward Unified Day-Ahead Wind Power Forecasting via Physics-Informed State Routing

Ronghui Xu, Tongxin Wu, Guozhen Zhang, Yihan Li, Chenjuan Guo, Bin Yang, Yong Li

Subjects: Machine Learning (cs.LG)

Day-ahead wind power forecasting is essential for cost-effective power-system operation. It is primarily driven by future meteorological conditions while retaining temporal dependencies in power generation. In practice, observed wind-farm power often entangles physically available power with local environmental effects and latent operational states, such as shutdowns and curtailment. Existing physical models provide useful constraints but adapt poorly across wind farms, whereas data-driven models can capture rich correlations but often conflate meteorological effects with state-induced deviations. In this study, we propose UniWind, a wind power forecasting model based on physics-informed state routing. UniWind first employs a Physical Prior Estimator to construct a site-calibrated physical prior by combining site-conditioned monotonic warping with a shared physical power curve. It further applies a physical upper-bound constraint to shape this prior as a soft envelope of available wind power generation. UniWind then proposes a Latent State Encoder to model operating-state embeddings and transforms the physical prior into final power forecasts through a State-aware Power Corrector, which uses knowledge-guided supervised state routing and bounded, state-specific expert correction. Full-shot and cross-farm zero-shot experiments on more than 20 real-world datasets demonstrate the accuracy and robustness of UniWind.
[39] arXiv:2607.01678 [pdf, html, other]: Title: SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication

Mingkai Zheng, Junlin Chen, Haotian Xie, Zhao Zhang

Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)

Communication increasingly dominates the cost of Large Language Model (LLM) pre-training, especially under data-parallel and sharded training schemes, where gradient synchronization and parameter reconstruction overhead increase with model size and system scale. Existing communication-reduction methods either sparsify raw gradients, which can be unstable for modern Adam-style optimizers at high sparsity, or quantize communication, whose savings are fundamentally bounded by bit width and often incur additional runtime overhead. We present SCAPE, a communication-efficient distributed optimizer for LLM training that exploits the stability of AdamS's first-moment to enable aggressive sparsification without loss of LLM quality. Instead of constructing masks from raw gradients, SCAPE derives them from first-moment-based statistics, partitions mask generation across workers to align with optimizer sharding, and delays mask usage by one step so that mask synchronization can overlap with computation. SCAPE also reconstructs the quantities required for second-moment updates from a single synchronized sparse buffer, avoiding an additional collective. We implement SCAPE in Megatron-LM and evaluate its convergence by pre-training GPT-345M on OpenWebText and Llama-500M on SlimPajama-6B using 32 NVIDIA GH200 GPUs on TACC Vista. In both models, SCAPE preserves training stability, validation loss, and downstream task accuracy under 90\% and 99\% sparsity. For Llama-500M, SCAPE reduces end-to-end pre-training wall-clock time by up to 43.3\% while maintaining model quality comparable to dense AdamW and AdamS. For Llama-1.8B, SCAPE achieves up to 3.26$\times$ speedup per step compared to dense AdamS.
[40] arXiv:2607.01686 [pdf, html, other]: Title: WARP: Weight-Space Analysis for Recovering Training Data Portfolios

Tzu-Heng Huang, Aditya Goyal, John Cooper, Frederic Sala

Comments: This work appears in the ICML 2026 Workshop on Weight-Space Symmetries (WSS): from Foundations to Practical Applications. Our source code is available at this http URL

Subjects: Machine Learning (cs.LG)

Foundation models are routinely released to the public, yet the data recipes used to train them -- such as domain mixture weights that determine how different sources are sampled -- are rarely disclosed. This creates an access asymmetry: researchers study the resulting models but lack visibility into the training distribution that produces them. Prior works for inferring training data, such as membership inference, detect at the level of individual samples and thus cannot characterize the global composition of the training corpus. We introduce WARP, a framework that recovers a fine-tuned model's training mixtures directly from its released weights. WARP interpolates between the base and fine-tuned models using model merging, generating pseudo-checkpoints that approximate the missing training trajectory and expose a geometric footprint of the training data in the weight space. From these simulated footprints, WARP extracts geometric features and maps them to domain proportions using either a parameter-free softmax readout or an MLP projector trained on synthetic mixtures. In controlled experiments with BERT and GPT-2, WARP recovers domain mixtures with an average MAE as low as 0.046 and 0.104 respectively, outperforming membership inference and a variant with access to the true training trajectory.
[41] arXiv:2607.01689 [pdf, html, other]: Title: Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space

Long Minh Bui, Tuan Anh Le Van, Tung Phi Duc, Phi Le Nguyen, Jana Doppa, Trong Nghia Hoang

Comments: Accepted for Publication at the 42nd Conference on Uncertainty in Artificial Intelligence (UAI), 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Model merging aims to combine existing single-task solutions into a multi-task solution without additional data-driven fine-tuning.~Most existing approaches achieve this using geometric properties of local solution spaces. However, such geometric views provide limited guidance for scoring how statistically useful each task-specific update direction is across tasks during merging. To address this, we formulate model merging from a new perspective of probabilistic inference under a product-of-experts (PoE) scenario where each single-task solution defines an energy-based expert model (EBM) over the merged parameters. We show that several existing model merging methods arise as special cases of our framework under energy designs that impose implicit Gaussian assumptions on directional residuals between merged and task-specific models. Empirically, we find that these residuals are often heavy-tailed which exposes a mismatch with the imposed light-tailed Gaussian structures. We address this with a heavy-tailed PoE design based on Cauchy experts, which better captures the observed residual behavior while admitting a provably convergent inference procedure. Experiments across multiple tasks and architectures show significant improvements over state-of-the-arts baselines. Our code is available at this https URL.
[42] arXiv:2607.01693 [pdf, html, other]: Title: A Mathematical Introduction to Diffusion Models

Jianfeng Lu

Comments: Lecture notes for the John Tukey Summer Graduate School on Mathematics of Generative Models at SLMath (June 22nd, 2026 -- July 2nd, 2026)

Subjects: Machine Learning (cs.LG); Probability (math.PR)

These notes give a proof-oriented introduction to diffusion models from the viewpoint of sampling, tracing a single arc from classical sampling dynamics to modern diffusion samplers, their error analysis, and inference-time control. Throughout, the material is layered into core definitions and identities proved in full, representative estimates proved under simplifying assumptions, and research-level theorems stated with a proof roadmap. The intended audience is beginning graduate students with a background in probability but no prior exposure to stochastic differential equations, stochastic numerics, or diffusion models.
[43] arXiv:2607.01694 [pdf, html, other]: Title: Frequency Shift Physics-Informed Extreme Learning Machine for Solving High-Frequency Partial Differential Equations

Xiong Xiong, Ruonan Zhai, Zheng Zeng, Sheng Zhou, Rongchun Hu, Zichen Deng

Subjects: Machine Learning (cs.LG)

Solving partial differential equations (PDEs) with high-frequency solutions remains a central challenge in physics-informed machine learning due to spectral bias -- the tendency of neural networks to learn low-frequency components preferentially. This paper proposes a Frequency Shift Physics-Informed Extreme Learning Machine (FS-PIELM) framework that addresses this limitation through an additive mechanism for weight initialization. Rather than multiplying random weights by a scaling factor, the method translates the mean of the Gaussian weight distribution while keeping the variance fixed at unity, thereby avoiding the variance amplification inherent in scaling-based methods. Two variants are developed: FS-PIELM-L assigns independent frequency magnitudes to individual neurons, while FS-PIELM-G groups neurons for improved robustness. Theoretical analysis shows that the frequency variance under the proposed framework remains bounded and approaches unity regardless of target frequency, in contrast to the quadratic growth of conventional approaches. The method preserves the computational efficiency of extreme learning machines, requiring only a single linear solve. Experiments on seven benchmark problems spanning six equation types -- Helmholtz, wave, Poisson, Klein-Gordon, heat, and advection-diffusion -- on both regular and complex geometries show that the linear variant achieves the best accuracy in six of seven cases, with improvements of one to nearly five orders of magnitude over existing PIELM variants. The code and data accompanying this manuscript will be made publicly available at this https URL.
[44] arXiv:2607.01736 [pdf, html, other]: Title: Predicting Closed-Loop Performance of Latent World Models: Offline Checkpoint Selection for MPC and Model-Based RL Under Non-Markovian Rewards in LunarLander

Nikolai Smolyanskiy

Comments: Preprint, 19 pages (16 main text + 3 pages appendix), 7 figures, 4 tables. Video: this https URL , Code: this https URL

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)

We study how to predict the downstream closed-loop performance of a learned latent world model from validation-time diagnostics alone. Choosing the right checkpoint from a world-model training run is difficult: validation loss and multi-step prediction RMSE keep improving long after closed-loop performance has collapsed. We present a suite of structural validation-time diagnostics drawn from optimal-control theory and apply them to Gymnasium's LunarLander v3, which features shaped rewards. We train an RSSM [5, 4] world model on it and treat per checkpoint CEM-MPC return as the oracle for closed-loop quality. By evaluating 40 metrics against this oracle, we find that the strongest single predictor is the Reward Observability Fraction (ROF), which measures the reward predictor's dependence on the observable subspace. We combine ROF with three structural regularizers into a single-number offline checkpoint selection score, the Composite Reward Observability Fraction (CROF). The CROF-selected world model trains a model-based A2C policy that beats a fairly evaluated model-free A2C baseline by ~24.5 return points while using ~65x fewer real-environment interactions, and the same world model also drives a strong zero-shot CEM-MPC policy. Code and data: this https URL.
[45] arXiv:2607.01746 [pdf, html, other]: Title: Finite-Lag Operator Geometry of Recurrent Representations

Kanishka Reddy

Subjects: Machine Learning (cs.LG)

Recurrent representations are trajectories, but representation geometry is often measured from static snapshots. We develop finite-lag operator geometry for recurrent hidden states from observed source-successor pairs $(X_t,X_{t+\Delta})$. The primitive is the conditional transport law $Q_\Delta(dy\mid x)$, estimated by a dense Gaussian source-smoothing operator. From this directed finite-lag law we derive a source-centered transport tensor $G_\Delta$, which decomposes exactly into conditional spread and coherent displacement, and an antisymmetric coordinate circulation $W_\Delta^\rho$, which summarizes directed lagged flow. We prove affine covariance with explicit metric dependence of scalar summaries, dense estimator stability on bounded trajectory clouds, and a finite-lag separation result showing that source-centered transport detects deterministic recurrent motion not recorded by infinitesimal carre-du-champ geometry. A linear-Gaussian closed form calibrates the quantities in terms of the update $A_\Delta$, source covariance, and innovation covariance. Controlled experiments validate the decomposition, circulation, covariance, and stability predictions. In performance matched repeat-copy networks, the framework reveals architecture dependent differences in total transport scale and coherent displacement trace, while coherent displacement fraction is metric and resolution dependent.
[46] arXiv:2607.01752 [pdf, html, other]: Title: Efficient Temporal Point Processes via Monotone Alternating Splines

Cheng Wan, Quyu Kong, Feng Zhou

Comments: 22 pages

Subjects: Machine Learning (cs.LG)

Temporal point processes (TPPs) have widespread applications across various domains. Compared to modeling the conditional intensity of a TPP, modeling its cumulative conditional intensity function (CCIF) improves computational efficiency and eliminates numerical approximation errors. However, current CCIF parameterizations uniformly rely on Monotone Neural Networks (MNNs), which we identify as suffering from three structural deadlocks--convexity restrictions, saturation limits, and violations of CCIF modeling requirements--that fundamentally restrict their representational capacity for complex temporal dynamics. To resolve these bottlenecks, this paper proposes a novel framework called Monotone Alternating Splines (MAS). By leveraging distinct interpolation and extrapolation components, MAS provides a flexible and efficient framework for modeling CCIFs. Theoretically, MAS's interpolation provides strong fitting accuracy, while its extrapolation supports robust generalization, reducing the irreducible approximation gaps of MNNs. Extensive experiments show that MAS achieves superior performance on both synthetic and real-world datasets.
[47] arXiv:2607.01762 [pdf, html, other]: Title: Role-Aware Neural Convex Divergence Heads for Asymmetric Representation Learning

He Huang, Lu Shen, Yunfeng Huang, Li Qi

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Many representation learning problems involve directed relations, such as lexical entailment, sentence entailment, ontology hierarchy, and citation links. Standard Euclidean, cosine, and Mahalanobis heads are symmetric, while generic neural scorers can model directionality but provide limited geometric structure. This paper proposes a role-aware neural convex divergence head for asymmetric representation learning. The head applies source- and target-role projections before evaluating an input-convex neural Bregman divergence, yielding a nonnegative structured score in the role-projected space. We characterize its projected-space identity, source-role convexity, directional-gap decomposition, and Hessian-based local curvature. Experiments on lexical, sentence, ontology, and directed graph benchmarks compare symmetric distances, unstructured asymmetric scorers, order/hyperbolic baselines, plain ICNN-Bregman heads, and the proposed role-aware variant. Across ten random seeds on the main semantic and ontology benchmarks, role-aware projections consistently improve directional accuracy over plain ICNN-Bregman heads while preserving zero observed negative divergence rate. The results also identify a boundary case: on large fixed-feature citation prediction, specialized symmetric or hyperbolic baselines remain stronger in ranking accuracy. Overall, the proposed head is best understood as a structured and interpretable plug-in distance module for tasks where directional relations matter.
[48] arXiv:2607.01763 [pdf, other]: Title: Denser $\neq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training

Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang, Geng Liu, Haiyang Guo, Guo-Sen Xie, Gaofeng Meng, Hongbin Liu, Fei Zhu

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a particularly attractive approach. In this work, we revisit this optimistic view through self-distillation policy optimization (SDPO). Our experiments show that SDPO can accelerate in-domain specialization when teacher signals are stable and well aligned, but it struggles to generalize to out-of-distribution scenarios. In continual post-training, SDPO exhibits stronger forgetting and can even collapse, whereas on-policy reinforcement learning methods such as GRPO adapt more conservatively and better preserve prior capabilities. Further analyses reveal that denser self-distillation induces larger drift in both parameter space and response space, and can amplify high-frequency formatting artifacts through a self-reinforcing teacher--student loop. These findings suggest that on-policy data alone is insufficient for continual learning. Dense self-distillation can accelerate specialization when teacher targets are stable and token-level supervision is reliable, but it should not be treated as a default stabilizer for continual post-training. Our code is available at this https URL.
[49] arXiv:2607.01775 [pdf, other]: Title: Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding

Marianne Arriola, Volodymyr Kuleshov

Comments: ICML 2026. We provide the code at this https URL

Subjects: Machine Learning (cs.LG)

Discrete diffusion models have steadily improved in quality relative to autoregressive (AR) models. However, these models are normally constrained to fixed-length generation and do not support key-value (KV) caching. Block diffusion partially bridges diffusion and AR by generating token blocks left-to-right, but its fixed-size sequential blocks limit decoding flexibility and parallelism. Here, we present a new class of language models, set diffusion, comprised of (i) a likelihood parameterization that factorizes over flexible-position, flexible-length token sets and (ii) a set-causal diffusion architecture that supports KV cache updates after every inference step. By factorizing over token sets instead of fixed-size blocks, tokens can be decoded in arbitrarily-ordered sets, including sliding-window sets, enabling faster inference and support for any-order decoding. Set diffusion achieves better speed-quality tradeoffs on mathematical reasoning, summarization, and unconditional generation compared to prior diffusion language models while offering stronger infilling performance than block diffusion. We provide the code, along with the model weights and blog post on the project page: this https URL
[50] arXiv:2607.01785 [pdf, html, other]: Title: EHHN: An Event-driven Heterogeneous Hypergraph Network for Object-Centric Next Activity Prediction

Jiaxing Wang, Kaitao Chen, Zhubin Han, Chenyu Hou, Bin Cao, Jing Fan, Ji Zhang

Subjects: Machine Learning (cs.LG)

Next activity prediction helps service-oriented processes anticipate upcoming steps before delays, exceptions, or service-level risks occur. Most existing methods assume classical single-case event logs, whereas real service processes often involve events shared by multiple typed business objects. Object-centric event logs (OCELs) capture such interactions, but current predictors remain limited. Flattening-based approaches lose cross-object context, and native OCEL graph-based approaches encode multi-object events through pairwise relations. Existing models also do not jointly capture event-driven object state changes, inter-event timing, and global execution patterns. We propose EHHN, an Event-driven Heterogeneous Hypergraph Network for object-centric next activity prediction. EHHN represents each prediction prefix as a heterogeneous hypergraph, where event--object hyperedges bind retained co-participating objects and a lifecycle hyperedge groups the primary object's observed lifecycle events. Based on this representation, EHHN uses a dual-stream architecture in which a micro-spatial stream models event-driven object-state evolution and a macro-evolution stream captures temporal dynamics using retrieved global prototypes. The two streams are fused to predict the next activity. Experiments on four public OCEL benchmarks against nine baselines show that EHHN achieves the best accuracy and macro F1-score on all datasets, with improvements of up to 8.1 and 12.4 percentage points over the strongest baselines. Compared with the strongest OCEL-native graph baseline, EHHN also reduces peak GPU memory by up to 24 times. Code is available at this https URL.
[51] arXiv:2607.01789 [pdf, html, other]: Title: EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning

Ahin Lee, Sehyun Yun, Taesik Gong

Comments: 6 pages. Accepted at MobiSys Workshop '26

Journal-ref: In Proceedings of the 24th Annual International Conference on Mobile Systems, Applications and Services Workshops (MobiSys Workshop '26), pp. 93-98, 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Mixture-of-Experts (MoE) models scale efficiently but remain costly to adapt due to redundant experts and uniform parameter allocation. Existing parameter-efficient fine-tuning (PEFT) methods such as LoRA ignore MoE routing dynamics, leading to suboptimal resource use. We propose EPnG, an adaptive prune-and-grow framework that reallocates LoRA capacity based on expert importance derived from router gate probabilities. EPnG prunes under-utilized experts and expands high-importance experts via rank growth with orthogonal initialization, while maintaining a fixed parameter budget. Across OLMoE and Qwen1.5-MoE, EPnG consistently outperforms LoRA under the same budget and achieves performance comparable to full fine-tuning while updating only 0.55%-0.72% of parameters (up to 140x-180x fewer). These results demonstrate that aligning PEFT with MoE routing yields a more effective and scalable fine-tuning strategy.
[52] arXiv:2607.01795 [pdf, html, other]: Title: Single-Channel EEG-Based Cognitive Load Assessment in Online Learning: A Hybrid Deep Learning Approach

Rowan Hussein, Mohamed Ouf

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Monitoring cognitive load during online learning could help instructors identify content that learners find difficult, but remote settings remove the visual cues that support this judgement in a classroom. We study whether a single-channel, consumer-grade EEG device (the NeuroSky MindWave Mobile 2) can distinguish easy from difficult educational-video content, using the publicly available dataset of Wang et al. [24] (ten learners, one excluded for excessive noise, leaving nine). We implement a hybrid CNN+LSTM+Attention model that combines the raw waveform with band-power features. In a within-subject setting, the model reaches up to 78.5% accuracy, compared with 55% for conventional feature-based classifiers; regularization (dropout and L2) closes the large gap between training and validation accuracy that we observe without it, keeping validation accuracy stable at roughly 68-73%. We are deliberately cautious about these numbers: with only nine subjects, within-subject evaluation is optimistic, and we argue that subject-independent evaluation -- in which no learner appears in both training and test data -- should be the standard for this task. To that end we release a reproducible evaluation pipeline. We frame the work as a feasibility study rather than a deployable system, and pair it with an open, notebook-based tool that records EEG, runs inference, and visualizes estimated cognitive load as a heatmap over the video timeline to help educators locate potentially challenging segments.
[53] arXiv:2607.01799 [pdf, html, other]: Title: Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability

Rodrigo Mendoza-Smith

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)

Sparse autoencoders (SAEs) decompose internal activations of neural networks into sparse linear combinations of learned features by fitting an overcomplete dictionary $\mathbf{W}\in\mathbb{R}^{m\times n}$ with $m<n$, and inferring a sparse code $\mathbf{x}\in\mathbb{R}^n$ from $\mathbf{h}\approx\mathbf{W}\mathbf{x}$. This inference problem closely resembles the canonical setup of compressed sensing, but dense decoders requires $O(mn)$ learned values, which becomes costly at large feature counts. We introduce Expander SAEs: TopK SAEs whose decoder and tied encoder are supported on a left-$d$-regular expander mask with $d\ll m$, learning only $dn$ decoder values while keeping the sparse-coding problem $(m,n,k)$ fixed. The same structure reduces storage and turns the matching-pursuit correlation step $\mathbf{W}^\top \mathbf{r}$ in OMP into an $O(dn)$ gather-and-reduce operation. Our experiments show that across Pythia-70M/160M, Qwen2.5-3B, and Llama-3.2-1B residual-stream activations, varying $d$ traces a consistent storage--fidelity frontier, and that at the most compressed modern-LM setting, Qwen2.5-3B with $d=7$ uses $293\times$ fewer learned decoder values than the full dense decoder while retaining $84$% of dense CE-loss recovered. Control experiments show that the improved storage--fidelity tradeoff is driven by sparse, diverse decoder support structure rather than by fewer learned decoder values, and that when sparse and dense decoders are compared at matched parameter count, part of the remaining gap comes from encoder amortisation. On the theoretical side, we show that expansion and column flatness are sufficient for identifiability of noiseless $k$-sparse codes, and we derive complementary sufficient conditions under which OMP recovers the support exactly.
[54] arXiv:2607.01800 [pdf, html, other]: Title: Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis

Jiatong Li, Weida Wang, Changmeng Zheng, Shufei Zhang, Yatao Bian, Xiao-yong Wei, Qing Li

Comments: 21 pages

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

Large Language Models (LLMs) have recently shown promise in molecular discovery, yet a gap remains between their probabilistic nature over discrete sequential tokens and the rigid topological constraints of chemical space. This raises the question of whether molecular LLMs can generalize beyond the local neighborhoods induced by their sequence-based representations. To systematically investigate this question, we introduce a Molecular Perturbation framework that generates syntax-valid structural variants of training molecules under controlled Graph Edit Distance (GED) to probe the manifold regularity of molecular LLMs. Our analysis shows that even a single edit can cause substantial performance drops on common molecular tasks, revealing a narrow local trust region and fragile sensitivity to structural changes. Since similar molecules tend to exhibit similar properties, In-Context Tuning (ICT), which anchors predictions on structurally similar molecules, offers a natural way to mitigate such fragility. Our experiments also examine whether ICT confers robustness under controlled structural perturbations, and the results suggest that it can partially expand the local trust region and offer a promising direction for stabilizing molecular LLMs against structural variation.
[55] arXiv:2607.01824 [pdf, html, other]: Title: Gaming Consensus: Coordinated Manipulation in Crowdsourced Fact-Checking

Nikil Roashan Selvam, Jay Baxter, Sophie Hilgard, Brad Miller, Keith Coleman, Ellen Vitercik, Sanmi Koyejo

Comments: ICML 2026

Subjects: Machine Learning (cs.LG)

Crowdsourced fact-checking systems have been adopted by major social media companies such as X, Meta, TikTok and Google with the aim of combating misleading information at scale without relying on centralized editorial control. These systems have been developed around a common underlying concept: a bridging mechanism that identifies notes flagging misleading information when they receive support from people with different perspectives rather than simple majority support. To our knowledge the only publicly disclosed bridging algorithms deployed for fact-checking are based on matrix factorization, as deployed by both X and Meta, augmented with additional components addressing abuse, targeted manipulation, and contributor brigades. This work examines the core matrix factorization portion of these systems, presenting theoretical and empirical evaluations of the degree to which coordinated users could vote strategically by leveraging the latent representations to fabricate the appearance of synthetic consensus within the bridging mechanism. Using historic production data, we find that up to 10.7% of lower quality notes could be manipulated above consensus thresholds using less than 10 ratings. We complement these findings with a theoretical analysis, revealing counterintuitively that rating a note as "Not Helpful" can increase its helpfulness score, as well as a cost model quantifying manipulation effort. We have developed and deployed mitigations within X's Community Notes algorithm to address synthetic consensus.
[56] arXiv:2607.01830 [pdf, html, other]: Title: Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling

Dazhi Fu, Jiuding Yang, Yiwen Guo, Jicong Fan

Subjects: Machine Learning (cs.LG)

Reliable reward and preference signals are critical for evaluating and optimizing large language models on open-ended tasks. Rubric-based judges offer a transparent way to decompose such judgments into explicit evaluation criteria, but existing annotation-free rubric generators typically rely on a single generic evaluator. As a result, they may overlook important dimensions of human preference, a failure mode we term dimensional blind spots. To address this limitation, we propose Multi-Role Rubric Generation (MRRG), a training-free and reference-free framework that elicits evaluation criteria from multiple complementary roles and consolidates them into an auditable rubric-based scorer. This scorer can be used both to validate pairwise preferences and to provide rewards for GRPO-style Reinforcement Learning with Verifiable Rewards (RLVR). Experiments on preference validation benchmarks show that MRRG consistently outperforms single-role rubric generation baselines across multiple backbone models. Further RLVR experiments demonstrate that MRRG yields a stronger reward signal for improving open-ended generation.
[57] arXiv:2607.01838 [pdf, html, other]: Title: Adaptive Group-Based Counterfactual Explanations for Time-Series Rehabilitation Data

Emmanuel C. Chukwu, Rianne M. Schouten, Monique Tabak, Mykola Pechenizkiy

Comments: To be published at IEEE CBMS 2026

Subjects: Machine Learning (cs.LG)

Counterfactual explanations (CEs) for multivariate time-series classifiers are often difficult to interpret in domains where experts reason in terms of semantic feature groups rather than individual channels. In rehabilitation movement analysis with multi-sensor inertial measurement units (IMUs), clinicians interpret motion through muscle-group and joint-segment abstractions; yet, most existing counterfactual methods operate at the channel level, producing scattered and biomechanically incoherent explanations. We propose a two-stage framework for group-based counterfactual generation in high-dimensional IMU data. We first show that Shapley-Adaptive (SA) group ranking preserves counterfactual validity but fails to enforce group-level sparsity, motivating the need for explicit group selection. We then introduce Learnable Gate (LG) methods, which incorporate trainable per-group relevance gates jointly optimized with perturbation masks. Experiments on the KneE-PAD rehabilitation dataset demonstrate that LG substantially improves modality-group sparsity compared to the channel-level M-CELS baseline while maintaining or improving validity, temporal smoothness, and generation efficiency. Exercise-specific analyses further show that group-structured counterfactuals yield concise, muscle-level corrective guidance aligned with clinical reasoning. Overall, the proposed framework enhances interpretability without sacrificing counterfactual quality, enabling more actionable explanations for rehabilitation movement analysis.
[58] arXiv:2607.01849 [pdf, other]: Title: Decomposer: Learning to Decompile Symbolic Music to Programs

Yewon Kim, Apurva Gandhi, David Chung, Graham Neubig, Chris Donahue

Comments: Project page: this https URL

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)

Musical performance involves executing a set of high-level musical instructions, yet recovering those instructions from the performance is a challenging inverse problem. We present Decomposer, a post-training framework for symbolic music decompilation: the task of recovering executable, editable music programs from symbolic music. We instantiate the task as MIDI-to-Strudel decompilation, where the model takes symbolic MIDI as input and produces a program in Strudel, a music programming language, that reconstructs the input when executed. The task poses two challenges: Strudel is a low-resource language with little naturally paired MIDI-code data, and optimizing faithful reconstruction of MIDI alone can collapse to unreadable note-by-note transliteration. We address these challenges in two stages. First, we construct Strudel-Synth, a synthetic corpus of paired Strudel programs and rendered MIDI, and use it for supervised fine-tuning. Second, we refine the model with reinforcement learning on unpaired MIDI, optimizing rewards for both MIDI reconstruction faithfulness and code readability. Our evaluation across synthetic and real-world MIDI benchmarks shows that Decomposer achieves substantially higher MIDI reconstruction faithfulness than closed-source LLMs while producing more readable and diverse code than the heuristic converter.
[59] arXiv:2607.01880 [pdf, html, other]: Title: Learning the Supports for Categorical Critic in Reinforcement Learning

Jen-Yen Chang, Takayuki Osa, Tatsuya Harada

Comments: Accepted to RLC 2026

Subjects: Machine Learning (cs.LG)

Value functions are an essential component in actor-critic based deep reinforcement learning (RL). Conventionally, these functions are trained as a regression task by minimising the mean squared error (MSE) relative to bootstrapped target values. Meanwhile, in distributional RL, a distribution of returns is modelled based on the distributional Bellman operator. This work investigates the Gaussian Histogram Loss (HL-Gauss), a recent approach that reframes value estimation as classification by encoding each scalar Bellman target as a Gaussian-smoothed categorical target. Despite its potential, applying histogram-based losses to RL presents inherent challenges, most notably the requirement to pre-define a fixed support interval, which is often complicated by the non-stationary and stochastic nature of target values typically found in RL tasks. In this work, we propose an approach that dynamically learns the lower and upper bounds of the support instead of assigning them beforehand. We derive an objective that jointly learns these bounds whilst learning the categorical representation of the scalar values, and we show that this objective forms an upper bound on the mean-squared Bellman error. Our theoretical analysis further shows that this bound is tighter than that of non-learned supports of HL-Gauss. Empirically, the proposed objective enables stable adaptation of the support interval and matches HL-Gauss-based actor-critic algorithms on most continuous-control tasks whilst improving on a subset, without requiring a pre-specified support interval.
[60] arXiv:2607.01895 [pdf, other]: Title: Regularized Variational and Spectral Log-Density-Ratio Estimation in the Gaussian Location Model

Francis Bach (SIERRA)

Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)

We study ridge-regularized log-density-ratio estimation in the Gaussian location model with a common covariance matrix. By affine invariance, the model is written as q $\sim$ N(0, I), p $\sim$ N($\Delta$, I), with linear features, where $\Delta$ is a mean vector. The variational estimator is the empirical Kullback-Leibler (KL) log-normalized fit with a squared L2-penalty on its nonconstant coefficient, and the spectral estimator recently introduced in [1] replaces a single variational problem by a continuum of ridge-regularized least-squares problems. We derive high-dimensional deterministic asymptotic equivalents when the numbers of observations and dimension tend to infinity with fixed ratios. The regularized variational limit is characterized by a scalar entropy minimization problem derived from the convex-Gaussian-min-max theorem (CGMT), while the regularized spectral limit follows from deterministic equivalents for resolvents of weighted sums of two independent Gaussian sample covariance matrices. We use these formulas to compare population risks, with experiments focused on fixed-signal aspect-ratio sweeps and optimized regularization. Our conclusion is that with many observations, under the criteria and asymptotic regimes analyzed here, the well-specified variational estimator has the smaller risk, while with fewer observations, the spectral estimator is favored because its covariance-based construction has lower variance. We also study how a nuclear penalty can be used and partially analyzed to perform feature learning.
[61] arXiv:2607.01897 [pdf, html, other]: Title: Rank-Then-Act: Reward-Free Control from Frame-Order Progress

Yuriy Maksyuta, George Bredis, Ruslan Rakhimov, Daniil Gavrilov

Comments: 20 pages, 15 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

We introduce Rank-Then-Act (RTA), a framework for learning control policies from expert video demonstrations without environment rewards. RTA trains a Vision-Language Model (VLM) offline as a progress-based ordinal scorer, using a Group Relative Policy Optimization (GRPO) objective over shuffled frame sequences, which forces the model to recover temporal ordering from visual semantics rather than trivial time cues. Importantly, instead of using the scorer directly as a scalar reward model, we propose a correlation-based reward function for reinforcement learning: at each interaction window, we compute the Spearman rank correlation between predicted progress rankings and true temporal indices, yielding a bounded, scale-invariant learning signal. This design decouples reward learning from absolute calibration and enables stable transfer across tasks and environments. We evaluate RTA on discrete control benchmarks (PyBoy: Catrap, Kirby) and continuous control tasks (PointMaze, MetaWorld). RTA consistently matches or outperforms prior video-based reward learning methods and rank-based baselines, while demonstrating strong cross-task reuse of a single pretrained progress scorer. Our results suggest that correlation-structured supervision over video-derived ordinal signals is sufficient for policy learning, offering a scalable alternative to explicit reward design.
[62] arXiv:2607.01901 [pdf, html, other]: Title: SABER: A Semantic-Aligned Brain Network Analysis Framework via Multi-scale Hypergraphs

Yidan Xu, Xiangmin Han, Rundong Xue, Huihui Ye

Comments: Accepted to IEEE International Conference on Multimedia and Expo (ICME) 2026;

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

Effective brain disease diagnosis requires the synergy of brain connectivity patterns and high-level semantic knowledge. Existing methods, however, largely treat semantics from large language models (LLMs) as auxiliary features or supervision, limiting their direct role in decision-making and constraining classification stability and robustness. To overcome this, we propose a semantic-aligned brain network framework that actively integrates LLM-derived semantics into the prediction process. Specifically, ROI-level semantics are first incorporated via global self-attention to enrich node representations and provide whole-brain context. Multi-scale hypergraphs are then constructed to explicitly model functional subnetworks and multi-ROI interactions, addressing the locality limitations of traditional GNNs and capturing high-order dependencies. Finally, a decision-level semantic alignment mechanism selectively injects patient-specific textual embeddings into graph representations, enabling semantics to directly guide predictions without perturbing the underlying network structure. Experiments on public brain network datasets ABIDE and ADHD-200 demonstrate state-of-the-art performance, enhanced stability, and improved interpretability, particularly in small-sample settings.
[63] arXiv:2607.01907 [pdf, html, other]: Title: Population-Based Multi-Objective Training of Discriminators for Semi-Supervised GANs

Francisco Sedeño, Francisco Chicano, Jamal Toutouh

Comments: The 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026)

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Semi-supervised generative adversarial networks (SSL-GANs) can exploit large unlabeled datasets while retaining a classifier in the discriminator, but their training is often unstable. This paper proposes a population-based evolutionary training strategy in which discriminator learning is formulated as a multi-objective optimization problem. Instead of aggregating the supervised and unsupervised components of the SSL objective into a single scalar loss, the method maintains a population of discriminators ranked by Pareto dominance, enabling the exploration of different trade-offs between classification accuracy and real/fake discrimination. This formulation aims to improve both roles of SSL-GANs: learning accurate classifiers and training generators capable of producing realistic samples. We analyze several variants, including an elitist strategy and a mono-objective ablation, to assess the role of multi-objective selection. Experiments on MNIST with limited labels show improved training robustness compared to SSL-GAN and CE-SSL-GAN state-of-the-art baselines, while the elitist variant consistently achieves the highest classification accuracy.
[64] arXiv:2607.01918 [pdf, html, other]: Title: Zeus: Towards Tuning-Free Foundation Model for Time Series Analysis

Yisong Fu, Zezhi Shao, Chengqing Yu, Yujie Li, Yongjun Xu, Xueqi Cheng, Fei Wang

Comments: Accepted by ICML 2026

Subjects: Machine Learning (cs.LG)

We present Zeus, a unified tuning-free Time Series Foundation Model (TSFM) that delivers superior performance across diverse analysis tasks without any task-specific fine-tuning. Unlike prior studies that primarily focus on zero-shot forecasting but require task-specific tuning for other tasks, Zeus bridges this gap by addressing two fundamental challenges in multi-task generalization. First, to reconcile point-level granularity with long-sequence scalability, Zeus incorporates a multi-scale Transformer featuring point-wise tokenization and a U-shaped hierarchy, effectively balancing fine-grained fidelity with computational efficiency. Second, to accommodate varying inductive biases across different tasks, Zeus introduces Multi-Objective Temporal Masking (MOTM), a unified strategy that supports heterogeneous tasks (e.g., extrapolation, interpolation, and global abstraction) within a single framework. Extensive experiments across five representative tasks demonstrate that Zeus consistently achieves competitive results in tuning-free settings, underscoring its potential as a general-purpose TSFM.
[65] arXiv:2607.01940 [pdf, html, other]: Title: Conditional Co-Ablation: Recovering Self-Repair Backups in Transformer Circuits

Zhiren Gong, Zihao Zeng, Chau Yuen, Wei Yang Bryan Lim

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Mechanistic interpretability often relies on component-level interventions to discover how a model produces a behavior. This guides attribution, capability knockout, and model pruning downstream to operate by scoring each unit by the effect of ablation in isolation. Such first-order scoring is natural when component importance is additive, but becomes misleading when a transformer self-repairs: after a primary component is removed, a dormant backup can take over, muting the primary's measured effect while the backup itself appears irrelevant on the intact model. We recast this failure as a recovery task, conditional circuit completion, and introduce Conditional Co-Ablation (CoAx), a label-free, output-grounded score that asks how much each remaining unit's ablation effect grows once a primary set has been removed. This conditional growth exposes the second-order interaction that single-unit scores discard. On the GPT-2-small IOI circuit, CoAx raises backup-head recovery from 0.33 to 0.91 ROC-AUC, outperforming all baselines, including self-repair-aware gradient scores (best 0.82); counterfactual patching verifies that the recovered heads causally carry the repair. The same label-free procedure transfers to induction across eight models. Beyond discovery, the recovered backups correct self-repair-masked attribution, identify the components required for capability knockout, and yield repair-aware structured pruning scaling from 124M to 7B. Component importance is therefore not merely an isolated-unit property: in robust circuits, the components that matter can become visible only under the interventions that make them necessary.
[66] arXiv:2607.01943 [pdf, html, other]: Title: Hybrid quantum-classical neural network for sentiment analysis

Giacomo Cappiello, Filippo Caruso, Xing Liang, Dimitrios Makris

Subjects: Machine Learning (cs.LG); Quantum Physics (quant-ph)

Quantum machine learning has recently emerged as a promising paradigm that leverages the expressive power of quantum circuits to address complex learning tasks. In this work, we investigate the applicability of hybrid quantum-classical neural networks to sentiment analysis, a central problem in natural language processing. We focus on a dataset of tweets related to COVID-19, where the textual content is vectorized using TF-IDF and fed into both classical feedforward networks and hybrid architectures incorporating parameterized quantum circuits. Our results show that hybrid models can achieve accuracy comparable to the classical baseline, while exhibiting distinct learning dynamics, especially in terms of validation loss and accuracy, that suggest a richer representational capacity. Moreover, when applying transfer learning to an SMS spam classification task, the hybrid models consistently outperform the classical counterpart, achieving an accuracy increase of 15 percentage points (from 66% to 81%) on the spam class, demonstrating enhanced generalization. These findings highlight the feasibility of employing QML for natural language processing and point toward the potential advantages of hybrid models as quantum hardware continues to advance.
[67] arXiv:2607.01958 [pdf, html, other]: Title: A More Accurate Algorithm Comparison through A/B Testing using Offline Evaluation Methods

Koki Konishi, Masataka Ushiku, Yuta Saito

Comments: 12 pages, 8 figures, accepted to KDD 2026

Subjects: Machine Learning (cs.LG)

A/B testing is the gold standard for selecting the better algorithm in online services. While offline evaluation has attracted attention as a safer alternative due to the high experimental costs and the potential risk of degrading user experience and revenue in A/B testing, it is widely recognized that the estimation accuracy of offline evaluation is substantially lower. As a result, final selection decisions are typically made through A/B testing. Contrary to this conventional view, we reveal a counterintuitive phenomenon in which A/B testing can produce a higher algorithm selection error rate than offline evaluation. This occurs because the sample mean estimator used in A/B testing does not induce positive correlation, which is crucial for reducing critical selection errors, namely underestimating the truly superior algorithm and overestimating the truly inferior one. In contrast, offline evaluation methods unintentionally generate this beneficial correlation by relying on shared offline data when estimating and comparing the performance of multiple algorithms. Building on this insight, we propose an estimator that intentionally induces positive correlation to improve algorithm selection in A/B testing. The key idea is to introduce a hypothetical middle algorithm and to estimate the performance difference between algorithms A, M, and B in a stepwise manner using shared data at each step. This approach enables the application of offline evaluation techniques in each step, thereby inducing positive correlation and reducing critical selection errors. Furthermore, we derive the optimal middle algorithm regarding the resulting variance and analyze its advantages over existing methods through bias-variance analysis. Experiments on real-world data demonstrate that our estimator achieves the same selection error rate as existing approaches while using only one half of the A/B testing data.
[68] arXiv:2607.01966 [pdf, html, other]: Title: Probabilistic Low-Voltage Peak Load Forecasting with Time Series Foundation Models Evaluated on Application-Oriented Metrics

Benedikt Kaas, Manuel Treutlein, Hannes Benedikt Gerber, Oliver Neumann, Cheewan Phatthanakhuha, Oliver Resch, Ralf Mikut, Veit Hagenmeyer

Comments: A poster abstract of this publication will be available at the 15th DACH+ Conference on Energy Informatics (2026 in Linz, Austria)

Subjects: Machine Learning (cs.LG)

Low-voltage load forecasting is an important component in current and future energy systems with a high degree of electrification and decentralized generation. However, current forecasting methods require significant manual effort, often lack uncertainty estimation and proper peak prediction, and they are often not adequately evaluated in terms of grid requirements. In the present study, we provide an extensive evaluation of short-term net load forecasts of 200 real-world low-voltage feeders with a focus on the rapidly evolving time series foundation models. Our study compares Chronos-Bolt, Chronos-2 and TabPFN-TS to six baseline models and demonstrates superior performance, in particular for Chronos-2. An ablation study, in which weather covariates are omitted, shows that time series foundation models adapt to increased uncertainty, despite the importance of weather information. A novel application-oriented metric links the model's forecasting capabilities in peak prediction to the trade-off in grid asset planning and operation between cost reduction and minimizing the risk of failure.
[69] arXiv:2607.01984 [pdf, html, other]: Title: Do Newer Lightweight CNNs Perform Better Under Resource Constraints? A Controlled Multigenerational Study of Architecture, Initialization, Training Budget, and Efficiency

Tasnim Shahriar

Comments: 19 pages, 8 figure, 13 tables

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Newer lightweight convolutional neural networks are often presented as improving predictive performance and deployment efficiency, but such claims require controlled evaluation. This study compares nine lightweight CNN model packages across CIFAR-10, CIFAR-100, and Tiny ImageNet under a shared downstream protocol. We report top-1 accuracy, macro F1, top-5 accuracy, parameter count, FP32 storage, GMACs, batch-size-1 latency on an NVIDIA L4 and AMD Ryzen 5 5500U CPU, peak PyTorch CUDA allocated tensor memory, and point estimate Pareto frontiers. EfficientNetV2-S achieves the highest observed top-1 accuracy on CIFAR-10 and CIFAR-100 at 97.57% and 86.98%, while RepViT-M1.0 leads Tiny ImageNet at 79.87%. EfficientNet-B0 remains within 0.22, 0.85, and 1.79 percentage points of the best result on the three datasets while using approximately 79% fewer parameters and 86% fewer GMACs than EfficientNetV2-S. It also appears on every evaluated accuracy and resource Pareto frontier, making it the most consistently competitive intermediate-budget option. MobileNetV3-Small has the lowest GMAC count, is the fastest model under both CPU thread settings, and records higher observed accuracy than MobileNetV4-Conv-S on all three datasets. Under random initialization, it leads MobileNetV4-Conv-S by 2.55, 1.76, and 0.99 points, with paired test-set intervals excluding zero for the fixed trained models. EfficientNet-B0 remains 3.29, 10.10, and 17.54 points below its pretrained counterpart after 100 epochs of scratch training, despite requiring about five times the recorded training time. SqueezeNet1.1 has the fewest parameters and lowest peak CUDA allocation, but substantially weaker accuracy. Latency rankings differ sharply between the L4 and CPU environments, showing that GMACs alone do not predict measured inference performance. Overall, newer designs provide selective rather than universal gains
[70] arXiv:2607.01986 [pdf, html, other]: Title: Liquid Latent State Dynamics for Interpretable Turbofan Degradation Modeling

Weizhi Nie, Weijie Wang, Yuting Su

Comments: Preprint. 37 references, 8 figures

Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Multivariate time-series models for prognostics are often evaluated by point prediction accuracy, yet their internal states rarely expose a coherent degradation process. We study liquid neural networks as latent dynamics models for aircraft engine health monitoring on the C-MAPSS benchmark. The proposed model encodes a history window into a latent state, evolves that state with a liquid transition model, and decodes future sensor observations. To separate health evolution from operating-condition variation, the latent state is factorized into degradation and condition components. Remaining useful life, monotonic risk, and latent-consistency losses supervise the degradation component, while condition prediction and decorrelation losses discourage operating-condition leakage. Across FD001--FD004, the full disentangled model improves overall sensor forecasting RMSE from 0.2438 for a GRU baseline to 0.2266, with the largest gains on the multi-condition subsets FD002 and FD004. The learned degradation state also forms a clearer temporal degradation axis, reaching an average state-speed Spearman correlation of 0.5960. Direct remaining-useful-life regression remains stronger for the GRU baseline, indicating that the proposed representation is currently more effective as an interpretable world model for degradation dynamics than as a calibrated lifetime regressor. These results suggest that liquid latent dynamics can bridge predictive maintenance forecasting and inspectable health-state modeling.
[71] arXiv:2607.02046 [pdf, html, other]: Title: Fast and Accurate Anomaly Detection in Time Series

Emanuele Mele, Massimo Cafaro, Angelo Coluccia, Italo Epicoco

Subjects: Machine Learning (cs.LG)

Anomaly detection is a critical and evolving field in Machine Learning, with applications targeting different domains such as cybersecurity, finance, healthcare, manufacturing and IoT (Internet of Things) systems. Traditionally, anomaly detection algorithms have been designed using both supervised and unsupervised learning paradigms. The fundamental challenge in real-world anomaly detection scenarios is related to the inherent class imbalance (anomalies are typically rare) and, for supervised methods, to the scarcity of labelled anomalous data. Indeed, labelling is both expensive and time-consuming. Conversely unsupervised methods do not require labelling, but may suffer from high false positive rates when deployed in safety-critical applications. In this work we introduce a novel unsupervised algorithm for anomaly detection in time series based on the Haar discrete wavelet and a suitably designed $t$-test. We establish the theoretical foundation of the proposed $t$-test and, through extensive experimentation across 343 datasets, demonstrate that our algorithm outperforms state-of-the-art unsupervised and self-supervised benchmarks.
[72] arXiv:2607.02050 [pdf, html, other]: Title: A Memory Efficient Unified Algorithm for Online Learning of Linear Dynamical Systems

Yuval Ran-Milo, Angelos Assos, Elad Hazan

Comments: 34 pages, 1 figure

Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)

Motivated by the challenge of stabilizing a general unknown linear dynamical system (LDS) from observations, we study the natural prerequisite of online prediction. Our goal is to achieve sublinear regret with a memory footprint that adapts to the intrinsic complexity of the dynamics rather than the full hidden -- state dimension. We focus on the practically central regime of systems with low instability complexity -- eigenvalues outside the real stable interval that do not decay rapidly, together with non-semisimple modes-potentially embedded in an otherwise stable real spectrum of much higher dimension; we write $k$ for this count. This regime is the primary setting in which stabilization is plausible: we show that many systems with high instability complexity cannot be stabilized without exponentially large controls. Thus, prediction is meaningful for stabilization precisely when the instability complexity is small. Within this regime, we introduce a unified online algorithm that handles every LDS (including non-diagonalizable systems with complex or exploding modes) with a learnable parameter count of $\widetilde{O}(k)$. Finally, we prove a lower bound showing that $k$ is a valid complexity measure: any filter-based predictor needs at least $k$ filters. Experiments corroborate our theory: on a high-dimensional system, our predictor sharply outperforms prior methods at an equal parameter budget.
[73] arXiv:2607.02055 [pdf, html, other]: Title: Beyond the Performance Illusion: Structure-Aware Stratified Partitioning and Curriculum Distributionally Robust Optimization for Spatially Correlated Domains

Prathamesh Patil, Arpit Jain, Aswanth Krishnan

Comments: 11 pages, 6 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Performance evaluation in AI systems commonly assumes that random dataset splits produce independent and identically distributed (i.i.d.) subsets. We show that this assumption often breaks down in spatiotemporally correlated domains such as aerial surveillance, precision agriculture, and medical imaging, leading to two systematic failures: data leakage, where correlated samples span training and validation splits and inflate performance estimates, and hidden stratification, where errors on minority subpopulations are obscured by aggregate metrics. To address these issues, we propose a unified evaluation and training framework for spatially correlated data. We introduce Structure-Aware Stratified Partitioning (SASP), which constructs validation splits that reduce spatiotemporal leakage while preserving meaningful class balance, and Curriculum Distributionally Robust Optimization (CDRO), a curriculum-based relaxation of distributionally robust training that stabilizes optimization under these stricter splits. Across multiple benchmarks, this combination yields consistently improved generalization, more reliable confidence calibration, and exposes failure modes that remain hidden under conventional random-split evaluation.
[74] arXiv:2607.02063 [pdf, html, other]: Title: SA-HGNN: Sample-Adaptive Hyperbolic Graph Neural Network for EEG-Based Depression Recognition

Yang Li, Pan Hu, Yan Zhang, Wenfan Yang, Tao Wu, Lianbo Guo

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Graph Neural Networks (GNNs) have been widely used to capture spatial functional connectivity patterns to improve electroencephalography (EEG)-based depression recognition performance. However, the functional connectivity of brain networks in patients with depression exhibits an inherent hierarchical structure, making it difficult to capture accurate connection patterns. To address these issues, this paper proposes a novel model named Sample-Adaptive Hyperbolic Graph Neural Network (SA-HGNN), which aims to accurately extract the authentic hierarchical structure of depression-affected brain networks. Specifically, the proposed model comprises three core modules. First, a Sample-Adaptive Graph Construction module dynamically constructs personalized brain network topologies to capture more complex spatial relationships within the brain network. Second, hyperbolic graph convolution is employed to overcome the representation bottlenecks of Euclidean space, leveraging hyperbolic geometry to precisely capture latent hierarchical relationships within the brain network. Finally, an Attention Pooling module adaptively filters out highly redundant noise channels in EEG signals, effectively mitigating the interference of inherent noise on the authentic hierarchical topology. Extensive experiments on public EEG datasets demonstrate the superior performance of our method across resting-state and task-related paradigms, validating its robustness to noise and efficacy in capturing abnormal functional connectivity patterns in brain networks of patients with depression.
[75] arXiv:2607.02072 [pdf, html, other]: Title: kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail

Mahmoud Abdelfattah, Hamid Nasiri, Peter Garraghan

Comments: 17 pages, 11 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

Large language models (LLMs) are increasingly deployed in domains requiring guardrails to detect unsafe, off-topic, or adversarial prompts. Existing guardrails predominately rely on fine-tuning to build classifiers, which often suffer from low generalization and high inference latency. We present kNNGuard, a training-free guardrail that utilizes the activation space of an off-the-shelf LLM. Given a small bank of 50 safe and unsafe prompts, kNNGuard extracts hidden activations and performs multi-layer kNN fusing activation-space and embedding-space scores for classification. Across six domains spanning topical and security prompts, kNNGuard achieves competitive or superior F1 compared to fine-tuned state-of-the-art guardrails while running 2.7x faster than the best comparable guardrail, and 10x faster than a fine-tuned safety classifier without gradient updates or fine-tuning. Domain adaptation requires only updating the labeled bank, which can be constructed in under 10 seconds and several orders of magnitude faster than established guardrails. We also analyze the impact of system prompts, layer selection, and integration into production LLM pipelines as a configurable, low-latency guardrail.
[76] arXiv:2607.02088 [pdf, html, other]: Title: Fourier Neural Operators for Rayleigh-Bénard Convection

Chelsea Maria John, Thibaut Lunet, Sebastian Götschel, Andreas Herten, Stefan Kesselheim, Daniel Ruprecht

Comments: Accepted at Computational Science, ICCS 2026

Journal-ref: ICCS 2026, Lecture Notes in Computer Science, vol 16784. Springer, Cham

Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)

We propose an improved Fourier Neural Operator (FNO) for modeling two-dimensional Rayleigh-Bénard convection by predicting time increments instead of full solutions, achieving higher accuracy than a standard FNO baseline. The resulting model is compact (314k parameters, 1.26 MB) and fast (7 ms inference), while maintaining similar accuracy as demonstrated in previous benchmarks. We show that although FNOs generalize to finer meshes, accuracy remains limited by the resolution of the training data.
[77] arXiv:2607.02104 [pdf, html, other]: Title: Ask the Right Comparison:Bias-Aware Bayesian Active Top-$k$ Ranking with LLM Judges

Jian Xu, Delu Zeng, John Paisley, Qibin Zhao

Subjects: Machine Learning (cs.LG)

Large language models (LLMs) are increasingly used as cheap, scalable judges that compare candidate outputs pairwise -- to rank responses, select models, or triage papers. Yet LLM judges are both noisy and systematically biased: they favor verbose or well-formatted answers and exhibit position effects, so simply aggregating their votes recovers a ranking of presentation, not of true quality. We study the practical goal of identifying the \topk{} items under a fixed comparison budget, and make two contributions. First, we cast judging as Bayesian inference over latent quality with explicit, judge-specific bias covariates (verbosity, position), regularized by a shrinkage prior so that the data decide which biases a given judge actually exhibits. Second, we introduce a \topk-aware active acquisition rule that chooses the next comparison to maximally reduce uncertainty about \topk{} \emph{membership}, rather than about the full ranking. On a controlled benchmark with known ground-truth quality, judged by sixteen real LLMs spanning open and proprietary families (Llama, Qwen, Phi-4, GPT-4o-mini/5.1/5.5, Gemini, DeepSeek, and Claude Haiku/Sonnet/Opus), naive aggregation plateaus at a wrong \topk{} on biased judges regardless of budget, while our bias-aware model recovers it; \topk-aware acquisition reaches this ceiling with far fewer comparisons than round-robin or a global-uncertainty (D-optimal) rule. Bias is real but heterogeneous and capability-dependent: cheap and mid-tier judges carry a strong verbosity bias that our model corrects (lifting recall from $\sim$$0.5$--$0.6$ to $0.84$--$1.0$), whereas the frontier judges we tested show little bias and already rank accurately, so bias-aware modeling changes little there.
[78] arXiv:2607.02124 [pdf, other]: Title: Predictive Conformal Slip Monitoring: An Empirical Evaluation of Rolling Split Conformal Prediction for Pre-Incident Traction Loss Detection

Varshith Roy Kotla

Comments: 10 pages, 4 tables. codes and data available at:this https URL

Subjects: Machine Learning (cs.LG); Applications (stat.AP)

Conventional traction control architectures intervene only after the adhesion limit of a tire has already been breached. This paper investigates whether Rolling Split Conformal Prediction , monitoring the volatility of non-conformity residuals from a per-driver Random Forest model of expected slip behavior , can serve as a statistically grounded pre-incident warning signal, ahead of gross traction loss. Unlike an earlier internal draft of this work, the evaluation reported here corrects a confound in the slip proxy (vehicle speed is included as an explicit model feature, not left implicit in the target's denominator), uses every racing lap for each driver rather than only the fastest lap, and is scored against real, timestamped incident labels extracted from FIA Race Control Messages and track-limits lap deletions rather than narrated post-hoc. The result is negative: across 19 drivers and 55,563 test-phase telemetry samples, the rolling-volatility detector achieves a mean precision of essentially 0.0 and mean recall of 0.0 against 14 ground-truth incidents, while flagging on average 15.3% of all samples as anomalous , too high a false-alarm rate for any early-warning use. A static 95th-percentile threshold baseline performs no better in any way that would justify the added complexity of the conformal-volatility formulation. Residual autocorrelation diagnostics show the split-conformal exchangeability assumption is violated for every driver (Ljung-Box p < 0.001, n = 19/19), which is one plausible driver of the high false-alarm rate. We report this as a methodologically rigorous negative finding, diagnose its likely causes, and outline what a genuinely predictive version of this approach would require.
[79] arXiv:2607.02137 [pdf, html, other]: Title: ART for Diffusion Sampling: Continuous-Time Control and Actor-Critic Learning

Yilie Huang, Wenpin Tang, Xun Yu Zhou

Comments: 36 pages, 14 figures, 8 tables

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)

We study timestep allocation for score-based diffusion sampling, where a learned reverse-time dynamics is discretized on a finite grid. Uniform and hand-crafted schedules are standard choices, but they rely on fixed prescriptions and can therefore be suboptimal. To address this limitation, we propose Adaptive Reparameterized Time (ART), a continuous-time control formulation that learns a time change by treating the speed of the sampling clock as the control, so that a uniform grid on the learned clock induces adaptive timesteps in the original diffusion time. Based on a leading-order Euler error surrogate, ART provides a principled objective for allocating timesteps along the sampling trajectory. To solve this deterministic control problem, we introduce ART-RL, an auxiliary randomized formulation with Gaussian policies that turns schedule learning into a continuous-time reinforcement learning problem. We prove that the randomized ART-RL formulation is equivalent to ART at the optimizer level, in the sense that its optimal Gaussian policy recovers the optimal ART time-warping rate through its mean. We further establish policy evaluation and policy improvement characterizations and derive trajectory-based moment identities that yield implementable actor--critic updates for learning the schedule. Across experiments ranging from controlled low-dimensional settings to image generation, ART-RL can be plugged into existing diffusion samplers by changing only the timestep grid, consistently improving sample quality over strong baseline schedules at matched budgets while leaving the rest of the sampling pipeline unchanged. The learned schedules also exhibit broad generalization, transferring without retraining across sampling budgets, datasets, solvers, pipelines, and representation spaces.
[80] arXiv:2607.02140 [pdf, other]: Title: Probing Chemical Language Models: Effects of Pre-training and Fine-tuning

Anna Karnysheva, Dietrich Klakow, Ji-Ung Lee

Subjects: Machine Learning (cs.LG)

Chemical language models (CLMs) are trained with linearized representations such as SMILES, yet it remains unclear which chemically meaningful substructures they encode. To foster a better understanding of CLMs, we conduct a systematic study and probe for 78 molecular substructures across eight pre-trained and six randomly initialized models. We furthermore study how fine-tuning on chemical downstream tasks affects the learned representations of molecular substructures. Our results show that pre-training generally improves molecular structure awareness of CLMs, particularly in the upper layers. Moreover, randomly initialized models already encode ring structures well in the first layer. Our analysis on two chemical downstream tasks further reveals that, interestingly, fine-tuning affects task-relevant molecular substructures more than others, indicating that the changes in the representations follow chemical theory.
[81] arXiv:2607.02142 [pdf, other]: Title: Predicting Early Stages Of Alzheimer's Disease And Identifying Key Biomarkers Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies

Debopriya Ghosh

Comments: Master's

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)

Alzheimers disease (AD) is a brain disorder that develops slowly and mainly affects memory, thinking, language, and daily activities. It is one of the most common causes of dementia and creates many difficulties for patients as well as their families. In the early stage, the symptoms are often mild and may look like normal ageing. For this reason, many people are diagnosed late, when the disease has already progressed. At present, there is no complete cure for AD. Still, early detection can help doctors manage the condition better and take suitable steps at the right time. In this study, a machine learning model is proposed to detect the early stages of Alzheimers disease using clinical details, neuropsychological test scores, and neuroimaging-related measures. The data used in this work is collected from the Alzheimers Disease Neuroimaging Initiative (ADNI). As the dataset has missing values, iterative imputation is applied to fill them. The dataset also has class imbalance, which is handled using Borderline SVM-SMOTE. After that, feature selection is carried out using wrapper-based and embedded methods so that only important features are used for training. The selected features are divided into training and testing sets, and feature scaling is applied. A stacking ensemble model is developed using Logistic Regression, Extra Trees, Bagging KNN, and LightGBM as base classifiers. Along with this, an artificial neural network is also trained on the same dataset. The performance of these models is compared using precision, recall, F1-score, and AUC-ROC. This study aims to find the best classifier and also identify important biomarkers that may help in the early diagnosis of Alzheimers disease.
[82] arXiv:2607.02166 [pdf, html, other]: Title: Dynamic Neural Graph Encoding of Inference Processes in Deep Weight Space

Di Wu, Huan Liu, Zhixiang Chi, Yuanhao Yu, Konstantinos N. Plataniotis, Yang Wang

Comments: Published in Transactions on Machine Learning Research (TMLR), 2026. 28 pages, 5 figures

Journal-ref: Transactions on Machine Learning Research, 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

The rapid advancements in using neural networks as implicit data representations have attracted significant interest in developing machine learning methods that analyze and process the weight spaces of other neural networks. However, efficiently handling these highdimensional weight spaces remains challenging. Existing methods often overlook the sequential nature of layer-by-layer processing in neural network inference. In this work, we propose a novel approach using dynamic graphs to represent neural network parameters, capturing the temporal dynamics of inference. Our Dynamic Neural Graph Encoder (DNG-Encoder) processes these graphs, preserving the sequential nature of neural processing. Additionally, we also leverage DNG-Encoder to develop INR2JLS (Implicit Neural Representation to Joint Latent Space) for facilitate downstream applications, such as classifying Implicit Neural Representations (INRs). Our approach demonstrates significant improvements across multiple tasks, surpassing the state-of-the-art INR classification accuracy by approximately 10% on the CIFAR-100-INR.
[83] arXiv:2607.02182 [pdf, html, other]: Title: Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation

Jijie Zhang, Zhe Ren, Quan Zhang, Dandan Guo

Comments: Preprint. 16 pages, 7 figures, 6 tables

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

Large language models (LLMs) exhibit remarkable reasoning capabilities, but their task-specific fine-tuning is notoriously plagued by overconfidence, severely hindering trustworthy deployment. We propose Data-Adaptive Lower-Rank Adaptation (DALorRA), a simple and effective variational Bayesian sparse framework that shifts the paradigm of uncertainty quantification from the dense parameter space to the lightweight rank level of low-rank adaptation (LoRA). With the insight that LoRA essentially aggregates multiple rank-one components that may provide superfluous model capacity, DALorRA imposes stochastic masking on rank dimensions, enabling Bayesian regularization of model capacity during training and ensemble-like calibration during inference. Extensive experiments demonstrate DALorRA's excellent calibration of LLMs without compromising reasoning accuracy.
[84] arXiv:2607.02187 [pdf, html, other]: Title: Privacy-Preserving and Verifiable Approximate Distributed Coded Computing

Xavier Martínez-Luaña, Alba Gude-Santos, Manuel Fernández-Veiga, Rebeca P. Díaz-Redondo

Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)

Distributed machine learning enables collaborative model training without centralizing data, but it also exposes learning processes to privacy leakage and malicious manipulation. Existing defenses typically address these threats in isolation and are often tailored to specific learning paradigms or model architectures, limiting their applicability in realistic deployments. In particular, federated learning and decentralized learning exhibit distinct adversarial surfaces that are rarely addressed within a unified framework. In this paper, we present a model-agnostic framework for adversary-resistant distributed learning that jointly addresses privacy preservation and malicious behavior across both federated and decentralized settings. Our approach combines paradigm-specific defense mechanisms with GPBACC, a privacy-enhancing coded computing technique applicable to arbitrary machine learning models. For federated learning, we integrate robust aggregation strategies to mitigate the impact of malicious participants, while for decentralized learning we employ approximate decode-and-compare and group testing techniques to enable lightweight verification and adversary isolation without relying on a trusted aggregator. Crucially, we evaluate the proposed framework through an explicit, attack-driven analysis. We implement representative privacy attacks and malicious behaviors, and empirically demonstrate that the combination of GPBACC with robust aggregation and verification mechanisms significantly reduces privacy leakage and improves resilience against active adversaries. These results suggest that privacy-enhancing coded computing, when combined with appropriate adversary-resistance strategies, provides a practical and deployable foundation for secure distributed machine learning.
[85] arXiv:2607.02194 [pdf, other]: Title: An Optimisation Framework for the Well-Conditioned Training of Physics-Informed Neural Networks

Joseph Webb, Sadok Jerad, Coralia Cartis

Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Computational Physics (physics.comp-ph)

Physics-informed neural networks (PINNs) have emerged as a promising route to solve partial differential equations, yet they have struggled to reach the precision of classical solvers. The obstacle is increasingly understood to be one of optimisation, owing to the severely ill-conditioned loss landscape. We present $\textbf{DSGNAR}$: Doubly-Sketched Gauss-Newton with Adaptive Ratio, a scalable second-order optimisation framework that confronts this ill-conditioning and, in doing so, obtains unprecedented accuracy and speed. $\textbf{DSGNAR}$ couples a doubly-sketched Gauss-Newton model with a novel strategy that carefully controls both regularisation and step length. Across a suite of problems spanning nonlinear, chaotic, multi-scale, high-dimensional, and Navier-Stokes, the framework greatly improves on the state of the art: able to attain relative $\ell_2$ errors as low as $3\times10^{-16}$ in double precision, improve contemporary results by five orders of magnitude on the canonical Burgers' equation, and as much as eight orders on a high-dimensional Poisson problem, while remaining markedly faster. We further show that, in single precision, solutions at the limit of round-off error can be obtained very quickly: Burgers' equation to $\ell_2^{\text{rel}} = 4.75 \times 10^{-7}$ in under ten seconds. The framework is also robust to the choice of architecture, arithmetic precision, and initial hyperparameters.
The code is available at this https URL
[86] arXiv:2607.02196 [pdf, html, other]: Title: Online Resource Allocation with Continuous Random Consumption: Regret under Degeneracy

Jiawei Zhang

Subjects: Machine Learning (cs.LG)

We study online resource allocation when both rewards and consumption sizes may be continuously distributed. Requests arrive sequentially and must be accepted or rejected irrevocably under fixed resource capacities. Each request belongs to one of finitely many observable types; conditional on an observable request type, both the reward and the scalar size are random, and the realized size scales a fixed type-specific resource-consumption vector. The model allows the deterministic fluid relaxation to be degenerate.
We show that additive regret is governed by the size-weighted mass of requests whose value-to-size ratios lie near the active acceptance cutoffs. We formalize this quantity through an active weighted-mass exponent p. When p > 1, this cutoff mass is thin, and the problem is genuinely hard: every online policy must incur regret of order at least $T^{1/2 - 1/(2p)}$, and this holds for every p > 1. A sample-path marginal policy matches this lower bound up to polylogarithmic factors; and when p = 1, so that the mass grows linearly near the cutoff, it attains $O((\log T)^2)$ regret. For example, if the size and the value-to-size ratio are independent and uniformly distributed, then p = 1; if instead the size and the reward are independent and uniformly distributed, then p = 2. Thus the policy achieves $o(\sqrt{T})$ regret throughout this regularity class without any fluid non-degeneracy assumption, allowing both primal degeneracy and dual non-uniqueness.
[87] arXiv:2607.02203 [pdf, html, other]: Title: Self-explainable Operator Learning for Discovering Spatial Patterns in Functional Data

Mojgan Alishiri, Amirhossein Arzani

Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)

Operator learning has emerged as a powerful tool for modeling complex physical systems in functional spaces. However, their neural network-based architectures make them opaque models, obscuring the reasoning behind their predictions. In this work, we introduce a self-explainable operator learning framework that overcomes this challenge by reformulating operator learning as a linear combination of generalized functional linear models expressed through integral equations. Exploiting the additive decomposability of these integral equations, we divide the input domain into subdomains and compute localized integrals to evaluate the contribution of each region to the final prediction. This decomposition enables direct interpretability where the model explains both inputs and outputs by linking specific input regions to corresponding output patterns, thereby revealing which spatial features drive predictions. We demonstrate the framework on function-to-scalar and function-to-function mappings in fluid flow problems involving blood flow and unsteady aerodynamics. The results show that the operator most often prioritizes regions with strong feature gradients, providing physically meaningful insight into the model's decision-making process. Comparisons with established post-hoc explainability methods demonstrate qualitative agreement while highlighting the key advantage of the proposed approach: explainability is embedded directly within the operator structure itself and does not require an external tool. Therefore, our framework provides a mathematically transparent and physically interpretable approach to uncover relationships within data, fostering trust in machine learning for scientific applications by enabling more informed data-driven analysis of physical systems.
[88] arXiv:2607.02266 [pdf, html, other]: Title: HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

Ziyun Qiao, Yue Min, Ruining Chen, Yujun Li

Comments: 19 pages, 5 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat embedding clusters, commit to one semantic axis at one granularity; changing the resolution rebuilds the labels. We argue the bottleneck is the label system, not the mixer, and provide a hierarchical one. HERMES is a data-derived labeling substrate: a Learned Semantic Transform followed by 3-stage residual vector quantization annotates each document once into a coarse-to-fine code whose prefix length controls granularity up to approximately 130k cells. At coarse granularity HERMES sits at a plateau with KMeans-family methods on standard clustering metrics, so the contribution is the substrate, not the clusterer. On 1B-parameter, 25B-token pre-training, the hierarchy exposes an interaction fixed-granularity pipelines cannot test: at one prefix length, a combined Stage-2 rule contrast, equal-subbucket coverage versus size-proportional within-bucket quality top-30%, lifts a 16-task capability macro-average by +0.0253; at the next finer level, the same rule loses its measurable edge as candidate pools contract approximately 5x. HERMES reframes data mixture design from choosing among fixed label sets to navigating a reusable, data-derived granularity hierarchy.
[89] arXiv:2607.02288 [pdf, html, other]: Title: Generalization in offline RL: The structure is more important than the amount of pessimism

Max Weltevrede, Matthijs T.J. Spaan, Wendelin Böhmer

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

While pessimism counteracts overestimation bias in offline reinforcement learning (RL), being overly conservative has been associated with hindering certain forms of generalization. However, in this paper we demonstrate that being overly pessimistic does not inherently prevent optimal generalization in contextual MDPs (CMDPs). Instead, we argue successful generalization depends not on the amount of pessimism, but whether the pessimistic structure respects the underlying symmetries of the optimal solution. We prove that a mildly pessimistic, non-symmetric value function can generalize worse than an overly pessimistic, symmetric one. In offline RL, the structure of the pessimism is determined by the structure of the dataset coverage. As such, enforcing a symmetric value function can be non-trivial, and might require techniques such as data augmentation (DA). Inspired by our theoretical results, we argue that DA can best be applied through a consistency loss during policy extraction, rather than the common practice of (regular) offline training on an augmented dataset. This is empirically validated using IQL and CQL on a rotationally symmetric reacher environment.
[90] arXiv:2607.02291 [pdf, html, other]: Title: Optimizing Visual Generative Models via Distribution-wise Rewards

Ruihang Li, Mengde Xu, Shuyang Gu, Leigang Qu, Fuli Feng, Han Hu, Wenjie Wang

Comments: ICML 2026 Main

Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Conventional reinforcement learning strategies for visual generation typically employ sample-wise reward functions, yet this practice frequently results in reward hacking that degrades image diversity and introduces visual anomalies. To address these limitations, we present a novel framework that finetunes generative models using distribution-wise rewards, ensuring better alignment with real-world data distributions. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating the mode collapse problem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce a subset-replace strategy that efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimize post-hoc model merging coefficients, potentially mitigating the train-inference inconsistency caused by introducing stochastic differential equation (SDE) in regular RL practices. Extensive experiments show our approach significantly improves FID-50K across various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhances perceptual quality while preserving sample diversity.
[91] arXiv:2607.02292 [pdf, html, other]: Title: One More Time: Revisiting Neural Quantum States from a Reinforcement Learning Perspective

Juan Agustín Duque, Sergio García Heredia, Vinicius Hernandes, Eliška Greplová, Thomas Spriggs, Aaron Courville, Anna Dawid

Comments: 34 pages, 11 figures

Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Quantum Physics (quant-ph)

Neural quantum states (NQS) provide a flexible and scalable framework for approximating quantum many-body wavefunctions. Among NQS parameterizations, autoregressive models are especially attractive because they enable exact, independent sampling from the Born distribution, avoiding the autocorrelation and mixing issues of Markov chain methods. Yet their optimization remains comparatively underexplored: Adam is a scalable method but ignores function space geometry, while stochastic reconfiguration is principled but costly and numerically fragile in large models. To address this gap, we show that variational energy minimization can be viewed as an advantage policy-gradient problem over the Born distribution, motivating trust-region optimization for NQS training. We introduce Proximal Wavefunction Optimization (PWO), a principled trust-region algorithm that clips probability-ratio changes in the amplitude channel and phase increments in the phase channel. PWO avoids explicit matrix inversion, reuses samples across multiple updates, and combines the scalability of first-order optimization with theoretical guarantees. Across Ising and frustrated $J_1$-$J_2$ one- and two-dimensional spin systems, PWO improves stability and wall-clock convergence over Adam, minSR, and SPRING. Finally, we fine-tune a $1.5$B-parameter RWKV-7 model, demonstrating NQS optimization at a scale over three orders of magnitude beyond prior work.
[92] arXiv:2607.02344 [pdf, html, other]: Title: Self-Gating Attention for Efficient Time Series Forecasting

Dezheng Wang, Tong Chen, Wei Yuan, Congyan Chen, Shihua Li, Hongzhi Yin

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Transformer architectures have shown strong potential in time series forecasting, where multi-head self-attention is widely used to capture temporal dependencies across historical timestamps. However, standard self-attention has quadratic time and memory complexity with respect to the look-back length. This cost may limit its use in resource-constrained or high-throughput forecasting systems, where fast and memory-efficient inference is important. Through qualitative and quantitative analyses, we observe that self-attention maps in time series forecasting often contain redundant patterns across different timestamps. This phenomenon can be related to the repeated temporal patterns and relatively stable temporal correlations in many real-world time series. Motivated by this observation, we propose Self-Gating Attention (SGA), a plug-and-play attention mechanism that represents the attention score with a shared learnable matrix and an input-dependent residual component. The shared matrix captures common attention patterns, while the residual component captures input-dependent variations. In this way, SGA avoids the query and key projections used in standard attention score computation, leading to linear time and score-matrix memory complexity with respect to the look-back length. We integrate SGA into several forecasting backbones and compare it with standard self-attention and lightweight attention variants on nine publicly available real-world datasets covering electricity, finance, weather, medical monitoring, human activity, and climate records. The results show that SGA improves inference efficiency on public benchmarks while maintaining competitive forecasting performance against state-of-the-art attention mechanisms. These benchmark results provide deployment-oriented evidence.
[93] arXiv:2607.02390 [pdf, html, other]: Title: DecompRL: Solving Harder Problems by Learning Modular Code Generation

Juliette Decugis, Fabian Gloeckle, Francis Bach, Taco Cohen, Gabriel Synnaeve

Subjects: Machine Learning (cs.LG)

How can Large Language Models (LLMs) solve problems they currently cannot? Repeated sampling scales test-time compute but GPU cost grows linearly with attempts, while reinforcement learning (RL) with verifiable rewards improves single-attempt accuracy at the expense of sample diversity. Both strategies ultimately fail when the base policy has near-zero probability of producing a correct solution: no amount of sampling or gradient signal can overcome a search space that is simply too large. We take a different approach: rather than sampling harder, we make the task easier by decomposing problems into smaller, independently solvable sub-functions whose implementations can be recombined. Since off-the-shelf models are not trained for this modular generation, we introduce DecompRL, an RL algorithm that explicitly learns to decompose and implement hierarchical code structures. Recombining $k$ implementations of $n$ modules yields up to $k^{n}$ candidate solutions, shifting the bottleneck from GPU inference to cheap CPU evaluation and cutting GPU token cost by $\sim$50$\times$. On LiveCodeBench and CodeContests (Qwen~2.5~7B, Code World Model~32B), DecompRL outperforms standard and diversity-optimized RL baselines beyond $10^5$ tokens per problem, solving problems that standard generation cannot reach.
[94] arXiv:2607.02423 [pdf, html, other]: Title: Neuron-Aware Active Few-Shot Learning for LLMs

Zhuowei Chen, Liwei Chen, Christian Schunn, Raquel Coelho, Xiang Lorraine Li

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Active Few-Shot Learning (AFSL) adapts LLMs to specialized domains by identifying the most valuable unlabeled samples for annotation and use as few-shot demonstrations, effectively reducing human annotation costs while promoting high performance. However, existing methods typically rely on output-level signals for sample identification, such as predictive entropy or semantic similarities with test-time data based on external embeddings, which often overlook models' internal dynamics, which could pinpoint specific knowledge gaps. To bridge this gap, we propose NeuFS, a Neuron-Aware Active Few-Shot Learning framework that shifts the selection paradigm from output-level proxies to models' internal dynamics. NeuFS utilizes neuron activation patterns to represent sample directly, and includes a dual-criteria selection strategy that: (1) ensures few-shot sample diversity with neuron patterns for broader example coverage, while (2) prioritizing on identifying informative and challenging few-shot samples LLMs tend to hallucinate by quantifying neuron consensus. Experiments on three datasets demonstrate that NeuFS excels in both reasoning and text classification tasks, outperforming existing AFSL baselines. Ablation studies further highlight that internal neuron activations provide a more principled and effective selection signal than external embeddings, validating the superiority of the proposed NeuFS.
[95] arXiv:2607.02426 [pdf, html, other]: Title: QFedAgent: Quantum-Enhanced Personalized Federated Learning for Multi-Agent Activity Recognition

Quoc Bao Phan, Tuy Tan Nguyen

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Federated learning (FL) enables collaborative model training across distributed devices without sharing raw data, making it suitable for privacy-sensitive robotic sensing applications. However, multi-agent systems generate heterogeneous and non-independent and identically distributed (non-IID) multimodal sensor streams that degrade conventional FL algorithms, while classical fusion modules introduce substantial parameter overhead and communication cost. This paper proposes QFedAgent, a hybrid quantum-classical personalized FL framework for multi-agent activity recognition. The approach integrates a variational quantum circuit fusion module that models accelerometer--gyroscope interactions through quantum state encoding and entanglement, requiring only 72 quantum rotation parameters versus 33K in classical multi-layer perceptron-based fusion, achieving approximately 10x total parameter reduction. Experiments on the OPPORTUNITY dataset under subject-based non-IID partitions demonstrate 97.7% mean test accuracy, confirming that parameter-efficient quantum fusion remains competitive with conventional federated baselines.
[96] arXiv:2607.02437 [pdf, html, other]: Title: Extreme Adaptive Transformer for Time Series Forecasting

Sanjeev Shrestha, Hui Liu, Yifan Zhang

Comments: Submitted to Scientific Reports

Subjects: Machine Learning (cs.LG)

Time series forecasting remains challenging when the underlying data contain rare but critical extreme events. This issue is particularly important in hydrologic forecasting, where streamflow distributions are often highly skewed and extreme peaks can have substantial impacts on flood monitoring, water resource management, and early warning systems. Although Transformer-based forecasting models have achieved strong performance by modeling long-range temporal dependencies, they typically treat all time points uniformly and may therefore underrepresent rare extreme patterns. In this paper, we propose the Extreme-Adaptive Transformer (Exformer), a forecasting framework designed to explicitly model temporal dependencies involving both normal and extreme events. Exformer introduces an extreme-adaptive attention mechanism composed of three sparse components: Local, Stride, and Extreme. The Local and Stride components capture short-term and periodic temporal dependencies, respectively, while the Extreme component selectively models event-aware dependencies between normal and extreme streamflow patterns. Experiments on four real-world hydrologic streamflow datasets show that Exformer achieves superior 3-day forecasting performance compared with state-of-the-art baselines. Our findings demonstrate that explicitly incorporating extreme-aware attention improves the forecasting capacity of Transformer models on imbalanced time series with rare but consequential events.
[97] arXiv:2607.02447 [pdf, html, other]: Title: Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data

Xuanyu Chen, Nan Yang, Shuai Wang, Dong Yuan

Comments: Accepted at ICLR2026

Subjects: Machine Learning (cs.LG)

Recent research has introduced distributed self-supervised learning (D-SSL) approaches to leverage vast amounts of unlabeled decentralized data. However, D-SSL faces the critical challenge of data heterogeneity, and there is limited theoretical understanding of how different D-SSL frameworks respond to this challenge. To fill this gap, we present a rigorous theoretical analysis of the robustness of D-SSL frameworks under non-IID (non-independent and identically distributed) settings. Our results show that pre-training with Masked Image Modeling (MIM) is inherently more robust to heterogeneous data than Contrastive Learning (CL), and that the robustness of decentralized SSL increases with average network connectivity, implying that federated learning (FL) is no less robust than decentralized learning (DecL). These findings provide a solid theoretical foundation for guiding the design of future D-SSL algorithms. To further illustrate the practical implications of our theory, we introduce MAR loss, a refinement of the MIM objective with local-to-global alignment regularization. Extensive experiments across model architectures and distributed settings validate our theoretical insights, and additionally confirm the effectiveness of MAR loss as an application of our analysis.
[98] arXiv:2607.02460 [pdf, html, other]: Title: Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation

Zhuowei Chen, Xiang Lorraine Li

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Post-training large language models (LLMs) without real-world interaction feedback or human-labeled supervision remains challenging, particularly in specialized domains where expert annotations are costly to obtain. Recent annotation-free self-evolution methods address this by using the model's own outputs as supervision signals, constructing a teacher via additional context and aggregating predictions across multiple rollouts through majority voting to produce pseudo-labels. However, these approaches are not without drawbacks: SFT- and GRPO-based variants suffer out-of-domain performance degradation, while reward-based on-policy RL inflates calibration error. In this paper, we propose Neuron On-Policy Self-Distillation (Neuron-OPSD), a data-centric framework for annotation-free self-distillation that leverages internal neuron activations to guide both training-data selection and teacher context construction. The model is then trained via on-policy distillation from the teacher distribution, requiring no ground-truth labels at any stage. Across specialized-domain benchmarks, Neuron-OPSD improves in-domain task performance while preserving cross-domain generalization and mitigating calibration collapse over prior annotation-free baselines. This framework is particularly relevant to settings where online interaction or external supervision is costly or infeasible, and is conceptually distinct from offline RL approaches that rely on logged, reward-labeled trajectories.
[99] arXiv:2607.02499 [pdf, html, other]: Title: Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials

Gil Harari, Yoel Zimmermann, Ola Tangen Kulseng, Laura Zichi, Chuin Wei Tan, Marc L. Descoteaux, Boris Kozinsky

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)

Machine learning interatomic potentials (MLIPs) have become a hallmark of AI for scientific simulation. While efforts on new architectures and datasets have led to increasingly accurate and general models, the choice of optimizer for training has largely remained unexplored, defaulting to Adam and its variants in the community. Here, we implement and systematically compare a class of recently proposed matrix-structured optimizers, including Muon, SOAP, and the hybrid SOAP-Muon, for training NequIP and Allegro MLIP models. We find that these optimizers can substantially outperform Adam in both convergence speed and final accuracy. SOAP and SOAP-Muon emerge as robust and consistently strong methods, while Muon only provides partial gains relative to Adam. The improvements are particularly pronounced under partial force supervision. Our results indicate that optimizer choice is an overlooked yet impactful design axis for MLIPs.
[100] arXiv:2607.02502 [pdf, html, other]: Title: DemoPSD: Disagreement-Modulated Policy Self-Distillation

Yunhe Li, Hao Shi, Wenhao Liu, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Shuang Qiu, Linqi Song

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that are unavailable at test time. We introduce **DemoPSD**, a novel framework that resolves such problems through the idea of *selective adoption of teacher guidance*. Instead of fitting the full teacher distribution, DemoPSD steers the student toward a *reverse-KL barycenter target*, a weighted geometric combination of the teacher and student distributions, that naturally balances learning from the teacher with preserving the student's own reasoning capacity. We measure the difference between their distributions and use such a discrepancy to adaptively control the blending at each token position. We provably show that DemoPSD achieves **(1)** *leakage attenuation*, i.e., effective mitigation of privileged information leakage; and **(2)** *exploration preservation*, i.e., preservation of exploration capacity under dense token-level distillation. Extensive experiments on SciKnowEval across four scientific fields show that DemoPSD outperforms both GRPO and SDPO while maintaining higher training entropy and robustly generalizing to out-of-distribution GPQA benchmarks.
[101] arXiv:2607.02512 [pdf, html, other]: Title: Program-as-Weights: A Programming Paradigm for Fuzzy Functions

Wentao Zhang, Liliana Hotsko, Woojeong Kim, Pengyu Nie, Stuart Shieber, Yuntian Deng

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at 30 tokens/s on a MacBook M3. PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.

[102] arXiv:2607.01245 (cross-list from cs.CL) [pdf, other]: Title: Office Comprehension Benchmark

Firoz Shaik, Mateus Picanço Lima Gomes, Tanvir Aumi, Jingci Wang, Milos Milunovic, Filip Basara, Ivana Jovanovic, Vishwas Suryanarayanan, Neha Nandan Kenkare, Weiyao Xie, Zhipeng Han, Zheng Zhang, Waleed Shahid, Jay Rathi, Russell Scherer, Thong Q. Nguyen, Michael Bentley, Tamara Stankovic, Rasika Chakravarthy, Vishal Chowdhary

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)

We introduce Office Comprehension Bench (OCB), the first public benchmark to jointly evaluate LLM systems on Word, Excel, and PowerPoint comprehension over native file formats (.docx, .xlsx, .pptx) and their variants. OCB consists of two tracks. File Fidelity Q&A tests structural and visual perception of office artifacts - tables, charts, embedded images, formulas, and app-specific elements such as headers, speaker notes, and named ranges. Domain Q&A tests expert-level reasoning grounded in real-world industry documents across 12 professional domains, with queries requiring multi-step analysis and synthesis across documents. Each reference answer is decomposed into atomic, binary-gradable claims, and an ensemble of LLM judges scores responses against each claim independently. Even the strongest frontier system in its default reasoning mode reaches only about 59.3% on Domain Q&A; increasing thinking depth within a tier does not move performance materially, while moving to a higher product tier yields modest gains. We release the dataset, evaluation tooling, judge prompt, and a public leaderboard.
[103] arXiv:2607.01266 (cross-list from math.LO) [pdf, other]: Title: Fast approximation and learning of binary classification tasks in o-minimal structures using ReLU neural networks

Clemens Kinn, Philipp Petersen

Subjects: Logic (math.LO); Machine Learning (cs.LG); Functional Analysis (math.FA)

We study binary classification problems whose decision sets are given by definable sets in o-minimal expansions of the real field. Motivated by cell decomposition of definable sets, we introduce traceable sets as a classical proxy for definable decision regions and analyze their approximation by ReLU neural networks. Under uniform bounds on the number of connected components and suitable $C^m$ extensions for the boundary functions, we prove that characteristic functions of traceable subsets of $[-1/2,1/2]^n$ can be approximated in $L^p$ to accuracy $\varepsilon>0$ by ReLU neural networks of size $\mathcal{O}(\varepsilon^{-p(n-1)/m})$, with depth independent of $\varepsilon$ and polynomially bounded weights. This establishes quantitative approximation rates for certain definable collections in o-minimal structures using ReLU neural networks. The same approach also yields the stated approximation rates for a subclass of definable maps $[-1/2,1/2]^n \to \mathbb{R}$. We then combine the approximation capabilities with entropy estimates for ReLU neural network classes to obtain statistical learning rates for empirical risk minimization with hinge loss. For $N$ uniformly distributed samples, the resulting classifiers achieve expected misclassification error of order $N^{-m/(m+pn-p)}$ up to an arbitrarily small polynomial loss.
[104] arXiv:2607.01272 (cross-list from cs.GR) [pdf, html, other]: Title: Benchmarking Federated Learning and Knowledge Distillation for Point Cloud Classification

Aizierjiang Aiersilan

Comments: We are pleased to announce that this paper has been accepted by the 19th European Conference on Computer Vision (ECCV 2026). We appreciate the valuable feedback from the reviewers and look forward to sharing our findings with the community

Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

Deploying 3D point cloud analysis in privacy-sensitive, resource-constrained settings faces two barriers: data cannot be centralized, and models must run on limited edge hardware. We present a multi-seed benchmark jointly evaluating federated learning (FL) and knowledge distillation (KD) for 3D point cloud classification. It spans 13 FL algorithms and 10 KD objectives (a 130-pair cross-product) across 504 training runs, evaluated on ModelNet40 and a clinical craniosynostosis dataset. We report three findings. First, under extreme non-IID label skew, standalone FL degrades sharply: on ModelNet40, the strongest method reaches 76.32% against a 92.26% centralized reference; on clinical data, the best reaches 75.83% against 100%. Second, distillation successfully compresses the teacher into a student 74.51% smaller and roughly twice as fast at inference, often matching or surpassing the teacher. Third, the combined pipeline exposes an evaluation pitfall: when distillation keeps a hard-label cross-entropy term on a labeled proxy split, a collapsed federated teacher (8.50%) paired with Logit-MSE still yields a 92.94% student. This 84.4-point gap reflects the proxy labels rather than the federated model, reusing the very labels whose privacy motivated federation. Objectives without hard labels instead track teacher quality ($r \approx 0.99$) and collapse when the teacher does. We therefore recommend evaluating FL-KD pipelines with label-free distillation so reported accuracy reflects the federated teacher, not the proxy.
[105] arXiv:2607.01275 (cross-list from stat.ML) [pdf, html, other]: Title: eXact-Prior Variational Autoencoder (X-VAE): Learning Data-Adaptive Gaussian Mixture Priors for Latent Distributions

Qijun Chen, Shaofan Li

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Variational Autoencoders (VAEs) commonly assume a standard isotropic Gaussian prior over the latent space, an assumption that often fails to capture the true distribution of latent representations for complex datasets. This mismatch can limit reconstruction accuracy, reduce sample quality, and constrain the expressive power of the learned latent space. We propose the eXact-Prior Variational Autoencoder (X-VAE), a framework that replaces the conventional standard normal prior with a Gaussian prior derived from the latent representations of a pretrained autoencoder (AE). Specifically, the empirical mean and standard deviation of the AE latent codes are used to parameterize a data-adaptive prior that more closely reflects the underlying structure of the training data. During generation, X-VAE introduces a latent scaling factor that enables explicit control over the variance of the sampled latent vectors, providing a simple mechanism for balancing sample diversity and fidelity. This flexibility makes the proposed approach particularly well suited for applications such as industrial and engineering design, where generated solutions must satisfy strict structural or functional constraints while still permitting meaningful design exploration. We present the mathematical formulation of well-suited X-VAE, derive the corresponding KL divergence objective for the proposed prior, and evaluate the method on standard benchmark datasets. Experimental results demonstrate that X-VAE preserves reconstruction quality while producing latent representations that better align with the empirical data distribution, leading to improved controllability and more realistic generated samples.
[106] arXiv:2607.01276 (cross-list from cs.CR) [pdf, html, other]: Title: Embedding Inference Attack

Cedric Fitiavana Raelijohn, Sébastien Gambs, Jean-Francois Rajotte

Comments: 12 pages

Subjects: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Embedding models are essential components of modern Information Retrieval (IR) systems, yet they are typically hidden behind APIs. Recent works have shown that dense IR system can lead to security vulnerabilities such as embedding inversion attacks. However, such attacks usually require that the attacker knows the embedding model for the attack to be applicable. In this paper, we study IR systems under a black-box setting in which the adversary observes only the unordered set of retrieved documents, without ranking or similarity scores. We demonstrate that in such contexts, tailored queries allow an adversary to identify which embedding model is in use from a set of known model candidate, which we coin as an embedding inference attack (EIA). We also show that certain queries remain discriminative even when the system includes a reranker as a potential defense mechanism. We further validate our method on a real Retrieval-Augmented Generation (RAG) system, in which the tailored queries bypass the LLM's tendency to reject inputs it does not recognize as well-formed questions. Finally, we propose and evaluate other mitigation strategies such as similarity thresholds.
[107] arXiv:2607.01295 (cross-list from eess.AS) [pdf, html, other]: Title: CNN Models for Microphone Array Covariance Matrix Upsampling and Acoustic Imaging

Marianthi Adamopoulou, Parthasaarathy Sudarsanam, David Diaz-Guerra, Meng Jiang, Archontis Politis, Seyed Jalaleddin Mousavirad, Tuomas Virtanen, Jan Lundgren

Comments: Published in the 2026 IEEE International Symposium on Artificial Intelligence for Instrumentation and Measurement (AI4IM), Amalfi, Italy, 2026

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)

Acoustic imaging visualization is a core methodology in acoustics, enabling spatial analysis of sound sources and acoustic scenes. However, limited sensor availability in practical systems motivate approaches that enhance spatial resolution without increasing the hardware complexity. In this paper, we focus on upsampling virtually a tetrahedral 4-microphone array to a spherical 32-microphone array by estimating the covariance matrices of the channels employing deep learning techniques. Five neural network architectures are investigated for covariance upsampling for acoustic imaging using the real-world STARSS23 dataset. These models are developed to estimate a 32-microphone, time-frequency covariance matrix from a 4-microphone input covariance representation. The proposed architectures are based on 2D convolutional layers to capture the underlying spatial-spectral structure of covariance matrices, and are further enhanced with frequency dynamic convolution to model their frequency-dependent properties. The proposed architectures are evaluated in terms of root mean square error (RMSE) and using delay-and-sum beamforming acoustic imaging. Quantitative results show that all models outperform a random-guess baseline, which yields an RMSE of 0.548, with the best-performing architecture achieving an RMSE of 0.432. We analyze qualitatively the performance of the proposed models through beamforming heatmap visualizations derived from the 4-channel input covariance, the 32-channel ground truth, and the predicted 32-channel covariance matrices. These results demonstrate that covariance upsampling significantly enhances the effective performance of the 4-channel microphone array, producing sound maps that closely resemble those obtained with the 32-channel array.
[108] arXiv:2607.01297 (cross-list from eess.AS) [pdf, other]: Title: Few-Shot Open-Set Audio Classification Using Attention Information-Fused Prototypes

Yanxiong Li, Jiaxin Tan, Qianqian Li, Guoqing Chen, Sen Huang, Tuomas Virtanen

Comments: 14 pages, 12 tables, 9 figures,Accepted for publication in IEEE TASLP

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)

Most existing audio classification methods suppose that each query (testing) sample belongs to a class of support (training) samples, and misrecognize samples of unseen classes as seen classes (cannot reject samples of unseen classes). In this study, we propose a method for Few-shot Open-set Audio Classification (FOAC), which can recognize query samples of seen classes after updating the model using a few support samples, and meanwhile reject query samples from unseen classes. We design a model consisting of an encoder and a classifier. The encoder is the backbone of a ResNet used for extracting embeddings. The classifier consists of prototype generators of few-shot classes and open-set classes. Prototypes of few-shot classes are obtained by fusing the class-discriminative information of support and query embeddings and by assigning larger weighting coefficient to representative part of the support embeddings. One prototype is generated for open-set classes using the proposed prototype generator. The encoder is trained with abundant samples of base classes in supervised manner, and then the prototypes of base classes are generated under the supervision of a joint loss. The classifier is trained using a few samples of few-shot classes in a meta-training way. Three public datasets (LS-100, NSynth-100, and FSC-89) are used to assess the performance of our method. Experiments show that our method has advantage over prior methods in AUROC and accuracy. This advantage has statistical significance for most prior methods. Our method has lower computational complexity than most prior methods. The code is at this https URL.
[109] arXiv:2607.01305 (cross-list from cs.CR) [pdf, html, other]: Title: Generative AI and Federated Learning for Intrusion Detection Systems: A Survey

Jiefei Liu, Abu Saleh Md Tayeen, Pratyay Kumar, Qixu Gong, Wenbin Jiang, Huiping Cao, Satyajayant Misra, Jayashree Harikumar

Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Intrusion Detection Systems (IDSs) are essential for monitoring network traffic and identifying malicious activities in modern cyber-physical, Internet of Things (IoT), enterprise, and distributed network environments. However, developing reliable IDS models remains challenging because attack behaviors evolve over time, realistic datasets are difficult to obtain, traffic records may be incomplete, attack classes are often imbalanced, and privacy constraints limit centralized data collection. Recent advances in generative artificial intelligence (AI) and Federated Learning (FL) provide new opportunities to address these limitations. Generative models can support anomaly detection, synthetic traffic generation, data augmentation, data imputation, adversarial traffic generation, and IDS alert explanation. FL enables distributed IDS training without directly sharing local network traffic, making it suitable for privacy-sensitive and geographically distributed environments. This survey provides a structured review of generative AI and FL techniques for IDS. We first summarize representative IDS research directions, including adversarial machine learning, anomaly-based detection, IoT-oriented IDS, explainable IDS, and benchmark datasets. We then categorize generative AI applications in IDS according to model families and task objectives, covering autoencoder-based models, Generative Adversarial Networks (GANs), diffusion models, and Large Language Models (LLMs). Finally, we review emerging studies that integrate generative AI with FL-based IDS and discuss open challenges, including synthetic data quality, realistic traffic generation, dual-use adversarial risks, non-IID client distributions, communication-efficient model sharing, federated IDS benchmarking, and domain-specific LLMs for network security.
[110] arXiv:2607.01329 (cross-list from quant-ph) [pdf, html, other]: Title: Ravines in quantum cost landscapes: opportunities for improved VQA predictions

Felix J. Beckmann, João F. Bravo

Comments: 28 pages, 14 figures

Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG)

The geometric and topological structure of quantum cost landscapes (QCLs) governs the optimization and thus the predictive power of variational quantum algorithms (VQAs). We systematically analyze ravines - low-cost paths connecting local minima - using an adapted version of the nudged elastic band (NEB) algorithm, a method originating from theoretical chemistry. By training quantum neural networks (QNNs) to classify the concentratable entanglement of quantum states, we apply the NEB algorithm and numerically identify ravine structures in QCLs of hardware-efficient ansatzes. Beyond visualizing these ravines, we construct an ensemble prediction framework by averaging predictions from QNNs parameterized along the low-cost NEB path. We introduce a resource-light pre-training metric which quantifies local-prediction variability and serves as a strong performance indicator for VQAs, even beyond the scope of this study. When base classifiers are drawn from circuit and weight initializations exhibiting high local-prediction variability, the quantum-based NEB ensembles outperform both classical and naive quantum alternatives. Moreover, a complexity analysis shows that leveraging the ravine-like structure of QCLs with the QNN NEB approach substantially reduces computational costs compared to naive QNN ensembling. A depth and qubit scaling analysis indicates that ravines persist across both scalings, and that, despite the expected growth in resource requirements with the qubit scaling, the NEB approach also accelerates convergence over the naive alternative.
[111] arXiv:2607.01336 (cross-list from quant-ph) [pdf, html, other]: Title: Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders

Zihao Qi, Christopher Earls

Comments: 15 pages, 7 figures. Comments welcome!

Subjects: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Neural Quantum States (NQS) are a remarkably expressive class of variational ansätze for quantum many-body wavefunctions, yet little is understood about their internal mechanisms: trained on variational objectives alone, how do NQS accurately capture physical observables that they have never been explicitly optimized for? In this work, we present a systematic approach to analyze the internal activations of NQS using sparse autoencoders. We extract features from the residual stream and demonstrate that these features strongly correlate with physical observables such as order parameters, staggered magnetization, and half-chain correlators, across both ground state representation and real-time dynamics. Remarkably, the discovery of these features is entirely unsupervised, with no physical labels provided. We further establish that such features causally affect the corresponding observables predicted by NQS, by showing that targeted, post-training intervention on a \textit{single} feature smoothly and monotonically steers the corresponding observable, while leaving the variational energy nearly unchanged. These results demonstrate that NQS are not merely functional approximators, but encode rich, interpretable internal representations of physical information. Our approach provides both a diagnostic and an intervention tool for NQS, and serves as a foundation for using mechanistic interpretability towards more reliable, transparent NQS.
[112] arXiv:2607.01362 (cross-list from physics.chem-ph) [pdf, other]: Title: Enerzyme: A Framework for Efficient Training of Reactive Neural Network Potentials for Enzyme Catalysis with Application to Methyltransferases

Weiliang Luo, Heather J. Kulik

Subjects: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Biomolecules (q-bio.BM)

Quantum mechanical (QM) cluster models provide an effective framework for mechanistic studies of enzymatic reactions but remain computationally demanding. Neural network potentials (NNPs) offer a promising route to reduce this cost, but enzymes present challenges beyond small molecules, including large system sizes, implicit-solvent environments, substantial polarization, and charge transfer. Here, we present an integrated software framework for efficient NNP training for mechanistic studies of enzymes, demonstrated on QM cluster models of S-adenosyl-L-methionine-dependent methyltransferases (MTases). Our Enerzyme code introduces modular electrostatics-aware NNP architectures and combines automated QM-cluster construction with reactive dataset generation. The Enerzymette subpackage automates reaction pathway exploration at both NNP and DFT levels. We show that iterative flexible scans and nudged elastic band calculations impose stricter requirements on NNPs than conventional dataset metrics. Nevertheless, NNPs trained on fewer than 1,000 system-specific datapoints reproduce reaction energetics and transition-state structures for MTase clusters containing up to 545 atoms with near-chemical accuracy. Direct supervision of atomic charges and consistent dielectric screening substantially improve simulation stability and accuracy, while multitask-learned atomic charges capture charge transfer and polarization trends and provide chemically meaningful descriptors of reactivity. Finally, transferability across chemically diverse catechol O-methyltransferase substrates indicates that NNPs learn generalizable reactivity patterns as training data expand across multiple enzymes. Together, these results establish a foundation for accelerating enzyme mechanistic studies and guide future NNP development for biomolecular reactivity.
[113] arXiv:2607.01387 (cross-list from cs.IR) [pdf, html, other]: Title: Bi-NAS: Towards Effective and Personalized Explanation for Recommender Systems via Bi-Level Neural Architecture Search

Longfeng Wu, Yao Zhou, Tong Zeng, Zhimin Peng, Bhanu Pratap Singh Rawat, Lecheng Zheng, Giovanni Seni, Dawei Zhou

Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)

Recommender systems are vital in helping users navigate vast amounts of information, offering personalized suggestions and effective explanations for these recommendations. While previous efforts have attempted to provide such explanations, evaluating their effectiveness across various scenarios remains a challenge. Enhancing these explanations is essential for improving user engagement, trust, and decision-making. To facilitate effective explanations within the recommender system, we propose a Bi-level Neural Architecture Search (Bi-NAS) framework to optimize explanations. This approach simultaneously refines cross-attention mechanisms and feature interaction functions by exploring both intra-layer and inter-layer design spaces. Furthermore, we integrate Large Language Models (LLMs) to enhance explanation generation, leveraging zero-shot prompting to produce more effective and personalized justifications. By aligning user feature preferences with item quality scores, our approach ensures that explanations reflect both user intent and item attributes, improving transparency and reasoning depth. Extensive evaluations on four real-world datasets demonstrate that Bi-NAS not only boosts recommendation accuracy but also significantly improves the effectiveness of explanations for recommender systems, providing users with clear and reliable insights into the suggestions they receive.
[114] arXiv:2607.01395 (cross-list from cs.CV) [pdf, html, other]: Title: Rethinking Generic Object Tracking Toward Human-Level Perceptual Intelligence

Shih-Fang Chen

Comments: Ph.D. dissertation, National Yang Ming Chiao Tung University, 2026. arXiv admin note: substantial text overlap with arXiv:2602.14771

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)

At the heart of human visual perception lies the ability to maintain a continuous and coherent understanding of the external world. By integrating observations with accumulated experience, the human visual system can continuously adapt to variations in both the target and its surrounding environment, while preserving robust visual continuity as scene dynamics evolve. Human vision can therefore integrate prior knowledge, spatial geometry, and semantic context to understand complex scenes and their changes. As a core problem in computer vision, visual object tracking aims to bring machine perception closer to human visual perception. These capabilities are central to the task of Generic Object Tracking (GOT). In this task, a visual tracker is initialized only with the bounding box of an arbitrarily specified target in the first frame, and must continuously localize the target in subsequent dynamic visual streams. However, future events, observations, and real-world variations are inherently unpredictable; therefore, the model's generalization and online adaptation capabilities remain bottlenecks. Tracking reliability can deteriorate when the target undergoes severe deformation, is affected by complex distractors, encounters significant environmental changes, or belongs to a category unseen during training. This dissertation aims to narrow the gap between machine visual tracking systems and human visual perception by proposing a series of methods that systematically enhance the target discrimination, robust adaptation, and geometric reasoning capabilities of tracking models.
[115] arXiv:2607.01400 (cross-list from cs.SE) [pdf, html, other]: Title: A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps

Barada Sahu, Shivesh Pandey

Comments: 7 pages, 1 figure. Code, video-ID manifest, and per-video results: this https URL

Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

Deep multimodal brain-encoding models now predict fMRI responses to naturalistic video with high accuracy. Whether their predicted neural signals also forecast behavioral engagement is unknown. We run TRIBE, the winning model of the 2025 Algonauts brain-encoding challenge (Llama-3.2 + V-JEPA2 + Wav2Vec-BERT), on 48 YouTube videos and reduce its predicted cortical response to a per-second engagement curve, the global field power. Correlated against each video's "most replayed" heatmap, a passively-collected proxy for which moments viewers return to, the curve shows no evidence of predicting re-watch behavior. The pooled position-controlled partial correlation is +0.058 (95% CI [-0.04, 0.15]; one-sample t(47)=1.21, p=0.23), indistinguishable from zero and not significantly above simple loudness and motion baselines (loudness +0.04, paired p=0.74). The raw correlation is also near zero; the moderate values reported for music videos reflect a genre-specific intro/onset-replay artifact rather than content prediction, and do not generalize. The null holds across six cortical-network readouts and under an autocorrelation-preserving permutation test. We release the code, the video-ID manifest, and an acquisition method that works despite YouTube's SABR-only streaming.
[116] arXiv:2607.01409 (cross-list from cs.SE) [pdf, html, other]: Title: GPUAlert: A Zero-Instrumentation Process-Boundary Monitor for Diagnosing GPU Training-Job Failures

Parv Agarwal, Asif Ekbal

Comments: 8 pages, 3 figures, 4 tables,3 Listings. Submitted as an arXiv preprint. Source, corpus and evaluation harness available at this https URL and this https URL

Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

GPU training jobs fail often, roughly two in five on large production clusters, yet the operator typically learns of a failure only by reconnecting hours later. Experiment trackers require editing the training script and maintaining a cloud connection; the scheduler's mail hook delivers a single status line with no cause and no logs. GPUAlert is a command-line wrapper that monitors any training command at the process boundary, and with no change to that command, emails a structured notification on completion carrying a classified failure cause, durable logs, and output artifacts. The tool is organized around three reliability primitives: a pre-launch log guarantee that establishes the durable destination before the child process can crash, notifier isolation that makes the wrapper's exit code a pure function of the child's status regardless of whether the email succeeds, and a non-silent artifact budget that bounds attachment size without ever dropping output silently. We release a labelled corpus of 474 GPU training logs across 15 failure classes and a reproducible evaluation harness. On the twelve hardware-reproduced classes, the ordered-rule classifier reaches 0.997 macro-F1, against 0.830 for unordered keyword matching and 0.133 for exit-code inspection. Wrapper overhead is a constant approximately 3ms per job; the pre-launch guarantee preserves a log where a shell redirect yields nothing; and across all 15 failure modes the wrapper returns the child's exit code unchanged even when the SMTP relay is unreachable.
[117] arXiv:2607.01410 (cross-list from cs.RO) [pdf, html, other]: Title: BIFROST: Bridging Invariant Feature Representation for Observation-space Sim2Real Transfer

Yunfu Deng, Josiah P. Hanna

Subjects: Robotics (cs.RO); Machine Learning (cs.LG)

Sim2real transfer for robot policy learning suffers due to mismatch between simulation and reality. Existing methods typically address each gap in isolation through separate adaptation modules, which are composed or layered when both gaps coexist. Yet the basis for attempting sim2real in the first place is that there is shared structure between a task in simulation and reality, where equivalent actions from equivalent configurations produce equivalent long term outcomes regardless of domain specific differences in rendering or physics. In this paper, we study whether we can identify and exploit this shared structure from raw observations to train a policy that enables zero shot transfer. We introduce BIFROST, which learns a shared history encoder on paired cross-domain data via cross-domain bisimulation objective: observation-action sequences leading to equivalent long-term behavior are mapped to nearby latent states, regardless of domain. Policies trained on these latent states in simulation transfer zero-shot to reality. We provide empirical evidence on sim2sim visual navigation and sim2real contact rich manipulation task and visual servoing task that BIFROST achieves effective transfer where domain adaptation and co-training baselines fail under both visual and dynamics domain gaps.
[118] arXiv:2607.01433 (cross-list from cs.AI) [pdf, html, other]: Title: CreativityNeuro: Steering Language Model Weights to Improve Divergent Thinking and Reduce Mode Collapse

Samuel Schapiro, Core Francisco Park, Felix Sosa, Lav R. Varshney

Comments: Accepted at ICML 2026 Workshop on Creativity & Generative AI

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Divergent thinking is a crucial aspect of creativity, yet large language models (LLMs) tend to consistently generate similar responses to open-ended questions, in what has been termed the artificial hivemind effect. Here, we introduce CreativityNeuro, a data-free method for enhancing divergent thinking in LLMs via contrastive weight steering. We evaluate our method across multiple creativity assessments and report several main findings. On the Divergent Association Task (DAT), a vocabulary-space creativity test, CreativityNeuro improves performance by up to 14 human percentile points. Next, in a large-scale human evaluation (N=720) on the Alternative Uses Test (AUT) and the Task Task, CreativityNeuro achieves significant improvements in originality, surprise, and creativity, transferring to longer-form and more open-ended tasks. Importantly, we find that across all three tasks, CreativityNeuro demonstrably reduces measures of mode collapse. Moreover, activation steering achieves comparable performance to CreativityNeuro on the DAT, but it does not transfer to the AUT and Task Task, demonstrating the effectiveness of weight-space steering in generalizing to unseen tasks. In conclusion, CreativityNeuro improves divergent thinking and reduces mode collapse without requiring behavioral data, re-training, or gradient-based fine-tuning, providing a straightforward way to enhance LLM performance in creative domains.
[119] arXiv:2607.01435 (cross-list from cs.CV) [pdf, other]: Title: Sign in the Air to Unlock: An Interface for authentication in Virtual and Augmented Reality Powered by Point-Voxel Cross-Attention Network

Neda Abdolrahimi, Thiru Siddharth, Frank Sicongchen, Vir V Phoha

Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

Significant advancement of immersive technologies such as Virtual and Augmented Reality (VR/AR) and their integration into diverse aspects of modern life need authentication interfaces that are secure, intuitive, and compatible with embodied interaction. Traditional methods such as passwords, PINs, and device-based logins, break immersion and rely on external hardware. Recent 3D-specific behavioral approaches, such as hand-gesture, eye-tracking, and electroencephalography (EEG)-based methods, offer promising alternatives but often require specialized sensors or constrain natural movement, limiting usability in dynamic environments. We present Sign in the Air to Unlock, an in-air signature interface that enables users to authenticate by signing naturally in 3D space which is a familiar, personal, and reproducible gesture. To realize this interface, we design a point-voxel Cross-Attention Network (PV-Net) that jointly models local motion dynamics and global spatial structure from 3D trajectories. The model is evaluated on two datasets: the public DeepAirSig dataset (1,800 signatures from 40 users) and ImmAirsig, a new dataset collected using Meta Quest 2 in immersive VR (880 samples from 22 users). PV-Net achieves an Equal Error Rate of 2.5% on DeepAirSig and 76% classification accuracy on ImmAirSig. These findings highlight the potential of 3D behavioral interfaces for seamless, user-centric authentication that merges security with natural interaction in immersive environments.
[120] arXiv:2607.01436 (cross-list from cs.AI) [pdf, html, other]: Title: Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Max Van Puyvelde, Halil Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Diffusion language models, which generate text by denoising a token canvas bidirectionally instead of emitting tokens left to right, have become competitive with autoregressive (AR) generation. Medical foundation models, however, remain almost entirely autoregressive. We adapt a mixture-of-experts diffusion language model, DiffusionGemma-26B, and benchmark it against its same-size AR sibling Gemma-4-26B under an identical LoRA recipe on medical visual question answering datasets, scored by a verbosity-robust LLM judge. Diffusion matches or exceeds AR on all of them, and the finetuned model (3.8B active) is competitive with frontier vision-language models; its decoding is also 3.5-4.4x faster. Beyond this parity, the diffusion model offers a drafting capability AR lacks: any-order infill. Because the canvas is denoised bidirectionally, a radiologist can fix report fragments and have the model fill the text between them, an operation inherent to diffusion but not to autoregression, which is subpar at it. This suits real reports, which are often terse or inconsistent across clinicians and institutions.
[121] arXiv:2607.01445 (cross-list from cs.CR) [pdf, html, other]: Title: Hamm-Grams: An Algorithm for Mining Regular Expressions of Bytes

Derek Everett, Edward Raff, James Holt

Comments: To appear in Machine Learning for Malware Detection

Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)

Malware poses a critical and ever-evolving threat, and robust and effective systems for detecting and classifying malware are of essential importance. $n$-grams features are among the common static features used in effective machine learning systems for malware, but these features are inherently brittle. We propose an algorithm for constructing more robust features, hamm-grams, which are a special class of regular expressions having a fixed length and single-character wildcards. We devise an efficient algorithm for finding common hamm-grams using a new locality-sensitive hash designed to produce collisions among pairs of small Hamming distance and a clustering within hash buckets to place wildcards. We then demonstrate the advantages of these features in malware classification and detection tasks.
[122] arXiv:2607.01478 (cross-list from math.OC) [pdf, html, other]: Title: Boundary-Aware Quantization: Finite-Scale Decision Geometry of Neural Classifiers

O.M. Kiselev

Comments: 7 pages, 2 figures, 6 tables

Subjects: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

We measured quantization-induced decision-boundary changes using local logit-margin radii, first-order boundary displacement, normal variation, slice-boundary Jaccard distance, grid prediction changes, multiclass junction counts, and low-margin boundary-band flips. On the digits benchmark, 8-bit weight quantization preserved all test labels while producing boundary-mask Jaccard $0.428$ on the PCA slice; at 4 bits, accuracy remained $0.9733$, while boundary Jaccard rose to $0.970$ and median local boundary shift reached $0.0290$. Interpolation between adjacent quantization levels localized the visible reconfigurations at multiclass junctions, with 12, 34, and 17 triple-junction cells in the selected transitions. Calibration-to-test stopping reduced the digits held-out flip rate from $0.0094$ to $0.0022$ and boundary Jaccard from $0.825$ to $0.524$; the same stopping rule also reduced flips on MNIST and Fashion-MNIST. On official CIFAR-10 subsets, PTQ-W selected by accuracy gave 6-bit flip $0.0367$ and boundary Jaccard $0.184$, whereas boundary-aware stopping selected 8-bit flip $0.0083$ and boundary Jaccard $0.048$. On full CIFAR-10 with three seeds, 6-bit PTQ-W lost $0.0029$ accuracy relative to float, changed $5.3\%$ of held-out decisions, and changed $24.5\%$ of low-margin boundary-band decisions. A fixed-bit boundary-gap rounding term changed the trade-off at 4 bits by reducing boundary Jaccard from $0.457$ to $0.435$ and boundary-band pair-order flip from $0.3600$ to $0.3558$, with an accuracy trade-off; the 3-bit stress test exposed the tuning limit of this surrogate. Calibration boundary Jaccard predicted held-out boundary Jaccard across PTQ-W and optimized rounding variants with $r=0.947$--$0.994$.
[123] arXiv:2607.01480 (cross-list from cs.AI) [pdf, other]: Title: Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Reinforcement learning with verifiable rewards (RLVR), along with recent selfdistillation variants such as SDPO, evaluates each rollout against a verifier and updates the policy from that episode-level signal. However, the richer procedural information in the rollout is rarely retained or reused. Across episodes and epochs, the model repeatedly encounters related problems under a changing policy, producing cross-episode signals that episode-local updates cannot capture: which strategies consistently pass verification, which failure modes persist, which patterns recur. We propose Procedural Memory Distillation (PMD), which converts these crossepisode signals into reusable procedural memory and distills it into the policy's weights during training. This memory functions as a training scaffold, absorbed into the policy itself, yielding a memory-free model at inference. PMD organizes the memory at three levels of abstraction: raw trajectories, self-reflected strategies and lessons, and higher-level behavioral patterns that recur across problems, all extracted online from the model's own trajectories. A memory-conditioned self-teacher draws on the accumulated experience to supervise the student on its own rollouts, enabling student to progressively internalize procedural knowledge within its parameters. The central design principle is co-evolution: the policy generates rollouts that update the memory, and memory shapes the supervision that updates the policy. Empirically, across Qwen3-8B and OLMo3-Instruct-7B, PMD improves over SDPO by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on LIVECODEBENCH. Co-evolution powers these gains: freezing either the memory or the policy trails PMD by more than 10% across SCIKNOWEVAL domains.
[124] arXiv:2607.01511 (cross-list from cs.AI) [pdf, other]: Title: Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

Hongyang He, Jiuming Liu, Victor Sanchez

Comments: Tech Report

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent reasoning capabilities in large language models. However, most existing CoT methods use reasoning chains mainly as inference-time prompts, while the generated reasoning traces are rarely reused as semi-supervised learning signals. In this report, we define \textbf{Semi-supervised Chain-of-Thought Learning} and propose \textbf{Semi-CoT}, a simple framework that uses unlabeled questions to construct pseudo reasoning supervision. Semi-CoT samples multiple pseudo-CoTs for each unlabeled question, estimates answer-level semantic entropy, and selects low-entropy reasoning chains as reliable pseudo-CoT demonstrations. This extends the self-training view of CoT from inference-time refinement to semi-supervised pseudo-supervision. Pilot experiments on AQuA, SVAMP, GSM8K, and MultiArith show that the entropy gate selects high-precision pseudo-CoTs, with pseudo-answer precision ranging from $91.36\%$ to $100\%$. Semi-CoT also gives small gains on SVAMP and GSM8K, while AQuA shows negative transfer and MultiArith reaches a ceiling. These results suggest that unlabeled questions can provide reliable pseudo reasoning signals, but their effective use still requires stronger demonstration selection or student training.
[125] arXiv:2607.01525 (cross-list from math.OC) [pdf, other]: Title: Mean Field Reinforcement Learning

René Carmona, Mathieu Laurière

Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Probability (math.PR)

This monograph provides an introduction to mean field reinforcement learning through the lens of Markov decision processes arising from large-population stochastic control with mean field interactions and common noise. Starting from the connection between multi-agent reinforcement learning and mean field control, it develops the probabilistic, mathematical, and control-theoretic framework needed to formulate representative-agent learning problems, analyze their relationship with finite-population systems, and study both general and linear-quadratic models. The presentation includes dynamic programming principles, propagation-of-chaos limits, and theoretical analyses of tabular Q-learning and policy-gradient methods. It also discusses numerical implementations, including tabular schemes and deep reinforcement learning methods such as deep deterministic policy gradient. The goal is to give readers a coherent bridge between mean field control theory and reinforcement learning methodology, emphasizing the mathematical structure of the problems and the design of tractable learning approaches for large stochastic populations.
[126] arXiv:2607.01527 (cross-list from cs.SD) [pdf, html, other]: Title: Quantifying the Uncertainty of Blindly Estimated Room Embeddings Using a Dispersion-Calibrated Score

Yang Xiang, Philipp Götz, Emanuël A. P. Habets, Andreas Walther, Wenwu Wang, Philip J. B. Jackson

Comments: Accepted to INTERSPEECH 2026

Subjects: Sound (cs.SD); Machine Learning (cs.LG)

Room embeddings derived from reverberant speech are often unreliable: speech content and recording degradation can alter the representation even when speaker, room, and source-receiver geometry remain unchanged, degrading downstream task performance. We propose a framework that learns room embeddings robust to speech-content variation and a representation-level uncertainty score from reverberant speech without downstream-task supervision. The embedding is anchored to a structured room impulse response (RIR) latent space and trained using a multi-view data structure with Kullback-Leibler (KL)-based alignment; a multi-positive contrastive term further refines robustness. A lightweight uncertainty head is calibrated using the dispersion of corruption-induced embeddings and optimized with a rank-based objective. Across waveform- and spectrogram-level corruptions, the score is consistent with representation dispersion and enables effective selective prediction while requiring only a single utterance at inference.
[127] arXiv:2607.01531 (cross-list from cs.AI) [pdf, html, other]: Title: OPINE-World: Programmatic World Modeling with Ontology-error-Prioritized Interactive Exploration

David Courtis, Wenhao Li, Scott Sanner

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Learning how an environment behaves from interaction is central to building agents that adapt to unfamiliar tasks. World models learned with deep networks are flexible but data-hungry and transfer poorly beyond their training distribution. Program-synthesized world models, written as source code by LLMs and refined through counterexample-guided inductive synthesis (CEGIS), are instead data-efficient and reusable, yet they have been demonstrated mainly on structured-state worlds with a given object vocabulary, and a single program search does not scale to pixel-rendered environments whose object structure must be hypothesized flexibly. We introduce OPINE-World, an LLM agent that learns an object-centric programmatic world model online from interaction. OPINE-World couples two cooperating agents in a loop of hypothesis and test, one acting in the environment and one synthesizing the model in code with replay verification and model-based planning, and it steers exploration with a Bayesian measure of object-type adequacy we call ontology error. We evaluate OPINE-World on ARC-AGI-3, a benchmark for skill-acquisition efficiency in which the object vocabulary, the goal, and the action semantics are withheld. OPINE-World solves 20 of 25 games without per-game training and reaches an action-efficiency score of 78.4 against the human baseline.
[128] arXiv:2607.01539 (cross-list from cs.NE) [pdf, html, other]: Title: MMAO-Cls: Metabolic Multi-Agent Optimization for Joint Feature Selection and Classifier Tuning

Jinliang Xu, Liping Ma

Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

This paper studies whether the Metabolic Multi-Agent Optimizer (MMAO) can act as a credible outer-loop optimizer for classification model selection. We propose MMAO-Cls, a mixed-space realization in which each agent jointly encodes a binary feature mask and classifier hyperparameters, while private energy, communal budget, role drift, and lifecycle turnover are mapped to the accuracy-complexity tradeoff of wrapper learning. The implementation is strengthened by deriving feature-budget adaptation from feature-information priors and by regularizing validation reward with both subset compactness and train-validation overfitting gap. We evaluate MMAO-Cls on seven standard tabular benchmarks with three seeds each and compare it against RandomSearch, GA-lite, PSO-lite, and an endogenous no-sharing ablation. On the aggregate validation objective, MMAO-Cls ranks second ($0.9433$) behind GA-lite ($0.9446$). On held-out test performance, it reaches mean score $0.8882$, improving over RandomSearch ($0.8808$) and GA-lite ($0.8857$), remaining close to PSO-lite ($0.8874$) and the no-sharing ablation ($0.8900$), while using the most compact mean held-out feature subset among all compared methods (feature ratio $0.4881$). Pairwise tests show that these margins are not yet statistically significant. The resulting claim is therefore conservative: MMAO-Cls supports classification applicability and compact mixed-space search more clearly than it isolates communal sharing as a decisive standalone advantage.
[129] arXiv:2607.01647 (cross-list from cs.DB) [pdf, other]: Title: AgenticDataBench: A Comprehensive Benchmark for Data Agents

Zhaoyan Sun, Shan Zhong, Daizhou Wen, Jiaxing Han, Guoliang Li, Ying Yan, Peng Zhang, Yu Su, Xiang Qi, Baolin Sun, Chengyuan Yang, Tao Fang, Huaiyu Ruan

Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Data science aims to derive actionable insights from heterogeneous raw data, unlocking the value of the massive amounts of data generated in modern society. Automating this process is essential to reducing labor-intensive efforts for data scientists and enabling scalable data-driven applications. Recently, large language model (LLM)-based data agents have emerged as a promising solution to automate data science workflows. However, the field lacks comprehensive benchmarks to rigorously evaluate these agents across diverse scenarios with fine-grained granularity. To address this gap, we propose AgenticDataBench, a comprehensive benchmark featuring realistic tasks spanning diverse domains with fine-grained ground-truth labels. This enables evaluations to capture the diversity and complexity of data science workflows and the detailed performance of agents. First, to cover diverse domains, we collect real datasets and tasks from 15 vertical domains, including 5 real-world B2B use cases from a leading fintech company. Second, to remove redundancy in real-world tasks and generate high-quality tasks for domains lacking real data, we introduce data science skills, recurring data-centric operational patterns, and quantify benchmark coverage by the number of skills included. Representative skills are extracted from large-scale task solutions on Stack Overflow using skill-aligned hierarchical clustering. Third, for real-world business tasks, we select task-solution pairs that maximize diversity in skill composition, ensuring broad coverage of practical scenarios. Fourth, to generate realistic tasks for devise domains without real tasks, we propose a systematic LLM-based task generation approach to create workflows and tasks based on these skills. Finally, we evaluate state-of-the-art data agents using our annotated benchmark and open-sourced testbed, providing detailed skill-level insights.
[130] arXiv:2607.01679 (cross-list from cs.CR) [pdf, html, other]: Title: Beyond Gradient-Based Attacks: Adversarial Robustness and Explainability Stability in Cybersecurity Classifiers

Mona Rajhans, Vishal Khawarey

Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Adversarial attacks on cybersecurity classifiers pose a dual threat: degrading predictions and destabilising the SHAP-based explanations that security analysts rely on to understand and triage alerts. We extend our prior MLP conference study to Random Forest and XGBoost across four tabular security datasets (phishing URLs, UNSW-NB15, NF-ToN-IoT, HIKARI-2021), evaluating five attacks including three black-box methods applicable to non-differentiable tree models. We introduce the Explainability Stability Index (ESI), a scalar metric computed from TreeSHAP attribution drift under adversarial perturbation, reported on the same [0,1] scale as the Robustness Index (RI). A key finding is that gradient-based black-box attacks (ZOO) produce degenerate results against XGBoost (apparent RI ~0.98) due to piecewise-constant prediction surfaces, while score-based Square Attack reveals genuine vulnerability (RI ~0.36). These degenerate perturbations still drive substantial attribution drift: XGBoost ESI ~0.06-0.16 despite near-perfect ZOO robustness, versus 0.14-0.29 for RF, showing that prediction robustness and explanation stability are distinct axes requiring joint measurement. A two-axis framework (gradient dependence, query efficiency) explains the observed attack ranking and yields practical guidance for tree ensemble evaluation. A step-size ablation explains a counterintuitive PGD anomaly on z-score normalised tabular data.
[131] arXiv:2607.01690 (cross-list from cs.AI) [pdf, html, other]: Title: Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing

Joshua Penman

Comments: 20 pages, 10 figures, 2 tables. Code at this https URL and generated documents, questions, and teacher rollouts at this https URL

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Finetuning a language model on documents that are explicitly annotated as fictional results in a model that still actually believes the documents' core claims, an effect known as Negation Neglect. In our evaluations, models trained on documents prefixed and suffixed with such annotations correctly identify the relevant claims as fictional only about 9% of the time. To address this, we introduce Goggles, a learned module that intervenes on the finetuning gradient rather than the data. During supervised finetuning, a Goggles module edits the gradients an LLM LoRA receives, imparting a chosen epistemic frame (the stance the model takes toward the nature of what it reads) to whatever the documents teach. A Goggles instance is trained once for a given base model, frame, and LoRA configuration, then applied frozen to documents it was never trained on. Trained through Goggles on those same documents, now carrying no fictional annotation, the model flags the content as fictional roughly 91% of the time, while preserving capability (GPQA and TruthfulQA match or exceed baseline). The same architecture supports other frames: a Goggles instance can be trained to treat documents as "part of an AI safety evaluation by Redwood Research" rather than simply as fiction. The imparted frame persists under continued finetuning that pushes back toward the claim, where prior interventions revert. Goggles suggests a path toward training language models on known-misaligned data without absorbing the behaviors that data demonstrates.
[132] arXiv:2607.01709 (cross-list from cs.AI) [pdf, html, other]: Title: COMFYCLAW: Self-Evolving Skill Harnesses for Image Generation Workflows

Zongxia Li, Dawei Liu, Fuxiao Liu, Yuhang Zhou, Xiyang Wu, Jingxi Chen, Jing Xie, Xiaomin Wu, Lichao Sun

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Agents are increasingly used to construct workflows and assist humans in completing recurring tasks more efficiently. As these workflows become repeated and domain-specific, agent memory and reusable skills become increasingly important: agents should be able to recall workflow patterns, execution constraints, and user preferences from previous runs. We study this problem in workflow-based image generation and introduce COMFYCLAW, an agentic skill evolution harness for controlling ComfyUI workflows. COMFYCLAW formulates workflow construction as typed graph editing, exposes tools organized by construction stage, automatically reverts invalid edits, and uses a region-level vision-language model (VLM) verifier to translate visual failures into actionable repair suggestions. The framework further evolves a progressively disclosed skill library, where trajectories, execution errors, and verifier feedback from previous runs are distilled into reusable Agent Skills. Across four benchmark splits, three agent models, and two image backbones, COMFYCLAW achieves the best average image-generation evaluation score across all six agent configurations, outperforming a verifier-only baseline without skill evolution. Human annotations further show that annotators prefer COMFYCLAW over variants without skill evolution. Our results suggest that skill evolution is an effective mechanism for improving agent reliability and performance in recurring visual workflow construction.
[133] arXiv:2607.01731 (cross-list from eess.IV) [pdf, html, other]: Title: Quantum-Inspired Vision: Leveraging Wave-Particle Duality for Low-Illumination Enhancement

Yiquan Gao

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Physics (quant-ph)

This study provides a theoretical expansion of the recent Data Relativistic Uncertainty (DRU) framework by formalizing a physics-to-AI paradigm for image enhancement. By modeling images as probabilistic wave functions rather than deterministic states, the paradigm explicitly integrates wave-particle duality to illustrate the system flow of how DRU leverages the intrinsic physical uncertainty of light, a dimension requiring further theoretical discussion. Consequently, this paradigm provides a rigorous Explainable AI (XAI) approach that enhances the interpretability of how DRU mitigates illumination bias and maintains robustness against data noise.
[134] arXiv:2607.01741 (cross-list from stat.ML) [pdf, html, other]: Title: Full Bayesian Reinforcement Learning via LF-IBIS

Stefano Masini, Cecilia Viscardi, Michela Baccini

Comments: 37 pages, 12 figures, 4 tables

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Reinforcement Learning (RL) is a sequential decision-making framework in which an agent learns optimal policies through interaction with an environment by maximizing cumulative rewards. Among RL methods, Bayesian Reinforcement Learning (BRL) addresses common practical challenges related to data scarcity by leveraging prior knowledge about the environment and sequential belief updates. However, most BRL approaches require an explicit likelihood function, which is frequently inaccessible or intractable in real-world settings.
We propose Likelihood-Free Iterated Batch Importance Sampling (LF-IBIS), a novel algorithm for BRL that updates the agent's beliefs online as new interactions become available. By combining Approximate Bayesian Computation with Iterated Batch Importance Sampling, LF-IBIS enables full Bayesian inference in settings where the environment dynamics are not described by an explicit or tractable likelihood. The method yields approximate posterior distributions over both environment parameters and optimal policies, providing a quantification of policy uncertainty useful for a Bayesian treatment of the exploration-exploitation trade-off. We test the method on a simulation study in response-adaptive randomization in clinical trials, where closed-form posteriors enable validation. Additional experiments address settings where the posterior has no closed form and illustrate online policy updating based on the posterior distribution of the optimal policy.
[135] arXiv:2607.01755 (cross-list from math.OC) [pdf, html, other]: Title: Decentralized Stochastic Subgradient-type Methods with Communication Compression for Nonsmooth Nonconvex Optimization

Siyuan Zhang, Nachuan Xiao, Xin Liu

Comments: 36 pages

Subjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

In this paper, we consider the nonsmooth nonconvex decentralized optimization problem, where inter-agent communication is compressed. We propose a general framework that unifies various decentralized stochastic subgradient-type methods with unbiased compression and contractive compression with error compensation. By relating the consensus-error iterates and the averaged iterates to the trajectories of continuous-time differential inclusions, we establish global convergence for all methods encompassed by our framework when the objective functions are nonsmooth and lack Clarke regularity. Based on our framework, we further develop several compression-based methods, including decentralized stochastic subgradient methods utilizing sign-based regularization and gradient-tracking momentum. Preliminary numerical experiments empirically support our theoretical results and highlight the communication-accuracy trade-off of the newly developed methods.
[136] arXiv:2607.01792 (cross-list from cs.CL) [pdf, html, other]: Title: PARTREP: Learning What to Repeat for Decoder-only LLMs

Andikawati P Widjaja, Yongjun Kim, Hyounghun Kim, Jaeho Lee

Comments: 15 pages and 7 figures (including appendix)

Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

While decoder-only LLMs excel at a vast array of natural language tasks, it suffers from an asymmetric information flow induced by causal attention: later tokens are richer in contextual grounding than earlier ones. A simple and effective remedy is prompt repetition -- just appending a second copy of prompt before generation can redistribute grounding across positions and improve reasoning performance. However, full repetition of the original prompt doubles the KV cache footprint and quadruples attention cost during prefill, making it impractical for long-context settings. We propose PartRep, a selective augmentation method that appends only the most informative tokens -- rather than the entire prompt. We use token-wise negative log-likelihood (NLL) as a selection signal, motivated by the hypothesis that less predictable tokens are less recoverable from surrounding context and therefore benefit more from late-position repetition. To avoid the heavy cost of a full forward pass for scoring, we train a lightweight gate that predicts high-NLL tokens from early-layer hidden states, enabling token selection during mid-prefill via early exit. Across eight benchmarks (including MMLU, GSM8K, and RULER) and three model families (Qwen2.5, Llama3.2, Gemma4), PartRep retains most of the gains of full repetition while using only 59.4\% of its KV cache and 79.0\% of its prefill FLOPs.
[137] arXiv:2607.01819 (cross-list from eess.SY) [pdf, html, other]: Title: Koopman operator theory: fundamentals, control, and applications

Igor Mezić, Jorge Cortés, Karl Worthmann, Mircea Lazar, Armin Lederer

Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)

The Koopman operator has gained considerable attention due to its ability to provide a global linear representation of highly complex dynamical systems. The operator describes nonlinear dynamics in a linear way through the lens of real- or complex-valued observable functions. Recently proposed data-driven techniques, like extended dynamic mode decomposition (EDMD), its kernelized variant, and machine-learning methods, can be used to generate finite-dimensional approximations accompanied by finite-data error bounds. In this tutorial paper, we provide a concise introduction into Koopman operator theory and its use in systems and control. A particular focus is put on data-driven surrogate models, their extension to systems with inputs, and controller design using Koopman operator theory. Moreover, we demonstrate the key techniques, i.e., EDMD and Koopman MPC. To this end, we provide simulation studies including source code on GitHub to enable the interested reader to experience the Koopman operator in systems and control step by step.
[138] arXiv:2607.01831 (cross-list from cs.DC) [pdf, other]: Title: Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference

Wenchen Han, Gingfung Matthew Yeung, Marco Barletta, William Toner, Amory Hoste, Adam Barker

Comments: 15 pages, 12 figures. This manuscript was originally submitted to SIGCOMM '26 in February 2026

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

Long-context inference is increasingly common in large language model (LLM) serving, driven by retrieval-augmented generation and agentic systems. In disaggregated inference, these workloads require transferring large Key-Value (KV) caches across the network, where decoding cannot begin until the transfer completes. Recent KV quantization techniques reduce data volume and alleviate this bottleneck, but existing schemes fail to achieve both low network-exposed latency and high inference accuracy.
We challenge the assumption that the KV cache is an indivisible unit that must be fully received before use. We leverage the observation that different bits in the KV cache contribute unequally to attention computation and inference precision: the most significant bits capture the coarse structure of attention and the least significant bits refine precision. This property enables partial use of the KV cache during decoding. We present Lynx, a system that enables progressive, split-stream KV transfer by partitioning the KV cache into a high-priority Anchor stream carrying the most significant bits and a low-priority Residual stream carrying remaining precision. Decoding begins upon receipt of the Anchor stream and proceeds speculatively while the Residual stream is transferred concurrently, followed by verification that ensures equivalence to higher-precision decoding.
Across multiple models and serving workloads, Lynx achieves Time-to-First-Token (TTFT) comparable to aggressive 4-bit KV quantization, while matching the accuracy of high-precision (BF16) inference, improving TTFT over standard 8-bit KV quantization by up to $1.43\times$ and improving accuracy over state-of-the-art by up to $5.1\%$.
[139] arXiv:2607.01902 (cross-list from cs.CV) [pdf, html, other]: Title: Rethinking Post-Hoc Calibration in Semantic Segmentation

Tristan Kirscher (ICube), Kim-Celine Kahl (DKFZ), Balint Kovacs (DKFZ), Maximilian R. Rokuss (DKFZ), Klaus Maier-Hein (DKFZ), Xavier Coubez, Philippe Meyer (ICube), Sylvain Faisan (ICube)

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Reliable confidence estimates are essential in semantic segmentation, especially in safety-critical settings where overconfident errors can mislead downstream decisions. Yet modern segmentation models often remain miscalibrated. Post-hoc calibration offers a practical way to correct confidence estimates without retraining the segmentation model, but its use in dense prediction raises structural issues that are often overlooked. We study two such issues. First, adding a constant to all logits leaves the softmax probabilities unchanged, but several standard calibrators can still depend on this arbitrary offset. As a result, two logit representations encoding the same predictive distribution may yield different calibrated probabilities. We define translation-invariant (TI) calibrators as those whose outputs are unchanged under such shifts, characterize which common calibrators satisfy this property, and construct TI counterparts of shift-sensitive calibrators to isolate the effect of removing representation dependence. Second, post-hoc calibration is typically fitted by minimizing a likelihood-based objective, whereas segmentation models are trained with task-specific metrics such as Dice. This mismatch can cause calibration to alter class orderings and degrade the deployed segmentation map. We study decision-preserving calibration under argmax- and order-preservation constraints. Since enforcing these constraints collapses affine softmax calibrators to temperature scaling, we introduce class-conditional affine calibrators that can be made argmax- or order-preserving while retaining greater expressivity, allowing us to quantify the calibration-segmentation trade-off induced by decision preservation. Across natural-image and medical segmentation benchmarks, and under corruption-based covariate shift, matched comparisons show that TI variants generally improve calibration metrics, while decision-preserving variants prevent segmentation degradation and retain strong calibration performance. These results provide practical design principles for well-defined post-hoc calibration pipelines in semantic segmentation.
[140] arXiv:2607.01938 (cross-list from cs.RO) [pdf, html, other]: Title: PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation

Peng Yun, Shouwang Huang, Hao Li, Jinxi Li, Jianan Wang, Bo Yang

Comments: ECCV 2026. Code and data are available at: this https URL

Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Manipulating fast and dynamically moving targets in unstructured 3D environments remains challenging for embodied AI. Existing visual-language-action models and world models struggle with accurate 3D geometry and physically meaningful forecasting. We propose PhysMani, a framework that couples a physics-principled 3D Gaussian world model with a future-aware action policy model. The world model learns a divergence-free Gaussian velocity field via online optimization for fast and physically grounded future dynamics prediction. The policy model integrates the predicted 3D scene future dynamics through a learnable token based cross-attention module. We introduce PhysMani-Bench, a dynamic manipulation benchmark with 16 tasks, and demonstrate a superior success rate over strong baselines in both simulation and real-world robot experiments.
[141] arXiv:2607.01945 (cross-list from stat.ML) [pdf, html, other]: Title: Statistical Properties of $k$-means Clustering for Data Missing Completely at Random

Xin Guan

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

The classical $k$-means clustering cannot be directly used to incomplete data, and existing $k$-means-based clustering for missing data primarily focus on improving the practical accuracy of clustering, whereas most of them lack theoretical guarantees in the asymptotic sense. In this paper, we investigate the statistical properties of $k$-means clustering in the presence of missing data. We first establish the $\sqrt{n}$-excess risk bound and prove the consistency of the estimated cluster centers under general missing mechanisms. For the Missing Completely at Random (MCAR) mechanism, we further derive the $\sqrt{n}$-convergence rate and asymptotic normality of the estimated cluster centers. Moreover, we study in what cases the cluster centers estimated by incomplete data converge to the true cluster centers of original fully observed data, and give a sufficient condition about the missing probability and the separation among true clusters. These results provide a theoretical guarantee for missing-data-$k$-means. Notably, our analysis reveal that under MCAR mechanism, both achieving the $\sqrt{n}$-rate and converging to the true cluster centers require $k$ true centers to be distinct in every dimension, highlighting the significant challenges of application in high-dimensional regimes. Finally, we conduct numerical simulations on synthetic incomplete datasets to support our theoretical analysis results.
[142] arXiv:2607.01959 (cross-list from stat.ML) [pdf, html, other]: Title: Autorelevance function and other feature relevance measures for univariate time series

Julian Cardenas, Jamie Arjona, Pedro Delicado

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

We propose a model agnostic methodology to measure lag relevance in machine learning forecasting models applied to univariate time series. Particularly, we are working in the context of time series using the frameworks of Ghost variables and Shapley values, together with additive importance measures, to introduce the auto-relevance and partial auto-relevance functions as the lag importance values. Additionally, we propose a novel method to replace absent features in coalition based methods with a one step forecast from the same model. We evaluate these proposals under different simulations and real data cases. This combined framework perspective is particularly suitable for time series. In addition, to show our discoveries we use a pull of models from the seasonal ARMA family and recurrent neural networks. We found that the calculated relevance measures successfully demonstrate the expected lag structure in almost all cases.
[143] arXiv:2607.01965 (cross-list from cs.CL) [pdf, html, other]: Title: Towards a Phonology-Informed Evaluation of Multilingual TTS

Sneha Ray Barman, Neeraj Kumar Sharma, Shakuntala Mahanta

Comments: Accepted at Interspeech 2026

Subjects: Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)

Neural TTS systems can sound natural across languages, but naturalness does not guarantee the preservation of sound contrasts that distinguish words from their grammatical forms. Standard metrics like MOS do not test for this. We propose a classifier-based framework that audits TTS output against language-specific phonological patterns using human speech as a benchmark. Testing Assamese advanced tongue root (ATR) vowel harmony with Meta's MMS TTS, we show that a classifier trained on human speech transfers to synthesized speech with minimal loss. The faithfulness audit reveals that [+ATR] mid vowels are realized as [-ATR] in 1/3 tokens despite an underlying [+ATR] specification, a bias absent in human speech. At the word level, predicted ATR labels classify harmony more accurately than transcription labels, indicating a gap between intended and produced phonology. The framework offers task-specific diagnostics and generalizes to other phonological contrasts with measurable acoustic cues.
[144] arXiv:2607.01972 (cross-list from cs.CL) [pdf, html, other]: Title: Object Aligner: A Configurable JSON Schema Similarity Score for Graphs, Applied to LLM Prompt Optimization

Jan Drchal

Comments: 28 pages, This is a submitted version of a manuscript under review at IEEE Access; it has not been peer reviewed

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Large language models (LLMs) are often asked to produce JSON conforming to a fixed schema, powering information extraction, tool calling, agentic planning, and knowledge-graph construction. Measuring how closely an output matches a gold reference is essential yet surprisingly hard: exact match is brittle, text similarity ignores structure, and an LLM judge is expensive, opaque, and non-deterministic. We address this with Object Aligner (OA), an open-source Python library that scores two JSON objects deterministically by recursively aligning their trees (the Hungarian algorithm for unordered collections, sequence alignment for ordered ones) and awarding partial credit at the granularity the schema declares. The Object Aligner is configured entirely through a set of JSON Schema extensions, so adapting it to a new task involves annotating a schema rather than writing code. Complex structured data, however, are rarely flat trees: records may form graphs or hypergraphs keyed by arbitrary identifiers, breaking the assumptions of prior similarity metrics. Our central contribution, referential alignment, closes this gap by inferring a bijection between gold and candidate identifiers and scoring every reference through it, so the score is invariant to relabeling. Since recovering this bijection exactly is graph isomorphism, the Object Aligner approximates it with Weisfeiler-Leman color refinement. An order-sensitive sequence regime targets ranking and planning. Since the same alignment localizes every mismatch, the Object Aligner emits ranked repair suggestions at no extra cost. Used as a reward inside the GEPA prompt optimizer, Object Aligner helps or stays neutral across all datasets.
[145] arXiv:2607.01973 (cross-list from cs.CV) [pdf, html, other]: Title: Assessing VLM Reliability for Medical Image Quality Evaluation Under Corruption and Bias

Sofiane Ouaari, Kevin Vorwalder, Nico Pfeifer

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Vision-Language Models (VLMs) are increasingly applied in medical tasks such as pathology description, report generation, and visual question answering. Medical Image Quality Assessment (MIQA) supports diagnostic accuracy and patient safety by determining whether images meet the standards required for clinical decision-making. Automating MIQA with VLMs may reduce workload, but their behavior under real-world conditions, where images may be degraded or textual context may affect judgments, should be further explored before deployment. We benchmark VLMs on medical image quality using the MediMeta-C dataset zero-shot across seven corruption types and five severity levels. We evaluate sensitivity to degradation patterns, the effect of corruptions on embedding geometry, and whether textual attributes (demographics, expertise, infrastructure, institution) alter scores. Across 16 VLMs and seven modalities, pixelation produced the largest score reductions (mean -20.58%, up to -34.4% for OCT), whereas brightness had limited effect (-0.81%). Embedding displacement was associated with score changes. Same-family models showed correlations of 0.67-0.83; some produced increases up to +31% for corrupted mammography. Textual attributes affected scores: institutional prestige raised them +17.15%, and equipment age lowered them -14.7%. The largest changes were +95.62% (InternVL-8B) and -37.7% (MedGemma). Current VLMs show limitations for medical image quality assessment. Pixelation, a privacy-preserving transformation, reduces performance, indicating a trade-off between patient privacy and reliability. Sensitivity to contextual metadata indicates limited objectivity and marks metadata as a privacy and bias source. Privacy protection and objective quality assessment are related requirements for use.
[146] arXiv:2607.01993 (cross-list from cs.DS) [pdf, html, other]: Title: Scalable and Distributed Silhouette Approximation

Ilie Sarpe, Federico Altieri, Andrea Pietracaprina, Geppino Pucci, Fabio Vandin

Comments: 50 pages, 12 figures, extension of a previously appeared conference paper: this https URL featuring substantial new contributions

Subjects: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

The silhouette is one of the most widely used measures to assess the quality of a $k$-clustering of a dataset of $n$ elements. Its evaluation requires no information beyond the clustering assignment. In addition, the silhouette is extremely easy to interpret, providing a score to measure the quality of a clustering as a whole or for each element. The exact computation of the: (i) silhouette of each element of a dataset; and (ii) the global silhouette of the clustering; require $\Theta(n^2)$ distance calculations, under general metrics. The quadratic complexity $\Theta(n^2)$ is extremely prohibitive, especially on massive modern datasets. Surprisingly, existing approximate methods using $O(n^2)$ distance calculations are heuristics not offering provable and controllable guarantees on the quality of their results.
We introduce the first rigorous and efficient algorithms to estimate: (i) the (local) silhouette of each element of a dataset; and (ii) the (global) silhouette; of any metric $k$-clustering. Our methods, based on sampling, perform $O(nk\varepsilon^{-2}\ln (nk/\delta))$ distance computations, and provide estimates with additive error $O(\varepsilon)$ with probability at least $1-\delta$. That is, parameters $\varepsilon$ and $\delta$ in $(0,1)$ control the trade-off between accuracy and efficiency. We also introduce a scalable and distributed design of our methods for the MapReduce and Massively Parallel Computing (MPC) frameworks. Our distributed algorithms use a constant number of rounds and sublinear local memory. Finally, we perform extensive experiments against state-of-the-art approaches. The results show that our new techniques yield the best trade-off between accuracy and efficiency for both local and global silhouette estimation. In addition, our methods scale efficiently to massive datasets for which an exact computation of the silhouette is not practical.
[147] arXiv:2607.02003 (cross-list from stat.ML) [pdf, html, other]: Title: Born Discrete, Made Smooth: Variational Formulation of Shallow Neural Networks

Matej Benko, Pierre Bousquet, Iwona Chlebicka, Błażej Miasojedow

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Although neural networks are remarkably effective, their underlying optimization principles remain theoretically elusive, often characterized by non-convex landscapes and stochastic heuristics. In this work, we propose a paradigm shift by replacing the discrete training problem of shallow neural networks with a well-posed continuum variational surrogate. We identify a family of $\lambda$-convex functionals over parameter densities in weighted Sobolev spaces and prove that these variational problems are globally well-posed, stable, and exhibit unexpected almost $C^3$ regularity.
Unlike existing Wasserstein-based or Mean-Field approaches, which often face limited regularity and discretization challenges, our formulation provides direct access to elliptic regularity and convex analysis. This allows us to prove that the optimal parameter density can be obtained by solving a single linear system, bypassing iterative optimization entirely. We establish explicit generalization error controls at a rate of $1/\alpha$ relative to the regularization parameter, and prove that finite-width networks of size $N$ achieve the continuum optimum at an $O(1/N)$ rate. This perspective bridges the gap between the Neural Tangent Kernel (NTK) and feature-learning regimes, providing a principled framework for understanding over-parameterization through the lens of variational calculus.
[148] arXiv:2607.02037 (cross-list from cs.RO) [pdf, html, other]: Title: Cross-Platform Control for Autonomous Surface Vehicles via Adaptive Reinforcement Learning

Ruiheng Jiang, Thomas Bi, Raffaello D'Andrea, Aswin Ramachandran

Comments: Video: this https URL

Subjects: Robotics (cs.RO); Machine Learning (cs.LG)

Autonomous surface vehicles vary widely in hydrodynamic and actuation characteristics, yet most controllers are designed for single-platform deployment. We present an adaptive reinforcement learning approach for trajectory tracking that enables zero-shot cross-platform deployment using a single policy. Since the deployment platform's dynamics are unknown to the policy, we address cross-platform generalization with the standard partial-observability approach of conditioning on interaction history, employing a teacher-student architecture in which a learned module infers a latent representation of the platform dynamics. The policy is trained in simulation under randomized vessel dynamics and is deployed zero-shot to two real-world platforms without any fine-tuning, despite relying on a simple analytical dynamics model rather than a high-fidelity hydrodynamic simulator. In real-world experiments on two different platforms, the adaptive policy outperforms non-adaptive learning-based baselines by up to 58% in position mean absolute error while approaching the tracking accuracy of a platform-specific tuned controller.
[149] arXiv:2607.02073 (cross-list from cs.AI) [pdf, html, other]: Title: Evidence-State Rewards for Long-Context Reasoning

Ya Gao, Pekka Marttinen

Comments: Under review

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Long-context reasoning requires models to locate, revise, and synthesize evidence distributed across lengthy inputs. Existing long-context RL methods usually reward final answers or static evidence extraction, offering little feedback on how intermediate actions change the model's evidence state. We propose Maven, a reinforcement learning framework with an editable evidence memory. Maven defines an answer-conditioned evidence-state value and rewards action-level state transitions: add actions are credited by marginal gain and hindsight contribution, link actions by evidence synergy, and drop actions by improved answer support after removing misleading evidence. These rewards are assigned to the corresponding action spans in GRPO. Across Llama and Qwen models on LongBench v2, LongReason, and RULER, Maven outperforms outcome-only RL and evidence-identification baselines, producing more sufficient evidence sets and lower distractor retention. Our results show that long-context RL benefits from optimizing stateful evidence navigation rather than one-shot evidence extraction.
[150] arXiv:2607.02079 (cross-list from cs.CL) [pdf, html, other]: Title: HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety

Navaneeth Sangameswaran, Preetham S, Ashmiya Lenin

Comments: 30 pages, 7 figures, 20 Tables, Link: this https URL

Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

We present HaloGuard 1.0, an open-weights implementation of the constitutional-classifier paradigm for input safety. It achieves state-of-the-art performance on English and multilingual prompt-safety benchmarks at roughly one-tenth the model size of current leading open guard models. The safety constitution is the organising structure of the corpus: a natural-language constitution of 46 policies and 2,940 subcategories drives synthetic data generation, with exhaustive one-to-one paired counterfactuals that hold topic and vocabulary fixed while flipping intent, a two-tier harmless design that separately targets boundary and baseline false positives (FPs), and balanced multilingual materialisation across 46 languages that treats language as a surface form appearing on both sides of the boundary rather than as an adversarial signal. Across seven prompt-safety benchmarks, HaloGuard 1.0-0.8B attains the best average F1 (90.9) of any open guard we evaluate, outperforming baselines up to 27B parameters (over 30 times larger) while holding false-positive rate (FPR) to 4.3 and false-negative rate (FNR) to 9.5. The HaloGuard 1.0-4B variant reaches average F1 of 92.1 and FPR of 3.5, spending its extra capacity on precision rather than recall. A structured adjudication of the remaining failures indicates that most apparent missed-harm cases are benchmark mislabels rather than genuine model misses. An always-on adversarial red-teaming protocol continuously hardens the guard against both content-level and agentic attacks. We release the models as open weights.
[151] arXiv:2607.02087 (cross-list from cs.AI) [pdf, html, other]: Title: SUNTA: Hierarchical Video Prediction with Surprise-based Chunking

Tomoshi Iiyama, Masahiro Suzuki, Yutaka Matsuo

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Hierarchical state-space models (HSSMs) offer a promising approach to long-horizon prediction by segmenting sequences into temporal chunks. However, their performance hinges on how chunk boundaries are determined. While prior HSSMs typically rely on fixed-length chunking or similarity-based boundary detection, these methods often misalign with the intrinsic temporal structure of the data. We argue that chunking should instead be driven by prediction errors, which more directly indicate when longer-range context becomes necessary. Nevertheless, integrating surprise-based chunking into HSSMs introduces critical challenges, including hierarchical collapse during end-to-end training and the absence of surprise signals during open-loop prediction. To address these issues, we propose Surprise-based Nested Temporal Abstraction (SUNTA), a method that employs a decoupled training strategy to preserve surprise signals and uses internal inconsistency as a top-down surprise metric to determine chunk boundaries within imagined rollouts. Experiments on video prediction tasks in 2D and 3D environments demonstrate that SUNTA outperforms baselines, uniquely maintaining accurate predictions over 250 timesteps, whereas all baselines degrade within the first 10 timesteps.
[152] arXiv:2607.02089 (cross-list from cs.CV) [pdf, html, other]: Title: ESC: Emotional Self-Correction for Reliable Vision-Language Models

Tien-Huy Nguyen, Minh-Nhat Nguyen, Nguyen Nhat Huy, Hung Viet Nguyen, Huy Nguyen Minh Nhat, Thanh-Huy Nguyen, Cuong Tuan Nguyen, Hoang M. Le, Dat Nguyen, Phat Kim Huynh, Min Xu, Ulas Bagci

Comments: ECCV Main Track 2026 (113 pages, 15 tables, 65 figures). Project Page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, yet they remain vulnerable to unreliable reasoning. Existing self-correction methods mitigate these issues but typically rely on post-training or carefully engineered feedback, incurring high computational cost. In this work, we revisit this challenge through the lens of emotional cues, asking whether they can activate latent self-correction behaviors in VLMs without additional training. \textbf{We find that emotional signals serve as an effective trigger for self-correction, encouraging more cautious and reflective reasoning}. Motivated by this finding, we propose \escabstract (\textbf{\underline{E}}motional \textbf{\underline{S}}elf-\textbf{\underline{C}}orrection), a training-free self-correction framework. ESC introduces an external verifier that detects potentially incorrect initial responses and injects emotional feedback to encourage model to reflect, and produce a better revised response without additional training. Extensive experiments across safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks show that ESC consistently improves reliability while preserving overall model utility. These results suggest that emotion can function not only as an ability to be recognized, but also as a practical control signal for scalable self-correction in VLMs. \textbf{We therefore believe that ESC provides a strong foundation for a new reliable human-like, emotion-integrated research direction.} Our project is publicly available at \textcolor{red}{this https URL}.
[153] arXiv:2607.02097 (cross-list from cs.CV) [pdf, html, other]: Title: WBMM: Windowed Batch Matrix Multiplication for Efficient Large Receptive Field Convolution

Wan Song, Wei Zhou, Rui Wang, Jun Yu, Toru Kurihara, Jiajia Xu, Shu Zhan

Comments: 23 pages, 4 figures. Accepted as a Spotlight paper at ICML 2026. Code available at this http URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Large kernel depthwise convolutions achieve strong performance but suffer from significant degradation as kernel size grows due to irregular memory access from gather-based computation; while Large Kernel Acceleration (LKA) helps on small feature maps, it becomes counterproductive on large feature maps, even slower than non-accelerated implementations. We propose Windowed Batch Matrix Multiplication (WBMM), which partitions input into contiguous windows and indexes a compact relative position bias table to construct weight matrices, enabling regular memory access via batched matrix multiplication. This yields a unique property: WBMM's throughput improves with larger windows, opposite to depthwise convolutions that degrade with larger kernels. Operator-level benchmarks show WBMM with 14x14 windows outperforms 5x5 depthwise convolution baselines in speed while providing a 7.8x larger per-layer receptive field. Combined with inter-block cross-window communication and hierarchical window reparameterization, WBMM achieves comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K with 1.31-1.88x training speedup, and demonstrates consistent advantages across GPU, CPU, and edge devices without requiring specialized acceleration kernels. Our code is available at this http URL
[154] arXiv:2607.02103 (cross-list from q-bio.QM) [pdf, html, other]: Title: Structured Gaussian Processes for Uncertainty-Aware Classification of High-Dimensional, Small-Sampled Omics Data

Yue Zhang, Nandini Amit Gadhia, Georgios Karagiannis, Michalis Smyrnakis

Comments: 15 pages, 1 figure. Preprint version

Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)

Classifying heterogeneous omics data remains a fundamental challenge in computational biology, particularly in high-dimensional, small-sample settings where nonlinear interactions dominate and class imbalance further complicates reliable prediction of minority phenotypes. While traditional kernel methods rely on feature abundance, they fail to leverage the known interaction landscapes of biological systems.
In this work, we propose a structured Gaussian process classification framework that integrates graph-encoded biological pathways directly into the kernel construction. By propagating information along known interaction networks and combining this with abundance-derived features, the resulting classifier captures both quantitative measurements and topological context.
We benchmark our proposed methodology on three publicly available gut and fecal microbiome datasets. To address severe class imbalance, we evaluate complementary strategies, including data-level resampling, threshold calibration, and confusion-matrix-based adjustments, and report minority-class performance alongside accuracy. The hybrid approach yields a performance gain over unstructured baselines and matches the performance of established benchmarks for similar datasets. Furthermore, the probabilistic nature of the framework naturally provides calibrated predictive uncertainty, enabling robust differentiation between confident predictions and ambiguous samples.
[155] arXiv:2607.02127 (cross-list from eess.IV) [pdf, html, other]: Title: Population-Scale Segmentation of Penile Tissue in DIXON MRI using Deep Learning for Quantitative Phenotyping in Male Reproductive Health

Jan Ernsting, Gunnar Paul Kordes, Nils Johannaber, Lynn Ogoniak, Wolfgang Roll, Tim Hahn, Alexander Siegfried Busch, Benjamin Risse

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Penile measurement is clinically relevant across male reproductive and urogenital health, including conditions such as micropenis, congenital and endocrine disorders, and sexual or urinary dysfunction. However, quantitative assessment of penile size has relied mainly on external length or circumference measurements, which are difficult to standardize, sensitive to measurement conditions, and unable to capture the internal portion of the penis. MRI enables volumetric assessment of the whole penis in vivo, but automated segmentation has not previously been established at population scale. Automated whole-organ volumetry would enable high-throughput phenotyping for multi-omics and clinical studies of male reproductive disease.
Here, we present a deep learning framework for whole-penis segmentation in multi-channel DIXON MRI. Using a newly curated expert-annotated training dataset ($n = 145$ subjects; $13,050$ annotated slices) and a double-annotated independent test benchmark ($n = 24$ subjects; $2,160$ double-annotated slices), we optimized a 3D nnU-Net architecture. The model achieved a 5-fold cross-validation Dice score of $0.90$ and performed at observer-level accuracy on the independent test set (Dice: $0.92$; Hausdorff distance: $3.58$).
We deployed the model in $34,412$ UK Biobank participants, enabling automated quantification of total penile tissue, including both external and internal components. Longitudinal evaluation in 2,282 men demonstrated high inter-session reproducibility ($r = 0.87$). This framework establishes a reproducible and population-scalable method for MRI-based assessment of penile anatomy and provides an open technical resource for future studies in urological imaging and male reproductive health. The trained model weights will be publicly released.
[156] arXiv:2607.02131 (cross-list from cs.CV) [pdf, html, other]: Title: AbsoluteDegradation: A Physics-Inspired Synthetic Film-Degradation Pipeline and Archival Film Restoration Benchmark

Mikołaj Jastrzębski, Dawid Glinkowski, Dawid Zieliński, Daniel Borkowski, Wojciech Kozłowski, Kamil Adamczewski

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Restoring archival film remains a fundamentally challenging problem due to the absence of paired training data and the lack of standardized evaluation benchmarks. Pristine versions of deteriorated footage are physically unrecoverable, requiring supervised methods to rely on synthetic data that often fail to capture the complex, temporally coherent nature of real film degradation. At the same time, existing real-world datasets are limited in scale, quality, and accessibility, hindering reliable evaluation and fair comparison across methods. We address both limitations with AbsoluteDegradation, a physics-inspired, modular pipeline for synthesizing realistic film degradations, and a new large-scale archival benchmark. The proposed pipeline models the analog-to-digital process as a structured composition of artifact families, incorporating signal-dependent grain, parametric scratches, and temporally coherent camera motion, enabling controlled generation of diverse degradation regimes. In parallel, we introduce a curated dataset of 81,576 high-resolution frames sourced from real archival footage, designed for consistent evaluation under real-world conditions. Together, these contributions provide a unified framework for training and benchmarking restoration models. Extensive experiments across multiple architectures show that models trained with AbsoluteDegradation generalize better to real-world footage, while the proposed benchmark reveals systematic failure modes of current methods. We hope this work establishes a foundation for reproducible and domain-authentic evaluation in archival film restoration.
[157] arXiv:2607.02150 (cross-list from cs.DS) [pdf, html, other]: Title: Tight Lower Bounds for the Multi-Secretary Problem via Bellman Certificates

Jiawei Zhang

Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)

This paper studies additive regret in the multi-secretary problem, defined as the gap between the expected offline prophet reward and the reward of the best online policy. Prior work established $O(\log T)$ regret for bounded-density distributions with connected support and $O((\log T)^2)$ upper bounds for bounded-density distributions with support gaps. It was unknown whether the extra logarithmic factor is necessary even in the one-resource model. We prove that it is necessary. For a mixture of two separated uniform distributions at the critical capacity, the optimal regret grows at least on the order of $(\log T)^2$. Thus the existing $O((\log T)^2)$ upper bounds for bounded-density gapped instances, including those implied by network revenue management models with continuous rewards, are tight in this simplest specialization. The same framework also yields a matching lower bound for gapped distributions whose gap-facing densities vanish near the support edges; this companion result is given in the appendix. The proofs use Bellman certificates: feasible solutions to a relaxation of the exact Bellman recursion. This framework converts lower bounds into explicit certificate constructions and identifies why support gaps permit larger regret.
[158] arXiv:2607.02175 (cross-list from cs.AI) [pdf, html, other]: Title: A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks

Samiha A. Ismail, Fan X. Chen, Ali Merali

Comments: 13 pages, 4 tables

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Multiple-choice medical benchmarks are increasingly saturated, and recent rubric-based evaluations such as HealthBench have shown that open-ended clinical performance is far from solved - its "Hard" subset top score remains 32%. We present a small, deliberately difficult evaluation dataset of five clinician-authored clinical scenarios spanning four specialties (anaesthesia, internal/family medicine, emergency medicine, and obstetrics), each accompanied by an atomic, weighted, MECE rubric (25-62 criteria per task; 184 criteria total) authored from a clinician-drafted golden answer. We evaluate three frontier models: GPT 5.4, Claude Opus 4.7, and Gemini 3.1 Pro. Mean rubric pass rates were 0.47 (Claude), 0.39 (GPT), and 0.37 (Gemini). The central finding is an inversion of clinical priority: the highest-weighted (weight-5, critical) criteria passed at only 32.4-41.7%, while low-stakes weight-1 criteria passed at 80-90%. 56 of 108 critical (weight-5) criteria (52%) were satisfied by no model. Three LLM autoraters reproduced expert met/not-met labels on 92.8-94.7% of 552 graded criteria. We position this as a methods-and-preliminary-findings contribution: the five tasks demonstrate a scalable, defensible pipeline ready to develop into a large-scale benchmark.
[159] arXiv:2607.02199 (cross-list from eess.SP) [pdf, html, other]: Title: Fourier Preconditioning for Neural Feature Learning

Preston Pitzer, Anish Pradhan, Harpreet S. Dhillon

Comments: Accepted for publication in IEEE Signal Processing Letters

Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG)

Mutual information (MI)-inspired feature learning techniques are capable of generating low-dimensional embeddings that retain nonlinear dependence structures, but direct estimations of MI suffer from noisy probability distribution estimates in the low-data regime. The H-Score objective, computed from second-order statistics, provides a practical proxy metric for training feature extraction networks. We prove that H-Score is invariant to invertible transformations in the unrestricted functional setting, but becomes sensitive to input basis rotations under constrained approximation classes. Consequently, we study unitary preconditioning for H-Score networks and show that selecting an appropriate basis rotation reduces finite-width truncation error by concentrating predictive dependence into fewer dominant modes. We identify the fast Fourier transform (FFT) as an effective data-independent, low-cost preconditioner for approximately stationary processes, where spectral structure induces concentration of the cross-covariance singular value spectrum. We introduce training-free metrics based on spectral entropy and cumulative dependence energy to quantify basis suitability and predict downstream inference gains prior to network training. Experiments across eight multivariate datasets demonstrate that FFT preconditioning is particularly useful in resource-constrained regimes, achieving up to 50% normalized mean squared error (NMSE) reduction, while the proposed metrics correlate with observed performance gains and correctly identify cases where spectral preconditioning is detrimental.
[160] arXiv:2607.02206 (cross-list from stat.ML) [pdf, html, other]: Title: Prediction Sets for Counterfactual Decisions: Coverage, Optimality, and Conformal Prediction

Yurui Zheng, Ying Jin

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

Predictions are increasingly used to guide high-stakes decisions, from treatment selection to policy making. To ensure reliability with imperfect predictions, uncertainty quantification methods such as conformal prediction build prediction sets with coverage guarantees. However, statistical validity alone does not immediately determine the decisions to take, nor the optimality thereof. This gap is especially delicate in counterfactual settings where the outcome that materializes depends on the action taken, so uncertainty cannot be specified independently of the decision rule.
We develop a decision-theoretic framework for uncertainty-informed counterfactual decisions. We identify a novel notion of \emph{policy-coupled coverage} -- namely, coverage of the realized outcome under the action induced by the prediction sets themselves -- as the optimal and lossless interface between uncertainty and action. It plays three roles. First, it justifies acting via a natural max-min rule as minimax-optimal under distributional ambiguity. Second, optimizing prediction sets under policy-coupled coverage is equivalent both to a stronger universal-coverage formulation and to the direct risk-averse optimization over policies and utility certificates; this equivalence yields the explicit form of the population-optimal prediction sets. Third, it admits a two-stage procedure, Policy-Coupled Risk-Averse Conformal Prediction (PC-RACP), that approximates these optimal sets with rigorous finite-sample coverage. Simulations and a real email-marketing experiment confirm that PC-RACP delivers higher utility than existing approaches while maintaining valid coverage, and that ignoring the counterfactual structure of the decision problem is suboptimal for both validity and utility.
[161] arXiv:2607.02212 (cross-list from stat.ML) [pdf, html, other]: Title: An Additive MLP-GNN Framework for Characterizing Chemical and Structural Contributions to Aqueous Solubility

Sampreeti Bhattacharya, Arkaprava Roy

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Aqueous solubility is a key property in early-stage drug discovery, but most predictive models merge physicochemical descriptors and molecular graph information into a single representation, obscuring whether a prediction is driven by global chemistry, molecular structure, or both. We present an additive deep-learning framework that keeps these two sources of information separate throughout training: physicochemical descriptors are encoded by a multilayer perceptron (the chemical branch) and molecular graph topology by a graph neural network (the structural branch), with the two outputs combined only at the prediction stage through an additive model with an optional multiplicative interaction. This design provides a direct decomposition of chemical and structural components that can be examined separately after training. Furthermore, pretraining on the larger AqSolDB dataset and fine-tuning on the smaller BigSolDB2 dataset substantially improve accuracy and reduce run-to-run variations, indicating generalizability of the learned features from the data-rich settings. We further interpret the fitted model using best linear projections of the branch outputs, molecule-level embedding summaries across solubility classes, and atom-level GNNExplainer masks aggregated over functional groups. These analyses show that the chemical branch aligns with familiar physicochemical descriptors, while the structural branch captures graph-topological and functional-group patterns associated with solubility. Across both datasets, the framework attains competitive predictive performance while making the distinct roles of chemical and structural information more transparent.
[162] arXiv:2607.02234 (cross-list from cs.AI) [pdf, html, other]: Title: Purified OPSD: On-Policy Self-Distillation Without Losing How to Think

Zhanming Shen, Jintao Tong, Shaotian Yan, Chen Shen, Hao Chen, Wentao Ye, Xiaomeng Hu, Rui Miao, Haobo Wang, Junbo Zhao, Gang Chen, Jieping Ye

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

On-policy self-distillation (OPSD) has emerged as a promising paradigm for improving LLM reasoning, where a privileged teacher with access to reference solutions provides token-level supervision on the student's own generated trajectories. However, we find that OPSD consistently fails on long chain-of-thought (long-CoT) reasoning models, yielding at best marginal gains while destabilizing the reflective reasoning capability these models depend on. Through a novel decomposition of the teacher's supervision signal, we identify the root cause: the teacher's supervision is dominated by a reference-induced component that drives rote memorization of reference-specific shortcuts, while the question-conditioned, inference-transferable component is ignored or actively opposed. Based on this diagnosis, we propose a two-step solution. First, we construct a reference-only teacher (the same model conditioned on the reference without the question) to isolate the non-transferable component of the supervision signal; the residual after subtracting this component captures the question-conditioned, inference-transferable correction. Second, we use pointwise mutual information (PMI) as the mechanism to transform this residual into a well-formed PMI target distribution that the student can directly distill from, filtering out the reference-induced shortcut. Experiments on four long-CoT models across two datasets demonstrate consistent improvements over both the base model and standard OPSD, while preserving the models' natural epistemic behavior throughout training.
[163] arXiv:2607.02247 (cross-list from math.ST) [pdf, html, other]: Title: Aggregation with Exponential Weights is Optimal in Expectation

Mikael Møller Høgsgaard, Patrick Rebeschini, Tobias Wegel

Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)

The aggregation with exponential weights (AEW) estimator is not fully understood in the basic setting of model selection aggregation with squared loss. In particular, whether it is minimax-rate optimal in expectation for large enough fixed temperatures and under random design has been an open problem since its introduction, which was explicitly posed by Lecué and Mendelson (2013). In this paper, we settle this problem by showing that \emph{without} requiring a Bernstein-type assumption, the AEW indeed achieves the excess risk $T \log (M) / (n+1)$ in expectation, whenever the temperature $T$ satisfies $(L^2/T)\exp(B/T)\leq \mu /2$. Here, the number of dictionary elements is $M$, the estimator has observed $n$ i.i.d. samples from any distribution, and the loss is assumed to be bounded by $B$, $L$-Lipschitz continuous and $\mu$-strongly convex. For squared loss, we show that $T\geq 4 b^2$ suffices when the predictions and labels are $[0,b]$-valued. Because AEW is known to be suboptimal in expectation for temperatures below some constant, this shows that AEW has a sharp phase transition when the temperature is large enough but constant, as conjectured by Lecué and Mendelson.
[164] arXiv:2607.02283 (cross-list from cs.NE) [pdf, html, other]: Title: Dendritic In-Context Learning in a Single-Layer Spiking Neural Network

Juwei Shen, Yujie Wu, Changwen Chen

Comments: 26 pages

Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)

In-context learning (ICL) operates via implicit gradient descent embedded in the forward pass of modern AI architectures -- Transformers, Mamba, state-space models, and MLPs. Capturing this capability in biologically plausible Spiking Neural Networks (SNNs) has remained an open challenge: existing SNNs fail the Garg-2022 benchmark at non-trivial task dimensions. We trace this failure to a structural assumption: prior SNN designs route adaptation through inference-time synaptic plasticity, viewing the dendritic compartment as a passive conduit for error or teacher signals. We challenge this assumption. The subthreshold dynamics of a single dendritic compartment already implement a complete online learning algorithm. By treating the compartment as the computational substrate rather than a passive conduit, we propose DendriCL -- a single-layer compartmental spiking architecture whose apical recurrence is structurally identical to leaky online Widrow-Hoff LMS. This dynamics-only update collapses the architectural depth required for general-purpose ICL to a single layer. DendriCL is uniquely seed-stable at super-dimensional Garg-2022 ICL -- where dense Transformers exhibit grokking-style instability and fail past moderate task dimension -- and a linear probe recovers the reference online-LMS trajectory directly from the apical membrane at R^2 = 0.93, showing the algorithm is structurally embedded in the dynamics rather than implicitly discovered during training. Taken together, ICL requires neither attention, depth, nor inference-time plasticity: a single compartment with online-LMS dynamics is sufficient.
[165] arXiv:2607.02307 (cross-list from cs.CL) [pdf, other]: Title: On the Role of Directionality in Structural Generalization

Zichao Wei

Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Several SLOG test categories explicitly involve directional distinctions (modifier position shifts, argument extraction positions), yet AM-Parser, the previous SOTA, uses an AM algebra whose operations do not encode direction. We redesign the symbolic backend around CCG directed types (deterministic CKY + single linear decoder, 30K learnable parameters). Under the same BERT-base encoder, the system achieves 75.9$\pm$6.4% LF exact match, surpassing AM-Parser (70.8$\pm$4.3%). Per SLOG's own category groupings, gains are highly directional: the CCG system outperforms AM-Parser on all 5 position-shift categories (+29.9pp), while AM-Parser outperforms on all 6 recursive-depth categories. Replacing the encoder with DeBERTa-v3-large yields 90.7$\pm$4.9%, with the largest encoder gains in recursive-depth categories, complementary to directionality's gains. Directional representations shift the bottleneck from the symbolic layer (AM-Parser's 0% category ceiling) to the neural layer, which improves with encoder upgrades.
[166] arXiv:2607.02338 (cross-list from cs.DB) [pdf, html, other]: Title: HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report

Minghao Li, Raghav Mittal, Sanjivni Rana, Suraj Shetiya, Gautam Das, Nick Koudas

Comments: 23 pages, 22 figures, Submitted to VLDB2027

Subjects: Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Hierarchical Navigable Small World (HNSW) graphs serve as the industry standard due to their logarithmic complexity and strong empirical performance. However, HNSW relies on greedy graph traversal, a heuristic that provides no theoretical guarantees of correctness. In this paper, we propose a novel "Certify-then-Rectify" framework that bridges the gap between the speed of heuristic search and the rigor of exact retrieval. Rather than discarding HNSW, our approach first employs a distribution-free statistical certifier to dynamically evaluate the quality of a standard HNSW search with minimal overhead. If certification indicates that the retrieved neighbors are of low quality, the framework safely escalates to a rigorous exact recovery algorithm. To make this exact recovery computationally feasible, we reinterpret the HNSW graph as a geometric spanner and utilize Extreme Value Theory to stochastically estimate its maximum empirical stretch factor. This allows us to mathematically bound the maximum distance of true nearest neighbors. Extensive evaluations on benchmark datasets demonstrate that our tiered framework delivers the average-case speed of HNSW while ensuring the worst-case correctness of exact search and outperforming other applicable approaches.
[167] arXiv:2607.02363 (cross-list from quant-ph) [pdf, html, other]: Title: Stable Self-Modulating Quantum Fast-Weight Programmers with Bounded Memory Gates

Kuo-Chung Peng, Jiun-Cheng Jiang, Chun-Hua Lin, Yifeng Peng, Junghoon Justin Park, Huan-Hsin Tseng, Hsin-Yi Lin, Kuan-Cheng Chen, Chen-Yu Liu, Shinjae Yoo, Samuel Yen-Chi Chen

Comments: 16 pages, 8 figures

Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Quantum Fast-Weight Programmers (QFWPs) store temporal information in dynamically programmed variational-circuit parameters rather than in nonlinear recurrent hidden states, offering a practical route to quantum sequence modeling. Self-Modulating QFWP improves this framework by using input-dependent gates for both new fast-weight updates and the accumulated fast-weight state, but its unbounded old-state multiplier can diverge in long-sequence regimes. We propose a bounded old-state modulation rule that applies a sign-preserving tanh gate only to the recurrent memory branch while leaving the additive update and new-update modulation unchanged. We evaluate standard QFWP, full Self-Modulating QFWP, Only-New, and Only-Old variants on two CUDA-Q quantum-dynamics forecasting tasks and on Milan SMS telecommunication activity prediction. The quantum-dynamics results show that old-state modulation is the most consistent source of improvement over Standard QFWP, and that bounding the old-state gate removes long-sequence divergence while improving aggregate robustness. On Milan SMS forecasting, the original unbounded Self-Modulating QFWP converges across the tested grid and shows its clearest gains at longer input windows, with behavior close to the Only-Old ablation. These findings identify accumulated-memory modulation as the key mechanism of Self-Modulating QFWP and bounded old-state gating as a targeted stabilization strategy.
[168] arXiv:2607.02368 (cross-list from stat.ML) [pdf, html, other]: Title: The Dual Nature of LLM Persona: Aggregated Tendencies and Frame-Dependent Geometry

Yuan Yuan

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Differential Geometry (math.DG)

Evaluations of LLM personas via psychometric questionnaires typically rely on aggregate scores, discarding within-instance correlation structure. We test whether this geometric structure is intrinsic or frame-dependent. Constructing within-instance correlation matrices from IPIP-50 responses, we analyze geometry on SPD manifolds under manipulated question orderings in GPT-4o simulating American and Chinese-American personas. We find that persona expression comprises two dissociable components: aggregated features (Big Five scores) degrade under randomization (21% drop) but are frame-robust; geometric features (SPD manifold) collapse under frame misalignment (42% drop) but recover substantially (to 84%) under shared frames, surpassing aggregated features (76%). This collapse-recovery pattern reveals that persona geometry is not intrinsic but a frame-dependent coordination pattern encoding information invisible to aggregation. Our findings establish a dual-nature framework for LLM personas, frame-dependent geometry versus frame-robust aggregates, necessitating frame-aware evaluation and challenging static trait conceptions.
[169] arXiv:2607.02386 (cross-list from cs.CV) [pdf, html, other]: Title: Transformer Geometry Observatory TGO-II: Representational Similarity Observatory

Kaustubh Kapil, Kishor P. Upla

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

While Vision Transformers have achieved remarkable success across computer vision and language applications, the geometric evolution of their internal representations throughout training remains insufficiently understood. Existing analyses primarily focus on attention mechanisms and downstream performance, leaving the evolution of representation geometry largely unexplored. In this work, we present Transformer Geometry Observatory-II (TGO-II), a representation geometry analysis framework designed to investigate how Transformer representations evolve during supervised training. TGO-II analyzes Vision Transformer (ViT-Small/16) representations using Centered Kernel Alignment (CKA), Singular Vector Canonical Correlation Analysis (SVCCA), Two-Nearest Neighbor Intrinsic Dimensionality (TwoNN-ID), and token covariance analysis. Our experiments reveal three key observations. First, both CKA and SVCCA progressively decrease throughout training, indicating increasing representational specialization across Transformer layers. Second, intrinsic dimensionality consistently increases before stabilizing, suggesting progressive expansion of the representation manifold into a larger set of locally accessible degrees of freedom. Third, token covariance and coupling analyses demonstrate that strong token interaction structure persists throughout training, challenging the hypothesis that increasing representational complexity arises primarily from progressive token independence. These findings suggest that representation complexity and layer specialization emerge simultaneously during training. Manifold expansion appears to occur without token decoupling. Together, these observations motivate a new hypothesis in which Vision Transformers increase representational complexity through progressively richer transformations while preserving strong token interaction structure during learning.
[170] arXiv:2607.02387 (cross-list from cs.IR) [pdf, html, other]: Title: Bringing Agentic Search to Earth Observation Data Discovery

Minghan Yu, Youran Sun, Chugang Yi, Yixin Wen, Haizhao Yang

Comments: 19 pages, 1 figure, 6 tables

Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)

NASA and its data centers hold thousands of geoscience datasets and tools like Worldview, Giovanni, the Science Discovery Engine, and Harmony. Finding the right one is hard even for domain experts. We present an agentic search system, deployed as a public service for the geoscience community, that takes a natural-language research query and returns the matching datasets and tools. We demonstrate that, in the era of large language models, the latent value of knowledge graphs (KGs) can be substantially amplified through agentic search. From the NASA Earth Observation Knowledge Graph (NASA EO-KG) we derive NASA-EO-Bench, an open benchmark of 47k query-dataset pairs (21k task-based queries). A neural scorer fine-tuned on NASA-EO-Bench beats cosine and BM25 baselines. Further combining it with BM25 via score fusion raises both Recall@10 (R@10) and MRR by over 5x. On top of this supervised pipeline, we add a zero-shot agentic reranking stage that, without any additional training, lifts MRR by 28% on a stratified N=200 subset, showing that LLM reasoning is complementary to supervised retrieval.
[171] arXiv:2607.02391 (cross-list from cs.DC) [pdf, html, other]: Title: WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs

Mauricio Fadel Argerich, Jonathan Fürst, Marta Patiño-Martínez

Comments: Accepted at 1st Workshop on Sustainability and Resource-Efficiency of Artificial Intelligence @ IJCAI 2026

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

Large Language Model (LLM) inference workloads are a rapidly growing contributor to data center energy consumption. Optimizing these deployments requires matching specific LLMs to the most efficient GPUs, but operators currently lack the tools to do so without exhaustively profiling each combination. While some predictive models exist, they still require profiling data and struggle to generalize to hardware unseen during training. To address this, we introduce \textit{WattGPU}, featuring two predictive models for mean GPU power draw and Inter-Token Latency (ITL). Our approach leverages only publicly available LLM metadata and GPU specifications, eliminating the need for hardware access or profiling while enabling generalization to unseen NVIDIA server-grade GPUs and LLMs. We evaluate our models using rigorous leave-one-GPU-out and leave-one-LLM-out cross-validation on a dataset of 42 open-source LLMs (0.1B--27B parameters) and 8 GPUs under both offline and server scenarios. The mean power draw model achieves a median absolute percentage error of $\leq3.4\%$ for offline and $\leq13.5\%$ for server scenarios on unseen GPUs, while the latency model achieves $\leq8.5\%$ in server mode, both maintaining strong GPU ranking correlations for server scenarios (Kendall $\tau\geq0.76$). Compared to standard physically grounded baselines -- Load-Scaled Thermal Design Power (TDP) for power draw and roofline for latency -- our models reduce median absolute percentage error by approximately 4$\times$ on unseen LLM-GPU combinations for server scenarios or approximately 2$\times$ for completely unseen GPUs. WattGPU's data and code are publicly available at this https URL.
[172] arXiv:2607.02396 (cross-list from cs.AI) [pdf, other]: Title: Fast Multi-dimensional Refusal Subspaces via RFM-AGOP

Thomas Winninger

Comments: Accepted to the Mechanistic Interpretability Workshop at the 43rd International Conference on Machine Learning, Seoul, South Korea, 2026

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Steering and monitoring activations in Large Language Models (LLMs) are increasingly used for both safety and interpretability. Early work assumed behaviours are encoded along single linear directions, but recent findings suggest complex behaviours, such as the refusal to answer harmful queries, live in multi-dimensional subspaces. However, existing methods for extracting these subspaces are computationally expensive, which becomes prohibitive on reasoning models who produce long reasoning traces. By adapting the Recursive Feature Machine (RFM) algorithm -- which can be computed efficiently -- with a probe-informed initialization, we are able to identify the multi-dimensional refusal subspace in seconds, on reasoning (Qwen 3) and non-reasoning (Qwen 2.5) models. While RFM allows for faster subspace identification, it also showed better performances on the ablation task than its alternatives. More work is planned to better understand the relations between subspaces found by different methods. If confirmed, RFM could be a cheap and scalable complement to existing subspace-extraction methods in LLMs.
[173] arXiv:2607.02404 (cross-list from cs.CV) [pdf, html, other]: Title: Object-centric LeJEPA

Jakob Geusen, Ender Konukoglu

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Image encoders trained with LeJEPA can deliver strong features for downstream tasks, but, like other image-level self-supervised methods, typically require large training datasets. Aligning representations at the level of objects rather than whole scenes promises greater data efficiency, but doing this in a completely self-supervised way, effectively jointly partitioning a scene and representing its objects, is unstable: the two are locked in a cyclic dependency, partitioning requires meaningful representations, while meaningful representations require consistent partitioning. We sidestep this instability by taking object masks as given during training, using cheap, off-the-shelf SAM proposals. We extend LeJEPA - whose distributional anti-collapse objective ports naturally from whole images to variable-sized sets of objects - to align object-centric representations rather than whole images. An additional instance-separating loss, which treats other objects in the same scene as negatives, further boosts downstream performance. Across two model scales and 10-100% of COCO, object-level LeJEPA outperforms image-level LeJEPA on tracking (DAVIS), classification (ImageNet-1k), segmentation (ADE20k), and re-identification (NAVI).
[174] arXiv:2607.02413 (cross-list from cond-mat.quant-gas) [pdf, html, other]: Title: Q-GAIN: A Python Package for Machine Learning and Physically Informed Analysis Applications

M. Doris, S. Guo, S. M. Koh, L. Ritter, A. R. Fritsch, S. Mukherjee, I. B. Spielman, J. P. Zwolak

Comments: Submission to SciPost, 20 pages with 4 figures

Subjects: Quantum Gases (cond-mat.quant-gas); Machine Learning (cs.LG)

Here we describe the quantum gas analysis and inference (Q-GAIN) Python package, which enables rapid deployment of machine learning (ML) and physics-informed analysis techniques for cold-atom experiments. Out of the box, Q-GAIN implements classification, object detection, and physics-informed metrics for feature detection in images of atomic Bose-Einstein condensates (BECs). Q-GAIN encourages a natural, module-based workflow: starting with data loading and preprocessing, followed by ML-based feature identification, and ending with conventional analysis techniques. We demonstrate this modularity by configuring Q-GAIN for three ML tasks. First, we demonstrate the basic workflow of the Q-GAIN framework by implementing the standard task of classifying handwritten digits from the MNIST dataset. Then, we re-implement our earlier soliton detection (SolDet) package in the Q-GAIN framework, enabling the detection and analysis of solitonic excitations in time-of-flight data. Finally, we develop an object-detection tool that identifies quantized vortices in images of ring-shaped BECs.
[175] arXiv:2607.02417 (cross-list from cs.RO) [pdf, html, other]: Title: LIME: Learning Intent-aware Camera Motion from Egocentric Video

Boyang Sun, Jiajie Li, Yung-Hsu Yang, Chenyangguang Zhang, Tim Engelbracht, Sunghwan Hong, Cesar Cadena, Marc Pollefeys, Hermann Blum

Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Autonomous robots often need to move their camera before they can act: to inspect an object, reveal an occluded region, or obtain a view that responds to a user's intent. While vision-language navigation translates instructions to base motion and vision-language-action policies map instructions to manipulation actions, language-conditioned camera motion remains comparatively underexplored as a first-class action. We formulate language-conditioned camera motion generation: given a current RGB observation and a free-form natural-language intent, predict a relative target camera pose for the next observation. This task is inherently non-trivial: viewpoint changes are driven by latent perceptual intentions, and a valid motion may operate at different semantic granularity, from entering a room to looking around a corner, inspecting a visible object, or revealing an occluded detail. To model this structure, we mine multi-intention camera-motion supervision from egocentric video, pairing plausible intents and observation-gain descriptions with relative SE(3) target poses. We propose LIME, a vision-language camera-motion generator that combines an auto-regressive observation-gain output with a continuous flow-matching pose head. This design lets the model jointly predict what the next view should reveal while representing multi-hypothesis target views. Across experiments and downstream robotic tasks, we show that LIME can learn to actively choose camera poses from passive human video, turning ordinary egocentric recordings into supervision for intent-aware active perception.
[176] arXiv:2607.02444 (cross-list from quant-ph) [pdf, html, other]: Title: Optimal Stabilizer Testing and Learning with Limited Quantum Memory

Srinivasan Arunachalam, Louis Schatzki

Comments: 66 pages, 5 figures

Subjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG)

We study stabilizer state testing and learning with limited coherent quantum memory. Here an algorithm sequentially receives copies of an unknown $n$-qubit state, but may keep only $k$ qubits of coherent quantum memory between measurements. With unrestricted memory, seminal work of Gross, Nezami and Walter showed how to test $n$-qubit stabilizer states using $6$ copies, which is dimension independent, unlike the learning complexity of $\Theta(n)$. We show that this testing-vs-learning separation is lost under memory constraints. More concretely we show that
(1) The sample complexity of testing stabilizer states in the $k$-qubit memory framework is $\Theta(n-k)$. Our upper bound goes via a novel connection to the hidden shift problem and the lower bound is proven using a novel approach to average case bounds on likelihood ratios via combinatorics of the stochastic orthogonal group.
(2) The sample complexity of learning stabilizer states with $k$ qubits of memory, in the non-adaptive framework, is $\Theta(n^2/k)$.
As a further application of our techniques, we prove an exponential lower bound for purity testing even when the memory may be left coherent throughout the protocol. Our main results identify coherent quantum memory as the resource enabling the usual separation between stabilizer testing and learning. In particular, even with $k=0.99n$ qubits of memory, there is no constant-copy stabilizer tester; furthermore for $k=cn$ qubits of memory (for $0< c < 1$), stabilizer testing is as hard as learning, with both requiring $\Theta(n)$ copies.
[177] arXiv:2607.02461 (cross-list from cs.CV) [pdf, html, other]: Title: OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

Donghyun Lee, Jitesh Chavan, Duy Nguyen, Sam Huang, Liming Jiang, Priyadarshini Panda, Timo Mertens, Saurabh Shukla

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd-Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings. It also pushes PTQ of image diffusion transformers to W2A4 with usable generation quality.
[178] arXiv:2607.02496 (cross-list from cs.RO) [pdf, html, other]: Title: Controllable Sim Agents with Behavior Latents

Juanwu Lu, Junyu Zhu, Ziran Wang

Comments: 23 pages, 5 tables, 8 figures

Subjects: Robotics (cs.RO); Machine Learning (cs.LG)

Realistic traffic simulation requires agents that imitate logged behavior and can also be steered along interpretable axes. Such controllability enables engineers to isolate variables, reproduce specific edge cases, and test autonomous systems without real-world risk. We introduce Controllable Neural Variational Agents (CNeVA), a controllable simulated-agent framework that learns to infer a per-agent Gaussian behavior latent from per-channel discounted returns via a closed-form conjugate variational update, conditioning a rectified-flow trajectory generator trained on a mixed channel-mask curriculum for classifier-free guidance. To tackle scarcity in reward signals, we propose soft eligibility gates that replace hard binary thresholds with smooth exponential decay, preserving the gradient signal for near-threshold agents. On the Waymo Open Motion Dataset, CNeVA attains competitive realism on the benchmark while exposing per-channel controllability that the higher-ranked imitation models lack. Speed- and acceleration-based steering produces monotone responses without stall-induced reward hacking. Safety controllability is monotone and substantial with the introduction of soft eligibility. We manage to achieve steerable map compliance under a context-residual return measure. Furthermore, our experiment demonstrates that steering metrics must be read alongside physical-plausibility guardrails to avoid reward-hacking confounds.
[179] arXiv:2607.02507 (cross-list from cs.AI) [pdf, html, other]: Title: What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

Arman Ghaffarizadeh, Danyal Mohaddes, Aliakbar Izadkhah, Shahriar Noroozizadeh

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and 5 variations within each scenario, alignment-inducing settings produce systematic public-OTR divergence in the targeted agent, with its decision divergence rising from a $\sim$3% baseline to roughly 40%. The effect is consistent across four aggregate analyses: stance, semantic similarity, natural language inference, and survey responses. In some cases, the OTR response explicitly attributes public accommodation to relational pressures, such as career risk or sponsorship obligation. The findings suggest that agent evaluation should extend beyond explicit goals and detect emergent objectives. We present a dual-channel evaluation framework and complementary behavioral measures that operationalize this assessment.
[180] arXiv:2607.02510 (cross-list from cs.AI) [pdf, html, other]: Title: Online Safety Monitoring for LLMs

Mona Schirmer, Metod Jazbec, Alexander Timans, Christian Naesseth, Maja Waldron, Eric Nalisnick

Comments: ICML 2026 Hypothesis Testing Workshop

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)

Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.
[181] arXiv:2607.02513 (cross-list from cs.CL) [pdf, html, other]: Title: LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

Matteo Boglioni, Thibault Rousset, Siva Reddy, Marius Mosbach, Verna Dankers

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct evaluation of whether unlearning targets the weights responsible for knowledge storage. We use LACUNA to benchmark current SOTA unlearning methods and find that, despite strong output-level performance, existing methods are highly imprecise and susceptible to resurfacing attacks. We further show that when localization is successful, even a simple gradient-based unlearning method achieves strong erasure and robustness to resurfacing attacks, highlighting the importance of precise unlearning. We release LACUNA to complement behavioral evaluations and drive further advances in robust, localization-based unlearning.

[182] arXiv:2502.00973 (replaced) [pdf, html, other]: Title: A Wearable Device Dataset for Mental Health Assessment Using Laser Doppler Flowmetry and Fluorescence Spectroscopy Sensors

Minh Ngoc Nguyen, Khai Le-Duc, Tan-Hanh Pham, Trong Nhan Nguyen, Bailey Trang, Ba Kien Tran, Viktor Dremin, Sergei Sokolovsky, Edik Rafailov, Truong-Son Hy

Comments: Communications Medicine 2026

Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)

Mental health problems such as stress, anxiety, and depression affect millions of people worldwide. These conditions are usually assessed using questionnaires, which rely on how people describe their own feelings. In this study, we explore whether a wearable device can help measure mental health using physical signals from the body. The device records small changes in blood flow and tissue activity from the fingertip. We collected data from 132 adults across 19 countries and compared these signals with mental health questionnaire results. We found that patterns in blood flow and tissue activity are linked to stress-related symptoms. This approach may help develop new tools for simple, non-invasive mental health monitoring in everyday life. Code and datasets are publicly available: this https URL
[183] arXiv:2506.09105 (replaced) [pdf, html, other]: Title: MetaTT: A Global Tensor-Train Adapter for Parameter-Efficient Fine-Tuning

Javier Lopez-Piqueres, Pranav Deshpande, Archan Ray, Mattia J. Villani, Marco Pistoia, Niraj Kumar

Comments: Accepted version to TMLR

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)

We present MetaTT, a Tensor Train (TT) adapter framework for fine-tuning of pre-trained transformers. MetaTT enables flexible and parameter-efficient model adaptation by using a single shared TT to factorize transformer sub-modules. This factorization indexes key structural dimensions, including layer and matrix type, and can optionally incorporate heads and tasks. This design allows MetaTT's parameter count to scale with the sum, rather than the product, of the modes, resulting in a substantially more compact adapter. Our benchmarks compare MetaTT with LoRA along with recent state-of-the-art matrix and tensor decomposition based fine-tuning methods. We observe that when tested on single-task standard language modeling benchmarks, MetaTT achieves competitive parameter efficiency to accuracy tradeoff. We further demonstrate that MetaTT performs competitively when compared to state-of-the-art methods on multi-task learning. Finally, we leverage the TT decomposition to design a rank adaptive optimizer inspired by the DMRG method from many-body physics. Our results demonstrate that integrating this approach with AdamW enhances optimization performance for a specified target rank.
[184] arXiv:2509.25136 (replaced) [pdf, other]: Title: BALF: Budgeted Activation-Aware Low-Rank Factorization for Fine-Tuning-Free Model Compression

David González-Martínez

Subjects: Machine Learning (cs.LG)

Activation-aware low-rank factorization techniques yield strong compression results but are generally confined to linear layers, while existing whitening-based theory typically makes an implicit full-rank assumption on activations. We introduce a layer representation framework that extends activation-aware factorization beyond linear layers, including standard and grouped convolutions. Within this framework, our whitening-based formulation is more general than prior ones, naturally covering rank-deficient activations, and yields an optimal low-rank projection that attains the reconstruction error of the best low-rank approximation to layer activations. The resulting singular spectrum provides a closed-form per-layer distortion proxy, which we use to allocate per-layer ranks under explicit FLOP or parameter-count budgets via a Lagrangian relaxation with negligible overhead. Together, these components form BALF, an end-to-end pipeline for efficient vision model compression. Across CNNs and vision transformers on CIFAR-10 and ImageNet-1K, BALF generally achieves higher accuracy than SVD-based factorization baselines at matched FLOP or parameter count targets and remains competitive with other fine-tuning-free compression techniques.
[185] arXiv:2510.14331 (replaced) [pdf, html, other]: Title: LLM Priors for ERM over Programs

Shivam Singhal, Priyadarsi Mishra, Eran Malach, Tomer Galanti

Comments: The first two authors contributed equally

Subjects: Machine Learning (cs.LG)

We study program-learning methods that are efficient in both samples and computation. Classical learning theory suggests that when the target admits a short program description, for example a short piece of ``Python code'', it can be learned from few examples by ERM over the program class. However, this approach relies on enumerating candidate programs, which is typically exponential in the description length; gradient-based training avoids this explicit search but, for some families of short programs, can require exponentially many samples to succeed. We propose \textsc{LLM-PV}, a propose-and-verify recipe that enables ERM-style selection over a discrete program class without exhaustive enumeration: a pretrained LLM induces a proposal distribution over candidate programs, each proposal is executed and scored on a held-out validation set, and the best program is selected, with no gradient updates or validation feedback used to adapt the sampling distribution. Across algorithmic tasks including parity variants, pattern matching, and primality testing, \textsc{LLM-PV} often recovers the exact underlying rule from a small labeled set and generalizes far beyond the training sequence lengths, while SGD-trained transformers, fine-tuning, in-context learning, and classical ML baselines can fit the training data yet fail to generalize reliably. Together, these results suggest that pretrained LLM priors can serve as effective search biases for ERM, narrowing the gap between statistical and computational efficiency.
[186] arXiv:2511.02496 (replaced) [pdf, html, other]: Title: Geometry as a Missing Axis of Representation Quality: The Variational Geometric Information Bottleneck under Data Scarcity

Ronald Katende

Comments: 25 pages, 12 tables, 8 Figures

Subjects: Machine Learning (cs.LG)

We study latent geometry as an explicit component of representation quality in data-scarce learning. For an encoder (\phi), we define (Q_{\beta,\gamma}(\phi)=I(\phi(X);Y)-\beta\mathcal C(\phi)-\gamma d_{\mathrm{int}}(\phi)), combining task-relevant information with penalties for curvature and intrinsic latent dimension. Thus geometry becomes part of the bottleneck criterion, not only a post hoc diagnostic.
Under smooth-manifold, loss-transfer, and estimator-concentration assumptions, we derive non-asymptotic low-label generalization bounds where intrinsic dimension and covering complexity enter explicitly. We characterize the information--geometry frontier and prove empirical-surrogate consistency. The analysis links encoder geometry to learning through latent covering numbers, loss-class entropy, and uniform deviation.
We instantiate the theory as \texttt{V-GIB}, adding curvature and dimension penalties to variational bottleneck training. Real low-label benchmarks compare \texttt{V-GIB} with ERM, VIB, and ablations across (1%)--(20%) label fractions. Results show improved performance and reduced geometric complexity in several regimes, especially FashionMNIST and CIFAR-10, while confirming that no fixed regularizer is universally dominant.
[187] arXiv:2511.12398 (replaced) [pdf, html, other]: Title: On the Dimension-Free Approximation of Deep Neural Networks for Symmetric Korobov Functions

Yulong Lu, Tong Mao, Jinchao Xu, Yahong Yang

Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)

Deep neural networks have been widely used as universal approximators for functions with inherent physical structures, including permutation symmetry. In this paper, we construct symmetric deep neural networks to approximate symmetric Korobov functions and prove that both the convergence rate and the constant prefactor scale at most polynomially with respect to the ambient dimension. This represents a substantial improvement over prior approximation guarantees that suffer from the curse of dimensionality. Building on these approximation bounds, we further derive a generalization-error rate for learning symmetric Korobov functions whose leading factors likewise avoid the curse of dimensionality.
[188] arXiv:2512.02657 (replaced) [pdf, html, other]: Title: Locality-Aware Continual Unlearning for Diffusion Models

Naveen George, Naoki Murata, Yuhta Takida, Konda Reddy Mopuri, Yuki Mitsufuji

Comments: Accepted to ECCV 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Real-world deployment of text-to-image diffusion models requires continual concept removal as new privacy, copyright, or safety obligations arise over time. Existing unlearning methods, however, are designed for single-step deletion and collapse after only 3-5 sequential applications. We trace this instability to two compounding factors: (i) coarse mapping targets that cause degradation to accumulate unnecessarily across steps, and (ii) the absence of local protection for semantically neighboring concepts, whose shared internal representations make them the first to suffer collateral damage. Because this damage is strongest in the local semantic neighborhood of the forget concept, global replay alone cannot prevent it. Building on this analysis, we propose Locality-Aware Continual Unlearning (LACU), a framework with two complementary mechanisms. Locality-Aware Target Selection chooses, for each forget prompt, the context-preserving mapping prompt that the diffusion model itself treats as most similar to the original prompt, measured by score-prediction distance (how differently the model denoises the same noisy image under two text conditions), ensuring each update is as small and targeted as possible. Locality-Aware Replay uses the same metric to identify the retain concepts closest to the forget concept in the model's own representation and replays them as a local functional regularizer, directly shielding the most vulnerable neighborhood. Combined with teacher-student distillation and lightweight $\ell_2$ parameter regularization, LACU maintains stable unlearning over 10 sequential steps, preserving significantly higher related retention ($RR_{\text{acc}}$) and general retention ($GR_{\text{acc}}$) than recent baselines.
[189] arXiv:2512.07843 (replaced) [pdf, other]: Title: ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin

Comments: Accepted as an oral paper at ICML 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but their inherently sequential decoding incurs substantial latency, motivating parallelization of the generation process. However, existing parallel reasoning approaches suffer from performance degradation compared to their sequential counterparts, and often rely on specialized inference engines. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that matches the accuracy of comparably sized sequential reasoning models while significantly reducing inference latency via three key innovations: 1) a two-stage parallel trajectory generator that produces high-quality parallel chain-of-thought data for supervised fine-tuning; 2) a trie-based rollout design that enables parallel reasoning on any off-the-shelf autoregressive inference engine; and 3) a parallelization-aware reinforcement learning framework that trains the model to balance reasoning accuracy with effective parallelization. Across six challenging math reasoning benchmarks, ThreadWeaver trained on top of Qwen3-8B achieves performance on par with cutting-edge sequential reasoning models (79.9% on AIME24 and 71.9% on average) while delivering up to 1.53x speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.
[190] arXiv:2601.12704 (replaced) [pdf, html, other]: Title: Adaptively trained Physics-informed Radial Basis Function Neural Networks for Solving Multi-asset Option Pricing Problems

Yan Ma, Yumeng Ren, Elisabeth Larsson

Comments: 30 pages,16 figures

Subjects: Machine Learning (cs.LG)

The present study investigates the numerical solution of Black-Scholes partial differential equation (PDE) for option valuation with multiple underlying assets. We develop a physics-informed (PI) machine learning algorithm based on a radial basis function neural network (RBFNN) that concurrently optimizes the network architecture and predicts the target option price. The physics-informed radial basis function neural network (PIRBFNN) combines the strengths of the traditional radial basis function collocation method and the physics-informed neural network machine learning approach to effectively solve PDE problems in the financial context. By employing a PDE residual-based technique to adaptively refine the distribution of hidden neurons during the training process, the PIRBFNN facilitates accurate and efficient handling of multidimensional option pricing models featuring non-smooth payoff conditions. The validity of the proposed method is demonstrated through a set of experiments encompassing a single-asset European put option, a double-asset exchange option, and a four-asset basket call option.
[191] arXiv:2601.15212 (replaced) [pdf, html, other]: Title: ZENITH: Automated Gradient Norm Informed Stochastic Optimization

Dhrubo Saha

Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Training deep computer vision models requires manual oversight or hyperparameter tuning of the learning rate (LR) schedule. While existing adaptive optimizers schedule the LR automatically, they suffer from computational and memory overhead, incompatibility with regularization, and suboptimal LR choices. In this work, we introduce the ZENITH (Zero-overhead Evolution using Norm-Informed Training History) optimizer, which adapts the LR using the temporal evolution of the gradient norm. Image classification experiments spanning 6 CNN architectures and 6 benchmarks demonstrate that ZENITH achieves higher test accuracy in lower wall-clock time than baselines. It also yielded superior mAP in object detection, keypoint detection, and instance segmentation on MS COCO using the R-CNN family of models. Furthermore, its compatibility with regularization enables even better generalization.
[192] arXiv:2601.21718 (replaced) [pdf, html, other]: Title: When Does Predictive Inverse Dynamics Outperform Behavior Cloning?

Lukas Schäfer, Pallavi Choudhury, Abdelhak Lemkhenter, Chris Lovett, Somjit Nath, Luis França, Matheus Ribeiro Furtado de Mendonça, Alex Lamb, Riashat Islam, Siddhartha Sen, John Langford, Katja Hofmann, Sergio Valcarcel Macua

Comments: To be published in proceedings of the International Conference on Machine Learning (ICML), 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent works have introduced a class of architectures named predictive inverse dynamics models (PIDMs) that combine a future-state predictor with an inverse dynamics model. While PIDMs often outperform BC, the reasons behind their benefits remain unclear. In this paper, we provide a theoretical explanation: PIDMs introduce a tradeoff. Conditioning the IDM on the predicted future state can significantly reduce variance, but the prediction itself introduces additional bias and variance. We establish conditions for PIDMs to achieve higher sample efficiency and lower prediction error than BC, with the gap widening when additional data sources are available. We validate the theoretical insights empirically in 2D navigation tasks, where BC requires up to five times (three times on average) more demonstrations than PIDM to reach comparable performance. Results are also illustrated in a complex 3D environment in a modern video game with high-dimensional visual inputs and stochastic transitions, where BC requires over 66\% more samples than PIDM.
[193] arXiv:2602.00722 (replaced) [pdf, html, other]: Title: Spectral Imbalance Causes Forgetting in Low-Rank Continual Adaptation

Hao Gu, Mao-Lin Luo, Zi-Hao Zhou, Han-Chen Zhang, Min-Ling Zhang, Tong Wei

Comments: 21 pages, 7 figures

Subjects: Machine Learning (cs.LG)

Parameter-efficient continual learning aims to adapt pre-trained models to sequential tasks without forgetting previously acquired knowledge. Most existing approaches treat continual learning as avoiding interference with past updates, rather than considering what properties make the current task-specific update naturally preserve previously acquired knowledge. From a knowledge-decomposition perspective, we observe that low-rank adaptations exhibit highly imbalanced singular value spectra: a few dominant components absorb most of the adaptation energy, thereby (i) more likely to disrupt previously acquired knowledge and (ii) making the update more vulnerable to interference from subsequent tasks. To enable explicit balance among components, we decouple the magnitude of the task update from its directional structure and formulate it as a constrained optimization problem on a restricted Stiefel manifold. We address this problem using a projected first-order method compatible with standard deep-learning optimizers used in vision-language models. Our method mitigates both backward and forward forgetting, consistently outperforming continual learning baselines. The implementation code is available at this https URL.
[194] arXiv:2602.01564 (replaced) [pdf, html, other]: Title: Local exponential stability of mean-field Langevin descent-ascent and associated particle system

Geuntaek Seo, Minseop Shin, Pierre Monmarché, Beomjun Choi

Subjects: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Optimization and Control (math.OC); Probability (math.PR)

We study the mean-field Langevin descent-ascent (MFL-DA), a coupled optimization dynamics on the space of probability measures for entropically regularized two-player zero-sum games, together with its associated interacting particle system. For general nonconvex-nonconcave payoffs, Wang and Chizat (COLT 2024) asked whether the original single-timescale MFL-DA converges to the mixed Nash equilibrium and, if so, at what rate. We prove a local affirmative answer in Wasserstein space: if the initial datum is sufficiently close to the mixed Nash equilibrium, then the mean-field dynamics converges to it exponentially fast at a quantitative rate. We further show that the finite-$N$ particle system inherits this stability up to times exponential in $N$, with an $N$-independent exponential rate modulo a finite-particle error floor. Combined with the recent counterexample of Mourrat and Pillaud-Vivien for MFL-DA, which shows that global convergence cannot hold in general, our theorem completes the positive local counterpart of the Wang-Chizat question: the mixed Nash equilibrium has a robust basin of attraction, stable under both the mean-field flow and its finite-particle approximation.
[195] arXiv:2602.02762 (replaced) [pdf, html, other]: Title: On the Sample Efficiency of Inverse Dynamics Models for Semi-Supervised Imitation Learning

Sacha Morin, Moonsub Byeon, Alexia Jolicoeur-Martineau, Sébastien Lachapelle

Comments: Accepted to ICML 2026

Subjects: Machine Learning (cs.LG)

Semi-supervised imitation learning (SSIL) consists in learning a policy from a small dataset of action-labeled trajectories and a much larger dataset of action-free trajectories. Some SSIL methods learn an inverse dynamics model (IDM) to predict the action from the current state and the next state. An IDM can act as a policy when paired with a video model (VM-IDM) or as a label generator to perform behavior cloning on action-free data (IDM labeling). In this work, we first show that VM-IDM and IDM labeling learn the same policy in a limit case, which we call the IDM-based policy. We then argue that the previously observed advantage of IDM-based policies over behavior cloning is due to the superior sample efficiency of IDM learning, which we attribute to two causes: (i) the ground-truth IDM tends to be contained in a lower complexity hypothesis class relative to the expert policy, and (ii) the ground-truth IDM is often less stochastic than the expert policy. We argue these claims based on insights from statistical learning theory and novel experiments, including a study of IDM-based policies using recent architectures for unified video-action prediction (UVA). Motivated by these insights, we finally propose an improved version of the existing LAPO algorithm for latent action policy learning. We experiment on the Procgen, Push-T and LIBERO benchmarks.
[196] arXiv:2602.03001 (replaced) [pdf, other]: Title: Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent

Hiroki Naganuma, Shagun Gupta, Youssef Briki, Ioannis Mitliagkas, Irina Rish, Parameswaran Raman, Hao-Jun Michael Shi

Comments: 8 pages, 2 figures, 4 tables

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

To maximize hardware utilization, modern machine learning systems typically employ large constant or manually tuned batch size schedules, relying on heuristics that are brittle and costly to tune. Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative. However, their assumption of SGD's Euclidean geometry creates a fundamental mismatch with popular optimizers based on generalized norms, such as signSGD / Signum ($\ell_\infty$) and stochastic spectral descent (specSGD) / Muon ($\mathcal{S}_\infty$). In this work, we derive gradient noise scales for signSGD and specSGD that naturally emerge from the geometry of their respective dual norms. To practically estimate these non-Euclidean metrics, we propose an efficient variance estimation procedure that leverages the local mini-batch gradients on different ranks in distributed data-parallel systems. Our experiments demonstrate that adaptive batch size strategies using non-Euclidean GNS enable us to match the validation loss of constant-batch baselines while reducing training steps by up to 66\% for Signum and Muon on a 160 million parameter Llama model.
[197] arXiv:2602.05999 (replaced) [pdf, html, other]: Title: On the Role of Computation in Reinforcement Learning

Raj Ghugare, Michał Bortkiewicz, Alicja Ziarko, Benjamin Eysenbach

Comments: ICML 2026, Website: this https URL

Subjects: Machine Learning (cs.LG)

How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
[198] arXiv:2602.08638 (replaced) [pdf, html, other]: Title: LEFT: Learnable Fusion of Tri-view Tokens for Unsupervised Time Series Anomaly Detection

Dezheng Wang, Tong Chen, Guansong Pang, Congyan Chen, Shihua Li, Hongzhi Yin

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

As a fundamental data mining task, unsupervised time series anomaly detection (TSAD) aims to build a model for identifying abnormal timestamps without assuming the availability of annotations. A key challenge in unsupervised TSAD is that many anomalies are too subtle to exhibit detectable deviation in any single view (e.g., time domain), and instead manifest as inconsistencies across multiple views like time, frequency, and a mixture of resolutions. However, most cross-view methods rely on feature or score fusion and do not enforce analysis-synthesis consistency, meaning the frequency branch is not required to reconstruct the time signal through an inverse transform, and vice versa. In this paper, we present Learnable Fusion of Tri-view Tokens (LEFT), a unified unsupervised TSAD framework that models anomalies as inconsistencies across complementary representations. LEFT learns feature tokens from three views of the same input time series: frequency domain tokens that embed periodicity information, time domain tokens that capture local dynamics, and multi-scale tokens that learn abnormal patterns at varying time series granularities. By learning a set of adaptive Nyquist-constrained spectral filters, the original time series is rescaled into multiple resolutions and then encoded, allowing these multi-scale tokens to complement the extracted frequency and time domain information. When generating the fused representation, we introduce a novel objective that reconstructs fine-grained targets from coarser multi-scale structure, and put forward an innovative time-frequency cycle consistency constraint to explicitly regularize cross-view agreement. As cross-view agreement is explicitly regularized during training, LEFT can adopt lightweight tri-view encoders while maintaining effective coordination among the three views.
[199] arXiv:2602.10431 (replaced) [pdf, html, other]: Title: QTALE: Quantization-Robust Token-Adaptive Layer Execution for LLMs

Kanghyun Noh, Jinheon Choi, Yulhwa Kim

Comments: 22 pages, 10 figures, 20 tables,

Journal-ref: ICML 2026 Poster

Subjects: Machine Learning (cs.LG)

Large language models (LLMs) demand substantial computational and memory resources, posing challenges for efficient deployment. Two complementary approaches have emerged to address these issues: token-adaptive layer execution, which reduces floating-point operations (FLOPs) by selectively bypassing layers, and quantization, which lowers memory footprint by reducing weight precision. However, naively integrating these techniques leads to additional accuracy degradation due to reduced redundancy in token-adaptive models. We propose QTALE (Quantization-Robust Token-Adaptive Layer Execution for LLMs), a novel framework that enables seamless integration of token-adaptive execution with quantization while preserving accuracy. Conventional token-adaptive methods reduce redundancy in two ways: (1) by limiting the diversity of training paths explored during fine-tuning, and (2) by lowering the number of parameters actively involved in inference. To overcome these limitations, QTALE introduces two key components: (1) a training strategy that ensures diverse execution paths are actively explored during fine-tuning, and (2) a post-training mechanism that allows flexible adjustment of the execution ratio at inference to reintroduce redundancy when needed. Experimental results show that QTALE enables seamless integration of token-adaptive layer execution with quantization, while keeping the accuracy gap to quantization-only models below 0.5% on CommonsenseQA benchmarks. By combining tokenadaptive execution for FLOPs reduction and quantization for memory savings, QTALE provides an effective solution for efficient LLM deployment.
[200] arXiv:2602.10545 (replaced) [pdf, html, other]: Title: $μ$pscaling small models: Principled warm starts and hyperparameter transfer

Yuxin Ma, Nan Chen, Mateo Díaz, Soufiane Hayou, Dmitriy Kunisky, Soledad Villar

Comments: 69 pages, 11 figures, closest to version to be published in ICML 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Modern large-scale neural networks are often trained and released in multiple sizes to accommodate diverse inference budgets. To improve efficiency, recent work has explored model upscaling: initializing larger models from trained smaller ones to accelerate convergence. However, this method can be sensitive to hyperparameters that need to be tuned at the target upscaled model size, which is prohibitively costly to do directly. It remains unclear whether tuning hyperparameters on smaller models and extrapolating via scaling laws is sound in this setting. We address this with principled approaches to width-based upscaling and efficient hyperparameter tuning in this setting. Motivated by $\mu$P and any-dimensional architectures, we introduce a general upscaling method that, like Net2Net, copies and perturbs weights, but uses theoretically grounded, width-dependent scalings for the perturbation noise and optimizer hyperparameters. First, we prove that under zero perturbation, the upscaled model is functionally equivalent to the base model throughout training. Second, we extend the $\mu$P theory to enable infinite-width limit analysis and establish hyperparameter transfer for upscaled models, greatly reducing the tuning cost. We empirically demonstrate that this method is effective on realistic datasets and architectures.
[201] arXiv:2602.18528 (replaced) [pdf, html, other]: Title: Audio-Visual Continual Test-Time Adaptation without Forgetting

Sarthak Kumar Maharana, Akshay Mehra, Bhavya Ramakrishna, Yunhui Guo, Guan-Ming Su

Comments: ECCV 2026 & ICML 2026 Workshop Continual Adaptation at Scale: Towards Sustainable AI

Subjects: Machine Learning (cs.LG); Sound (cs.SD)

Audio-visual continual test-time adaptation involves continually adapting a source audio-visual model at test-time, to unlabeled non-stationary domains, where either or both modalities can be distributionally shifted, which hampers online cross-modal learning and eventually leads to poor accuracy. While previous works have tackled this problem, we find that SOTA methods suffer from catastrophic forgetting where the model's performance drops well below even the source model due to continual parameter updates at test-time. In this work, we first show that adapting only the modality fusion layer to a target domain not only improves performance on that domain but can also enhance performance on subsequent domains. Based on this strong cross-task transferability of the fusion layer's parameters, we propose a method, $\texttt{AVReCAP}$, that improves test-time performance of the models without access to any source data. Our approach works by using a selective parameter retrieval mechanism that dynamically retrieves the best fusion layer parameters from a buffer using only a small batch of test data. These parameters are then integrated into the model, adapted to the current test distribution, and saved back for future use. Extensive experiments on benchmark datasets involving unimodal and bimodal corruptions show our proposed $\texttt{AVReCAP}$ significantly outperforms existing methods while minimizing catastrophic forgetting.
[202] arXiv:2603.02112 (replaced) [pdf, html, other]: Title: Recursive Models for Long-Horizon Reasoning

Chenxiao Yang, Nathan Srebro, Zhiyuan Li

Comments: in ICML 2026

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

Modern language models reason within bounded context, an inherent constraint that poses a fundamental barrier to long-horizon reasoning. We identify recursion as a core principle for overcoming this barrier, and propose recursive models as a minimal realization, where the model can recursively invoke itself to solve subtasks in isolated contexts. We prove that any computable problem admits a recursive decomposition of reasoning in which each subtask requires only exponentially smaller active context than standard autoregressive models; this strictly surpasses any context management approach confined to a single sequence, such as summarization. We further generalize our framework to modern agentic systems with arbitrary context processing and control flows, and prove that recursive models can achieve optimal power within this broader class. Experimentally, we test two settings: fine-tuning a pretrained base model for recursive SAT solving, and training a small model from scratch on Go traces generated by exact game-tree search. Both show improved long-horizon accuracy with small active contexts.
[203] arXiv:2603.12676 (replaced) [pdf, html, other]: Title: Disentangled Latent Dynamics Manifold Fusion for Solving Parameterized PDEs

Zhangyong Liang, Huanhuan Gao

Subjects: Machine Learning (cs.LG)

Generalizing neural surrogate models across different PDE parameters remains difficult because changes in PDE coefficients often make learning harder and optimization less stable. The problem becomes even more severe when the model must also predict beyond the training time range. Existing methods usually cannot handle parameter generalization and temporal extrapolation at the same time. Standard parameterized models treat time as just another input and therefore fail to capture intrinsic dynamics, while recent continuous-time latent methods often rely on expensive test-time auto-decoding for each instance, which is inefficient and can disrupt continuity across the parameterized solution space. To address this, we propose Disentangled Latent Dynamics Manifold Fusion (DLDMF), a physics-informed framework that explicitly separates space, time, and parameters. Instead of unstable auto-decoding, DLDMF maps PDE parameters directly to a continuous latent embedding through a feed-forward network. This embedding initializes and conditions a latent state whose evolution is governed by a parameter-conditioned Neural ODE. We further introduce a dynamic manifold fusion mechanism that uses a shared decoder to combine spatial coordinates, parameter embeddings, and time-evolving latent states to reconstruct the corresponding spatiotemporal solution. By modeling prediction as latent dynamic evolution rather than static coordinate fitting, DLDMF reduces interference between parameter variation and temporal evolution while preserving a smooth and coherent solution manifold. As a result, it performs well on unseen parameter settings and in long-term temporal extrapolation. Experiments on several benchmark problems show that DLDMF consistently outperforms state-of-the-art baselines in accuracy, parameter generalization, and extrapolation robustness.
[204] arXiv:2603.14198 (replaced) [pdf, html, other]: Title: Efficient Federated Conformal Prediction with Group-Conditional Guarantee

Haifeng Wen, Osvaldo Simeone, Hong Xing

Comments: 24 pages, 8 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Deploying trustworthy AI systems requires principled uncertainty quantification. Conformal prediction (CP) is a widely used framework for constructing prediction sets with distribution-free coverage guarantees. In many practical settings, including healthcare, finance, and mobile sensing, the calibration data required for CP are distributed across multiple clients, each with its own local data distribution. In this federated setting, data can often be partitioned into, potentially overlapping, groups, which may reflect client-specific strata or cross-cutting attributes such as demographic or semantic categories. We propose group-conditional federated conformal prediction (GC-FCP), a federated extension of conditional conformal calibration for a target mixture over prespecified groups. GC-FCP constructs mergeable, atom-stratified coresets from local calibration scores, enabling compact aggregation at the server when the number of active atoms is moderate. Experiments on synthetic and real-world datasets validate the performance of GC-FCP compared to centralized calibration baselines. The code of our work can be found at this https URL.
[205] arXiv:2603.27631 (replaced) [pdf, other]: Title: On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry

Mohammad Tinati, Stephen Tu

Comments: Conference on Learning Theory, 6197-6309

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.
[206] arXiv:2603.29466 (replaced) [pdf, html, other]: Title: An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms

Nils Grünefeld, Jes Frellsen, Christian Hardmeier

Comments: ProbML 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Existing methods for quantifying predictive uncertainty in neural networks are either computationally intractable for large language models or require access to training data that is typically unavailable. We derive a lightweight alternative through two approximations: a first-order Taylor expansion that expresses uncertainty in terms of the gradient of the prediction and the parameter covariance, and an isotropy assumption on the parameter covariance. Together, these yield epistemic uncertainty as the squared gradient norm and aleatoric uncertainty as the Bernoulli variance of the point prediction, from a single forward-backward pass through an unmodified pretrained model. We justify the isotropy assumption by showing that covariance estimates built from non-training data introduce structured distortions that isotropic covariance avoids, and that theoretical results on the spectral properties of large networks support the approximation at scale. Validation against reference Markov Chain Monte Carlo estimates on synthetic problems shows strong correspondence that improves with model size. We then use the estimates to investigate when each uncertainty type carries useful signal for predicting answer correctness in question answering with large language models, revealing a benchmark-dependent divergence: the combined estimate achieves the highest mean AUROC on TruthfulQA, where questions involve genuine conflict between plausible answers, but falls to near chance on TriviaQA's factual recall, suggesting that parameter-level uncertainty captures a fundamentally different signal than self-assessment methods.
[207] arXiv:2604.21254 (replaced) [pdf, html, other]: Title: Hyperloop Transformers

Abbas Zeitoun, Lucas Torroba-Hennigen, Yoon Kim

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to perform well compared to depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. This performance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.
[208] arXiv:2604.25421 (replaced) [pdf, html, other]: Title: FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

Changyu Li, Shuanghong Huang, Jiashen Liu, Ming Lei, Jidu Xing, Kaishun Wu, Lu Wang, Fei Luo

Comments: 18 pages, 14 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data. However, in mobile deployments, the training wall-clock is often dominated by straggler-limited uplink communication under heterogeneous bandwidth, intermittent participation, and non-IID client data. Although parameter-efficient fine-tuning (PEFT) methods such as LoRA and QLoRA reduce local memory and trainable parameters, repeated transmission of adapter updates remains a major bottleneck. We propose Fed-FSTQ, a semantic-sensitivity-aware communication-control primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ uses a lightweight token-level Fisher proxy to estimate semantic sensitivity, couples token-guided sparsification with mixed-precision adapter-update quantization, and allocates higher communication fidelity to semantically load-bearing evidence while suppressing redundant transmission. The method is drop-in compatible with standard federated PEFT pipelines and requires no change to the server aggregation rule. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46-fold relative to a Fed-LoRA baseline and improves straggler-limited wall-clock time-to-accuracy by 52%. Under the corrected Controlled LTE-20Mbps accounting, Fed-FSTQ reduces per-round time from 414.60s to 67.29s and reduces per-round energy from 839.20J to 146.28J, yielding a 6.16-fold speedup. On NVIDIA Jetson-class edge devices, Fisher-guided token reduction also yields up to a 1.55-fold inference speedup, demonstrating deployability under tight resource constraints.
[209] arXiv:2605.11007 (replaced) [pdf, html, other]: Title: The Transformer as a Polar State Estimator

Peter Racioppo

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

We show that the core components of the Transformer -- attention, residual connections, and normalization -- arise naturally from a single geometric state estimation problem. Modeling the latent state in polar form, with direction constrained to the hypersphere and uncertainty decomposed into radial and tangential components, yields a precision-weighted filtering procedure in which normalization enforces the hyperspherical constraint, attention aggregates directional evidence, and residual connections implement incremental state updates. Under suitable first-order approximations, this estimator reduces to the standard Transformer block with rotary positional encodings, showing that its architecture follows from the underlying estimation problem rather than from independent design choices. Retaining higher-order geometric corrections yields the proposed \textit{Polar Transformer}, which more faithfully approximates the underlying radial-tangential state estimator.
[210] arXiv:2605.11020 (replaced) [pdf, html, other]: Title: Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

Anish Diwan, Davide Tateo, Christopher E. Mower, Haitham Bou-Ammar, Jan Peters, Oleg Arenz

Comments: Accepted as a conference paper at the International Conference on Machine Learning (ICML) 2026. Revised to include review feedback

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Inverse reinforcement learning (IRL) is typically formulated as maximizing entropy subject to matching the distribution of expert trajectories. Classical (dual-ascent) IRL guarantees monotonic performance improvement but requires fully solving an RL problem each iteration to compute dual gradients. More recent adversarial methods avoid this cost at the expense of stability and monotonic dual improvement, by directly optimizing the primal problem and using a discriminator to provide rewards. In this work, we bridge the gap between these approaches by enabling monotonic improvement of the reward function and policy without having to fully solve an RL problem at every iteration. Our key theoretical insight is that a trust-region-optimal policy for a reward function update can be globally optimal for a smaller update in the same direction. This smaller update allows us to explicitly optimize the dual objective while only relying on a local search around the current policy. In doing so, our approach avoids the training instabilities of adversarial methods, offers monotonic performance improvement, and learns a reward function in the traditional sense of IRL--one that can be globally optimized to match expert demonstrations. Our proposed algorithm, Trust Region Inverse Reinforcement Learning (TRIRL), outperforms state-of-the-art imitation learning methods across multiple challenging tasks by a factor of 2.4x in terms of aggregate inter-quartile mean, while recovering reward functions that generalize to system dynamics shifts.
[211] arXiv:2605.18224 (replaced) [pdf, html, other]: Title: A Simplex Witness Certificate and Escape Force for Constant Collapse in Variational Autoencoders

Zegu Zhang, Jian Zhang

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

We study exact constant collapse in variational autoencoders: the deterministic encoder mean becomes independent of the input. The prior remains the standard Gaussian. Before VAE training, we select a fixed teacher posterior from a GMM-based view of the data and attach a fixed latent-only simplex witness to the encoder mean. This construction yields two linked objects. The first is a certificate: if the witness prediction improves on the best constant predictor of the teacher, the encoder mean cannot be input-independent constant. The second is a local escape direction: on the collapsed manifold, the teacher residual gives a sample-dependent descent direction for the alignment loss. For any full-support teacher posterior, the same geometry also gives a closed-form latent code with zero teacher-witness alignment error. Its scaled versions trace a margin-energy path from the constant predictor to the exact teacher code, which quantifies non-collapse inside the protected witness subspace. We instantiate the method on MNIST, CIFAR-10, and CIFAR-100. With searched unsupervised PCA-GMM teachers, vanilla VAEs fail the teacher-witness certificate in all five seeds on CIFAR-10 and CIFAR-100, while RST variants pass in all five seeds. Under collapse-stress settings with $\beta_{\mathrm{KL}}\in\{2,4,8\}$, vanilla VAE again fails in all seeds, whereas RST-alpha-prefit remains certificate-positive. Escape trajectories on both natural-image datasets increase the witness margin from a low-margin initialization and exhibit nonzero teacher-induced gradient norms. The analysis is confined to exact constant collapse of the encoder mean; generation quality, decoder use, and other collapse modes remain separate questions.
[212] arXiv:2605.22472 (replaced) [pdf, other]: Title: Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning

Julian Gutheil (1), Simon Hitzginger (1), Robert Legenstein (1) ((1) Graz University of Technology)

Comments: We have revised the theorem and its proof. We have also corrected some minor errors

Subjects: Machine Learning (cs.LG)

Winner-take-all (WTA) networks constitute a central circuit motif in cortical networks of the brain. In addition, WTA-like activations are abundant in modern deep learning models in the form of the softmax activation for example in attention layers of transformers. While their role in the extraction of latent factors has been studied for relatively simple generative models, their role in the context of highly non-linearly entangled latent factors has remained elusive. In this article, we show that a WTA bottleneck within a deep neural network can enforce under certain well-defined conditions the extraction of categorical latent factors of the data in a multi-task learning setup. In particular, we prove that the representation that emerges in the WTA bottleneck is highly symbolic, where a single neuron or a population of neurons encodes the presence of a single abstract feature such as a specific object, color, or position. We furthermore show empirically on two datasets, that this also holds for architectures and setups that do not fully comply with the assumptions of our theorem and demonstrate the advantages of the acquired symbolic representation for generalization. Our proposed model provides insights into the generalization capabilities of deep neural networks with WTA-like components and may serve as an interface between symbolic and subsymbolic AI systems.
[213] arXiv:2605.29283 (replaced) [pdf, html, other]: Title: Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

Mengdi Chu, Yang Liu, Ayan Biswas, Han-Wei Shen

Comments: 26 pages, 31 figures

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to determine whether a model has learned generalizable physical dynamics or only performs well under particular settings. We construct a benchmark with 8 physical dynamics, 3 training-data mixtures, and 25 test regimes induced by dynamic-scale and initial-condition complexity shifts, covering in-distribution, distribution-shift, and out-of-distribution settings. We evaluate five physics foundation model architectures and four model variants per architecture (scratch and three pretrained sizes), resulting in 60,000 measurements. Our results show that current physics foundation models behave as conditional rather than universal generalists: their generality depends on the physical regime, temporal scale, initial-condition setting, pretraining, model size, and architecture. Improving the training data distribution only partially mitigates this limitation. Pretraining and scaling are also unable to reliably remove their ability biases. We argue that improving physics foundation models requires moving beyond scaling models or expanding data, toward learning mechanisms that better capture transferable physical knowledge across regimes, temporal scales, and distribution shifts.
[214] arXiv:2606.00342 (replaced) [pdf, html, other]: Title: PE-means: Improved Differentially Private $k$-means Clustering through Private Evolution

Thomas Humphries, Zinan Lin, Sergey Yekhanin

Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Databases (cs.DB)

We study the problem of differentially private (DP) $k$-means clustering in Euclidean space. Previous solutions rely on summing the private data directly, which induces a sensitivity proportional to the domain. We introduce PE-means, an extension of the private evolution (PE) algorithm (an increasingly popular method for synthetic data generation), to the problem of $k$-means clustering. The key advantage of PE is that it only computes a private histogram with constant sensitivity to guide the evolution. Our adaptation of PE includes new evolutionary operators for clustering, as well as other algorithmic improvements of independent interest. Overall, PE-means achieves an average improvement of 26% in clustering loss over state-of-the-art baselines such as Google's LSH-based algorithm and DP-Lloyd variants.
[215] arXiv:2606.03003 (replaced) [pdf, other]: Title: Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group

Hongbo Wang

Comments: 112 pages, 19 figures. v2 adds programme lineage to companion papers (arXiv:2606.13092, 2606.24945, 2606.24946), engages the equivariance-at-scale debate (arXiv:2410.23179), and adds experimental hardening: 5-seed CIs, frame-averaging/canonicalization baselines, a real-robot DROID anchor, a scale-vs-exactness curve. Core claims unchanged. Code: this https URL

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)

A latent world model built from an equivariant encoder and predictor inherits a provable symmetry of its training loss: when the dynamics carries a group $G$ acting on latents by an orthogonal representation $\rho(g)$, the one-step prediction relMSE is exactly invariant across the whole group, so fitting a restricted slice of orientations mathematically determines it on the entire orbit. The symmetry survives a real Muon/AdamW+EMA+VICReg run -- composed residual $\sim 10^{-6}$ after training, under any optimiser (intrinsic Vector-Neuron/e3nn parametrisation) -- and one-step error is flat across the group (5-seed medians: equivariant $\times 1.00$ vs a higher-capacity non-equivariant baseline $\times 12.7$ in 2D, $\times 17.2$ in 3D), while that baseline fits the slice but breaks out-of-distribution. The flatness is not a synthetic artefact: on real-robot DROID end-effector trajectories the equivariant model stays flat across the orbit ($\times 1.000$, rotation residual $1.5\times 10^{-16}$) while a $4.5\times$-larger baseline degrades $\times 11$. One caution is load-bearing: flatness is necessary, not sufficient -- the theorem transports the in-distribution error level unchanged but does not lower it (3D relMSE $\approx 0.43$): across-group error is constant, not low. The same isometry lifts to a closed-loop corollary: under a matching equivariant planner the control error is invariant across the group -- float-floor-exact in 2D/SO(2), statistically flat in 3D/SE(3). Stress-tested against Sutton's Bitter Lesson (augmentation, scale, soft-equivariance), each closes at most the across-group task metric, never the float-floor exactness. This is the generalisation-side foundation of a certified-world-models programme (arXiv:2606.13092, 2606.24945, 2606.24946): flatness transports competence, and the trust bounds built on it are downstream products.
[216] arXiv:2606.07571 (replaced) [pdf, html, other]: Title: Enabling KV Caching of Shared Prefix for Diffusion Language Models

Younghun Go, Jaehoon Han, Changyong Shin, Chuck Yoo, Gyeongsik Yang

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical challenges in emerging diffusion language models (DLMs). In DLMs, bidirectional attention means that updating any token dynamically alters the entire context and its corresponding KVs. Thus, existing caching techniques developed for LLMs, which assume that KVs remain invariant once computed, corrupt the shared prefix KVs. Our experiments show that applying these techniques to DLMs causes model accuracy to collapse to near zero.
To unlock high-throughput DLM serving, we propose bidirectional prefix caching, bicache, the first KV caching technique for shared prefixes in DLMs. bicache is designed based on key observations from our comprehensive analysis: shared prefix KVs remain stable and reusable in shallow layers, while the depth of shallow layers depends on the fraction of shared prefix tokens in each request. Thus, bicache dynamically identifies a safe layer depth for reusing shared prefix KVs and eliminates redundant computation. Evaluations demonstrate that bicache significantly improves serving throughput by 36.3%-98.3% compared to existing techniques without accuracy collapse (only 0-1.8% difference).
[217] arXiv:2606.07591 (replaced) [pdf, html, other]: Title: ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Koutian Wu, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.
[218] arXiv:2606.13092 (replaced) [pdf, html, other]: Title: Certified World Models: Predictability Across Configuration, Horizon, and Resolution

Hongbo Wang

Comments: 56 pages. v3: evidence hardening -- pendulum-ring mechanism doubled to n=30 seeds (Fisher p=9.5e-6), 5-task x 7-checkpoint multitask audit (0/35 cells reach the calibration band), certificate start-spread and measured episode-sensitivity analyses; prose pass; conclusions unchanged. Code: this https URL

Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Dynamical Systems (math.DS)

Scale buys interpolation; structure buys certifiable transfer. A world model's average error does not say whether a particular rollout can be trusted, or for how long. For equivariant latent world models we give a predictability certificate: a computable region spanning configuration, horizon, and resolution. Under exact equivariance, rollout error is invariant over the monoid generated by k primitive symmetries and is certified from the k generators (Theorem A); universal orbit-flatness over equivariant targets characterizes equivariance at the function level (Lemma 2), so an unconstrained architecture cannot certify the property by construction. Approximate orbit-transfer defects propagate by the finite-time Lyapunov spectrum (Theorem B): expanding channels give a logarithmic horizon $T_j(\epsilon)\sim\log(1/\epsilon)/\lambda_j$, neutral channels accumulate recurrent defect linearly, and contracting channels accumulate a bounded nonzero floor. Exact conserved charge values are certified to all horizons only at zero defect; with one-step defect $\eta$, charge-value error grows at most as $T\eta$. Empirically, on a 40-dimensional learned model a $\mathbb{Z}_N$-equivariant network recovers the full Lyapunov spectrum ($R^2=0.98$-$0.99$) where dense and recurrent baselines fail. A cone/adapted-metric certificate reads an a-priori horizon off the model's own Jacobian, tight on uniformly hyperbolic dynamics and self-abstaining elsewhere; the resulting horizon improves a budgeted re-observation decision. For public non-equivariant world models the tangent spectrum gives a training-free candidate horizon, paired with a held-out divergence cross-check that abstains or corrects when the learned loop over-promises.
[219] arXiv:2606.19984 (replaced) [pdf, html, other]: Title: Kolmogorov-Arnold Reservoir Computing

Juntian Huang, Jürgen Kurths, Ying Tang

Subjects: Machine Learning (cs.LG)

Reservoir computing offers a lightweight framework for forecasting dynamical systems but may struggle to capture long-range dependencies due to limited representational capacity. Conventional reservoir computing recurrently uses fixed reservoirs with hyperparameter sensitivity, while the next generation reservoir computing removes recurrence at the cost of rapidly growing feature dimensions. Here, we develop Kolmogorov-Arnold Reservoir Computing (KARC), which replaces reservoirs with explicit basis-function expansions inspired by the Kolmogorov-Arnold representation theorem. We rigorously show that KARC is a lightweight design of Kolmogorov-Arnold networks (KANs), preserving the potential expressive capacity of KANs while admitting efficient closed-form training of reservoir computing. At comparable cost, KARC outperforms existing reservoir computing methods on challenging benchmarks including partial differential equations. It can also be integrated with generative diffusion models for facilitating text-to-image generation. This work thus establishes a principled bridge between reservoir computing and KANs, yielding a unified framework for efficient dynamical forecasting and generative modeling.
[220] arXiv:2606.24945 (replaced) [pdf, html, other]: Title: When Do Conservation Laws Survive Learned Representations? Certified Horizons for Latent World Models

Hongbo Wang

Comments: 16 pages, including appendices. v2: second soft-survival system (Duffing double well, pre-registered) with a linear-oscillator anchor; 5-seed and step-size hardening of the state-Kepler result; 8-seed SympNet confirmation of the lift null. Code: this https URL

Subjects: Machine Learning (cs.LG); Robotics (cs.RO)

We ask a representation-learning question about physical world models: when does a conservation law remain certifiable after a model learns a latent representation? A certified horizon bounds -- in advance, from measurable model defects -- how many steps a rollout provably stays on a physical invariant's level set. The key design choice is what is certified: not a learned latent Hamiltonian or a learned scalar witness (a model can conserve either while drifting in true energy), but the decoded physical invariant obtained by decoding the latent state and evaluating the known invariant. Around this object we derive shell-horizon certificates whose budget decomposes into representation, readout, and latent-dynamics defects, with a monotone alignment bridge through which a soft learned witness yields a certified horizon for the decoded invariant, and test them across state, learned-lift, and pixel observations on conservative systems. Conservation certificates can survive learned representation, but not all geometric priors survive equally. Hard canonical symplectic structure yields the longest horizons in known phase coordinates yet does not cross a learned chart, whereas a controlled-Lipschitz-aligned soft invariant survives in the nonlinear learned-representation settings we test -- two lift systems, with the gain growing with nonlinearity, and pixels. Pixel certification is recovered on a readout-stable sub-tube, and the Kepler problem exposes a geometric boundary. The central object is therefore not a latent Hamiltonian, but a decoded physical invariant whose robustness to representation learning can be measured, certified, and falsified.
[221] arXiv:2606.28833 (replaced) [pdf, html, other]: Title: Active Quantum Kernel Acquisition for Gaussian Process Regression

Jian Xu, Artur Miroszewski, Delu Zeng, Qibin Zhao

Subjects: Machine Learning (cs.LG)

Quantum kernel estimation on near-term hardware is shot-budgeted: every entry of the kernel Gram matrix is a Bernoulli expectation that must be sampled with a finite number of circuit executions. Recent work on quantum kernel classification has shown that allocating shots non-uniformly across kernel entries, weighted by their downstream task sensitivity, can reduce the shot budget required to reach a target accuracy. We extend this idea to Gaussian process (GP) regression, a setting whose downstream quantities (full-spectrum posterior variance, log-determinant, marginal likelihood) couple to kernel error more tightly than the sign-only outputs of classification. We derive three closed-form pair-level sensitivities predictive coupling $|\alpha_i\alpha_j|$, leave-one-out residual, and marginal-likelihood gradient and plug them into a Neyman-style minimum-variance allocation rule. To prevent catastrophic over-concentration when the warm-up sensitivity estimate is itself noisy, we add a high uniform coverage floor justified by a Frobenius lower bound on the missing-entry perturbation. On four UCI benchmarks and two synthetic RBF + Bernoulli controlled studies, the resulting allocator delivers $10$--$21\%$ test-RMSE improvement over uniform allocation across the moderate-budget regime. The gain transfers (i) to genuine ZZ and Pauli-Z quantum kernels on quantum-natural data ($-13$--$15\%$ at low budget, $p<0.05$ paired) and (ii) to four downstream tasks (Bayesian quadrature, heteroscedastic regression, hyperparameter learning, multi-output Cokriging). On UCI features embedded into a ZZ kernel the gain disappears, consistent with the exponential-concentration regime where shot allocation has nothing to exploit.
[222] arXiv:2606.31191 (replaced) [pdf, html, other]: Title: ISM:Self-Improving Strategy Memory for Continual Mathematical Reasoning

Prakhar Dixit, Tim Oates

Comments: 3rd AI for Math Workshop at ICML 2026 Forty-Third International Conference on Machine Learning

Subjects: Machine Learning (cs.LG)

We propose Intelligent Schema Memory (ISM), a self-evolving memory-augmented system that improves mathematical reasoning for a frozen LLM under continual learning with hard episodic resets. ISM maintains a compact, self-refined bank of strategy schemas learned from both successful and failed episodes, with symbolic tools that check intermediate steps and certify answers. Without updating model parameters, ISM outperforms passive, retrieval, and reflection baselines on MATH-Hard and OlympiadBench, using 64% and 86% fewer schemas respectively than the strongest passive baseline. These results show that small, actively maintained, and verified strategy memories can support reliable continual mathematical reasoning under strict episodic isolation. The codebase is available at this https URL .
[223] arXiv:2606.31394 (replaced) [pdf, html, other]: Title: Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images

Jisung Park, Seohyeon Kang, Daeun Yoo, Eunsu Lee, Seoin Cho, Wooyeop Choi, Ian Choi, James R. Evan, Daesoo Kim, Sonia Gandhi, Minee L. Choi

Comments: 10 pages, 7 figures (plus 14 in appendix), 1 table, preprint

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

Artificial intelligence is transforming our capability to solve biological challenges. In dimensionality bottleneck regimes exacerbated by high-dimensional biological data, neural networks force distinct concepts into the lower dimensions known as superposition. Although this superposition is widely known to hinder interpretability, its impact on corrupting the geometry of latent spaces remains critically overlooked. Here, we utilized sparse autoencoders (SAEs) trained on over 100,000 multiplexed images of patient-derived Parkinson's disease and healthy neurons to resolve superposition. This approach bypasses the mathematical non-uniqueness of feature attribution by shifting to interpretable latent representation analysis. We theoretically and empirically demonstrate that superposition contaminates representational metric spaces, and thereby SAEs successfully recover geometric fidelity. By treating these geometrically purified representations as single-cell state vectors, we adapted single-cell RNA sequencing (scRNA-seq) data analysis methodologies directly to the image domain. Finally, we introduce GW-map, utilizing Gromov-Wasserstein optimal transport to align these image representations with authentic scRNA-seq data de novo. This coupling reconstructs hierarchical neuronal pathology pathways such as Calcium-AIS scaffold, without reference spatial transcriptomics, establishing a scalable foundation for spatial biology. Code is available at this https URL\_superposition
[224] arXiv:2606.31536 (replaced) [pdf, html, other]: Title: Beyond the Expressivity-Trainability Paradox: A Dynamical Lie Algebra Perspective on Navigating Barren Plateaus in Quantum Machine Learning

Kung-Ming Lan, Edward Huang

Comments: 8 pages, 3 figures; added missing co-author, incorporated MBL literature in Section 2.4, and included Author Contributions and AI disclosure

Subjects: Machine Learning (cs.LG); Quantum Physics (quant-ph)

As Quantum Machine Learning (QML) transitions toward practical implementation, the field faces a critical architectural bottleneck that challenges the fundamental assumptions of classical statistical learning theory. In classical deep learning, increasing model capacity typically risks overfitting. However, this study advances a counter-intuitive paradigm: unstructured contemporary QML architectures suffer from a profound state of quantum underfitting, driven by the "expressivity-trainability paradox." We demonstrate that the vast Hilbert space capacity of Parameterized Quantum Circuits (PQCs)-traditionally chased as the source of quantum advantage is the direct mathematical cause of Barren Plateaus (BPs), where gradient landscapes become exponentially flat. By synthesizing recent breakthroughs in Dynamical Lie Algebras (DLAs) and Geometric QML, we establish a comprehensive framework linking the algebraic dimension of circuit generators to their optimization dynamics. Furthermore, we empirically validate this framework on a non-linear binary classification task, illuminating a uniquely quantum manifestation of the bias-variance tradeoff: while unstructured architectures achieve near-perfect training accuracy via unscalable parameterization (quantum overfitting), embedding group-theoretic geometric priors acts as a structural regularizer. By restricting the DLA growth to a polynomial regime, our symmetry-preserving approach sacrifices raw memorization capacity to guarantee scalable, gradient-rich training landscapes, offering a robust roadmap for "Trainability-by-Design" in scalable quantum neural networks.
[225] arXiv:2607.01083 (replaced) [pdf, html, other]: Title: Staleness-Learning Rate Scaling Laws for Asynchronous RLHF

Jingwei Song, Haofeng Xu, Jie Xiao, Chengke Bao, Jingwei Shi, Pengbin Feng, Weixun Wang, Yuhang Han, Chuan Wu, Linfeng Zhang, Bill Shi

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

High-throughput RLHF systems often decouple rollout generation from policy optimization, leading to the use of stale rollouts during learner updates. In this work, we study the effect of such staleness in asynchronous GRPO. We make the behavior policy explicit in the GRPO surrogate objective and distinguish between the surrogate-gradient mapping used by the learner and the true total derivative of a distribution-dependent population objective. Under assumptions of local boundedness, distributional smoothness, and behavior-policy smoothness, we show that stale rollouts introduce a per-step surrogate-gradient bias of order O(S * eta), where S denotes the maximum rollout lag and eta denotes the learning rate. We further derive a conditional collapse-time scaling law: when within-cycle drift remains below a batch-level clipping radius, collapse is governed primarily by cumulative learner drift T * eta; when the stale-rollout constraint is active, stability instead depends explicitly on S * eta. This yields a two-constraint stability condition eta << min{R_batch / (S * G_upd), R_crit / (T * G_upd)}, explaining why the maximum stable learning rate may appear weakly dependent on staleness in the horizon-limited regime.
[226] arXiv:2607.01197 (replaced) [pdf, html, other]: Title: Quantum vs. Classical Machine Learning: A Unified Empirical Comparison

Chuanming Yu, Jiaming Liu, Zihao Ge, Xiongfei Wu, Lulu Zhu, Pengzhan Zhao, Jianjun Zhao

Comments: This paper has been accepted for a poster presentation at the 5th CCF Quantum Computation Conference (CQCC 2026) on August 3, 2026

Subjects: Machine Learning (cs.LG)

Quantum computing has emerged as a promising computational paradigm for machine learning (ML), with the potential to offer computational advantages over classical approaches. At this stage, the evidence supporting the performance and advantages of quantum machine learning (QML) models relative to classical models is insufficient. To address this gap, this paper presents an empirical study on the performance of QML models and their classical counterparts. We compare seven model pairs spanning supervised learning and reinforcement learning. Our results indicate that the evaluated quantum machine learning models do not yet surpass the classical baselines in overall prediction performance, policy stability, or training time. Nevertheless, QML remains a promising approach for filtering noise and controlling false positives. Our research findings summarize the challenges facing quantum machine learning across hardware environments, training efficiency, and convergence stability, providing a foundation for research into the robustness and parameter optimization of QML. This work is publicly available at this https URL.
[227] arXiv:2607.01232 (replaced) [pdf, html, other]: Title: Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

Zijian Zhang, Rizhen Hu, Athanasios Glentis, Dawei Li, Chung-Yiu Yau, Hongzhou Lin, Mingyi Hong

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters uniformly, implicitly assuming that every layer contributes similarly to the gains obtained during RL post-training. In this work, we challenge this assumption through a systematic layer-wise study of RL training. Surprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it. To quantify this phenomenon, we introduce the quantity layer contribution, which measures the fraction of full RL improvement recovered by training a layer in isolation. Across seven models spanning two model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains including mathematical reasoning, code generation, and agentic decision-making, we observe a remarkably stable pattern: RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers. More strikingly, the same structural pattern consistently emerges: high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less. The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.
[228] arXiv:2209.04942 (replaced) [pdf, other]: Title: Learning Consumer Preferences from Bundle Sales Data

Ningyuan Chen, Setareh Farajollahzadeh, Qingwei Jin, Fanni Shen, Guan Wang

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

Problem definition: This paper studies the problem of estimating consumer preferences from bundle sales data. Product bundling is a widely used pricing strategy in retail markets. To set profitable bundle selection and prices, the seller needs to learn the distribution of consumers' valuations for individual products from the transaction data. When customers purchase bundles or multiple products, classical methods such as discrete choice models cannot be used to estimate consumers' valuations. In this paper, we propose an approach to learn the distribution of consumers' valuations toward the products using bundle sales data. Methodology/results: Our approach is to define a utility model for customer choices and estimate the parameters of a valuation distribution that maximizes the likelihood of observing the transaction data. Our approach reduces this problem to an estimation problem where the samples are censored by polyhedral regions on the valuation space of customers. Using the EM algorithm and Monte Carlo simulation, our approach can recover the distribution of consumers' valuations. We extend the framework to allow for unobserved no-purchases, clustered market segments and to incorporate non-additive bundle utilities with synergy effects. In addition, we provide theoretical results on the identifiability of the probability model and sufficient conditions for local convergence of the EM algorithm. Moreover, the performance of the approach is also demonstrated numerically with synthetic and real datasets. Managerial implications: This study demonstrates the challenge to leverage the transaction data of bundle sales to learn customers' preferences. The proposed algorithm provides a practical guidance for retailers.
[229] arXiv:2311.17633 (replaced) [pdf, other]: Title: Introduction to Transformers: an NLP Perspective

Tong Xiao, Jingbo Zhu

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.
[230] arXiv:2503.07570 (replaced) [pdf, html, other]: Title: Split-n-Chain: Privacy-Preserving Multi-Node Split Learning with Blockchain-Based Auditability

Mukesh Sahani, Binanda Sengupta

Journal-ref: Cluster Computing 29, 418 (2026)

Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)

Deep learning, when integrated with a large amount of training data, has the potential to outperform machine learning in terms of high accuracy. Recently, privacy-preserving deep learning has drawn significant attention of the research community. Different privacy notions in deep learning include privacy of data provided by data-owners and privacy of parameters and/or hyperparameters of the underlying neural network. Federated learning is a popular privacy-preserving execution environment where data-owners participate in learning the parameters collectively without leaking their respective data to other participants. However, federated learning suffers from certain security/privacy issues. In this paper, we propose Split-n-Chain, a variant of split learning where the layers of the network are split among several distributed nodes. Split-n-Chain achieves several privacy properties: data-owners need not share their training data with other nodes, and no nodes have access to the parameters and hyperparameters of the neural network (except that of the respective layers they hold). Moreover, Split-n-Chain uses blockchain to audit the computation done by different nodes. Our experimental results show that: Split-n-Chain is efficient, in terms of time required to execute different phases, and the training loss trend is similar to that for the same neural network when implemented in a monolithic fashion.
[231] arXiv:2503.21796 (replaced) [pdf, html, other]: Title: Meta-Representational Predictive Coding: Neuroscience-Informed Self-Supervised Learning

Alexander Ororbia, Karl Friston, Rajesh P. N. Rao

Comments: Significant updates to modeling framework, saccade planning + LGN units detailed, natural image experiments, new analyses and zero-shot generalization results now included

Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

Self-supervised learning has become an increasingly important paradigm in the domain of machine intelligence. Furthermore, evidence for self-supervised adaptation, such as contrastive formulations, has emerged in recent computational neuroscience and brain-inspired research. Nevertheless, current work on self-supervised learning relies on biologically implausible credit assignment -- in the form of backpropagation of errors -- and feedforward inference, typically a forward-locked pass. Predictive coding, in its mechanistic form, offers a biologically plausible means to sidestep these backprop-specific limitations. However, unsupervised predictive coding rests on learning a generative model of raw input (akin to "generative AI" approaches), which entails predicting a potentially high dimensional input; on the other hand, supervised predictive coding, which learns a mapping between inputs to target labels, requires human annotation, and thus incurs the drawbacks of supervised learning. In this work, we present a scheme for self-supervised learning, specifically for an emerging research sub-domain that we label as neuroscience-informed self-supervised learning (NeuroSSL), within a neurobiologically plausible framework that appeals to the free energy principle, constructing a new form of predictive coding that we call meta-representational predictive coding (MPC). MPC sidesteps the need for learning a generative model of sensory input (e.g., pixel-level features) by learning to predict representations of the input across parallel streams, resulting in an encoder-only learning and inference scheme. This formulation notably rests on active inference (in the form of sensory glimpsing) to drive the learning of representations, i.e., the representational dynamics are driven by sequences of decisions made by the model to sample informative portions of its sensorium.
[232] arXiv:2503.24009 (replaced) [pdf, html, other]: Title: Learning 3D-Gaussian Simulators from RGB Videos

Mikel Zhobro, Andreas René Geist, Georg Martius

Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

Realistic simulation is critical for applications ranging from robotics to animation. Learned simulators have emerged as a possibility to capture real world physics directly from video data, but very often require privileged information such as depth information, particle tracks and hand-engineered features to maintain spatial and temporal consistency. These strong inductive biases or ground truth 3D information help in domains where data is sparse but limit scalability and generalization in data rich regimes. To overcome the key limitations, we propose 3DGSim, a learned 3D simulator that directly learns physical interactions from multi-view RGB videos. 3DGSim unifies 3D scene reconstruction, particle dynamics prediction and video synthesis into an end-to-end trained framework. It adopts MVSplat to learn a latent particle-based representation of 3D scenes, a Point Transformer for particle dynamics, a Temporal Merging module for consistent temporal aggregation and Gaussian Splatting to produce novel view renderings. By jointly training inverse rendering and dynamics forecasting, 3DGSim embeds the physical properties into point-wise latent features. This enables the model to capture diverse physical behaviors, from rigid to elastic, cloth-like dynamics, and boundary conditions (e.g. fixed cloth corner), along with realistic lighting effects that also generalize to unseen multibody interactions and novel scene edits.
[233] arXiv:2504.03711 (replaced) [pdf, html, other]: Title: A Survey of Circuit Foundation Model: Foundation AI Models for VLSI Circuit Design and EDA

Wenji Fang, Jing Wang, Yao Lu, Shang Liu, Yuchao Wu, Yuzhe Ma, Zhiyao Xie

Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)

Artificial intelligence (AI)-driven electronic design automation (EDA) techniques have been extensively explored for VLSI circuit design applications. Most recently, foundation AI models for circuits have emerged as a new technology trend. Unlike traditional task-specific AI solutions, these new AI models are developed through two stages: 1) self-supervised pre-training on a large amount of unlabeled data to learn intrinsic circuit properties; and 2) efficient fine-tuning for specific downstream applications, such as early-stage design quality evaluation, circuit-related context generation, and functional verification. This new paradigm brings many advantages: model generalization, less reliance on labeled circuit data, efficient adaptation to new tasks, and unprecedented generative capability. In this paper, we propose referring to AI models developed with this new paradigm as circuit foundation models (CFMs). This paper provides a comprehensive survey of the latest progress in circuit foundation models, unprecedentedly covering over 130 relevant works. Over 90% of our introduced works were published in or after 2022, indicating that this emerging research trend has attracted wide attention in a short period. In this survey, we propose to categorize all existing circuit foundation models into two primary types: 1) encoder-based methods performing general circuit representation learning for predictive tasks; and 2) decoder-based methods leveraging large language models (LLMs) for generative tasks. For our introduced works, we cover their input modalities, model architecture, pre-training strategies, domain adaptation techniques, and downstream design applications. In addition, this paper discussed the unique properties of circuits from the data perspective. These circuit properties have motivated many works in this domain and differentiated them from general AI techniques.
[234] arXiv:2506.01623 (replaced) [pdf, html, other]: Title: MAGIK: Mapping to Analogous Goals via Imagination-enabled Knowledge Transfer

Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Humans excel at analogical reasoning - applying knowledge from one task to a related one with minimal relearning. In contrast, reinforcement learning (RL) agents typically require extensive retraining even when new tasks share structural similarities with previously learned ones. In this work, we propose MAGIK, a novel framework that enables RL agents to transfer knowledge to analogous tasks without interacting with the target environment. Our approach leverages an imagination mechanism to map entities in the target task to their analogues in the source domain, allowing the agent to reuse its original policy. Experiments on custom MiniGrid and MuJoCo tasks show that MAGIK achieves effective zero-shot transfer using only a small number of human-labelled examples. We compare our approach to related baselines and highlight how it offers a novel and effective mechanism for knowledge transfer via imagination-based analogy mapping.
[235] arXiv:2506.06323 (replaced) [pdf, html, other]: Title: Composite Reward Design in PPO-Driven Adaptive Filtering

Abdullah Burkan Bereketoglu

Comments: 8 pages, 4 figures, 2 table, 26th International Conference on Computational Science - Workshops (MLDADS-26) ,Keywords: Reinforcement learning, Adaptive filtering, Noise reduction, PPO

Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)

Model-free and reinforcement learning-based adaptive filtering methods are gaining traction for denoising in dynamic, non-stationary environments such as wireless signal channels, biomedical monitoring, and sensor networks. Traditional filters such as LMS, RLS, Wiener, and Kalman are often limited by assumptions of stationarity, the need for exact noise statistics, or fragile parameter tuning. This paper proposes an adaptive filtering framework using Proximal Policy Optimization (PPO), guided by a composite reward that balances SNR improvement, MSE reduction, and residual smoothness. We frame adaptive filtering as a Markov decision process and train a PPO agent to adjust filter coefficients directly in response to changing noise. Experiments on synthetic nonstationary signals with diverse noise types show that the PPO agent generalizes beyond its training distribution. Moreover, real-world analysis is made and evaluated on ECG recordings from the MIT-BIH Noise Stress Test Database corrupted by baseline wander, electrode motion, and muscle artifacts. The learned PPO policy achieves real-time inference and slightly outperforms strong classical baselines on ECG denoising. These results demonstrate the viability of policy-gradient reinforcement learning as a computationally efficient and flexible tool for adaptive filtering in nonlinear, time-varying dynamical systems.
[236] arXiv:2507.02964 (replaced) [pdf, html, other]: Title: Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens

Salahuddin Salahuddin, Ahmed Hussain, Jussi Löppönen, Toni Jutila

Comments: 19 Pages; Updated content and authors list

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

The increasing scale of AI workloads demands High-Performance Computing (HPC) infrastructure and training methodologies that are both scalable and sustainable. While Large Language Models (LLMs) demonstrate exceptional natural language capabilities, general-purpose models often lack the specialized domain knowledge necessary for effective cybersecurity analysis. We investigate Domain-Adaptive Continuous Pretraining (DAP) as a scalable, resource-efficient methodology for enhancing cybersecurity understanding in pretrained LLMs, implemented through a distributed Fully Sharded Data Parallel (FSDP) pipeline across multi-node GPU clusters. We systematically adapted three decoder-based architectures -- Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-14B, and Llama-3.3-70B-Instruct -- using a curated 126-million-word cybersecurity corpus from standards, academic literature, and technical documentation. Evaluation across three cybersecurity benchmarks -- CTI-MCQ, CyberMetric, and SecEval -- demonstrates consistent improvements post-adaptation. Notably, our Llama-3.3-70B-Ins-DAP model achieves state-of-the-art performance with accuracies of 0.718, 0.933, and 0.864, respectively, surpassing parameter-efficient baselines and specialized models including Llama-Primus-Base (trained on 2.77 billion tokens) and Foundation-Sec-8B (trained on 5 billion tokens), despite utilizing only 118.8 million tokens -- representing a 23-to-42-fold reduction in training data. Targeted continuous pretraining via scalable HPC infrastructure enables effective cybersecurity domain adaptation with a substantially reduced computational and energy footprint, supporting specialized AI assistants in threat analysis, vulnerability assessment, and security documentation, while advancing sustainable and responsible AI development.
[237] arXiv:2509.11606 (replaced) [pdf, html, other]: Title: Scaling to Multimodal and Multichannel Heart Sound Classification with Synthetic and Augmented Biosignals

Milan Marocchi, Matthew Fynn, Kayapanda Mandana, Yue Rong

Comments: 35 pages, 37 figures, 19 tables

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Signal Processing (eess.SP)

Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for approximately 17.9 million deaths each year. Early detection is critical, creating a demand for accurate and inexpensive pre-screening methods. Deep learning has recently been applied to classify abnormal heart sounds indicative of CVDs using synchronised phonocardiogram (PCG) and electrocardiogram (ECG) signals, as well as multichannel PCG (mPCG). However, state-of-the-art architectures remain underutilised due to the limited availability of synchronised and multichannel datasets. Augmented datasets and pre-trained models provide a pathway to overcome these limitations, enabling transformer-based architectures to be trained effectively. This work combines traditional signal processing with denoising diffusion models, WaveGrad and DiffWave, to create an augmented dataset to fine-tune a Wav2Vec 2.0-based classifier on multimodal and multichannel heart sound datasets. The approach achieves state-of-the-art performance. On the Computing in Cardiology (CinC) 2016 dataset of single channel PCG, accuracy, unweighted average recall (UAR), sensitivity, specificity and Matthew's correlation coefficient (MCC) reach 92.48%, 93.05%, 93.63%, 92.48%, 94.93% and 0.8283, respectively. Using the synchronised PCG and ECG signals of the training-a dataset from CinC, 93.14%, 92.21%, 94.35%, 90.10%, 95.12% and 0.8380 are achieved for accuracy, UAR, sensitivity, specificity and MCC, respectively. Using a wearable vest dataset consisting of mPCG data, the model achieves 77.13% accuracy, 74.25% UAR, 86.47% sensitivity, 62.04% specificity, and 0.5082 MCC. These results demonstrate the effectiveness of transformer-based models for CVD detection when supported by augmented datasets, highlighting their potential to advance multimodal and multichannel heart sound classification.
[238] arXiv:2510.06288 (replaced) [pdf, html, other]: Title: BuilderBench: The Building Blocks of Intelligent Agents

Raj Ghugare, Roger Creus Castanyer, Catherine Ji, Kathryn Wantlin, Jin Schofield, Karthik Narasimhan, Benjamin Eysenbach

Comments: Blogpost: this https URL and Code: this https URL

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Today's AI models learn primarily through mimicry and refining, so it is not surprising that they struggle to solve problems beyond the limits set by existing data. To solve novel problems, agents should acquire skills by exploring and learning through experience. Finding a scalable learning mechanism for developing agents that learn through interaction remains a major open problem. In this work, we introduce BuilderBench, a benchmark to accelerate research into agent training that centers open-ended exploration. BuilderBench requires agents to learn how to build any structure using blocks. BuilderBench is equipped with (1) a simulator of a robot interacting with various physical blocks, and (2) a task-suite with over 50 diverse target structures that are carefully curated to test an understanding of physics, mathematics, and long-horizon planning. Agents are provided with a target structure at the start, and can interact with the environment for multiple episodes to experiment and learn various skills for building the structure. Solving these tasks requires \emph{embodied reasoning} in a way that is not reflected in words but rather in actions, experimenting with different strategies and piecing them together. Our experiments with multiple state-of-the-art frontier language model based agents and tabula rasa reinforcement learning algorithms show that these agents cannot solve any of the non-trivial tasks in the BuilderBench. Our analysis throws light on the lack of exploration abilities in these models.
[239] arXiv:2510.16031 (replaced) [pdf, other]: Title: A Storm-Centric 250 m NEXRAD Level-II Dataset for High-Resolution ML Nowcasting

Andy Shi

Comments: Dataset construction yielded insufficient event coverage to support the intended ML use cases

Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)

Machine learning-based precipitation nowcasting relies on high-fidelity radar reflectivity sequences to model the short-term evolution of convective storms. However, the development of models capable of predicting extreme weather has been constrained by the coarse resolution (1-2 km) of existing public radar datasets, such as SEVIR, HKO-7, and GridRad-Severe, which smooth the fine-scale structures essential for accurate forecasting. To address this gap, we introduce Storm250-L2, a storm-centric radar dataset derived from NEXRAD Level-II and GridRad-Severe data. We algorithmically crop a fixed, high-resolution (250 m) window around GridRad-Severe storm tracks, preserve the native polar geometry, and provide temporally consistent sequences of both per-tilt sweeps and a pseudo-composite reflectivity product. The dataset comprises thousands of storm events across the continental United States, packaged in HDF5 tensors with rich context metadata and reproducible manifests.
[240] arXiv:2510.17426 (replaced) [pdf, html, other]: Title: Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging

Tiancheng Hu, Benjamin Minixhofer, Nigel Collier

Comments: ACL 2026 Findings

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.
[241] arXiv:2511.05536 (replaced) [pdf, other]: Title: Gravity-Awareness: Deep Learning Models and LLM Simulation of Human Awareness in Altered Gravity

Bakytzhan Alibekov, Alina Gutoreva, Elisa Raffaella-Ferre

Comments: 60 pages, 5 figures, 2 datasets, 1 protocol

Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)

Earth s gravity fundamentally shapes human behaviour. The brain encodes this force as an internal model of gravity, enabling the prediction and interpretation of gravitational effects during perception and action. Understanding how this model adapts to altered gravity is critical for predicting human performance in spaceflight. We present a computational framework for modelling neurophysiological adaptation across diverse gravitational environments. The framework has two components trained on open-access data from altered-gravity studies, particularly parabolic flights. The first component (CorticalG) employs a lightweight multilayer perceptron neural network to predict gravity-dependent changes in EEG frequency bands, estimating cortical state under different gravitational loads. The second component (PhysioG) uses independent Gaussian process models to capture broader physiological responses, including heart rate variability, electrodermal activity, and motor control. To complement the quantitative modelling, we simulated subjective experience across gravitational environments using the Large Language Model (LLM) Claude 3.5 Sonnet. Physiological outputs prompted the model to generate narratives describing alertness, bodily awareness, and cognitive state across zero gravity, partial gravity of the Moon and Mars, and hypergravity. This framework provides a novel approach for investigating human adaptation to spaceflight. It offers a predictive tool to assess performance and resilience, supporting the design of future space exploration missions.
[242] arXiv:2512.07541 (replaced) [pdf, html, other]: Title: High-Dimensional Change Point Detection via Graph Spanning Ratio

Katerina Papagiannouli, Yang-wen Sun, Vladimir Spokoiny

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Inspired by graph-based methodologies, we introduce a novel graph-spanning algorithm designed to identify changes in both offline and online data across low to high dimensions. This versatile approach is applicable to Euclidean and graph-structured data with unknown distributions, while maintaining control over error probabilities. Theoretically, we demonstrate that the algorithm achieves high detection power when the magnitude of the change surpasses the lower bound of the minimax separation rate, which scales on the order of $\sqrt{nd}$. Our method outperforms other techniques in terms of accuracy for both Gaussian and non-Gaussian data. Notably, it maintains strong detection power even with small observation windows, making it particularly effective for online environments where timely and precise change detection is critical.
[243] arXiv:2512.10485 (replaced) [pdf, html, other]: Title: From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

Chaomeng Lu, Bert Lagaisse

Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)

Vulnerability detection methods based on deep learning (DL) have shown strong performance on benchmark datasets, yet their real-world effectiveness remains underexplored. Recent work suggests that both graph neural network (GNN)-based and transformer-based models, including large language models (LLMs), yield promising results when evaluated on curated benchmark datasets. These datasets are typically characterized by consistent data distributions and heuristic or partially noisy labels. In this study, we systematically evaluate two representative DL models-ReVeal and LineVul-across four representative datasets: Juliet, Devign, BigVul, and ICVul. Each model is trained independently on each respective dataset, and their code representations are analyzed using t-SNE to uncover vulnerability related patterns. To assess realistic applicability, we deploy these models along with four pretrained LLMs, Claude 3.5 Sonnet, GPT-o3-mini, GPT-4o, and GPT-5 on a curated dataset, VentiVul, comprising 20 recently (May 2025) fixed vulnerabilities from the Linux kernel. Our experiments reveal that current models struggle to distinguish vulnerable from non-vulnerable code in representation space and generalize poorly across datasets with differing distributions. When evaluated on VentiVul, our newly constructed time-wise out-of-distribution dataset, performance drops sharply, with most models failing to detect vulnerabilities reliably. These results expose a persistent gap between academic benchmarks and real-world deployment, emphasizing the value of our deployment-oriented evaluation framework and the need for more robust code representations and higher-quality datasets.
[244] arXiv:2512.22227 (replaced) [pdf, html, other]: Title: Probing Spectrum-Like Organization of States of Mind in Transformer Representation Spaces

Sophie Zhao

Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

We investigate whether graded states of mind form spectrum-like structure in transformer representation spaces. To do so, we construct a dataset of 636 short natural-language sentences annotated with both a continuous score from $-5$ to $5$ and one of seven ordered tiers, ranging from collapsed or scarcity-driven expressions to more coherent, reflective, and integrative ones. We evaluate five frozen transformer representations: four sentence-embedding models and one decoder-only residual-stream representation. Across all representations, simple probes reliably recover both the continuous score and the discrete tier labels, and permutation tests show that performance significantly exceeds shuffled-label baselines. Additional analyses reveal a consistent geometric pattern: UMAP projections show low-to-high organization, confusion matrices concentrate errors between neighboring tiers, and directional ablation identifies a prominent score-aligned component. These results suggest that transformer representations contain statistically significant, spectrum-like organization aligned with the annotated state-of-mind structure. The annotations are used only as an operational framework for representation analysis, not as a clinical or diagnostic measure.
[245] arXiv:2601.01095 (replaced) [pdf, html, other]: Title: NarrativeTrack: Evaluating Entity-Centric Reasoning for Narrative Understanding

Hyeonjeong Ha, Jinjin Ge, Bo Feng, Kaixin Ma, Gargi Chakraborty

Comments: Project Page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Multimodal large language models (MLLMs) have achieved impressive progress in vision-language reasoning, yet their ability to understand temporally unfolding narratives in videos remains underexplored. True narrative understanding requires grounding who is doing what, when, and where, maintaining coherent entity representations across dynamic visual and temporal contexts. We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs through fine-grained entity-centric reasoning. Unlike existing benchmarks limited to short clips or coarse scene-level semantics, we decompose videos into constituent entities and examine their continuity via a Compositional Reasoning Progression (CRP), a structured evaluation framework that progressively increases narrative complexity across three dimensions: entity existence, entity changes, and entity ambiguity. CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning. A fully automated entity-centric pipeline enables scalable extraction of temporally grounded entity representations, providing the foundation for CRP. Evaluations of state-of-the-art MLLMs reveal that models fail to robustly track entities across visual transitions and temporal dynamics, often hallucinating identity under context shifts. Open-source general-purpose MLLMs exhibit strong perceptual grounding but weak temporal coherence, while video-specific MLLMs capture temporal context yet hallucinate entities' contexts. These findings uncover a fundamental trade-off between perceptual grounding and temporal reasoning, indicating that narrative understanding emerges only from their integration. NarrativeTrack provides the first systematic framework to diagnose and advance temporally grounded narrative comprehension in MLLMs.
[246] arXiv:2601.02813 (replaced) [pdf, html, other]: Title: HAL: Inducing Human-likeness in LLMs with Alignment

Masum Hasan, Junjie Zhao, Ehsan Hoque

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Aligning language models to qualitative behavioral traits, such as human-likeness, remains difficult because they are hard to define, measure, and optimize. As a result, improvements in human-like behavior are largely driven by scale or broad supervised training, rather than targeted alignment. We introduce Human Aligning LLMs (HAL), a framework for aligning language models to conversational human-likeness using an interpretable, data-driven reward. HAL derives explicit conversational traits from contrastive dialogue data, combines them into a compact scalar score, and uses this score as a transparent reward signal for alignment with standard preference optimization methods. Using this approach, we align models of varying sizes without affecting their overall performance. In large-scale Chatbot Arena-style human evaluations, a model aligned with HAL is more frequently perceived as human-like in conversation. Because HAL operates over explicit, interpretable traits, it enables inspection of alignment behavior and diagnosis of unintended effects. More broadly, HAL demonstrates how soft, qualitative properties of language--previously outside the scope for alignment--can be made measurable and aligned in an interpretable and explainable way.
[247] arXiv:2601.03946 (replaced) [pdf, html, other]: Title: Provably Finding a Hidden Dense Submatrix among Many Planted Dense Submatrices via Convex Programming

Valentine Olanubi (1), Phineas Agar (1), Brendan Ames (2) ((1) University of Alabama, Department of Mathematics, (2) University of Southampton, School of Mathematical Sciences)

Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG)

We consider the densest submatrix problem, which seeks the submatrix of fixed size of a given binary matrix that contains the most nonzero entries. This problem is a natural generalization of fundamental problems in combinatorial optimization, e.g., the densest subgraph, maximum clique, and maximum edge biclique problems, and has wide application the study of complex networks. Much recent research has focused on the development of sufficient conditions for exact solution of the densest submatrix problem via convex relaxation. The vast majority of these sufficient conditions establish identification of the densest submatrix within a graph containing exactly one large dense submatrix hidden by noise. The assumptions of these underlying models are not observed in real-world networks, where the data may correspond to a matrix containing many dense submatrices of varying sizes.
We extend and generalize these results to the more realistic setting where the input matrix may contain \emph{many} large dense subgraphs. Specifically, we establish sufficient conditions under which we can expect to solve the densest submatrix problem in polynomial time for random input matrices sampled from a generalization of the stochastic block model. Moreover, we also provide sufficient conditions for perfect recovery under a deterministic adversarial. Numerical experiments involving randomly generated problem instances and real-world collaboration and communication networks are used empirically to verify the theoretical phase-transitions to perfect recovery given by these sufficient conditions.
[248] arXiv:2602.07267 (replaced) [pdf, html, other]: Title: BRIDGE: Predicting Human Task Completion Time From Model Performance

Fengyuan Liu, Jay Gala, Nilaksh, Dzmitry Bahdanau, Siva Reddy, Hugo Larochelle

Comments: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns a latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.
[249] arXiv:2602.08542 (replaced) [pdf, other]: Title: Incremental (k, z)-Clustering on Graphs

Emilio Cruciani, Sebastian Forster, Antonis Skarlatos

Comments: In the Proceedings of ICALP 2026. Abstract shortened to meet arXiv limits

Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)

Given a weighted undirected graph, a number of clusters $k$, and an exponent $z$, the goal in the $(k, z)$-clustering problem on graphs is to select $k$ vertices as centers that minimize the sum of the distances raised to the power $z$ of each vertex to its closest center. In the dynamic setting, the graph is subject to adversarial edge updates, and the goal is to maintain explicitly an exact $(k, z)$-clustering solution in the induced shortest-path metric.
While efficient dynamic $k$-center approximation algorithms on graphs exist [Cruciani et al. SODA 2024], to the best of our knowledge, no prior work provides similar results for the dynamic $(k,z)$-clustering problem. As the main result of this paper, we develop a randomized incremental $(k, z)$-clustering algorithm that maintains with high probability a constant-factor approximation in a graph undergoing edge insertions with a total update time of $\tilde O(k m^{1+o(1)}+ k^{1+\frac{1}{\lambda}} m)$, where $\lambda \geq 1$ is an arbitrary fixed constant. Our incremental algorithm consists of two stages. In the first stage, we maintain a constant-factor bicriteria approximate solution of size $\tilde{O}(k)$ with a total update time of $m^{1+o(1)}$ over all adversarial edge insertions. This first stage is an intricate adaptation of the bicriteria approximation algorithm by Mettu and Plaxton [Machine Learning 2004] to incremental graphs. One of our key technical results is that the radii in their algorithm can be assumed to be non-decreasing while the approximation ratio remains constant, a property that may be of independent interest.
In the second stage, we maintain a constant-factor approximate $(k,z)$-clustering solution on a dynamic weighted instance induced by the bicriteria approximate solution. For this subproblem, we employ a dynamic spanner algorithm together with a static $(k,z)$-clustering algorithm.
[250] arXiv:2602.22897 (replaced) [pdf, other]: Title: OmniGAIA: Towards Native Omni-Modal AI Agents

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Haoxuan Li, Hao Wang, Shijian Wang, Guanting Dong, Jiajie Jin, Yinuo Wang, Yuan Lu, Ji-Rong Wen, Zhicheng Dou, Zhouchen Lin

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.
[251] arXiv:2602.23561 (replaced) [pdf, other]: Title: VaSST: Variational Inference for Symbolic Regression using Soft Symbolic Trees

Somjit Roy, Pritam Dey, Bani K. Mallick

Comments: 55 pages, 9 figures, 54 tables, Accepted at UAI 2026

Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Symbolic Computation (cs.SC); Computation (stat.CO); Machine Learning (stat.ML)

Symbolic regression (SR) has gained recent traction in AI-driven scientific discovery for learning closed-form physical laws. Yet existing methods are dominated by heuristic search or data-intensive approaches that often assume low-noise regimes and lack principled uncertainty quantification, while fully probabilistic SR formulations remain scarce. We introduce a scalable probabilistic framework for SR, VaSST, based on variational inference. VaSST uses soft symbolic trees, a continuous relaxation of symbolic expression trees in which discrete operator and feature assignments are replaced by probability distributions over allowable components. This transforms combinatorial symbolic search through an astronomically large expression space into efficient gradient-based optimization while preserving a coherent probabilistic interpretation. The learned soft representations induce posterior distributions over symbolic structures, enabling uncertainty quantification across plausible symbolic forms through posterior-aware symbolic model selection. On simulated experiments and the Feynman Symbolic Regression Database, VaSST achieves strong structural recovery and predictive accuracy compared to state-of-the-art competing SR methods.
[252] arXiv:2603.02196 (replaced) [pdf, html, other]: Title: Conformal Policy Control

Drew Prinster, Clara Fannjiang, Ji Won Park, Kyunghyun Cho, Anqi Liu, Suchi Saria, Samuel Stanton

Comments: International Conference on Machine Learning (ICML), 2026

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)

An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded loss functions, and it introduces a new policy control setting. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.
[253] arXiv:2603.09046 (replaced) [pdf, html, other]: Title: FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu, Zhichao Hua, Yubin Xia

Comments: 13 pages, 11 figures

Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Operating Systems (cs.OS)

Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.
[254] arXiv:2603.17212 (replaced) [pdf, html, other]: Title: Adaptive Contracts for Cost-Effective AI Delegation

Eden Saig, Tamar Garbuz, Ariel D. Procaccia, Inbal Talgam-Cohen, Jamie Tucker-Foltz

Comments: ICML 2026

Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

When organizations delegate text generation tasks to AI providers via pay-for-performance contracts, expected payments rise when evaluation is noisy. As evaluation methods become more elaborate, the economic benefits of decreased noise are often overshadowed by increased evaluation costs. In this work, we introduce adaptive contracts for AI delegation, which allow detailed evaluation to be performed selectively after observing an initial coarse signal in order to conserve resources. We make three sets of contributions: First, we provide efficient algorithms for computing optimal adaptive contracts under natural assumptions or when core problem dimensions are small, and prove hardness of approximation in the general unstructured case. We then formulate alternative models of randomized adaptive contracts and discuss their benefits and limitations. Finally, we empirically demonstrate the benefits of adaptivity over non-adaptive baselines using question-answering and code-generation datasets.
[255] arXiv:2603.24016 (replaced) [pdf, html, other]: Title: COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm

Zekun Qian, Wei Feng, Ruize Han, Junhui Hou

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K.
[256] arXiv:2604.02546 (replaced) [pdf, html, other]: Title: RGB-Pointmap Pretraining for Unified 3D Scene Understanding

Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk

Comments: 19 Pages, ECCV 2026 Accepted

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Pretraining 3D encoders through alignment with Contrastive Language-Image Pre-training (CLIP) has emerged as a promising direction for learning generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based framework that learns unified 3D scene representations from multi-view RGB-Pointmap inputs by leveraging the priors of a pretrained 2D foundation model. For robust RGB-Pointmap representation learning, we introduce cross-view geometric alignment and grounded view alignment to enforce geometric and semantic consistency across views. Extensive low-shot and task-specific fine-tuning on viewpoint grounding, scene retrieval, scene classification, and 3D visual question answering demonstrates state-of-the-art performance. These results establish UniScene3D as an effective framework for unified 3D scene understanding. Project page: this https URL
[257] arXiv:2604.08580 (replaced) [pdf, html, other]: Title: Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control

Carles Domingo-Enrich, Jiequn Han

Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG)

Reward fine-tuning of diffusion and flow models and sampling from tilted or Boltzmann distributions can both be formulated as stochastic optimal control (SOC) problems, where learning an optimal generative dynamics corresponds to optimizing a control under SDE constraints. In this work, we revisit and generalize Adjoint Matching, a recently proposed SOC-based method for learning optimal controls, and place it on a rigorous footing by deriving it from the Stochastic Maximum Principle (SMP). We formulate a general Hamiltonian adjoint matching objective for SOC problems with control-dependent drift and diffusion and convex running costs, and show that its expected value has the same first variation as the original SOC objective. As a consequence, critical points satisfy the Hamilton--Jacobi--Bellman (HJB) stationarity conditions. In the important practical case of state- and control-independent diffusion, we recover the lean adjoint matching loss previously introduced, which avoids second-order terms and whose critical points coincide with the optimal control under mild uniqueness assumptions. Numerical experiments confirm that the extra terms it discards become necessary once the diffusion is state-dependent. Finally, we show that adjoint matching can be precisely interpreted as a continuous-time method of successive approximations induced by the SMP, yielding a practical and implementable alternative to classical SMP-based algorithms, which are obstructed by intractable martingale terms in the stochastic setting. These results are also of independent interest to the stochastic control community, providing new implementable objectives and a viable pathway for SMP-based iterations in stochastic problems.
[258] arXiv:2604.14228 (replaced) [pdf, html, other]: Title: Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, Zhiqiang Shen

Comments: Tech report. Code at: this https URL

Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its architecture by analyzing the publicly available source code and comparing it with two independent open-source AI agent systems, OpenClaw and Hermes Agent, that answer many of similar or even the same design questions. Our analysis identifies five human values, philosophies, and needs that motivate the architecture: human decision authority, safety, security, and privacy, reliable execution, capability amplification, and contextual adaptability. We then trace them through thirteen design principles to implementation choices. The core of the system is a simple while-loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML-based classifier, a five-layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation and orchestration mechanism, and append-oriented session storage. Comparisons with OpenClaw and Hermes Agent show that the same design questions produce different answers across three deployment contexts. Claude Code emphasizes per-action safety, OpenClaw emphasizes perimeter-level access, and Hermes renders per-action approvals across many surfaces. At the runtime layer, Claude Code uses a single CLI loop, OpenClaw embeds the runtime within a gateway control plane, and Hermes uses one process whose role is set by its entry point. At the context and extension layer, Claude Code extends the context window, OpenClaw registers gateway-wide capabilities, and Hermes provides pluggable memory and model backends. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.
[259] arXiv:2605.08488 (replaced) [pdf, html, other]: Title: A Unified Lyapunov-IQC Framework for Uniform Stability of Smooth Quadratic First-Order Accelerated Optimizers

Don Li, Dacian Daescu

Comments: Minor revisions, mostly comprised of making notation more consistent across paper and correcting a few details in some proofs

Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG)

We develop a unified Lyapunov-integral quadratic constraint (IQC) framework for establishing uniform stability of first-order accelerated optimization algorithms in the $\beta$-smooth and $\gamma$-strongly convex regime. Classical analyses of uniform stability, such as the work of Hardt, Recht, and Singer for stochastic gradient descent (SGD), rely on direct coupling arguments and case-by-case control of iterate differences under random sampling. Extending such arguments to accelerated methods, such as Nesterov Accelerated Gradient (NAG), is complicated by the presence of higher-order state dynamics induced by momentum. We first extend this classical approach with the use of Lyapunov functions to provide a uniform stability bound for smooth quadratic NAG, and supplement this result with small-scale numerical experiments. We then extend this framework by modeling first-order accelerated optimizers as Lur'e-type feedback interconnections between a linear dynamical system and a (non-linear) gradient operator. $\beta$-Smoothness and $\gamma$-strong convexity are encoded a sector IQC inequality. Under this representation, uniform stability is certified via the existence of a quadratic Lyapunov function satisfying a finite-dimensional linear matrix inequality (LMI) in the form of a feasibility problem, which can be solved via semi-definite programming (SDP). We instantiate this framework for NAG and show how classical uniform stability bounds can be recovered via this framework. These results underscore a structural connection between optimization dynamics and robust control theory, providing a modular methodology for reliable and reproducible numerical certification of uniform stability and generalization behavior of first-order methods via convex optimization tools that is adaptable to increasingly complex optimization algorithms.
[260] arXiv:2605.23965 (replaced) [pdf, html, other]: Title: LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

Zenghui Zhou, Man Li, Xiaoke Fang, Xinyi Zhou, Weibin Lin, Zheng Zheng

Comments: Zheng Zheng is the corresponding author

Journal-ref: Knowledge-Based Systems, Volume 348, 2026, 116324, ISSN 0950-7051

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.
[261] arXiv:2605.27991 (replaced) [pdf, html, other]: Title: Gradient-Flow Optimization as Dynamic Random-Effects Inference: Testing and Early Stopping with Applications to Deep Learning

Minhao Yao, Ruoyu Wang, Xihong Lin, Lin Liu, Zhonghua Liu

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Gradient-flow optimization is usually viewed as an algorithmic procedure for minimizing empirical loss, with training duration selected by validation or heuristic early stopping rules. We develop a statistical inference framework for gradient-flow training. We show that whenever fitted values evolve through a time-invariant positive semidefinite training operator, the output at each time is equivalent to the best linear unbiased predictor under a corresponding random-effects model. Training time then becomes a variance-component parameter governing variance reallocation from residual noise to structured signal. This turns two training decisions into inferential problems: whether training is needed becomes a variance-component test for signal beyond initialization, and how long to train becomes restricted maximum likelihood (REML) estimation of the training-time variance component. We show that the REML-guided early stopping rule selects the time at which optimized spectral losses become decorrelated from the training-operator eigenvalues. The asymptotic prediction optimality of the REML-guided early stopping time is established for fixed-design in-sample risk and random-design out-of-sample risk. Deep learning models in fixed-kernel gradient regimes provide canonical instantiations for our results. Numerical experiments and a UK Biobank proteomics application show competitive accuracy of the REML-guided early stopping time with reduced reliance on validation splits and repeated checkpoint evaluation.
[262] arXiv:2606.04265 (replaced) [pdf, html, other]: Title: Nonlocal Mean Field Schrödinger Bridge with Learned Interactions

Daisuke Inoue, Dante Kalise, Mathieu Laurière

Comments: 32 pages, 15 figures

Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)

The Schrödinger Bridge Problem connects an initial distribution to a terminal one along a minimum-energy stochastic process. Its mean-field extension, the Mean-Field Schrödinger Bridge, governs interacting populations whose dynamics and costs depend on the collective distribution. When these interactions are nonlocal, their direct evaluation scales quadratically with the population size, making large ensembles intractable within FBSDE-based solvers. We replace these terms with neural surrogates in state and time, trained on empirical interaction values along sampled trajectories and embedded in a four-stage alternating scheme that updates the forward and backward potentials and the surrogates in turn, while preserving forward--backward consistency and the prescribed endpoint marginals. We derive Grönwall-type stability bounds quantifying how surrogate errors propagate to the generated trajectories under a small-gain condition. On crowd-navigation and high-dimensional opinion-dynamics benchmarks, the surrogates reproduce the trajectories obtained with exact evaluation at reduced training cost. The advantage is most significant when the interaction is a nonlinear functional of the measure, such as the normalized bounded-confidence drift, for which random-batch subsampling is biased and unstable whereas the learned surrogate remains accurate.
[263] arXiv:2606.05895 (replaced) [pdf, other]: Title: Representing Research Attention as Contextually Structured Flows

Jessica Rodrigues, Angelo Salatino, Gard Jenset, Scott Hale

Comments: Accepted at STi 2026 - International Conference on Science and Technology Indicators

Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Research metrics use attention as evidence of societal impact. Yet attention serves as evidence only once interpreted, and its meaning depends on its contextual structure, not on volume alone. Altmetrics records signals in isolation, keeping a count of the attention an output received, or a sequence of when. We address this with attention flows, representations that situate an output's attention in the social settings where it occurs, the language expressing it, and the time over which it unfolds. To evaluate the flow, we build a benchmark of analogy queries, each testing whether the relationship between two outputs, applied to a third, yields a fourth. The count and sequence baselines fail to recover these relationships, whereas flows learned as dynamic contextualised representations recover them. The recovered structure also survives partial observation and rests on its contexts instead of volume. These findings support attention represented as contextually structured for research evaluation.
[264] arXiv:2606.07931 (replaced) [pdf, html, other]: Title: Pointwise Complexity for Gaussian Fields: Upper Envelopes, Algorithmic Lower Bounds, and Separation

Yunbei Xu

Subjects: Probability (math.PR); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)

We prove a variance-aware pointwise majorizing-measure theorem for centered Gaussian processes. Classical generic chaining characterizes the scalar quantity $\mathbb E\sup_{x\in T}X_x$; the theorem here gives a simultaneous high-probability envelope for the entire field. For an ambient prior $\mu$, the envelope at $x$ is governed by a pointwise Fernique-Talagrand functional \[\Phi_\mu(x):=\int_0^{4\sigma(x)}\sqrt{\log\frac{1}{\mu(B_d(x,\varepsilon))}}\,d\varepsilon,\] together with the corresponding Gaussian tail term. The theorem provides a reusable field-level refinement of classical generic chaining and a Gaussian-process counterpart of pointwise empirical-process bounds for deep neural networks.
We also record a Bayesian algorithmic lower envelope from the interactive Fano/data-processing principle. For a known prior $\pi$, an observation channel, and a concrete estimator $\widehat t(Y)$, the lower bound is expressed through the exact ghost small-ball mass $\mathbb E_{Y\sim Q}\pi(B_d(\widehat t(Y),\Delta))$, rather than a worst-case covering number. In Gaussian location experiments, comparison decoders convert Bayes location error into lower bounds on decision-aligned Gaussian ranges. We then construct an elementary example separating the usual Fano relaxation, the Bayesian algorithmic lower envelope, the pointwise Gaussian envelope, and the full-class minimax risk. Together, these results show that algorithmic lower bounds provide local-geometric validations of pointwise complexity for fixed estimators in overparameterized ambient classes, precisely in regimes where classical minimax theory becomes either too coarse or oracle-dependent.
This separation can also be recast in minimax language as penalty-range information relaxation, highlighting an important question of algorithmic robustness for classical high-dimensional models and regularized algorithms.
[265] arXiv:2606.08236 (replaced) [pdf, html, other]: Title: Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

Hyunjin Cho, Youngji Roh, Jaehyung Kim

Comments: 40 pages; accepted as an ICML 2026 Spotlight; project page: this https URL

Journal-ref: ICML 2026 Spotlight

Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup can obscure heterogeneity across a model's continuation distribution. We introduce distribution-level unsupervised feature discovery, which clusters sampled continuations using both semantic content and sequence-level mechanistic attributions, without manually specifying target outputs. Our method represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective that trades off semantic coherence, mechanistic consistency, and cluster granularity. Across clustering and steering analyses, the discovered clusters expose continuation modes that single-view baselines miss and provide interventional evidence that cluster signatures correspond to actionable mechanistic factors. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable audit of the mechanisms underlying a model's continuation distribution.
[266] arXiv:2606.12623 (replaced) [pdf, html, other]: Title: Estimating Individualized Treatment Effects in Acute Ischemic Stroke with Causal Transformation Models (TRAM-DAG): A Multi-Centre Observational Study with External RCT Validation

Lisa Herzog, Oliver Dürr, Pascal Bühler, Hakim Baazaoui, Julian Deseö, Susanne Wegener, Beate Sick

Subjects: Applications (stat.AP); Machine Learning (cs.LG)

Personalized medicine in acute ischemic stroke requires moving beyond average treatment effects (ATE) to individualized treatment effect (ITE) estimates to support treatment decisions. In acute ischemic stroke, mechanical thrombectomy has been shown to be more effective on average than lysis in randomized controlled trials (RCTs), such as the MR CLEAN study. We aim to identify which individual patients benefit most from mechanical thrombectomy compared to lysis. The outcome of interest is the modified Rankin Scale (mRS) at three months, an ordinal measure of functional disability (0: no symptoms, 6: death). We demonstrate that causal transformation models on directed acyclic graphs (TRAM-DAG) can be used for ITE estimation after being fitted on observational MAGIC multi-center stroke patient data. To ensure comparability with the MR CLEAN population, which we use for validation, we train the TRAM-DAG on a MAGIC sub-population with NIHSS at admission >= 6, corresponding to one inclusion criterion of MR CLEAN. The fitted model is then used to estimate ITEs for stroke patients in the MR CLEAN population. While these ITE estimates cannot be confirmed experimentally, we show that their average is consistent with the trial's reported ATE. Furthermore, the ITE estimates correctly rank trial patients by their observed frequency of a good outcome (mRS at three months <= 2). These findings support the use of causal models like TRAM-DAG for personalized decision-making in stroke care and highlight their ability to bridge the gap between observational evidence and clinical trials.
[267] arXiv:2606.13501 (replaced) [pdf, html, other]: Title: GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang, Han Zhao, Chen Chen, Yu Feng, Jingwen Leng, Minyi Guo

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)

Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality.
This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups.
We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01$\times$, reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 $\mu$s.
Our code is available at this https URL.
[268] arXiv:2606.21199 (replaced) [pdf, html, other]: Title: Orthogonal Discrepancy Kernels for Learning with Partial Physics

Swapnil Manna, Timothy J. Rogers, Lawrence Bull

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)

We introduce a semi-parametric framework for nonlinear system identification, which decouples discrepancy functions from physics-based components. Orthogonal Gaussian process regression balances sparse parameter selection (the white box) with discrepancy learning (the black box) to produce interpretable models from incomplete physics.
[269] arXiv:2606.23370 (replaced) [pdf, other]: Title: FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu, Zhichao Hua, Yubin Xia

Comments: Repeated paper uploading due to mistakes. See arXiv:2603.09046

Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Operating Systems (cs.OS)

Device-side Large Language Models (LLMs) have grown explosively, offering stronger privacy and higher availability than their cloud-side counterparts. During LLM inference, both the model weights and the user data are valuable, and attackers may compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead to both the secure inference and the normal aplications, due to two challenges: the inflexible resource isolation and the inefficient secure resource management.
To address these challenges, this paper presents FlexServe, a fast and secure LLM inference system for mobile devices. The key idea is to decouple the access permission from the management permission of secure resources, so that the normal-world OS cannot access them but can still manage them as usual. First, FlexServe introduces a Recallable Resource Isolation mechanism to construct Recallable Secure Memory (Flex-Mem) and a Recallable Secure NPU (Flex-NPU). They can only be accessed by the secure world, but can be efficiently allocated and reclaimed by the normal-world OS. Based on them, FlexServe further introduces a FlexServe Framework to run secure LLM inference in the secure world. It works together with the normal-world OS to perform cooperative secure memory management. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves average TTFT speedups of 10.05X over the strawman and 2.44X over an optimized strawman.
[270] arXiv:2606.30313 (replaced) [pdf, html, other]: Title: TRACE: A Concept Bottleneck Model for Longitudinal 3D Glioblastoma Response Assessment

Alia Tarek, Hamsa Saberr, Hamza Elghonemy, Youssef Afify, Tamer Basha, Omair Shahzad Bhatti, Abdulrahman M. Selim, Hasan Md Tusfiqur Alam, Daniel Sonntag

Comments: Accept in the EXPLIMED: Explainable Artificial Intelligence for the Medical Domain workshop in IJCAI 2026

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Longitudinal glioblastoma response assessment requires comparing subtle tumor changes across MRI time points using structured clinical criteria such as RANO. However, most deep learning methods predict response labels directly from imaging features, which limits clinical inspection, verification, and correction. We introduce TRACE, a RANO 2.0-aligned concept bottleneck model for interpretable 4-class glioblastoma response classification on longitudinal 3D MRI. TRACE processes paired baseline and follow-up multimodal MRI scans with a shared 3D vision encoder, predicts clinically meaningful tumor measurements as root concepts, computes downstream RANO-derived concepts through deterministic rules, and incorporates scan interval and new-lesion information as passthrough concepts. This design frames response assessment as structured concept reasoning rather than direct image-to-label prediction.
Using 5-fold patient-wise cross-validation on the LUMIERE dataset, TRACE achieves a 4-class macro F1 of 0.4769 and a binary progression-versus-non-progression macro F1 of 0.7085. It improves over a concept bottleneck baseline and remains within the range of published non-interpretable deep learning approaches. Ablation studies show that the expert RANO graph and intervention-consistency training are important for performance, while intervention experiments demonstrate that correcting concepts can improve downstream predictions. These results suggest that structured concept bottlenecks offer a transparent and clinically aligned direction for longitudinal glioblastoma response assessment, while highlighting the need for larger protocol-aligned datasets and external validation.
[271] arXiv:2607.00224 (replaced) [pdf, html, other]: Title: Sample Complexities of Estimating Gumbel--Max Watermark Proportions with and without Reduction to Pivotal Statistics

Shuwen Chai, Qiaosen Wang

Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)

Watermarking promises statistical traceability of large language model (LLM) uses, but real documents rarely arrive as purely human-written or purely LLM-generated. This motivates a quantitative question beyond detection: what proportion of a document is generated from a pre-specified watermarked LLM? We study this watermark proportion estimation problem under the Gumbel--max watermarking mechanism, treating the next-token prediction distributions as unknown and arbitrary nuisance parameters subject to a non-degeneracy condition. We compare two observation regimes: in the full observation regime, the estimator observes the pseudorandom vector and the selected token at each position; in the more prevalent setting of pivotal reduction, it observes only a scalar pivot, which follows a one-dimensional Uniform--Beta mixture distribution. Under pivotal reduction, we develop a Laguerre-polynomial estimator and establish a matching information-theoretic lower bound for the sample complexity. For full observation, we introduce an event-counting estimator and show a matching lower bound, yielding a substantially smaller sample complexity. As our results imply, although reducing to pivotal statistics is an elegant and prevalent choice, it is not always sample-efficient for estimating the proportion of watermarks.
[272] arXiv:2607.00876 (replaced) [pdf, html, other]: Title: The Binary Tree Mechanism is Optimal for Approximate Differentially Private Continual Counting

Konstantina Bairaktari, Kasper Green Larsen

Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

Private continual counting is a fundamental problem in differential privacy: given a binary stream of length $n$, where each $1$ corresponds to the contribution of one individual, the goal is to release all running counts while protecting the privacy of each individual. The standard algorithm is the binary tree mechanism, whose Gaussian-noise variant achieves expected $\ell_\infty$ error proportional to $\log^{3/2} n$ for approximate differential privacy. Whether this dependence on the stream length is necessary has remained a central open problem.
In this work, we resolve the dependence on $n$ by proving that every differentially private mechanism for continual counting must incur expected $\ell_\infty$ error $\Omega(\log^{3/2} n)$. This shows that the binary tree mechanism is asymptotically optimal in the approximate-DP setting.
As a consequence, we also obtain a largest-possible separation between hereditary discrepancy and private $\ell_\infty$ error for linear queries, showing that the known general upper bound in terms of hereditary discrepancy has the optimal dependence on the number of queries.
[273] arXiv:2607.01223 (replaced) [pdf, html, other]: Title: Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

Michael Saldivar, Ben Slivinski

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)

When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We present Theoria, a verification architecture that closes this gap. A candidate solution is rewritten into a sequence of typed state transitions, each licensed by an explicit justification, whether that be a citation, computation, or problem-given fact, and every transition is independently auditable. The foundational invariant is completeness of change: every difference between consecutive proof states must be accounted for, so hidden premises surface as unlicensed mutations rather than passing silently. On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). Every certification produces a human readable proof trace in which each step can be independently challenged. Holistic LLM judges achieve comparable precision at matched coverage but fail on different problems (Jaccard 0.14-0.36), making the approaches complementary. On 95 adversarial poisoned proofs across 15 domains, structured judges catch 94.7% versus 83.2% for holistic judging (p= 0.0017). The overall 11.5 pp gap concentrates in hidden premises (90.6% vs. 62.5%, a 28 pp difference) and fabricated citations (100% vs. 90%), the error classes where the formal analysis predicts an advantage; performance is identical on arithmetic and theorem-misapplication errors, where no advantage is predicted. On GPQA Diamond (n= 65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).

Total of 273 entries

Showing up to 2000 entries per page: fewer | more | all

Machine Learning

Showing new listings for Friday, 3 July 2026

New submissions (showing 101 of 101 entries)

Cross submissions (showing 80 of 80 entries)

Replacement submissions (showing 92 of 92 entries)