CUEBES

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

arXiv:2601.22984v1 Announce Type: new Abstract: Diagnosing the failure mechanisms of Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation,...

Policy Neuroscience

arXiv CS Feb 2

dgMARK: Decoding-Guided Watermarking for Diffusion Language Models

arXiv:2601.22985v1 Announce Type: new Abstract: We propose dgMARK, a decoding-guided watermarking method for discrete diffusion language models (dLLMs). Unlike autoregressive models, dLLMs can generate tokens...

Energy Mathematics

arXiv CS Feb 2

ArabicDialectHub: A Cross-Dialectal Arabic Learning Resource and Platform

arXiv:2601.22987v1 Announce Type: new Abstract: We present ArabicDialectHub, a cross-dialectal Arabic learning resource comprising 552 phrases across six varieties (Moroccan Darija, Lebanese, Syrian, Emirati, Saudi,...

Software Artificial Intelligence

arXiv CS Feb 2

Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation

arXiv:2601.22988v1 Announce Type: new Abstract: Real-world robotic manipulation demands visuomotor policies capable of robust spatial scene understanding and strong generalization across diverse camera viewpoints. While...

Policy Biology

arXiv CS Feb 2

Self-Supervised Slice-to-Volume Reconstruction with Gaussian Representations for Fetal MRI

arXiv:2601.22990v1 Announce Type: new Abstract: Reconstructing 3D fetal MR volumes from motion-corrupted stacks of 2D slices is a crucial and challenging task. Conventional slice-to-volume reconstruction...

Software Business

arXiv CS Feb 2

Value-at-Risk Constrained Policy Optimization

arXiv:2601.22993v1 Announce Type: new Abstract: We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR)...

Policy Environment

arXiv CS Feb 2

Competitive Non-Clairvoyant KV-Cache Scheduling for LLM Inference

arXiv:2601.22996v1 Announce Type: new Abstract: Large Language Model (LLM) inference presents a unique scheduling challenge due to the Key-Value (KV) cache, where a job's memory...

Software Policy

arXiv CS Feb 2

TriCEGAR: A Trace-Driven Abstraction Mechanism for Agentic AI

arXiv:2601.22997v1 Announce Type: new Abstract: Agentic AI systems act through tools and evolve their behavior over long, stochastic interaction traces. This setting complicates assurance, because...

Software Environment

arXiv CS Feb 2

Mano: Restriking Manifold Optimization for LLM Training

arXiv:2601.23000v1 Announce Type: new Abstract: While large language models (LLMs) have emerged as a significant advancement in artificial intelligence, the hardware and computational costs for...

Artificial Intelligence Technology

arXiv CS Feb 2

Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs

arXiv:2601.23001v1 Announce Type: new Abstract: Large Language Models (LLMs) increasingly shape global discourse, making fairness and ideological neutrality essential for responsible AI deployment. Despite growing...

Artificial Intelligence World News

arXiv CS Feb 2

InstructDiff: Domain-Adaptive Data Selection via Differential Entropy for Efficient LLM Fine-Tuning

arXiv:2601.23006v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) is fundamental to adapting large language models, yet training on complete datasets incurs prohibitive costs with diminishing...

Technology Neuroscience

arXiv CS Feb 2

Leveraging Multi-Rater Annotations to Calibrate Object Detectors in Microscopy Imaging

arXiv:2601.23007v1 Announce Type: new Abstract: Deep learning-based object detectors have achieved impressive performance in microscopy imaging, yet their confidence estimates often lack calibration, limiting their...

Software Policy

arXiv CS Feb 2

SolAgent: A Specialized Multi-Agent Framework for Solidity Code Generation

arXiv:2601.23009v1 Announce Type: new Abstract: Smart contracts are the backbone of the decentralized web, yet ensuring their functional correctness and security remains a critical challenge....

Artificial Intelligence Software

arXiv CS Feb 2

Automatic Constraint Policy Optimization based on Continuous Constraint Interpolation Framework for Offline Reinforcement Learning

arXiv:2601.23010v1 Announce Type: new Abstract: Offline Reinforcement Learning (RL) relies on policy constraints to mitigate extrapolation error, where both the constraint form and constraint strength...

Policy Psychology

arXiv CS Feb 2

Leveraging Convolutional Sparse Autoencoders for Robust Movement Classification from Low-Density sEMG

arXiv:2601.23011v1 Announce Type: new Abstract: Reliable control of myoelectric prostheses is often hindered by high inter-subject variability and the clinical impracticality of high-density sensor arrays....

Engineering Psychology

arXiv CS Feb 2

Mem-T: Densifying Rewards for Long-Horizon Memory Agents

arXiv:2601.23014v1 Announce Type: new Abstract: Memory agents, which depart from predefined memory-processing pipelines by endogenously managing the processing, storage, and retrieval of memories, have garnered...

Robotics Policy

arXiv CS Feb 2

Integrating Multi-Label Classification and Generative AI for Scalable Analysis of User Feedback

arXiv:2601.23018v1 Announce Type: new Abstract: In highly competitive software markets, user experience (UX) evaluation is crucial for ensuring software quality and fostering long-term product success....

Technology Software

arXiv CS Feb 2

Uncovering Hidden Inclusions of Vulnerable Dependencies in Real-World Java Projects

arXiv:2601.23020v1 Announce Type: new Abstract: Open-source software (OSS) dependencies are a dominant component of modern software code bases. Using proven and well-tested OSS components lets...

Software Policy

arXiv CS Feb 2

Causal Characterization of Measurement and Mechanistic Anomalies

arXiv:2601.23026v1 Announce Type: new Abstract: Root cause analysis of anomalies aims to identify those features that cause the deviation from the normal process. Existing methods...

Psychology Software

arXiv CS Feb 2

Divide-and-Conquer CoT: RL for Reducing Latency via Parallel Reasoning

arXiv:2601.23027v1 Announce Type: new Abstract: Long chain-of-thought reasoning (Long CoT) is now fundamental to state-of-the-art LLMs, especially in mathematical reasoning. However, LLM generation is highly...

Artificial Intelligence Psychology

arXiv CS Feb 2

Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning

arXiv:2601.23032v1 Announce Type: new Abstract: Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to solve complex tasks by interacting with external tools, yet existing approaches...

Biology Psychology

arXiv CS Feb 2

MOSAIC: Modular Scalable Autonomy for Intelligent Coordination of Heterogeneous Robotic Teams

arXiv:2601.23038v1 Announce Type: new Abstract: Mobile robots have become indispensable for exploring hostile environments, such as in space or disaster relief scenarios, but often remain...

Robotics Environment

arXiv CS Feb 2

Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference

arXiv:2601.23039v1 Announce Type: new Abstract: Differentiable matching layers, often implemented via entropy-regularized Optimal Transport, serve as a critical approximate inference mechanism in structural prediction. However,...

Policy Genetics

arXiv CS Feb 2

One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs

arXiv:2601.23041v1 Announce Type: new Abstract: Vision Language Models (VLMs) achieve strong performance on multimodal tasks but still suffer from hallucination and safety-related failures that persist...

Technology Software