alphaXiv

38,712

22 Apr 2025

reinforcement-learningreasoningtest-time-inference

A test-time reinforcement learning framework enables large language models to improve their mathematical reasoning capabilities using only unlabeled data, demonstrating a 159% improvement in pass@1 performance on AIME 2024 with Qwen-2.5-Math-7B while requiring no ground truth labels or human annotation.

46,749

22 Apr 2025

reasoningchain-of-thought

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

StephenQS ·

nondegeneracy

A comprehensive benchmark suite called PHYBench evaluates physical reasoning capabilities in large language models through 500 carefully curated physics problems and introduces a novel Expression Edit Distance (EED) Score metric, revealing significant performance gaps between current LLMs and human physics students across various domains of physics.

19,423

22 Apr 2025

video-understandingvision-language-modelsself-supervised-learning

Describe Anything: Detailed Localized Image and Video Captioning

NVIDIA and UC Berkeley researchers introduce DAM (Describe Anything Model), a vision-language architecture that generates detailed captions for specific regions in images and videos through a focal prompt mechanism and localized vision backbone, achieving state-of-the-art performance across 7 benchmarks while addressing data scarcity through a semi-supervised learning pipeline.

176,136

18 Apr 2025

reinforcement-learningreasoningimitation-learning

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

A comprehensive analysis reveals that Reinforcement Learning with Verifiable Rewards (RLVR) primarily improves sampling efficiency rather than expanding reasoning capabilities in Large Language Models, with base models outperforming RLVR-trained versions at higher sampling rates (k>256) across multiple benchmarks and model architectures.

16,903

21 Apr 2025

reinforcement-learningreasoningimitation-learning

Learning to Reason under Off-Policy Guidance

LUFFY framework enhances large language models' reasoning capabilities by integrating off-policy guidance into zero-RL training, achieving +7.0 point improvements on math benchmarks and +6.2 points on out-of-distribution tasks compared to previous methods while maintaining exploration through policy shaping and regularized importance sampling.

11,448

22 Apr 2025

reinforcement-learningchain-of-thoughtagents

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Google DeepMind researchers systematically analyze three key failure modes of LLMs in decision-making tasks (greediness, frequency bias, and knowing-doing gap), demonstrating how reinforcement learning fine-tuning with chain-of-thought reasoning can improve exploration and decision-making capabilities across multi-armed bandits, contextual bandits, and game-playing environments.

76,846

17 Apr 2025

generative-modelsvideo-understandingtransformers

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

A compression technique called FramePack enables next-frame prediction models to generate longer videos by using variable kernel sizes to compress input frames based on their temporal importance, while anti-drifting sampling methods reduce error accumulation through bi-directional context and endpoint anchoring.

5,392

21 Apr 2025

vision-language-modelsmulti-modal-learningcomputer-vision-security

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

A comprehensive benchmark called FG-BMK evaluates large vision-language models' capabilities on fine-grained image analysis through 3.49 million questions across 3.32 million images from 12 established datasets, revealing significant performance gaps in attribute recognition and category reasoning while demonstrating the superiority of contrastive training approaches over generative methods.

244,077

17 Apr 2025

contrastive-learningmulti-modal-learningrepresentation-learning

Perception Encoder: The best visual embeddings are not at the output of the network

jiale ·

dbolya +1

A unified visual encoding framework from Meta FAIR demonstrates that intermediate network layers contain the most effective visual embeddings, with a contrastive learning approach achieving state-of-the-art performance across multiple vision tasks through targeted alignment tuning and robust pretraining strategies.

5,973

22 Apr 2025

robotics-perceptionvision-language-modelsrobotic-control

π_{0.5}

: a Vision-Language-Action Model with Open-World Generalization

A Vision-Language-Action model called π0.5 enables mobile robots to perform complex household tasks in entirely new homes through co-training on heterogeneous data sources including multi-robot demonstrations, web data, and verbal instructions, while using a hierarchical inference approach that combines semantic subtask prediction with low-level action generation.

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Filters