Artificial Indifference — Issue 083

§00 · APOD

A Gravity Map of Earth

Is gravity the same over the surface of the Earth? No -- in some places you will feel slightly heavier than others. The featured Earth map video shows in colors and exaggerated highs and lows where the gravitational field of Earth is relatively strong and weak. A low spot, where you would feel slightly lighter, can be seen just off the coast of India, in blue, while a relative high occurs in the mountains of Chile in South America. The cause of these irregularities does not always follow present surface features. Scientists hypothesize that other important factors lie in deep underground s...

2026-03-24 · NASA APOD ↗

§06 · arXiv Dispatch

Research Filed Today

Preprints submitted to arXiv on March 24, 2026. Science before peer review.

cs.CV WorldCache: Content-Aware Caching for Accelerated Video World Models

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising ste...

Umair Nawaz, Ahmed Heakl, Ufaq Khan et al. (+3)

cs.CV VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlook...

Ruoliu Yang, Chu Wu, Caifeng Shan et al. (+2)

cs.CV End-to-End Training for Unified Tokenization and Latent Denoising

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propo...

Shivam Duggal, Xingjian Bai, Zongze Wu et al. (+5)

cs.CV UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or s...

Ziyi Wang, Xinshun Wang, Shuang Chen et al. (+2)

cs.CV ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level...

Haichao Zhang, Yijiang Li, Shwai He et al. (+5)

cs.CV DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations dema...

Zhide Zhong, Junfeng Li, Junjie He et al. (+10)

cs.CV 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs t...

Haoyu Zhen, Xiaolong Li, Yilin Zhao et al. (+5)

cs.CV The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we...

Kelly Cui, Nikhil Prakash, Ayush Raina et al. (+3)

Source: arXiv.org · Cornell University