Artificial Indifference — Issue 078

§00 · APOD

Launch Plume: SpaceX Jellyfish

ven if you live with your head in the clouds, you won’t find a jellyfish like this one very often. The featured image shows a SpaceX Falcon 9 rocket launch from Cape Canaveral in Florida on March 4. The launch happened 52 minutes before sunrise, and the second stage rocket exhaust plume was high enough in the sky to catch the light of the rising sun, while the photographer was still in the dark. This combination of light and shadow, possible at dawn or dusk, makes the exhaust, mostly water vapor and carbon dioxide, appear as a glowing cloud. It only looks like it's going down, as the rocket fo...

2026-03-19 · © Michael Seeley · NASA APOD ↗

§06 · arXiv Dispatch

Research Filed Today

Preprints submitted to arXiv on March 19, 2026. Science before peer review.

cs.CV Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively fo...

Jianrui Zhang, Yue Yang, Rohun Tripathi et al. (+5)

cs.CV Universal Skeleton Understanding via Differentiable Rendering and MLLMs

Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature...

Ziyi Wang, Peiming Li, Xinshun Wang et al. (+3)

cs.CV Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than expl...

Kevin Qu, Haozhe Qi, Mihai Dusmanu et al. (+3)

cs.CV EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. ...

Kai Zou, Hongbo Liu, Dian Zheng et al. (+3)

cs.AI AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose Agen...

Zhang Zhang, Shuqi Lu, Hongjin Qian et al. (+2)

cs.CV The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embeddi...

Yigit Ekin, Yossi Gandelsman

cs.CV LoST: Level of Semantics Tokenization for 3D Shapes

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remain...

Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero et al. (+4)

cs.CV GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. ...

Huajian Zeng, Abhishek Saroha, Daniel Cremers et al. (+1)

Source: arXiv.org · Cornell University