SeeingEye

Published: October 01, 2025

SeeingEye explores how text-only LLMs can perform multimodal reasoning through agentic information flow rather than direct vision inputs.

Key contributions include:

Proposed an agentic information-flow framework that converts visual observations into structured, tool-mediated textual context.
Designed the perception-to-reasoning pipeline for multimodal tasks while keeping the backbone model text-only.
Studied how tool use, intermediate representations, and memory-like context help text-only LLMs unlock multimodal reasoning behavior.
Released the work as an arXiv preprint, currently under review at EMNLP 2026.

SeeingEye preview