SeeingEye

Published:

SeeingEye explores how text-only LLMs can perform multimodal reasoning through agentic information flow rather than direct vision inputs.

Key contributions include:

  • Proposed an agentic information-flow framework that converts visual observations into structured, tool-mediated textual context.
  • Designed the perception-to-reasoning pipeline for multimodal tasks while keeping the backbone model text-only.
  • Studied how tool use, intermediate representations, and memory-like context help text-only LLMs unlock multimodal reasoning behavior.
  • Released the work as an arXiv preprint, currently under review at EMNLP 2026.

SeeingEye preview

Direct Link