SeeingEye
Published:
SeeingEye explores how text-only LLMs can perform multimodal reasoning through agentic information flow rather than direct vision inputs.
Key contributions include:
- Proposed an agentic information-flow framework that converts visual observations into structured, tool-mediated textual context.
- Designed the perception-to-reasoning pipeline for multimodal tasks while keeping the backbone model text-only.
- Studied how tool use, intermediate representations, and memory-like context help text-only LLMs unlock multimodal reasoning behavior.
- Released the work as an arXiv preprint, currently under review at EMNLP 2026.

