Deep Label

Jan 29, 2026 Update

A month after vibe coding my Deep Label, Google has released Agentic Vision in Gemini 3 Flash with the ability to “think, act, and observe”, including zooming and inspecting. Example agent thought audit from a visual description prompt: “I zoomed into the two primary dark shapes to discern their internal patterns. The upper shape contains vertical dotted lines, while the lower shape features wavy horizontal bands. I also identified the subtle vertical background lines on the bark substrate. The small initial file size made fine detail difficult, but the primary geometric and patterned elements are clearly visible through the crops.”

Agentic Vision works very well. However, complex artworks like “The Fall of the Magician” still require a bit of back and forth to get the agent to identify “all” demons and creatures, and even then some are missed. For exhauastive labeling, breaking the problem down with an agentic workflow of first identifying subject types and then labeling each type using agentic vision still seems necessary.

Bounding boxes of “demons and creatures” after multiple rounds with Agentic Vision in Gemini 3 Flash.

https://github.com/derekphilipau/deep-label

Caution: Vibe-coded work in progress.

I've been dreaming of an IIIIF, "Intelligent IIIF" framework, one which can deeply analyze and describe images at various scales. This Opus 4.5 vibe-coded proof-of-concept is a step in that direction- a generalized algorithm for recursive, multi-scale discovery & verification with "attention" passing spatially-attributed context at smaller scales.

One of my sub-goals is to browse all the "cats" in all artworks in a collection. The object "cutouts" generated by deep-label could form the basis of a fun semantic search UI, or even to train specialized image models.

deep-label is an intelligent, agentic computer vision workflow designed to analyze complex artwork with exhaustive detail.

Unlike standard object detection models that only find the most obvious elements (e.g., "person," "dog"), this system uses an N-level recursive detection approach with per-tile verification to force LLMs to look deeper, identifying specific background details, individual crowd members, and subtle narrative elements.

It then uses this exhaustive data as "Ground Truth" to generate high-quality, hallucination-free accessibility descriptions (Alt Text and Long Descriptions) for museum contexts.

“Hunting near Hartenfels Castle”, Lucas Cranach, 1540

Deep-label segmentation of "Hunting near Hartenfels Castle".

“The Fall of the Magician”, Pieter van der Heyden, 1565

Example cutouts generated for "The Fall of the Magician".

Next
Next

Compressing Art to Text (and Back)