Seeing

I think a lot about why biological perception works differently.

A human eye doesn't fuse modalities the way a late-fusion model does. It doesn't even try. Different sensory streams stay partially independent, get integrated at multiple levels, and the system tolerates ambiguity instead of resolving it into a single embedding. That's closer to what UniCat stumbled into than what most fusion architectures try to do.

I don't have a grand theory here. I just notice that the hardest perception problems I've worked on were hard because we imposed the wrong structure on the input, not because we lacked capacity in the model. Biological systems don't seem to make that mistake as often. I want to understand why.

Robotics and embodied AI pull on this thread. Perception with physical consequences is different from perception for retrieval or classification. When your next action depends on what you see, the cost of misperception is immediate. I want to understand where that changes the design.