Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Nature is inherently structured! The entities in the real world are naturally organized in rich relationships. For example, dolphins and sharks, despite their striking visual resemblance in body shape and fins, are actually from entirely different branches of the animal hierarchy, i.e., mammals and fishes, respectively. This remarkable similarity is a prime example of ‘convergent evolution’, where unrelated species develop similar features because they face similar environmental challenges. This illustrates how nature’s underlying organization often transcends superficial visual resemblances. Although humans intuitively grasp and utilize these profound natural constraints, they are typically underutilized in most AI systems. As a result, trained AI models tend to align with statistical patterns in the data, such as sampling biases or class imbalance, rather than adhering to the underlying relational consistency. This thesis argues that AI systems must evolve beyond learning “flat” feature representations, which are domain-agnostic and derived purely from data correlations, to “explicitly model the domain-specific structural relationships”. A key benefit of encoding relational priors in the learning process is that it can inject domain knowledge as an inductive bias, leading to more robust and reliable models. My research investigates incorporating domain knowledge by leveraging “graph-based structural priors” that explicitly model relational constraints in various visual recognition tasks. This work spans three distinct dimensions of visual recognition, progressing from coarse-level (image-level) to fine-grained (scene-level) understanding. My research highlights a crucial limitation in existing AI models: they often fail to incorporate real-world constraints, leading to significant errors. I show that even powerful, pre-trained neural networks can make severe mistakes due to a lack of domain knowledge. I argue that standard metrics like top-1 accuracy, precision, and recall are insufficient for evaluating model robustness, and propose a new metric based on rank order of the predictions as a better indicator of reliability. The benchmark on various large-scale datasets confirms that existing solutions do not sufficiently capture the domain knowledge, which is often available as a taxonomy tree, motivating our design of better learning frameworks. I also examine complex visual re-identification (Re-ID) tasks, such as monitoring animals in the wild. I find that existing foundational models struggle with new species and environments. This challenge is compounded by the high cost of manual annotation for adapting these systems to new settings. While existing unsupervised learning methods can help reduce the need for extensive labeling, they often suffer from under- and over-segmentation errors, which led me to develop more effective active learning strategies. Finally, I address the limitations of the classic Kalman filter, a widely used tool for dynamic systems. I point out that this filter makes a flawed assumption that the movement of each individual object is independent of its dynamic surroundings. In the real world, this is rarely the case. I demonstrate the need for a new filtering mechanism that not only considers an object’s past movements but also its spatial relationship with other dynamic entities in its environment. In my analysis, I observed the vision foundation models for all recognition tasks, i.e., classification, detection, and segmentation, lack the domain knowledge. I believe that our learning framework, which was designed specifically for classification, can be adapted for other recognition tasks. I speculate that a unified learning framework can be designed that can be leveraged for making vision foundation models aware of the available taxonomy.
