Introduction

In this work, we handle a new problem of Open-Domain Open-Vocabulary (ODOV) object detection, which considers the detection model’s adaptability to the real world including both domain and category shifts.For this problem, we first construct a new benchmark OD-LVIS, which includes 46,949 images, covers 18 complex real-world domains and 1,203 categories, and provides a comprehensive dataset for evaluating real-world object detection. Besides, we develop a novel baseline method for ODOV detection.The proposed method first leverages large language models to generate the domain-agnostic text prompts for category embedding. It further learns the domain embedding from the given image, which, during testing, can be integrated into the category embedding to form the customized domain-specific category embedding for each test image. We provide sufficient benchmark evaluations for the proposed ODOV detection task and report the results, which verify the rationale of ODOV detection, the usefulness of our benchmark, and the superiority of the proposed method.

Dataset examples

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

                              

Characteristics

We then present the characteristics of the proposed OD-LVIS, which contains a series of challenging settings specifically designed to evaluate the limitations of object detection models for open-world scenarios.

Multi-category and multi-object scenes. Most images in OD-LVIS contain multiple objects from different categories, evaluating the model's robustness in both object localization and classification.

Various object sizes and aspect ratios. OD-LVIS encompasses the same category of objects but with different sizes and shapes, requiring models to recognize and distinguish objects under diverse visual conditions.

Complex backgrounds and overlapping objects. Lots of images in OD-LVIS contain complex backgrounds and overlapping objects, reflecting the wild environments surrounding objects in real-world scenarios.

Long-tailed category distribution. Similar with the object categories in real world, which in our dataset also follow a long-tailed distribution, demanding models to exhibit strong capabilities in localizing and distinguishing rare or tail-class samples.

Cross-domain variability. Since OD-LVIS shares categories with LVIS, it can be combined with LVIS (used for training) to further assess the domain generalization ability of object detection models across 18 various scenes.

These challenges are designed to simulate complex real-world conditions, providing a comprehensive benchmark to advance the development of detection technologies in open real-world.

                              

Domain distribution visualization

The figure illustrates the feature distribution of samples from the apple category. We extract features using ViT-B/16 models pretrained on ImageNet and CLIP, respectively, and visualize them using t-SNE. The samples clearly cluster based on their domain characteristics.

Contact us: