Image Annotation and Segmentation Project

This repository contains Jupyter notebooks for comprehensive image annotation using state-of-the-art vision-language models. The project encompasses image understanding, object segmentation, and annotation format conversion to facilitate computer vision model development.

Models Used

This project utilizes the Moondream series of vision-language models, which are compact yet powerful models designed for image understanding and description. These models combine transformer architectures with vision encoders to provide detailed analysis of image content. Moondream models are particularly efficient for edge deployment while maintaining high accuracy in image comprehension tasks.

VRAM Requirements

SAM 3 and Grounding Dino: 2 GB VRAM
Moondream 2: 3.8 GB VRAM
Moondream 3 (Quantized INT4): 6 GB VRAM

Notebooks

1. Image_Annotation_Testing_Satyam.ipynb

This notebook provides comprehensive testing and evaluation of image annotation capabilities using Moondream vision-language models. It includes various experiments to assess model performance, accuracy of image understanding, and annotation quality. The notebook tests functionalities such as caption generation, object identification, scene description, and multi-modal reasoning. It also evaluates the model's capability to accurately detect and describe objects, people, and contexts within images.

2. Moondream_Segmentation_Satyam.ipynb

This notebook implements advanced segmentation capabilities using the Moondream vision-language model focused on object detection and precise boundary generation. It performs pixel-level segmentation of objects within images, creating accurate masks for different entities in the scene. The notebook tests functionality including instance segmentation, semantic segmentation, object boundary precision, and the integration of segmentation with textual descriptions for comprehensive image understanding. It demonstrates how vision-language models can combine spatial understanding with contextual knowledge.

3. Moondream3_to_COCO_Satyam.ipynb

This notebook handles the conversion of segmentation annotations to the COCO (Common Objects in Context) format, providing compatibility with mainstream computer vision frameworks. It transforms segmented objects into standardized JSON annotations with bounding boxes, segmentation masks, and category labels. The functionality includes conversion validation, format standardization, and preparation of datasets for training object detection and segmentation models. This enables seamless integration with popular frameworks like Detectron2, MMDetection, and other training pipelines.

Prerequisites

To run these notebooks, you'll need:

Python 3.8+
Jupyter Notebook or JupyterLab
PyTorch >= 1.10
Transformers
Pillow
NumPy
OpenCV-Python
Moondream model dependencies
Matplotlib
Scikit-image

Setup

Clone or download this repository
Install required dependencies:

pip install torch torchvision
pip install transformers pillow numpy opencv-python matplotlib scikit-image
pip install moondream

Launch Jupyter:

jupyter notebook

Open any of the notebooks and run the cells

Usage

Each notebook serves a specific purpose in the image annotation pipeline:

Start with Image_Annotation_Testing_Satyam.ipynb to understand model capabilities and test basic annotation functions
Use Moondream_Segmentation_Satyam.ipynb for detailed object segmentation tasks and mask generation
Apply Moondream3_to_COCO_Satyam.ipynb to standardize your annotations for downstream ML model training

Key Functionalities Tested

Image captioning and description
Object detection and localization
Instance and semantic segmentation
Multi-modal reasoning
Annotation format conversion
Precision of boundary detection
Integration of visual and linguistic understanding

Dependencies

Moondream - Efficient vision-language model
PyTorch - Deep learning framework
OpenCV - Computer vision operations
COCO API - Annotation format handling
Transformers - Hugging Face library for model processing

Notes

VRAM requirements vary by model: SAM 3 and Grounding Dino (2 GB), Moondream 2 (3.8 GB), Moondream 3 (Quantized INT4) (6 GB)
For optimal performance, ensure your GPU meets or exceeds the VRAM requirements for your selected model
Models may require internet connectivity for initial downloads from HuggingFace Hub
Results may vary depending on the complexity and quality of input images
Preprocessing steps may be necessary for optimal model performance

Author

Satyam Rastogi - Image Annotation and Segmentation Project

4.7 KiB Raw Blame History