Image Annotation and Segmentation Project

This repository contains Jupyter notebooks for comprehensive image annotation using state-of-the-art vision-language models. The project encompasses image understanding, object segmentation (with Moondream returning normalized SVG path strings and bounding boxes), and annotation format conversion from Moondream's native format to the widely-used COCO format to facilitate computer vision model development.

Models Used

This project utilizes the Moondream series of vision-language models, which are compact yet powerful models designed for image understanding and description. These models combine transformer architectures with vision encoders to provide detailed analysis of image content. Moondream models are particularly efficient for edge deployment while maintaining high accuracy in image comprehension tasks.

VRAM Requirements

SAM 3 and Grounding Dino: 2 GB VRAM
Moondream 2: 3.8 GB VRAM
Moondream 3 (Quantized INT4): 6 GB VRAM

Notebooks

1. Image_Annotation_Testing_Satyam.ipynb

This notebook provides comprehensive testing and evaluation of image annotation capabilities using Moondream vision-language models. It includes various experiments to assess model performance, accuracy of image understanding, and annotation quality. The notebook tests functionalities such as caption generation, object identification, scene description, and multi-modal reasoning. It also evaluates the model's capability to accurately detect and describe objects, people, and contexts within images.

2. Moondream_Segmentation_Satyam.ipynb

This notebook implements advanced segmentation capabilities using the Moondream vision-language model focused on object detection and precise boundary generation. It performs segmentation of objects within images and generates normalized SVG path strings plus bounding boxes for each segmented object. The path encodes the object's mask as an SVG , with coordinates in the range 0–1 relative to the bounding box rather than the full image. The notebook tests functionality including instance segmentation, semantic segmentation, object boundary precision, and the integration of segmentation with textual descriptions for comprehensive image understanding. It demonstrates how vision-language models can combine spatial understanding with contextual knowledge.

3. Moondream3_to_COCO_Satyam.ipynb

This notebook handles the conversion of Moondream's segmentation annotations to the COCO (Common Objects in Context) format, providing compatibility with mainstream computer vision frameworks. Since Moondream returns normalized SVG path strings with coordinates in the range 0–1 relative to the bounding box (rather than full image coordinates), this notebook converts these paths into the polygon format required by COCO. It transforms segmented objects into standardized JSON annotations with bounding boxes, segmentation masks, and category labels. The functionality includes conversion validation, format standardization, and preparation of datasets for training object detection and segmentation models. This enables seamless integration with popular frameworks like Detectron2, MMDetection, and other training pipelines, as COCO is a much more widely-used format than the native Moondream output.

Prerequisites

To run these notebooks, you'll need:

Python 3.8+
Jupyter Notebook or JupyterLab
PyTorch >= 1.10
Transformers
Pillow
NumPy
OpenCV-Python
Matplotlib
Scikit-image
Moondream (for Moondream models)
SAM (Segment Anything) and Grounding DINO (for segmentation tasks)
Hugging Face Accelerate (for optimized inference)
Requests
Json (standard library, but used extensively)
tqdm (for progress bars)
BitsAndBytes (for quantized models)
SentencePiece (for tokenization, if needed)

Setup

Clone or download this repository
Install required dependencies:

pip install torch torchvision torchaudio
pip install transformers pillow numpy opencv-python matplotlib scikit-image
pip install moondream jupyter
pip install accelerate bitsandbytes sentencepiece
pip install supervision  # For visualization and annotations
pip install segment-anything  # For SAM model
pip install groundingdino-py  # For Grounding DINO
pip install huggingface_hub

Launch Jupyter:

jupyter notebook

Open any of the notebooks and run the cells

Usage

Each notebook serves a specific purpose in the image annotation pipeline:

Start with Image_Annotation_Testing_Satyam.ipynb to understand model capabilities and test basic annotation functions
Use Moondream_Segmentation_Satyam.ipynb for detailed object segmentation tasks and mask generation
Apply Moondream3_to_COCO_Satyam.ipynb to standardize your annotations for downstream ML model training

Key Functionalities Tested

Image captioning and description
Object detection and localization
Instance and semantic segmentation
Multi-modal reasoning
Annotation format conversion
Precision of boundary detection
Integration of visual and linguistic understanding

Dependencies

Moondream - Efficient vision-language model
Segment Anything (SAM) - Advanced segmentation model
Grounding DINO - Open-set object detection
PyTorch - Deep learning framework
OpenCV - Computer vision operations
COCO API - Annotation format handling
Transformers - Hugging Face library for model processing
Supervision - Utilities for computer vision workflows
Accelerate - Optimization library for PyTorch models

Notes

VRAM requirements vary by model: SAM 3 and Grounding Dino (2 GB), Moondream 2 (3.8 GB), Moondream 3 (Quantized INT4) (6 GB)
For optimal performance, ensure your GPU meets or exceeds the VRAM requirements for your selected model
Models may require internet connectivity for initial downloads from HuggingFace Hub
Results may vary depending on the complexity and quality of input images
Preprocessing steps may be necessary for optimal model performance

Author

Satyam Rastogi - Image Annotation and Segmentation Project

6.1 KiB Raw Permalink Blame History Unescape Escape