Image_Annotation_and_Segmen…/README.md

# Image Annotation and Segmentation Project

This repository contains Jupyter notebooks for comprehensive image annotation using state-of-the-art vision-language models. The project encompasses image understanding, object segmentation (with Moondream returning normalized SVG path strings and bounding boxes), and annotation format conversion from Moondream's native format to the widely-used COCO format to facilitate computer vision model development.

## Models Used

This project utilizes the Moondream series of vision-language models, which are compact yet powerful models designed for image understanding and description. These models combine transformer architectures with vision encoders to provide detailed analysis of image content. Moondream models are particularly efficient for edge deployment while maintaining high accuracy in image comprehension tasks.

### VRAM Requirements

- SAM 3 and Grounding Dino: 2 GB VRAM
- Moondream 2: 3.8 GB VRAM
- Moondream 3 (Quantized INT4): 6 GB VRAM

## Notebooks

### 1. Image_Annotation_Testing_Satyam.ipynb

This notebook provides comprehensive testing and evaluation of image annotation capabilities using Moondream vision-language models. It includes various experiments to assess model performance, accuracy of image understanding, and annotation quality. The notebook tests functionalities such as caption generation, object identification, scene description, and multi-modal reasoning. It also evaluates the model's capability to accurately detect and describe objects, people, and contexts within images.

### 2. Moondream_Segmentation_Satyam.ipynb

This notebook implements advanced segmentation capabilities using the Moondream vision-language model focused on object detection and precise boundary generation. It performs segmentation of objects within images and generates normalized SVG path strings plus bounding boxes for each segmented object. The path encodes the object's mask as an SVG <path d="...">, with coordinates in the range 0–1 relative to the bounding box rather than the full image. The notebook tests functionality including instance segmentation, semantic segmentation, object boundary precision, and the integration of segmentation with textual descriptions for comprehensive image understanding. It demonstrates how vision-language models can combine spatial understanding with contextual knowledge.

### 3. Moondream3_to_COCO_Satyam.ipynb

This notebook handles the conversion of Moondream's segmentation annotations to the COCO (Common Objects in Context) format, providing compatibility with mainstream computer vision frameworks. Since Moondream returns normalized SVG path strings with coordinates in the range 0–1 relative to the bounding box (rather than full image coordinates), this notebook converts these paths into the polygon format required by COCO. It transforms segmented objects into standardized JSON annotations with bounding boxes, segmentation masks, and category labels. The functionality includes conversion validation, format standardization, and preparation of datasets for training object detection and segmentation models. This enables seamless integration with popular frameworks like Detectron2, MMDetection, and other training pipelines, as COCO is a much more widely-used format than the native Moondream output.

## Prerequisites

To run these notebooks, you'll need:

- Python 3.8+
- Jupyter Notebook or JupyterLab
- PyTorch >= 1.10
- Transformers
- Pillow
- NumPy
- OpenCV-Python
- Matplotlib
- Scikit-image
- Moondream (for Moondream models)
- SAM (Segment Anything) and Grounding DINO (for segmentation tasks)
- Hugging Face Accelerate (for optimized inference)
- Requests
- Json (standard library, but used extensively)
- tqdm (for progress bars)
- BitsAndBytes (for quantized models)
- SentencePiece (for tokenization, if needed)

## Setup

1. Clone or download this repository
2. Install required dependencies:

```bash
pip install torch torchvision torchaudio
pip install transformers pillow numpy opencv-python matplotlib scikit-image
pip install moondream jupyter
pip install accelerate bitsandbytes sentencepiece
pip install supervision  # For visualization and annotations
pip install segment-anything  # For SAM model
pip install groundingdino-py  # For Grounding DINO
pip install huggingface_hub
```

3. Launch Jupyter:

```bash
jupyter notebook
```

4. Open any of the notebooks and run the cells

## Usage

Each notebook serves a specific purpose in the image annotation pipeline:

1. Start with `Image_Annotation_Testing_Satyam.ipynb` to understand model capabilities and test basic annotation functions
2. Use `Moondream_Segmentation_Satyam.ipynb` for detailed object segmentation tasks and mask generation
3. Apply `Moondream3_to_COCO_Satyam.ipynb` to standardize your annotations for downstream ML model training

## Key Functionalities Tested

- Image captioning and description
- Object detection and localization
- Instance and semantic segmentation
- Multi-modal reasoning
- Annotation format conversion
- Precision of boundary detection
- Integration of visual and linguistic understanding

## Dependencies

- [Moondream](https://github.com/vikhyat/moondream) - Efficient vision-language model
- [Segment Anything (SAM)](https://segment-anything.com/) - Advanced segmentation model
- [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO) - Open-set object detection
- PyTorch - Deep learning framework
- OpenCV - Computer vision operations
- COCO API - Annotation format handling
- Transformers - Hugging Face library for model processing
- Supervision - Utilities for computer vision workflows
- Accelerate - Optimization library for PyTorch models

## Notes

- VRAM requirements vary by model: SAM 3 and Grounding Dino (2 GB), Moondream 2 (3.8 GB), Moondream 3 (Quantized INT4) (6 GB)
- For optimal performance, ensure your GPU meets or exceeds the VRAM requirements for your selected model
- Models may require internet connectivity for initial downloads from HuggingFace Hub
- Results may vary depending on the complexity and quality of input images
- Preprocessing steps may be necessary for optimal model performance

## Author

Satyam Rastogi - Image Annotation and Segmentation Project