4.7 KiB
Image Annotation and Segmentation Project
This repository contains Jupyter notebooks for comprehensive image annotation using state-of-the-art vision-language models. The project encompasses image understanding, object segmentation, and annotation format conversion to facilitate computer vision model development.
Models Used
This project utilizes the Moondream series of vision-language models, which are compact yet powerful models designed for image understanding and description. These models combine transformer architectures with vision encoders to provide detailed analysis of image content. Moondream models are particularly efficient for edge deployment while maintaining high accuracy in image comprehension tasks.
VRAM Requirements
- SAM 3 and Grounding Dino: 2 GB VRAM
- Moondream 2: 3.8 GB VRAM
- Moondream 3 (Quantized INT4): 6 GB VRAM
Notebooks
1. Image_Annotation_Testing_Satyam.ipynb
This notebook provides comprehensive testing and evaluation of image annotation capabilities using Moondream vision-language models. It includes various experiments to assess model performance, accuracy of image understanding, and annotation quality. The notebook tests functionalities such as caption generation, object identification, scene description, and multi-modal reasoning. It also evaluates the model's capability to accurately detect and describe objects, people, and contexts within images.
2. Moondream_Segmentation_Satyam.ipynb
This notebook implements advanced segmentation capabilities using the Moondream vision-language model focused on object detection and precise boundary generation. It performs pixel-level segmentation of objects within images, creating accurate masks for different entities in the scene. The notebook tests functionality including instance segmentation, semantic segmentation, object boundary precision, and the integration of segmentation with textual descriptions for comprehensive image understanding. It demonstrates how vision-language models can combine spatial understanding with contextual knowledge.
3. Moondream3_to_COCO_Satyam.ipynb
This notebook handles the conversion of segmentation annotations to the COCO (Common Objects in Context) format, providing compatibility with mainstream computer vision frameworks. It transforms segmented objects into standardized JSON annotations with bounding boxes, segmentation masks, and category labels. The functionality includes conversion validation, format standardization, and preparation of datasets for training object detection and segmentation models. This enables seamless integration with popular frameworks like Detectron2, MMDetection, and other training pipelines.
Prerequisites
To run these notebooks, you'll need:
- Python 3.8+
- Jupyter Notebook or JupyterLab
- PyTorch >= 1.10
- Transformers
- Pillow
- NumPy
- OpenCV-Python
- Moondream model dependencies
- Matplotlib
- Scikit-image
Setup
- Clone or download this repository
- Install required dependencies:
pip install torch torchvision
pip install transformers pillow numpy opencv-python matplotlib scikit-image
pip install moondream
- Launch Jupyter:
jupyter notebook
- Open any of the notebooks and run the cells
Usage
Each notebook serves a specific purpose in the image annotation pipeline:
- Start with
Image_Annotation_Testing_Satyam.ipynbto understand model capabilities and test basic annotation functions - Use
Moondream_Segmentation_Satyam.ipynbfor detailed object segmentation tasks and mask generation - Apply
Moondream3_to_COCO_Satyam.ipynbto standardize your annotations for downstream ML model training
Key Functionalities Tested
- Image captioning and description
- Object detection and localization
- Instance and semantic segmentation
- Multi-modal reasoning
- Annotation format conversion
- Precision of boundary detection
- Integration of visual and linguistic understanding
Dependencies
- Moondream - Efficient vision-language model
- PyTorch - Deep learning framework
- OpenCV - Computer vision operations
- COCO API - Annotation format handling
- Transformers - Hugging Face library for model processing
Notes
- VRAM requirements vary by model: SAM 3 and Grounding Dino (2 GB), Moondream 2 (3.8 GB), Moondream 3 (Quantized INT4) (6 GB)
- For optimal performance, ensure your GPU meets or exceeds the VRAM requirements for your selected model
- Models may require internet connectivity for initial downloads from HuggingFace Hub
- Results may vary depending on the complexity and quality of input images
- Preprocessing steps may be necessary for optimal model performance
Author
Satyam Rastogi - Image Annotation and Segmentation Project