Image embedding systems face different challenges than text: preprocessing requirements, internal patch-based processing, handling large images, and extracting regions of interest. This chapter covers how modern vision models create embeddings, practical preprocessing strategies, approaches for large-scale imagery (satellite, medical), and techniques for multi-object scenes. You’ll learn to prepare images for optimal embedding quality across diverse visual domains.
The previous chapter explored how text documents are chunked into semantic units for embedding. Images present a parallel but distinct challenge: while text chunking is primarily a user decision, image “chunking” often happens inside the model itself. However, image preparation decisions—preprocessing, cropping, tiling, and region extraction—significantly impact embedding quality. Understanding these choices is essential for building effective visual search and multi-modal systems.
25.1 How Image Embedding Models Work
Before diving into preparation strategies, let’s understand what happens inside modern image embedding models.
25.1.1 From Pixels to Vectors
Image embedding models transform raw pixels into dense vector representations:
Input: RGB Image (224 × 224 × 3 = 150,528 values)
↓
Image Embedding Model
↓
Output: Embedding Vector (768 or 1024 dimensions)
Compression ratio: ~150x to ~200x
Unlike text where chunking is explicit, image models handle spatial “chunking” internally through their architecture.
CLS token: Special token aggregates information into final embedding
224×224 image → 196 patches (14×14 grid of 16×16 patches)
↓
Each patch → 768-dim token (linear projection)
↓
[CLS] + 196 patch tokens + position embeddings
↓
Transformer layers (self-attention)
↓
[CLS] token output → embedding (768-dim for ViT-Base, 1024 for ViT-Large)
25.1.4 The Key Insight: Internal vs External Chunking
Text vs image chunking comparison
Aspect
Text Embeddings
Image Embeddings
User chunking
Required (documents → chunks)
Optional (whole images often work)
Model chunking
Tokenization (subwords)
Patches (ViT) or receptive fields (CNN)
Semantic units
Sentences, paragraphs
Objects, regions, scenes
Boundary decisions
Made during preprocessing
Made by model architecture
For images, the model handles spatial decomposition. Your preparation decisions focus on: input quality, scale, cropping, and whether to embed whole images or extracted regions.
25.2 Preprocessing for Optimal Embeddings
Image preprocessing significantly impacts embedding quality. Each model expects specific input formats.
25.2.1 Standard Preprocessing Pipeline
Show Preprocessing Pipeline
from dataclasses import dataclassfrom typing import Tupleimport numpy as np@dataclassclass PreprocessConfig:"""Configuration for image preprocessing.""" target_size: Tuple[int, int] = (224, 224) resize_method: str="resize"# 'resize', 'crop', 'pad' normalize: bool=True mean: Tuple[float, ...] = (0.485, 0.456, 0.406) std: Tuple[float, ...] = (0.229, 0.224, 0.225)class ImagePreprocessor:"""Standard preprocessing pipeline for image embeddings."""def__init__(self, config: PreprocessConfig =None):self.config = config or PreprocessConfig()def preprocess(self, image) -> np.ndarray:"""Preprocess a single image."""from PIL import Imageifisinstance(image, np.ndarray): image = Image.fromarray(image)# Resizeifself.config.resize_method =="resize": image = image.resize(self.config.target_size)elifself.config.resize_method =="crop": image =self._center_crop(image)elifself.config.resize_method =="pad": image =self._resize_with_pad(image)# Convert to numpy and scale to [0, 1] img_array = np.array(image, dtype=np.float32)if img_array.max() >1.0: img_array = img_array /255.0# Normalizeifself.config.normalize: mean = np.array(self.config.mean) std = np.array(self.config.std) img_array = (img_array - mean) / stdreturn img_arraydef _center_crop(self, image) ->"Image":"""Resize then center crop to target size.""" w, h = image.size target_w, target_h =self.config.target_size scale =max(target_w / w, target_h / h) new_w, new_h =int(w * scale), int(h * scale) image = image.resize((new_w, new_h)) left = (new_w - target_w) //2 top = (new_h - target_h) //2return image.crop((left, top, left + target_w, top + target_h))def _resize_with_pad(self, image) ->"Image":"""Resize preserving aspect ratio with padding."""from PIL import Image as PILImage w, h = image.size target_w, target_h =self.config.target_size scale =min(target_w / w, target_h / h) new_w, new_h =int(w * scale), int(h * scale) image = image.resize((new_w, new_h)) padded = PILImage.new("RGB", self.config.target_size, (128, 128, 128)) left = (target_w - new_w) //2 top = (target_h - new_h) //2 padded.paste(image, (left, top))return padded# Usage examplepreprocessor = ImagePreprocessor()print("ImagePreprocessor ready with ImageNet normalization")
ImagePreprocessor ready with ImageNet normalization
25.2.2 Resolution and Aspect Ratio
Most models expect fixed input sizes (224×224, 384×384, etc.). How you achieve this matters:
Object detection: YOLO/Faster R-CNN -> crop objects -> embed each
25.4.2 Segmentation-Based Regions
Use semantic or instance segmentation for precise region extraction:
Show Segmentation-Based Embedding
import numpy as npdef embed_segmented_regions(image, segmentation_mask, encoder):"""Extract embeddings from segmented regions.""" embeddings = []for segment_id in np.unique(segmentation_mask)[1:]: # Skip background (0) mask = (segmentation_mask == segment_id)# Get bounding box rows, cols = np.where(mask)iflen(rows) ==0:continue y1, y2, x1, x2 = rows.min(), rows.max(), cols.min(), cols.max()# Extract and mask region region = image[y1:y2+1, x1:x2+1] region_mask = mask[y1:y2+1, x1:x2+1] region[~region_mask] =255# White background# Embed embedding = encoder.encode(region) embeddings.append((segment_id, embedding))return embeddings# Usage exampleprint("Segmentation: semantic/instance seg -> extract regions -> embed with masked background")
Segmentation: semantic/instance seg -> extract regions -> embed with masked background
25.4.3 Attention-Guided Regions
Use model attention to identify important regions:
Show Attention-Guided Regions
import numpy as npdef extract_attention_regions(image, model, top_k=5):"""Use model attention to identify important regions."""# Get attention maps from model (e.g., ViT attention, GradCAM) attention_map = model.get_attention(image)# Find top-k regions with highest attention flat_idx = np.argsort(attention_map.flatten())[-top_k:] regions = []for idx in flat_idx: y, x = np.unravel_index(idx, attention_map.shape)# Extract region around attention peak region = image[max(0,y-56):y+56, max(0,x-56):x+56] regions.append(region)return regions# Usage exampleprint("Attention-guided: use ViT attention/GradCAM -> extract salient regions -> embed")
Attention-guided: use ViT attention/GradCAM -> extract salient regions -> embed
25.5 Multi-Object Scene Handling
Scenes with multiple objects present a choice: one embedding for the whole scene, or separate embeddings per object?
25.5.1 Scene-Level vs Object-Level Embeddings
Show Scene vs Object Embeddings
import numpy as npdef scene_level_embedding(image, encoder):"""Single embedding for entire scene."""return encoder.encode(image)def object_level_embeddings(image, detector, encoder):"""Separate embeddings for each object.""" objects = detector.detect(image) embeddings = []for obj in objects: x1, y1, x2, y2 = obj.bbox cropped = image[y1:y2, x1:x2] embeddings.append((obj.class_name, encoder.encode(cropped)))return embeddings# Usage exampleprint("Scene-level: 1 embedding. Object-level: N embeddings. Hybrid: both")
Scene-level: 1 embedding. Object-level: N embeddings. Hybrid: both
25.5.2 Hybrid Approaches
Show Hybrid Embedding Approach
import numpy as npdef hybrid_embedding(image, detector, encoder):"""Combine scene-level and object-level embeddings."""# Scene embedding scene_emb = encoder.encode(image)# Object embeddings objects = detector.detect(image) object_embs = []for obj in objects: x1, y1, x2, y2 = obj.bbox cropped = image[y1:y2, x1:x2] object_embs.append(encoder.encode(cropped))return {"scene": scene_emb, "objects": object_embs, "count": len(object_embs)}# Usage exampleprint("Hybrid: store both scene and object embeddings for comprehensive search")
Hybrid: store both scene and object embeddings for comprehensive search
Multi-object embedding strategies
Approach
Storage
Query Types Supported
Best For
Scene-only
1×
“Show me kitchen scenes”
Scene retrieval
Objects-only
N×
“Find red chairs”
Object retrieval
Hybrid
(N+1)×
Both scene and object queries
Comprehensive search
25.6 Augmentation for Training Embeddings
When training or fine-tuning embedding models, augmentation creates diverse views of the same image—essential for contrastive learning.
25.6.1 Standard Augmentation Pipeline
Show Augmentation Pipeline
import torchvision.transforms as Tdef create_training_augmentation():"""Standard augmentation for training embedding models."""return T.Compose([ T.RandomResizedCrop(224, scale=(0.8, 1.0)), T.RandomHorizontalFlip(), T.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1), T.RandomGrayscale(p=0.1), T.ToTensor(), T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ])def create_test_augmentation():"""Minimal augmentation for testing."""return T.Compose([ T.Resize(256), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ])# Usage exampleprint("Augmentation: crop, flip, color jitter for training; crop for test")
Augmentation: crop, flip, color jitter for training; crop for test
Contrastive: strong augmentations (SimCLR) to create positive pairs
25.6.3 Domain-Specific Augmentation
Show Domain-Specific Augmentation
import torchvision.transforms as Tdef medical_augmentation():"""Augmentation for medical images."""return T.Compose([ T.RandomRotation(15), T.RandomAffine(degrees=0, translate=(0.1, 0.1)), T.ColorJitter(brightness=0.1, contrast=0.1), T.ToTensor(), ])def satellite_augmentation():"""Augmentation for satellite imagery."""return T.Compose([ T.RandomRotation(90), # Any rotation valid T.RandomHorizontalFlip(), T.RandomVerticalFlip(), T.ToTensor(), ])# Usage exampleprint("Domain-specific: tailored augmentations for medical, satellite, etc.")
Domain-specific: tailored augmentations for medical, satellite, etc.
25.7 Video Frame Extraction
Videos require selecting which frames to embed:
Show Video Frame Embedding
import numpy as npdef embed_video_frames(video_path, encoder, sample_rate=30):"""Extract and embed video frames."""import cv2 cap = cv2.VideoCapture(video_path) frame_embeddings = [] frame_idx =0while cap.isOpened(): ret, frame = cap.read()ifnot ret:breakif frame_idx % sample_rate ==0: frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) embedding = encoder.encode(frame_rgb) frame_embeddings.append((frame_idx, embedding)) frame_idx +=1 cap.release()return frame_embeddings# Usage exampleprint("Video: sample frames at intervals -> embed each -> temporal indexing")
Video: sample frames at intervals -> embed each -> temporal indexing
25.8 Production Image Pipeline
Putting it all together into a production system:
Show Production Pipeline
from dataclasses import dataclassimport numpy as np@dataclassclass ImageEmbeddingResult: embedding: np.ndarray image_id: str metadata: dictclass ProductionImagePipeline:"""Production-ready image embedding pipeline."""def__init__(self, encoder, preprocessor, quality_filter=None):self.encoder = encoderself.preprocessor = preprocessorself.quality_filter = quality_filterdef process(self, image, image_id: str):"""Process single image through full pipeline."""# Quality checkifself.quality_filter andnotself.quality_filter.is_valid(image):returnNone# Preprocess processed =self.preprocessor.preprocess(image)# Embed embedding =self.encoder.encode(processed)# Return with metadatareturn ImageEmbeddingResult(embedding=embedding, image_id=image_id, metadata={"size": image.shape[:2]})# Usage exampleprint("Production: quality filter -> preprocess -> embed -> store with metadata")
Production: quality filter -> preprocess -> embed -> store with metadata
25.9 Quality and Consistency
25.9.1 Embedding Consistency Checks
Show Consistency Checks
import numpy as npdef check_embedding_consistency(embeddings, threshold=0.95):"""Check for duplicate or near-duplicate embeddings."""from sklearn.metrics.pairwise import cosine_similarity similarities = cosine_similarity(embeddings) np.fill_diagonal(similarities, 0) duplicates = np.where(similarities > threshold)returnlist(zip(duplicates[0], duplicates[1]))def validate_embedding_distribution(embeddings):"""Check if embeddings have reasonable distribution.""" norms = np.linalg.norm(embeddings, axis=1) mean_sim = np.mean(cosine_similarity(embeddings))return {"mean_norm": float(np.mean(norms)),"std_norm": float(np.std(norms)),"mean_similarity": float(mean_sim), }# Usage exampleprint("Consistency: check for duplicates, validate distribution, monitor quality")
Consistency: check for duplicates, validate distribution, monitor quality
25.9.2 Batch Processing Best Practices
Show Batch Processing
import numpy as npfrom typing import Listclass BatchImageProcessor:"""Efficient batch processing for large image datasets."""def__init__(self, encoder, batch_size=32, num_workers=4):self.encoder = encoderself.batch_size = batch_sizeself.num_workers = num_workersdef process_batch(self, images: List):"""Process images in batches for efficiency.""" embeddings = []for i inrange(0, len(images), self.batch_size): batch = images[i:i +self.batch_size] batch_embeddings =self.encoder.encode(batch) embeddings.append(batch_embeddings)return np.vstack(embeddings) if embeddings else np.array([])def process_directory(self, image_dir: str):"""Process all images in a directory."""from PIL import Imageimport os images = [] image_paths = []for filename in os.listdir(image_dir):if filename.endswith(('.jpg', '.png', '.jpeg')): path = os.path.join(image_dir, filename) images.append(Image.open(path)) image_paths.append(path)iflen(images) >=self.batch_size: embeddings =self.process_batch(images)yieldlist(zip(image_paths, embeddings)) images, image_paths = [], []if images: embeddings =self.process_batch(images)yieldlist(zip(image_paths, embeddings))# Usage exampleprint("Batch processing: process images in batches, use DataLoader for efficiency")
Batch processing: process images in batches, use DataLoader for efficiency
25.10 Comparing Text and Image Preparation
Text vs image preparation comparison
Aspect
Text Chunking
Image Preparation
Primary decision
Chunk boundaries and size
Preprocessing and cropping strategy
Model handles
Tokenization
Patch extraction (ViT) or convolution
Multi-part content
Split into chunks
Tile large images
Object-level
Extract sentences/paragraphs
Detect and crop objects
Quality filtering
Language detection, deduplication
Blur detection, resolution checks
Metadata
Source, section, page
EXIF, geolocation, timestamp
Augmentation use
Rarely for retrieval
Essential for training
25.11 Key Takeaways
Image embedding models handle spatial “chunking” internally: Unlike text where you explicitly chunk documents, CNNs use hierarchical convolutions and ViTs use patch extraction—your preparation focuses on input quality and scale
Preprocessing choices significantly impact embedding quality: Resize strategy (crop vs pad vs stretch), normalization, and color handling should match model expectations and content characteristics
Large images require tiling with overlap: Satellite imagery, medical scans, and gigapixel images should be split into overlapping tiles, embedded separately, with optional aggregation strategies
Multi-object scenes offer embedding design choices: Whole-scene embeddings support scene queries, object-level embeddings support object queries, hybrid approaches support both at increased storage cost
Quality filtering prevents garbage embeddings: Blur detection, resolution checks, and content filtering should precede embedding to avoid polluting your vector database
Augmentation is essential for training, optional for inference: When training embedding models, augmentation creates diverse views for contrastive learning; for inference, consider multi-crop only for high-value retrieval scenarios
25.12 Looking Ahead
With text and image preparation covered, you’re ready to build complete retrieval systems. Chapter 11 explores RAG at scale—combining these preparation techniques with efficient retrieval pipelines, context assembly, and LLM integration for production question-answering systems.
25.13 Further Reading
Dosovitskiy, A., et al. (2020). “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv:2010.11929 (ViT)
Radford, A., et al. (2021). “Learning Transferable Visual Models From Natural Language Supervision.” arXiv:2103.00020 (CLIP)
He, K., et al. (2016). “Deep Residual Learning for Image Recognition.” CVPR (ResNet)
Chen, T., et al. (2020). “A Simple Framework for Contrastive Learning of Visual Representations.” ICML (SimCLR)
Caron, M., et al. (2021). “Emerging Properties in Self-Supervised Vision Transformers.” ICCV (DINO)
Campanella, G., et al. (2019). “Clinical-grade computational pathology using weakly supervised deep learning on whole slide images.” Nature Medicine