Siamese Networks

What is Siamese Networks?

A Siamese Network is an artificial intelligence model featuring two or more identical sub-networks that share the same weights and architecture. Its primary purpose is not to classify inputs, but to learn a similarity function. By processing two different inputs simultaneously, it determines how similar or different they are.

How Siamese Networks Works

Input A -----> [Identical Network 1] -----> Vector A
                    (Shared Weights)           |
                                            [Distance] --> Similarity Score
                    (Shared Weights)           |
Input B -----> [Identical Network 2] -----> Vector B

Siamese networks function by processing two distinct inputs through identical neural network structures, often called “twin” networks. This architecture is designed to learn the relationship between pairs of data points rather than classifying a single input. The process ensures that similar inputs are mapped to nearby points in a feature space, while dissimilar inputs are mapped far apart.

Input and Twin Networks

The process begins with two input data points, such as two images, text snippets, or signatures. Each input is fed into one of the two identical subnetworks. Crucially, these subnetworks share the exact same architecture, parameters, and weights. This weight-sharing mechanism is fundamental; it guarantees that both inputs are processed in precisely the same manner, generating comparable output vectors, also known as embeddings.

Feature Vector Generation

As each input passes through its respective subnetwork (which could be a Convolutional Neural Network for images or a Recurrent Neural Network for sequences), the network extracts a set of meaningful features. These features are compressed into a high-dimensional vector, or an “embedding.” This embedding is a numerical representation that captures the essential characteristics of the input. The goal of training is to refine this embedding space.

Similarity Comparison

Once the two embeddings are generated, they are fed into a distance metric function to calculate their similarity. Common distance metrics include Euclidean distance or cosine similarity. This function outputs a score that quantifies how close the two embeddings are. During training, a loss function, such as contrastive loss or triplet loss, is used to adjust the network’s weights. The loss function penalizes the network for placing similar pairs far apart and dissimilar pairs close together, thereby teaching the model to produce effective similarity scores.

Explaining the ASCII Diagram

Inputs (A and B)

These represent the pair of data points being compared.

  • Input A: The first data sample (e.g., a reference image).
  • Input B: The second data sample (e.g., an image to be verified).

Identical Networks & Shared Weights

This is the core of the Siamese architecture.

  • [Identical Network 1] and [Identical Network 2]: These are two neural networks with the exact same layers and configuration.
  • (Shared Weights): This indicates that any weight update during training in one network is mirrored in the other. This ensures that a consistent feature extraction process is applied to both inputs.

Feature Vectors (Vector A and Vector B)

These are the outputs of the twin networks.

  • Vector A / Vector B: Numerical representations (embeddings) that capture the essential features of the original inputs. The network learns to create these vectors so that their distance in the vector space corresponds to their semantic similarity.

Distance and Similarity Score

This is the final comparison stage.

  • [Distance]: This module calculates the distance (e.g., Euclidean) between Vector A and Vector B.
  • Similarity Score: The final output, which is a value indicating how similar the original inputs are. A small distance corresponds to a high similarity score, and a large distance corresponds to a low score.

Core Formulas and Applications

Example 1: Euclidean Distance

This formula calculates the straight-line distance between two embedding vectors in the feature space. It is a fundamental component used within loss functions to determine how close or far apart two inputs are after being processed by the network. It’s widely used in the final comparison step.

d(e₁, e₂) = ||e₁ - e₂||₂

Example 2: Contrastive Loss

This loss function is used to train the network. It encourages the model to produce embeddings that are close for similar pairs (y=0) and far apart for dissimilar pairs (y=1). The ‘margin’ (m) parameter enforces a minimum distance for dissimilar pairs, helping to create a well-structured embedding space.

Loss = (1 - y) * (d(e₁, e₂))² + y * max(0, m - d(e₁, e₂))²

Example 3: Triplet Loss

Triplet loss improves upon contrastive loss by using three inputs: an anchor (a), a positive example (p), and a negative example (n). It pushes the model to ensure the distance between the anchor and the positive is smaller than the distance between the anchor and the negative by at least a certain margin, leading to more robust embeddings.

Loss = max(d(a, p)² - d(a, n)² + margin, 0)

Practical Use Cases for Businesses Using Siamese Networks

  • Signature Verification: Banks and financial institutions use Siamese Networks to verify the authenticity of handwritten signatures on checks and documents by comparing a new signature against a stored, verified sample.
  • Face Recognition for Access Control: Secure facilities and enterprise applications deploy facial recognition systems powered by Siamese Networks to grant access to authorized personnel by matching a live camera feed to a database of employee images.
  • Duplicate Content Detection: Online platforms and content management systems use this technology to find and flag duplicate or near-duplicate articles, images, or product listings, ensuring content quality and originality.
  • Product Recommendation: E-commerce sites can use Siamese Networks to recommend visually similar products to shoppers. By analyzing product images, the network can identify items with similar styles, patterns, or shapes.
  • Patient Record Matching: In healthcare, Siamese Networks can help identify duplicate patient records across different databases by comparing demographic information and clinical notes, even when there are minor variations in the data.

Example 1: Signature Verification

Input_A: Image of customer's reference signature
Input_B: Image of new signature on a check
Network_Output: Similarity_Score

IF Similarity_Score > Verification_Threshold:
  RETURN "Signature Genuine"
ELSE:
  RETURN "Signature Forged"

A financial institution uses this logic to automate check processing, reducing manual review time and fraud.

Example 2: Duplicate Question Detection

Input_A: Embedding of a new user question
Input_B: Embeddings of existing questions in a forum database
Network_Output: List of [Similarity_Score, Existing_Question_ID]

FOR each score in Network_Output:
  IF score > Duplication_Threshold:
    SUGGEST Existing_Question_ID to user

An online Q&A platform uses this to prevent redundant questions and direct users to existing answers.

🐍 Python Code Examples

This example shows how to define the core components of a Siamese Network in Python using TensorFlow and Keras. We create a base convolutional network, a distance calculation layer, and then instantiate the Siamese model itself. This structure is foundational for tasks like image similarity.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def create_base_network(input_shape):
    """Creates the base convolutional network shared by both inputs."""
    input = layers.Input(shape=input_shape)
    x = layers.Conv2D(32, (3, 3), activation='relu')(input)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(64, (3, 3), activation='relu')(x)
    x = layers.Flatten()(x)
    x = layers.Dense(128, activation='relu')(x)
    return keras.Model(input, x)

def euclidean_distance(vects):
    """Calculates the Euclidean distance between two vectors."""
    x, y = vects
    sum_square = tf.reduce_sum(tf.square(x - y), axis=1, keepdims=True)
    return tf.sqrt(tf.maximum(sum_square, tf.keras.backend.epsilon()))

# Define input shapes and create the Siamese network
input_shape = (28, 28, 1)
input_a = layers.Input(shape=input_shape)
input_b = layers.Input(shape=input_shape)

base_network = create_base_network(input_shape)
processed_a = base_network(input_a)
processed_b = base_network(input_b)

distance = layers.Lambda(euclidean_distance)([processed_a, processed_b])
model = keras.Model([input_a, input_b], distance)

Here is an implementation of the triplet loss function. This loss is crucial for training a Siamese Network effectively. It takes the anchor, positive, and negative embeddings and calculates a loss that aims to minimize the anchor-positive distance while maximizing the anchor-negative distance.

class TripletLoss(layers.Layer):
    """Calculates the triplet loss."""
    def __init__(self, margin=0.5, **kwargs):
        super().__init__(**kwargs)
        self.margin = margin

    def call(self, anchor, positive, negative):
        ap_distance = tf.reduce_sum(tf.square(anchor - positive), -1)
        an_distance = tf.reduce_sum(tf.square(anchor - negative), -1)
        loss = ap_distance - an_distance
        loss = tf.maximum(loss + self.margin, 0.0)
        return loss

Types of Siamese Networks

  • Convolutional Siamese Networks: These networks use convolutional neural networks (CNNs) as their identical subnetworks. They are highly effective for image-based tasks like facial recognition or signature verification, as CNNs excel at extracting hierarchical features from visual data.
  • Triplet Networks: A variation that uses three inputs: an anchor, a positive (similar to the anchor), and a negative (dissimilar). Instead of simple pairwise comparison, it learns by minimizing the distance between the anchor and positive while maximizing the distance to the negative, often leading to more robust embeddings.
  • Pseudo-Siamese Networks: In this architecture, the twin subnetworks do not share weights. This is useful when the inputs are from different modalities or have inherently different structures (e.g., comparing an image to a text description) where identical processing pathways would be ineffective.
  • Masked Siamese Networks: This is an advanced type used for self-supervised learning, particularly with images. It works by masking parts of an input image and training the network to predict the representation of the original, unmasked image, helping it learn robust features without labeled data.

Comparison with Other Algorithms

Small Datasets and One-Shot Learning

Compared to traditional classification algorithms like a standard Convolutional Neural Network (CNN), Siamese Networks excel in scenarios with very little data per class. A traditional CNN requires many examples of each class to learn effectively. In contrast, a Siamese Network can learn to differentiate between classes with just one or a few examples (one-shot learning), making it superior for tasks like face verification where new individuals are frequently added.

Large Datasets and Scalability

When dealing with large, static datasets with a fixed number of classes, a traditional classification model is often more efficient. Siamese Networks require comparing input pairs, which can become computationally expensive as the number of items grows (quadratic complexity). However, for similarity search in large databases, a pre-trained Siamese Network can be very powerful. By pre-computing embeddings for all items in the database, it can find the most similar items to a new query quickly, outperforming methods that require pairwise comparisons at runtime.

Dynamic Updates and Flexibility

Siamese Networks are inherently more flexible than traditional classifiers when new classes are introduced. Adding a new class to a standard CNN requires retraining the entire model, including the final classification layer. With a Siamese Network, a new class can be added without any retraining. The network has learned a general similarity function, so it can compute embeddings for the new class examples and compare them against others immediately.

Real-Time Processing and Memory

For real-time applications, the performance of a Siamese Network depends on the implementation. If embeddings for a gallery of items can be pre-computed and stored, similarity search can be extremely fast. The memory usage is dependent on the dimensionality of the embedding vectors and the number of items stored. In contrast, some algorithms may require loading larger models or more data into memory at inference time, making Siamese networks a good choice for efficient, real-time verification tasks.

⚠️ Limitations & Drawbacks

While powerful for similarity tasks, Siamese Networks are not universally applicable and come with specific limitations. Their performance and efficiency can be a bottleneck in certain scenarios, and they are not designed to provide the same kind of output as traditional classification models.

  • Computationally Intensive Training: Training requires processing pairs or triplets of data, which leads to a number of combinations that can grow quadratically, making training significantly slower and more resource-intensive than standard classification.
  • No Probabilistic Output: The network outputs a distance or similarity score, not a class probability. This makes it less suitable for tasks where confidence scores for multiple predefined classes are needed.
  • Sensitivity to Pair/Triplet Selection: The model’s performance is highly dependent on the strategy used for selecting pairs or triplets during training. Poor sampling can lead to slow convergence or a suboptimal embedding space.
  • Large Dataset Requirement for Generalization: While it excels at one-shot learning after training, the initial training phase requires a large and diverse dataset to learn a robust and generalizable similarity function.
  • Defining the Margin is Tricky: For loss functions like contrastive or triplet loss, setting the margin hyperparameter is a non-trivial task that requires careful tuning to achieve optimal separation in the embedding space.

Given these drawbacks, hybrid strategies or alternative algorithms may be more suitable for standard classification tasks or when computational resources for training are limited.

❓ Frequently Asked Questions

How are Siamese Networks different from traditional CNNs?

A traditional Convolutional Neural Network (CNN) learns to map an input (like an image) to a single class label (e.g., “cat” or “dog”). A Siamese Network, in contrast, uses two identical CNNs to process two different inputs and outputs a similarity score between them. It learns relationships, not categories.

Why is weight sharing so important in a Siamese Network?

Weight sharing is the defining feature of a Siamese Network. It ensures that both inputs are processed through the exact same feature extraction pipeline. If the networks had different weights, they would create different, non-comparable embeddings, making it impossible to meaningfully measure the distance or similarity between them.

What is “one-shot” learning and how do Siamese Networks enable it?

One-shot learning is the ability to correctly identify a new class after seeing only a single example of it. Siamese Networks enable this because they learn a general function for similarity. Once trained, you can present the network with an image from a new, unseen class and it can compare it to other images to find a match, without needing to be retrained on that new class.

What is the difference between contrastive loss and triplet loss?

Contrastive loss works with pairs of inputs (either similar or dissimilar) and aims to pull similar pairs together and push dissimilar pairs apart. Triplet loss is often more effective; it uses three inputs (an anchor, a positive, and a negative) and learns to ensure the anchor-positive distance is smaller than the anchor-negative distance by a set margin, which creates a more structured embedding space.

Can Siamese Networks be used for tasks other than image comparison?

Yes, absolutely. While commonly used for images (face recognition, signature verification), the same architecture can be applied to other data types. For example, they can compare text snippets for semantic similarity, audio clips for speaker verification, or even molecular structures in scientific research. The underlying principle of learning a similarity metric is domain-agnostic.

🧾 Summary

Siamese Networks are a unique neural network architecture designed for learning similarity. Comprising two or more identical subnetworks with shared weights, they process two inputs to produce comparable feature vectors. Rather than classifying inputs, their purpose is to determine how alike or different two items are, making them ideal for verification tasks like facial recognition, signature analysis, and duplicate detection.