Query by Example (QBE)

What is Query by Example QBE?

Query by Example (QBE) is a method in artificial intelligence that lets users search a database by providing a sample item instead of a text-based query. The system analyzes the features of the example—such as an image or a document—and retrieves other items with similar characteristics.

How Query by Example QBE Works

[User Provides Example] ---> [Feature Extraction Engine] ---> [Vector Representation]
            |                                |                           |
            |                                |                           v
            '--------------------------------'              [Vector Database / Index]
                                                                         |
                                                                         v
                                                        [Similarity Search Algorithm] ---> [Ranked Results] ---> [User]

Query by Example (QBE) works by translating a sample item into a search query to find similar items in a database. Instead of requiring users to formulate complex search commands, QBE allows them to use an example—like an image, audio clip, or document—as the input. The system then identifies and returns items that share similar features or patterns. This approach makes data retrieval more intuitive, especially for non-textual or complex data where describing a query with words would be difficult.

Feature Extraction

The first step in the QBE process is feature extraction. When a user provides an example item, the system uses specialized algorithms, often deep learning models like CNNs for images or transformers for text, to analyze its content and convert its key characteristics into a numerical format. This numerical representation, known as a feature vector or an embedding, captures the essential attributes of the example, such as colors and shapes in an image or semantic meaning in a text.

Indexing and Similarity Search

Once the feature vector is created, it is compared against a database of pre-indexed vectors from other items in the collection. This database, often a specialized vector database, is optimized for high-speed similarity searches. The system employs algorithms to calculate the “distance” between the query vector and all other vectors in the database. The most common methods include measuring Euclidean distance or Cosine Similarity to identify which items are “closest” or most similar to the provided example.

Result Ranking and Retrieval

Finally, the system ranks the items from the database based on their calculated similarity scores, from most to least similar. The top-ranking results are then presented to the user. This process enables powerful search capabilities, such as finding visually similar products in an e-commerce catalog from a user-uploaded photo or identifying songs based on a short audio sample. The effectiveness of the search depends heavily on the quality of the feature extraction and the efficiency of the similarity search algorithm.

Diagram Components Explained

User Provides Example

This is the starting point of the process. The user inputs a piece of data (e.g., an image, a song snippet, a document) that serves as the template for what they want to find.

Feature Extraction Engine

This component is an AI model or algorithm that analyzes the input example. Its job is to identify and quantify the core characteristics of the example and convert them into a machine-readable format, specifically a feature vector.

Vector Database / Index

This is a specialized database that stores the feature vectors for all items in the collection. It is highly optimized to perform rapid searches over these high-dimensional numerical representations.

Similarity Search Algorithm

This algorithm takes the query vector from the example and compares it to all the vectors in the database. It calculates a similarity score between the query and every other item, determining which ones are the closest matches.

Ranked Results

The output of the similarity search is a list of items from the database, ordered by how similar they are to the user’s original example. This ranked list is then presented to the user, completing the query.

Core Formulas and Applications

Example 1: Cosine Similarity

This formula measures the cosine of the angle between two non-zero vectors. In QBE, it determines the similarity in orientation, not magnitude, making it ideal for comparing documents or images based on their content features. A value of 1 means identical, 0 means unrelated, and -1 means opposite.

Similarity(A, B) = (A · B) / (||A|| * ||B||)

Example 2: Euclidean Distance

This is the straight-line distance between two points in Euclidean space. In QBE, it is used to find the “closest” items in the feature space. A smaller distance implies a higher degree of similarity. It is commonly used in clustering and nearest-neighbor searches.

Distance(A, B) = sqrt(Σ(A_i - B_i)^2)

Example 3: k-Nearest Neighbors (k-NN) Pseudocode

This pseudocode represents the logic of the k-NN algorithm, a core method for implementing QBE. It finds the ‘k’ most similar items (neighbors) to a query example from a dataset by calculating the distance to all other points and selecting the closest ones.

FUNCTION find_k_neighbors(query_example, dataset, k):
  distances = []
  FOR item IN dataset:
    dist = calculate_distance(query_example, item)
    distances.append((dist, item))
  
  SORT distances by dist
  
  RETURN first k items from sorted distances

Practical Use Cases for Businesses Using Query by Example QBE

  • Reverse Image Search for E-commerce: Customers upload an image of a product to find visually similar items in a store’s catalog. This enhances user experience and boosts sales by making product discovery intuitive and fast, bypassing keyword limitations.
  • Music and Media Identification: Services use audio fingerprinting, a form of QBE, to identify a song, movie, or TV show from a short audio or video clip. This is used in content identification for licensing and in consumer applications like Shazam.
  • Duplicate Document Detection: Enterprises use QBE to find duplicate or near-duplicate documents within their systems. By providing a document as an example, the system can identify redundant files, reducing storage costs and improving data organization.
  • Plagiarism and Copyright Infringement Detection: Educational institutions and content platforms can submit a document or image to find instances of it elsewhere. This helps enforce academic integrity and protect intellectual property rights by finding unauthorized copies.
  • Genomic Sequence Matching: In bioinformatics, researchers can search for similar genetic sequences by providing a sample sequence as a query. This accelerates research by identifying related genes or proteins across vast biological databases.

Example 1

QUERY: {
  "input_media": {
    "type": "image",
    "features": [0.12, 0.98, ..., -0.45]
  },
  "parameters": {
    "search_type": "similar_products",
    "top_n": 10
  }
}

Business Use Case: An e-commerce platform uses this query to power its visual search feature, allowing a user to upload a photo of a dress and receive a list of the 10 most visually similar dresses available in its inventory.

Example 2

QUERY: {
  "input_media": {
    "type": "audio_fingerprint",
    "hash_sequence": ["A4B1", "C9F2", ..., "D5E3"]
  },
  "parameters": {
    "search_type": "song_identification",
    "match_threshold": 0.95
  }
}

Business Use Case: A music identification app captures a 10-second audio clip from a user, converts it to a unique hash sequence, and runs this query to find the matching song in its database with at least 95% confidence.

🐍 Python Code Examples

This example uses scikit-learn to perform a simple Query by Example search. We define a dataset of feature vectors, provide a query “example,” and use the NearestNeighbors algorithm to find the two most similar items in the dataset.

from sklearn.neighbors import NearestNeighbors
import numpy as np

# Sample dataset of feature vectors (e.g., from images or documents)
X = np.array([
    [-1, -1], [-2, -1], [-3, -2],
   ,,
])

# The "example" we want to find neighbors for
query_example = np.array([])

# Initialize the NearestNeighbors model to find the 2 nearest neighbors
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)

# Find the neighbors of the query example
distances, indices = nbrs.kneighbors(query_example)

print("Indices of nearest neighbors:", indices)
print("Distances to nearest neighbors:", distances)
print("Nearest neighbor vectors:", X[indices])

This snippet demonstrates how QBE can be applied to text similarity using feature vectors generated by TF-IDF. After transforming a corpus of documents into vectors, we transform a new query sentence and use cosine similarity to find and rank the most relevant documents, mimicking how a QBE system retrieves similar text.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Corpus of documents
documents = [
    "AI is transforming the world",
    "Machine learning is a subset of AI",
    "Deep learning drives modern AI",
    "The world is changing rapidly"
]

# The "example" query
query_example = ["AI and machine learning applications"]

# Create TF-IDF vectors
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
query_vec = vectorizer.transform(query_example)

# Calculate cosine similarity between the query and all documents
cosine_similarities = cosine_similarity(query_vec, X).flatten()

# Get the indices of the most similar documents
most_similar_doc_indices = np.argsort(cosine_similarities)[::-1]

print("Ranked document indices (most to least similar):", most_similar_doc_indices)
print("Similarity scores:", np.sort(cosine_similarities)[::-1])
print("Most similar document:", documents[most_similar_doc_indices])

Types of Query by Example QBE

  • Content-Based Image Retrieval (CBIR): This type uses an image as the query to find visually similar images in a database. It analyzes features like color, texture, and shape, making it useful for reverse image search engines and finding similar products in e-commerce.
  • Query by Humming (QBH): Users hum or sing a melody, and the system finds the original song. This works by extracting acoustic features like pitch and tempo from the user’s input and matching them against a database of audio fingerprints.
  • Textual Similarity Search: A user provides a sample document or paragraph, and the system retrieves documents with similar semantic meaning or style. This is applied in plagiarism detection, related article recommendation, and finding duplicate records within a database.
  • Genomic and Proteomic Search: In bioinformatics, a specific gene or protein sequence is used as a query to find similar or related sequences in vast biological databases. This helps researchers identify evolutionary relationships and functional similarities between different organisms.
  • Example-Based 3D Model Retrieval: This variation allows users to search for 3D models (e.g., for CAD or 3D printing) by providing a sample 3D model as the query. The system analyzes geometric properties to find structurally similar objects.

Comparison with Other Algorithms

QBE vs. Keyword-Based Search

Query by Example, which relies on vector-based similarity search, fundamentally differs from traditional keyword-based search. Keyword search excels at finding exact textual matches but fails when queries are abstract, non-textual, or require an understanding of context and semantics. QBE thrives in these scenarios, as it can find conceptually similar items even if they don’t share any keywords.

Performance on Small Datasets

On small datasets, a brute-force QBE approach (calculating distance to every item) is feasible and highly accurate. Its performance can be comparable to keyword search in terms of speed, but it uses more memory to store the vector embeddings. Keyword search, relying on an inverted index, is typically faster and more memory-efficient for simple text retrieval tasks.

Performance on Large Datasets

For large datasets, brute-force similarity search becomes computationally prohibitive. QBE systems must use Approximate Nearest Neighbor (ANN) algorithms like LSH or HNSW. These methods trade a small amount of accuracy for a massive gain in speed, making QBE viable at scale. Keyword search scales exceptionally well for text due to the efficiency of inverted indexes, but its inability to handle non-textual or conceptual queries remains a major limitation.

Dynamic Updates and Real-Time Processing

Adding new items to a keyword search index is generally a fast and efficient process. For QBE systems, adding new items requires generating the vector embedding and then updating the vector index. Updating some ANN indexes can be computationally intensive and may not be ideal for highly dynamic datasets with frequent writes. For real-time processing, QBE latency depends heavily on the efficiency of the ANN index and the complexity of the feature extraction model, while keyword search latency is typically very low.

⚠️ Limitations & Drawbacks

While powerful, Query by Example is not always the best solution and can be inefficient or problematic in certain situations. Its performance depends heavily on the quality of the input example and the underlying data representation, and its computational demands can be significant. Understanding these drawbacks is key to deciding when to use QBE.

  • The Curse of Dimensionality: As the complexity of data increases, the feature vectors become very high-dimensional, making it difficult to calculate distances meaningfully and requiring more data to achieve robust performance.
  • Garbage In, Garbage Out: The quality of search results is entirely dependent on the quality of the query example; a poor or ambiguous example will yield poor and irrelevant results.
  • High Computational Cost: Performing an exact similarity search across a large dataset is computationally expensive, and while approximate methods are faster, they can sacrifice accuracy.
  • Feature Extraction Dependency: The effectiveness of the search is contingent on the feature extraction model’s ability to capture the essential characteristics of the data, and a poorly trained model will lead to poor results.
  • Storage Overhead: Storing high-dimensional vector embeddings for every item in a database requires significantly more storage space than traditional indexes like those used for keyword search.
  • Difficulty with Grouped Constraints: QBE systems often struggle with complex, logical queries that involve nested conditions or combinations of attributes (e.g., finding images with “a dog AND a cat but NOT a person”).

In scenarios requiring complex logical filtering or where query inputs are easily expressed with text, traditional database queries or hybrid strategies may be more suitable.

❓ Frequently Asked Questions

How is Query by Example different from a keyword search?

Query by Example uses a sample item (like an image or document) to find conceptually or structurally similar results, whereas a keyword search finds exact or partial matches of the text you enter. QBE is ideal for non-textual data or when you can’t describe what you’re looking for with words.

What kind of data works best with QBE?

QBE excels with unstructured, high-dimensional data where similarity is subjective or difficult to define with rules. This includes images, audio files, video, and complex documents. It is less effective for simple, structured data where traditional SQL queries are more efficient.

Is Query by Example difficult to implement?

Implementation complexity varies. Using a managed cloud service or an open-source vector database can simplify the process significantly. However, building a custom QBE system from scratch, including training a high-quality feature extraction model, requires significant expertise in machine learning and data engineering.

What are vector databases and why are they important for QBE?

Vector databases are specialized databases designed to store and efficiently search through high-dimensional feature vectors. They are crucial for QBE because they use optimized algorithms (like ANN) to perform similarity searches incredibly fast, making it possible to query millions or even billions of items in real-time.

Can QBE understand the context or semantics of a query?

Yes, this is one of its key strengths. Modern QBE systems use deep learning models to create feature vectors that capture the semantic meaning of data. This allows the system to find results that are conceptually related to the query example, even if they are not visually or structurally identical.

🧾 Summary

Query by Example (QBE) is an AI-driven search technique that allows users to find information by providing a sample item rather than a textual query. The system extracts the core features of the example into a numerical vector and then searches a database for items with the most similar vectors. This method is especially powerful for searching non-textual data like images and audio.