Mapping Meaning

Understanding Semantic Search

Traditional search engines match keywords. Semantic search understands meaning. When you search for "intelligent ocean animals," a keyword search might miss "Dolphins are smart marine mammals" because the exact words don't match. Semantic search finds it because the meaning is similar.

The breakthrough insight: if we can represent text as numbers, we can measure similarity mathematically. Words with similar meanings should have similar numerical representations. This is the foundation of embeddings — transforming language into geometry where proximity equals semantic similarity.

How Embeddings Work

An embedding is a transformation from words to vectors — lists of numbers. But not just any numbers. The key is that similar words end up close together in this vector space, while different words stay far apart.

TF-IDF Embeddings: Our implementation uses Term Frequency-Inverse Document Frequency, a classic technique from information retrieval. For each word, we calculate two things:

1. Term Frequency (TF): How often does this word appear in the document? Common words like "the" appear frequently, but so do important topic words like "elephant" in a document about elephants.

2. Inverse Document Frequency (IDF): How rare is this word across all documents? Words that appear in every document (like "the") aren't very informative. Rare words (like "espresso" in a coffee context) tell us a lot about what makes a document unique.

The TF-IDF score multiplies these together: TF × IDF = (word count / total words) × log(total docs / docs with word). This balances local importance (TF) with global uniqueness (IDF).

Measuring Similarity

Once we have vectors, we need to measure how similar they are. Cosine similarity is perfect for this. Instead of measuring straight-line distance, it measures the angle between vectors.

Why angles instead of distance? Imagine two documents about elephants — one is a short definition, the other a detailed encyclopedia entry. They have very different vector lengths (magnitudes), but they point in the same direction (similar word distributions). Cosine similarity captures this: it returns 1.0 for identical directions, 0.0 for perpendicular (unrelated), and -1.0 for opposite directions.

The formula: cosine(A, B) = (A · B) / (||A|| × ||B||), where A · B is the dot product and ||A|| is the magnitude. In practice, this means: multiply matching dimensions, sum them up, and normalize by vector lengths.

Real-World Applications

Search Engines

Google and other search engines use embeddings to understand query intent. When you search "how to fix leaky faucet," they understand you're looking for repair guides, even if pages use words like "dripping tap" or "plumbing solutions."

Recommendation Systems

Netflix, Spotify, and Amazon embed products, movies, and songs as vectors. Similar embeddings mean similar content. This powers "you might also like" features that find connections humans might miss.

Vector Databases

Modern databases like Pinecone, Weaviate, and Milvus are built specifically for storing and searching embeddings. They enable fast similarity search across millions of vectors for applications like RAG (Retrieval-Augmented Generation) with AI.

Document Clustering

News aggregators group similar articles automatically by clustering their embeddings. No manual categorization needed — the vectors naturally cluster by topic.

Semantic Deduplication

Find duplicate or near-duplicate content even when wording differs. Essential for cleaning datasets, detecting plagiarism, or managing large document repositories.

Question Answering

AI assistants like ChatGPT use embeddings to find relevant context from knowledge bases. Your question is embedded, matched against document embeddings, and the most relevant passages are retrieved to answer your query.

Beyond TF-IDF

While TF-IDF works well for many tasks, modern systems use neural embeddings that capture deeper semantic relationships:

Word2Vec and GloVe: These models learn word embeddings from massive text corpora. They famously capture analogies: king - man + woman ≈ queen. The vectors encode semantic relationships learned from patterns of word co-occurrence.

Sentence Transformers: Models like BERT and its variants create embeddings for entire sentences. They understand context — "bank" near "river" gets a different embedding than "bank" near "money."

Multi-modal Embeddings: CLIP and similar models embed both images and text into the same vector space. Search images with text queries, or text with image queries. The same similarity math works across modalities.

The principle remains the same across all these approaches: transform data into vectors where geometric proximity represents semantic similarity. Whether you're using TF-IDF, transformers, or custom neural networks, you're mapping meaning into measurable mathematics.

Implementation Notes

This demonstration implements semantic search entirely in JavaScript with no external ML libraries. The TF-IDF vectorizer builds a vocabulary from the corpus, calculates term frequencies and inverse document frequencies, and transforms text into sparse vectors.

For production systems, you'd typically use:

Pre-trained models: Sentence-BERT, OpenAI embeddings, or domain-specific models
Specialized databases: Vector databases optimized for similarity search at scale
Approximate nearest neighbors: Algorithms like HNSW or IVF for sub-linear search time
Dimensionality reduction: PCA or UMAP to reduce vector size while preserving relationships

But the core concept — embeddings enabling semantic search through similarity metrics — remains the same regardless of implementation complexity.

References & Further Reading

For the mathematics of TF-IDF: TF-IDF on Wikipedia. For modern embedding techniques: Sentence-BERT and the paper "Attention Is All You Need" by Vaswani et al. For vector databases: Pinecone's Vector Database Guide.

From Text to Tokens

Example: Tokenizing a Document

Tokens to Vectors

TF-IDF Vector Representation

The Math Behind TF-IDF

Finding Similarity

Cosine Similarity in Action

Query Vector

Document Vector

Search in Action

Try Different Queries

Semantic Search Engine

Vector Space

How It Works

Visualization

Dataset & Filters

Search Results