Mapping Meaning

Build a semantic search engine from scratch and discover how embeddings transform words into meaning you can measure.

Step 1

From Text to Tokens

Before we can search, we need to break text into pieces computers can understand. This process, called tokenization, splits sentences into words and normalizes them.

Example: Tokenizing a Document

Input: "The elephant is a large mammal with a distinctive trunk"

Notice: words are lowercased and common stop words (like "the", "a", "is") are removed to focus on meaningful terms.

Step 2

Tokens to Vectors

Each document becomes a vector of numbers using TF-IDF (Term Frequency-Inverse Document Frequency). Words that are common in a document but rare across all documents get higher scores.

TF-IDF Vector Representation

Each dimension represents a word from our vocabulary. The value shows how important that word is to this document.

The Math Behind TF-IDF

TF (Term Frequency): How often a word appears in the document
IDF (Inverse Document Frequency): How unique the word is across all documents
TF-IDF = TF ร— IDF: Balance between frequency and uniqueness

Step 3

Finding Similarity

To find similar documents, we calculate the angle between vectors using cosine similarity. A score of 1.0 means identical, 0.0 means completely different.

Cosine Similarity in Action

Query Vector

"intelligent marine animal"

[0.52, 0.31, 0.76, 0.18, ...]
โŸน
Document Vector

"Dolphins are intelligent marine mammals"

[0.48, 0.29, 0.81, 0.22, ...]
Similarity: 0.87

High similarity indicates these texts share similar meaning

Step 4

Search in Action

Putting it all together: your query is tokenized, vectorized, and compared against all documents. The closest matches appear at the top.

Try Different Queries

Notice how semantic search finds relevant results even when exact words don't match!

Semantic Search Engine

Type a query to search through our knowledge base. The engine converts your text to a vector and finds the most similar documents using cosine similarity.

Vector Space

Your Query
Documents
Top Match

How It Works

Each document and query is converted to a TF-IDF vector. The bar chart shows similarity scores ranking the most semantically related documents.

Visualization

๐Ÿ“ˆ Bars: Top results ranked by similarity score with clear percentages showing semantic match strength

Initializing...

Dataset & Filters

Explore different categories or view all documents in the knowledge base.

Understanding Semantic Search

Traditional search engines match keywords. Semantic search understands meaning. When you search for "intelligent ocean animals," a keyword search might miss "Dolphins are smart marine mammals" because the exact words don't match. Semantic search finds it because the meaning is similar.

The breakthrough insight: if we can represent text as numbers, we can measure similarity mathematically. Words with similar meanings should have similar numerical representations. This is the foundation of embeddings โ€” transforming language into geometry where proximity equals semantic similarity.

How Embeddings Work

An embedding is a transformation from words to vectors โ€” lists of numbers. But not just any numbers. The key is that similar words end up close together in this vector space, while different words stay far apart.

TF-IDF Embeddings: Our implementation uses Term Frequency-Inverse Document Frequency, a classic technique from information retrieval. For each word, we calculate two things:

1. Term Frequency (TF): How often does this word appear in the document? Common words like "the" appear frequently, but so do important topic words like "elephant" in a document about elephants.

2. Inverse Document Frequency (IDF): How rare is this word across all documents? Words that appear in every document (like "the") aren't very informative. Rare words (like "espresso" in a coffee context) tell us a lot about what makes a document unique.

The TF-IDF score multiplies these together: TF ร— IDF = (word count / total words) ร— log(total docs / docs with word). This balances local importance (TF) with global uniqueness (IDF).

Measuring Similarity

Once we have vectors, we need to measure how similar they are. Cosine similarity is perfect for this. Instead of measuring straight-line distance, it measures the angle between vectors.

Why angles instead of distance? Imagine two documents about elephants โ€” one is a short definition, the other a detailed encyclopedia entry. They have very different vector lengths (magnitudes), but they point in the same direction (similar word distributions). Cosine similarity captures this: it returns 1.0 for identical directions, 0.0 for perpendicular (unrelated), and -1.0 for opposite directions.

The formula: cosine(A, B) = (A ยท B) / (||A|| ร— ||B||), where A ยท B is the dot product and ||A|| is the magnitude. In practice, this means: multiply matching dimensions, sum them up, and normalize by vector lengths.

Real-World Applications

Search Engines

Google and other search engines use embeddings to understand query intent. When you search "how to fix leaky faucet," they understand you're looking for repair guides, even if pages use words like "dripping tap" or "plumbing solutions."

Recommendation Systems

Netflix, Spotify, and Amazon embed products, movies, and songs as vectors. Similar embeddings mean similar content. This powers "you might also like" features that find connections humans might miss.

Vector Databases

Modern databases like Pinecone, Weaviate, and Milvus are built specifically for storing and searching embeddings. They enable fast similarity search across millions of vectors for applications like RAG (Retrieval-Augmented Generation) with AI.

Document Clustering

News aggregators group similar articles automatically by clustering their embeddings. No manual categorization needed โ€” the vectors naturally cluster by topic.

Semantic Deduplication

Find duplicate or near-duplicate content even when wording differs. Essential for cleaning datasets, detecting plagiarism, or managing large document repositories.

Question Answering

AI assistants like ChatGPT use embeddings to find relevant context from knowledge bases. Your question is embedded, matched against document embeddings, and the most relevant passages are retrieved to answer your query.

Beyond TF-IDF

While TF-IDF works well for many tasks, modern systems use neural embeddings that capture deeper semantic relationships:

Word2Vec and GloVe: These models learn word embeddings from massive text corpora. They famously capture analogies: king - man + woman โ‰ˆ queen. The vectors encode semantic relationships learned from patterns of word co-occurrence.

Sentence Transformers: Models like BERT and its variants create embeddings for entire sentences. They understand context โ€” "bank" near "river" gets a different embedding than "bank" near "money."

Multi-modal Embeddings: CLIP and similar models embed both images and text into the same vector space. Search images with text queries, or text with image queries. The same similarity math works across modalities.

The principle remains the same across all these approaches: transform data into vectors where geometric proximity represents semantic similarity. Whether you're using TF-IDF, transformers, or custom neural networks, you're mapping meaning into measurable mathematics.

Implementation Notes

This demonstration implements semantic search entirely in JavaScript with no external ML libraries. The TF-IDF vectorizer builds a vocabulary from the corpus, calculates term frequencies and inverse document frequencies, and transforms text into sparse vectors.

For production systems, you'd typically use:

  • Pre-trained models: Sentence-BERT, OpenAI embeddings, or domain-specific models
  • Specialized databases: Vector databases optimized for similarity search at scale
  • Approximate nearest neighbors: Algorithms like HNSW or IVF for sub-linear search time
  • Dimensionality reduction: PCA or UMAP to reduce vector size while preserving relationships

But the core concept โ€” embeddings enabling semantic search through similarity metrics โ€” remains the same regardless of implementation complexity.

References & Further Reading

For the mathematics of TF-IDF: TF-IDF on Wikipedia. For modern embedding techniques: Sentence-BERT and the paper "Attention Is All You Need" by Vaswani et al. For vector databases: Pinecone's Vector Database Guide.