Traditional search engines match keywords. Semantic search understands meaning. When you search for "intelligent ocean animals," a keyword search might miss "Dolphins are smart marine mammals" because the exact words don't match. Semantic search finds it because the meaning is similar.
The breakthrough insight: if we can represent text as numbers, we can measure similarity mathematically. Words with similar meanings should have similar numerical representations. This is the foundation of embeddings โ transforming language into geometry where proximity equals semantic similarity.
An embedding is a transformation from words to vectors โ lists of numbers. But not just any numbers. The key is that similar words end up close together in this vector space, while different words stay far apart.
TF-IDF Embeddings: Our implementation uses Term Frequency-Inverse Document Frequency, a classic technique from information retrieval. For each word, we calculate two things:
1. Term Frequency (TF): How often does this word appear in the document? Common words like "the" appear frequently, but so do important topic words like "elephant" in a document about elephants.
2. Inverse Document Frequency (IDF): How rare is this word across all documents? Words that appear in every document (like "the") aren't very informative. Rare words (like "espresso" in a coffee context) tell us a lot about what makes a document unique.
The TF-IDF score multiplies these together: TF ร IDF = (word count / total words) ร log(total docs / docs with word). This balances local importance (TF) with global uniqueness (IDF).
Once we have vectors, we need to measure how similar they are. Cosine similarity is perfect for this. Instead of measuring straight-line distance, it measures the angle between vectors.
Why angles instead of distance? Imagine two documents about elephants โ one is a short definition, the other a detailed encyclopedia entry. They have very different vector lengths (magnitudes), but they point in the same direction (similar word distributions). Cosine similarity captures this: it returns 1.0 for identical directions, 0.0 for perpendicular (unrelated), and -1.0 for opposite directions.
The formula: cosine(A, B) = (A ยท B) / (||A|| ร ||B||), where A ยท B is the dot product and ||A|| is the magnitude. In practice, this means: multiply matching dimensions, sum them up, and normalize by vector lengths.
Search Engines
Google and other search engines use embeddings to understand query intent. When you search "how to fix leaky faucet," they understand you're looking for repair guides, even if pages use words like "dripping tap" or "plumbing solutions."
Recommendation Systems
Netflix, Spotify, and Amazon embed products, movies, and songs as vectors. Similar embeddings mean similar content. This powers "you might also like" features that find connections humans might miss.
Vector Databases
Modern databases like Pinecone, Weaviate, and Milvus are built specifically for storing and searching embeddings. They enable fast similarity search across millions of vectors for applications like RAG (Retrieval-Augmented Generation) with AI.
Document Clustering
News aggregators group similar articles automatically by clustering their embeddings. No manual categorization needed โ the vectors naturally cluster by topic.
Semantic Deduplication
Find duplicate or near-duplicate content even when wording differs. Essential for cleaning datasets, detecting plagiarism, or managing large document repositories.
Question Answering
AI assistants like ChatGPT use embeddings to find relevant context from knowledge bases. Your question is embedded, matched against document embeddings, and the most relevant passages are retrieved to answer your query.
While TF-IDF works well for many tasks, modern systems use neural embeddings that capture deeper semantic relationships:
Word2Vec and GloVe: These models learn word embeddings from massive text corpora. They famously capture analogies: king - man + woman โ queen. The vectors encode semantic relationships learned from patterns of word co-occurrence.
Sentence Transformers: Models like BERT and its variants create embeddings for entire sentences. They understand context โ "bank" near "river" gets a different embedding than "bank" near "money."
Multi-modal Embeddings: CLIP and similar models embed both images and text into the same vector space. Search images with text queries, or text with image queries. The same similarity math works across modalities.
The principle remains the same across all these approaches: transform data into vectors where geometric proximity represents semantic similarity. Whether you're using TF-IDF, transformers, or custom neural networks, you're mapping meaning into measurable mathematics.
This demonstration implements semantic search entirely in JavaScript with no external ML libraries. The TF-IDF vectorizer builds a vocabulary from the corpus, calculates term frequencies and inverse document frequencies, and transforms text into sparse vectors.
For production systems, you'd typically use:
- Pre-trained models: Sentence-BERT, OpenAI embeddings, or domain-specific models
- Specialized databases: Vector databases optimized for similarity search at scale
- Approximate nearest neighbors: Algorithms like HNSW or IVF for sub-linear search time
- Dimensionality reduction: PCA or UMAP to reduce vector size while preserving relationships
But the core concept โ embeddings enabling semantic search through similarity metrics โ remains the same regardless of implementation complexity.