OpenAI Embeddings and Vector Databases Crash Course

Published by Fudgy McFarlen on

Summary

A quick run thru of vectorized databases.   He goes a bit quick.   Looks like the video is from 2 years ago so the free online tools he used might not still be available.  Might have to substitute others.

  • Demonstrates how words and phrases can be vectorized –  { how does this process get done? Math please }  – he uses OpenAI to generate.
  • Demonstrates storing vectorized phases into a vector database that consists of:   Phrase ,  Vectorized ( Phrase ) —> SQL:  Text , Blob 
  • Creates a search phrase –> Vectorize ( search phase ) and use it to search against the SQL Vectrorized ( Phrase ).  The search results are returned in descending order of match.  { how does the match get calculated?} 

Tools Used

  • Postman
  • GAI: what is postman software
  • Postman is a widely used API (Application Programming Interface) development platform that allows developers to design, build, test, document, and share APIs. It acts as an API client, enabling users to send various HTTP requests (GET, POST, PUT, DELETE) to web services and analyze responses, supporting REST, SOAP, and GraphQL. 
  • GAI: is there an open source version of postman for ubuntu
  • Yes, there are several open-source, offline-friendly Postman alternatives for Ubuntu that offer similar API testing functionality without requiring cloud accounts. The top recommendations include Bruno, Hoppscotch, and Insomnium, all of which are cross-platform and available on Linux.  ( ….more ) 

Research Links

Followup Questions

LLMs create word vectors (embeddings) by first converting text into numerical tokens and then passing them through a learned neural network layer that maps each token to a high-dimensional vector, placing similar words close together in a semantic space. These vectors are refined during training to capture context, grammar, and semantic relationships. 

Key Steps in Creating Word Vectors:  

  • Tokenization: Text is broken down into smaller units called tokens (words, sub-words, or characters).
  • Input Embedding Layer: Each token ID is looked up in an embedding matrix—a large, learnable table—which converts the token into a dense vector (e.g., 512 or 768 dimensions).
  • Contextualization (Transformers): The initial embeddings pass through transformer layers using attention mechanisms, allowing the vector for a word to be modified by surrounding words, making it context-aware.
  • Training (Learning Relationships): Models like Word2Vec or transformer-based models are trained on massive datasets to adjust these vectors so that words with similar meanings (e.g., "dog" and "puppy") or relationships (e.g., "king" – "man" + "woman"   ≈ is approximately equal to  ≈ "queen") are mathematically close

Summary Table

Step  Action Example
Input Raw text "Unbelievable!"
Normalization Clean & Lowercase "unbelievable!"
Tokenization Split into subwords ["un", "believ", "able", "!"]
Mapping Convert to IDs [345, 12903, 762, 0]

 

Yes, when embedding phrases using a specific, trained machine learning model, all input phrases—regardless of their length or complexity—are mapped into the same fixed-dimension vector space.  While the content of the vectors (the numerical values) will differ to represent the distinct meaning of each phrase, the dimensionality (length of the vector) remains constant, which is crucial for computational efficiency and semantic comparison. 

Example:

  • "hello" > [0.12, -0.5, …, 0.03]       (768 dimensions)
  • "the quick brown fox jumps"  > [0.88, 0.1, …, -0.2]      (768 dimensions)

 

 

eXtP7rp


Artificial Intelligence AI LLM Large Language Models Table of Contents TOC

 

 

 

 

End TOC


 

 

 

 

 

 

 

 

 

 

 

 

Categories:

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x