Artificial intelligence (AI) has transformative potential for driving value and insights from data. As we progress toward a world where nearly every application will be AI-driven, developers building those applications will need the right tools to create experiences from these applications. Tools like vector search are essential for enabling efficient and accurate retrieval of relevant information from massive datasets when working with large language models. By converting text and images into high-dimensional vectors, these techniques allow quick comparisons and searches, even when dealing with millions of files from disparate datasets across the organization.
In this article, we will cover the following topics:
-
What is Vector Search?
-
Vectors and Embeddings
-
Storing Vector Data
-
Viewing Vector Data
-
Performing Vector Search
1. What is Vector Search?
Vector search is a method of information retrieval where documents and queries are represented as vectors instead of plain text. Machine learning models generate these vector representations from source inputs that can be text, images, or other content. Having a mathematical representation of content provides a common basis for search scenarios. In this way, a query can find a match in vector space even if the original content is in a different media or language. This method allows one to find the most relevant documents in a given query by converting both the documents and the query into vectors and then computing the cosine similarity between them. The higher the cosine similarity, the more relevant the document is.
At the core of vector search lies the concept of vector representation. In this context, a vector is an array of numbers that encapsulate the semantic meaning of a piece of content. Such machine learning models as Word2Vec, GloVe, or BERT, are commonly used to generate these vectors. These models are trained on large datasets to learn the relationships and patterns between words, sentences, or entire documents.
The primary task in vector search is to measure the similarity between vectors. Various mathematical techniques can be used for this purpose. However, cosine similarity and dot product are the most common ones. Cosine similarity measures the cosine of the angle between two vectors, providing a value between -1 and 1, where 1 indicates identical directions (high similarity), and -1 indicates opposite directions (low similarity). The dot product, on the other hand, measures the magnitude of the overlap between two vectors.
2. Vectors and Embeddings
Vectors can represent the semantic meaning of language in embeddings. These embeddings are determined by an embedding model, a machine-learning model that maps words to a high-dimensional geometric space. Modern embedding vectors typically range from hundreds to thousands of dimensions. Words with similar semantic meanings occupy nearby positions in this space, while words with different meanings are placed far apart. These spatial positions allow applications to algorithmically determine the similarity between two words or even sentences by performing operations on their embedding vectors.
Embeddings are a specific type of vector representation created by machine learning models that capture the semantic meaning of a text or other types of content e.g., images. Natural language machine learning models are trained on large datasets to identify patterns and relationships between words. During training, they learn to represent any input as a vector of real numbers in an intermediary step called the encoder. After training, these language models can be modified, so the intermediary vector representation becomes the model's output
In vector search, a user can compare an input vector with vectors stored in a database using operations that determine similarity, e.g., the dot product. When the vectors represent embeddings, vector search enables the algorithmic determination of the most semantically similar pieces of text compared to an input. As a result, vector search is well-suited for tasks involving information retrieval.
Vectors can also be added, subtracted, or multiplied to find meanings and build relationships. One of the most popular examples is king – man + woman = queen. Machines might use this kind of relationship to determine gender or understand gender relationships.
3. Storing Vector Data
In vector search, an embedding model transforms unstructured data, e.g., text, into a piece of structured data, called a vector. Users can then perform operations on those vectors, which they could not do with the unstructured data. InterSystems IRIS® data platform supports a dedicated VECTOR type that performs operations on vectors. There are three numeric vector types: decimal (the most precise), double, and integer (the least accurate). Since VECTOR is a standard SQL datatype, we can store vectors alongside other data in a relational table, converting a SQL database transparently into a hybrid vector database. Vector data can be added to a table with INSERT statements or through ObjectScript with a property of %Library.Vector type. IRIS Vector Search comprises a new SQL datatype VECTOR
, VECTOR_DOT_PRODUCT()
and VECTOR_COSINE()
similarity functions to search for similar vectors. Users can access this functionality via SQL directly, or via Community-developed LangChain and LlamaIndex, popular Python frameworks for developing Generative AI applications, we will use SQL direct functionality.
In order to store the vector data, we must create a table or persistent class containing a vector datatype column/property.
Below the ObjectScript function, we will use embedded Python to create the desired table:
ClassMethod CreateTable() As %Status
{
// Create table with the help of embedded SQL
&SQL(CREATE TABLE VectorLab (
description VARCHAR(2000),
description_vector VECTOR(DOUBLE, 384))
)
if (SQLCODE '= 0)
{
//Create index
&SQL(CREATE COLUMNAR INDEX IVectorLab on VectorLab(description_vector))
}
return $$$OK
}
Once the table is formed, we need to convert text to embedding with the help of the Python module sentence-transformers, when we have a list of strings that represent embeddings, we can insert them into your table as VECTORs. To do this, you can either use an INSERT statement or you can create an object and store the embedding as a property of that object. For each embedding, we have to execute an INSERT statement that adds the embedding to the desired table. TO_VECTOR() function is used to convert the string representation of the embedding to a VECTOR.
Before we can store embeddings as vectors in InterSystems IRIS, we first need to create them from a text source. In general, we can transform a piece of text into an embedding in four steps:
- Import a package that will turn your text into a series of embeddings.
- Pre-process your text to best fit your chosen embedding model’s input specifications.
- Instantiate the model and convert your text to the embeddings, using your chosen package’s workflow.
- Convert each individual embedding to a string. This step is necessary to convert an embedding to a VECTOR at a later point.
The following code samples use this query to insert a single embedding into the table.
// Save vector data
ClassMethod SaveData(desc As %String) As %String [ Language = python ]
{
#Required to call objectscript method
import iris
# Prepare the Data
documents =[ desc ]
# Generate Document Embeddings
from sentence_transformers import SentenceTransformer
import pandas as pd
# Convert to dataframe for data manipulation
df = pd.DataFrame(documents)
# Define column header
df.columns = ['description']
# Assign model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings for each document
document_embeddings = model.encode(documents)
# Assigning vector data to new column of dataframe
df['description_vector'] = document_embeddings.tolist()
# Iterate through dataframe
for index, row in df.iterrows():
# Call SaveVector method of the same class
iris.cls(__name__).SaveVector(row['description'],str(row['description_vector']))
}
//Function to save vector data
ClassMethod SaveVector(desc As %String, descvec As %String) As %Status
{
//Insert data to VectorLab table
&sql(INSERT INTO SQLUser.VectorLab VALUES (:desc,to_vector(:descvec)))
if SQLCODE '= 0 {
write !, "Insert failed, SQLCODE= ", SQLCODE, ! ,%msg
quit
}
return $$$OK
}
4. Viewing Vector Data
We can view vector data by using $vector operations in ObjectScript functionality or from the management portal. The ObjectScript function below will return the vector data:
// View Vector data against ID
ClassMethod ViewData(id As %Integer, opt As %Integer) As %String
{
&sql(SELECT description_vector into :desc FROM SQLUser.VectorLab WHERE ID = :id)
IF SQLCODE<0 {WRITE "SQLCODE error ",SQLCODE," ",%msg QUIT}
//count the number of vectors
set count = $vectorop("count",desc)
set vectorStr = ""
//Iterate to all vectors, concatenate them, and return as a string
for i = 1:1:count
{
if (i = 1)
{ set vectorStr = $vector(desc,i)}
else
{ set vectorStr = vectorStr_", "_$vector(desc,i)}
}
return vectorStr
}
Check out Vector data from the Management Portal:
5. Performing Vector Search
Vector search enables us to use one vector to search for other similar vectors stored within the database. In InterSystems SQL, we can perform such a search with a single query. InterSystems SQL supports two functions that determine the similarity between two vectors: VECTOR_DOT_PRODUCT and VECTOR_COSINE. The larger the value of these functions, the more similar the vectors are.
The following example demonstrates how to use SQL to issue a query that employs VECTOR_DOT_PRODUCT to find the most semantically similar descriptions to an input sentence. After converting an input search to an embedding, utilize either VECTOR_DOT_PRODUCT or VECTOR_COSINE within an ORDER BY clause to return the most similar pieces of text. Additionally, you can use a TOP clause to select only the most resembling results (Note that the example uses "?" as a placeholder for the embedding of the search term since this value is typically provided as a parameter, and not as a literal.)
ClassMethod VectorSearch(aurg As %String) As %String [ Language = python ]
{
from sentence_transformers import SentenceTransformer
import pandas as pd
model = SentenceTransformer('all-MiniLM-L6-v2')
search_vector = str(model.encode(aurg, normalize_embeddings=True).tolist()) # Convert search phrase into a vector
import iris
stmt = iris.sql.prepare("SELECT top 5 id,description,VECTOR_COSINE(description_vector, TO_VECTOR(?)) FROM SQLUser.VectorLab ORDER BY VECTOR_DOT_PRODUCT(description_vector, TO_VECTOR(?)) DESC")
results = stmt.execute(search_vector,search_vector)
results_df = pd.DataFrame(results,columns=['id', 'description','Cosine_Similarity'])
print(results_df.head())
}
Thank you for reading!