Hey, community! 👋
We are a team of Stanford students applying technology to make sense of climate action. AI excites us because we know we can quickly analyze huge amounts of text.
As we require more reports on sustainability, such as responsibility reports and financial statements, it can be challenging to cut through the noise of aspirations and get to the real action: what are companies doing
That’s why we built a tool to match companies with climate actions scraped from company sustainability reports.
In this post, we’ll show you how to implement this tool as a chatbot interfacing with an InterSystems IRIS vector database.
- Step 1: Setup
- Step 2: Create database table
- Step 3: Embed content and populate database
- Step 4: Use vector search with user input
- Step 5: Find climate actions
Step 1: Setup
Make sure to download Docker and then follow the InterSystems community hackathon start guide. The guide walks you through creating a new database container and connecting to the locally hosted web database portal. Navigate to http://localhost:52773/csp/sys/UtilHome.csp and use the credentials username: demo, password: demo to login.
Step 2: Create database table
Now, for the fun part!
First, we use Python to create the database to store our text chunks. We use the following script:
import pandas as pd
import numpy as np
import iris
import time
import os
### Vector Database Setup
username = "demo"
password = "demo"
hostname = os.getenv("IRIS_HOSTNAME", "localhost")
port = "1972"
namespace = "USER"
CONNECTION_STRING = f"{hostname}:{port}/{namespace}"
print(CONNECTION_STRING)
### Connect to IRIS
conn = iris.connect(CONNECTION_STRING, username, password)
cursor = conn.cursor()
### Create new table. Replace BASLABS to your prefix.
embedTableName = "BASLABS.ClimateReportsEmbed"
tableDefinition = """
(
source_url VARCHAR(1000),
page INT,
total_pages INT,
content VARCHAR(5000),
content_embedding VECTOR(DOUBLE, 384)
)
"""
### Re-create if table already exists
if CREATE_TABLE:
try:
cursor.execute(f"DROP TABLE {embedTableName}")
except:
pass
cursor.execute(f"CREATE TABLE {embedTableName} {tableDefinition}")
- We use
iris.connect
to create an SQL conn instance in Python and connect to the database. We can then use thecursor
object to perform create/read/update/delete operations within the database. - We define the table structure as a string in the variable
tableDefinition
. Notably, we include acontent_embedding
field of typeVECTOR(DOUBLE, 384)
. You must make sure that the number matches the dimension of your text embeddings. In this project, we are using 384 to match the “all-MiniLM-L6-v2” embedding model from the SentenceTransformers package. - Finally, we use
cursor.execute
to create the table.
Step 3: Embed content and populate database
Next, we will populate the database.
To do so, create a file with a list of PDF links you want to scrape. You can also use our list of links that we used during the Stanford TreeHacks hackathon. Your file should look something like this:
<http://website1.xyz/pdf-content-upload1.pdf>
<http://website2.xyz/sustainability-report-2024.pdf>
...
Then, we use LangChain PyPDFLoader(url)
to load and split the PDF content.
import pandas as pd
from langchain.docstore.document import Document
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_iris import IRISVector
from langchain_community.document_loaders import PyPDFLoader
# Database setup as above...
# Embedding model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Setup Text Splitter
text_splitter = RecursiveCharacterTextSplitter(
separators=["\\n\\n", "\\n", ". ", "? ", "! "],
chunk_size=1000,
chunk_overlap=100,
length_function=len,
is_separator_regex=False,
)
# Utility functions
def get_df(docs, pdf_url):
data = [
{
"page": doc.metadata["page"],
"total_pages": doc.metadata["total_pages"],
"content": doc.page_content,
}
for doc in docs
]
doc_df = pd.DataFrame(data)
doc_df["source_url"] = pdf_url
doc_embeddings = model.encode(doc_df["content"], normalize_embeddings=True).tolist()
doc_df["content_embedding"] = doc_embeddings
return doc_df
def insert_df(doc_df):
sql = f"""
INSERT INTO {embedTableName}
(source_url, page, total_pages, content, content_embedding)
VALUES (?, ?, ?, ?, ?, ?, TO_VECTOR(?))
"""
data = [
(
row["source_url"],
row["page"],
row["total_pages"],
row["content"],
str(row["content_embedding"]),
)
for index, row in doc_df.iterrows()
]
results = cursor.executemany(sql, data)
# Main loop
with open("pdf_links.txt") as f:
links = f.read().split("\\n")
for url in links:
loader = PyPDFLoader(url)
documents = loader.load()
docs = text_splitter.split_documents(documents)
print(f"Found {len(docs)} docs for url: {url}")
df = get_df(docs, url)
insert_df(df)
- First, we set up an embedding model, which we import from sentence_transformers.
- Next, we define how to split the text of the reports. We use LangChains recursive text splitter that tries to split on provided characters in order until the chunks are small enough.
- We then create two utility functions:
get_df
andinsert_df
. We utilize pandas dataframes to package text content and its metadata for later storage effectively. - Finally, we loop over all links in the text file, loading the PDF using LangChain, splitting into smaller chunks, embedding their contents, and inserting them into the IRIS Vector Database.
Step 4: Use vector search with user input
Finally, we will create a chatbot wrapper to perform a vector search based on a company climate description. First, we package the database functionality into an IRIS_Database class:
import iris
import time
import os
import pandas as pd
from sqlalchemy import create_engine
from sentence_transformers import SentenceTransformer
class IRIS_Database:
def __init__(self):
### Vector Database Setup
username = "demo"
password = "demo"
hostname = os.getenv("IRIS_HOSTNAME", "localhost")
port = "1972"
namespace = "USER"
CONNECTION_STRING = f"{hostname}:{port}/{namespace}"
self.conn = iris.connect(CONNECTION_STRING, username, password)
self.cursor = self.conn.cursor()
# Tables
self.report_table = "BASLABS.ClimateReportsEmbed"
# Embeddings
self.embed = SentenceTransformer("all-MiniLM-L6-v2")
def get_report_section(self, search_query, n=5):
if not self.conn:
self.connect()
# Embed query using the same embedding model as before
search_vector = self.embed.encode(
search_query, normalize_embeddings=True
).tolist()
# SQL query to perform vector search
sql = f"""
SELECT TOP ? source_url, page, content
FROM {self.report_table}
ORDER BY VECTOR_DOT_PRODUCT(content_embedding, TO_VECTOR(?)) DESC
"""
self.cursor.execute(sql, [n, str(search_vector)])
# Format results
results = self.cursor.fetchall()
return [
dict(
source_url=row[0],
page=str(row[1]),
content=row[2],
)
for row in results
]
- First, we set up the database as before in the
__init__
method. Notice that we also load the embedding model to embed our search queries later on. - The
get_report_section
does all of the heavy lifting. First, it embeds the query vector using the same model used when populating the database. Then, we use an SQL query to perform the vector search. The magic lies inORDER BY VECTOR_DOT_PRODUCT(content_embedding)
since it is here, we find the cosine similarities between text embeddings and return the most similar chunks. - Finally, we format the results as a Python dictionary and return to the user.
Step 5: Find climate actions
As a final step, we can create a front-facing client application called vector search. Below, we include a basic chat interface without a language model. But feel free to extend it to your needs! See our DevPost for more details on setting up an agentic tool calling workflow using DAIN!
import os
import sys
import json
from IRIS_Database import IRIS_Database
def get_report_section(search_query: str) -> list[dict]:
"""
Perform RAG search for report data.
"""
db = BAS_Database()
result = db.get_report_section(search_query)
return result
# ask user for
while True:
print("Please tell me a bit more about your company and your goals. (q to quit)")
user_input = input("> ")
if user_input.lower().startswith("q"):
break
results = get_report_section(user_input)
print(f"Found {len(results)} matches!")
for r in results:
print(r)
Wrapping up
And that’s it! We hope this can inspire you to use an embedding search to match individuals, companies, and even nations with more ambitious climate actions. Thank you to InterSystems for providing these services.
If you have any questions, please leave them in the comments below.
Happy InterSystems Hacking!
Authors: Alice, Suze, and Bubble