Creating and using a Knowledge Bank#
This tutorial teaches you how to create a Knowledge Bank. You will then learn how to store the embeddings of your content into the newly created Knowledge Bank. Finally, this tutorial will show you how to use this Knowledge Bank in your RAG.
Prerequisites#
Dataiku >= 14.1
permission to use a Python code environment with the RAG and Agents package set installed, plus the
pypdf
packagePython >= 3.9
LLM connection to a model able to embed text content
Creating a Knowledge Bank#
Dataiku provides a way to store and manipulate your embedded documents through the DSSKnowledgeBank
class.
The first step is to create a Knowledge Bank from your DSSProject
.
import dataiku
# Creating your Knowledge Bank
KB_NAME = ""
EMBED_LLM_ID = ""
client = dataiku.api_client()
project = client.get_default_project()
dss_kb = project.create_knowledge_bank(KB_NAME, "CHROMA", EMBED_LLM_ID)
The storage of the vectorized content will be based here on ChromaDB, but you have several other options as per the create_knowledge_bank()
documentation.
The parameter EMBED_LLM_ID
defines the model that will be used in the next step during vectorization of the content.
This code sample will help you find an LLM with the proper purpose.
Adding content to the Knowledge Bank#
Now that we have a Knowledge Bank, we must add the content for our use case. It is a common practice to split the text into smaller chunks before indexing the embedded result. This tutorial provides more information on the complete process.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
FILE_URL = "https://bit.ly/GEP-Jan-2024" # Update as needed
loader = PyPDFLoader(FILE_URL)
documents = []
async for page in loader.alazy_load():
documents.append(page)
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 100
splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE,
separator='\n',
chunk_overlap=CHUNK_OVERLAP,
length_function=len)
chunked_documents = splitter.split_documents(documents)
kb_core = dss_kb.as_core_knowledge_bank()
with kb_core.get_writer() as writer:
langchain_vs = writer.as_langchain_vectorstore()
langchain_vs.add_documents(chunked_documents)
The combination of get_writer()
and as_langchain_vectorstore()
will provide access to the vector store.
You can then use the add_documents method to embed and add the content of your chunks of content.
Caution
This tutorial uses the World Bank’s Global Economic Prospects (GEP) report. If the referenced publication is no longer available, look for the latest report’s PDF version on this page.
Using the Knowledge Bank#
Once you have a Knowledge Bank with the content you want, you can use it in your RAG. The tutorial shows you a complete approach, and the Code 3 below is a good reminder of how to do a RAG query.
from langchain.chains.question_answering import load_qa_chain
LLM_ID = "" # Fill with your LLM-Mesh id
vector_store = kb_core.as_langchain_vectorstore()
llm = project.get_llm(llm_id=LLM_ID).as_langchain_chat_model()
# Create the question answering chain
chain = load_qa_chain(llm, chain_type="stuff")
query = "What will inflation in Europe look like and why?"
search_results = vector_store.similarity_search(query)
# ⚡ Get the results ⚡
resp = chain({"input_documents":search_results, "question": query})
print(resp["output_text"])
Wrapping up#
Congratulations! You are now able to create, enrich, and use a Knowledge Bank. This provides a way to improve and enrich the answers of your LLMs or your Agents.
Here is the complete code of all the steps.
Knowledge Bank tutorial complete code
import dataiku
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.question_answering import load_qa_chain
# Creating your Knowledge Bank
KB_NAME = ""
EMBED_LLM_ID = ""
client = dataiku.api_client()
project = client.get_default_project()
dss_kb = project.create_knowledge_bank(KB_NAME, "CHROMA", EMBED_LLM_ID)
# Adding embeded content to your Knowledge Bank
FILE_URL = "https://bit.ly/GEP-Jan-2024" # Update as needed
loader = PyPDFLoader(FILE_URL)
documents = []
async for page in loader.alazy_load():
documents.append(page)
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 100
splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE,
separator='\n',
chunk_overlap=CHUNK_OVERLAP,
length_function=len)
chunked_documents = splitter.split_documents(documents)
kb_core = dss_kb.as_core_knowledge_bank()
with kb_core.get_writer() as writer:
langchain_vs = writer.as_langchain_vectorstore()
langchain_vs.add_documents(chunked_documents)
# Query a LLM with improved context from your Knowledge Bank
LLM_ID = "" # Fill with your LLM-Mesh id
vector_store = kb_core.as_langchain_vectorstore()
llm = project.get_llm(llm_id=LLM_ID).as_langchain_chat_model()
# Create the question answering chain
chain = load_qa_chain(llm, chain_type="stuff")
query = "What will inflation in Europe look like and why?"
search_results = vector_store.similarity_search(query)
# ⚡ Get the results ⚡
resp = chain({"input_documents":search_results, "question": query})
print(resp["output_text"])
Reference documentation#
Classes#
|
Entry point for the DSS API client |
A handle to interact with a DSS-managed knowledge bank. |
|
|
A handle to interact with a project on the DSS instance. |
|
This is a handle to interact with a Dataiku Knowledge Bank flow object |
Functions#
Get the |
|
|
Create a langchain-compatible chat LLM object for this LLM. |
|
Gets this writer as a Langchain Vectorstore object |
|
Get the current version of this knowledge bank as a Langchain Vectorstore object. |
|
Create a new knowledge bank in the project, and return a handle to interact with it |
Get a handle to the current default project, if available (i.e. |
|
Gets a writer on the latest vector store files on disk. |