Creating and using a Knowledge Bank#

This tutorial teaches you how to create a Knowledge Bank. You will then learn how to store the embeddings of your content into the newly created Knowledge Bank. Finally, this tutorial will show you how to use this Knowledge Bank in your RAG.

Prerequisites#

  • Dataiku >= 14.1

  • permission to use a Python code environment with the RAG and Agents package set installed, plus the pypdf package

  • Python >= 3.9

  • LLM connection to a model able to embed text content

Creating a Knowledge Bank#

Dataiku provides a way to store and manipulate your embedded documents through the DSSKnowledgeBank class. The first step is to create a Knowledge Bank from your DSSProject.

Code 1: Creating your Knowledge Bank#
import dataiku


# Creating your Knowledge Bank
KB_NAME = ""
EMBED_LLM_ID = ""

client = dataiku.api_client()
project = client.get_default_project()
dss_kb = project.create_knowledge_bank(KB_NAME, "CHROMA", EMBED_LLM_ID)

The storage of the vectorized content will be based here on ChromaDB, but you have several other options as per the create_knowledge_bank() documentation. The parameter EMBED_LLM_ID defines the model that will be used in the next step during vectorization of the content. This code sample will help you find an LLM with the proper purpose.

Adding content to the Knowledge Bank#

Now that we have a Knowledge Bank, we must add the content for our use case. It is a common practice to split the text into smaller chunks before indexing the embedded result. This tutorial provides more information on the complete process.

Code 2: Adding content to your Knowledge Bank#
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter


FILE_URL = "https://bit.ly/GEP-Jan-2024" # Update as needed

loader = PyPDFLoader(FILE_URL)
documents = []
async for page in loader.alazy_load():
    documents.append(page)

CHUNK_SIZE = 1000
CHUNK_OVERLAP = 100
splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE,
                                 separator='\n',
                                 chunk_overlap=CHUNK_OVERLAP,
                                 length_function=len)
chunked_documents = splitter.split_documents(documents)

kb_core = dss_kb.as_core_knowledge_bank()

with kb_core.get_writer() as writer:
    langchain_vs = writer.as_langchain_vectorstore()
    langchain_vs.add_documents(chunked_documents)

The combination of get_writer() and as_langchain_vectorstore() will provide access to the vector store. You can then use the add_documents method to embed and add the content of your chunks of content.

Caution

This tutorial uses the World Bank’s Global Economic Prospects (GEP) report. If the referenced publication is no longer available, look for the latest report’s PDF version on this page.

Using the Knowledge Bank#

Once you have a Knowledge Bank with the content you want, you can use it in your RAG. The tutorial shows you a complete approach, and the Code 3 below is a good reminder of how to do a RAG query.

Code 3: Using the Knowledge Bank#
from langchain.chains.question_answering import load_qa_chain


LLM_ID = ""  # Fill with your LLM-Mesh id

vector_store = kb_core.as_langchain_vectorstore()
llm = project.get_llm(llm_id=LLM_ID).as_langchain_chat_model()

# Create the question answering chain
chain = load_qa_chain(llm, chain_type="stuff")
query = "What will inflation in Europe look like and why?"
search_results = vector_store.similarity_search(query)

# ⚡ Get the results ⚡
resp = chain({"input_documents":search_results, "question": query})
print(resp["output_text"])

Wrapping up#

Congratulations! You are now able to create, enrich, and use a Knowledge Bank. This provides a way to improve and enrich the answers of your LLMs or your Agents.

Here is the complete code of all the steps.

Knowledge Bank tutorial complete code
import dataiku
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.question_answering import load_qa_chain


# Creating your Knowledge Bank
KB_NAME = ""
EMBED_LLM_ID = ""

client = dataiku.api_client()
project = client.get_default_project()
dss_kb = project.create_knowledge_bank(KB_NAME, "CHROMA", EMBED_LLM_ID)

# Adding embeded content to your Knowledge Bank
FILE_URL = "https://bit.ly/GEP-Jan-2024" # Update as needed

loader = PyPDFLoader(FILE_URL)
documents = []
async for page in loader.alazy_load():
    documents.append(page)

CHUNK_SIZE = 1000
CHUNK_OVERLAP = 100
splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE,
                                 separator='\n',
                                 chunk_overlap=CHUNK_OVERLAP,
                                 length_function=len)
chunked_documents = splitter.split_documents(documents)

kb_core = dss_kb.as_core_knowledge_bank()

with kb_core.get_writer() as writer:
    langchain_vs = writer.as_langchain_vectorstore()
    langchain_vs.add_documents(chunked_documents)

# Query a LLM with improved context from your Knowledge Bank
LLM_ID = ""  # Fill with your LLM-Mesh id

vector_store = kb_core.as_langchain_vectorstore()
llm = project.get_llm(llm_id=LLM_ID).as_langchain_chat_model()

# Create the question answering chain
chain = load_qa_chain(llm, chain_type="stuff")
query = "What will inflation in Europe look like and why?"
search_results = vector_store.similarity_search(query)

# ⚡ Get the results ⚡
resp = chain({"input_documents":search_results, "question": query})
print(resp["output_text"])

Reference documentation#

Classes#

dataikuapi.DSSClient(host[, api_key, ...])

Entry point for the DSS API client

dataikuapi.dss.knowledgebank.DSSKnowledgeBank(...)

A handle to interact with a DSS-managed knowledge bank.

dataikuapi.dss.project.DSSProject(client, ...)

A handle to interact with a project on the DSS instance.

dataiku.KnowledgeBank(id[, project_key])

This is a handle to interact with a Dataiku Knowledge Bank flow object

Functions#

as_core_knowledge_bank()

Get the dataiku.KnowledgeBank object corresponding to this knowledge bank

as_langchain_chat_model(**data)

Create a langchain-compatible chat LLM object for this LLM.

as_langchain_vectorstore(**vectorstore_kwargs)

Gets this writer as a Langchain Vectorstore object

as_langchain_vectorstore(**vectorstore_kwargs)

Get the current version of this knowledge bank as a Langchain Vectorstore object.

create_knowledge_bank(name, ...[, settings])

Create a new knowledge bank in the project, and return a handle to interact with it

get_default_project()

Get a handle to the current default project, if available (i.e.

get_writer()

Gets a writer on the latest vector store files on disk.