LLM Mesh Retrieval And Knowledge Banks#

For the overall proposed structure, see LLM Mesh.

Knowledge banks#

class dataikuapi.dss.knowledgebank.DSSKnowledgeBankListItem(client, data)#

An item in a list of knowledge banks

Important

Do not instantiate this class directly, instead use dataikuapi.dss.project.DSSProject.list_knowledge_banks().

to_knowledge_bank()#

Convert the current item.

Returns:

A handle for the knowledge_bank.

Return type:

dataikuapi.dss.knowledgebank.DSSKnowledgeBank

as_core_knowledge_bank()#

Get the dataiku.KnowledgeBank object corresponding to this knowledge bank

Return type:

dataiku.KnowledgeBank

property project_key#
Returns:

The project key

Return type:

string

property id#
Returns:

The id of the knowledge bank.

Return type:

string

property name#
Returns:

The name of the knowledge bank.

Return type:

string

class dataikuapi.dss.knowledgebank.DSSKnowledgeBank(client, project_key, id)#

A handle to interact with a DSS-managed knowledge bank.

Important

Do not create this class directly, use dataikuapi.dss.project.DSSProject.get_knowledge_bank() instead.

property id#
as_core_knowledge_bank()#

Get the dataiku.KnowledgeBank object corresponding to this knowledge bank

Return type:

dataiku.KnowledgeBank

as_langchain_retriever(**data)#

Get the current version of this knowledge bank as a Langchain Retriever object.

Parameters:

data (dict) – keyword arguments to pass to the dataikuapi.dss.knowledgebank.DSSKnowledgeBank.search() function

Returns:

a langchain-compatible retriever

Return type:

dataikuapi.dss.langchain.knowledge_bank.DKUKnowledgeBankRetriever

get_settings()#

Get the knowledge bank’s definition

Returns:

a handle on the knowledge bank definition

Return type:

dataikuapi.dss.knowledgebank.DSSKnowledgeBankSettings

delete()#

Delete the knowledge bank

build(job_type='NON_RECURSIVE_FORCED_BUILD', wait=True)#

Start a new job to build this knowledge bank and wait for it to complete. Raises if the job failed.

job = knowledge_bank.build()
print("Job %s done" % job.id)
Parameters:
  • job_type – the job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD

  • wait (bool) – whether to wait for the job completion before returning the job handle, defaults to True

Returns:

the dataikuapi.dss.job.DSSJob job handle corresponding to the built job

Return type:

dataikuapi.dss.job.DSSJob

search(query, max_documents=10, search_type='SIMILARITY', similarity_threshold=0.5, mmr_documents_count=20, mmr_factor=0.25, hybrid_use_advanced_reranking=False, hybrid_rrf_rank_constant=60, hybrid_rrf_rank_window_size=4, filter=None)#

Search for documents in a knowledge bank

MMR and HYBRID search types are not supported by every vector stores.

Parameters:
  • query (str) – what to search for

  • max_documents (int) – the maximum number of documents to return, default to 10

  • search_type (str) – the search algorithm to use. One of SIMILARITY, SIMILARITY_THRESHOLD, MMR or HYBRID. Defaults to SIMILARITY

  • similarity_threshold (float) – only return documents with a similarity score above this threshold, typically between 0 and 1, only applied with search_type=SIMILARITY_THRESHOLD, defaults to 0.5

  • mmr_documents_count (int) – number of documents to consider before selecting the documents to retrieve, only applied with search_type=MMR, defaults to 20

  • mmr_factor (float) – weight of diversity vs relevancy, between 0 and 1, where 0 favors maximum diversity and 1 favors maximum relevancy, only applied with search_type=MMR, defaults to 0.25

  • hybrid_use_advanced_reranking (bool) – whether to use proprietary rerankers, valid for Azure AI and ElasticSearch vector stores, defaults to False

  • hybrid_rrf_rank_constant (int) – higher values give more weight to lower-ranked documents, valid for ElasticSearch vector stores, defaults to 60

  • hybrid_rrf_rank_window_size (int) – number of documents to consider from each search type, valid for ElasticSearch vector stores, defaults to 4

  • filter (Union[DSSSimpleFilter, dict], optional) – optional metadata filter as a DSSSimpleFilter or a simple filter dictionary

Returns:

a result object with a list of documents that matched the query

Return type:

dataikuapi.dss.knowledgebank.DSSKnowledgeBankSearchResult

class dataikuapi.dss.knowledgebank.DSSKnowledgeBankSettings(client, project_key, settings)#

Settings for a knowledge bank

Important

Do not instantiate directly, use dataikuapi.dss.knowledgebank.DSSKnowledgeBank.get_settings() instead

property project_key#

Returns the project key of the knowledge bank

Return type:

str

property id#

Returns the identifier of the knowledge bank

Return type:

str

property vector_store_type#

Returns the type of storage backing the vector store (could be CHROMA, PINECONE, ELASTICSEARCH, AZURE_AI_SEARCH, SNOWFLAKE_CORTEX_SEARCH, VERTEX_AI_GCS_BASED, FAISS, QDRANT_LOCAL)

Return type:

str

set_metadata_schema(schema)#

Sets the schema for metadata fields.

Parameters:

schema (Dict[str, str]) – the schema, as a mapping metadata_field -> type

set_images_folder(managed_folder_id, project_key=None)#

Sets the images folder to use with this knowledge bank.

Parameters:
  • managed_folder_id (str) – The (managed) images folder id.

  • project_key (Optional[str]) – The image folder project key, if different from this knowledge bank project key. Default to None.

get_images_folder()#

Returns the images folder of the knowledge bank, if any.

Returns:

the managed folder or None

Return type:

DSSManagedFolder | None

get_raw()#

Returns the raw settings of the knowledge bank

Returns:

the raw settings of the knowledge bank

Return type:

dict

save()#

Saves the settings on the knowledge bank

class dataikuapi.dss.knowledgebank.DSSKnowledgeBankSearchResult(kb, documents)#

The result of a search in a knowledge bank, contains documents that matched the query

Each document is a dataikuapi.dss.knowledgebank.DSSKnowledgeBankSearchResultDocument

property documents#

Returns a list of documents that matched a search query

Returns:

a list of result documents

Return type:

list[DSSKnowledgeBankSearchResultDocument]

property managed_folder_id#
class dataikuapi.dss.knowledgebank.DSSKnowledgeBankSearchResultDocument(result, text, score, metadata)#

A document found by searching a knowledge bank with dataikuapi.dss.knowledgebank.DSSKnowledgeBank.search()

property text#

Returns the text from the knowledge bank for this document

Returns:

the text for this document

Return type:

str

property score#

Returns the match score for this document

Returns:

the score for this document

Return type:

float

property metadata#

Returns metadata from the knowledge bank for this document

Returns:

metadata for this document

Return type:

dict

property images#

Returns images for this document

Returns:

a list of images references or None

Return type:

list[ManagedFolderImageRef] | None

property file_ref#

Returns the file reference for this document

Returns:

a file reference or None

Return type:

ManagedFolderDocumentRef | None

class dataiku.KnowledgeBank(id, project_key=None, context_project_key=None)#

This is a handle to interact with a Dataiku Knowledge Bank flow object

set_context_project_key(context_project_key)#

Set the context project key to use to report calls to the embedding LLM associated with this knowledge bank.

Parameters:

context_project_key (str) – the context project key

get_current_version(trusted_object: TrustedObject | None = None)#

Gets the current version for this knowledge bank.

Parameters:

trusted_object (Optional["TrustedObject"]) – the optional trusted object using the kb

Return type:

str

get_writer()#

Gets a writer on the latest vector store files on disk. For local vector stores, downloads metadata files as well as data files. For remote vector stores, only downloads metadata files.

The vector store files are automatically uploaded when the context manager is closed.

Note

Each call creates an isolated writer which works on its own folder.

Returns:

dataiku.core.vector_stores.data.writer.VectorStoreWriter

as_langchain_retriever(search_type='similarity', search_kwargs=None, vectorstore_kwargs=None, **retriever_kwargs)#

Get the current version of this knowledge bank as a Langchain Retriever object.

Return type:

langchain_core.vectorstores.VectorStoreRetriever

as_langchain_vectorstore(**vectorstore_kwargs)#

Get the current version of this knowledge bank as a Langchain Vectorstore object.

Return type:

langchain_core.vectorstores.VectorStore

get_multipart_context(docs)#

Convert retrieved documents from the vector store to a multipart context. The multipart context contains the parts that can be added to a completion query

Parameters:

docs (List[Document]) – A list of retrieved documents from the langchain retriever

Raises:

Exception – If the knowledge bank does not contain multimodal content

Returns:

A multipart context object composed by a list of parts containing text or images

Return type:

MultipartContext

class dataiku.core.knowledge_bank.MultipartContext#

A reference to a list of text or images parts that can be added to a completion query

append(part)#
Parameters:

part (MultipartContent) – Part of a completion query

add_to_completion_query(completion, role='user')#

Add the accumulated parts as a new multipart-message to the completion query

Parameters:
  • completion (DSSLLMCompletionsQuerySingleQuery) – the completion query to be edited

  • role (str) – The message role. Use system to set the LLM behavior, assistant to store predefined responses, user to provide requests or comments for the LLM to answer to. Defaults to user.

is_text_only()#
Returns:

True if all the accumulated parts are text parts, False otherwise

Return type:

bool

to_text()#
Returns:

the concatenation of accumulated text parts (other parts are skipped)

Return type:

str

Retrieval-augmented LLMs#

class dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLMListItem(client, data)#

An item in a list of retrieval-augmented LLMs

Important

Do not instantiate this class directly, instead use dataikuapi.dss.project.DSSProject.list_retrieval_augmented_llms().

property project_key#
Returns:

The project

Return type:

string

property id#
Returns:

The id of the retrieval-augmented LLM.

Return type:

string

property name#
Returns:

The name of the retrieval-augmented LLM.

Return type:

string

as_llm()#

Returns this retrieval-augmented LLM as a usable dataikuapi.dss.llm.DSSLLM for querying

class dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLM(client, project_key, id)#

A handle to interact with a DSS-managed retrieval-augmented LLM.

Important

Do not create this class directly, use dataikuapi.dss.project.DSSProject.get_retrieval_augmented_llm() instead.

property id#
as_llm()#

Returns this retrieval-augmented LLM as a usable dataikuapi.dss.llm.DSSLLM for querying

get_settings()#

Get the retrieval-augmented LLM’s definition

Returns:

a handle on the retrieval-augmented LLM definition

Return type:

dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLMSettings

delete()#

Delete the retrieval-augmented LLM

class dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLMSettings(client, settings)#

Settings for a retrieval-augmented LLM

Important

Do not instantiate directly, use dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLM.get_settings() instead

get_version_ids()#
property active_version#

Returns the active version of this retrieval-augmented LLM. May return None if no version is declared as active

get_version_settings(version_id)#
get_raw()#

Returns the raw settings of the retrieval-augmented LLM

Returns:

the raw settings of the retrieval-augmented LLM

Return type:

dict

save()#

Saves the settings on the retrieval-augmented LLM

class dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLMVersionSettings(version_settings)#
get_raw()#
property llm_id#

Get or set the name of the Data Collection

Return type:

str

property interaction_logging_selection#

Get the interaction logging selection for this version.

Before configuring interaction logging on a retrieval-augmented LLM version, create the target dataset on the project:

project = client.get_project("MYPROJECT")
project.create_llm_interaction_logging_dataset(
    "llm_logs",
    connection_id="filesystem_managed",
    time_partitioning="DAY",
)

Example using inherited settings:

rag = project.get_retrieval_augmented_llm("my_rag")
rag_settings = rag.get_settings()
version_settings = rag_settings.get_version_settings("v1")

logging_selection = version_settings.interaction_logging_selection
logging_selection.inherit()

rag_settings.save()

Example using explicit settings:

rag = project.get_retrieval_augmented_llm("my_rag")
rag_settings = rag.get_settings()
version_settings = rag_settings.get_version_settings("v1")

logging_selection = version_settings.interaction_logging_selection
logging_selection.enable(
    "llm_logs",
    settings={
        "flushEveryS": 60,
        "flushEveryBytes": 1_000_000,
        "contentMode": "FULL",
    },
)

rag_settings.save()

Example disabling interaction logging:

rag = project.get_retrieval_augmented_llm("my_rag")
rag_settings = rag.get_settings()
version_settings = rag_settings.get_version_settings("v1")

logging_selection = version_settings.interaction_logging_selection
logging_selection.disable()

rag_settings.save()
Return type:

dataikuapi.dss.agent.DSSLLMInteractionLoggingSelection

Vector store helpers#

class dataiku.core.vector_stores.data.writer.VectorStoreWriter(project_key: str, kb_full_id: str, isolated_folder: VectorStoreIsolatedFolder)#

A helper class to write vector store data to the underlying knowledge bank folder.

Important

Do not create this class directly, use dataiku.KnowledgeBank.get_writer()

property folder_path: str#

The path to the underlying folder on the filesystem.

clear()#

Clears the vector store data stored in the underlying folder.

save()#

Saves the content of the underlying folder as a new knowledge bank version.

Returns:

the created version

Return type:

str

as_langchain_vectorstore(**vectorstore_kwargs) VectorStore#

Gets this writer as a Langchain Vectorstore object

Return type:

langchain_core.vectorstores.VectorStore

get_metadata_formatter() DocumentMetadataFormatter#

Gets the metadata formatter to help writing documents to this vector store.

Return type:

DocumentMetadataFormatter

class dataiku.core.vector_stores.data.metadata.DocumentMetadataFormatter(project_key: str, vector_store_implementation)#

Helper class to format vector store documents metadata for usage within Dataiku.

Important

Do not create this class directly, use VectorStoreWriter.get_metadata_formatter() instead.

with_security_tokens(security_tokens: List[str])#

Adds the security tokens in the metadata.

Parameters:

security_tokens – The security tokens.

with_original_document(folder_id: str, path: str, project_key: str | None = None)#

Adds the original document information in the metadata.

Parameters:
  • folder_id – The id of the managed folder that contains the original document.

  • path – The original document path in the managed folder.

  • project_key – The managed folder project key. Defaults to the project key of the knowledge bank.

with_original_document_ref(document_ref: ManagedFolderDocumentRef, project_key: str | None = None)#

Adds the original document information in the metadata.

Parameters:
  • document_ref – The reference to the original document.

  • project_key – The managed folder project key. Defaults to the project key of the knowledge bank.

with_original_document_page_range(page_start: int, page_end: int)#

Adds the page range in the original document. This metadata is intended to start at index 1.

Parameters:
  • page_start – The original document page where the extract started. Must be positive, and lower or equal to page_end.

  • page_end – The original document page where the extract ended. Must be positive, and greater or equal to page_start.

with_original_document_section_outline(section_outline: List[str])#

Adds a section outline in the metadata. Section outlines can be derived from the document extracted content. For example, it may contain the titles of the sections that contains this part of the original document, from top level headers to lower level headers.

Parameters:

section_outline – The section outline.

static make_captioned_images(caption: str, image_paths: List[str]) Dict[str, str | List[str]]#

Construct a captioned images dictionary from the text and image.

Parameters:
  • caption – The caption .

  • image_paths – The paths to the images, relative to the managed folder that is configured in the knowledge bank.

with_retrieval_content(text: str | None = None, image_paths: List[str] | None = None, captioned_images: Dict[str, str | List[str]] | None = None)#
Adds the retrieval content in the metadata. Accepts either
  • text content

  • image paths relative to the knowledge bank images folder.

  • captioned images

Parameters:
  • text – The text content.

  • image_paths – The paths to the images, relative to the managed folder that is configured in the knowledge bank.

  • captioned_images – Captioned images constructed using the constructed using make_captioned_images().

format_metadata(document: Document) Document#

Formats the metadata in the provided document, so that it can be used for retrieval in Dataiku.

Parameters:

document – The Langchain document which metadata must be formatted.

Returns:

The document with updated metadata.