LLM Mesh Retrieval And Knowledge Banks#
For the overall proposed structure, see LLM Mesh.
Knowledge banks#
- class dataikuapi.dss.knowledgebank.DSSKnowledgeBankListItem(client, data)#
An item in a list of knowledge banks
Important
Do not instantiate this class directly, instead use
dataikuapi.dss.project.DSSProject.list_knowledge_banks().- to_knowledge_bank()#
Convert the current item.
- Returns:
A handle for the knowledge_bank.
- Return type:
- as_core_knowledge_bank()#
Get the
dataiku.KnowledgeBankobject corresponding to this knowledge bank- Return type:
- property project_key#
- Returns:
The project key
- Return type:
string
- property id#
- Returns:
The id of the knowledge bank.
- Return type:
string
- property name#
- Returns:
The name of the knowledge bank.
- Return type:
string
- class dataikuapi.dss.knowledgebank.DSSKnowledgeBank(client, project_key, id)#
A handle to interact with a DSS-managed knowledge bank.
Important
Do not create this class directly, use
dataikuapi.dss.project.DSSProject.get_knowledge_bank()instead.- property id#
- as_core_knowledge_bank()#
Get the
dataiku.KnowledgeBankobject corresponding to this knowledge bank- Return type:
- as_langchain_retriever(**data)#
Get the current version of this knowledge bank as a Langchain Retriever object.
- Parameters:
data (dict) – keyword arguments to pass to the
dataikuapi.dss.knowledgebank.DSSKnowledgeBank.search()function- Returns:
a langchain-compatible retriever
- Return type:
dataikuapi.dss.langchain.knowledge_bank.DKUKnowledgeBankRetriever
- get_settings()#
Get the knowledge bank’s definition
- Returns:
a handle on the knowledge bank definition
- Return type:
- delete()#
Delete the knowledge bank
- build(job_type='NON_RECURSIVE_FORCED_BUILD', wait=True)#
Start a new job to build this knowledge bank and wait for it to complete. Raises if the job failed.
job = knowledge_bank.build() print("Job %s done" % job.id)
- Parameters:
job_type – the job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD
wait (bool) – whether to wait for the job completion before returning the job handle, defaults to True
- Returns:
the
dataikuapi.dss.job.DSSJobjob handle corresponding to the built job- Return type:
- search(query, max_documents=10, search_type='SIMILARITY', similarity_threshold=0.5, mmr_documents_count=20, mmr_factor=0.25, hybrid_use_advanced_reranking=False, hybrid_rrf_rank_constant=60, hybrid_rrf_rank_window_size=4, filter=None)#
Search for documents in a knowledge bank
MMR and HYBRID search types are not supported by every vector stores.
- Parameters:
query (str) – what to search for
max_documents (int) – the maximum number of documents to return, default to 10
search_type (str) – the search algorithm to use. One of SIMILARITY, SIMILARITY_THRESHOLD, MMR or HYBRID. Defaults to SIMILARITY
similarity_threshold (float) – only return documents with a similarity score above this threshold, typically between 0 and 1, only applied with search_type=SIMILARITY_THRESHOLD, defaults to 0.5
mmr_documents_count (int) – number of documents to consider before selecting the documents to retrieve, only applied with search_type=MMR, defaults to 20
mmr_factor (float) – weight of diversity vs relevancy, between 0 and 1, where 0 favors maximum diversity and 1 favors maximum relevancy, only applied with search_type=MMR, defaults to 0.25
hybrid_use_advanced_reranking (bool) – whether to use proprietary rerankers, valid for Azure AI and ElasticSearch vector stores, defaults to False
hybrid_rrf_rank_constant (int) – higher values give more weight to lower-ranked documents, valid for ElasticSearch vector stores, defaults to 60
hybrid_rrf_rank_window_size (int) – number of documents to consider from each search type, valid for ElasticSearch vector stores, defaults to 4
filter (Union[
DSSSimpleFilter, dict], optional) – optional metadata filter as aDSSSimpleFilteror a simple filter dictionary
- Returns:
a result object with a list of documents that matched the query
- Return type:
- class dataikuapi.dss.knowledgebank.DSSKnowledgeBankSettings(client, project_key, settings)#
Settings for a knowledge bank
Important
Do not instantiate directly, use
dataikuapi.dss.knowledgebank.DSSKnowledgeBank.get_settings()instead- property project_key#
Returns the project key of the knowledge bank
- Return type:
str
- property id#
Returns the identifier of the knowledge bank
- Return type:
str
- property vector_store_type#
Returns the type of storage backing the vector store (could be CHROMA, PINECONE, ELASTICSEARCH, AZURE_AI_SEARCH, SNOWFLAKE_CORTEX_SEARCH, VERTEX_AI_GCS_BASED, FAISS, QDRANT_LOCAL)
- Return type:
str
- set_metadata_schema(schema)#
Sets the schema for metadata fields.
- Parameters:
schema (Dict[str, str]) – the schema, as a mapping metadata_field -> type
- set_images_folder(managed_folder_id, project_key=None)#
Sets the images folder to use with this knowledge bank.
- Parameters:
managed_folder_id (str) – The (managed) images folder id.
project_key (Optional[str]) – The image folder project key, if different from this knowledge bank project key. Default to None.
- get_images_folder()#
Returns the images folder of the knowledge bank, if any.
- Returns:
the managed folder or None
- Return type:
DSSManagedFolder | None
- get_raw()#
Returns the raw settings of the knowledge bank
- Returns:
the raw settings of the knowledge bank
- Return type:
dict
- save()#
Saves the settings on the knowledge bank
- class dataikuapi.dss.knowledgebank.DSSKnowledgeBankSearchResult(kb, documents)#
The result of a search in a knowledge bank, contains documents that matched the query
Each document is a
dataikuapi.dss.knowledgebank.DSSKnowledgeBankSearchResultDocument- property documents#
Returns a list of documents that matched a search query
- Returns:
a list of result documents
- Return type:
- property managed_folder_id#
- class dataikuapi.dss.knowledgebank.DSSKnowledgeBankSearchResultDocument(result, text, score, metadata)#
A document found by searching a knowledge bank with
dataikuapi.dss.knowledgebank.DSSKnowledgeBank.search()- property text#
Returns the text from the knowledge bank for this document
- Returns:
the text for this document
- Return type:
str
- property score#
Returns the match score for this document
- Returns:
the score for this document
- Return type:
float
- property metadata#
Returns metadata from the knowledge bank for this document
- Returns:
metadata for this document
- Return type:
dict
- property images#
Returns images for this document
- Returns:
a list of images references or None
- Return type:
list[ManagedFolderImageRef] | None
- property file_ref#
Returns the file reference for this document
- Returns:
a file reference or None
- Return type:
ManagedFolderDocumentRef | None
- class dataiku.KnowledgeBank(id, project_key=None, context_project_key=None)#
This is a handle to interact with a Dataiku Knowledge Bank flow object
- set_context_project_key(context_project_key)#
Set the context project key to use to report calls to the embedding LLM associated with this knowledge bank.
- Parameters:
context_project_key (str) – the context project key
- get_current_version(trusted_object: TrustedObject | None = None)#
Gets the current version for this knowledge bank.
- Parameters:
trusted_object (Optional["TrustedObject"]) – the optional trusted object using the kb
- Return type:
str
- get_writer()#
Gets a writer on the latest vector store files on disk. For local vector stores, downloads metadata files as well as data files. For remote vector stores, only downloads metadata files.
The vector store files are automatically uploaded when the context manager is closed.
Note
Each call creates an isolated writer which works on its own folder.
- as_langchain_retriever(search_type='similarity', search_kwargs=None, vectorstore_kwargs=None, **retriever_kwargs)#
Get the current version of this knowledge bank as a Langchain Retriever object.
- Return type:
langchain_core.vectorstores.VectorStoreRetriever
- as_langchain_vectorstore(**vectorstore_kwargs)#
Get the current version of this knowledge bank as a Langchain Vectorstore object.
- Return type:
langchain_core.vectorstores.VectorStore
- get_multipart_context(docs)#
Convert retrieved documents from the vector store to a multipart context. The multipart context contains the parts that can be added to a completion query
- Parameters:
docs (List[Document]) – A list of retrieved documents from the langchain retriever
- Raises:
Exception – If the knowledge bank does not contain multimodal content
- Returns:
A multipart context object composed by a list of parts containing text or images
- Return type:
MultipartContext
- class dataiku.core.knowledge_bank.MultipartContext#
A reference to a list of text or images parts that can be added to a completion query
- append(part)#
- Parameters:
part (
MultipartContent) – Part of a completion query
- add_to_completion_query(completion, role='user')#
Add the accumulated parts as a new multipart-message to the completion query
- Parameters:
completion (
DSSLLMCompletionsQuerySingleQuery) – the completion query to be editedrole (str) – The message role. Use
systemto set the LLM behavior,assistantto store predefined responses,userto provide requests or comments for the LLM to answer to. Defaults touser.
- is_text_only()#
- Returns:
True if all the accumulated parts are text parts, False otherwise
- Return type:
bool
- to_text()#
- Returns:
the concatenation of accumulated text parts (other parts are skipped)
- Return type:
str
Retrieval-augmented LLMs#
- class dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLMListItem(client, data)#
An item in a list of retrieval-augmented LLMs
Important
Do not instantiate this class directly, instead use
dataikuapi.dss.project.DSSProject.list_retrieval_augmented_llms().- property project_key#
- Returns:
The project
- Return type:
string
- property id#
- Returns:
The id of the retrieval-augmented LLM.
- Return type:
string
- property name#
- Returns:
The name of the retrieval-augmented LLM.
- Return type:
string
- as_llm()#
Returns this retrieval-augmented LLM as a usable
dataikuapi.dss.llm.DSSLLMfor querying
- class dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLM(client, project_key, id)#
A handle to interact with a DSS-managed retrieval-augmented LLM.
Important
Do not create this class directly, use
dataikuapi.dss.project.DSSProject.get_retrieval_augmented_llm()instead.- property id#
- as_llm()#
Returns this retrieval-augmented LLM as a usable
dataikuapi.dss.llm.DSSLLMfor querying
- get_settings()#
Get the retrieval-augmented LLM’s definition
- Returns:
a handle on the retrieval-augmented LLM definition
- Return type:
dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLMSettings
- delete()#
Delete the retrieval-augmented LLM
- class dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLMSettings(client, settings)#
Settings for a retrieval-augmented LLM
Important
Do not instantiate directly, use
dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLM.get_settings()instead- get_version_ids()#
- property active_version#
Returns the active version of this retrieval-augmented LLM. May return None if no version is declared as active
- get_version_settings(version_id)#
- get_raw()#
Returns the raw settings of the retrieval-augmented LLM
- Returns:
the raw settings of the retrieval-augmented LLM
- Return type:
dict
- save()#
Saves the settings on the retrieval-augmented LLM
- class dataikuapi.dss.retrieval_augmented_llm.DSSRetrievalAugmentedLLMVersionSettings(version_settings)#
- get_raw()#
- property llm_id#
Get or set the name of the Data Collection
- Return type:
str
- property interaction_logging_selection#
Get the interaction logging selection for this version.
Before configuring interaction logging on a retrieval-augmented LLM version, create the target dataset on the project:
project = client.get_project("MYPROJECT") project.create_llm_interaction_logging_dataset( "llm_logs", connection_id="filesystem_managed", time_partitioning="DAY", )
Example using inherited settings:
rag = project.get_retrieval_augmented_llm("my_rag") rag_settings = rag.get_settings() version_settings = rag_settings.get_version_settings("v1") logging_selection = version_settings.interaction_logging_selection logging_selection.inherit() rag_settings.save()
Example using explicit settings:
rag = project.get_retrieval_augmented_llm("my_rag") rag_settings = rag.get_settings() version_settings = rag_settings.get_version_settings("v1") logging_selection = version_settings.interaction_logging_selection logging_selection.enable( "llm_logs", settings={ "flushEveryS": 60, "flushEveryBytes": 1_000_000, "contentMode": "FULL", }, ) rag_settings.save()
Example disabling interaction logging:
rag = project.get_retrieval_augmented_llm("my_rag") rag_settings = rag.get_settings() version_settings = rag_settings.get_version_settings("v1") logging_selection = version_settings.interaction_logging_selection logging_selection.disable() rag_settings.save()
Vector store helpers#
- class dataiku.core.vector_stores.data.writer.VectorStoreWriter(project_key: str, kb_full_id: str, isolated_folder: VectorStoreIsolatedFolder)#
A helper class to write vector store data to the underlying knowledge bank folder.
Important
Do not create this class directly, use
dataiku.KnowledgeBank.get_writer()- property folder_path: str#
The path to the underlying folder on the filesystem.
- clear()#
Clears the vector store data stored in the underlying folder.
- save()#
Saves the content of the underlying folder as a new knowledge bank version.
- Returns:
the created version
- Return type:
str
- as_langchain_vectorstore(**vectorstore_kwargs) VectorStore#
Gets this writer as a Langchain Vectorstore object
- Return type:
langchain_core.vectorstores.VectorStore
- get_metadata_formatter() DocumentMetadataFormatter#
Gets the metadata formatter to help writing documents to this vector store.
- Return type:
DocumentMetadataFormatter
- class dataiku.core.vector_stores.data.metadata.DocumentMetadataFormatter(project_key: str, vector_store_implementation)#
Helper class to format vector store documents metadata for usage within Dataiku.
Important
Do not create this class directly, use
VectorStoreWriter.get_metadata_formatter()instead.- with_security_tokens(security_tokens: List[str])#
Adds the security tokens in the metadata.
- Parameters:
security_tokens – The security tokens.
- with_original_document(folder_id: str, path: str, project_key: str | None = None)#
Adds the original document information in the metadata.
- Parameters:
folder_id – The id of the managed folder that contains the original document.
path – The original document path in the managed folder.
project_key – The managed folder project key. Defaults to the project key of the knowledge bank.
- with_original_document_ref(document_ref: ManagedFolderDocumentRef, project_key: str | None = None)#
Adds the original document information in the metadata.
- Parameters:
document_ref – The reference to the original document.
project_key – The managed folder project key. Defaults to the project key of the knowledge bank.
- with_original_document_page_range(page_start: int, page_end: int)#
Adds the page range in the original document. This metadata is intended to start at index 1.
- Parameters:
page_start – The original document page where the extract started. Must be positive, and lower or equal to page_end.
page_end – The original document page where the extract ended. Must be positive, and greater or equal to page_start.
- with_original_document_section_outline(section_outline: List[str])#
Adds a section outline in the metadata. Section outlines can be derived from the document extracted content. For example, it may contain the titles of the sections that contains this part of the original document, from top level headers to lower level headers.
- Parameters:
section_outline – The section outline.
- static make_captioned_images(caption: str, image_paths: List[str]) Dict[str, str | List[str]]#
Construct a captioned images dictionary from the text and image.
- Parameters:
caption – The caption .
image_paths – The paths to the images, relative to the managed folder that is configured in the knowledge bank.
- with_retrieval_content(text: str | None = None, image_paths: List[str] | None = None, captioned_images: Dict[str, str | List[str]] | None = None)#
- Adds the retrieval content in the metadata. Accepts either
text content
image paths relative to the knowledge bank images folder.
captioned images
- Parameters:
text – The text content.
image_paths – The paths to the images, relative to the managed folder that is configured in the knowledge bank.
captioned_images – Captioned images constructed using the constructed using
make_captioned_images().
- format_metadata(document: Document) Document#
Formats the metadata in the provided document, so that it can be used for retrieval in Dataiku.
- Parameters:
document – The Langchain document which metadata must be formatted.
- Returns:
The document with updated metadata.
