LLM Mesh#

For usage information and examples, please see LLM Mesh

class dataikuapi.dss.llm.DSSLLM(client, project_key, llm_id)#

A handle to interact with a DSS-managed LLM.

Important

Do not create this class directly, use dataikuapi.dss.project.DSSProject.get_llm() instead.

new_completion()#

Create a new completion query.

Returns:: A handle on the generated completion query.
Return type:: DSSLLMCompletionQuery

new_completions()#

Create a new multi-completion query.

Returns:: A handle on the generated multi-completion query.
Return type:: DSSLLMCompletionsQuery

new_embeddings(text_overflow_mode='FAIL')#

Create a new embedding query.

Parameters:: text_overflow_mode (str) – How to handle longer texts than what the model supports. Either ‘TRUNCATE’ or ‘FAIL’.
Returns:: A handle on the generated embeddings query.
Return type:: DSSLLMEmbeddingsQuery

new_images_generation()#

as_langchain_llm(**data)#

Create a langchain-compatible LLM object for this LLM.

Returns:: A langchain-compatible LLM object.
Return type:: dataikuapi.dss.langchain.llm.DKULLM

as_langchain_chat_model(**data)#

Create a langchain-compatible chat LLM object for this LLM.

Returns:: A langchain-compatible LLM object.
Return type:: dataikuapi.dss.langchain.llm.DKUChatModel

as_langchain_embeddings(**data)#

Create a langchain-compatible embeddings object for this LLM.

Returns:: A langchain-compatible embeddings object.
Return type:: dataikuapi.dss.langchain.embeddings.DKUEmbeddings

class dataikuapi.dss.llm.DSSLLMListItem(client, project_key, data)#

An item in a list of llms

Important

Do not instantiate this class directly, instead use dataikuapi.dss.project.DSSProject.list_llms().

to_llm()#

Convert the current item.

Returns:: A handle for the llm.
Return type:: dataikuapi.dss.llm.DSSLLM

property id#

Returns:: The id of the llm.
Return type:: string

property type#

Returns:: The type of the LLM
Return type:: string

property description#

Returns:: The description of the LLM
Return type:: string

class dataikuapi.dss.llm.DSSLLMCompletionsQuery(llm)#

A handle to interact with a multi-completion query. Completion queries allow you to send a prompt to a DSS-managed LLM and retrieve its response.

Important

Do not create this class directly, use dataikuapi.dss.llm.DSSLLM.new_completion() instead.

property settings#

Returns:: The completion query settings.
Return type:: dict

new_completion()#

new_guardrail(type)#: Start adding a guardrail to the request. You need to configure the returned object, and call add() to actually add it

execute()#

Run the completions query and retrieve the LLM response.

Returns:: The LLM response.
Return type:: DSSLLMCompletionsResponse

with_json_output(schema=None, strict=None, compatible=None)#

Request the model to generate a valid JSON response, for models that support it.

Note that some models may require you to also explicitly request this in the user or system prompt to use this.

Caution

JSON output support is experimental for locally-running Hugging Face models.

Parameters:

schema (dict) – (optional) If specified, request the model to produce a JSON response that adheres to the provided schema. Support varies across models/providers.
strict (bool) – (optional) If a schema is provided, whether to strictly enforce it. Support varies across models/providers.
compatible (bool) – (optional) Allow DSS to modify the schema in order to increase compatibility, depending on known limitations of the model/provider. Defaults to automatic.

with_structured_output(model_type, strict=None, compatible=None)#

Instruct the model to generate a response as an instance of a specified Pydantic model.

This functionality depends on with_json_output and necessitates that the model supports JSON output with a schema.

Caution

Structured output support is experimental for locally-running Hugging Face models.

Parameters:

model_type (pydantic.BaseModel) – A Pydantic model class used for structuring the response.
strict (bool) – (optional) see with_json_output()
compatible (bool) – (optional) see with_json_output()

class dataikuapi.dss.llm.DSSLLMCompletionsQuerySingleQuery#

new_multipart_message(role='user')#

Start adding a multipart-message to the completion query.

Use this to add image parts to the message.

Parameters:: role (str) – The message role. Use system to set the LLM behavior, assistant to store predefined responses, user to provide requests or comments for the LLM to answer to. Defaults to user.
Return type:: DSSLLMCompletionQueryMultipartMessage

with_message(message, role='user')#

Add a message to the completion query.

Parameters:

message (str) – The message text.
role (str) – The message role. Use system to set the LLM behavior, assistant to store predefined responses, user to provide requests or comments for the LLM to answer to. Defaults to user.

with_tool_calls(tool_calls, role='assistant')#

Add tool calls to the completion query.

Caution

Tool calls support is experimental for locally-running Hugging Face models.

Parameters:

tool_calls (list[dict]) – Calls to tools that the LLM requested to use.
role (str) – The message role. Defaults to assistant.

with_tool_output(tool_output, tool_call_id, role='tool')#

Add a tool message to the completion query.

Parameters:

tool_output (str) – The tool output, as a string.
tool_call_id (str) – The tool call id, as provided by the LLM in the conversation messages.
role (str) – The message role. Defaults to tool.

with_context(context)#

class dataikuapi.dss.llm.DSSLLMCompletionsResponse(raw_resp, response_parser=None)#

A handle to interact with a multi-completion response.

Important

Do not create this class directly, use dataikuapi.dss.llm.DSSLLMCompletionsQuery.execute() instead.

property responses#: The array of responses

class dataikuapi.dss.llm.DSSLLMCompletionQuery(llm)#

A handle to interact with a completion query. Completion queries allow you to send a prompt to a DSS-managed LLM and retrieve its response.

Important

Do not create this class directly, use dataikuapi.dss.llm.DSSLLM.new_completion() instead.

property settings#

Returns:: The completion query settings.
Return type:: dict

new_guardrail(type)#: Start adding a guardrail to the request. You need to configure the returned object, and call add() to actually add it

execute()#

Run the completion query and retrieve the LLM response.

Returns:: The LLM response.
Return type:: DSSLLMCompletionResponse

execute_streamed()#

Run the completion query and retrieve the LLM response as streamed chunks.

Returns:: An iterator over the LLM response chunks
Return type:: Iterator[Union[DSSLLMStreamedCompletionChunk, DSSLLMStreamedCompletionFooter]]

new_multipart_message(role='user')#

Start adding a multipart-message to the completion query.

Use this to add image parts to the message.

Parameters:: role (str) – The message role. Use system to set the LLM behavior, assistant to store predefined responses, user to provide requests or comments for the LLM to answer to. Defaults to user.
Return type:: DSSLLMCompletionQueryMultipartMessage

with_context(context)#

with_json_output(schema=None, strict=None, compatible=None)#

Request the model to generate a valid JSON response, for models that support it.

Note that some models may require you to also explicitly request this in the user or system prompt to use this.

Caution

JSON output support is experimental for locally-running Hugging Face models.

Parameters:

schema (dict) – (optional) If specified, request the model to produce a JSON response that adheres to the provided schema. Support varies across models/providers.
strict (bool) – (optional) If a schema is provided, whether to strictly enforce it. Support varies across models/providers.
compatible (bool) – (optional) Allow DSS to modify the schema in order to increase compatibility, depending on known limitations of the model/provider. Defaults to automatic.

with_message(message, role='user')#

Add a message to the completion query.

Parameters:

message (str) – The message text.
role (str) – The message role. Use system to set the LLM behavior, assistant to store predefined responses, user to provide requests or comments for the LLM to answer to. Defaults to user.

with_structured_output(model_type, strict=None, compatible=None)#

Instruct the model to generate a response as an instance of a specified Pydantic model.

This functionality depends on with_json_output and necessitates that the model supports JSON output with a schema.

Caution

Structured output support is experimental for locally-running Hugging Face models.

Parameters:

model_type (pydantic.BaseModel) – A Pydantic model class used for structuring the response.
strict (bool) – (optional) see with_json_output()
compatible (bool) – (optional) see with_json_output()

with_tool_calls(tool_calls, role='assistant')#

Add tool calls to the completion query.

Caution

Tool calls support is experimental for locally-running Hugging Face models.

Parameters:

tool_calls (list[dict]) – Calls to tools that the LLM requested to use.
role (str) – The message role. Defaults to assistant.

with_tool_output(tool_output, tool_call_id, role='tool')#

Add a tool message to the completion query.

Parameters:

tool_output (str) – The tool output, as a string.
tool_call_id (str) – The tool call id, as provided by the LLM in the conversation messages.
role (str) – The message role. Defaults to tool.

class dataikuapi.dss.llm.DSSLLMCompletionQueryMultipartMessage(q, role)#

Important

Do not create this class directly, use dataikuapi.dss.llm.DSSLLMCompletionQuery.new_multipart_message() or dataikuapi.dss.llm.DSSLLMCompletionsQuerySingleQuery.new_multipart_message().

with_text(text)#: Add a text part to the multipart message

with_inline_image(image, mime_type=None)#

Add an image part to the multipart message

Parameters:

image (Union[str, bytes]) – The image
mime_type (str) – None for default

with_image_url(image)#

Add an image url part to the multipart message

Parameters:: image – str the image url

add()#: Add this message to the completion query

class dataikuapi.dss.llm.DSSLLMCompletionResponse(raw_resp=None, text=None, finish_reason=None, response_parser=None, trace=None)#

Response to a completion

property json#

Returns:: LLM response parsed as a JSON object

property parsed#

property success#

Returns:: The outcome of the completion query.
Return type:: bool

property text#

Returns:: The raw text of the LLM response.
Return type:: Union[str, None]

property tool_calls#

Returns:: The tool calls of the LLM response.
Return type:: Union[list, None]

property log_probs#

Returns:: The log probs of the LLM response.
Return type:: Union[list, None]

property trace#

class dataikuapi.dss.llm.DSSLLMEmbeddingsQuery(llm, text_overflow_mode)#

A handle to interact with an embedding query. Embedding queries allow you to transform text into embedding vectors using a DSS-managed model.

Important

Do not create this class directly, use dataikuapi.dss.llm.DSSLLM.new_embeddings() instead.

add_text(text)#

Add text to the embedding query.

Parameters:: text (str) – Text to add to the query.

add_image(image, text=None)#

Add an image to the embedding query.

Parameters:

image – Image content as bytes or str (base64)
text – Optional text (requires a multimodal model)

new_guardrail(type)#: Start adding a guardrail to the request. You need to configure the returned object, and call add() to actually add it

execute()#

Run the embedding query.

Returns:: The results of the embedding query.
Return type:: DSSLLMEmbeddingsResponse

class dataikuapi.dss.llm.DSSLLMEmbeddingsResponse(raw_resp)#

A handle to interact with an embedding query result.

Important

Do not create this class directly, use dataikuapi.dss.llm.DSSLLMEmbeddingsQuery.execute() instead.

get_embeddings()#

Retrieve vectors resulting from the embeddings query.

Returns:: A list of lists containing all embedding vectors.
Return type:: list

class dataikuapi.dss.llm.DSSLLMImageGenerationQuery(llm)#

A handle to interact with an image generation query.

Important

Do not create this class directly, use dataikuapi.dss.llm.DSSLLM.new_images_generation() instead.

with_prompt(prompt, weight=None)#

Add a prompt to the image generation query.

Parameters:

prompt (str) – The prompt text.
weight (float) – Optional weight between 0 and 1 for the prompt.

with_negative_prompt(prompt, weight=None)#

Add a negative prompt to the image generation query.

Parameters:

prompt (str) – The prompt text.
weight (float) – Optional weight between 0 and 1 for the negative prompt.

with_original_image(image, mode=None, weight=None)#

Add an image to the generation query.

To edit specific pixels of the original image. A mask can be applied by calling with_mask():

>>> query.with_original_image(image, mode="INPAINTING") # replace the pixels using a mask

To edit an image:

>>> query.with_original_image(image, mode="MASK_FREE") # edit the original image according to the prompt

>>> query.with_original_image(image, mode="VARY") # generates a variation of the original image

Parameters:

image (Union[str, bytes]) – The original image as str in base 64 or bytes.
mode (str) – The edition mode. Modes support varies across models/providers.
weight (float) – The original image weight between 0 and 1.

with_mask(mode, image=None)#

Add a mask for edition to the generation query. Call this method alongside with_original_image().

To edit parts of the image using a black mask (replace the black pixels):

>>> query.with_mask("MASK_IMAGE_BLACK", image=black_mask)

To edit parts of the image that are transparent (replace the transparent pixels):

>>> query.with_mask("ORIGINAL_IMAGE_ALPHA")

Parameters:

mode (str) – The mask mode. Modes support varies across models/providers.
image (Union[str, bytes]) – The mask image to apply to the image edition. As str in base 64 or bytes.

new_guardrail(type)#: Start adding a guardrail to the request. You need to configure the returned object, and call add() to actually add it

property height#

Returns:: The generated image height in pixels.
Return type:: Optional[int]

property width#

Returns:: The generated image width in pixels.
Return type:: Optional[int]

property fidelity#

Returns:: From 0.0 to 1.0, how strongly to adhere to prompt.
Return type:: Optional[float]

property quality#

Returns:: Quality of the image to generate. Valid values depend on the targeted model.
Return type:: Optional[str]

property seed#

Returns:: Seed of the image to generate, gives deterministic results when set.
Return type:: Optional[int]

property style#

Returns:: Style of the image to generate. Valid values depend on the targeted model.
Return type:: Optional[str]

property images_to_generate#

Returns:: Number of images to generate per query. Valid values depend on the targeted model.
Return type:: Optional[int]

property aspect_ratio#

Returns:: The width/height aspect ratio or None if either is not set.
Return type:: Optional[float]

execute()#

Executes the image generation

Return type:: DSSLLMImageGenerationResponse

class dataikuapi.dss.llm.DSSLLMImageGenerationResponse(raw_resp)#

A handle to interact with an image generation response.

Important

Do not create this class directly, use dataikuapi.dss.llm.DSSLLMImageGenerationQuery.execute() instead.

property success#

Returns:: The outcome of the image generation query.
Return type:: bool

first_image(as_type='bytes')#

Parameters:: as_type (str) – The type of image to return, ‘bytes’ for bytes otherwise ‘str’ for base 64 str.
Returns:: The first generated image as bytes or str depending on the as_type parameter.
Return type:: Union[bytes,str]

get_images(as_type='bytes')#

Parameters:: as_type (str) – The type of images to return, ‘bytes’ for bytes otherwise ‘str’ for base 64 str.
Returns:: The generated images as bytes or str depending on the as_type parameter.
Return type:: Union[List[bytes], List[str]]

property images#

Returns:: The generated images in bytes format.
Return type:: List[bytes]

class dataikuapi.dss.knowledgebank.DSSKnowledgeBankListItem(client, data)#

An item in a list of knowledege banks

Important

Do not instantiate this class directly, instead use dataikuapi.dss.project.DSSProject.list_knowledge_banks().

to_knowledge_bank()#

Convert the current item.

Returns:: A handle for the knowledge_bank.
Return type:: dataikuapi.dss.knowledgebank.DSSKnowledgeBank

as_core_knowledge_bank()#

Get the dataiku.KnowledgeBank object corresponding to this knowledge bank

Return type:: dataiku.KnowledgeBank

property project_key#

Returns:: The project
Return type:: string

property id#

Returns:: The id of the knowledge bank.
Return type:: string

property name#

Returns:: The name of the knowledge bank.
Return type:: string

class dataikuapi.dss.knowledgebank.DSSKnowledgeBank(client, project_key, id)#

A handle to interact with a DSS-managed knowledge bank.

Important

Do not create this class directly, use dataikuapi.dss.project.DSSProject.get_knowledge_bank() instead.

property id#

as_core_knowledge_bank()#

Get the dataiku.KnowledgeBank object corresponding to this knowledge bank

Return type:: dataiku.KnowledgeBank

get_settings()#

Get the knowledge bank’s definition

Returns:: a handle on the knowledge bank definition
Return type:: dataikuapi.dss.knowledgebank.DSSKnowledgeBankSettings

delete()#: Delete the knowledge bank

build(job_type='NON_RECURSIVE_FORCED_BUILD', wait=True)#

Start a new job to build this knowledge bank and wait for it to complete. Raises if the job failed.

job = knowledge_bank.build()
print("Job %s done" % job.id)

Parameters:

job_type – the job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD
wait (bool) – whether to wait for the job completion before returning the job handle, defaults to True

Returns:

the dataikuapi.dss.job.DSSJob job handle corresponding to the built job

Return type:

dataikuapi.dss.job.DSSJob

class dataikuapi.dss.knowledgebank.DSSKnowledgeBankSettings(client, settings)#

Settings for a knowledge bank

Important

Do not instantiate directly, use dataikuapi.dss.knowledgebank.DSSKnowledgeBank.get_settings() instead

property project_key#

Returns the project key of the knowledge bank

Return type:: str

property id#

Returns the identifier of the knowledge bank

Return type:: str

property vector_store_type#

Returns the type of storage backing the vector store (could be CHROMA, PINECONE, ELASTICSEARCH, AZURE_AI_SEARCH, VERTEX_AI_GCS_BASED, FAISS, QDRANT_LOCAL)

Return type:: str

get_raw()#

Returns the raw settings of the knowledge bank

Returns:: the raw settings of the knowledge bank
Return type:: dict

save()#: Saves the settings on the knowledge bank

class dataiku.KnowledgeBank(id, project_key=None)#

This is a handle to interact with a Dataiku Knowledge Bank flow object

get_current_version()#

Gets the current version for this knowledge bank.

Return type:: str

as_langchain_retriever(search_type='similarity', search_kwargs=None, vectorstore_kwargs=None, **retriever_kwargs)#

Get the current version of this knowledge bank as a Langchain Retriever object.

Return type:: langchain_core.vectorstores.VectorStoreRetriever

as_langchain_vectorstore(**vectorstore_kwargs)#

Get the current version of this knowledge bank as a Langchain Vectorstore object.

Return type:: langchain_core.vectorstores.VectorStore

get_multipart_context(docs)#

Convert retrieved documents from the vector store to a multipart context. The multipart context contains the parts that can be added to a completion query

Parameters:: docs (List[Document]) – A list of retrieved documents from the langchain retriever
Raises:: Exception – If the knowledge bank does not contain multimodal content
Returns:: A multipart context object composed by a list of parts containing text or images
Return type:: MultipartContext

class dataiku.core.knowledge_bank.MultipartContext#

A reference to a list of text or images parts that can be added to a completion query

append(part)#

Parameters:: part (MultipartContent) – Part of a completion query

add_to_completion_query(completion, role='user')#

Add the accumulated parts as a new multipart-message to the completion query

Parameters:

completion (DSSLLMCompletionsQuerySingleQuery) – the completion query to be edited
role (str) – The message role. Use system to set the LLM behavior, assistant to store predefined responses, user to provide requests or comments for the LLM to answer to. Defaults to user.

is_text_only()#

Returns:: True if all the accumulated parts are text parts, False otherwise
Return type:: bool

to_text()#

Returns:: the concatenation of accumulated text parts (other parts are skipped)
Return type:: str

class dataikuapi.dss.langchain.DKULLM(*args: Any, **kwargs: Any)#

Langchain-compatible wrapper around Dataiku-mediated LLMs

Note

Direct instantiation of this class is possible from within DSS, though it’s recommended to instead use dataikuapi.dss.llm.DSSLLM.as_langchain_llm().

Example:

llm = dkullm.as_langchain_llm()

# single prompt
print(llm.invoke("tell me a joke"))

# multiple prompts with batching
for response in llm.batch(["tell me a joke in English", "tell me a joke in French"]):
    print(response)

# streaming, with stop sequence
for chunk in llm.stream("Explain photosynthesis in a few words in English then French", stop=["dioxyde de"]):
    print(chunk, end="", flush=True)

llm_id: str#: LLM identifier to use

max_tokens: int = None#: Denotes the number of tokens to predict per generation. Deprecated: use key “maxOutputTokens” in field “completion_settings”.

temperature: float = None#: A non-negative float that tunes the degree of randomness in generation. Deprecated: use key “temperature” in field “completion_settings”.

top_k: int = None#: Number of tokens to pick from when sampling. Deprecated: use key “topK” in field “completion_settings”.

top_p: float = None#: Sample from the top tokens whose probabilities add up to p. Deprecated: use key “topP” in field “completion_settings”.

completion_settings: dict = {}#: Settings applied to completion queries, all keys are optional and can include: maxOutputTokens, temperature, topK, topP, frequencyPenalty, presencePenalty, logitBias, logProbs and topLogProbs.

class dataikuapi.dss.langchain.DKUChatModel(*args: Any, **kwargs: Any)#

Langchain-compatible wrapper around Dataiku-mediated chat LLMs

Note

Direct instantiation of this class is possible from within DSS, though it’s recommended to instead use dataikuapi.dss.llm.DSSLLM.as_langchain_chat_model().

Example:

from langchain_core.prompts import ChatPromptTemplate

llm = dkullm.as_langchain_chat_model()
prompt = ChatPromptTemplate.from_template("tell me a joke about {topic}")
chain = prompt | llm
for chunk in chain.stream({"topic": "parrot"}):
    print(chunk.content, end="", flush=True)

llm_id: str#: LLM identifier to use

max_tokens: int = None#: Denotes the number of tokens to predict per generation. Deprecated: use key “maxOutputTokens” in field “completion_settings”.

temperature: float = None#: A non-negative float that tunes the degree of randomness in generation. Deprecated: use key “temperature” in field “completion_settings”.

top_k: int = None#: Number of tokens to pick from when sampling. Deprecated: use key “topK” in field “completion_settings”.

top_p: float = None#: Sample from the top tokens whose probabilities add up to p. Deprecated: use key “topP” in field “completion_settings”.

completion_settings: dict = {}#: Settings applied to completion queries, all keys are optional and can include: maxOutputTokens, temperature, topK, topP, frequencyPenalty, presencePenalty, logitBias, logProbs and topLogProbs.

Bind tool-like objects to this chat model.

Args:

tools: A list of tool definitions to bind to this chat model.

Can be a dictionary, pydantic model, callable, or BaseTool. Pydantic models, callables, and BaseTools will be automatically converted to their schema dictionary representation.

tool_choice: Which tool to request the model to call.

Options are:

name of the tool (str): call the corresponding tool;
“auto”: automatically select a tool (or no tool);
“none”: do not call a tool;
“any” or “required”: force at least one tool call;
True: call the one given tool (requires tools to be of length 1);
a dict of the form: {“type”: “tool_name”, “name”: “<<tool_name>>”}, or {“type”: “required”}, or {“type”: “any”} or {“type”: “none”}, or {“type”: “auto”};

strict: If specified, request the model to produce a JSON tool call that adheres to the provided schema. Support varies across models/providers. compatible: Allow DSS to modify the schema in order to increase compatibility, depending on known limitations of the model/provider. Defaults to automatic.

kwargs: Any additional parameters to bind.

class dataikuapi.dss.langchain.DKUEmbeddings(*args: Any, **kwargs: Any)#

Langchain-compatible wrapper around Dataiku-mediated embedding LLMs

Note

Direct instantiation of this class is possible from within DSS, though it’s recommended to instead use dataikuapi.dss.llm.DSSLLM.as_langchain_embeddings().

llm_id: str#: LLM identifier to use

embed_documents(texts: List[str]) → List[List[float]]#

Call out to Dataiku-mediated LLM

Args:: texts: The list of texts to embed.
Returns:: List of embeddings, one for each text.

async aembed_documents(texts: List[str]) → List[List[float]]#

embed_query(text: str) → List[float]#

async aembed_query(text: str) → List[float]#

class dataikuapi.dss.document_extractor.DocumentExtractor(client, project_key)#

A handle to interact with a DSS-managed Document Extractor.

vlm_extract(images, llm_id, llm_prompt=None, window_size=1, window_overlap=0)#

Extract text content from images using a vision LLM: for each group of ‘window_size’ consecutive images, prompt the given vision LLM to summarize in plain text.

Parameters:

images (iterable(InlineImageRef) | iterable(ManagedFolderImageRef)) – iterable over the images to be described by the vision LLM
llm_id (str) – the identifier of a vision LLM
llm_prompt (str) – Custom prompt to extract text from the images
window_size (int) – Number of consecutive images to represent in a single output. Use -1 for all images.
window_overlap (int) – Number of overlapping images between two windows of images. Must be less than window_size.

Returns:

Extracted text content per group of images

Return type:

VlmExtractorResponse

structured_extract(document, max_section_depth=6)#

Splits a document (txt/md) into a structured hierarchy of sections and texts

Parameters:

document (DocumentRef) – document to split
max_section_depth (int) – Maximum depth of sections to extract - consider deeper sections as plain text. If set to 0, extract the whole document as one single section.

Returns:

Structured content of the document

Return type:

StructuredExtractorResponse

generate_pages_screenshots(document, output_managed_folder=None, offset=0, fetch_size=10, keep_fetched=True)#

Generate per-page screenshots of a document, returning an iterable over the screenshots. In most cases, a screenshot corresponds to a single page of a document.

Usage example:

doc_extractor = DocumentExtractor(client, "project_key")
document_ref = ManagedFolderDocumentRef('path_in_folder/document.pdf', folder_id)

for image in doc_extractor.generate_pages_screenshots(document_ref):
    print(image.get_raw())

Parameters:

document (DocumentRef) – input document (txt | md | docx | pdf).
output_managed_folder (str) – id of a managed folder to store the generated screenshots as png. When unspecified, return inline images in the response.
offset (int) – start extraction from offset screenshots.
fetch_size (int) – number of screenshots to fetch in each request, iterating on the next result automatically sends a new request for another fetch_size screenshots
keep_fetched (boolean) – whether to keep previous screenshots requests within this response object when fetching next ones.

Returns:

An iterable over the result screenshots

Return type:

ScreenshotterResponse

class dataikuapi.dss.document_extractor.ScreenshotterResponse(client, project_key, screenshotter_request, keep_fetched)#

A handle to interact with a screenshotter result. Iterable over the ImageRef screenshots.

Important

Do not create this class directly, use generate_page_screenshots() instead.

get_raw()#

fetch_screenshot(screenshot_index)#

property success#

Returns:: The outcome of the extractor request / latest fetch.
Return type:: bool

property has_next#

Returns:: Whether there are more screenshots to extract after this response
Return type:: bool

property total_count#

Returns:: Total number of screenshots that can be extracted from the document. In most cases corresponds to the number of pages of the document.
Return type:: int

property document#

Returns:: The reference to the screenshotted document.
Return type:: DocumentRef

class dataikuapi.dss.document_extractor.StructuredExtractorResponse(data)#

A handle to interact with a document structured extractor result.

Important

Do not create this class directly, use structured_extract() instead.

get_raw()#

property success#

Returns:: The outcome of the structured extractor request.
Return type:: bool

property content#

Returns:: The structure of the document as a dictionary
Return type:: dict

property text_chunks#

Returns:: A flattened text-only view of the documents, along with their outline.
Return type:: list[dict]

class dataikuapi.dss.document_extractor.VlmExtractorResponse(data)#

A handle to interact with a VLM extractor result.

Important

Do not create this class directly, use vlm_extract()

get_raw()#

property success#

Returns:: The outcome of the extractor request.
Return type:: bool

property chunks#

Content extracted from the original document, split into chunks

Returns:: extracted text content per chunk.
Return type:: list[str]

class dataikuapi.dss.document_extractor.DocumentRef#

A reference to a document file.

Important

Do not create this class directly, use one of its implementations:

LocalFileDocumentRef for a local file to be uploaded
ManagedFolderDocumentRef for a file inside a DSS-managed folder

as_json()#

class dataikuapi.dss.document_extractor.LocalFileDocumentRef(fp)#

A reference to a client-local file.

Usage example:

with open("/Users/mdupont/document.pdf", "rb") as f:
    file_ref = LocalFileDocumentRef(f)

    # upload the document & generate images of the document's pages:
    images = list(doc_ex.generate_pages_screenshots(file_ref))

as_json()#

class dataikuapi.dss.document_extractor.ManagedFolderDocumentRef(file_path, managed_folder_id)#

A reference to a file in a DSS-managed folder.

Usage example:

file_ref = ManagedFolderDocumentRef('path_in_folder/document.pdf', folder_id)

# generate images of the document's pages:
resp = doc_ex.generate_pages_screenshots(file_ref)

as_json()#

class dataikuapi.dss.document_extractor.ImageRef#

A reference to a single image

Important

Do not create this class directly, use one of its implementations:

InlineImageRef for an inline (bytes / base64 string) image
ManagedFolderImageRef for an image stored in a DSS-managed folder

as_json()#

class dataikuapi.dss.document_extractor.InlineImageRef(image, mime_type=None)#

A reference to an inline image.

Usage example:

with open("/Users/mdupont/image.jpg", "rb") as f:
    image_ref = InlineImageRef(f.read())

# Extract a text summary from the image using a vision LLM:
resp = doc_ex.vlm_extract([image_ref], 'llm_id')

as_json()#

class dataikuapi.dss.document_extractor.ManagedFolderImageRef(managed_folder_id, image_path)#

A reference to an image stored in a DSS-managed folder.

Usage example:

managed_img = ManagedFolderImageRef('managed_folder_id', 'path_in_folder/image.png')

# Extract a text summary from the image using a vision LLM:
resp = doc_ex.vlm_extract([managed_img], 'llm_id')

as_json()#