Document Extraction For LLM Workflows#

This page groups document and image extraction helpers that are adjacent to LLM workflows.

Extractor#

class dataikuapi.dss.document_extractor.DocumentExtractor(client, project_key)#

A handle to interact with a DSS-managed Document Extractor.

vlm_extract(images, llm_id, llm_prompt=None, window_size=1, window_overlap=0)#

Extract text content from images using a vision LLM: for each group of ‘window_size’ consecutive images, prompt the given vision LLM to summarize in plain text.

Parameters:
  • images (iterable(InlineImageRef) | iterable(ManagedFolderImageRef)) – iterable over the images to be described by the vision LLM

  • llm_id (str) – the identifier of a vision LLM

  • llm_prompt (str) – Custom prompt to extract text from the images

  • window_size (int) – Number of consecutive images to represent in a single output. Use -1 for all images.

  • window_overlap (int) – Number of overlapping images between two windows of images. Must be less than window_size.

Returns:

Extracted text content per group of images

Return type:

VlmExtractorResponse

vlm_extract_fields(images, schema=None, llm_id=None, llm_prompt=None, from_recipe=None, strict=None, compatible=None)#

Extract specific fields (structured data) from images (typically screenshots of a document’s pages) using a vision LLM. Describe expected fields in extraction_schema, or specify an Extract Fields recipe to use its settings.

Parameters:
  • images (iterable(InlineImageRef) | iterable(ManagedFolderImageRef)) – screenshots of the document’s pages from which to extract the fields

  • schema (str | dict | pydantic.BaseModel class or Python type hint.) – a JSON schema or a Pydantic model class describing the fields to extract. JSON schema definitions or Pydantic models referencing other models are unsupported.

  • llm_id (str) – Identifier of a vision LLM

  • llm_prompt (str) – Custom prompt to extract fields from the images

  • from_recipe (str) – Name of a recipe from which to read the other arguments. Arguments provided explicitly take precedence.

  • strict (bool) – Whether to strictly enforce the schema. Support varies across models/providers.

  • compatible (bool) – Allow DSS to modify the schema in order to increase compatibility, depending on known limitations of the model/provider. Defaults to automatic.

Returns:

Extracted fields from images

Return type:

FieldsVlmExtractorResponse

structured_extract(document, max_section_depth=6, image_handling_mode='IGNORE', ocr_engine='AUTO', languages='en', llm_id=None, llm_prompt=None, output_managed_folder=None, image_validation=True)#

Splits a document (txt, md, pdf, docx, pptx, html, png, jpg, jpeg) into a structured hierarchy of sections and texts

Parameters:
  • document (DocumentRef) – document to split

  • max_section_depth (int) – Maximum depth of sections to extract - consider deeper sections as plain text. If set to 0, extract the whole document as one single section.

  • image_handling_mode (str) – How to handle images in the document. Can be one of: ‘IGNORE’, ‘OCR’, ‘VLM_ANNOTATE’.

  • ocr_engine (str) – Engine to perform the OCR, either ‘AUTO’, ‘EASYOCR’ or ‘TESSERACT’. Auto uses tesseract if available, otherwise easyOCR.

  • languages (str | list) – OCR languages to use for recognition. List (either a comma-separated string, or list of strings) of ISO639 languages codes.

  • llm_id (str) – ID of the (vision-capable) LLM to use for annotating images when image_handling_mode is ‘VLM_ANNOTATE’

  • llm_prompt (str) – Custom prompt to extract text from the images

  • output_managed_folder (str) – id of a managed folder to store the image in the document. When unspecified, return inline images in the response.

  • image_validation (boolean) – Whether to validate images before processing. If True, images classified as barcodes, icons, logos, QR codes, signatures, or stamps are skipped.

Returns:

Structured content of the document

Return type:

StructuredExtractorResponse

text_extract(document, image_handling_mode='IGNORE', ocr_engine='AUTO', languages='en')#

Extract raw text from a document (txt, md, pdf, docx, pptx, html, png, jpg, jpeg).

Some documents like PDF or PowerPoint have an inherent structure (page, bookmarks or slides); for those documents, the returned results contain this structure. Otherwise, the document’s structure is not inferred, resulting in one or more text item(s).

PDF files are converted to images and processed using OCR if image_handling_mode is set to ‘OCR’, recommended for scanned PDFs. Otherwise, their text content is extracted.

Parameters:
  • document (DocumentRef) – document to split

  • image_handling_mode (str) – How to handle images in the document, either ‘IGNORE’ or ‘OCR’.

  • ocr_engine (str) – Engine to perform the OCR, either ‘AUTO’, ‘EASYOCR’ or ‘TESSERACT’. Auto uses tesseract if available, otherwise easyOCR.

  • languages (str | list) – OCR languages to use for recognition. List (either a comma-separated string, or list of strings) of ISO639 languages codes.

Returns:

Text content of the document

Return type:

TextExtractorResponse

generate_pages_screenshots(document, output_managed_folder=None, offset=0, fetch_size=10, keep_fetched=True, page_dpi=None, max_memory_per_document=None)#

Generate per-page screenshots of a document, returning an iterable over the screenshots.

Usage example:

doc_extractor = DocumentExtractor(client, "project_key")
document_ref = ManagedFolderDocumentRef('path_in_folder/document.pdf', folder_id)

fetch_size = 10
response = doc_extractor.generate_pages_screenshots(document_ref, fetch_size=fetch_size)
# The first 10 screenshots (fetch_size) are computed & retrieved immediately within the response.

first_screenshot = response.fetch_screenshot(0)  # InlineImageRef or ManagedFolderImageRef

# Iterating through the first 10 items is instantaneous as they are already fetched.
# Iterating from the 11th item triggers new backend requests (processing pages 11-20, fetch screenshots).
for idx, screenshot in enumerate(response):
    if (idx % fetch_size == 0) and idx != 0:
        print(f"Computing the next {fetch_size} screenshots")
    print(f"Screenshot #{idx}: {screenshot.as_dict()}")

# Alternatively, response being an iterable, you can compute & fetch all screenshots at once:
response = doc_extractor.generate_pages_screenshots(document_ref)
screenshots = list(response)  # list of InlineImageRef or ManagedFolderImageRef objects
Parameters:
  • document (DocumentRef) – input document (txt | pdf | docx | doc | odt | pptx | ppt | odp | xlsx | xls | xlsm | xlsb | ods | png | jpg | jpeg).

  • output_managed_folder (str) – id of a managed folder to store the generated screenshots as png. When unspecified, return inline images in the response.

  • offset (int) – start extraction from offset screenshots.

  • fetch_size (int) – number of screenshots to fetch in each request, iterating on the next result automatically sends a new request for another fetch_size screenshots

  • keep_fetched (boolean) – whether to keep previous screenshots requests within this response object when fetching next ones.

  • page_dpi (int) – DPI used to render pages if memory allows.

  • max_memory_per_document (int) – maximum memory budget in MB used while rendering a document. The effective DPI may be reduced to fit this limit depending on page dimensions.

Returns:

An iterable over the result screenshots

Return type:

ScreenshotterResponse

convert_to_pdf(document, output_managed_folder=None, path_in_output_folder=None)#

Convert a document to PDF format.

Parameters:
  • document (DocumentRef) – input document (docx | doc | odt | pptx | ppt | odp | xlsx | xls | xlsm | xlsb | ods | png | jpg | jpeg).

  • output_managed_folder (str) – id of an optional managed folder to store the generated PDF document. If unspecified, the document is not stored and should be downloaded from the returned PDFConversionResponse

  • path_in_output_folder (str) – optional path of the generated PDF document in the output managed folder. If unspecified and the input document is in a managed folder, defaults to the input document path (with a .pdf extension).

Returns:

A PDFConversionResponse, to reference & download the resulting PDF.

Return type:

PDFConversionResponse

Responses#

class dataikuapi.dss.document_extractor.PDFConversionResponse(client, project_key, document, output_managed_folder, path_in_output_folder=None)#

A handle to interact with a document PDF conversion result.

Important

Do not create this class directly, use convert_to_pdf() instead.

get_raw()#
stream()#

Download the converted PDF as a binary stream.

Returns:

The converted PDF file as a binary stream.

Return type:

requests.Response

download_to_file(path)#

Download the converted PDF to a local file.

Parameters:

path (str) – the path where to download the PDF file

Returns:

None

property document#
Returns:

The reference to the stored PDF if applicable, otherwise None

Return type:

ManagedFolderDocumentRef

property success#
Returns:

The outcome of the PDF conversion request.

Return type:

bool

class dataikuapi.dss.document_extractor.ScreenshotterResponse(client, project_key, screenshotter_request, keep_fetched)#

A handle to interact with a screenshotter result. Iterable over the ImageRef screenshots.

Important

Do not create this class directly, use generate_pages_screenshots() instead.

get_raw()#
fetch_screenshot(screenshot_index)#
property success#
Returns:

The outcome of the extractor request / latest fetch.

Return type:

bool

property has_next#
Returns:

Whether there are more screenshots to extract after this response

Return type:

bool

property total_count#
Returns:

Total number of screenshots that can be extracted from the document. In most cases corresponds to the number of pages of the document.

Return type:

int

property document#
Returns:

The reference to the screenshotted document.

Return type:

DocumentRef

class dataikuapi.dss.document_extractor.TextExtractorResponse(data)#

A handle to interact with a document text extractor result.

Important

Do not create this class directly, use text_extract() instead.

get_raw()#
property success#
Returns:

The outcome of the text extraction request.

Return type:

bool

property content#

The content of the document as extracted by text_extract() can contain some structure inherent to the document. For instance, PDF documents are extracted page by page, and PowerPoint documents slide by slide. Some PDF documents contain bookmarks that can be used to separate them into sections. For other documents, a single section with one or more text item(s).

This property returns a dict that represents this structure.

Returns:

The structure of the document as a dictionary

Return type:

dict

property text_content#
Returns:

The textual content of the document as a string.

Return type:

str

class dataikuapi.dss.document_extractor.StructuredExtractorResponse(data)#

A handle to interact with a document structured extractor result.

Important

Do not create this class directly, use structured_extract() instead.

get_raw()#
property success#
Returns:

The outcome of the structured extractor request.

Return type:

bool

property content#
Returns:

The structure of the document as a dictionary

Return type:

dict

property text_chunks#
Returns:

A flattened text-only view of the documents, along with their outline.

Return type:

list[dict]

class dataikuapi.dss.document_extractor.VlmExtractorResponse(data)#

A handle to interact with a VLM extractor result.

Important

Do not create this class directly, use vlm_extract()

get_raw()#
property success#
Return type:

bool

property chunks#

Content extracted from the original document, split into chunks

Returns:

extracted text content per chunk.

Return type:

list[str]

class dataikuapi.dss.document_extractor.FieldsVlmExtractorResponse(data, response_parser=None)#

A handle to interact with a VLM fields extraction result.

Important

Do not create this class directly; use vlm_extract_fields()

get_raw()#
property success#
Return type:

bool

property fields#

Fields extracted from the original document. Follows the structure of the extraction schema, has only the fields that abide by it.

Returns:

extracted fields.

Return type:

dict

property parsed_fields#

Fields extracted from the original document. Follows the structure of the extraction schema, has only the fields that abide by it. Only available for extraction schema given as a Pydantic model or using Python type hint.

Returns:

extracted fields deserialized into a Pydantic model instance.

Return type:

pydantic.BaseModel

property invalid_fields#

Fields in the extraction schema that the Vision LLM could not extract. Follows the structure/hierarchy of the extraction schema, but has only the incorrect or missing fields.

Return type:

dict

Input references#

class dataikuapi.dss.document_extractor.DocumentRef(mime_type=None)#

A reference to a document file.

Important

Do not create this class directly, use one of its implementations:
as_dict()#
class dataikuapi.dss.document_extractor.LocalFileDocumentRef(fp, mime_type=None)#

A reference to a client-local file.

Usage example:

with open("/Users/mdupont/document.pdf", "rb") as f:
    file_ref = LocalFileDocumentRef(f)

    # upload the document & generate images of the document's pages:
    images = list(doc_ex.generate_pages_screenshots(file_ref))
as_json()#

Get a dictionary representation.

Caution

Deprecated, use as_dict() instead

Return type:

dict

as_dict()#
class dataikuapi.dss.document_extractor.ManagedFolderDocumentRef(file_path, managed_folder_id, mime_type=None)#

A reference to a file in a DSS-managed folder.

Usage example:

file_ref = ManagedFolderDocumentRef('path_in_folder/document.pdf', folder_id)

# generate images of the document's pages:
resp = doc_ex.generate_pages_screenshots(file_ref)
property managed_folder_id#
as_json()#

Get a dictionary representation.

Caution

Deprecated, use as_dict() instead

Return type:

dict

as_dict()#

Get a dictionary representation.

Return type:

dict

class dataikuapi.dss.document_extractor.ImageRef#

A reference to a single image

Important

Do not create this class directly, use one of its implementations:
as_dict()#
class dataikuapi.dss.document_extractor.InlineImageRef(image, mime_type=None)#

A reference to an inline image.

Usage example:

with open("/Users/mdupont/image.jpg", "rb") as f:
    image_ref = InlineImageRef(f.read())

# Extract a text summary from the image using a vision LLM:
resp = doc_ex.vlm_extract([image_ref], 'llm_id')
as_json()#

Get a dictionary representation.

Caution

Deprecated, use as_dict() instead

Return type:

dict

as_dict()#

Get a dictionary representation.

Return type:

dict

class dataikuapi.dss.document_extractor.ManagedFolderImageRef(managed_folder_ref, image_path)#

A reference to an image stored in a DSS-managed folder.

Usage example:

managed_img = ManagedFolderImageRef('managed_folder_ref', 'path_in_folder/image.png')

# Extract a text summary from the image using a vision LLM:
resp = doc_ex.vlm_extract([managed_img], 'llm_id')
property managed_folder_id#
as_json()#

Get a dictionary representation.

Caution

Deprecated, use as_dict() instead

Return type:

dict

as_dict()#

Get a dictionary representation.

Return type:

dict