Document Extraction For LLM Workflows#
This page groups document and image extraction helpers that are adjacent to LLM workflows.
Extractor#
- class dataikuapi.dss.document_extractor.DocumentExtractor(client, project_key)#
A handle to interact with a DSS-managed Document Extractor.
- vlm_extract(images, llm_id, llm_prompt=None, window_size=1, window_overlap=0)#
Extract text content from images using a vision LLM: for each group of ‘window_size’ consecutive images, prompt the given vision LLM to summarize in plain text.
- Parameters:
images (iterable(
InlineImageRef) | iterable(ManagedFolderImageRef)) – iterable over the images to be described by the vision LLMllm_id (str) – the identifier of a vision LLM
llm_prompt (str) – Custom prompt to extract text from the images
window_size (int) – Number of consecutive images to represent in a single output. Use -1 for all images.
window_overlap (int) – Number of overlapping images between two windows of images. Must be less than window_size.
- Returns:
Extracted text content per group of images
- Return type:
- vlm_extract_fields(images, schema=None, llm_id=None, llm_prompt=None, from_recipe=None, strict=None, compatible=None)#
Extract specific fields (structured data) from images (typically screenshots of a document’s pages) using a vision LLM. Describe expected fields in
extraction_schema, or specify an Extract Fields recipe to use its settings.- Parameters:
images (iterable(
InlineImageRef) | iterable(ManagedFolderImageRef)) – screenshots of the document’s pages from which to extract the fieldsschema (str | dict | pydantic.BaseModel class or Python type hint.) – a JSON schema or a Pydantic model class describing the fields to extract. JSON schema definitions or Pydantic models referencing other models are unsupported.
llm_id (str) – Identifier of a vision LLM
llm_prompt (str) – Custom prompt to extract fields from the images
from_recipe (str) – Name of a recipe from which to read the other arguments. Arguments provided explicitly take precedence.
strict (bool) – Whether to strictly enforce the schema. Support varies across models/providers.
compatible (bool) – Allow DSS to modify the schema in order to increase compatibility, depending on known limitations of the model/provider. Defaults to automatic.
- Returns:
Extracted fields from images
- Return type:
- structured_extract(document, max_section_depth=6, image_handling_mode='IGNORE', ocr_engine='AUTO', languages='en', llm_id=None, llm_prompt=None, output_managed_folder=None, image_validation=True)#
Splits a document (txt, md, pdf, docx, pptx, html, png, jpg, jpeg) into a structured hierarchy of sections and texts
- Parameters:
document (
DocumentRef) – document to splitmax_section_depth (int) – Maximum depth of sections to extract - consider deeper sections as plain text. If set to 0, extract the whole document as one single section.
image_handling_mode (str) – How to handle images in the document. Can be one of: ‘IGNORE’, ‘OCR’, ‘VLM_ANNOTATE’.
ocr_engine (str) – Engine to perform the OCR, either ‘AUTO’, ‘EASYOCR’ or ‘TESSERACT’. Auto uses tesseract if available, otherwise easyOCR.
languages (str | list) – OCR languages to use for recognition. List (either a comma-separated string, or list of strings) of ISO639 languages codes.
llm_id (str) – ID of the (vision-capable) LLM to use for annotating images when image_handling_mode is ‘VLM_ANNOTATE’
llm_prompt (str) – Custom prompt to extract text from the images
output_managed_folder (str) – id of a managed folder to store the image in the document. When unspecified, return inline images in the response.
image_validation (boolean) – Whether to validate images before processing. If True, images classified as barcodes, icons, logos, QR codes, signatures, or stamps are skipped.
- Returns:
Structured content of the document
- Return type:
- text_extract(document, image_handling_mode='IGNORE', ocr_engine='AUTO', languages='en')#
Extract raw text from a document (txt, md, pdf, docx, pptx, html, png, jpg, jpeg).
Some documents like PDF or PowerPoint have an inherent structure (page, bookmarks or slides); for those documents, the returned results contain this structure. Otherwise, the document’s structure is not inferred, resulting in one or more text item(s).
PDF files are converted to images and processed using OCR if image_handling_mode is set to ‘OCR’, recommended for scanned PDFs. Otherwise, their text content is extracted.
- Parameters:
document (
DocumentRef) – document to splitimage_handling_mode (str) – How to handle images in the document, either ‘IGNORE’ or ‘OCR’.
ocr_engine (str) – Engine to perform the OCR, either ‘AUTO’, ‘EASYOCR’ or ‘TESSERACT’. Auto uses tesseract if available, otherwise easyOCR.
languages (str | list) – OCR languages to use for recognition. List (either a comma-separated string, or list of strings) of ISO639 languages codes.
- Returns:
Text content of the document
- Return type:
- generate_pages_screenshots(document, output_managed_folder=None, offset=0, fetch_size=10, keep_fetched=True, page_dpi=None, max_memory_per_document=None)#
Generate per-page screenshots of a document, returning an iterable over the screenshots.
Usage example:
doc_extractor = DocumentExtractor(client, "project_key") document_ref = ManagedFolderDocumentRef('path_in_folder/document.pdf', folder_id) fetch_size = 10 response = doc_extractor.generate_pages_screenshots(document_ref, fetch_size=fetch_size) # The first 10 screenshots (fetch_size) are computed & retrieved immediately within the response. first_screenshot = response.fetch_screenshot(0) # InlineImageRef or ManagedFolderImageRef # Iterating through the first 10 items is instantaneous as they are already fetched. # Iterating from the 11th item triggers new backend requests (processing pages 11-20, fetch screenshots). for idx, screenshot in enumerate(response): if (idx % fetch_size == 0) and idx != 0: print(f"Computing the next {fetch_size} screenshots") print(f"Screenshot #{idx}: {screenshot.as_dict()}") # Alternatively, response being an iterable, you can compute & fetch all screenshots at once: response = doc_extractor.generate_pages_screenshots(document_ref) screenshots = list(response) # list of InlineImageRef or ManagedFolderImageRef objects
- Parameters:
document (
DocumentRef) – input document (txt | pdf | docx | doc | odt | pptx | ppt | odp | xlsx | xls | xlsm | xlsb | ods | png | jpg | jpeg).output_managed_folder (str) – id of a managed folder to store the generated screenshots as png. When unspecified, return inline images in the response.
offset (int) – start extraction from offset screenshots.
fetch_size (int) – number of screenshots to fetch in each request, iterating on the next result automatically sends a new request for another fetch_size screenshots
keep_fetched (boolean) – whether to keep previous screenshots requests within this response object when fetching next ones.
page_dpi (int) – DPI used to render pages if memory allows.
max_memory_per_document (int) – maximum memory budget in MB used while rendering a document. The effective DPI may be reduced to fit this limit depending on page dimensions.
- Returns:
An iterable over the result screenshots
- Return type:
- convert_to_pdf(document, output_managed_folder=None, path_in_output_folder=None)#
Convert a document to PDF format.
- Parameters:
document (
DocumentRef) – input document (docx | doc | odt | pptx | ppt | odp | xlsx | xls | xlsm | xlsb | ods | png | jpg | jpeg).output_managed_folder (str) – id of an optional managed folder to store the generated PDF document. If unspecified, the document is not stored and should be downloaded from the returned
PDFConversionResponsepath_in_output_folder (str) – optional path of the generated PDF document in the output managed folder. If unspecified and the input document is in a managed folder, defaults to the input document path (with a .pdf extension).
- Returns:
A
PDFConversionResponse, to reference & download the resulting PDF.- Return type:
Responses#
- class dataikuapi.dss.document_extractor.PDFConversionResponse(client, project_key, document, output_managed_folder, path_in_output_folder=None)#
A handle to interact with a document PDF conversion result.
Important
Do not create this class directly, use
convert_to_pdf()instead.- get_raw()#
- stream()#
Download the converted PDF as a binary stream.
- Returns:
The converted PDF file as a binary stream.
- Return type:
requests.Response
- download_to_file(path)#
Download the converted PDF to a local file.
- Parameters:
path (str) – the path where to download the PDF file
- Returns:
None
- property document#
- Returns:
The reference to the stored PDF if applicable, otherwise None
- Return type:
- property success#
- Returns:
The outcome of the PDF conversion request.
- Return type:
bool
- class dataikuapi.dss.document_extractor.ScreenshotterResponse(client, project_key, screenshotter_request, keep_fetched)#
A handle to interact with a screenshotter result. Iterable over the
ImageRefscreenshots.Important
Do not create this class directly, use
generate_pages_screenshots()instead.- get_raw()#
- fetch_screenshot(screenshot_index)#
- property success#
- Returns:
The outcome of the extractor request / latest fetch.
- Return type:
bool
- property has_next#
- Returns:
Whether there are more screenshots to extract after this response
- Return type:
bool
- property total_count#
- Returns:
Total number of screenshots that can be extracted from the document. In most cases corresponds to the number of pages of the document.
- Return type:
int
- property document#
- Returns:
The reference to the screenshotted document.
- Return type:
- class dataikuapi.dss.document_extractor.TextExtractorResponse(data)#
A handle to interact with a document text extractor result.
Important
Do not create this class directly, use
text_extract()instead.- get_raw()#
- property success#
- Returns:
The outcome of the text extraction request.
- Return type:
bool
- property content#
The content of the document as extracted by
text_extract()can contain some structure inherent to the document. For instance, PDF documents are extracted page by page, and PowerPoint documents slide by slide. Some PDF documents contain bookmarks that can be used to separate them into sections. For other documents, a single section with one or more text item(s).This property returns a dict that represents this structure.
- Returns:
The structure of the document as a dictionary
- Return type:
dict
- property text_content#
- Returns:
The textual content of the document as a string.
- Return type:
str
- class dataikuapi.dss.document_extractor.StructuredExtractorResponse(data)#
A handle to interact with a document structured extractor result.
Important
Do not create this class directly, use
structured_extract()instead.- get_raw()#
- property success#
- Returns:
The outcome of the structured extractor request.
- Return type:
bool
- property content#
- Returns:
The structure of the document as a dictionary
- Return type:
dict
- property text_chunks#
- Returns:
A flattened text-only view of the documents, along with their outline.
- Return type:
list[dict]
- class dataikuapi.dss.document_extractor.VlmExtractorResponse(data)#
A handle to interact with a VLM extractor result.
Important
Do not create this class directly, use
vlm_extract()- get_raw()#
- property success#
- Return type:
bool
- property chunks#
Content extracted from the original document, split into chunks
- Returns:
extracted text content per chunk.
- Return type:
list[str]
- class dataikuapi.dss.document_extractor.FieldsVlmExtractorResponse(data, response_parser=None)#
A handle to interact with a VLM fields extraction result.
Important
Do not create this class directly; use
vlm_extract_fields()- get_raw()#
- property success#
- Return type:
bool
- property fields#
Fields extracted from the original document. Follows the structure of the extraction schema, has only the fields that abide by it.
- Returns:
extracted fields.
- Return type:
dict
- property parsed_fields#
Fields extracted from the original document. Follows the structure of the extraction schema, has only the fields that abide by it. Only available for extraction schema given as a Pydantic model or using Python type hint.
- Returns:
extracted fields deserialized into a Pydantic model instance.
- Return type:
pydantic.BaseModel
- property invalid_fields#
Fields in the extraction schema that the Vision LLM could not extract. Follows the structure/hierarchy of the extraction schema, but has only the incorrect or missing fields.
- Return type:
dict
Input references#
- class dataikuapi.dss.document_extractor.DocumentRef(mime_type=None)#
A reference to a document file.
Important
- Do not create this class directly, use one of its implementations:
LocalFileDocumentReffor a local file to be uploadedManagedFolderDocumentReffor a file inside a DSS-managed folder
- as_dict()#
- class dataikuapi.dss.document_extractor.LocalFileDocumentRef(fp, mime_type=None)#
A reference to a client-local file.
Usage example:
with open("/Users/mdupont/document.pdf", "rb") as f: file_ref = LocalFileDocumentRef(f) # upload the document & generate images of the document's pages: images = list(doc_ex.generate_pages_screenshots(file_ref))
- as_json()#
Get a dictionary representation.
Caution
Deprecated, use
as_dict()instead- Return type:
dict
- as_dict()#
- class dataikuapi.dss.document_extractor.ManagedFolderDocumentRef(file_path, managed_folder_id, mime_type=None)#
A reference to a file in a DSS-managed folder.
Usage example:
file_ref = ManagedFolderDocumentRef('path_in_folder/document.pdf', folder_id) # generate images of the document's pages: resp = doc_ex.generate_pages_screenshots(file_ref)
- property managed_folder_id#
- as_json()#
Get a dictionary representation.
Caution
Deprecated, use
as_dict()instead- Return type:
dict
- as_dict()#
Get a dictionary representation.
- Return type:
dict
- class dataikuapi.dss.document_extractor.ImageRef#
A reference to a single image
Important
- Do not create this class directly, use one of its implementations:
InlineImageReffor an inline (bytes / base64 string) imageManagedFolderImageReffor an image stored in a DSS-managed folder
- as_dict()#
- class dataikuapi.dss.document_extractor.InlineImageRef(image, mime_type=None)#
A reference to an inline image.
Usage example:
with open("/Users/mdupont/image.jpg", "rb") as f: image_ref = InlineImageRef(f.read()) # Extract a text summary from the image using a vision LLM: resp = doc_ex.vlm_extract([image_ref], 'llm_id')
- as_json()#
Get a dictionary representation.
Caution
Deprecated, use
as_dict()instead- Return type:
dict
- as_dict()#
Get a dictionary representation.
- Return type:
dict
- class dataikuapi.dss.document_extractor.ManagedFolderImageRef(managed_folder_ref, image_path)#
A reference to an image stored in a DSS-managed folder.
Usage example:
managed_img = ManagedFolderImageRef('managed_folder_ref', 'path_in_folder/image.png') # Extract a text summary from the image using a vision LLM: resp = doc_ex.vlm_extract([managed_img], 'llm_id')
- property managed_folder_id#
- as_json()#
Get a dictionary representation.
Caution
Deprecated, use
as_dict()instead- Return type:
dict
- as_dict()#
Get a dictionary representation.
- Return type:
dict
