Performing SQL, Hive and Impala queries#

For usage information and examples, see Performing SQL, Hive and Impala queries

class dataiku.SQLExecutor2(connection=None, dataset=None)#

This is a handle to execute SQL statements on a given SQL connection.

The connection is derived from either the connection parameter or the dataset parameter.

Parameters:
  • connection (string) – name of the SQL connection

  • dataset – name of a dataset or a dataiku.Dataset object.

static exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False)#

Executes a SQL query and store the results to the output_dataset after dropping its underlying table.

Parameters:
  • output_dataset (object) – dataiku.Dataset output dataset where to write the result of the query.

  • query (str) – SQL main query

  • pre_queries (list) – list of queries to be executed before the main query

  • post_queries (list) – list of queries to be executed after the main query

  • overwrite_output_schema (bool) – if True, generates the output schema from the query results. If False, maintains the existing output schema

  • drop_partitioned_on_schema_mismatch (bool) – for partitioned output datasets. If True, drops all partitions whose schema is inconsistent with that of the dataset. Only relevant when overwrite_output_schema=True

Returns:

None

query_to_df(query, pre_queries=None, post_queries=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

This function returns the result of the main query as a pandas dataframe.

Parameters:
  • query (str) – SQL main query

  • pre_queries (list) – list of queries to be executed before the main query

  • post_queries (list) – list of queries to be executed after the main query

  • extra_conf – do not use

  • infer_from_schema (bool) – if True, the resulting pandas dataframe types are set per the SQL query datatypes rather than being inferred by pandas

  • parse_dates (bool) – if True, SQL datetime columns are set as datetime dtypes in the resulting pandas dataframe. The infer_from_schema must be True for this param to be relevant

  • bool_as_str (bool) – whether to cast boolean values as string

  • dtypes (dict) – with key= column name and value=`numpy.dtype()`

  • script_steps – do not use

  • script_input_schema – do not use

  • script_output_schema – do not use

Returns:

a pandas dataframe with the result of the query.

query_to_iter(query, pre_queries=None, post_queries=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

This function returns a QueryReader to iterate on the rows.

Parameters:
  • query (str) – the main query

  • pre_queries (list) – list of queries to be executed before the main query

  • post_queries (list) – list of queries to be executed after the main query

  • script_steps – do not use

  • script_input_schema – do not use

  • script_output_schema – do not use

Returns:

a QueryReader to iterate on the rows.

class dataiku.HiveExecutor(dataset=None, database=None, connection=None)#

Execute Hive queries against the Hive server configured in DSS

Configure the executor with exactly one of dataset, database, or connection.

Parameters:
  • dataset – Name of a dataset or a dataiku.Dataset object (defaults to None).

  • database (str) – Hive database name (defaults to None).

  • connection (str) – DSS Hive connection name (defaults to None).

static exec_recipe_fragment(query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False, metastore_handling=None, extra_conf={}, add_dku_udf=False)#

Execute a Hive query fragment in the current Python recipe.

Parameters:
  • query (str) – Hive main query.

  • pre_queries (list) – List of queries to execute before the main query (defaults to []).

  • post_queries (list) – List of queries to execute after the main query (defaults to []).

  • overwrite_output_schema (bool) – Whether to overwrite the output schema (defaults to True).

  • drop_partitioned_on_schema_mismatch (bool) – Whether to drop incompatible partitions (defaults to False).

  • metastore_handling – Metastore synchronization mode (defaults to None).

  • extra_conf (dict) – Additional Hive configuration (defaults to {}).

  • add_dku_udf (bool) – Whether to add Dataiku UDFs to the execution context (defaults to False).

Returns:

None

query_to_df(query, pre_queries=None, post_queries=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

Return the result of a Hive query as a pandas dataframe.

Parameters:
  • query (str) – Hive main query

  • pre_queries (list) – list of queries to execute before the main query. Default: None

  • post_queries (list) – list of queries to execute after the main query. Default: None

  • extra_conf (dict) – additional Hive configuration. Default: {}

  • infer_from_schema (bool) – whether to enforce pandas dtypes from SQL schema. Default: False

  • parse_dates (bool) – whether to parse date and datetime columns. Default: True

  • bool_as_str (bool) – whether to cast boolean values as strings. Default: False

  • dtypes (dict) – explicit pandas dtypes by column name. Default: None

  • script_steps – do not use. Default: None

  • script_input_schema – do not use. Default: None

  • script_output_schema – do not use. Default: None

Returns:

a pandas dataframe with the result of the query

Raises:

ValueError – If the executor is not configured with a dataset, database, or connection.

query_to_iter(query, pre_queries=None, post_queries=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

Return a QueryReader to iterate over Hive query rows.

Parameters:
  • query (str) – Hive main query.

  • pre_queries (list) – List of queries to execute before the main query (defaults to None).

  • post_queries (list) – List of queries to execute after the main query (defaults to None).

  • extra_conf (dict) – Additional Hive configuration (defaults to {}).

  • script_steps – Do not use (defaults to None).

  • script_input_schema – Do not use (defaults to None).

  • script_output_schema – Do not use (defaults to None).

Returns:

a QueryReader to iterate on the rows.

Raises:

ValueError – If the executor is not configured with a dataset, database, or connection.

class dataiku.ImpalaExecutor(dataset=None, database=None, connection=None)#

Execute Impala queries against the Impala server configured in DSS

Configure the executor with exactly one of dataset, database, or connection.

Parameters:
  • dataset – Name of a dataset or a dataiku.Dataset object (defaults to None).

  • database (str) – Impala database name (defaults to None).

  • connection (str) – Impala connection name (defaults to None).

static exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, use_stream_mode=True)#

Execute an Impala query fragment in the current recipe activity.

Parameters:
  • output_dataset (object) – dataiku.Dataset output dataset.

  • query (str) – Impala main query.

  • pre_queries (list) – List of queries to execute before the main query (defaults to []).

  • post_queries (list) – List of queries to execute after the main query (defaults to []).

  • overwrite_output_schema (bool) – Whether to overwrite the output schema (defaults to True).

  • use_stream_mode (bool) – Whether to use stream mode for execution (defaults to True).

Returns:

None

query_to_df(query, pre_queries=None, post_queries=None, connection=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

Return the result of an Impala query as a pandas dataframe.

Parameters:
  • query (str) – Impala main query.

  • pre_queries (list) – List of queries to execute before the main query (defaults to None).

  • post_queries (list) – List of queries to execute after the main query (defaults to None).

  • connection (str) – Deprecated, do not use (defaults to None).

  • extra_conf (dict) – Additional Impala configuration (defaults to {}).

  • infer_from_schema (bool) – Whether to enforce pandas dtypes from SQL schema (defaults to False).

  • parse_dates (bool) – Whether to parse date and datetime columns (defaults to True).

  • bool_as_str (bool) – Whether to cast boolean values as strings (defaults to False).

  • dtypes (dict) – Explicit pandas dtypes by column name (defaults to None).

  • script_steps – Do not use (defaults to None).

  • script_input_schema – Do not use (defaults to None).

  • script_output_schema – Do not use (defaults to None).

Returns:

A pandas dataframe with the result of the query.

Raises:

ValueError – If the executor is not configured with a dataset, database, or connection.

query_to_iter(query, pre_queries=None, post_queries=None, connection=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

Return a QueryReader to iterate over Impala query rows.

Parameters:
  • query (str) – Impala main query.

  • pre_queries (list) – List of queries to execute before the main query (defaults to None).

  • post_queries (list) – List of queries to execute after the main query (defaults to None).

  • connection (str) – Deprecated, do not use (defaults to None).

  • extra_conf (dict) – Additional Impala configuration (defaults to {}).

  • script_steps – Do not use (defaults to None).

  • script_input_schema – Do not use (defaults to None).

  • script_output_schema – Do not use (defaults to None).

Returns:

a QueryReader to iterate on the rows.

Raises:

ValueError – If the executor is not configured with a dataset, database, or connection.

class dataikuapi.dss.sqlquery.DSSSQLQuery(client, query, connection, database, dataset_full_name, pre_queries, post_queries, type, extra_conf, script_steps, script_input_schema, script_output_schema, script_report_location, read_timestamp_without_timezone_as_string, read_date_as_string, datetimenotz_read_mode, dateonly_read_mode, project_key)

A connection to a database or database-like on which queries can be run through DSS.

Important

Do not create this class directly, instead use dataikuapi.DSSClient.sql_query()

Usage example:

# run some query on a connection
query = client.sql_query('select * from "public"."SOME_TABLE"', connection='some_postgres_connection')
n = 0
for row in query.iter_rows():
    n += 1
    if n < 10:
        print("row %s : %s" % (n, row))
query.verify()
print("Returned %s rows" % n)
get_schema()

Get the query’s result set’s schema.

The schema made of DSS column types, and built from mapping database types to DSS types. The actual type in the database can be found in the originalType field (originalSQLType in BigQuery)

Returns:

a schema, as a dict with a columns array, in which each element is a column, itself as a dict of

  • name : the column name

  • type : the column type (smallint, int, bigint, float, double, boolean, date, string)

  • length : the string length

  • comment : the column name

  • originalType : type of the column in the database

Return type:

dict

iter_rows()

Get an iterator on the query’s results.

Returns:

an iterator over the rows, each row being a tuple of values. The order of values in the tuples is the same as the order of columns in the schema returned by get_schema(). The values are cast to python types according to the types in get_schema()

Return type:

iterator[list]

verify()

Verify that reading results completed successfully.

When using the iter_rows() method, and the iterator stops returning rows, there is no way to tell whether there are no more rows because the query didn’t return more rows, or because an error in the query, or in the fetching of its results, happened. You should thus call verify() after the iterator is done, because it will raise an Exception if an error happened.

Raises:

Exception