Performing SQL, Hive and Impala queries#

For usage information and examples, see Performing SQL, Hive and Impala queries

class dataiku.SQLExecutor2(connection=None, dataset=None)#

This is a handle to execute SQL statements on a given SQL connection.

The connection is derived from either the connection parameter or the dataset parameter.

Parameters:
  • connection (string) – name of the SQL connection

  • dataset – name of a dataset or a dataiku.Dataset object.

static exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False)#

Executes a SQL query and store the results to the output_dataset after dropping its underlying table.

Parameters:
  • output_dataset (object) – dataiku.Dataset output dataset where to write the result of the query.

  • query (str) – SQL main query

  • pre_queries (list) – list of queries to be executed before the main query

  • post_queries (list) – list of queries to be executed after the main query

  • overwrite_output_schema (bool) – if True, generates the output schema from the query results. If False, maintains the existing output schema

  • drop_partitioned_on_schema_mismatch (bool) – for partitioned output datasets. If True, drops all partitions whose schema is inconsistent with that of the dataset. Only relevant when overwrite_output_schema=True

Returns:

None

query_to_df(query, pre_queries=None, post_queries=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None)#

This function returns the result of the main query as a pandas dataframe.

Parameters:
  • query (str) – SQL main query

  • pre_queries (list) – list of queries to be executed before the main query

  • post_queries (list) – list of queries to be executed after the main query

  • extra_conf – do not use

  • infer_from_schema (bool) – if True, the resulting pandas dataframe types are set per the SQL query datatypes rather than being inferred by pandas

  • parse_dates (bool) – if True, SQL datetime columns are set as datetime dtypes in the resulting pandas dataframe. The infer_from_schema must be True for this param to be relevant

  • bool_as_str (bool) – whether to cast boolean values as string

  • dtypes (dict) – with key= column name and value=`numpy.dtype()`

  • script_steps – do not use

  • script_input_schema – do not use

  • script_output_schema – do not use

Returns:

a pandas dataframe with the result of the query.

query_to_iter(query, pre_queries=None, post_queries=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None)#

This function returns a QueryReader to iterate on the rows.

Parameters:
  • query (str) – the main query

  • pre_queries (list) – list of queries to be executed before the main query

  • post_queries (list) – list of queries to be executed after the main query

  • script_steps – do not use

  • script_input_schema – do not use

  • script_output_schema – do not use

Returns:

a QueryReader to iterate on the rows.

class dataiku.HiveExecutor(dataset=None, database=None, connection=None)#
static exec_recipe_fragment(query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False, metastore_handling=None, extra_conf={}, add_dku_udf=False)#
query_to_df(query, pre_queries=None, post_queries=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None)#
query_to_iter(query, pre_queries=None, post_queries=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None)#
class dataiku.ImpalaExecutor(dataset=None, database=None, connection=None)#
static exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, use_stream_mode=True)#
query_to_df(query, pre_queries=None, post_queries=None, connection=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None)#
query_to_iter(query, pre_queries=None, post_queries=None, connection=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None)#
class dataikuapi.dss.sqlquery.DSSSQLQuery(client, query, connection, database, dataset_full_name, pre_queries, post_queries, type, extra_conf, script_steps, script_input_schema, script_output_schema, script_report_location, read_timestamp_without_timezone_as_string, read_date_as_string, project_key)

A connection to a database or database-like on which queries can be run through DSS.

Important

Do not create this class directly, instead use dataikuapi.DSSClient.sql_query()

Usage example:

# run some query on a connection
query = client.sql_query('select * from "public"."SOME_TABLE"', connection='some_postgres_connection')
n = 0
for row in query.iter_rows():
    n += 1
    if n < 10:
        print("row %s : %s" % (n, row))
query.verify()
print("Returned %s rows" % n)
get_schema()

Get the query’s result set’s schema.

The schema made of DSS column types, and built from mapping database types to DSS types. The actual type in the database can be found in the originalType field (originalSQLType in BigQuery)

Returns:

a schema, as a dict with a columns array, in which each element is a column, itself as a dict of

  • name : the column name

  • type : the column type (smallint, int, bigint, float, double, boolean, date, string)

  • length : the string length

  • comment : the column name

  • originalType : type of the column in the database

Return type:

dict

iter_rows()

Get an iterator on the query’s results.

Returns:

an iterator over the rows, each row being a tuple of values. The order of values in the tuples is the same as the order of columns in the schema returned by get_schema(). The values are cast to python types according to the types in get_schema()

Return type:

iterator[list]

verify()

Verify that reading results completed successfully.

When using the iter_rows() method, and the iterator stops returning rows, there is no way to tell whether there are no more rows because the query didn’t return more rows, or because an error in the query, or in the fetching of its results, happened. You should thus call verify() after the iterator is done, because it will raise an Exception if an error happened.

Raises:

Exception