Performing SQL, Hive and Impala queries#

For usage information and examples, see Performing SQL, Hive and Impala queries

class dataiku.SQLExecutor2(connection=None, dataset=None)#

This is a handle to execute SQL statements on a given SQL connection.

The connection is derived from either the connection parameter or the dataset parameter.

Parameters:

connection (string) – name of the SQL connection
dataset – name of a dataset or a dataiku.Dataset object.

static exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False)#

Executes a SQL query and store the results to the output_dataset after dropping its underlying table.

Parameters:

output_dataset (object) – dataiku.Dataset output dataset where to write the result of the query.
query (str) – SQL main query
pre_queries (list) – list of queries to be executed before the main query
post_queries (list) – list of queries to be executed after the main query
overwrite_output_schema (bool) – if True, generates the output schema from the query results. If False, maintains the existing output schema
drop_partitioned_on_schema_mismatch (bool) – for partitioned output datasets. If True, drops all partitions whose schema is inconsistent with that of the dataset. Only relevant when overwrite_output_schema=True

Returns:

None

query_to_df(query, pre_queries=None, post_queries=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

This function returns the result of the main query as a pandas dataframe.

Parameters:

query (str) – SQL main query
pre_queries (list) – list of queries to be executed before the main query
post_queries (list) – list of queries to be executed after the main query
extra_conf – do not use
infer_from_schema (bool) – if True, the resulting pandas dataframe types are set per the SQL query datatypes rather than being inferred by pandas
parse_dates (bool) – if True, SQL datetime columns are set as datetime dtypes in the resulting pandas dataframe. The infer_from_schema must be True for this param to be relevant
bool_as_str (bool) – whether to cast boolean values as string
dtypes (dict) – with key= column name and value=`numpy.dtype()`
script_steps – do not use
script_input_schema – do not use
script_output_schema – do not use

Returns:

a pandas dataframe with the result of the query.

query_to_iter(query, pre_queries=None, post_queries=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

This function returns a QueryReader to iterate on the rows.

Parameters:

query (str) – the main query
pre_queries (list) – list of queries to be executed before the main query
post_queries (list) – list of queries to be executed after the main query
script_steps – do not use
script_input_schema – do not use
script_output_schema – do not use

Returns:

a QueryReader to iterate on the rows.

class dataiku.HiveExecutor(dataset=None, database=None, connection=None)#

static exec_recipe_fragment(query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False, metastore_handling=None, extra_conf={}, add_dku_udf=False)#

query_to_df(query, pre_queries=None, post_queries=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

query_to_iter(query, pre_queries=None, post_queries=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

class dataiku.ImpalaExecutor(dataset=None, database=None, connection=None)#

static exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, use_stream_mode=True)#

query_to_df(query, pre_queries=None, post_queries=None, connection=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

query_to_iter(query, pre_queries=None, post_queries=None, connection=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#

class dataikuapi.dss.sqlquery.DSSSQLQuery(client, query, connection, database, dataset_full_name, pre_queries, post_queries, type, extra_conf, script_steps, script_input_schema, script_output_schema, script_report_location, read_timestamp_without_timezone_as_string, read_date_as_string, datetimenotz_read_mode, dateonly_read_mode, project_key)

A connection to a database or database-like on which queries can be run through DSS.

Important

Do not create this class directly, instead use dataikuapi.DSSClient.sql_query()

Usage example:

# run some query on a connection
query = client.sql_query('select * from "public"."SOME_TABLE"', connection='some_postgres_connection')
n = 0
for row in query.iter_rows():
    n += 1
    if n < 10:
        print("row %s : %s" % (n, row))
query.verify()
print("Returned %s rows" % n)

get_schema()

Get the query’s result set’s schema.

The schema made of DSS column types, and built from mapping database types to DSS types. The actual type in the database can be found in the originalType field (originalSQLType in BigQuery)

Returns:

a schema, as a dict with a columns array, in which each element is a column, itself as a dict of

name : the column name
type : the column type (smallint, int, bigint, float, double, boolean, date, string)
length : the string length
comment : the column name
originalType : type of the column in the database

Return type:

dict

iter_rows()

Get an iterator on the query’s results.

Returns:: an iterator over the rows, each row being a tuple of values. The order of values in the tuples is the same as the order of columns in the schema returned by get_schema(). The values are cast to python types according to the types in get_schema()
Return type:: iterator[list]

verify()

Verify that reading results completed successfully.

When using the iter_rows() method, and the iterator stops returning rows, there is no way to tell whether there are no more rows because the query didn’t return more rows, or because an error in the query, or in the fetching of its results, happened. You should thus call verify() after the iterator is done, because it will raise an Exception if an error happened.

Raises:: Exception