Performing SQL, Hive and Impala queries#
For usage information and examples, see Performing SQL, Hive and Impala queries
- class dataiku.SQLExecutor2(connection=None, dataset=None)#
This is a handle to execute SQL statements on a given SQL connection.
The connection is derived from either the connection parameter or the dataset parameter.
- Parameters:
connection (string) – name of the SQL connection
dataset – name of a dataset or a
dataiku.Datasetobject.
- static exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False)#
Executes a SQL query and store the results to the output_dataset after dropping its underlying table.
- Parameters:
output_dataset (object) –
dataiku.Datasetoutput dataset where to write the result of the query.query (str) – SQL main query
pre_queries (list) – list of queries to be executed before the main query
post_queries (list) – list of queries to be executed after the main query
overwrite_output_schema (bool) – if True, generates the output schema from the query results. If False, maintains the existing output schema
drop_partitioned_on_schema_mismatch (bool) – for partitioned output datasets. If True, drops all partitions whose schema is inconsistent with that of the dataset. Only relevant when overwrite_output_schema=True
- Returns:
None
- query_to_df(query, pre_queries=None, post_queries=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#
This function returns the result of the main query as a pandas dataframe.
- Parameters:
query (str) – SQL main query
pre_queries (list) – list of queries to be executed before the main query
post_queries (list) – list of queries to be executed after the main query
extra_conf – do not use
infer_from_schema (bool) – if True, the resulting pandas dataframe types are set per the SQL query datatypes rather than being inferred by pandas
parse_dates (bool) – if True, SQL datetime columns are set as datetime dtypes in the resulting pandas dataframe. The infer_from_schema must be True for this param to be relevant
bool_as_str (bool) – whether to cast boolean values as string
dtypes (dict) – with key= column name and value=`numpy.dtype()`
script_steps – do not use
script_input_schema – do not use
script_output_schema – do not use
- Returns:
a pandas dataframe with the result of the query.
- query_to_iter(query, pre_queries=None, post_queries=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#
This function returns a
QueryReaderto iterate on the rows.- Parameters:
query (str) – the main query
pre_queries (list) – list of queries to be executed before the main query
post_queries (list) – list of queries to be executed after the main query
script_steps – do not use
script_input_schema – do not use
script_output_schema – do not use
- Returns:
a
QueryReaderto iterate on the rows.
- class dataiku.HiveExecutor(dataset=None, database=None, connection=None)#
Execute Hive queries against the Hive server configured in DSS
Configure the executor with exactly one of dataset, database, or connection.
- Parameters:
dataset – Name of a dataset or a
dataiku.Datasetobject (defaults to None).database (str) – Hive database name (defaults to None).
connection (str) – DSS Hive connection name (defaults to None).
- static exec_recipe_fragment(query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False, metastore_handling=None, extra_conf={}, add_dku_udf=False)#
Execute a Hive query fragment in the current Python recipe.
- Parameters:
query (str) – Hive main query.
pre_queries (list) – List of queries to execute before the main query (defaults to []).
post_queries (list) – List of queries to execute after the main query (defaults to []).
overwrite_output_schema (bool) – Whether to overwrite the output schema (defaults to True).
drop_partitioned_on_schema_mismatch (bool) – Whether to drop incompatible partitions (defaults to False).
metastore_handling – Metastore synchronization mode (defaults to None).
extra_conf (dict) – Additional Hive configuration (defaults to {}).
add_dku_udf (bool) – Whether to add Dataiku UDFs to the execution context (defaults to False).
- Returns:
None
- query_to_df(query, pre_queries=None, post_queries=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#
Return the result of a Hive query as a pandas dataframe.
- Parameters:
query (str) – Hive main query
pre_queries (list) – list of queries to execute before the main query. Default:
Nonepost_queries (list) – list of queries to execute after the main query. Default:
Noneextra_conf (dict) – additional Hive configuration. Default:
{}infer_from_schema (bool) – whether to enforce pandas dtypes from SQL schema. Default:
Falseparse_dates (bool) – whether to parse date and datetime columns. Default:
Truebool_as_str (bool) – whether to cast boolean values as strings. Default:
Falsedtypes (dict) – explicit pandas dtypes by column name. Default:
Nonescript_steps – do not use. Default:
Nonescript_input_schema – do not use. Default:
Nonescript_output_schema – do not use. Default:
None
- Returns:
a pandas dataframe with the result of the query
- Raises:
ValueError – If the executor is not configured with a dataset, database, or connection.
- query_to_iter(query, pre_queries=None, post_queries=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#
Return a
QueryReaderto iterate over Hive query rows.- Parameters:
query (str) – Hive main query.
pre_queries (list) – List of queries to execute before the main query (defaults to None).
post_queries (list) – List of queries to execute after the main query (defaults to None).
extra_conf (dict) – Additional Hive configuration (defaults to {}).
script_steps – Do not use (defaults to None).
script_input_schema – Do not use (defaults to None).
script_output_schema – Do not use (defaults to None).
- Returns:
a
QueryReaderto iterate on the rows.- Raises:
ValueError – If the executor is not configured with a dataset, database, or connection.
- class dataiku.ImpalaExecutor(dataset=None, database=None, connection=None)#
Execute Impala queries against the Impala server configured in DSS
Configure the executor with exactly one of dataset, database, or connection.
- Parameters:
dataset – Name of a dataset or a
dataiku.Datasetobject (defaults to None).database (str) – Impala database name (defaults to None).
connection (str) – Impala connection name (defaults to None).
- static exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, use_stream_mode=True)#
Execute an Impala query fragment in the current recipe activity.
- Parameters:
output_dataset (object) –
dataiku.Datasetoutput dataset.query (str) – Impala main query.
pre_queries (list) – List of queries to execute before the main query (defaults to []).
post_queries (list) – List of queries to execute after the main query (defaults to []).
overwrite_output_schema (bool) – Whether to overwrite the output schema (defaults to True).
use_stream_mode (bool) – Whether to use stream mode for execution (defaults to True).
- Returns:
None
- query_to_df(query, pre_queries=None, post_queries=None, connection=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#
Return the result of an Impala query as a pandas dataframe.
- Parameters:
query (str) – Impala main query.
pre_queries (list) – List of queries to execute before the main query (defaults to None).
post_queries (list) – List of queries to execute after the main query (defaults to None).
connection (str) – Deprecated, do not use (defaults to None).
extra_conf (dict) – Additional Impala configuration (defaults to {}).
infer_from_schema (bool) – Whether to enforce pandas dtypes from SQL schema (defaults to False).
parse_dates (bool) – Whether to parse date and datetime columns (defaults to True).
bool_as_str (bool) – Whether to cast boolean values as strings (defaults to False).
dtypes (dict) – Explicit pandas dtypes by column name (defaults to None).
script_steps – Do not use (defaults to None).
script_input_schema – Do not use (defaults to None).
script_output_schema – Do not use (defaults to None).
- Returns:
A pandas dataframe with the result of the query.
- Raises:
ValueError – If the executor is not configured with a dataset, database, or connection.
- query_to_iter(query, pre_queries=None, post_queries=None, connection=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None, **kwargs)#
Return a
QueryReaderto iterate over Impala query rows.- Parameters:
query (str) – Impala main query.
pre_queries (list) – List of queries to execute before the main query (defaults to None).
post_queries (list) – List of queries to execute after the main query (defaults to None).
connection (str) – Deprecated, do not use (defaults to None).
extra_conf (dict) – Additional Impala configuration (defaults to {}).
script_steps – Do not use (defaults to None).
script_input_schema – Do not use (defaults to None).
script_output_schema – Do not use (defaults to None).
- Returns:
a
QueryReaderto iterate on the rows.- Raises:
ValueError – If the executor is not configured with a dataset, database, or connection.
- class dataikuapi.dss.sqlquery.DSSSQLQuery(client, query, connection, database, dataset_full_name, pre_queries, post_queries, type, extra_conf, script_steps, script_input_schema, script_output_schema, script_report_location, read_timestamp_without_timezone_as_string, read_date_as_string, datetimenotz_read_mode, dateonly_read_mode, project_key)
A connection to a database or database-like on which queries can be run through DSS.
Important
Do not create this class directly, instead use
dataikuapi.DSSClient.sql_query()Usage example:
# run some query on a connection query = client.sql_query('select * from "public"."SOME_TABLE"', connection='some_postgres_connection') n = 0 for row in query.iter_rows(): n += 1 if n < 10: print("row %s : %s" % (n, row)) query.verify() print("Returned %s rows" % n)
- get_schema()
Get the query’s result set’s schema.
The schema made of DSS column types, and built from mapping database types to DSS types. The actual type in the database can be found in the originalType field (originalSQLType in BigQuery)
- Returns:
a schema, as a dict with a columns array, in which each element is a column, itself as a dict of
name : the column name
type : the column type (smallint, int, bigint, float, double, boolean, date, string)
length : the string length
comment : the column name
originalType : type of the column in the database
- Return type:
dict
- iter_rows()
Get an iterator on the query’s results.
- Returns:
an iterator over the rows, each row being a tuple of values. The order of values in the tuples is the same as the order of columns in the schema returned by
get_schema(). The values are cast to python types according to the types inget_schema()- Return type:
iterator[list]
- verify()
Verify that reading results completed successfully.
When using the
iter_rows()method, and the iterator stops returning rows, there is no way to tell whether there are no more rows because the query didn’t return more rows, or because an error in the query, or in the fetching of its results, happened. You should thus callverify()after the iterator is done, because it will raise an Exception if an error happened.- Raises:
Exception
