Performing SQL, Hive and Impala queries#
For usage information and examples, see Performing SQL, Hive and Impala queries
- class dataiku.SQLExecutor2(connection=None, dataset=None)#
This is a handle to execute SQL statements on a given SQL connection.
The connection is derived from either the connection parameter or the dataset parameter.
- Parameters:
connection (string) – name of the SQL connection
dataset – name of a dataset or a
dataiku.Dataset
object.
- static exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False)#
Executes a SQL query and store the results to the output_dataset after dropping its underlying table.
- Parameters:
output_dataset (object) –
dataiku.Dataset
output dataset where to write the result of the query.query (str) – SQL main query
pre_queries (list) – list of queries to be executed before the main query
post_queries (list) – list of queries to be executed after the main query
overwrite_output_schema (bool) – if True, generates the output schema from the query results. If False, maintains the existing output schema
drop_partitioned_on_schema_mismatch (bool) – for partitioned output datasets. If True, drops all partitions whose schema is inconsistent with that of the dataset. Only relevant when overwrite_output_schema=True
- Returns:
None
- query_to_df(query, pre_queries=None, post_queries=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None)#
This function returns the result of the main query as a pandas dataframe.
- Parameters:
query (str) – SQL main query
pre_queries (list) – list of queries to be executed before the main query
post_queries (list) – list of queries to be executed after the main query
extra_conf – do not use
infer_from_schema (bool) – if True, the resulting pandas dataframe types are set per the SQL query datatypes rather than being inferred by pandas
parse_dates (bool) – if True, SQL datetime columns are set as datetime dtypes in the resulting pandas dataframe. The infer_from_schema must be True for this param to be relevant
bool_as_str (bool) – whether to cast boolean values as string
dtypes (dict) – with key= column name and value=`numpy.dtype()`
script_steps – do not use
script_input_schema – do not use
script_output_schema – do not use
- Returns:
a pandas dataframe with the result of the query.
- query_to_iter(query, pre_queries=None, post_queries=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None)#
This function returns a
QueryReader
to iterate on the rows.- Parameters:
query (str) – the main query
pre_queries (list) – list of queries to be executed before the main query
post_queries (list) – list of queries to be executed after the main query
script_steps – do not use
script_input_schema – do not use
script_output_schema – do not use
- Returns:
a
QueryReader
to iterate on the rows.
- class dataiku.HiveExecutor(dataset=None, database=None, connection=None)#
- static exec_recipe_fragment(query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False, metastore_handling=None, extra_conf={}, add_dku_udf=False)#
- query_to_df(query, pre_queries=None, post_queries=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None)#
- query_to_iter(query, pre_queries=None, post_queries=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None)#
- class dataiku.ImpalaExecutor(dataset=None, database=None, connection=None)#
- static exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, use_stream_mode=True)#
- query_to_df(query, pre_queries=None, post_queries=None, connection=None, extra_conf={}, infer_from_schema=False, parse_dates=True, bool_as_str=False, dtypes=None, script_steps=None, script_input_schema=None, script_output_schema=None)#
- query_to_iter(query, pre_queries=None, post_queries=None, connection=None, extra_conf={}, script_steps=None, script_input_schema=None, script_output_schema=None)#
- class dataikuapi.dss.sqlquery.DSSSQLQuery(client, query, connection, database, dataset_full_name, pre_queries, post_queries, type, extra_conf, script_steps, script_input_schema, script_output_schema, script_report_location, read_timestamp_without_timezone_as_string, read_date_as_string, project_key)
A connection to a database or database-like on which queries can be run through DSS.
Important
Do not create this class directly, instead use
dataikuapi.DSSClient.sql_query()
Usage example:
# run some query on a connection query = client.sql_query('select * from "public"."SOME_TABLE"', connection='some_postgres_connection') n = 0 for row in query.iter_rows(): n += 1 if n < 10: print("row %s : %s" % (n, row)) query.verify() print("Returned %s rows" % n)
- get_schema()
Get the query’s result set’s schema.
The schema made of DSS column types, and built from mapping database types to DSS types. The actual type in the database can be found in the originalType field (originalSQLType in BigQuery)
- Returns:
a schema, as a dict with a columns array, in which each element is a column, itself as a dict of
name : the column name
type : the column type (smallint, int, bigint, float, double, boolean, date, string)
length : the string length
comment : the column name
originalType : type of the column in the database
- Return type:
dict
- iter_rows()
Get an iterator on the query’s results.
- Returns:
an iterator over the rows, each row being a tuple of values. The order of values in the tuples is the same as the order of columns in the schema returned by
get_schema()
. The values are cast to python types according to the types inget_schema()
- Return type:
iterator[list]
- verify()
Verify that reading results completed successfully.
When using the
iter_rows()
method, and the iterator stops returning rows, there is no way to tell whether there are no more rows because the query didn’t return more rows, or because an error in the query, or in the fetching of its results, happened. You should thus callverify()
after the iterator is done, because it will raise an Expcetion if an error happened.- Raises:
Exception