mlqa.checkers

This module includes individual QA functions of mlqa.

checkers.qa_outliers(data, std, logger=None, log_level=30)[source]

QA check for outliers as wrapper of qa_outliers_1d.

If there are values in the data outside of [mean-std, mean+`std`] range, returns False, otherwise True. If a pd.DataFrame given, then it checks each column individually.

Parameters
Returns

is QA passed or not

Return type

bool

Example

Check for 1d:

>>> qa_outliers([1, 2, 3, 4], std=0.1)
False
>>> qa_outliers([1, 2, 3, 4], std=3)
True

Check for pd.DataFrame:

>>> import numpy as np
>>> import pandas as pd
>>> np.random.seed(123)
>>> df = pd.DataFrame({
...     'col1':np.random.normal(0, 0.1, 100),
...     'col2':np.random.normal(0, 1.0, 100)})
>>> qa_outliers(df, std=0.5)
False

See also

qa_outliers_1d: same but only for 1d

checkers.qa_outliers_1d(array, std, logger=None, log_level=30, name=None)[source]

QA check for outliers for 1d iterable.

If there are values in the array outside of [mean-std, mean+`std`] range, returns False, otherwise True.

Parameters
Returns

is QA passed or not

Return type

bool

Example

>>> qa_outliers_1d([1, 2, 3, 4], std=0.1)
False
>>> qa_outliers_1d([1, 2, 3, 4], std=3)
True

See also

qa_outliers: wrapper to be used in pd.DataFrame

checkers.qa_missing_values(data, n=None, frac=None, threshold=0.1, limit=False, True, logger=None, log_level=30)[source]

QA check for missing values as wrapper of qa_missing_values_1d to also use in pd.DataFrame.

If array na count is within given condition, returns True, False otherwise. If a pd.DataFrame given, then it checks each column individually.

Parameters
Returns

is QA passed or not

Return type

bool

See also

qa_missing_values_1d: same but only for 1d

checkers.qa_missing_values_1d(array, n=None, frac=None, threshold=0.1, limit=False, True, logger=None, log_level=30, name=None)[source]

QA check for missing values of 1D array.

If array na count is within given condition, returns True, False otherwise.

Parameters
Returns

is QA passed or not

Return type

bool

Example

>>> qa_missing_values_1d([1, 2, None, None], n=1)
False
>>> qa_missing_values_1d([1, 2, None, None], n=2)
True
>>> qa_missing_values_1d([1, None, None, None], n=2, threshold=0.5)
True

See also

qa_missing_values: wrapper to be used in pd.DataFrame

checkers.qa_df_set(dfs, threshold=0.1, ignore_min=None, ignore_max=None, stats_to_exclude=None, columns_to_exclude=None, error_columns=None, logger=None, name=None)[source]

Wrapper for qa_df_pair() to apply 2 length subsequences of dfs.

QA datasets’ statistics by utilizing describe() method of pd.DataFrame. Ignores non-numeric columns.

Parameters
  • dfs (iter) – set of pd.DataFrame

  • threshold (float) – percentage threshold for absolute percentage error between statistics

  • ignore_min (None or float) – ignore stats less or equal than this to handle division errors or extreme values

  • ignore_max (None or float) – ignore stats greater or equal than this to handle extreme values

  • stats_to_exclude (None or list) – statistics to exclude as list of strings, e.g. [‘count’, ‘mean’, ‘std’, ‘min’, ‘25%’, ‘50%’, ‘75%’, ‘max’]

  • columns_to_exclude (None or list) – columns to exclude as list of strings

  • error_columns (None or list) – error columns for error, if given, then test results for non error columns would be ignored. Only these columns are logged with level 40.

  • logger (logging.Logger or None) – Python logging object https://docs.python.org/3/library/logging.html#logging.Logger

  • name (str) – optional array name for logger

Returns

is QA passed or not

Return type

bool

Example

>>> df1 = pd.DataFrame({'col1':[1, 2]*10, 'col2':[0, 4]*10})
>>> df2 = pd.DataFrame({'col1':[1, 9]*10, 'col2':[0, -4]*10})
>>> qa_df_set([df1, df2])
False

See also

qa_df_pair: same but only for 2 pd.DataFrame

checkers.qa_df_pair(df1, df2, threshold=0.1, ignore_min=None, ignore_max=None, stats_to_exclude=None, columns_to_exclude=None, error_columns=None, logger=None, name=None)[source]

QA two datasets’ statistics by utilizing describe() method of pd.DataFrame. Ignores non-numeric columns.

Parameters
  • df1 (pd.DataFrame) – test dataframe

  • df2 (pd.DataFrame) – test dataframe

  • threshold (float) – percentage threshold for absolute percentage error between statistics

  • ignore_min (None or float) – ignore stats less or equal than this to handle division errors or extreme values

  • ignore_max (None or float) – ignore stats greater or equal than this to handle extreme values

  • stats_to_exclude (None or list) – statistics to exclude as list of strings, e.g. [‘count’, ‘mean’, ‘std’, ‘min’, ‘25%’, ‘50%’, ‘75%’, ‘max’]

  • columns_to_exclude (None or list) – columns to exclude as list of strings

  • error_columns (None or list) – error columns for error, if given, then test results for non error columns would be ignored. Only these columns are logged with level 40.

  • logger (logging.Logger or None) – Python logging object https://docs.python.org/3/library/logging.html#logging.Logger

  • name (str) – optional array name for logger

Returns

is QA passed or not

Return type

bool

See also

qa_df_set: wrapper to use more than 2 pd.DataFrame

checkers.qa_preds(preds, warn_range, error_range=None, logger=None, name=None)[source]

Wrapper for qa_array_statistics for stats min and max only.

It should be mainly used to also log QA steps and prediction statistics. Use qa_array_statistics for detailed QA on prediction array.

Parameters
  • preds – array, shape (n_samples, 1)

  • warn_range (iter) – 2 elements iterable, e.g. [min, max] to warn

  • error_range (iter or None) – 2 elements iterable or None, e.g. [min, max] for error, should involve warn_range. If not None, QA result by warn_range is ignored.

  • logger (logging.Logger or None) – Python logging object https://docs.python.org/3/library/logging.html#logging.Logger If None given, no practical use of this function. Use qa_array_statistics instead.

  • name (str) – optional array name for logger

Returns

is QA passed or not

Return type

bool

Example

>>> qa_preds([1, 2, 3, 4], warn_range=[1.5, 5])
False
>>> qa_preds([1, 2, 3, 4], warn_range=[1.5, 5], error_range=[0, 5.5])
True
checkers.qa_category_distribution_on_value(df, category_column_name, distribution, value_column_name, threshold=0.1, logger=None, log_level=30)[source]

QA check for the distribution of category-value pairs in a pd.DataFrame.

Gender

Parameters
  • df (pd.DataFrame) – input data

  • category_column_name (str) – column name for the category, (e.g. ‘Gender’)

  • distribution (dict) – expected value distribution of the category (e.g. {‘Male’:.05, ‘Female’:.14, ‘Undefined’:.81})

  • value_column_name (str) – numeric column name to check distribution, (e.g. ‘Weight’)

  • threshold (float) – percentage threshold for absolute percentage error

  • logger (logging.Logger or None) – Python logging object https://docs.python.org/3/library/logging.html#logging.Logger

  • log_level (int) – https://docs.python.org/3/library/logging.html#logging-levels

Returns

is QA passed or not

Return type

bool

Example

>>> df1 = pd.DataFrame({'Gender': ['Male', 'Male', 'Female', 'Female'],
...                     'Weight': [200, 250, 100, 125]})
>>> qa_category_distribution_on_value(df1,
...                                   'Gender',
...                                   {'Male':.66, 'Female':.33},
...                                   'Weight',
...                                   0.1)
True
>>> qa_category_distribution_on_value(df1,
...                                   'Gender',
...                                   {'Male':.5, 'Female':.5},
...                                   'Weight',
...                                   0.1)
False
>>> qa_category_distribution_on_value(df1,
...                                   'Gender',
...                                   {'Male':.5, 'Female':.5},
...                                   'Weight',
...                                   0.5)
True
checkers.qa_preds_by_metric(y_true, y_pred, metric, check_range, logger=None, log_level=30)[source]

QA check for model’s predictions by selected metric (e.g. R2, AUC).

Parameters
Returns

is QA passed or not

Return type

bool

Example

>>> y_true = pd.Series([1, 2, 3, 4])
>>> y_pred = pd.Series([1, 3, 3, 3])
>>> mae = lambda x, y: abs(x - y).mean()
>>> qa_preds_by_metric(y_true, y_pred, mae, [None, 0.6])
True
>>> qa_preds_by_metric(y_true, y_pred, mae, [0.4, 0.6])
True
>>> qa_preds_by_metric(y_true, y_pred, mae, [0.6, None])
False
checkers.qa_array_statistics(array, stats, logger=None, log_level=30, name=None)[source]

QA check for 1D array statistics such as mean, count.

Parameters
Returns

is QA passed or not

Return type

bool

Example

>>> qa_array_statistics([1, 2, 3, 4], {'count':[3, 5], 'min':[None, 1.5]})
True
>>> qa_array_statistics([1, 2, 3, 4], {'count':[3, 5], 'max':[None, 1.5]})
False
checkers.is_value_in_range(value, check_range, logger=None, log_level=None, log_msg=None)[source]

Checks if a value is in given check_range.

Parameters
Returns

is QA passed or not

Return type

bool

Example

>>> is_value_in_range(5.0, [3, 10])
True
>>> is_value_in_range(5.0, [None, 1])
False
checkers.na_rate(array)[source]

Aggregate function to calculate na rate in pd.Series.

Parameters

array (pd.Series) – input array

Returns

na count / array length

Return type

float

Example

>>> na_rate(pd.Series([1, None, 2, 3]))
0.25