This module includes individual QA functions of mlqa.
checkers.
qa_outliers
QA check for outliers as wrapper of qa_outliers_1d.
If there are values in the data outside of [mean-std, mean+`std`] range, returns False, otherwise True. If a pd.DataFrame given, then it checks each column individually.
data (pd.DataFrame or iter) – data to check
std (list or float) – distance from mean for outliers, can be 2 elements iterable for different lower and upper bounds
logger (logging.Logger or None) – Python logging object https://docs.python.org/3/library/logging.html#logging.Logger
log_level (int) – https://docs.python.org/3/library/logging.html#logging-levels
is QA passed or not
bool
Example
Check for 1d:
>>> qa_outliers([1, 2, 3, 4], std=0.1) False >>> qa_outliers([1, 2, 3, 4], std=3) True
Check for pd.DataFrame:
>>> import numpy as np >>> import pandas as pd >>> np.random.seed(123) >>> df = pd.DataFrame({ ... 'col1':np.random.normal(0, 0.1, 100), ... 'col2':np.random.normal(0, 1.0, 100)}) >>> qa_outliers(df, std=0.5) False
See also
qa_outliers_1d: same but only for 1d
qa_outliers_1d
QA check for outliers for 1d iterable.
If there are values in the array outside of [mean-std, mean+`std`] range, returns False, otherwise True.
array (iter) – 1d array to check
name (str) – optional array name for logger
>>> qa_outliers_1d([1, 2, 3, 4], std=0.1) False >>> qa_outliers_1d([1, 2, 3, 4], std=3) True
qa_outliers: wrapper to be used in pd.DataFrame
qa_missing_values
QA check for missing values as wrapper of qa_missing_values_1d to also use in pd.DataFrame.
If array na count is within given condition, returns True, False otherwise. If a pd.DataFrame given, then it checks each column individually.
n (int or None) – expected missing value count
frac (float or None) – expected missing value percentage
threshold (float) – percentage threshold for upper or lower limit
limit (tuple) – limit direction, which side of na limit to check
qa_missing_values_1d: same but only for 1d
qa_missing_values_1d
QA check for missing values of 1D array.
If array na count is within given condition, returns True, False otherwise.
>>> qa_missing_values_1d([1, 2, None, None], n=1) False >>> qa_missing_values_1d([1, 2, None, None], n=2) True >>> qa_missing_values_1d([1, None, None, None], n=2, threshold=0.5) True
qa_missing_values: wrapper to be used in pd.DataFrame
qa_df_set
Wrapper for qa_df_pair() to apply 2 length subsequences of dfs.
QA datasets’ statistics by utilizing describe() method of pd.DataFrame. Ignores non-numeric columns.
dfs (iter) – set of pd.DataFrame
threshold (float) – percentage threshold for absolute percentage error between statistics
ignore_min (None or float) – ignore stats less or equal than this to handle division errors or extreme values
ignore_max (None or float) – ignore stats greater or equal than this to handle extreme values
stats_to_exclude (None or list) – statistics to exclude as list of strings, e.g. [‘count’, ‘mean’, ‘std’, ‘min’, ‘25%’, ‘50%’, ‘75%’, ‘max’]
columns_to_exclude (None or list) – columns to exclude as list of strings
error_columns (None or list) – error columns for error, if given, then test results for non error columns would be ignored. Only these columns are logged with level 40.
>>> df1 = pd.DataFrame({'col1':[1, 2]*10, 'col2':[0, 4]*10}) >>> df2 = pd.DataFrame({'col1':[1, 9]*10, 'col2':[0, -4]*10}) >>> qa_df_set([df1, df2]) False
qa_df_pair: same but only for 2 pd.DataFrame
qa_df_pair
QA two datasets’ statistics by utilizing describe() method of pd.DataFrame. Ignores non-numeric columns.
df1 (pd.DataFrame) – test dataframe
df2 (pd.DataFrame) – test dataframe
qa_df_set: wrapper to use more than 2 pd.DataFrame
qa_preds
Wrapper for qa_array_statistics for stats min and max only.
It should be mainly used to also log QA steps and prediction statistics. Use qa_array_statistics for detailed QA on prediction array.
preds – array, shape (n_samples, 1)
warn_range (iter) – 2 elements iterable, e.g. [min, max] to warn
error_range (iter or None) – 2 elements iterable or None, e.g. [min, max] for error, should involve warn_range. If not None, QA result by warn_range is ignored.
logger (logging.Logger or None) – Python logging object https://docs.python.org/3/library/logging.html#logging.Logger If None given, no practical use of this function. Use qa_array_statistics instead.
>>> qa_preds([1, 2, 3, 4], warn_range=[1.5, 5]) False >>> qa_preds([1, 2, 3, 4], warn_range=[1.5, 5], error_range=[0, 5.5]) True
qa_category_distribution_on_value
QA check for the distribution of category-value pairs in a pd.DataFrame.
Gender
df (pd.DataFrame) – input data
category_column_name (str) – column name for the category, (e.g. ‘Gender’)
distribution (dict) – expected value distribution of the category (e.g. {‘Male’:.05, ‘Female’:.14, ‘Undefined’:.81})
value_column_name (str) – numeric column name to check distribution, (e.g. ‘Weight’)
threshold (float) – percentage threshold for absolute percentage error
>>> df1 = pd.DataFrame({'Gender': ['Male', 'Male', 'Female', 'Female'], ... 'Weight': [200, 250, 100, 125]}) >>> qa_category_distribution_on_value(df1, ... 'Gender', ... {'Male':.66, 'Female':.33}, ... 'Weight', ... 0.1) True >>> qa_category_distribution_on_value(df1, ... 'Gender', ... {'Male':.5, 'Female':.5}, ... 'Weight', ... 0.1) False >>> qa_category_distribution_on_value(df1, ... 'Gender', ... {'Male':.5, 'Female':.5}, ... 'Weight', ... 0.5) True
qa_preds_by_metric
QA check for model’s predictions by selected metric (e.g. R2, AUC).
y_true (iter) – shape (n_samples, 1)
y_pred (iter) – shape (n_samples, 1)
metric (func) – sklearn like metric function. https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
check_range (list) – list of 2 float, i.e. [lower_limit, upper_limit], either of elements can be None if no limit is set for that direction.
>>> y_true = pd.Series([1, 2, 3, 4]) >>> y_pred = pd.Series([1, 3, 3, 3]) >>> mae = lambda x, y: abs(x - y).mean() >>> qa_preds_by_metric(y_true, y_pred, mae, [None, 0.6]) True >>> qa_preds_by_metric(y_true, y_pred, mae, [0.4, 0.6]) True >>> qa_preds_by_metric(y_true, y_pred, mae, [0.6, None]) False
qa_array_statistics
QA check for 1D array statistics such as mean, count.
array (iter) – shape (n_samples, 1)
stats (dict) – stats to qa (e.g. {‘mean’:[0.1, 0.99], ‘count’:[100, None]}( Options for keys are [‘mean’, ‘min’, ‘max’, ‘sum’, ‘count’, ‘std’] or function such as np.mean.
>>> qa_array_statistics([1, 2, 3, 4], {'count':[3, 5], 'min':[None, 1.5]}) True >>> qa_array_statistics([1, 2, 3, 4], {'count':[3, 5], 'max':[None, 1.5]}) False
is_value_in_range
Checks if a value is in given check_range.
value (float) – value to check
check_range (list) – acceptable lower and upper bounds for value
log_msg – str or None, custom log message for logger
>>> is_value_in_range(5.0, [3, 10]) True >>> is_value_in_range(5.0, [None, 1]) False
na_rate
Aggregate function to calculate na rate in pd.Series.
array (pd.Series) – input array
na count / array length
float
>>> na_rate(pd.Series([1, None, 2, 3])) 0.25