mlqa.identifiers

This module is for DiffChecker class.

class identifiers.DiffChecker(qa_level='loose', logger=None, qa_log_level=None, log_info=False)[source]

Bases: object

Integrated QA performer on pd.DataFrame with logging functionality.

It only works in numerical columns.

Parameters
  • qa_level (str) – quick set for QA level, can be one of [‘loose’, ‘mid’, ‘strict’]

  • logger (str or logging.Logger) – ‘print’ for print only, every other str creates a file for logging. using external logging.Logger object is highly recommended, i.e. logger=<mylogger>.

  • qa_log_level (int) – qa message logging level

  • log_info (bool) – True if method calls or arguments also need to be logged

Notes

Although DiffChecker is able to create a Logger object by just passing a file name (i.e. logger=’mylog.log’), creating the Logger object externally then passing accordingly (i.e. logger=<mylogger>) is highly recommended.

Example

Basic usage:

>>> dc = DiffChecker()
>>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
True
>>> dc.set_threshold(0.1)
>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
False

Quick set for qa_level:

>>> dc = DiffChecker()
>>> dc.threshold
0.5
>>> dc = DiffChecker(qa_level='mid')
>>> dc.threshold
0.2
>>> dc = DiffChecker(qa_level='strict')
>>> dc.threshold
0.1

Logger can also be initiated:

>>> dc = DiffChecker(logger='mylog.log')
>>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
>>> dc.set_threshold(0.1)
>>> dc.check(pd.DataFrame({'mean_col':[1, 1.5]*50, 'na_col':[None]*70+[1]*30}))
False
stats = []
threshold = 0.0
threshold_df = Empty DataFrame Columns: [] Index: []
df_fit_stats = Empty DataFrame Columns: [] Index: []
set_stats(funcs)[source]

Sets statistic functions list to check by.

Parameters

funcs (list) – list of functions and/or function names, e.g. [np.sum, ‘mean’]

See also

add_stat: just to add one

add_stat(func)[source]

Appends a statistic function into the existing list (i.e. stats).

Parameters

func (func) – function name (e.g. np.sum or ‘mean’)

See also

set_stats: to reset all

set_threshold(threshold)[source]

Sets threshold for statistic-column pairs.

Parameters

threshold (float or dict) – can be used to set for all or column statistic pairs.

Example

>>> dc = DiffChecker()
>>> dc.set_stats(['mean', 'max'])
>>> dc.set_threshold(0.1) # to reset all thresholds
>>> print(dc.threshold)
0.1
>>> dc.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4}))
>>> dc.set_threshold({'col1':0.2, 'col2':0.1}) # to set in column level
>>> print(dc.threshold_df)
      col1  col2
mean   0.2   0.1
max    0.2   0.1
>>> dc.set_threshold({'col1':{'mean':0.3}}) # to set in column-stat level
>>> print(dc.threshold_df)
      col1  col2
mean   0.3   0.1
max    0.2   0.1
fit(df)[source]

Fits given df.

Based on given df and stats attribute, this method constructs df_fit_stats attribute to store column statistics. This is later to be used by check method. Only works in numerical columns.

Parameters

df (pd.DataFrame) – data to be fit

Example

>>> dc = DiffChecker()
>>> dc.set_stats(['mean', 'max'])
>>> dc.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4}))
>>> print(dc.df_fit_stats)
      col1  col2
mean   2.5   0.0
max    4.0   0.0
check(df_to_check, columns=None, columns_to_exclude=None)[source]

Checks given df_to_check based on fitted df stats.

For each column stat pairs, it checks if stat is in given threshold by utilizing qa_array_statistics. If any stat qa fails, returns False, True otherwise.

Parameters
  • df_to_check (pd.DataFrame) – data to check

  • columns (None or list) – if given, only these columns will be considered for qa

  • columns_to_exclude (None or list) – columns to exclude from qa

Returns

is QA passed or not

Return type

bool

Example

>>> dc = DiffChecker()
>>> dc.set_threshold(0.2)
>>> dc.set_stats(['mean', 'max', np.sum])
>>> dc.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[1]*4}))
>>> dc.check(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4}))
False
>>> dc.check(pd.DataFrame({'col1':[1, 2.1, 3.2, 4.2], 'col2':[1.1]*4}))
True
to_pickle(path='DiffChecker.pkl')[source]

Pickle (serialize) object to a file.

Parameters

path (str) – file path where the pickled object will be stored

Example

To save a *.pkl file:

>>> dc1 = DiffChecker()
>>> dc1.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4}))
>>> dc1.to_pickle(path='DiffChecker.pkl')

To load the same object later:

>>> import pickle
>>> pkl_file = open('DiffChecker.pkl', 'rb')
>>> dc2 = pickle.load(pkl_file)
>>> pkl_file.close()
>>> os.remove('DiffChecker.pkl')
_method_init_logger(args, exclude=['self'])[source]

Logs method initiation with given arguments.

Parameters
  • args (dict) – local arguments, i.e. locals()

  • exclude (list) – arguments to exclude, e.g. self