This module is for DiffChecker class.
identifiers.
DiffChecker
Bases: object
object
Integrated QA performer on pd.DataFrame with logging functionality.
It only works in numerical columns.
qa_level (str) – quick set for QA level, can be one of [‘loose’, ‘mid’, ‘strict’]
logger (str or logging.Logger) – ‘print’ for print only, every other str creates a file for logging. using external logging.Logger object is highly recommended, i.e. logger=<mylogger>.
qa_log_level (int) – qa message logging level
log_info (bool) – True if method calls or arguments also need to be logged
Notes
Although DiffChecker is able to create a Logger object by just passing a file name (i.e. logger=’mylog.log’), creating the Logger object externally then passing accordingly (i.e. logger=<mylogger>) is highly recommended.
Example
Basic usage:
>>> dc = DiffChecker() >>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50})) >>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30})) True >>> dc.set_threshold(0.1) >>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30})) False
Quick set for qa_level:
>>> dc = DiffChecker() >>> dc.threshold 0.5 >>> dc = DiffChecker(qa_level='mid') >>> dc.threshold 0.2 >>> dc = DiffChecker(qa_level='strict') >>> dc.threshold 0.1
Logger can also be initiated:
>>> dc = DiffChecker(logger='mylog.log') >>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50})) >>> dc.set_threshold(0.1) >>> dc.check(pd.DataFrame({'mean_col':[1, 1.5]*50, 'na_col':[None]*70+[1]*30})) False
stats
threshold
threshold_df
df_fit_stats
set_stats
Sets statistic functions list to check by.
funcs (list) – list of functions and/or function names, e.g. [np.sum, ‘mean’]
See also
add_stat: just to add one
add_stat
Appends a statistic function into the existing list (i.e. stats).
func (func) – function name (e.g. np.sum or ‘mean’)
set_stats: to reset all
set_threshold
Sets threshold for statistic-column pairs.
threshold (float or dict) – can be used to set for all or column statistic pairs.
>>> dc = DiffChecker() >>> dc.set_stats(['mean', 'max']) >>> dc.set_threshold(0.1) # to reset all thresholds >>> print(dc.threshold) 0.1 >>> dc.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4})) >>> dc.set_threshold({'col1':0.2, 'col2':0.1}) # to set in column level >>> print(dc.threshold_df) col1 col2 mean 0.2 0.1 max 0.2 0.1 >>> dc.set_threshold({'col1':{'mean':0.3}}) # to set in column-stat level >>> print(dc.threshold_df) col1 col2 mean 0.3 0.1 max 0.2 0.1
fit
Fits given df.
Based on given df and stats attribute, this method constructs df_fit_stats attribute to store column statistics. This is later to be used by check method. Only works in numerical columns.
df (pd.DataFrame) – data to be fit
>>> dc = DiffChecker() >>> dc.set_stats(['mean', 'max']) >>> dc.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4})) >>> print(dc.df_fit_stats) col1 col2 mean 2.5 0.0 max 4.0 0.0
check
Checks given df_to_check based on fitted df stats.
For each column stat pairs, it checks if stat is in given threshold by utilizing qa_array_statistics. If any stat qa fails, returns False, True otherwise.
df_to_check (pd.DataFrame) – data to check
columns (None or list) – if given, only these columns will be considered for qa
columns_to_exclude (None or list) – columns to exclude from qa
is QA passed or not
bool
>>> dc = DiffChecker() >>> dc.set_threshold(0.2) >>> dc.set_stats(['mean', 'max', np.sum]) >>> dc.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[1]*4})) >>> dc.check(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4})) False >>> dc.check(pd.DataFrame({'col1':[1, 2.1, 3.2, 4.2], 'col2':[1.1]*4})) True
to_pickle
Pickle (serialize) object to a file.
path (str) – file path where the pickled object will be stored
To save a *.pkl file:
>>> dc1 = DiffChecker() >>> dc1.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4})) >>> dc1.to_pickle(path='DiffChecker.pkl')
To load the same object later:
>>> import pickle >>> pkl_file = open('DiffChecker.pkl', 'rb') >>> dc2 = pickle.load(pkl_file) >>> pkl_file.close() >>> os.remove('DiffChecker.pkl')
_method_init_logger
Logs method initiation with given arguments.
args (dict) – local arguments, i.e. locals()
exclude (list) – arguments to exclude, e.g. self