Here, you can see some quick examples on how to utilize the package. For more details, refer to API Reference.
DiffChecker is designed to perform QA on data flows for ML. You can easily save statistics from the origin data such as missing value rate, mean, min/max, percentile, outliers, etc., then to compare against the new data. This is especially important if you want to keep the prediction data under the same assumptions with the training data.
Below is a quick example on how it works, just initiate and save statistics from the input data.
>>> from mlqa.identifiers import DiffChecker >>> import pandas as pd >>> dc = DiffChecker() >>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
Then, you can check on new data if it’s okay for given criteria. Below, you can see some data that is very similar in column mean_col but increased NA count in column na_col. The default threshold is 0.5 which means it should be okay if NA rate is 50% more than the origin data. NA rate is 50% in the origin data so up to 75% (i.e. 50*(1+0.5)) should be okay. NA rate is 70% in the new data and, as expected, the QA passes.
>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30})) True
If you think the threshold is too loose, you can adjust as you wish with set_threshold method. And, now the same returns False indicating the QA has failed.
>>> dc.set_threshold(0.1) >>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30})) False
As default, DiffChecker is initialized with qa_level=’loose’. Different values can also be given.
>>> from mlqa.identifiers import DiffChecker >>> dc = DiffChecker() >>> dc.threshold 0.5 >>> dc = DiffChecker(qa_level='mid') >>> dc.threshold 0.2 >>> dc = DiffChecker(qa_level='strict') >>> dc.threshold 0.1
To be more precise, you can set both threshold and stats individually.
>>> import pandas as pd >>> import numpy as np >>> dc = DiffChecker() >>> dc.set_threshold(0.2) >>> dc.set_stats(['mean', 'max', np.sum]) >>> dc.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[1]*4})) >>> dc.check(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4})) False >>> dc.check(pd.DataFrame({'col1':[1, 2.1, 3.2, 4.2], 'col2':[1.1]*4})) True
You can even be more detailed in set_threshold.
>>> dc = DiffChecker() >>> dc.set_stats(['mean', 'max']) >>> dc.set_threshold(0.1) # to reset all thresholds >>> print(dc.threshold) 0.1 >>> dc.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4})) >>> dc.set_threshold({'col1':0.2, 'col2':0.1}) # to set in column level >>> print(dc.threshold_df) col1 col2 mean 0.2 0.1 max 0.2 0.1 >>> dc.set_threshold({'col1':{'mean':0.1}}) # to set in column-stat level >>> print(dc.threshold_df) col1 col2 mean 0.1 0.1 max 0.2 0.1
You can also pickle the object to be used later with to_pickle method.
>>> dc1 = DiffChecker() >>> dc1.fit(pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[0]*4})) >>> dc1.to_pickle(path='DiffChecker.pkl')
Then, to load the same object later.
>>> import pickle >>> pkl_file = open('DiffChecker.pkl', 'rb') >>> dc2 = pickle.load(pkl_file) >>> pkl_file.close()
If you enable logging functionality, you can get detailed description of what column failed for which stat and why. You can even log DiffChecker steps.
Just initiate the class with logger=’<your-logger-name>.log’ argument.
>>> from mlqa.identifiers import DiffChecker >>> import pandas as pd >>> dc = DiffChecker(logger='mylog.log') >>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50})) >>> dc.set_threshold(0.1) >>> dc.check(pd.DataFrame({'mean_col':[1, 1.5]*50, 'na_col':[None]*70+[1]*30})) False
If you open mylog.log, you’ll see something like below.
WARNING|2020-05-31 15:56:48,146|mean value (i.e. 1.25) is not in the range of [1.35, 1.65] for mean_col WARNING|2020-05-31 15:56:48,147|na_rate value (i.e. 0.7) is not in the range of [0.45, 0.55] for na_col
If you initiate the class with also log_info=True argument, then the other class steps (e.g. set_threshold, check) would be logged, too.
Note
Although DiffChecker is able to create a Logger object by just passing a file name (i.e. logger=’mylog.log’), creating the Logger object externally then passing accordingly (i.e. logger=<mylogger>) is highly recommended.
There are also checkers to provide other kind of QA functionalities such as outliers detection, pd.DataFrame comparison or some categorical value QA. You can use these individually or combining with DiffChecker’s logger.
Let’s say you initiated DiffChecker with some logger already.
>>> from mlqa.identifiers import DiffChecker >>> dc = DiffChecker(logger='mylog.log')
Then, you can just pass logger attribute of the object when calling checkers. Here is an example of qa_outliers.
>>> import mlqa.checkers as ch >>> import numpy as np >>> import pandas as pd >>> np.random.seed(123) >>> df = pd.DataFrame({ ... 'col1':np.random.normal(0, 0.1, 100), ... 'col2':np.random.normal(0, 1.0, 100)}) >>> ch.qa_outliers(df, std=0.5, logger=dc.logger) False
This should log something like below.
WARNING|2020-05-31 17:54:13,426|70 outliers detected within inlier range (i.e. [-0.053985309527773806, 0.059407124225845764]) for col1 WARNING|2020-05-31 17:54:13,428|53 outliers detected within inlier range (i.e. [-0.5070058315486367, 0.46793470772834406]) for col2
You can also compare multiple datasets from the same population with qa_df_set.
>>> df1 = pd.DataFrame({'col1':[1, 2]*10, 'col2':[0, 4]*10}) >>> df2 = pd.DataFrame({'col1':[1, 9]*10, 'col2':[0, -4]*10}) >>> ch.qa_df_set([df1, df2], logger=dc.logger) False
INFO|2020-05-31 18:09:47,581|df sets QA initiated with threshold 0.1 WARNING|2020-05-31 18:09:47,598|mean of col1 not passed. Values are 1.5 and 5.0 WARNING|2020-05-31 18:09:47,599|mean of col2 not passed. Values are 2.0 and -2.0 WARNING|2020-05-31 18:09:47,599|std of col1 not passed. Values are 0.51299 and 4.10391 WARNING|2020-05-31 18:09:47,599|min of col2 not passed. Values are 0.0 and -4.0 WARNING|2020-05-31 18:09:47,599|25% of col2 not passed. Values are 0.0 and -4.0 WARNING|2020-05-31 18:09:47,599|50% of col1 not passed. Values are 1.5 and 5.0 WARNING|2020-05-31 18:09:47,600|50% of col2 not passed. Values are 2.0 and -2.0 WARNING|2020-05-31 18:09:47,600|75% of col1 not passed. Values are 2.0 and 9.0 WARNING|2020-05-31 18:09:47,600|75% of col2 not passed. Values are 4.0 and 0.0 WARNING|2020-05-31 18:09:47,600|max of col1 not passed. Values are 2.0 and 9.0 WARNING|2020-05-31 18:09:47,600|max of col2 not passed. Values are 4.0 and 0.0 INFO|2020-05-31 18:09:47,600|df sets QA done with threshold 0.1
For categorical values, you can check its distribution on a numeric column with qa_category_distribution_on_value.
>>> df1 = pd.DataFrame({'Gender': ['Male', 'Male', 'Female', 'Female'], ... 'Weight': [200, 250, 100, 125]}) >>> ch.qa_category_distribution_on_value(df1, ... 'Gender', ... {'Male':.5, 'Female':.5}, ... 'Weight', ... logger=dc.logger) False
WARNING|2020-05-31 18:21:20,019|Gender distribution looks wrong, check Weight for Gender=Male. Expected=0.5, Actual=0.6666666666666666 WARNING|2020-05-31 18:21:20,019|Gender distribution looks wrong, check Weight for Gender=Female. Expected=0.5, Actual=0.3333333333333333