Skip to content

model_tests.FEAT.DataShift

DataShift Objects

@dataclass
class DataShift(ModelTest)

Test if there is any shift (based on specified threshold and method) in the distribution of protected attributes. A data shift might be an indication that of the need to retrain a model. Currently, the test is intended for categorical attributes.

The test outputs a dataframe detailing which of the specified attributes passed the data shift test.

To pass, if ratio is used, the ratio of distribution of the subgroups in the datasets should not exceed the threshold.

If diff is used, the difference of distribution of the subgroups in the datasets should not exceed the threshold.

If chi2 is used, the p-value calculated from a chi-square test of independence between the datasets should be greater than the level of significance as specified by the threshold.

Arguments:

  • protected_attr - List of protected attributes.
  • method - Type of method for the test, choose from 'chi2', 'ratio' or 'diff'.
  • threshold - Probability distribution threshold or the significance level of chi-sq test.
  • test_name - Name of the test, default is 'Data Shift Test'.
  • test_desc - Description of the test. If none is provided, an automatic description will be generated based on the rest of the arguments passed in.

get_df_distribution_by_pa

@staticmethod
def get_df_distribution_by_pa(df: pd.DataFrame, col: str, freq: bool = False) -> pd.DataFrame

Get the probability distribution of a specified column's values in a given df.

Arguments:

  • df - Dataframe.
  • col - Column to compute the distribution over.
  • freq - True to compute the frequency distribution, or False to compute the relative proportion

Returns:

Dataframe of distribution.

get_result

def get_result(x_train: pd.DataFrame, x_test: pd.DataFrame) -> pd.DataFrame

Calculates test result.

Arguments:

  • x_train - training data features, protected features should not be encoded.
  • x_test - data to be evaluated on, protected features should not be encoded.

Returns:

Dataframe of results.

plot

def plot(alpha: float = 0.05, save_plots: bool = True)

Plot the the probability distribution of subgroups of protected attribute for training and evaluation data respectively, and their confidence interval bands.

Arguments:

  • alpha - Significance level for confidence interval.
  • save_plots - If True, saves the plots to the class instance.

run

def run(x_train: pd.DataFrame, x_test: pd.DataFrame) -> bool

Runs test by calculating result / retrieving cached property and evaluating if it passes a defined condition.

Arguments:

  • x_train - Training data features, protected features should not be encoded.
  • x_test - Data to be evaluated on, protected features should not be encoded.

Returns:

Test result