Modules¶

Submodules¶

klib.describe module¶

Functions for descriptive analytics.

author:	Andreas Kanz

klib.describe.cat_plot(data: pandas.core.frame.DataFrame, figsize: Tuple = (18, 18), top: int = 3, bottom: int = 3, bar_color_top: str = '#5ab4ac', bar_color_bottom: str = '#d8b365')[source]¶

Two-dimensional visualization of the number and frequency of categorical features.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
figsize : Tuple, optional: Use to control the figure size, by default (18, 18)
top : int, optional: Show the “top” most frequent values in a column, by default 3
bottom : int, optional: Show the “bottom” most frequent values in a column, by default 3
bar_color_top : str, optional: Use to control the color of the bars indicating the most common values, by default “#5ab4ac”
bar_color_bottom : str, optional: Use to control the color of the bars indicating the least common values, by default “#d8b365”

Returns:

Gridspec: gs: Figure with array of Axes objects

klib.describe.corr_mat(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray, str, None] = None, method: str = 'pearson', colored: bool = True) → Union[pandas.core.frame.DataFrame, Any][source]¶

Returns a color-encoded correlation matrix.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
split : Optional[str], optional: Type of split to be performed, by default None {None, “pos”, “neg”, “high”, “low”}
threshold : float, optional: Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3
target : Optional[Union[pd.DataFrame, str]], optional: Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None
method : str, optional: method: {“pearson”, “spearman”, “kendall”}, by default “pearson” * pearson: measures linear relationships and requires normally distributed and homoscedastic data. * spearman: ranked/ordinal correlation, measures monotonic relationships. * kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”
colored : bool, optional: If True the negative values in the correlation matrix are colored in red, by default True

Returns:

Union[pd.DataFrame, pd.Styler]: If colored = True - corr: Pandas Styler object If colored = False - corr: Pandas DataFrame

klib.describe.corr_plot(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.series.Series, str, None] = None, method: str = 'pearson', cmap: str = 'BrBG', figsize: Tuple = (12, 10), annot: bool = True, dev: bool = False, **kwargs)[source]¶

Two-dimensional visualization of the correlation between feature-columns excluding NA values.

Parameters:

data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

split : Optional[str], optional

Type of split to be performed {None, “pos”, “neg”, “high”, “low”}, by default None

None: visualize all correlations between the feature-columns
pos: visualize all positive correlations between the feature-columns above the threshold
neg: visualize all negative correlations between the feature-columns below the threshold
high: visualize all correlations between the feature-columns for which abs (corr) > threshold is True
low: visualize all correlations between the feature-columns for which abs(corr) < threshold is True

threshold : float, optional

Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3

target : Optional[Union[pd.Series, str]], optional

Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None

method : str, optional

method: {“pearson”, “spearman”, “kendall”}, by default “pearson”

pearson: measures linear relationships and requires normally distributed and homoscedastic data.
spearman: ranked/ordinal correlation, measures monotonic relationships.
kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”.

cmap : str, optional

The mapping from data values to color space, matplotlib colormap name or object, or list of colors, by default “BrBG”

figsize : Tuple, optional

Use to control the figure size, by default (12, 10)

annot : bool, optional

Use to show or hide annotations, by default True

dev : bool, optional

Display figure settings in the plot by setting dev = True. If False, the settings are not displayed, by default False

Keyword Arguments : optional

Additional elements to control the visualization of the plot, e.g.:

mask: bool, default True

If set to False the entire correlation matrix, including the upper triangle is shown. Set dev = False in this case to avoid overlap.

vmax: float, default is calculated from the given correlation coefficients.

Value between -1 or vmin <= vmax <= 1, limits the range of the cbar.

vmin: float, default is calculated from the given correlation coefficients.

Value between -1 <= vmin <= 1 or vmax, limits the range of the cbar.

linewidths: float, default 0.5

Controls the line-width inbetween the squares.

annot_kws: dict, default {“size” : 10}

Controls the font size of the annotations. Only available when annot = True.

cbar_kws: dict, default {“shrink”: .95, “aspect”: 30}

Controls the size of the colorbar.

Many more kwargs are available, i.e. “alpha” to control blending, or options to adjust labels, ticks …

Kwargs can be supplied through a dictionary of key-value pairs (see above).

Returns:

ax: matplotlib Axes: Returns the Axes object with the plot for further tweaking.

klib.describe.dist_plot(data: pandas.core.frame.DataFrame, mean_color: str = 'orange', size: int = 2.5, fill_range: Tuple = (0.025, 0.975), showall: bool = False, kde_kws: Dict[str, Any] = None, rug_kws: Dict[str, Any] = None, fill_kws: Dict[str, Any] = None, font_kws: Dict[str, Any] = None)[source]¶

Two-dimensional visualization of the distribution of non binary numerical features.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
mean_color : str, optional: Color of the vertical line indicating the mean of the data, by default “orange”
size : int, optional: Controls the plot size, by default 2.5
fill_range : Tuple, optional: Set the quantiles for shading. Default spans 95% of the data, which is about two std. deviations above and below the mean, by default (0.025, 0.975)
showall : bool, optional: Set to True to remove the output limit of 20 plots, by default False
kde_kws : Dict[str, Any], optional: Keyword arguments for kdeplot(), by default {“color”: “k”, “alpha”: 0.75, “linewidth”: 1.5, “bw_adjust”: 0.8}
rug_kws : Dict[str, Any], optional: Keyword arguments for rugplot(), by default {“color”: “#ff3333”, “alpha”: 0.15, “lw”: 3, “height”: 0.075}
fill_kws : Dict[str, Any], optional: Keyword arguments to control the fill, by default {“color”: “#80d4ff”, “alpha”: 0.2}
font_kws : Dict[str, Any], optional: Keyword arguments to control the font, by default {“color”: “#111111”, “weight”: “normal”, “size”: 11}

Returns:

ax: matplotlib Axes: Returns the Axes object with the plot for further tweaking.

klib.describe.missingval_plot(data: pandas.core.frame.DataFrame, cmap: str = 'PuBuGn', figsize: Tuple = (20, 20), sort: bool = False, spine_color: str = '#EEEEEE')[source]¶

Two-dimensional visualization of the missing values in a dataset.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
cmap : str, optional: Any valid colormap can be used. E.g. “Greys”, “RdPu”. More information can be found in the matplotlib documentation, by default “PuBuGn”
figsize : Tuple, optional: Use to control the figure size, by default (20, 20)
sort : bool, optional: Sort columns based on missing values in descending order and drop columns without any missing values, by default False
spine_color : str, optional: Set to “None” to hide the spines on all plots or use any valid matplotlib color argument, by default “#EEEEEE”

Returns:

GridSpec: gs: Figure with array of Axes objects

klib.clean module¶

Functions for data cleaning.

author:	Andreas Kanz

klib.clean.clean_column_names(data: pandas.core.frame.DataFrame, hints: bool = True) → pandas.core.frame.DataFrame[source]¶

Cleans the column names of the provided Pandas Dataframe and optionally provides hints on duplicate and long column names.

Parameters:	data : pd.DataFrame Original Dataframe with columns to be cleaned hints : bool, optional Print out hints on column name duplication and colum name length, by default True
Returns:	pd.DataFrame Pandas DataFrame with cleaned column names

klib.clean.convert_datatypes(data: pandas.core.frame.DataFrame, category: bool = True, cat_threshold: float = 0.05, cat_exclude: Optional[List[Union[str, int]]] = None) → pandas.core.frame.DataFrame[source]¶

Converts columns to best possible dtypes using dtypes supporting pd.NA. Temporarily not converting to integers due to an issue in pandas. This is expected to be fixed in pandas 1.1. See https://github.com/pandas-dev/pandas/issues/33803

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame
category : bool, optional: Change dtypes of columns with dtype “object” to “category”. Set threshold using cat_threshold or exclude columns using cat_exclude, by default True
cat_threshold : float, optional: Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.05
cat_exclude : Optional[List[Union[str, int]]], optional: List of columns to exclude from categorical conversion, by default None

Returns:

pd.DataFrame: Pandas DataFrame with converted Datatypes

klib.clean.data_cleaning(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 0.9, drop_threshold_rows: float = 0.9, drop_duplicates: bool = True, convert_dtypes: bool = True, col_exclude: Optional[List[str]] = None, category: bool = True, cat_threshold: float = 0.03, cat_exclude: Optional[List[Union[str, int]]] = None, clean_col_names: bool = True, show: str = 'changes') → pandas.core.frame.DataFrame[source]¶

Perform initial data cleaning tasks on a dataset, such as dropping single valued and empty rows, empty columns as well as optimizing the datatypes.

Parameters:

data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

drop_threshold_cols : float, optional

Drop columns with NA-ratio equal to or above the specified threshold, by default 0.9

drop_threshold_rows : float, optional

Drop rows with NA-ratio equal to or above the specified threshold, by default 0.9

drop_duplicates : bool, optional

Drop duplicate rows, keeping the first occurence. This step comes after the dropping of missing values, by default True

convert_dtypes : bool, optional

Convert dtypes using pd.convert_dtypes(), by default True

col_exclude : Optional[List[str]], optional

Specify a list of columns to exclude from dropping, by default None

category : bool, optional

Enable changing dtypes of “object” columns to “category”. Set threshold using cat_threshold. Requires convert_dtypes=True, by default True

cat_threshold : float, optional

Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.03

cat_exclude : Optional[List[str]], optional

List of columns to exclude from categorical conversion, by default None

clean_column_names: bool, optional

Cleans the column names and provides hints on duplicate and long names, by default True

show : str, optional

{“all”, “changes”, None}, by default “changes” Specify verbosity of the output:

“all”: Print information about the data before and after cleaning as well as information about changes and memory usage (deep). Please be aware, that this can slow down the function by quite a bit.

“changes”: Print out differences in the data before and after cleaning.

None: No information about the data and the data cleaning is printed.

Returns:

pd.DataFrame: Cleaned Pandas DataFrame

klib.preprocess module¶

Functions for data preprocessing.

author:	Andreas Kanz

klib.preprocess.feature_selection_pipe(var_thresh=VarianceThreshold(threshold=0.1), select_from_model=SelectFromModel(estimator=LassoCV(cv=4, random_state=408), threshold='0.1*median'), select_percentile=SelectPercentile(percentile=95), var_thresh_info=PipeInfo(name='after var_thresh'), select_from_model_info=PipeInfo(name='after select_from_model'), select_percentile_info=PipeInfo(name='after select_percentile'))[source]¶

Preprocessing operations for feature selection.

Parameters:

var_thresh: default, VarianceThreshold(threshold=0.1): Specify a threshold to drop low variance features.
select_from_model: default, SelectFromModel(LassoCV(cv=4, random_state=408), threshold=”0.1 * median”): Specify an estimator which is used for selecting features based on importance weights.
select_percentile: default, SelectPercentile(f_classif, percentile=95): Specify a score-function and a percentile value of features to keep.
var_thresh_info, select_from_model_info, select_percentile_info: Prints the shape of the dataset after applying the respective function. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].

Returns:

Pipeline

klib.preprocess.num_pipe(imputer=IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408), scaler=RobustScaler())[source]¶

Standard preprocessing operations on numerical data.

Parameters:	imputer: default, IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408) scaler: default, RobustScaler()
Returns:	Pipeline

klib.preprocess.cat_pipe(imputer=SimpleImputer(strategy='most_frequent'), encoder=OneHotEncoder(handle_unknown='ignore'), scaler=MaxAbsScaler(), encoder_info=PipeInfo(name='after encoding categorical data'))[source]¶

Standard preprocessing operations on categorical data.

Parameters:

imputer: default, SimpleImputer(strategy=’most_frequent’)
encoder: default, OneHotEncoder(handle_unknown=’ignore’): Encode categorical features as a one-hot numeric array.
scaler: default, MaxAbsScaler(): Scale each feature by its maximum absolute value. MaxAbsScaler() does not shift/center the data, and thus does not destroy any sparsity. It is recommended to check for outliers before applying MaxAbsScaler().
encoder_info:: Prints the shape of the dataset at the end of ‘cat_pipe’. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].

Returns:

Pipeline

klib.preprocess.train_dev_test_split(data, target, dev_size=0.1, test_size=0.1, stratify=None, random_state=408)[source]¶

Split a dataset and a label column into train, dev and test sets.

Parameters:

data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots.
target: string, list, np.array or pd.Series, default None: Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label.
dev_size: float, default 0.1: If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the dev split.
test_size: float, default 0.1: If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
stratify: target column, default None: If not None, data is split in a stratified fashion, using the input as the class labels.
random_state: integer, default 408: Random_state is the seed used by the random number generator.

Returns:

tuple: Tuple containing train-dev-test split of inputs.

Module contents¶

Data Science Module for Python¶

klib is an easy to use Python library of customized functions for cleaning and analyzing data.

klib.clean_column_names(data: pandas.core.frame.DataFrame, hints: bool = True) → pandas.core.frame.DataFrame[source]¶

Cleans the column names of the provided Pandas Dataframe and optionally provides hints on duplicate and long column names.

Parameters:	data : pd.DataFrame Original Dataframe with columns to be cleaned hints : bool, optional Print out hints on column name duplication and colum name length, by default True
Returns:	pd.DataFrame Pandas DataFrame with cleaned column names

klib.convert_datatypes(data: pandas.core.frame.DataFrame, category: bool = True, cat_threshold: float = 0.05, cat_exclude: Optional[List[Union[str, int]]] = None) → pandas.core.frame.DataFrame[source]¶

Converts columns to best possible dtypes using dtypes supporting pd.NA. Temporarily not converting to integers due to an issue in pandas. This is expected to be fixed in pandas 1.1. See https://github.com/pandas-dev/pandas/issues/33803

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame
category : bool, optional: Change dtypes of columns with dtype “object” to “category”. Set threshold using cat_threshold or exclude columns using cat_exclude, by default True
cat_threshold : float, optional: Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.05
cat_exclude : Optional[List[Union[str, int]]], optional: List of columns to exclude from categorical conversion, by default None

Returns:

pd.DataFrame: Pandas DataFrame with converted Datatypes

klib.data_cleaning(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 0.9, drop_threshold_rows: float = 0.9, drop_duplicates: bool = True, convert_dtypes: bool = True, col_exclude: Optional[List[str]] = None, category: bool = True, cat_threshold: float = 0.03, cat_exclude: Optional[List[Union[str, int]]] = None, clean_col_names: bool = True, show: str = 'changes') → pandas.core.frame.DataFrame[source]¶

Perform initial data cleaning tasks on a dataset, such as dropping single valued and empty rows, empty columns as well as optimizing the datatypes.

Parameters:

data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

drop_threshold_cols : float, optional

Drop columns with NA-ratio equal to or above the specified threshold, by default 0.9

drop_threshold_rows : float, optional

Drop rows with NA-ratio equal to or above the specified threshold, by default 0.9

drop_duplicates : bool, optional

Drop duplicate rows, keeping the first occurence. This step comes after the dropping of missing values, by default True

convert_dtypes : bool, optional

Convert dtypes using pd.convert_dtypes(), by default True

col_exclude : Optional[List[str]], optional

Specify a list of columns to exclude from dropping, by default None

category : bool, optional

Enable changing dtypes of “object” columns to “category”. Set threshold using cat_threshold. Requires convert_dtypes=True, by default True

cat_threshold : float, optional

Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.03

cat_exclude : Optional[List[str]], optional

List of columns to exclude from categorical conversion, by default None

clean_column_names: bool, optional

Cleans the column names and provides hints on duplicate and long names, by default True

show : str, optional

{“all”, “changes”, None}, by default “changes” Specify verbosity of the output:

“all”: Print information about the data before and after cleaning as well as information about changes and memory usage (deep). Please be aware, that this can slow down the function by quite a bit.

“changes”: Print out differences in the data before and after cleaning.

None: No information about the data and the data cleaning is printed.

Returns:

pd.DataFrame: Cleaned Pandas DataFrame

See also

convert_datatypes: Convert columns to best possible dtypes.
drop_missing: Flexibly drop columns and rows.
_memory_usage: Gives the total memory usage in megabytes.
_missing_vals: Metrics about missing values in the dataset.

Notes

The category dtype is not grouped in the summary, unless it contains exactly the same categories.

klib.drop_missing(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 1, drop_threshold_rows: float = 1, col_exclude: Optional[List[str]] = None) → pandas.core.frame.DataFrame[source]¶

Drops completely empty columns and rows by default and optionally provides flexibility to loosen restrictions to drop additional non-empty columns and rows based on the fraction of NA-values.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame
drop_threshold_cols : float, optional: Drop columns with NA-ratio equal to or above the specified threshold, by default 1
drop_threshold_rows : float, optional: Drop rows with NA-ratio equal to or above the specified threshold, by default 1
col_exclude : Optional[List[str]], optional: Specify a list of columns to exclude from dropping. The excluded columns do not affect the drop thresholds, by default None

Returns:

pd.DataFrame: Pandas DataFrame without any empty columns or rows

Notes

Columns are dropped first

klib.mv_col_handling(data: pandas.core.frame.DataFrame, target: Union[str, pandas.core.series.Series, List[T], None] = None, mv_threshold: float = 0.1, corr_thresh_features: float = 0.5, corr_thresh_target: float = 0.3, return_details: bool = False) → pandas.core.frame.DataFrame[source]¶

Converts columns with a high ratio of missing values into binary features and eventually drops them based on their correlation with other features and the target variable. This function follows a three step process: - 1) Identify features with a high ratio of missing values (above ‘mv_threshold’). - 2) Identify high correlations of these features among themselves and with other features in the dataset (above ‘corr_thresh_features’). - 3) Features with high ratio of missing values and high correlation among each other are dropped unless they correlate reasonably well with the target variable (above ‘corr_thresh_target’).

Note: If no target is provided, the process exits after step two and drops columns identified up to this point.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame
target : Optional[Union[str, pd.Series, List]], optional: Specify target for correlation. I.e. label column to generate only the correlations between each feature and the label, by default None
mv_threshold : float, optional: Value between 0 <= threshold <= 1. Features with a missing-value-ratio larger than mv_threshold are candidates for dropping and undergo further analysis, by default 0.1
corr_thresh_features : float, optional: Value between 0 <= threshold <= 1. Maximum correlation a previously identified features (with a high mv-ratio) is allowed to have with another feature. If this threshold is overstepped, the feature undergoes further analysis, by default 0.5
corr_thresh_target : float, optional: Value between 0 <= threshold <= 1. Minimum required correlation of a remaining feature (i.e. feature with a high mv-ratio and high correlation to another existing feature) with the target. If this threshold is not met the feature is ultimately dropped, by default 0.3
return_details : bool, optional: Provdies flexibility to return intermediary results, by default False

Returns:

pd.DataFrame: Updated Pandas DataFrame
optional:
cols_mv: Columns with missing values included in the analysis
drop_cols: List of dropped columns

klib.pool_duplicate_subsets(data: pandas.core.frame.DataFrame, col_dupl_thresh: float = 0.2, subset_thresh: float = 0.2, min_col_pool: int = 3, exclude: Optional[List[str]] = None, return_details=False) → pandas.core.frame.DataFrame[source]¶

Checks for duplicates in subsets of columns and pools them. This can reduce the number of columns in the data without loosing much information. Suitable columns are combined to subsets and tested for duplicates. In case sufficient duplicates can be found, the respective columns are aggregated into a “pooled_var” column. Identical numbers in the “pooled_var” column indicate identical information in the respective rows.

Note: It is advised to exclude features that provide sufficient informational content by themselves as well as the target column by using the “exclude” setting.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame
col_dupl_thresh : float, optional: Columns with a ratio of duplicates higher than “col_dupl_thresh” are considered in the further analysis. Columns with a lower ratio are not considered for pooling, by default 0.2
subset_thresh : float, optional: The first subset with a duplicate threshold higher than “subset_thresh” is chosen and aggregated. If no subset reaches the threshold, the algorithm continues with continuously smaller subsets until “min_col_pool” is reached, by default 0.2
min_col_pool : int, optional: Minimum number of columns to pool. The algorithm attempts to combine as many columns as possible to suitable subsets and stops when “min_col_pool” is reached, by default 3
exclude : Optional[List[str]], optional: List of column names to be excluded from the analysis. These columns are passed through without modification, by default None
return_details : bool, optional: Provdies flexibility to return intermediary results, by default False

Returns:

pd.DataFrame: DataFrame with low cardinality columns pooled
optional:
subset_cols: List of columns used as subset

klib.cat_plot(data: pandas.core.frame.DataFrame, figsize: Tuple = (18, 18), top: int = 3, bottom: int = 3, bar_color_top: str = '#5ab4ac', bar_color_bottom: str = '#d8b365')[source]¶

Two-dimensional visualization of the number and frequency of categorical features.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
figsize : Tuple, optional: Use to control the figure size, by default (18, 18)
top : int, optional: Show the “top” most frequent values in a column, by default 3
bottom : int, optional: Show the “bottom” most frequent values in a column, by default 3
bar_color_top : str, optional: Use to control the color of the bars indicating the most common values, by default “#5ab4ac”
bar_color_bottom : str, optional: Use to control the color of the bars indicating the least common values, by default “#d8b365”

Returns:

Gridspec: gs: Figure with array of Axes objects

klib.corr_mat(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray, str, None] = None, method: str = 'pearson', colored: bool = True) → Union[pandas.core.frame.DataFrame, Any][source]¶

Returns a color-encoded correlation matrix.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
split : Optional[str], optional: Type of split to be performed, by default None {None, “pos”, “neg”, “high”, “low”}
threshold : float, optional: Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3
target : Optional[Union[pd.DataFrame, str]], optional: Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None
method : str, optional: method: {“pearson”, “spearman”, “kendall”}, by default “pearson” * pearson: measures linear relationships and requires normally distributed and homoscedastic data. * spearman: ranked/ordinal correlation, measures monotonic relationships. * kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”
colored : bool, optional: If True the negative values in the correlation matrix are colored in red, by default True

Returns:

Union[pd.DataFrame, pd.Styler]: If colored = True - corr: Pandas Styler object If colored = False - corr: Pandas DataFrame

klib.corr_plot(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.series.Series, str, None] = None, method: str = 'pearson', cmap: str = 'BrBG', figsize: Tuple = (12, 10), annot: bool = True, dev: bool = False, **kwargs)[source]¶

Two-dimensional visualization of the correlation between feature-columns excluding NA values.

Parameters:

data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

split : Optional[str], optional

Type of split to be performed {None, “pos”, “neg”, “high”, “low”}, by default None

None: visualize all correlations between the feature-columns
pos: visualize all positive correlations between the feature-columns above the threshold
neg: visualize all negative correlations between the feature-columns below the threshold
high: visualize all correlations between the feature-columns for which abs (corr) > threshold is True
low: visualize all correlations between the feature-columns for which abs(corr) < threshold is True

threshold : float, optional

Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3

target : Optional[Union[pd.Series, str]], optional

Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None

method : str, optional

method: {“pearson”, “spearman”, “kendall”}, by default “pearson”

pearson: measures linear relationships and requires normally distributed and homoscedastic data.
spearman: ranked/ordinal correlation, measures monotonic relationships.
kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”.

cmap : str, optional

The mapping from data values to color space, matplotlib colormap name or object, or list of colors, by default “BrBG”

figsize : Tuple, optional

Use to control the figure size, by default (12, 10)

annot : bool, optional

Use to show or hide annotations, by default True

dev : bool, optional

Display figure settings in the plot by setting dev = True. If False, the settings are not displayed, by default False

Keyword Arguments : optional

Additional elements to control the visualization of the plot, e.g.:

mask: bool, default True

If set to False the entire correlation matrix, including the upper triangle is shown. Set dev = False in this case to avoid overlap.

vmax: float, default is calculated from the given correlation coefficients.

Value between -1 or vmin <= vmax <= 1, limits the range of the cbar.

vmin: float, default is calculated from the given correlation coefficients.

Value between -1 <= vmin <= 1 or vmax, limits the range of the cbar.

linewidths: float, default 0.5

Controls the line-width inbetween the squares.

annot_kws: dict, default {“size” : 10}

Controls the font size of the annotations. Only available when annot = True.

cbar_kws: dict, default {“shrink”: .95, “aspect”: 30}

Controls the size of the colorbar.

Many more kwargs are available, i.e. “alpha” to control blending, or options to adjust labels, ticks …

Kwargs can be supplied through a dictionary of key-value pairs (see above).

Returns:

ax: matplotlib Axes: Returns the Axes object with the plot for further tweaking.

klib.dist_plot(data: pandas.core.frame.DataFrame, mean_color: str = 'orange', size: int = 2.5, fill_range: Tuple = (0.025, 0.975), showall: bool = False, kde_kws: Dict[str, Any] = None, rug_kws: Dict[str, Any] = None, fill_kws: Dict[str, Any] = None, font_kws: Dict[str, Any] = None)[source]¶

Two-dimensional visualization of the distribution of non binary numerical features.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
mean_color : str, optional: Color of the vertical line indicating the mean of the data, by default “orange”
size : int, optional: Controls the plot size, by default 2.5
fill_range : Tuple, optional: Set the quantiles for shading. Default spans 95% of the data, which is about two std. deviations above and below the mean, by default (0.025, 0.975)
showall : bool, optional: Set to True to remove the output limit of 20 plots, by default False
kde_kws : Dict[str, Any], optional: Keyword arguments for kdeplot(), by default {“color”: “k”, “alpha”: 0.75, “linewidth”: 1.5, “bw_adjust”: 0.8}
rug_kws : Dict[str, Any], optional: Keyword arguments for rugplot(), by default {“color”: “#ff3333”, “alpha”: 0.15, “lw”: 3, “height”: 0.075}
fill_kws : Dict[str, Any], optional: Keyword arguments to control the fill, by default {“color”: “#80d4ff”, “alpha”: 0.2}
font_kws : Dict[str, Any], optional: Keyword arguments to control the font, by default {“color”: “#111111”, “weight”: “normal”, “size”: 11}

Returns:

ax: matplotlib Axes: Returns the Axes object with the plot for further tweaking.

klib.missingval_plot(data: pandas.core.frame.DataFrame, cmap: str = 'PuBuGn', figsize: Tuple = (20, 20), sort: bool = False, spine_color: str = '#EEEEEE')[source]¶

Two-dimensional visualization of the missing values in a dataset.

Parameters:

data : pd.DataFrame: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
cmap : str, optional: Any valid colormap can be used. E.g. “Greys”, “RdPu”. More information can be found in the matplotlib documentation, by default “PuBuGn”
figsize : Tuple, optional: Use to control the figure size, by default (20, 20)
sort : bool, optional: Sort columns based on missing values in descending order and drop columns without any missing values, by default False
spine_color : str, optional: Set to “None” to hide the spines on all plots or use any valid matplotlib color argument, by default “#EEEEEE”

Returns:

GridSpec: gs: Figure with array of Axes objects

klib.feature_selection_pipe(var_thresh=VarianceThreshold(threshold=0.1), select_from_model=SelectFromModel(estimator=LassoCV(cv=4, random_state=408), threshold='0.1*median'), select_percentile=SelectPercentile(percentile=95), var_thresh_info=PipeInfo(name='after var_thresh'), select_from_model_info=PipeInfo(name='after select_from_model'), select_percentile_info=PipeInfo(name='after select_percentile'))[source]¶

Preprocessing operations for feature selection.

Parameters:

var_thresh: default, VarianceThreshold(threshold=0.1): Specify a threshold to drop low variance features.
select_from_model: default, SelectFromModel(LassoCV(cv=4, random_state=408), threshold=”0.1 * median”): Specify an estimator which is used for selecting features based on importance weights.
select_percentile: default, SelectPercentile(f_classif, percentile=95): Specify a score-function and a percentile value of features to keep.
var_thresh_info, select_from_model_info, select_percentile_info: Prints the shape of the dataset after applying the respective function. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].

Returns:

Pipeline

klib.num_pipe(imputer=IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408), scaler=RobustScaler())[source]¶

Standard preprocessing operations on numerical data.

Parameters:	imputer: default, IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408) scaler: default, RobustScaler()
Returns:	Pipeline

klib.cat_pipe(imputer=SimpleImputer(strategy='most_frequent'), encoder=OneHotEncoder(handle_unknown='ignore'), scaler=MaxAbsScaler(), encoder_info=PipeInfo(name='after encoding categorical data'))[source]¶

Standard preprocessing operations on categorical data.

Parameters:

imputer: default, SimpleImputer(strategy=’most_frequent’)
encoder: default, OneHotEncoder(handle_unknown=’ignore’): Encode categorical features as a one-hot numeric array.
scaler: default, MaxAbsScaler(): Scale each feature by its maximum absolute value. MaxAbsScaler() does not shift/center the data, and thus does not destroy any sparsity. It is recommended to check for outliers before applying MaxAbsScaler().
encoder_info:: Prints the shape of the dataset at the end of ‘cat_pipe’. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].

Returns:

Pipeline

klib.train_dev_test_split(data, target, dev_size=0.1, test_size=0.1, stratify=None, random_state=408)[source]¶

Split a dataset and a label column into train, dev and test sets.

Parameters:

data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots.
target: string, list, np.array or pd.Series, default None: Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label.
dev_size: float, default 0.1: If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the dev split.
test_size: float, default 0.1: If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
stratify: target column, default None: If not None, data is split in a stratified fashion, using the input as the class labels.
random_state: integer, default 408: Random_state is the seed used by the random number generator.

Returns:

tuple: Tuple containing train-dev-test split of inputs.