Modules¶
Submodules¶
klib.describe module¶
Functions for descriptive analytics.
author: | Andreas Kanz |
---|
-
klib.describe.
cat_plot
(data: pandas.core.frame.DataFrame, figsize: Tuple = (18, 18), top: int = 3, bottom: int = 3, bar_color_top: str = '#5ab4ac', bar_color_bottom: str = '#d8b365')[source]¶ Two-dimensional visualization of the number and frequency of categorical features.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
- figsize : Tuple, optional
Use to control the figure size, by default (18, 18)
- top : int, optional
Show the “top” most frequent values in a column, by default 3
- bottom : int, optional
Show the “bottom” most frequent values in a column, by default 3
- bar_color_top : str, optional
Use to control the color of the bars indicating the most common values, by default “#5ab4ac”
- bar_color_bottom : str, optional
Use to control the color of the bars indicating the least common values, by default “#d8b365”
Returns: - Gridspec
gs: Figure with array of Axes objects
-
klib.describe.
corr_mat
(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray, str, None] = None, method: str = 'pearson', colored: bool = True) → Union[pandas.core.frame.DataFrame, Any][source]¶ Returns a color-encoded correlation matrix.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
- split : Optional[str], optional
Type of split to be performed, by default None {None, “pos”, “neg”, “high”, “low”}
- threshold : float, optional
Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3
- target : Optional[Union[pd.DataFrame, str]], optional
Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None
- method : str, optional
method: {“pearson”, “spearman”, “kendall”}, by default “pearson” * pearson: measures linear relationships and requires normally distributed and homoscedastic data. * spearman: ranked/ordinal correlation, measures monotonic relationships. * kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”
- colored : bool, optional
If True the negative values in the correlation matrix are colored in red, by default True
Returns: - Union[pd.DataFrame, pd.Styler]
If colored = True - corr: Pandas Styler object If colored = False - corr: Pandas DataFrame
-
klib.describe.
corr_plot
(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.series.Series, str, None] = None, method: str = 'pearson', cmap: str = 'BrBG', figsize: Tuple = (12, 10), annot: bool = True, dev: bool = False, **kwargs)[source]¶ Two-dimensional visualization of the correlation between feature-columns excluding NA values.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
- split : Optional[str], optional
- Type of split to be performed {None, “pos”, “neg”, “high”, “low”}, by default None
- None: visualize all correlations between the feature-columns
- pos: visualize all positive correlations between the feature-columns above the threshold
- neg: visualize all negative correlations between the feature-columns below the threshold
- high: visualize all correlations between the feature-columns for which abs (corr) > threshold is True
- low: visualize all correlations between the feature-columns for which abs(corr) < threshold is True
- threshold : float, optional
Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3
- target : Optional[Union[pd.Series, str]], optional
Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None
- method : str, optional
- method: {“pearson”, “spearman”, “kendall”}, by default “pearson”
- pearson: measures linear relationships and requires normally distributed and homoscedastic data.
- spearman: ranked/ordinal correlation, measures monotonic relationships.
- kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”.
- cmap : str, optional
The mapping from data values to color space, matplotlib colormap name or object, or list of colors, by default “BrBG”
- figsize : Tuple, optional
Use to control the figure size, by default (12, 10)
- annot : bool, optional
Use to show or hide annotations, by default True
- dev : bool, optional
Display figure settings in the plot by setting dev = True. If False, the settings are not displayed, by default False
- Keyword Arguments : optional
Additional elements to control the visualization of the plot, e.g.:
- mask: bool, default True
- If set to False the entire correlation matrix, including the upper triangle is shown. Set dev = False in this case to avoid overlap.
- vmax: float, default is calculated from the given correlation coefficients.
- Value between -1 or vmin <= vmax <= 1, limits the range of the cbar.
- vmin: float, default is calculated from the given correlation coefficients.
- Value between -1 <= vmin <= 1 or vmax, limits the range of the cbar.
- linewidths: float, default 0.5
- Controls the line-width inbetween the squares.
- annot_kws: dict, default {“size” : 10}
- Controls the font size of the annotations. Only available when annot = True.
- cbar_kws: dict, default {“shrink”: .95, “aspect”: 30}
- Controls the size of the colorbar.
- Many more kwargs are available, i.e. “alpha” to control blending, or options to adjust labels, ticks …
Kwargs can be supplied through a dictionary of key-value pairs (see above).
Returns: - ax: matplotlib Axes
Returns the Axes object with the plot for further tweaking.
-
klib.describe.
dist_plot
(data: pandas.core.frame.DataFrame, mean_color: str = 'orange', size: int = 2.5, fill_range: Tuple = (0.025, 0.975), showall: bool = False, kde_kws: Dict[str, Any] = None, rug_kws: Dict[str, Any] = None, fill_kws: Dict[str, Any] = None, font_kws: Dict[str, Any] = None)[source]¶ Two-dimensional visualization of the distribution of non binary numerical features.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
- mean_color : str, optional
Color of the vertical line indicating the mean of the data, by default “orange”
- size : int, optional
Controls the plot size, by default 2.5
- fill_range : Tuple, optional
Set the quantiles for shading. Default spans 95% of the data, which is about two std. deviations above and below the mean, by default (0.025, 0.975)
- showall : bool, optional
Set to True to remove the output limit of 20 plots, by default False
- kde_kws : Dict[str, Any], optional
Keyword arguments for kdeplot(), by default {“color”: “k”, “alpha”: 0.75, “linewidth”: 1.5, “bw_adjust”: 0.8}
- rug_kws : Dict[str, Any], optional
Keyword arguments for rugplot(), by default {“color”: “#ff3333”, “alpha”: 0.15, “lw”: 3, “height”: 0.075}
- fill_kws : Dict[str, Any], optional
Keyword arguments to control the fill, by default {“color”: “#80d4ff”, “alpha”: 0.2}
- font_kws : Dict[str, Any], optional
Keyword arguments to control the font, by default {“color”: “#111111”, “weight”: “normal”, “size”: 11}
Returns: - ax: matplotlib Axes
Returns the Axes object with the plot for further tweaking.
-
klib.describe.
missingval_plot
(data: pandas.core.frame.DataFrame, cmap: str = 'PuBuGn', figsize: Tuple = (20, 20), sort: bool = False, spine_color: str = '#EEEEEE')[source]¶ Two-dimensional visualization of the missing values in a dataset.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
- cmap : str, optional
Any valid colormap can be used. E.g. “Greys”, “RdPu”. More information can be found in the matplotlib documentation, by default “PuBuGn”
- figsize : Tuple, optional
Use to control the figure size, by default (20, 20)
- sort : bool, optional
Sort columns based on missing values in descending order and drop columns without any missing values, by default False
- spine_color : str, optional
Set to “None” to hide the spines on all plots or use any valid matplotlib color argument, by default “#EEEEEE”
Returns: - GridSpec
gs: Figure with array of Axes objects
klib.clean module¶
Functions for data cleaning.
author: | Andreas Kanz |
---|
-
klib.clean.
clean_column_names
(data: pandas.core.frame.DataFrame, hints: bool = True) → pandas.core.frame.DataFrame[source]¶ Cleans the column names of the provided Pandas Dataframe and optionally provides hints on duplicate and long column names.
Parameters: - data : pd.DataFrame
Original Dataframe with columns to be cleaned
- hints : bool, optional
Print out hints on column name duplication and colum name length, by default True
Returns: - pd.DataFrame
Pandas DataFrame with cleaned column names
-
klib.clean.
convert_datatypes
(data: pandas.core.frame.DataFrame, category: bool = True, cat_threshold: float = 0.05, cat_exclude: Optional[List[Union[str, int]]] = None) → pandas.core.frame.DataFrame[source]¶ Converts columns to best possible dtypes using dtypes supporting pd.NA. Temporarily not converting to integers due to an issue in pandas. This is expected to be fixed in pandas 1.1. See https://github.com/pandas-dev/pandas/issues/33803
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame
- category : bool, optional
Change dtypes of columns with dtype “object” to “category”. Set threshold using cat_threshold or exclude columns using cat_exclude, by default True
- cat_threshold : float, optional
Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.05
- cat_exclude : Optional[List[Union[str, int]]], optional
List of columns to exclude from categorical conversion, by default None
Returns: - pd.DataFrame
Pandas DataFrame with converted Datatypes
-
klib.clean.
data_cleaning
(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 0.9, drop_threshold_rows: float = 0.9, drop_duplicates: bool = True, convert_dtypes: bool = True, col_exclude: Optional[List[str]] = None, category: bool = True, cat_threshold: float = 0.03, cat_exclude: Optional[List[Union[str, int]]] = None, clean_col_names: bool = True, show: str = 'changes') → pandas.core.frame.DataFrame[source]¶ Perform initial data cleaning tasks on a dataset, such as dropping single valued and empty rows, empty columns as well as optimizing the datatypes.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame
- drop_threshold_cols : float, optional
Drop columns with NA-ratio equal to or above the specified threshold, by default 0.9
- drop_threshold_rows : float, optional
Drop rows with NA-ratio equal to or above the specified threshold, by default 0.9
- drop_duplicates : bool, optional
Drop duplicate rows, keeping the first occurence. This step comes after the dropping of missing values, by default True
- convert_dtypes : bool, optional
Convert dtypes using pd.convert_dtypes(), by default True
- col_exclude : Optional[List[str]], optional
Specify a list of columns to exclude from dropping, by default None
- category : bool, optional
Enable changing dtypes of “object” columns to “category”. Set threshold using cat_threshold. Requires convert_dtypes=True, by default True
- cat_threshold : float, optional
Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.03
- cat_exclude : Optional[List[str]], optional
List of columns to exclude from categorical conversion, by default None
- clean_column_names: bool, optional
Cleans the column names and provides hints on duplicate and long names, by default True
- show : str, optional
{“all”, “changes”, None}, by default “changes” Specify verbosity of the output:
- “all”: Print information about the data before and after cleaning as well as information about changes and memory usage (deep). Please be aware, that this can slow down the function by quite a bit.
- “changes”: Print out differences in the data before and after cleaning.
- None: No information about the data and the data cleaning is printed.
Returns: - pd.DataFrame
Cleaned Pandas DataFrame
See also
convert_datatypes
- Convert columns to best possible dtypes.
drop_missing
- Flexibly drop columns and rows.
_memory_usage
- Gives the total memory usage in megabytes.
_missing_vals
- Metrics about missing values in the dataset.
Notes
The category dtype is not grouped in the summary, unless it contains exactly the same categories.
-
klib.clean.
drop_missing
(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 1, drop_threshold_rows: float = 1, col_exclude: Optional[List[str]] = None) → pandas.core.frame.DataFrame[source]¶ Drops completely empty columns and rows by default and optionally provides flexibility to loosen restrictions to drop additional non-empty columns and rows based on the fraction of NA-values.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame
- drop_threshold_cols : float, optional
Drop columns with NA-ratio equal to or above the specified threshold, by default 1
- drop_threshold_rows : float, optional
Drop rows with NA-ratio equal to or above the specified threshold, by default 1
- col_exclude : Optional[List[str]], optional
Specify a list of columns to exclude from dropping. The excluded columns do not affect the drop thresholds, by default None
Returns: - pd.DataFrame
Pandas DataFrame without any empty columns or rows
Notes
Columns are dropped first
-
klib.clean.
mv_col_handling
(data: pandas.core.frame.DataFrame, target: Union[str, pandas.core.series.Series, List[T], None] = None, mv_threshold: float = 0.1, corr_thresh_features: float = 0.5, corr_thresh_target: float = 0.3, return_details: bool = False) → pandas.core.frame.DataFrame[source]¶ Converts columns with a high ratio of missing values into binary features and eventually drops them based on their correlation with other features and the target variable. This function follows a three step process: - 1) Identify features with a high ratio of missing values (above ‘mv_threshold’). - 2) Identify high correlations of these features among themselves and with other features in the dataset (above ‘corr_thresh_features’). - 3) Features with high ratio of missing values and high correlation among each other are dropped unless they correlate reasonably well with the target variable (above ‘corr_thresh_target’).
Note: If no target is provided, the process exits after step two and drops columns identified up to this point.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame
- target : Optional[Union[str, pd.Series, List]], optional
Specify target for correlation. I.e. label column to generate only the correlations between each feature and the label, by default None
- mv_threshold : float, optional
Value between 0 <= threshold <= 1. Features with a missing-value-ratio larger than mv_threshold are candidates for dropping and undergo further analysis, by default 0.1
- corr_thresh_features : float, optional
Value between 0 <= threshold <= 1. Maximum correlation a previously identified features (with a high mv-ratio) is allowed to have with another feature. If this threshold is overstepped, the feature undergoes further analysis, by default 0.5
- corr_thresh_target : float, optional
Value between 0 <= threshold <= 1. Minimum required correlation of a remaining feature (i.e. feature with a high mv-ratio and high correlation to another existing feature) with the target. If this threshold is not met the feature is ultimately dropped, by default 0.3
- return_details : bool, optional
Provdies flexibility to return intermediary results, by default False
Returns: - pd.DataFrame
Updated Pandas DataFrame
- optional:
- cols_mv: Columns with missing values included in the analysis
- drop_cols: List of dropped columns
klib.preprocess module¶
Functions for data preprocessing.
author: | Andreas Kanz |
---|
-
klib.preprocess.
feature_selection_pipe
(var_thresh=VarianceThreshold(threshold=0.1), select_from_model=SelectFromModel(estimator=LassoCV(cv=4, random_state=408), threshold='0.1*median'), select_percentile=SelectPercentile(percentile=95), var_thresh_info=PipeInfo(name='after var_thresh'), select_from_model_info=PipeInfo(name='after select_from_model'), select_percentile_info=PipeInfo(name='after select_percentile'))[source]¶ Preprocessing operations for feature selection.
Parameters: - var_thresh: default, VarianceThreshold(threshold=0.1)
Specify a threshold to drop low variance features.
- select_from_model: default, SelectFromModel(LassoCV(cv=4, random_state=408), threshold=”0.1 * median”)
Specify an estimator which is used for selecting features based on importance weights.
- select_percentile: default, SelectPercentile(f_classif, percentile=95)
Specify a score-function and a percentile value of features to keep.
- var_thresh_info, select_from_model_info, select_percentile_info
Prints the shape of the dataset after applying the respective function. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].
Returns: - Pipeline
-
klib.preprocess.
num_pipe
(imputer=IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408), scaler=RobustScaler())[source]¶ Standard preprocessing operations on numerical data.
Parameters: - imputer: default, IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408)
- scaler: default, RobustScaler()
Returns: - Pipeline
-
klib.preprocess.
cat_pipe
(imputer=SimpleImputer(strategy='most_frequent'), encoder=OneHotEncoder(handle_unknown='ignore'), scaler=MaxAbsScaler(), encoder_info=PipeInfo(name='after encoding categorical data'))[source]¶ Standard preprocessing operations on categorical data.
Parameters: - imputer: default, SimpleImputer(strategy=’most_frequent’)
- encoder: default, OneHotEncoder(handle_unknown=’ignore’)
Encode categorical features as a one-hot numeric array.
- scaler: default, MaxAbsScaler()
Scale each feature by its maximum absolute value. MaxAbsScaler() does not shift/center the data, and thus does not destroy any sparsity. It is recommended to check for outliers before applying MaxAbsScaler().
- encoder_info:
Prints the shape of the dataset at the end of ‘cat_pipe’. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].
Returns: - Pipeline
-
klib.preprocess.
train_dev_test_split
(data, target, dev_size=0.1, test_size=0.1, stratify=None, random_state=408)[source]¶ Split a dataset and a label column into train, dev and test sets.
Parameters: - data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots.
- target: string, list, np.array or pd.Series, default None
Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label.
- dev_size: float, default 0.1
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the dev split.
- test_size: float, default 0.1
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
- stratify: target column, default None
If not None, data is split in a stratified fashion, using the input as the class labels.
- random_state: integer, default 408
Random_state is the seed used by the random number generator.
Returns: - tuple: Tuple containing train-dev-test split of inputs.
Module contents¶
Data Science Module for Python¶
klib is an easy to use Python library of customized functions for cleaning and analyzing data.
-
klib.
clean_column_names
(data: pandas.core.frame.DataFrame, hints: bool = True) → pandas.core.frame.DataFrame[source]¶ Cleans the column names of the provided Pandas Dataframe and optionally provides hints on duplicate and long column names.
Parameters: - data : pd.DataFrame
Original Dataframe with columns to be cleaned
- hints : bool, optional
Print out hints on column name duplication and colum name length, by default True
Returns: - pd.DataFrame
Pandas DataFrame with cleaned column names
-
klib.
convert_datatypes
(data: pandas.core.frame.DataFrame, category: bool = True, cat_threshold: float = 0.05, cat_exclude: Optional[List[Union[str, int]]] = None) → pandas.core.frame.DataFrame[source]¶ Converts columns to best possible dtypes using dtypes supporting pd.NA. Temporarily not converting to integers due to an issue in pandas. This is expected to be fixed in pandas 1.1. See https://github.com/pandas-dev/pandas/issues/33803
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame
- category : bool, optional
Change dtypes of columns with dtype “object” to “category”. Set threshold using cat_threshold or exclude columns using cat_exclude, by default True
- cat_threshold : float, optional
Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.05
- cat_exclude : Optional[List[Union[str, int]]], optional
List of columns to exclude from categorical conversion, by default None
Returns: - pd.DataFrame
Pandas DataFrame with converted Datatypes
-
klib.
data_cleaning
(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 0.9, drop_threshold_rows: float = 0.9, drop_duplicates: bool = True, convert_dtypes: bool = True, col_exclude: Optional[List[str]] = None, category: bool = True, cat_threshold: float = 0.03, cat_exclude: Optional[List[Union[str, int]]] = None, clean_col_names: bool = True, show: str = 'changes') → pandas.core.frame.DataFrame[source]¶ Perform initial data cleaning tasks on a dataset, such as dropping single valued and empty rows, empty columns as well as optimizing the datatypes.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame
- drop_threshold_cols : float, optional
Drop columns with NA-ratio equal to or above the specified threshold, by default 0.9
- drop_threshold_rows : float, optional
Drop rows with NA-ratio equal to or above the specified threshold, by default 0.9
- drop_duplicates : bool, optional
Drop duplicate rows, keeping the first occurence. This step comes after the dropping of missing values, by default True
- convert_dtypes : bool, optional
Convert dtypes using pd.convert_dtypes(), by default True
- col_exclude : Optional[List[str]], optional
Specify a list of columns to exclude from dropping, by default None
- category : bool, optional
Enable changing dtypes of “object” columns to “category”. Set threshold using cat_threshold. Requires convert_dtypes=True, by default True
- cat_threshold : float, optional
Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.03
- cat_exclude : Optional[List[str]], optional
List of columns to exclude from categorical conversion, by default None
- clean_column_names: bool, optional
Cleans the column names and provides hints on duplicate and long names, by default True
- show : str, optional
{“all”, “changes”, None}, by default “changes” Specify verbosity of the output:
- “all”: Print information about the data before and after cleaning as well as information about changes and memory usage (deep). Please be aware, that this can slow down the function by quite a bit.
- “changes”: Print out differences in the data before and after cleaning.
- None: No information about the data and the data cleaning is printed.
Returns: - pd.DataFrame
Cleaned Pandas DataFrame
See also
convert_datatypes
- Convert columns to best possible dtypes.
drop_missing
- Flexibly drop columns and rows.
_memory_usage
- Gives the total memory usage in megabytes.
_missing_vals
- Metrics about missing values in the dataset.
Notes
The category dtype is not grouped in the summary, unless it contains exactly the same categories.
-
klib.
drop_missing
(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 1, drop_threshold_rows: float = 1, col_exclude: Optional[List[str]] = None) → pandas.core.frame.DataFrame[source]¶ Drops completely empty columns and rows by default and optionally provides flexibility to loosen restrictions to drop additional non-empty columns and rows based on the fraction of NA-values.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame
- drop_threshold_cols : float, optional
Drop columns with NA-ratio equal to or above the specified threshold, by default 1
- drop_threshold_rows : float, optional
Drop rows with NA-ratio equal to or above the specified threshold, by default 1
- col_exclude : Optional[List[str]], optional
Specify a list of columns to exclude from dropping. The excluded columns do not affect the drop thresholds, by default None
Returns: - pd.DataFrame
Pandas DataFrame without any empty columns or rows
Notes
Columns are dropped first
-
klib.
mv_col_handling
(data: pandas.core.frame.DataFrame, target: Union[str, pandas.core.series.Series, List[T], None] = None, mv_threshold: float = 0.1, corr_thresh_features: float = 0.5, corr_thresh_target: float = 0.3, return_details: bool = False) → pandas.core.frame.DataFrame[source]¶ Converts columns with a high ratio of missing values into binary features and eventually drops them based on their correlation with other features and the target variable. This function follows a three step process: - 1) Identify features with a high ratio of missing values (above ‘mv_threshold’). - 2) Identify high correlations of these features among themselves and with other features in the dataset (above ‘corr_thresh_features’). - 3) Features with high ratio of missing values and high correlation among each other are dropped unless they correlate reasonably well with the target variable (above ‘corr_thresh_target’).
Note: If no target is provided, the process exits after step two and drops columns identified up to this point.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame
- target : Optional[Union[str, pd.Series, List]], optional
Specify target for correlation. I.e. label column to generate only the correlations between each feature and the label, by default None
- mv_threshold : float, optional
Value between 0 <= threshold <= 1. Features with a missing-value-ratio larger than mv_threshold are candidates for dropping and undergo further analysis, by default 0.1
- corr_thresh_features : float, optional
Value between 0 <= threshold <= 1. Maximum correlation a previously identified features (with a high mv-ratio) is allowed to have with another feature. If this threshold is overstepped, the feature undergoes further analysis, by default 0.5
- corr_thresh_target : float, optional
Value between 0 <= threshold <= 1. Minimum required correlation of a remaining feature (i.e. feature with a high mv-ratio and high correlation to another existing feature) with the target. If this threshold is not met the feature is ultimately dropped, by default 0.3
- return_details : bool, optional
Provdies flexibility to return intermediary results, by default False
Returns: - pd.DataFrame
Updated Pandas DataFrame
- optional:
- cols_mv: Columns with missing values included in the analysis
- drop_cols: List of dropped columns
-
klib.
pool_duplicate_subsets
(data: pandas.core.frame.DataFrame, col_dupl_thresh: float = 0.2, subset_thresh: float = 0.2, min_col_pool: int = 3, exclude: Optional[List[str]] = None, return_details=False) → pandas.core.frame.DataFrame[source]¶ Checks for duplicates in subsets of columns and pools them. This can reduce the number of columns in the data without loosing much information. Suitable columns are combined to subsets and tested for duplicates. In case sufficient duplicates can be found, the respective columns are aggregated into a “pooled_var” column. Identical numbers in the “pooled_var” column indicate identical information in the respective rows.
Note: It is advised to exclude features that provide sufficient informational content by themselves as well as the target column by using the “exclude” setting.Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame
- col_dupl_thresh : float, optional
Columns with a ratio of duplicates higher than “col_dupl_thresh” are considered in the further analysis. Columns with a lower ratio are not considered for pooling, by default 0.2
- subset_thresh : float, optional
The first subset with a duplicate threshold higher than “subset_thresh” is chosen and aggregated. If no subset reaches the threshold, the algorithm continues with continuously smaller subsets until “min_col_pool” is reached, by default 0.2
- min_col_pool : int, optional
Minimum number of columns to pool. The algorithm attempts to combine as many columns as possible to suitable subsets and stops when “min_col_pool” is reached, by default 3
- exclude : Optional[List[str]], optional
List of column names to be excluded from the analysis. These columns are passed through without modification, by default None
- return_details : bool, optional
Provdies flexibility to return intermediary results, by default False
Returns: - pd.DataFrame
DataFrame with low cardinality columns pooled
- optional:
- subset_cols: List of columns used as subset
-
klib.
cat_plot
(data: pandas.core.frame.DataFrame, figsize: Tuple = (18, 18), top: int = 3, bottom: int = 3, bar_color_top: str = '#5ab4ac', bar_color_bottom: str = '#d8b365')[source]¶ Two-dimensional visualization of the number and frequency of categorical features.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
- figsize : Tuple, optional
Use to control the figure size, by default (18, 18)
- top : int, optional
Show the “top” most frequent values in a column, by default 3
- bottom : int, optional
Show the “bottom” most frequent values in a column, by default 3
- bar_color_top : str, optional
Use to control the color of the bars indicating the most common values, by default “#5ab4ac”
- bar_color_bottom : str, optional
Use to control the color of the bars indicating the least common values, by default “#d8b365”
Returns: - Gridspec
gs: Figure with array of Axes objects
-
klib.
corr_mat
(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray, str, None] = None, method: str = 'pearson', colored: bool = True) → Union[pandas.core.frame.DataFrame, Any][source]¶ Returns a color-encoded correlation matrix.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
- split : Optional[str], optional
Type of split to be performed, by default None {None, “pos”, “neg”, “high”, “low”}
- threshold : float, optional
Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3
- target : Optional[Union[pd.DataFrame, str]], optional
Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None
- method : str, optional
method: {“pearson”, “spearman”, “kendall”}, by default “pearson” * pearson: measures linear relationships and requires normally distributed and homoscedastic data. * spearman: ranked/ordinal correlation, measures monotonic relationships. * kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”
- colored : bool, optional
If True the negative values in the correlation matrix are colored in red, by default True
Returns: - Union[pd.DataFrame, pd.Styler]
If colored = True - corr: Pandas Styler object If colored = False - corr: Pandas DataFrame
-
klib.
corr_plot
(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.series.Series, str, None] = None, method: str = 'pearson', cmap: str = 'BrBG', figsize: Tuple = (12, 10), annot: bool = True, dev: bool = False, **kwargs)[source]¶ Two-dimensional visualization of the correlation between feature-columns excluding NA values.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
- split : Optional[str], optional
- Type of split to be performed {None, “pos”, “neg”, “high”, “low”}, by default None
- None: visualize all correlations between the feature-columns
- pos: visualize all positive correlations between the feature-columns above the threshold
- neg: visualize all negative correlations between the feature-columns below the threshold
- high: visualize all correlations between the feature-columns for which abs (corr) > threshold is True
- low: visualize all correlations between the feature-columns for which abs(corr) < threshold is True
- threshold : float, optional
Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3
- target : Optional[Union[pd.Series, str]], optional
Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None
- method : str, optional
- method: {“pearson”, “spearman”, “kendall”}, by default “pearson”
- pearson: measures linear relationships and requires normally distributed and homoscedastic data.
- spearman: ranked/ordinal correlation, measures monotonic relationships.
- kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”.
- cmap : str, optional
The mapping from data values to color space, matplotlib colormap name or object, or list of colors, by default “BrBG”
- figsize : Tuple, optional
Use to control the figure size, by default (12, 10)
- annot : bool, optional
Use to show or hide annotations, by default True
- dev : bool, optional
Display figure settings in the plot by setting dev = True. If False, the settings are not displayed, by default False
- Keyword Arguments : optional
Additional elements to control the visualization of the plot, e.g.:
- mask: bool, default True
- If set to False the entire correlation matrix, including the upper triangle is shown. Set dev = False in this case to avoid overlap.
- vmax: float, default is calculated from the given correlation coefficients.
- Value between -1 or vmin <= vmax <= 1, limits the range of the cbar.
- vmin: float, default is calculated from the given correlation coefficients.
- Value between -1 <= vmin <= 1 or vmax, limits the range of the cbar.
- linewidths: float, default 0.5
- Controls the line-width inbetween the squares.
- annot_kws: dict, default {“size” : 10}
- Controls the font size of the annotations. Only available when annot = True.
- cbar_kws: dict, default {“shrink”: .95, “aspect”: 30}
- Controls the size of the colorbar.
- Many more kwargs are available, i.e. “alpha” to control blending, or options to adjust labels, ticks …
Kwargs can be supplied through a dictionary of key-value pairs (see above).
Returns: - ax: matplotlib Axes
Returns the Axes object with the plot for further tweaking.
-
klib.
dist_plot
(data: pandas.core.frame.DataFrame, mean_color: str = 'orange', size: int = 2.5, fill_range: Tuple = (0.025, 0.975), showall: bool = False, kde_kws: Dict[str, Any] = None, rug_kws: Dict[str, Any] = None, fill_kws: Dict[str, Any] = None, font_kws: Dict[str, Any] = None)[source]¶ Two-dimensional visualization of the distribution of non binary numerical features.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
- mean_color : str, optional
Color of the vertical line indicating the mean of the data, by default “orange”
- size : int, optional
Controls the plot size, by default 2.5
- fill_range : Tuple, optional
Set the quantiles for shading. Default spans 95% of the data, which is about two std. deviations above and below the mean, by default (0.025, 0.975)
- showall : bool, optional
Set to True to remove the output limit of 20 plots, by default False
- kde_kws : Dict[str, Any], optional
Keyword arguments for kdeplot(), by default {“color”: “k”, “alpha”: 0.75, “linewidth”: 1.5, “bw_adjust”: 0.8}
- rug_kws : Dict[str, Any], optional
Keyword arguments for rugplot(), by default {“color”: “#ff3333”, “alpha”: 0.15, “lw”: 3, “height”: 0.075}
- fill_kws : Dict[str, Any], optional
Keyword arguments to control the fill, by default {“color”: “#80d4ff”, “alpha”: 0.2}
- font_kws : Dict[str, Any], optional
Keyword arguments to control the font, by default {“color”: “#111111”, “weight”: “normal”, “size”: 11}
Returns: - ax: matplotlib Axes
Returns the Axes object with the plot for further tweaking.
-
klib.
missingval_plot
(data: pandas.core.frame.DataFrame, cmap: str = 'PuBuGn', figsize: Tuple = (20, 20), sort: bool = False, spine_color: str = '#EEEEEE')[source]¶ Two-dimensional visualization of the missing values in a dataset.
Parameters: - data : pd.DataFrame
2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots
- cmap : str, optional
Any valid colormap can be used. E.g. “Greys”, “RdPu”. More information can be found in the matplotlib documentation, by default “PuBuGn”
- figsize : Tuple, optional
Use to control the figure size, by default (20, 20)
- sort : bool, optional
Sort columns based on missing values in descending order and drop columns without any missing values, by default False
- spine_color : str, optional
Set to “None” to hide the spines on all plots or use any valid matplotlib color argument, by default “#EEEEEE”
Returns: - GridSpec
gs: Figure with array of Axes objects
-
klib.
feature_selection_pipe
(var_thresh=VarianceThreshold(threshold=0.1), select_from_model=SelectFromModel(estimator=LassoCV(cv=4, random_state=408), threshold='0.1*median'), select_percentile=SelectPercentile(percentile=95), var_thresh_info=PipeInfo(name='after var_thresh'), select_from_model_info=PipeInfo(name='after select_from_model'), select_percentile_info=PipeInfo(name='after select_percentile'))[source]¶ Preprocessing operations for feature selection.
Parameters: - var_thresh: default, VarianceThreshold(threshold=0.1)
Specify a threshold to drop low variance features.
- select_from_model: default, SelectFromModel(LassoCV(cv=4, random_state=408), threshold=”0.1 * median”)
Specify an estimator which is used for selecting features based on importance weights.
- select_percentile: default, SelectPercentile(f_classif, percentile=95)
Specify a score-function and a percentile value of features to keep.
- var_thresh_info, select_from_model_info, select_percentile_info
Prints the shape of the dataset after applying the respective function. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].
Returns: - Pipeline
-
klib.
num_pipe
(imputer=IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408), scaler=RobustScaler())[source]¶ Standard preprocessing operations on numerical data.
Parameters: - imputer: default, IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408)
- scaler: default, RobustScaler()
Returns: - Pipeline
-
klib.
cat_pipe
(imputer=SimpleImputer(strategy='most_frequent'), encoder=OneHotEncoder(handle_unknown='ignore'), scaler=MaxAbsScaler(), encoder_info=PipeInfo(name='after encoding categorical data'))[source]¶ Standard preprocessing operations on categorical data.
Parameters: - imputer: default, SimpleImputer(strategy=’most_frequent’)
- encoder: default, OneHotEncoder(handle_unknown=’ignore’)
Encode categorical features as a one-hot numeric array.
- scaler: default, MaxAbsScaler()
Scale each feature by its maximum absolute value. MaxAbsScaler() does not shift/center the data, and thus does not destroy any sparsity. It is recommended to check for outliers before applying MaxAbsScaler().
- encoder_info:
Prints the shape of the dataset at the end of ‘cat_pipe’. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].
Returns: - Pipeline
-
klib.
train_dev_test_split
(data, target, dev_size=0.1, test_size=0.1, stratify=None, random_state=408)[source]¶ Split a dataset and a label column into train, dev and test sets.
Parameters: - data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots.
- target: string, list, np.array or pd.Series, default None
Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label.
- dev_size: float, default 0.1
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the dev split.
- test_size: float, default 0.1
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
- stratify: target column, default None
If not None, data is split in a stratified fashion, using the input as the class labels.
- random_state: integer, default 408
Random_state is the seed used by the random number generator.
Returns: - tuple: Tuple containing train-dev-test split of inputs.