dataframe_utils
utils.dataframe_utils
Data quality diagnostics utilities.
This module provides functions for diagnosing data quality issues in pandas DataFrames, including detecting duplicates, missing values, and data type mismatches.
Functions
combine_dataframes
utils.dataframe_utils.combine_dataframes(
data_frames,
join_column= None ,
how= 'left' ,
)
Combine multiple DataFrames into a single DataFrame.
Parameters
data_frames
Dict[str, pd.DataFrame]
Dictionary of DataFrames to combine.
required
join_column
Optional[str]
Column to join on, by default None. If None, will try to find common columns.
None
how
str
Join method, by default βleftβ.
'left'
Returns
pd.DataFrame
Combined DataFrame or empty DataFrame if no data.
compare_dataframes
utils.dataframe_utils.compare_dataframes(
df1,
df2,
key_column,
compare_columns= None ,
console= None ,
)
Compare two DataFrames and identify differences.
Parameters
df1
pd.DataFrame
First DataFrame
required
df2
pd.DataFrame
Second DataFrame
required
key_column
str
Column to use as key for matching rows
required
compare_columns
List[str]
List of columns to compare (if None, compares all common columns)
None
console
Console
Rich Console instance for logging. If None, print statements are used.
None
Returns
Dict[str, Any]
Dictionary with comparison results
create_summary_dataframe
utils.dataframe_utils.create_summary_dataframe(data_frames)
Create a summary DataFrame from a dictionary of DataFrames.
Parameters
data_frames
Dict[str, pd.DataFrame]
Dictionary of DataFrames to summarize.
required
Returns
pd.DataFrame
Summary DataFrame with information about each dataset.
diagnose_dataframe
utils.dataframe_utils.diagnose_dataframe(
df,
key_columns= None ,
id_columns= None ,
console= None ,
)
Diagnose issues in a DataFrame.
Parameters
df
pd.DataFrame
DataFrame to diagnose
required
key_columns
List[str]
List of columns that should be unique keys
None
id_columns
List[str]
List of columns that should contain IDs (e.g., system IDs, account IDs)
None
console
Console
Rich Console instance for logging. If None, print statements are used.
None
Returns
Dict[str, Any]
Dictionary of diagnostic results
get_attached_frames
utils.dataframe_utils.get_attached_frames(df)
Get attached data frames from a DataFrame.
Parameters
df
pd.DataFrame
DataFrame to get attached frames from.
required
Returns
Dict[str, pd.DataFrame]
Dictionary of attached frames or empty dict if none found.
has_attached_frames
utils.dataframe_utils.has_attached_frames(df)
Check if a DataFrame has attached data frames.
Parameters
df
pd.DataFrame
DataFrame to check.
required
Returns
bool
True if the DataFrame has attached frames, False otherwise.
safe_to_dataframe
utils.dataframe_utils.safe_to_dataframe(data)
Safely convert various input types to a DataFrame.
Parameters
data
Any
Data to convert to DataFrame.
required
Returns
pd.DataFrame
Converted DataFrame or empty DataFrame if conversion fails.
validate_dataframe
utils.dataframe_utils.validate_dataframe(
df,
required_columns= None ,
column_types= None ,
non_null_columns= None ,
unique_columns= None ,
console= None ,
)
Validate a DataFrame against requirements.
Parameters
df
pd.DataFrame
DataFrame to validate
required
required_columns
List[str]
List of required column names
None
column_types
Dict[str, str]
Dictionary mapping column names to required types
None
non_null_columns
List[str]
List of columns that should not contain null values
None
unique_columns
List[str]
List of columns that should have unique values
None
console
Console
Rich Console instance for logging. If None, print statements are used.
None
Returns
Dict[str, Any]
Dictionary with validation results