lipinet package

lipinet package#

Submodules#

lipinet.build_lipinet module#

lipinet.build_lipinet

Builds the combined LipiNet graph (nodes & edges) from parsed sources (SwissLipids, Rhea), including cross-source linking (e.g., ChEBI).

Provides:

build_lipinet_data(verbose=False, use_cache=False, force_download=False)
CLI entrypoint: python -m lipinet.build_lipinet [–use-cache] [–force-download] [–quiet] [–save]

Cache semantics:

If use_cache is True and cache for ‘lipinet’ exists (and not force_download), load and return.
Otherwise build fresh; if use_cache True, write cache at the end.

Output (when –save given):

.data/processed/lipinet_nodes.parquet
.data/processed/lipinet_edges.parquet

lipinet.build_lipinet.build_lipinet_data(verbose=False, use_cache=False, force_download=False)#

Build combined LipiNet nodes & edges from SwissLipids and Rhea.

Return type:

dict[str, DataFrame]

Parameters:

verbose (bool)
use_cache (bool)
force_download (bool)

Cache behavior mirrors parse_* modules:

If use_cache and not force_download and cache exists(‘lipinet’): load & return.
Else build; if use_cache, save to cache before returning.

lipinet.build_lipinet.main()#

lipinet.databases module#

lipinet.databases.clean(df, name_of_resource, verbose=False, filename=None)#

Dispatch per-resource specialized cleaning. Returns a cleaned copy; original df is not mutated.

Return type:

DataFrame

Parameters:

df (DataFrame)
name_of_resource (str)
verbose (bool)
filename (str)

lipinet.databases.download_and_load_data(filename, url, file_format='csv', compressed=False, sep=',', encoding='utf-8', header='infer', verbose=False, force_download=False)#

Checks if the specified file exists locally. If not, downloads it from the provided URL. Supports loading compressed files and handling different formats.

Parameters: - filename (str): The name of the file to be saved within the data directory. - url (str): The URL to download the file from if it’s not found locally. - file_format (str): The format of the file (‘json’ or ‘csv’). Defaults to ‘csv’. - compressed (bool): If True, expects the downloaded file to be in gzip format. Defaults to False. - sep (str): Separator to use if loading CSV/TSV data. Defaults to ‘,’. - encoding (str): Encoding to use for reading files. Defaults to ‘utf-8’. - verbose (bool): If True, prints additional information during the process. Defaults to False. - force_download (bool): If True, download even if the file exists locally. Defaults to False. - header: passed to pandas.read_csv. Use None for no header, 0 or ‘infer’ for first-row header.

Returns: - data (DataFrame, dict, or list): The loaded data from the file, in the format specified.

lipinet.databases.get_prior_knowledge(name_of_resource, verbose=False, force_download=False, squeeze=True)#

lipinet.parse_swisslipids module#

A standalone module that loads and processes SwissLipids data into a df_nodes using lipinet.

This module provides a helper function parse_swisslipids_data that can be imported into notebooks or other scripts. A thin wrapper in the main() function allows command-line execution.

lipinet.parse_swisslipids.main()#: Thin wrapper for command-line execution.

lipinet.parse_swisslipids.parse_swisslipids_data(verbose=False, force_download=False, use_cache=False)#

Core function to process SwissLipids data and return nodes and edges dataframes.

Parameters:

verbose (bool) – If True, prints detailed output. Defaults to False.
force_download (bool) – If True, re-fetch raw data, skipping any cache.
use_cache (bool) – If True, load/save the parsed nodes & edges after first run.

Returns:

A dictionary with keys ‘df_nodes’ and ‘df_edges’.

Return type:

dict

lipinet.parse_rhea module#

lipinet.parse_rhea

A standalone module that loads and processes Rhea data into node and edge DataFrames for LipiNet. Provides a helper function parse_rhea_data and a CLI entrypoint.

lipinet.parse_rhea.build_rhea_ec_edges_and_nodes(df_ec)#

Given a DataFrame with EC hierarchy columns:

Main_Class, Subclass, Subsubclass, EC_number,

this function creates:

A DataFrame of edges linking each hierarchical level.
A DataFrame of unique nodes with a ‘ec_level’ column indicating the node’s level in the hierarchy.

Parameters:: df_ec (DataFrame)

lipinet.parse_rhea.explode_columns(df, columns, delimiter=';')#

Split and explode the specified columns of a DataFrame.

Parameters:

df (pd.DataFrame) – Input DataFrame.
columns (list of str) – List of column names to split by the delimiter.
delimiter (str) – The delimiter to use when splitting the column values.

Returns:

A new DataFrame with the specified columns exploded.

Return type:

pd.DataFrame

Note

Each row in the specified columns must produce lists of the same length.

lipinet.parse_rhea.main()#

lipinet.parse_rhea.parse_rhea_data(verbose=False, use_cache=False, force_download=False)#

Core function to load and process Rhea data.

Parameters:

verbose (bool) – If True, prints detailed status.
use_cache (bool) – If True, load/save processed nodes & edges.
force_download (bool) – If True, refetch raw Rhea and rebuild (ignore cache).

Returns:

{‘df_edges’: DataFrame, ‘df_nodes’: DataFrame}

Return type:

dict

lipinet.parse_rhea.process_ec_numbers(df)#

Process the ‘EC number’ column of the input DataFrame.

Parameters:

df (pd.DataFrame) – A DataFrame containing an ‘EC number’ column.

Returns:

A new DataFrame with the following columns:

’EC_number’: The reassembled EC number in the format ‘EC:Main_Class.Subclass.Subsubclass.Serial_Number’
’Main_Class’: The first part of the EC number.
’Subclass’: The second part of the EC number.
’Subsubclass’: The third part of the EC number.
’Serial_Number’: The fourth part of the EC number.

Return type:

pd.DataFrame

lipinet.utils module#

lipinet.utils.cache_exists(source)#

True if both cache files exist for this source.

Return type:: bool
Parameters:: source (str)

lipinet.utils.check_for_split_characters(df, delimiter='|')#

lipinet.utils.clean_columns(df, cols=None, strip_chars=None, trim_substrings=None, lowercase=False, uppercase=False, collapse_whitespace=False, unicode_normalize=False, verbose=False, ignore_missing=True)#

Clean specified string columns in a dataframe.

Steps applied to each column:

Preserve missing values.
Optionally strip characters from both ends (defaults to whitespace).
Optionally remove any of the given trim_substrings from start or end.
Optionally lowercase.
Optionally collapse internal multiple whitespace to single space.
(Future) Optionally normalize unicode.

Parameters:

df (DataFrame) – pandas DataFrame to clean (not modified in-place; a copy is returned).
cols (Optional[Iterable[str]]) – columns to clean; if None or empty, all columns are considered.
strip_chars (Optional[str]) – characters to strip from ends (None means default whitespace).
trim_substrings (Optional[Iterable[str]]) – substrings to strip from start/end (literal, case-sensitive).
lowercase (bool) – whether to lowercase the result.
uppercase (bool) – whether to uppercae the result.
collapse_whitespace (bool) – collapse internal runs of whitespace to a single space.
unicode_normalize (bool) – if True, apply Unicode normalization (NFC).
verbose (bool) – print before/after for samples.
ignore_missing (bool) – if False, raise if a listed column is missing; if True, skip it.

Return type:

DataFrame

Returns:

A cleaned copy of the dataframe.

lipinet.utils.clean_missing_strings(df, cols=None, string_fraction_threshold=0.9)#

Strip whitespace from stringy values and normalize common placeholder “missing” strings into real pandas NA. Operates on specified columns or all object/string columns by default.

Parameters:

df (DataFrame) – input DataFrame (modified in-place).
cols – list of columns to process; if None, uses all object/string dtype columns.
string_fraction_threshold – for object dtype columns, if >= this fraction of non-null values are str, coerce to StringDtype and vectorize the strip; otherwise do per-element.

Return type:

DataFrame

lipinet.utils.create_nodedf_from_edgedf(edge_df, props=['layer', 'id'], cols=['layer', 'node_id'])#

lipinet.utils.load_cache(source)#

Load pickled nodes & edges; KeyError if missing.

Return type:: Dict[str, DataFrame]
Parameters:: source (str)

lipinet.utils.save_cache(source, df_nodes, df_edges)#

Pickle out nodes & edges DataFrames for this source.

Return type:

None

Parameters:

source (str)
df_nodes (DataFrame)
df_edges (DataFrame)

lipinet.utils.split_and_expand_large(df, split_col, delimiter, expand_cols)#

Splits a column by a delimiter and expands specified columns for large DataFrames, handling None/NaN values.

Parameters: df (pd.DataFrame): The original DataFrame. split_col (str): The name of the column to split. delimiter (str): The delimiter to split the column by. expand_cols (list): List of column names to be expanded with the split column.

Returns: pd.DataFrame: A new DataFrame with the split and expanded rows.

lipinet package

Contents

lipinet package#

Submodules#

lipinet.build_lipinet module#

lipinet.databases module#

lipinet.parse_swisslipids module#

lipinet.parse_rhea module#

lipinet.utils module#

Module contents#