Build LipiNet

Contents

Build LipiNet#

LipiNet Build Walkthrough (SwissLipids + Rhea)

This notebook demonstrates, step by step, how lipinet.build_lipinet constructs the combined LipiNet nodes and edges tables, and shows the behind-the-scenes pieces: loading/parsing sources, linking ChEBI IDs across sources, and merging nodes/edges.

Prereqs: You should have the lipinet package (your repo) importable in this environment.

# --- Runtime toggles ---
USE_CACHE = True          # load/save processed caches
FORCE_DOWNLOAD = False    # force fresh raw downloads & rebuild
VERBOSE = True            # print debug info

# Tip: run once with USE_CACHE=True, then set VERBOSE=False for a quiet rebuild.

import sys, os, importlib
from pathlib import Path
import pandas as pd

print("Python:", sys.version.split()[0])
# print("CWD:", os.getcwd().split('/lipinet')[1]) #os.getcwd()

# Try to import lipinet and key modules
try:
    import lipinet
    from lipinet.build_lipinet import build_lipinet_data, _link_chebi_edges, _join_node_dfs, _join_edge_dfs
    from lipinet.parse_swisslipids import parse_swisslipids_data
    from lipinet.parse_rhea import parse_rhea_data
    from lipinet.utils import cache_exists, load_cache, save_cache
    print("Imported lipinet OK.") # Version/module path:", getattr(lipinet, "__file__", "(pkg)"))
except Exception as e:
    print("❗ Could not import 'lipinet'. Make sure your repo is on PYTHONPATH.")
    raise

Python: 3.12.4
Imported lipinet OK.

1) Cache status#

sources = ["swisslipids", "rhea", "lipinet"]
for s in sources:
    try:
        print(f"{s:12s} cache:", "present" if cache_exists(s) else "missing")
    except Exception as e:
        print(f"{s:12s} cache: (could not check) ->", e)

swisslipids  cache: present
rhea         cache: present
lipinet      cache: present

2) Build the combined LipiNet (public API)#

lipinet_data = build_lipinet_data(verbose=VERBOSE, use_cache=USE_CACHE, force_download=FORCE_DOWNLOAD)
df_nodes = lipinet_data["df_nodes"].copy()
df_edges = lipinet_data["df_edges"].copy()

print("Nodes:", df_nodes.shape, "Edges:", df_edges.shape)
display(df_nodes.head(3))
display(df_edges.head(3))

# Quick layer counts
print("\nNode counts by layer:")
display(df_nodes["layer"].value_counts().to_frame("count"))
print("\nInterlayer edge count:", int(df_edges.get("interlayer", False).sum()) if "interlayer" in df_edges else "N/A")
print("Edge counts by (source_layer, target_layer):")
display(df_edges.groupby(["source_layer","target_layer"]).size().sort_values(ascending=False).to_frame("edges"))

↪ Loading LipiNet (combined) from cache
Nodes: (2817072, 41) Edges: (7002161, 8)

	node_id	layer	origin_vertex	rhea_Equation	rhea_ChEBI identifier	rhea_chebi_name	rhea_EC number	rhea_Enzymes	rhea_Gene Ontology	rhea_Cross-reference (Reactome)	...	sl_Exact m/z of [M+NH4]+	sl_Exact m/z of [M-H]-	sl_Exact m/z of [M+Cl]-	sl_Exact m/z of [M+OAc]-	sl_CHEBI	sl_LIPID MAPS	sl_HMDB	sl_MetaNetX	sl_PMID	sl_Components_parsed
0	RHEA:21252	rhea_reactionid	rhea	(S)-2-hydroxyglutarate + A = 2-oxoglutarate + AH2	CHEBI:16782;CHEBI:13193;CHEBI:16810;CHEBI:17499	(S)-2-hydroxyglutarate;A;2-oxoglutarate;AH2	EC:1.1.99.2	4258.0	GO:0047545 2-hydroxyglutarate dehydrogenase ac...	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	RHEA:21256	rhea_reactionid	rhea	3-phosphoshikimate + phosphoenolpyruvate = 5-O...	CHEBI:145989;CHEBI:58702;CHEBI:57701;CHEBI:43474	3-phosphoshikimate;phosphoenolpyruvate;5-O-(1-...	EC:2.5.1.19	44340.0	GO:0003866 3-phosphoshikimate 1-carboxyvinyltr...	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	RHEA:21260	rhea_reactionid	rhea	[thioredoxin]-disulfide + L-methionine + H2O =...	CHEBI:50058;CHEBI:57844;CHEBI:15377;CHEBI:5877...	L-cystine residue;L-methionine;H2O;L-methionin...	EC:1.8.4.14	3112.0	GO:0033745 L-methionine-(R)-S-oxide reductase ...	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

3 rows × 41 columns

	source_layer	source_id	target_layer	target_id	interlayer	edge_type	origin_edge	ec_level
0	rhea_ec	EC:1	rhea_ec	EC:1.1	False	ec_hierarchy	rhea	main_class->subclass
1	rhea_ec	EC:2	rhea_ec	EC:2.5	False	ec_hierarchy	rhea	main_class->subclass
2	rhea_ec	EC:1	rhea_ec	EC:1.8	False	ec_hierarchy	rhea	main_class->subclass

Node counts by layer:

	count
layer
swisslipids	779312
sl_abbreviation	736949
sl_synonyms	534781
sl_metanetx	504880
sl_parent	184620
rhea_reactionid	17783
sl_hmdb	17232
rhea_chebiid	13723
sl_lipidmaps	12112
rhea_ec	6489
sl_chebi	4277
sl_components	1708
sl_components_parsed	1677
sl_pmid	1529

Interlayer edge count: 6114475
Edge counts by (source_layer, target_layer):

		edges
source_layer	target_layer
swisslipids	sl_components	1852844
	sl_components_parsed	1852844
	sl_abbreviation	786750
	swisslipids	779247
	sl_synonyms	568257
	sl_metanetx	505003
	sl_parent	493491
rhea_reactionid	rhea_chebiid	83885
swisslipids	sl_hmdb	26026
rhea_reactionid	rhea_ec	18072
swisslipids	sl_lipidmaps	12117
swisslipids	sl_pmid	10109
rhea_ec	rhea_ec	6482
swisslipids	sl_chebi	4278
sl_chebi	rhea_chebiid	2756

3) Behind the scenes: parse each source#

sl = parse_swisslipids_data(verbose=VERBOSE, use_cache=USE_CACHE, force_download=FORCE_DOWNLOAD)
rhea = parse_rhea_data(verbose=VERBOSE, use_cache=USE_CACHE, force_download=FORCE_DOWNLOAD)

df_sl_nodes = sl["df_nodes"].copy()
df_sl_edges = sl["df_edges"].copy()
df_rhea_nodes = rhea["df_nodes"].copy()
df_rhea_edges = rhea["df_edges"].copy()

print("SwissLipids:", df_sl_nodes.shape, "nodes /", df_sl_edges.shape, "edges")
print("Rhea:       ", df_rhea_nodes.shape, "nodes /", df_rhea_edges.shape, "edges")

↪ Loading SwissLipids cache
↪ Loading Rhea (processed) from cache
SwissLipids: (2779077, 32) nodes / (6890966, 5) edges
Rhea:        (37995, 10) nodes / (108439, 7) edges

3.1) Layers present#

print("SwissLipids layers:")
display(df_sl_nodes["layer"].value_counts().to_frame("count"))

print("\nRhea layers:")
display(df_rhea_nodes["layer"].value_counts().to_frame("count"))

SwissLipids layers:

	count
layer
swisslipids	779312
sl_abbreviation	736949
sl_synonyms	534781
sl_metanetx	504880
sl_parent	184620
sl_hmdb	17232
sl_lipidmaps	12112
sl_chebi	4277
sl_components	1708
sl_components_parsed	1677
sl_pmid	1529

Rhea layers:

	count
layer
rhea_reactionid	17783
rhea_chebiid	13723
rhea_ec	6489

4) Cross-source linking: ChEBI (SwissLipids ↔ Rhea)#

df_chebi_linked = _link_chebi_edges(df_sl_nodes, df_rhea_nodes, verbose=VERBOSE)
print("Linked ChEBI interlayer edges:", df_chebi_linked.shape)
display(df_chebi_linked.head(10))

# Sanity: a few random examples
print("\nSample mapped CHEBI IDs (SwissLipids → Rhea):")
display(df_chebi_linked.sample(min(10, len(df_chebi_linked)), random_state=0) if len(df_chebi_linked) else df_chebi_linked)

Linked ChEBI edges: 2756
Linked ChEBI interlayer edges: (2756, 7)

	source_layer	source_id	target_layer	target_id	interlayer	edge_type	origin_edge
0	sl_chebi	10036	rhea_chebiid	CHEBI:10036	True	same_id_chebi	lipinet
1	sl_chebi	10362	rhea_chebiid	CHEBI:10362	True	same_id_chebi	lipinet
2	sl_chebi	11152	rhea_chebiid	CHEBI:11152	True	same_id_chebi	lipinet
3	sl_chebi	1156	rhea_chebiid	CHEBI:1156	True	same_id_chebi	lipinet
4	sl_chebi	116314	rhea_chebiid	CHEBI:116314	True	same_id_chebi	lipinet
5	sl_chebi	11641	rhea_chebiid	CHEBI:11641	True	same_id_chebi	lipinet
6	sl_chebi	1178	rhea_chebiid	CHEBI:1178	True	same_id_chebi	lipinet
7	sl_chebi	11867	rhea_chebiid	CHEBI:11867	True	same_id_chebi	lipinet
8	sl_chebi	1189	rhea_chebiid	CHEBI:1189	True	same_id_chebi	lipinet
9	sl_chebi	11893	rhea_chebiid	CHEBI:11893	True	same_id_chebi	lipinet

Sample mapped CHEBI IDs (SwissLipids → Rhea):

	source_layer	source_id	target_layer	target_id	interlayer	edge_type	origin_edge
352	sl_chebi	138100	rhea_chebiid	CHEBI:138100	True	same_id_chebi	lipinet
855	sl_chebi	48946	rhea_chebiid	CHEBI:48946	True	same_id_chebi	lipinet
883	sl_chebi	52639	rhea_chebiid	CHEBI:52639	True	same_id_chebi	lipinet
1801	sl_chebi	76591	rhea_chebiid	CHEBI:76591	True	same_id_chebi	lipinet
1774	sl_chebi	76475	rhea_chebiid	CHEBI:76475	True	same_id_chebi	lipinet
1695	sl_chebi	76291	rhea_chebiid	CHEBI:76291	True	same_id_chebi	lipinet
1604	sl_chebi	75587	rhea_chebiid	CHEBI:75587	True	same_id_chebi	lipinet
1087	sl_chebi	62243	rhea_chebiid	CHEBI:62243	True	same_id_chebi	lipinet
396	sl_chebi	138569	rhea_chebiid	CHEBI:138569	True	same_id_chebi	lipinet
1255	sl_chebi	71567	rhea_chebiid	CHEBI:71567	True	same_id_chebi	lipinet

5) Merging node frames (origin tagging & prefixing unique columns)#

df_nodes_joined = _join_node_dfs(df_sl_nodes, df_rhea_nodes)
print("Joined nodes:", df_nodes_joined.shape)
display(df_nodes_joined.head(3))
print("Column order (shared first, then prefixed):")
print(list(df_nodes_joined.columns)[:15], "...")

Joined nodes: (2817072, 41)

	node_id	layer	origin_vertex	rhea_Equation	rhea_ChEBI identifier	rhea_chebi_name	rhea_EC number	rhea_Enzymes	rhea_Gene Ontology	rhea_Cross-reference (Reactome)	...	sl_Exact m/z of [M+NH4]+	sl_Exact m/z of [M-H]-	sl_Exact m/z of [M+Cl]-	sl_Exact m/z of [M+OAc]-	sl_CHEBI	sl_LIPID MAPS	sl_HMDB	sl_MetaNetX	sl_PMID	sl_Components_parsed
0	RHEA:21252	rhea_reactionid	rhea	(S)-2-hydroxyglutarate + A = 2-oxoglutarate + AH2	CHEBI:16782;CHEBI:13193;CHEBI:16810;CHEBI:17499	(S)-2-hydroxyglutarate;A;2-oxoglutarate;AH2	EC:1.1.99.2	4258.0	GO:0047545 2-hydroxyglutarate dehydrogenase ac...	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	RHEA:21256	rhea_reactionid	rhea	3-phosphoshikimate + phosphoenolpyruvate = 5-O...	CHEBI:145989;CHEBI:58702;CHEBI:57701;CHEBI:43474	3-phosphoshikimate;phosphoenolpyruvate;5-O-(1-...	EC:2.5.1.19	44340.0	GO:0003866 3-phosphoshikimate 1-carboxyvinyltr...	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	RHEA:21260	rhea_reactionid	rhea	[thioredoxin]-disulfide + L-methionine + H2O =...	CHEBI:50058;CHEBI:57844;CHEBI:15377;CHEBI:5877...	L-cystine residue;L-methionine;H2O;L-methionin...	EC:1.8.4.14	3112.0	GO:0033745 L-methionine-(R)-S-oxide reductase ...	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

3 rows × 41 columns

Column order (shared first, then prefixed):
['node_id', 'layer', 'origin_vertex', 'rhea_Equation', 'rhea_ChEBI identifier', 'rhea_chebi_name', 'rhea_EC number', 'rhea_Enzymes', 'rhea_Gene Ontology', 'rhea_Cross-reference (Reactome)', 'rhea_ec_level', 'sl_Lipid ID', 'sl_Level', 'sl_Name', 'sl_Abbreviation*'] ...

6) Merging edge frames (source edges + interlayer links)#

df_edges_joined = _join_edge_dfs(df_sl_edges, df_rhea_edges, df_chebi_linked)
print("Joined edges:", df_edges_joined.shape)
display(df_edges_joined.head(5))

print("\nInterlayer edges in merged set:", int(df_edges_joined.get("interlayer", False).sum()) if "interlayer" in df_edges_joined else "N/A")

Joined edges: (7002161, 8)

	source_layer	source_id	target_layer	target_id	interlayer	edge_type	origin_edge	ec_level
0	rhea_ec	EC:1	rhea_ec	EC:1.1	False	ec_hierarchy	rhea	main_class->subclass
1	rhea_ec	EC:2	rhea_ec	EC:2.5	False	ec_hierarchy	rhea	main_class->subclass
2	rhea_ec	EC:1	rhea_ec	EC:1.8	False	ec_hierarchy	rhea	main_class->subclass
3	rhea_ec	EC:1	rhea_ec	EC:1.5	False	ec_hierarchy	rhea	main_class->subclass
4	rhea_ec	EC:6	rhea_ec	EC:6.3	False	ec_hierarchy	rhea	main_class->subclass

Interlayer edges in merged set: 6114475

7) Quick QC: orphans & coverage by layer#

def qc_coverage(df_nodes, df_edges, layer_name):
    n = df_nodes.query("layer == @layer_name")["node_id"].nunique()
    E = df_edges[(df_edges.source_layer==layer_name) | (df_edges.target_layer==layer_name)]
    touched = pd.unique(pd.concat([
        E.loc[E.source_layer==layer_name, "source_id"],
        E.loc[E.target_layer==layer_name, "target_id"]
    ]))
    touched = set(map(str, touched))
    return {"nodes": n, "touched": len(touched), "orphans": n - len(touched)}

layers_to_check = ["sl_chebi", "rhea_chebiid", "rhea_reactionid", "rhea_ec"]
qc = {L: qc_coverage(df_nodes, df_edges, L) for L in layers_to_check if L in df_nodes["layer"].unique()}
pd.DataFrame(qc).T

	nodes	touched	orphans
sl_chebi	4277	4277	0
rhea_chebiid	13723	13723	0
rhea_reactionid	17783	17783	0
rhea_ec	6489	6490	-1

df_edges[df_edges["target_layer"]=="rhea_ec"]["target_id"].isna().sum()
# and/or
# df_edges.query("target_layer=='rhea_ec' and target_id.isna()").head()

np.int64(10214)

Note that the one orphan for the rhea_ec is most likely due to the nans not being filtered out. We should later correct _join edges to handle this by santizing for it.

8) Optional: Write outputs to `.data/processed/`#

# from pathlib import Path
# DATA_PROCESSED = Path(__import__("lipinet").__file__).parent / ".data" / "processed"
# DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
# nodes_path = DATA_PROCESSED / "lipinet_nodes.parquet"
# edges_path = DATA_PROCESSED / "lipinet_edges.parquet"

# df_nodes.to_parquet(nodes_path, index=False)
# df_edges.to_parquet(edges_path, index=False)
# print("Wrote:", nodes_path)
# print("Wrote:", edges_path)

Where to go next#

Check out the explore_lipinet.ipynb doc to get an idea of how you can analyse the resulting LipiNet in practice