Build LipiNet#

LipiNet Build Walkthrough (SwissLipids + Rhea)

This notebook demonstrates, step by step, how lipinet.build_lipinet constructs the combined LipiNet nodes and edges tables, and shows the behind-the-scenes pieces: loading/parsing sources, linking ChEBI IDs across sources, and merging nodes/edges.

Prereqs: You should have the lipinet package (your repo) importable in this environment.

# --- Runtime toggles ---
USE_CACHE = True          # load/save processed caches
FORCE_DOWNLOAD = False    # force fresh raw downloads & rebuild
VERBOSE = True            # print debug info

# Tip: run once with USE_CACHE=True, then set VERBOSE=False for a quiet rebuild.
import sys, os, importlib
from pathlib import Path
import pandas as pd

print("Python:", sys.version.split()[0])
# print("CWD:", os.getcwd().split('/lipinet')[1]) #os.getcwd()

# Try to import lipinet and key modules
try:
    import lipinet
    from lipinet.build_lipinet import build_lipinet_data, _link_chebi_edges, _join_node_dfs, _join_edge_dfs
    from lipinet.parse_swisslipids import parse_swisslipids_data
    from lipinet.parse_rhea import parse_rhea_data
    from lipinet.utils import cache_exists, load_cache, save_cache
    print("Imported lipinet OK.") # Version/module path:", getattr(lipinet, "__file__", "(pkg)"))
except Exception as e:
    print("❗ Could not import 'lipinet'. Make sure your repo is on PYTHONPATH.")
    raise
Python: 3.12.4
Imported lipinet OK.

1) Cache status#

sources = ["swisslipids", "rhea", "lipinet"]
for s in sources:
    try:
        print(f"{s:12s} cache:", "present" if cache_exists(s) else "missing")
    except Exception as e:
        print(f"{s:12s} cache: (could not check) ->", e)
swisslipids  cache: present
rhea         cache: present
lipinet      cache: present

2) Build the combined LipiNet (public API)#

lipinet_data = build_lipinet_data(verbose=VERBOSE, use_cache=USE_CACHE, force_download=FORCE_DOWNLOAD)
df_nodes = lipinet_data["df_nodes"].copy()
df_edges = lipinet_data["df_edges"].copy()

print("Nodes:", df_nodes.shape, "Edges:", df_edges.shape)
display(df_nodes.head(3))
display(df_edges.head(3))

# Quick layer counts
print("\nNode counts by layer:")
display(df_nodes["layer"].value_counts().to_frame("count"))
print("\nInterlayer edge count:", int(df_edges.get("interlayer", False).sum()) if "interlayer" in df_edges else "N/A")
print("Edge counts by (source_layer, target_layer):")
display(df_edges.groupby(["source_layer","target_layer"]).size().sort_values(ascending=False).to_frame("edges"))
↪ Loading LipiNet (combined) from cache
Nodes: (2817072, 41) Edges: (7002161, 8)
node_id layer origin_vertex rhea_Equation rhea_ChEBI identifier rhea_chebi_name rhea_EC number rhea_Enzymes rhea_Gene Ontology rhea_Cross-reference (Reactome) ... sl_Exact m/z of [M+NH4]+ sl_Exact m/z of [M-H]- sl_Exact m/z of [M+Cl]- sl_Exact m/z of [M+OAc]- sl_CHEBI sl_LIPID MAPS sl_HMDB sl_MetaNetX sl_PMID sl_Components_parsed
0 RHEA:21252 rhea_reactionid rhea (S)-2-hydroxyglutarate + A = 2-oxoglutarate + AH2 CHEBI:16782;CHEBI:13193;CHEBI:16810;CHEBI:17499 (S)-2-hydroxyglutarate;A;2-oxoglutarate;AH2 EC:1.1.99.2 4258.0 GO:0047545 2-hydroxyglutarate dehydrogenase ac... NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 RHEA:21256 rhea_reactionid rhea 3-phosphoshikimate + phosphoenolpyruvate = 5-O... CHEBI:145989;CHEBI:58702;CHEBI:57701;CHEBI:43474 3-phosphoshikimate;phosphoenolpyruvate;5-O-(1-... EC:2.5.1.19 44340.0 GO:0003866 3-phosphoshikimate 1-carboxyvinyltr... NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 RHEA:21260 rhea_reactionid rhea [thioredoxin]-disulfide + L-methionine + H2O =... CHEBI:50058;CHEBI:57844;CHEBI:15377;CHEBI:5877... L-cystine residue;L-methionine;H2O;L-methionin... EC:1.8.4.14 3112.0 GO:0033745 L-methionine-(R)-S-oxide reductase ... NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 41 columns

source_layer source_id target_layer target_id interlayer edge_type origin_edge ec_level
0 rhea_ec EC:1 rhea_ec EC:1.1 False ec_hierarchy rhea main_class->subclass
1 rhea_ec EC:2 rhea_ec EC:2.5 False ec_hierarchy rhea main_class->subclass
2 rhea_ec EC:1 rhea_ec EC:1.8 False ec_hierarchy rhea main_class->subclass
Node counts by layer:
count
layer
swisslipids 779312
sl_abbreviation 736949
sl_synonyms 534781
sl_metanetx 504880
sl_parent 184620
rhea_reactionid 17783
sl_hmdb 17232
rhea_chebiid 13723
sl_lipidmaps 12112
rhea_ec 6489
sl_chebi 4277
sl_components 1708
sl_components_parsed 1677
sl_pmid 1529
Interlayer edge count: 6114475
Edge counts by (source_layer, target_layer):
edges
source_layer target_layer
swisslipids sl_components 1852844
sl_components_parsed 1852844
sl_abbreviation 786750
swisslipids 779247
sl_synonyms 568257
sl_metanetx 505003
sl_parent 493491
rhea_reactionid rhea_chebiid 83885
swisslipids sl_hmdb 26026
rhea_reactionid rhea_ec 18072
swisslipids sl_lipidmaps 12117
sl_pmid 10109
rhea_ec rhea_ec 6482
swisslipids sl_chebi 4278
sl_chebi rhea_chebiid 2756

3) Behind the scenes: parse each source#

sl = parse_swisslipids_data(verbose=VERBOSE, use_cache=USE_CACHE, force_download=FORCE_DOWNLOAD)
rhea = parse_rhea_data(verbose=VERBOSE, use_cache=USE_CACHE, force_download=FORCE_DOWNLOAD)

df_sl_nodes = sl["df_nodes"].copy()
df_sl_edges = sl["df_edges"].copy()
df_rhea_nodes = rhea["df_nodes"].copy()
df_rhea_edges = rhea["df_edges"].copy()

print("SwissLipids:", df_sl_nodes.shape, "nodes /", df_sl_edges.shape, "edges")
print("Rhea:       ", df_rhea_nodes.shape, "nodes /", df_rhea_edges.shape, "edges")
↪ Loading SwissLipids cache
↪ Loading Rhea (processed) from cache
SwissLipids: (2779077, 32) nodes / (6890966, 5) edges
Rhea:        (37995, 10) nodes / (108439, 7) edges

3.1) Layers present#

print("SwissLipids layers:")
display(df_sl_nodes["layer"].value_counts().to_frame("count"))

print("\nRhea layers:")
display(df_rhea_nodes["layer"].value_counts().to_frame("count"))
SwissLipids layers:
count
layer
swisslipids 779312
sl_abbreviation 736949
sl_synonyms 534781
sl_metanetx 504880
sl_parent 184620
sl_hmdb 17232
sl_lipidmaps 12112
sl_chebi 4277
sl_components 1708
sl_components_parsed 1677
sl_pmid 1529
Rhea layers:
count
layer
rhea_reactionid 17783
rhea_chebiid 13723
rhea_ec 6489

4) Cross-source linking: ChEBI (SwissLipids ↔ Rhea)#

df_chebi_linked = _link_chebi_edges(df_sl_nodes, df_rhea_nodes, verbose=VERBOSE)
print("Linked ChEBI interlayer edges:", df_chebi_linked.shape)
display(df_chebi_linked.head(10))

# Sanity: a few random examples
print("\nSample mapped CHEBI IDs (SwissLipids → Rhea):")
display(df_chebi_linked.sample(min(10, len(df_chebi_linked)), random_state=0) if len(df_chebi_linked) else df_chebi_linked)
Linked ChEBI edges: 2756
Linked ChEBI interlayer edges: (2756, 7)
source_layer source_id target_layer target_id interlayer edge_type origin_edge
0 sl_chebi 10036 rhea_chebiid CHEBI:10036 True same_id_chebi lipinet
1 sl_chebi 10362 rhea_chebiid CHEBI:10362 True same_id_chebi lipinet
2 sl_chebi 11152 rhea_chebiid CHEBI:11152 True same_id_chebi lipinet
3 sl_chebi 1156 rhea_chebiid CHEBI:1156 True same_id_chebi lipinet
4 sl_chebi 116314 rhea_chebiid CHEBI:116314 True same_id_chebi lipinet
5 sl_chebi 11641 rhea_chebiid CHEBI:11641 True same_id_chebi lipinet
6 sl_chebi 1178 rhea_chebiid CHEBI:1178 True same_id_chebi lipinet
7 sl_chebi 11867 rhea_chebiid CHEBI:11867 True same_id_chebi lipinet
8 sl_chebi 1189 rhea_chebiid CHEBI:1189 True same_id_chebi lipinet
9 sl_chebi 11893 rhea_chebiid CHEBI:11893 True same_id_chebi lipinet
Sample mapped CHEBI IDs (SwissLipids → Rhea):
source_layer source_id target_layer target_id interlayer edge_type origin_edge
352 sl_chebi 138100 rhea_chebiid CHEBI:138100 True same_id_chebi lipinet
855 sl_chebi 48946 rhea_chebiid CHEBI:48946 True same_id_chebi lipinet
883 sl_chebi 52639 rhea_chebiid CHEBI:52639 True same_id_chebi lipinet
1801 sl_chebi 76591 rhea_chebiid CHEBI:76591 True same_id_chebi lipinet
1774 sl_chebi 76475 rhea_chebiid CHEBI:76475 True same_id_chebi lipinet
1695 sl_chebi 76291 rhea_chebiid CHEBI:76291 True same_id_chebi lipinet
1604 sl_chebi 75587 rhea_chebiid CHEBI:75587 True same_id_chebi lipinet
1087 sl_chebi 62243 rhea_chebiid CHEBI:62243 True same_id_chebi lipinet
396 sl_chebi 138569 rhea_chebiid CHEBI:138569 True same_id_chebi lipinet
1255 sl_chebi 71567 rhea_chebiid CHEBI:71567 True same_id_chebi lipinet

5) Merging node frames (origin tagging & prefixing unique columns)#

df_nodes_joined = _join_node_dfs(df_sl_nodes, df_rhea_nodes)
print("Joined nodes:", df_nodes_joined.shape)
display(df_nodes_joined.head(3))
print("Column order (shared first, then prefixed):")
print(list(df_nodes_joined.columns)[:15], "...")
Joined nodes: (2817072, 41)
node_id layer origin_vertex rhea_Equation rhea_ChEBI identifier rhea_chebi_name rhea_EC number rhea_Enzymes rhea_Gene Ontology rhea_Cross-reference (Reactome) ... sl_Exact m/z of [M+NH4]+ sl_Exact m/z of [M-H]- sl_Exact m/z of [M+Cl]- sl_Exact m/z of [M+OAc]- sl_CHEBI sl_LIPID MAPS sl_HMDB sl_MetaNetX sl_PMID sl_Components_parsed
0 RHEA:21252 rhea_reactionid rhea (S)-2-hydroxyglutarate + A = 2-oxoglutarate + AH2 CHEBI:16782;CHEBI:13193;CHEBI:16810;CHEBI:17499 (S)-2-hydroxyglutarate;A;2-oxoglutarate;AH2 EC:1.1.99.2 4258.0 GO:0047545 2-hydroxyglutarate dehydrogenase ac... NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 RHEA:21256 rhea_reactionid rhea 3-phosphoshikimate + phosphoenolpyruvate = 5-O... CHEBI:145989;CHEBI:58702;CHEBI:57701;CHEBI:43474 3-phosphoshikimate;phosphoenolpyruvate;5-O-(1-... EC:2.5.1.19 44340.0 GO:0003866 3-phosphoshikimate 1-carboxyvinyltr... NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 RHEA:21260 rhea_reactionid rhea [thioredoxin]-disulfide + L-methionine + H2O =... CHEBI:50058;CHEBI:57844;CHEBI:15377;CHEBI:5877... L-cystine residue;L-methionine;H2O;L-methionin... EC:1.8.4.14 3112.0 GO:0033745 L-methionine-(R)-S-oxide reductase ... NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 41 columns

Column order (shared first, then prefixed):
['node_id', 'layer', 'origin_vertex', 'rhea_Equation', 'rhea_ChEBI identifier', 'rhea_chebi_name', 'rhea_EC number', 'rhea_Enzymes', 'rhea_Gene Ontology', 'rhea_Cross-reference (Reactome)', 'rhea_ec_level', 'sl_Lipid ID', 'sl_Level', 'sl_Name', 'sl_Abbreviation*'] ...

7) Quick QC: orphans & coverage by layer#

def qc_coverage(df_nodes, df_edges, layer_name):
    n = df_nodes.query("layer == @layer_name")["node_id"].nunique()
    E = df_edges[(df_edges.source_layer==layer_name) | (df_edges.target_layer==layer_name)]
    touched = pd.unique(pd.concat([
        E.loc[E.source_layer==layer_name, "source_id"],
        E.loc[E.target_layer==layer_name, "target_id"]
    ]))
    touched = set(map(str, touched))
    return {"nodes": n, "touched": len(touched), "orphans": n - len(touched)}

layers_to_check = ["sl_chebi", "rhea_chebiid", "rhea_reactionid", "rhea_ec"]
qc = {L: qc_coverage(df_nodes, df_edges, L) for L in layers_to_check if L in df_nodes["layer"].unique()}
pd.DataFrame(qc).T
nodes touched orphans
sl_chebi 4277 4277 0
rhea_chebiid 13723 13723 0
rhea_reactionid 17783 17783 0
rhea_ec 6489 6490 -1
df_edges[df_edges["target_layer"]=="rhea_ec"]["target_id"].isna().sum()
# and/or
# df_edges.query("target_layer=='rhea_ec' and target_id.isna()").head()
np.int64(10214)

Note that the one orphan for the rhea_ec is most likely due to the nans not being filtered out. We should later correct _join edges to handle this by santizing for it.

8) Optional: Write outputs to .data/processed/#

# from pathlib import Path
# DATA_PROCESSED = Path(__import__("lipinet").__file__).parent / ".data" / "processed"
# DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
# nodes_path = DATA_PROCESSED / "lipinet_nodes.parquet"
# edges_path = DATA_PROCESSED / "lipinet_edges.parquet"

# df_nodes.to_parquet(nodes_path, index=False)
# df_edges.to_parquet(edges_path, index=False)
# print("Wrote:", nodes_path)
# print("Wrote:", edges_path)

Where to go next#

  • Check out the explore_lipinet.ipynb doc to get an idea of how you can analyse the resulting LipiNet in practice