Parsing SwissLipids into a network for LipiNet

import lipinet.databases  # Import the module
import importlib

# Reload the module to ensure changes are picked up
importlib.reload(lipinet)

# Now can use the functions after reloading the module
from lipinet.databases import get_prior_knowledge
from lipinet.utils import split_and_expand_large, create_nodedf_from_edgedf, check_for_split_characters

import pandas as pd

Parsing the manual way

LipiNet offers conventient functions to parse prior knowledge resources straight into networks. But to show what is happening behind the scenes, this notebook goes through the data and each of the steps. Which may also be particularly helpful to you if you need to customise the networks in a way that is not yet supported by LipiNet directly.

df_swisslipids = get_prior_knowledge('swisslipids', verbose=True)
df_swisslipids
File found locally at /Users/agjanyunlu/Documents/Metabolomics/lipinet/lipinet/.data/downloaded/swisslipids_lipids.tsv. Loading data...
Before cleaning, number of values in lipid class column with trailing space: Lipid class*
False    779171
True         76
Name: count, dtype: int64
After cleaning, number of values in lipid class column with trailing space: Lipid class*
False    779247
Name: count, dtype: int64
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
0 SLM:000000002 Class Ceramide (iso-d17:1(4E)) Cer(iso-d17:1(4E)) N-acyl-15-methylhexadecasphing-4-enine SLM:000399814 NaN NaN CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](CO)NC([*])=O InChI=none ... NaN NaN NaN NaN NaN 70846 NaN NaN MNXM97012 | 11443131 | 14685263 | 18390550 | 21325339 |...
1 SLM:000000003 Isomeric subspecies 15-methylhexadecasphing-4-enine NaN NaN SLM:000390097 NaN NaN CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@@H]([NH3+])CO InChI=1S/C17H35NO2/c1-15(2)12-10-8-6-4-3-5-7-9... ... 292.282235 303.300605 284.259503 320.236181 344.280632 70771 NaN NaN MNXM57784 19372430
2 SLM:000000006 Isomeric subspecies 15-methylhexadecasphinganine NaN NaN SLM:000390097 NaN NaN CC(C)CCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO InChI=1S/C17H37NO2/c1-15(2)12-10-8-6-4-3-5-7-9... ... 294.297885 305.316255 286.275153 322.251831 346.296282 70829 NaN NaN MNXM97029 19372430
3 SLM:000000007 Class Sphingomyelin (iso-d17:1(4E)) SM(iso-d17:1(4E)) N-acyl-15-methylhexadecasphing-4-enine-1-phosp... SLM:000001000 NaN NaN CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](COP([O-])(=O... InChI=none ... NaN NaN NaN NaN NaN 70775 NaN NaN MNXM97113 14685263 | 21926990 | 9603947
4 SLM:000000035 Isomeric subspecies sphinganine NaN NaN SLM:000390097 NaN NaN CCCCCCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO InChI=1S/C18H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12... ... 308.313535 319.331905 300.290803 336.267481 360.311932 57817 LMSP01020001 HMDB00269 MNXM302 10652340 | 10702247 | 10751414 | 10802064 | 10...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
779244 SLM:000782324 NaN apo carotenoid NaN NaN SLM:000508864 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 53183 NaN NaN NaN NaN
779245 SLM:000782325 NaN terpenoid NaN NaN SLM:000508864 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 26873 NaN NaN NaN NaN
779246 SLM:000782326 NaN C-45 isoprenoid NaN NaN SLM:000508864 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 87168 NaN NaN NaN NaN
779247 SLM:000782327 NaN gamma-lactone NaN NaN SLM:000782238 NaN NaN O1C(C(C(C1=O)*)*)* NaN ... NaN NaN NaN NaN NaN 37581 NaN NaN NaN NaN
779248 SLM:000782328 NaN oxidized 2-acylglycerol NaN NaN SLM:000000355 NaN NaN OCC(CO)OC(=O)* NaN ... NaN NaN NaN NaN NaN 167117 NaN NaN NaN NaN

779249 rows × 29 columns

If we take a closer look into the data, especially the Lipid class* column, we will see that some of the values have multiple entries. For example Ceramide phosphoinositol is a Class level entry that itself belongs to both the SLM:000000834 and SLM:000399815 classes.

df_swisslipids.dropna(subset='Lipid class*')[df_swisslipids['Lipid class*'].dropna().str.contains('|', regex=False)]
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
142 SLM:000000392 Class Ceramide phosphoinositol IPC Inositol-1-phosphoceramide SLM:000000834 | SLM:000399815 NaN NaN O[C@H]([*])[C@H](COP([O-])(=O)O[C@H]1[C@H](O)[... InChI=none ... NaN NaN NaN NaN NaN 64916 NaN NaN NaN 10888667 | 20727985
234 SLM:000000509 Isomeric subspecies All-trans-retinyl hexadecanoate NaN all-trans-retinyl palmitate SLM:000000982 | SLM:000508854 NaN NaN CCCCCCCCCCCCCCCC(=O)OCC=C(C)C=CC=C(C)C=CC1=C(C... InChI=1S/C36H60O2/c1-7-8-9-10-11-12-13-14-15-1... ... NaN NaN NaN NaN NaN 17616 NaN HMDB03648 NaN 10769148 | 10819989 | 12230550 | 15550674 | 15...
315 SLM:000000612 NaN tetracosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74146 NaN NaN NaN 18541923 | 20110363 | 20937905
317 SLM:000000614 NaN hexacosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74161 NaN NaN NaN 18165233
319 SLM:000000621 NaN 2-hydroxy-tetracosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74215 NaN NaN NaN 18541923
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
755324 SLM:000758294 Class Globoside Globo Globo-series SLM:000000834 | SLM:000399813 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 61360 NaN NaN NaN NaN
755325 SLM:000758295 Class Isogloboside Isoglobo Isoglobo-series SLM:000000834 | SLM:000399813 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 78257 NaN NaN NaN NaN
779141 SLM:000782221 NaN Resolvin E RvE NaN SLM:000501332 | SLM:000508853 NaN NaN NaN InChI=none ... NaN NaN NaN NaN NaN NaN LMFA0314 NaN NaN NaN
779142 SLM:000782222 NaN Resolvin D RvD NaN SLM:000501331 | SLM:000508853 NaN NaN NaN InChI=none ... NaN NaN NaN NaN NaN NaN LMFA0403 NaN NaN NaN
779157 SLM:000782237 NaN an N-(omega-(9Z,12Z-octadecadienoyloxy)-ultra-... NaN NaN SLM:000000413 | SLM:000782274 NaN NaN [C@H]([C@@H](/C=C/CCCCCCCCCCCCC)O)(NC(=O)*COC(... NaN ... NaN NaN NaN NaN NaN 157662 NaN NaN NaN NaN

119 rows × 29 columns

What about other IDs?

cols_with_split_chars = check_for_split_characters(df_swisslipids, delimiter='|')
Checking split characters (|) in Lipid ID
No rows found

Checking split characters (|) in Level
No rows found

Checking split characters (|) in Name
No rows found

Checking split characters (|) in Abbreviation*
Found 9768 rows with split characters
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
56 SLM:000000262 Class 1,2-diacyl-sn-glycerol 1,2-sn-DAG | DAG | DG Diacylglycerol SLM:000000423 NaN NaN OC[C@@H](COC([*])=O)OC([*])=O InChI=none ... NaN NaN NaN NaN NaN 17815 NaN NaN MNXM59 10336610 | 10685032 | 10888667 | 10931938 | 11...
114 SLM:000000341 Class 1-acyl-sn-glycerol MAG | MG Monoacylglycerol SLM:000117130 NaN NaN OC[C@H](O)COC([*])=O InChI=none ... NaN NaN NaN NaN NaN 64683 NaN NaN MNXM2963 10685032 | 15939762 | 18037386 | 8663293 | 960...
122 SLM:000000355 Class 2-acylglycerol MAG | MG Monoacylglycerol SLM:000000403 NaN NaN OCC(CO)OC([*])=O InChI=none ... NaN NaN NaN NaN NaN 17389 NaN NaN MNXM335 NaN
146 SLM:000000400 Class Triacylglycerol TAG | TG NaN SLM:000117141 NaN NaN [*]C(=O)OCC(COC([*])=O)OC([*])=O InChI=none ... NaN NaN NaN NaN NaN 17855 NaN NaN MNXM248 12682047 | 16135509 | 16150821 | 21704635 | 27...
147 SLM:000000401 Class Diacylglycerol DAG | DG NaN SLM:000117140 NaN NaN [*]OCC(CO[*])O[*] InChI=none ... NaN NaN NaN NaN NaN 18035 NaN NaN MNXM59 12682047 | 16135509 | 16150821 | 27247428 | 29...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
505694 SLM:000508489 Molecular subspecies Phosphatidylglycerol (O-17:1_0:0) LPG(O-17:1_0:0) | PG(O-17:1_0:0) Lysophosphatidylglycerol (O-17:1_0:0) SLM:000508807 SLM:000508779 SLM:000001333 (sn1 or sn2 or sn3) OCC(O)COP([O-])(=O)OCC(CO[*])O[*] InChI=none ... 489.316311 500.334681 481.293579 517.270257 541.314708 NaN NaN NaN MNXM629334 NaN
505695 SLM:000508490 Molecular subspecies Phosphatidylglycerol (O-15:1_0:0) LPG(O-15:1_0:0) | PG(O-15:1_0:0) Lysophosphatidylglycerol (O-15:1_0:0) SLM:000508807 SLM:000508775 SLM:000001331 (sn1 or sn2 or sn3) OCC(O)COP([O-])(=O)OCC(CO[*])O[*] InChI=none ... 461.285011 472.303381 453.262279 489.238957 513.283408 NaN NaN NaN MNXM628940 NaN
505696 SLM:000508491 Molecular subspecies Phosphatidylglycerol (O-13:1_0:0) LPG(O-13:1_0:0) | PG(O-13:1_0:0) Lysophosphatidylglycerol (O-13:1_0:0) SLM:000508807 SLM:000508771 SLM:000001329 (sn1 or sn2 or sn3) OCC(O)COP([O-])(=O)OCC(CO[*])O[*] InChI=none ... 433.253711 444.272081 425.230979 461.207657 485.252108 NaN NaN NaN MNXM628548 NaN
595061 SLM:000597889 Isomeric subspecies 7-oxoresolvin D2 7-oxo-RvD2| 7-keto-RvD2 (16R,17S)-dihydroxy-7-oxo-(4Z,8E,10Z,12E,14E,1... SLM:000508853 | SLM:000782222 NaN NaN C(C/C=C\CC(/C=C/C=C\C=C\C=C\[C@H]([C@H](C/C=C\... InChI=1S/C22H30O5/c1-2-3-9-16-20(24)21(25)17-1... ... 381.224780 392.243150 373.202048 409.178725 433.223177 137497 NaN NaN NaN 22844113
595062 SLM:000597890 Isomeric subspecies 16-oxoresolvin D2 16-oxo-RvD2| 16-keto-RvD2 (7S,17S)-dihydroxy-16-oxo-(4Z,8E,10Z,12E,14E,1... SLM:000508853 | SLM:000782222 NaN NaN C(C/C=C\C[C@@H](\C=C\C=C/C=C/C=C/C([C@H](C/C=C... InChI=1S/C22H30O5/c1-2-3-9-16-20(24)21(25)17-1... ... 381.224780 392.243150 373.202048 409.178725 433.223177 137498 NaN NaN NaN 22844113

9768 rows × 29 columns

Checking split characters (|) in Synonyms*
Found 19853 rows with split characters
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
11 SLM:000000101 Class 1,2-diacyl-sn-glycero-3-phospho-(1'-sn-glycero... PA 1,2-diacyl-sn-glycero-3-phospho-(1'-sn-glycero... SLM:000477285 NaN NaN O[C@@H](COP([O-])([O-])=O)COP([O-])(=O)OC[C@@H... InChI=none ... NaN NaN NaN NaN NaN 60110 NaN NaN MNXM871 20485265 | 9880566
17 SLM:000000147 Isomeric subspecies N-(9Z-octadecenoyl)-ethanolamine NAE (18:1(9Z)) (9Z-octadecenoyl)-ethanolamide | N-(9Z-octadec... SLM:000000378 NaN NaN CCCCCCCC\C=C/CCCCCCCC(=O)NCCO InChI=1S/C20H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12... ... 332.313535 343.331905 324.290803 360.267481 384.311932 71466 NaN HMDB02088 MNXM107386 14634025 | 16527816 | 17015445 | 17626977 | 17...
18 SLM:000000149 Isomeric subspecies N-hexadecanoyl-ethanolamine NAE (16:0) hexadecanoyl-ethanolamide | N-hexadecanoyl eth... SLM:000000378 NaN NaN CCCCCCCCCCCCCCCC(=O)NCCO InChI=1S/C18H37NO2/c1-2-3-4-5-6-7-8-9-10-11-12... ... 306.297885 317.316255 298.275153 334.251831 358.296282 71464 NaN HMDB02100 MNXM107548 12824167 | 14634025 | 15655246 | 15760304 | 16...
19 SLM:000000178 Isomeric subspecies N-(docosanoyl)-15-methylhexadecasphing-4-enine Cer(iso-d17:1(4E)/22:0) Ceramide (iso-d17:1(4E)/22:0) | N-docosanoyl-1... SLM:000000002 SLM:000392021 SLM:000000827 (n-acyl) CCCCCCCCCCCCCCCCCCCCCC(=O)N[C@@H](CO)[C@H](O)\... InChI=1S/C39H77NO3/c1-4-5-6-7-8-9-10-11-12-13-... ... 614.605801 625.624171 606.583069 642.559747 666.604198 71377 NaN NaN MNXM107026 19372430
20 SLM:000000179 Isomeric subspecies N-(heneicosanoyl)-15-methylhexadecasphing-4-enine Cer(iso-d17:1(4E)/21:0) Ceramide (iso-d17:1(4E)/21:0) | N-henicosanoyl... SLM:000000002 SLM:000392020 SLM:000001207 (n-acyl) CCCCCCCCCCCCCCCCCCCCC(=O)N[C@@H](CO)[C@H](O)\C... InChI=1S/C38H75NO3/c1-4-5-6-7-8-9-10-11-12-13-... ... 600.590151 611.608521 592.567419 628.544097 652.588548 71375 NaN NaN MNXM107036 19372430
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
745092 SLM:000747954 Isomeric subspecies CDP-1,2-di-(13-methyltetradecanoyl)-sn-glycerol CDP-DAG (iso15:0/iso15:0) 1,2-di-(13-methyltetradecanoyl)-sn-glycero-3-c... SLM:000000084 NaN SLM:000000047 (sn1 or sn2) [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... InChI=1S/C42H77N3O15P2/c1-32(2)23-19-15-11-7-5... ... 932.498448 943.516818 924.475716 960.452394 984.496846 NaN NaN HMDB0116214 NaN NaN
745093 SLM:000747955 Isomeric subspecies CDP-1-(13-methyltetradecanoyl)-2-(15-methylhex... CDP-DAG (iso15:0/iso17:0) 1-(13-methyltetradecanoyl)-2-(15-methylhexadec... SLM:000000084 NaN SLM:000000047 (sn1) / SLM:000000048 (sn2) [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... InChI=1S/C44H81N3O15P2/c1-34(2)25-21-17-13-9-6... ... 960.529748 971.548118 952.507016 988.483694 1012.528146 NaN NaN HMDB0116216 NaN NaN
745175 SLM:000748037 Isomeric subspecies CDP-1-(15-methylhexadecanoyl)-2-(11-methyldode... CDP-DAG (iso17:0/iso13:0) 1-(15-methylhexadecanoyl)-2-(11-methyldodecano... SLM:000000084 NaN SLM:000000048 (sn1) / SLM:000001197 (sn2) [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... InChI=1S/C42H77N3O15P2/c1-32(2)23-19-15-11-8-6... ... 932.498448 943.516818 924.475716 960.452394 984.496846 NaN NaN HMDB0116248 NaN NaN
745176 SLM:000748038 Isomeric subspecies CDP-1-(15-methylhexadecanoyl)-2-(13-methyltetr... CDP-DAG (iso17:0/iso15:0) 1-(15-methylhexadecanoyl)-2-(13-methyltetradec... SLM:000000084 NaN SLM:000000047 (sn2) / SLM:000000048 (sn1) [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... InChI=1S/C44H81N3O15P2/c1-34(2)25-21-17-13-9-6... ... 960.529748 971.548118 952.507016 988.483694 1012.528146 NaN NaN HMDB0116250 NaN NaN
745177 SLM:000748039 Isomeric subspecies CDP-1,2-di-(15-methylhexadecanoyl)-sn-glycerol CDP-DAG (iso17:0/iso17:0) 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy... SLM:000000084 NaN SLM:000000048 (sn1 or sn2) [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... InChI=1S/C46H85N3O15P2/c1-36(2)27-23-19-15-11-... ... 988.561049 999.579419 980.538317 1016.514994 1040.559446 NaN NaN HMDB0116252 NaN NaN

19853 rows × 29 columns

Checking split characters (|) in Lipid class*
Found 119 rows with split characters
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
142 SLM:000000392 Class Ceramide phosphoinositol IPC Inositol-1-phosphoceramide SLM:000000834 | SLM:000399815 NaN NaN O[C@H]([*])[C@H](COP([O-])(=O)O[C@H]1[C@H](O)[... InChI=none ... NaN NaN NaN NaN NaN 64916 NaN NaN NaN 10888667 | 20727985
234 SLM:000000509 Isomeric subspecies All-trans-retinyl hexadecanoate NaN all-trans-retinyl palmitate SLM:000000982 | SLM:000508854 NaN NaN CCCCCCCCCCCCCCCC(=O)OCC=C(C)C=CC=C(C)C=CC1=C(C... InChI=1S/C36H60O2/c1-7-8-9-10-11-12-13-14-15-1... ... NaN NaN NaN NaN NaN 17616 NaN HMDB03648 NaN 10769148 | 10819989 | 12230550 | 15550674 | 15...
315 SLM:000000612 NaN tetracosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74146 NaN NaN NaN 18541923 | 20110363 | 20937905
317 SLM:000000614 NaN hexacosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74161 NaN NaN NaN 18165233
319 SLM:000000621 NaN 2-hydroxy-tetracosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74215 NaN NaN NaN 18541923
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
755324 SLM:000758294 Class Globoside Globo Globo-series SLM:000000834 | SLM:000399813 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 61360 NaN NaN NaN NaN
755325 SLM:000758295 Class Isogloboside Isoglobo Isoglobo-series SLM:000000834 | SLM:000399813 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 78257 NaN NaN NaN NaN
779141 SLM:000782221 NaN Resolvin E RvE NaN SLM:000501332 | SLM:000508853 NaN NaN NaN InChI=none ... NaN NaN NaN NaN NaN NaN LMFA0314 NaN NaN NaN
779142 SLM:000782222 NaN Resolvin D RvD NaN SLM:000501331 | SLM:000508853 NaN NaN NaN InChI=none ... NaN NaN NaN NaN NaN NaN LMFA0403 NaN NaN NaN
779157 SLM:000782237 NaN an N-(omega-(9Z,12Z-octadecadienoyloxy)-ultra-... NaN NaN SLM:000000413 | SLM:000782274 NaN NaN [C@H]([C@@H](/C=C/CCCCCCCCCCCCC)O)(NC(=O)*COC(... NaN ... NaN NaN NaN NaN NaN 157662 NaN NaN NaN NaN

119 rows × 29 columns

Checking split characters (|) in Parent
No rows found

Checking split characters (|) in Components*
No rows found

Checking split characters (|) in SMILES (pH7.3)
No rows found

Checking split characters (|) in InChI (pH7.3)
No rows found

Checking split characters (|) in InChI key (pH7.3)
No rows found

Checking split characters (|) in Formula (pH7.3)
No rows found

Checking split characters (|) in Charge (pH7.3)
Not a string column

Checking split characters (|) in Mass (pH7.3)
Not a string column

Checking split characters (|) in Exact Mass (neutral form)
Not a string column

Checking split characters (|) in Exact m/z of [M.]+
Not a string column

Checking split characters (|) in Exact m/z of [M+H]+
Not a string column

Checking split characters (|) in Exact m/z of [M+K]+ 
Not a string column

Checking split characters (|) in Exact m/z of [M+Na]+
Not a string column

Checking split characters (|) in Exact m/z of [M+Li]+
Not a string column

Checking split characters (|) in Exact m/z of [M+NH4]+
Not a string column

Checking split characters (|) in Exact m/z of [M-H]-
Not a string column

Checking split characters (|) in Exact m/z of [M+Cl]-
Not a string column

Checking split characters (|) in Exact m/z of [M+OAc]- 
Not a string column

Checking split characters (|) in CHEBI
Found 3 rows with split characters
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
465 SLM:000000784 Isomeric subspecies 1,2-di-(9Z-octadecenoyl)-sn-glycero-3-phosphate PA(18:1(9Z)/18:1(9Z)) Phosphatidate (18:1(9Z)/18:1(9Z)) SLM:000000329 SLM:000082169 SLM:000000418 (sn1 or sn2) CCCCCCCC\C=C/CCCCCCCC(=O)OC[C@H](COP([O-])([O-... InChI=1S/C39H73O8P/c1-3-5-7-9-11-13-15-17-19-2... ... 707.519775 718.538147 699.497009 735.473694 759.518188 74546 | 82922 LMGP10010962 HMDB07865 MNXM51075 11309392 | 14634025 | 14665624 | 15164764 | 15...
387185 SLM:000389154 NaN (14Z,17Z,20Z,23Z,26Z)-dotriacontapentaenoate NaN Fatty acid 32:5(14Z,17Z,20Z,23Z,26Z) SLM:000389801 NaN NaN CCCCC\C=C/C\C=C/C\C=C/C\C=C/C\C=C/CCCCCCCCCCCC... InChI=1S/C32H54O2/c1-2-3-4-5-6-7-8-9-10-11-12-... ... 477.427836 488.446207 469.405105 505.381782 529.426234 82731 | CHEBI:82731 LMFA01030848 NaN NaN NaN
595221 SLM:000598072 NaN all-trans-retinol--[retinol-binding protein] NaN NaN SLM:000000982 NaN NaN [*][C@H](N-*)C(-*)=O InChI=none ... NaN NaN NaN NaN NaN 17336 | 83228 NaN NaN NaN 20628054 | 28758396

3 rows × 29 columns

Checking split characters (|) in LIPID MAPS
No rows found

Checking split characters (|) in HMDB
No rows found

Checking split characters (|) in MetaNetX
No rows found

Checking split characters (|) in PMID
Found 1318 rows with split characters
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
0 SLM:000000002 Class Ceramide (iso-d17:1(4E)) Cer(iso-d17:1(4E)) N-acyl-15-methylhexadecasphing-4-enine SLM:000399814 NaN NaN CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](CO)NC([*])=O InChI=none ... NaN NaN NaN NaN NaN 70846 NaN NaN MNXM97012 | 11443131 | 14685263 | 18390550 | 21325339 |...
3 SLM:000000007 Class Sphingomyelin (iso-d17:1(4E)) SM(iso-d17:1(4E)) N-acyl-15-methylhexadecasphing-4-enine-1-phosp... SLM:000001000 NaN NaN CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](COP([O-])(=O... InChI=none ... NaN NaN NaN NaN NaN 70775 NaN NaN MNXM97113 14685263 | 21926990 | 9603947
4 SLM:000000035 Isomeric subspecies sphinganine NaN NaN SLM:000390097 NaN NaN CCCCCCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO InChI=1S/C18H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12... ... 308.313535 319.331905 300.290803 336.267481 360.311932 57817 LMSP01020001 HMDB00269 MNXM302 10652340 | 10702247 | 10751414 | 10802064 | 10...
5 SLM:000000042 Isomeric subspecies cholesta-5,7-dien-3beta-ol NaN NaN SLM:000501263 NaN NaN [H][C@@]1(CC[C@@]2([H])C3=CC=C4C[C@@H](O)CC[C@... InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2... ... 391.354671 402.373042 383.331940 419.308617 443.353069 17759 LMST01010069 HMDB00032 MNXM710 10329655 | 10344195 | 10786622 | 11230174 | 16...
6 SLM:000000043 Isomeric subspecies lathosterone NaN NaN SLM:000501263 NaN NaN [H][C@@]12CC=C3[C@]4([H])CC[C@]([H])([C@H](C)C... InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2... ... 391.354671 402.373042 383.331940 419.308617 443.353069 71550 NaN NaN MNXM97065 19531354 | 22505847
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
595221 SLM:000598072 NaN all-trans-retinol--[retinol-binding protein] NaN NaN SLM:000000982 NaN NaN [*][C@H](N-*)C(-*)=O InChI=none ... NaN NaN NaN NaN NaN 17336 | 83228 NaN NaN NaN 20628054 | 28758396
595222 SLM:000598073 NaN all-trans-retinyl heptanoate NaN NaN SLM:000000982 NaN NaN C1(C)(C)C(\C=C\C(=C\C=C\C(=C\COC(CCCCCC)=O)\C)... InChI=1S/C27H42O2/c1-7-8-9-10-16-26(28)29-21-1... ... NaN NaN NaN NaN NaN 138724 NaN NaN NaN 20628054 | 28758396
595223 SLM:000598074 NaN 2-heptanoyl-sn-glycero-3-phosphocholine NaN NaN SLM:000000724 NaN NaN P(OC[C@@H](CO)OC(=O)CCCCCC)(=O)(OCC[N+](C)(C)C... InChI=1S/C15H32NO7P/c1-5-6-7-8-9-15(18)23-14(1... ... NaN NaN NaN NaN NaN 138266 NaN NaN NaN 20628054 | 22605381 | 28758396
595230 SLM:000598083 NaN 12-hydroxy-(9Z)-octadecenoyl-CoA NaN NaN SLM:000389958 | SLM:000390051 NaN NaN S(C(CCCCCCC/C=C\C[C@@H](CCCCCC)O)=O)CCNC(CCNC(... InChI=1S/C39H68N7O18P3S/c1-4-5-6-13-16-27(47)1... ... NaN NaN NaN NaN NaN 139559 NaN NaN NaN 17084870 | 27758859
595245 SLM:000598101 NaN a mannosylinositol-1-phospho-N-(2-hydroxyacyl)... NaN NaN SLM:000000835 NaN NaN OC[C@H]1OC(O[C@@H]2[C@@H](O)[C@H](O)[C@@H](O)[... InChI=none ... NaN NaN NaN NaN NaN 74994 NaN NaN NaN 12954640 | 9368028

1318 rows × 29 columns

Okay wow! So these are all the columns we have found with split characters…

cols_with_split_chars
['Abbreviation*', 'Synonyms*', 'Lipid class*', 'CHEBI', 'PMID']

We can also check for different types of characters if we know that they will be present. For instance SL uses the / character for Components*, but this is also used by another of columns like the lipid names themselves or smiles and inchi.

check_for_split_characters(df_swisslipids.drop(columns=['Name','Abbreviation*','Synonyms*','SMILES (pH7.3)','InChI (pH7.3)']), delimiter='/')
Checking split characters (/) in Lipid ID
No rows found

Checking split characters (/) in Level
No rows found

Checking split characters (/) in Lipid class*
No rows found

Checking split characters (/) in Parent
No rows found

Checking split characters (/) in Components*
Found 708725 rows with split characters
Lipid ID Level Lipid class* Parent Components* InChI key (pH7.3) Formula (pH7.3) Charge (pH7.3) Mass (pH7.3) Exact Mass (neutral form) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
164 SLM:000000422 Isomeric subspecies SLM:000000329 SLM:000081844 SLM:000000418 (sn2) / SLM:000000510 (sn1) InChIKey=OPVZUEPSMJNLOM-QEJMHMKOSA-L C37H69O8P -2.0 672.913818 674.488647 ... 681.504089 692.522461 673.481384 709.458069 733.502502 64839 LMGP10010032 HMDB07859 MNXM66476 10359651 | 11788596 | 12963729 | 16620771 | 17...
229 SLM:000000498 Isomeric subspecies SLM:000000324 SLM:000105249 SLM:000000296 (sn2) / SLM:000000826 (sn1) InChIKey=KRTOMQDUKGRFDJ-ZAHDIIMDSA-M C47H82O13P -1.0 886.120483 886.557129 ... 893.572571 904.590942 885.549866 921.526550 945.570984 133606 LMGP06010010 HMDB09815 MNXM75683 22942276 | 23097495 | 23472195 | 8300559
269 SLM:000000557 Isomeric subspecies SLM:000000261 SLM:000088147 SLM:000000510 (sn1) / SLM:000000826 (sn2) InChIKey=PZNPLUBHRSSFHT-RRHRGVEJSA-N C42H84NO8P 0.0 762.091980 761.593445 ... 768.608887 779.627258 NaN 796.562866 820.607300 73000 LMGP01010573 HMDB07970 MNXM69304 18195019 | 19416660 | 22923616 | 27399000
332 SLM:000000636 Isomeric subspecies SLM:000000329 SLM:000082164 SLM:000000418 (sn1) / SLM:000000510 (sn2) InChIKey=ZSXHMDPHNCOWSV-QEJMHMKOSA-L C37H69O8P -2.0 672.913818 674.488647 ... 681.504089 692.522461 673.481384 709.458069 733.502502 74551 LMGP10010964 NaN MNXM66662 16620771 | 18606822 | 19318427 | 19801371 | 20...
333 SLM:000000637 Isomeric subspecies SLM:000000329 SLM:000082168 SLM:000000418 (sn1) / SLM:000000826 (sn2) InChIKey=XIERONXOJKEALF-PXYGFXEISA-L C39H73O8P -2.0 700.966980 702.519958 ... 709.535400 720.553772 701.512695 737.489380 761.533813 74552 LMGP10010963 NaN MNXM66667 16620771 | 18606822 | 19318427 | 19801371 | 21...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
745172 SLM:000748034 Isomeric subspecies SLM:000000084 NaN SLM:000000048 (sn1) / SLM:000001195 (sn2) InChIKey=LJSBNBPNSBKZCI-JNOBRDIFSA-L C33H57N3O15P2 -2.0 NaN 799.342142 ... 806.357598 817.375968 798.334866 834.311543 858.355995 NaN NaN NaN NaN NaN
745173 SLM:000748035 Isomeric subspecies SLM:000000084 NaN SLM:000000048 (sn1) / SLM:000001196 (sn2) InChIKey=ODNYDZLXLRZPCJ-GPTQCAHZSA-L C35H61N3O15P2 -2.0 NaN 827.373442 ... 834.388898 845.407268 826.366166 862.342844 886.387295 NaN NaN NaN NaN NaN
745174 SLM:000748036 Isomeric subspecies SLM:000000084 NaN SLM:000000048 (sn1) / SLM:000000853 (sn2) InChIKey=FJIBTCUXUBRYKG-QOTCTSOZSA-L C37H65N3O15P2 -2.0 NaN 855.404743 ... 862.420198 873.438568 854.397466 890.374144 914.418595 NaN NaN NaN NaN NaN
745175 SLM:000748037 Isomeric subspecies SLM:000000084 NaN SLM:000000048 (sn1) / SLM:000001197 (sn2) InChIKey=AIBKQADSQWEVSS-HUKRWTLJSA-L C42H75N3O15P2 -2.0 NaN 925.482993 ... 932.498448 943.516818 924.475716 960.452394 984.496846 NaN NaN HMDB0116248 NaN NaN
745176 SLM:000748038 Isomeric subspecies SLM:000000084 NaN SLM:000000047 (sn2) / SLM:000000048 (sn1) InChIKey=PIZFKSVTEGNINS-BQUKFSKHSA-L C44H79N3O15P2 -2.0 NaN 953.514293 ... 960.529748 971.548118 952.507016 988.483694 1012.528146 NaN NaN HMDB0116250 NaN NaN

708725 rows × 24 columns

Checking split characters (/) in InChI key (pH7.3)
No rows found

Checking split characters (/) in Formula (pH7.3)
No rows found

Checking split characters (/) in Charge (pH7.3)
Not a string column

Checking split characters (/) in Mass (pH7.3)
Not a string column

Checking split characters (/) in Exact Mass (neutral form)
Not a string column

Checking split characters (/) in Exact m/z of [M.]+
Not a string column

Checking split characters (/) in Exact m/z of [M+H]+
Not a string column

Checking split characters (/) in Exact m/z of [M+K]+ 
Not a string column

Checking split characters (/) in Exact m/z of [M+Na]+
Not a string column

Checking split characters (/) in Exact m/z of [M+Li]+
Not a string column

Checking split characters (/) in Exact m/z of [M+NH4]+
Not a string column

Checking split characters (/) in Exact m/z of [M-H]-
Not a string column

Checking split characters (/) in Exact m/z of [M+Cl]-
Not a string column

Checking split characters (/) in Exact m/z of [M+OAc]- 
Not a string column

Checking split characters (/) in CHEBI
No rows found

Checking split characters (/) in LIPID MAPS
No rows found

Checking split characters (/) in HMDB
No rows found

Checking split characters (/) in MetaNetX
No rows found

Checking split characters (/) in PMID
No rows found
['Components*']

These double entries for the classes will be important to take into account for our class hierarchy, because if we don’t many of these Class level entries will become disjointed in the ontology.

To help us handle this connection we will split it into two using the split_and_expand_large utility function, but we will come back to this a bit later…

For now we will also add another column for components, so that later we can have both the actual component with location (e.g. sn) and a parsed version where we just have the SL

df_swisslipids['Components_parsed'] = df_swisslipids['Components*']

Now we can melt to start creating the edges df

Building the edges df

# # Split the 'Lipid class*' column into multiple rows
# df_swisslipids_splitexp = split_and_expand_large(
#     df_swisslipids, #.assign(from_layer_col='swisslipids')
#     split_col='Lipid class*', 
#     expand_cols=['Lipid ID', 'Level', 'Name', 'Abbreviation*',
#                     'CHEBI', 'LIPID MAPS', 'HMDB', 'MetaNetX', 'PMID','Synonyms*','Parent','Components*','Components_parsed'], #'from_layer_col'
#     delimiter='|'
# )
df_swisslipids_edges = pd.melt(df_swisslipids,  #df_swisslipids_splitexp
                id_vars=['Lipid ID'], 
                value_vars=['CHEBI','LIPID MAPS','HMDB','MetaNetX','PMID','Lipid class*','Abbreviation*','Synonyms*','Parent','Components*','Components_parsed'], 
                var_name='melted_column', value_name='value')
df_swisslipids_edges
Lipid ID melted_column value
0 SLM:000000002 CHEBI 70846
1 SLM:000000003 CHEBI 70771
2 SLM:000000006 CHEBI 70829
3 SLM:000000007 CHEBI 70775
4 SLM:000000035 CHEBI 57817
... ... ... ...
8571734 SLM:000782324 Components_parsed NaN
8571735 SLM:000782325 Components_parsed NaN
8571736 SLM:000782326 Components_parsed NaN
8571737 SLM:000782327 Components_parsed NaN
8571738 SLM:000782328 Components_parsed NaN

8571739 rows × 3 columns

Because this melt operation also resulted in a large number of null values, which probably mean nothing to us in this case, we will drop instances where the value is null

df_swisslipids_edges = df_swisslipids_edges.dropna(subset='value')
df_swisslipids_edges
Lipid ID melted_column value
0 SLM:000000002 CHEBI 70846
1 SLM:000000003 CHEBI 70771
2 SLM:000000006 CHEBI 70829
3 SLM:000000007 CHEBI 70775
4 SLM:000000035 CHEBI 57817
... ... ... ...
8571494 SLM:000781997 Components_parsed SLM:000000856 (n-acyl)
8571495 SLM:000781998 Components_parsed SLM:000389154 (n-acyl)
8571496 SLM:000781999 Components_parsed SLM:000485643 (n-acyl)
8571497 SLM:000782000 Components_parsed SLM:000485644 (n-acyl)
8571498 SLM:000782001 Components_parsed SLM:000485645 (n-acyl)

4678499 rows × 3 columns

There are still some things we need to tidy up so that it is in a suitable format for OnionNet

df_swisslipids_edges = df_swisslipids_edges.copy()
df_swisslipids_edges['source_layer'] = 'swisslipids'
df_swisslipids_edges.rename(columns={'Lipid ID':'source_id', 'melted_column':'target_layer', 'value':'target_id'}, inplace=True)
df_swisslipids_edges = df_swisslipids_edges[['source_layer','source_id','target_layer','target_id']]
df_swisslipids_edges['target_layer'] = df_swisslipids_edges['target_layer'].map(lambda x: 'swisslipids' if x=='Lipid class*' else f"sl_{str(x).replace(' ','').strip('*').lower()}")
#df_swisslipids_edges['target_layer'] = df_swisslipids_edges['target_layer'].map(lambda x: )
df_swisslipids_edges
source_layer source_id target_layer target_id
0 swisslipids SLM:000000002 sl_chebi 70846
1 swisslipids SLM:000000003 sl_chebi 70771
2 swisslipids SLM:000000006 sl_chebi 70829
3 swisslipids SLM:000000007 sl_chebi 70775
4 swisslipids SLM:000000035 sl_chebi 57817
... ... ... ... ...
8571494 swisslipids SLM:000781997 sl_components_parsed SLM:000000856 (n-acyl)
8571495 swisslipids SLM:000781998 sl_components_parsed SLM:000389154 (n-acyl)
8571496 swisslipids SLM:000781999 sl_components_parsed SLM:000485643 (n-acyl)
8571497 swisslipids SLM:000782000 sl_components_parsed SLM:000485644 (n-acyl)
8571498 swisslipids SLM:000782001 sl_components_parsed SLM:000485645 (n-acyl)

4678499 rows × 4 columns

For rows where it is swisslipids to swisslipids, we actually want to correct this from target_layer to source_layer, because currently the target_layer in this case is actually the parent class, and ideally it would be better to have the parent point towards the children, so that way the root node should be the one with multiple outgoing edges and no incoming edges…

Be sure to only run this once, otherwise it will switch back again…

# Identify rows where both source_layer and target_layer are 'swisslipids'
condition = (df_swisslipids_edges["source_layer"] == "swisslipids") & (df_swisslipids_edges["target_layer"] == "swisslipids")

# Swap the columns for rows satisfying the condition
df_swisslipids_edges.loc[condition, ["source_layer", "source_id", "target_layer", "target_id"]] = df_swisslipids_edges.loc[condition, ["target_layer", "target_id", "source_layer", "source_id"]].values

# Output the modified DataFrame
df_swisslipids_edges
source_layer source_id target_layer target_id
0 swisslipids SLM:000000002 sl_chebi 70846
1 swisslipids SLM:000000003 sl_chebi 70771
2 swisslipids SLM:000000006 sl_chebi 70829
3 swisslipids SLM:000000007 sl_chebi 70775
4 swisslipids SLM:000000035 sl_chebi 57817
... ... ... ... ...
8571494 swisslipids SLM:000781997 sl_components_parsed SLM:000000856 (n-acyl)
8571495 swisslipids SLM:000781998 sl_components_parsed SLM:000389154 (n-acyl)
8571496 swisslipids SLM:000781999 sl_components_parsed SLM:000485643 (n-acyl)
8571497 swisslipids SLM:000782000 sl_components_parsed SLM:000485644 (n-acyl)
8571498 swisslipids SLM:000782001 sl_components_parsed SLM:000485645 (n-acyl)

4678499 rows × 4 columns

df_swisslipids_edges['target_layer'].value_counts()
target_layer
swisslipids             779247
sl_abbreviation         776464
sl_components           765323
sl_components_parsed    765323
sl_synonyms             548163
sl_metanetx             505003
sl_parent               493491
sl_hmdb                  26026
sl_lipidmaps             12117
sl_chebi                  4276
sl_pmid                   3066
Name: count, dtype: int64

Now let’s return to two items on our todo list:

  1. splitting values that have multi-identifiers

  2. trimming/parsing the components col

edges_with_multilinks = df_swisslipids_edges[df_swisslipids_edges['target_id'].str.contains('|', regex=False, na=False)]
edges_with_multilinks
source_layer source_id target_layer target_id
465 swisslipids SLM:000000784 sl_chebi 74546 | 82922
387185 swisslipids SLM:000389154 sl_chebi 82731 | CHEBI:82731
595221 swisslipids SLM:000598072 sl_chebi 17336 | 83228
3116996 swisslipids SLM:000000002 sl_pmid | 11443131 | 14685263 | 18390550 | 21325339 |...
3116999 swisslipids SLM:000000007 sl_pmid 14685263 | 21926990 | 9603947
... ... ... ... ...
6199835 swisslipids SLM:000747954 sl_synonyms 1,2-di-(13-methyltetradecanoyl)-sn-glycero-3-c...
6199836 swisslipids SLM:000747955 sl_synonyms 1-(13-methyltetradecanoyl)-2-(15-methylhexadec...
6199918 swisslipids SLM:000748037 sl_synonyms 1-(15-methylhexadecanoyl)-2-(11-methyldodecano...
6199919 swisslipids SLM:000748038 sl_synonyms 1-(15-methylhexadecanoyl)-2-(13-methyltetradec...
6199920 swisslipids SLM:000748039 sl_synonyms 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy...

30942 rows × 4 columns

edges_with_multilinks.value_counts('target_layer')
target_layer
sl_synonyms        19853
sl_abbreviation     9768
sl_pmid             1318
sl_chebi               3
Name: count, dtype: int64
edges_with_multilinks_split = split_and_expand_large(edges_with_multilinks, 
                       split_col='target_id', 
                       expand_cols=['source_layer','source_id','target_layer'],
                       delimiter='|').drop_duplicates()
edges_with_multilinks_split
source_layer source_id target_layer target_id
0 swisslipids SLM:000000784 sl_chebi 74546
1 swisslipids SLM:000000784 sl_chebi 82922
2 swisslipids SLM:000389154 sl_chebi 82731
3 swisslipids SLM:000389154 sl_chebi CHEBI:82731
4 swisslipids SLM:000598072 sl_chebi 17336
... ... ... ... ...
68383 swisslipids SLM:000748037 sl_synonyms CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(11Z))
68384 swisslipids SLM:000748038 sl_synonyms 1-(15-methylhexadecanoyl)-2-(13-methyltetradec...
68385 swisslipids SLM:000748038 sl_synonyms CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(9Z))
68386 swisslipids SLM:000748039 sl_synonyms 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy...
68387 swisslipids SLM:000748039 sl_synonyms CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:2(9Z,12Z))

68380 rows × 4 columns

This is good, but we also need to remember the separators in the components column

edges_with_multilinks2 = df_swisslipids_edges[df_swisslipids_edges['target_id'].str.contains('/', regex=False, na=False) &
                     df_swisslipids_edges['target_layer'].str.contains('sl_components', regex=False, na=False)]
edges_with_multilinks2
source_layer source_id target_layer target_id
7013405 swisslipids SLM:000000422 sl_components SLM:000000418 (sn2) / SLM:000000510 (sn1)
7013470 swisslipids SLM:000000498 sl_components SLM:000000296 (sn2) / SLM:000000826 (sn1)
7013510 swisslipids SLM:000000557 sl_components SLM:000000510 (sn1) / SLM:000000826 (sn2)
7013573 swisslipids SLM:000000636 sl_components SLM:000000418 (sn1) / SLM:000000510 (sn2)
7013574 swisslipids SLM:000000637 sl_components SLM:000000418 (sn1) / SLM:000000826 (sn2)
... ... ... ... ...
8537662 swisslipids SLM:000748034 sl_components_parsed SLM:000000048 (sn1) / SLM:000001195 (sn2)
8537663 swisslipids SLM:000748035 sl_components_parsed SLM:000000048 (sn1) / SLM:000001196 (sn2)
8537664 swisslipids SLM:000748036 sl_components_parsed SLM:000000048 (sn1) / SLM:000000853 (sn2)
8537665 swisslipids SLM:000748037 sl_components_parsed SLM:000000048 (sn1) / SLM:000001197 (sn2)
8537666 swisslipids SLM:000748038 sl_components_parsed SLM:000000047 (sn2) / SLM:000000048 (sn1)

1417450 rows × 4 columns

edges_with_multilinks2_split = split_and_expand_large(edges_with_multilinks2, 
                       split_col='target_id', 
                       expand_cols=['source_layer','source_id','target_layer'],
                       delimiter='/').drop_duplicates()
edges_with_multilinks2_split
source_layer source_id target_layer target_id
0 swisslipids SLM:000000422 sl_components SLM:000000418 (sn2)
1 swisslipids SLM:000000422 sl_components SLM:000000510 (sn1)
2 swisslipids SLM:000000498 sl_components SLM:000000296 (sn2)
3 swisslipids SLM:000000498 sl_components SLM:000000826 (sn1)
4 swisslipids SLM:000000557 sl_components SLM:000000510 (sn1)
... ... ... ... ...
3592487 swisslipids SLM:000748036 sl_components_parsed SLM:000000853 (sn2)
3592488 swisslipids SLM:000748037 sl_components_parsed SLM:000000048 (sn1)
3592489 swisslipids SLM:000748037 sl_components_parsed SLM:000001197 (sn2)
3592490 swisslipids SLM:000748038 sl_components_parsed SLM:000000047 (sn2)
3592491 swisslipids SLM:000748038 sl_components_parsed SLM:000000048 (sn1)

3592492 rows × 4 columns

Now let’s also parse the brackets from the parsed components so that these can be linked directly to the other SLMs if needed

# Apply transformation only for rows where target_layer equals 'sl_components_parsed'
mask = edges_with_multilinks2_split['target_layer'] == 'sl_components_parsed'
edges_with_multilinks2_split.loc[mask, 'target_id'] = edges_with_multilinks2_split.loc[mask, 'target_id'].str.split('(').str[0].str.strip()
edges_with_multilinks2_split
source_layer source_id target_layer target_id
0 swisslipids SLM:000000422 sl_components SLM:000000418 (sn2)
1 swisslipids SLM:000000422 sl_components SLM:000000510 (sn1)
2 swisslipids SLM:000000498 sl_components SLM:000000296 (sn2)
3 swisslipids SLM:000000498 sl_components SLM:000000826 (sn1)
4 swisslipids SLM:000000557 sl_components SLM:000000510 (sn1)
... ... ... ... ...
3592487 swisslipids SLM:000748036 sl_components_parsed SLM:000000853
3592488 swisslipids SLM:000748037 sl_components_parsed SLM:000000048
3592489 swisslipids SLM:000748037 sl_components_parsed SLM:000001197
3592490 swisslipids SLM:000748038 sl_components_parsed SLM:000000047
3592491 swisslipids SLM:000748038 sl_components_parsed SLM:000000048

3592492 rows × 4 columns

Now we need a way to change these original rows where they had multilinks and add back the corrected ones.

# Identify rows with multilinks (either '|' or '/' with the specific target_layer condition)
mask_pipe = df_swisslipids_edges['target_id'].str.contains('|', regex=False, na=False)
mask_slash = (
    df_swisslipids_edges['target_id'].str.contains('/', regex=False, na=False) &
    df_swisslipids_edges['target_layer'].str.contains('sl_components', regex=False, na=False)
)
mask_problem = mask_pipe | mask_slash

# Remove these rows from the original df
df_clean = df_swisslipids_edges[~mask_problem].copy()

# Now, combine the cleaned df with the corrected edges dataframes.
# These corrected dataframes are assumed to be: 
#   - edges_with_multilinks_split
#   - edges_with_multilinks2_split
df_swisslipids_edges = pd.concat([df_clean, edges_with_multilinks_split, edges_with_multilinks2_split], ignore_index=True)

# (Optional) Drop any duplicate rows that might arise
df_swisslipids_edges = df_swisslipids_edges.drop_duplicates()

# df_final now contains the original "good" rows plus the corrected edges.
df_swisslipids_edges
source_layer source_id target_layer target_id
0 swisslipids SLM:000000002 sl_chebi 70846
1 swisslipids SLM:000000003 sl_chebi 70771
2 swisslipids SLM:000000006 sl_chebi 70829
3 swisslipids SLM:000000007 sl_chebi 70775
4 swisslipids SLM:000000035 sl_chebi 57817
... ... ... ... ...
6890974 swisslipids SLM:000748036 sl_components_parsed SLM:000000853
6890975 swisslipids SLM:000748037 sl_components_parsed SLM:000000048
6890976 swisslipids SLM:000748037 sl_components_parsed SLM:000001197
6890977 swisslipids SLM:000748038 sl_components_parsed SLM:000000047
6890978 swisslipids SLM:000748038 sl_components_parsed SLM:000000048

6890979 rows × 4 columns

Now we will determine whether the edge is within the same layer (intralayer) or between different layers (interlayer)

def assess_edge_layertype(df):
    interlayer = df['source_layer']!=df['target_layer']
    df['interlayer'] = interlayer
    return df 

df_swisslipids_edges = assess_edge_layertype(df_swisslipids_edges)
df_swisslipids_edges
source_layer source_id target_layer target_id interlayer
0 swisslipids SLM:000000002 sl_chebi 70846 True
1 swisslipids SLM:000000003 sl_chebi 70771 True
2 swisslipids SLM:000000006 sl_chebi 70829 True
3 swisslipids SLM:000000007 sl_chebi 70775 True
4 swisslipids SLM:000000035 sl_chebi 57817 True
... ... ... ... ... ...
6890974 swisslipids SLM:000748036 sl_components_parsed SLM:000000853 True
6890975 swisslipids SLM:000748037 sl_components_parsed SLM:000000048 True
6890976 swisslipids SLM:000748037 sl_components_parsed SLM:000001197 True
6890977 swisslipids SLM:000748038 sl_components_parsed SLM:000000047 True
6890978 swisslipids SLM:000748038 sl_components_parsed SLM:000000048 True

6890979 rows × 5 columns

Now we will build the node df

Building the node df

df_swisslipids_nodes = create_nodedf_from_edgedf(edge_df=df_swisslipids_edges, props=['layer', 'id'], cols=['layer', 'node_id'])
df_swisslipids_nodes
layer node_id
0 swisslipids SLM:000000002
1 swisslipids SLM:000000003
2 swisslipids SLM:000000006
3 swisslipids SLM:000000007
4 swisslipids SLM:000000035
... ... ...
13781953 sl_components_parsed SLM:000000853
13781954 sl_components_parsed SLM:000000048
13781955 sl_components_parsed SLM:000001197
13781956 sl_components_parsed SLM:000000047
13781957 sl_components_parsed SLM:000000048

13781958 rows × 2 columns

Let’s also see how many are duplicates

df_swisslipids_nodes.value_counts(dropna=True)
layer        node_id      
swisslipids  SLM:000000353    132660
             SLM:000000377     98800
             SLM:000000102     80218
             SLM:000117148     46826
             SLM:000000400     38525
                               ...  
sl_metanetx  MNXM311776            1
             MNXM311777            1
             MNXM311778            1
             MNXM311779            1
swisslipids  SLM:000782332         1
Name: count, Length: 2783345, dtype: int64

Now let’s merge the nodes with the information from earlier to create richer node attributes

df_swisslipids_nodes = pd.merge(df_swisslipids_nodes, df_swisslipids.assign(from_layer_col='swisslipids'),
                                left_on=['layer','node_id'], right_on=['from_layer_col','Lipid ID'],
                                how='outer')
df_swisslipids_nodes
layer node_id Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* ... Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID Components_parsed from_layer_col
0 sl_abbreviation (5S)-HpHEPE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 sl_abbreviation 15-KETE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 sl_abbreviation (10,11S,12R)-TriHETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 sl_abbreviation (10R)-H-(11S,12S)-EpETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 sl_abbreviation (10R)-H-(8S,9S)-EpETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13781953 swisslipids SLM:000782330 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13781954 swisslipids SLM:000782331 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13781955 swisslipids SLM:000782331 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13781956 swisslipids SLM:000782331 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13781957 swisslipids SLM:000782332 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

13781958 rows × 33 columns

This has a lot of duplicates in it, so lets remove them, along with the from_layer_col which means nothing in this case and is just a relic of our join back with the initial df we used to create the edges (which could probably be tidied up)

df_swisslipids_nodes = df_swisslipids_nodes.drop_duplicates()
df_swisslipids_nodes = df_swisslipids_nodes.drop(columns='from_layer_col')
df_swisslipids_nodes
layer node_id Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* ... Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID Components_parsed
0 sl_abbreviation (5S)-HpHEPE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 sl_abbreviation 15-KETE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 sl_abbreviation (10,11S,12R)-TriHETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 sl_abbreviation (10R)-H-(11S,12S)-EpETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 sl_abbreviation (10R)-H-(8S,9S)-EpETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13781947 swisslipids SLM:000782328 SLM:000782328 NaN oxidized 2-acylglycerol NaN NaN SLM:000000355 NaN NaN ... NaN NaN NaN NaN 167117 NaN NaN NaN NaN NaN
13781950 swisslipids SLM:000782329 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13781953 swisslipids SLM:000782330 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13781954 swisslipids SLM:000782331 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13781957 swisslipids SLM:000782332 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2783345 rows × 32 columns

Now we have the nodes and edges dfs for swisslipids and understand how we have arrived at them. In reality you don’t have to go through this process every time, LipiNet offers a convenient function to do just this if you are interested in this same network setup.

Using the LipiNet parse_swisslipids function

The LipiNet parse_swisslipids function automatically runs through all of the same steps as we have just covered.

from lipinet.parse_swisslipids import parse_swisslipids_data

sl_results = parse_swisslipids_data(verbose=False)
df_sl_nodes = sl_results['df_nodes']
df_sl_edges = sl_results['df_edges']

We can also check to make sure these are equal here for an individual entry

df_swisslipids_nodes.iloc[0]
layer                        sl_abbreviation
node_id                          (5S)-HpHEPE
Lipid ID                                 NaN
Level                                    NaN
Name                                     NaN
Abbreviation*                            NaN
Synonyms*                                NaN
Lipid class*                             NaN
Parent                                   NaN
Components*                              NaN
SMILES (pH7.3)                           NaN
InChI (pH7.3)                            NaN
InChI key (pH7.3)                        NaN
Formula (pH7.3)                          NaN
Charge (pH7.3)                           NaN
Mass (pH7.3)                             NaN
Exact Mass (neutral form)                NaN
Exact m/z of [M.]+                       NaN
Exact m/z of [M+H]+                      NaN
Exact m/z of [M+K]+                      NaN
Exact m/z of [M+Na]+                     NaN
Exact m/z of [M+Li]+                     NaN
Exact m/z of [M+NH4]+                    NaN
Exact m/z of [M-H]-                      NaN
Exact m/z of [M+Cl]-                     NaN
Exact m/z of [M+OAc]-                    NaN
CHEBI                                    NaN
LIPID MAPS                               NaN
HMDB                                     NaN
MetaNetX                                 NaN
PMID                                     NaN
Components_parsed                        NaN
Name: 0, dtype: object
df_sl_nodes.iloc[0]
layer                        sl_abbreviation
node_id                          (5S)-HpHEPE
Lipid ID                                 NaN
Level                                    NaN
Name                                     NaN
Abbreviation*                            NaN
Synonyms*                                NaN
Lipid class*                             NaN
Parent                                   NaN
Components*                              NaN
SMILES (pH7.3)                           NaN
InChI (pH7.3)                            NaN
InChI key (pH7.3)                        NaN
Formula (pH7.3)                          NaN
Charge (pH7.3)                           NaN
Mass (pH7.3)                             NaN
Exact Mass (neutral form)                NaN
Exact m/z of [M.]+                       NaN
Exact m/z of [M+H]+                      NaN
Exact m/z of [M+K]+                      NaN
Exact m/z of [M+Na]+                     NaN
Exact m/z of [M+Li]+                     NaN
Exact m/z of [M+NH4]+                    NaN
Exact m/z of [M-H]-                      NaN
Exact m/z of [M+Cl]-                     NaN
Exact m/z of [M+OAc]-                    NaN
CHEBI                                    NaN
LIPID MAPS                               NaN
HMDB                                     NaN
MetaNetX                                 NaN
PMID                                     NaN
Components_parsed                        NaN
Name: 0, dtype: object

For the first entry it looks good, what about for the entire df? We can use the pd.testing.assert_frame_equal function to do this.

First we will use a null test to test equality between df_swisslipids_nodes and df_swisslipids_edges, which should obviously be False.

try:
    pd.testing.assert_frame_equal(df_swisslipids_nodes, df_swisslipids_edges)
    print('DataFrames are equal')
except AssertionError as e:
    print(e)
DataFrame are different

DataFrame shape mismatch
[left]:  (2783345, 32)
[right]: (6890979, 5)

Now let’s test between df_swisslipids_nodes and df_sl_nodes, which should hopefully be True and not throw an error. We will also test the edges df while we’re at it too.

try:
    pd.testing.assert_frame_equal(df_swisslipids_nodes, df_sl_nodes)
    print('DataFrames for nodes are equal')
except AssertionError as e:
    print(e)
DataFrames for nodes are equal
try:
    pd.testing.assert_frame_equal(df_swisslipids_edges, df_sl_edges)
    print('DataFrames for edges are equal')
except AssertionError as e:
    print(e)
DataFrames for edges are equal

Great! It looks like both approaches achieve the same df. We will use these dfs in other parts of the package.

If they are different, we can inspect the exact rows here

diff = df_sl_edges.merge(df_swisslipids_edges, how='outer', indicator=True)
diff_rows_edges = diff[diff['_merge'] != 'both']
diff_rows_edges
source_layer source_id target_layer target_id interlayer _merge
diff = df_sl_nodes.merge(df_swisslipids_nodes, how='outer', indicator=True)
diff_rows_nodes = diff[diff['_merge'] != 'both']
diff_rows_nodes
layer node_id Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* ... Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID Components_parsed _merge

0 rows × 33 columns

These should also be the same

df_sl_edges[df_sl_edges['source_id']=='SLM:000389145']
source_layer source_id target_layer target_id interlayer
1640 swisslipids SLM:000389145 sl_chebi 18059 True
429400 swisslipids SLM:000389145 sl_metanetx MNXM12117 True
549344 swisslipids SLM:000389145 swisslipids SLM:000000436 False
549407 swisslipids SLM:000389145 swisslipids SLM:000000525 False
549887 swisslipids SLM:000389145 swisslipids SLM:000001193 False
665828 swisslipids SLM:000389145 swisslipids SLM:000117142 False
936914 swisslipids SLM:000389145 swisslipids SLM:000390054 False
1046948 swisslipids SLM:000389145 swisslipids SLM:000500463 False
1055230 swisslipids SLM:000389145 swisslipids SLM:000508860 False
1328368 swisslipids SLM:000389145 swisslipids SLM:000782283 False
df_swisslipids_edges[df_swisslipids_edges['source_id']=='SLM:000389145']
source_layer source_id target_layer target_id interlayer
1640 swisslipids SLM:000389145 sl_chebi 18059 True
429400 swisslipids SLM:000389145 sl_metanetx MNXM12117 True
549344 swisslipids SLM:000389145 swisslipids SLM:000000436 False
549407 swisslipids SLM:000389145 swisslipids SLM:000000525 False
549887 swisslipids SLM:000389145 swisslipids SLM:000001193 False
665828 swisslipids SLM:000389145 swisslipids SLM:000117142 False
936914 swisslipids SLM:000389145 swisslipids SLM:000390054 False
1046948 swisslipids SLM:000389145 swisslipids SLM:000500463 False
1055230 swisslipids SLM:000389145 swisslipids SLM:000508860 False
1328368 swisslipids SLM:000389145 swisslipids SLM:000782283 False