Parse SwissLipids#

Parsing SwissLipids data into a network for LipiNet.

LipiNet offers conventient functions to parse prior knowledge resources straight into networks. For instance, LipiNet can parse SwissLipids data into a network as easily as running: parse_swisslipids_data()

However to show what is happening behind the scenes, this notebook will also go through the data and each of the steps that are made in the background of this function. This may be particularly helpful to users needing to customise the networks in a way that is not yet supported by LipiNet directly.

Using parse_swisslipids_data()#

Like already mentioned, the LipiNet parse_swisslipids_data() function automatically parses SwissLipids into a network. This is what LipiNet uses as input to its overall combined network and for the majority of users this function will probably suffice if they wish to build sub-networks with just SwissLipids data.

import importlib
import lipinet.parse_swisslipids
importlib.reload(lipinet.parse_swisslipids)  # reload the module after edits
from lipinet.parse_swisslipids import parse_swisslipids_data 

sl_results = parse_swisslipids_data(verbose=False, use_cache=True)
df_sl_nodes = sl_results['df_nodes']
df_sl_edges = sl_results['df_edges']

To avoid repeatedly downloading the SwissLipids data (and choking up their server calls), set use_cache=True. If the cache has not been set yet, this will automatically save the download to cache. If there is already a cache, it will use that.

To override the cache you can set force_download=True, but this is only recommended every few months when you want to update the source data in case of changes.

Where to from here?#

Manual parsing#

For users wanting to better understand all the steps being undertaken behind the parse_swisslipids_data() function, we will recreate the steps here.

import lipinet.databases  # Import the module

# Reload the module to ensure changes are picked up
importlib.reload(lipinet)
importlib.reload(lipinet.databases)

# Now can use the functions after reloading the module
from lipinet.databases import get_prior_knowledge
from lipinet.utils import split_and_expand_large, create_nodedf_from_edgedf, check_for_split_characters, clean_missing_strings

import pandas as pd
df_swisslipids = get_prior_knowledge('swisslipids', verbose=True, force_download=False) #Previously set to True
df_swisslipids
File found locally at /opt/anaconda3/envs/graphtool/lib/python3.12/site-packages/lipinet/.data/downloaded/swisslipids_lipids.tsv. Loading data...
Before cleaning, trailing-space counts in 'Lipid class*': {False: 779171, True: 76, nan: 2}

>> Cleaning column 'Lipid class*':
   sample before: ['SLM:000399814', 'SLM:000390097', 'SLM:000390097', 'SLM:000001000', 'SLM:000390097']
   sample after:  ['SLM:000399814', 'SLM:000390097', 'SLM:000390097', 'SLM:000001000', 'SLM:000390097']

>> Cleaning column 'CHEBI':
   sample before: ['70846', '70771', '70829', '70775', '57817']
   sample after:  ['70846', '70771', '70829', '70775', '57817']
After cleaning, trailing-space counts in 'Lipid class*': {False: 779247, <NA>: 2}
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
0 SLM:000000002 Class Ceramide (iso-d17:1(4E)) Cer(iso-d17:1(4E)) N-acyl-15-methylhexadecasphing-4-enine SLM:000399814 NaN NaN CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](CO)NC([*])=O InChI=none ... NaN NaN NaN NaN NaN 70846 NaN NaN MNXM97012 | 11443131 | 14685263 | 18390550 | 21325339 |...
1 SLM:000000003 Isomeric subspecies 15-methylhexadecasphing-4-enine NaN NaN SLM:000390097 NaN NaN CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@@H]([NH3+])CO InChI=1S/C17H35NO2/c1-15(2)12-10-8-6-4-3-5-7-9... ... 292.282235 303.300605 284.259503 320.236181 344.280632 70771 NaN NaN MNXM57784 19372430
2 SLM:000000006 Isomeric subspecies 15-methylhexadecasphinganine NaN NaN SLM:000390097 NaN NaN CC(C)CCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO InChI=1S/C17H37NO2/c1-15(2)12-10-8-6-4-3-5-7-9... ... 294.297885 305.316255 286.275153 322.251831 346.296282 70829 NaN NaN MNXM97029 19372430
3 SLM:000000007 Class Sphingomyelin (iso-d17:1(4E)) SM(iso-d17:1(4E)) N-acyl-15-methylhexadecasphing-4-enine-1-phosp... SLM:000001000 NaN NaN CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](COP([O-])(=O... InChI=none ... NaN NaN NaN NaN NaN 70775 NaN NaN MNXM97113 14685263 | 21926990 | 9603947
4 SLM:000000035 Isomeric subspecies sphinganine NaN NaN SLM:000390097 NaN NaN CCCCCCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO InChI=1S/C18H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12... ... 308.313535 319.331905 300.290803 336.267481 360.311932 57817 LMSP01020001 HMDB00269 MNXM302 10652340 | 10702247 | 10751414 | 10802064 | 10...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
779244 SLM:000782324 NaN apo carotenoid NaN NaN SLM:000508864 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 53183 NaN NaN NaN NaN
779245 SLM:000782325 NaN terpenoid NaN NaN SLM:000508864 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 26873 NaN NaN NaN NaN
779246 SLM:000782326 NaN C-45 isoprenoid NaN NaN SLM:000508864 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 87168 NaN NaN NaN NaN
779247 SLM:000782327 NaN gamma-lactone NaN NaN SLM:000782238 NaN NaN O1C(C(C(C1=O)*)*)* NaN ... NaN NaN NaN NaN NaN 37581 NaN NaN NaN NaN
779248 SLM:000782328 NaN oxidized 2-acylglycerol NaN NaN SLM:000000355 NaN NaN OCC(CO)OC(=O)* NaN ... NaN NaN NaN NaN NaN 167117 NaN NaN NaN NaN

779249 rows × 29 columns

To be safe, we will start by removing leading and trailing whitespace from all object and string columns

df_swisslipids[df_swisslipids['Abbreviation*'].str.startswith(' ', na=False)]
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
116651 SLM:000117132 Class 1(3)-O-(alk-1-enyl)-glycerol MG(P-) | MAG(P-) Monoacylglycerol (P-) SLM:000117137 NaN NaN OCC(O)COC=C[*] InChI=none ... NaN NaN NaN NaN NaN 77998 NaN NaN MNXM149874 NaN
116663 SLM:000117144 Class 1-O-(alk-1Z-enyl)-sn-glycerol MG(P-) | MAG(P-) Monoacylglycerol (P-) SLM:000117132 NaN NaN OC[C@H](O)COC=C/[*] InChI=none ... NaN NaN NaN NaN NaN 77297 NaN NaN MNXM413498 NaN

2 rows × 29 columns

df_swisslipids = clean_missing_strings(df_swisslipids)
df_swisslipids[df_swisslipids['Abbreviation*'].str.startswith(' ', na=False)].shape
(0, 29)

If we take a closer look into the data, especially the Lipid class* column, we will see that some of the values have multiple entries. For example Ceramide phosphoinositol is a Class level entry that itself belongs to both the SLM:000000834 and SLM:000399815 classes.

df_swisslipids.dropna(subset='Lipid class*')[df_swisslipids['Lipid class*'].dropna().str.contains('|', regex=False)]
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
142 SLM:000000392 Class Ceramide phosphoinositol IPC Inositol-1-phosphoceramide SLM:000000834 | SLM:000399815 NaN NaN O[C@H]([*])[C@H](COP([O-])(=O)O[C@H]1[C@H](O)[... InChI=none ... NaN NaN NaN NaN NaN 64916 NaN NaN NaN 10888667 | 20727985
234 SLM:000000509 Isomeric subspecies All-trans-retinyl hexadecanoate NaN all-trans-retinyl palmitate SLM:000000982 | SLM:000508854 NaN NaN CCCCCCCCCCCCCCCC(=O)OCC=C(C)C=CC=C(C)C=CC1=C(C... InChI=1S/C36H60O2/c1-7-8-9-10-11-12-13-14-15-1... ... NaN NaN NaN NaN NaN 17616 NaN HMDB03648 NaN 10769148 | 10819989 | 12230550 | 15550674 | 15...
315 SLM:000000612 NaN tetracosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74146 NaN NaN NaN 18541923 | 20110363 | 20937905
317 SLM:000000614 NaN hexacosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74161 NaN NaN NaN 18165233
319 SLM:000000621 NaN 2-hydroxy-tetracosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74215 NaN NaN NaN 18541923
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
755324 SLM:000758294 Class Globoside Globo Globo-series SLM:000000834 | SLM:000399813 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 61360 NaN NaN NaN NaN
755325 SLM:000758295 Class Isogloboside Isoglobo Isoglobo-series SLM:000000834 | SLM:000399813 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 78257 NaN NaN NaN NaN
779141 SLM:000782221 NaN Resolvin E RvE NaN SLM:000501332 | SLM:000508853 NaN NaN NaN InChI=none ... NaN NaN NaN NaN NaN <NA> LMFA0314 NaN NaN NaN
779142 SLM:000782222 NaN Resolvin D RvD NaN SLM:000501331 | SLM:000508853 NaN NaN NaN InChI=none ... NaN NaN NaN NaN NaN <NA> LMFA0403 NaN NaN NaN
779157 SLM:000782237 NaN an N-(omega-(9Z,12Z-octadecadienoyloxy)-ultra-... NaN NaN SLM:000000413 | SLM:000782274 NaN NaN [C@H]([C@@H](/C=C/CCCCCCCCCCCCC)O)(NC(=O)*COC(... NaN ... NaN NaN NaN NaN NaN 157662 NaN NaN NaN NaN

119 rows × 29 columns

What about other IDs?

cols_with_split_chars = check_for_split_characters(df_swisslipids, delimiter='|')
Checking split characters (|) in Lipid ID
No rows found

Checking split characters (|) in Level
No rows found

Checking split characters (|) in Name
No rows found

Checking split characters (|) in Abbreviation*
Found 9768 rows with split characters
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
56 SLM:000000262 Class 1,2-diacyl-sn-glycerol 1,2-sn-DAG | DAG | DG Diacylglycerol SLM:000000423 NaN NaN OC[C@@H](COC([*])=O)OC([*])=O InChI=none ... NaN NaN NaN NaN NaN 17815 NaN NaN MNXM59 10336610 | 10685032 | 10888667 | 10931938 | 11...
114 SLM:000000341 Class 1-acyl-sn-glycerol MAG | MG Monoacylglycerol SLM:000117130 NaN NaN OC[C@H](O)COC([*])=O InChI=none ... NaN NaN NaN NaN NaN 64683 NaN NaN MNXM2963 10685032 | 15939762 | 18037386 | 8663293 | 960...
122 SLM:000000355 Class 2-acylglycerol MAG | MG Monoacylglycerol SLM:000000403 NaN NaN OCC(CO)OC([*])=O InChI=none ... NaN NaN NaN NaN NaN 17389 NaN NaN MNXM335 NaN
146 SLM:000000400 Class Triacylglycerol TAG | TG NaN SLM:000117141 NaN NaN [*]C(=O)OCC(COC([*])=O)OC([*])=O InChI=none ... NaN NaN NaN NaN NaN 17855 NaN NaN MNXM248 12682047 | 16135509 | 16150821 | 21704635 | 27...
147 SLM:000000401 Class Diacylglycerol DAG | DG NaN SLM:000117140 NaN NaN [*]OCC(CO[*])O[*] InChI=none ... NaN NaN NaN NaN NaN 18035 NaN NaN MNXM59 12682047 | 16135509 | 16150821 | 27247428 | 29...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
505694 SLM:000508489 Molecular subspecies Phosphatidylglycerol (O-17:1_0:0) LPG(O-17:1_0:0) | PG(O-17:1_0:0) Lysophosphatidylglycerol (O-17:1_0:0) SLM:000508807 SLM:000508779 SLM:000001333 (sn1 or sn2 or sn3) OCC(O)COP([O-])(=O)OCC(CO[*])O[*] InChI=none ... 489.316311 500.334681 481.293579 517.270257 541.314708 <NA> NaN NaN MNXM629334 NaN
505695 SLM:000508490 Molecular subspecies Phosphatidylglycerol (O-15:1_0:0) LPG(O-15:1_0:0) | PG(O-15:1_0:0) Lysophosphatidylglycerol (O-15:1_0:0) SLM:000508807 SLM:000508775 SLM:000001331 (sn1 or sn2 or sn3) OCC(O)COP([O-])(=O)OCC(CO[*])O[*] InChI=none ... 461.285011 472.303381 453.262279 489.238957 513.283408 <NA> NaN NaN MNXM628940 NaN
505696 SLM:000508491 Molecular subspecies Phosphatidylglycerol (O-13:1_0:0) LPG(O-13:1_0:0) | PG(O-13:1_0:0) Lysophosphatidylglycerol (O-13:1_0:0) SLM:000508807 SLM:000508771 SLM:000001329 (sn1 or sn2 or sn3) OCC(O)COP([O-])(=O)OCC(CO[*])O[*] InChI=none ... 433.253711 444.272081 425.230979 461.207657 485.252108 <NA> NaN NaN MNXM628548 NaN
595061 SLM:000597889 Isomeric subspecies 7-oxoresolvin D2 7-oxo-RvD2| 7-keto-RvD2 (16R,17S)-dihydroxy-7-oxo-(4Z,8E,10Z,12E,14E,1... SLM:000508853 | SLM:000782222 NaN NaN C(C/C=C\CC(/C=C/C=C\C=C\C=C\[C@H]([C@H](C/C=C\... InChI=1S/C22H30O5/c1-2-3-9-16-20(24)21(25)17-1... ... 381.224780 392.243150 373.202048 409.178725 433.223177 137497 NaN NaN NaN 22844113
595062 SLM:000597890 Isomeric subspecies 16-oxoresolvin D2 16-oxo-RvD2| 16-keto-RvD2 (7S,17S)-dihydroxy-16-oxo-(4Z,8E,10Z,12E,14E,1... SLM:000508853 | SLM:000782222 NaN NaN C(C/C=C\C[C@@H](\C=C\C=C/C=C/C=C/C([C@H](C/C=C... InChI=1S/C22H30O5/c1-2-3-9-16-20(24)21(25)17-1... ... 381.224780 392.243150 373.202048 409.178725 433.223177 137498 NaN NaN NaN 22844113

9768 rows × 29 columns

Checking split characters (|) in Synonyms*
Found 19853 rows with split characters
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
11 SLM:000000101 Class 1,2-diacyl-sn-glycero-3-phospho-(1'-sn-glycero... PA 1,2-diacyl-sn-glycero-3-phospho-(1'-sn-glycero... SLM:000477285 NaN NaN O[C@@H](COP([O-])([O-])=O)COP([O-])(=O)OC[C@@H... InChI=none ... NaN NaN NaN NaN NaN 60110 NaN NaN MNXM871 20485265 | 9880566
17 SLM:000000147 Isomeric subspecies N-(9Z-octadecenoyl)-ethanolamine NAE (18:1(9Z)) (9Z-octadecenoyl)-ethanolamide | N-(9Z-octadec... SLM:000000378 NaN NaN CCCCCCCC\C=C/CCCCCCCC(=O)NCCO InChI=1S/C20H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12... ... 332.313535 343.331905 324.290803 360.267481 384.311932 71466 NaN HMDB02088 MNXM107386 14634025 | 16527816 | 17015445 | 17626977 | 17...
18 SLM:000000149 Isomeric subspecies N-hexadecanoyl-ethanolamine NAE (16:0) hexadecanoyl-ethanolamide | N-hexadecanoyl eth... SLM:000000378 NaN NaN CCCCCCCCCCCCCCCC(=O)NCCO InChI=1S/C18H37NO2/c1-2-3-4-5-6-7-8-9-10-11-12... ... 306.297885 317.316255 298.275153 334.251831 358.296282 71464 NaN HMDB02100 MNXM107548 12824167 | 14634025 | 15655246 | 15760304 | 16...
19 SLM:000000178 Isomeric subspecies N-(docosanoyl)-15-methylhexadecasphing-4-enine Cer(iso-d17:1(4E)/22:0) Ceramide (iso-d17:1(4E)/22:0) | N-docosanoyl-1... SLM:000000002 SLM:000392021 SLM:000000827 (n-acyl) CCCCCCCCCCCCCCCCCCCCCC(=O)N[C@@H](CO)[C@H](O)\... InChI=1S/C39H77NO3/c1-4-5-6-7-8-9-10-11-12-13-... ... 614.605801 625.624171 606.583069 642.559747 666.604198 71377 NaN NaN MNXM107026 19372430
20 SLM:000000179 Isomeric subspecies N-(heneicosanoyl)-15-methylhexadecasphing-4-enine Cer(iso-d17:1(4E)/21:0) Ceramide (iso-d17:1(4E)/21:0) | N-henicosanoyl... SLM:000000002 SLM:000392020 SLM:000001207 (n-acyl) CCCCCCCCCCCCCCCCCCCCC(=O)N[C@@H](CO)[C@H](O)\C... InChI=1S/C38H75NO3/c1-4-5-6-7-8-9-10-11-12-13-... ... 600.590151 611.608521 592.567419 628.544097 652.588548 71375 NaN NaN MNXM107036 19372430
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
745092 SLM:000747954 Isomeric subspecies CDP-1,2-di-(13-methyltetradecanoyl)-sn-glycerol CDP-DAG (iso15:0/iso15:0) 1,2-di-(13-methyltetradecanoyl)-sn-glycero-3-c... SLM:000000084 NaN SLM:000000047 (sn1 or sn2) [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... InChI=1S/C42H77N3O15P2/c1-32(2)23-19-15-11-7-5... ... 932.498448 943.516818 924.475716 960.452394 984.496846 <NA> NaN HMDB0116214 NaN NaN
745093 SLM:000747955 Isomeric subspecies CDP-1-(13-methyltetradecanoyl)-2-(15-methylhex... CDP-DAG (iso15:0/iso17:0) 1-(13-methyltetradecanoyl)-2-(15-methylhexadec... SLM:000000084 NaN SLM:000000047 (sn1) / SLM:000000048 (sn2) [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... InChI=1S/C44H81N3O15P2/c1-34(2)25-21-17-13-9-6... ... 960.529748 971.548118 952.507016 988.483694 1012.528146 <NA> NaN HMDB0116216 NaN NaN
745175 SLM:000748037 Isomeric subspecies CDP-1-(15-methylhexadecanoyl)-2-(11-methyldode... CDP-DAG (iso17:0/iso13:0) 1-(15-methylhexadecanoyl)-2-(11-methyldodecano... SLM:000000084 NaN SLM:000000048 (sn1) / SLM:000001197 (sn2) [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... InChI=1S/C42H77N3O15P2/c1-32(2)23-19-15-11-8-6... ... 932.498448 943.516818 924.475716 960.452394 984.496846 <NA> NaN HMDB0116248 NaN NaN
745176 SLM:000748038 Isomeric subspecies CDP-1-(15-methylhexadecanoyl)-2-(13-methyltetr... CDP-DAG (iso17:0/iso15:0) 1-(15-methylhexadecanoyl)-2-(13-methyltetradec... SLM:000000084 NaN SLM:000000047 (sn2) / SLM:000000048 (sn1) [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... InChI=1S/C44H81N3O15P2/c1-34(2)25-21-17-13-9-6... ... 960.529748 971.548118 952.507016 988.483694 1012.528146 <NA> NaN HMDB0116250 NaN NaN
745177 SLM:000748039 Isomeric subspecies CDP-1,2-di-(15-methylhexadecanoyl)-sn-glycerol CDP-DAG (iso17:0/iso17:0) 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy... SLM:000000084 NaN SLM:000000048 (sn1 or sn2) [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... InChI=1S/C46H85N3O15P2/c1-36(2)27-23-19-15-11-... ... 988.561049 999.579419 980.538317 1016.514994 1040.559446 <NA> NaN HMDB0116252 NaN NaN

19853 rows × 29 columns

Checking split characters (|) in Lipid class*
Found 119 rows with split characters
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
142 SLM:000000392 Class Ceramide phosphoinositol IPC Inositol-1-phosphoceramide SLM:000000834 | SLM:000399815 NaN NaN O[C@H]([*])[C@H](COP([O-])(=O)O[C@H]1[C@H](O)[... InChI=none ... NaN NaN NaN NaN NaN 64916 NaN NaN NaN 10888667 | 20727985
234 SLM:000000509 Isomeric subspecies All-trans-retinyl hexadecanoate NaN all-trans-retinyl palmitate SLM:000000982 | SLM:000508854 NaN NaN CCCCCCCCCCCCCCCC(=O)OCC=C(C)C=CC=C(C)C=CC1=C(C... InChI=1S/C36H60O2/c1-7-8-9-10-11-12-13-14-15-1... ... NaN NaN NaN NaN NaN 17616 NaN HMDB03648 NaN 10769148 | 10819989 | 12230550 | 15550674 | 15...
315 SLM:000000612 NaN tetracosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74146 NaN NaN NaN 18541923 | 20110363 | 20937905
317 SLM:000000614 NaN hexacosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74161 NaN NaN NaN 18165233
319 SLM:000000621 NaN 2-hydroxy-tetracosenoyl-CoA NaN NaN SLM:000390051 | SLM:000782334 NaN NaN CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... NaN ... NaN NaN NaN NaN NaN 74215 NaN NaN NaN 18541923
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
755324 SLM:000758294 Class Globoside Globo Globo-series SLM:000000834 | SLM:000399813 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 61360 NaN NaN NaN NaN
755325 SLM:000758295 Class Isogloboside Isoglobo Isoglobo-series SLM:000000834 | SLM:000399813 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 78257 NaN NaN NaN NaN
779141 SLM:000782221 NaN Resolvin E RvE NaN SLM:000501332 | SLM:000508853 NaN NaN NaN InChI=none ... NaN NaN NaN NaN NaN <NA> LMFA0314 NaN NaN NaN
779142 SLM:000782222 NaN Resolvin D RvD NaN SLM:000501331 | SLM:000508853 NaN NaN NaN InChI=none ... NaN NaN NaN NaN NaN <NA> LMFA0403 NaN NaN NaN
779157 SLM:000782237 NaN an N-(omega-(9Z,12Z-octadecadienoyloxy)-ultra-... NaN NaN SLM:000000413 | SLM:000782274 NaN NaN [C@H]([C@@H](/C=C/CCCCCCCCCCCCC)O)(NC(=O)*COC(... NaN ... NaN NaN NaN NaN NaN 157662 NaN NaN NaN NaN

119 rows × 29 columns

Checking split characters (|) in Parent
No rows found

Checking split characters (|) in Components*
No rows found

Checking split characters (|) in SMILES (pH7.3)
No rows found

Checking split characters (|) in InChI (pH7.3)
No rows found

Checking split characters (|) in InChI key (pH7.3)
No rows found

Checking split characters (|) in Formula (pH7.3)
No rows found

Checking split characters (|) in Charge (pH7.3)
Not a string column

Checking split characters (|) in Mass (pH7.3)
Not a string column

Checking split characters (|) in Exact Mass (neutral form)
Not a string column

Checking split characters (|) in Exact m/z of [M.]+
Not a string column

Checking split characters (|) in Exact m/z of [M+H]+
Not a string column

Checking split characters (|) in Exact m/z of [M+K]+ 
Not a string column

Checking split characters (|) in Exact m/z of [M+Na]+
Not a string column

Checking split characters (|) in Exact m/z of [M+Li]+
Not a string column

Checking split characters (|) in Exact m/z of [M+NH4]+
Not a string column

Checking split characters (|) in Exact m/z of [M-H]-
Not a string column

Checking split characters (|) in Exact m/z of [M+Cl]-
Not a string column

Checking split characters (|) in Exact m/z of [M+OAc]- 
Not a string column

Checking split characters (|) in CHEBI
Found 3 rows with split characters
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
465 SLM:000000784 Isomeric subspecies 1,2-di-(9Z-octadecenoyl)-sn-glycero-3-phosphate PA(18:1(9Z)/18:1(9Z)) Phosphatidate (18:1(9Z)/18:1(9Z)) SLM:000000329 SLM:000082169 SLM:000000418 (sn1 or sn2) CCCCCCCC\C=C/CCCCCCCC(=O)OC[C@H](COP([O-])([O-... InChI=1S/C39H73O8P/c1-3-5-7-9-11-13-15-17-19-2... ... 707.519775 718.538147 699.497009 735.473694 759.518188 74546|82922 LMGP10010962 HMDB07865 MNXM51075 11309392 | 14634025 | 14665624 | 15164764 | 15...
387185 SLM:000389154 NaN (14Z,17Z,20Z,23Z,26Z)-dotriacontapentaenoate NaN Fatty acid 32:5(14Z,17Z,20Z,23Z,26Z) SLM:000389801 NaN NaN CCCCC\C=C/C\C=C/C\C=C/C\C=C/C\C=C/CCCCCCCCCCCC... InChI=1S/C32H54O2/c1-2-3-4-5-6-7-8-9-10-11-12-... ... 477.427836 488.446207 469.405105 505.381782 529.426234 82731|82731 LMFA01030848 NaN NaN NaN
595221 SLM:000598072 NaN all-trans-retinol--[retinol-binding protein] NaN NaN SLM:000000982 NaN NaN [*][C@H](N-*)C(-*)=O InChI=none ... NaN NaN NaN NaN NaN 17336|83228 NaN NaN NaN 20628054 | 28758396

3 rows × 29 columns

Checking split characters (|) in LIPID MAPS
No rows found

Checking split characters (|) in HMDB
No rows found

Checking split characters (|) in MetaNetX
No rows found

Checking split characters (|) in PMID
Found 1318 rows with split characters
Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* SMILES (pH7.3) InChI (pH7.3) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
0 SLM:000000002 Class Ceramide (iso-d17:1(4E)) Cer(iso-d17:1(4E)) N-acyl-15-methylhexadecasphing-4-enine SLM:000399814 NaN NaN CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](CO)NC([*])=O InChI=none ... NaN NaN NaN NaN NaN 70846 NaN NaN MNXM97012 | 11443131 | 14685263 | 18390550 | 21325339 | ...
3 SLM:000000007 Class Sphingomyelin (iso-d17:1(4E)) SM(iso-d17:1(4E)) N-acyl-15-methylhexadecasphing-4-enine-1-phosp... SLM:000001000 NaN NaN CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](COP([O-])(=O... InChI=none ... NaN NaN NaN NaN NaN 70775 NaN NaN MNXM97113 14685263 | 21926990 | 9603947
4 SLM:000000035 Isomeric subspecies sphinganine NaN NaN SLM:000390097 NaN NaN CCCCCCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO InChI=1S/C18H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12... ... 308.313535 319.331905 300.290803 336.267481 360.311932 57817 LMSP01020001 HMDB00269 MNXM302 10652340 | 10702247 | 10751414 | 10802064 | 10...
5 SLM:000000042 Isomeric subspecies cholesta-5,7-dien-3beta-ol NaN NaN SLM:000501263 NaN NaN [H][C@@]1(CC[C@@]2([H])C3=CC=C4C[C@@H](O)CC[C@... InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2... ... 391.354671 402.373042 383.331940 419.308617 443.353069 17759 LMST01010069 HMDB00032 MNXM710 10329655 | 10344195 | 10786622 | 11230174 | 16...
6 SLM:000000043 Isomeric subspecies lathosterone NaN NaN SLM:000501263 NaN NaN [H][C@@]12CC=C3[C@]4([H])CC[C@]([H])([C@H](C)C... InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2... ... 391.354671 402.373042 383.331940 419.308617 443.353069 71550 NaN NaN MNXM97065 19531354 | 22505847
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
595221 SLM:000598072 NaN all-trans-retinol--[retinol-binding protein] NaN NaN SLM:000000982 NaN NaN [*][C@H](N-*)C(-*)=O InChI=none ... NaN NaN NaN NaN NaN 17336|83228 NaN NaN NaN 20628054 | 28758396
595222 SLM:000598073 NaN all-trans-retinyl heptanoate NaN NaN SLM:000000982 NaN NaN C1(C)(C)C(\C=C\C(=C\C=C\C(=C\COC(CCCCCC)=O)\C)... InChI=1S/C27H42O2/c1-7-8-9-10-16-26(28)29-21-1... ... NaN NaN NaN NaN NaN 138724 NaN NaN NaN 20628054 | 28758396
595223 SLM:000598074 NaN 2-heptanoyl-sn-glycero-3-phosphocholine NaN NaN SLM:000000724 NaN NaN P(OC[C@@H](CO)OC(=O)CCCCCC)(=O)(OCC[N+](C)(C)C... InChI=1S/C15H32NO7P/c1-5-6-7-8-9-15(18)23-14(1... ... NaN NaN NaN NaN NaN 138266 NaN NaN NaN 20628054 | 22605381 | 28758396
595230 SLM:000598083 NaN 12-hydroxy-(9Z)-octadecenoyl-CoA NaN NaN SLM:000389958 | SLM:000390051 NaN NaN S(C(CCCCCCC/C=C\C[C@@H](CCCCCC)O)=O)CCNC(CCNC(... InChI=1S/C39H68N7O18P3S/c1-4-5-6-13-16-27(47)1... ... NaN NaN NaN NaN NaN 139559 NaN NaN NaN 17084870 | 27758859
595245 SLM:000598101 NaN a mannosylinositol-1-phospho-N-(2-hydroxyacyl)... NaN NaN SLM:000000835 NaN NaN OC[C@H]1OC(O[C@@H]2[C@@H](O)[C@H](O)[C@@H](O)[... InChI=none ... NaN NaN NaN NaN NaN 74994 NaN NaN NaN 12954640 | 9368028

1318 rows × 29 columns

Okay wow! So these are all the columns we have found with split characters…

cols_with_split_chars
['Abbreviation*', 'Synonyms*', 'Lipid class*', 'CHEBI', 'PMID']

We can also check for different types of characters if we know that they will be present. For instance SL uses the / character for Components*, but this is also used by another of columns like the lipid names themselves or smiles and inchi.

check_for_split_characters(df_swisslipids.drop(columns=['Name','Abbreviation*','Synonyms*','SMILES (pH7.3)','InChI (pH7.3)']), delimiter='/')
Checking split characters (/) in Lipid ID
No rows found

Checking split characters (/) in Level
No rows found

Checking split characters (/) in Lipid class*
No rows found

Checking split characters (/) in Parent
No rows found

Checking split characters (/) in Components*
Found 708725 rows with split characters
Lipid ID Level Lipid class* Parent Components* InChI key (pH7.3) Formula (pH7.3) Charge (pH7.3) Mass (pH7.3) Exact Mass (neutral form) ... Exact m/z of [M+Li]+ Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID
164 SLM:000000422 Isomeric subspecies SLM:000000329 SLM:000081844 SLM:000000418 (sn2) / SLM:000000510 (sn1) InChIKey=OPVZUEPSMJNLOM-QEJMHMKOSA-L C37H69O8P -2.0 672.913818 674.488647 ... 681.504089 692.522461 673.481384 709.458069 733.502502 64839 LMGP10010032 HMDB07859 MNXM66476 10359651 | 11788596 | 12963729 | 16620771 | 17...
229 SLM:000000498 Isomeric subspecies SLM:000000324 SLM:000105249 SLM:000000296 (sn2) / SLM:000000826 (sn1) InChIKey=KRTOMQDUKGRFDJ-ZAHDIIMDSA-M C47H82O13P -1.0 886.120483 886.557129 ... 893.572571 904.590942 885.549866 921.526550 945.570984 133606 LMGP06010010 HMDB09815 MNXM75683 22942276 | 23097495 | 23472195 | 8300559
269 SLM:000000557 Isomeric subspecies SLM:000000261 SLM:000088147 SLM:000000510 (sn1) / SLM:000000826 (sn2) InChIKey=PZNPLUBHRSSFHT-RRHRGVEJSA-N C42H84NO8P 0.0 762.091980 761.593445 ... 768.608887 779.627258 NaN 796.562866 820.607300 73000 LMGP01010573 HMDB07970 MNXM69304 18195019 | 19416660 | 22923616 | 27399000
332 SLM:000000636 Isomeric subspecies SLM:000000329 SLM:000082164 SLM:000000418 (sn1) / SLM:000000510 (sn2) InChIKey=ZSXHMDPHNCOWSV-QEJMHMKOSA-L C37H69O8P -2.0 672.913818 674.488647 ... 681.504089 692.522461 673.481384 709.458069 733.502502 74551 LMGP10010964 NaN MNXM66662 16620771 | 18606822 | 19318427 | 19801371 | 20...
333 SLM:000000637 Isomeric subspecies SLM:000000329 SLM:000082168 SLM:000000418 (sn1) / SLM:000000826 (sn2) InChIKey=XIERONXOJKEALF-PXYGFXEISA-L C39H73O8P -2.0 700.966980 702.519958 ... 709.535400 720.553772 701.512695 737.489380 761.533813 74552 LMGP10010963 NaN MNXM66667 16620771 | 18606822 | 19318427 | 19801371 | 21...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
745172 SLM:000748034 Isomeric subspecies SLM:000000084 NaN SLM:000000048 (sn1) / SLM:000001195 (sn2) InChIKey=LJSBNBPNSBKZCI-JNOBRDIFSA-L C33H57N3O15P2 -2.0 NaN 799.342142 ... 806.357598 817.375968 798.334866 834.311543 858.355995 <NA> NaN NaN NaN NaN
745173 SLM:000748035 Isomeric subspecies SLM:000000084 NaN SLM:000000048 (sn1) / SLM:000001196 (sn2) InChIKey=ODNYDZLXLRZPCJ-GPTQCAHZSA-L C35H61N3O15P2 -2.0 NaN 827.373442 ... 834.388898 845.407268 826.366166 862.342844 886.387295 <NA> NaN NaN NaN NaN
745174 SLM:000748036 Isomeric subspecies SLM:000000084 NaN SLM:000000048 (sn1) / SLM:000000853 (sn2) InChIKey=FJIBTCUXUBRYKG-QOTCTSOZSA-L C37H65N3O15P2 -2.0 NaN 855.404743 ... 862.420198 873.438568 854.397466 890.374144 914.418595 <NA> NaN NaN NaN NaN
745175 SLM:000748037 Isomeric subspecies SLM:000000084 NaN SLM:000000048 (sn1) / SLM:000001197 (sn2) InChIKey=AIBKQADSQWEVSS-HUKRWTLJSA-L C42H75N3O15P2 -2.0 NaN 925.482993 ... 932.498448 943.516818 924.475716 960.452394 984.496846 <NA> NaN HMDB0116248 NaN NaN
745176 SLM:000748038 Isomeric subspecies SLM:000000084 NaN SLM:000000047 (sn2) / SLM:000000048 (sn1) InChIKey=PIZFKSVTEGNINS-BQUKFSKHSA-L C44H79N3O15P2 -2.0 NaN 953.514293 ... 960.529748 971.548118 952.507016 988.483694 1012.528146 <NA> NaN HMDB0116250 NaN NaN

708725 rows × 24 columns

Checking split characters (/) in InChI key (pH7.3)
No rows found

Checking split characters (/) in Formula (pH7.3)
No rows found

Checking split characters (/) in Charge (pH7.3)
Not a string column

Checking split characters (/) in Mass (pH7.3)
Not a string column

Checking split characters (/) in Exact Mass (neutral form)
Not a string column

Checking split characters (/) in Exact m/z of [M.]+
Not a string column

Checking split characters (/) in Exact m/z of [M+H]+
Not a string column

Checking split characters (/) in Exact m/z of [M+K]+ 
Not a string column

Checking split characters (/) in Exact m/z of [M+Na]+
Not a string column

Checking split characters (/) in Exact m/z of [M+Li]+
Not a string column

Checking split characters (/) in Exact m/z of [M+NH4]+
Not a string column

Checking split characters (/) in Exact m/z of [M-H]-
Not a string column

Checking split characters (/) in Exact m/z of [M+Cl]-
Not a string column

Checking split characters (/) in Exact m/z of [M+OAc]- 
Not a string column

Checking split characters (/) in CHEBI
No rows found

Checking split characters (/) in LIPID MAPS
No rows found

Checking split characters (/) in HMDB
No rows found

Checking split characters (/) in MetaNetX
No rows found

Checking split characters (/) in PMID
No rows found
['Components*']

These double entries for the classes will be important to take into account for our class hierarchy, because if we don’t many of these Class level entries will become disjointed in the ontology.

To help us handle this connection we will split it into two using the split_and_expand_large utility function, but we will come back to this a bit later…

For now we will also add another column for components, so that later we can have both the actual component with location (e.g. sn) and a parsed version where we just have the SL

df_swisslipids['Components_parsed'] = df_swisslipids['Components*']

Now we can melt to start creating the edges df

Building the edges df#

# # Split the 'Lipid class*' column into multiple rows
# df_swisslipids_splitexp = split_and_expand_large(
#     df_swisslipids, #.assign(from_layer_col='swisslipids')
#     split_col='Lipid class*', 
#     expand_cols=['Lipid ID', 'Level', 'Name', 'Abbreviation*',
#                     'CHEBI', 'LIPID MAPS', 'HMDB', 'MetaNetX', 'PMID','Synonyms*','Parent','Components*','Components_parsed'], #'from_layer_col'
#     delimiter='|'
# )
df_swisslipids_edges = pd.melt(df_swisslipids,  #df_swisslipids_splitexp
                id_vars=['Lipid ID'], 
                value_vars=['CHEBI','LIPID MAPS','HMDB','MetaNetX','PMID','Lipid class*','Abbreviation*','Synonyms*','Parent','Components*','Components_parsed'], 
                var_name='melted_column', value_name='value')
df_swisslipids_edges
Lipid ID melted_column value
0 SLM:000000002 CHEBI 70846
1 SLM:000000003 CHEBI 70771
2 SLM:000000006 CHEBI 70829
3 SLM:000000007 CHEBI 70775
4 SLM:000000035 CHEBI 57817
... ... ... ...
8571734 SLM:000782324 Components_parsed NaN
8571735 SLM:000782325 Components_parsed NaN
8571736 SLM:000782326 Components_parsed NaN
8571737 SLM:000782327 Components_parsed NaN
8571738 SLM:000782328 Components_parsed NaN

8571739 rows × 3 columns

df_swisslipids_edges['value'].value_counts()
value
SLM:000000353                                                 132652
SLM:000000377                                                  98788
SLM:000000102                                                  80209
SLM:000117148                                                  46820
SLM:000000400                                                  38514
                                                               ...  
TG(30:0/26:0/22:0)                                                 1
TG(30:0/24:0/24:0)                                                 1
TG(30:0/22:0/26:0)                                                 1
TG(30:0/20:0/28:0)                                                 1
NAPE (15:0-13me/34:5(16Z,19Z,22Z,25Z,28Z)/18:3(6Z,9Z,12Z))         1
Name: count, Length: 2342278, dtype: int64

Especially because we have so many nan values we should handle these by marking them explicitly as null values, not ‘nan’ strings

df_swisslipids_edges = df_swisslipids_edges.replace(['nan'], pd.NA).copy() # added here to drop 'nan' strings, could also use .dropna(). directly instead of next step
df_swisslipids_edges
Lipid ID melted_column value
0 SLM:000000002 CHEBI 70846
1 SLM:000000003 CHEBI 70771
2 SLM:000000006 CHEBI 70829
3 SLM:000000007 CHEBI 70775
4 SLM:000000035 CHEBI 57817
... ... ... ...
8571734 SLM:000782324 Components_parsed NaN
8571735 SLM:000782325 Components_parsed NaN
8571736 SLM:000782326 Components_parsed NaN
8571737 SLM:000782327 Components_parsed NaN
8571738 SLM:000782328 Components_parsed NaN

8571739 rows × 3 columns

Because this melt operation also resulted in a large number of null values, which probably mean nothing to us in this case, we will drop instances where the value is null

df_swisslipids_edges = df_swisslipids_edges.dropna(subset='value')
df_swisslipids_edges
Lipid ID melted_column value
0 SLM:000000002 CHEBI 70846
1 SLM:000000003 CHEBI 70771
2 SLM:000000006 CHEBI 70829
3 SLM:000000007 CHEBI 70775
4 SLM:000000035 CHEBI 57817
... ... ... ...
8571494 SLM:000781997 Components_parsed SLM:000000856 (n-acyl)
8571495 SLM:000781998 Components_parsed SLM:000389154 (n-acyl)
8571496 SLM:000781999 Components_parsed SLM:000485643 (n-acyl)
8571497 SLM:000782000 Components_parsed SLM:000485644 (n-acyl)
8571498 SLM:000782001 Components_parsed SLM:000485645 (n-acyl)

4678499 rows × 3 columns

There are still some things we need to tidy up so that it is in a suitable format for OnionNet

df_swisslipids_edges = df_swisslipids_edges.copy()
df_swisslipids_edges['source_layer'] = 'swisslipids'
df_swisslipids_edges.rename(columns={'Lipid ID':'source_id', 'melted_column':'target_layer', 'value':'target_id'}, inplace=True)
df_swisslipids_edges = df_swisslipids_edges[['source_layer','source_id','target_layer','target_id']]
df_swisslipids_edges['target_layer'] = df_swisslipids_edges['target_layer'].map(lambda x: 'swisslipids' if x=='Lipid class*' else f"sl_{str(x).replace(' ','').strip('*').lower()}")
#df_swisslipids_edges['target_layer'] = df_swisslipids_edges['target_layer'].map(lambda x: )
df_swisslipids_edges
source_layer source_id target_layer target_id
0 swisslipids SLM:000000002 sl_chebi 70846
1 swisslipids SLM:000000003 sl_chebi 70771
2 swisslipids SLM:000000006 sl_chebi 70829
3 swisslipids SLM:000000007 sl_chebi 70775
4 swisslipids SLM:000000035 sl_chebi 57817
... ... ... ... ...
8571494 swisslipids SLM:000781997 sl_components_parsed SLM:000000856 (n-acyl)
8571495 swisslipids SLM:000781998 sl_components_parsed SLM:000389154 (n-acyl)
8571496 swisslipids SLM:000781999 sl_components_parsed SLM:000485643 (n-acyl)
8571497 swisslipids SLM:000782000 sl_components_parsed SLM:000485644 (n-acyl)
8571498 swisslipids SLM:000782001 sl_components_parsed SLM:000485645 (n-acyl)

4678499 rows × 4 columns

For rows where it is swisslipids to swisslipids, we actually want to correct this from target_layer to source_layer, because currently the target_layer in this case is actually the parent class, and ideally it would be better to have the parent point towards the children, so that way the root node should be the one with multiple outgoing edges and no incoming edges…

Be sure to only run this once, otherwise it will switch back again…

# Identify rows where both source_layer and target_layer are 'swisslipids'
condition = (df_swisslipids_edges["source_layer"] == "swisslipids") & (df_swisslipids_edges["target_layer"] == "swisslipids")

# Swap the columns for rows satisfying the condition
df_swisslipids_edges.loc[condition, ["source_layer", "source_id", "target_layer", "target_id"]] = df_swisslipids_edges.loc[condition, ["target_layer", "target_id", "source_layer", "source_id"]].values

# Output the modified DataFrame
df_swisslipids_edges
source_layer source_id target_layer target_id
0 swisslipids SLM:000000002 sl_chebi 70846
1 swisslipids SLM:000000003 sl_chebi 70771
2 swisslipids SLM:000000006 sl_chebi 70829
3 swisslipids SLM:000000007 sl_chebi 70775
4 swisslipids SLM:000000035 sl_chebi 57817
... ... ... ... ...
8571494 swisslipids SLM:000781997 sl_components_parsed SLM:000000856 (n-acyl)
8571495 swisslipids SLM:000781998 sl_components_parsed SLM:000389154 (n-acyl)
8571496 swisslipids SLM:000781999 sl_components_parsed SLM:000485643 (n-acyl)
8571497 swisslipids SLM:000782000 sl_components_parsed SLM:000485644 (n-acyl)
8571498 swisslipids SLM:000782001 sl_components_parsed SLM:000485645 (n-acyl)

4678499 rows × 4 columns

df_swisslipids_edges['target_layer'].value_counts()
target_layer
swisslipids             779247
sl_abbreviation         776464
sl_components           765323
sl_components_parsed    765323
sl_synonyms             548163
sl_metanetx             505003
sl_parent               493491
sl_hmdb                  26026
sl_lipidmaps             12117
sl_chebi                  4276
sl_pmid                   3066
Name: count, dtype: int64

Now let’s return to two items on our todo list:

  1. splitting values that have multi-identifiers

  2. trimming/parsing the components col

edges_with_multilinks = df_swisslipids_edges[df_swisslipids_edges['target_id'].str.contains('|', regex=False, na=False)]
edges_with_multilinks
source_layer source_id target_layer target_id
465 swisslipids SLM:000000784 sl_chebi 74546|82922
387185 swisslipids SLM:000389154 sl_chebi 82731|82731
595221 swisslipids SLM:000598072 sl_chebi 17336|83228
3116996 swisslipids SLM:000000002 sl_pmid | 11443131 | 14685263 | 18390550 | 21325339 | ...
3116999 swisslipids SLM:000000007 sl_pmid 14685263 | 21926990 | 9603947
... ... ... ... ...
6199835 swisslipids SLM:000747954 sl_synonyms 1,2-di-(13-methyltetradecanoyl)-sn-glycero-3-c...
6199836 swisslipids SLM:000747955 sl_synonyms 1-(13-methyltetradecanoyl)-2-(15-methylhexadec...
6199918 swisslipids SLM:000748037 sl_synonyms 1-(15-methylhexadecanoyl)-2-(11-methyldodecano...
6199919 swisslipids SLM:000748038 sl_synonyms 1-(15-methylhexadecanoyl)-2-(13-methyltetradec...
6199920 swisslipids SLM:000748039 sl_synonyms 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy...

30942 rows × 4 columns

edges_with_multilinks.value_counts('target_layer')
target_layer
sl_synonyms        19853
sl_abbreviation     9768
sl_pmid             1318
sl_chebi               3
Name: count, dtype: int64
edges_with_multilinks_split = split_and_expand_large(edges_with_multilinks, 
                       split_col='target_id', 
                       expand_cols=['source_layer','source_id','target_layer'],
                       delimiter='|').drop_duplicates()
edges_with_multilinks_split
source_layer source_id target_layer target_id
0 swisslipids SLM:000000784 sl_chebi 74546
1 swisslipids SLM:000000784 sl_chebi 82922
2 swisslipids SLM:000389154 sl_chebi 82731
4 swisslipids SLM:000598072 sl_chebi 17336
5 swisslipids SLM:000598072 sl_chebi 83228
... ... ... ... ...
68383 swisslipids SLM:000748037 sl_synonyms CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(11Z))
68384 swisslipids SLM:000748038 sl_synonyms 1-(15-methylhexadecanoyl)-2-(13-methyltetradec...
68385 swisslipids SLM:000748038 sl_synonyms CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(9Z))
68386 swisslipids SLM:000748039 sl_synonyms 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy...
68387 swisslipids SLM:000748039 sl_synonyms CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:2(9Z,12Z))

68379 rows × 4 columns

Now we also want to clean up the results and turn those empty spaces i.e. empty strings and things into NaN’s, as well as strip leading and trailing spaces that may have been between splitting characters

edges_with_multilinks_split = clean_missing_strings(edges_with_multilinks_split)
edges_with_multilinks_split
source_layer source_id target_layer target_id
0 swisslipids SLM:000000784 sl_chebi 74546
1 swisslipids SLM:000000784 sl_chebi 82922
2 swisslipids SLM:000389154 sl_chebi 82731
4 swisslipids SLM:000598072 sl_chebi 17336
5 swisslipids SLM:000598072 sl_chebi 83228
... ... ... ... ...
68383 swisslipids SLM:000748037 sl_synonyms CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(11Z))
68384 swisslipids SLM:000748038 sl_synonyms 1-(15-methylhexadecanoyl)-2-(13-methyltetradec...
68385 swisslipids SLM:000748038 sl_synonyms CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(9Z))
68386 swisslipids SLM:000748039 sl_synonyms 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy...
68387 swisslipids SLM:000748039 sl_synonyms CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:2(9Z,12Z))

68379 rows × 4 columns

edges_with_multilinks_split['target_id'].value_counts(dropna=False)
target_id
18390550                                                                      87
23670529                                                                      86
20431113                                                                      77
19603071                                                                      70
24068966                                                                      53
                                                                              ..
Phosphatidylcholine (O-18:1(11Z)/16:2(9Z,12Z))                                 1
1-(11Z-octadecenyl)-2-(9Z,12Z-octadecadienoyl)-sn-glycero-3-phosphocholine     1
Phosphatidylcholine (O-18:1(11Z)/18:2(9Z,12Z))                                 1
1-(11Z-octadecenyl)-2-(9Z-hexadecenoyl)-sn-glycero-3-phosphocholine            1
1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cytidine-5'-diphosphate            1
Name: count, Length: 59378, dtype: int64
edges_with_multilinks_split[edges_with_multilinks_split['target_id'].isna()]
source_layer source_id target_layer target_id
6 swisslipids SLM:000000002 sl_pmid <NA>
278 swisslipids SLM:000000272 sl_pmid <NA>
4533 swisslipids SLM:000001020 sl_pmid <NA>
4546 swisslipids SLM:000001022 sl_pmid <NA>
4550 swisslipids SLM:000001023 sl_pmid <NA>
4553 swisslipids SLM:000001024 sl_pmid <NA>
4586 swisslipids SLM:000001032 sl_pmid <NA>
4646 swisslipids SLM:000001036 sl_pmid <NA>

So note there are only 8 instances where the target_id is missing. This is probably ok to handle downstream anyway

# edges_with_multilinks_split = edges_with_multilinks_split[~edges_with_multilinks_split['target_id'].isna()].copy()
edges_with_multilinks_split.shape
(68379, 4)

What about source_id? Looks like it has no missing source_ids

edges_with_multilinks_split[edges_with_multilinks_split['source_id'].isna()].shape
(0, 4)

This is good, but we also need to remember the separators in the components column

edges_with_multilinks2 = df_swisslipids_edges[df_swisslipids_edges['target_id'].str.contains('/', regex=False, na=False) &
                     df_swisslipids_edges['target_layer'].str.contains('sl_components', regex=False, na=False)]
edges_with_multilinks2
source_layer source_id target_layer target_id
7013405 swisslipids SLM:000000422 sl_components SLM:000000418 (sn2) / SLM:000000510 (sn1)
7013470 swisslipids SLM:000000498 sl_components SLM:000000296 (sn2) / SLM:000000826 (sn1)
7013510 swisslipids SLM:000000557 sl_components SLM:000000510 (sn1) / SLM:000000826 (sn2)
7013573 swisslipids SLM:000000636 sl_components SLM:000000418 (sn1) / SLM:000000510 (sn2)
7013574 swisslipids SLM:000000637 sl_components SLM:000000418 (sn1) / SLM:000000826 (sn2)
... ... ... ... ...
8537662 swisslipids SLM:000748034 sl_components_parsed SLM:000000048 (sn1) / SLM:000001195 (sn2)
8537663 swisslipids SLM:000748035 sl_components_parsed SLM:000000048 (sn1) / SLM:000001196 (sn2)
8537664 swisslipids SLM:000748036 sl_components_parsed SLM:000000048 (sn1) / SLM:000000853 (sn2)
8537665 swisslipids SLM:000748037 sl_components_parsed SLM:000000048 (sn1) / SLM:000001197 (sn2)
8537666 swisslipids SLM:000748038 sl_components_parsed SLM:000000047 (sn2) / SLM:000000048 (sn1)

1417450 rows × 4 columns

edges_with_multilinks2_split = split_and_expand_large(edges_with_multilinks2, 
                       split_col='target_id', 
                       expand_cols=['source_layer','source_id','target_layer'],
                       delimiter='/').drop_duplicates()
edges_with_multilinks2_split
source_layer source_id target_layer target_id
0 swisslipids SLM:000000422 sl_components SLM:000000418 (sn2)
1 swisslipids SLM:000000422 sl_components SLM:000000510 (sn1)
2 swisslipids SLM:000000498 sl_components SLM:000000296 (sn2)
3 swisslipids SLM:000000498 sl_components SLM:000000826 (sn1)
4 swisslipids SLM:000000557 sl_components SLM:000000510 (sn1)
... ... ... ... ...
3592487 swisslipids SLM:000748036 sl_components_parsed SLM:000000853 (sn2)
3592488 swisslipids SLM:000748037 sl_components_parsed SLM:000000048 (sn1)
3592489 swisslipids SLM:000748037 sl_components_parsed SLM:000001197 (sn2)
3592490 swisslipids SLM:000748038 sl_components_parsed SLM:000000047 (sn2)
3592491 swisslipids SLM:000748038 sl_components_parsed SLM:000000048 (sn1)

3592492 rows × 4 columns

Now let’s also clean this up in case we have whitespace or empty strings etc.

edges_with_multilinks2_split = clean_missing_strings(edges_with_multilinks2_split)

Now let’s also parse the brackets from the parsed components so that these can be linked directly to the other SLMs if needed

# Apply transformation only for rows where target_layer equals 'sl_components_parsed'
mask = edges_with_multilinks2_split['target_layer'] == 'sl_components_parsed'
edges_with_multilinks2_split.loc[mask, 'target_id'] = edges_with_multilinks2_split.loc[mask, 'target_id'].str.split('(').str[0].str.strip()
edges_with_multilinks2_split
source_layer source_id target_layer target_id
0 swisslipids SLM:000000422 sl_components SLM:000000418 (sn2)
1 swisslipids SLM:000000422 sl_components SLM:000000510 (sn1)
2 swisslipids SLM:000000498 sl_components SLM:000000296 (sn2)
3 swisslipids SLM:000000498 sl_components SLM:000000826 (sn1)
4 swisslipids SLM:000000557 sl_components SLM:000000510 (sn1)
... ... ... ... ...
3592487 swisslipids SLM:000748036 sl_components_parsed SLM:000000853
3592488 swisslipids SLM:000748037 sl_components_parsed SLM:000000048
3592489 swisslipids SLM:000748037 sl_components_parsed SLM:000001197
3592490 swisslipids SLM:000748038 sl_components_parsed SLM:000000047
3592491 swisslipids SLM:000748038 sl_components_parsed SLM:000000048

3592492 rows × 4 columns

Now we need a way to change these original rows where they had multilinks and add back the corrected ones.

# Identify rows with multilinks (either '|' or '/' with the specific target_layer condition)
mask_pipe = df_swisslipids_edges['target_id'].str.contains('|', regex=False, na=False)
mask_slash = (
    df_swisslipids_edges['target_id'].str.contains('/', regex=False, na=False) &
    df_swisslipids_edges['target_layer'].str.contains('sl_components', regex=False, na=False)
)
mask_problem = mask_pipe | mask_slash

# Remove these rows from the original df
df_clean = df_swisslipids_edges[~mask_problem].copy()

# Now, combine the cleaned df with the corrected edges dataframes.
# These corrected dataframes are assumed to be: 
#   - edges_with_multilinks_split
#   - edges_with_multilinks2_split
df_swisslipids_edges = pd.concat([df_clean, edges_with_multilinks_split, edges_with_multilinks2_split], ignore_index=True)

# Clean up empty strings again or leading/trailing spaces
df_swisslipids_edges = clean_missing_strings(df_swisslipids_edges)

# (Optional) Drop any duplicate rows that might arise
df_swisslipids_edges = df_swisslipids_edges.drop_duplicates()

# df_final now contains the original "good" rows plus the corrected edges.
df_swisslipids_edges
source_layer source_id target_layer target_id
0 swisslipids SLM:000000002 sl_chebi 70846
1 swisslipids SLM:000000003 sl_chebi 70771
2 swisslipids SLM:000000006 sl_chebi 70829
3 swisslipids SLM:000000007 sl_chebi 70775
4 swisslipids SLM:000000035 sl_chebi 57817
... ... ... ... ...
6890973 swisslipids SLM:000748036 sl_components_parsed SLM:000000853
6890974 swisslipids SLM:000748037 sl_components_parsed SLM:000000048
6890975 swisslipids SLM:000748037 sl_components_parsed SLM:000001197
6890976 swisslipids SLM:000748038 sl_components_parsed SLM:000000047
6890977 swisslipids SLM:000748038 sl_components_parsed SLM:000000048

6890966 rows × 4 columns

Now we will determine whether the edge is within the same layer (intralayer) or between different layers (interlayer)

def assess_edge_layertype(df):
    interlayer = df['source_layer']!=df['target_layer']
    df['interlayer'] = interlayer
    return df 

df_swisslipids_edges = assess_edge_layertype(df_swisslipids_edges)
df_swisslipids_edges
source_layer source_id target_layer target_id interlayer
0 swisslipids SLM:000000002 sl_chebi 70846 True
1 swisslipids SLM:000000003 sl_chebi 70771 True
2 swisslipids SLM:000000006 sl_chebi 70829 True
3 swisslipids SLM:000000007 sl_chebi 70775 True
4 swisslipids SLM:000000035 sl_chebi 57817 True
... ... ... ... ... ...
6890973 swisslipids SLM:000748036 sl_components_parsed SLM:000000853 True
6890974 swisslipids SLM:000748037 sl_components_parsed SLM:000000048 True
6890975 swisslipids SLM:000748037 sl_components_parsed SLM:000001197 True
6890976 swisslipids SLM:000748038 sl_components_parsed SLM:000000047 True
6890977 swisslipids SLM:000748038 sl_components_parsed SLM:000000048 True

6890966 rows × 5 columns

Now we will build the node df

Building the node df#

df_swisslipids_nodes = create_nodedf_from_edgedf(edge_df=df_swisslipids_edges, props=['layer', 'id'], cols=['layer', 'node_id'])
df_swisslipids_nodes
layer node_id
0 swisslipids SLM:000000002
1 swisslipids SLM:000000003
2 swisslipids SLM:000000006
3 swisslipids SLM:000000007
4 swisslipids SLM:000000035
... ... ...
13781927 sl_components_parsed SLM:000000853
13781928 sl_components_parsed SLM:000000048
13781929 sl_components_parsed SLM:000001197
13781930 sl_components_parsed SLM:000000047
13781931 sl_components_parsed SLM:000000048

13781932 rows × 2 columns

Let’s also see how many are duplicates

df_swisslipids_nodes.value_counts(dropna=False)
layer        node_id      
swisslipids  SLM:000000353    132660
             SLM:000000377     98800
             SLM:000000102     80218
             SLM:000117148     46826
             SLM:000000400     38525
                               ...  
sl_metanetx  MNXM312433            1
             MNXM312434            1
             MNXM312435            1
             MNXM312436            1
swisslipids  SLM:000782332         1
Name: count, Length: 2779078, dtype: int64
# Pre-emptively dropping duplicates before the merge
df_swisslipids_nodes = df_swisslipids_nodes.drop_duplicates()
df_swisslipids_nodes.shape
(2779078, 2)

Now let’s merge the nodes with the information from earlier to create richer node attributes

df_swisslipids_nodes = pd.merge(df_swisslipids_nodes, df_swisslipids.assign(from_layer_col='swisslipids'),
                                left_on=['layer','node_id'], right_on=['from_layer_col','Lipid ID'],
                                how='outer')
df_swisslipids_nodes
layer node_id Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* ... Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID Components_parsed from_layer_col
0 sl_abbreviation (10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 sl_abbreviation (10,11S,12R)-TriHETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 sl_abbreviation (10R)-H-(11S,12S)-Ep-(5Z,8Z,14Z)-ETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 sl_abbreviation (10R)-H-(11S,12S)-EpETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 sl_abbreviation (10R)-H-(8S,9S)-Ep-(5Z,11Z,14Z)-ETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2779073 swisslipids SLM:000782328 SLM:000782328 NaN oxidized 2-acylglycerol NaN NaN SLM:000000355 NaN NaN ... NaN NaN NaN 167117 NaN NaN NaN NaN NaN swisslipids
2779074 swisslipids SLM:000782329 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2779075 swisslipids SLM:000782330 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2779076 swisslipids SLM:000782331 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2779077 swisslipids SLM:000782332 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2779078 rows × 33 columns

If this has any more duplicates in it for some reason lets remove them, along with the from_layer_col which means nothing in this case and is just a relic of our join back with the initial df we used to create the edges (which could probably be tidied up)

df_swisslipids_nodes = df_swisslipids_nodes.drop_duplicates()
df_swisslipids_nodes = df_swisslipids_nodes.drop(columns='from_layer_col')
df_swisslipids_nodes
layer node_id Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* ... Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID Components_parsed
0 sl_abbreviation (10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 sl_abbreviation (10,11S,12R)-TriHETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 sl_abbreviation (10R)-H-(11S,12S)-Ep-(5Z,8Z,14Z)-ETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 sl_abbreviation (10R)-H-(11S,12S)-EpETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 sl_abbreviation (10R)-H-(8S,9S)-Ep-(5Z,11Z,14Z)-ETrE NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2779073 swisslipids SLM:000782328 SLM:000782328 NaN oxidized 2-acylglycerol NaN NaN SLM:000000355 NaN NaN ... NaN NaN NaN NaN 167117 NaN NaN NaN NaN NaN
2779074 swisslipids SLM:000782329 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2779075 swisslipids SLM:000782330 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2779076 swisslipids SLM:000782331 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2779077 swisslipids SLM:000782332 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2779078 rows × 32 columns

df_swisslipids_nodes[df_swisslipids_nodes['node_id'].isna()]
layer node_id Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* ... Exact m/z of [M+NH4]+ Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID Components_parsed
1464984 sl_pmid <NA> NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1 rows × 32 columns

print(df_swisslipids_nodes[df_swisslipids_nodes['node_id'].isna()].shape)
print(df_swisslipids_nodes[df_swisslipids_nodes['layer'].isna()].shape)
(1, 32)
(0, 32)

Even though onionnet can handle nan and remove them downstream, it is safer to now drop these cases where either the node_id or layer is missing - they serve us no purpose anyway!

df_swisslipids_nodes = df_swisslipids_nodes.dropna(subset=['layer','node_id'])
print(df_swisslipids_nodes.shape)

print(df_swisslipids_nodes[df_swisslipids_nodes['node_id'].isna()].shape)
print(df_swisslipids_nodes[df_swisslipids_nodes['layer'].isna()].shape)
(2779077, 32)
(0, 32)
(0, 32)
df_swisslipids_nodes['layer'].value_counts()
layer
swisslipids             779312
sl_abbreviation         736949
sl_synonyms             534781
sl_metanetx             504880
sl_parent               184620
sl_hmdb                  17232
sl_lipidmaps             12112
sl_chebi                  4277
sl_components             1708
sl_components_parsed      1677
sl_pmid                   1529
Name: count, dtype: int64
df_swisslipids_nodes['Level'].value_counts()
Level
Isomeric subspecies      592413
Structural subspecies    111867
Molecular subspecies      62516
Species                   10347
Class                       806
Category                      7
Name: count, dtype: int64

Now we have the nodes and edges dfs for swisslipids and understand how we have arrived at them. In reality you don’t have to go through this process every time, LipiNet offers a convenient function to do just this if you are interested in this same network setup.

Ensuring equivalency#

We can also check to make sure that the output of the autmatic parse_swisslipids_data() function and our manually processed data are equivalent.

We start by checking this for a single entry of the dataframe.

df_swisslipids_nodes.iloc[0]
layer                                           sl_abbreviation
node_id                      (10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE
Lipid ID                                                    NaN
Level                                                       NaN
Name                                                        NaN
Abbreviation*                                               NaN
Synonyms*                                                   NaN
Lipid class*                                                NaN
Parent                                                      NaN
Components*                                                 NaN
SMILES (pH7.3)                                              NaN
InChI (pH7.3)                                               NaN
InChI key (pH7.3)                                           NaN
Formula (pH7.3)                                             NaN
Charge (pH7.3)                                              NaN
Mass (pH7.3)                                                NaN
Exact Mass (neutral form)                                   NaN
Exact m/z of [M.]+                                          NaN
Exact m/z of [M+H]+                                         NaN
Exact m/z of [M+K]+                                         NaN
Exact m/z of [M+Na]+                                        NaN
Exact m/z of [M+Li]+                                        NaN
Exact m/z of [M+NH4]+                                       NaN
Exact m/z of [M-H]-                                         NaN
Exact m/z of [M+Cl]-                                        NaN
Exact m/z of [M+OAc]-                                       NaN
CHEBI                                                       NaN
LIPID MAPS                                                  NaN
HMDB                                                        NaN
MetaNetX                                                    NaN
PMID                                                        NaN
Components_parsed                                           NaN
Name: 0, dtype: object
df_sl_nodes.iloc[0]
layer                                           sl_abbreviation
node_id                      (10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE
Lipid ID                                                    NaN
Level                                                       NaN
Name                                                        NaN
Abbreviation*                                               NaN
Synonyms*                                                   NaN
Lipid class*                                                NaN
Parent                                                      NaN
Components*                                                 NaN
SMILES (pH7.3)                                              NaN
InChI (pH7.3)                                               NaN
InChI key (pH7.3)                                           NaN
Formula (pH7.3)                                             NaN
Charge (pH7.3)                                              NaN
Mass (pH7.3)                                                NaN
Exact Mass (neutral form)                                   NaN
Exact m/z of [M.]+                                          NaN
Exact m/z of [M+H]+                                         NaN
Exact m/z of [M+K]+                                         NaN
Exact m/z of [M+Na]+                                        NaN
Exact m/z of [M+Li]+                                        NaN
Exact m/z of [M+NH4]+                                       NaN
Exact m/z of [M-H]-                                         NaN
Exact m/z of [M+Cl]-                                        NaN
Exact m/z of [M+OAc]-                                       NaN
CHEBI                                                       NaN
LIPID MAPS                                                  NaN
HMDB                                                        NaN
MetaNetX                                                    NaN
PMID                                                        NaN
Components_parsed                                           NaN
Name: 0, dtype: object

For the first entry it looks good, what about for the entire df? We can use the pd.testing.assert_frame_equal function to do this.

First we will use a null test to test equality between df_swisslipids_nodes and df_swisslipids_edges, which should obviously be False.

try:
    pd.testing.assert_frame_equal(df_swisslipids_nodes, df_swisslipids_edges)
    print('DataFrames are equal')
except AssertionError as e:
    print(e)
DataFrame are different

DataFrame shape mismatch
[left]:  (2779077, 32)
[right]: (6890966, 5)

Now let’s test between df_swisslipids_nodes and df_sl_nodes, which should hopefully be True and not throw an error. We will also test the edges df while we’re at it too.

try:
    pd.testing.assert_frame_equal(df_swisslipids_nodes, df_sl_nodes)
    print('DataFrames for nodes are equal')
except AssertionError as e:
    print(e)
DataFrames for nodes are equal
try:
    pd.testing.assert_frame_equal(df_swisslipids_edges, df_sl_edges)
    print('DataFrames for edges are equal')
except AssertionError as e:
    print(e)
DataFrames for edges are equal

Great! It looks like both approaches achieve the same df. We will use these dfs in other parts of the package.

If they are different, we can inspect the exact rows here

diff = df_sl_edges.merge(df_swisslipids_edges, how='outer', indicator=True)
diff_rows_edges = diff[diff['_merge'] != 'both']
diff_rows_edges
source_layer source_id target_layer target_id interlayer _merge
diff_rows_edges['_merge'].value_counts()
_merge
left_only     0
right_only    0
both          0
Name: count, dtype: int64
diff = df_sl_nodes.merge(df_swisslipids_nodes, how='outer', indicator=True)
diff_rows_nodes = diff[diff['_merge'] != 'both']
diff_rows_nodes
layer node_id Lipid ID Level Name Abbreviation* Synonyms* Lipid class* Parent Components* ... Exact m/z of [M-H]- Exact m/z of [M+Cl]- Exact m/z of [M+OAc]- CHEBI LIPID MAPS HMDB MetaNetX PMID Components_parsed _merge

0 rows × 33 columns

These should also be the same

df_sl_edges[df_sl_edges['source_id']=='SLM:000389145']
source_layer source_id target_layer target_id interlayer
1640 swisslipids SLM:000389145 sl_chebi 18059 True
429400 swisslipids SLM:000389145 sl_metanetx MNXM12117 True
549344 swisslipids SLM:000389145 swisslipids SLM:000000436 False
549407 swisslipids SLM:000389145 swisslipids SLM:000000525 False
549887 swisslipids SLM:000389145 swisslipids SLM:000001193 False
665828 swisslipids SLM:000389145 swisslipids SLM:000117142 False
936914 swisslipids SLM:000389145 swisslipids SLM:000390054 False
1046948 swisslipids SLM:000389145 swisslipids SLM:000500463 False
1055230 swisslipids SLM:000389145 swisslipids SLM:000508860 False
1328368 swisslipids SLM:000389145 swisslipids SLM:000782283 False
df_swisslipids_edges[df_swisslipids_edges['source_id']=='SLM:000389145']
source_layer source_id target_layer target_id interlayer
1640 swisslipids SLM:000389145 sl_chebi 18059 True
429400 swisslipids SLM:000389145 sl_metanetx MNXM12117 True
549344 swisslipids SLM:000389145 swisslipids SLM:000000436 False
549407 swisslipids SLM:000389145 swisslipids SLM:000000525 False
549887 swisslipids SLM:000389145 swisslipids SLM:000001193 False
665828 swisslipids SLM:000389145 swisslipids SLM:000117142 False
936914 swisslipids SLM:000389145 swisslipids SLM:000390054 False
1046948 swisslipids SLM:000389145 swisslipids SLM:000500463 False
1055230 swisslipids SLM:000389145 swisslipids SLM:000508860 False
1328368 swisslipids SLM:000389145 swisslipids SLM:000782283 False