Parse SwissLipids

Parse SwissLipids#

Parsing SwissLipids data into a network for LipiNet.

LipiNet offers conventient functions to parse prior knowledge resources straight into networks. For instance, LipiNet can parse SwissLipids data into a network as easily as running: parse_swisslipids_data()

However to show what is happening behind the scenes, this notebook will also go through the data and each of the steps that are made in the background of this function. This may be particularly helpful to users needing to customise the networks in a way that is not yet supported by LipiNet directly.

Using `parse_swisslipids_data()`#

Like already mentioned, the LipiNet parse_swisslipids_data() function automatically parses SwissLipids into a network. This is what LipiNet uses as input to its overall combined network and for the majority of users this function will probably suffice if they wish to build sub-networks with just SwissLipids data.

import importlib
import lipinet.parse_swisslipids
importlib.reload(lipinet.parse_swisslipids)  # reload the module after edits
from lipinet.parse_swisslipids import parse_swisslipids_data 

sl_results = parse_swisslipids_data(verbose=False, use_cache=True)
df_sl_nodes = sl_results['df_nodes']
df_sl_edges = sl_results['df_edges']

To avoid repeatedly downloading the SwissLipids data (and choking up their server calls), set use_cache=True. If the cache has not been set yet, this will automatically save the download to cache. If there is already a cache, it will use that.

To override the cache you can set force_download=True, but this is only recommended every few months when you want to update the source data in case of changes.

Where to from here?#

Now to quickly start exploring SwissLipids, go to the Explore SwissLipids notebook.
To see how the combined LipiNet network uses SwissLipids, go to the Explore LipiNet notebook.
Or to see how the parse_swisslipids_data() function works behind the scenes, continue to the Manual parsing section below.

Manual parsing#

For users wanting to better understand all the steps being undertaken behind the parse_swisslipids_data() function, we will recreate the steps here.

import lipinet.databases  # Import the module

# Reload the module to ensure changes are picked up
importlib.reload(lipinet)
importlib.reload(lipinet.databases)

# Now can use the functions after reloading the module
from lipinet.databases import get_prior_knowledge
from lipinet.utils import split_and_expand_large, create_nodedf_from_edgedf, check_for_split_characters, clean_missing_strings

import pandas as pd

df_swisslipids = get_prior_knowledge('swisslipids', verbose=True, force_download=False) #Previously set to True
df_swisslipids

File found locally at /opt/anaconda3/envs/graphtool/lib/python3.12/site-packages/lipinet/.data/downloaded/swisslipids_lipids.tsv. Loading data...
Before cleaning, trailing-space counts in 'Lipid class*': {False: 779171, True: 76, nan: 2}

>> Cleaning column 'Lipid class*':
   sample before: ['SLM:000399814', 'SLM:000390097', 'SLM:000390097', 'SLM:000001000', 'SLM:000390097']
   sample after:  ['SLM:000399814', 'SLM:000390097', 'SLM:000390097', 'SLM:000001000', 'SLM:000390097']

>> Cleaning column 'CHEBI':
   sample before: ['70846', '70771', '70829', '70775', '57817']
   sample after:  ['70846', '70771', '70829', '70775', '57817']
After cleaning, trailing-space counts in 'Lipid class*': {False: 779247, <NA>: 2}

	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	SMILES (pH7.3)	InChI (pH7.3)	...	Exact m/z of [M+Li]+	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID
0	SLM:000000002	Class	Ceramide (iso-d17:1(4E))	Cer(iso-d17:1(4E))	N-acyl-15-methylhexadecasphing-4-enine	SLM:000399814	NaN	NaN	CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](CO)NC([*])=O	InChI=none	...	NaN	NaN	NaN	NaN	NaN	70846	NaN	NaN	MNXM97012	\| 11443131 \| 14685263 \| 18390550 \| 21325339 \|...
1	SLM:000000003	Isomeric subspecies	15-methylhexadecasphing-4-enine	NaN	NaN	SLM:000390097	NaN	NaN	CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@@H]([NH3+])CO	InChI=1S/C17H35NO2/c1-15(2)12-10-8-6-4-3-5-7-9...	...	292.282235	303.300605	284.259503	320.236181	344.280632	70771	NaN	NaN	MNXM57784	19372430
2	SLM:000000006	Isomeric subspecies	15-methylhexadecasphinganine	NaN	NaN	SLM:000390097	NaN	NaN	CC(C)CCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO	InChI=1S/C17H37NO2/c1-15(2)12-10-8-6-4-3-5-7-9...	...	294.297885	305.316255	286.275153	322.251831	346.296282	70829	NaN	NaN	MNXM97029	19372430
3	SLM:000000007	Class	Sphingomyelin (iso-d17:1(4E))	SM(iso-d17:1(4E))	N-acyl-15-methylhexadecasphing-4-enine-1-phosp...	SLM:000001000	NaN	NaN	CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](COP([O-])(=O...	InChI=none	...	NaN	NaN	NaN	NaN	NaN	70775	NaN	NaN	MNXM97113	14685263 \| 21926990 \| 9603947
4	SLM:000000035	Isomeric subspecies	sphinganine	NaN	NaN	SLM:000390097	NaN	NaN	CCCCCCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO	InChI=1S/C18H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12...	...	308.313535	319.331905	300.290803	336.267481	360.311932	57817	LMSP01020001	HMDB00269	MNXM302	10652340 \| 10702247 \| 10751414 \| 10802064 \| 10...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
779244	SLM:000782324	NaN	apo carotenoid	NaN	NaN	SLM:000508864	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	53183	NaN	NaN	NaN	NaN
779245	SLM:000782325	NaN	terpenoid	NaN	NaN	SLM:000508864	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	26873	NaN	NaN	NaN	NaN
779246	SLM:000782326	NaN	C-45 isoprenoid	NaN	NaN	SLM:000508864	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	87168	NaN	NaN	NaN	NaN
779247	SLM:000782327	NaN	gamma-lactone	NaN	NaN	SLM:000782238	NaN	NaN	O1C(C(C(C1=O)))*	NaN	...	NaN	NaN	NaN	NaN	NaN	37581	NaN	NaN	NaN	NaN
779248	SLM:000782328	NaN	oxidized 2-acylglycerol	NaN	NaN	SLM:000000355	NaN	NaN	OCC(CO)OC(=O)*	NaN	...	NaN	NaN	NaN	NaN	NaN	167117	NaN	NaN	NaN	NaN

779249 rows × 29 columns

To be safe, we will start by removing leading and trailing whitespace from all object and string columns

df_swisslipids[df_swisslipids['Abbreviation*'].str.startswith(' ', na=False)]

	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	SMILES (pH7.3)	InChI (pH7.3)	...	Exact m/z of [M+Li]+	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID
116651	SLM:000117132	Class	1(3)-O-(alk-1-enyl)-glycerol	MG(P-) \| MAG(P-)	Monoacylglycerol (P-)	SLM:000117137	NaN	NaN	OCC(O)COC=C[*]	InChI=none	...	NaN	NaN	NaN	NaN	NaN	77998	NaN	NaN	MNXM149874	NaN
116663	SLM:000117144	Class	1-O-(alk-1Z-enyl)-sn-glycerol	MG(P-) \| MAG(P-)	Monoacylglycerol (P-)	SLM:000117132	NaN	NaN	OC[C@H](O)COC=C/[*]	InChI=none	...	NaN	NaN	NaN	NaN	NaN	77297	NaN	NaN	MNXM413498	NaN

2 rows × 29 columns

df_swisslipids = clean_missing_strings(df_swisslipids)

df_swisslipids[df_swisslipids['Abbreviation*'].str.startswith(' ', na=False)].shape

(0, 29)

If we take a closer look into the data, especially the Lipid class* column, we will see that some of the values have multiple entries. For example Ceramide phosphoinositol is a Class level entry that itself belongs to both the SLM:000000834 and SLM:000399815 classes.

df_swisslipids.dropna(subset='Lipid class*')[df_swisslipids['Lipid class*'].dropna().str.contains('|', regex=False)]

	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	SMILES (pH7.3)	InChI (pH7.3)	...	Exact m/z of [M+Li]+	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID
142	SLM:000000392	Class	Ceramide phosphoinositol	IPC	Inositol-1-phosphoceramide	SLM:000000834 \| SLM:000399815	NaN	NaN	O[C@H]([*])[C@H](COP([O-])(=O)O[C@H]1[C@H](O)[...	InChI=none	...	NaN	NaN	NaN	NaN	NaN	64916	NaN	NaN	NaN	10888667 \| 20727985
234	SLM:000000509	Isomeric subspecies	All-trans-retinyl hexadecanoate	NaN	all-trans-retinyl palmitate	SLM:000000982 \| SLM:000508854	NaN	NaN	CCCCCCCCCCCCCCCC(=O)OCC=C(C)C=CC=C(C)C=CC1=C(C...	InChI=1S/C36H60O2/c1-7-8-9-10-11-12-13-14-15-1...	...	NaN	NaN	NaN	NaN	NaN	17616	NaN	HMDB03648	NaN	10769148 \| 10819989 \| 12230550 \| 15550674 \| 15...
315	SLM:000000612	NaN	tetracosenoyl-CoA	NaN	NaN	SLM:000390051 \| SLM:000782334	NaN	NaN	CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](...	NaN	...	NaN	NaN	NaN	NaN	NaN	74146	NaN	NaN	NaN	18541923 \| 20110363 \| 20937905
317	SLM:000000614	NaN	hexacosenoyl-CoA	NaN	NaN	SLM:000390051 \| SLM:000782334	NaN	NaN	CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](...	NaN	...	NaN	NaN	NaN	NaN	NaN	74161	NaN	NaN	NaN	18165233
319	SLM:000000621	NaN	2-hydroxy-tetracosenoyl-CoA	NaN	NaN	SLM:000390051 \| SLM:000782334	NaN	NaN	CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](...	NaN	...	NaN	NaN	NaN	NaN	NaN	74215	NaN	NaN	NaN	18541923
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
755324	SLM:000758294	Class	Globoside	Globo	Globo-series	SLM:000000834 \| SLM:000399813	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	61360	NaN	NaN	NaN	NaN
755325	SLM:000758295	Class	Isogloboside	Isoglobo	Isoglobo-series	SLM:000000834 \| SLM:000399813	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	78257	NaN	NaN	NaN	NaN
779141	SLM:000782221	NaN	Resolvin E	RvE	NaN	SLM:000501332 \| SLM:000508853	NaN	NaN	NaN	InChI=none	...	NaN	NaN	NaN	NaN	NaN	<NA>	LMFA0314	NaN	NaN	NaN
779142	SLM:000782222	NaN	Resolvin D	RvD	NaN	SLM:000501331 \| SLM:000508853	NaN	NaN	NaN	InChI=none	...	NaN	NaN	NaN	NaN	NaN	<NA>	LMFA0403	NaN	NaN	NaN
779157	SLM:000782237	NaN	an N-(omega-(9Z,12Z-octadecadienoyloxy)-ultra-...	NaN	NaN	SLM:000000413 \| SLM:000782274	NaN	NaN	[C@H]([C@@H](/C=C/CCCCCCCCCCCCC)O)(NC(=O)*COC(...	NaN	...	NaN	NaN	NaN	NaN	NaN	157662	NaN	NaN	NaN	NaN

119 rows × 29 columns

What about other IDs?

cols_with_split_chars = check_for_split_characters(df_swisslipids, delimiter='|')

Checking split characters (|) in Lipid ID
No rows found

Checking split characters (|) in Level
No rows found

Checking split characters (|) in Name
No rows found

Checking split characters (|) in Abbreviation*
Found 9768 rows with split characters

	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	SMILES (pH7.3)	InChI (pH7.3)	...	Exact m/z of [M+Li]+	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID
56	SLM:000000262	Class	1,2-diacyl-sn-glycerol	1,2-sn-DAG \| DAG \| DG	Diacylglycerol	SLM:000000423	NaN	NaN	OC[C@@H](COC([])=O)OC([])=O	InChI=none	...	NaN	NaN	NaN	NaN	NaN	17815	NaN	NaN	MNXM59	10336610 \| 10685032 \| 10888667 \| 10931938 \| 11...
114	SLM:000000341	Class	1-acyl-sn-glycerol	MAG \| MG	Monoacylglycerol	SLM:000117130	NaN	NaN	OC[C@H](O)COC([*])=O	InChI=none	...	NaN	NaN	NaN	NaN	NaN	64683	NaN	NaN	MNXM2963	10685032 \| 15939762 \| 18037386 \| 8663293 \| 960...
122	SLM:000000355	Class	2-acylglycerol	MAG \| MG	Monoacylglycerol	SLM:000000403	NaN	NaN	OCC(CO)OC([*])=O	InChI=none	...	NaN	NaN	NaN	NaN	NaN	17389	NaN	NaN	MNXM335	NaN
146	SLM:000000400	Class	Triacylglycerol	TAG \| TG	NaN	SLM:000117141	NaN	NaN	[]C(=O)OCC(COC([])=O)OC([*])=O	InChI=none	...	NaN	NaN	NaN	NaN	NaN	17855	NaN	NaN	MNXM248	12682047 \| 16135509 \| 16150821 \| 21704635 \| 27...
147	SLM:000000401	Class	Diacylglycerol	DAG \| DG	NaN	SLM:000117140	NaN	NaN	[]OCC(CO[])O[*]	InChI=none	...	NaN	NaN	NaN	NaN	NaN	18035	NaN	NaN	MNXM59	12682047 \| 16135509 \| 16150821 \| 27247428 \| 29...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
505694	SLM:000508489	Molecular subspecies	Phosphatidylglycerol (O-17:1_0:0)	LPG(O-17:1_0:0) \| PG(O-17:1_0:0)	Lysophosphatidylglycerol (O-17:1_0:0)	SLM:000508807	SLM:000508779	SLM:000001333 (sn1 or sn2 or sn3)	OCC(O)COP([O-])(=O)OCC(CO[])O[]	InChI=none	...	489.316311	500.334681	481.293579	517.270257	541.314708	<NA>	NaN	NaN	MNXM629334	NaN
505695	SLM:000508490	Molecular subspecies	Phosphatidylglycerol (O-15:1_0:0)	LPG(O-15:1_0:0) \| PG(O-15:1_0:0)	Lysophosphatidylglycerol (O-15:1_0:0)	SLM:000508807	SLM:000508775	SLM:000001331 (sn1 or sn2 or sn3)	OCC(O)COP([O-])(=O)OCC(CO[])O[]	InChI=none	...	461.285011	472.303381	453.262279	489.238957	513.283408	<NA>	NaN	NaN	MNXM628940	NaN
505696	SLM:000508491	Molecular subspecies	Phosphatidylglycerol (O-13:1_0:0)	LPG(O-13:1_0:0) \| PG(O-13:1_0:0)	Lysophosphatidylglycerol (O-13:1_0:0)	SLM:000508807	SLM:000508771	SLM:000001329 (sn1 or sn2 or sn3)	OCC(O)COP([O-])(=O)OCC(CO[])O[]	InChI=none	...	433.253711	444.272081	425.230979	461.207657	485.252108	<NA>	NaN	NaN	MNXM628548	NaN
595061	SLM:000597889	Isomeric subspecies	7-oxoresolvin D2	7-oxo-RvD2\| 7-keto-RvD2	(16R,17S)-dihydroxy-7-oxo-(4Z,8E,10Z,12E,14E,1...	SLM:000508853 \| SLM:000782222	NaN	NaN	C(C/C=C\CC(/C=C/C=C\C=C\C=C\[C@H]([C@H](C/C=C\...	InChI=1S/C22H30O5/c1-2-3-9-16-20(24)21(25)17-1...	...	381.224780	392.243150	373.202048	409.178725	433.223177	137497	NaN	NaN	NaN	22844113
595062	SLM:000597890	Isomeric subspecies	16-oxoresolvin D2	16-oxo-RvD2\| 16-keto-RvD2	(7S,17S)-dihydroxy-16-oxo-(4Z,8E,10Z,12E,14E,1...	SLM:000508853 \| SLM:000782222	NaN	NaN	C(C/C=C\C[C@@H](\C=C\C=C/C=C/C=C/C([C@H](C/C=C...	InChI=1S/C22H30O5/c1-2-3-9-16-20(24)21(25)17-1...	...	381.224780	392.243150	373.202048	409.178725	433.223177	137498	NaN	NaN	NaN	22844113

9768 rows × 29 columns

Checking split characters (|) in Synonyms*
Found 19853 rows with split characters

	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	SMILES (pH7.3)	InChI (pH7.3)	...	Exact m/z of [M+Li]+	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID
11	SLM:000000101	Class	1,2-diacyl-sn-glycero-3-phospho-(1'-sn-glycero...	PA	1,2-diacyl-sn-glycero-3-phospho-(1'-sn-glycero...	SLM:000477285	NaN	NaN	O[C@@H](COP([O-])([O-])=O)COP([O-])(=O)OC[C@@H...	InChI=none	...	NaN	NaN	NaN	NaN	NaN	60110	NaN	NaN	MNXM871	20485265 \| 9880566
17	SLM:000000147	Isomeric subspecies	N-(9Z-octadecenoyl)-ethanolamine	NAE (18:1(9Z))	(9Z-octadecenoyl)-ethanolamide \| N-(9Z-octadec...	SLM:000000378	NaN	NaN	CCCCCCCC\C=C/CCCCCCCC(=O)NCCO	InChI=1S/C20H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12...	...	332.313535	343.331905	324.290803	360.267481	384.311932	71466	NaN	HMDB02088	MNXM107386	14634025 \| 16527816 \| 17015445 \| 17626977 \| 17...
18	SLM:000000149	Isomeric subspecies	N-hexadecanoyl-ethanolamine	NAE (16:0)	hexadecanoyl-ethanolamide \| N-hexadecanoyl eth...	SLM:000000378	NaN	NaN	CCCCCCCCCCCCCCCC(=O)NCCO	InChI=1S/C18H37NO2/c1-2-3-4-5-6-7-8-9-10-11-12...	...	306.297885	317.316255	298.275153	334.251831	358.296282	71464	NaN	HMDB02100	MNXM107548	12824167 \| 14634025 \| 15655246 \| 15760304 \| 16...
19	SLM:000000178	Isomeric subspecies	N-(docosanoyl)-15-methylhexadecasphing-4-enine	Cer(iso-d17:1(4E)/22:0)	Ceramide (iso-d17:1(4E)/22:0) \| N-docosanoyl-1...	SLM:000000002	SLM:000392021	SLM:000000827 (n-acyl)	CCCCCCCCCCCCCCCCCCCCCC(=O)N[C@@H](CO)[C@H](O)\...	InChI=1S/C39H77NO3/c1-4-5-6-7-8-9-10-11-12-13-...	...	614.605801	625.624171	606.583069	642.559747	666.604198	71377	NaN	NaN	MNXM107026	19372430
20	SLM:000000179	Isomeric subspecies	N-(heneicosanoyl)-15-methylhexadecasphing-4-enine	Cer(iso-d17:1(4E)/21:0)	Ceramide (iso-d17:1(4E)/21:0) \| N-henicosanoyl...	SLM:000000002	SLM:000392020	SLM:000001207 (n-acyl)	CCCCCCCCCCCCCCCCCCCCC(=O)N[C@@H](CO)[C@H](O)\C...	InChI=1S/C38H75NO3/c1-4-5-6-7-8-9-10-11-12-13-...	...	600.590151	611.608521	592.567419	628.544097	652.588548	71375	NaN	NaN	MNXM107036	19372430
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
745092	SLM:000747954	Isomeric subspecies	CDP-1,2-di-(13-methyltetradecanoyl)-sn-glycerol	CDP-DAG (iso15:0/iso15:0)	1,2-di-(13-methyltetradecanoyl)-sn-glycero-3-c...	SLM:000000084	NaN	SLM:000000047 (sn1 or sn2)	[H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(...	InChI=1S/C42H77N3O15P2/c1-32(2)23-19-15-11-7-5...	...	932.498448	943.516818	924.475716	960.452394	984.496846	<NA>	NaN	HMDB0116214	NaN	NaN
745093	SLM:000747955	Isomeric subspecies	CDP-1-(13-methyltetradecanoyl)-2-(15-methylhex...	CDP-DAG (iso15:0/iso17:0)	1-(13-methyltetradecanoyl)-2-(15-methylhexadec...	SLM:000000084	NaN	SLM:000000047 (sn1) / SLM:000000048 (sn2)	[H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(...	InChI=1S/C44H81N3O15P2/c1-34(2)25-21-17-13-9-6...	...	960.529748	971.548118	952.507016	988.483694	1012.528146	<NA>	NaN	HMDB0116216	NaN	NaN
745175	SLM:000748037	Isomeric subspecies	CDP-1-(15-methylhexadecanoyl)-2-(11-methyldode...	CDP-DAG (iso17:0/iso13:0)	1-(15-methylhexadecanoyl)-2-(11-methyldodecano...	SLM:000000084	NaN	SLM:000000048 (sn1) / SLM:000001197 (sn2)	[H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(...	InChI=1S/C42H77N3O15P2/c1-32(2)23-19-15-11-8-6...	...	932.498448	943.516818	924.475716	960.452394	984.496846	<NA>	NaN	HMDB0116248	NaN	NaN
745176	SLM:000748038	Isomeric subspecies	CDP-1-(15-methylhexadecanoyl)-2-(13-methyltetr...	CDP-DAG (iso17:0/iso15:0)	1-(15-methylhexadecanoyl)-2-(13-methyltetradec...	SLM:000000084	NaN	SLM:000000047 (sn2) / SLM:000000048 (sn1)	[H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(...	InChI=1S/C44H81N3O15P2/c1-34(2)25-21-17-13-9-6...	...	960.529748	971.548118	952.507016	988.483694	1012.528146	<NA>	NaN	HMDB0116250	NaN	NaN
745177	SLM:000748039	Isomeric subspecies	CDP-1,2-di-(15-methylhexadecanoyl)-sn-glycerol	CDP-DAG (iso17:0/iso17:0)	1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy...	SLM:000000084	NaN	SLM:000000048 (sn1 or sn2)	[H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(...	InChI=1S/C46H85N3O15P2/c1-36(2)27-23-19-15-11-...	...	988.561049	999.579419	980.538317	1016.514994	1040.559446	<NA>	NaN	HMDB0116252	NaN	NaN

19853 rows × 29 columns

Checking split characters (|) in Lipid class*
Found 119 rows with split characters

	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	SMILES (pH7.3)	InChI (pH7.3)	...	Exact m/z of [M+Li]+	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID
142	SLM:000000392	Class	Ceramide phosphoinositol	IPC	Inositol-1-phosphoceramide	SLM:000000834 \| SLM:000399815	NaN	NaN	O[C@H]([*])[C@H](COP([O-])(=O)O[C@H]1[C@H](O)[...	InChI=none	...	NaN	NaN	NaN	NaN	NaN	64916	NaN	NaN	NaN	10888667 \| 20727985
234	SLM:000000509	Isomeric subspecies	All-trans-retinyl hexadecanoate	NaN	all-trans-retinyl palmitate	SLM:000000982 \| SLM:000508854	NaN	NaN	CCCCCCCCCCCCCCCC(=O)OCC=C(C)C=CC=C(C)C=CC1=C(C...	InChI=1S/C36H60O2/c1-7-8-9-10-11-12-13-14-15-1...	...	NaN	NaN	NaN	NaN	NaN	17616	NaN	HMDB03648	NaN	10769148 \| 10819989 \| 12230550 \| 15550674 \| 15...
315	SLM:000000612	NaN	tetracosenoyl-CoA	NaN	NaN	SLM:000390051 \| SLM:000782334	NaN	NaN	CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](...	NaN	...	NaN	NaN	NaN	NaN	NaN	74146	NaN	NaN	NaN	18541923 \| 20110363 \| 20937905
317	SLM:000000614	NaN	hexacosenoyl-CoA	NaN	NaN	SLM:000390051 \| SLM:000782334	NaN	NaN	CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](...	NaN	...	NaN	NaN	NaN	NaN	NaN	74161	NaN	NaN	NaN	18165233
319	SLM:000000621	NaN	2-hydroxy-tetracosenoyl-CoA	NaN	NaN	SLM:000390051 \| SLM:000782334	NaN	NaN	CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](...	NaN	...	NaN	NaN	NaN	NaN	NaN	74215	NaN	NaN	NaN	18541923
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
755324	SLM:000758294	Class	Globoside	Globo	Globo-series	SLM:000000834 \| SLM:000399813	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	61360	NaN	NaN	NaN	NaN
755325	SLM:000758295	Class	Isogloboside	Isoglobo	Isoglobo-series	SLM:000000834 \| SLM:000399813	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	78257	NaN	NaN	NaN	NaN
779141	SLM:000782221	NaN	Resolvin E	RvE	NaN	SLM:000501332 \| SLM:000508853	NaN	NaN	NaN	InChI=none	...	NaN	NaN	NaN	NaN	NaN	<NA>	LMFA0314	NaN	NaN	NaN
779142	SLM:000782222	NaN	Resolvin D	RvD	NaN	SLM:000501331 \| SLM:000508853	NaN	NaN	NaN	InChI=none	...	NaN	NaN	NaN	NaN	NaN	<NA>	LMFA0403	NaN	NaN	NaN
779157	SLM:000782237	NaN	an N-(omega-(9Z,12Z-octadecadienoyloxy)-ultra-...	NaN	NaN	SLM:000000413 \| SLM:000782274	NaN	NaN	[C@H]([C@@H](/C=C/CCCCCCCCCCCCC)O)(NC(=O)*COC(...	NaN	...	NaN	NaN	NaN	NaN	NaN	157662	NaN	NaN	NaN	NaN

119 rows × 29 columns

Checking split characters (|) in Parent
No rows found

Checking split characters (|) in Components*
No rows found

Checking split characters (|) in SMILES (pH7.3)
No rows found

Checking split characters (|) in InChI (pH7.3)
No rows found

Checking split characters (|) in InChI key (pH7.3)
No rows found

Checking split characters (|) in Formula (pH7.3)
No rows found

Checking split characters (|) in Charge (pH7.3)
Not a string column

Checking split characters (|) in Mass (pH7.3)
Not a string column

Checking split characters (|) in Exact Mass (neutral form)
Not a string column

Checking split characters (|) in Exact m/z of [M.]+
Not a string column

Checking split characters (|) in Exact m/z of [M+H]+
Not a string column

Checking split characters (|) in Exact m/z of [M+K]+ 
Not a string column

Checking split characters (|) in Exact m/z of [M+Na]+
Not a string column

Checking split characters (|) in Exact m/z of [M+Li]+
Not a string column

Checking split characters (|) in Exact m/z of [M+NH4]+
Not a string column

Checking split characters (|) in Exact m/z of [M-H]-
Not a string column

Checking split characters (|) in Exact m/z of [M+Cl]-
Not a string column

Checking split characters (|) in Exact m/z of [M+OAc]- 
Not a string column

Checking split characters (|) in CHEBI
Found 3 rows with split characters

	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	SMILES (pH7.3)	InChI (pH7.3)	...	Exact m/z of [M+Li]+	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID
465	SLM:000000784	Isomeric subspecies	1,2-di-(9Z-octadecenoyl)-sn-glycero-3-phosphate	PA(18:1(9Z)/18:1(9Z))	Phosphatidate (18:1(9Z)/18:1(9Z))	SLM:000000329	SLM:000082169	SLM:000000418 (sn1 or sn2)	CCCCCCCC\C=C/CCCCCCCC(=O)OC[C@H](COP([O-])([O-...	InChI=1S/C39H73O8P/c1-3-5-7-9-11-13-15-17-19-2...	...	707.519775	718.538147	699.497009	735.473694	759.518188	74546\|82922	LMGP10010962	HMDB07865	MNXM51075	11309392 \| 14634025 \| 14665624 \| 15164764 \| 15...
387185	SLM:000389154	NaN	(14Z,17Z,20Z,23Z,26Z)-dotriacontapentaenoate	NaN	Fatty acid 32:5(14Z,17Z,20Z,23Z,26Z)	SLM:000389801	NaN	NaN	CCCCC\C=C/C\C=C/C\C=C/C\C=C/C\C=C/CCCCCCCCCCCC...	InChI=1S/C32H54O2/c1-2-3-4-5-6-7-8-9-10-11-12-...	...	477.427836	488.446207	469.405105	505.381782	529.426234	82731\|82731	LMFA01030848	NaN	NaN	NaN
595221	SLM:000598072	NaN	all-trans-retinol--[retinol-binding protein]	NaN	NaN	SLM:000000982	NaN	NaN	[][C@H](N-)C(-*)=O	InChI=none	...	NaN	NaN	NaN	NaN	NaN	17336\|83228	NaN	NaN	NaN	20628054 \| 28758396

3 rows × 29 columns

Checking split characters (|) in LIPID MAPS
No rows found

Checking split characters (|) in HMDB
No rows found

Checking split characters (|) in MetaNetX
No rows found

Checking split characters (|) in PMID
Found 1318 rows with split characters

	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	SMILES (pH7.3)	InChI (pH7.3)	...	Exact m/z of [M+Li]+	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID
0	SLM:000000002	Class	Ceramide (iso-d17:1(4E))	Cer(iso-d17:1(4E))	N-acyl-15-methylhexadecasphing-4-enine	SLM:000399814	NaN	NaN	CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](CO)NC([*])=O	InChI=none	...	NaN	NaN	NaN	NaN	NaN	70846	NaN	NaN	MNXM97012	\| 11443131 \| 14685263 \| 18390550 \| 21325339 \| ...
3	SLM:000000007	Class	Sphingomyelin (iso-d17:1(4E))	SM(iso-d17:1(4E))	N-acyl-15-methylhexadecasphing-4-enine-1-phosp...	SLM:000001000	NaN	NaN	CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](COP([O-])(=O...	InChI=none	...	NaN	NaN	NaN	NaN	NaN	70775	NaN	NaN	MNXM97113	14685263 \| 21926990 \| 9603947
4	SLM:000000035	Isomeric subspecies	sphinganine	NaN	NaN	SLM:000390097	NaN	NaN	CCCCCCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO	InChI=1S/C18H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12...	...	308.313535	319.331905	300.290803	336.267481	360.311932	57817	LMSP01020001	HMDB00269	MNXM302	10652340 \| 10702247 \| 10751414 \| 10802064 \| 10...
5	SLM:000000042	Isomeric subspecies	cholesta-5,7-dien-3beta-ol	NaN	NaN	SLM:000501263	NaN	NaN	[H][C@@]1(CC[C@@]2([H])C3=CC=C4C[C@@H](O)CC[C@...	InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...	...	391.354671	402.373042	383.331940	419.308617	443.353069	17759	LMST01010069	HMDB00032	MNXM710	10329655 \| 10344195 \| 10786622 \| 11230174 \| 16...
6	SLM:000000043	Isomeric subspecies	lathosterone	NaN	NaN	SLM:000501263	NaN	NaN	[H][C@@]12CC=C3[C@]4([H])CC[C@]([H])([C@H](C)C...	InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2...	...	391.354671	402.373042	383.331940	419.308617	443.353069	71550	NaN	NaN	MNXM97065	19531354 \| 22505847
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
595221	SLM:000598072	NaN	all-trans-retinol--[retinol-binding protein]	NaN	NaN	SLM:000000982	NaN	NaN	[][C@H](N-)C(-*)=O	InChI=none	...	NaN	NaN	NaN	NaN	NaN	17336\|83228	NaN	NaN	NaN	20628054 \| 28758396
595222	SLM:000598073	NaN	all-trans-retinyl heptanoate	NaN	NaN	SLM:000000982	NaN	NaN	C1(C)(C)C(\C=C\C(=C\C=C\C(=C\COC(CCCCCC)=O)\C)...	InChI=1S/C27H42O2/c1-7-8-9-10-16-26(28)29-21-1...	...	NaN	NaN	NaN	NaN	NaN	138724	NaN	NaN	NaN	20628054 \| 28758396
595223	SLM:000598074	NaN	2-heptanoyl-sn-glycero-3-phosphocholine	NaN	NaN	SLM:000000724	NaN	NaN	P(OC[C@@H](CO)OC(=O)CCCCCC)(=O)(OCC[N+](C)(C)C...	InChI=1S/C15H32NO7P/c1-5-6-7-8-9-15(18)23-14(1...	...	NaN	NaN	NaN	NaN	NaN	138266	NaN	NaN	NaN	20628054 \| 22605381 \| 28758396
595230	SLM:000598083	NaN	12-hydroxy-(9Z)-octadecenoyl-CoA	NaN	NaN	SLM:000389958 \| SLM:000390051	NaN	NaN	S(C(CCCCCCC/C=C\C[C@@H](CCCCCC)O)=O)CCNC(CCNC(...	InChI=1S/C39H68N7O18P3S/c1-4-5-6-13-16-27(47)1...	...	NaN	NaN	NaN	NaN	NaN	139559	NaN	NaN	NaN	17084870 \| 27758859
595245	SLM:000598101	NaN	a mannosylinositol-1-phospho-N-(2-hydroxyacyl)...	NaN	NaN	SLM:000000835	NaN	NaN	OC[C@H]1OC(O[C@@H]2[C@@H](O)[C@H](O)[C@@H](O)[...	InChI=none	...	NaN	NaN	NaN	NaN	NaN	74994	NaN	NaN	NaN	12954640 \| 9368028

1318 rows × 29 columns

Okay wow! So these are all the columns we have found with split characters…

cols_with_split_chars

['Abbreviation*', 'Synonyms*', 'Lipid class*', 'CHEBI', 'PMID']

We can also check for different types of characters if we know that they will be present. For instance SL uses the / character for Components*, but this is also used by another of columns like the lipid names themselves or smiles and inchi.

check_for_split_characters(df_swisslipids.drop(columns=['Name','Abbreviation*','Synonyms*','SMILES (pH7.3)','InChI (pH7.3)']), delimiter='/')

Checking split characters (/) in Lipid ID
No rows found

Checking split characters (/) in Level
No rows found

Checking split characters (/) in Lipid class*
No rows found

Checking split characters (/) in Parent
No rows found

Checking split characters (/) in Components*
Found 708725 rows with split characters

	Lipid ID	Level	Lipid class*	Parent	Components*	InChI key (pH7.3)	Formula (pH7.3)	Charge (pH7.3)	Mass (pH7.3)	Exact Mass (neutral form)	...	Exact m/z of [M+Li]+	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID
164	SLM:000000422	Isomeric subspecies	SLM:000000329	SLM:000081844	SLM:000000418 (sn2) / SLM:000000510 (sn1)	InChIKey=OPVZUEPSMJNLOM-QEJMHMKOSA-L	C37H69O8P	-2.0	672.913818	674.488647	...	681.504089	692.522461	673.481384	709.458069	733.502502	64839	LMGP10010032	HMDB07859	MNXM66476	10359651 \| 11788596 \| 12963729 \| 16620771 \| 17...
229	SLM:000000498	Isomeric subspecies	SLM:000000324	SLM:000105249	SLM:000000296 (sn2) / SLM:000000826 (sn1)	InChIKey=KRTOMQDUKGRFDJ-ZAHDIIMDSA-M	C47H82O13P	-1.0	886.120483	886.557129	...	893.572571	904.590942	885.549866	921.526550	945.570984	133606	LMGP06010010	HMDB09815	MNXM75683	22942276 \| 23097495 \| 23472195 \| 8300559
269	SLM:000000557	Isomeric subspecies	SLM:000000261	SLM:000088147	SLM:000000510 (sn1) / SLM:000000826 (sn2)	InChIKey=PZNPLUBHRSSFHT-RRHRGVEJSA-N	C42H84NO8P	0.0	762.091980	761.593445	...	768.608887	779.627258	NaN	796.562866	820.607300	73000	LMGP01010573	HMDB07970	MNXM69304	18195019 \| 19416660 \| 22923616 \| 27399000
332	SLM:000000636	Isomeric subspecies	SLM:000000329	SLM:000082164	SLM:000000418 (sn1) / SLM:000000510 (sn2)	InChIKey=ZSXHMDPHNCOWSV-QEJMHMKOSA-L	C37H69O8P	-2.0	672.913818	674.488647	...	681.504089	692.522461	673.481384	709.458069	733.502502	74551	LMGP10010964	NaN	MNXM66662	16620771 \| 18606822 \| 19318427 \| 19801371 \| 20...
333	SLM:000000637	Isomeric subspecies	SLM:000000329	SLM:000082168	SLM:000000418 (sn1) / SLM:000000826 (sn2)	InChIKey=XIERONXOJKEALF-PXYGFXEISA-L	C39H73O8P	-2.0	700.966980	702.519958	...	709.535400	720.553772	701.512695	737.489380	761.533813	74552	LMGP10010963	NaN	MNXM66667	16620771 \| 18606822 \| 19318427 \| 19801371 \| 21...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
745172	SLM:000748034	Isomeric subspecies	SLM:000000084	NaN	SLM:000000048 (sn1) / SLM:000001195 (sn2)	InChIKey=LJSBNBPNSBKZCI-JNOBRDIFSA-L	C33H57N3O15P2	-2.0	NaN	799.342142	...	806.357598	817.375968	798.334866	834.311543	858.355995	<NA>	NaN	NaN	NaN	NaN
745173	SLM:000748035	Isomeric subspecies	SLM:000000084	NaN	SLM:000000048 (sn1) / SLM:000001196 (sn2)	InChIKey=ODNYDZLXLRZPCJ-GPTQCAHZSA-L	C35H61N3O15P2	-2.0	NaN	827.373442	...	834.388898	845.407268	826.366166	862.342844	886.387295	<NA>	NaN	NaN	NaN	NaN
745174	SLM:000748036	Isomeric subspecies	SLM:000000084	NaN	SLM:000000048 (sn1) / SLM:000000853 (sn2)	InChIKey=FJIBTCUXUBRYKG-QOTCTSOZSA-L	C37H65N3O15P2	-2.0	NaN	855.404743	...	862.420198	873.438568	854.397466	890.374144	914.418595	<NA>	NaN	NaN	NaN	NaN
745175	SLM:000748037	Isomeric subspecies	SLM:000000084	NaN	SLM:000000048 (sn1) / SLM:000001197 (sn2)	InChIKey=AIBKQADSQWEVSS-HUKRWTLJSA-L	C42H75N3O15P2	-2.0	NaN	925.482993	...	932.498448	943.516818	924.475716	960.452394	984.496846	<NA>	NaN	HMDB0116248	NaN	NaN
745176	SLM:000748038	Isomeric subspecies	SLM:000000084	NaN	SLM:000000047 (sn2) / SLM:000000048 (sn1)	InChIKey=PIZFKSVTEGNINS-BQUKFSKHSA-L	C44H79N3O15P2	-2.0	NaN	953.514293	...	960.529748	971.548118	952.507016	988.483694	1012.528146	<NA>	NaN	HMDB0116250	NaN	NaN

708725 rows × 24 columns

Checking split characters (/) in InChI key (pH7.3)
No rows found

Checking split characters (/) in Formula (pH7.3)
No rows found

Checking split characters (/) in Charge (pH7.3)
Not a string column

Checking split characters (/) in Mass (pH7.3)
Not a string column

Checking split characters (/) in Exact Mass (neutral form)
Not a string column

Checking split characters (/) in Exact m/z of [M.]+
Not a string column

Checking split characters (/) in Exact m/z of [M+H]+
Not a string column

Checking split characters (/) in Exact m/z of [M+K]+ 
Not a string column

Checking split characters (/) in Exact m/z of [M+Na]+
Not a string column

Checking split characters (/) in Exact m/z of [M+Li]+
Not a string column

Checking split characters (/) in Exact m/z of [M+NH4]+
Not a string column

Checking split characters (/) in Exact m/z of [M-H]-
Not a string column

Checking split characters (/) in Exact m/z of [M+Cl]-
Not a string column

Checking split characters (/) in Exact m/z of [M+OAc]- 
Not a string column

Checking split characters (/) in CHEBI
No rows found

Checking split characters (/) in LIPID MAPS
No rows found

Checking split characters (/) in HMDB
No rows found

Checking split characters (/) in MetaNetX
No rows found

Checking split characters (/) in PMID
No rows found

['Components*']

These double entries for the classes will be important to take into account for our class hierarchy, because if we don’t many of these Class level entries will become disjointed in the ontology.

To help us handle this connection we will split it into two using the split_and_expand_large utility function, but we will come back to this a bit later…

For now we will also add another column for components, so that later we can have both the actual component with location (e.g. sn) and a parsed version where we just have the SL

df_swisslipids['Components_parsed'] = df_swisslipids['Components*']

Now we can melt to start creating the edges df

Building the edges df#

# # Split the 'Lipid class*' column into multiple rows
# df_swisslipids_splitexp = split_and_expand_large(
#     df_swisslipids, #.assign(from_layer_col='swisslipids')
#     split_col='Lipid class*', 
#     expand_cols=['Lipid ID', 'Level', 'Name', 'Abbreviation*',
#                     'CHEBI', 'LIPID MAPS', 'HMDB', 'MetaNetX', 'PMID','Synonyms*','Parent','Components*','Components_parsed'], #'from_layer_col'
#     delimiter='|'
# )

df_swisslipids_edges = pd.melt(df_swisslipids,  #df_swisslipids_splitexp
                id_vars=['Lipid ID'], 
                value_vars=['CHEBI','LIPID MAPS','HMDB','MetaNetX','PMID','Lipid class*','Abbreviation*','Synonyms*','Parent','Components*','Components_parsed'], 
                var_name='melted_column', value_name='value')
df_swisslipids_edges

	Lipid ID	melted_column	value
0	SLM:000000002	CHEBI	70846
1	SLM:000000003	CHEBI	70771
2	SLM:000000006	CHEBI	70829
3	SLM:000000007	CHEBI	70775
4	SLM:000000035	CHEBI	57817
...	...	...	...
8571734	SLM:000782324	Components_parsed	NaN
8571735	SLM:000782325	Components_parsed	NaN
8571736	SLM:000782326	Components_parsed	NaN
8571737	SLM:000782327	Components_parsed	NaN
8571738	SLM:000782328	Components_parsed	NaN

8571739 rows × 3 columns

df_swisslipids_edges['value'].value_counts()

value
SLM:000000353                                                 132652
SLM:000000377                                                  98788
SLM:000000102                                                  80209
SLM:000117148                                                  46820
SLM:000000400                                                  38514
                                                               ...  
TG(30:0/26:0/22:0)                                                 1
TG(30:0/24:0/24:0)                                                 1
TG(30:0/22:0/26:0)                                                 1
TG(30:0/20:0/28:0)                                                 1
NAPE (15:0-13me/34:5(16Z,19Z,22Z,25Z,28Z)/18:3(6Z,9Z,12Z))         1
Name: count, Length: 2342278, dtype: int64

Especially because we have so many nan values we should handle these by marking them explicitly as null values, not ‘nan’ strings

df_swisslipids_edges = df_swisslipids_edges.replace(['nan'], pd.NA).copy() # added here to drop 'nan' strings, could also use .dropna(). directly instead of next step
df_swisslipids_edges

	Lipid ID	melted_column	value
0	SLM:000000002	CHEBI	70846
1	SLM:000000003	CHEBI	70771
2	SLM:000000006	CHEBI	70829
3	SLM:000000007	CHEBI	70775
4	SLM:000000035	CHEBI	57817
...	...	...	...
8571734	SLM:000782324	Components_parsed	NaN
8571735	SLM:000782325	Components_parsed	NaN
8571736	SLM:000782326	Components_parsed	NaN
8571737	SLM:000782327	Components_parsed	NaN
8571738	SLM:000782328	Components_parsed	NaN

8571739 rows × 3 columns

Because this melt operation also resulted in a large number of null values, which probably mean nothing to us in this case, we will drop instances where the value is null

df_swisslipids_edges = df_swisslipids_edges.dropna(subset='value')
df_swisslipids_edges

	Lipid ID	melted_column	value
0	SLM:000000002	CHEBI	70846
1	SLM:000000003	CHEBI	70771
2	SLM:000000006	CHEBI	70829
3	SLM:000000007	CHEBI	70775
4	SLM:000000035	CHEBI	57817
...	...	...	...
8571494	SLM:000781997	Components_parsed	SLM:000000856 (n-acyl)
8571495	SLM:000781998	Components_parsed	SLM:000389154 (n-acyl)
8571496	SLM:000781999	Components_parsed	SLM:000485643 (n-acyl)
8571497	SLM:000782000	Components_parsed	SLM:000485644 (n-acyl)
8571498	SLM:000782001	Components_parsed	SLM:000485645 (n-acyl)

4678499 rows × 3 columns

There are still some things we need to tidy up so that it is in a suitable format for OnionNet

df_swisslipids_edges = df_swisslipids_edges.copy()
df_swisslipids_edges['source_layer'] = 'swisslipids'
df_swisslipids_edges.rename(columns={'Lipid ID':'source_id', 'melted_column':'target_layer', 'value':'target_id'}, inplace=True)
df_swisslipids_edges = df_swisslipids_edges[['source_layer','source_id','target_layer','target_id']]
df_swisslipids_edges['target_layer'] = df_swisslipids_edges['target_layer'].map(lambda x: 'swisslipids' if x=='Lipid class*' else f"sl_{str(x).replace(' ','').strip('*').lower()}")
#df_swisslipids_edges['target_layer'] = df_swisslipids_edges['target_layer'].map(lambda x: )
df_swisslipids_edges

	source_layer	source_id	target_layer	target_id
0	swisslipids	SLM:000000002	sl_chebi	70846
1	swisslipids	SLM:000000003	sl_chebi	70771
2	swisslipids	SLM:000000006	sl_chebi	70829
3	swisslipids	SLM:000000007	sl_chebi	70775
4	swisslipids	SLM:000000035	sl_chebi	57817
...	...	...	...	...
8571494	swisslipids	SLM:000781997	sl_components_parsed	SLM:000000856 (n-acyl)
8571495	swisslipids	SLM:000781998	sl_components_parsed	SLM:000389154 (n-acyl)
8571496	swisslipids	SLM:000781999	sl_components_parsed	SLM:000485643 (n-acyl)
8571497	swisslipids	SLM:000782000	sl_components_parsed	SLM:000485644 (n-acyl)
8571498	swisslipids	SLM:000782001	sl_components_parsed	SLM:000485645 (n-acyl)

4678499 rows × 4 columns

For rows where it is swisslipids to swisslipids, we actually want to correct this from target_layer to source_layer, because currently the target_layer in this case is actually the parent class, and ideally it would be better to have the parent point towards the children, so that way the root node should be the one with multiple outgoing edges and no incoming edges…

Be sure to only run this once, otherwise it will switch back again…

# Identify rows where both source_layer and target_layer are 'swisslipids'
condition = (df_swisslipids_edges["source_layer"] == "swisslipids") & (df_swisslipids_edges["target_layer"] == "swisslipids")

# Swap the columns for rows satisfying the condition
df_swisslipids_edges.loc[condition, ["source_layer", "source_id", "target_layer", "target_id"]] = df_swisslipids_edges.loc[condition, ["target_layer", "target_id", "source_layer", "source_id"]].values

# Output the modified DataFrame
df_swisslipids_edges

	source_layer	source_id	target_layer	target_id
0	swisslipids	SLM:000000002	sl_chebi	70846
1	swisslipids	SLM:000000003	sl_chebi	70771
2	swisslipids	SLM:000000006	sl_chebi	70829
3	swisslipids	SLM:000000007	sl_chebi	70775
4	swisslipids	SLM:000000035	sl_chebi	57817
...	...	...	...	...
8571494	swisslipids	SLM:000781997	sl_components_parsed	SLM:000000856 (n-acyl)
8571495	swisslipids	SLM:000781998	sl_components_parsed	SLM:000389154 (n-acyl)
8571496	swisslipids	SLM:000781999	sl_components_parsed	SLM:000485643 (n-acyl)
8571497	swisslipids	SLM:000782000	sl_components_parsed	SLM:000485644 (n-acyl)
8571498	swisslipids	SLM:000782001	sl_components_parsed	SLM:000485645 (n-acyl)

4678499 rows × 4 columns

df_swisslipids_edges['target_layer'].value_counts()

target_layer
swisslipids             779247
sl_abbreviation         776464
sl_components           765323
sl_components_parsed    765323
sl_synonyms             548163
sl_metanetx             505003
sl_parent               493491
sl_hmdb                  26026
sl_lipidmaps             12117
sl_chebi                  4276
sl_pmid                   3066
Name: count, dtype: int64

Now let’s return to two items on our todo list:

splitting values that have multi-identifiers
trimming/parsing the components col

edges_with_multilinks = df_swisslipids_edges[df_swisslipids_edges['target_id'].str.contains('|', regex=False, na=False)]
edges_with_multilinks

	source_layer	source_id	target_layer	target_id
465	swisslipids	SLM:000000784	sl_chebi	74546\|82922
387185	swisslipids	SLM:000389154	sl_chebi	82731\|82731
595221	swisslipids	SLM:000598072	sl_chebi	17336\|83228
3116996	swisslipids	SLM:000000002	sl_pmid	\| 11443131 \| 14685263 \| 18390550 \| 21325339 \| ...
3116999	swisslipids	SLM:000000007	sl_pmid	14685263 \| 21926990 \| 9603947
...	...	...	...	...
6199835	swisslipids	SLM:000747954	sl_synonyms	1,2-di-(13-methyltetradecanoyl)-sn-glycero-3-c...
6199836	swisslipids	SLM:000747955	sl_synonyms	1-(13-methyltetradecanoyl)-2-(15-methylhexadec...
6199918	swisslipids	SLM:000748037	sl_synonyms	1-(15-methylhexadecanoyl)-2-(11-methyldodecano...
6199919	swisslipids	SLM:000748038	sl_synonyms	1-(15-methylhexadecanoyl)-2-(13-methyltetradec...
6199920	swisslipids	SLM:000748039	sl_synonyms	1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy...

30942 rows × 4 columns

edges_with_multilinks.value_counts('target_layer')

target_layer
sl_synonyms        19853
sl_abbreviation     9768
sl_pmid             1318
sl_chebi               3
Name: count, dtype: int64

edges_with_multilinks_split = split_and_expand_large(edges_with_multilinks, 
                       split_col='target_id', 
                       expand_cols=['source_layer','source_id','target_layer'],
                       delimiter='|').drop_duplicates()
edges_with_multilinks_split

	source_layer	source_id	target_layer	target_id
0	swisslipids	SLM:000000784	sl_chebi	74546
1	swisslipids	SLM:000000784	sl_chebi	82922
2	swisslipids	SLM:000389154	sl_chebi	82731
4	swisslipids	SLM:000598072	sl_chebi	17336
5	swisslipids	SLM:000598072	sl_chebi	83228
...	...	...	...	...
68383	swisslipids	SLM:000748037	sl_synonyms	CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(11Z))
68384	swisslipids	SLM:000748038	sl_synonyms	1-(15-methylhexadecanoyl)-2-(13-methyltetradec...
68385	swisslipids	SLM:000748038	sl_synonyms	CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(9Z))
68386	swisslipids	SLM:000748039	sl_synonyms	1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy...
68387	swisslipids	SLM:000748039	sl_synonyms	CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:2(9Z,12Z))

68379 rows × 4 columns

Now we also want to clean up the results and turn those empty spaces i.e. empty strings and things into NaN’s, as well as strip leading and trailing spaces that may have been between splitting characters

edges_with_multilinks_split = clean_missing_strings(edges_with_multilinks_split)
edges_with_multilinks_split

	source_layer	source_id	target_layer	target_id
0	swisslipids	SLM:000000784	sl_chebi	74546
1	swisslipids	SLM:000000784	sl_chebi	82922
2	swisslipids	SLM:000389154	sl_chebi	82731
4	swisslipids	SLM:000598072	sl_chebi	17336
5	swisslipids	SLM:000598072	sl_chebi	83228
...	...	...	...	...
68383	swisslipids	SLM:000748037	sl_synonyms	CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(11Z))
68384	swisslipids	SLM:000748038	sl_synonyms	1-(15-methylhexadecanoyl)-2-(13-methyltetradec...
68385	swisslipids	SLM:000748038	sl_synonyms	CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(9Z))
68386	swisslipids	SLM:000748039	sl_synonyms	1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy...
68387	swisslipids	SLM:000748039	sl_synonyms	CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:2(9Z,12Z))

68379 rows × 4 columns

edges_with_multilinks_split['target_id'].value_counts(dropna=False)

target_id
18390550                                                                      87
23670529                                                                      86
20431113                                                                      77
19603071                                                                      70
24068966                                                                      53
                                                                              ..
Phosphatidylcholine (O-18:1(11Z)/16:2(9Z,12Z))                                 1
1-(11Z-octadecenyl)-2-(9Z,12Z-octadecadienoyl)-sn-glycero-3-phosphocholine     1
Phosphatidylcholine (O-18:1(11Z)/18:2(9Z,12Z))                                 1
1-(11Z-octadecenyl)-2-(9Z-hexadecenoyl)-sn-glycero-3-phosphocholine            1
1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cytidine-5'-diphosphate            1
Name: count, Length: 59378, dtype: int64

edges_with_multilinks_split[edges_with_multilinks_split['target_id'].isna()]

	source_layer	source_id	target_layer	target_id
6	swisslipids	SLM:000000002	sl_pmid	<NA>
278	swisslipids	SLM:000000272	sl_pmid	<NA>
4533	swisslipids	SLM:000001020	sl_pmid	<NA>
4546	swisslipids	SLM:000001022	sl_pmid	<NA>
4550	swisslipids	SLM:000001023	sl_pmid	<NA>
4553	swisslipids	SLM:000001024	sl_pmid	<NA>
4586	swisslipids	SLM:000001032	sl_pmid	<NA>
4646	swisslipids	SLM:000001036	sl_pmid	<NA>

So note there are only 8 instances where the target_id is missing. This is probably ok to handle downstream anyway

# edges_with_multilinks_split = edges_with_multilinks_split[~edges_with_multilinks_split['target_id'].isna()].copy()
edges_with_multilinks_split.shape

(68379, 4)

What about source_id? Looks like it has no missing source_ids

edges_with_multilinks_split[edges_with_multilinks_split['source_id'].isna()].shape

(0, 4)

This is good, but we also need to remember the separators in the components column

edges_with_multilinks2 = df_swisslipids_edges[df_swisslipids_edges['target_id'].str.contains('/', regex=False, na=False) &
                     df_swisslipids_edges['target_layer'].str.contains('sl_components', regex=False, na=False)]
edges_with_multilinks2

	source_layer	source_id	target_layer	target_id
7013405	swisslipids	SLM:000000422	sl_components	SLM:000000418 (sn2) / SLM:000000510 (sn1)
7013470	swisslipids	SLM:000000498	sl_components	SLM:000000296 (sn2) / SLM:000000826 (sn1)
7013510	swisslipids	SLM:000000557	sl_components	SLM:000000510 (sn1) / SLM:000000826 (sn2)
7013573	swisslipids	SLM:000000636	sl_components	SLM:000000418 (sn1) / SLM:000000510 (sn2)
7013574	swisslipids	SLM:000000637	sl_components	SLM:000000418 (sn1) / SLM:000000826 (sn2)
...	...	...	...	...
8537662	swisslipids	SLM:000748034	sl_components_parsed	SLM:000000048 (sn1) / SLM:000001195 (sn2)
8537663	swisslipids	SLM:000748035	sl_components_parsed	SLM:000000048 (sn1) / SLM:000001196 (sn2)
8537664	swisslipids	SLM:000748036	sl_components_parsed	SLM:000000048 (sn1) / SLM:000000853 (sn2)
8537665	swisslipids	SLM:000748037	sl_components_parsed	SLM:000000048 (sn1) / SLM:000001197 (sn2)
8537666	swisslipids	SLM:000748038	sl_components_parsed	SLM:000000047 (sn2) / SLM:000000048 (sn1)

1417450 rows × 4 columns

edges_with_multilinks2_split = split_and_expand_large(edges_with_multilinks2, 
                       split_col='target_id', 
                       expand_cols=['source_layer','source_id','target_layer'],
                       delimiter='/').drop_duplicates()
edges_with_multilinks2_split

	source_layer	source_id	target_layer	target_id
0	swisslipids	SLM:000000422	sl_components	SLM:000000418 (sn2)
1	swisslipids	SLM:000000422	sl_components	SLM:000000510 (sn1)
2	swisslipids	SLM:000000498	sl_components	SLM:000000296 (sn2)
3	swisslipids	SLM:000000498	sl_components	SLM:000000826 (sn1)
4	swisslipids	SLM:000000557	sl_components	SLM:000000510 (sn1)
...	...	...	...	...
3592487	swisslipids	SLM:000748036	sl_components_parsed	SLM:000000853 (sn2)
3592488	swisslipids	SLM:000748037	sl_components_parsed	SLM:000000048 (sn1)
3592489	swisslipids	SLM:000748037	sl_components_parsed	SLM:000001197 (sn2)
3592490	swisslipids	SLM:000748038	sl_components_parsed	SLM:000000047 (sn2)
3592491	swisslipids	SLM:000748038	sl_components_parsed	SLM:000000048 (sn1)

3592492 rows × 4 columns

Now let’s also clean this up in case we have whitespace or empty strings etc.

edges_with_multilinks2_split = clean_missing_strings(edges_with_multilinks2_split)

Now let’s also parse the brackets from the parsed components so that these can be linked directly to the other SLMs if needed

# Apply transformation only for rows where target_layer equals 'sl_components_parsed'
mask = edges_with_multilinks2_split['target_layer'] == 'sl_components_parsed'
edges_with_multilinks2_split.loc[mask, 'target_id'] = edges_with_multilinks2_split.loc[mask, 'target_id'].str.split('(').str[0].str.strip()
edges_with_multilinks2_split

	source_layer	source_id	target_layer	target_id
0	swisslipids	SLM:000000422	sl_components	SLM:000000418 (sn2)
1	swisslipids	SLM:000000422	sl_components	SLM:000000510 (sn1)
2	swisslipids	SLM:000000498	sl_components	SLM:000000296 (sn2)
3	swisslipids	SLM:000000498	sl_components	SLM:000000826 (sn1)
4	swisslipids	SLM:000000557	sl_components	SLM:000000510 (sn1)
...	...	...	...	...
3592487	swisslipids	SLM:000748036	sl_components_parsed	SLM:000000853
3592488	swisslipids	SLM:000748037	sl_components_parsed	SLM:000000048
3592489	swisslipids	SLM:000748037	sl_components_parsed	SLM:000001197
3592490	swisslipids	SLM:000748038	sl_components_parsed	SLM:000000047
3592491	swisslipids	SLM:000748038	sl_components_parsed	SLM:000000048

3592492 rows × 4 columns

Now we need a way to change these original rows where they had multilinks and add back the corrected ones.

# Identify rows with multilinks (either '|' or '/' with the specific target_layer condition)
mask_pipe = df_swisslipids_edges['target_id'].str.contains('|', regex=False, na=False)
mask_slash = (
    df_swisslipids_edges['target_id'].str.contains('/', regex=False, na=False) &
    df_swisslipids_edges['target_layer'].str.contains('sl_components', regex=False, na=False)
)
mask_problem = mask_pipe | mask_slash

# Remove these rows from the original df
df_clean = df_swisslipids_edges[~mask_problem].copy()

# Now, combine the cleaned df with the corrected edges dataframes.
# These corrected dataframes are assumed to be: 
#   - edges_with_multilinks_split
#   - edges_with_multilinks2_split
df_swisslipids_edges = pd.concat([df_clean, edges_with_multilinks_split, edges_with_multilinks2_split], ignore_index=True)

# Clean up empty strings again or leading/trailing spaces
df_swisslipids_edges = clean_missing_strings(df_swisslipids_edges)

# (Optional) Drop any duplicate rows that might arise
df_swisslipids_edges = df_swisslipids_edges.drop_duplicates()

# df_final now contains the original "good" rows plus the corrected edges.
df_swisslipids_edges

	source_layer	source_id	target_layer	target_id
0	swisslipids	SLM:000000002	sl_chebi	70846
1	swisslipids	SLM:000000003	sl_chebi	70771
2	swisslipids	SLM:000000006	sl_chebi	70829
3	swisslipids	SLM:000000007	sl_chebi	70775
4	swisslipids	SLM:000000035	sl_chebi	57817
...	...	...	...	...
6890973	swisslipids	SLM:000748036	sl_components_parsed	SLM:000000853
6890974	swisslipids	SLM:000748037	sl_components_parsed	SLM:000000048
6890975	swisslipids	SLM:000748037	sl_components_parsed	SLM:000001197
6890976	swisslipids	SLM:000748038	sl_components_parsed	SLM:000000047
6890977	swisslipids	SLM:000748038	sl_components_parsed	SLM:000000048

6890966 rows × 4 columns

Now we will determine whether the edge is within the same layer (intralayer) or between different layers (interlayer)

def assess_edge_layertype(df):
    interlayer = df['source_layer']!=df['target_layer']
    df['interlayer'] = interlayer
    return df 

df_swisslipids_edges = assess_edge_layertype(df_swisslipids_edges)
df_swisslipids_edges

	source_layer	source_id	target_layer	target_id	interlayer
0	swisslipids	SLM:000000002	sl_chebi	70846	True
1	swisslipids	SLM:000000003	sl_chebi	70771	True
2	swisslipids	SLM:000000006	sl_chebi	70829	True
3	swisslipids	SLM:000000007	sl_chebi	70775	True
4	swisslipids	SLM:000000035	sl_chebi	57817	True
...	...	...	...	...	...
6890973	swisslipids	SLM:000748036	sl_components_parsed	SLM:000000853	True
6890974	swisslipids	SLM:000748037	sl_components_parsed	SLM:000000048	True
6890975	swisslipids	SLM:000748037	sl_components_parsed	SLM:000001197	True
6890976	swisslipids	SLM:000748038	sl_components_parsed	SLM:000000047	True
6890977	swisslipids	SLM:000748038	sl_components_parsed	SLM:000000048	True

6890966 rows × 5 columns

Now we will build the node df

Building the node df#

df_swisslipids_nodes = create_nodedf_from_edgedf(edge_df=df_swisslipids_edges, props=['layer', 'id'], cols=['layer', 'node_id'])
df_swisslipids_nodes

	layer	node_id
0	swisslipids	SLM:000000002
1	swisslipids	SLM:000000003
2	swisslipids	SLM:000000006
3	swisslipids	SLM:000000007
4	swisslipids	SLM:000000035
...	...	...
13781927	sl_components_parsed	SLM:000000853
13781928	sl_components_parsed	SLM:000000048
13781929	sl_components_parsed	SLM:000001197
13781930	sl_components_parsed	SLM:000000047
13781931	sl_components_parsed	SLM:000000048

13781932 rows × 2 columns

Let’s also see how many are duplicates

df_swisslipids_nodes.value_counts(dropna=False)

layer        node_id      
swisslipids  SLM:000000353    132660
             SLM:000000377     98800
             SLM:000000102     80218
             SLM:000117148     46826
             SLM:000000400     38525
                               ...  
sl_metanetx  MNXM312433            1
             MNXM312434            1
             MNXM312435            1
             MNXM312436            1
swisslipids  SLM:000782332         1
Name: count, Length: 2779078, dtype: int64

# Pre-emptively dropping duplicates before the merge
df_swisslipids_nodes = df_swisslipids_nodes.drop_duplicates()
df_swisslipids_nodes.shape

(2779078, 2)

Now let’s merge the nodes with the information from earlier to create richer node attributes

df_swisslipids_nodes = pd.merge(df_swisslipids_nodes, df_swisslipids.assign(from_layer_col='swisslipids'),
                                left_on=['layer','node_id'], right_on=['from_layer_col','Lipid ID'],
                                how='outer')
df_swisslipids_nodes

	layer	node_id	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	...	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID	Components_parsed	from_layer_col
0	sl_abbreviation	(10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	sl_abbreviation	(10,11S,12R)-TriHETrE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	sl_abbreviation	(10R)-H-(11S,12S)-Ep-(5Z,8Z,14Z)-ETrE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	sl_abbreviation	(10R)-H-(11S,12S)-EpETrE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	sl_abbreviation	(10R)-H-(8S,9S)-Ep-(5Z,11Z,14Z)-ETrE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2779073	swisslipids	SLM:000782328	SLM:000782328	NaN	oxidized 2-acylglycerol	NaN	NaN	SLM:000000355	NaN	NaN	...	NaN	NaN	NaN	167117	NaN	NaN	NaN	NaN	NaN	swisslipids
2779074	swisslipids	SLM:000782329	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2779075	swisslipids	SLM:000782330	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2779076	swisslipids	SLM:000782331	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2779077	swisslipids	SLM:000782332	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2779078 rows × 33 columns

If this has any more duplicates in it for some reason lets remove them, along with the from_layer_col which means nothing in this case and is just a relic of our join back with the initial df we used to create the edges (which could probably be tidied up)

df_swisslipids_nodes = df_swisslipids_nodes.drop_duplicates()
df_swisslipids_nodes = df_swisslipids_nodes.drop(columns='from_layer_col')
df_swisslipids_nodes

	layer	node_id	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	...	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID	Components_parsed
0	sl_abbreviation	(10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	sl_abbreviation	(10,11S,12R)-TriHETrE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	sl_abbreviation	(10R)-H-(11S,12S)-Ep-(5Z,8Z,14Z)-ETrE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	sl_abbreviation	(10R)-H-(11S,12S)-EpETrE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	sl_abbreviation	(10R)-H-(8S,9S)-Ep-(5Z,11Z,14Z)-ETrE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2779073	swisslipids	SLM:000782328	SLM:000782328	NaN	oxidized 2-acylglycerol	NaN	NaN	SLM:000000355	NaN	NaN	...	NaN	NaN	NaN	NaN	167117	NaN	NaN	NaN	NaN	NaN
2779074	swisslipids	SLM:000782329	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2779075	swisslipids	SLM:000782330	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2779076	swisslipids	SLM:000782331	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2779077	swisslipids	SLM:000782332	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2779078 rows × 32 columns

df_swisslipids_nodes[df_swisslipids_nodes['node_id'].isna()]

	layer	node_id	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	...	Exact m/z of [M+NH4]+	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID	Components_parsed
1464984	sl_pmid	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1 rows × 32 columns

print(df_swisslipids_nodes[df_swisslipids_nodes['node_id'].isna()].shape)
print(df_swisslipids_nodes[df_swisslipids_nodes['layer'].isna()].shape)

(1, 32)
(0, 32)

Even though onionnet can handle nan and remove them downstream, it is safer to now drop these cases where either the node_id or layer is missing - they serve us no purpose anyway!

df_swisslipids_nodes = df_swisslipids_nodes.dropna(subset=['layer','node_id'])
print(df_swisslipids_nodes.shape)

print(df_swisslipids_nodes[df_swisslipids_nodes['node_id'].isna()].shape)
print(df_swisslipids_nodes[df_swisslipids_nodes['layer'].isna()].shape)

(2779077, 32)
(0, 32)
(0, 32)

df_swisslipids_nodes['layer'].value_counts()

layer
swisslipids             779312
sl_abbreviation         736949
sl_synonyms             534781
sl_metanetx             504880
sl_parent               184620
sl_hmdb                  17232
sl_lipidmaps             12112
sl_chebi                  4277
sl_components             1708
sl_components_parsed      1677
sl_pmid                   1529
Name: count, dtype: int64

df_swisslipids_nodes['Level'].value_counts()

Level
Isomeric subspecies      592413
Structural subspecies    111867
Molecular subspecies      62516
Species                   10347
Class                       806
Category                      7
Name: count, dtype: int64

Now we have the nodes and edges dfs for swisslipids and understand how we have arrived at them. In reality you don’t have to go through this process every time, LipiNet offers a convenient function to do just this if you are interested in this same network setup.

Ensuring equivalency#

We can also check to make sure that the output of the autmatic parse_swisslipids_data() function and our manually processed data are equivalent.

We start by checking this for a single entry of the dataframe.

df_swisslipids_nodes.iloc[0]

layer                                           sl_abbreviation
node_id                      (10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE
Lipid ID                                                    NaN
Level                                                       NaN
Name                                                        NaN
Abbreviation*                                               NaN
Synonyms*                                                   NaN
Lipid class*                                                NaN
Parent                                                      NaN
Components*                                                 NaN
SMILES (pH7.3)                                              NaN
InChI (pH7.3)                                               NaN
InChI key (pH7.3)                                           NaN
Formula (pH7.3)                                             NaN
Charge (pH7.3)                                              NaN
Mass (pH7.3)                                                NaN
Exact Mass (neutral form)                                   NaN
Exact m/z of [M.]+                                          NaN
Exact m/z of [M+H]+                                         NaN
Exact m/z of [M+K]+                                         NaN
Exact m/z of [M+Na]+                                        NaN
Exact m/z of [M+Li]+                                        NaN
Exact m/z of [M+NH4]+                                       NaN
Exact m/z of [M-H]-                                         NaN
Exact m/z of [M+Cl]-                                        NaN
Exact m/z of [M+OAc]-                                       NaN
CHEBI                                                       NaN
LIPID MAPS                                                  NaN
HMDB                                                        NaN
MetaNetX                                                    NaN
PMID                                                        NaN
Components_parsed                                           NaN
Name: 0, dtype: object

df_sl_nodes.iloc[0]

layer                                           sl_abbreviation
node_id                      (10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE
Lipid ID                                                    NaN
Level                                                       NaN
Name                                                        NaN
Abbreviation*                                               NaN
Synonyms*                                                   NaN
Lipid class*                                                NaN
Parent                                                      NaN
Components*                                                 NaN
SMILES (pH7.3)                                              NaN
InChI (pH7.3)                                               NaN
InChI key (pH7.3)                                           NaN
Formula (pH7.3)                                             NaN
Charge (pH7.3)                                              NaN
Mass (pH7.3)                                                NaN
Exact Mass (neutral form)                                   NaN
Exact m/z of [M.]+                                          NaN
Exact m/z of [M+H]+                                         NaN
Exact m/z of [M+K]+                                         NaN
Exact m/z of [M+Na]+                                        NaN
Exact m/z of [M+Li]+                                        NaN
Exact m/z of [M+NH4]+                                       NaN
Exact m/z of [M-H]-                                         NaN
Exact m/z of [M+Cl]-                                        NaN
Exact m/z of [M+OAc]-                                       NaN
CHEBI                                                       NaN
LIPID MAPS                                                  NaN
HMDB                                                        NaN
MetaNetX                                                    NaN
PMID                                                        NaN
Components_parsed                                           NaN
Name: 0, dtype: object

For the first entry it looks good, what about for the entire df? We can use the pd.testing.assert_frame_equal function to do this.

First we will use a null test to test equality between df_swisslipids_nodes and df_swisslipids_edges, which should obviously be False.

try:
    pd.testing.assert_frame_equal(df_swisslipids_nodes, df_swisslipids_edges)
    print('DataFrames are equal')
except AssertionError as e:
    print(e)

DataFrame are different

DataFrame shape mismatch
[left]:  (2779077, 32)
[right]: (6890966, 5)

Now let’s test between df_swisslipids_nodes and df_sl_nodes, which should hopefully be True and not throw an error. We will also test the edges df while we’re at it too.

try:
    pd.testing.assert_frame_equal(df_swisslipids_nodes, df_sl_nodes)
    print('DataFrames for nodes are equal')
except AssertionError as e:
    print(e)

DataFrames for nodes are equal

try:
    pd.testing.assert_frame_equal(df_swisslipids_edges, df_sl_edges)
    print('DataFrames for edges are equal')
except AssertionError as e:
    print(e)

DataFrames for edges are equal

Great! It looks like both approaches achieve the same df. We will use these dfs in other parts of the package.

If they are different, we can inspect the exact rows here

diff = df_sl_edges.merge(df_swisslipids_edges, how='outer', indicator=True)
diff_rows_edges = diff[diff['_merge'] != 'both']
diff_rows_edges

	source_layer	source_id	target_layer	target_id	interlayer	_merge

diff_rows_edges['_merge'].value_counts()

_merge
left_only     0
right_only    0
both          0
Name: count, dtype: int64

diff = df_sl_nodes.merge(df_swisslipids_nodes, how='outer', indicator=True)
diff_rows_nodes = diff[diff['_merge'] != 'both']
diff_rows_nodes

	layer	node_id	Lipid ID	Level	Name	Abbreviation*	Synonyms*	Lipid class*	Parent	Components*	...	Exact m/z of [M-H]-	Exact m/z of [M+Cl]-	Exact m/z of [M+OAc]-	CHEBI	LIPID MAPS	HMDB	MetaNetX	PMID	Components_parsed	_merge

0 rows × 33 columns

These should also be the same

df_sl_edges[df_sl_edges['source_id']=='SLM:000389145']

	source_layer	source_id	target_layer	target_id	interlayer
1640	swisslipids	SLM:000389145	sl_chebi	18059	True
429400	swisslipids	SLM:000389145	sl_metanetx	MNXM12117	True
549344	swisslipids	SLM:000389145	swisslipids	SLM:000000436	False
549407	swisslipids	SLM:000389145	swisslipids	SLM:000000525	False
549887	swisslipids	SLM:000389145	swisslipids	SLM:000001193	False
665828	swisslipids	SLM:000389145	swisslipids	SLM:000117142	False
936914	swisslipids	SLM:000389145	swisslipids	SLM:000390054	False
1046948	swisslipids	SLM:000389145	swisslipids	SLM:000500463	False
1055230	swisslipids	SLM:000389145	swisslipids	SLM:000508860	False
1328368	swisslipids	SLM:000389145	swisslipids	SLM:000782283	False

df_swisslipids_edges[df_swisslipids_edges['source_id']=='SLM:000389145']

	source_layer	source_id	target_layer	target_id	interlayer
1640	swisslipids	SLM:000389145	sl_chebi	18059	True
429400	swisslipids	SLM:000389145	sl_metanetx	MNXM12117	True
549344	swisslipids	SLM:000389145	swisslipids	SLM:000000436	False
549407	swisslipids	SLM:000389145	swisslipids	SLM:000000525	False
549887	swisslipids	SLM:000389145	swisslipids	SLM:000001193	False
665828	swisslipids	SLM:000389145	swisslipids	SLM:000117142	False
936914	swisslipids	SLM:000389145	swisslipids	SLM:000390054	False
1046948	swisslipids	SLM:000389145	swisslipids	SLM:000500463	False
1055230	swisslipids	SLM:000389145	swisslipids	SLM:000508860	False
1328368	swisslipids	SLM:000389145	swisslipids	SLM:000782283	False