Parse SwissLipids#
Parsing SwissLipids data into a network for LipiNet.
LipiNet offers conventient functions to parse prior knowledge resources straight into networks. For instance, LipiNet can parse SwissLipids data into a network as easily as running: parse_swisslipids_data()
However to show what is happening behind the scenes, this notebook will also go through the data and each of the steps that are made in the background of this function. This may be particularly helpful to users needing to customise the networks in a way that is not yet supported by LipiNet directly.
Using parse_swisslipids_data()#
Like already mentioned, the LipiNet parse_swisslipids_data() function automatically parses SwissLipids into a network. This is what LipiNet uses as input to its overall combined network and for the majority of users this function will probably suffice if they wish to build sub-networks with just SwissLipids data.
import importlib
import lipinet.parse_swisslipids
importlib.reload(lipinet.parse_swisslipids) # reload the module after edits
from lipinet.parse_swisslipids import parse_swisslipids_data
sl_results = parse_swisslipids_data(verbose=False, use_cache=True)
df_sl_nodes = sl_results['df_nodes']
df_sl_edges = sl_results['df_edges']
To avoid repeatedly downloading the SwissLipids data (and choking up their server calls), set use_cache=True. If the cache has not been set yet, this will automatically save the download to cache. If there is already a cache, it will use that.
To override the cache you can set force_download=True, but this is only recommended every few months when you want to update the source data in case of changes.
Where to from here?#
Now to quickly start exploring SwissLipids, go to the Explore SwissLipids notebook.
To see how the combined LipiNet network uses SwissLipids, go to the Explore LipiNet notebook.
Or to see how the
parse_swisslipids_data()function works behind the scenes, continue to the Manual parsing section below.
Manual parsing#
For users wanting to better understand all the steps being undertaken behind the parse_swisslipids_data() function, we will recreate the steps here.
import lipinet.databases # Import the module
# Reload the module to ensure changes are picked up
importlib.reload(lipinet)
importlib.reload(lipinet.databases)
# Now can use the functions after reloading the module
from lipinet.databases import get_prior_knowledge
from lipinet.utils import split_and_expand_large, create_nodedf_from_edgedf, check_for_split_characters, clean_missing_strings
import pandas as pd
df_swisslipids = get_prior_knowledge('swisslipids', verbose=True, force_download=False) #Previously set to True
df_swisslipids
File found locally at /opt/anaconda3/envs/graphtool/lib/python3.12/site-packages/lipinet/.data/downloaded/swisslipids_lipids.tsv. Loading data...
Before cleaning, trailing-space counts in 'Lipid class*': {False: 779171, True: 76, nan: 2}
>> Cleaning column 'Lipid class*':
sample before: ['SLM:000399814', 'SLM:000390097', 'SLM:000390097', 'SLM:000001000', 'SLM:000390097']
sample after: ['SLM:000399814', 'SLM:000390097', 'SLM:000390097', 'SLM:000001000', 'SLM:000390097']
>> Cleaning column 'CHEBI':
sample before: ['70846', '70771', '70829', '70775', '57817']
sample after: ['70846', '70771', '70829', '70775', '57817']
After cleaning, trailing-space counts in 'Lipid class*': {False: 779247, <NA>: 2}
| Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | SMILES (pH7.3) | InChI (pH7.3) | ... | Exact m/z of [M+Li]+ | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SLM:000000002 | Class | Ceramide (iso-d17:1(4E)) | Cer(iso-d17:1(4E)) | N-acyl-15-methylhexadecasphing-4-enine | SLM:000399814 | NaN | NaN | CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](CO)NC([*])=O | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 70846 | NaN | NaN | MNXM97012 | | 11443131 | 14685263 | 18390550 | 21325339 |... |
| 1 | SLM:000000003 | Isomeric subspecies | 15-methylhexadecasphing-4-enine | NaN | NaN | SLM:000390097 | NaN | NaN | CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@@H]([NH3+])CO | InChI=1S/C17H35NO2/c1-15(2)12-10-8-6-4-3-5-7-9... | ... | 292.282235 | 303.300605 | 284.259503 | 320.236181 | 344.280632 | 70771 | NaN | NaN | MNXM57784 | 19372430 |
| 2 | SLM:000000006 | Isomeric subspecies | 15-methylhexadecasphinganine | NaN | NaN | SLM:000390097 | NaN | NaN | CC(C)CCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO | InChI=1S/C17H37NO2/c1-15(2)12-10-8-6-4-3-5-7-9... | ... | 294.297885 | 305.316255 | 286.275153 | 322.251831 | 346.296282 | 70829 | NaN | NaN | MNXM97029 | 19372430 |
| 3 | SLM:000000007 | Class | Sphingomyelin (iso-d17:1(4E)) | SM(iso-d17:1(4E)) | N-acyl-15-methylhexadecasphing-4-enine-1-phosp... | SLM:000001000 | NaN | NaN | CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](COP([O-])(=O... | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 70775 | NaN | NaN | MNXM97113 | 14685263 | 21926990 | 9603947 |
| 4 | SLM:000000035 | Isomeric subspecies | sphinganine | NaN | NaN | SLM:000390097 | NaN | NaN | CCCCCCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO | InChI=1S/C18H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12... | ... | 308.313535 | 319.331905 | 300.290803 | 336.267481 | 360.311932 | 57817 | LMSP01020001 | HMDB00269 | MNXM302 | 10652340 | 10702247 | 10751414 | 10802064 | 10... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 779244 | SLM:000782324 | NaN | apo carotenoid | NaN | NaN | SLM:000508864 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 53183 | NaN | NaN | NaN | NaN |
| 779245 | SLM:000782325 | NaN | terpenoid | NaN | NaN | SLM:000508864 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 26873 | NaN | NaN | NaN | NaN |
| 779246 | SLM:000782326 | NaN | C-45 isoprenoid | NaN | NaN | SLM:000508864 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 87168 | NaN | NaN | NaN | NaN |
| 779247 | SLM:000782327 | NaN | gamma-lactone | NaN | NaN | SLM:000782238 | NaN | NaN | O1C(C(C(C1=O)*)*)* | NaN | ... | NaN | NaN | NaN | NaN | NaN | 37581 | NaN | NaN | NaN | NaN |
| 779248 | SLM:000782328 | NaN | oxidized 2-acylglycerol | NaN | NaN | SLM:000000355 | NaN | NaN | OCC(CO)OC(=O)* | NaN | ... | NaN | NaN | NaN | NaN | NaN | 167117 | NaN | NaN | NaN | NaN |
779249 rows × 29 columns
To be safe, we will start by removing leading and trailing whitespace from all object and string columns
df_swisslipids[df_swisslipids['Abbreviation*'].str.startswith(' ', na=False)]
| Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | SMILES (pH7.3) | InChI (pH7.3) | ... | Exact m/z of [M+Li]+ | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 116651 | SLM:000117132 | Class | 1(3)-O-(alk-1-enyl)-glycerol | MG(P-) | MAG(P-) | Monoacylglycerol (P-) | SLM:000117137 | NaN | NaN | OCC(O)COC=C[*] | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 77998 | NaN | NaN | MNXM149874 | NaN |
| 116663 | SLM:000117144 | Class | 1-O-(alk-1Z-enyl)-sn-glycerol | MG(P-) | MAG(P-) | Monoacylglycerol (P-) | SLM:000117132 | NaN | NaN | OC[C@H](O)COC=C/[*] | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 77297 | NaN | NaN | MNXM413498 | NaN |
2 rows × 29 columns
df_swisslipids = clean_missing_strings(df_swisslipids)
df_swisslipids[df_swisslipids['Abbreviation*'].str.startswith(' ', na=False)].shape
(0, 29)
If we take a closer look into the data, especially the Lipid class* column, we will see that some of the values have multiple entries. For example Ceramide phosphoinositol is a Class level entry that itself belongs to both the SLM:000000834 and SLM:000399815 classes.
df_swisslipids.dropna(subset='Lipid class*')[df_swisslipids['Lipid class*'].dropna().str.contains('|', regex=False)]
| Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | SMILES (pH7.3) | InChI (pH7.3) | ... | Exact m/z of [M+Li]+ | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 142 | SLM:000000392 | Class | Ceramide phosphoinositol | IPC | Inositol-1-phosphoceramide | SLM:000000834 | SLM:000399815 | NaN | NaN | O[C@H]([*])[C@H](COP([O-])(=O)O[C@H]1[C@H](O)[... | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 64916 | NaN | NaN | NaN | 10888667 | 20727985 |
| 234 | SLM:000000509 | Isomeric subspecies | All-trans-retinyl hexadecanoate | NaN | all-trans-retinyl palmitate | SLM:000000982 | SLM:000508854 | NaN | NaN | CCCCCCCCCCCCCCCC(=O)OCC=C(C)C=CC=C(C)C=CC1=C(C... | InChI=1S/C36H60O2/c1-7-8-9-10-11-12-13-14-15-1... | ... | NaN | NaN | NaN | NaN | NaN | 17616 | NaN | HMDB03648 | NaN | 10769148 | 10819989 | 12230550 | 15550674 | 15... |
| 315 | SLM:000000612 | NaN | tetracosenoyl-CoA | NaN | NaN | SLM:000390051 | SLM:000782334 | NaN | NaN | CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... | NaN | ... | NaN | NaN | NaN | NaN | NaN | 74146 | NaN | NaN | NaN | 18541923 | 20110363 | 20937905 |
| 317 | SLM:000000614 | NaN | hexacosenoyl-CoA | NaN | NaN | SLM:000390051 | SLM:000782334 | NaN | NaN | CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... | NaN | ... | NaN | NaN | NaN | NaN | NaN | 74161 | NaN | NaN | NaN | 18165233 |
| 319 | SLM:000000621 | NaN | 2-hydroxy-tetracosenoyl-CoA | NaN | NaN | SLM:000390051 | SLM:000782334 | NaN | NaN | CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... | NaN | ... | NaN | NaN | NaN | NaN | NaN | 74215 | NaN | NaN | NaN | 18541923 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 755324 | SLM:000758294 | Class | Globoside | Globo | Globo-series | SLM:000000834 | SLM:000399813 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 61360 | NaN | NaN | NaN | NaN |
| 755325 | SLM:000758295 | Class | Isogloboside | Isoglobo | Isoglobo-series | SLM:000000834 | SLM:000399813 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 78257 | NaN | NaN | NaN | NaN |
| 779141 | SLM:000782221 | NaN | Resolvin E | RvE | NaN | SLM:000501332 | SLM:000508853 | NaN | NaN | NaN | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | <NA> | LMFA0314 | NaN | NaN | NaN |
| 779142 | SLM:000782222 | NaN | Resolvin D | RvD | NaN | SLM:000501331 | SLM:000508853 | NaN | NaN | NaN | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | <NA> | LMFA0403 | NaN | NaN | NaN |
| 779157 | SLM:000782237 | NaN | an N-(omega-(9Z,12Z-octadecadienoyloxy)-ultra-... | NaN | NaN | SLM:000000413 | SLM:000782274 | NaN | NaN | [C@H]([C@@H](/C=C/CCCCCCCCCCCCC)O)(NC(=O)*COC(... | NaN | ... | NaN | NaN | NaN | NaN | NaN | 157662 | NaN | NaN | NaN | NaN |
119 rows × 29 columns
What about other IDs?
cols_with_split_chars = check_for_split_characters(df_swisslipids, delimiter='|')
Checking split characters (|) in Lipid ID
No rows found
Checking split characters (|) in Level
No rows found
Checking split characters (|) in Name
No rows found
Checking split characters (|) in Abbreviation*
Found 9768 rows with split characters
| Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | SMILES (pH7.3) | InChI (pH7.3) | ... | Exact m/z of [M+Li]+ | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | SLM:000000262 | Class | 1,2-diacyl-sn-glycerol | 1,2-sn-DAG | DAG | DG | Diacylglycerol | SLM:000000423 | NaN | NaN | OC[C@@H](COC([*])=O)OC([*])=O | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 17815 | NaN | NaN | MNXM59 | 10336610 | 10685032 | 10888667 | 10931938 | 11... |
| 114 | SLM:000000341 | Class | 1-acyl-sn-glycerol | MAG | MG | Monoacylglycerol | SLM:000117130 | NaN | NaN | OC[C@H](O)COC([*])=O | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 64683 | NaN | NaN | MNXM2963 | 10685032 | 15939762 | 18037386 | 8663293 | 960... |
| 122 | SLM:000000355 | Class | 2-acylglycerol | MAG | MG | Monoacylglycerol | SLM:000000403 | NaN | NaN | OCC(CO)OC([*])=O | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 17389 | NaN | NaN | MNXM335 | NaN |
| 146 | SLM:000000400 | Class | Triacylglycerol | TAG | TG | NaN | SLM:000117141 | NaN | NaN | [*]C(=O)OCC(COC([*])=O)OC([*])=O | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 17855 | NaN | NaN | MNXM248 | 12682047 | 16135509 | 16150821 | 21704635 | 27... |
| 147 | SLM:000000401 | Class | Diacylglycerol | DAG | DG | NaN | SLM:000117140 | NaN | NaN | [*]OCC(CO[*])O[*] | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 18035 | NaN | NaN | MNXM59 | 12682047 | 16135509 | 16150821 | 27247428 | 29... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 505694 | SLM:000508489 | Molecular subspecies | Phosphatidylglycerol (O-17:1_0:0) | LPG(O-17:1_0:0) | PG(O-17:1_0:0) | Lysophosphatidylglycerol (O-17:1_0:0) | SLM:000508807 | SLM:000508779 | SLM:000001333 (sn1 or sn2 or sn3) | OCC(O)COP([O-])(=O)OCC(CO[*])O[*] | InChI=none | ... | 489.316311 | 500.334681 | 481.293579 | 517.270257 | 541.314708 | <NA> | NaN | NaN | MNXM629334 | NaN |
| 505695 | SLM:000508490 | Molecular subspecies | Phosphatidylglycerol (O-15:1_0:0) | LPG(O-15:1_0:0) | PG(O-15:1_0:0) | Lysophosphatidylglycerol (O-15:1_0:0) | SLM:000508807 | SLM:000508775 | SLM:000001331 (sn1 or sn2 or sn3) | OCC(O)COP([O-])(=O)OCC(CO[*])O[*] | InChI=none | ... | 461.285011 | 472.303381 | 453.262279 | 489.238957 | 513.283408 | <NA> | NaN | NaN | MNXM628940 | NaN |
| 505696 | SLM:000508491 | Molecular subspecies | Phosphatidylglycerol (O-13:1_0:0) | LPG(O-13:1_0:0) | PG(O-13:1_0:0) | Lysophosphatidylglycerol (O-13:1_0:0) | SLM:000508807 | SLM:000508771 | SLM:000001329 (sn1 or sn2 or sn3) | OCC(O)COP([O-])(=O)OCC(CO[*])O[*] | InChI=none | ... | 433.253711 | 444.272081 | 425.230979 | 461.207657 | 485.252108 | <NA> | NaN | NaN | MNXM628548 | NaN |
| 595061 | SLM:000597889 | Isomeric subspecies | 7-oxoresolvin D2 | 7-oxo-RvD2| 7-keto-RvD2 | (16R,17S)-dihydroxy-7-oxo-(4Z,8E,10Z,12E,14E,1... | SLM:000508853 | SLM:000782222 | NaN | NaN | C(C/C=C\CC(/C=C/C=C\C=C\C=C\[C@H]([C@H](C/C=C\... | InChI=1S/C22H30O5/c1-2-3-9-16-20(24)21(25)17-1... | ... | 381.224780 | 392.243150 | 373.202048 | 409.178725 | 433.223177 | 137497 | NaN | NaN | NaN | 22844113 |
| 595062 | SLM:000597890 | Isomeric subspecies | 16-oxoresolvin D2 | 16-oxo-RvD2| 16-keto-RvD2 | (7S,17S)-dihydroxy-16-oxo-(4Z,8E,10Z,12E,14E,1... | SLM:000508853 | SLM:000782222 | NaN | NaN | C(C/C=C\C[C@@H](\C=C\C=C/C=C/C=C/C([C@H](C/C=C... | InChI=1S/C22H30O5/c1-2-3-9-16-20(24)21(25)17-1... | ... | 381.224780 | 392.243150 | 373.202048 | 409.178725 | 433.223177 | 137498 | NaN | NaN | NaN | 22844113 |
9768 rows × 29 columns
Checking split characters (|) in Synonyms*
Found 19853 rows with split characters
| Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | SMILES (pH7.3) | InChI (pH7.3) | ... | Exact m/z of [M+Li]+ | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | SLM:000000101 | Class | 1,2-diacyl-sn-glycero-3-phospho-(1'-sn-glycero... | PA | 1,2-diacyl-sn-glycero-3-phospho-(1'-sn-glycero... | SLM:000477285 | NaN | NaN | O[C@@H](COP([O-])([O-])=O)COP([O-])(=O)OC[C@@H... | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 60110 | NaN | NaN | MNXM871 | 20485265 | 9880566 |
| 17 | SLM:000000147 | Isomeric subspecies | N-(9Z-octadecenoyl)-ethanolamine | NAE (18:1(9Z)) | (9Z-octadecenoyl)-ethanolamide | N-(9Z-octadec... | SLM:000000378 | NaN | NaN | CCCCCCCC\C=C/CCCCCCCC(=O)NCCO | InChI=1S/C20H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12... | ... | 332.313535 | 343.331905 | 324.290803 | 360.267481 | 384.311932 | 71466 | NaN | HMDB02088 | MNXM107386 | 14634025 | 16527816 | 17015445 | 17626977 | 17... |
| 18 | SLM:000000149 | Isomeric subspecies | N-hexadecanoyl-ethanolamine | NAE (16:0) | hexadecanoyl-ethanolamide | N-hexadecanoyl eth... | SLM:000000378 | NaN | NaN | CCCCCCCCCCCCCCCC(=O)NCCO | InChI=1S/C18H37NO2/c1-2-3-4-5-6-7-8-9-10-11-12... | ... | 306.297885 | 317.316255 | 298.275153 | 334.251831 | 358.296282 | 71464 | NaN | HMDB02100 | MNXM107548 | 12824167 | 14634025 | 15655246 | 15760304 | 16... |
| 19 | SLM:000000178 | Isomeric subspecies | N-(docosanoyl)-15-methylhexadecasphing-4-enine | Cer(iso-d17:1(4E)/22:0) | Ceramide (iso-d17:1(4E)/22:0) | N-docosanoyl-1... | SLM:000000002 | SLM:000392021 | SLM:000000827 (n-acyl) | CCCCCCCCCCCCCCCCCCCCCC(=O)N[C@@H](CO)[C@H](O)\... | InChI=1S/C39H77NO3/c1-4-5-6-7-8-9-10-11-12-13-... | ... | 614.605801 | 625.624171 | 606.583069 | 642.559747 | 666.604198 | 71377 | NaN | NaN | MNXM107026 | 19372430 |
| 20 | SLM:000000179 | Isomeric subspecies | N-(heneicosanoyl)-15-methylhexadecasphing-4-enine | Cer(iso-d17:1(4E)/21:0) | Ceramide (iso-d17:1(4E)/21:0) | N-henicosanoyl... | SLM:000000002 | SLM:000392020 | SLM:000001207 (n-acyl) | CCCCCCCCCCCCCCCCCCCCC(=O)N[C@@H](CO)[C@H](O)\C... | InChI=1S/C38H75NO3/c1-4-5-6-7-8-9-10-11-12-13-... | ... | 600.590151 | 611.608521 | 592.567419 | 628.544097 | 652.588548 | 71375 | NaN | NaN | MNXM107036 | 19372430 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 745092 | SLM:000747954 | Isomeric subspecies | CDP-1,2-di-(13-methyltetradecanoyl)-sn-glycerol | CDP-DAG (iso15:0/iso15:0) | 1,2-di-(13-methyltetradecanoyl)-sn-glycero-3-c... | SLM:000000084 | NaN | SLM:000000047 (sn1 or sn2) | [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... | InChI=1S/C42H77N3O15P2/c1-32(2)23-19-15-11-7-5... | ... | 932.498448 | 943.516818 | 924.475716 | 960.452394 | 984.496846 | <NA> | NaN | HMDB0116214 | NaN | NaN |
| 745093 | SLM:000747955 | Isomeric subspecies | CDP-1-(13-methyltetradecanoyl)-2-(15-methylhex... | CDP-DAG (iso15:0/iso17:0) | 1-(13-methyltetradecanoyl)-2-(15-methylhexadec... | SLM:000000084 | NaN | SLM:000000047 (sn1) / SLM:000000048 (sn2) | [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... | InChI=1S/C44H81N3O15P2/c1-34(2)25-21-17-13-9-6... | ... | 960.529748 | 971.548118 | 952.507016 | 988.483694 | 1012.528146 | <NA> | NaN | HMDB0116216 | NaN | NaN |
| 745175 | SLM:000748037 | Isomeric subspecies | CDP-1-(15-methylhexadecanoyl)-2-(11-methyldode... | CDP-DAG (iso17:0/iso13:0) | 1-(15-methylhexadecanoyl)-2-(11-methyldodecano... | SLM:000000084 | NaN | SLM:000000048 (sn1) / SLM:000001197 (sn2) | [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... | InChI=1S/C42H77N3O15P2/c1-32(2)23-19-15-11-8-6... | ... | 932.498448 | 943.516818 | 924.475716 | 960.452394 | 984.496846 | <NA> | NaN | HMDB0116248 | NaN | NaN |
| 745176 | SLM:000748038 | Isomeric subspecies | CDP-1-(15-methylhexadecanoyl)-2-(13-methyltetr... | CDP-DAG (iso17:0/iso15:0) | 1-(15-methylhexadecanoyl)-2-(13-methyltetradec... | SLM:000000084 | NaN | SLM:000000047 (sn2) / SLM:000000048 (sn1) | [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... | InChI=1S/C44H81N3O15P2/c1-34(2)25-21-17-13-9-6... | ... | 960.529748 | 971.548118 | 952.507016 | 988.483694 | 1012.528146 | <NA> | NaN | HMDB0116250 | NaN | NaN |
| 745177 | SLM:000748039 | Isomeric subspecies | CDP-1,2-di-(15-methylhexadecanoyl)-sn-glycerol | CDP-DAG (iso17:0/iso17:0) | 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy... | SLM:000000084 | NaN | SLM:000000048 (sn1 or sn2) | [H]Nc1ccn([C@@H]2O[C@H](COP([O-])(=O)OP([O-])(... | InChI=1S/C46H85N3O15P2/c1-36(2)27-23-19-15-11-... | ... | 988.561049 | 999.579419 | 980.538317 | 1016.514994 | 1040.559446 | <NA> | NaN | HMDB0116252 | NaN | NaN |
19853 rows × 29 columns
Checking split characters (|) in Lipid class*
Found 119 rows with split characters
| Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | SMILES (pH7.3) | InChI (pH7.3) | ... | Exact m/z of [M+Li]+ | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 142 | SLM:000000392 | Class | Ceramide phosphoinositol | IPC | Inositol-1-phosphoceramide | SLM:000000834 | SLM:000399815 | NaN | NaN | O[C@H]([*])[C@H](COP([O-])(=O)O[C@H]1[C@H](O)[... | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 64916 | NaN | NaN | NaN | 10888667 | 20727985 |
| 234 | SLM:000000509 | Isomeric subspecies | All-trans-retinyl hexadecanoate | NaN | all-trans-retinyl palmitate | SLM:000000982 | SLM:000508854 | NaN | NaN | CCCCCCCCCCCCCCCC(=O)OCC=C(C)C=CC=C(C)C=CC1=C(C... | InChI=1S/C36H60O2/c1-7-8-9-10-11-12-13-14-15-1... | ... | NaN | NaN | NaN | NaN | NaN | 17616 | NaN | HMDB03648 | NaN | 10769148 | 10819989 | 12230550 | 15550674 | 15... |
| 315 | SLM:000000612 | NaN | tetracosenoyl-CoA | NaN | NaN | SLM:000390051 | SLM:000782334 | NaN | NaN | CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... | NaN | ... | NaN | NaN | NaN | NaN | NaN | 74146 | NaN | NaN | NaN | 18541923 | 20110363 | 20937905 |
| 317 | SLM:000000614 | NaN | hexacosenoyl-CoA | NaN | NaN | SLM:000390051 | SLM:000782334 | NaN | NaN | CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... | NaN | ... | NaN | NaN | NaN | NaN | NaN | 74161 | NaN | NaN | NaN | 18165233 |
| 319 | SLM:000000621 | NaN | 2-hydroxy-tetracosenoyl-CoA | NaN | NaN | SLM:000390051 | SLM:000782334 | NaN | NaN | CC(C)(COP([O-])(=O)OP([O-])(=O)OC[C@H]1O[C@H](... | NaN | ... | NaN | NaN | NaN | NaN | NaN | 74215 | NaN | NaN | NaN | 18541923 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 755324 | SLM:000758294 | Class | Globoside | Globo | Globo-series | SLM:000000834 | SLM:000399813 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 61360 | NaN | NaN | NaN | NaN |
| 755325 | SLM:000758295 | Class | Isogloboside | Isoglobo | Isoglobo-series | SLM:000000834 | SLM:000399813 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 78257 | NaN | NaN | NaN | NaN |
| 779141 | SLM:000782221 | NaN | Resolvin E | RvE | NaN | SLM:000501332 | SLM:000508853 | NaN | NaN | NaN | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | <NA> | LMFA0314 | NaN | NaN | NaN |
| 779142 | SLM:000782222 | NaN | Resolvin D | RvD | NaN | SLM:000501331 | SLM:000508853 | NaN | NaN | NaN | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | <NA> | LMFA0403 | NaN | NaN | NaN |
| 779157 | SLM:000782237 | NaN | an N-(omega-(9Z,12Z-octadecadienoyloxy)-ultra-... | NaN | NaN | SLM:000000413 | SLM:000782274 | NaN | NaN | [C@H]([C@@H](/C=C/CCCCCCCCCCCCC)O)(NC(=O)*COC(... | NaN | ... | NaN | NaN | NaN | NaN | NaN | 157662 | NaN | NaN | NaN | NaN |
119 rows × 29 columns
Checking split characters (|) in Parent
No rows found
Checking split characters (|) in Components*
No rows found
Checking split characters (|) in SMILES (pH7.3)
No rows found
Checking split characters (|) in InChI (pH7.3)
No rows found
Checking split characters (|) in InChI key (pH7.3)
No rows found
Checking split characters (|) in Formula (pH7.3)
No rows found
Checking split characters (|) in Charge (pH7.3)
Not a string column
Checking split characters (|) in Mass (pH7.3)
Not a string column
Checking split characters (|) in Exact Mass (neutral form)
Not a string column
Checking split characters (|) in Exact m/z of [M.]+
Not a string column
Checking split characters (|) in Exact m/z of [M+H]+
Not a string column
Checking split characters (|) in Exact m/z of [M+K]+
Not a string column
Checking split characters (|) in Exact m/z of [M+Na]+
Not a string column
Checking split characters (|) in Exact m/z of [M+Li]+
Not a string column
Checking split characters (|) in Exact m/z of [M+NH4]+
Not a string column
Checking split characters (|) in Exact m/z of [M-H]-
Not a string column
Checking split characters (|) in Exact m/z of [M+Cl]-
Not a string column
Checking split characters (|) in Exact m/z of [M+OAc]-
Not a string column
Checking split characters (|) in CHEBI
Found 3 rows with split characters
| Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | SMILES (pH7.3) | InChI (pH7.3) | ... | Exact m/z of [M+Li]+ | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 465 | SLM:000000784 | Isomeric subspecies | 1,2-di-(9Z-octadecenoyl)-sn-glycero-3-phosphate | PA(18:1(9Z)/18:1(9Z)) | Phosphatidate (18:1(9Z)/18:1(9Z)) | SLM:000000329 | SLM:000082169 | SLM:000000418 (sn1 or sn2) | CCCCCCCC\C=C/CCCCCCCC(=O)OC[C@H](COP([O-])([O-... | InChI=1S/C39H73O8P/c1-3-5-7-9-11-13-15-17-19-2... | ... | 707.519775 | 718.538147 | 699.497009 | 735.473694 | 759.518188 | 74546|82922 | LMGP10010962 | HMDB07865 | MNXM51075 | 11309392 | 14634025 | 14665624 | 15164764 | 15... |
| 387185 | SLM:000389154 | NaN | (14Z,17Z,20Z,23Z,26Z)-dotriacontapentaenoate | NaN | Fatty acid 32:5(14Z,17Z,20Z,23Z,26Z) | SLM:000389801 | NaN | NaN | CCCCC\C=C/C\C=C/C\C=C/C\C=C/C\C=C/CCCCCCCCCCCC... | InChI=1S/C32H54O2/c1-2-3-4-5-6-7-8-9-10-11-12-... | ... | 477.427836 | 488.446207 | 469.405105 | 505.381782 | 529.426234 | 82731|82731 | LMFA01030848 | NaN | NaN | NaN |
| 595221 | SLM:000598072 | NaN | all-trans-retinol--[retinol-binding protein] | NaN | NaN | SLM:000000982 | NaN | NaN | [*][C@H](N-*)C(-*)=O | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 17336|83228 | NaN | NaN | NaN | 20628054 | 28758396 |
3 rows × 29 columns
Checking split characters (|) in LIPID MAPS
No rows found
Checking split characters (|) in HMDB
No rows found
Checking split characters (|) in MetaNetX
No rows found
Checking split characters (|) in PMID
Found 1318 rows with split characters
| Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | SMILES (pH7.3) | InChI (pH7.3) | ... | Exact m/z of [M+Li]+ | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SLM:000000002 | Class | Ceramide (iso-d17:1(4E)) | Cer(iso-d17:1(4E)) | N-acyl-15-methylhexadecasphing-4-enine | SLM:000399814 | NaN | NaN | CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](CO)NC([*])=O | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 70846 | NaN | NaN | MNXM97012 | | 11443131 | 14685263 | 18390550 | 21325339 | ... |
| 3 | SLM:000000007 | Class | Sphingomyelin (iso-d17:1(4E)) | SM(iso-d17:1(4E)) | N-acyl-15-methylhexadecasphing-4-enine-1-phosp... | SLM:000001000 | NaN | NaN | CC(C)CCCCCCCCC\C=C\[C@@H](O)[C@H](COP([O-])(=O... | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 70775 | NaN | NaN | MNXM97113 | 14685263 | 21926990 | 9603947 |
| 4 | SLM:000000035 | Isomeric subspecies | sphinganine | NaN | NaN | SLM:000390097 | NaN | NaN | CCCCCCCCCCCCCCC[C@@H](O)[C@@H]([NH3+])CO | InChI=1S/C18H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12... | ... | 308.313535 | 319.331905 | 300.290803 | 336.267481 | 360.311932 | 57817 | LMSP01020001 | HMDB00269 | MNXM302 | 10652340 | 10702247 | 10751414 | 10802064 | 10... |
| 5 | SLM:000000042 | Isomeric subspecies | cholesta-5,7-dien-3beta-ol | NaN | NaN | SLM:000501263 | NaN | NaN | [H][C@@]1(CC[C@@]2([H])C3=CC=C4C[C@@H](O)CC[C@... | InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2... | ... | 391.354671 | 402.373042 | 383.331940 | 419.308617 | 443.353069 | 17759 | LMST01010069 | HMDB00032 | MNXM710 | 10329655 | 10344195 | 10786622 | 11230174 | 16... |
| 6 | SLM:000000043 | Isomeric subspecies | lathosterone | NaN | NaN | SLM:000501263 | NaN | NaN | [H][C@@]12CC=C3[C@]4([H])CC[C@]([H])([C@H](C)C... | InChI=1S/C27H44O/c1-18(2)7-6-8-19(3)23-11-12-2... | ... | 391.354671 | 402.373042 | 383.331940 | 419.308617 | 443.353069 | 71550 | NaN | NaN | MNXM97065 | 19531354 | 22505847 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 595221 | SLM:000598072 | NaN | all-trans-retinol--[retinol-binding protein] | NaN | NaN | SLM:000000982 | NaN | NaN | [*][C@H](N-*)C(-*)=O | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 17336|83228 | NaN | NaN | NaN | 20628054 | 28758396 |
| 595222 | SLM:000598073 | NaN | all-trans-retinyl heptanoate | NaN | NaN | SLM:000000982 | NaN | NaN | C1(C)(C)C(\C=C\C(=C\C=C\C(=C\COC(CCCCCC)=O)\C)... | InChI=1S/C27H42O2/c1-7-8-9-10-16-26(28)29-21-1... | ... | NaN | NaN | NaN | NaN | NaN | 138724 | NaN | NaN | NaN | 20628054 | 28758396 |
| 595223 | SLM:000598074 | NaN | 2-heptanoyl-sn-glycero-3-phosphocholine | NaN | NaN | SLM:000000724 | NaN | NaN | P(OC[C@@H](CO)OC(=O)CCCCCC)(=O)(OCC[N+](C)(C)C... | InChI=1S/C15H32NO7P/c1-5-6-7-8-9-15(18)23-14(1... | ... | NaN | NaN | NaN | NaN | NaN | 138266 | NaN | NaN | NaN | 20628054 | 22605381 | 28758396 |
| 595230 | SLM:000598083 | NaN | 12-hydroxy-(9Z)-octadecenoyl-CoA | NaN | NaN | SLM:000389958 | SLM:000390051 | NaN | NaN | S(C(CCCCCCC/C=C\C[C@@H](CCCCCC)O)=O)CCNC(CCNC(... | InChI=1S/C39H68N7O18P3S/c1-4-5-6-13-16-27(47)1... | ... | NaN | NaN | NaN | NaN | NaN | 139559 | NaN | NaN | NaN | 17084870 | 27758859 |
| 595245 | SLM:000598101 | NaN | a mannosylinositol-1-phospho-N-(2-hydroxyacyl)... | NaN | NaN | SLM:000000835 | NaN | NaN | OC[C@H]1OC(O[C@@H]2[C@@H](O)[C@H](O)[C@@H](O)[... | InChI=none | ... | NaN | NaN | NaN | NaN | NaN | 74994 | NaN | NaN | NaN | 12954640 | 9368028 |
1318 rows × 29 columns
Okay wow! So these are all the columns we have found with split characters…
cols_with_split_chars
['Abbreviation*', 'Synonyms*', 'Lipid class*', 'CHEBI', 'PMID']
We can also check for different types of characters if we know that they will be present. For instance SL uses the / character for Components*, but this is also used by another of columns like the lipid names themselves or smiles and inchi.
check_for_split_characters(df_swisslipids.drop(columns=['Name','Abbreviation*','Synonyms*','SMILES (pH7.3)','InChI (pH7.3)']), delimiter='/')
Checking split characters (/) in Lipid ID
No rows found
Checking split characters (/) in Level
No rows found
Checking split characters (/) in Lipid class*
No rows found
Checking split characters (/) in Parent
No rows found
Checking split characters (/) in Components*
Found 708725 rows with split characters
| Lipid ID | Level | Lipid class* | Parent | Components* | InChI key (pH7.3) | Formula (pH7.3) | Charge (pH7.3) | Mass (pH7.3) | Exact Mass (neutral form) | ... | Exact m/z of [M+Li]+ | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | SLM:000000422 | Isomeric subspecies | SLM:000000329 | SLM:000081844 | SLM:000000418 (sn2) / SLM:000000510 (sn1) | InChIKey=OPVZUEPSMJNLOM-QEJMHMKOSA-L | C37H69O8P | -2.0 | 672.913818 | 674.488647 | ... | 681.504089 | 692.522461 | 673.481384 | 709.458069 | 733.502502 | 64839 | LMGP10010032 | HMDB07859 | MNXM66476 | 10359651 | 11788596 | 12963729 | 16620771 | 17... |
| 229 | SLM:000000498 | Isomeric subspecies | SLM:000000324 | SLM:000105249 | SLM:000000296 (sn2) / SLM:000000826 (sn1) | InChIKey=KRTOMQDUKGRFDJ-ZAHDIIMDSA-M | C47H82O13P | -1.0 | 886.120483 | 886.557129 | ... | 893.572571 | 904.590942 | 885.549866 | 921.526550 | 945.570984 | 133606 | LMGP06010010 | HMDB09815 | MNXM75683 | 22942276 | 23097495 | 23472195 | 8300559 |
| 269 | SLM:000000557 | Isomeric subspecies | SLM:000000261 | SLM:000088147 | SLM:000000510 (sn1) / SLM:000000826 (sn2) | InChIKey=PZNPLUBHRSSFHT-RRHRGVEJSA-N | C42H84NO8P | 0.0 | 762.091980 | 761.593445 | ... | 768.608887 | 779.627258 | NaN | 796.562866 | 820.607300 | 73000 | LMGP01010573 | HMDB07970 | MNXM69304 | 18195019 | 19416660 | 22923616 | 27399000 |
| 332 | SLM:000000636 | Isomeric subspecies | SLM:000000329 | SLM:000082164 | SLM:000000418 (sn1) / SLM:000000510 (sn2) | InChIKey=ZSXHMDPHNCOWSV-QEJMHMKOSA-L | C37H69O8P | -2.0 | 672.913818 | 674.488647 | ... | 681.504089 | 692.522461 | 673.481384 | 709.458069 | 733.502502 | 74551 | LMGP10010964 | NaN | MNXM66662 | 16620771 | 18606822 | 19318427 | 19801371 | 20... |
| 333 | SLM:000000637 | Isomeric subspecies | SLM:000000329 | SLM:000082168 | SLM:000000418 (sn1) / SLM:000000826 (sn2) | InChIKey=XIERONXOJKEALF-PXYGFXEISA-L | C39H73O8P | -2.0 | 700.966980 | 702.519958 | ... | 709.535400 | 720.553772 | 701.512695 | 737.489380 | 761.533813 | 74552 | LMGP10010963 | NaN | MNXM66667 | 16620771 | 18606822 | 19318427 | 19801371 | 21... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 745172 | SLM:000748034 | Isomeric subspecies | SLM:000000084 | NaN | SLM:000000048 (sn1) / SLM:000001195 (sn2) | InChIKey=LJSBNBPNSBKZCI-JNOBRDIFSA-L | C33H57N3O15P2 | -2.0 | NaN | 799.342142 | ... | 806.357598 | 817.375968 | 798.334866 | 834.311543 | 858.355995 | <NA> | NaN | NaN | NaN | NaN |
| 745173 | SLM:000748035 | Isomeric subspecies | SLM:000000084 | NaN | SLM:000000048 (sn1) / SLM:000001196 (sn2) | InChIKey=ODNYDZLXLRZPCJ-GPTQCAHZSA-L | C35H61N3O15P2 | -2.0 | NaN | 827.373442 | ... | 834.388898 | 845.407268 | 826.366166 | 862.342844 | 886.387295 | <NA> | NaN | NaN | NaN | NaN |
| 745174 | SLM:000748036 | Isomeric subspecies | SLM:000000084 | NaN | SLM:000000048 (sn1) / SLM:000000853 (sn2) | InChIKey=FJIBTCUXUBRYKG-QOTCTSOZSA-L | C37H65N3O15P2 | -2.0 | NaN | 855.404743 | ... | 862.420198 | 873.438568 | 854.397466 | 890.374144 | 914.418595 | <NA> | NaN | NaN | NaN | NaN |
| 745175 | SLM:000748037 | Isomeric subspecies | SLM:000000084 | NaN | SLM:000000048 (sn1) / SLM:000001197 (sn2) | InChIKey=AIBKQADSQWEVSS-HUKRWTLJSA-L | C42H75N3O15P2 | -2.0 | NaN | 925.482993 | ... | 932.498448 | 943.516818 | 924.475716 | 960.452394 | 984.496846 | <NA> | NaN | HMDB0116248 | NaN | NaN |
| 745176 | SLM:000748038 | Isomeric subspecies | SLM:000000084 | NaN | SLM:000000047 (sn2) / SLM:000000048 (sn1) | InChIKey=PIZFKSVTEGNINS-BQUKFSKHSA-L | C44H79N3O15P2 | -2.0 | NaN | 953.514293 | ... | 960.529748 | 971.548118 | 952.507016 | 988.483694 | 1012.528146 | <NA> | NaN | HMDB0116250 | NaN | NaN |
708725 rows × 24 columns
Checking split characters (/) in InChI key (pH7.3)
No rows found
Checking split characters (/) in Formula (pH7.3)
No rows found
Checking split characters (/) in Charge (pH7.3)
Not a string column
Checking split characters (/) in Mass (pH7.3)
Not a string column
Checking split characters (/) in Exact Mass (neutral form)
Not a string column
Checking split characters (/) in Exact m/z of [M.]+
Not a string column
Checking split characters (/) in Exact m/z of [M+H]+
Not a string column
Checking split characters (/) in Exact m/z of [M+K]+
Not a string column
Checking split characters (/) in Exact m/z of [M+Na]+
Not a string column
Checking split characters (/) in Exact m/z of [M+Li]+
Not a string column
Checking split characters (/) in Exact m/z of [M+NH4]+
Not a string column
Checking split characters (/) in Exact m/z of [M-H]-
Not a string column
Checking split characters (/) in Exact m/z of [M+Cl]-
Not a string column
Checking split characters (/) in Exact m/z of [M+OAc]-
Not a string column
Checking split characters (/) in CHEBI
No rows found
Checking split characters (/) in LIPID MAPS
No rows found
Checking split characters (/) in HMDB
No rows found
Checking split characters (/) in MetaNetX
No rows found
Checking split characters (/) in PMID
No rows found
['Components*']
These double entries for the classes will be important to take into account for our class hierarchy, because if we don’t many of these Class level entries will become disjointed in the ontology.
To help us handle this connection we will split it into two using the split_and_expand_large utility function, but we will come back to this a bit later…
For now we will also add another column for components, so that later we can have both the actual component with location (e.g. sn) and a parsed version where we just have the SL
df_swisslipids['Components_parsed'] = df_swisslipids['Components*']
Now we can melt to start creating the edges df
Building the edges df#
# # Split the 'Lipid class*' column into multiple rows
# df_swisslipids_splitexp = split_and_expand_large(
# df_swisslipids, #.assign(from_layer_col='swisslipids')
# split_col='Lipid class*',
# expand_cols=['Lipid ID', 'Level', 'Name', 'Abbreviation*',
# 'CHEBI', 'LIPID MAPS', 'HMDB', 'MetaNetX', 'PMID','Synonyms*','Parent','Components*','Components_parsed'], #'from_layer_col'
# delimiter='|'
# )
df_swisslipids_edges = pd.melt(df_swisslipids, #df_swisslipids_splitexp
id_vars=['Lipid ID'],
value_vars=['CHEBI','LIPID MAPS','HMDB','MetaNetX','PMID','Lipid class*','Abbreviation*','Synonyms*','Parent','Components*','Components_parsed'],
var_name='melted_column', value_name='value')
df_swisslipids_edges
| Lipid ID | melted_column | value | |
|---|---|---|---|
| 0 | SLM:000000002 | CHEBI | 70846 |
| 1 | SLM:000000003 | CHEBI | 70771 |
| 2 | SLM:000000006 | CHEBI | 70829 |
| 3 | SLM:000000007 | CHEBI | 70775 |
| 4 | SLM:000000035 | CHEBI | 57817 |
| ... | ... | ... | ... |
| 8571734 | SLM:000782324 | Components_parsed | NaN |
| 8571735 | SLM:000782325 | Components_parsed | NaN |
| 8571736 | SLM:000782326 | Components_parsed | NaN |
| 8571737 | SLM:000782327 | Components_parsed | NaN |
| 8571738 | SLM:000782328 | Components_parsed | NaN |
8571739 rows × 3 columns
df_swisslipids_edges['value'].value_counts()
value
SLM:000000353 132652
SLM:000000377 98788
SLM:000000102 80209
SLM:000117148 46820
SLM:000000400 38514
...
TG(30:0/26:0/22:0) 1
TG(30:0/24:0/24:0) 1
TG(30:0/22:0/26:0) 1
TG(30:0/20:0/28:0) 1
NAPE (15:0-13me/34:5(16Z,19Z,22Z,25Z,28Z)/18:3(6Z,9Z,12Z)) 1
Name: count, Length: 2342278, dtype: int64
Especially because we have so many nan values we should handle these by marking them explicitly as null values, not ‘nan’ strings
df_swisslipids_edges = df_swisslipids_edges.replace(['nan'], pd.NA).copy() # added here to drop 'nan' strings, could also use .dropna(). directly instead of next step
df_swisslipids_edges
| Lipid ID | melted_column | value | |
|---|---|---|---|
| 0 | SLM:000000002 | CHEBI | 70846 |
| 1 | SLM:000000003 | CHEBI | 70771 |
| 2 | SLM:000000006 | CHEBI | 70829 |
| 3 | SLM:000000007 | CHEBI | 70775 |
| 4 | SLM:000000035 | CHEBI | 57817 |
| ... | ... | ... | ... |
| 8571734 | SLM:000782324 | Components_parsed | NaN |
| 8571735 | SLM:000782325 | Components_parsed | NaN |
| 8571736 | SLM:000782326 | Components_parsed | NaN |
| 8571737 | SLM:000782327 | Components_parsed | NaN |
| 8571738 | SLM:000782328 | Components_parsed | NaN |
8571739 rows × 3 columns
Because this melt operation also resulted in a large number of null values, which probably mean nothing to us in this case, we will drop instances where the value is null
df_swisslipids_edges = df_swisslipids_edges.dropna(subset='value')
df_swisslipids_edges
| Lipid ID | melted_column | value | |
|---|---|---|---|
| 0 | SLM:000000002 | CHEBI | 70846 |
| 1 | SLM:000000003 | CHEBI | 70771 |
| 2 | SLM:000000006 | CHEBI | 70829 |
| 3 | SLM:000000007 | CHEBI | 70775 |
| 4 | SLM:000000035 | CHEBI | 57817 |
| ... | ... | ... | ... |
| 8571494 | SLM:000781997 | Components_parsed | SLM:000000856 (n-acyl) |
| 8571495 | SLM:000781998 | Components_parsed | SLM:000389154 (n-acyl) |
| 8571496 | SLM:000781999 | Components_parsed | SLM:000485643 (n-acyl) |
| 8571497 | SLM:000782000 | Components_parsed | SLM:000485644 (n-acyl) |
| 8571498 | SLM:000782001 | Components_parsed | SLM:000485645 (n-acyl) |
4678499 rows × 3 columns
There are still some things we need to tidy up so that it is in a suitable format for OnionNet
df_swisslipids_edges = df_swisslipids_edges.copy()
df_swisslipids_edges['source_layer'] = 'swisslipids'
df_swisslipids_edges.rename(columns={'Lipid ID':'source_id', 'melted_column':'target_layer', 'value':'target_id'}, inplace=True)
df_swisslipids_edges = df_swisslipids_edges[['source_layer','source_id','target_layer','target_id']]
df_swisslipids_edges['target_layer'] = df_swisslipids_edges['target_layer'].map(lambda x: 'swisslipids' if x=='Lipid class*' else f"sl_{str(x).replace(' ','').strip('*').lower()}")
#df_swisslipids_edges['target_layer'] = df_swisslipids_edges['target_layer'].map(lambda x: )
df_swisslipids_edges
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 0 | swisslipids | SLM:000000002 | sl_chebi | 70846 |
| 1 | swisslipids | SLM:000000003 | sl_chebi | 70771 |
| 2 | swisslipids | SLM:000000006 | sl_chebi | 70829 |
| 3 | swisslipids | SLM:000000007 | sl_chebi | 70775 |
| 4 | swisslipids | SLM:000000035 | sl_chebi | 57817 |
| ... | ... | ... | ... | ... |
| 8571494 | swisslipids | SLM:000781997 | sl_components_parsed | SLM:000000856 (n-acyl) |
| 8571495 | swisslipids | SLM:000781998 | sl_components_parsed | SLM:000389154 (n-acyl) |
| 8571496 | swisslipids | SLM:000781999 | sl_components_parsed | SLM:000485643 (n-acyl) |
| 8571497 | swisslipids | SLM:000782000 | sl_components_parsed | SLM:000485644 (n-acyl) |
| 8571498 | swisslipids | SLM:000782001 | sl_components_parsed | SLM:000485645 (n-acyl) |
4678499 rows × 4 columns
For rows where it is swisslipids to swisslipids, we actually want to correct this from target_layer to source_layer, because currently the target_layer in this case is actually the parent class, and ideally it would be better to have the parent point towards the children, so that way the root node should be the one with multiple outgoing edges and no incoming edges…
Be sure to only run this once, otherwise it will switch back again…
# Identify rows where both source_layer and target_layer are 'swisslipids'
condition = (df_swisslipids_edges["source_layer"] == "swisslipids") & (df_swisslipids_edges["target_layer"] == "swisslipids")
# Swap the columns for rows satisfying the condition
df_swisslipids_edges.loc[condition, ["source_layer", "source_id", "target_layer", "target_id"]] = df_swisslipids_edges.loc[condition, ["target_layer", "target_id", "source_layer", "source_id"]].values
# Output the modified DataFrame
df_swisslipids_edges
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 0 | swisslipids | SLM:000000002 | sl_chebi | 70846 |
| 1 | swisslipids | SLM:000000003 | sl_chebi | 70771 |
| 2 | swisslipids | SLM:000000006 | sl_chebi | 70829 |
| 3 | swisslipids | SLM:000000007 | sl_chebi | 70775 |
| 4 | swisslipids | SLM:000000035 | sl_chebi | 57817 |
| ... | ... | ... | ... | ... |
| 8571494 | swisslipids | SLM:000781997 | sl_components_parsed | SLM:000000856 (n-acyl) |
| 8571495 | swisslipids | SLM:000781998 | sl_components_parsed | SLM:000389154 (n-acyl) |
| 8571496 | swisslipids | SLM:000781999 | sl_components_parsed | SLM:000485643 (n-acyl) |
| 8571497 | swisslipids | SLM:000782000 | sl_components_parsed | SLM:000485644 (n-acyl) |
| 8571498 | swisslipids | SLM:000782001 | sl_components_parsed | SLM:000485645 (n-acyl) |
4678499 rows × 4 columns
df_swisslipids_edges['target_layer'].value_counts()
target_layer
swisslipids 779247
sl_abbreviation 776464
sl_components 765323
sl_components_parsed 765323
sl_synonyms 548163
sl_metanetx 505003
sl_parent 493491
sl_hmdb 26026
sl_lipidmaps 12117
sl_chebi 4276
sl_pmid 3066
Name: count, dtype: int64
Now let’s return to two items on our todo list:
splitting values that have multi-identifiers
trimming/parsing the components col
edges_with_multilinks = df_swisslipids_edges[df_swisslipids_edges['target_id'].str.contains('|', regex=False, na=False)]
edges_with_multilinks
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 465 | swisslipids | SLM:000000784 | sl_chebi | 74546|82922 |
| 387185 | swisslipids | SLM:000389154 | sl_chebi | 82731|82731 |
| 595221 | swisslipids | SLM:000598072 | sl_chebi | 17336|83228 |
| 3116996 | swisslipids | SLM:000000002 | sl_pmid | | 11443131 | 14685263 | 18390550 | 21325339 | ... |
| 3116999 | swisslipids | SLM:000000007 | sl_pmid | 14685263 | 21926990 | 9603947 |
| ... | ... | ... | ... | ... |
| 6199835 | swisslipids | SLM:000747954 | sl_synonyms | 1,2-di-(13-methyltetradecanoyl)-sn-glycero-3-c... |
| 6199836 | swisslipids | SLM:000747955 | sl_synonyms | 1-(13-methyltetradecanoyl)-2-(15-methylhexadec... |
| 6199918 | swisslipids | SLM:000748037 | sl_synonyms | 1-(15-methylhexadecanoyl)-2-(11-methyldodecano... |
| 6199919 | swisslipids | SLM:000748038 | sl_synonyms | 1-(15-methylhexadecanoyl)-2-(13-methyltetradec... |
| 6199920 | swisslipids | SLM:000748039 | sl_synonyms | 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy... |
30942 rows × 4 columns
edges_with_multilinks.value_counts('target_layer')
target_layer
sl_synonyms 19853
sl_abbreviation 9768
sl_pmid 1318
sl_chebi 3
Name: count, dtype: int64
edges_with_multilinks_split = split_and_expand_large(edges_with_multilinks,
split_col='target_id',
expand_cols=['source_layer','source_id','target_layer'],
delimiter='|').drop_duplicates()
edges_with_multilinks_split
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 0 | swisslipids | SLM:000000784 | sl_chebi | 74546 |
| 1 | swisslipids | SLM:000000784 | sl_chebi | 82922 |
| 2 | swisslipids | SLM:000389154 | sl_chebi | 82731 |
| 4 | swisslipids | SLM:000598072 | sl_chebi | 17336 |
| 5 | swisslipids | SLM:000598072 | sl_chebi | 83228 |
| ... | ... | ... | ... | ... |
| 68383 | swisslipids | SLM:000748037 | sl_synonyms | CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(11Z)) |
| 68384 | swisslipids | SLM:000748038 | sl_synonyms | 1-(15-methylhexadecanoyl)-2-(13-methyltetradec... |
| 68385 | swisslipids | SLM:000748038 | sl_synonyms | CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(9Z)) |
| 68386 | swisslipids | SLM:000748039 | sl_synonyms | 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy... |
| 68387 | swisslipids | SLM:000748039 | sl_synonyms | CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:2(9Z,12Z)) |
68379 rows × 4 columns
Now we also want to clean up the results and turn those empty spaces i.e. empty strings and things into NaN’s, as well as strip leading and trailing spaces that may have been between splitting characters
edges_with_multilinks_split = clean_missing_strings(edges_with_multilinks_split)
edges_with_multilinks_split
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 0 | swisslipids | SLM:000000784 | sl_chebi | 74546 |
| 1 | swisslipids | SLM:000000784 | sl_chebi | 82922 |
| 2 | swisslipids | SLM:000389154 | sl_chebi | 82731 |
| 4 | swisslipids | SLM:000598072 | sl_chebi | 17336 |
| 5 | swisslipids | SLM:000598072 | sl_chebi | 83228 |
| ... | ... | ... | ... | ... |
| 68383 | swisslipids | SLM:000748037 | sl_synonyms | CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(11Z)) |
| 68384 | swisslipids | SLM:000748038 | sl_synonyms | 1-(15-methylhexadecanoyl)-2-(13-methyltetradec... |
| 68385 | swisslipids | SLM:000748038 | sl_synonyms | CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:1(9Z)) |
| 68386 | swisslipids | SLM:000748039 | sl_synonyms | 1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cy... |
| 68387 | swisslipids | SLM:000748039 | sl_synonyms | CDP-DG(22:6(4Z,7Z,10Z,13Z,16Z,19Z)/18:2(9Z,12Z)) |
68379 rows × 4 columns
edges_with_multilinks_split['target_id'].value_counts(dropna=False)
target_id
18390550 87
23670529 86
20431113 77
19603071 70
24068966 53
..
Phosphatidylcholine (O-18:1(11Z)/16:2(9Z,12Z)) 1
1-(11Z-octadecenyl)-2-(9Z,12Z-octadecadienoyl)-sn-glycero-3-phosphocholine 1
Phosphatidylcholine (O-18:1(11Z)/18:2(9Z,12Z)) 1
1-(11Z-octadecenyl)-2-(9Z-hexadecenoyl)-sn-glycero-3-phosphocholine 1
1,2-di-(15-methylhexadecanoyl)-sn-glycero-3-cytidine-5'-diphosphate 1
Name: count, Length: 59378, dtype: int64
edges_with_multilinks_split[edges_with_multilinks_split['target_id'].isna()]
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 6 | swisslipids | SLM:000000002 | sl_pmid | <NA> |
| 278 | swisslipids | SLM:000000272 | sl_pmid | <NA> |
| 4533 | swisslipids | SLM:000001020 | sl_pmid | <NA> |
| 4546 | swisslipids | SLM:000001022 | sl_pmid | <NA> |
| 4550 | swisslipids | SLM:000001023 | sl_pmid | <NA> |
| 4553 | swisslipids | SLM:000001024 | sl_pmid | <NA> |
| 4586 | swisslipids | SLM:000001032 | sl_pmid | <NA> |
| 4646 | swisslipids | SLM:000001036 | sl_pmid | <NA> |
So note there are only 8 instances where the target_id is missing. This is probably ok to handle downstream anyway
# edges_with_multilinks_split = edges_with_multilinks_split[~edges_with_multilinks_split['target_id'].isna()].copy()
edges_with_multilinks_split.shape
(68379, 4)
What about source_id? Looks like it has no missing source_ids
edges_with_multilinks_split[edges_with_multilinks_split['source_id'].isna()].shape
(0, 4)
This is good, but we also need to remember the separators in the components column
edges_with_multilinks2 = df_swisslipids_edges[df_swisslipids_edges['target_id'].str.contains('/', regex=False, na=False) &
df_swisslipids_edges['target_layer'].str.contains('sl_components', regex=False, na=False)]
edges_with_multilinks2
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 7013405 | swisslipids | SLM:000000422 | sl_components | SLM:000000418 (sn2) / SLM:000000510 (sn1) |
| 7013470 | swisslipids | SLM:000000498 | sl_components | SLM:000000296 (sn2) / SLM:000000826 (sn1) |
| 7013510 | swisslipids | SLM:000000557 | sl_components | SLM:000000510 (sn1) / SLM:000000826 (sn2) |
| 7013573 | swisslipids | SLM:000000636 | sl_components | SLM:000000418 (sn1) / SLM:000000510 (sn2) |
| 7013574 | swisslipids | SLM:000000637 | sl_components | SLM:000000418 (sn1) / SLM:000000826 (sn2) |
| ... | ... | ... | ... | ... |
| 8537662 | swisslipids | SLM:000748034 | sl_components_parsed | SLM:000000048 (sn1) / SLM:000001195 (sn2) |
| 8537663 | swisslipids | SLM:000748035 | sl_components_parsed | SLM:000000048 (sn1) / SLM:000001196 (sn2) |
| 8537664 | swisslipids | SLM:000748036 | sl_components_parsed | SLM:000000048 (sn1) / SLM:000000853 (sn2) |
| 8537665 | swisslipids | SLM:000748037 | sl_components_parsed | SLM:000000048 (sn1) / SLM:000001197 (sn2) |
| 8537666 | swisslipids | SLM:000748038 | sl_components_parsed | SLM:000000047 (sn2) / SLM:000000048 (sn1) |
1417450 rows × 4 columns
edges_with_multilinks2_split = split_and_expand_large(edges_with_multilinks2,
split_col='target_id',
expand_cols=['source_layer','source_id','target_layer'],
delimiter='/').drop_duplicates()
edges_with_multilinks2_split
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 0 | swisslipids | SLM:000000422 | sl_components | SLM:000000418 (sn2) |
| 1 | swisslipids | SLM:000000422 | sl_components | SLM:000000510 (sn1) |
| 2 | swisslipids | SLM:000000498 | sl_components | SLM:000000296 (sn2) |
| 3 | swisslipids | SLM:000000498 | sl_components | SLM:000000826 (sn1) |
| 4 | swisslipids | SLM:000000557 | sl_components | SLM:000000510 (sn1) |
| ... | ... | ... | ... | ... |
| 3592487 | swisslipids | SLM:000748036 | sl_components_parsed | SLM:000000853 (sn2) |
| 3592488 | swisslipids | SLM:000748037 | sl_components_parsed | SLM:000000048 (sn1) |
| 3592489 | swisslipids | SLM:000748037 | sl_components_parsed | SLM:000001197 (sn2) |
| 3592490 | swisslipids | SLM:000748038 | sl_components_parsed | SLM:000000047 (sn2) |
| 3592491 | swisslipids | SLM:000748038 | sl_components_parsed | SLM:000000048 (sn1) |
3592492 rows × 4 columns
Now let’s also clean this up in case we have whitespace or empty strings etc.
edges_with_multilinks2_split = clean_missing_strings(edges_with_multilinks2_split)
Now let’s also parse the brackets from the parsed components so that these can be linked directly to the other SLMs if needed
# Apply transformation only for rows where target_layer equals 'sl_components_parsed'
mask = edges_with_multilinks2_split['target_layer'] == 'sl_components_parsed'
edges_with_multilinks2_split.loc[mask, 'target_id'] = edges_with_multilinks2_split.loc[mask, 'target_id'].str.split('(').str[0].str.strip()
edges_with_multilinks2_split
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 0 | swisslipids | SLM:000000422 | sl_components | SLM:000000418 (sn2) |
| 1 | swisslipids | SLM:000000422 | sl_components | SLM:000000510 (sn1) |
| 2 | swisslipids | SLM:000000498 | sl_components | SLM:000000296 (sn2) |
| 3 | swisslipids | SLM:000000498 | sl_components | SLM:000000826 (sn1) |
| 4 | swisslipids | SLM:000000557 | sl_components | SLM:000000510 (sn1) |
| ... | ... | ... | ... | ... |
| 3592487 | swisslipids | SLM:000748036 | sl_components_parsed | SLM:000000853 |
| 3592488 | swisslipids | SLM:000748037 | sl_components_parsed | SLM:000000048 |
| 3592489 | swisslipids | SLM:000748037 | sl_components_parsed | SLM:000001197 |
| 3592490 | swisslipids | SLM:000748038 | sl_components_parsed | SLM:000000047 |
| 3592491 | swisslipids | SLM:000748038 | sl_components_parsed | SLM:000000048 |
3592492 rows × 4 columns
Now we need a way to change these original rows where they had multilinks and add back the corrected ones.
# Identify rows with multilinks (either '|' or '/' with the specific target_layer condition)
mask_pipe = df_swisslipids_edges['target_id'].str.contains('|', regex=False, na=False)
mask_slash = (
df_swisslipids_edges['target_id'].str.contains('/', regex=False, na=False) &
df_swisslipids_edges['target_layer'].str.contains('sl_components', regex=False, na=False)
)
mask_problem = mask_pipe | mask_slash
# Remove these rows from the original df
df_clean = df_swisslipids_edges[~mask_problem].copy()
# Now, combine the cleaned df with the corrected edges dataframes.
# These corrected dataframes are assumed to be:
# - edges_with_multilinks_split
# - edges_with_multilinks2_split
df_swisslipids_edges = pd.concat([df_clean, edges_with_multilinks_split, edges_with_multilinks2_split], ignore_index=True)
# Clean up empty strings again or leading/trailing spaces
df_swisslipids_edges = clean_missing_strings(df_swisslipids_edges)
# (Optional) Drop any duplicate rows that might arise
df_swisslipids_edges = df_swisslipids_edges.drop_duplicates()
# df_final now contains the original "good" rows plus the corrected edges.
df_swisslipids_edges
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 0 | swisslipids | SLM:000000002 | sl_chebi | 70846 |
| 1 | swisslipids | SLM:000000003 | sl_chebi | 70771 |
| 2 | swisslipids | SLM:000000006 | sl_chebi | 70829 |
| 3 | swisslipids | SLM:000000007 | sl_chebi | 70775 |
| 4 | swisslipids | SLM:000000035 | sl_chebi | 57817 |
| ... | ... | ... | ... | ... |
| 6890973 | swisslipids | SLM:000748036 | sl_components_parsed | SLM:000000853 |
| 6890974 | swisslipids | SLM:000748037 | sl_components_parsed | SLM:000000048 |
| 6890975 | swisslipids | SLM:000748037 | sl_components_parsed | SLM:000001197 |
| 6890976 | swisslipids | SLM:000748038 | sl_components_parsed | SLM:000000047 |
| 6890977 | swisslipids | SLM:000748038 | sl_components_parsed | SLM:000000048 |
6890966 rows × 4 columns
Now we will determine whether the edge is within the same layer (intralayer) or between different layers (interlayer)
def assess_edge_layertype(df):
interlayer = df['source_layer']!=df['target_layer']
df['interlayer'] = interlayer
return df
df_swisslipids_edges = assess_edge_layertype(df_swisslipids_edges)
df_swisslipids_edges
| source_layer | source_id | target_layer | target_id | interlayer | |
|---|---|---|---|---|---|
| 0 | swisslipids | SLM:000000002 | sl_chebi | 70846 | True |
| 1 | swisslipids | SLM:000000003 | sl_chebi | 70771 | True |
| 2 | swisslipids | SLM:000000006 | sl_chebi | 70829 | True |
| 3 | swisslipids | SLM:000000007 | sl_chebi | 70775 | True |
| 4 | swisslipids | SLM:000000035 | sl_chebi | 57817 | True |
| ... | ... | ... | ... | ... | ... |
| 6890973 | swisslipids | SLM:000748036 | sl_components_parsed | SLM:000000853 | True |
| 6890974 | swisslipids | SLM:000748037 | sl_components_parsed | SLM:000000048 | True |
| 6890975 | swisslipids | SLM:000748037 | sl_components_parsed | SLM:000001197 | True |
| 6890976 | swisslipids | SLM:000748038 | sl_components_parsed | SLM:000000047 | True |
| 6890977 | swisslipids | SLM:000748038 | sl_components_parsed | SLM:000000048 | True |
6890966 rows × 5 columns
Now we will build the node df
Building the node df#
df_swisslipids_nodes = create_nodedf_from_edgedf(edge_df=df_swisslipids_edges, props=['layer', 'id'], cols=['layer', 'node_id'])
df_swisslipids_nodes
| layer | node_id | |
|---|---|---|
| 0 | swisslipids | SLM:000000002 |
| 1 | swisslipids | SLM:000000003 |
| 2 | swisslipids | SLM:000000006 |
| 3 | swisslipids | SLM:000000007 |
| 4 | swisslipids | SLM:000000035 |
| ... | ... | ... |
| 13781927 | sl_components_parsed | SLM:000000853 |
| 13781928 | sl_components_parsed | SLM:000000048 |
| 13781929 | sl_components_parsed | SLM:000001197 |
| 13781930 | sl_components_parsed | SLM:000000047 |
| 13781931 | sl_components_parsed | SLM:000000048 |
13781932 rows × 2 columns
Let’s also see how many are duplicates
df_swisslipids_nodes.value_counts(dropna=False)
layer node_id
swisslipids SLM:000000353 132660
SLM:000000377 98800
SLM:000000102 80218
SLM:000117148 46826
SLM:000000400 38525
...
sl_metanetx MNXM312433 1
MNXM312434 1
MNXM312435 1
MNXM312436 1
swisslipids SLM:000782332 1
Name: count, Length: 2779078, dtype: int64
# Pre-emptively dropping duplicates before the merge
df_swisslipids_nodes = df_swisslipids_nodes.drop_duplicates()
df_swisslipids_nodes.shape
(2779078, 2)
Now let’s merge the nodes with the information from earlier to create richer node attributes
df_swisslipids_nodes = pd.merge(df_swisslipids_nodes, df_swisslipids.assign(from_layer_col='swisslipids'),
left_on=['layer','node_id'], right_on=['from_layer_col','Lipid ID'],
how='outer')
df_swisslipids_nodes
| layer | node_id | Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | ... | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | Components_parsed | from_layer_col | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | sl_abbreviation | (10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | sl_abbreviation | (10,11S,12R)-TriHETrE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | sl_abbreviation | (10R)-H-(11S,12S)-Ep-(5Z,8Z,14Z)-ETrE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | sl_abbreviation | (10R)-H-(11S,12S)-EpETrE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | sl_abbreviation | (10R)-H-(8S,9S)-Ep-(5Z,11Z,14Z)-ETrE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2779073 | swisslipids | SLM:000782328 | SLM:000782328 | NaN | oxidized 2-acylglycerol | NaN | NaN | SLM:000000355 | NaN | NaN | ... | NaN | NaN | NaN | 167117 | NaN | NaN | NaN | NaN | NaN | swisslipids |
| 2779074 | swisslipids | SLM:000782329 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2779075 | swisslipids | SLM:000782330 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2779076 | swisslipids | SLM:000782331 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2779077 | swisslipids | SLM:000782332 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2779078 rows × 33 columns
If this has any more duplicates in it for some reason lets remove them, along with the from_layer_col which means nothing in this case and is just a relic of our join back with the initial df we used to create the edges (which could probably be tidied up)
df_swisslipids_nodes = df_swisslipids_nodes.drop_duplicates()
df_swisslipids_nodes = df_swisslipids_nodes.drop(columns='from_layer_col')
df_swisslipids_nodes
| layer | node_id | Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | ... | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | Components_parsed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | sl_abbreviation | (10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | sl_abbreviation | (10,11S,12R)-TriHETrE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | sl_abbreviation | (10R)-H-(11S,12S)-Ep-(5Z,8Z,14Z)-ETrE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | sl_abbreviation | (10R)-H-(11S,12S)-EpETrE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | sl_abbreviation | (10R)-H-(8S,9S)-Ep-(5Z,11Z,14Z)-ETrE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2779073 | swisslipids | SLM:000782328 | SLM:000782328 | NaN | oxidized 2-acylglycerol | NaN | NaN | SLM:000000355 | NaN | NaN | ... | NaN | NaN | NaN | NaN | 167117 | NaN | NaN | NaN | NaN | NaN |
| 2779074 | swisslipids | SLM:000782329 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2779075 | swisslipids | SLM:000782330 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2779076 | swisslipids | SLM:000782331 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2779077 | swisslipids | SLM:000782332 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2779078 rows × 32 columns
df_swisslipids_nodes[df_swisslipids_nodes['node_id'].isna()]
| layer | node_id | Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | ... | Exact m/z of [M+NH4]+ | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | Components_parsed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1464984 | sl_pmid | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 32 columns
print(df_swisslipids_nodes[df_swisslipids_nodes['node_id'].isna()].shape)
print(df_swisslipids_nodes[df_swisslipids_nodes['layer'].isna()].shape)
(1, 32)
(0, 32)
Even though onionnet can handle nan and remove them downstream, it is safer to now drop these cases where either the node_id or layer is missing - they serve us no purpose anyway!
df_swisslipids_nodes = df_swisslipids_nodes.dropna(subset=['layer','node_id'])
print(df_swisslipids_nodes.shape)
print(df_swisslipids_nodes[df_swisslipids_nodes['node_id'].isna()].shape)
print(df_swisslipids_nodes[df_swisslipids_nodes['layer'].isna()].shape)
(2779077, 32)
(0, 32)
(0, 32)
df_swisslipids_nodes['layer'].value_counts()
layer
swisslipids 779312
sl_abbreviation 736949
sl_synonyms 534781
sl_metanetx 504880
sl_parent 184620
sl_hmdb 17232
sl_lipidmaps 12112
sl_chebi 4277
sl_components 1708
sl_components_parsed 1677
sl_pmid 1529
Name: count, dtype: int64
df_swisslipids_nodes['Level'].value_counts()
Level
Isomeric subspecies 592413
Structural subspecies 111867
Molecular subspecies 62516
Species 10347
Class 806
Category 7
Name: count, dtype: int64
Now we have the nodes and edges dfs for swisslipids and understand how we have arrived at them. In reality you don’t have to go through this process every time, LipiNet offers a convenient function to do just this if you are interested in this same network setup.
Ensuring equivalency#
We can also check to make sure that the output of the autmatic parse_swisslipids_data() function and our manually processed data are equivalent.
We start by checking this for a single entry of the dataframe.
df_swisslipids_nodes.iloc[0]
layer sl_abbreviation
node_id (10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE
Lipid ID NaN
Level NaN
Name NaN
Abbreviation* NaN
Synonyms* NaN
Lipid class* NaN
Parent NaN
Components* NaN
SMILES (pH7.3) NaN
InChI (pH7.3) NaN
InChI key (pH7.3) NaN
Formula (pH7.3) NaN
Charge (pH7.3) NaN
Mass (pH7.3) NaN
Exact Mass (neutral form) NaN
Exact m/z of [M.]+ NaN
Exact m/z of [M+H]+ NaN
Exact m/z of [M+K]+ NaN
Exact m/z of [M+Na]+ NaN
Exact m/z of [M+Li]+ NaN
Exact m/z of [M+NH4]+ NaN
Exact m/z of [M-H]- NaN
Exact m/z of [M+Cl]- NaN
Exact m/z of [M+OAc]- NaN
CHEBI NaN
LIPID MAPS NaN
HMDB NaN
MetaNetX NaN
PMID NaN
Components_parsed NaN
Name: 0, dtype: object
df_sl_nodes.iloc[0]
layer sl_abbreviation
node_id (10,11S,12R)-TriH-(5Z,8Z,14Z)-ETrE
Lipid ID NaN
Level NaN
Name NaN
Abbreviation* NaN
Synonyms* NaN
Lipid class* NaN
Parent NaN
Components* NaN
SMILES (pH7.3) NaN
InChI (pH7.3) NaN
InChI key (pH7.3) NaN
Formula (pH7.3) NaN
Charge (pH7.3) NaN
Mass (pH7.3) NaN
Exact Mass (neutral form) NaN
Exact m/z of [M.]+ NaN
Exact m/z of [M+H]+ NaN
Exact m/z of [M+K]+ NaN
Exact m/z of [M+Na]+ NaN
Exact m/z of [M+Li]+ NaN
Exact m/z of [M+NH4]+ NaN
Exact m/z of [M-H]- NaN
Exact m/z of [M+Cl]- NaN
Exact m/z of [M+OAc]- NaN
CHEBI NaN
LIPID MAPS NaN
HMDB NaN
MetaNetX NaN
PMID NaN
Components_parsed NaN
Name: 0, dtype: object
For the first entry it looks good, what about for the entire df? We can use the pd.testing.assert_frame_equal function to do this.
First we will use a null test to test equality between df_swisslipids_nodes and df_swisslipids_edges, which should obviously be False.
try:
pd.testing.assert_frame_equal(df_swisslipids_nodes, df_swisslipids_edges)
print('DataFrames are equal')
except AssertionError as e:
print(e)
DataFrame are different
DataFrame shape mismatch
[left]: (2779077, 32)
[right]: (6890966, 5)
Now let’s test between df_swisslipids_nodes and df_sl_nodes, which should hopefully be True and not throw an error. We will also test the edges df while we’re at it too.
try:
pd.testing.assert_frame_equal(df_swisslipids_nodes, df_sl_nodes)
print('DataFrames for nodes are equal')
except AssertionError as e:
print(e)
DataFrames for nodes are equal
try:
pd.testing.assert_frame_equal(df_swisslipids_edges, df_sl_edges)
print('DataFrames for edges are equal')
except AssertionError as e:
print(e)
DataFrames for edges are equal
Great! It looks like both approaches achieve the same df. We will use these dfs in other parts of the package.
If they are different, we can inspect the exact rows here
diff = df_sl_edges.merge(df_swisslipids_edges, how='outer', indicator=True)
diff_rows_edges = diff[diff['_merge'] != 'both']
diff_rows_edges
| source_layer | source_id | target_layer | target_id | interlayer | _merge |
|---|
diff_rows_edges['_merge'].value_counts()
_merge
left_only 0
right_only 0
both 0
Name: count, dtype: int64
diff = df_sl_nodes.merge(df_swisslipids_nodes, how='outer', indicator=True)
diff_rows_nodes = diff[diff['_merge'] != 'both']
diff_rows_nodes
| layer | node_id | Lipid ID | Level | Name | Abbreviation* | Synonyms* | Lipid class* | Parent | Components* | ... | Exact m/z of [M-H]- | Exact m/z of [M+Cl]- | Exact m/z of [M+OAc]- | CHEBI | LIPID MAPS | HMDB | MetaNetX | PMID | Components_parsed | _merge |
|---|
0 rows × 33 columns
These should also be the same
df_sl_edges[df_sl_edges['source_id']=='SLM:000389145']
| source_layer | source_id | target_layer | target_id | interlayer | |
|---|---|---|---|---|---|
| 1640 | swisslipids | SLM:000389145 | sl_chebi | 18059 | True |
| 429400 | swisslipids | SLM:000389145 | sl_metanetx | MNXM12117 | True |
| 549344 | swisslipids | SLM:000389145 | swisslipids | SLM:000000436 | False |
| 549407 | swisslipids | SLM:000389145 | swisslipids | SLM:000000525 | False |
| 549887 | swisslipids | SLM:000389145 | swisslipids | SLM:000001193 | False |
| 665828 | swisslipids | SLM:000389145 | swisslipids | SLM:000117142 | False |
| 936914 | swisslipids | SLM:000389145 | swisslipids | SLM:000390054 | False |
| 1046948 | swisslipids | SLM:000389145 | swisslipids | SLM:000500463 | False |
| 1055230 | swisslipids | SLM:000389145 | swisslipids | SLM:000508860 | False |
| 1328368 | swisslipids | SLM:000389145 | swisslipids | SLM:000782283 | False |
df_swisslipids_edges[df_swisslipids_edges['source_id']=='SLM:000389145']
| source_layer | source_id | target_layer | target_id | interlayer | |
|---|---|---|---|---|---|
| 1640 | swisslipids | SLM:000389145 | sl_chebi | 18059 | True |
| 429400 | swisslipids | SLM:000389145 | sl_metanetx | MNXM12117 | True |
| 549344 | swisslipids | SLM:000389145 | swisslipids | SLM:000000436 | False |
| 549407 | swisslipids | SLM:000389145 | swisslipids | SLM:000000525 | False |
| 549887 | swisslipids | SLM:000389145 | swisslipids | SLM:000001193 | False |
| 665828 | swisslipids | SLM:000389145 | swisslipids | SLM:000117142 | False |
| 936914 | swisslipids | SLM:000389145 | swisslipids | SLM:000390054 | False |
| 1046948 | swisslipids | SLM:000389145 | swisslipids | SLM:000500463 | False |
| 1055230 | swisslipids | SLM:000389145 | swisslipids | SLM:000508860 | False |
| 1328368 | swisslipids | SLM:000389145 | swisslipids | SLM:000782283 | False |