Parse Reactome#
Parsing Reactome data into a network for LipiNet.
LipiNet offers conventient functions to parse prior knowledge resources straight into networks. For instance, LipiNet can parse Reactome data into a network as easily as running: parse_reactome_data()
However to show what is happening behind the scenes, this notebook will also go through the data and each of the steps that are made in the background of this function. This may be particularly helpful for users needing to customise the networks in a way that is not yet supported by LipiNet directly.
Using parse_reactome_data()#
The LipiNet parse_reactome_data() function automatically parses Reactome data into a network. This is what LipiNet uses as input to its overall combined network and for the majority of users this function will probably suffice if they wish to build sub-networks with just Reactome data.
from lipinet.parse_reactome import parse_reactome_data
reactome_results = parse_reactome_data(verbose=True, use_cache=True)
df_reactome_nodes = reactome_results['df_nodes']
df_reactome_edges = reactome_results['df_edges']
↪ loading Reactome (processed) from cache: reactome_human_nb
To avoid repeatedly downloading the Reactome data (and choking up their server calls), set use_cache=True. If the cache has not been set yet, this will automatically save the download to cache. If there is already a cache, it will use that.
To override the cache you can set force_download=True, but this is only recommended every few months when you want to update the source data in case of changes.
Where to from here?#
Now to quickly start exploring Reactome, go to the Explore Reactome notebook.
To see how the combined LipiNet network uses Reactome, go to the Explore LipiNet notebook.
Or to see how the
parse_reactome_data()function works behind the scenes, continue to the Manual parsing section below.
Manual parsing#
For users wanting to better understand all the steps being undertaken behind the parse_reactome_data() function, we will recreate the steps here.
Download#
import importlib
# Now can use the functions after reloading the module
from lipinet.databases import get_prior_knowledge
import lipinet
# from lipinet.utils import split_and_expand_large, create_nodedf_from_edgedf, check_for_split_characters
import pandas as pd
# Reload the module to ensure changes are picked up
importlib.reload(lipinet)
importlib.reload(lipinet.databases)
# from lipinet.databases import get_prior_knowledge
<module 'lipinet.databases' from '/Users/macsbook/Code/lipinet/lipinet/databases.py'>
import lipinet, lipinet.databases as db
print(lipinet.__file__)
print(db.__file__)
None
/Users/macsbook/Code/lipinet/lipinet/databases.py
reactome_dfs = get_prior_knowledge('reactome', verbose=True)
Fetching ChEBI2Reactome_PE_All_Levels.tsv
File found locally at /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ChEBI2Reactome_PE_All_Levels.tsv. Loading data...
Fetching ChEBI2Reactome_PE_Reactions.tsv
File found locally at /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ChEBI2Reactome_PE_Reactions.tsv. Loading data...
Fetching ReactomePathways.tsv
File found locally at /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ReactomePathways.tsv. Loading data...
Fetching ReactomePathwaysRelation.tsv
File found locally at /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ReactomePathwaysRelation.tsv. Loading data...
Returning ['ChEBI2Reactome_PE_All_Levels.tsv', 'ChEBI2Reactome_PE_Reactions.tsv', 'ReactomePathways.tsv', 'ReactomePathwaysRelation.tsv'] as a dict of dfs
reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']
| source_db_identifier | reactome_pe_stableid | reactome_pe_name | reactome_pathway_stableid | url | event_name_pathway_or_reaction | evidence_code | species | |
|---|---|---|---|---|---|---|---|---|
| 0 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-1430728 | https://reactome.org/PathwayBrowser/#/R-BTA-14... | Metabolism | IEA | Bos taurus |
| 1 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-196854 | https://reactome.org/PathwayBrowser/#/R-BTA-19... | Metabolism of vitamins and cofactors | IEA | Bos taurus |
| 2 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-6806664 | https://reactome.org/PathwayBrowser/#/R-BTA-68... | Metabolism of vitamin K | IEA | Bos taurus |
| 3 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-6806667 | https://reactome.org/PathwayBrowser/#/R-BTA-68... | Metabolism of fat-soluble vitamins | IEA | Bos taurus |
| 4 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-CFA-1430728 | https://reactome.org/PathwayBrowser/#/R-CFA-14... | Metabolism | IEA | Canis familiaris |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 391091 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-XTR-211859 | https://reactome.org/PathwayBrowser/#/R-XTR-21... | Biological oxidations | IEA | Xenopus tropicalis |
| 391092 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-XTR-211945 | https://reactome.org/PathwayBrowser/#/R-XTR-21... | Phase I - Functionalization of compounds | IEA | Xenopus tropicalis |
| 391093 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-397014 | https://reactome.org/PathwayBrowser/#/R-XTR-39... | Muscle contraction | IEA | Xenopus tropicalis |
| 391094 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-5576891 | https://reactome.org/PathwayBrowser/#/R-XTR-55... | Cardiac conduction | IEA | Xenopus tropicalis |
| 391095 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-5576893 | https://reactome.org/PathwayBrowser/#/R-XTR-55... | Phase 2 - plateau phase | IEA | Xenopus tropicalis |
391096 rows × 8 columns
reactome_dfs['ChEBI2Reactome_PE_Reactions.tsv']
| source_db_identifier | reactome_pe_stableid | reactome_pe_name | reactome_pathway_stableid | url | event_name_pathway_or_reaction | evidence_code | species | |
|---|---|---|---|---|---|---|---|---|
| 0 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-159790 | https://reactome.org/PathwayBrowser/#/R-BTA-15... | VKORC1 reduces vitamin K epoxide to MK4 (vitam... | IEA | Bos taurus |
| 1 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-9026967 | https://reactome.org/PathwayBrowser/#/R-BTA-90... | VKORC1 inhibitors binds VKORC1 dimer | IEA | Bos taurus |
| 2 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-CFA-159790 | https://reactome.org/PathwayBrowser/#/R-CFA-15... | VKORC1 reduces vitamin K epoxide to MK4 (vitam... | IEA | Canis familiaris |
| 3 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-CFA-9026967 | https://reactome.org/PathwayBrowser/#/R-CFA-90... | VKORC1 inhibitors binds VKORC1 dimer | IEA | Canis familiaris |
| 4 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-DME-159790 | https://reactome.org/PathwayBrowser/#/R-DME-15... | VKORC1 reduces vitamin K epoxide to MK4 (vitam... | IEA | Drosophila melanogaster |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 234736 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-SSC-9614031 | https://reactome.org/PathwayBrowser/#/R-SSC-96... | Class IV antihypertensives bind LTCC multimer | IEA | Sus scrofa |
| 234737 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-SSC-9659680 | https://reactome.org/PathwayBrowser/#/R-SSC-96... | ABCB1 transports xenobiotics out of the cell | IEA | Sus scrofa |
| 234738 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-SSC-9678766 | https://reactome.org/PathwayBrowser/#/R-SSC-96... | CYP3A4 binds CYP3A4 inhibitors | IEA | Sus scrofa |
| 234739 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-9614031 | https://reactome.org/PathwayBrowser/#/R-XTR-96... | Class IV antihypertensives bind LTCC multimer | IEA | Xenopus tropicalis |
| 234740 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-XTR-9678766 | https://reactome.org/PathwayBrowser/#/R-XTR-96... | CYP3A4 binds CYP3A4 inhibitors | IEA | Xenopus tropicalis |
234741 rows × 8 columns
reactome_dfs['ReactomePathways.tsv']
| reactome_pathway_stableid | reactome_pathway_name | species | |
|---|---|---|---|
| 0 | R-BTA-73843 | 5-Phosphoribose 1-diphosphate biosynthesis | Bos taurus |
| 1 | R-BTA-1369062 | ABC transporters in lipid homeostasis | Bos taurus |
| 2 | R-BTA-382556 | ABC-family proteins mediated transport | Bos taurus |
| 3 | R-BTA-9033807 | ABO blood group biosynthesis | Bos taurus |
| 4 | R-BTA-418592 | ADP signalling through P2Y purinoceptor 1 | Bos taurus |
| ... | ... | ... | ... |
| 23152 | R-XTR-379724 | tRNA Aminoacylation | Xenopus tropicalis |
| 23153 | R-XTR-6787450 | tRNA modification in the mitochondrion | Xenopus tropicalis |
| 23154 | R-XTR-6782315 | tRNA modification in the nucleus and cytosol | Xenopus tropicalis |
| 23155 | R-XTR-72306 | tRNA processing | Xenopus tropicalis |
| 23156 | R-XTR-199992 | trans-Golgi Network Vesicle Budding | Xenopus tropicalis |
23157 rows × 3 columns
reactome_dfs['ReactomePathwaysRelation.tsv']
| parent_stableid | child_stableid | |
|---|---|---|
| 0 | R-BTA-109581 | R-BTA-109606 |
| 1 | R-BTA-109581 | R-BTA-169911 |
| 2 | R-BTA-109581 | R-BTA-5357769 |
| 3 | R-BTA-109581 | R-BTA-75153 |
| 4 | R-BTA-109582 | R-BTA-140877 |
| ... | ... | ... |
| 23254 | R-XTR-9958790 | R-XTR-427652 |
| 23255 | R-XTR-9958790 | R-XTR-433137 |
| 23256 | R-XTR-9958863 | R-XTR-352230 |
| 23257 | R-XTR-9958863 | R-XTR-428559 |
| 23258 | R-XTR-9959399 | R-XTR-427975 |
23259 rows × 2 columns
df_pe_allpathway_split = reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv'].copy()
df_pe_allpathway_split[['pe_name','pe_location']] = df_pe_allpathway_split['reactome_pe_name'].str.strip(r'\]').str.split(r' \[', expand=True)
df_pe_allpathway_split
| source_db_identifier | reactome_pe_stableid | reactome_pe_name | reactome_pathway_stableid | url | event_name_pathway_or_reaction | evidence_code | species | pe_name | pe_location | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-1430728 | https://reactome.org/PathwayBrowser/#/R-BTA-14... | Metabolism | IEA | Bos taurus | warfarin | cytosol |
| 1 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-196854 | https://reactome.org/PathwayBrowser/#/R-BTA-19... | Metabolism of vitamins and cofactors | IEA | Bos taurus | warfarin | cytosol |
| 2 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-6806664 | https://reactome.org/PathwayBrowser/#/R-BTA-68... | Metabolism of vitamin K | IEA | Bos taurus | warfarin | cytosol |
| 3 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-6806667 | https://reactome.org/PathwayBrowser/#/R-BTA-68... | Metabolism of fat-soluble vitamins | IEA | Bos taurus | warfarin | cytosol |
| 4 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-CFA-1430728 | https://reactome.org/PathwayBrowser/#/R-CFA-14... | Metabolism | IEA | Canis familiaris | warfarin | cytosol |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 391091 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-XTR-211859 | https://reactome.org/PathwayBrowser/#/R-XTR-21... | Biological oxidations | IEA | Xenopus tropicalis | verapamil | cytosol |
| 391092 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-XTR-211945 | https://reactome.org/PathwayBrowser/#/R-XTR-21... | Phase I - Functionalization of compounds | IEA | Xenopus tropicalis | verapamil | cytosol |
| 391093 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-397014 | https://reactome.org/PathwayBrowser/#/R-XTR-39... | Muscle contraction | IEA | Xenopus tropicalis | verapamil | extracellular region |
| 391094 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-5576891 | https://reactome.org/PathwayBrowser/#/R-XTR-55... | Cardiac conduction | IEA | Xenopus tropicalis | verapamil | extracellular region |
| 391095 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-5576893 | https://reactome.org/PathwayBrowser/#/R-XTR-55... | Phase 2 - plateau phase | IEA | Xenopus tropicalis | verapamil | extracellular region |
391096 rows × 10 columns
Checks#
In this section we show some quick checks and EDA on the Reactome dfs.
Which pathway stableIDs are shared between df_pathways_complete and df_pe_allpathway, and which are not?
reactome_dfs['ReactomePathways.tsv'][reactome_dfs['ReactomePathways.tsv']['reactome_pathway_stableid'].isin(df_pe_allpathway_split['reactome_pathway_stableid'])]
| reactome_pathway_stableid | reactome_pathway_name | species | |
|---|---|---|---|
| 0 | R-BTA-73843 | 5-Phosphoribose 1-diphosphate biosynthesis | Bos taurus |
| 1 | R-BTA-1369062 | ABC transporters in lipid homeostasis | Bos taurus |
| 2 | R-BTA-382556 | ABC-family proteins mediated transport | Bos taurus |
| 3 | R-BTA-9033807 | ABO blood group biosynthesis | Bos taurus |
| 4 | R-BTA-418592 | ADP signalling through P2Y purinoceptor 1 | Bos taurus |
| ... | ... | ... | ... |
| 23152 | R-XTR-379724 | tRNA Aminoacylation | Xenopus tropicalis |
| 23153 | R-XTR-6787450 | tRNA modification in the mitochondrion | Xenopus tropicalis |
| 23154 | R-XTR-6782315 | tRNA modification in the nucleus and cytosol | Xenopus tropicalis |
| 23155 | R-XTR-72306 | tRNA processing | Xenopus tropicalis |
| 23156 | R-XTR-199992 | trans-Golgi Network Vesicle Budding | Xenopus tropicalis |
20140 rows × 3 columns
So nearly all pathways overlap (~20k). Which don’t?
reactome_dfs['ReactomePathways.tsv'][~reactome_dfs['ReactomePathways.tsv']['reactome_pathway_stableid'].isin(df_pe_allpathway_split['reactome_pathway_stableid'])]
| reactome_pathway_stableid | reactome_pathway_name | species | |
|---|---|---|---|
| 6 | R-BTA-211163 | AKT-mediated inactivation of FOXO1A | Bos taurus |
| 9 | R-BTA-179409 | APC-Cdc20 mediated degradation of Nek2A | Bos taurus |
| 11 | R-BTA-174048 | APC/C:Cdc20 mediated degradation of Cyclin B | Bos taurus |
| 12 | R-BTA-174154 | APC/C:Cdc20 mediated degradation of Securin | Bos taurus |
| 13 | R-BTA-176409 | APC/C:Cdc20 mediated degradation of mitotic pr... | Bos taurus |
| ... | ... | ... | ... |
| 23118 | R-XTR-2032785 | YAP1- and WWTR1 (TAZ)-stimulated gene expression | Xenopus tropicalis |
| 23146 | R-XTR-209543 | p75NTR recruits signalling complexes | Xenopus tropicalis |
| 23148 | R-XTR-193639 | p75NTR signals via NF-kB | Xenopus tropicalis |
| 23150 | R-XTR-72312 | rRNA processing | Xenopus tropicalis |
| 23151 | R-XTR-8868773 | rRNA processing in the nucleus and cytosol | Xenopus tropicalis |
3017 rows × 3 columns
A minority don’t overlap (~3k) - wonder why?
We based this off the pathway stable ID… but what about the pathway names? Are they different?
reactome_dfs['ReactomePathways.tsv'][reactome_dfs['ReactomePathways.tsv']['reactome_pathway_name'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['event_name_pathway_or_reaction'])]
| reactome_pathway_stableid | reactome_pathway_name | species | |
|---|---|---|---|
| 0 | R-BTA-73843 | 5-Phosphoribose 1-diphosphate biosynthesis | Bos taurus |
| 1 | R-BTA-1369062 | ABC transporters in lipid homeostasis | Bos taurus |
| 2 | R-BTA-382556 | ABC-family proteins mediated transport | Bos taurus |
| 3 | R-BTA-9033807 | ABO blood group biosynthesis | Bos taurus |
| 4 | R-BTA-418592 | ADP signalling through P2Y purinoceptor 1 | Bos taurus |
| ... | ... | ... | ... |
| 23152 | R-XTR-379724 | tRNA Aminoacylation | Xenopus tropicalis |
| 23153 | R-XTR-6787450 | tRNA modification in the mitochondrion | Xenopus tropicalis |
| 23154 | R-XTR-6782315 | tRNA modification in the nucleus and cytosol | Xenopus tropicalis |
| 23155 | R-XTR-72306 | tRNA processing | Xenopus tropicalis |
| 23156 | R-XTR-199992 | trans-Golgi Network Vesicle Budding | Xenopus tropicalis |
21463 rows × 3 columns
reactome_dfs['ReactomePathways.tsv'][~reactome_dfs['ReactomePathways.tsv']['reactome_pathway_name'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['event_name_pathway_or_reaction'])]
| reactome_pathway_stableid | reactome_pathway_name | species | |
|---|---|---|---|
| 9 | R-BTA-179409 | APC-Cdc20 mediated degradation of Nek2A | Bos taurus |
| 11 | R-BTA-174048 | APC/C:Cdc20 mediated degradation of Cyclin B | Bos taurus |
| 12 | R-BTA-174154 | APC/C:Cdc20 mediated degradation of Securin | Bos taurus |
| 13 | R-BTA-176409 | APC/C:Cdc20 mediated degradation of mitotic pr... | Bos taurus |
| 14 | R-BTA-174178 | APC/C:Cdh1 mediated degradation of Cdc20 and o... | Bos taurus |
| ... | ... | ... | ... |
| 23092 | R-XTR-195399 | VEGF binds to VEGFR leading to receptor dimeri... | Xenopus tropicalis |
| 23093 | R-XTR-194313 | VEGF ligand-receptor interactions | Xenopus tropicalis |
| 23097 | R-XTR-8866427 | VLDLR internalisation and degradation | Xenopus tropicalis |
| 23113 | R-XTR-5140745 | WNT5A-dependent internalization of FZD2, FZD5 ... | Xenopus tropicalis |
| 23118 | R-XTR-2032785 | YAP1- and WWTR1 (TAZ)-stimulated gene expression | Xenopus tropicalis |
1694 rows × 3 columns
So more of the pathway names are shared between the two dfs than the stableids. Could be probably either because:
some pathway names share duplicates or more (but different stableids)
there was an update to the stableids and they were changed but the names stayed the same (hopefully less likely if they’re meant to be stable IDs…)
Maybe the best linkage key between the pathways (reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']) and the reactions (eactome_dfs['ChEBI2Reactome_PE_Reactions.tsv']) is the reactome_pe_stableid key?!
So each pathway (
reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv) is linked to multiple stable ids (reactome_pe_stableids). Each of these stable IDs will also have a name and cellular location (is it best to have these as nodes or attributes?).Then these
reactome_pe_stableids should be linked to the multiple reactions (ChEBI2Reactome_PE_Reactions.tsv).
print(f'len of ChEBI2Reactome_PE_All_Levels: {len(set(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['reactome_pe_stableid']))}')
print(f'len of ChEBI2Reactome_PE_Reactions: {len(set(reactome_dfs['ChEBI2Reactome_PE_Reactions.tsv']['reactome_pe_stableid']))}')
print(f'len of overlap between pathways and reactions using reactome_pe_stableid: {len(set(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['reactome_pe_stableid']) & set(reactome_dfs['ChEBI2Reactome_PE_Reactions.tsv']['reactome_pe_stableid']))}')
len of ChEBI2Reactome_PE_All_Levels: 6001
len of ChEBI2Reactome_PE_Reactions: 6177
len of overlap between pathways and reactions using reactome_pe_stableid: 6001
So this approach seems to be likely to work more or less. It would be interesting to see which stable entities were in the reactions but not pathways - we should keep this in mind for later joins too.
Nonetheless, we can now inspect this further. we should just also be sure to see whether this roll up would destroy the name or component info if we joined them all? or if they stay unique to the stableID…
reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv'].value_counts(subset=['reactome_pe_stableid'])
reactome_pe_stableid
R-ALL-113592 8540
R-ALL-29370 7836
R-ALL-29356 6144
R-ALL-29372 4253
R-ALL-29438 4192
...
R-ALL-30495 2
R-ALL-9750013 2
R-ALL-9749584 2
R-HSA-8943989 2
R-ALL-9635420 2
Name: count, Length: 6001, dtype: int64
reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv'].value_counts(subset=['reactome_pe_name'])
reactome_pe_name
ATP [cytosol] 8540
ADP [cytosol] 7836
H2O [cytosol] 6144
Pi [cytosol] 4253
GTP [cytosol] 4192
...
CSN polymer [extracellular region] 2
sulfo-Cipro [cytosol] 2
sulfo-Cipro [extracellular region] 2
4-5 nt RNA [mitochondrial matrix] 2
SKM [cytosol] 2
Name: count, Length: 5922, dtype: int64
It’s interesting that the length of these two varies slightly (so there is not a direct 1:1 association between the pe_stableid and the pd_name).
Which pathway stableIDs are shared between reactome_dfs['ReactomePathways'] and reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv'], and which are not?
reactome_dfs['ReactomePathways.tsv'][reactome_dfs['ReactomePathways.tsv']['reactome_pathway_stableid'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['reactome_pathway_stableid'])]
| reactome_pathway_stableid | reactome_pathway_name | species | |
|---|---|---|---|
| 0 | R-BTA-73843 | 5-Phosphoribose 1-diphosphate biosynthesis | Bos taurus |
| 1 | R-BTA-1369062 | ABC transporters in lipid homeostasis | Bos taurus |
| 2 | R-BTA-382556 | ABC-family proteins mediated transport | Bos taurus |
| 3 | R-BTA-9033807 | ABO blood group biosynthesis | Bos taurus |
| 4 | R-BTA-418592 | ADP signalling through P2Y purinoceptor 1 | Bos taurus |
| ... | ... | ... | ... |
| 23152 | R-XTR-379724 | tRNA Aminoacylation | Xenopus tropicalis |
| 23153 | R-XTR-6787450 | tRNA modification in the mitochondrion | Xenopus tropicalis |
| 23154 | R-XTR-6782315 | tRNA modification in the nucleus and cytosol | Xenopus tropicalis |
| 23155 | R-XTR-72306 | tRNA processing | Xenopus tropicalis |
| 23156 | R-XTR-199992 | trans-Golgi Network Vesicle Budding | Xenopus tropicalis |
20140 rows × 3 columns
So nearly all pathways overlap (~20k). Which don’t?
reactome_dfs['ReactomePathways.tsv'][~reactome_dfs['ReactomePathways.tsv']['reactome_pathway_stableid'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['reactome_pathway_stableid'])]
| reactome_pathway_stableid | reactome_pathway_name | species | |
|---|---|---|---|
| 6 | R-BTA-211163 | AKT-mediated inactivation of FOXO1A | Bos taurus |
| 9 | R-BTA-179409 | APC-Cdc20 mediated degradation of Nek2A | Bos taurus |
| 11 | R-BTA-174048 | APC/C:Cdc20 mediated degradation of Cyclin B | Bos taurus |
| 12 | R-BTA-174154 | APC/C:Cdc20 mediated degradation of Securin | Bos taurus |
| 13 | R-BTA-176409 | APC/C:Cdc20 mediated degradation of mitotic pr... | Bos taurus |
| ... | ... | ... | ... |
| 23118 | R-XTR-2032785 | YAP1- and WWTR1 (TAZ)-stimulated gene expression | Xenopus tropicalis |
| 23146 | R-XTR-209543 | p75NTR recruits signalling complexes | Xenopus tropicalis |
| 23148 | R-XTR-193639 | p75NTR signals via NF-kB | Xenopus tropicalis |
| 23150 | R-XTR-72312 | rRNA processing | Xenopus tropicalis |
| 23151 | R-XTR-8868773 | rRNA processing in the nucleus and cytosol | Xenopus tropicalis |
3017 rows × 3 columns
A minority don’t overlap (~3k) - wonder why?
We based this off the pathway stable ID… but what about the pathway names? Are they different?
reactome_dfs['ReactomePathways.tsv'][reactome_dfs['ReactomePathways.tsv']['reactome_pathway_name'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['event_name_pathway_or_reaction'])]
| reactome_pathway_stableid | reactome_pathway_name | species | |
|---|---|---|---|
| 0 | R-BTA-73843 | 5-Phosphoribose 1-diphosphate biosynthesis | Bos taurus |
| 1 | R-BTA-1369062 | ABC transporters in lipid homeostasis | Bos taurus |
| 2 | R-BTA-382556 | ABC-family proteins mediated transport | Bos taurus |
| 3 | R-BTA-9033807 | ABO blood group biosynthesis | Bos taurus |
| 4 | R-BTA-418592 | ADP signalling through P2Y purinoceptor 1 | Bos taurus |
| ... | ... | ... | ... |
| 23152 | R-XTR-379724 | tRNA Aminoacylation | Xenopus tropicalis |
| 23153 | R-XTR-6787450 | tRNA modification in the mitochondrion | Xenopus tropicalis |
| 23154 | R-XTR-6782315 | tRNA modification in the nucleus and cytosol | Xenopus tropicalis |
| 23155 | R-XTR-72306 | tRNA processing | Xenopus tropicalis |
| 23156 | R-XTR-199992 | trans-Golgi Network Vesicle Budding | Xenopus tropicalis |
21463 rows × 3 columns
reactome_dfs['ReactomePathways.tsv'][~reactome_dfs['ReactomePathways.tsv']['reactome_pathway_name'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['event_name_pathway_or_reaction'])]
| reactome_pathway_stableid | reactome_pathway_name | species | |
|---|---|---|---|
| 9 | R-BTA-179409 | APC-Cdc20 mediated degradation of Nek2A | Bos taurus |
| 11 | R-BTA-174048 | APC/C:Cdc20 mediated degradation of Cyclin B | Bos taurus |
| 12 | R-BTA-174154 | APC/C:Cdc20 mediated degradation of Securin | Bos taurus |
| 13 | R-BTA-176409 | APC/C:Cdc20 mediated degradation of mitotic pr... | Bos taurus |
| 14 | R-BTA-174178 | APC/C:Cdh1 mediated degradation of Cdc20 and o... | Bos taurus |
| ... | ... | ... | ... |
| 23092 | R-XTR-195399 | VEGF binds to VEGFR leading to receptor dimeri... | Xenopus tropicalis |
| 23093 | R-XTR-194313 | VEGF ligand-receptor interactions | Xenopus tropicalis |
| 23097 | R-XTR-8866427 | VLDLR internalisation and degradation | Xenopus tropicalis |
| 23113 | R-XTR-5140745 | WNT5A-dependent internalization of FZD2, FZD5 ... | Xenopus tropicalis |
| 23118 | R-XTR-2032785 | YAP1- and WWTR1 (TAZ)-stimulated gene expression | Xenopus tropicalis |
1694 rows × 3 columns
So more of the pathway names are shared between the two dfs than the stableids. Could be probably either because:
some pathway names share duplicates or more (but different stableids)
there was an update to the stableids and they were changed but the names stayed the same (hopefully less likely if they’re meant to be stable IDs…)
df_pe_allpathway_split.value_counts(subset='species')
species
Homo sapiens 57728
Mus musculus 33310
Rattus norvegicus 32852
Bos taurus 32728
Sus scrofa 32653
Canis familiaris 32054
Gallus gallus 31029
Xenopus tropicalis 25691
Drosophila melanogaster 23091
Danio rerio 23017
Caenorhabditis elegans 21197
Dictyostelium discoideum 15461
Saccharomyces cerevisiae 11025
Schizosaccharomyces pombe 10961
Plasmodium falciparum 8003
Mycobacterium tuberculosis 296
Name: count, dtype: int64
df_pe_allpathway_split[['source_db_identifier','reactome_pe_stableid','species']].drop_duplicates().value_counts(subset='species')
species
Homo sapiens 6277
Mus musculus 4214
Bos taurus 4157
Sus scrofa 4149
Rattus norvegicus 4133
Canis familiaris 4090
Gallus gallus 3897
Xenopus tropicalis 3426
Danio rerio 3149
Drosophila melanogaster 2932
Caenorhabditis elegans 2767
Dictyostelium discoideum 2123
Saccharomyces cerevisiae 1489
Schizosaccharomyces pombe 1481
Plasmodium falciparum 1227
Mycobacterium tuberculosis 82
Name: count, dtype: int64
Above we see the number of CHEBI IDs for each of the species. But for our default LipiNet layout, if we keep each species duplicated, then we will need to link to multiple IDs when making our interlayer-layer links (e.g. Rhea to Reactome). This is probably undesirable and it would be better to keep single interlayer connections between nodes of the same ID, both to reduce redundancy and minimise problems with downstream algorithms like node traversal.
temp_df1 = pd.DataFrame(df_pe_allpathway_split.value_counts(subset=['reactome_pe_stableid','reactome_pe_name'])).reset_index()
temp_df2 = pd.DataFrame(df_pe_allpathway_split.value_counts(subset=['reactome_pe_name'])).reset_index()
print(len(temp_df1))
print(len(temp_df2))
# len(set(temp_df2['reactome_pe_name'].to_list()) - set(temp_df1['reactome_pe_name'].to_list()))
# how many unique stable IDs per name
count_per_name = df_pe_allpathway_split.groupby('reactome_pe_name')['reactome_pe_stableid'].nunique()
# names that map to multiple stable IDs
ambiguous_names = count_per_name[count_per_name > 1]
print(len(ambiguous_names))
ambiguous_names #.head()
6001
5922
72
reactome_pe_name
(4Fe-4S)(2+) [cytosol] 2
(GlcNAc)2 (Man)8b [endoplasmic reticulum lumen] 2
(GlcNAc)2 (Man)8c [endoplasmic reticulum lumen] 2
1-acyl LPA [cytosol] 2
11cRAL [cytosol] 2
..
tRNA(Asn) containing A-37 [mitochondrial matrix] 2
tobramycin [cytosol] 2
tobramycin [periplasmic space] 2
unknown NAT [cytosol] 2
unknown kinase [cytosol] 2
Name: reactome_pe_stableid, Length: 72, dtype: int64
temp_df1[temp_df1['reactome_pe_name']==ambiguous_names.index[0]]
| reactome_pe_stableid | reactome_pe_name | count | |
|---|---|---|---|
| 810 | R-ALL-937126 | (4Fe-4S)(2+) [cytosol] | 83 |
| 5502 | R-ALL-937265 | (4Fe-4S)(2+) [cytosol] | 4 |
Okay so each combined name and location (reactome_pe_name) is NOT quite unique to the stableid - because there are now slightly fewer unique pe names from the pathways then there were using the stableID.
This could be because Reactome considers them slightly different molecules in some contexts perhaps? Such as a free ion, part of a complex, or bound to another molecule? Or it could be a bug from Reactome.
Nonetheless, they are quite similar in mapping except for these rare cases, but how much would it expand if we split the reactome_pe_name into name and location separately?
df_pe_allpathway_split.value_counts(subset=['pe_name'], dropna=False)
pe_name
ATP 13750
ADP 12271
H2O 12250
H+ 8892
Pi 7177
...
Islet amyloid polypeptide fibril 2
VWF multimer 2
Cleaved fibronectin matrix Ala(271)/Val(272) 2
Cleaved fibrillin-3 2
ribonucleotide 2
Name: count, Length: 4084, dtype: int64
df_pe_allpathway_split.value_counts(subset=['pe_location'], dropna=False)
pe_location
cytosol 163062
extracellular region 64709
nucleoplasm 37155
mitochondrial matrix 33962
endoplasmic reticulum lumen 15053
...
endosome 3
host cell cytosol 3
plastid stroma 2
nucleolus 2
Golgi-associated vesicle lumen 2
Name: count, Length: 85, dtype: int64
So here we see that there are just over 4000 unique physical entities, from about 85 different cellular locations.
The most common physical entities are ATP, ADP and H20, which we would expect.
The most common cellular locations for the physical entities are the cytosol, extracellular region, and nucleoplasm, which also makes sense.
Going back to our previous problematic example, we can see that very few cases have the strange difference in stableid name, and inspect these on the browser
df_pe_allpathway_split[df_pe_allpathway_split['reactome_pe_name']=='(4Fe-4S)(2+) [cytosol]']['reactome_pe_stableid'].value_counts()
reactome_pe_stableid
R-ALL-937126 83
R-ALL-937265 4
Name: count, dtype: int64
df_pe_allpathway_split[df_pe_allpathway_split['reactome_pe_stableid']=='R-ALL-937126']['url']
189142 https://reactome.org/PathwayBrowser/#/R-BTA-14...
189144 https://reactome.org/PathwayBrowser/#/R-BTA-19...
189145 https://reactome.org/PathwayBrowser/#/R-BTA-19...
189147 https://reactome.org/PathwayBrowser/#/R-BTA-94...
189148 https://reactome.org/PathwayBrowser/#/R-BTA-96...
...
189279 https://reactome.org/PathwayBrowser/#/R-XTR-14...
189281 https://reactome.org/PathwayBrowser/#/R-XTR-19...
189282 https://reactome.org/PathwayBrowser/#/R-XTR-19...
189284 https://reactome.org/PathwayBrowser/#/R-XTR-94...
189285 https://reactome.org/PathwayBrowser/#/R-XTR-96...
Name: url, Length: 83, dtype: object
df_pe_allpathway_split[df_pe_allpathway_split['reactome_pe_stableid']=='R-ALL-937265']['url']
189248 https://reactome.org/PathwayBrowser/#/R-MTU-87...
189249 https://reactome.org/PathwayBrowser/#/R-MTU-93...
189250 https://reactome.org/PathwayBrowser/#/R-MTU-93...
189251 https://reactome.org/PathwayBrowser/#/R-MTU-93...
Name: url, dtype: object
Zooming out to the overall question of how we link the dfs together: again the best way for us to join all this up, is probably to use the stable identifier to link to the reactions and the pathways. Then the stable identifier can also link to the name. Then that name can split off into the common name and the location as separate nodes.
That way in later applications, users can search by name or location and get the degree of name or location nodes associated with each query. This also avoids clogging up the maps and allows the ability to easily filter layers still…
Edge creation#
We begin to show the creation of Reactome edge dfs, namely:
df_edges_ontpathway_to_ontpathway (and df_nodes_ontpathway, which isn’t for edges but should be created now)
df_edges_pathwayid_to_ontpathway
df_edges_chebi_to_physicalent
df_edges_phyiscalent_to_pathwayid
df_edges_phyiscalent_to_reactionid
df_edges_penameloc_to_phyiscalent
df_edges_pename_to_penameloc
df_edges_peloc_to_penameloc
First, we will create the ontology for the Reactome pathways.
df_edges_ontpathway_to_ontpathway = reactome_dfs['ReactomePathwaysRelation.tsv'].copy()
df_edges_ontpathway_to_ontpathway.columns = ['source_id', 'target_id']
df_edges_ontpathway_to_ontpathway['source_layer'] = 'reactome_pathway_ontology'
df_edges_ontpathway_to_ontpathway['target_layer'] = 'reactome_pathway_ontology'
df_edges_ontpathway_to_ontpathway['interlayer'] = False
df_edges_ontpathway_to_ontpathway
| source_id | target_id | source_layer | target_layer | interlayer | |
|---|---|---|---|---|---|
| 0 | R-BTA-109581 | R-BTA-109606 | reactome_pathway_ontology | reactome_pathway_ontology | False |
| 1 | R-BTA-109581 | R-BTA-169911 | reactome_pathway_ontology | reactome_pathway_ontology | False |
| 2 | R-BTA-109581 | R-BTA-5357769 | reactome_pathway_ontology | reactome_pathway_ontology | False |
| 3 | R-BTA-109581 | R-BTA-75153 | reactome_pathway_ontology | reactome_pathway_ontology | False |
| 4 | R-BTA-109582 | R-BTA-140877 | reactome_pathway_ontology | reactome_pathway_ontology | False |
| ... | ... | ... | ... | ... | ... |
| 23254 | R-XTR-9958790 | R-XTR-427652 | reactome_pathway_ontology | reactome_pathway_ontology | False |
| 23255 | R-XTR-9958790 | R-XTR-433137 | reactome_pathway_ontology | reactome_pathway_ontology | False |
| 23256 | R-XTR-9958863 | R-XTR-352230 | reactome_pathway_ontology | reactome_pathway_ontology | False |
| 23257 | R-XTR-9958863 | R-XTR-428559 | reactome_pathway_ontology | reactome_pathway_ontology | False |
| 23258 | R-XTR-9959399 | R-XTR-427975 | reactome_pathway_ontology | reactome_pathway_ontology | False |
23259 rows × 5 columns
Here we will also have to define the nodes for ontology before we can link the pathways to the pathway ontology
df_nodes_ontpathway = reactome_dfs['ReactomePathways.tsv'].copy()
df_nodes_ontpathway.columns = ['node_id', 'name', 'species']
df_nodes_ontpathway = df_nodes_ontpathway.assign(layer='reactome_pathway_ontology').copy()
df_nodes_ontpathway = df_nodes_ontpathway[['layer','node_id','name','species']]
df_nodes_ontpathway
| layer | node_id | name | species | |
|---|---|---|---|---|
| 0 | reactome_pathway_ontology | R-BTA-73843 | 5-Phosphoribose 1-diphosphate biosynthesis | Bos taurus |
| 1 | reactome_pathway_ontology | R-BTA-1369062 | ABC transporters in lipid homeostasis | Bos taurus |
| 2 | reactome_pathway_ontology | R-BTA-382556 | ABC-family proteins mediated transport | Bos taurus |
| 3 | reactome_pathway_ontology | R-BTA-9033807 | ABO blood group biosynthesis | Bos taurus |
| 4 | reactome_pathway_ontology | R-BTA-418592 | ADP signalling through P2Y purinoceptor 1 | Bos taurus |
| ... | ... | ... | ... | ... |
| 23152 | reactome_pathway_ontology | R-XTR-379724 | tRNA Aminoacylation | Xenopus tropicalis |
| 23153 | reactome_pathway_ontology | R-XTR-6787450 | tRNA modification in the mitochondrion | Xenopus tropicalis |
| 23154 | reactome_pathway_ontology | R-XTR-6782315 | tRNA modification in the nucleus and cytosol | Xenopus tropicalis |
| 23155 | reactome_pathway_ontology | R-XTR-72306 | tRNA processing | Xenopus tropicalis |
| 23156 | reactome_pathway_ontology | R-XTR-199992 | trans-Golgi Network Vesicle Budding | Xenopus tropicalis |
23157 rows × 4 columns
# EDGES: pathway_stid (from pathway_ont) -> pathway_stid (from allpathway_split)
# note - could also delete more of these if needed, or just keep as edge attributes just in case for more advanced filtering options?
df_edges_pathwayid_to_ontpathway = pd.merge(df_nodes_ontpathway.drop(columns=['layer','species']).assign(target_layer='reactome_ontology_pathways'),
df_pe_allpathway_split.assign(source_layer='reactome_pathway'),
left_on='node_id', right_on='reactome_pathway_stableid',
how='outer'
).rename(columns={'node_id':'target_id', 'reactome_pathway_stableid':'source_id'}
).drop(columns=['reactome_pe_stableid','reactome_pe_name','pe_name','pe_location','source_db_identifier'] #note we drop these columns here bc they're not relevant or applicable to the pathway-pathway physical entity relationship - there are also often multiple physicalent per pathway anyway, so would be bad practice to only represent one of these
).drop_duplicates()
df_edges_pathwayid_to_ontpathway
| target_id | name | target_layer | source_id | url | event_name_pathway_or_reaction | evidence_code | species | source_layer | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | R-BTA-1059683 | Interleukin-6 signaling | reactome_ontology_pathways | R-BTA-1059683 | https://reactome.org/PathwayBrowser/#/R-BTA-10... | Interleukin-6 signaling | IEA | Bos taurus | reactome_pathway |
| 2 | R-BTA-109581 | Apoptosis | reactome_ontology_pathways | R-BTA-109581 | https://reactome.org/PathwayBrowser/#/R-BTA-10... | Apoptosis | IEA | Bos taurus | reactome_pathway |
| 14 | R-BTA-109582 | Hemostasis | reactome_ontology_pathways | R-BTA-109582 | https://reactome.org/PathwayBrowser/#/R-BTA-10... | Hemostasis | IEA | Bos taurus | reactome_pathway |
| 126 | R-BTA-109606 | Intrinsic Pathway for Apoptosis | reactome_ontology_pathways | R-BTA-109606 | https://reactome.org/PathwayBrowser/#/R-BTA-10... | Intrinsic Pathway for Apoptosis | IEA | Bos taurus | reactome_pathway |
| 134 | R-BTA-109703 | PKB-mediated events | reactome_ontology_pathways | R-BTA-109703 | https://reactome.org/PathwayBrowser/#/R-BTA-10... | PKB-mediated events | IEA | Bos taurus | reactome_pathway |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 393990 | R-XTR-9958517 | SLC-mediated bile acid transport | reactome_ontology_pathways | R-XTR-9958517 | https://reactome.org/PathwayBrowser/#/R-XTR-99... | SLC-mediated bile acid transport | IEA | Xenopus tropicalis | reactome_pathway |
| 393998 | R-XTR-9958790 | SLC-mediated transport of inorganic anions | reactome_ontology_pathways | R-XTR-9958790 | https://reactome.org/PathwayBrowser/#/R-XTR-99... | SLC-mediated transport of inorganic anions | IEA | Xenopus tropicalis | reactome_pathway |
| 394026 | R-XTR-9958863 | SLC-mediated transport of amino acids | reactome_ontology_pathways | R-XTR-9958863 | https://reactome.org/PathwayBrowser/#/R-XTR-99... | SLC-mediated transport of amino acids | IEA | Xenopus tropicalis | reactome_pathway |
| 394104 | R-XTR-9959399 | SLC-mediated transport of oligopeptides | reactome_ontology_pathways | R-XTR-9959399 | https://reactome.org/PathwayBrowser/#/R-XTR-99... | SLC-mediated transport of oligopeptides | IEA | Xenopus tropicalis | reactome_pathway |
| 394110 | R-XTR-997272 | Inhibition of voltage gated Ca2+ channels via... | reactome_ontology_pathways | R-XTR-997272 | https://reactome.org/PathwayBrowser/#/R-XTR-99... | Inhibition of voltage gated Ca2+ channels via... | IEA | Xenopus tropicalis | reactome_pathway |
23878 rows × 9 columns
df_edges_chebi_to_physicalent = df_pe_allpathway_split[['source_db_identifier','reactome_pe_stableid']].drop_duplicates()
df_edges_chebi_to_physicalent = df_edges_chebi_to_physicalent.rename(
columns={'source_db_identifier':'source_id', 'reactome_pe_stableid':'target_id'}
).assign(source_layer='reactome_chebi', target_layer='reactome_physicalent')
df_edges_chebi_to_physicalent
| source_id | target_id | source_layer | target_layer | |
|---|---|---|---|---|
| 0 | 10033 | R-ALL-9014945 | reactome_chebi | reactome_physicalent |
| 32 | 10036 | R-ALL-5696412 | reactome_chebi | reactome_physicalent |
| 110 | 10055 | R-ALL-9611688 | reactome_chebi | reactome_physicalent |
| 164 | 10093 | R-ALL-9648287 | reactome_chebi | reactome_physicalent |
| 165 | 10093 | R-ALL-3296452 | reactome_chebi | reactome_physicalent |
| ... | ... | ... | ... | ... |
| 390918 | 9884 | R-ALL-9713792 | reactome_chebi | reactome_physicalent |
| 390921 | 9927 | R-ALL-9615299 | reactome_chebi | reactome_physicalent |
| 390971 | 9943 | R-ALL-9714401 | reactome_chebi | reactome_physicalent |
| 391019 | 9948 | R-ALL-9660998 | reactome_chebi | reactome_physicalent |
| 391022 | 9948 | R-ALL-9614135 | reactome_chebi | reactome_physicalent |
6353 rows × 4 columns
# EDGES: PE_stid -> pathway_stid (BOTH from _allpathway_split_)
df_edges_phyiscalent_to_pathwayid = df_pe_allpathway_split.assign(
source_layer='reactome_physicalent',
target_layer='reactome_pathway'
).rename(columns={
'reactome_pe_stableid':'source_id',
'reactome_pathway_stableid':'target_id'
})
df_edges_phyiscalent_to_pathwayid['human'] = df_edges_phyiscalent_to_pathwayid['species']=='Homo sapiens'
df_edges_phyiscalent_to_pathwayid
| source_db_identifier | source_id | reactome_pe_name | target_id | url | event_name_pathway_or_reaction | evidence_code | species | pe_name | pe_location | source_layer | target_layer | human | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-1430728 | https://reactome.org/PathwayBrowser/#/R-BTA-14... | Metabolism | IEA | Bos taurus | warfarin | cytosol | reactome_physicalent | reactome_pathway | False |
| 1 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-196854 | https://reactome.org/PathwayBrowser/#/R-BTA-19... | Metabolism of vitamins and cofactors | IEA | Bos taurus | warfarin | cytosol | reactome_physicalent | reactome_pathway | False |
| 2 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-6806664 | https://reactome.org/PathwayBrowser/#/R-BTA-68... | Metabolism of vitamin K | IEA | Bos taurus | warfarin | cytosol | reactome_physicalent | reactome_pathway | False |
| 3 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-6806667 | https://reactome.org/PathwayBrowser/#/R-BTA-68... | Metabolism of fat-soluble vitamins | IEA | Bos taurus | warfarin | cytosol | reactome_physicalent | reactome_pathway | False |
| 4 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-CFA-1430728 | https://reactome.org/PathwayBrowser/#/R-CFA-14... | Metabolism | IEA | Canis familiaris | warfarin | cytosol | reactome_physicalent | reactome_pathway | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 391091 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-XTR-211859 | https://reactome.org/PathwayBrowser/#/R-XTR-21... | Biological oxidations | IEA | Xenopus tropicalis | verapamil | cytosol | reactome_physicalent | reactome_pathway | False |
| 391092 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-XTR-211945 | https://reactome.org/PathwayBrowser/#/R-XTR-21... | Phase I - Functionalization of compounds | IEA | Xenopus tropicalis | verapamil | cytosol | reactome_physicalent | reactome_pathway | False |
| 391093 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-397014 | https://reactome.org/PathwayBrowser/#/R-XTR-39... | Muscle contraction | IEA | Xenopus tropicalis | verapamil | extracellular region | reactome_physicalent | reactome_pathway | False |
| 391094 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-5576891 | https://reactome.org/PathwayBrowser/#/R-XTR-55... | Cardiac conduction | IEA | Xenopus tropicalis | verapamil | extracellular region | reactome_physicalent | reactome_pathway | False |
| 391095 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-5576893 | https://reactome.org/PathwayBrowser/#/R-XTR-55... | Phase 2 - plateau phase | IEA | Xenopus tropicalis | verapamil | extracellular region | reactome_physicalent | reactome_pathway | False |
391096 rows × 13 columns
# EDGES: PE_stid -> pathway_stid (BOTH from _allreactions_)
df_edges_phyiscalent_to_reactionid = reactome_dfs['ChEBI2Reactome_PE_Reactions.tsv'].assign(
source_layer='reactome_physicalent',
target_layer='reactome_reactions'
).rename(columns={
'reactome_pe_stableid':'source_id',
'reactome_pathway_stableid':'target_id'
})
df_edges_phyiscalent_to_reactionid['human'] = df_edges_phyiscalent_to_reactionid['species']=='Homo sapiens'
df_edges_phyiscalent_to_reactionid
| source_db_identifier | source_id | reactome_pe_name | target_id | url | event_name_pathway_or_reaction | evidence_code | species | source_layer | target_layer | human | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-159790 | https://reactome.org/PathwayBrowser/#/R-BTA-15... | VKORC1 reduces vitamin K epoxide to MK4 (vitam... | IEA | Bos taurus | reactome_physicalent | reactome_reactions | False |
| 1 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-BTA-9026967 | https://reactome.org/PathwayBrowser/#/R-BTA-90... | VKORC1 inhibitors binds VKORC1 dimer | IEA | Bos taurus | reactome_physicalent | reactome_reactions | False |
| 2 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-CFA-159790 | https://reactome.org/PathwayBrowser/#/R-CFA-15... | VKORC1 reduces vitamin K epoxide to MK4 (vitam... | IEA | Canis familiaris | reactome_physicalent | reactome_reactions | False |
| 3 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-CFA-9026967 | https://reactome.org/PathwayBrowser/#/R-CFA-90... | VKORC1 inhibitors binds VKORC1 dimer | IEA | Canis familiaris | reactome_physicalent | reactome_reactions | False |
| 4 | 10033 | R-ALL-9014945 | warfarin [cytosol] | R-DME-159790 | https://reactome.org/PathwayBrowser/#/R-DME-15... | VKORC1 reduces vitamin K epoxide to MK4 (vitam... | IEA | Drosophila melanogaster | reactome_physicalent | reactome_reactions | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 234736 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-SSC-9614031 | https://reactome.org/PathwayBrowser/#/R-SSC-96... | Class IV antihypertensives bind LTCC multimer | IEA | Sus scrofa | reactome_physicalent | reactome_reactions | False |
| 234737 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-SSC-9659680 | https://reactome.org/PathwayBrowser/#/R-SSC-96... | ABCB1 transports xenobiotics out of the cell | IEA | Sus scrofa | reactome_physicalent | reactome_reactions | False |
| 234738 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-SSC-9678766 | https://reactome.org/PathwayBrowser/#/R-SSC-96... | CYP3A4 binds CYP3A4 inhibitors | IEA | Sus scrofa | reactome_physicalent | reactome_reactions | False |
| 234739 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | R-XTR-9614031 | https://reactome.org/PathwayBrowser/#/R-XTR-96... | Class IV antihypertensives bind LTCC multimer | IEA | Xenopus tropicalis | reactome_physicalent | reactome_reactions | False |
| 234740 | 9948 | R-ALL-9660998 | verapamil [cytosol] | R-XTR-9678766 | https://reactome.org/PathwayBrowser/#/R-XTR-96... | CYP3A4 binds CYP3A4 inhibitors | IEA | Xenopus tropicalis | reactome_physicalent | reactome_reactions | False |
234741 rows × 11 columns
# EDGES: PE_stid (from allpathway_split) <- PE_name_location (from allpathway_split)
df_edges_penameloc_to_phyiscalent = df_pe_allpathway_split.assign(
source_layer='reactome_physicalent_nameloc',
target_layer='reactome_physicalent'
).rename(columns={
'reactome_pe_name':'source_id',
'reactome_pe_stableid':'target_id'
})[
['source_layer','source_id','target_layer','target_id','pe_name','pe_location']
].drop_duplicates()
df_edges_penameloc_to_phyiscalent
| source_layer | source_id | target_layer | target_id | pe_name | pe_location | |
|---|---|---|---|---|---|---|
| 0 | reactome_physicalent_nameloc | warfarin [cytosol] | reactome_physicalent | R-ALL-9014945 | warfarin | cytosol |
| 32 | reactome_physicalent_nameloc | arachidyl ester [endoplasmic reticulum lumen] | reactome_physicalent | R-ALL-5696412 | arachidyl ester | endoplasmic reticulum lumen |
| 110 | reactome_physicalent_nameloc | xamoterol [extracellular region] | reactome_physicalent | R-ALL-9611688 | xamoterol | extracellular region |
| 164 | reactome_physicalent_nameloc | yohimbine [extracellular region] | reactome_physicalent | R-ALL-9648287 | yohimbine | extracellular region |
| 165 | reactome_physicalent_nameloc | Yohimbine [extracellular region] | reactome_physicalent | R-ALL-3296452 | Yohimbine | extracellular region |
| ... | ... | ... | ... | ... | ... | ... |
| 390837 | reactome_physicalent_nameloc | troglitazone [nucleoplasm] | reactome_physicalent | R-ALL-9732670 | troglitazone | nucleoplasm |
| 390864 | reactome_physicalent_nameloc | tropicamide [extracellular region] | reactome_physicalent | R-ALL-9704271 | tropicamide | extracellular region |
| 390918 | reactome_physicalent_nameloc | uramustine [nucleoplasm] | reactome_physicalent | R-ALL-9713792 | uramustine | nucleoplasm |
| 390921 | reactome_physicalent_nameloc | valsartan [extracellular region] | reactome_physicalent | R-ALL-9615299 | valsartan | extracellular region |
| 390971 | reactome_physicalent_nameloc | venlafaxine [extracellular region] | reactome_physicalent | R-ALL-9714401 | venlafaxine | extracellular region |
6001 rows × 6 columns
# EDGES: PE_name (from allpathway_split) -> PE_name_location (from allpathway_split)
df_edges_pename_to_penameloc = df_pe_allpathway_split.assign(
source_layer='reactome_physicalent_name',
target_layer='reactome_physicalent_nameloc'
).rename(columns={
'pe_name':'source_id',
'reactome_pe_name':'target_id'
})[
['source_layer','source_id','target_layer','target_id']
].drop_duplicates()
df_edges_pename_to_penameloc
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 0 | reactome_physicalent_name | warfarin | reactome_physicalent_nameloc | warfarin [cytosol] |
| 32 | reactome_physicalent_name | arachidyl ester | reactome_physicalent_nameloc | arachidyl ester [endoplasmic reticulum lumen] |
| 110 | reactome_physicalent_name | xamoterol | reactome_physicalent_nameloc | xamoterol [extracellular region] |
| 164 | reactome_physicalent_name | yohimbine | reactome_physicalent_nameloc | yohimbine [extracellular region] |
| 165 | reactome_physicalent_name | Yohimbine | reactome_physicalent_nameloc | Yohimbine [extracellular region] |
| ... | ... | ... | ... | ... |
| 390837 | reactome_physicalent_name | troglitazone | reactome_physicalent_nameloc | troglitazone [nucleoplasm] |
| 390864 | reactome_physicalent_name | tropicamide | reactome_physicalent_nameloc | tropicamide [extracellular region] |
| 390918 | reactome_physicalent_name | uramustine | reactome_physicalent_nameloc | uramustine [nucleoplasm] |
| 390921 | reactome_physicalent_name | valsartan | reactome_physicalent_nameloc | valsartan [extracellular region] |
| 390971 | reactome_physicalent_name | venlafaxine | reactome_physicalent_nameloc | venlafaxine [extracellular region] |
5922 rows × 4 columns
# PE_location (from allpathway_split) -> PE_name_location (from allpathway_split)
df_edges_peloc_to_penameloc = df_pe_allpathway_split.assign(
source_layer='reactome_physicalent_loc',
target_layer='reactome_physicalent_nameloc'
).rename(columns={
'pe_location':'source_id',
'reactome_pe_name':'target_id'
})[
['source_layer','source_id','target_layer','target_id']
].drop_duplicates()
df_edges_peloc_to_penameloc
| source_layer | source_id | target_layer | target_id | |
|---|---|---|---|---|
| 0 | reactome_physicalent_loc | cytosol | reactome_physicalent_nameloc | warfarin [cytosol] |
| 32 | reactome_physicalent_loc | endoplasmic reticulum lumen | reactome_physicalent_nameloc | arachidyl ester [endoplasmic reticulum lumen] |
| 110 | reactome_physicalent_loc | extracellular region | reactome_physicalent_nameloc | xamoterol [extracellular region] |
| 164 | reactome_physicalent_loc | extracellular region | reactome_physicalent_nameloc | yohimbine [extracellular region] |
| 165 | reactome_physicalent_loc | extracellular region | reactome_physicalent_nameloc | Yohimbine [extracellular region] |
| ... | ... | ... | ... | ... |
| 390837 | reactome_physicalent_loc | nucleoplasm | reactome_physicalent_nameloc | troglitazone [nucleoplasm] |
| 390864 | reactome_physicalent_loc | extracellular region | reactome_physicalent_nameloc | tropicamide [extracellular region] |
| 390918 | reactome_physicalent_loc | nucleoplasm | reactome_physicalent_nameloc | uramustine [nucleoplasm] |
| 390921 | reactome_physicalent_loc | extracellular region | reactome_physicalent_nameloc | valsartan [extracellular region] |
| 390971 | reactome_physicalent_loc | extracellular region | reactome_physicalent_nameloc | venlafaxine [extracellular region] |
5922 rows × 4 columns
Node creation#
We now create the following node dfs, using the edge dfs:
df_nodes_ontpathway (already created prior)
df_nodes_pathwayid (using df_edges_pathwayid_to_ontpathway)
df_nodes_chebi (using df_edges_chebi_to_physicalent)
df_nodes_physicalent (using df_edges_phyiscalent_to_pathwayid)
df_nodes_reactionid (using df_edges_phyiscalent_to_reactionid)
df_nodes_penameloc (using df_edges_penameloc_to_phyiscalent)
df_nodes_pename (using df_edges_pename_to_penameloc)
df_nodes_peloc (using df_edges_peloc_to_penameloc)
Note that the df_nodes_ontpathway were created earlier already.
# create nodes chebi df
# from df_edges_chebi_to_physicalent
df_nodes_chebi = df_edges_chebi_to_physicalent.drop(
columns=['target_layer','target_id']
).rename(
columns={'source_id':'node_id','source_layer':'layer'}).drop_duplicates()
df_nodes_chebi
| node_id | layer | |
|---|---|---|
| 0 | 10033 | reactome_chebi |
| 32 | 10036 | reactome_chebi |
| 110 | 10055 | reactome_chebi |
| 164 | 10093 | reactome_chebi |
| 305 | 10100 | reactome_chebi |
| ... | ... | ... |
| 390864 | 9757 | reactome_chebi |
| 390918 | 9884 | reactome_chebi |
| 390921 | 9927 | reactome_chebi |
| 390971 | 9943 | reactome_chebi |
| 391019 | 9948 | reactome_chebi |
3068 rows × 2 columns
# create nodes pathwayid df
# from df_edges_pathwayid_to_ontpathway
df_nodes_pathwayid = df_edges_pathwayid_to_ontpathway.drop(
columns=['target_layer','target_id']
).rename(
columns={'source_id':'node_id','source_layer':'layer'}).drop_duplicates()
df_nodes_pathwayid['human'] = df_nodes_pathwayid['species']=='Homo sapiens'
df_nodes_pathwayid
| name | node_id | url | event_name_pathway_or_reaction | evidence_code | species | layer | human | |
|---|---|---|---|---|---|---|---|---|
| 0 | Interleukin-6 signaling | R-BTA-1059683 | https://reactome.org/PathwayBrowser/#/R-BTA-10... | Interleukin-6 signaling | IEA | Bos taurus | reactome_pathway | False |
| 2 | Apoptosis | R-BTA-109581 | https://reactome.org/PathwayBrowser/#/R-BTA-10... | Apoptosis | IEA | Bos taurus | reactome_pathway | False |
| 14 | Hemostasis | R-BTA-109582 | https://reactome.org/PathwayBrowser/#/R-BTA-10... | Hemostasis | IEA | Bos taurus | reactome_pathway | False |
| 126 | Intrinsic Pathway for Apoptosis | R-BTA-109606 | https://reactome.org/PathwayBrowser/#/R-BTA-10... | Intrinsic Pathway for Apoptosis | IEA | Bos taurus | reactome_pathway | False |
| 134 | PKB-mediated events | R-BTA-109703 | https://reactome.org/PathwayBrowser/#/R-BTA-10... | PKB-mediated events | IEA | Bos taurus | reactome_pathway | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 393990 | SLC-mediated bile acid transport | R-XTR-9958517 | https://reactome.org/PathwayBrowser/#/R-XTR-99... | SLC-mediated bile acid transport | IEA | Xenopus tropicalis | reactome_pathway | False |
| 393998 | SLC-mediated transport of inorganic anions | R-XTR-9958790 | https://reactome.org/PathwayBrowser/#/R-XTR-99... | SLC-mediated transport of inorganic anions | IEA | Xenopus tropicalis | reactome_pathway | False |
| 394026 | SLC-mediated transport of amino acids | R-XTR-9958863 | https://reactome.org/PathwayBrowser/#/R-XTR-99... | SLC-mediated transport of amino acids | IEA | Xenopus tropicalis | reactome_pathway | False |
| 394104 | SLC-mediated transport of oligopeptides | R-XTR-9959399 | https://reactome.org/PathwayBrowser/#/R-XTR-99... | SLC-mediated transport of oligopeptides | IEA | Xenopus tropicalis | reactome_pathway | False |
| 394110 | Inhibition of voltage gated Ca2+ channels via... | R-XTR-997272 | https://reactome.org/PathwayBrowser/#/R-XTR-99... | Inhibition of voltage gated Ca2+ channels via... | IEA | Xenopus tropicalis | reactome_pathway | False |
21606 rows × 8 columns
# create nodes physicalent df
# from df_edges_phyiscalent_to_pathwayid
df_nodes_physicalent = df_edges_phyiscalent_to_pathwayid.drop(
columns=['target_layer','target_id']
).rename(
columns={'source_id':'node_id','source_layer':'layer'}
).drop(columns=['event_name_pathway_or_reaction','url','evidence_code']).drop_duplicates()
# NOTE: WE KEEP SPECIES FOR NOW, BUT MIGHT WANT TO REVISIT LATER IF THE DUPLICATES ARE PROBLEM FOR CREATION OR SEARCHING UNIQUELY...
df_nodes_physicalent['human'] = df_nodes_physicalent['species']=='Homo sapiens'
df_nodes_physicalent
| source_db_identifier | node_id | reactome_pe_name | species | pe_name | pe_location | layer | human | |
|---|---|---|---|---|---|---|---|---|
| 0 | 10033 | R-ALL-9014945 | warfarin [cytosol] | Bos taurus | warfarin | cytosol | reactome_physicalent | False |
| 4 | 10033 | R-ALL-9014945 | warfarin [cytosol] | Canis familiaris | warfarin | cytosol | reactome_physicalent | False |
| 8 | 10033 | R-ALL-9014945 | warfarin [cytosol] | Drosophila melanogaster | warfarin | cytosol | reactome_physicalent | False |
| 12 | 10033 | R-ALL-9014945 | warfarin [cytosol] | Homo sapiens | warfarin | cytosol | reactome_physicalent | True |
| 16 | 10033 | R-ALL-9014945 | warfarin [cytosol] | Mus musculus | warfarin | cytosol | reactome_physicalent | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 391080 | 9948 | R-ALL-9660998 | verapamil [cytosol] | Schizosaccharomyces pombe | verapamil | cytosol | reactome_physicalent | False |
| 391082 | 9948 | R-ALL-9660998 | verapamil [cytosol] | Sus scrofa | verapamil | cytosol | reactome_physicalent | False |
| 391085 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | Sus scrofa | verapamil | extracellular region | reactome_physicalent | False |
| 391090 | 9948 | R-ALL-9660998 | verapamil [cytosol] | Xenopus tropicalis | verapamil | cytosol | reactome_physicalent | False |
| 391093 | 9948 | R-ALL-9614135 | verapamil [extracellular region] | Xenopus tropicalis | verapamil | extracellular region | reactome_physicalent | False |
49593 rows × 8 columns
# create nodes reactionid df
# from df_edges_phyiscalent_to_reactionid
df_nodes_reactionid = df_edges_phyiscalent_to_reactionid.drop(
columns=['source_layer','source_id']
).rename(
columns={'target_id':'node_id','target_layer':'layer'}
).drop(columns=['source_db_identifier','reactome_pe_name']).drop_duplicates()
df_nodes_reactionid
| node_id | url | event_name_pathway_or_reaction | evidence_code | species | layer | human | |
|---|---|---|---|---|---|---|---|
| 0 | R-BTA-159790 | https://reactome.org/PathwayBrowser/#/R-BTA-15... | VKORC1 reduces vitamin K epoxide to MK4 (vitam... | IEA | Bos taurus | reactome_reactions | False |
| 1 | R-BTA-9026967 | https://reactome.org/PathwayBrowser/#/R-BTA-90... | VKORC1 inhibitors binds VKORC1 dimer | IEA | Bos taurus | reactome_reactions | False |
| 2 | R-CFA-159790 | https://reactome.org/PathwayBrowser/#/R-CFA-15... | VKORC1 reduces vitamin K epoxide to MK4 (vitam... | IEA | Canis familiaris | reactome_reactions | False |
| 3 | R-CFA-9026967 | https://reactome.org/PathwayBrowser/#/R-CFA-90... | VKORC1 inhibitors binds VKORC1 dimer | IEA | Canis familiaris | reactome_reactions | False |
| 4 | R-DME-159790 | https://reactome.org/PathwayBrowser/#/R-DME-15... | VKORC1 reduces vitamin K epoxide to MK4 (vitam... | IEA | Drosophila melanogaster | reactome_reactions | False |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 233705 | R-RNO-9679044 | https://reactome.org/PathwayBrowser/#/R-RNO-96... | FKBP1A binds sirolimus | IEA | Rattus norvegicus | reactome_reactions | False |
| 233706 | R-SCE-9679044 | https://reactome.org/PathwayBrowser/#/R-SCE-96... | FKBP1A binds sirolimus | IEA | Saccharomyces cerevisiae | reactome_reactions | False |
| 233707 | R-SPO-9679044 | https://reactome.org/PathwayBrowser/#/R-SPO-96... | FKBP1A binds sirolimus | IEA | Schizosaccharomyces pombe | reactome_reactions | False |
| 233710 | R-SSC-9679044 | https://reactome.org/PathwayBrowser/#/R-SSC-96... | FKBP1A binds sirolimus | IEA | Sus scrofa | reactome_reactions | False |
| 233711 | R-XTR-9679044 | https://reactome.org/PathwayBrowser/#/R-XTR-96... | FKBP1A binds sirolimus | IEA | Xenopus tropicalis | reactome_reactions | False |
61847 rows × 7 columns
# create nodes penameloc df
# from df_edges_penameloc_to_phyiscalent
df_nodes_penameloc = df_edges_penameloc_to_phyiscalent.drop(
columns=['target_layer','target_id']
).rename(
columns={'source_id':'node_id','source_layer':'layer'}
).drop(columns=[]).drop_duplicates()
df_nodes_penameloc
| layer | node_id | pe_name | pe_location | |
|---|---|---|---|---|
| 0 | reactome_physicalent_nameloc | warfarin [cytosol] | warfarin | cytosol |
| 32 | reactome_physicalent_nameloc | arachidyl ester [endoplasmic reticulum lumen] | arachidyl ester | endoplasmic reticulum lumen |
| 110 | reactome_physicalent_nameloc | xamoterol [extracellular region] | xamoterol | extracellular region |
| 164 | reactome_physicalent_nameloc | yohimbine [extracellular region] | yohimbine | extracellular region |
| 165 | reactome_physicalent_nameloc | Yohimbine [extracellular region] | Yohimbine | extracellular region |
| ... | ... | ... | ... | ... |
| 390837 | reactome_physicalent_nameloc | troglitazone [nucleoplasm] | troglitazone | nucleoplasm |
| 390864 | reactome_physicalent_nameloc | tropicamide [extracellular region] | tropicamide | extracellular region |
| 390918 | reactome_physicalent_nameloc | uramustine [nucleoplasm] | uramustine | nucleoplasm |
| 390921 | reactome_physicalent_nameloc | valsartan [extracellular region] | valsartan | extracellular region |
| 390971 | reactome_physicalent_nameloc | venlafaxine [extracellular region] | venlafaxine | extracellular region |
5922 rows × 4 columns
# create nodes pename df
# from df_edges_pename_to_penameloc
df_nodes_pename = df_edges_pename_to_penameloc.drop(
columns=['target_layer','target_id']
).rename(
columns={'source_id':'node_id','source_layer':'layer'}
).drop(columns=[]).drop_duplicates()
df_nodes_pename
| layer | node_id | |
|---|---|---|
| 0 | reactome_physicalent_name | warfarin |
| 32 | reactome_physicalent_name | arachidyl ester |
| 110 | reactome_physicalent_name | xamoterol |
| 164 | reactome_physicalent_name | yohimbine |
| 165 | reactome_physicalent_name | Yohimbine |
| ... | ... | ... |
| 390837 | reactome_physicalent_name | troglitazone |
| 390864 | reactome_physicalent_name | tropicamide |
| 390918 | reactome_physicalent_name | uramustine |
| 390921 | reactome_physicalent_name | valsartan |
| 390971 | reactome_physicalent_name | venlafaxine |
4084 rows × 2 columns
# create nodes peloc df
# from df_edges_peloc_to_penameloc
df_nodes_peloc = df_edges_peloc_to_penameloc.drop(
columns=['target_layer','target_id']
).rename(
columns={'source_id':'node_id','source_layer':'layer'}
).drop(columns=[]).drop_duplicates()
df_nodes_peloc
| layer | node_id | |
|---|---|---|
| 0 | reactome_physicalent_loc | cytosol |
| 32 | reactome_physicalent_loc | endoplasmic reticulum lumen |
| 110 | reactome_physicalent_loc | extracellular region |
| 706 | reactome_physicalent_loc | endocytic vesicle lumen |
| 1083 | reactome_physicalent_loc | nucleoplasm |
| ... | ... | ... |
| 270753 | reactome_physicalent_loc | autophagosome membrane |
| 272323 | reactome_physicalent_loc | lamellar body |
| 272330 | reactome_physicalent_loc | lamellar body membrane |
| 327144 | reactome_physicalent_loc | clathrin-sculpted gamma-aminobutyric acid tran... |
| 336088 | reactome_physicalent_loc | endosome |
85 rows × 2 columns
Concatenation#
df_edges = pd.concat([
df_edges_chebi_to_physicalent,
df_edges_phyiscalent_to_reactionid,
df_edges_phyiscalent_to_pathwayid,
df_edges_pathwayid_to_ontpathway,
df_edges_penameloc_to_phyiscalent,
df_edges_pename_to_penameloc,
df_edges_peloc_to_penameloc,
df_edges_ontpathway_to_ontpathway,
])
df_edges
| source_id | target_id | source_layer | target_layer | source_db_identifier | reactome_pe_name | url | event_name_pathway_or_reaction | evidence_code | species | human | pe_name | pe_location | name | interlayer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10033 | R-ALL-9014945 | reactome_chebi | reactome_physicalent | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | 10036 | R-ALL-5696412 | reactome_chebi | reactome_physicalent | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 110 | 10055 | R-ALL-9611688 | reactome_chebi | reactome_physicalent | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 164 | 10093 | R-ALL-9648287 | reactome_chebi | reactome_physicalent | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 165 | 10093 | R-ALL-3296452 | reactome_chebi | reactome_physicalent | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23254 | R-XTR-9958790 | R-XTR-427652 | reactome_pathway_ontology | reactome_pathway_ontology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 23255 | R-XTR-9958790 | R-XTR-433137 | reactome_pathway_ontology | reactome_pathway_ontology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 23256 | R-XTR-9958863 | R-XTR-352230 | reactome_pathway_ontology | reactome_pathway_ontology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 23257 | R-XTR-9958863 | R-XTR-428559 | reactome_pathway_ontology | reactome_pathway_ontology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 23258 | R-XTR-9959399 | R-XTR-427975 | reactome_pathway_ontology | reactome_pathway_ontology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
697172 rows × 15 columns
df_nodes = pd.concat([
df_nodes_chebi,
df_nodes_physicalent,
df_nodes_reactionid,
df_nodes_pathwayid,
df_nodes_ontpathway,
df_nodes_penameloc,
df_nodes_pename,
df_nodes_peloc,
])
df_nodes
| node_id | layer | source_db_identifier | reactome_pe_name | species | pe_name | pe_location | human | url | event_name_pathway_or_reaction | evidence_code | name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10033 | reactome_chebi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | 10036 | reactome_chebi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 110 | 10055 | reactome_chebi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 164 | 10093 | reactome_chebi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 305 | 10100 | reactome_chebi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 270753 | autophagosome membrane | reactome_physicalent_loc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 272323 | lamellar body | reactome_physicalent_loc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 272330 | lamellar body membrane | reactome_physicalent_loc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 327144 | clathrin-sculpted gamma-aminobutyric acid tran... | reactome_physicalent_loc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 336088 | endosome | reactome_physicalent_loc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
169362 rows × 12 columns
Filtering to human reaction/pathway nodes only#
Not that this still keeps nodes that have no defined species (such as chebi id’s and reactome physical entities)
def filter_reactome(df, human_only=True):
if human_only:
filtered = df[df['human'].ne(False)]
print(f'Dropped rows from non-human species (dropped: {len(df)-len(filtered)}; remaining: {len(filtered)})')
return filtered
return df
df_nodes_human = filter_reactome(df_nodes)
df_nodes_human
Dropped rows from non-human species (dropped: 114171; remaining: 55191)
| node_id | layer | source_db_identifier | reactome_pe_name | species | pe_name | pe_location | human | url | event_name_pathway_or_reaction | evidence_code | name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10033 | reactome_chebi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | 10036 | reactome_chebi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 110 | 10055 | reactome_chebi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 164 | 10093 | reactome_chebi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 305 | 10100 | reactome_chebi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 270753 | autophagosome membrane | reactome_physicalent_loc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 272323 | lamellar body | reactome_physicalent_loc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 272330 | lamellar body membrane | reactome_physicalent_loc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 327144 | clathrin-sculpted gamma-aminobutyric acid tran... | reactome_physicalent_loc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 336088 | endosome | reactome_physicalent_loc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
55191 rows × 12 columns
df_edges_human = filter_reactome(df_edges)
df_edges_human
Dropped rows from non-human species (dropped: 532405; remaining: 164767)
| source_id | target_id | source_layer | target_layer | source_db_identifier | reactome_pe_name | url | event_name_pathway_or_reaction | evidence_code | species | human | pe_name | pe_location | name | interlayer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10033 | R-ALL-9014945 | reactome_chebi | reactome_physicalent | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | 10036 | R-ALL-5696412 | reactome_chebi | reactome_physicalent | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 110 | 10055 | R-ALL-9611688 | reactome_chebi | reactome_physicalent | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 164 | 10093 | R-ALL-9648287 | reactome_chebi | reactome_physicalent | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 165 | 10093 | R-ALL-3296452 | reactome_chebi | reactome_physicalent | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23254 | R-XTR-9958790 | R-XTR-427652 | reactome_pathway_ontology | reactome_pathway_ontology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 23255 | R-XTR-9958790 | R-XTR-433137 | reactome_pathway_ontology | reactome_pathway_ontology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 23256 | R-XTR-9958863 | R-XTR-352230 | reactome_pathway_ontology | reactome_pathway_ontology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 23257 | R-XTR-9958863 | R-XTR-428559 | reactome_pathway_ontology | reactome_pathway_ontology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 23258 | R-XTR-9959399 | R-XTR-427975 | reactome_pathway_ontology | reactome_pathway_ontology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
164767 rows × 15 columns
Ensuring equivalency#
To check the steps outlined above and the parse_reactome_data() function work in a similar way, we can check them briefly in this section.
reactome_results = parse_reactome_data(verbose=True, use_cache=True, force_download=False)
df_reactome_nodes = reactome_results['df_nodes']
df_reactome_edges = reactome_results['df_edges']
⏬ loading Reactome raw tables …
Fetching ChEBI2Reactome_PE_All_Levels.tsv
Override requested; re-downloading ChEBI2Reactome_PE_All_Levels.tsv from https://reactome.org/download/current/ChEBI2Reactome_PE_All_Levels.txt even though it exists locally.
Data downloaded and saved to /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ChEBI2Reactome_PE_All_Levels.tsv.
Fetching ChEBI2Reactome_PE_Reactions.tsv
Override requested; re-downloading ChEBI2Reactome_PE_Reactions.tsv from https://reactome.org/download/current/ChEBI2Reactome_PE_Reactions.txt even though it exists locally.
Data downloaded and saved to /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ChEBI2Reactome_PE_Reactions.tsv.
Fetching ReactomePathways.tsv
Override requested; re-downloading ReactomePathways.tsv from https://reactome.org/download/current/ReactomePathways.txt even though it exists locally.
Data downloaded and saved to /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ReactomePathways.tsv.
Fetching ReactomePathwaysRelation.tsv
Override requested; re-downloading ReactomePathwaysRelation.tsv from https://reactome.org/download/current/ReactomePathwaysRelation.txt even though it exists locally.
Data downloaded and saved to /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ReactomePathwaysRelation.tsv.
Returning ['ChEBI2Reactome_PE_All_Levels.tsv', 'ChEBI2Reactome_PE_Reactions.tsv', 'ReactomePathways.tsv', 'ReactomePathwaysRelation.tsv'] as a dict of dfs
[reactome] edges: (164767, 15)
[reactome] nodes: (55191, 12)
print(df_reactome_edges.shape)
print(df_reactome_nodes.shape)
print(df_reactome_nodes.columns)
(164767, 15)
(55191, 12)
Index(['node_id', 'layer', 'source_db_identifier', 'reactome_pe_name',
'species', 'pe_name', 'pe_location', 'human', 'url',
'event_name_pathway_or_reaction', 'evidence_code', 'name'],
dtype='object')
print(df_edges_human.shape)
print(df_nodes_human.shape)
print(df_nodes_human.columns)
(164767, 15)
(55191, 12)
Index(['node_id', 'layer', 'source_db_identifier', 'reactome_pe_name',
'species', 'pe_name', 'pe_location', 'human', 'url',
'event_name_pathway_or_reaction', 'evidence_code', 'name'],
dtype='object')
def nodes_topology(df):
return df[['layer','node_id']].drop_duplicates()
# auto vs manual topology counts
auto_top = nodes_topology(df_reactome_nodes)
man_top = nodes_topology(df_nodes_human)
print("AUTO unique nodes:", len(auto_top))
print("MAN unique nodes:", len(man_top))
# by-layer comparison
by_layer_auto = auto_top.groupby('layer')['node_id'].nunique().sort_values(ascending=False)
by_layer_man = man_top.groupby('layer')['node_id'].nunique().sort_values(ascending=False)
print("\nBy-layer differences (AUTO - MAN):")
diff = (by_layer_auto.to_frame('auto')
.join(by_layer_man.to_frame('man'), how='outer').fillna(0).astype(int))
diff['delta'] = diff['auto'] - diff['man']
diff.sort_values('delta', ascending=False)
# where attributes are causing duplication (same id, multiple rows)
dup_counts = (df_reactome_nodes
.groupby(['layer','node_id']).size()
.reset_index(name='rows_per_id')
.query('rows_per_id > 1')
.sort_values('rows_per_id', ascending=False))
dup_counts.head(20)
AUTO unique nodes: 54174
MAN unique nodes: 54174
By-layer differences (AUTO - MAN):
| layer | node_id | rows_per_id | |
|---|---|---|---|
| 33795 | reactome_physicalent | R-COV-9694285 | 4 |
| 33800 | reactome_physicalent | R-COV-9694354 | 4 |
| 34372 | reactome_physicalent | R-HSA-6790604 | 4 |
| 33767 | reactome_physicalent | R-COV-9685518 | 4 |
| 33785 | reactome_physicalent | R-COV-9685914 | 4 |
| 33786 | reactome_physicalent | R-COV-9685915 | 4 |
| 33787 | reactome_physicalent | R-COV-9685916 | 4 |
| 33788 | reactome_physicalent | R-COV-9685917 | 4 |
| 33789 | reactome_physicalent | R-COV-9685918 | 4 |
| 33790 | reactome_physicalent | R-COV-9685920 | 4 |
| 33792 | reactome_physicalent | R-COV-9685922 | 4 |
| 33822 | reactome_physicalent | R-COV-9697431 | 4 |
| 33797 | reactome_physicalent | R-COV-9694325 | 4 |
| 33791 | reactome_physicalent | R-COV-9685921 | 4 |
| 33802 | reactome_physicalent | R-COV-9694393 | 4 |
| 33803 | reactome_physicalent | R-COV-9694456 | 4 |
| 33813 | reactome_physicalent | R-COV-9694622 | 4 |
| 33805 | reactome_physicalent | R-COV-9694496 | 4 |
| 33806 | reactome_physicalent | R-COV-9694503 | 4 |
| 33807 | reactome_physicalent | R-COV-9694513 | 4 |
# Compare per-layer topology after the two patches:
auto_top = reactome_results['df_nodes'][['layer','node_id']].drop_duplicates()
man_top = df_nodes_human[['layer','node_id']].drop_duplicates()
(by_layer_auto := auto_top.groupby('layer').size().sort_values(ascending=False)).head(10)
(by_layer_man := man_top.groupby('layer').size().sort_values(ascending=False)).head(10)
# See which layers still differ and by how much
diff = (by_layer_auto.to_frame('auto')
.join(by_layer_man.to_frame('man'), how='outer').fillna(0).astype(int))
diff['delta'] = diff['auto'] - diff['man']
diff.sort_values('delta', ascending=False)
| auto | man | delta | |
|---|---|---|---|
| layer | |||
| reactome_chebi | 3068 | 3068 | 0 |
| reactome_pathway | 2477 | 2477 | 0 |
| reactome_pathway_ontology | 23157 | 23157 | 0 |
| reactome_physicalent | 5925 | 5925 | 0 |
| reactome_physicalent_loc | 85 | 85 | 0 |
| reactome_physicalent_name | 4084 | 4084 | 0 |
| reactome_physicalent_nameloc | 5922 | 5922 | 0 |
| reactome_reactions | 9456 | 9456 | 0 |
Great, looks like they are both equivalent in node and node, edge and column counts.