Parse Reactome#

Parsing Reactome data into a network for LipiNet.

LipiNet offers conventient functions to parse prior knowledge resources straight into networks. For instance, LipiNet can parse Reactome data into a network as easily as running: parse_reactome_data()

However to show what is happening behind the scenes, this notebook will also go through the data and each of the steps that are made in the background of this function. This may be particularly helpful for users needing to customise the networks in a way that is not yet supported by LipiNet directly.

Using parse_reactome_data()#

The LipiNet parse_reactome_data() function automatically parses Reactome data into a network. This is what LipiNet uses as input to its overall combined network and for the majority of users this function will probably suffice if they wish to build sub-networks with just Reactome data.

from lipinet.parse_reactome import parse_reactome_data 

reactome_results = parse_reactome_data(verbose=True, use_cache=True)
df_reactome_nodes = reactome_results['df_nodes']
df_reactome_edges = reactome_results['df_edges']
↪ loading Reactome (processed) from cache: reactome_human_nb

To avoid repeatedly downloading the Reactome data (and choking up their server calls), set use_cache=True. If the cache has not been set yet, this will automatically save the download to cache. If there is already a cache, it will use that.

To override the cache you can set force_download=True, but this is only recommended every few months when you want to update the source data in case of changes.

Where to from here?#

Manual parsing#

For users wanting to better understand all the steps being undertaken behind the parse_reactome_data() function, we will recreate the steps here.

Download#

import importlib



# Now can use the functions after reloading the module
from lipinet.databases import get_prior_knowledge 
import lipinet
# from lipinet.utils import split_and_expand_large, create_nodedf_from_edgedf, check_for_split_characters

import pandas as pd

# Reload the module to ensure changes are picked up
importlib.reload(lipinet)
importlib.reload(lipinet.databases)


# from lipinet.databases import get_prior_knowledge
<module 'lipinet.databases' from '/Users/macsbook/Code/lipinet/lipinet/databases.py'>
import lipinet, lipinet.databases as db
print(lipinet.__file__)
print(db.__file__)
None
/Users/macsbook/Code/lipinet/lipinet/databases.py
reactome_dfs = get_prior_knowledge('reactome', verbose=True)
Fetching ChEBI2Reactome_PE_All_Levels.tsv
File found locally at /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ChEBI2Reactome_PE_All_Levels.tsv. Loading data...
Fetching ChEBI2Reactome_PE_Reactions.tsv
File found locally at /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ChEBI2Reactome_PE_Reactions.tsv. Loading data...
Fetching ReactomePathways.tsv
File found locally at /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ReactomePathways.tsv. Loading data...
Fetching ReactomePathwaysRelation.tsv
File found locally at /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ReactomePathwaysRelation.tsv. Loading data...
Returning ['ChEBI2Reactome_PE_All_Levels.tsv', 'ChEBI2Reactome_PE_Reactions.tsv', 'ReactomePathways.tsv', 'ReactomePathwaysRelation.tsv'] as a dict of dfs
reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']
source_db_identifier reactome_pe_stableid reactome_pe_name reactome_pathway_stableid url event_name_pathway_or_reaction evidence_code species
0 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-1430728 https://reactome.org/PathwayBrowser/#/R-BTA-14... Metabolism IEA Bos taurus
1 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-196854 https://reactome.org/PathwayBrowser/#/R-BTA-19... Metabolism of vitamins and cofactors IEA Bos taurus
2 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-6806664 https://reactome.org/PathwayBrowser/#/R-BTA-68... Metabolism of vitamin K IEA Bos taurus
3 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-6806667 https://reactome.org/PathwayBrowser/#/R-BTA-68... Metabolism of fat-soluble vitamins IEA Bos taurus
4 10033 R-ALL-9014945 warfarin [cytosol] R-CFA-1430728 https://reactome.org/PathwayBrowser/#/R-CFA-14... Metabolism IEA Canis familiaris
... ... ... ... ... ... ... ... ...
391091 9948 R-ALL-9660998 verapamil [cytosol] R-XTR-211859 https://reactome.org/PathwayBrowser/#/R-XTR-21... Biological oxidations IEA Xenopus tropicalis
391092 9948 R-ALL-9660998 verapamil [cytosol] R-XTR-211945 https://reactome.org/PathwayBrowser/#/R-XTR-21... Phase I - Functionalization of compounds IEA Xenopus tropicalis
391093 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-397014 https://reactome.org/PathwayBrowser/#/R-XTR-39... Muscle contraction IEA Xenopus tropicalis
391094 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-5576891 https://reactome.org/PathwayBrowser/#/R-XTR-55... Cardiac conduction IEA Xenopus tropicalis
391095 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-5576893 https://reactome.org/PathwayBrowser/#/R-XTR-55... Phase 2 - plateau phase IEA Xenopus tropicalis

391096 rows × 8 columns

reactome_dfs['ChEBI2Reactome_PE_Reactions.tsv']
source_db_identifier reactome_pe_stableid reactome_pe_name reactome_pathway_stableid url event_name_pathway_or_reaction evidence_code species
0 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-159790 https://reactome.org/PathwayBrowser/#/R-BTA-15... VKORC1 reduces vitamin K epoxide to MK4 (vitam... IEA Bos taurus
1 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-9026967 https://reactome.org/PathwayBrowser/#/R-BTA-90... VKORC1 inhibitors binds VKORC1 dimer IEA Bos taurus
2 10033 R-ALL-9014945 warfarin [cytosol] R-CFA-159790 https://reactome.org/PathwayBrowser/#/R-CFA-15... VKORC1 reduces vitamin K epoxide to MK4 (vitam... IEA Canis familiaris
3 10033 R-ALL-9014945 warfarin [cytosol] R-CFA-9026967 https://reactome.org/PathwayBrowser/#/R-CFA-90... VKORC1 inhibitors binds VKORC1 dimer IEA Canis familiaris
4 10033 R-ALL-9014945 warfarin [cytosol] R-DME-159790 https://reactome.org/PathwayBrowser/#/R-DME-15... VKORC1 reduces vitamin K epoxide to MK4 (vitam... IEA Drosophila melanogaster
... ... ... ... ... ... ... ... ...
234736 9948 R-ALL-9614135 verapamil [extracellular region] R-SSC-9614031 https://reactome.org/PathwayBrowser/#/R-SSC-96... Class IV antihypertensives bind LTCC multimer IEA Sus scrofa
234737 9948 R-ALL-9660998 verapamil [cytosol] R-SSC-9659680 https://reactome.org/PathwayBrowser/#/R-SSC-96... ABCB1 transports xenobiotics out of the cell IEA Sus scrofa
234738 9948 R-ALL-9660998 verapamil [cytosol] R-SSC-9678766 https://reactome.org/PathwayBrowser/#/R-SSC-96... CYP3A4 binds CYP3A4 inhibitors IEA Sus scrofa
234739 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-9614031 https://reactome.org/PathwayBrowser/#/R-XTR-96... Class IV antihypertensives bind LTCC multimer IEA Xenopus tropicalis
234740 9948 R-ALL-9660998 verapamil [cytosol] R-XTR-9678766 https://reactome.org/PathwayBrowser/#/R-XTR-96... CYP3A4 binds CYP3A4 inhibitors IEA Xenopus tropicalis

234741 rows × 8 columns

reactome_dfs['ReactomePathways.tsv']
reactome_pathway_stableid reactome_pathway_name species
0 R-BTA-73843 5-Phosphoribose 1-diphosphate biosynthesis Bos taurus
1 R-BTA-1369062 ABC transporters in lipid homeostasis Bos taurus
2 R-BTA-382556 ABC-family proteins mediated transport Bos taurus
3 R-BTA-9033807 ABO blood group biosynthesis Bos taurus
4 R-BTA-418592 ADP signalling through P2Y purinoceptor 1 Bos taurus
... ... ... ...
23152 R-XTR-379724 tRNA Aminoacylation Xenopus tropicalis
23153 R-XTR-6787450 tRNA modification in the mitochondrion Xenopus tropicalis
23154 R-XTR-6782315 tRNA modification in the nucleus and cytosol Xenopus tropicalis
23155 R-XTR-72306 tRNA processing Xenopus tropicalis
23156 R-XTR-199992 trans-Golgi Network Vesicle Budding Xenopus tropicalis

23157 rows × 3 columns

reactome_dfs['ReactomePathwaysRelation.tsv']
parent_stableid child_stableid
0 R-BTA-109581 R-BTA-109606
1 R-BTA-109581 R-BTA-169911
2 R-BTA-109581 R-BTA-5357769
3 R-BTA-109581 R-BTA-75153
4 R-BTA-109582 R-BTA-140877
... ... ...
23254 R-XTR-9958790 R-XTR-427652
23255 R-XTR-9958790 R-XTR-433137
23256 R-XTR-9958863 R-XTR-352230
23257 R-XTR-9958863 R-XTR-428559
23258 R-XTR-9959399 R-XTR-427975

23259 rows × 2 columns

df_pe_allpathway_split = reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv'].copy()
df_pe_allpathway_split[['pe_name','pe_location']] = df_pe_allpathway_split['reactome_pe_name'].str.strip(r'\]').str.split(r' \[', expand=True)
df_pe_allpathway_split
source_db_identifier reactome_pe_stableid reactome_pe_name reactome_pathway_stableid url event_name_pathway_or_reaction evidence_code species pe_name pe_location
0 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-1430728 https://reactome.org/PathwayBrowser/#/R-BTA-14... Metabolism IEA Bos taurus warfarin cytosol
1 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-196854 https://reactome.org/PathwayBrowser/#/R-BTA-19... Metabolism of vitamins and cofactors IEA Bos taurus warfarin cytosol
2 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-6806664 https://reactome.org/PathwayBrowser/#/R-BTA-68... Metabolism of vitamin K IEA Bos taurus warfarin cytosol
3 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-6806667 https://reactome.org/PathwayBrowser/#/R-BTA-68... Metabolism of fat-soluble vitamins IEA Bos taurus warfarin cytosol
4 10033 R-ALL-9014945 warfarin [cytosol] R-CFA-1430728 https://reactome.org/PathwayBrowser/#/R-CFA-14... Metabolism IEA Canis familiaris warfarin cytosol
... ... ... ... ... ... ... ... ... ... ...
391091 9948 R-ALL-9660998 verapamil [cytosol] R-XTR-211859 https://reactome.org/PathwayBrowser/#/R-XTR-21... Biological oxidations IEA Xenopus tropicalis verapamil cytosol
391092 9948 R-ALL-9660998 verapamil [cytosol] R-XTR-211945 https://reactome.org/PathwayBrowser/#/R-XTR-21... Phase I - Functionalization of compounds IEA Xenopus tropicalis verapamil cytosol
391093 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-397014 https://reactome.org/PathwayBrowser/#/R-XTR-39... Muscle contraction IEA Xenopus tropicalis verapamil extracellular region
391094 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-5576891 https://reactome.org/PathwayBrowser/#/R-XTR-55... Cardiac conduction IEA Xenopus tropicalis verapamil extracellular region
391095 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-5576893 https://reactome.org/PathwayBrowser/#/R-XTR-55... Phase 2 - plateau phase IEA Xenopus tropicalis verapamil extracellular region

391096 rows × 10 columns

Checks#

In this section we show some quick checks and EDA on the Reactome dfs.

Which pathway stableIDs are shared between df_pathways_complete and df_pe_allpathway, and which are not?

reactome_dfs['ReactomePathways.tsv'][reactome_dfs['ReactomePathways.tsv']['reactome_pathway_stableid'].isin(df_pe_allpathway_split['reactome_pathway_stableid'])]
reactome_pathway_stableid reactome_pathway_name species
0 R-BTA-73843 5-Phosphoribose 1-diphosphate biosynthesis Bos taurus
1 R-BTA-1369062 ABC transporters in lipid homeostasis Bos taurus
2 R-BTA-382556 ABC-family proteins mediated transport Bos taurus
3 R-BTA-9033807 ABO blood group biosynthesis Bos taurus
4 R-BTA-418592 ADP signalling through P2Y purinoceptor 1 Bos taurus
... ... ... ...
23152 R-XTR-379724 tRNA Aminoacylation Xenopus tropicalis
23153 R-XTR-6787450 tRNA modification in the mitochondrion Xenopus tropicalis
23154 R-XTR-6782315 tRNA modification in the nucleus and cytosol Xenopus tropicalis
23155 R-XTR-72306 tRNA processing Xenopus tropicalis
23156 R-XTR-199992 trans-Golgi Network Vesicle Budding Xenopus tropicalis

20140 rows × 3 columns

So nearly all pathways overlap (~20k). Which don’t?

reactome_dfs['ReactomePathways.tsv'][~reactome_dfs['ReactomePathways.tsv']['reactome_pathway_stableid'].isin(df_pe_allpathway_split['reactome_pathway_stableid'])]
reactome_pathway_stableid reactome_pathway_name species
6 R-BTA-211163 AKT-mediated inactivation of FOXO1A Bos taurus
9 R-BTA-179409 APC-Cdc20 mediated degradation of Nek2A Bos taurus
11 R-BTA-174048 APC/C:Cdc20 mediated degradation of Cyclin B Bos taurus
12 R-BTA-174154 APC/C:Cdc20 mediated degradation of Securin Bos taurus
13 R-BTA-176409 APC/C:Cdc20 mediated degradation of mitotic pr... Bos taurus
... ... ... ...
23118 R-XTR-2032785 YAP1- and WWTR1 (TAZ)-stimulated gene expression Xenopus tropicalis
23146 R-XTR-209543 p75NTR recruits signalling complexes Xenopus tropicalis
23148 R-XTR-193639 p75NTR signals via NF-kB Xenopus tropicalis
23150 R-XTR-72312 rRNA processing Xenopus tropicalis
23151 R-XTR-8868773 rRNA processing in the nucleus and cytosol Xenopus tropicalis

3017 rows × 3 columns

A minority don’t overlap (~3k) - wonder why?

We based this off the pathway stable ID… but what about the pathway names? Are they different?

reactome_dfs['ReactomePathways.tsv'][reactome_dfs['ReactomePathways.tsv']['reactome_pathway_name'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['event_name_pathway_or_reaction'])]
reactome_pathway_stableid reactome_pathway_name species
0 R-BTA-73843 5-Phosphoribose 1-diphosphate biosynthesis Bos taurus
1 R-BTA-1369062 ABC transporters in lipid homeostasis Bos taurus
2 R-BTA-382556 ABC-family proteins mediated transport Bos taurus
3 R-BTA-9033807 ABO blood group biosynthesis Bos taurus
4 R-BTA-418592 ADP signalling through P2Y purinoceptor 1 Bos taurus
... ... ... ...
23152 R-XTR-379724 tRNA Aminoacylation Xenopus tropicalis
23153 R-XTR-6787450 tRNA modification in the mitochondrion Xenopus tropicalis
23154 R-XTR-6782315 tRNA modification in the nucleus and cytosol Xenopus tropicalis
23155 R-XTR-72306 tRNA processing Xenopus tropicalis
23156 R-XTR-199992 trans-Golgi Network Vesicle Budding Xenopus tropicalis

21463 rows × 3 columns

reactome_dfs['ReactomePathways.tsv'][~reactome_dfs['ReactomePathways.tsv']['reactome_pathway_name'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['event_name_pathway_or_reaction'])]
reactome_pathway_stableid reactome_pathway_name species
9 R-BTA-179409 APC-Cdc20 mediated degradation of Nek2A Bos taurus
11 R-BTA-174048 APC/C:Cdc20 mediated degradation of Cyclin B Bos taurus
12 R-BTA-174154 APC/C:Cdc20 mediated degradation of Securin Bos taurus
13 R-BTA-176409 APC/C:Cdc20 mediated degradation of mitotic pr... Bos taurus
14 R-BTA-174178 APC/C:Cdh1 mediated degradation of Cdc20 and o... Bos taurus
... ... ... ...
23092 R-XTR-195399 VEGF binds to VEGFR leading to receptor dimeri... Xenopus tropicalis
23093 R-XTR-194313 VEGF ligand-receptor interactions Xenopus tropicalis
23097 R-XTR-8866427 VLDLR internalisation and degradation Xenopus tropicalis
23113 R-XTR-5140745 WNT5A-dependent internalization of FZD2, FZD5 ... Xenopus tropicalis
23118 R-XTR-2032785 YAP1- and WWTR1 (TAZ)-stimulated gene expression Xenopus tropicalis

1694 rows × 3 columns

So more of the pathway names are shared between the two dfs than the stableids. Could be probably either because:

  1. some pathway names share duplicates or more (but different stableids)

  2. there was an update to the stableids and they were changed but the names stayed the same (hopefully less likely if they’re meant to be stable IDs…)

Maybe the best linkage key between the pathways (reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']) and the reactions (eactome_dfs['ChEBI2Reactome_PE_Reactions.tsv']) is the reactome_pe_stableid key?!

  • So each pathway (reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv) is linked to multiple stable ids (reactome_pe_stableids). Each of these stable IDs will also have a name and cellular location (is it best to have these as nodes or attributes?).

  • Then these reactome_pe_stableids should be linked to the multiple reactions (ChEBI2Reactome_PE_Reactions.tsv).

print(f'len of ChEBI2Reactome_PE_All_Levels: {len(set(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['reactome_pe_stableid']))}')
print(f'len of ChEBI2Reactome_PE_Reactions: {len(set(reactome_dfs['ChEBI2Reactome_PE_Reactions.tsv']['reactome_pe_stableid']))}')

print(f'len of overlap between pathways and reactions using reactome_pe_stableid: {len(set(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['reactome_pe_stableid']) & set(reactome_dfs['ChEBI2Reactome_PE_Reactions.tsv']['reactome_pe_stableid']))}')
len of ChEBI2Reactome_PE_All_Levels: 6001
len of ChEBI2Reactome_PE_Reactions: 6177
len of overlap between pathways and reactions using reactome_pe_stableid: 6001

So this approach seems to be likely to work more or less. It would be interesting to see which stable entities were in the reactions but not pathways - we should keep this in mind for later joins too.

Nonetheless, we can now inspect this further. we should just also be sure to see whether this roll up would destroy the name or component info if we joined them all? or if they stay unique to the stableID…

reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv'].value_counts(subset=['reactome_pe_stableid'])
reactome_pe_stableid
R-ALL-113592            8540
R-ALL-29370             7836
R-ALL-29356             6144
R-ALL-29372             4253
R-ALL-29438             4192
                        ... 
R-ALL-30495                2
R-ALL-9750013              2
R-ALL-9749584              2
R-HSA-8943989              2
R-ALL-9635420              2
Name: count, Length: 6001, dtype: int64
reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv'].value_counts(subset=['reactome_pe_name'])
reactome_pe_name                  
ATP [cytosol]                         8540
ADP [cytosol]                         7836
H2O [cytosol]                         6144
Pi [cytosol]                          4253
GTP [cytosol]                         4192
                                      ... 
CSN polymer [extracellular region]       2
sulfo-Cipro [cytosol]                    2
sulfo-Cipro [extracellular region]       2
4-5 nt RNA [mitochondrial matrix]        2
SKM [cytosol]                            2
Name: count, Length: 5922, dtype: int64

It’s interesting that the length of these two varies slightly (so there is not a direct 1:1 association between the pe_stableid and the pd_name).

Which pathway stableIDs are shared between reactome_dfs['ReactomePathways'] and reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv'], and which are not?

reactome_dfs['ReactomePathways.tsv'][reactome_dfs['ReactomePathways.tsv']['reactome_pathway_stableid'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['reactome_pathway_stableid'])]
reactome_pathway_stableid reactome_pathway_name species
0 R-BTA-73843 5-Phosphoribose 1-diphosphate biosynthesis Bos taurus
1 R-BTA-1369062 ABC transporters in lipid homeostasis Bos taurus
2 R-BTA-382556 ABC-family proteins mediated transport Bos taurus
3 R-BTA-9033807 ABO blood group biosynthesis Bos taurus
4 R-BTA-418592 ADP signalling through P2Y purinoceptor 1 Bos taurus
... ... ... ...
23152 R-XTR-379724 tRNA Aminoacylation Xenopus tropicalis
23153 R-XTR-6787450 tRNA modification in the mitochondrion Xenopus tropicalis
23154 R-XTR-6782315 tRNA modification in the nucleus and cytosol Xenopus tropicalis
23155 R-XTR-72306 tRNA processing Xenopus tropicalis
23156 R-XTR-199992 trans-Golgi Network Vesicle Budding Xenopus tropicalis

20140 rows × 3 columns

So nearly all pathways overlap (~20k). Which don’t?

reactome_dfs['ReactomePathways.tsv'][~reactome_dfs['ReactomePathways.tsv']['reactome_pathway_stableid'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['reactome_pathway_stableid'])]
reactome_pathway_stableid reactome_pathway_name species
6 R-BTA-211163 AKT-mediated inactivation of FOXO1A Bos taurus
9 R-BTA-179409 APC-Cdc20 mediated degradation of Nek2A Bos taurus
11 R-BTA-174048 APC/C:Cdc20 mediated degradation of Cyclin B Bos taurus
12 R-BTA-174154 APC/C:Cdc20 mediated degradation of Securin Bos taurus
13 R-BTA-176409 APC/C:Cdc20 mediated degradation of mitotic pr... Bos taurus
... ... ... ...
23118 R-XTR-2032785 YAP1- and WWTR1 (TAZ)-stimulated gene expression Xenopus tropicalis
23146 R-XTR-209543 p75NTR recruits signalling complexes Xenopus tropicalis
23148 R-XTR-193639 p75NTR signals via NF-kB Xenopus tropicalis
23150 R-XTR-72312 rRNA processing Xenopus tropicalis
23151 R-XTR-8868773 rRNA processing in the nucleus and cytosol Xenopus tropicalis

3017 rows × 3 columns

A minority don’t overlap (~3k) - wonder why?

We based this off the pathway stable ID… but what about the pathway names? Are they different?

reactome_dfs['ReactomePathways.tsv'][reactome_dfs['ReactomePathways.tsv']['reactome_pathway_name'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['event_name_pathway_or_reaction'])]
reactome_pathway_stableid reactome_pathway_name species
0 R-BTA-73843 5-Phosphoribose 1-diphosphate biosynthesis Bos taurus
1 R-BTA-1369062 ABC transporters in lipid homeostasis Bos taurus
2 R-BTA-382556 ABC-family proteins mediated transport Bos taurus
3 R-BTA-9033807 ABO blood group biosynthesis Bos taurus
4 R-BTA-418592 ADP signalling through P2Y purinoceptor 1 Bos taurus
... ... ... ...
23152 R-XTR-379724 tRNA Aminoacylation Xenopus tropicalis
23153 R-XTR-6787450 tRNA modification in the mitochondrion Xenopus tropicalis
23154 R-XTR-6782315 tRNA modification in the nucleus and cytosol Xenopus tropicalis
23155 R-XTR-72306 tRNA processing Xenopus tropicalis
23156 R-XTR-199992 trans-Golgi Network Vesicle Budding Xenopus tropicalis

21463 rows × 3 columns

reactome_dfs['ReactomePathways.tsv'][~reactome_dfs['ReactomePathways.tsv']['reactome_pathway_name'].isin(reactome_dfs['ChEBI2Reactome_PE_All_Levels.tsv']['event_name_pathway_or_reaction'])]
reactome_pathway_stableid reactome_pathway_name species
9 R-BTA-179409 APC-Cdc20 mediated degradation of Nek2A Bos taurus
11 R-BTA-174048 APC/C:Cdc20 mediated degradation of Cyclin B Bos taurus
12 R-BTA-174154 APC/C:Cdc20 mediated degradation of Securin Bos taurus
13 R-BTA-176409 APC/C:Cdc20 mediated degradation of mitotic pr... Bos taurus
14 R-BTA-174178 APC/C:Cdh1 mediated degradation of Cdc20 and o... Bos taurus
... ... ... ...
23092 R-XTR-195399 VEGF binds to VEGFR leading to receptor dimeri... Xenopus tropicalis
23093 R-XTR-194313 VEGF ligand-receptor interactions Xenopus tropicalis
23097 R-XTR-8866427 VLDLR internalisation and degradation Xenopus tropicalis
23113 R-XTR-5140745 WNT5A-dependent internalization of FZD2, FZD5 ... Xenopus tropicalis
23118 R-XTR-2032785 YAP1- and WWTR1 (TAZ)-stimulated gene expression Xenopus tropicalis

1694 rows × 3 columns

So more of the pathway names are shared between the two dfs than the stableids. Could be probably either because:

  1. some pathway names share duplicates or more (but different stableids)

  2. there was an update to the stableids and they were changed but the names stayed the same (hopefully less likely if they’re meant to be stable IDs…)

df_pe_allpathway_split.value_counts(subset='species')
species
Homo sapiens                  57728
Mus musculus                  33310
Rattus norvegicus             32852
Bos taurus                    32728
Sus scrofa                    32653
Canis familiaris              32054
Gallus gallus                 31029
Xenopus tropicalis            25691
Drosophila melanogaster       23091
Danio rerio                   23017
Caenorhabditis elegans        21197
Dictyostelium discoideum      15461
Saccharomyces cerevisiae      11025
Schizosaccharomyces pombe     10961
Plasmodium falciparum          8003
Mycobacterium tuberculosis      296
Name: count, dtype: int64
df_pe_allpathway_split[['source_db_identifier','reactome_pe_stableid','species']].drop_duplicates().value_counts(subset='species')
species
Homo sapiens                  6277
Mus musculus                  4214
Bos taurus                    4157
Sus scrofa                    4149
Rattus norvegicus             4133
Canis familiaris              4090
Gallus gallus                 3897
Xenopus tropicalis            3426
Danio rerio                   3149
Drosophila melanogaster       2932
Caenorhabditis elegans        2767
Dictyostelium discoideum      2123
Saccharomyces cerevisiae      1489
Schizosaccharomyces pombe     1481
Plasmodium falciparum         1227
Mycobacterium tuberculosis      82
Name: count, dtype: int64

Above we see the number of CHEBI IDs for each of the species. But for our default LipiNet layout, if we keep each species duplicated, then we will need to link to multiple IDs when making our interlayer-layer links (e.g. Rhea to Reactome). This is probably undesirable and it would be better to keep single interlayer connections between nodes of the same ID, both to reduce redundancy and minimise problems with downstream algorithms like node traversal.

temp_df1 = pd.DataFrame(df_pe_allpathway_split.value_counts(subset=['reactome_pe_stableid','reactome_pe_name'])).reset_index()
temp_df2 = pd.DataFrame(df_pe_allpathway_split.value_counts(subset=['reactome_pe_name'])).reset_index()

print(len(temp_df1))
print(len(temp_df2))

# len(set(temp_df2['reactome_pe_name'].to_list()) - set(temp_df1['reactome_pe_name'].to_list()))

# how many unique stable IDs per name
count_per_name = df_pe_allpathway_split.groupby('reactome_pe_name')['reactome_pe_stableid'].nunique()

# names that map to multiple stable IDs
ambiguous_names = count_per_name[count_per_name > 1]
print(len(ambiguous_names))
ambiguous_names #.head()
6001
5922
72
reactome_pe_name
(4Fe-4S)(2+) [cytosol]                              2
(GlcNAc)2 (Man)8b [endoplasmic reticulum lumen]     2
(GlcNAc)2 (Man)8c [endoplasmic reticulum lumen]     2
1-acyl LPA [cytosol]                                2
11cRAL [cytosol]                                    2
                                                   ..
tRNA(Asn) containing A-37 [mitochondrial matrix]    2
tobramycin [cytosol]                                2
tobramycin [periplasmic space]                      2
unknown NAT [cytosol]                               2
unknown kinase [cytosol]                            2
Name: reactome_pe_stableid, Length: 72, dtype: int64
temp_df1[temp_df1['reactome_pe_name']==ambiguous_names.index[0]]
reactome_pe_stableid reactome_pe_name count
810 R-ALL-937126 (4Fe-4S)(2+) [cytosol] 83
5502 R-ALL-937265 (4Fe-4S)(2+) [cytosol] 4

Okay so each combined name and location (reactome_pe_name) is NOT quite unique to the stableid - because there are now slightly fewer unique pe names from the pathways then there were using the stableID.

This could be because Reactome considers them slightly different molecules in some contexts perhaps? Such as a free ion, part of a complex, or bound to another molecule? Or it could be a bug from Reactome.

Nonetheless, they are quite similar in mapping except for these rare cases, but how much would it expand if we split the reactome_pe_name into name and location separately?

df_pe_allpathway_split.value_counts(subset=['pe_name'], dropna=False)
pe_name                                     
ATP                                             13750
ADP                                             12271
H2O                                             12250
H+                                               8892
Pi                                               7177
                                                ...  
Islet amyloid polypeptide fibril                    2
VWF multimer                                        2
Cleaved fibronectin matrix Ala(271)/Val(272)        2
Cleaved fibrillin-3                                 2
ribonucleotide                                      2
Name: count, Length: 4084, dtype: int64
df_pe_allpathway_split.value_counts(subset=['pe_location'], dropna=False)
pe_location                   
cytosol                           163062
extracellular region               64709
nucleoplasm                        37155
mitochondrial matrix               33962
endoplasmic reticulum lumen        15053
                                   ...  
endosome                               3
host cell cytosol                      3
plastid stroma                         2
nucleolus                              2
Golgi-associated vesicle lumen         2
Name: count, Length: 85, dtype: int64

So here we see that there are just over 4000 unique physical entities, from about 85 different cellular locations.

The most common physical entities are ATP, ADP and H20, which we would expect.

The most common cellular locations for the physical entities are the cytosol, extracellular region, and nucleoplasm, which also makes sense.

Going back to our previous problematic example, we can see that very few cases have the strange difference in stableid name, and inspect these on the browser

df_pe_allpathway_split[df_pe_allpathway_split['reactome_pe_name']=='(4Fe-4S)(2+) [cytosol]']['reactome_pe_stableid'].value_counts()
reactome_pe_stableid
R-ALL-937126    83
R-ALL-937265     4
Name: count, dtype: int64
df_pe_allpathway_split[df_pe_allpathway_split['reactome_pe_stableid']=='R-ALL-937126']['url']
189142    https://reactome.org/PathwayBrowser/#/R-BTA-14...
189144    https://reactome.org/PathwayBrowser/#/R-BTA-19...
189145    https://reactome.org/PathwayBrowser/#/R-BTA-19...
189147    https://reactome.org/PathwayBrowser/#/R-BTA-94...
189148    https://reactome.org/PathwayBrowser/#/R-BTA-96...
                                ...                        
189279    https://reactome.org/PathwayBrowser/#/R-XTR-14...
189281    https://reactome.org/PathwayBrowser/#/R-XTR-19...
189282    https://reactome.org/PathwayBrowser/#/R-XTR-19...
189284    https://reactome.org/PathwayBrowser/#/R-XTR-94...
189285    https://reactome.org/PathwayBrowser/#/R-XTR-96...
Name: url, Length: 83, dtype: object
df_pe_allpathway_split[df_pe_allpathway_split['reactome_pe_stableid']=='R-ALL-937265']['url']
189248    https://reactome.org/PathwayBrowser/#/R-MTU-87...
189249    https://reactome.org/PathwayBrowser/#/R-MTU-93...
189250    https://reactome.org/PathwayBrowser/#/R-MTU-93...
189251    https://reactome.org/PathwayBrowser/#/R-MTU-93...
Name: url, dtype: object

Zooming out to the overall question of how we link the dfs together: again the best way for us to join all this up, is probably to use the stable identifier to link to the reactions and the pathways. Then the stable identifier can also link to the name. Then that name can split off into the common name and the location as separate nodes.

That way in later applications, users can search by name or location and get the degree of name or location nodes associated with each query. This also avoids clogging up the maps and allows the ability to easily filter layers still…

Edge creation#

We begin to show the creation of Reactome edge dfs, namely:

  • df_edges_ontpathway_to_ontpathway (and df_nodes_ontpathway, which isn’t for edges but should be created now)

  • df_edges_pathwayid_to_ontpathway

  • df_edges_chebi_to_physicalent

  • df_edges_phyiscalent_to_pathwayid

  • df_edges_phyiscalent_to_reactionid

  • df_edges_penameloc_to_phyiscalent

  • df_edges_pename_to_penameloc

  • df_edges_peloc_to_penameloc

First, we will create the ontology for the Reactome pathways.

df_edges_ontpathway_to_ontpathway = reactome_dfs['ReactomePathwaysRelation.tsv'].copy()
df_edges_ontpathway_to_ontpathway.columns = ['source_id', 'target_id']
df_edges_ontpathway_to_ontpathway['source_layer'] = 'reactome_pathway_ontology'
df_edges_ontpathway_to_ontpathway['target_layer'] = 'reactome_pathway_ontology'
df_edges_ontpathway_to_ontpathway['interlayer'] = False

df_edges_ontpathway_to_ontpathway
source_id target_id source_layer target_layer interlayer
0 R-BTA-109581 R-BTA-109606 reactome_pathway_ontology reactome_pathway_ontology False
1 R-BTA-109581 R-BTA-169911 reactome_pathway_ontology reactome_pathway_ontology False
2 R-BTA-109581 R-BTA-5357769 reactome_pathway_ontology reactome_pathway_ontology False
3 R-BTA-109581 R-BTA-75153 reactome_pathway_ontology reactome_pathway_ontology False
4 R-BTA-109582 R-BTA-140877 reactome_pathway_ontology reactome_pathway_ontology False
... ... ... ... ... ...
23254 R-XTR-9958790 R-XTR-427652 reactome_pathway_ontology reactome_pathway_ontology False
23255 R-XTR-9958790 R-XTR-433137 reactome_pathway_ontology reactome_pathway_ontology False
23256 R-XTR-9958863 R-XTR-352230 reactome_pathway_ontology reactome_pathway_ontology False
23257 R-XTR-9958863 R-XTR-428559 reactome_pathway_ontology reactome_pathway_ontology False
23258 R-XTR-9959399 R-XTR-427975 reactome_pathway_ontology reactome_pathway_ontology False

23259 rows × 5 columns

Here we will also have to define the nodes for ontology before we can link the pathways to the pathway ontology

df_nodes_ontpathway = reactome_dfs['ReactomePathways.tsv'].copy()
df_nodes_ontpathway.columns = ['node_id', 'name', 'species']

df_nodes_ontpathway = df_nodes_ontpathway.assign(layer='reactome_pathway_ontology').copy()
df_nodes_ontpathway = df_nodes_ontpathway[['layer','node_id','name','species']]
df_nodes_ontpathway
layer node_id name species
0 reactome_pathway_ontology R-BTA-73843 5-Phosphoribose 1-diphosphate biosynthesis Bos taurus
1 reactome_pathway_ontology R-BTA-1369062 ABC transporters in lipid homeostasis Bos taurus
2 reactome_pathway_ontology R-BTA-382556 ABC-family proteins mediated transport Bos taurus
3 reactome_pathway_ontology R-BTA-9033807 ABO blood group biosynthesis Bos taurus
4 reactome_pathway_ontology R-BTA-418592 ADP signalling through P2Y purinoceptor 1 Bos taurus
... ... ... ... ...
23152 reactome_pathway_ontology R-XTR-379724 tRNA Aminoacylation Xenopus tropicalis
23153 reactome_pathway_ontology R-XTR-6787450 tRNA modification in the mitochondrion Xenopus tropicalis
23154 reactome_pathway_ontology R-XTR-6782315 tRNA modification in the nucleus and cytosol Xenopus tropicalis
23155 reactome_pathway_ontology R-XTR-72306 tRNA processing Xenopus tropicalis
23156 reactome_pathway_ontology R-XTR-199992 trans-Golgi Network Vesicle Budding Xenopus tropicalis

23157 rows × 4 columns

# EDGES: pathway_stid (from pathway_ont) -> pathway_stid (from allpathway_split)
# note - could also delete more of these if needed, or just keep as edge attributes just in case for more advanced filtering options?

df_edges_pathwayid_to_ontpathway = pd.merge(df_nodes_ontpathway.drop(columns=['layer','species']).assign(target_layer='reactome_ontology_pathways'), 
         df_pe_allpathway_split.assign(source_layer='reactome_pathway'),
         left_on='node_id', right_on='reactome_pathway_stableid',
         how='outer'
         ).rename(columns={'node_id':'target_id', 'reactome_pathway_stableid':'source_id'}
         ).drop(columns=['reactome_pe_stableid','reactome_pe_name','pe_name','pe_location','source_db_identifier'] #note we drop these columns here bc they're not relevant or applicable to the pathway-pathway physical entity relationship - there are also often multiple physicalent per pathway anyway, so would be bad practice to only represent one of these
         ).drop_duplicates()
df_edges_pathwayid_to_ontpathway
target_id name target_layer source_id url event_name_pathway_or_reaction evidence_code species source_layer
0 R-BTA-1059683 Interleukin-6 signaling reactome_ontology_pathways R-BTA-1059683 https://reactome.org/PathwayBrowser/#/R-BTA-10... Interleukin-6 signaling IEA Bos taurus reactome_pathway
2 R-BTA-109581 Apoptosis reactome_ontology_pathways R-BTA-109581 https://reactome.org/PathwayBrowser/#/R-BTA-10... Apoptosis IEA Bos taurus reactome_pathway
14 R-BTA-109582 Hemostasis reactome_ontology_pathways R-BTA-109582 https://reactome.org/PathwayBrowser/#/R-BTA-10... Hemostasis IEA Bos taurus reactome_pathway
126 R-BTA-109606 Intrinsic Pathway for Apoptosis reactome_ontology_pathways R-BTA-109606 https://reactome.org/PathwayBrowser/#/R-BTA-10... Intrinsic Pathway for Apoptosis IEA Bos taurus reactome_pathway
134 R-BTA-109703 PKB-mediated events reactome_ontology_pathways R-BTA-109703 https://reactome.org/PathwayBrowser/#/R-BTA-10... PKB-mediated events IEA Bos taurus reactome_pathway
... ... ... ... ... ... ... ... ... ...
393990 R-XTR-9958517 SLC-mediated bile acid transport reactome_ontology_pathways R-XTR-9958517 https://reactome.org/PathwayBrowser/#/R-XTR-99... SLC-mediated bile acid transport IEA Xenopus tropicalis reactome_pathway
393998 R-XTR-9958790 SLC-mediated transport of inorganic anions reactome_ontology_pathways R-XTR-9958790 https://reactome.org/PathwayBrowser/#/R-XTR-99... SLC-mediated transport of inorganic anions IEA Xenopus tropicalis reactome_pathway
394026 R-XTR-9958863 SLC-mediated transport of amino acids reactome_ontology_pathways R-XTR-9958863 https://reactome.org/PathwayBrowser/#/R-XTR-99... SLC-mediated transport of amino acids IEA Xenopus tropicalis reactome_pathway
394104 R-XTR-9959399 SLC-mediated transport of oligopeptides reactome_ontology_pathways R-XTR-9959399 https://reactome.org/PathwayBrowser/#/R-XTR-99... SLC-mediated transport of oligopeptides IEA Xenopus tropicalis reactome_pathway
394110 R-XTR-997272 Inhibition of voltage gated Ca2+ channels via... reactome_ontology_pathways R-XTR-997272 https://reactome.org/PathwayBrowser/#/R-XTR-99... Inhibition of voltage gated Ca2+ channels via... IEA Xenopus tropicalis reactome_pathway

23878 rows × 9 columns

df_edges_chebi_to_physicalent = df_pe_allpathway_split[['source_db_identifier','reactome_pe_stableid']].drop_duplicates()
df_edges_chebi_to_physicalent = df_edges_chebi_to_physicalent.rename(
    columns={'source_db_identifier':'source_id', 'reactome_pe_stableid':'target_id'}
    ).assign(source_layer='reactome_chebi', target_layer='reactome_physicalent')
df_edges_chebi_to_physicalent
source_id target_id source_layer target_layer
0 10033 R-ALL-9014945 reactome_chebi reactome_physicalent
32 10036 R-ALL-5696412 reactome_chebi reactome_physicalent
110 10055 R-ALL-9611688 reactome_chebi reactome_physicalent
164 10093 R-ALL-9648287 reactome_chebi reactome_physicalent
165 10093 R-ALL-3296452 reactome_chebi reactome_physicalent
... ... ... ... ...
390918 9884 R-ALL-9713792 reactome_chebi reactome_physicalent
390921 9927 R-ALL-9615299 reactome_chebi reactome_physicalent
390971 9943 R-ALL-9714401 reactome_chebi reactome_physicalent
391019 9948 R-ALL-9660998 reactome_chebi reactome_physicalent
391022 9948 R-ALL-9614135 reactome_chebi reactome_physicalent

6353 rows × 4 columns

# EDGES: PE_stid -> pathway_stid (BOTH from _allpathway_split_)

df_edges_phyiscalent_to_pathwayid = df_pe_allpathway_split.assign(
        source_layer='reactome_physicalent', 
        target_layer='reactome_pathway'
    ).rename(columns={
        'reactome_pe_stableid':'source_id',
        'reactome_pathway_stableid':'target_id'
    })
df_edges_phyiscalent_to_pathwayid['human'] = df_edges_phyiscalent_to_pathwayid['species']=='Homo sapiens'
df_edges_phyiscalent_to_pathwayid
source_db_identifier source_id reactome_pe_name target_id url event_name_pathway_or_reaction evidence_code species pe_name pe_location source_layer target_layer human
0 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-1430728 https://reactome.org/PathwayBrowser/#/R-BTA-14... Metabolism IEA Bos taurus warfarin cytosol reactome_physicalent reactome_pathway False
1 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-196854 https://reactome.org/PathwayBrowser/#/R-BTA-19... Metabolism of vitamins and cofactors IEA Bos taurus warfarin cytosol reactome_physicalent reactome_pathway False
2 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-6806664 https://reactome.org/PathwayBrowser/#/R-BTA-68... Metabolism of vitamin K IEA Bos taurus warfarin cytosol reactome_physicalent reactome_pathway False
3 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-6806667 https://reactome.org/PathwayBrowser/#/R-BTA-68... Metabolism of fat-soluble vitamins IEA Bos taurus warfarin cytosol reactome_physicalent reactome_pathway False
4 10033 R-ALL-9014945 warfarin [cytosol] R-CFA-1430728 https://reactome.org/PathwayBrowser/#/R-CFA-14... Metabolism IEA Canis familiaris warfarin cytosol reactome_physicalent reactome_pathway False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
391091 9948 R-ALL-9660998 verapamil [cytosol] R-XTR-211859 https://reactome.org/PathwayBrowser/#/R-XTR-21... Biological oxidations IEA Xenopus tropicalis verapamil cytosol reactome_physicalent reactome_pathway False
391092 9948 R-ALL-9660998 verapamil [cytosol] R-XTR-211945 https://reactome.org/PathwayBrowser/#/R-XTR-21... Phase I - Functionalization of compounds IEA Xenopus tropicalis verapamil cytosol reactome_physicalent reactome_pathway False
391093 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-397014 https://reactome.org/PathwayBrowser/#/R-XTR-39... Muscle contraction IEA Xenopus tropicalis verapamil extracellular region reactome_physicalent reactome_pathway False
391094 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-5576891 https://reactome.org/PathwayBrowser/#/R-XTR-55... Cardiac conduction IEA Xenopus tropicalis verapamil extracellular region reactome_physicalent reactome_pathway False
391095 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-5576893 https://reactome.org/PathwayBrowser/#/R-XTR-55... Phase 2 - plateau phase IEA Xenopus tropicalis verapamil extracellular region reactome_physicalent reactome_pathway False

391096 rows × 13 columns

# EDGES: PE_stid -> pathway_stid (BOTH from _allreactions_)

df_edges_phyiscalent_to_reactionid = reactome_dfs['ChEBI2Reactome_PE_Reactions.tsv'].assign(
        source_layer='reactome_physicalent', 
        target_layer='reactome_reactions'
    ).rename(columns={
        'reactome_pe_stableid':'source_id',
        'reactome_pathway_stableid':'target_id'
    })
df_edges_phyiscalent_to_reactionid['human'] = df_edges_phyiscalent_to_reactionid['species']=='Homo sapiens'
df_edges_phyiscalent_to_reactionid
source_db_identifier source_id reactome_pe_name target_id url event_name_pathway_or_reaction evidence_code species source_layer target_layer human
0 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-159790 https://reactome.org/PathwayBrowser/#/R-BTA-15... VKORC1 reduces vitamin K epoxide to MK4 (vitam... IEA Bos taurus reactome_physicalent reactome_reactions False
1 10033 R-ALL-9014945 warfarin [cytosol] R-BTA-9026967 https://reactome.org/PathwayBrowser/#/R-BTA-90... VKORC1 inhibitors binds VKORC1 dimer IEA Bos taurus reactome_physicalent reactome_reactions False
2 10033 R-ALL-9014945 warfarin [cytosol] R-CFA-159790 https://reactome.org/PathwayBrowser/#/R-CFA-15... VKORC1 reduces vitamin K epoxide to MK4 (vitam... IEA Canis familiaris reactome_physicalent reactome_reactions False
3 10033 R-ALL-9014945 warfarin [cytosol] R-CFA-9026967 https://reactome.org/PathwayBrowser/#/R-CFA-90... VKORC1 inhibitors binds VKORC1 dimer IEA Canis familiaris reactome_physicalent reactome_reactions False
4 10033 R-ALL-9014945 warfarin [cytosol] R-DME-159790 https://reactome.org/PathwayBrowser/#/R-DME-15... VKORC1 reduces vitamin K epoxide to MK4 (vitam... IEA Drosophila melanogaster reactome_physicalent reactome_reactions False
... ... ... ... ... ... ... ... ... ... ... ...
234736 9948 R-ALL-9614135 verapamil [extracellular region] R-SSC-9614031 https://reactome.org/PathwayBrowser/#/R-SSC-96... Class IV antihypertensives bind LTCC multimer IEA Sus scrofa reactome_physicalent reactome_reactions False
234737 9948 R-ALL-9660998 verapamil [cytosol] R-SSC-9659680 https://reactome.org/PathwayBrowser/#/R-SSC-96... ABCB1 transports xenobiotics out of the cell IEA Sus scrofa reactome_physicalent reactome_reactions False
234738 9948 R-ALL-9660998 verapamil [cytosol] R-SSC-9678766 https://reactome.org/PathwayBrowser/#/R-SSC-96... CYP3A4 binds CYP3A4 inhibitors IEA Sus scrofa reactome_physicalent reactome_reactions False
234739 9948 R-ALL-9614135 verapamil [extracellular region] R-XTR-9614031 https://reactome.org/PathwayBrowser/#/R-XTR-96... Class IV antihypertensives bind LTCC multimer IEA Xenopus tropicalis reactome_physicalent reactome_reactions False
234740 9948 R-ALL-9660998 verapamil [cytosol] R-XTR-9678766 https://reactome.org/PathwayBrowser/#/R-XTR-96... CYP3A4 binds CYP3A4 inhibitors IEA Xenopus tropicalis reactome_physicalent reactome_reactions False

234741 rows × 11 columns

# EDGES: PE_stid (from allpathway_split) <- PE_name_location (from allpathway_split)

df_edges_penameloc_to_phyiscalent = df_pe_allpathway_split.assign(
        source_layer='reactome_physicalent_nameloc', 
        target_layer='reactome_physicalent'
    ).rename(columns={
        'reactome_pe_name':'source_id',
        'reactome_pe_stableid':'target_id'
    })[
        ['source_layer','source_id','target_layer','target_id','pe_name','pe_location']
    ].drop_duplicates()
df_edges_penameloc_to_phyiscalent
source_layer source_id target_layer target_id pe_name pe_location
0 reactome_physicalent_nameloc warfarin [cytosol] reactome_physicalent R-ALL-9014945 warfarin cytosol
32 reactome_physicalent_nameloc arachidyl ester [endoplasmic reticulum lumen] reactome_physicalent R-ALL-5696412 arachidyl ester endoplasmic reticulum lumen
110 reactome_physicalent_nameloc xamoterol [extracellular region] reactome_physicalent R-ALL-9611688 xamoterol extracellular region
164 reactome_physicalent_nameloc yohimbine [extracellular region] reactome_physicalent R-ALL-9648287 yohimbine extracellular region
165 reactome_physicalent_nameloc Yohimbine [extracellular region] reactome_physicalent R-ALL-3296452 Yohimbine extracellular region
... ... ... ... ... ... ...
390837 reactome_physicalent_nameloc troglitazone [nucleoplasm] reactome_physicalent R-ALL-9732670 troglitazone nucleoplasm
390864 reactome_physicalent_nameloc tropicamide [extracellular region] reactome_physicalent R-ALL-9704271 tropicamide extracellular region
390918 reactome_physicalent_nameloc uramustine [nucleoplasm] reactome_physicalent R-ALL-9713792 uramustine nucleoplasm
390921 reactome_physicalent_nameloc valsartan [extracellular region] reactome_physicalent R-ALL-9615299 valsartan extracellular region
390971 reactome_physicalent_nameloc venlafaxine [extracellular region] reactome_physicalent R-ALL-9714401 venlafaxine extracellular region

6001 rows × 6 columns

# EDGES: PE_name (from allpathway_split) -> PE_name_location (from allpathway_split)

df_edges_pename_to_penameloc = df_pe_allpathway_split.assign(
        source_layer='reactome_physicalent_name', 
        target_layer='reactome_physicalent_nameloc'
    ).rename(columns={
        'pe_name':'source_id',
        'reactome_pe_name':'target_id'
    })[
        ['source_layer','source_id','target_layer','target_id']
    ].drop_duplicates()
df_edges_pename_to_penameloc
source_layer source_id target_layer target_id
0 reactome_physicalent_name warfarin reactome_physicalent_nameloc warfarin [cytosol]
32 reactome_physicalent_name arachidyl ester reactome_physicalent_nameloc arachidyl ester [endoplasmic reticulum lumen]
110 reactome_physicalent_name xamoterol reactome_physicalent_nameloc xamoterol [extracellular region]
164 reactome_physicalent_name yohimbine reactome_physicalent_nameloc yohimbine [extracellular region]
165 reactome_physicalent_name Yohimbine reactome_physicalent_nameloc Yohimbine [extracellular region]
... ... ... ... ...
390837 reactome_physicalent_name troglitazone reactome_physicalent_nameloc troglitazone [nucleoplasm]
390864 reactome_physicalent_name tropicamide reactome_physicalent_nameloc tropicamide [extracellular region]
390918 reactome_physicalent_name uramustine reactome_physicalent_nameloc uramustine [nucleoplasm]
390921 reactome_physicalent_name valsartan reactome_physicalent_nameloc valsartan [extracellular region]
390971 reactome_physicalent_name venlafaxine reactome_physicalent_nameloc venlafaxine [extracellular region]

5922 rows × 4 columns

# PE_location (from allpathway_split) -> PE_name_location (from allpathway_split)

df_edges_peloc_to_penameloc = df_pe_allpathway_split.assign(
        source_layer='reactome_physicalent_loc', 
        target_layer='reactome_physicalent_nameloc'
    ).rename(columns={
        'pe_location':'source_id',
        'reactome_pe_name':'target_id'
    })[
        ['source_layer','source_id','target_layer','target_id']
    ].drop_duplicates()
df_edges_peloc_to_penameloc
source_layer source_id target_layer target_id
0 reactome_physicalent_loc cytosol reactome_physicalent_nameloc warfarin [cytosol]
32 reactome_physicalent_loc endoplasmic reticulum lumen reactome_physicalent_nameloc arachidyl ester [endoplasmic reticulum lumen]
110 reactome_physicalent_loc extracellular region reactome_physicalent_nameloc xamoterol [extracellular region]
164 reactome_physicalent_loc extracellular region reactome_physicalent_nameloc yohimbine [extracellular region]
165 reactome_physicalent_loc extracellular region reactome_physicalent_nameloc Yohimbine [extracellular region]
... ... ... ... ...
390837 reactome_physicalent_loc nucleoplasm reactome_physicalent_nameloc troglitazone [nucleoplasm]
390864 reactome_physicalent_loc extracellular region reactome_physicalent_nameloc tropicamide [extracellular region]
390918 reactome_physicalent_loc nucleoplasm reactome_physicalent_nameloc uramustine [nucleoplasm]
390921 reactome_physicalent_loc extracellular region reactome_physicalent_nameloc valsartan [extracellular region]
390971 reactome_physicalent_loc extracellular region reactome_physicalent_nameloc venlafaxine [extracellular region]

5922 rows × 4 columns

Node creation#

We now create the following node dfs, using the edge dfs:

  • df_nodes_ontpathway (already created prior)

  • df_nodes_pathwayid (using df_edges_pathwayid_to_ontpathway)

  • df_nodes_chebi (using df_edges_chebi_to_physicalent)

  • df_nodes_physicalent (using df_edges_phyiscalent_to_pathwayid)

  • df_nodes_reactionid (using df_edges_phyiscalent_to_reactionid)

  • df_nodes_penameloc (using df_edges_penameloc_to_phyiscalent)

  • df_nodes_pename (using df_edges_pename_to_penameloc)

  • df_nodes_peloc (using df_edges_peloc_to_penameloc)

Note that the df_nodes_ontpathway were created earlier already.

# create nodes chebi df 
# from df_edges_chebi_to_physicalent
df_nodes_chebi = df_edges_chebi_to_physicalent.drop(
    columns=['target_layer','target_id']
    ).rename(
    columns={'source_id':'node_id','source_layer':'layer'}).drop_duplicates()
df_nodes_chebi
node_id layer
0 10033 reactome_chebi
32 10036 reactome_chebi
110 10055 reactome_chebi
164 10093 reactome_chebi
305 10100 reactome_chebi
... ... ...
390864 9757 reactome_chebi
390918 9884 reactome_chebi
390921 9927 reactome_chebi
390971 9943 reactome_chebi
391019 9948 reactome_chebi

3068 rows × 2 columns

# create nodes pathwayid df 
# from df_edges_pathwayid_to_ontpathway
df_nodes_pathwayid = df_edges_pathwayid_to_ontpathway.drop(
    columns=['target_layer','target_id']
    ).rename(
    columns={'source_id':'node_id','source_layer':'layer'}).drop_duplicates()
df_nodes_pathwayid['human'] = df_nodes_pathwayid['species']=='Homo sapiens'
df_nodes_pathwayid
name node_id url event_name_pathway_or_reaction evidence_code species layer human
0 Interleukin-6 signaling R-BTA-1059683 https://reactome.org/PathwayBrowser/#/R-BTA-10... Interleukin-6 signaling IEA Bos taurus reactome_pathway False
2 Apoptosis R-BTA-109581 https://reactome.org/PathwayBrowser/#/R-BTA-10... Apoptosis IEA Bos taurus reactome_pathway False
14 Hemostasis R-BTA-109582 https://reactome.org/PathwayBrowser/#/R-BTA-10... Hemostasis IEA Bos taurus reactome_pathway False
126 Intrinsic Pathway for Apoptosis R-BTA-109606 https://reactome.org/PathwayBrowser/#/R-BTA-10... Intrinsic Pathway for Apoptosis IEA Bos taurus reactome_pathway False
134 PKB-mediated events R-BTA-109703 https://reactome.org/PathwayBrowser/#/R-BTA-10... PKB-mediated events IEA Bos taurus reactome_pathway False
... ... ... ... ... ... ... ... ...
393990 SLC-mediated bile acid transport R-XTR-9958517 https://reactome.org/PathwayBrowser/#/R-XTR-99... SLC-mediated bile acid transport IEA Xenopus tropicalis reactome_pathway False
393998 SLC-mediated transport of inorganic anions R-XTR-9958790 https://reactome.org/PathwayBrowser/#/R-XTR-99... SLC-mediated transport of inorganic anions IEA Xenopus tropicalis reactome_pathway False
394026 SLC-mediated transport of amino acids R-XTR-9958863 https://reactome.org/PathwayBrowser/#/R-XTR-99... SLC-mediated transport of amino acids IEA Xenopus tropicalis reactome_pathway False
394104 SLC-mediated transport of oligopeptides R-XTR-9959399 https://reactome.org/PathwayBrowser/#/R-XTR-99... SLC-mediated transport of oligopeptides IEA Xenopus tropicalis reactome_pathway False
394110 Inhibition of voltage gated Ca2+ channels via... R-XTR-997272 https://reactome.org/PathwayBrowser/#/R-XTR-99... Inhibition of voltage gated Ca2+ channels via... IEA Xenopus tropicalis reactome_pathway False

21606 rows × 8 columns

# create nodes physicalent df 
# from df_edges_phyiscalent_to_pathwayid
df_nodes_physicalent = df_edges_phyiscalent_to_pathwayid.drop(
    columns=['target_layer','target_id']
    ).rename(
    columns={'source_id':'node_id','source_layer':'layer'}
    ).drop(columns=['event_name_pathway_or_reaction','url','evidence_code']).drop_duplicates() 
# NOTE: WE KEEP SPECIES FOR NOW, BUT MIGHT WANT TO REVISIT LATER IF THE DUPLICATES ARE PROBLEM FOR CREATION OR SEARCHING UNIQUELY...
df_nodes_physicalent['human'] = df_nodes_physicalent['species']=='Homo sapiens'
df_nodes_physicalent
source_db_identifier node_id reactome_pe_name species pe_name pe_location layer human
0 10033 R-ALL-9014945 warfarin [cytosol] Bos taurus warfarin cytosol reactome_physicalent False
4 10033 R-ALL-9014945 warfarin [cytosol] Canis familiaris warfarin cytosol reactome_physicalent False
8 10033 R-ALL-9014945 warfarin [cytosol] Drosophila melanogaster warfarin cytosol reactome_physicalent False
12 10033 R-ALL-9014945 warfarin [cytosol] Homo sapiens warfarin cytosol reactome_physicalent True
16 10033 R-ALL-9014945 warfarin [cytosol] Mus musculus warfarin cytosol reactome_physicalent False
... ... ... ... ... ... ... ... ...
391080 9948 R-ALL-9660998 verapamil [cytosol] Schizosaccharomyces pombe verapamil cytosol reactome_physicalent False
391082 9948 R-ALL-9660998 verapamil [cytosol] Sus scrofa verapamil cytosol reactome_physicalent False
391085 9948 R-ALL-9614135 verapamil [extracellular region] Sus scrofa verapamil extracellular region reactome_physicalent False
391090 9948 R-ALL-9660998 verapamil [cytosol] Xenopus tropicalis verapamil cytosol reactome_physicalent False
391093 9948 R-ALL-9614135 verapamil [extracellular region] Xenopus tropicalis verapamil extracellular region reactome_physicalent False

49593 rows × 8 columns

# create nodes reactionid df 
# from df_edges_phyiscalent_to_reactionid
df_nodes_reactionid = df_edges_phyiscalent_to_reactionid.drop(
    columns=['source_layer','source_id']
    ).rename(
    columns={'target_id':'node_id','target_layer':'layer'}
    ).drop(columns=['source_db_identifier','reactome_pe_name']).drop_duplicates() 
df_nodes_reactionid
node_id url event_name_pathway_or_reaction evidence_code species layer human
0 R-BTA-159790 https://reactome.org/PathwayBrowser/#/R-BTA-15... VKORC1 reduces vitamin K epoxide to MK4 (vitam... IEA Bos taurus reactome_reactions False
1 R-BTA-9026967 https://reactome.org/PathwayBrowser/#/R-BTA-90... VKORC1 inhibitors binds VKORC1 dimer IEA Bos taurus reactome_reactions False
2 R-CFA-159790 https://reactome.org/PathwayBrowser/#/R-CFA-15... VKORC1 reduces vitamin K epoxide to MK4 (vitam... IEA Canis familiaris reactome_reactions False
3 R-CFA-9026967 https://reactome.org/PathwayBrowser/#/R-CFA-90... VKORC1 inhibitors binds VKORC1 dimer IEA Canis familiaris reactome_reactions False
4 R-DME-159790 https://reactome.org/PathwayBrowser/#/R-DME-15... VKORC1 reduces vitamin K epoxide to MK4 (vitam... IEA Drosophila melanogaster reactome_reactions False
... ... ... ... ... ... ... ...
233705 R-RNO-9679044 https://reactome.org/PathwayBrowser/#/R-RNO-96... FKBP1A binds sirolimus IEA Rattus norvegicus reactome_reactions False
233706 R-SCE-9679044 https://reactome.org/PathwayBrowser/#/R-SCE-96... FKBP1A binds sirolimus IEA Saccharomyces cerevisiae reactome_reactions False
233707 R-SPO-9679044 https://reactome.org/PathwayBrowser/#/R-SPO-96... FKBP1A binds sirolimus IEA Schizosaccharomyces pombe reactome_reactions False
233710 R-SSC-9679044 https://reactome.org/PathwayBrowser/#/R-SSC-96... FKBP1A binds sirolimus IEA Sus scrofa reactome_reactions False
233711 R-XTR-9679044 https://reactome.org/PathwayBrowser/#/R-XTR-96... FKBP1A binds sirolimus IEA Xenopus tropicalis reactome_reactions False

61847 rows × 7 columns

# create nodes penameloc df 
# from df_edges_penameloc_to_phyiscalent
df_nodes_penameloc = df_edges_penameloc_to_phyiscalent.drop(
    columns=['target_layer','target_id']
    ).rename(
    columns={'source_id':'node_id','source_layer':'layer'}
    ).drop(columns=[]).drop_duplicates() 
df_nodes_penameloc
layer node_id pe_name pe_location
0 reactome_physicalent_nameloc warfarin [cytosol] warfarin cytosol
32 reactome_physicalent_nameloc arachidyl ester [endoplasmic reticulum lumen] arachidyl ester endoplasmic reticulum lumen
110 reactome_physicalent_nameloc xamoterol [extracellular region] xamoterol extracellular region
164 reactome_physicalent_nameloc yohimbine [extracellular region] yohimbine extracellular region
165 reactome_physicalent_nameloc Yohimbine [extracellular region] Yohimbine extracellular region
... ... ... ... ...
390837 reactome_physicalent_nameloc troglitazone [nucleoplasm] troglitazone nucleoplasm
390864 reactome_physicalent_nameloc tropicamide [extracellular region] tropicamide extracellular region
390918 reactome_physicalent_nameloc uramustine [nucleoplasm] uramustine nucleoplasm
390921 reactome_physicalent_nameloc valsartan [extracellular region] valsartan extracellular region
390971 reactome_physicalent_nameloc venlafaxine [extracellular region] venlafaxine extracellular region

5922 rows × 4 columns

# create nodes pename df 
# from df_edges_pename_to_penameloc
df_nodes_pename = df_edges_pename_to_penameloc.drop(
    columns=['target_layer','target_id']
    ).rename(
    columns={'source_id':'node_id','source_layer':'layer'}
    ).drop(columns=[]).drop_duplicates() 
df_nodes_pename
layer node_id
0 reactome_physicalent_name warfarin
32 reactome_physicalent_name arachidyl ester
110 reactome_physicalent_name xamoterol
164 reactome_physicalent_name yohimbine
165 reactome_physicalent_name Yohimbine
... ... ...
390837 reactome_physicalent_name troglitazone
390864 reactome_physicalent_name tropicamide
390918 reactome_physicalent_name uramustine
390921 reactome_physicalent_name valsartan
390971 reactome_physicalent_name venlafaxine

4084 rows × 2 columns

# create nodes peloc df 
# from df_edges_peloc_to_penameloc
df_nodes_peloc = df_edges_peloc_to_penameloc.drop(
    columns=['target_layer','target_id']
    ).rename(
    columns={'source_id':'node_id','source_layer':'layer'}
    ).drop(columns=[]).drop_duplicates() 
df_nodes_peloc
layer node_id
0 reactome_physicalent_loc cytosol
32 reactome_physicalent_loc endoplasmic reticulum lumen
110 reactome_physicalent_loc extracellular region
706 reactome_physicalent_loc endocytic vesicle lumen
1083 reactome_physicalent_loc nucleoplasm
... ... ...
270753 reactome_physicalent_loc autophagosome membrane
272323 reactome_physicalent_loc lamellar body
272330 reactome_physicalent_loc lamellar body membrane
327144 reactome_physicalent_loc clathrin-sculpted gamma-aminobutyric acid tran...
336088 reactome_physicalent_loc endosome

85 rows × 2 columns

Concatenation#

df_edges = pd.concat([
    df_edges_chebi_to_physicalent, 
    df_edges_phyiscalent_to_reactionid,
    df_edges_phyiscalent_to_pathwayid,
    df_edges_pathwayid_to_ontpathway,
    df_edges_penameloc_to_phyiscalent,
    df_edges_pename_to_penameloc,
    df_edges_peloc_to_penameloc,
    df_edges_ontpathway_to_ontpathway,
    ])
df_edges
source_id target_id source_layer target_layer source_db_identifier reactome_pe_name url event_name_pathway_or_reaction evidence_code species human pe_name pe_location name interlayer
0 10033 R-ALL-9014945 reactome_chebi reactome_physicalent NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32 10036 R-ALL-5696412 reactome_chebi reactome_physicalent NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
110 10055 R-ALL-9611688 reactome_chebi reactome_physicalent NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
164 10093 R-ALL-9648287 reactome_chebi reactome_physicalent NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
165 10093 R-ALL-3296452 reactome_chebi reactome_physicalent NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
23254 R-XTR-9958790 R-XTR-427652 reactome_pathway_ontology reactome_pathway_ontology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False
23255 R-XTR-9958790 R-XTR-433137 reactome_pathway_ontology reactome_pathway_ontology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False
23256 R-XTR-9958863 R-XTR-352230 reactome_pathway_ontology reactome_pathway_ontology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False
23257 R-XTR-9958863 R-XTR-428559 reactome_pathway_ontology reactome_pathway_ontology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False
23258 R-XTR-9959399 R-XTR-427975 reactome_pathway_ontology reactome_pathway_ontology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False

697172 rows × 15 columns

df_nodes = pd.concat([
    df_nodes_chebi, 
    df_nodes_physicalent, 
    df_nodes_reactionid,
    df_nodes_pathwayid,
    df_nodes_ontpathway,
    df_nodes_penameloc,
    df_nodes_pename,
    df_nodes_peloc,
    ])
df_nodes
node_id layer source_db_identifier reactome_pe_name species pe_name pe_location human url event_name_pathway_or_reaction evidence_code name
0 10033 reactome_chebi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32 10036 reactome_chebi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
110 10055 reactome_chebi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
164 10093 reactome_chebi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
305 10100 reactome_chebi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
270753 autophagosome membrane reactome_physicalent_loc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
272323 lamellar body reactome_physicalent_loc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
272330 lamellar body membrane reactome_physicalent_loc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
327144 clathrin-sculpted gamma-aminobutyric acid tran... reactome_physicalent_loc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
336088 endosome reactome_physicalent_loc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

169362 rows × 12 columns

Filtering to human reaction/pathway nodes only#

Not that this still keeps nodes that have no defined species (such as chebi id’s and reactome physical entities)

def filter_reactome(df, human_only=True):
    if human_only:
        filtered = df[df['human'].ne(False)]
        print(f'Dropped rows from non-human species (dropped: {len(df)-len(filtered)}; remaining: {len(filtered)})')
        return filtered
    return df
df_nodes_human = filter_reactome(df_nodes)
df_nodes_human
Dropped rows from non-human species (dropped: 114171; remaining: 55191)
node_id layer source_db_identifier reactome_pe_name species pe_name pe_location human url event_name_pathway_or_reaction evidence_code name
0 10033 reactome_chebi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32 10036 reactome_chebi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
110 10055 reactome_chebi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
164 10093 reactome_chebi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
305 10100 reactome_chebi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
270753 autophagosome membrane reactome_physicalent_loc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
272323 lamellar body reactome_physicalent_loc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
272330 lamellar body membrane reactome_physicalent_loc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
327144 clathrin-sculpted gamma-aminobutyric acid tran... reactome_physicalent_loc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
336088 endosome reactome_physicalent_loc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

55191 rows × 12 columns

df_edges_human = filter_reactome(df_edges)
df_edges_human
Dropped rows from non-human species (dropped: 532405; remaining: 164767)
source_id target_id source_layer target_layer source_db_identifier reactome_pe_name url event_name_pathway_or_reaction evidence_code species human pe_name pe_location name interlayer
0 10033 R-ALL-9014945 reactome_chebi reactome_physicalent NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32 10036 R-ALL-5696412 reactome_chebi reactome_physicalent NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
110 10055 R-ALL-9611688 reactome_chebi reactome_physicalent NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
164 10093 R-ALL-9648287 reactome_chebi reactome_physicalent NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
165 10093 R-ALL-3296452 reactome_chebi reactome_physicalent NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
23254 R-XTR-9958790 R-XTR-427652 reactome_pathway_ontology reactome_pathway_ontology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False
23255 R-XTR-9958790 R-XTR-433137 reactome_pathway_ontology reactome_pathway_ontology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False
23256 R-XTR-9958863 R-XTR-352230 reactome_pathway_ontology reactome_pathway_ontology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False
23257 R-XTR-9958863 R-XTR-428559 reactome_pathway_ontology reactome_pathway_ontology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False
23258 R-XTR-9959399 R-XTR-427975 reactome_pathway_ontology reactome_pathway_ontology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False

164767 rows × 15 columns

Ensuring equivalency#

To check the steps outlined above and the parse_reactome_data() function work in a similar way, we can check them briefly in this section.

reactome_results = parse_reactome_data(verbose=True, use_cache=True, force_download=False)

df_reactome_nodes = reactome_results['df_nodes']
df_reactome_edges = reactome_results['df_edges']
⏬ loading Reactome raw tables …
Fetching ChEBI2Reactome_PE_All_Levels.tsv
Override requested; re-downloading ChEBI2Reactome_PE_All_Levels.tsv from https://reactome.org/download/current/ChEBI2Reactome_PE_All_Levels.txt even though it exists locally.
Data downloaded and saved to /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ChEBI2Reactome_PE_All_Levels.tsv.
Fetching ChEBI2Reactome_PE_Reactions.tsv
Override requested; re-downloading ChEBI2Reactome_PE_Reactions.tsv from https://reactome.org/download/current/ChEBI2Reactome_PE_Reactions.txt even though it exists locally.
Data downloaded and saved to /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ChEBI2Reactome_PE_Reactions.tsv.
Fetching ReactomePathways.tsv
Override requested; re-downloading ReactomePathways.tsv from https://reactome.org/download/current/ReactomePathways.txt even though it exists locally.
Data downloaded and saved to /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ReactomePathways.tsv.
Fetching ReactomePathwaysRelation.tsv
Override requested; re-downloading ReactomePathwaysRelation.tsv from https://reactome.org/download/current/ReactomePathwaysRelation.txt even though it exists locally.
Data downloaded and saved to /Users/macsbook/Code/lipinet/lipinet/.data/downloaded/ReactomePathwaysRelation.tsv.
Returning ['ChEBI2Reactome_PE_All_Levels.tsv', 'ChEBI2Reactome_PE_Reactions.tsv', 'ReactomePathways.tsv', 'ReactomePathwaysRelation.tsv'] as a dict of dfs
[reactome] edges: (164767, 15)
[reactome] nodes: (55191, 12)
print(df_reactome_edges.shape)
print(df_reactome_nodes.shape)
print(df_reactome_nodes.columns)
(164767, 15)
(55191, 12)
Index(['node_id', 'layer', 'source_db_identifier', 'reactome_pe_name',
       'species', 'pe_name', 'pe_location', 'human', 'url',
       'event_name_pathway_or_reaction', 'evidence_code', 'name'],
      dtype='object')
print(df_edges_human.shape)
print(df_nodes_human.shape)
print(df_nodes_human.columns)
(164767, 15)
(55191, 12)
Index(['node_id', 'layer', 'source_db_identifier', 'reactome_pe_name',
       'species', 'pe_name', 'pe_location', 'human', 'url',
       'event_name_pathway_or_reaction', 'evidence_code', 'name'],
      dtype='object')
def nodes_topology(df):
    return df[['layer','node_id']].drop_duplicates()

# auto vs manual topology counts
auto_top = nodes_topology(df_reactome_nodes)
man_top  = nodes_topology(df_nodes_human)

print("AUTO unique nodes:", len(auto_top))
print("MAN  unique nodes:", len(man_top))

# by-layer comparison
by_layer_auto = auto_top.groupby('layer')['node_id'].nunique().sort_values(ascending=False)
by_layer_man  = man_top.groupby('layer')['node_id'].nunique().sort_values(ascending=False)

print("\nBy-layer differences (AUTO - MAN):")
diff = (by_layer_auto.to_frame('auto')
          .join(by_layer_man.to_frame('man'), how='outer').fillna(0).astype(int))
diff['delta'] = diff['auto'] - diff['man']
diff.sort_values('delta', ascending=False)

# where attributes are causing duplication (same id, multiple rows)
dup_counts = (df_reactome_nodes
              .groupby(['layer','node_id']).size()
              .reset_index(name='rows_per_id')
              .query('rows_per_id > 1')
              .sort_values('rows_per_id', ascending=False))
dup_counts.head(20)
AUTO unique nodes: 54174
MAN  unique nodes: 54174

By-layer differences (AUTO - MAN):
layer node_id rows_per_id
33795 reactome_physicalent R-COV-9694285 4
33800 reactome_physicalent R-COV-9694354 4
34372 reactome_physicalent R-HSA-6790604 4
33767 reactome_physicalent R-COV-9685518 4
33785 reactome_physicalent R-COV-9685914 4
33786 reactome_physicalent R-COV-9685915 4
33787 reactome_physicalent R-COV-9685916 4
33788 reactome_physicalent R-COV-9685917 4
33789 reactome_physicalent R-COV-9685918 4
33790 reactome_physicalent R-COV-9685920 4
33792 reactome_physicalent R-COV-9685922 4
33822 reactome_physicalent R-COV-9697431 4
33797 reactome_physicalent R-COV-9694325 4
33791 reactome_physicalent R-COV-9685921 4
33802 reactome_physicalent R-COV-9694393 4
33803 reactome_physicalent R-COV-9694456 4
33813 reactome_physicalent R-COV-9694622 4
33805 reactome_physicalent R-COV-9694496 4
33806 reactome_physicalent R-COV-9694503 4
33807 reactome_physicalent R-COV-9694513 4
# Compare per-layer topology after the two patches:
auto_top = reactome_results['df_nodes'][['layer','node_id']].drop_duplicates()
man_top  = df_nodes_human[['layer','node_id']].drop_duplicates()

(by_layer_auto := auto_top.groupby('layer').size().sort_values(ascending=False)).head(10)
(by_layer_man  := man_top.groupby('layer').size().sort_values(ascending=False)).head(10)

# See which layers still differ and by how much
diff = (by_layer_auto.to_frame('auto')
          .join(by_layer_man.to_frame('man'), how='outer').fillna(0).astype(int))
diff['delta'] = diff['auto'] - diff['man']
diff.sort_values('delta', ascending=False)
auto man delta
layer
reactome_chebi 3068 3068 0
reactome_pathway 2477 2477 0
reactome_pathway_ontology 23157 23157 0
reactome_physicalent 5925 5925 0
reactome_physicalent_loc 85 85 0
reactome_physicalent_name 4084 4084 0
reactome_physicalent_nameloc 5922 5922 0
reactome_reactions 9456 9456 0

Great, looks like they are both equivalent in node and node, edge and column counts.