Binding Site Distance#
Binding site distance can be calculated using two different approaches. The first one (Boltzmann sampling) is the suggested use if you dont have prior knowledge about the kind of strand binding behavior of your molecule. The second one is the recommended use if you know that your molecules tend to bind exterior regions on the RNA structure. It is recommended to use the clote-ponty algorithm then, since it actually calculates the expected distance instead of an approximation
Without Prior Information#
You can approximate the expected distance between two binding sites via sampling.
The fastest way to get the distance between two binding sites in such a case is to use sample_distance_ij()
.
Please have a look into the sampling chapter of the Tutorial if you want to use a CLI for sampling.
Exterior Strand Binders#
We showed that the clote-ponty algorithm is capable of calculating expected distances of exterior nucleotides. Since interaction of mRNA with molecules such as RNA binding proteins or regulatory RNAs prevents the binding-site from forming intramolecular hydrogen bonds, the binding-sites might be forced to be in unstructured regions hence exterior nucleotides. In such cases the mentioned algorithm will have no error since no structures span the interval between two binding sites. in the following paragraphs we will show how to calculate expected distances between such binding sites using either the Python API or the Command Line Interface
Python API#
Here we will shortly describe how to calulate the expected distance of two binding-sites using the RNAdist Python API. Therefore we picked binding-sites of PUM2 a famous RNA binding protein.
Note
The binding sites were extracted from an publically available e-CLIP dataset and processed to find the most likely transcript variant. They originated from using the tool Peakhood. Also we shrank the binding site to mainly contain a learned interaction motif of PUM2 using RNAProt.
BindingSites
ENST00000261313.3 715 721
ENST00000261313.3 931 937
First we need to set up a ViennaRNA fold compound and generate a list with those to binding sites.
>>> import RNA
>>> seq = "AGUGUGCUGAGCUCUCCGCGUCGCCUCUGUCGCCCGCGCCUGGCCUACCGCGGCACUCCCGGCUGCACGCUCUGCUUGGCCUCGCCAUGCCGGUGGACCUCAGCAAGUGGUCCGGGCCCUUGAGCCUGCAAGAAGUGGACGAGCAGCCGCAGCACCCGCUGCAUGUCACCUACGCCGGGGCGGCGGUGGACGAGCUGGGCAAAGUGCUGACGCCCACCCAGGUUAAGAAUAGACCCACCAGCAUUUCGUGGGAUGGUCUUGAUUCAGGGAAGCUCUACACCUUGGUCCUGACAGACCCGGAUGCUCCCAGCAGGAAGGAUCCCAAAUACAGAGAAUGGCAUCAUUUCCUGGUGGUCAACAUGAAGGGCAAUGACAUCAGCAGUGGCACAGUCCUCUCCGAUUAUGUGGGCUCGGGGCCUCCCAAGGGCACAGGCCUCCACCGCUAUGUCUGGCUGGUUUACGAGCAGGACAGGCCGCUAAAGUGUGACGAGCCCAUCCUCAGCAACCGAUCUGGAGACCACCGUGGCAAAUUCAAGGUGGCGUCCUUCCGUAAAAAGUAUGAGCUCAGGGCCCCGGUGGCUGGCACGUGUUACCAGGCCGAGUGGGAUGACUAUGUGCCCAAACUGUACGAGCAGCUGUCUGGGAAGUAGGGGGUUAGCUUGGGGACCUGAACUGUCCUGGAGGCCCCAAGCCAUGUUCCCCAGUUCAGUGUUGCAUGUAUAAUAGAUUUCUCCUCUUCCUGCCCCCCUUGGCAUGGGUGAGACCUGACCAGUCAGAUGGUAGUUGAGGGUGACUUUUCCUGCUGCCUGGCCUUUAUAAUUUUACUCACUCACUCUGAUUUAUGUUUUGAUCAAAUUUGAACUUCAUUUUGGGGGGUAUUUUGGUACUGUGAUGGGGUCAUCAAAUUAUUAAUCUGAAAAUAGCAACCCAGAAUGUAAAAAAGAAAAAACUGGGGGGAAAAAGACCAGGUCUACAGUGAUAGAGCAAAGCAUCAAAGAAUCUUUAAGGGAGGUUUAAAAAAAAAAAAAAAAAAAAAGAUUGGUUGCCUCUGCCUUUGUGAUCCUGAGUCCAGAAUGGUACACAAUGUGAUUUUAUGGUGAUGUCACUCACCUAGACAACCAGAGGCUGGCAUUGAGGCUAACCUCCAACACAGUGCAUCUCAGAUGCCUCAGUAGGCAUCAGUAUGUCACUCUGGUCCCUUUAAAGAGCAAUCCUGGAAGAAGCAGGAGGGAGGGUGGCUUUGCUGUUGUUGGGACAUGGCAAUCUAGACCGGUAGCAGCGCUCGCUGACAGCUUGGGAGGAAACCUGAGAUCUGUGUUUUUUAAAUUGAUCGUUCUUCAUGGGGGUAAGAAAAGCUGGUCUGGAGUUGCUGAAUGUUGCAUUAAUUGUGCUGUUUGCUUGUAGUUGAAUAAAAAUAGAAACCUGAAUGAAGGAA"
>>> fc = RNA.fold_compound(seq)
>>> binding_sites = [(715, 721), (931, 937)]
Now that we set up the ViennaRNA fold compound object and the binding sites list we can use the helper function
binding_site_distance()
>>> from RNAdist.dp.cpedistance import binding_site_distance
>>> interesting_distance = binding_site_distance(fc, binding_sites)
We finally got the expected distance between nucleotides 720
and 937
.
>>> interesting_distance
26.263341302657103
>>> backbone_distance = 931 - 720
>>> backbone_distance
211
As you can see the expected distance between the two binding sites is much different from the backbone distance.
You might also wonder how different it is from the MFE structure. Therefore we use the ViennRNA API again to get the string representation of the MFE. Further we import a helper function that can calculate pairwise distances from a secondary structure.
>>> from RNAdist.sampling.ed_sampling import shortest_paths_from_struct
>>> mfe_structure, _ = fc.mfe()
>>> ed_matrix = shortest_paths_from_struct(mfe_structure)
Now we just use the binding site indices to get the mfe distance
>>> ed_matrix[720][931]
135.0
You can see the mfe distance on this RNA is much different from the expected distance.
Command Line Interface#
The command line interface uses FASTA as well as BED files to calculate distances between binding sites. We will consider the same example as the one used in the Tutorial for the Python API. Thus you must have a BED file which looks the following
BED
ENST00000261313.3 715 721
ENST00000261313.3 931 937
As well as a FASTA file that has got sequences sharing its description with entries in the first column of the bed file.
FASTA
>ENST00000261313.3
AGUGUGCUGAGCUCUCCGCGUCGCCUCUGUCGCCCGCGCCUGGCCUACCGCGGCACUCCCGGCUGCACGCUCUGCUUGGCCUCGCCAUGCCGGUGGACCUCAGCAAGUGGUCCGGGCCCUUGAGCCUGCAAGAAGUGGACGAGCAGCCGCAGCACCCGCUGCAUGUCACCUACGCCGGGGCGGCGGUGGACGAGCUGGGCAAAGUGCUGACGCCCACCCAGGUUAAGAAUAGACCCACCAGCAUUUCGUGGGAUGGUCUUGAUUCAGGGAAGCUCUACACCUUGGUCCUGACAGACCCGGAUGCUCCCAGCAGGAAGGAUCCCAAAUACAGAGAAUGGCAUCAUUUCCUGGUGGUCAACAUGAAGGGCAAUGACAUCAGCAGUGGCACAGUCCUCUCCGAUUAUGUGGGCUCGGGGCCUCCCAAGGGCACAGGCCUCCACCGCUAUGUCUGGCUGGUUUACGAGCAGGACAGGCCGCUAAAGUGUGACGAGCCCAUCCUCAGCAACCGAUCUGGAGACCACCGUGGCAAAUUCAAGGUGGCGUCCUUCCGUAAAAAGUAUGAGCUCAGGGCCCCGGUGGCUGGCACGUGUUACCAGGCCGAGUGGGAUGACUAUGUGCCCAAACUGUACGAGCAGCUGUCUGGGAAGUAGGGGGUUAGCUUGGGGACCUGAACUGUCCUGGAGGCCCCAAGCCAUGUUCCCCAGUUCAGUGUUGCAUGUAUAAUAGAUUUCUCCUCUUCCUGCCCCCCUUGGCAUGGGUGAGACCUGACCAGUCAGAUGGUAGUUGAGGGUGACUUUUCCUGCUGCCUGGCCUUUAUAAUUUUACUCACUCACUCUGAUUUAUGUUUUGAUCAAAUUUGAACUUCAUUUUGGGGGGUAUUUUGGUACUGUGAUGGGGUCAUCAAAUUAUUAAUCUGAAAAUAGCAACCCAGAAUGUAAAAAAGAAAAAACUGGGGGGAAAAAGACCAGGUCUACAGUGAUAGAGCAAAGCAUCAAAGAAUCUUUAAGGGAGGUUUAAAAAAAAAAAAAAAAAAAAAGAUUGGUUGCCUCUGCCUUUGUGAUCCUGAGUCCAGAAUGGUACACAAUGUGAUUUUAUGGUGAUGUCACUCACCUAGACAACCAGAGGCUGGCAUUGAGGCUAACCUCCAACACAGUGCAUCUCAGAUGCCUCAGUAGGCAUCAGUAUGUCACUCUGGUCCCUUUAAAGAGCAAUCCUGGAAGAAGCAGGAGGGAGGGUGGCUUUGCUGUUGUUGGGACAUGGCAAUCUAGACCGGUAGCAGCGCUCGCUGACAGCUUGGGAGGAAACCUGAGAUCUGUGUUUUUUAAAUUGAUCGUUCUUCAUGGGGGUAAGAAAAGCUGGUCUGGAGUUGCUGAAUGUUGCAUUAAUUGUGCUGUUUGCUUGUAGUUGAAUAAAAAUAGAAACCUGAAUGAAGGAA
Note
It is possible to have multiple different sequence identifiers in the BED and FASTA files. However, the FASTA file must contain all the entries in the first column of the bed file.
The following command line call will produce a TSV with the expected distance between those binding sites
RNAdist binding-site --input my.fasta --bed_files my.bed --ouput my_output.tsv --num_threads 5
Note
You can use as many bed files as you want by separating them using whitespaces.
Warning
RNAdist will automatically filter overlapping sites since the expected distance will be zero
Output#
The output TSV will have the following structure
TSV
sequence_name name1 start1 stop1 name2 start2 stop2 expected_distance
ENST00000261313.3 foo.bed 715 721 foo.bed 931 937 26.263341302657103
Where sequence_name
is the fasta header as well as the bed first column. The following columns originate from the
BED files and are the names of the files (or custom names specified via --names
flag) as well as the intervals.