Binding Site Distance#

Binding site distance can be calculated using two different approaches. The first one (Boltzmann sampling) is the suggested use if you dont have prior knowledge about the kind of strand binding behavior of your molecule. The second one is the recommended use if you know that your molecules tend to bind exterior regions on the RNA structure. It is recommended to use the clote-ponty algorithm then, since it actually calculates the expected distance instead of an approximation

Without Prior Information#

You can approximate the expected distance between two binding sites via sampling. The fastest way to get the distance between two binding sites in such a case is to use sample_distance_ij(). Please have a look into the sampling chapter of the Tutorial if you want to use a CLI for sampling.

Exterior Strand Binders#

We showed that the clote-ponty algorithm is capable of calculating expected distances of exterior nucleotides. Since interaction of mRNA with molecules such as RNA binding proteins or regulatory RNAs prevents the binding-site from forming intramolecular hydrogen bonds, the binding-sites might be forced to be in unstructured regions hence exterior nucleotides. In such cases the mentioned algorithm will have no error since no structures span the interval between two binding sites. in the following paragraphs we will show how to calculate expected distances between such binding sites using either the Python API or the Command Line Interface

Python API#

Here we will shortly describe how to calulate the expected distance of two binding-sites using the RNAdist Python API. Therefore we picked binding-sites of PUM2 a famous RNA binding protein.

Note

The binding sites were extracted from an publically available e-CLIP dataset and processed to find the most likely transcript variant. They originated from using the tool Peakhood. Also we shrank the binding site to mainly contain a learned interaction motif of PUM2 using RNAProt.

BindingSites

ENST00000261313.3       715 721
ENST00000261313.3       931 937

First we need to set up a ViennaRNA fold compound and generate a list with those to binding sites.

>>> import RNA
>>> seq = "AGUGUGCUGAGCUCUCCGCGUCGCCUCUGUCGCCCGCGCCUGGCCUACCGCGGCACUCCCGGCUGCACGCUCUGCUUGGCCUCGCCAUGCCGGUGGACCUCAGCAAGUGGUCCGGGCCCUUGAGCCUGCAAGAAGUGGACGAGCAGCCGCAGCACCCGCUGCAUGUCACCUACGCCGGGGCGGCGGUGGACGAGCUGGGCAAAGUGCUGACGCCCACCCAGGUUAAGAAUAGACCCACCAGCAUUUCGUGGGAUGGUCUUGAUUCAGGGAAGCUCUACACCUUGGUCCUGACAGACCCGGAUGCUCCCAGCAGGAAGGAUCCCAAAUACAGAGAAUGGCAUCAUUUCCUGGUGGUCAACAUGAAGGGCAAUGACAUCAGCAGUGGCACAGUCCUCUCCGAUUAUGUGGGCUCGGGGCCUCCCAAGGGCACAGGCCUCCACCGCUAUGUCUGGCUGGUUUACGAGCAGGACAGGCCGCUAAAGUGUGACGAGCCCAUCCUCAGCAACCGAUCUGGAGACCACCGUGGCAAAUUCAAGGUGGCGUCCUUCCGUAAAAAGUAUGAGCUCAGGGCCCCGGUGGCUGGCACGUGUUACCAGGCCGAGUGGGAUGACUAUGUGCCCAAACUGUACGAGCAGCUGUCUGGGAAGUAGGGGGUUAGCUUGGGGACCUGAACUGUCCUGGAGGCCCCAAGCCAUGUUCCCCAGUUCAGUGUUGCAUGUAUAAUAGAUUUCUCCUCUUCCUGCCCCCCUUGGCAUGGGUGAGACCUGACCAGUCAGAUGGUAGUUGAGGGUGACUUUUCCUGCUGCCUGGCCUUUAUAAUUUUACUCACUCACUCUGAUUUAUGUUUUGAUCAAAUUUGAACUUCAUUUUGGGGGGUAUUUUGGUACUGUGAUGGGGUCAUCAAAUUAUUAAUCUGAAAAUAGCAACCCAGAAUGUAAAAAAGAAAAAACUGGGGGGAAAAAGACCAGGUCUACAGUGAUAGAGCAAAGCAUCAAAGAAUCUUUAAGGGAGGUUUAAAAAAAAAAAAAAAAAAAAAGAUUGGUUGCCUCUGCCUUUGUGAUCCUGAGUCCAGAAUGGUACACAAUGUGAUUUUAUGGUGAUGUCACUCACCUAGACAACCAGAGGCUGGCAUUGAGGCUAACCUCCAACACAGUGCAUCUCAGAUGCCUCAGUAGGCAUCAGUAUGUCACUCUGGUCCCUUUAAAGAGCAAUCCUGGAAGAAGCAGGAGGGAGGGUGGCUUUGCUGUUGUUGGGACAUGGCAAUCUAGACCGGUAGCAGCGCUCGCUGACAGCUUGGGAGGAAACCUGAGAUCUGUGUUUUUUAAAUUGAUCGUUCUUCAUGGGGGUAAGAAAAGCUGGUCUGGAGUUGCUGAAUGUUGCAUUAAUUGUGCUGUUUGCUUGUAGUUGAAUAAAAAUAGAAACCUGAAUGAAGGAA"
>>> fc = RNA.fold_compound(seq)
>>> binding_sites = [(715, 721), (931, 937)]

Now that we set up the ViennaRNA fold compound object and the binding sites list we can use the helper function binding_site_distance()

>>> from RNAdist.dp.cpedistance import binding_site_distance
>>> interesting_distance = binding_site_distance(fc, binding_sites)

We finally got the expected distance between nucleotides 720 and 937.

>>> interesting_distance
26.263341302657103
>>> backbone_distance = 931 - 720
>>> backbone_distance
211

As you can see the expected distance between the two binding sites is much different from the backbone distance.

You might also wonder how different it is from the MFE structure. Therefore we use the ViennRNA API again to get the string representation of the MFE. Further we import a helper function that can calculate pairwise distances from a secondary structure.

>>> from RNAdist.sampling.ed_sampling import shortest_paths_from_struct
>>> mfe_structure, _ = fc.mfe()
>>> ed_matrix = shortest_paths_from_struct(mfe_structure)

Now we just use the binding site indices to get the mfe distance

>>> ed_matrix[720][931]
135.0

You can see the mfe distance on this RNA is much different from the expected distance.

Command Line Interface#

The command line interface uses FASTA as well as BED files to calculate distances between binding sites. We will consider the same example as the one used in the Tutorial for the Python API. Thus you must have a BED file which looks the following

BED

ENST00000261313.3       715 721
ENST00000261313.3       931 937

As well as a FASTA file that has got sequences sharing its description with entries in the first column of the bed file.

FASTA

>ENST00000261313.3
AGUGUGCUGAGCUCUCCGCGUCGCCUCUGUCGCCCGCGCCUGGCCUACCGCGGCACUCCCGGCUGCACGCUCUGCUUGGCCUCGCCAUGCCGGUGGACCUCAGCAAGUGGUCCGGGCCCUUGAGCCUGCAAGAAGUGGACGAGCAGCCGCAGCACCCGCUGCAUGUCACCUACGCCGGGGCGGCGGUGGACGAGCUGGGCAAAGUGCUGACGCCCACCCAGGUUAAGAAUAGACCCACCAGCAUUUCGUGGGAUGGUCUUGAUUCAGGGAAGCUCUACACCUUGGUCCUGACAGACCCGGAUGCUCCCAGCAGGAAGGAUCCCAAAUACAGAGAAUGGCAUCAUUUCCUGGUGGUCAACAUGAAGGGCAAUGACAUCAGCAGUGGCACAGUCCUCUCCGAUUAUGUGGGCUCGGGGCCUCCCAAGGGCACAGGCCUCCACCGCUAUGUCUGGCUGGUUUACGAGCAGGACAGGCCGCUAAAGUGUGACGAGCCCAUCCUCAGCAACCGAUCUGGAGACCACCGUGGCAAAUUCAAGGUGGCGUCCUUCCGUAAAAAGUAUGAGCUCAGGGCCCCGGUGGCUGGCACGUGUUACCAGGCCGAGUGGGAUGACUAUGUGCCCAAACUGUACGAGCAGCUGUCUGGGAAGUAGGGGGUUAGCUUGGGGACCUGAACUGUCCUGGAGGCCCCAAGCCAUGUUCCCCAGUUCAGUGUUGCAUGUAUAAUAGAUUUCUCCUCUUCCUGCCCCCCUUGGCAUGGGUGAGACCUGACCAGUCAGAUGGUAGUUGAGGGUGACUUUUCCUGCUGCCUGGCCUUUAUAAUUUUACUCACUCACUCUGAUUUAUGUUUUGAUCAAAUUUGAACUUCAUUUUGGGGGGUAUUUUGGUACUGUGAUGGGGUCAUCAAAUUAUUAAUCUGAAAAUAGCAACCCAGAAUGUAAAAAAGAAAAAACUGGGGGGAAAAAGACCAGGUCUACAGUGAUAGAGCAAAGCAUCAAAGAAUCUUUAAGGGAGGUUUAAAAAAAAAAAAAAAAAAAAAGAUUGGUUGCCUCUGCCUUUGUGAUCCUGAGUCCAGAAUGGUACACAAUGUGAUUUUAUGGUGAUGUCACUCACCUAGACAACCAGAGGCUGGCAUUGAGGCUAACCUCCAACACAGUGCAUCUCAGAUGCCUCAGUAGGCAUCAGUAUGUCACUCUGGUCCCUUUAAAGAGCAAUCCUGGAAGAAGCAGGAGGGAGGGUGGCUUUGCUGUUGUUGGGACAUGGCAAUCUAGACCGGUAGCAGCGCUCGCUGACAGCUUGGGAGGAAACCUGAGAUCUGUGUUUUUUAAAUUGAUCGUUCUUCAUGGGGGUAAGAAAAGCUGGUCUGGAGUUGCUGAAUGUUGCAUUAAUUGUGCUGUUUGCUUGUAGUUGAAUAAAAAUAGAAACCUGAAUGAAGGAA

Note

It is possible to have multiple different sequence identifiers in the BED and FASTA files. However, the FASTA file must contain all the entries in the first column of the bed file.

The following command line call will produce a TSV with the expected distance between those binding sites

RNAdist binding-site --input my.fasta --bed_files my.bed --ouput my_output.tsv --num_threads 5

Note

You can use as many bed files as you want by separating them using whitespaces.

Warning

RNAdist will automatically filter overlapping sites since the expected distance will be zero

Output#

The output TSV will have the following structure

TSV

sequence_name   name1   start1  stop1   name2   start2  stop2   expected_distance
ENST00000261313.3       foo.bed 715     721     foo.bed 931     937     26.263341302657103

Where sequence_name is the fasta header as well as the bed first column. The following columns originate from the BED files and are the names of the files (or custom names specified via --names flag) as well as the intervals.