Supplementary material
Tree edit distance algorithms for computing RSS similarities use three operations on the tree nodes: insertion, deletion and substitution [13, 15]. An analogy can be made between these operations and events that occur on the primary sequence level, i.e. mutations that cause deletion of nucleotides, insertion of nucleotides or substitution of nucleotides. These mutations induce secondary structure changes that can be picked up by the OLT comparison model, e.g. a deletion of consecutive nucleotides might delete a loop and/or stem from the RSS. Another event that occurs on the primary structure level is inversion. We hypothesize that the flip enabled model can capture this type of event.
We were unable to find experimentally verified secondary structures in the literature that contain such inversions for testing the model.
Therefore, in order to test whether the flip enabled model can in principle detect changes in secondary structure induced by inversions on the sequence level we performed the following of simulation: the first 747 nucleotides from the human coxsackie virus IRES 5 (M16572) were extracted [30]. An artificial inversion of the nucleotides between positions 243 to 439 was made. The two sequences, the original and the one containing the inversion were folded into their predicted RSS using minimum free energy folding algorithm [23] (see Figure S2). The two structures were given as input to the SimTree method and the similarity score between them was computed using one flip. Notice that the model does not have any pre-knowledge regarding the two structures; it does not know if an inversion occurred or where it occurred. It flips a substructure only if the flip at that location yields the maximal similarity score. The indices of the nucleotides that reside in the flipped substructure are then returned. In this instance, nucleotides 273 to 411 were returned. This means that the flipped secondary substructure overlapped and was entirely contained in the inverted primary sequence region, indicating that inversions on the sequence level induce secondary structure changes that can be picked up by the flip enabled model.
We repeated the experiment 10 times, each time creating an inversion at different locations on different viral IRESes. In all instances the inverted subsequence and the predicted flip in the secondary substructure overlapped. Furthermore, their boundaries were closer than then 40 nucleotides in all cases.
Overall, by computing the flipping location that yields maximal RSS similarity, the method was able to accurately trace back sequence level inversions based solely on secondary structure information.
Two secondary structures corresponding to two inverted sequences. The structures were generated using the minimum fold energy model [22]. It can be seen that the secondary structures are highly similar mirror reflections of one another.
An
artificial inversion of the subsequence between nucleotide positions 243 to 439
in the human coxsackie virus IRES 5 was introduced. The predicted secondary
structure of the original sequence and the one containing the inverted
sub-sequence are given in (a) and (b) respectively. The inversion on the
primary sequence caused a secondary substructure to change into its approximate
mirror reflection. By enabling a flip SimTree was able to detect this
occurrence and map corresponding regions a1 to b1, a2 to b2 and a3 to b3.
A list of the
164 organisms in the RNase P RNA secondary structure dataset.
|
Archaea |
|
|
|
|
|
Crenarchaea |
|
|
1 |
|
|
A.ambivalens |
|
2 |
|
|
A.brierleyi |
|
3 |
|
|
A.pernix |
|
4 |
|
|
M.sedula |
|
5 |
|
|
S.acidocaldarius |
|
6 |
|
|
S.shibatae |
|
7 |
|
|
S.solfataricus |
|
|
|
Euryarchaea |
|
|
8 |
|
|
A.fulgidus |
|
9 |
|
|
H.cutirubrum |
|
10 |
|
|
H.morrhuae |
|
11 |
|
|
H.trapanicum |
|
12 |
|
|
H.volcanii |
|
13 |
|
|
M.barkeri |
|
14 |
|
|
M.fervidus |
|
15 |
|
|
M.formicicum |
|
16 |
|
|
M.jannaschii |
|
17 |
|
|
M.maripaludus |
|
18 |
|
|
M.thermoautotrophicum-DH |
|
19 |
|
|
M.thermolithotrophicus |
|
20 |
|
|
M.vannielli |
|
21 |
|
|
N.gregoryi |
|
22 |
|
|
P.abyssi |
|
23 |
|
|
P.horikoshi |
|
24 |
|
|
SL2D |
|
25 |
|
|
T.celerAL1 |
|
|
Bacteria |
|
|
|
|
|
Gram-positive
Bacteria |
|
|
|
|
|
High GC |
|
26 |
|
|
|
A.erythreum |
27 |
|
|
|
C.diphtheriae |
28 |
|
|
|
F.antarctica |
29 |
|
|
|
I.calvum |
30 |
|
|
|
L.japonicus-IFO15385 |
31 |
|
|
|
M.avium |
32 |
|
|
|
M.bovis |
33 |
|
|
|
M.leprae |
34 |
|
|
|
M.luteus |
35 |
|
|
|
M.phosphovorus-JCM9380 |
36 |
|
|
|
M.tuberculosis-g |
37 |
|
|
|
M.tuberculosis |
38 |
|
|
|
N.flavus |
39 |
|
|
|
N.jensenii |
40 |
|
|
|
N.luteus |
41 |
|
|
|
Nocardiodes-ATCC39419 |
42 |
|
|
|
N.plantarum |
43 |
|
|
|
N.pyridinolyticus |
44 |
|
|
|
N.simplex |
45 |
|
|
|
P.innocua |
46 |
|
|
|
S.bikiniensis |
47 |
|
|
|
S.caesia-K76T |
48 |
|
|
|
S.cyanea-K168T |
49 |
|
|
|
S.glauca-L169T |
50 |
|
|
|
S.lividans |
51 |
|
|
|
S.polymorpha |
52 |
|
|
|
S.viridis-K73T |
|
|
|
Low GC |
|
53 |
|
|
|
A.laidlawii |
54 |
|
|
|
B.anthracis |
55 |
|
|
|
B.brevis |
56 |
|
|
|
B.halodurans |
57 |
|
|
|
B.megaterium |
58 |
|
|
|
B.stearothermophilus |
59 |
|
|
|
B.subtilis |
60 |
|
|
|
C.acetobutylicum |
61 |
|
|
|
C.difficile |
62 |
|
|
|
C.hydrogenoformans |
63 |
|
|
|
C.innocuum |
64 |
|
|
|
C.sporogenes |
65 |
|
|
|
E.faecalis |
66 |
|
|
|
E.rhusiopathiae |
67 |
|
|
|
E.thermomarinus |
68 |
|
|
|
H.chlorum |
69 |
|
|
|
M.capricolum |
70 |
|
|
|
M.fermentans |
71 |
|
|
|
M.flocculare |
72 |
|
|
|
M.genitalium |
73 |
|
|
|
M.hyopneumoniae |
74 |
|
|
|
S.bovis |
75 |
|
|
|
S.epidermidis |
76 |
|
|
|
S.gordonii |
77 |
|
|
|
U.urealyticum |
|
|
Purple
Bacteria |
|
|
|
|
|
Alpha |
|
78 |
|
|
|
A.tumefaciens |
79 |
|
|
|
ESH212C |
80 |
|
|
|
LGB41 |
81 |
|
|
|
LGW113 |
82 |
|
|
|
PS26 |
83 |
|
|
|
PS2 |
84 |
|
|
|
R.capsulatus |
85 |
|
|
|
R.palustris |
86 |
|
|
|
R.rubrum |
87 |
|
|
|
Wolbachia-sp |
|
|
|
Beta |
|
88 |
|
|
|
A.eutrophus |
89 |
|
|
|
C.testosteroni |
90 |
|
|
|
ESH167F |
91 |
|
|
|
ESH183D |
92 |
|
|
|
ESH26-4 |
93 |
|
|
|
Mxa1 |
94 |
|
|
|
N.meningitidis-Z2491 |
95 |
|
|
|
T.thioparus |
96 |
|
|
|
vB11 |
|
|
|
Gamma |
|
97 |
|
|
|
A.ferrooxidans |
98 |
|
|
|
Buchnera-APS |
99 |
|
|
|
C.vinosum |
100 |
|
|
|
E.agglomerulans |
101 |
|
|
|
E.coli |
102 |
|
|
|
H.influenza |
103 |
|
|
|
K.pneumoniae |
104 |
|
|
|
L.adecarboxylata |
105 |
|
|
|
P.alcalifaciens |
106 |
|
|
|
V.cholera |
|
|
|
Delta |
|
107 |
|
|
|
D.desulfuricans |
108 |
|
|
|
D.vulgaris |
109 |
|
|
|
G.sulfurreducens |
110 |
|
|
|
M.xanthus |
|
|
|
Epsilon |
|
111 |
|
|
|
C.jejuni |
112 |
|
|
|
H.pylori-26695 |
|
|
Bacteroides
and Relatives |
|
|
113 |
|
|
B.thetaiotaomicron |
|
114 |
|
|
ESH30-3 |
|
115 |
|
|
F.yabuuchiae |
|
116 |
|
|
LGB27 |
|
117 |
|
|
L.weilii |
|
118 |
|
|
P.gingivalis |
|
119 |
|
|
T.pallidum |
|
|
|
Chlamydia |
|
|
120 |
|
|
C.abortus-B577 |
|
121 |
|
|
C.caviae-G |
|
122 |
|
|
C.felis-FP |
|
123 |
|
|
C.muridarum-MoPn |
|
124 |
|
|
C.pecorum-H3 |
|
125 |
|
|
C.pneumoniae |
|
126 |
|
|
C.psittaci |
|
127 |
|
|
C.suis-S45 |
|
128 |
|
|
C.trachomatis |
|
129 |
|
|
P.acanthamoebae |
|
130 |
|
|
S.negevensis |
|
|
|
thermolithotrophicus |
|
|
131 |
|
|
Anabaena-ATCC29413 |
|
132 |
|
|
Anabaena-PCC7120 |
|
133 |
|
|
A.nidulans |
|
134 |
|
|
Calothrix-PCC7601 |
|
135 |
|
|
Dermocarpa-PCC7437 |
|
136 |
|
|
Fischerella-UTEX1829 |
|
137 |
|
|
Nostoc-PCC7107 |
|
138 |
|
|
Oscillatoria-PCC7515 |
|
139 |
|
|
P.hollandica |
|
140 |
|
|
P.marinus |
|
141 |
|
|
Prochlorococcus-TATL2 |
|
142 |
|
|
Pseudoanabaena-PCC6903 |
|
143 |
|
|
Synechococcus-PCC7001 |
|
|
|
Spirochaete |
|
|
144 |
|
|
B.burgdorferi |
|
145 |
|
|
B.hermsii |
|
146 |
|
|
L.borgpetersenii |
|
|
Eukaryotes |
|
|
|
|
|
mitochondrion |
|
|
147 |
|
|
R.americana-mito |
|
|
|
nucleus |
|
|
148 |
|
|
A.obstetricans |
|
149 |
|
|
A.telluris |
|
150 |
|
|
A.truei |
|
151 |
|
|
C.lusitaniae |
|
152 |
|
|
E.eschsholtzii |
|
153 |
|
|
G.gorilla |
|
154 |
|
|
K.polysporus |
|
155 |
|
|
L.muta-1 |
|
156 |
|
|
M.musculus |
|
157 |
|
|
P.pygmaeus |
|
158 |
|
|
S.carlsbergensis |
|
159 |
|
|
S.cerevisiae |
|
160 |
|
|
T.syrichta |
|
161 |
|
|
X.laevis |
|
|
|
plastid |
|
|
162 |
|
|
Cadoxa-cyanelle |
|
163 |
|
|
N.olivacea-chloroplast |
|
164 |
|
|
P.purpurea-chloroplast |
The SimTree method first receives an RNA secondary
structure in parenthesis notation and transforms it into a labeled tree
representation.
Input: a tree in a parenthesis representation.
Output: a tree object.
Time complexity: O(n) where n is the length of the input sequence.
Assumptions and notations:
1.The parenthesis representation is correct i.e. it represents a viable RNA secondary structure.
2. Each node v in the tree contains the following data:
a. The type of the node, where type(v) є {multi loop, stem}
b. The node complexity which is defined as:
n_bp is number of base pairs in the stem. (n_bp, n_fork) are number of nucleotides in a multi-loop and number of stems branching out of a multi-loop. Other high level structural subunits can now be composed, for instance:
A loop: Type(v) = multi-loop and Complexity(v) = (n_bp,1).
A bulge: Type(v) = multi-loop and Complexity(v) = ({1 or 2},2).
3. All trees are rooted at a unique node, notated as Root, regardless of their topology.
Algorithm description:
The algorithm has three major components:
First, the input parenthesis sequence is traversed upon and
transformed into a list of triples where each triple is composed of the
following fields: [shape, s, index]. Shape , s represents the number of sequential instances of
the shape and an index is the position of the shape in the sequence. For
example see figure 2.
Figure 2
The time complexity of the above component is O(n) as it performs a single traverse over the input parenthesis sequence.
The second component receives the parenthesis sequence transformation from the first component and returns a stack containing the trees node in a pre-order traversal.
This component employs two stacks:
1.
SOP - contains the open parenthesis
objects.
2.
SD - contains the dot objects.
The algorithm determines the nature of a node by counting the number of branches from each node and thus, determining the kind of multi-loop (stems are being determined by complementary parenthesis). The method exploits the observation that when it finds a stem, a multi-loop must precede it. Hence, it pops and counts the number of dot objects in the dots stack and determines the number of branches the multi-loop has. A pseudo dot object is needed in order to count correctly multi-loops with a fork where there are no separating bases between the two branches. A schematic explanation is given in figure 3.
This method has a time complexity of O(n) since in worse case it performs n operations on both stacks.
Figure
3
When a ')' is
encountered all objects in the '.' stack that have greater index than the
object on the top of the '(' stack are popped. The number of popped objects
determines the degree of the Multi-Loop.
The third method takes the stack that contains the nodes in a pre-order from the second component and builds the corresponding tree.
It performs one scan of the stack and hence has a time complexity of O(n).