|Abstrakt:||The goal of gene finding is to mark important places in DNA sequence -- genes. Programs solving this task are using information about DNA structure from already annotated sequences and external information. We study existing gene finding programs and find their limitations in using external information. Typically they can process only simple hints consisting of a single interval, because these are relatively easy to incorporate to standard algorithms, but cannot cope with complex hint consisting of multiple intervals. This is unwanted information loss, and therefore we developed an algorithm able to process complex information. It is based on Hidden Markov models and the Viterbi algorithm as previous gene-finders, but uses a different approach of adding external information into the model. Our algorithm finds the sequence of state labels of the HMM with the best score, where a sequence gets a bonus for each respected hint. Hints are modeled as alternative paths in a graph representation of the problem.
We proved that this approach increases the accuracy of gene prediction, especially at the gene level. Time complexity of the algorithm is linear in sequence length and in the total length of all hints, while quadratic in the number of hints and HMM states. This is comparable to time complexity of existing gene finders.