Authors: Ornella, L.; Esteban, L.; Serra, E.; Tapia, E.
Title: A classification approach for the analysis of coding vs. non-coding sequences in the Tripanosoma Cruzi Genome.
As part of a project of gene finding in highly divergent genomes we evaluate the performance of 5 well known supervised learning classifiers for coding vs non coding sequence detection in Trypanosoma cruzi genome. From 781 sequences available in GenBank (374 coding and 407 non-coding) we obtained several datasets (Sw), each one containing the frequency of all possible words for a given nucleotide length (w-mer) and a binary class label (cod and no-cod). Additional datasets were generated by filtering Sw by S2N (signal to noise) ratio feature selection. Classifiers performance were evaluated by Montecarlo method (100 runs) and 10-fold Cross Validation. Best results were obtained with Support Vector Machine with radial basis function classifiers (C = 1, γ = 1): 6.0 % mean error rate using the dataset containing combined frequencies of both w = 2, 3 mer without filtering, and 6.1 % mean error using the dataset w =3, also without filtering. Even thought analysis are preliminary and some statistical test are needed in order to support the significance of the better or poorer performance of classifiers, results approximate accuracy of Glimmer tool and other algorithms reported in the literature..
Magazine: Actas de las Academia Nacional de Ciencias.
Editorial: Pugliese Siena SH.
Editing place: Córdoba.
Reference type: Con Referato.
It's published?: Yes