ARTICLE
TITLE

Entropy of DNA sequences and leukemia patients mortality

SUMMARY

Introduction. Deoxyribonucleic acid (DNA) is not a random sequence of four nucleotides  combinations: comprehensive reviews [1, 2] persuasively shows long- and short-range correlations in DNA, periodic properties and correlations structure of sequences. Information theory methods, like Entropy, imply quantifying the amount of information contained in sequences. the relationship between entropy and patient survival is widespread in some branches of medicine and medical researches: cardiology, neurology, surgery,  trauma. Therefore, it appears there is a necessity for implementing advantages of information theory methods for exploration of relationship between mortality of some category of patients and entropy of their DNA sequences. Aim of the research. The goal of this paper is to provide a reliable formula for calculating entropy accurately for short DNA sequences and to show how to use existing entropy analysis to examine the mortality of leukemia patients. Materials and Methods. We used University of Barcelona (UB) leukemia patient’s data base (DB) with 117 anonymized records that consists: Date of patient’s diagnosis, Date of patient’s death, Leukemia diagnoses, Patient’s DNA sequence. Average time for patient death after diagnoses: 99 ± 77 months. The formal characteristics of DNA sequences in UB leukemia patient’s DB are: average number of bases N = 496 ± 69; min (N) = 297 bases; max(N) = 745 bases. The generalized form of the Robust Entropy Estimator (EnRE) for short DNA sequences was proposed and key EnRE futures was showed. The Survival Analysis has been done using statistical package IBM SPSS 27 by Kaplan-Meier survival analysis and Cox Regressions survival modelling. Results. The accuracy of the proposed EnRE for calculating entropy was proved for various lengths of time series and various types of random distributions. It was shown, that in all cases for N = 500, relative error in calculating the precise value of entropy does not exceed 1 %, while the magnitude of correlation is no worse than 0.995. In order to yield the minimum EnRE standard deviation and coefficient of variation, an initial DNA sequence's alphabet code was converted into an integer code of bases using an optimization rule for only one minimal numerical decoding around zero. Entropy EnRE were calculated for leukemia patients for two samples: 2 groups divided by median EnRE = 1.47 and 2 groups of patients were formed according to their belonging to 1st (EnRE = 1.448) and 4th (EnRE = 1.490) quartiles. The result of Kaplan-Meier survival analysis and Cox Regressions survival modelling are statistically significant: p < 0,05 for median groups and p < 0,005 for patient’s groups formed of 1st and 4th quartiles. The death hazard for a patient with EnRE below median is 1.556 times that of a patient with EnRE over median and that the death hazard for a patient of 1st entropy quartile (lowest EnRE) is 2.143 times that of a patient of 4th entropy quartile (highest EnRE). Conclusions. The transition from widen (median) to smaller (quartile) patients’ groups with more EnRE differentiation confirmed the unique significance of the entropy of DNA sequences for leukemia patient’s mortality. This significance is proved statistically by increasing hazard and decreasing of average time of death after diagnoses for leukemia patients with lower entropy of DNA sequences.

 Articles related