ARTICLE
TITLE

THE USAGE OF NATURAL LANGUAGE PROCESSING METHODS TO DETECT THE SYMPTOMS OF MENTAL ILLNESS

SUMMARY

Background. The process of the detection of the symptoms of mental illness is a complicated task that requires the appropriate level of the qualification of a specialist to solve it. One part of the diagnostics of such diseases is the analysis of the patient's speech. Alogia (the poverty of speech), the lack of the persistent focus on a topic, incoherent speech, permanent usage of metaphors can indicate the availability of appropriate symptoms. Thus, it is necessary to apply the different automated methods of the estimation of the patient's speech in order to detect some deviations from defined statistical data. Such methods fall into the category of natural language processing. Taking into account the lack of unified structure, the availability of ambiguous terms, the tasks of natural language processing cannot be solved with the usage of defined algorithms. Search for regularities and the detection of the connection between the text's elements are performed using the methods of machine learning: regression models, decision trees, deep learning (multilayer neural networks). Thus, it is advisable to consider state-of-the-art methods, based on different methods of machine learning, to detect the symptoms of mental illness by analyzing the patient's speech. The purpose of the work is the following: to perform the comparative analysis of different state-of-the-art methods of the detection of the symptoms of mental illness based on the methods of natural language processing; to make the experimental examination of the effectiveness of the proposed method based on the analysis of the connectivity of the text's elements.Materials and methods. Results. According to the analysis of state-of-the-art methods, the semantic coherence is the main feature of a text to predict mental illness. Two different models based on the estimation of semantic coherence are considered: tangentiality model and incoherence model. The main idea of the tangentiality model consists in the detection of the persistent deviation of the topic of an answer from a question. A text is divided into windows — sets of words with a fixed length. Each window and question are represented as vectors using a pre-trained semantic embedding model — LSA. The similarity between a window and the question is calculated as the cosine distance between corresponding vectors. Using the set of calculated distances, linear regression is built. The steeper slope of a line indicates the deviation of the thoughts of a speaker from the whole topic of a conversation. In comparison to the tangentiality model, the incoherence model processes a text at the level of sentences. All sentences are represented as the average vector interpretation of its words; each word is represented as a vector using a pre-trained semantic embedding model. Then three different features are calculated to form a feature vector: minimum first-order coherence (minimum similarity between two sentences that is estimated as a cosine distance between corresponding vectors), maximum sentence length, and the frequency of the usage of additional uninformative words. This dataset is used to build a convex hull classifier that divides interviews of healthy and ill people. The key disadvantage of both mentioned models is the neglect of the repeats of phrases within a text. Moreover, such repeats can complicate the classification process. In order to solve it, the different combination of state-of-the-art semantic embedding models (Word2Vec, Sent2Vec, GloVe) with frequency algorithms (TF-IDF, SIF) can be used. The disadvantage of such an approach is the dependency on an additional corpus to calculate statistical data about the frequency of words' usage. As for the effectiveness of each model for different languages, it depends on the collected dataset and the unique features of a separate language. Except for the semantic coherence, other linguistic characteristicscan be taken into account to form a feature vector: linguistic complexity, linguistic density, syntactic complexity. Each of these characteristics can be represented with the corresponding set of metrics. Moreover, the frequent usage of ambiguous pronouns may also be taken into account because it can indicate the disorganization of the thoughts of a speaker.The proposed method based on the graph of the consistency of phrases allows estimating the connectivity of a text — its cohesion. It takes into account the availability of coreferent objects and common terms within a text. The effectiveness of the suggested method was compared with other features of a text using pre-trained classification models. The results obtained can indicate that the proposed method may be used to calculate the connectivity feature for a model that predicts a mental illness.Conclusions. As the main criteria to distinguish the texts of healthy and ill persons, the semantic coherence is used. The estimation of the semantic coherence is performed in the following models: tangentiality model and incoherence model. It is advisable to perform the semantic representation of the text's elements (sentences for the incoherence model and windows for the tangentiailty model) using the combination of different semantic embedding models with statistical algorithms (TF-IDF, SIF) in order to take into account permanent repeats of phrases. As for the effectiveness of the mentioned models for different languages, it depends on the semantic embedding model and the properties of a certain language.In order to increase the accuracy of the classification model, other linguistic features should be taken into account: lexical density, lexical and syntactic complexity, connectivity. The method based on the graph of the consistency of phrases has been proposed to take into account the connectivity of a text. The experimental examination of the effectiveness of the proposed method in comparison with other features has been verified. The results obtained can indicate the expediency of the usage of the proposed method to increase the accuracy of a prediction model.

 Articles related

Jason Stratton Davis    

Several meta-analyses and studies have been undertaken in game-based research, which compare the efficacy of conventional teaching against the introduction of educational games into the classroom. The findings point to educational gaming providing teachi... see more


Aslina Ahmad,Norazani Ahmad,Muhammad Nasir Bistamam,Mohamad Muzafar Shah Mohd Razali,Ab Aziz Mohd Yatim,Taqudin Mukhti,Wong Kung Tec    

Motivation in teaching has been found one of the significant factors in improving teacher-students’ relationship, academic performance and students’ wellbeing. However, the use of counseling skills in enhancing motivation of teaching is still scarce. The... see more


Ingrid Moons and Patrick De Pelsmacker    

An Extended Decomposed Theory of Planned Behaviour (DTPB) is developed that integrates emotions towards car driving and electric cars as well as car driving habits of the DTPB, and is empirically validated in a Belgian sample (n = 1023). Multi-group comp... see more

Revista: Sustainability

Xiaolin Wang, Gaoxu Wang, Yongxiang Wu, Yi Xu, Hong Gao    

The efficient control of water usage is one of the core goals of the strictest water resources management system in China. Therefore, the objective and reasonable evaluation of the effects of implementing this system is crucial. Based on the natural and ... see more

Revista: Water

Céline Byukusenge,Florie Nsanganwimana,Albert Paulo Tarmo    

Laboratory experience has been indicated as a crucial component of science teaching for practical skills acquisition and concretization of scientific abstract concepts. However, due to the shortage of physical laboratories, there is a need to integrate v... see more