Identifying Themes in Fiction: A Centroid-Based Lexical Clustering Approach

Abdulfattah Omar

Abstract


In recent years, numerous computational methods have been developed that have been widely used in humanities and literary studies. In spite of the potential of such methods in providing workable solutions to various inherent problems in research within these domains, including selectivity, objectivity, and replicability, very little empirical work has been done on thematic studies in literature. Such studies are almost entirely undertaken through traditional methods based on individual researchers’ reading of texts and intuitive abstraction of generalizations from their reading. This has negative implications in terms of issues of objectivity and replicability. Furthermore, there are challenges in dealing effectively with the hundreds of thousands of new novels that are published every year using traditional methods. In the face of these problems, this study proposes an integrated computational model for the thematic classification of literary texts based on lexical clustering methods. This study is based on a corpus comprising Thomas Hardy’s novels and short stories. The study employs computational semantic analysis based on a vector space model (VSM) representation of the lexical content of the texts. The results indicate that the selected texts could be grouped thematically based on their semantic content. Thus, there is now evidence that text clustering approaches, which have long been used in computational theory and data mining applications, can be usefully applied in literary studies.


Keywords


computational models; computational semantics, lexical clustering; lexical content; philological methods; Thomas Hardy; Vector Space Model (VSM)

Full Text:

PDF

References


Abdalgader, K. (2018). Centroid-based lexical clustering. In H. Pirim (Ed.), Recent applications in data clustering (pp. 378–403). London: IntechOpen.

Aggarwal, C. C., & Reddy, C. K. (2016). Data clustering: algorithms and applications. London; New York: Chapman and Hall/CRC Press.

Balossi, G. (2014). A Corpus linguistic approach to literary language and characterization: Virginia Woolf's The Wave. Amsterdam, Netherlands: John Benjamins Publishing Company.

Amsterdam, Netherlands

Bevis, M. (2013). The Oxford handbook of Victorian poetry. Oxford: Oxford University Press.

Bownas, J. L. (2012). Thomas Hardy and Empire: The Representation of Imperial Themes in the Work of Thomas Hardy. Farnham: Ashgate

Bozdogan, H., & Gupta, A. K. (2012). Multivariate statistical modeling and data analysis. Proceedings of the Advanced Symposium on Multivariate Modeling and Data Analysis May 15–16, 1986. Netherlands: Springer.

Brady, K. (1982). The short stories of Thomas Hardy. New York, NY: St. Martin's Press.

Brantlinger, P., & Thesing, W. (2008). A companion to the Victorian novel. Oxford: Wiley.

Burrows, J. (2004). Textual analysis. In S. Schreibman, R. Siemens, & J. Unsworth (Eds.), A companion to digital humanities (pp. 88–97). Oxford: Blackwell.

Chakraborty, G., Pagolu, M., & Garla, S. (2014). Text mining and analysis: Practical methods, examples, and case studies using SAS. Cary, North Carolina: SAS Institute.

Cox, R. G. (1970). Thomas Hardy: The critical heritage. New York, NY: Barnes & Noble.

Dalziel, P. (1992a). Hardy's unforgotton “indiscretionâ€: The centrality of an uncontrolled work. Review of English Studies, XLIII (171), 347–366. doi:10.1093/res/XLIII.171.347

Dalziel, P. (Ed.) (1992b). Thomas Hardy: The excluded and collaborative stories. Oxford: Clarendon Press.

Dillion, J. (2016). Thomas Hardy: Folklore and resistance. London: Palgrave Macmillan UK.

Eaton, M. L. (2007). Multivariate statistics: A vector space approach (Vol. 53). Beachwood, OH: Institute of Mathematical Statistics.

França, F. M. G., & de Souza, A. F. (2008). Intelligent text categorization and clustering. Berlin, Heidelberg: Springer.

Gani, A., Siddiqa, A., Shamshirband, S., & Hanum, F. (2016). A survey on indexing techniques for big data: Taxonomy and performance evaluation. Knowledge and Information Systems, 46(2), 241–284.

Gilmartin, S., & Mengham, R. (2007). Thomas Hardy's shorter fiction: A critical study. Edinburgh: Edinburgh University Press.

Gittings, R. (Ed.) (1978). An introduction to The Hand of Ethelberta (New Wessex edition ed.). New York, NY: St. Martin's Press.

Glynn, D., & Robinson, J. A. (2014). Corpus methods for semantics: Quantitative studies in polysemy and synonymy. Amsterdam ; Philadelphia: John Benjamins Publishing Company.

Gold, M. K., & Klein, L. F. (2016). Debates in the digital humanities. Minneapolis: University of Minnesota Press.

Headrick, P. P. (2013). The Wiley guide to writing essays about literature. New York: John Wiley & Sons.

Hodson, J. (2017). Dialect and literature in the long nineteenth century. New York: Routledge.

Hofmann, T. (2017). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. 50-57.

Hoover, D. L., Culpeper, J., & O'Halloran, K. (2014). Digital literary studies: Corpus approaches to poetry, prose, and drama. New York: Routledge.

Ireland, K. (2014). Thomas Hardy, time and narrative: A narratological approach to his novels. London: Palgrave Macmillan UK.

Jockers, M. L., & Thalken, R. (2020). Text analysis with R: For students of literature. Cham, Switzerland: Springer International Publishing.

Kachuck, B. (1995). Feminist social theories: Theme and variations. Sociological Bulletin, 44(2), 169–193.

Kassambara, A. (2017). Practical guide to principal component methods in R. Statistical Tools for High-Throughput Data Analysis (STHADA).

King, J. (1978). Tragedy in the Victorian novel: Theory and practice in the novels of George Eliot, Thomas Hardy and Henry James. Cambridge: Cambridge University Press.

Kogan, J. (2007). Introduction to clustering large and high-dimensional data. Cambridge: Cambridge University Press.

Mallett, P., & Maier, S. E. (2013). Thomas Hardy in context. Cambridge: Cambridge University Press.

Mani, I. (2013). Computational modeling of narrative. San Rafael, California: Morgan & Claypool Publishers.

Moisl, H. (2015). Cluster analysis for corpus linguistics. New York: Walter De Gruyter.

Mulhern, F. (2014). Contemporary Marxist literary criticism. London: Routledge.

Mullings, C., Kenna, S., Deegan, M., & Ross, S. (2019). New technologies for the humanities. London: De Gruyter Saur Verlag.

Nemesvari, R. (2011). Thomas Hardy, sensationalism, and the melodramatic mode. New York, NY: Palgrave Macmillan US.

Omar, A. A. (2010). Addressing subjectivity in thematic classification of literary texts: Using cluster analysis to derive taxonomies of thematic concepts in the Thomas Hardy’s prose fiction. Proceedings of the Chicago Colloquium on Digital Humanities and Computer Science, 1(2).

Omar, A. A. (2020a). Feature selection in text clustering applications of literary texts: A hybrid of term weighting methods. International Journal of Advanced Computer Science and Applications, 11(2), 99–107.

Omar, A. A. (2020b). On the digital applications in the thematic literature studies of Emily Dickinson’s poetry. International Journal of Advanced Computer Science and Applications, 11(6), 361–365.

Page, N. (2000). Oxford reader's companion to Hardy. Oxford: Oxford University Press.

Pugh, T., & Johnson, M. E. (2013). Literary studies: A practical guide. London; New York: Routledge.

Purdy, R. L. (1979). Thomas Hardy: A bibliographical study. Oxford: Oxford University Press.

Riesen, K., & Bunke, H. (2010). Graph classification and clustering based on vector space embedding. New Jersey, United States: World Scientific Publishing Company.

Shanahan, J. G., Qu, Y., & Wiebe, J. (2005). Computing attitude and affect in text: Theory and applications. Heidelberg: Springer Netherlands.

Siemens, R., & Schreibman, S. (2013). A companion to digital literary studies. Oxford: Blackwell.

Somani, A. K., Shekhawat, R. S., Mundra, A., Srivastava, S., & Verma, V. K. (2019). Smart systems and IoT: Innovations in computing. Proceeding of SSIC 2019. Springer Singapore.

Srivastava, A. N., & Sahami, M. (Eds.). (2009). Text mining classification, clustering, and applications (1st ed.). Boca Raton, Florida: Chapman and Hall/CRC.

Tullis, T., & Albert, B. (2008). Measuring the user experience: Collecting, analyzing, and presenting usability metrics (2nd ed.). San Francisco, CA: Morgan Kaufmann Publishers Inc.

Vigar, P. (2014). The novels of Thomas Hardy: Illusion and reality. London: Bloomsbury Academic.

Wellek, R., & Warren, A. (1963). Theory of literature. Harmondsworth, London: Penguin.

Widdowson, P. (1998). On Thomas Hardy: Late essays and earlier. Basingstoke: Macmillan.

Wilson, K. (2010). A companion to Thomas Hardy. New York: Wiley-Blackwell.

Wu, W., Xiong, H., & Shekhar, S. (2013). Clustering and information retrieval. Heidelberg: Springer.

Zyngier, S. (2008). Directions in empirical literary studies: In honor of Willie Van Peer. Amsterdam, Netherlands: John Benjamins Publishing Company.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Journal of Language and Linguistic Studies
ISSN 1305-578X (Online)
Copyright © 2005-2022 by Journal of Language and Linguistic Studies