Clustering of Texts From a Digital Literary Work -

The amount of textual information in the world is enormous and continues to grow every day. Clustering is a technique for grouping similar objects and separating dissimilar ones. This technique is therefore useful for structuring and interpreting large volumes of texts. Moreover, clustering can be seen as a necessary step before classification when texts lack labels.

In this project, I clustered the texts from the digital literary work particles, which had been started by the author Andreas Louis Seyerlein in 2007 and included 3,000 texts at the time of my analysis in 2021.

I applied an earlier approach from Natural Language Processing (NLP) that treats texts as bags of words. According to this approach, texts are transformed into high-dimensional vectors, with each dimension corresponding to a specific word or token from the vocabulary of the given corpus. Text vector values are based on different types of term occurrence. The advantage is that the dimensions of text vectors are easily interpretable from a human perspective. On the other hand, linguistic phenomena such as synonymy or homonymy are not taken into account. Furthermore, several preprocessing steps are necessary to reduce the dimensionality of text vectors. The graphic below visualizes the steps I performed.

Processing Steps

– In the text preprocessing step, I focused on reducing the vocabulary size. The greatest decrease was achieved through pruning (26,380 tokens fewer) as most words in human and, especially, poetic language are rare. The other two methods that contributed to the significant vocabulary reduction were removing numbers (9,512 tokens fewer) and lemmatizing (8,215 tokens fewer) using the spaCy library. During this step, the number of tokens declined from 52,229 to 2,238.

– Based on Term Frequency (TF) or Term Frequency-Inverse Document Frequency (TF-IDF), the reduced texts were transformed into sparse vectors with the TfidfVectorizer from the scikit-learn library.

– The data preprocessing step involved correlation analysis to remove redundant tokens and dimensionality reduction using Singular Value Decomposition (SVD; see TruncatedSVD from scikit-learn). SVD replaces many tokens with fewer concepts, which can still be interpreted based on the tokens with the highest weights within each concept.

– I used the prototype-based K-means algorithm for clustering via the KMeansClusterer from the NLTK library. This algorithm is suitable for sparse, high-dimensional text vectors and has linear complexity (Tan et al., 2014, pp. 505, 570). However, it struggles with outliers and clusters of different sizes (Tan et al., 2014, pp. 506, 570). Moreover, the K-means algorithm assigns each object to exactly one cluster (Tan et al., 2014, p. 497). The challenge is to determine an appropriate value for k, or the number of clusters, when few or no labels are available. I used the Dask library to generate multiple clusterings with different numbers of clusters in parallel.

– Since only 20% of the texts were assigned to user-defined categories, only this portion could be evaluated using supervised validation metrics like precision, recall, purity, entropy, and Normalized Mutual Information (NMI; Manning et al., pp. 356-360). However, all clusters were assessed through unsupervised validation. Plotting the Sum of Squared Errors (SSE) and average silhouette coefficient against the number of clusters using Matplotlib and Plotly, and identifying knees, helped determine the appropriate values for k (Tan et al., 2014, pp. 546-547). I also visualized similarity matrices, with clusters ordered by the average similarity of their text vectors in descending order (Tan et al., 2014, pp. 543-544). Additionally, I listed the tokens with the highest weights for each cluster centroid to identify cluster keywords, which was possible due to the previously mentioned interpretability of dimensions in bag-of-words-based text vectors.

SSE Versus Number of Clusters

Similarity Matrix

In summary, both the unsupervised and supervised clustering methods indicated that higher values for k in K-means resulted in purer, more cohesive clusters but also split thematically related texts across different clusters. In contrast, lower values of k led to the higher recall and prevented the separation of thematically similar texts but produced clusters with texts from different topics.

From today’s perspective, I would combine the bag-of-words-based approach applied in this project with dense contextual representations obtained from encoders or decoders (compare the tutorial Combining Semantic & Keyword Search by Google).

References

– Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
– Tan, P.-N., Steinbach, M., & Kumar, V. (2014). Introduction to Data Mining. Pearson.