Forming Dataset of The Undergraduate Thesis using Simple Clustering Methods

  • Chinta 'Aliyyah Candramaya Uiversitas Jember
  • Vandha Pradwiyasma Widharta Pukyong National University
Keywords: Document Clustering, Text Mining, Relevant Term, Information Retrieval, Topic Identification


Each university collects many undergraduate theses data but has yet to process it to make it easier for students to find references as desired. This study aims to classify and compare the grouping of documents using expert and simple clustering methods. Experts have done ground truth using OR Boolean Retrieval and keyword generation. The best cluster was discovered by the experiments using the K-Means, K-Medoids, and DBSCAN clustering methods and using Euclidean, Manhattan, City Block, and Cosine Similarity metrics. The cluster with the best Silhouette Score compared to the accuracy of the categorization of each document. The K-Means clustering method and the Cosine Similarity metric gave the best results with a Silhouette Score value of 0.105534. The comparison between ground truth and the best cluster results shows an accuracy of 33.42%. The result shows that the simple clustering method cannot handle data with Negative Skewness and Leptokurtic Kurtosis.


[1] R. P. Soesanto, A. F. Rizana, and L. Andrawina, "Design of Reporting, Evaluation, and Monitoring Application for Student Organization in University," International Journal of Innovation in Enterprise System, vol. 3, no. 01, pp. 53–57, Jan. 2019, doi: 10.25124/ijies.v3i01.34.
[2] S. Yang, R. Wei, J. Guo, and H. Tan, "Chinese semantic document classification based on strategies of semantic similarity computation and correlation analysis," Journal of Web Semantics, vol. 63, p. 100578, Aug. 2020, doi: 10.1016/j.websem.2020.100578.
[3] A. Y. Muaad et al., "An effective approach for Arabic document classification using machine learning," Global Transitions Proceedings, vol. 3, no. 1, pp. 267–271, Jun. 2022, doi: 10.1016/j.gltp.2022.03.003.
[4] O. Karnalim, "IR-based technique for linearizing abstract method invocation in plagiarism-suspected source code pair," Journal of King Saud University - Computer and Information Sciences, vol. 31, no. 3, pp. 327–334, Jul. 2019, doi: 10.1016/j.jksuci.2018.01.012.
[5] N. K. Seong, J. H. Lee, J. B. Lee, and P. H. Seong, "Retrieval methodology for similar NPP LCO cases based on domain specific NLP," Nuclear Engineering and Technology, Oct. 2022, doi: 10.1016/
[6] S. Bag, S. K. Kumar, and M. K. Tiwari, "An efficient recommendation generation using relevant Jaccard similarity," Inf Sci (N Y), vol. 483, pp. 53–64, May 2019, doi: 10.1016/j.ins.2019.01.023.
[7] A. Gragera and V. Suppakitpaisarn, "Relaxed triangle inequality ratio of the Sørensen–Dice and Tversky indexes," Theor Comput Sci, vol. 718, pp. 37–45, Mar. 2018, doi: 10.1016/j.tcs.2017.01.004.
[8] M. Hanifi, H. Chibane, R. Houssin, and D. Cavallucci, "Problem formulation in inventive design using Doc2vec and Cosine Similarity as Artificial Intelligence methods and Scientific Papers," Eng Appl Artif Intell, vol. 109, p. 104661, Mar. 2022, doi: 10.1016/j.engappai.2022.104661.
[9] J. Pascual Espada, J. Solís Martínez, I. Cid Rico, and L. Emilio Velasco Sánchez, "Extracting keywords of educational texts using a novel mechanism based on linguistic approaches and evolutive graphs," Expert Syst Appl, vol. 213, p. 118842, Mar. 2023, doi: 10.1016/j.eswa.2022.118842.
[10] S. Behpour, M. Mohammadi, M. v. Albert, Z. S. Alam, L. Wang, and T. Xiao, "Automatic trend detection: Time-biased document clustering," Knowl Based Syst, vol. 220, p. 106907, May 2021, doi: 10.1016/j.knosys.2021.106907.
[11] K. Thirumoorthy and K. Muneeswaran, "A hybrid approach for text document clustering using Jaya optimization algorithm," Expert Syst Appl, vol. 178, p. 115040, Sep. 2021, doi: 10.1016/j.eswa.2021.115040.
[12] E. Mohamed and T. Celik, "Early detection of failures from vehicle equipment data using K-means clustering design," Computers and Electrical Engineering, vol. 103, p. 108351, Oct. 2022, doi: 10.1016/j.compeleceng.2022.108351.
[13] S. Harikumar and S. PV, “K-Medoid Clustering for Heterogeneous DataSets,” Procedia Comput Sci, vol. 70, pp. 226–237, 2015, doi: 10.1016/j.procs.2015.10.077.
[14] Q. Zhu, X. Tang, and A. Elahi, "Application of the novel harmony search optimization algorithm for DBSCAN clustering," Expert Syst Appl, vol. 178, p. 115054, Sep. 2021, doi: 10.1016/j.eswa.2021.115054.
[15] S. I. Rizo Rodríguez and F. de A. Tenório de Carvalho, "Clustering interval-valued data with adaptive Euclidean and City-Block distances," Expert Syst Appl, vol. 198, p. 116774, Jul. 2022, doi: 10.1016/j.eswa.2022.116774.
[16] J. Yin and S. Sun, "Incomplete multi-view clustering with cosine similarity," Pattern Recognit, vol. 123, p. 108371, Mar. 2022, doi: 10.1016/j.patcog.2021.108371.
[17] E. Aytaç, "Unsupervised learning approach in defining the similarity of catchments: Hydrological response unit based k-means clustering, a demonstration on Western Black Sea Region of Turkey," International Soil and Water Conservation Research, vol. 8, no. 3, pp. 321–331, Sep. 2020, doi: 10.1016/j.iswcr.2020.05.002.
[18] J. A. Hartigan and M. A. Wong, "Algorithm AS 136: A K-Means Clustering Algorithm," Appl Stat, vol. 28, no. 1, p. 100, 1979, doi: 10.2307/2346830.
[19] L. Kaufman and Rousseuw Peter J., "Partitioning Around Medoids (Program PAM)," pp. 68–125. doi: 10.1002/9780470316801.ch2.
[20] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231.
How to Cite
Dharmawan, T., Candramaya, C., & Widharta, V. (2023, January 31). Forming Dataset of The Undergraduate Thesis using Simple Clustering Methods. International Journal of Innovation in Enterprise System, 7(01), 31-40.
Information and Computational Engineering