Evaluation of TF-IDF Algorithm Weighting Scheme in The Qur'an Translation Clustering with K-Means Algorithm

The Al-Quran translation index issued by the Ministry of Religion can be used in text mining to search for similar patterns of Al-Quran translation. This study performs sentence grouping using the K-Means Clustering algorithm and three weighting scheme models of the TF-IDF algorithm to get the best performance of the Tf-IDF algorithm. From the three models of the TF-IDF algorithm weighting scheme, the highest percentage results were obtained in the traditional TF-IDF weighting scheme, namely 62.16% with an average percentage of 36.12% and a standard deviation of 12.77%. The smallest results are shown in the TF-IDF 1 normalization weighting scheme, namely 48.65% with an average percentage of 25.65% and a standard deviation of 10.16%. The smallest standard deviation results in a normalized 2 TF-IDF weighting of 8.27% with an average percentage of 28.15% and the largest percentage weighting of 48.65% which is the same as the normalized TF-IDF 1 weighting.


Introduction
Al-Qur'an is the Muslim holy book and is used as a guide to the life of Muslims, has a unique structure consisting of two parts, three parts, five parts, seven parts and so on. The Qur'an consists of 114 letters, each letter consisting of several verses. The number of verses in the Qur'an reaches 6236 verses. The Qur'an is divided into 30 sections called juz. Each juz is divided into several curves. The letters in the Qur'an have varying amounts of ruku depending on the number of verses in the letter and the short length of each verse [1]. The grouping aims to make it easier to memorize, learn, and study the Qur'an. In each letter can contain various themes. Certain themes can be in several letters. To easily learn and understand the Qur'an, it can use the translation of the Qur'an by following the language understood.
The Indonesian translation of the Qur'an issued by the Ministry of Religion of the Republic of Indonesia is the main reference in Indonesia, although there are several versions of the Indonesian translation of the Qur'an carried out by various social organizations. The translation of the Qur'an in Indonesian is an interesting object for computer scientists to demonstrate the knowledge, wisdom and law of the verses of the Qur'an in a computer system. Understanding the meaning of the verses of the Qur'an can be done by reading interpretations written by interpreters. However, this has not been enough to give a complete picture of the meaning contained in the Qur'an. For us to get a complete picture of the various themes in the Qur'an, we must read and understand all parts of the Qur'an.
In the field of computing, the various unique structures of the Qur'an are very interesting to study. One way to research is with text mining. Various text mining methods can be used to group certain data, one of them is clustering. Text clustering is an important part of the text mining method. Text clustering is a classification of documents that divides a collection of text into several subsets called clusters, the text of each cluster having greater similarity than those in different clusters. Clustering is particularly useful for organizing documents to improve information rediscovery and support the browsing process [2].
The quality of clustering is very dependent on the process of removing interference from the pattern used in the clustering process. So we need pre-processing processes such as separating words from documents (tokenization), removing words that often appear but are not relevant (stopword removal) and changing words into basic words (stemming). Each word will be represented by a weighting method based on the frequency of words appearing, namely TF-IDF. TF-IDF is very well used for weighting but has many limitations. Many researchers propose modifications to TF-IDF for the best performance [3][4] [5] 2.

Related Work
The research entitled Application of the Cosine Similarity Algorithm in Text Mining Translated the Qur'an Based on Topic Linkages [6] was carried out by searching for similarities in the text in the Qur'an translation. Of the similarity groups formed, they are then compared with the Qur'an index compiled by the Ministry of Religion of the Republic of Indonesia. From the comparison of the two groups, the results show that the similarity generated from the Cosine Similarity has a similarity of 46,42% with the Qur'an index made by the Ministry of Religion. These results are felt to be less than optimal, so this research is expected to provide better results. The Qur'an index compiled by the Ministry of Religion was compiled by expert commentators and has been institutionally recognized in Indonesia as a valid the Qur'an index. Chapters in the Qur'an index are arranged based on the similarities between the verses of the Qur'an, so that they can be used as a reference to test the results of text similarity between text mining. One way to improve the performance of the text mining algorithm is to normalize the algorithm. Normalization can be done by making the right weighting scheme to increase effectiveness [3]. The TF-IDF normalization in a study entitled "Modified TF-IDF Term Weighting Strategies for Text Categorization" conducted by Rajendra Kumar Roul, et al., Succeeded in processing text well, but it was still too simple in processing a text, so many neglected the details of words that were used. actually more meaningful [7]. Optimization of TF-IDF can also be done by using the maximum TF-IDF method. Maximum TF-IDF is a normalization method in which the frequency value is divided by the largest number of words that appear to optimize the best results in the algorithm. This is done so that the K-Nearest Neighbor algorithm gives the best results [8].
Research conducted by S. Albitar et al propose a new measure for assessing semantic similarity between texts based on TF/IDF with a new function that aggregates semantic similarities between concepts representing the compared text documents pair-to-pair using a semantic similarity matrix. Experimental results demonstrate that our measure outperforms other semantic and classical measures with significant improvements in the concept space [4].
Research conducted by Calho, H, propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region [5]. From the research that has been done, it shows that TF-IDF normalization can improve the quality of the results for the better. Therefore, this research will normalize the TF-IDF with the hope of obtaining better results.

Research Method
The method used in this research is the study of literature. In the literature study, literature search is not only for the initial steps of preparing a research framework but also utilizing library resources to obtain research data [9]. In this type of research, researchers do not have to go to the field and meet with respondents. The data needed in research can be obtained from library sources or documents. The data source used in this study is the translation of digital the Qur'an compiled into a dataset with the required format.
The data that will be used in this study are the text of the translation of the Qur'an in Indonesian and the index of the Qur'an issued by the Ministry of Religion of the Republic of Indonesia contained in the Al-Fatih Manuscripts. The following are the steps for the research:

Preprocessing
Preprocessing in this study includes lemmatizing and stop words are not removed. The process of lemmatizing is a process to return a word to the root word. In this study, the stop word was left alone and not deleted. The removal of the stop word will result in a change in the actual meaning of the translation of the Qur'an. This is actually contrary to the truth value of a verse of the Qur'an.

Similarity between the translation of verses of the Qur'an
Similarity data between verses used in this study is the result of data processing carried out in previous research [6]. From this data, as many as 6136 groups of similarity were formed. For this research, we take 15 similar groups with the most number of verses which will represent the number of chapters in the Qur'an index and evaluated with the Qur'an index. The similarity group of the verse is shown in Figure  The next step is to process each group of similarity with the three TF-IDF algorithm models which will be continued with the clustering process with the K-Means Clustering algorithm. The number of clusters formed is 3 clusters for each group of similar.

3.3
Normalization TF-IDF In Information Retrieval, TF-IDF (Term Frequency-Inverse Document Frequency) is a product of two statistics, namely Term Frequency (TF) and Inverse Document Frequency (IDF) which shows how important a word is for a document in a corpus collection [10]. There are many ways to determine the two statistical values This value is often used as a weighting factor in searching in information retrieval, text mining and user modeling. The TF-IDF value will increase proportionally if the frequency of occurrence of words in a document is balanced by the number of documents in the corpus containing the word. TF-IDF is one of the most popular term weighting schemes today. Around 83% of text-based recommendation systems in digital libraries use TF-IDF [11]. Various TF-IDF weighting schemes are often used by search engines as one of the methods in providing value and relevance ranking of documents requested by users. TF-IDF can be used to filter stop words in various subject areas, including summation and text classification. In this study, three TF-IDF models will be used to find the best performance in the clustering that will be carried out. The three TF-IDF models are: 1. Traditional TF-IDF. Traditional TF-IDF is a form of TF ID commonly used with default settings without parameters. This model works quite well but ignores many details when processing documents such as document length and frequency distribution. The traditional TF-IDF invocation command in the python language used in this study is: tfidf = TfidfVectorizer () 2. TF-IDF Normalization 1. In normalization 1, the TF-IDF algorithm used is given additional parameters. The additional parameter is to eliminate words that appear in more than 80% of documents (max_df parameter in TF-IDF) and less than 20% of documents (min_df parameter in TF-IDF). This word removal is done with the assumption that words that appear in more than 80% of documents and less than 20% of documents are words that are not important and have no meaning in the document being processed [12]. The command program for calling TF-IDF normalization 1 in python language used in this study is: tfidf = TfidfVectorizer (max_df = 0.8, min_df = 0.2, use_idf = True, ngram_range = (1,2), sublinear_tf = True, norm = 'max ') 3. TF-IDF Normalization 2. TF-IDF Normalization 2 still uses the parameters max_df and min_df as normalization 1. It's just that the values of the two variables are different. In normalization 2, the value of max_df = 0.5 and value of min_df = 0.1. The value of the two parameters is adopted from Singhal's research, A., entitled Pivoted document length normalization [13]. The program command for the normalization 2 TF-IDF call in python is: tfidf = TfidfVectorizer (max_df = 0.5, min_df = 0.1)

K-MEANS Clustering
Clustering is grouping data items into a number of groups [14]. The two main approaches are clustering with the partitioning approach and clustering with the hierarchical approach. Clustering with the partitioning approach is clustering by sifting through data. Hierarchical clustering groups data by creating a hierarchy in the form of curves that describe clustering clusters where similar data will be placed in adjacent hierarchies. One clustering algorithm is K-Means. The K-Means algorithm, first introduced by MacQueen JB in 1976, is a method of analyzing data or the method of data mining that performs the process of modeling without supervision (unsupervised) and is one of the methods for grouping data with a partition system. The K-Means method groups data into groups, where the data in one group has the same characteristics and has different characteristics from the data in other groups [15]. The results of the cluster using the K-Means method depend on the center value of the initial group given. Giving different initial values can produce different groups [16]. After weighting with three TF-IDF models, the next process is clustering the results of this weighting into 3 clusters using the K-Means Clustering algorithm. This clustering process is carried out on all three TF-IDF models. The clustering command with the K-Means Clustering algorithm in the python language is as follows: clusterer = KMeans (n_clusters = 3, max_iter = 300, toll = 0.0001) The use of 3 clusters in this study is because this study uses 3 TF-IDF models. The max iteration value is 300, because this is the maximum number of iterations of the k-means algorithm for one run [17] 3.4 Conversion of The Qur'an Index Data In this study, the Qur'an index was used by the Ministry of Religion of the Republic of Indonesia. This index data is included in the Al Fatih Manuscripts. Al Fatih Manuscripts is one of the Qur'ans which is quite complete [18]. One of them is the Qur'anic index. In this study, the editor of Al Fatih was willing to provide a soft copy of Al Qur'an index data in the text format as shown in Figure 3.  The Qur'an index dataset in the desired format. Then it is processed to eliminate duplicate verses in the same chapter. This duplication occurs because in each chapter the Qur'an index has several sub-indexes that most likely a verse will be a member of more than one sub-chapter member in the same chapter. In the following table 2 is displayed the number of verses in the Qur'an index both duplicate and non-duplicate:

Experiment Description
Classification is carried out on a group of documents so that they are easily understood and studied done in various ways. In this study, the grouping of the translation of the Qur'an is done by several methods to find the best performance. In the past years research, the grouping of the translation of the Qur'an in Indonesian is based on the degree of similarity between the verses [6]. This grouping uses the Cosine Similarity algorithm. The results of the grouping of translations of the verses of the Qur'an are then tested for compatibility with the Qur'an index. Al Qur'an Index is a way of presenting information about the existence of a particular theme or verses, and this is intended to facilitate understanding the Qur'an [19]. This research gives the result that with the similarity between verses by 20%, the similarity level of this similar group is 46.42%. This percentage of match level decreases if the similarity level is raised. This is due to the increasing level of similarity between the translations of the verses of the Qur'an, the number of similar verses decreases.
In this study, the group similarity between verses produced in the preliminary research will be further processed with other methods to see the performance results. The following are the steps of the research: The K-Means clustering parameter in the above program code is taken from the scikit learn library. The contents of the variable max_iter = 300 are maximum number of iterations of the k-means algorithm for a single run. The contents of the parameter toll = 0.0001 are relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence [12].

3.6
Traditional TF-IDF and Clustering process The traditional TF-IDF used in the first model gave quite good results, although it ignored many details when processing documents such as the frequency with which a term appeared and the frequency distribution. In the program fragment, it is shown that this process will be repeated 15 times, according to the number of similar groups that are candidates for chapters in the Qur'an index. Each group of similarities will produce a set of weights presented in the form of a sparse matrix (sparse matrix). The results of the traditional TF-IDF clustering that the number of verses produced in each cluster for each group of similarities tends to be uneven. There are clusters where there are very many verses and there are also very few.

TF-IDF Normalization 1 and Clustering process
In normalization 1, the TF-IDF algorithm used is given additional parameters to eliminate words that appear in more than 80% of documents (max_df parameter in TF-IDF) and less than 20% documents (min_df parameter in TF-IDF). This word removal is done with the assumption that the word is a word that is not important and has no meaning in the document that is processed [12]. This normalization TF-IDF calling program command 1 will be given by replacing the traditional TF-IDF command. The results of normalization 1 TF-IDF clustering show the number of verses produced in each cluster for each group of similarities tends to be more balanced and even when compared to traditional TF-IDF calculations. Although there are still clusters where the number of verses is very large and there are also very few, it is relatively more balanced.

TF-IDF Normalization 2 and Clustering process
TF-IDF Normalization 2 still uses the parameters max_df and min_df as normalization 1. It's just that the values of the two variables are different. In normalization 2, the value of max_df = 0.5 and value of min_df = 0.1. The value of the two parameters is adopted from Singhal's research, A., entitled Pivoted document length normalization [20]. This normalization TF-IDF calling program command 2 will be given by replacing the traditional TF-IDF command contained. The results of TF-IDF clustering normalization 2 show that the number of verses produced in each cluster for each group of similarities tends to be more balanced and evenly distributed when compared to traditional TF-IDF calculations. When for similar in range (15) compared with TF-IDF Normalization 1, then TF-IDF Normalization 1 gives better results.

Testing of Clustering Results on Chapter Al Qur'an
At this stage, the results of clustering that have been carried out in the previous stage will be tested on each chapter in the Qur'anic index. Each cluster in each group of similarity generated from each TF-IDF model will be tested against each chapter in the Al Qur'an index. The test results show the percentage of the number of verses in each TF-IDF model. The percentage results on the traditional TF-IDF are shown in the following table 3.  Table 4. The results of cluster testing formed from the normalization 1 TF-IDF algorithm performance as presented in table 4 show that the largest percentage of the number of clusters contained in the Al Qur'an index chapter is 48.65% contained in chapter 12 on Country and Society. The smallest percentage is produced in chapter 4 on branches of science. These results are identical to the results of traditional TF-IDF processing. From 15 groups of similarity verses, there are 6 similar groups that give the best results, with the highest number of verses in the Qur'an index chapter, namely similar groups 2, 3, 6, 7, 9, 14. The average percentage is 25.65 % with a standard deviation of 10.16%. The next process is clustering testing of TF-IDF normalization 2 as shown in Table 5.

Summary and Conclusions
Based on research that has been done, it can be concluded that the evaluation of the TF-IDF weighting scheme in clustering the translation of the Qur'an with the K-Means Clustering algorithm was successfully carried out. The results of the grouping of verses based on the level of similarity as done in this previous study were then processed again by using the three TF-IDF weighting scheme models for further clustering using the K-Means Clustering algorithm. The results of clustering over 15 groups of similarity verses show that the number of verses in each cluster and similar groups is relatively uneven. There is a cluster that tends to have a relatively large number of verses and there are clusters that tend to have a small number of verses. However, the results of clustering from TF-IDF normalization 1 are relatively more evenly distributed in the number of verses in the resulting cluster. The results of this clustering are then matched with the Al Qur'an index to find which group of verses best fits the theme in the Al Qur'an index. From the three TF-IDF weighting scheme models the highest percentage results obtained from the traditional TF-IDF weighting scheme were 62.16% with an average percentage of 36.12% and a standard deviation of 12.77%. The time needed to process this model is 98.09 seconds.
The smallest results are shown in normalization 1 TF-IDF weighting scheme which is 48.65% with an average percentage of 25.65% and a standard deviation of 10.16%. The time needed to process this model is 20.69 seconds. The smallest standard deviation resulted in normalization 2 TF-IDF weighting is 8.27% with an average percentage of 28.15% and the largest percentage of weighting is 48.65%, the same as normalization TF-IDF weighting 1. The time needed for the process this model for 14.30 seconds. A more complete comparison of these results is presented in table 6 and figure 3 below: