Building Related Words in Indonesian and English Translation of Al-Qur’an Vocabulary Based on Distributional Similarity

— The Qur'an is the Muslim holy book as the primary source of knowledge and guidance, consisting of 114 surahs, 30 juz, and has approximately 6200 verses in it. Searching for connections or similarities between words in the Qur'an takes a long time to find and summarize them. There is a need for a dictionary, encyclopedia, or thesaurus of the Al-Qur'an vocabulary, which contains each word entry related to other words. This study discusses the interrelations and semantic similarities between words in the Qur'an, which aims to help in searching between related words in them. The approach taken is a distributional similarity which is an important part of word embedding. Measurement of word relevance is measured by semantic similarity which is one of the lessons learned in Natural Language Processing (NLP). Semantic similarity measures the closeness of word vectors using cosine similarity. The process of changing words in vector form uses the FastText algorithm which is a development of the Word2vec algorithm. The dataset used is the translation of the word Al-Qur'an in English and Indonesian. The word becomes an input in the system and then produces a score that represents the interrelationship between words. Evaluation of system output results using the Pearson correlation method involving the gold standard. Evaluation of the use of the FastText algorithm produces a correlation value of 0.3398 for Indonesian translation corpus and 0.2326 for English


INTRODUCTION
The Qur'an is the holy book in Islam, which was come as the primary source of knowledge, law, wisdom, and guidance for Muslims. The Qur'an consists of 114 surahs, 30 juz, and 6217 verses according to the history of Abl Medina, 6210 verses according to al-Dani's history, or 6214 verses according to Warsy's history [1]. There is a lot of information in the Qur'an that there are words with related meanings scattered about it. One way to understand the Qur'an is to try to explain the content of the verses of the Qur'an, from various aspects of paying attention to the sequence of the verses of the Qur'an, as stated in it [2]. Looking for similarities and linkages of words is also needed to help explain the contents of the Qur'anic verses.
Semantic similarities and similarities are related to one of the areas of discussion on Natural Language Processing (NLP), namely semantic similarity. This field discusses the measurement of the similarity of two words represented by similarities between related concepts in it. The idea of semantic similarity is to identify concepts that have the same 'characteristics'. Semantic similarity is understood as the level of taxonomic closeness between concepts (or terms, words). In other words, semantic similarity states how closely two concepts (or terms, words) are taxonomic, because they share several aspects of their meaning. Technically, the similarity measures assess numerical scores that measure this closeness as a function of the semantic evidence observed in one or several sources of knowledge [3]. In its application, for example of input systems such as the first word "paradise" and the input of the second word "hereafter" will produce a high output similarity value. As humans can be interpreted, those words have the meaning of a place of life after world life. Until now, research on semantic similarity continues to be carried out with various methods, some of which are Word2vec, Global Vector, and Support Vector Machine (SVM).
In previous studies related to distributional similarity, measurements were made of the interrelationship of words in Arabic, using a vector-based approach. The system built on this research produces a set of words that have a relationship with other words using the Word2vec model. Evaluation in the study was carried out by calculating precision based on the corrections made by linguists from the resulting system output [4]. Word2vec known ignoring morphology, these methods cannot create word vectors for new words that do not appear in the training data. Because morphological features of words are ignored, new word vectors cannot be obtained by comparing them with morphologically similar words [5].
In this study, a system was built to calculate the semantic similarity value of two input words. We use the distributional similarity approach to capture the similarity of semantic words and make groups of words that are similar. This research uses the Al-Qur'an corpus in English and Indonesian translations, as a complement to previous research. The construction of this system requires data in the form of words contained in the Qur'an. The model used in this study is FastText, which is a development of the Word2vec model. Each word in FastText is modeled by several vectors, with each n-gram vector representation. This approach is considered to be very useful for rare words and can handle out-of-vocabulary (OOV) words [6]. FastText is popularly used in many studies, some of which are text-classification, sentiment analysis and semantic similarity. These factors are the reason this approach was chosen for this research. The system built is expected to produce an excellent performance based on the calculated correlation value. System evaluation is done by calculating the correlation system with WordNet where WordNet is a combination and expansion of dictionaries and thesaurus. Then we do the factors that can influence the correlation results from Fasttext to the gold standard.

DISTRIBUTIONAL SIMILARITY
The distributional similarity approach is used to model languages and represent naturally occurring texts. This is a statistical-based model that uses the statistical distribution of words along with their context to determine the level of semantic similarity between terms. This model illustrates words with context vectors built on the distribution hypothesis, which states that similar words appear in the same context. The proposed method semantic distributions to construct word-context matrices that represent the distribution of words across contexts and to transform the text into representations of vector space models (VSM) based on semantic word similarities. The measures of equality of distribution used to capture the semantic similarities of words and to make groups of similar words [7].

SEMANTIC SIMILARITY
Semantic similarity is a method for measuring and expressing similarity between words based on the meaning of word similarity. Semantic similarity is used to estimate or calculate semantic proximity, or the relationship between various constructs in language and concepts based on numerical descriptions. In general, semantics or similarities between two-word objects can be assessed using ontologies and are used to define relationships between terms [8]. Semantically is merely taking two terms (concepts or words) as input and returning a numerical score that counts the number of similar words. Semantic similarity considers all types of semantic relations between terms [9].

WORD EMBEDDINGS
Word Embeddings are vector space models (VSM) that represent words as vectors in continuous space capturing many syntactic and semantic relationships between terms [10]. Word embeddings recently gained great popularity for modeling words in various Natural Language Processing (NLP) tasks including measurement of semantic similarities.
The very well-known word embeddings represent a new branch of the corpus-based semantic distribution model that utilizes neural networks to model the context in which a word is expected to appear. The ability of word embeddings to capture syntactic and semantic information, word embedding has been successfully applied to various NLP tasks, such as Word Sense Disambiguation, Machine Translation, Similarity of Relationships, Semantic Relatedness, and Knowledge Representation [11].

FASTTEXT
FastText is a new model for word embeddings that can capture word senses, sub-word structure, and information uncertainty. FastText models words with several vectors, where vectors represent n-grams. FastText produces accurate representations of rare words, misspellings, even unknown words [6]. This model is a well-known algorithm that creates word vectors for out-of-vocabulary (OOV) words. FastText learns morphological features using subwords, and a word vector can be produced even for words that do not exist in the dictionary [5].

COSINE SIMILARITY
Cosine similarity is a measure to calculate given pairs of sentences related to one another and determine scores based on words that overlap in sentences [12]. Cosine similarity can also be defined by angle or of the angle between two vectors. This is possible documents with the same composition to be treated identically which makes this the most popular size for text documents. The vector has a unit length, then the cosine angle between two words is calculated by the dot product equation between the two vectors. The calculation of cosine similarity can be seen in the equation below [13].
To calculate the cosine equation between two sentences, the sentence is then converted to terms/ words. The word is transformed in the form of a vector, where each word in the text defines the dimensions in Euclidean space and the frequency of each word according to the value in the dimension [12].

GOLD STANDARD
The gold standard is a technique to evaluate the performance of a computerized system, which serves as a reference point for other things of its kind, which can be compared by calculating the correlation between the two. Gold Standard is often described as a high-quality data set that is explained by humans [14]. In this study, WordNet was used as a gold standard reference. WordNet is a lexical database and is considered the most extensive electronic dictionary. WordNet was developed by lexicographer experts whose results are made into a lexical database. WordNet is created manually by requiring a lot of resources such as language experts and time so that it has high quality [15].

PEARSON CORRELATION
The Pearson correlation is a general measure used to measure the linear relationship between two continuous variables. The Pearson correlation is defined as the ratio of covariance of two variables to their respective standard deviations. Pearson correlation coefficients range from -1 to +1 [16]. The formula used in calculating correlations in Pearson correlation uses the equation below.
As for its use in this study, N is the number of word pairs, X is the value of the system, then y is the value of the gold standard.

A. System Overview
The system built in this study is a system that can calculate the semantic similarity of the pair of words in the corpus of the English and Indonesian translation of the Qur'an. Semantic similarity values are obtained based on the implementation of the FastText method. Evaluation of the results of the system is done by calculating the correlation as a benchmark of similarity to the gold standard value. In general, the system is illustrated in the picture below.

B. Corpus Dataset
The corpus dataset used in this study is the corpus of the English and Indonesian translation of the Qur'an, which is obtained from an online site, qurandatabase.org. This site provides translations of the Qur'an in various languages. The dataset used is still in the form of a complete narrative, so it needs to be further processed.

C. Preprocessing
Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a format that will be more efficiently and effectively processed for user. The order of preprocessing can be seen in the image below. The initial step in processing is to do a case-folding that converts the entire text in the document into a standard form that is the lowercase.

• Tokenizing
Tokenizing is a step that splits longer strings of text into smaller pieces or tokens. Larger chunks of text can be tokenized into sentences. Sentences can be tokenized into words.

• Filtering
The third step is the crucial password from the token result. This stage uses the stoplist algorithm to list the words that are not descriptive, such as "which", "and", "in", "from", etc.

• Stemming
The final step of preprocessing is stemming, which is the process of removing affixes, suffixes, and prefixes. The stemming process is not carried out in this study, because it is deemed incompatible with the corpus used based on the results of the system output.

D. FastText
FastText training stage aims to produce corpus modeling that is converted into vector form. To do this stage, the required data from preprocessing results in the previous stage, and then represented in a vector using FastText. In this study, we use the gensim library to running FastText. Gensim is billed as a Natural Language Processing package that does "Topic Modeling for Humans". But it is practically much more than that. It is a leading and a state-of-the-art package for processing texts, working with word vector models (such as Word2Vec, FastText etc) and for building topic models. The model built in the training process has parameters that affect the results of semantic similarity in words. Below are the parameters used in this study.
• Embedding size: this parameter is used to determine the dimensionality of a vector, where the dimension of the word vector must be an integer.
• Window size: this parameter determines the maximum distance between the target word and the words around the target word.
• Minimum frequency: this parameter that determines the minimum frequency of a word in the corpus.
• Down sampling: this parameter determines the number of samples taken based on words that often appear.
• • Iteration: a number to determine how much training is done. the iteration implemented is 100 The results of the training produced by the system produce a vector of each word from the corpus data. FastText produces every word that can have more than one vector, because every word in a sentence has a different context. Examples of the results of system output with the input "verse" can be seen in figure 3, then for the Indonesian translation corpus with the input word "ayat" can be seen in figure 4. For example, we use an embedding size of 60, the system will display 60 vectors words according to the embedding size specified in the training model.

D. Semantic Similarity Calculation
After doing the vector calculation process of each word in the translation of the verses of the Qur'an, the calculation of semantic similarities is done using Cosine similarity. The system can produce several words that are judged to be the most similar to the input word. Examples of system output can be seen in Figure 5 and Figure 6. The system that was built can also measure the value of similarity to test the interrelationships between a pair of words. The higher the similarity produced by the system, then the pair of input words tested were judged to be more similar. The testing can be seen in table 1. In Figures 7 and 8 there is a visualization of related words from input words that have been represented in two dimensions. The collection of dots in Figure 7 and 8 represents the distribution of word relatedness in the Qur'an based on the vector value generated from the FastText process.

EVALUATION
System testing in this study was conducted to see the performance of system output results against the gold standard. Measurement of system performance is measured using the Pearson correlation. Correlation measurement done by comparing ten words related to system output for each word input with a gold standard value.

A. Testing Scenario
Tests carried out are analyzing the semantic similarity relationship between input word pairs. The output of the system will then be compared with the synsets of the gold standard. Testing is done is to enter each set of words contained in the corpus of English and Indonesian translations, then every ten words related to system output, the correlation is calculated with the gold standard. Scenarios The parameters used in this test are 100 embedding sizes with window sizes 5, 7, and 10 for each corpus. Examples of testing for one-word input can be seen in Figures 9 and 10, then for the resulting correlation can be seen in table 2.  The primary test is carried out to calculate the correlation for each set of words contained in the corpus of the Qur'an in English and Indonesian translations, which are filtered based on the terms listed on the gold standard (WordNet). The registered words consist of 4505 for the English translation corpus, and 4821 for the Indonesian translation corpus. Then for the final test, Pearson's correlation is calculated for all words with the gold standard.

B. Testing Result
Tests carried out are analyzing the semantic similarity relationship between word pairs with FastText based on window size. Three window sizes used in this test are window size 5, window size 7, and window size 10, and the corpus used is the English and Indonesian translation of the Qur'an. The value of the system output from each window size will be compared with the value of the gold standard with the Pearson correlation calculation. The test results can be seen in Figure  11. The test results produced in Figure 10 show that the test results using the Indonesian translation Al-Qur'an corpus showed the highest correlation value compared to the English translation corpus. The results of each correlation on the Indonesian translation corpus are, window size5 is 0.3398, window size7 is 0.3396, and the smallest correlation value is found in window size 10 of 0.3327.
Correlation results describe the level of accuracy of the system results. We did not compare the accuracy results with previous studies. That's because the datasets that we use are different, previous studies used the Qur'anic corpus with Arabic language and writing. As for our research which is a continuation of previous research, we use English and Indonesian translation corpus which is a complement to previous research. Then our evaluation method is different, namely in previous studies using precision in the range of 0% -100%, based on corrections from linguists. As for this research, we use Pearson correlation with range (-1) -(+1) based on an electronic dictionary/thesaurus namely WordNet. In the previous research, the accuracy that was produced was a precision value of 98% by choosing only 10 sample input words and then corrected by a linguist. Whereas for this research the test was conducted for all words contained in the corpus that we used.  The test results in tables 3 and 4 are the output of the correlation calculation between the ten words related to the system output with the gold standard, then the level of correlation is determined based on the criteria. Based on table 3 and 4, the Indonesian translation corpus produces more words with strong correlations, while the English translation corpus produces more words with low correlation. Based on table 4 for each window size, it can be seen that the larger the window size does not increase the strong correlation criteria, while the low correlation always increases. The test results for the words in Table 5 are not correlated because of the unavailability of these words in the gold standard. That is because the word concerned is still in Arabic, so no correlation calculations are made in these words.

C. Analysis of Testing Results
The process of calculating the value of semantic similarity between words using FastText is done with a corpus that has a different size. After preprocessing data, the number of words processed in FastText consists of 96971 words in the Indonesian translation corpus and 53804 in the English translation corpus. Tests are also carried out using a different window size for each corpus used. This aims to find out what are the factors that can influence the results of the value of the use of FastText for the calculation of semantic similarities between words.
The test results show that several factors can influence the performance value of the system. Here are the things that affect the amount of semantic similarity calculations using FastText: • Use of parameter values. The results of FastText correlation show the window size 5 parameter has the highest correlation value, while the lowest correlation value is generated by window size 10. The use of window size can determine the number of possible words in pairs with other words. So that the similarity value produced by FastText can be increased based on determining the amount of window sizes • Size of the corpus of data. The size of the corpus is very influential on the value of the performance of the system output. The larger the corpus size, the more vocabulary the corpus has, so the better the semantic similarity values produced by the system.
• The vocabulary owned by the corpus greatly influences the output of semantic similarity in words. That is because the training process captures pairs of words that often appear in the corpus used.
• Performance values depend on WordNet as the gold standard.
WordNet cannot yet recognize a name from an entity such as "mushrik", "mudharat", and a few words with other Islamic contexts.

Conclusion
Based on the results of tests and analyzes that have been done, the correlation score produced by the system is relatively low, in the Indonesian translation corpus produces a correlation of 0.3398 and an English Translation corpus of 0.2326. In this study, we analyze the use of parameters in FastText which aims to see the best correlation results produced by the system. The best correlation is obtained using a windows size of 5 with vector dimensions of 100. The use of window size used must be adjusted to the size of the corpus. The large size of the corpus with the adjustment of the window size affects the suitability of the word similarity value produced in the training process.
Another Factors that affect the value of semantic similarity between words using FastText is that a pair of words is generated influenced by the number of words appearing in the corpus. This is based on the corpus used, where after preprocessing data, the number of words processed in FastText consists of 96971 words in the Indonesian translation corpus and 53804 in the English translation corpus.
The gold standard used as an evaluation reference is also a factor influencing the correlation results. The use of WordNet as the gold standard in this study, cannot be used as the main reference for testing the system for corpus translation of the Qur'an. WordNet has not been able to capture several words related to the Islamic context.

Suggestion
The following recommendations might be useful as an extension to develop this study, or simply to avoid errors when conducting such similar research: • Test with other approaches, so that more optimal performance values can be generated.
• Developing FastText performance measurements involves selecting complex parameters such as window size, embedding size, minimum frequency, down sampling, and training models.