Classification of Physical Soil Condition for Plants using Nearest Neighbor Algorithm with Dimensionality Reduction of Color and Moisture Information

Determining the quality of soil is an important task to perform especially on newly opened agricultural land since it may provide a significant impact on the growth of plants. One alternative to determine physical soil quality is by visually observe the color of the soil and measure its moisture. This paper designed an embedded system to classify soil condition for plants according to the dimensionality reduction of color and moisture information from the soil using k-NN algorithm. The dimension of attribute information was reduced using correlation analysis to achieve lower computational time and lower memory usage on embedded system. In this study, 39 sample of soil from various location were collected and categorized by soil expert using visual observation. In the accuracy testing on the system that used four attributes, 100% accuracy was given by 60:40 ratio with 7 neighbors. In contrast, the system that used only two attributes, 100% accuracy was given by 60:40 ratio with 5 nearest neighbors. The resource usage testing shown that by using reduced attributes dimension, the resource usage can be lowered as many as 188 bytes on program storage and 192 bytes on global variable usage. Moreover, the average of computation time performed by the system using reduced attribute dimension was 5.4 ms compared to the system that used all attributes which was 6.2 ms.


Introduction
As an agricultural country, Indonesia has many types of agricultural lands, such as wetlands, dry field, and shifting cultivation lands. According to BPS-Statistics Indonesia on 2015, Indonesia has 8.092.906,80 Ha of wetlands, 11.861.675,90 Ha of dry fields, and about 5.190.378,40 Ha of shifting cultivation land. Other than that, there were also temporarily unused land of about 12.340.270,20 Ha [1]. Geographical condition may also inflict varying soil condition throughout the land [2].
The quality and the fertility of the soil affect the growth of plants planted on it [3]. Thus, determining the quality of soil is an important task to perform especially on newly opened agricultural land. Generally, the parameters of soil fertility can be categorized in three areas; physical, biological, and chemical [4].
The quality of soil can be examined visually based on its appearance. Soil that has darker color contains more organic matter compared to brighter soil [5]. A massive amount of nutrition and water made fertile soil appears darker. The usual method involves the comparison of the soil with soil color chart as standard. In traditional way, the task can be done by visually observing the color of the soil and compare it toward a standardized color chart [6]. That means, the physical appearance of soil can be easily observed without additional complex procedure such as using biological or chemical material. However, it relatively needs time and the decision is affected by the condition of light and individual's color perception [7].
Another parameter that defines the quality of soil is its moisture. The moisture of soil represents the quantity of water contained in the soil. In dry season, soil tends to have less water than in rainy season. The soil in location which is far form water source also tends to have less water contents in it. The ability of soil to contain water is one of important factor for the growth of plants above it [8].
In 2017, Prasetyo et.al. designed low budget system to detect soil fertility especially in Cihaur village. He only used moisture data using soil hygrometer sensor and only used simple value thresholding comparison which achieved 75% accuracy [9]. In our previous research, naïve Bayes had been used as the classification algorithm for color and moisture information. It used four dimensions of data as the input attributes. It had 100% accuracy but relatively longer time to compute. It also required to perform offline pre-processing computation before inputting the training dataset [10].
Based on the previous statements about the importance of determining the quality of soil for plants, this study proposed a system that is able to classify soil condition for plants according to the color and moisture information from the soil. The proposed system used color and moisture sensor to extract the visual information of soil. Then, the dimension of all information would be reduced to achieve lower computational time and lower memory usage on embedded system.

Methodology
The system uses TCS3200 to sense color and FC-28 to measure the moisture of soil, respectively. Both sensors were connected to Arduino Uno that would classify the data into fertile or non-fertile category. The category was then displayed in an LCD 16x2 as an output. The block diagram of the system is shown in Figure 1. Moisture sensor FC-28 was chosen as it is designed specifically to measure moisture of soil. The sensor has two tips that are inserted into soil during data acquisition. It does not require specific condition for acquiring data. Whilst the color sensor TCS3200 is sensitive to illumination thus require special case during data acquisition. Different ambient lighting could cause different color value for a soil sample. Hence the system was developed in special case where the soil must be put inside a jar and its bottom was inserted into a fully covered black case. Lighting inside the case was only came from the sensor's lighting. Figure 2 shows the implementation of the hardware system using moisture sensor and color sensor. The hardware system had been implemented and used in the previous research using naïve Bayes system [10]. Color sensor extracts three components of color which were value of Red, Green and Blue (RGB). Hence the classification utilized four features, three from color sensor and one from moisture sensor. All four features are a numeric data type, hence k-NN (k-Nearest Neighbors) is suitable to be used for classification. The k-NN also popular in classification as it is does not require dimensionality enlargement as in SVM to increase the accuracy. The classification was embedded in a microcontroller Arduino that has small computation memory hence keeping dimension to the least is important. The k-NN simply measures distance between a data to all the training data and sort it ascendingly. It then chooses k number of nearest data training and assign the data to the majority class.

Data Acquisition
In this study, 39 sample of soil from various location were collected. Each sample was categorized using visual observation by expert in Soil Science laboratory. The RGB color and moisture values were used to classify the soil condition. The color sensor TCS3200 has digital output of 8-bits, resulting value between 0 -255. The moisture sensor FC-28 has digital output of 10-bits, resulting value between 0 -1023. The output class were separated into two: high organic level which is correlated to "good" for plants, and low organic level which is correlated into "bad" for plants. The number of samples for each class is shown in Table 1.

Data Normalization
k-NN utilizes distance between data to define nearest neighbors. Distance is a measure that is sensitive to data range. Different data range between features could yield a classification that depends only in the larger data range. This is because distance between objects in smaller range feature is insignificant when it is added to distance of larger range feature. This study perform normalization to the data set by scaling each feature in a range of 0 to 1.

Dimensionality reduction Using Correlation Between Feature-Pairs
As the classification was embedded in Arduino that has limited computational memory, reducing dimension of data set is beneficial. Simple dimensionality reduction method usually carried out by analyzing correlation between feature-pairs. Finding two or one features from all of four features would reduce computational time in the embedded system. A correlation analysis is used in feature selection to determine relation between two different features [11]. Since the data are numeric, then Pearson's correlation is applied. The correlation uses linear approximation between two variables. The coefficient is calculated using Equation 1. results coefficients that has value of -1 to +1 where positive sign means both features is directly proportional and negatives means both features are inversely proportional. The magnitude shows how strong the correlation between both features is, where 1 means strong and 0 means weak. This study used linear approximation of correlation analysis. The coefficients r of each pair of features are shown as matrix in Table 2.
High magnitude of r value means the feature pair has high similarity. Both features has similar information of each data and fluctuate similarly thus eliminating one of them would not affect the classification. In opposite, low r value means the feature pair has diverse information hence both should be kept as features. Feature selection usually choose to keep one from feature pair with high r value. In this study, two selected features was simply chosen to be the feature pairs with lowest r value, which was -0.70 of B and Moisture feature pair. The data plot between R and G (r = 0.98), B and moisture (r = -0.70) is shown in Figure 3(a) and 2(b), respectively. As can be seen visually, plot of R and G appears to be similar compared to B and moisture.

K-Nearest Neighbors (k-NN)
K-nearest neighbors is one of classification method that categorizes data based on the class of its k number of neighbors. Neighbors are identified as the closest data with the nearest distance. Euclidian distance is one of the most common distance approximations which is based on distance measurement in vector space. The data are classified to majority of classes in the k-neighbors data. The pseudocode for k-NN is shown in Figure 4.

Testing and Analysis
Two tests were conducted in term of classification accuracy and computational time. Both test are needed to justify whether data reduction has advantage in computation time but still maintain high accuracy.
a. Classification accuracy Testing on the k-NN classification was conducted by using all four features (R, G, B, and Moisture) and two features (B, moisture). The test was performed on ratio of training-testing data and number of neighbors, k. The result is shown in Table 3. As can be seen in the Table, using two features with the least correlation coefficient gave similar accuracy compared to when using all four features. In four features, 100% accuracy was given by 60:40 ratio with 7 neighbors. In two features, 100% accuracy was given by 60:40 ratio with 5 nearest neighbors. The two features even excel in term of data number. In two features, using 60% of data as training data means that only 23×2 vector data are used. It reduces the number of distance calculation from testing data to training data. Using 5 neighbors instead of 7 also reduces the number of voting input. According the previous accuracy result, the system was tested using two scenarios. First, the system that used four attributes as input parameter and k=7. Second, the system that used two attributes (only B and moisture) and k=5. Using 60:40 ratio, the dataset was divided into 23 training data and 16 testing data. In the first scenario, it can be seen in Figure 5, that the system using four attributes consumed exactly 7674 bytes and used 763 bytes of global variable. In other hand, the system that used dimensionality reduction (only used B and moisture information) took space as many as 7486 bytes and 571 bytes of global variable as it is shown in Figure 6. That means, by using reduced attributes dimension, the resource usage can be lowered especially when it comes to embedded system. In k-NN algorithm, the number of attributes provides a significant impact on computation time. That happened because Euclidean distance computation become more complex on higher dimension. In this test, both scenarios were tested on Arduino UNO board. Computation time was started and finished exactly when k-NN computation took place. The comparison result of both scenarios is displayed in Table  4. It can be implied that the average of computation time performed by the system using attributes dimensionality reduction was relatively shorter than the system that used all attributes.