Comparison of Bagging Ensemble Combination Rules for Imbalanced Text Sentiment Analysis

The wealth of opinions expressed by users on micro-blogging sites can be beneficial for product manufacturers of service providers, as they can gain insights about certain aspects of their products or services. The most common approach for analyzing text opinion is using machine learning. However. opinion data are often imbalanced, e.g. the number of positive sentiments heavily outnumbered the negative sentiments. Ensemble technique, which combines multiple classification algorithms to make decisions, can be used to tackle imbalanced data to learn from multiple balanced datasets. The decision of ensemble is obtained by combining the decisions of individual classifiers using a certain rule. Therefore, rule selection is an important factor in ensemble design. This research aims to investigate the best decision combination rule for imbalanced text data. Multinomial Naïve Bayes, Complement Naïve Bayes, Support Vector Machine, and Softmax Regression are used for base classifiers, and max, min, product, sum, vote, and meta-classifier rules are considered for decision combination. The experiment is done on several Twitter datasets. From the experimental results, it is found that the Softmax Regression ensemble with meta-classifier combination rule performs the best in all except in one dataset. However, it is also found that the training of the Softmax Regression ensemble requires intensive computational resources. Keyword :ensemble, SUM, SR, classifier, dataset


Introduction
It is common to see relatives or friends express their stance on the social phenomenon or certain products or services in social media timelines. Not only on social media but also personal blogs have also become a way for people to articulate their ideas. Other than in personal platforms like social media and blogs, commercial websites also provide a place for their users to write reviews about products or services they use. Internet forums have also accommodated like-minded people to discuss their topic-of-interest. Indeed, ease of communication provided by Internet technologies has allowed people to freely express their opinion regarding a certain topic [1].
The abundant amount of opinions provided on these platforms has many advantages for various parties. For customers, opinions can provide additional information in assisting financial decisions. For example, users who want to buy a certain product may refer to existing user reviews to see if the product fulfills their needs. For manufacturers of service providers, these opinions can be used to gather public reception regarding their are described in Section 2. The methodology used in this research is explained in Section 3. The result is discussed in Section 4 and conclusions can be found in Section 5.

Literature Review
In this section, several studies in the fields of ensembles, text processing, sentiment analysis, and imbalanced problems are reviewed.
Ensemble method is used alongside evolutionary search Particle Swarm Optimization (PSO) to perform sentiment analysis for two imbalanced multiclass datasets [1]. PSO is used to select text features, which includes words, part-of-speech, named entity, etc. Each particle in the swarm represents the selected feature space, which then used to train a classifier. Several top-n classifiers are used to construct the ensemble. PSO is used again to selectively drop classifiers from the ensemble. The method shows improvement over existing methods.
Another text processing method based on feature weighting and Naïve Bayes (NB) is proposed [21]. Feature weighting is used to alleviate the weakness of feature independence assumption of NB. The method uses correlation-based feature selection (CFS) to select features, and selected features have increased weight. The weight is used to modify the decision function and probability estimation of NB. Modified NB models Multinomial NB (MNB), Complement NB (CNB), and One-versus-all NB (combination of MNB and CNB) is used with the modified weights. The experiments are conducted on both text and non-text databases. Modified weights have been shown to improve classification results.
Evolutionary algorithm is also used to modify NB method for text classification [16]. The method converts probability estimation problem of NB into optimization problem. Differential evolution (DE) is used to find the optimal estimated conditional probability. DE is also enhanced with multi-parent and crossover methods. The proposed method is shown to perform better than classical classifiers on multiple text datasets.
Another evolutionary algorithm based on cuckoo search (CS) is also used for Twitter sentiment analysis [22]. Twitter datasets are pre-processed into features such as total number of words, negative emoji, and positive emoji. CS is used to optimize cluster centroid based on maximization of inter-class variance. After the best cluster centroid is found, the classification is carried out using k-means algorithm. Experiments on four Twitter datasets shows improvement compared to evolutionary algorithms such as PSO and DE.
Deep learning models is also suitable for text classification [20]. Deep learning models constructs new feature representation from sparse high-dimensional text representations such as TFIDF. The study uses deep belief network (DBN) to learn the new feature space. Softmax Regression (SR) is used to classify the learned feature representation from DBN. Initially, the DBN and SR are trained independently. Then, both models are combined and trained together to get better model. The experiment compares gradient descent and L-BFGS methods to train the network. It is found that L-BFGS perform better than gradient descent.
Ensemble method is used to classify imbalanced dataset [9]. The method chooses appropriate ensemble, feature selection, classifier, and ensemble rule for different datasets. Included ensemble methods are oversampling-based bagging, undersampling-based bagging, and AdaBoost M1. Included feature selection methods are wrapper-based PSO and filter-based fast-correlation based feature selection (FCBF). Used classifiers are C4.5, SVM, radial basis neural network (RBF-NN), data gravitation classifier (DGC), and k-nearest neighbor (KNN). The combination of methods is then selected based on the number of samples, number of classes, number of features, and the imbalance rate. The proposed method shows better performance than existing ensemble methods on a small database.
An ensemble method with modified rules is also introduced [14]. The ensemble method splits the dataset using two schemes i.e., splitbal and clusterbal. Splitbal creates multiple balanced datasets by splitting data randomly, while clusterbal uses clustering. The modified ensemble rules incorporate the distance of test sample to the training samples, assigning JITeCS Volume 6, Number 1, April 2021, pp 33-49 p-ISSN: 2540-9433; e-ISSN: 2540-9824 higher weight to classifier with nearer samples. Experiments on non-text datasets shows that splitbal performs better more often than clusterbal and other ensemble methods, and modified max rule is often the best combination rule for the ensembles.

Bagging Ensemble
Breiman's Bagging (bootstrap aggregating) [23] is the simplest ensemble model and very easy to implement. Each model in the bagging ensemble is trained using a fraction of data and/or feature subset from the dataset. As every model is independent of each other, it can also be trained in parallel [4]. The pseudocode of bagging can be viewed in Procedure 1.
The classification of a new test sample of the ensemble is determined by combining the decision of classifiers (i.e., the scores of new samples belonging to certain class). The possible combination rules are max rule, min rule, product rule, sum rule, and majority vote [14] or with meta-classifier [24]. The formulations of combination rules are listed in Table  1, with refers to normalized score for class in classifier and refers to the ensemble score of class . The 1( = ) operator has value of 1 if the specified condition is fulfilled, and 0 if not. The final decision is the class with the highest .

Multinomial Naïve Bayes
Multinomial Naïve Bayes (MNB) is one of the state-of-the-art Naïve Bayesian models for text classification, with the other being complement naïve Bayes (CNB) [17]. MNB is an NB model that allows for term frequency vector representation. MNB classifies test document = [ 1 , 2 , … , ] using decision function written in Equation (1), where is the number of classes in the dataset, is the term frequency for -th word in document is the size of the dictionary, and log ( | ) is the conditional probability of term given class . The conditional probability is computed in Equation (2) where is the total number of documents. Eq. (2) calculates the sum of the frequency of term in all documents belonging to class , divided by the sum of frequencies of all terms in all documents belonging to class (smoothing is applied to avoid division by zero). = select random subset of training data and/or feature space from .

Complement Naïve Bayes
CNB is another NB state-of-the-art model that can be used for text classification [25]. CNB uses term frequencies to calculate probabilities like MNB. CNB, however, calculates conditional probabilities using the complement of class ( ̅ ), i.e. all classes other than . The decision of CNB is written in Equation (3). It should be noted that the decision function of CNB differs from MNB, as CNB chooses the class that has the smallest score (using the minus sign). (3) ( | ̅ ) is the conditional probability of term given all classes other than . It is calculated using Equation (4).

Linear Support Vector Machine
Stochastic Gradient Descent (SGD)-based methods are suitable classification models for large-scale learning such as text classification [26]. SGD classifier has a decision function ( ) that is parameterized by weights . The weights are chosen so that they minimize a loss function ( ; , ) over given dataset = [( 1 , 1 ), … , ( , )]. The value of is gradually updated based on the gradient of the average loss function on dataset . Gradient descent becomes stochastic when only a small fraction of the dataset is used to estimate the gradient . The model of linear SGD classifier can be viewed in Fig. 2. Inputs 1 , 2 , 3 , … , is the features of the test data . Weights is used to with the input to btain classifier outputs 1 through , where represents the number of classes. Linear Support Vector Machine (SVM) can be formulated as an SGD classifier using the multiclass hinge loss as the loss function. Additionally, SVM also uses L2 regularization into the loss function to prevent overfitting of the model.
Performing the training of SVM model requires updating the weight values. Initially the weight values are set randomly. Then, a subset of the dataset (a batch) is selected randomly an is used to calculate the value of the loss function. Value of the loss function is used to get the gradient of currently trained batch. Then the weights are updated based on the gradient and the learning rate. The weight update process is repeated after a certain number of iterations. After the training is complete, test data sample can be classified based on the class with the highest score ( 1 , 2 , … , in Fig. 2).

Softmax Regression
Another model for large-scale text classification is the Softmax Regression (SR) model [20], [27]. Softmax regression is the multi-class generalization of the Logistic Regression model. Like SVM, it is parameterized by weight matrix , and can be optimized using SGD procedure to minimize its loss function. Unlike SVM that use hinge loss, SR uses the crossentropy (CE) loss with the L2 regularization. CE loss causes the output of the classifier to be the probability estimate of a data sample belonging to certain class . In other words, the value of in Fig

Methodology
In this section the methodology of the research is explained. The overall process of the classification process is illustrated in Fig. 3. Initially, the imbalanced dataset is partitioned into multiple balanced subsets. Then, each balanced subset is used to train a classifier to create the models inside the ensemble. New test sample is classified using all models in the ensemble, each producing classifier results. All results are combined using a certain combination rule to obtain final classifier decision. The rest of this section explain the dataset, classification process, performance evaluation and implementation details.

Dataset Description
To perform the comparison of combination rules, experiments are carried out on several publicly-available datasets. Each dataset differs in the number of classes/sentiments and the imbalance ratio (IR). IR is the ratio between majority and minority class. The data distribution and the IR of each dataset can be seen in Table 2.
The brief explanation of each dataset is as follows: 1. Twitter-sanders-apple2 1 : binary classification (positive and negative) Twitter dataset. 2. Twitter-sanders-apple3 2 : the first dataset augmented with a neutral class.  3. Testdata-manual 3 : three-class dataset with fairly balanced classes. 4. Airline 4 : three-class dataset with a large amount of data. 5. Self-driving 5 : extremely imbalanced five-class dataset. There are tweets labeled with 'not relevant' in the original dataset. These tweets are not used in the experiment.

Text Preprocessing
For the tweets to be used in classification with the ensemble method, they need to be cleaned using a series of preprocessing steps. The preprocessing steps aim to standardize the tweet text. After the tweets are preprocessed, the numerical features can be extracted and the dataset is ready to be used in classification.
In this research, the performed preprocessing steps are: 1. The tweets are transformed into lower-case.
2. Contractions are expanded e.g. 'they're' to 'they are'. 3. Twitter usernames beginning with '@' are removed. 4. URL (starting with https://) in tweets are replaced with string 'url' to prevent links from being converted into terms. 5. Hashtags are removed. 6. Non-alphabetic characters and additional whitespaces are removed from the tweets. 7. Words in tweets are lemmatized into the base form e.g. 'driving' to 'drive'. 8. Stop-words are removed. After preprocessing, the numerical features can be extracted using term frequency TF weighting. The extracted TF features are then normalized such that the maximal absolute value of each word/feature will be 1.0. This normalization scheme ensures the sparsity of the data as a sparse data format is crucial to minimize the running time of the classifiers.

Classification with Bagging Ensemble
The dataset with TF-IDF features must be split into training and testing data using a 4-fold cross-validation method. The training is used to create the ensemble model, and consists of 75% instances of the dataset. The testing data is used to validate the performance of the trained ensemble model and consists of 25% instances of the dataset.
The training data are then further partitioned to create multiple classifier models that make up the ensemble. SplitBal [14] method is used to create multiple class-balanced partitions. The original SplitBal method was intended for a binary classification problem. To extend the method to a multi-class classification problem, first the minority class is determined to be the class with the least number of instances, e.g. the very negative class. Then, other classes are split into bins that have the same number of instances as the minority class. The number of created partitions equals the ratio between the number of instances in the majority class and minority class. In other words, the number of classifiers in the ensemble equals to the imbalance rate (IR) of the dataset. IR is the ratio between the amount of data in the class with most samples and amount of data in the class with least samples. Hence, the according to Table 2, ensemble for apple2 uses 2 classifiers, ensemble for apple3 uses 4 classifiers, testdata uses 2 classifiers, and self-driving uses 39 classifiers (IR numbers are rounded up). The ensemble is constructed using the same classifier. The meta-classifier also uses the same classifier as the ensemble members.
The created partitions are used to train individual classifiers. In this research, the ensemble is trained with classifiers of the same type, and there are four types of classifiers to be considered. The four classifiers are MNB, CNB, SVM, and SR. The MNB and CNB models do not have user-defined hyper-parameters. The SVM and SR have several parameters, i.e., learning rate/step size , number of iterations , regularization parameter . Grid search is used to find the best parameter for each rule. For grid search, SVM and SR use a batch size of 256. Additionally, data scaling is also treated as a parameter. The values of the parameters in the grid search are presented in Table 3.
Additionally, the classifier scores for MNB, CNB, and SVM needs to be adjusted. The combination rules presented in Table 1 requires classifiers to produce scores for each class given test data . The scores need to be in the range of [0, 1] and all scores must sum to 1. However, only SR inherently supports producing normalized scores for each class, while other classifiers only produce raw scores for each class. Therefore, an additional step is taken to normalize the outputs of MNB, CNB, and SVM classifiers. The softmax function is a common method to normalize the raw scores [13]. The raw and normalized scores for NB classifiers are described in Table 4. The normalization process of the SVM classifier is illustrated in Fig. 4. It should be noted that normalized scores for NB and SVM are not probability estimates, as NB and SVM are known to be uncalibrated classifiers [28].

Performance Evaluation
To evaluate the classification model, F1 measure is used. F1 can be used in a multi-class classification problem using the macro-averaging [29]. The F1 score for a binary problem is formulated in (5). The macro-averaging of F1 is shown in (6). Macro-averaging is calculated by treating a class as a positive sample and other classes as negative. The process is repeated for each class as the positive class, and the F1-macro score is averaged from all individual F1scores.
The stratified 4-fold cross-validation method is used in this research. Four partitions with similar class distribution are created. Three partitions are used as training to create the ensemble model. The remaining partition is used as testing data to validate the performance of the ensemble using the F1-macro score. The process is repeated so that all partition is used as testing data and F1-macro scores for four folds are obtained. The overall performance of the ensemble is the average F1-macro from 4-folds.

Implementation Details
The experiments are executed in Python 3.6.5 environment on top of Windows 10 system with Inter® Core™ i5-7200U CPU 2.5 GHz processor. The packages numpy and scipy.sparse are used for dense and sparse matrix computations, respectively. nltk is used for lemmatization, and sklearn is used for 4-fold CV and evaluation of the models.

Classification Results
The results of the experiments are presented in this section. The summary of the classification results is presented in Table 5. For each dataset, the results of the bestperforming parameters for each classifier-rule combination are presented. Additionally, the base, non-ensembled classifier is also presented (model with '-' value in rule column) to decide whether the ensemble methods improve the original classifier.
The results from Twitter-sanders-apple2 dataset are presented in Table 6. It is divided into groups for each classifier. The italicized entry represents the base classifier, while the bold-faced entry represents the best performing classifier in its group.
The results show that for Naïve Bayesian methods, the ensemble models can improve the recognition ability of the base classifier, regardless of the rule. The best performing rule is the 'clf' (meta-classifier) rule for MNB. MNB models are known to be unsuitable for imbalanced class problems [25]. The ensemble model allows MNB to alleviate this problem by dividing the dataset into partitions of equal distribution between classes [14]. For CNB, all ensemble rules also improve the performance of the base classifier.  For the SGD-based algorithms (SVM/SR), only 'vote' and 'clf' rules can improve the performance, while other rules degrade the performance. Among the two rules, 'clf' is the better rule for both SVM and SR. The 'clf' rule uses a meta-classifier trained using the outputs of ensemble members [24]. Therefore, 'clf' rule can better map the outputs of the classifiers, and learn which ensemble members that can correctly classify a new data [13].
Among all the models tested for this dataset, the best performing model in this dataset is the SR ensemble with the 'clf' rule. The SR model has the learning rate = 1. This means that each step of the SGD is noisier. The ensemble method can improve the performance of an unstable method. The decision boundary of an unstable model can vary greatly, which ensemble can exploit to select a better boundary for the overall system [13].
The results from Twittter-sanders-apple3 are presented in Table 7. This dataset is similar to the previous dataset, with a neutral class added. The addition of class turned binary classification into a multi-class classification which can significantly improve classification difficulty [4]. The performance degradation of all models can be seen, based on the comparison from Table 6 and Table 7. Similar to the previous dataset, all four classifiers also still gain performance improvements. Ensemble with the 'clf' rule is also the best overall performer. It also still holds that SGD models outperform the NB model, with SR being the best overall classifier. MNB and CNB, however, still gain better performance from ensemble formation compared to SGD models. MNB gained 3% performance while MNB gained 2% performance. Unlike the Twittter-sanders-apple2 dataset, however, all other rules degrade the performance of the base classifier, including the 'vote' rule that was able to improve the base model in the Twittter-sanders-apple2 dataset. The improvement in the best model is also lower than the Twittter-sanders-apple2 dataset. The SR ensemble gained a 3% improvement from the base model on the Twittter-sanders-apple2 dataset, whereas it is only 0.7% improvement on the Twittter-sanders-apple3 dataset.
The third result is Testdata-manual dataset in Table 8. This is a fairly balanced dataset with only IR of 1.30, based on Table 2. The results for the Testdata-manual dataset are presented in Table 8. The result shows that ensemble models generally degrade the performance of the base classifier, regardless of the rule. The only exception is the SR model. The SplitBal scheme split the majority class into multiple partitions to balance the majority and minority classes [14]. However, when the dataset is more balanced like the Testdata-manual dataset, this will lead to fewer majority class samples in each classifier. As the sample decreases, the generalization ability also degrades. In the case of the Testdatamanual dataset, the majority class 'positive' is divided from 180 samples into 90 samples for each classifier. Trends from Twitter-sanders-apple2 and Twitter-sanders-apple3 datasets still apply to this dataset. NB models perform worse than SGD models. SR once again is the best model. The parameter of SR is also the same, i.e. learning rate = 1, number of iterations = 500, and regularization parameter = 0.0001. Ensembled SR is also even able to improve the performance of base SR although the improvement is only 1%. For the Airline dataset, the result is presented in Table 9. The Airline dataset has an IR of 3.89, which means that the majority class samples are adequately distributed into each ensemble member. Even though the Airline dataset has approximately 15 times more data than, the result shows a similar trend to the Twitter-sanders-apple3 dataset, with.IR of 3.14.
In the Airline dataset, all base methods are improved by the ensemble with 'clf' rule, although other rules also decrease the performance. The preferred model is still the SR model, followed by SVM, and then by the two NB models. The improvement is also lower. MNB and SVM gained 2% improvement, while CNB and SR models gained 1% or less improvement.
The last experiment is the Self-driving dataset with highly-imbalanced distribution with IR of 38.59. The results are presented in Table 10. The minority class only has 110 samples, compared to 4245 samples of the majority class. This dataset has the lowest F1-macro of all datasets. Also, almost all base classifier models have higher performance than the ensembled models. The only exception is the ensembled MNB model which performed better than the base MNB model. However, the ensembled MNB model is also comparatively worse than other base classifier models. As Self-driving is highly imbalanced, the number of ensemble members is also large [14]. This can lead to a suboptimal combination of decisions. The probability of a worse-performing classifier affecting the overall decision is also higher. Based on classification results from five datasets, the ensemble of the SR model is the overall best classifier. Ensembled SR with the 'clf' rule can score the highest in four of five datasets, while the one dataset is best classified with base SR model. All SR models also have learning rate = 1, with a varying number of iterations and regularization.

Running Time
After comparing classification results, the training time is compared. The average training time for each model is presented in Fig. 5. The training time of the ensemble depends on the number of classifiers and the type of the classifier itself. NB models only require one pass of the dataset to train the model, and therefore only have training time in the order milliseconds. Even when training multiple classifiers, the NB ensemble is comparatively much faster to train compared to SVM/SR ensembles.
For SGD models (SVM/SR), the training time primarily depends on the number of iterations. As can be seen in Fig. 5, more iterations lead to longer training time. The training time is also dependent on the number of classifiers in the ensemble. The number of classifiers in the SplitBal method equals the imbalance rate (IR) of the dataset [14] e.g., a dataset with IR of 4 will be trained using an ensemble of 4 classifiers. The training time, however, depends less on the size of the dataset. The use of the sparse data format ensures that training time is less dependent on the number of training samples. For example, the Airline dataset has 29 times more data than Twitter-sanders-apple2 but only takes 3 times longer to train a 2000 iterations SR ensemble with twice more classifiers. Although training does not take more than 10 seconds on the most dataset, the cost of the training SVM/SR ensembles is the most noticeable on self-driving because of the IR of 39. It takes almost 60 seconds to train the whole ensemble. This means that 4-fold crossvalidation will last for 4 minutes to test a single parameter combination. This leads to a very long time to find optimal parameters. Additionally, the best ensemble model does not outperform the base model, and performing a grid search on the base model is much faster. Testing time is not reported in this research. Testing unknown data samples with ensemble only takes time in the order of milliseconds. Unlike training process that involves iterative process (for SVM/SR), testing only involves one non-iterative process. As mentioned, the use of sparse data format ensures efficiency. This also means that the testing time for each rule is not compared, as the ensemble can use all rule without re-training.

Conclusions
In this research, a comparison of the ensemble model to find the most suitable method for classifying various datasets is performed. The ensemble models compared are 'max', 'min', 'product', 'vote', 'sum', and 'clf' models. Four base classifiers, Multinomial NB, Complement NB, Support Vector Machine, and Softmax Regression are compared to see the effect on the resulting ensemble model. The ensembles models are then tested on five datasets with different characteristics in the number of data and the imbalanced ratio.

Fig. 5 Training Time Comparison for Each Dataset
From experimental results, it is found that 'clf' (using meta-classifier to combine the results of the ensemble members) rule is the most consistent in improving the base classifier. 'max' and 'vote' rules can improve the base classifier in some datasets. However, other rules degrade the base classifier in most cases. As the classifier of choice, Softmax Regression is the best performing classifier in all datasets. Ensembled Softmax classifier is also the best model in all but one dataset. From this, it can be concluded that the ensemble model with the 'clf' rule can improve the performance of the base classifier.
For future researches, datasets with highly-imbalanced data such as self-driving can be observed. Additionally, the ensemble method that assigns weight to each classifier can be used. There are several ways to perform weighting, such as evolutionary algorithm-based weighting. There is also the need to build an efficient ensemble, as it can be very time consuming to perform an ensemble of certain classifiers such as Support Vector Machine and Softmax Regression.