Maintenance Web Based Applications Using Feature Location

Maintenance web applications are a complex set of efforts. The FilkomApps are the web application used by the Faculty of Computer Science of Universitas Brawijaya to arrange the academic, theses of students, assignment of faculty, inventory, presence, honorarium. It has about 28K number of files(HTML, PHP, JS, CSS). The feature location was able to help the maintenance of the web applications by locating specific features on the files. The process comprises of preprocessing (tokenizing, web language syntax removal, splitting, stopword and stemming), indexing (VSM Lucene), and evaluations (precision and recall). The experiments were done by querying the keywords originate from previous maintenance modification effort and feature of a system. The results of precision were 86% and recall were 47%. The precision was better 374% than the conventional method (using the IDE search feature). Keyword: Feature, PHP, Event Programmers


Introduction
Software maintenance is the process to make the software remain operated and perform smoothly. The software might need to be fixed when there was a bug. The software also sometimes has to evolve to face against the change of requirements. There are many jobs that relate to the change of the software. Fixing a bug, add a functionality, modification of a data was some of the common jobs due to change the software. To fix a bug, adding a feature, or modify, the programmers/analyst must have good knowledge to which area of code has a correlation with the specific feature/bugs. The documentation of requirements, design of software or database were able to used as a search based to locate the area where bug arise. Unfortunately, many projects have very limited documentation for many reasons. Feature location (FL) was the name of the set of the activity to locate a specific feature on software code.
Many research had done to address the FL problem. Some research using a fragment model to find the feature which tackled model driven development [1]. Another FL research used crowd-based screen cast to help programmers find a sample of code from video tutorial [2]. Many others were used information retrieval (IR) to discover FL within a source code [3]; [4][5][6][7]. Most of its used Java-based code as data sets of their experiment. Their experiment results were very impressive. Unfortunately, there was so little FL research using PHP language as data set. The PHP allows a free approach to developing such as Structural Approach (SA), Object-Oriented (OO), or might event hybrid (SA+OO). The PHP usually combined with JavaScript, CSS to make the web looks great and better usability.
The PHP was the most popular server-side programming language over the world [8]. The Faculties on University of Brawijaya used web-based open source technology to create web application. Based on our interview, most of them use PHP as the base language for their web applications. Their web-based applications usually manage by the unit called TIK. The TIK contains of people who work as Programmers, DevOps, Network Administrator. They are working together to create, maintain, change their web-based applications. Sometimes for a big scope system, they were not build the applications by themselves but outsource to the third party. The personnel of TIK sometimes are not permanent due to career or might resign. The new personnel might face difficulty to maintain, fix a bug, or add a feature to the web applications. As result, the systems might buggy, slow, or might stop working. The simple but expensive approach way to handle this is by creating a new application which has more complete feature from different third party.
The FilkomApps[9] is the application used by the Faculty of Computer Science of Universitas Brawijaya to arrange the academic, theses of students, assignment of faculty, inventory, presence, honorarium. It is developed using PHP language combined with many Javascript, CSS. It has about  6K number of files (HTML, PHP, JS, CSS). The new TIK personnel sometime have difficulties to manage the change, fix bug the applications.
Razzaq et al have researched how FL techniques which might best determine the feature locations using many techniques [7]. Their techniques used the Java open source projects the measure which technique has the best performance. The techniques that they used were VSM_Lucene , LSI, and LDA. As a results, they have concluded that VSM_Lucene has the best performance amongs all.
This research has addressed how Razzaq's finding was able to be used in discovering FL in PHP based projects. The PHP has different characteristics with Java. The PHP is interpreted not compiled like Java. The PHP is much more flexible, which tend less structured rather than Java. The PHP is usually also needed other script like Html, Javascript, and CSS.

Research Method
To accomplish the goals, the methods used in this study are illustrated in figure 1. There are several processes which were used on this research, the process explained in section 2.1 to 2.5.

Dataset Selections
Dataset selections were process to define which file have to be chosen as data. The FilkomApps source codes were read and save into table in Mysql database. It has about  6K number of files (HTML, PHP, JS, CSS) within the FilkomApps. Some of the files were excluded from dataset. It was setting files, configuration, fonts, icon, css, js, etc. The data then save into the table including attribute name of a file, the full path of a file, and code. After the removal process, the data left were about  2K.

Data Preprocessing
Data preprocessing was the stage to convert code into the stemmed version of code. The purpose of this stage was to split, remove non-PHP, shaping the identifier into the simple (stemmed) word. The detail process of preprocessing depicted in the sections below.

Strip Tag Removal
Strip tag removal was the process of removing symbols shown after the previous process. The process was made by employing Regular Expression which matches to specific symbols such as "<,>,/, -,(,)". The results would be saved into a database in the field named nonstrigtags.

Web Language Removal
Web language tag removal was the process to pull out the keyword based on PHP, Html, Javascript, Css, Sql from the documents. The keywords of PHP were grabbed from Php.manual and many resources. It was about  11K keywords. The keywords of Javascript, Html, and CSS were also eliminated from the documents. It was about 200 keywords. The keywords of SQL which based on MySql were also removed from documents. It was about 825 keywords. After the web language removal, the documents contain of identifier which originally created by programmers. This approach was intended to minimize redundancy of keywords.

Stop Word Removal
Stop word removal was the step to remove the common words in English, Bahasa Indonesia and common term that used on domain FilkomApps. The Bahasa Indonesia stop word was removed since the programmer might be using Bahasa Indonesia to express the identifier name. The FilkomApp common term was also removed to minimize the number occurrence of the terms.

Stemming
Stemming was the step to get the base form of words like "Articles" into "articl". This process was to ensure all the words have the same form. The process was done by using a porter stemmer. The results of the process would be used on further process.

Indexing
The indexing was the primer process after the preprocessing phase. The indexing process was done by employing VSM Lucene since it recommended by Razzaq's. According to Razzaq's, the VSM Lucene performed much better than any other methods such as LDA[10], LSI [11]. The VSM Lucene was performed using apache solr [12]. The documents/files would be extracted word by word to form the index.

Query Phrase/Keyword
The query was the process to find a feature location using specific words. The words given by the user have also preprocessed using the same process from the previous phase (Html tag removal, Strip tag removal, Web Language removal, Splitting Identifier, Stop word removal and stemming). This process intended to ensure that the words given by the user have the same treatment with a dataset.
The query is given by the user based on the feature or the change that has done by the programmer to modify the feature of the dataset (FilkomApps). The number of the query was limited since its based on their previous maintenance efforts. The words that used as query were depicted in table 1.

Ranking
To find the relevant documents/files that contain the specific feature, the ranking was done by employing VSM Lucene. The ranking is given by calculating the query against the index using score equation 1 [12].

Evaluation
The final stage was the evaluation. To measure the preciseness of the feature locations recommendation, we used precision and recall as the general evaluation method of information retrieval (see equation II & III) [13]. To define the relevant items, we were helped by the expert (the programmer of FilkomApps). The evaluation was done by measuring the recommendation ranking result against the results from an expert (the programmer of FilkomApps).

Results and Analysis
The series of experiments were conducted to find out the succesfull of method. The challenge in the experiments were how to choose the right keywords. When the keywords are given are too few, it will produce answers that have low precision and high recall. After the environment has set up, the experiments went run using several keywords. The keywords were got from the previous maintenance effort which done by a programmer of the FilkomApps and also some of the features provided by Filkomapps. The experiments were depicted in table 2.
As a comparison, we also measure the same query using conventional method. The method was querying the same keywords into the search menu on IDE (Visual Studio Code). This method was choose since it very common activity among the programmers. The search results were measured using precision and recall. The comparison experiment results were depicted in table 3. In table 2, field "# Files retrieved" mean the number of files displayed as a result of searching using keyword given in field "Keyword". The Field "# Relevant retrieved files" mean the number of files that were displayed and relevant(correct files). The field "#Relevant files" was the number of files which were relevant/correct to keyword according to expert(programmer of FilkomApps).
According to experiment results, our technique was able to found 53 relevant items/files from 66 retrieved items/files. It achieved 86% of the precision score. Besides, our technique also was able to found 53 relevant items/files from 111 relevant files. The score of recall was 46 %.
We also conduct the experiments using a conventional methods as mention previous. The queries given were also the same as the previous experiment. The comparison experiment was depicted in table 3. Based on the conventional experiment results, it was able to found 42 relevant items/files from 182 retrieved items/files. Also, the conventional also was able to found 42 relevant items/files from 111 relevant files. The conventional methods gave the results of precision 23 % and the recall 37%.
According to the results of the experiments, our method performs precision much better than the conventional method. The success ratio of accuracy was about 374% (86%:23%). The conventional methods achieve the value because the IDE search feature does not separate words like the camelCase methods.
The recall measurement results show that the value was not too significantly different. The recall ratio between our technique and the conventional was losing about 20% (46%:37%). The recall ratio results show the conventional was better than our technique did. It happened since the conventional method was not recommended any documents (see table 3, experiments number 4,7,8).