Using a new algorithm in Machine learning Approaches to estimate level-of-service in hourly traffic flow data in vehicular ad hoc networks

T he primary goals of transportation agencies and researchers studying traffic operations are to ease traffic and increase road safety through the use of vehicular ad hoc networks. Agencies can't achieve their goals without reliable and consistent data on the current traffic situation. The Level-of-Service (LOS) index is a helpful measure of freeway traffic operations. Conventional fixed-location cameras and sensors are impractical and expensive for gathering reliable traffic density data on every road in large networks. Flow data is a new, low-cost option that has the potential to boost safety and operations. This


1-Introduction
In a vehicular ad hoc network (VANET), traffic conditions can't be accurately assessed without the help of intelligent transportation systems (ITS). Road work planning, traffic operations, congestion management, and assessing traffic queues are just a few of the uses for ITS traffic measurements. For the purpose of estimating traffic performance and state, the Highway Capacity Manual (HCM) defines six LOS. The HCM offers formulas for calculating LOS based on traffic volume and road conditions [1]. An important part of LOS evaluation is the speed, flow, and density of the traffic [1,2,3]. The transportation agencies often require hourly data on the traffic situation and LOS for various stretches of freeway, either in real time or historically fixed location sensors like remote traffic microwave sensors (RTMS), loop detectors, laser sensors, magnetic sensors, license plate recognition (LPR), and video image systems [4,5] have long been used to collect traffic data (travel time, speed, density, and flow). Data collection techniques that rely on stationary nodes are notoriously costly and space-consuming. Recently, data-driven ITS has resulted in multisource, high-performance, and potent solutions for transportation systems [6]. The use of "probe vehicles" and "floating cars" for data collection has recently received a lot of attention. These strategies collect information about traffic through the use of cutting-edge technologies like connected vehicles (CVs), Wi-Fi, Bluetooth sensors, cellular networks and smartphones [7,8]. These tools not only open up new possibilities for collecting crowdsourced data, but also produce valuable information that can be used in a variety of transportation analyses, including those concerned with traffic safety [9,10,11,12], public transit [13,14], and energy consumption and emissions [15,16]. Big data is being used in the transportation sector to propose novel ideas and solutions that have not been explored before [17]. Predicted traffic flows are a key input into LOS calculations for highways, and as a result, they can help drivers and passengers make more informed decisions about which routes to take. Knowing "when and where" congestion will occur is helpful for transportation planning because it allows experts to allocate resources to the roads at risk of congestion, which can reduce traffic congestion over time. To that end, traffic flow prediction [18] [19] [20] has become a hot topic in recent years as a means to estimate LOS due to its substantial advantages over other devices. Since VANET uses traffic flow data, many city governments and departments of transportation (DOTs) have made deals with data providers like MIDAS to work together. Flow data has been used in many different ways, such as to measure performance and find problems. The focus of this paper is on MIDAS. In the UK, the MIDAS system is made up of a network of traffic sensors, mostly inductive loops, that send information about traffic volumes and average speeds to a regional control center (RCC). The RCC can then change variable message signs and advisory speed limits automatically. When flow data is collected, it gives us a chance to come up with a new way to measure LOS based on the features and characteristics of the data. This study comes up with a new way to measure LOS on freeways in VANET that uses technical indicators from flow data. With this method, you don't need fixed traffic volume sensors to make new tools for LOS assessment and hourly traffic status data on freeways. The proposed method could be thought of as an addition to the traditional HCM LOS calculation method, which is based on the amount and speed of traffic. Here's how the rest of this paper is put together: In the next section, "Methodology," the traditional way to figure out LOS and the proposed way to figure it out are shown. In this section, we also talk about some methods for data mining. Then, the data used in this study are talked about, and then the results of using the methodology are given. At the end of the paper, suggestions are made for further research.

1-Related Works
This part reviews the most relevant literature pertaining to this study, summarizes traffic status and LOS assessment methods, and discusses the research gaps. Studies have typically relied on single or multiple parameters, such as traffic flow [21], traffic speed [22], and traffic density [23], to explain traffic status and LOS. Previous research has relied on a wide variety of methods and data sets, including sensor readings [24], probe vehicles [25], camera videos and images [26], CVs [2], and simulation environments [2] [23]. Regarding approach, statistical modeling [23], artificial neural networks [24,25], Kalman filters (KF) [25], image processing [27], and machine learning (ML) [21,26] have all seen extensive use. Table 1 presents the most relevant studies that have proposed alternative methods for LOS assessment. As we've already talked about, the most attention in the past literature was paid to HCM density-based LOS. Some studies also used travel time and speed changes to figure out LOS. No data on traffic flow has been used to figure out LOS. This study fills a gap in integrating flow data for LOS assessment with the help of technical indicators as features. The results of this study can help agencies figure out LOS for different segments without having to install new fixed location equipment.
• To meet the goals of the study and estimate hourly LOS based on flow data, the following machine learning classification methods were used: 1-Random Forest (RF): RF is a classification technique that uses a collection of random decision trees to make a more accurate prediction than using either one alone. Here, each tree is constructed separately from the others. The data is then classified using a majority vote across all trees, with Gini impurity serving as the function to measure the quality of the split at each node [32]. The Gini impureness at a given node N is defined as: where Pi is the proportion of the population with class label i.

2-Support Vector Machines (SVM):
Support vector machines are a famous classification technique that uses margins. The SVM algorithm determines, for each class, the ideal SVM that provides the greatest distance to other classes. The algorithm delineates boundaries and assigns classes to data by computing optimal support vectors [33].

3-K-Nearest Neighbors (KNN):
The use of KNNs, a non-parametric technique, in the classification process is commonplace. In this approach, the entire set of training data is mapped onto a feature space with n dimensions (where n is the number of input features). The algorithm takes the Euclidean distance between each observation and its nearest neighbors and finds the k closest neighbors. After that, it determines a label based on how often it appears among the neighbors [34].

3-Materials and Methods
In this paper, a method based on traffic flow data to determine the hourly level of service-based traffic status are used. This approach takes into account the volume of traffic on a given stretch of road in order to determine the technical indicators that characterize the state of traffic along that route. The study, which will be detailed below, relies heavily on data from the MIDAS traffic flow. This section elaborates on the proposed algorithm from this research. The various stages of the proposed method are as follows, as depicted in the research framework ( Figure 1

Data collection
Massive amounts of data are continuously computed by MIDAS. Storing MIDAS travel time and traffic data is the first step in conducting such an investigation. Data on traffic volumes and travel times were recorded at 15-minute intervals thanks to a Python code. Using raw data from the real world always comes with the risk of encountering problems like noise and missing values. The data was cleaned and checked for errors before being used. As far as possible, missing values and outliers were removed or identified. The next step was to gather MIDAS traffic data in order to determine the hourly flow and ground truth for the level of service. To evaluate the efficacy of the algorithm [35], MIDAS data of the M25 highway between Junction (13)(14) in the United Kingdom's busiest highway was collected, as shown in Figure 2.

Fig. 2: Part of the M25 highway chosen for the study from Open Street Map 3.2 Exponential Smoothing
With exponential smoothing, more importance is placed on more recent observations, while less important observations from further back in time are given weights that decrease at an exponential rate. Recursively finding the exponentially smoothed statistic of a series Y looks like this: where α represents a smoothing factor. Increasing the friction has the opposite effect, increasing the roughness. α = 1, so the smoothed statistic is identical to the raw data. When multiple consecutive observations are available, the smoothed statistic St can be computed. Through this process of smoothing, the model is better able to detect the long-term trend in the behavior of traffic flows by eliminating the effect of random variation or noise in the underlying data. Following the exponential smoothing of the time series data, a feature matrix is constructed from which technical indicators are derived.

Feature extraction from data
The only variables considered are vehicle travel time and traffic flow over the course of several days. As a result, the format can be used to evaluate our input data (date, traffic flow). These indicators are derived from the data: • Average True Range (ATR): The ATR measures the deviation from the average over a given time period [36] and the size of the range over that time period. It is formulated as per Eq. (2). The true range is indicated here by the symbol TR.  (0-100). The name of this oscillator is deceptive because it does not make comparisons between instruments; rather, it depicts the current flow in terms of how it compares to pieces that have been produced within the chosen lookback window length [32]. The equation for the RSI is (6). Where: •

Ground Truth LOS:
Level of service (LOS) is a popular metric for gauging how well a given stretch of road is performing. With data from flow and road characteristics, the HCM classified freeways and highways into six LOS groups. For highway sections, HCM uses traffic volume as the primary LOS metric [2]. Each LOS's flow is detailed in Table 2 [1]. In this investigation, the LOS was determined hourly based on traffic volume collected by MIDAS sensors. The hourly LOS was calculated using the traffic volume from Table 2. The LOS that was computed was used as the standard of comparison. The LOS model presented below makes use of hourly input data that was labeled with ground truth values. . Different machine learning methods were used in this study, so they had to be compared to find the best one. The preferred model and features were chosen based on classification accuracy, recall, precision, f-score, and support. In this study, the ratio of correctly labeled predictions (LOS) to ground truth data is measured by accuracy.

4-Experiments and Results
As a first step, this section supplies summary statistics for all of the data sources used. Next, the findings of the ML models are shown. All of the analyses and visuals in this section were created using the Python programming language. The datasets also did not contain any missing values that represented more than one percent of the entire population. When determining traffic flow data for the M25, factors such as profile diversity, profile reputation, and profile geometry validity were considered. The efficiency of the model allows for a range of values, which were taken into account while simulating traffic on the busiest highway. Additionally, the model's robustness should be assessed. Naturally, it would be easier to predict the free flow or breakdown of traffic that is relatively stable than traffic that is relatively noisy. From an engineering perspective, less variation in the data accounts for stability and means that ML classifiers can make more accurate predictions. Accuracy, recall (also known as sensitivity), precision, and f-score are the performance metrics used to assess the stability of a multiclass classifier.

Using Machine Learning for LOS Classification
This research used three different machine learning models to categorize LOS. This paper reports that KNN, SVM, and RF achieved the highest accuracy rates of all the methods tried. To get rid of unexpected local variation, exponential smoothing was used in this work. Figure 3 shows that, compared to the previously used classifiers, the results from RF, KNN, and SVM are superior. To achieve this goal, each technique used a grid of hyperparameter values with varying values to tune hyperparameters and choose the best model, as described in this Section. Table 3 displays the results of LOS classification using data from the M25 highway. Table 3 lists the various performance metrics used to assess the reliability of a multiclass classifier.

Sensitivity Analysis
Finally, the importance of each technical indicator was investigated via sensitivity analysis applied to the hourly random forest model. To identify the input parameters that most affect robustness and model performance, a sensitivity analysis is conducted [40].
To conduct this study, each technical indicator was first removed once from the model input before checking its accuracy. Since the factors are swapped out and the model is reevaluated after each iteration, this method is known as a parametric bootstrap [41]. The results of each eliminated technical indicator are summarized in Table 4. Not extracting any indicator from the sensitivity test gave a very high accuracy (93.16). By the looks of things, the SMA was the most important technical indicator, with a drastic drop in model accuracy (accuracy = 87.28) after its removal.
Once ATR was taken out of the model, the accuracy was very close to the original (accuracy = 93.11), making it the least important parameter. Even though the accuracy has increased with MOM, the overall profile of the results is not quite as good as when using all the technical indicators, so MOM also has low significance. Based on these results, it seems that the SMA is a reliable technical indicator for LOS estimation.

Feature importance
It is hypothesized in this research that LOS classification accuracy can be enhanced by using data from technical indicators. Thus, the chosen RF model was used as the basis for an importance analysis of the variables involved (Fig. 4). As a result of using the RF model, it was possible to calculate an average statistically significant decrease in the Gini index. The significance of a variable is better captured by a higher value of this index. Significantly more weight is given to the SMA and EMA of traffic flow when calculating LOS. Figure 4 shows that ROC, MOM, and ATR are all less significant in the classification of LOS. In order to better predict LOS, this research proposes a new method that incorporates traffic flow data and machine learning algorithms. However, this research was not without its flaws. The methodology's inherent sensitivity to factors like speed and weather conditions was ignored in this investigation. However, spatial flow variation was disregarded.
Potentially useful in assessing LOS in the future is spatial variation, which can be gathered through further study. It is possible to account for the difference in flow between the upstream and downstream sections when estimating LOS. Deep learning and other sophisticated approaches can be used for this purpose. In the future, researchers may be able to capture spatial and temporal variation in the same study by employing deep neural networks like convolutional neural networks (CNNs) and recurrent neural networks (RNNs). By analyzing whether or not including TI characteristics improved traffic flow forecasting accuracy, this article assessed TIs' explanatory power. In general, it has been shown that TIs may capture the effects of behavioral biases in traffic flow, resulting in significantly lower prediction errors when using ML models. Our research showed that ATR, SMA, EMA, RSI, ROC, and MOM are the most effective TIs for predicting traffic volumes. In particular, our findings recommend including these TIs into the proposed ML models. Evidence has been found that TI performance varies by model; However, both SMA and EMA improved the accuracy of the ML models from 88.88% and 87.28% to 93.16%, respectively.

5-Conclusion
Rapidly expanding quantities of data on traffic flows in VANET are now available, and machine learning provides a means of analyzing them. This research provided a fresh approach to using flow information in LOS evaluation. The UK's M25 freeway between junctions 13 and 14 was used for the experiment. Six input metrics (ATR, SMA, EMA, RSI, and ROC MOM) were generated based on the acquired MIDAS traffic data. LOS was classified hourly using machine learning algorithms. The traffic density and LOS ground truth were estimated using HCM density criteria, with both calculated using data received from fixed-position loop sensors. Using a combination of machine learning and data on traffic flows, this study shows that level of service in VANET can be estimated with some degree of accuracy. The outcomes demonstrated that incorporating technical indicators as input can considerably raise the accuracy of the model. And when compared to other classification approaches on training datasets, RF performed the best of all (accuracy = 93.16 percent).
It was concluded that this study will encourage others to investigate the possibilities of technical feature engineering because this is the first study to apply technical indicators features to level of service predictions in vehicular networks. Although it has been considered that some of the most common and basic technical analysis indicators can explain phenomena, more advanced technical analysis indicators may be better at making predictions, and this is an area that future study may focus on.

Appendix A (Hyperparameters)
The hyperparameters of the optimal model are listed in Table 4 below. It should be mentioned that scikitlearn [42] is the tool we use to implement machine learning models.  [43]. The important data fields in the MIDAS traffic flow dataset are shown in Table 5 below. Each model site's files are made every month. Since Highway England is in charge of all the major highways, junctions, and motorways, each file only has flow, speed, and day logs from those places. The number of vehicles less than 5.2m detected on any lane within the 15-minute time slice. Total Flow vehicles 5.21m -6.6m Number of vehicles between 5.21m -6.6m detected on any lane within the 15-minute time slice. Total Flow vehicles 6.61m -11.6m The number of vehicles between 6.61m -11.6mn detected on any lane within the 15-minute time slice. Total Flow vehicles above 11.6m The Number of vehicles above 11.6m detected on any lane within the 15-minute time slice.

Speed Value
The average speed in km/h. of all vehicles for all lanes measured by the site over the 15-minute period. Quality Index The Indication of the quality of the data provided. The number of valid one-minute records reported and used to generate the Total Traffic Flow and speed. A quality index of 0 indicates no valid records.