Improved Methodology to Support FUSL-AJEM Validation
by Thomas Johnson, Craig Andres, Russell Dibelka, Lindsey Butler, David Grimm, and Juliana Ivancik
As the U.S. military’s test and evaluation community increasingly relies on modeling and simulation (M&S) to complement the live testing of a wide range of ground, air, and sea systems, M&S validation has likewise become increasingly critical for ensuring system evaluations are credible. For the system-level evaluation of armored fighting vehicles (AFVs), the Army’s Advanced Joint Effectiveness Model (AJEM) is used in combination with full-up system-level (FUSL) live fire (LF) testing to evaluate vehicle vulnerability. This article proposes improvements to a method—the Component Data Vector (CDV) method—that is used to validate AJEM. The method compares damage to components (e.g., driveshaft, engine, suspension, wiring, driver displays, weapons) sustained during FUSL testing to the predictions of that damage in AJEM. Though the vehicle application used herein is an AFV, the CDV methodology could also be extended to analogous vulnerability assessment tools used by the Navy and Air Force for other ground, air, and sea systems.
THE CDV METHOD
FUSL LF testing is an essential part of the vulnerability assessment of an AFV, wherein testers “attack” a fully loaded, combat-ready AFV with a wide variety of explosive mechanisms, including small- and mid-caliber munitions, shaped charge jets, artillery munitions, and underbody mines. To complement FUSL testing such as this, the Army, Navy, and Air Force have individually developed and maintained their own models—AJEM, the Advanced Survivability Assessment Program (ASAP), and the Computation of Vulnerable Area Tool (COVART), respectively—for assessing system-level vulnerability and lethality (V/L) for various combat systems. Despite the separate development of these models, they possess many similarities. They share libraries and modules, have similar inputs and outputs, and are composed of a collection of computationally inexpensive empirical models that have been calibrated to decades worth of test data and subject-matter expertise.
To determine the credibility of these and other models, a formal process of verification, validation, and accreditation (VV&A) is used. Validation, in particular, is the process of determining the degree to which a model or simulation and its associated outputs are an accurate representation of the real world from the perspective of the intended uses of the model [1]. The CDV method is a type of validation analysis.
AJEM and its predecessors have undergone extensive VV&A in the past. The Army has produced VV&A reports on 27 separate live fire test and evaluation (LFT&E) programs since 1998 [2]. These include major programs such as the Stryker, the M109 Family of Vehicles, and the Joint Light Tactical Vehicle. The Army conducts validation analyses at every layer of AJEM using data from tests of varying complexity, including tests conducted at the component, subsystem, and system levels. The CDV method focuses on validation using data from the highest level of testing that corresponds to the highest layer of AJEM, the FUSL LF test.
The method relies on component kill data because these data are rich, observable in FUSL testing, and directly comparable to AJEM output. Each component on the AFV (e.g., driveshaft, engine, suspension, wiring, driver displays, weapons, etc.) has an associated pair of data, with one value corresponding to a binary outcome indicating whether the component was observed in the FUSL event as killed or not and the second value corresponding to AJEM’s prediction of the probability that the component was killed. Given that an AFV has many components, a FUSL event results in a vector of pairs, with each pair in the vector corresponding to a single component.
The CDV method was first applied to AJEM validation in 1998 as part of the Bradley Fighting Vehicle program and was applied most recently (in 2022) as part of the Armored Multi-Purpose Vehicle program. In the former application, the CDV method led to the discovery that AJEM underpredicts the number of components damaged from certain threats. In the latter, the CDV method underscored another known limitation of AJEM—an inability to predict component damage caused by certain secondary threat effects, such as ricochet. The CDV method thus helps identify, quantify, and prioritize problems within AJEM so that follow-on development efforts can improve the underlying algorithms.
In the past, the method has used a variety of different analysis techniques and result presentations. Many techniques focused on low-level results by providing extensive detail on each component damaged in each FUSL event [3–5]. However, the unique aspect of the improvements proposed herein, which complement past efforts, is that they focus on high-level results. This focus addresses two main goals: (1) providing a concise validation assessment for an entire FUSL test series, and (2) revealing overarching trends across the test series.
CDV DATA
As with system-level testing in other fields, FUSL test data are typically in short supply because of the high cost of producing such data. The most common data that FUSL testers collect pertain to the state of the crew, exterior armor, and critical components. The CDV method focuses exclusively on critical component data.
A critical component is defined as any component that, when killed, results in a loss of function (LoF) for any of the relevant metrics (mobility, firepower, etc.) for the vehicle. A specialized group within the integrated product team with the responsibility of identifying the critical components on the vehicle is staffed by soldiers, maintainers, and engineers who have specific knowledge about the effects of damage on vehicle operation and experience repairing damaged vehicles. This group identifies critical components by considering important design features, the layout of the reliability fault tree, corporate experience with similar test programs, battlefield observations, and the importance of a given component relative to various missions.
A straightforward approach to quantifying the effects to the vehicle caused by a threat engagement is to assess the kill status of each critical component. After each FUSL event, the test team inspects the critical components on the vehicle, attempts to power up the vehicle to diagnose the components’ residual functionality, and holds a damage assessment meeting to review notes and video footage. The team’s evaluation culminates with an assessment of each critical component on a binary scale: killed or not killed. A critical component is defined as killed if the component was physically harmed and suffered a loss of functionality, necessitating repair or replacement to restore functionality prior to the next FUSL event.
In AJEM, component kill is defined as a probability. AJEM is a stochastic simulation that incorporates numerous sources of uncertainty that produce a nondeterministic output. Modelers typically conduct 1,000 AJEM iterations per shot; and from iteration to iteration, the components that AJEM predicts to be killed vary. In each iteration, the component kill data are binary outcomes, but they become probabilities when averaged across the 1,000 iterations, yielding a single probability of kill for each component on the vehicle.
Given these definitions, we denote component kill as follows. Let yi denote the component kill outcome observed in FUSL testing for the ith component, where yi equals 1 or 0 if it was killed or not killed, respectively; i = 1,2,…,n ; and n is the number of critical components on the AFV. Additionally, let pi correspond to AJEM’s predicted probability that the ith component was killed. Given a FUSL event, one may collect the following vector of data: (yi ,pi),(y2,p2),…,(yn,pn). CDV analysis synthesizes the vector of paired data into metrics that summarize the discrepancy between the FUSL observations and AJEM predictions.
To further illustrate the format of CDV, consider the following simple example. Figure 1 shows an AFV with five critical components. The left portion of the figure displays AJEM predictions, while the right portion shows FUSL test observations. Results from this simple example are then tabulated in Table 1.
Table 1. Example of CDV Data for a Single FUSL Event
In practice, however, the AFV will have many more critical components, and the CDV data will typically include data from numerous FUSL events as part of a comprehensive test series. This results in the generalized data format appear in Table 2. Additional columns, not shown in the table, may also provide information pertaining to factors and their levels, component descriptions, test notes, and AJEM settings. Data of this format serve as the input to the forthcoming CDV analysis.
Table 2. Example of CDV Data for a FUSL Test Series
PROPOSED IMPROVEMENTS TO CDV ANALYSIS
CDV analysis quantifies the discrepancy between AJEM-predicted and FUSL-observed component damage. A defining feature of this analysis is the underlying data being compared—a vector of data pairs of predicted probabilities and binary test outcomes. A comparison involving this type of data is not uncommon in machine learning and statistics.
For instance, an appropriate framework for conducting this comparison is called Predicted Probabilities Validation (PPV) [6, 7]. PPV is most commonly used to validate logistic regression models, but it applies to more complex models too. PPV synthesizes a vector of paired data into one or more validation metrics that describe the discrepancy between the predicted probabilities and binary test outcomes.
We recommend the computation of numerous metrics and organize them into three groups: Counts, Averages, and Validation metrics. The Counts, which appear in Table 3, address details related to FUSL test results, the scope of FUSL testing, and the size of the vector of paired component data. The Averages, which appear in Table 4, summarize the conditional distribution of AJEM predictions (conditioned on whether the components were observed as damaged or not). The Validation metrics, which appear in Table 5, originate from PPV literature and summarize the discrepancy between AJEM predictions and FUSL test outcomes.
Table 3. Counts
Table 4. Averages
Table 5. Validation Metrics
EXAMPLES
To illustrate CDV analysis, we present two different examples involving notional data pertaining to a generic AFV. The first example illustrates a basic application of CDV analysis to a notional FUSL test series. The second example augments the first by using factors and levels to reveal high-level trends in the notional data.
The analytical approach in these examples aligns with exploratory data analysis. Each metric is computed once for each FUSL event in the test series. These metrics are then presented as boxplots, depicting the spread in the metrics across FUSL events. The left and right hinge of the boxplot corresponds to the 25th and 75th percentile of the computed metrics; the line in the middle of the box is the median; and the whiskers extend to the farthest computed metrics but no farther than 1.5 times the interquartile range. Metrics beyond the whiskers are considered outliers and are plotted as dots. Given the small sample sizes and rare-event nature of CDV data, we did not pursue parametric modeling or statistical inference. However, the exploratory data analysis presented in these examples may provide the insight to motivate such pursuits in the future.
Example 1
Here, we provide CDV results pertaining to a notional FUSL test series. The Counts that appear in Figure 2 indicate that the FUSL test series comprises 23 FUSL events; the total number of components that were observed as killed throughout the test series was 79, and the average length of the component kill vector (the number of critical components on the AFV) per FUSL event was 165.
The Counts also show boxplots of the number of components that were observed as damaged or killed per FUSL event. The maximum number of components observed as damaged or killed in a FUSL event was 16. Likewise, the minimum number of components observed as not killed in a FUSL event was 149.
The Averages in Figure 2 indicate that the median of the average AJEM-predicted probability of component kill per FUSL event, among components that were observed as killed (denoted as py1), was .16. An ideal outcome for AJEM is a value of py1 close to 1. Meanwhile, the median of the average AJEM-predicted probability of component kill, for components that were observed as not killed (denoted as py0), was approximately .001. An ideal outcome for AJEM is a value of py0 near 0.
The Validation metrics in Figure 2 indicate that the median of Brier Scores was .004. The median of the Somers’ index was .990. By traditional rules of thumb, these results indicate good agreement between AJEM and FUSL outcomes.
Example 2
Assessing the metrics in an absolute sense, as in Example 1, can leave much to be desired, given that threshold requirements for these metrics are not set in practice. This issue can be alleviated to some degree by assessing the metrics in a relative sense, by grouping the computed metrics using factors and levels, as is common in Design of Experiments [12].
In Example 2, the metrics computed per FUSL event are grouped by the type of threat that was used in each event of the notional FUSL test series. Here, we assume that all 23 FUSL events involved a Direct Fire or Indirect Fire weapon engagement. Threat Type is referred to as an independent variable or factor, which has two categorical levels (Indirect Fire and Direct Fire). The purpose of this grouping strategy is to discover whether AJEM predictions were better for one level compared to the other.
The Counts in Figure 3 show that, among the 23 FUSL events in the FUSL test series, 11 events involved an Indirect Fire threat, while 12 events involved a Direct Fire threat. Figure 3 also shows that Direct Fire caused more damaged and killed components than Indirect Fire.
The results appear to indicate that AJEM performed better for the Indirect Fire threats. This is evident in py1, which is higher for Indirect Fire, suggesting AJEM was better at predicting components that were observed as killed in events involving Indirect Fire threats. It is also evident in the Brier Score and Spiegelhalter statistic, which are lower for Indirect Fire with nonoverlapping interquartile ranges. In practice, this result could motivate follow-on work to improve AJEM predictions relative to Direct Fire threats.
CONCLUSIONS
As shown, improving the CDV method has the potential to improve vulnerability evaluation M&S, which in turn can improve the survivability and combat effectiveness of the Warfighter and his/her weapon systems. The application of the CDV method presented herein leveraged statistical theory and exploratory data analysis to augment existing AJEM-FUSL validation techniques by focusing on high-level results. Example 1 illustrates a detailed yet concise validation assessment for a FUSL test series, while Example 2 illustrates trends in results across the threat type factor.
Future applications of this work could use other factors instead of threat type. For instance, FUSL events could be grouped in many ways, including by threat mechanism, vehicle variant type, or engagement geometry. In addition, as mentioned previously, while the focus of this work was on AJEM and AFVs, future endeavors could extend the CDV methodology to analogous vulnerability assessment tools, such as the Navy’s ASAP and the Air Force’s COVART.
About the Authors
Mr. Thomas Johnson is a data scientist at the Institute for Defense Analyses, specializing in test design, statistical modeling, and M&S validation. He currently supports live fire and operational testing of armored vehicles and rotorcraft, as well as numerous JASP programs and initiatives.
Mr. Craig Andres works at the U.S. Army Combat Capabilities Development Command (DEVCOM) Analysis Center, currently serving as the Vulnerability Methodology Team Lead for AJEM and the End Game Model.
Mr. Russell Dibelka is an operations research analyst at the DEVCOM Data and Analysis Center. He was the first AJEM model manager and currently leads the AJEM Methodology Development Team, supporting major Army acquisition programs and the Joint Technical Coordinating Group for Munitions Effectiveness in varied VV&A efforts.
Ms. Lindsey Butler is an engineer at the Institute for Defense Analyses, supporting the LFT&E of armored vehicles and body armor, with a specialized focus on advancing methodologies for planning live fire testing and assessing injuries to personnel.
Mr. David Grimm is the Project Leader for DoD Ground Combat System LFT&E programs at the Institute for Defense Analyses. He previously served as an Army infantry and operations research/systems analysis officer, as well as a former Deputy Army T&E Executive and Acting Director of the Army T&E Office.
Dr. Juliana Ivancik is a Senior Military Evaluator at DOT&E in the Office of the Secretary of Defense, overseeing survivability and lethality testing for major ground combat systems. She previously worked in advancing soldier protection at the Army Research Laboratory and Army Test and Evaluation Command and is widely published in the areas of fracture mechanics and mechanical characterization of materials.
References
- Headquarters, U.S. Department of the Army. “Verification, Validation, and Accreditation of Army Models and Simulations.” Department of the Army Pamphlet 5-11, 30 September 1999.
- Dunn, J. “Baseline Accreditation Report for the Advanced Joint Effectiveness Model Version 2.54.” July 2022.
- Baker, W., R. Saucier, T. Muehl, and R. Grote. “Comparison of MUVES-SQuASH with Bradley Fighting Vehicle Live-Fire Test Results.” ARL-TR-1846, U.S. Army Research Laboratory, Aberdeen Proving Ground, MD, November 1998.
- Deitz, R., R. Saucier, and W. Baker. “Developments in Modeling Ballistic Live-Fire Events.” Paper presented at the 16th International Symposium on Ballistics, San Francisco, CA, 23–28 September 1996.
- Tonnessen, L., A. Fries, L. Starkley, and A. Stein. “Live Fire Testing in the Evaluation of the Vulnerability of Armored Vehicles and Other Exposed Land-Based Systems.” Appendix A, IDA Paper P-2205.
- Harrell, F. Jr., K. Lee, and D. Mark. “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors.” Statistics in Medicine 15.4, pp. 361–387, 1996.
- Harrell, F. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Vol. 3, New York: Springer, 2015.
- Brier, G. “Verification of Forecasts Expressed in Terms of Probability.” Monthly Weather Review, vol. 78, issue 1, 1950.
- Spiegelhalter, D. “Probabilistic Prediction in Patient Management and Clinical Trials.” Statistics in Medicine, vol. 5, issue 5, pp. 421–433, 1986.
- Somers, R. “A New Asymmetric Measure of Association for Ordinal Variables.” American Sociological Review, pp. 799–811, 1962
- Jaccard, P. “The Distribution of the Flora in the Alpine Zone.” The New Phytologist, vol. XI, no. 2, pp. 37–50, February 1912.
- Johnson, T., J. Haman, H. Wojton, and M. Couch. “Design of Experiments (DOE) in Survivability Testing.” Aircraft Survivability, summer 2019.