A Deeper Look at the MACRO Score (Part 2)
Part two of our plunge into the intricacies of the MACRO Score. Spectral's data science team describes model evaluation criteria, credit scoring, interpreting MACRO Scores, model monitoring, and future work.
An in-depth look at the intricacies of the MACRO Score, the second in a series of two. The first part focused on defining the problem, gathering data, engineering features, and constructing a model. This part explores model evaluation criteria, computation of the MACRO Score, interpreting MACRO Scores, and ongoing model monitoring. Part one can be found here.
Model Evaluation Criteria
Several model evaluation metrics are calculated for each of the shortlisted candidate models on the hold-out test data. Specific weights are assigned to the metrics, allowing for a comparative model evaluation based on the weighted average for each model. Metrics evaluated include the following.
Area Under the Receiver Operating Characteristic Curve (AUROC/AUC)
A Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (i.e., Recall) against the False Positive Rate at all the possible probability thresholds given a binary model. The Area Under the Curve (AUC) of the ROC measures the ability of a classifier to distinguish between classes (e.g., good or bad) and is used as a summary of the ROC curve. In other words, the AUC reflects a model’s discriminatory power — i.e., how adept it is at correctly distinguishing between good and bad borrowers given their respective credit risk profiles.
Spectral’s AUC of 87 is considered very good and is higher than the AUC of traditional credit scores, where anything above 70 is generally deemed to be good.
Area Under the Precision-Recall Curve (PR-AUC)
A Precision-Recall (PR) curve plots a classifier’s Precision against its Recall at all the possible probability thresholds. A high area under the PR curve (PR-AUC) represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. Spectral's PR-AUC of 89 is substantially higher than traditional credit scores.
Kolmogorov–Smirnov (KS) Statistic
KS is a standard validation metric in credit scoring that measures the discriminatory power of a model. Spectral’s KS of 58 is considered very good in the credit scoring domain.
The Brier Score evaluates the accuracy of a model’s probabilistic predictions. For example, consider two models that predicted the same outcome correctly, but one with 60% and another with 90% predicted probability. The latter model appears to be better as it could more confidently predict the outcome. This is where the Brier Score comes in — it evaluates the “distance” between the predicted probability and the outcome. The lower the Brier Score, the more accurate a model’s predictions. Spectral’s Brier Score is 15.
Recall (also known as Sensitivity or the True Positive Rate) is the ratio of the number of predicted positives to the actual positives, where a positive prediction refers to a liquidation outcome. In other words, Recall measures the percentage of the actual liquidations that our model correctly predicts. Recall aims to minimize the number of False Negatives that are more costly than False Positives in the credit risk domain. Spectral’s Recall is 85.
The F1 Score combines a model’s Precision (that aims to minimize False Positives) and Recall into a single metric by taking their harmonic mean. Useful for imbalanced classification problems, a high F1 Score implies high Precision and Recall. Spectral’s F1 Score is 80.
MACRO Score Distribution
We assess the predicted MACRO Scores’ distribution of each model to ensure that it is not heavily concentrated within a single band and remains representative of the entire credit risk spectrum ranging from 300 (extremely high credit risk) to 850 (extremely low credit risk).
Logit transformation is applied to the predicted probability of non-liquidation from the model (the higher the probability of non-liquidation, the higher the MACRO Score). The effect of the logit transformation is primarily to “stretch” the model’s predicted probabilities that can be densely concentrated within a narrow range.
Logit-transformed probabilities are then linearly scaled into the preliminary MACRO Score ranging from 300 to 850.
The preliminary MACRO Score is adjusted by a rug-pull penalty, if applicable, to arrive at the final MACRO Score. The rug-pull penalty essentially penalizes historical interactions with smart contracts that were rug-pulled since such interactions could indicate risky user behaviors.
Some of the factors considered when determining the rug-pull penalty include:
- Time since last interaction with a rug-pulled smart contract
- Number of unique rug-pulled smart contracts interacted with
- Number of transactions with rug-pulled smart contracts
- Total nominal value (in ETH) of all the transactions with rug-pulled smart contracts
Interpreting MACRO Scores
We wanted our MACRO Scores to be easily understandable and interpretable by the general public allowing one to understand the underlying reason(s) for its Score. This, in turn, will allow a user to potentially work towards improving their Score in the future.
To enable this model interpretability, we make use of two approaches:
SHAP Values, based on game theory, quantify the contribution of each feature to the prediction made by a given model; and is all about local interpretability, i.e., at each observation level. Therefore, each model prediction can be explained by the SHAP Values of the involved features. The higher the feature’s SHAP Value, the higher the contribution it has towards the predicted probability of liquidation, and accordingly, the lower the MACRO Score. Hence, sorting the SHAP Values for a given MACRO Score observation allows us to identify features that are positively supporting or negatively impacting the MACRO Score.
Partial Dependence Plots (PDPs) show the marginal effect a feature has on the predicted outcome of a model. A PDP can show whether the relationship between the target and a feature is linear, monotonic, or more complex. An upward-sloping PDP implies that higher feature values lead to a positive impact on the predicted probability of liquidation and thereby a lower MACRO Score.
We provide Score Ingredients for all MACRO Scores derived through a combination of SHAP Values and PDPs. Score Ingredients represent the top two factors that positively support your MACRO Score and the top two that negatively impact your Score, along with practical suggestions to potentially increase your Score.
The process flow for determining the Score Ingredients is:
- For any given user’s MACRO Score, identify two features with the highest and two with the lowest SHAP Values
- For each of these four features, use the respective PDP to identify the range of feature values representative of low or high MACRO Score.
- For the two features with the lowest SHAP Values:
- If the actual feature value falls within the high Score range as identified by the PDP, then it confirms that this feature is really supporting the MACRO Score and is so conveyed to the user.
- The feature is excluded from the Score Ingredients in case the actual feature value does not fall within the high Score range as identified by the PDP.
- A similar process is applied for the two features with the highest SHAP Values.
In short, we deem a Score Ingredient valid only if both the SHAP Values and PDP frameworks are in agreement. The above enables model interpretability so that the MACRO Score does not come across as a black box. The Score Ingredients allow the user to understand which specific features have had the most effect on its MACRO Score.
Ongoing Model Monitoring
Ongoing model monitoring is critical to prevent any performance degradation and to detect early warning signals identified by data and prediction drifts that could warrant either a model recalibration or a redevelopment from scratch. Some of the regular model monitoring reports that we utilize include:
- Population Stability Report: Compares the Score distribution at the model development time with a more recent date. Together with the Population Stability Index (PSI), it assists in evaluating whether there has been a shift in the model's performance or the assessed population in general.
- Performance Report: In addition to the Score distribution, this report also identifies the number of good and bad observations for each bucket, together with the overall KS Statistic.
- Vintage Analysis: Monitors the evolution of various cohorts over fixed time intervals in terms of various metrics (e.g., bad rate, data distributions, etc.), where each cohort includes the new borrowers during the pre-defined time interval.
This is just the beginning - our roadmap includes:
- Integrating additional DeFi lending protocols, including those on other L2 and EVM compatible chains, across our data and ML pipelines
- Productionizing alternative on-chain data for credit risk analysis, including transactions on decentralized exchanges (e.g., Uniswap), NFT holdings, ENS, etc.
- Evaluating the inclusion of users that have no borrowing history as part of the training data
- Experimenting with ways to include rug-pull-related data points at the feature level rather than as a penalty adjustment
- More graph-based machine learning and unsupervised learning approaches to augment our analysis
- Cost of Attack analysis to understand what level of time, cost, and effort is required for a malicious user to attack and game the Score
Would you like early access to our APIs and smart contracts?