By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Insights & Stories

Risk Analytics: Monitoring, Tracking and Alerting

Your risk team needs to keep track of so many things at the same time. A strong and proactive risk team starts with building up your monitoring, tracking, and alerting capabilities. In this article we suggest how we like to think about monitoring and alerting for risk teams.

A key factor of successful risk analytics is the ability to identify anomalies - but the challenge is, there’s no typical standard for this. Every business has different patterns, norms, and expected behaviors. The way to tell if something is happening that requires mitigation or further investigation, is to identify when something happens that doesn’t usually happen. 

Anomalies might occur within technical systems, user activity patterns, user characteristics, or business performance metrics. Anomalies have a wide range of causes. They can be short-term, or long-term. They may be indicative of a problem (such as: something is broken in your system; a fraud attack; a flaw in your customer processes) or highlight an evolving trend that your business can benefit from understanding. They may also be irrelevant, a temporary blip that means nothing. 

Whatever the case with a specific anomaly, what matters is that you have robust mechanisms in place to track normal behaviors, identify things that don’t fit expected patterns, and alert your teams when something is worth investigating. In this way, you’ll be aware of any events that require your more immediate attention, and be able to react quickly and appropriately.

Let's explore the 5 major categories of alerts you want to keep track of as a risk leader.

Bugs: Things That Should Never Happen

Bugs are events that ideally should never happen in the system, but somehow do have a strong tendency to happen eventually anyway. As preparation for such unfortunate occurrences you should set up monitoring and alerts aimed to catch business-critical events that necessarily indicate that something in the system is broken.

The alerts that you have associated with bugs are ones that are created largely through two types of exercises.

The first exercise is to identify safety nets for hypothetical cases of very high risk. Imagine something like a transaction being approved for a deactivated customer, for example. It’s important to be clear that this is distinct from alerts that you get for high risk events generally speaking. These are cases of events that should never happen, so if you get such an alert it is an immediate indication of something meaningful breaking.

If you don’t have alerts in place for this type of bug, it’s worth making a list of such hypothetical  events and seeing if you know how to systematically identify if they happen. This will allow you to make sure that you have measures in place to prevent them from ever occurring, and if something happens to break, to alert you at once. 

The second exercise is around reactions to specific bugs. For instance, in the case of a financial institution, imagine that there was a case in which the system mistakenly allowed users to withdraw money from ATMs without taking into consideration their daily limits or their current balance. After such an event, the risk team would put into place an alert for cases where a user withdraws more than their daily limit and another alert for cases where a user withdraws more than their current balance, to prevent a recurrence of the problem. 

This type of exercise is reactive, instituted in response to a specific event that was problematic, to prevent it happening again. It can be tempting for teams to beat themselves up for not realizing this kind of vulnerability in advance, but it’s important to recognize that some events of this nature are inevitable; it is simply not possible to imagine every single edge case in advance. 

What matters is how you react, both at the time and afterwards to ensure that it doesn’t repeat. It’s a type of failing forward, and is an important part of how systems and teams grow and improve over time.

An added benefit of this alert type is that when a bug does occur, the alerts provide a great starting point in scoping which customers were impacted. 

Suspicious Activity: Customer High Risk Events

Suspicious activities are events performed by a user that should be investigated by someone. They might turn out to be entirely innocuous upon further investigation, or they might be signs that an attack or other problematic behavior is in progress.

Suspicious activities can be as simple as transactions over a certain $ value, or more complex such as transactions that are over 80% of the customer’s current balance and over 200% of the customer’s balance 5 days ago (a rule aimed at catching a money-in -> money-out pattern).

There is a balance that teams need to aim to strike with suspicious activities. On the one hand, you want to protect your business from potentially costly events - from either the side of failing to prevent risk, or from offending a good customer. That is why investigation is so valuable. On the other hand, if you set your bar too low for these sorts of events, you’ll have too many false positives to investigate, overwhelming your team and making the real issues harder to spot. 

Finding the right balance with this is a constant, delicate dance. If you’ve got it right, you’ll not miss any major suspicious activities, but you’ll still have some suspicious activity alerts that turn out to be nothing. It’s important that teams appreciate the value of the work they do in these cases, even though the end result is deciding that everything is fine. Those cases are a necessary element of enabling the wider effective protection of the system.

A major element of this delicate dance is to identify suspicious activity alerts that do not require investigation anymore, either because they are so correlated with bad activity that they can be moved to an automated reaction, or because they so often turn out to be legitimate activity that they can be removed from your list of suspicious activities.

A second element of refining your suspicious activity alerts is to improve their sophistication over time. If you start with a simple logic of “transactions over $1,000” you allow bad actors room to identify your arbitrary threshold and find ways around it. If you improve your logic sophistication to “transactions in the last 24 hours totalling $1,000” you make those workarounds more difficult. If you later improve it to “transactions in the last 24 hours totalling more than the 98th percentile of legitimate 24 hour transaction volume from the past 6 months” you will make your alert flexible to match your users actual activity. 

System Spikes: Drastic Changes in System Signals

System spikes are alerts for various forms of data outages, SLA changes, or other indicators that the system is experiencing difficulty providing its standard level of defense.

Problems of this nature may well not fall to the risk team to manage or fix; the issue is often primarily technical in nature and requires assistance from engineering teams or third party customer support. However, it’s vital that risk teams nonetheless take responsibility for tracking these metrics, ensuring that alerts are in place to flag issues, and monitoring the impact on user activity and risk decisions when something happens. 

In a sense, although risk departments may not be the hands-on experts needed to solve the problem, they do need to take high level responsibility for system spikes, because the negative impact that these can have very much falls under the umbrella of risk management responsibility. There are a few stages that are part of this kind of high level responsibility:

  • Thinking through system spike issues that might arise, and ensuring that appropriate monitoring and alerting is in place for them.
  • Ensuring that there are processes in place to handle them, should they arise.
  • Making sure specific team members have responsibility for handling such issues and know what to do about them. 
  • When something does happen, make sure that 1) the relevant technical team know about and understand the problem, 2) the impact on customers is being minimized and handled appropriately, 3) that risk is mitigated as much as possible and 4) that you are in close communication with the technical team to ensure speedy, effective resolution.

Decision Spikes: Drastic Changes in Decisioning Distribution

Decision spikes are abnormal rates in the proportion of different decision types or logics. Thinking about an example of decision types, let’s say that the proportion of benign customers sent to 2FA while they were logging in jumped from 2% to 17% in one day, that would be a decision spike. 

They may also come from anomalies when it comes to different decision logics. An example would be if the decline rule for US customers using an IP from Uganda and an iPhone emulator device type dropped from hitting 200 times per day to only hitting 5 times in a single day. 

Sometimes these spikes come from ordinary, legitimate causes, such as a local event like a holiday or festival, a large event such as a concert or sporting event, or even a significant sale. Equally, these spikes can represent a problem with your own systems, or an attack. 

Since changes in decisioning distribution impact customers directly, it’s important to prioritize these quickly and ensure the investigation is effective and speedy, so that any negative impact on good customers is minimal. 

Business Target Spikes: Drastic Changes in Business Metrics

Business target spikes reflect when various forms of business targets show abnormal changes. For instance, if the number of applications in the last hour suddenly shows as 40X higher than in the same hour a week ago - that would be a business target spike. The same would be true if the number of declined transactions in the past 3 days were to be 15% above the monthly target.

This category shows the importance of seeing risk management as an intrinsic part of the entire business. As we noted in our introduction to risk analytics, every part of the customer journey is a part of the relevant big picture - and this must be reflected in your monitoring, tracking and alerting infrastructure as well. Otherwise important parts of the business will not be protected. 

The Impact of Monitoring, Tracking and Alerts

Ensuring that your monitoring, tracking and alerting systems are tailored to your organization’s needs and unique risk profile is a worthwhile investment. It means that your risk management department can easily and effectively identify developing issues and react to them very quickly. 

The detail of the alerts themselves should provide a strong starting point in identifying the impacted population, and the issues that they are facing. This in turn means the team can start the investigation in a more direct manner. 

In some cases, of course, there should be automatic processes in place to protect the business from risk while the investigation occurs, such as temporarily freezing accounts or transactions, or preventing further changes being made to the system. 

These are temporary measures, however. Investigation by the risk team will enable you to gain a deeper understanding which can guide a more precise, appropriate, and effective response. With investment in monitoring, tracking and alerting, you will be able to reach that point faster, and more smoothly, than would otherwise be possible. 

If you’d like some help thinking through the monitoring, tracking and alerting that protects your systems and organization - contact us.

We’d love to chat. We’d love to help.