Skip to the content.

I would love to hear your feedback!

Should you come across any mistake in this article, or if you want to leave a comment, please use the GitHub Discussion at the end of the document. Alternatively, please use the Issue tracker available here.


Read more articles

Vulnerability Management Metrics

In my previous job (at the time of writing) I have been leading the Vulnerability Management Team for Facebook (nowadays Meta).

When I joined the Facebook Security team, the vulnerability management effort was noticeably fragmented, there was lack of structure in many areas, and the metrics in use were loosely connected with the main purpose of the security program. In close collaboration with several teams working in security and operations, we immediately set out to establish key performance indicators for the vulnerability management program and for the main pillars on which it was implemented.

In this document, I generalized and formalized some of the metrics we started using at Facebook. My goal is not to provide implementation details, or to suggest the perfect metric for the job. Instead, I hope to give the reader useful cues to start making informed decision on investments in the vulnerability management infrastructures and operations.

Although the vulnerability management program I was part of was focused primarily on third-party assets, the metrics I will describe here can also be used to cover software developed in-house. In this case, the company would probably need to use homemade scanners, static code analysers, intelligence data, and the output of simulated attacks like penetration tests and red team exercises.

Finally, most of the concepts illustrated in this document would probably make more sense in the context of vulnerability management at scale. Be warned!

Table of Contents


What is Vulnerability Management?

In this document I am considering the definition of Vulnerability Management adopted by the vast majority of large companies. Specifically, I am assuming that:

Vulnerability Management is the set of processes and mechanisms used to manage the risk posed to the company by known software vulnerabilities.

In the above statements, there are two important concepts that determine the scope and goals of this very complicated world: risk management and known vulnerabilities.

Wikipedia’s definition for Vulnerability Management is somewhat more detailed, although it doesn’t convey the importance of making tradeoffs. It defines Vulnerability Management as the identifycation, classificartion, prioritization, remediation, and mitigation of software vulnerabilities.

We shall see in details what all that means, providing some ideas on how to measure the effectiveness of a vulnerability management program.

Note: for this note the term asset is used to indicate a software product instance identified by: its vendor, product name, version, and the system on which it is deployed. I also use the term system to indicate a bare-metal server, a virtual machine, a container, or a serverless computing instance.

Executive metrics

Running a vulnerability management program at scale can be very expensive. We want to make sure that what we are doing is efficient, so we need effective metrics.

Even though a plethora of performance indicators can be used, we want to be able to tell at a glance where we are and whether we are making progress. At the same time we definitely don’t want to drown in an ocean of graphs and numbers, or present our leadership with the details about how we run things.

I personally like to use only two first order, or executive, metrics: Risk and Coverage. And these are the first performance indicators we are going to talk about.

Later we will explore second order, operational, metrics. We can also call them leading indicators for either Risk or Converge. These can be used to identify bottlenecks and prioritize the work needed to bring Risk down or improve Coverage.

Risk

When managing vulnerabilities at scale, it is crucial to evaluate the realistic risk posed by known vulnerabilities.

Measuring risk, we want to have an idea of how easy it would be for an attacker to leverage vulnerabilities on our software assets to damage the company. We also need to keep in mind that in the diverse and ample spectrum of the possible damages a company can get from security incidents we can find things like: stealing of trade secrets, user data compromise, and damage to company’s image which can just be caused by exposing bad management of security practices.

We want to make sure we consider as many known vulnerabilities as possible, along with the likelihood and the impact of them being exploited: it is impractical for the typical attacker to enumerate all the possible vulnerabilities to find the “perfect one”. Whereas a widespread vulnerability can be more easily spotted.

Another important reason why you should adopt a risk metric is the prioritization of remediation. When dealing with security at scale, it’s always a matter of where and how you take trade-offs: you can rarely fix everything, and the influx of vulnerabilities never stops. You never have infinite resources to invest in vulnerabilities remediation, also because that is not the core business of your company.

So, having an effective way to prioritize security remediation is paramount.

For instance, remediating a severe vulnerability affecting a single asset should not be more urgent than remediating many assets affected by a different vulnerability with a similar severity score.

In the rest of this chapter, we will see three risk metrics, starting with a simple definition and ending with a refined one. As you shall see perfection is unattainable, still it is worth pursuing the best possible metric we can afford.

Simple Risk

As the intuition suggests, risk is additive: every vulnerability contributes to it in the measure of the risk it carries to our infrastructure.

So, in its simplest form, risk can be quantified as the sum of the product of number of affected assets by the Common Vulnerability Scoring System score (CVSS, as defined by NIST).

The number of assets affected by a vulnerability represents the potential attack surface for it. The score represents the risk posed by a vulnerability on a single software instance.

More formally, we can define the simple Risk ($SRisk$) for a given vulnerability $v$ in the set of known vulnerabilities $V$ as:

\[SRisk(v) = |A_v| \cdot cvss(v), v \in V\]

$A_v$ is the set of assets affected by the vulnerability $v$, and $cvss(v)$ is the CVSS Score of $v$.

Or, if we want to calculate the aggregated risk on all the software we own:

\[SRisk = \sum\limits_{v \in V} SRisk(v)\]

Note that, in practice, the set of known vulnerabilities is given by the vulnerability scanners used.

Triaged Risk

In large companies, we don’t want to send urgent tickets to our fellow engineers to resolve non-existent vulnerabilities. If we do so, people will rapidly start ignoring these requests and lose trust in the vulnerability management infrastructure and processes.

Many times, the risk posed by a software vulnerability can be better evaluated considering how that particular software instance is compiled, configured and used. In other words, triaging of vulnerabilities could improve our understanding of the security risk and help to modulate remediation efforts and investments.

The definition of Risk we gave before does not account for automatic or manual triage. We almost always want to use a risk score enriched with information on the actual usage of software and potential mitigations that are in place.

Automatic triage

Automatic triage is what many vulnerability scanners do: they often try to give a better risk score, and reduce the number of false positives, by considering the context and how the software is used and the mitigations in place (e.g., AppArmor or SELinux).

For instance, Red Hat provides an adjusted score for RHEL (RedHat Enterprise Linux) packages that is used by some agent-based scanners. As you can read in their documentation:

[…] NVD may rate a flaw in a particular service as having High Impact on the CVSS CIA Triad (Confidentiality, Integrity, Availability) where the service in question is typically run as the root user with full privileges on a system. However, in a Red Hat product, the service may be specifically configured to run as a dedicated non-privileged user running entirely in a SELinux sandbox, greatly reducing the immediate impact from compromise, resulting in Low impact.

Other vendors sell vulnerability feeds with additional and refined information about vulnerabilities. Snyk, Accenture iDefense, Flexera, and FireEye are some examples.

In addition to vendor solutions, we can implement our own automated triage pipeline, using readily available, and sometimes open source, tools like nvdtool.

Manual triage

In environments where there is a high level of “standardization” and customization (i.e., most hyper-scale companies), human triage can also be used. Security analysts reviews vulnerabilities and manually adjust (up or down) the CVSS score. This is done typically by tweaking the CVSS vector string according to the context.

To explain this better, consider a real-life scenario: CVE-2020-12284 affects FFmpg 4.1 and 4.2.2, but we know our software is not compiled with anything using, directly or indirectly, cbs_jpeg_split_fragment in libavcodec/cbs_jpeg.c. The problem is still there, so it would be nice to have an upgrade, but upgrading FFmpeg is less urgent than mitigating another exploitable vulnerability, so we can “temporarily downgrade” the score.

Vice versa, there could be situations in which we might want to bump up the risk score because a given software is configured in a way that makes the impacts of an exploit even more dangerous.

A very common mistake is to consider manual triage a one off activity. If we dismiss a vulnerability found by our scanner, this could remain untreated for long time, even if the conditions at the time of triage change. While manual triage can be an effective way to boost confidence in our risk measure, any re-scoring bumping down the remediation priority for a vulnerability should be periodically reviewed, until the vulnerability is eliminated.

Finally, not all the detected vulnerabilities can be manually examined. This is a necessary trade-off which, in practice, shouldn’t change much, as long as the most severe vulnerabilities detected by scanners are triaged by security analysts.

Triaged Risk calculation

We can define the triaged Risk, $TRisk$, for a vulnerability $v \in V$, as

\[TRisk(v) = |A_v| \cdot tscore(v)\]

Where $tscore(v)$ is the triage score for $v$. This is a function of the CVSS score and any other automatic and manual assessment regarding $v$. And $A_v$ is the set of deployed asset instances affected by $v$.

It is also useful to consider the aggregated triaged Risk on all the assets. This can be calculated as follows:

\[TRisk = \sum\limits_{v \in V} TRisk(v)\]

In a realistic scenario, $V$ is the set of vulnerabilities discovered by the scanners, and $tscore(v)$ is the score assigned by the scanner to $v$.

Enhanced Risk

At scale, and in more mature vulnerability management programs, the risk metric should consider even more details.

For instance, we may want to evaluate:

  • Temporal factors - how long the vulnerability has been around?
    • The exploitation of an ancient vulnerability, even with minimal quantifiable damage, can give our users the perception that we don’t care about security. This can lead to an even greater image damage.
  • Intelligence data
    • is this vulnerability being actively exploited by APT (Advanced Persistent Threat) against similar companies (Operation Aurora comes to mind)
    • is our intelligence/red-team aware of exploits?
  • Blast radius - is this vulnerability affecting highly sensitive environments?

We can enrich the previous risk score definition with any additional information we have, in an automated or manual way. We call this new risk definition Enhanced Risk ($ERisk$), and we can compute it as:

\[ERisk(v) = escore(v) \cdot \sum\limits_{a \in A_v} br(a) \cdot ex(a)\]

Where $br(a)$ is the blast radius of $a$, and $ex(a)$ is the exposure of the asset $a$ to threats.

The range of values $br()$ and $ex()$ can take should be devised so that we can effectively “bump up” vulnerabilities remediation priority in a sensible way: if $ERisk(v’) > ERisk(v’’)$, remediation of $v’$ should take priority over remediation of $v’’$.

Just as an example, we could say that:

\[\forall a \in A_v \quad br(a) \in [1,1.5] \ \textrm{and} \ ex(a) \in [1,1.5]\]

In particular, $br(a) = 1 \ \text{and} \ ex(a) = 1$ if we don’t have any reason to bump up or down the “importance” of vulnerabilities affecting the asset $a$.

Finally, $escore(v)$ is the enhanced score for $v$. It should consider aspects that can affect the CVSS vector string, such as:

  • Manual triage;
  • adjusted CVSS score provided by vendors (e.g., RH adjusted score);
  • intelligence information such as known exploitation in the wild;
  • age of the vulnerability (time elapsed from the time of public or intelligence disclosure).

We can now calculate the aggregated enhanced risk as:

\[ERisk = \sum\limits_{v \in V} ERisk(v)\]

Normalizing the Risk

The number of assets grows and shrink frequently in time. And so does the risk, if the new assets are affected by any vulnerability. This, intuitively, makes sense as our vulnerable surface depends on the number of assets we have, but it makes it hard to track the progress of our vulnerability management program over time.

Often, we want to have a definition of risk based on the relative vulnerable surface (i.e., fraction of affected assets). So, our risk metric needs to be normalized on the number of assets considered in the reference period, $\vert A\vert $.

We can define the normalized, triaged, risk as:

\[nTRisk = \frac{TRisk}{|A|}\]

Where $A$ is the set of all asset instances on which we are running the vulnerability management program, and $TRisk$ is the triaged risk.

In the same way, we can calculate the normalized, enhanced, risk as:

\[nERisk = \frac{ERisk}{|A|}\]

Effectively, we are now calculating the score as follows:

\[nTRisk = \frac{\sum\limits_{v \in V} |A_v| \cdot tscore(v) }{|A|} = \sum\limits_{v \in V} \frac{|A_v|}{|A|} \cdot tscore(v)\]

Where $\frac{\vert A_v\vert }{\vert A\vert }$ is the fraction of assets affected by a vulnerability $v$.

Note that for this definition our risk still doesn’t have an upper bound, as vulnerability score is a positive number greater or equal to 1.

As it doesn’t depend on the absolute number of assets, the normalized risk can be used to compare the security posture of different parts of the company (e.g., your Corporate, Cloud, and on-premises environments, which in large companies are managed by different orgs).

Defining what good looks like for Risk

Given the definition of normalized risk above, we can track progress by comparing, for instance, $nTRisk^{t1}$ and $nTRisk^{t2}$.

If $t2 > t1$ and $nTRisk^{t1} > nTRisk^{t2}$, we can say that we are making progress in the right direction.

The granularity of time depends on the frequency vulnerability scanners are run. For massive domains, we could only be able to track progress every couple of days or every week.

We can set a target for $nTRisk$, considering anything below it acceptable residual risk. Or, we can define risk bands (low, medium, high, and critical) to provide a very high-level idea of the risk our company is subject to.

As the reader probably has realized at this point, the quality and freshness of the asset inventory is critical to have a meaningful risk score. Inventories are a typical problem for large companies. We should invest lots of our resources in doing inventory right, before taking any risk score too seriously.

Coverage

Risk alone can’t give the complete picture about a company’s security posture. When it comes to vulnerability management, it is important to have the risk calculated on as many assets as possible, ideally on all of them.

Not only that. The quality of coverage needs to be good enough to guarantee that at least the most critical vulnerabilities are timely surfaced.

Inventory Coverage

The Inventory Coverage can be easily calculated as the percentage of assets scanned by vulnerability management tools over the number of total assets in our inventory.

Again, a good understanding of the company’s inventory is a prerequisite for an effective management of vulnerabilities.

In other words, while the coverage of known assets is one of the main responsibilities of a vulnerability management program, completeness of inventory (typically) is not. The inventory of company’s assets is generally used for multiple purposes in diverse contexts, vulnerability management being only one.

Using multiple scanners

In large companies it’s not uncommon to have multiple vulnerability scanners working together, typically assessing different environment, sometimes with some overlap.

Nowadays, many security companies try to provide tools for as many environments as possible, but it’s hard to find a single brand properly covering everything: hosted resources, cloud resources, containers, and network devices, just to mention some. Hyper-scale companies may also want to write their own custom tools, and, if they write software, also add language vulnerability management and static code analysis to the mix.

Having multiple sources of data for vulnerability management implies some sort of de-duplication or aggregation of the detected vulnerabilities. The reader should keep that in mind for the following metric definitions.

Calculate Inventory Coverage

We can calculate the asset coverage as the percentage of the assets we cover with at least one type of scanner. We can call this metric Inventory Coverage:

\[ICoverage = \frac{|A|}{|I|} \quad \text{with} \ A \subseteq I\]

$A$ is the set of assets covered by our vulnerability management program, while $I$ is the inventory of company’s Assets.

Coverage Confidence

When the vulnerability management landscape is large and complex, and custom tooling is used, we need to understand how well we are covering each domain. Assuming that as long as at least one security scanner analyses an asset “we are good”, creates a dangerous false sense of security.

To refine our understanding of our posture, we can introduce a measure of confidence in the coverage, that we can call Coverage Confidence ($CConfidence$).

The Inventory Coverage metric we saw before is still needed, for instance for compliance purposes, but we can introduce the Coverage Confidence formula to account for the quality of scanning, for each asset type.

To calculate the Coverage Confidence, we multiply the number of unique assets scanned in a given category by a confidence factor given by how much we trust the information collected. This confidence value reflects the capabilities, and the known limitations, of the specific security assessment tool used for each type of asset.

In practice, one way to do that is to consider how we the scan is performed, relative to the type of asset at hand. For instance, we may have an excellent confidence that our agent-based solution works great for Windows Servers. Whereas, if we use a network scanner for our Linux fleet, it’s likely it won’t surface many vulnerabilities on installed packages not running exposed services.

The Coverage Confidence metric can be calculated as:

\[CConfidence = \frac{\sum\limits_{c \in C} \theta_{c} \cdot |A_{c}|}{|I|}\]

Where $\theta_c \in (0,1]$ is the confidence we have in our scanning capability for the asset/scan category $c$.

$C$ is the set of scan categories we can identify in our environment. For instance, if we have two scanners, a network and an agent based, $C$ will contain these two elements. $A_{c}$ is the set of assets covered with scanning methodology $c$.

Note that we are assuming that different scanning methodologies cover different assets. If there is overlap for some assets, we can pick $C$ to consider the combination of two or more scanning methodologies. E.g.:

\[C = \{ \text{network and agent}, \text{only agent}, \text{only network} \}\]

In other words, for $CConfidence$ to make sense, $C$ should be taken so that $A_c$, for $c \in C$, is a partition of $A$.

Defining what good looks like for Coverage

For the Coverage metric, the goal is to bring the value of $ICoverage$ and/or $CConfidence$ as close as possible to 1, by improving the asset coverage and the quality of scans.

We can measure progress on Coverage by looking at the difference between values of $ICoverage$ and/or $CConfidence$ registered at different moment in time: for $t2 > t1$, if $CConfidence^{t2} > CConfidence^{t1}$ we are improving the coverage of our inventory.

Operational Metrics

Up to this point, we have only been talking about Risk and Coverage, framing them as Executive or first order metrics.

Executive metrics will give us a synthetic view on how the vulnerability management program is going, but won’t tell clearly how and where we can do to improve it.

On the other hand, Operational Metrics are Key Performance Indicators useful to understand where we should invest to improve the company security posture. Establishing operational metrics, and Service-Level Objective (SLO) on them, is an effective way we can use to influence how executive metrics should be driven down.

Without any claim to be exhaustive, in this section I will mention some of the metrics that have been used successfully on large-scale vulnerability management.

Before seeing a few examples of metrics that can be used in this context, we need to define what we mean by “management of a vulnerability”. We can say that a vulnerability is managed when it has been, either:

  • Eliminated, through assets upgrades or decommissions;
  • reviewed, and its risk accepted, possibly by changing its applicability with the effect of changing its Risk score;
  • deemed as a false positive.

Remediation Timeliness

One important metric to measure the quality of a vulnerability management program is the average time passed between the disclosure of a new vulnerability and its management. There are several metrics we can define for this, all pertaining to what we can call Remediation Timeliness.

Why is Remediation Timeliness an operative metric?

You might be wondering why I didn’t consider Remediation Timeliness as a first order metric. There are at least a couple of reasons for that.

Firstly, Risk should already be a function of remediation timeliness: we typically give more priority to older vulnerabilities, everything else being the same.

Secondly, it’s (again) important to note that vulnerability management is an ongoing process that can’t realistically eliminate all the known vulnerabilities. So, we shouldn’t optimize only for quick remediation, but rather for bringing quickly the Risk down (and Coverage up).

All that being said, having Remediation Timeliness as an operative, metric doesn’t make it less important: if remediation time is too high for some company environment, we need to improve the operations related to vulnerability management in that domain. That will, in turn, bring the Risk down.

Time To Manage

Intuitively, Time To Manage ($TTManage$) is some sort of average time taken to manage a vulnerability from the moment it is published on some channel.

It should be noted that $TTManage$ is a “biased” metric, as it is calculated only on vulnerabilities that have actually been already resolved.

$TTManage$ doesn’t consider the throughput of remediation and if we optimize for it, we risk incentivizing the resolution of low-hanging fruits.

Calculating and inspecting Time To Manage

For a given vulnerability $v$, $TTManage(v)$ is calculated by subtracting the time a vulnerability is disclosed, $t^d_v$, from the time it is managed, $t^m_v$.

\[TTManage(v) = t^m_v - t^d_v\]

One of the main purposed of operative metrics is to identify bottlenecks and gaps. We can further split Time To Manage into three stages: the time taken to 1) identify a vulnerability, 2) (optionally) triage it, and 3) remove it.

So, $\forall v \in AV(r)$:

\[TTManage(v) = TTIdentify(v) + TTTriage(v) + TTRemove(v)\]

$TTIdentify(v)$ is the time between the moment a vulnerability (that applies to any of our assets) has been disclosed, and the moment it has been added in the triage or remediation queue.

Calculating $TTIdentify(v)$ is not straightforward: typically, we are only aware of vulnerabilities when they are disclosed by vendors, and often the matching rules are not immediately available. In practice, we can approximate $TTIdentify(v)$ with the time taken by the scanning software to surface a vulnerability $v$ from the moment it is disclosed to the public (i.e., the publication date of the CVE for $v$). This approximation may introduce errors due to vulnerable assets potentially being introduced in the environment after the correspondent vulnerability is disclosed. We can accept this error as a motivating factor to introduce better screening of assets before their deployment (see Assets Freshness below).

$TTTriage(v)$ is the Time to manually or automatically Triage the vulnerability $v$. If a vulnerability management program does not have triage in place, this metric can be ignored. In this case, $TTRemove(v)$ is likely to be higher than it could be when manual or automatic triage is in place, due to the additional noise in the remediation stage.

Finally, $TTRemove(v)$ is time needed to eliminate $v$. Elimination can happen through workarounds, mitigations, upgrades, or just risk acceptance.

In general, we expect the value of $TTRemove(v)$ to be “low” on average when proper deployment automations are in place. For instance, the presence of CI/CD pipelines can guarantee a low latency between the moment a new version of a 1st or 3rd party product is available, and the moment it is deployed. Moreover, urgent upgrades, mandated by the discovery of new critical vulnerabilities, can be performed effortlessly and with little overhead for engineers.

Aggregated Time To Manage

One possible aggregation for the Time To Manage is calculating the average over the reference set of vulnerabilities that have been addressed in a reference period of time.

\[avgTTManage = \frac{\sum\limits_{v \in M} TTManage(v) }{|M|}\]

Where $M$ is the set of managed vulnerabilities over the reference period.

Another effective way to aggregate the time to manage is the Percentile function.

$ P_{x}TTManage $ is then the $ (\frac{x}{100}(\vert M\vert +1))^{th} $ term of ordered set $ {TTManage(v) \ \forall v \in M} $

Using the percentile function, we eliminate corner cases producing spikes. For instance, we pick $x = 90$ to filter out the top 10% values.

A word about Scanning Frequency

Generally, by Scanning Frequency, we indicate how often we run our scanners on the company’s assets.

In vulnerability management literature, we can find Scanning Frequency as a top-level metric, but in this document you won’t find it. I do indirectly consider the frequency of scans in the timeliness metric, and in fact you could use $TTIdentify(v)$ as a separate operational metric, but there is also an important aspect we need to consider.

I have already mentioned how large companies frequently use multiple, and sometimes custom, scanners for a variety of systems. They could also use vulnerability and intelligence feeds provided by 3rd parties.

In this context, we might want to scan for vulnerabilities whenever something changes, in addition to a regular, periodic, scan.

For instance, we may want to scan virtual machines and Docker containers being deployed, gating them with the result of our analysis. When a security feed adds a particularly severe vulnerability, we may want to re-trigger our scanners, maybe aiming at the particular type of asset affected. This is also useful given the ephemeral nature of those virtual resources.

Having a good Scan Frequency, without considering the cases above, can be suboptimal in some contexts, especially when scanning the entire company fleet is a very demanding operation.

Defining what good looks like for Remediation Timeliness

We can compare the aggregated Time To Remediate with the speed at which, on average, new vulnerabilities are discovered on our assets. Since the goal is to bring Risk down, we need to have an average remediation time which is lower than the average throughput of new vulnerabilities.

Time To Manage is an effective metric for assessing this because, as we have seen, we can use its addends, individually, to understand where the bottlenecks are, and address them.

Assets Freshness

A radically different way to think about vulnerability management, would be to forget about the Time To Remediate and focus on how up to date our assets are.

As a matter of fact, most vulnerabilities are resolved by upgrading the affected resources, but this particular wording implies a reactive stance with respect to vulnerabilities affecting assets we are actually monitoring with our scanners.

If we reframe our goal as keeping assets up to date, then we both consider the asset coverage and, indirectly, the risk due to vulnerabilities.

Note that keeping assets fresh won’t solve everything: it’s important to understand the risk posed by known vulnerabilities, as upgrades might not always be readily available. In that case, we want to implement mitigations or workaround, until the problem can be resolved.

One big advantage of using asset freshness as a metric, and setting a goal on it, is that we are forcing the company to adopt ways to keep assets up to date.

We can define the freshness of an asset $a$ as the distance from its canonical version and the most up-to-date version available from the asset vendor.

\[freshness(a) = diff(version(a), latestVersion(a))\]

Calculating this metric is not as straightforward as it may seem.

For instance, Microsoft and Apple maintain few “major” versions of their operating systems (Windows, iOS and macOS), providing security upgrades for all of them. In many cases, we don’t want to blindly upgrade to the latest version: it could be a licence problem, or a problem related to compatibility with our infrastructure, for instance.

We need to “relax” our conditions, and take the latest version available in the supported channel our assets are in.

Having an effective asset freshness metric implies collecting data for each product in use at the company and for all versions available. The silver lining of that is that with that information we can easily run an end-of-life campaign, which is tremendously important for the company security posture: end-of-life products won’t receive upgrades or security fixes and sometimes vendors don’t even bother filing new CVEs.

Aggregating Asset Freshness

Similarly to what we have seen with the Time To Remediate, two possible ways to produce an aggregated metric is to use an average or a percentile function, over our inventory.

For instance, given $A$ our inventory:

\[avgFeshness = \frac{\sum\limits_{a \in A} freshness(a)}{|A|}\]

Defining what good looks like for Assets Freshness

The ideal situation is to have an average asset freshness as close as possible to 0.

Besides the absolute value of a metric like $avgFeshness$, we should also consider its behaviour as a function of time, and the velocity at which published upgrades on assets we own are deployed in the company infrastructure.

A high $avgFeshness$ and a big latency in catching up with new versions, indicate that the company should invest in streamlining validation of new software, deployment and configuration management.

Conclusions

If there is one take-away I would like the reader to have from this note, it is the importance of having a detailed and up-to-date inventory of company assets. Without a good inventory, our measure of Risk is incomplete and can create a false sense of security. Moreover, from there, any investment put on improving our vulnerability management program would probably be suboptimal.

The second advice I can give to people leading large-scale vulnerability management is to use synthetic executive metrics to evaluate the status and progress of their efforts. We can always “drill down” or use operative metrics when needed, but when operating at scale in a fragmented environment, it is important to provide only a few KPIs, shared across all the domains covered by the program.

Then, we should use operational metrics to influence _how_we want to bring security risk down. An operational metric measures a specific activity we would like to happen, and suggests how that activity should be carried out. Asset freshness is a great example. Keeping assets up to date is always a good idea: using that metric to evaluate the work of our Corporate (IT) teams in the context of security, can boost efforts in that direction.



Back to top or read more posts