Tired of hearing/reading only about generative AI models? This post explores how Artificial Intelligence and Machine Learning can help with a very real cybersecurity problem.
Table of Contents
- Predicting CVSS Vectors with text embeddings and random forests
Predicting CVSS Vectors with text embeddings and random forests
The problem
If you work in information security, you are probably aware of the delays in enriching or classifying CVE impacts that we are experiencing in 2024.
One of the value-added by the NIST CVE analysis is the CVSS vector. In short, a CVSS vector is a concise string that summarises the characteristics and severity of a software vulnerability based on the Common Vulnerability Scoring System (CVSS). It consists of several metrics like Attack Vector, Attack Complexity, and Impact scores, each with a specific value. This vector allows for a standardised way to assess and communicate the risk associated with a vulnerability, enabling security professionals to prioritise their response efforts.
According to their website, the National Vulnerability Database (NVD) is facing a growing backlog of vulnerabilities due to increased software and changes in interagency support.
While NIST is prioritising the analysis of significant vulnerabilities, and while they are exploring long-term solutions like a consortium of stakeholders to collaborate on improving the NVD. As of today, companies involved in vulnerability management have to come up with alternative solutions to estimate the risk they are facing and prioritise vulnerability remediation.
AI is a big, hot topic today, and many think it will steal our jobs and be applied to virtually everything, in the long term.
While that sounds quite inaccurate to me, I also believe that the value of (text) embeddings is something that didn’t have the right attention so far 🙂.
So, let’s see how we can do something useful with AI.
What are text embeddings?
Text embeddings are a way to represent words or phrases as dense vectors (lists of numbers) in a high-dimensional space.
In simple terms, these vectors capture the semantic meaning of the text, with respect to the training data, positioning words or phrases with similar meanings closer to each other in this space. This allows computers to perform various natural language processing tasks by measuring the distance or similarity between these vectors.
Embeddings are basically calculated as the activation of one of the latest layers of a trained neural network. Currently, the most popular way to generate an embedding model is by utilising pre-trained Large Language Models (LLMs) like BERT or GPT variants. These models have revolutionised natural language processing due to their ability to capture contextual relationships between words and generate high-quality embeddings.
It should be noted that LLMs are trained on a gargantuan amount of data to be generally useful. That means that embedding models derived from LLMs are, typically, capable of finding and encoding meaning for many domains.
As you probably know, or as you can imagine, more specialised (or fine-tuned) LLMs can perform much better in the domain they have been optimised for.
In this post, I will use a general purpose text embedding model, namely text-embedding-gecko from Google VertexAI. More precisely, I will use text-embedding-004
, which is a mono-modal embedder trained on data written in English language.
What are random forests?
A forest is made up of trees, right? Let’s start from there:
A decision tree is a flowchart-like structure used for classification or regression. It learns from data by splitting it into subsets based on feature values, creating branches and leaves. The leaf nodes represent final predictions. Training involves finding the best splits to maximize information gain (involving entropy) or, in other words, minimize impurity (i.e., number of different classes in a node).
A random forest classifies data by constructing multiple decision trees. Each tree provides a classification. The final classification is determined by a majority vote among all trees. This ensemble approach improves accuracy and robustness compared to a single decision tree.
Random forests are trained by training individual trees on a random sample (with repetition) of input data and on a random subset of features.
Okay, cool, but what does all this have to do with CVEs?
To assign a CVSS vector to a CVE, a security analyst would carefully read and understand the vulnerability description. Then, using their security knowledge, they would create a meaningful vector, allowing the attribution of a base risk score to the vulnerability.
Can AI do that? It certainly can, to some extent.
You probably cannot just feed the description to a large language model and hope to get a super accurate CVSS vector. At least in 2024.
One reason is that LLMs produce sequences of tokens based on previously seen (or generated) tokens, so the initial part of the CVSS vector could influence the following parts. Of course, that heavily depends on the data the specific model in use has been trained on.
But fear not, AI is not just LLMs and sharks.
Getting text and embeddings is a great way to extract meaning from words. That meaning, encoded with a vector in a highly dimensional space, is a perfect candidate for classifying machine learning models.
Let’s see how to combine these two things.
The juicy part
Step 1. Getting the data
This is the easiest part. NIST allows downloading CVE feeds straight from their site.
The data has been incrementally improving over the years, but it’s still not great as historically there isn’t much structure on some CVE attributes and each CNA has its own way of filing CVEs…
Anyway, NIST has been doing a decent job at assigning and improving the CVSS vector during vulnerabilities lifecycle. At least until 2024.
For our little experiment, we will download CVE data for 2022 and 2023, saving them in a handy Google BigQuery table.
feed_format = 'https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-{}.json.gz'
years = (2022, 2023)
[...]
skipped = 0
processed = 0
for year in years:
year_cves = []
nvd_data = nvd_feed_url_to_json(feed_format.format(year))
print(f"Processing data for year {year}: "
f"{len(nvd_data['CVE_Items'])} total CVEs")
for item in nvd_data['CVE_Items']:
relevant_data = extract_cve_from_item(item)
if relevant_data is None:
skipped += 1
else:
year_cves.append(relevant_data)
processed += 1
print("Saving to bigquery")
save_to_bigquery(
project_id=PROJECT_ID, dataset_id=DATASET_ID, table_id=TABLE_ID,
schema=BIGQUERY_CVES_SCHEMA, data_list=year_cves)
print(f"Processed CVEs: {processed}\nSkipped CVEs {skipped}")
extract_cve_from_item
is a simple function extracting from “raw” CVE data the information we need to train our model. We only handle CVEs with baseMetricV3
as we are only interested in CVSS v3.
def extract_cve_from_item(item: Dict[str, Any]) -> Optional[Dict[str, Any]]:
if 'baseMetricV3' not in item['impact']:
return None
en_text = next(
(item['value'] for item in item['cve']['description']['description_data']
if item['lang'] == 'en'), None)
if en_text is None:
return None
return {
'cve_id': item['cve']['CVE_data_meta']['ID'],
'assigner': item['cve']['CVE_data_meta']['ASSIGNER'],
'cvssv3_vector_string': item['impact']['baseMetricV3']['cvssV3']['vectorString'],
'cvssv3_attack_vector': item['impact']['baseMetricV3']['cvssV3']['attackVector'],
'cvssv3_attack_complexity': item['impact']['baseMetricV3']['cvssV3']['attackComplexity'],
'cvssv3_privileges_required': item['impact']['baseMetricV3']['cvssV3']['privilegesRequired'],
'cvssv3_user_interaction': item['impact']['baseMetricV3']['cvssV3']['userInteraction'],
'cvssv3_scope': item['impact']['baseMetricV3']['cvssV3']['scope'],
'cvssv3_confidentiality_impact': item['impact']['baseMetricV3']['cvssV3']['confidentialityImpact'],
'cvssv3_integrity_impact': item['impact']['baseMetricV3']['cvssV3']['integrityImpact'],
'cvssv3_availability_impact': item['impact']['baseMetricV3']['cvssV3']['availabilityImpact'],
'cvssv3_base_score': item['impact']['baseMetricV3']['cvssV3']['baseScore'],
'cvssv3_base_severity': item['impact']['baseMetricV3']['cvssV3']['baseSeverity'],
'english_description': en_text,
}
Sample output for the above snippet:
Processing data for year 2022: 24961 total CVEs
Saving to bigquery
Loaded 23746 rows to BigQuery table cves
Processing data for year 2023: 28462 total CVEs
Saving to bigquery
Loaded 25378 rows to BigQuery table cves
Processed CVEs: 70406
Skipped CVEs 5895
Step 2. Getting the embeddings
Now that we have a BigQuery table with all the relevant pieces of information, we need to translate english_description
into features we can use to classify the CVE into multiple classes.
To do so, we will be using the Google Vertex AI python API. Note that to deal with the rate limiting on that API, we are using a helper function, splitting the data into batches and making sure we don’t submit more than 100 requests per second.
Here, I load the data set from BigQuery into a Pandas DataFrame:
from google.cloud import bigquery
import pandas as pd
def run_bq_query(sql: str) -> pd.DataFrame:
# Create BQ client
bq_client = bigquery.Client(project=PROJECT_ID)
# Try dry run before executing query to catch any errors
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
# This will raise an exception if there's a query error
bq_client.query(sql, job_config=job_config)
# If dry run succeeds without errors, proceed to run query
job_config = bigquery.QueryJobConfig()
client_result = bq_client.query(sql, job_config=job_config)
job_id = client_result.job_id
# Wait for query/job to finish running, then get & return data frame
df = client_result.result().to_arrow().to_pandas()
print(f"Finished job_id: {job_id}")
return df
cve_df = run_bq_query("SELECT * FROM nvd.cves")
Then I calculate the embeddings for the whole table. For the model used here, each embedding is a 768-dimensional vector.
from vertexai.language_models import TextEmbeddingModel
# Initialize Vertex AI SDK
[...]
# Load the TextEmbeddingModel
model = TextEmbeddingModel.from_pretrained("text-embedding-004")
embeddings = encode_text_to_embedding_batched(
sentences=new_cve_df["english_description"].tolist(),
model=model,
api_calls_per_second = 95/60,
batch_size = 20)
And finally, add the new column with the embeddings to the DataFrame.
cve_df.insert(
loc=len(cve_df.columns), column='description_embeddings',
value=embeddings.tolist())
Step 3. Training the model
Now that we have numeric vectors describing our CVEs, we can try to classify them.
We could use a single classifier, trained to predict the entire CVSS Vector, but that would be inaccurate given the possible combinations of 8 different dimensions of that vector.
So, we will be training 8 different classifiers, one per each dimension of the CVSS Vector.
# Dimensions we are interested in predicting
categories = [
'cvssv3_attack_vector',
'cvssv3_attack_complexity',
'cvssv3_privileges_required',
'cvssv3_user_interaction',
'cvssv3_scope',
'cvssv3_confidentiality_impact',
'cvssv3_integrity_impact',
'cvssv3_availability_impact',
]
For each Random Forest classifier, we will split the data into training and test sets, also calculating the accuracy:
estimators = 300
X = new_cve_df['description_embeddings'].to_list()
classifiers = {}
for category in categories:
print(f"Training random forest classifier for {category}...")
y = new_cve_df[category].values
# Define training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state = 2)
# Train the classifier
clf = RandomForestClassifier(n_estimators=estimators,n_jobs=-1)
clf.fit(X_train, y_train)
classifiers[category] = {'clf': clf}
# Calculate the classifier accuracy
y_pred = clf.predict(X_test)
classifiers[category]['acc'] = accuracy_score(y_test, y_pred)
Let’s see how the classifiers performed on the test set:
for k, v in classifiers.items():
print(f"{k:30} - accuracy: {v['acc']:>6.3f}")
Which produced:
cvssv3_attack_vector - accuracy: 0.901
cvssv3_attack_complexity - accuracy: 0.964
cvssv3_privileges_required - accuracy: 0.753
cvssv3_user_interaction - accuracy: 0.924
cvssv3_scope - accuracy: 0.958
cvssv3_confidentiality_impact - accuracy: 0.831
cvssv3_integrity_impact - accuracy: 0.833
cvssv3_availability_impact - accuracy: 0.868
Not bad for a quick and dirty model!
To summarise, “just” looking at the CVE description we are able to predict
- the attack vector, attack complexity, need for user interaction, scope with an accuracy of over 90%
- the impacts with an accuracy of over 83%
- the need of privileges with an accuracy of 75%
‘nuff test data, how do we perform on 2024 data?
Just for fun, let’s try out a few unclassified CVEs.
To test the classifier, I will use the following code:
# Test classifiers
en_desc = '<will put description here>'
# Get the embedding of this desc and predict its category
desc_embedding = model.get_embeddings([en_desc])[0].values
# Predict each dimension of the CVSS Vector
for category in categories:
pred = classifiers[category]['clf'].predict([desc_embedding])
print(f"Predicted {category:>30}: {pred[0]}")
CVE-2024-39243
For this CVE, we do have a CVSS 3.1 vector on the NIST website:
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H
The CVE description recites:
An issue discovered in skycaiji 2.8 allows attackers to run arbitrary code
via crafted POST request to /index.php?s=/admin/develop/editor_save.
Let’s see what our classifier predicts:
Predicted cvssv3_attack_vector: NETWORK
Predicted cvssv3_attack_complexity: LOW
Predicted cvssv3_privileges_required: NONE
Predicted cvssv3_user_interaction: NONE
Predicted cvssv3_scope: UNCHANGED
Predicted cvssv3_confidentiality_impact: HIGH
Predicted cvssv3_integrity_impact: HIGH
Predicted cvssv3_availability_impact: HIGH
bullseye!
CVE-2024-27981
This CVE does not have a CVSS 3.1 vector on the NIST website.
The CVE description says:
A Command Injection vulnerability found in a Self-Hosted UniFi Network Servers
(Linux) with UniFi Network Application (Version 8.0.28 and earlier) allows a
malicious actor with UniFi Network Application Administrator credentials to
escalate privileges to root on the host device. Affected Products: UniFi
Network Application (Version 8.0.28 and earlier).
Mitigation: Update UniFi Network Application to Version 8.1.113 or later.
Using the same code as before, we get:
Predicted cvssv3_attack_vector: NETWORK
Predicted cvssv3_attack_complexity: LOW
Predicted cvssv3_privileges_required: LOW
Predicted cvssv3_user_interaction: NONE
Predicted cvssv3_scope: UNCHANGED
Predicted cvssv3_confidentiality_impact: HIGH
Predicted cvssv3_integrity_impact: HIGH
Predicted cvssv3_availability_impact: HIGH
This vector roughly makes sense to me. Maybe with some caveats.
Impact dimensions seem accurate, and so does the attack vector. User interaction is not required indeed, as long as we have some (limited?) privileges: i.e., admin on the appliance.
Things that make less sense to me are:
- Scope: the description doesn’t help much, as I don’t know what I can do with a UniFi Network Application Administrator access.
- Attack complexity is not clear. Unless we know how to exploit this, we can’t really tell…
CVE-2024-4764
We have a vector for this one:
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H
Let’s see what our prediction says, just looking at the following description:
Multiple WebRTC threads could have claimed a newly connected audio input
leading to use-after-free. This vulnerability affects Firefox < 126.
Predicted cvssv3_attack_vector: NETWORK
Predicted cvssv3_attack_complexity: LOW
Predicted cvssv3_privileges_required: NONE
Predicted cvssv3_user_interaction: NONE
Predicted cvssv3_scope: UNCHANGED
Predicted cvssv3_confidentiality_impact: HIGH
Predicted cvssv3_integrity_impact: HIGH
Predicted cvssv3_availability_impact: HIGH
Wow, another one bites the dust :)
How can we improve?
75%, for the worst case, might not be enough in some scenarios. And, in general, we might want to improve that 83%+ too, to avoid a false sense of security (or insecurity) while calculating the overall vulnerability management risk for your domain.
There are several ways to improve the model. To name some:
- Add more unstructured/structured data from the CVE when calculating the embeddings.
- Add some information about the vulnerable software or intelligence data we have from different security feeds.
- Test different classifiers or experiment with classifiers parameters.
- Fine-tune the text embedding model for vulnerability management and cybersecurity.
- Train a model from scratch on cybersecurity data.
Conclusion (written by AI!)
In conclusion, this exploration into leveraging text embeddings and machine learning for CVSS vector prediction has showcased the potential of AI to address the critical challenge of timely vulnerability assessment. The ability to automatically estimate CVSS vectors based on vulnerability descriptions, even before official assignments, empowers organisations to proactively manage their security risks. While the initial results demonstrate promise, there remains room for further refinement and enhancement through the integration of domain-specific knowledge and the exploration of more advanced language models and classification algorithms.
As the cybersecurity landscape continues to evolve, the integration of AI-driven solutions like this will become increasingly vital in ensuring the resilience of our digital infrastructure.
By automating and accelerating the process of CVSS vector prediction, we can help organisations stay ahead of emerging threats and safeguard their critical assets against exploitation. The path forward involves continued research and collaboration to develop even more sophisticated models capable of accurately capturing the nuances and complexities of vulnerabilities, ultimately contributing to a more secure and resilient digital future.