Working with NVD Json inside of mongodb

In the last entry we added an NVD file to MongoDB with an insert command from PyMongo. Hence, at this point we have a single yearly NVD file (2002) posted in to a MongoDB and we want to investigate if this is a useful way of managing the NVD data. Was this an optimal approach to managing a store of CVE data? Let’s review by querying what exists in the new MongoDB nvdcve collection:

The query can be done in Python, but a very convenient method to quick check the collection is through either a graphical MongoDB interface query tool or simply the command line mongo client that was installed with MongoDB. In this case, I simply use the mongo client to quickly query the collection:

db.nvdcve.find({},{id:1})
{ “_id” : ObjectId(“5f0bd7b9f9cdf5e2f0183be6”) }
db.nvdcve.find({},{id:1}).count()
1

If you expected that each CVE# would appear as a document–that was wishful thinking not immediately implemented in the structure of the downloaded document. Instead, there is one large document for the year containing all the year’s CVEs as sub-documents under one umbrella document ID.

This can be easily verified in Jupyter with Python code as well: using the find_one() method we see that there are 6748 CVEs listed under one document.

6748 CVEs inside one master document–likely not an optimal organization

This format likely isn’t ideal. As an alternative, it’s simple to flatten out the nested JSON to where we could insert documents keyed by CVE#–given that is the way almost everyone searches and uses the NVD database. Recall earlier posts about flattening JSON–they will be useful for this task.

Below is a code example to flatten, clean-up some key names and insert into the collection.

#read the json file
with open('nvdcve-1.1-2002.json') as data_file:    
    data = json.load(data_file)  
#normalize out the CVEs
df = json_normalize(data,'CVE_Items')
df.rename({'cve.CVE_data_meta.ID':'CVE_ID'}, axis=1,inplace=True)
#change the awkward index name
df.rename_axis('CVE')
#MongoDB doesn't like '.' in key names
df.columns = df.columns.str.replace(".", "_")
#insert records into MongoDB
result =mycol.insert_many(df.to_dict('records'))

This produces the below document count, which is likely more amenable to working with CVEs.

Re-checking the document count

As demonstrated below, this structure more intuitive to work with and query against.

The new structure allows for simple querying of CVEs

NVD JSON in MongoDB

Most organizations that mirror CVEs do so because they are adding Temporal and/or Environmental components to the Base CVSS score. It is likely efficient to store scores from a feed in a traditional RDBMS. However, if an organization needs to combine unstructured data such as news articles, email chains, etc to their CVE data, then a database that optimizes unstructured storage may be worth considering.

Adding overall CVE information (e.g. CVSS Base/Temporal/Environmental scoring) together with TI and asset investigations to MongoDB is one solution for managing all the information in a hybrid structured/unstructured format. The result will be easily searchable and indexable. Starting with simple routine to insert a file, I will demonstrate how to achieve this and ultimately build a simple scoring platform.

Let’s start with a simple piece of code to store the credentials on the file system rather than in the code.

def get_params(filename):
    '''read values from a file and get parameters
        place your mongodb connect information in this file
        server,<yourservername or IP>
        username,<username>
        password,<mongopassword>
        authSource,<db you are authenticating against>
    '''
    myparams = {}
    with open(filename) as myfile:
        for line in myfile:
            key, value = line.partition("=")[::2]
            myparams[key.strip()] = value.strip()
    return myparams

Now add a bit of pymongo connection code–see the many documents available on pymongo using a Google search:

import json
from pymongo import MongoClient
def connect_mongo(params):
    '''Connect to a mongo instance using params from input
        params - a dictionary with user, password, server and database
    '''
    client = MongoClient(params.get("server"),\
                          username=params.get("user"),\
                          password=params.get("password"),\
                          authSource=params.get("authSource"),\
                          authMechanism='SCRAM-SHA-1')
    return client
    

Finally, add some code to actually insert the file:

def insert_json_file(client, db, collection, file):
    mydb = client[db]
    mycol = mydb[collection]
    
    with open(file) as jsonf :
        file_data = json.load(jsonf)
        if isinstance(file_data, list):
            mycol.insert_many(file_data)
        else:
            mycol.insert_one(file_data)

    client.close()

Most of the above code is available on various MongoDB-Python tutorials for those that want to understand it in more depth. Calling the code and actually inserting a document is simple:

par = get_params("mongo.env")
cli = connect_mongo(par)
insert_json_file(cli, "nvd", "nvdcve", "nvdcve-1.1-2002.json")

This inserts the 2002 JSON file into Mongo. Note that the db I used requires authentication (many MongoDB-Python demonstrations on the web omit authentication).