Category Archives: JSON

JQ for JSON Files and Feeds

The opensource jq utility is an enormously handy tool for quick-and-dirty parsing of JSON files or feeds. The tool was created by Stephen Dolan in 2012 as a lightweight C-program. Its available for free download on various platforms including Windows, Mac and Linux. jq is particularly well-suited for lightweight and quick ETL-type work frequently undertaken by security specialists working in data-intensive environments.

The tool’s usage and style will be familiar to those accustomed to working in Linux or other Unix distributions–it’s similar to the sed/awk/grep genre of parsing utilities. Feeding JSON into jq with few or no options at all will result in it ‘prettifying’ text into a nicely-formatted, colorized output that is much easier to read and to gain a sense of the document organization. Investing time in learning it’s more advanced functions will be worthwhile to those who routinely deal with JSON feeds.

There are various YouTube and webpage tutorials to aid in learning the tool. In addition, there is a wide community of support–if you run into an issue, chances are you will find it asked and answered on a forum somewhere. The purpose here is to introduce some examples to see how it can be useful to those of us with in a security niche who may have not run into it previously

Some Examples

Looking back at the JSON CVE feed files from NVD discussed in the last few blog articles, it’s simple to find useful places where one can apply jq. In our first example we use jq with no options (except colorize) to format and NVD JSON file:

rzupancic@tiserver:~/CVE$ cat nvdcve-1.1-2009.json | jq -C
{
CVE_data_type“: “CVE”,
“CVE_data_format”: “MITRE“,
“CVE_data_version”: “4.0“,
CVE_data_numberOfCVEs“: “5001“,
“CVE_data_timestamp”: “2020-11-24T08:56Z“,
“CVE_Items”: [
{
“cve”: {
“data_type”: “CVE“,
“data_format”: “MITRE“,
“data_version”: “4.0“,
“CVE_data_meta”: {
“ID”: “CVE-2009-0001“,
“ASSIGNER”: “cve@mitre.org

Later on we will discuss the field in bold above–CVE_data_numberOfCVEs–in an advanced example.

Compare the above output to the raw output of cat’ing the file and you will see an immediate benefit of jq. The formatting is nice for getting an overview of the data–a quick view of the data’s keys and nesting. Harnessing the real power of the utility requires acquiring some basic proficiency in navigating the data’s keys and arrays together with the utility’s filtering options and built-in functions.

In the next example, also using the NVD JSON files, suppose one wanted to parse out and view all of the CVE ID’s–we introduce the use of a filter:

rzupancic@tiserver:~/CVE$ cat nvdcve-1.1-2009.json | jq -C .CVE_Items[].cve.CVE_data_meta.ID
“CVE-2009-0001”
“CVE-2009-0002”
“CVE-2009-0003”
“CVE-2009-0004”
“CVE-2009-0005”
“CVE-2009-0006”
“CVE-2009-0007”
“CVE-2009-0008”
“CVE-2009-0009”
“CVE-2009-0010”
“CVE-2009-0011”
“CVE-2009-0012”
“CVE-2009-0013”

Other keys can be added to the query filter to selectively add complexity and delve deeper into an investigation of the CVEs–all within a single command line.

JQ with Web Services

To demonstrate how to grab data directly from a web service, here is an example of using the curl utility together with jq to easily parse an API call on a SecurityTrails whois API feed–in this case to capture the nameservers of a domain. Note that this example, as well as the last, takes into account the use of arrays in JSON to organize data as ordered sequences–hence the use of brackets to dereference arrays in addition to the periods that dereference keys (as in key/value pairs).

rzupancic@tiserver:~/CVE$ curl –request GET –url https://api.securitytrails.com/v1/history/grumblesoft.com/whois –header ‘apikey: <here>’ | jq .result.items[0].nameServers
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 6321 100 6321 0 0 3967 0 0:00:01 0:00:01 –:–:– 3967
[
NS41.DOMAINCONTROL.COM”,
“NS42.DOMAINCONTROL.COM

An Advanced Example

Finally, to see something slightly more advanced, suppose you are wondering what the total CVE count is in the combined set of NVD JSON files is. One of the fields from the NVD JSON file, listed above, was CVE_data_numberOfCVEs, which can be used to add the total CVEs in all files together. Here’s an example in the shell:

Demonstration of jq with pipes in the Bash Shell

This calculates the total CVEs from all yearly JSON (as of a point in time–this changes)-153,671. This uses a tonumber conversion function and a summing routine. The line of code in the shell capture above is:

cat *json | jq ‘.CVE_data_numberOfCVEs|tonumber’ | jq -n ‘[inputs | . ] | reduce .[] as $num (0; .+$num)’

Working with NVD Json inside of mongodb

In the last entry we added an NVD file to MongoDB with an insert command from PyMongo. Hence, at this point we have a single yearly NVD file (2002) posted in to a MongoDB and we want to investigate if this is a useful way of managing the NVD data. Was this an optimal approach to managing a store of CVE data? Let’s review by querying what exists in the new MongoDB nvdcve collection:

The query can be done in Python, but a very convenient method to quick check the collection is through either a graphical MongoDB interface query tool or simply the command line mongo client that was installed with MongoDB. In this case, I simply use the mongo client to quickly query the collection:

db.nvdcve.find({},{id:1})
{ “_id” : ObjectId(“5f0bd7b9f9cdf5e2f0183be6”) }
db.nvdcve.find({},{id:1}).count()
1

If you expected that each CVE# would appear as a document–that was wishful thinking not immediately implemented in the structure of the downloaded document. Instead, there is one large document for the year containing all the year’s CVEs as sub-documents under one umbrella document ID.

This can be easily verified in Jupyter with Python code as well: using the find_one() method we see that there are 6748 CVEs listed under one document.

6748 CVEs inside one master document–likely not an optimal organization

This format likely isn’t ideal. As an alternative, it’s simple to flatten out the nested JSON to where we could insert documents keyed by CVE#–given that is the way almost everyone searches and uses the NVD database. Recall earlier posts about flattening JSON–they will be useful for this task.

Below is a code example to flatten, clean-up some key names and insert into the collection.

#read the json file
with open('nvdcve-1.1-2002.json') as data_file:    
    data = json.load(data_file)  
#normalize out the CVEs
df = json_normalize(data,'CVE_Items')
df.rename({'cve.CVE_data_meta.ID':'CVE_ID'}, axis=1,inplace=True)
#change the awkward index name
df.rename_axis('CVE')
#MongoDB doesn't like '.' in key names
df.columns = df.columns.str.replace(".", "_")
#insert records into MongoDB
result =mycol.insert_many(df.to_dict('records'))

This produces the below document count, which is likely more amenable to working with CVEs.

Re-checking the document count

As demonstrated below, this structure more intuitive to work with and query against.

The new structure allows for simple querying of CVEs

Viewing the NVD’s JSON CVE Datafeed as a Pandas Dataframe

The National Vulnerability Database publishes complete CVE (Common Vulnerabilities and Exposures) information as a JSON-formatted data-feed that can be downloaded into a local repository such as MongoDB and/or processed via scripting. Here is a demonstration of how to easily manipulate, format and query this JSON with Python. Begin by downloading the JSON data feed file from the NVD site here.

You will note that the file is deeply nested and challenging to work with without a bit of manipulation. Creating a Pandas Dataframe is perfect for this. I suggest using a Jupyter Notebook to explore the data structure and understand how the nesting might need to be flattened or otherwise organized for your purposes. The Pandas and JSON modules will be very useful.

From a Python perspective, the JSON nesting consists of nested dictionaries. If you simply ‘cat’ or ‘more’ the data file on a command line it will look a bit tangled, but the JSON module helps import it in such a way as to facilitate flattening.

import json
import pandas as pd
# this is the downloaded JSON file from NVD 
# from a Python perspective, it's a list of 
# nested dictionaries and challenging to 
# make sense of  without some manipulation of the nesting
your_path = 'd://projects/scratch/'
nvd_file = 'nvdcve-1.1-modified.json'
with open(your_path + nvd_file) as json_file:
    data = json.load(json_file)

# get rid of the first few lines of 
# meta information and create a list 
# of dictionaries keyed by cve information
cves = data['CVE_Items']
cves[0]
we still see nested dictionaries at this point, but the data is becoming readable
# notice the actual CVE # is buried 
# in nested dictionary as CVE_data_meta:ID
print(cves[0]['cve']['CVE_data_meta']['ID'])


CVE-1999-0143
# extract all the numbers to build 
# as a column in a dataframe
# simplest is to use a list comprehension for this
cve_number_list = [ sub['cve']['CVE_data_meta']['ID'] for sub in cves ] 
df = pd.DataFrame(cves,index=cve_number_list)
# we would like a column that is CVE numbers rather than an index
df = df.rename_axis('cve_number').reset_index()
df
a nicely formatted dataframe, though there is still nested data buried inside

Now we have a nicely formatted Dataframe that allows for searching: here is an example of searching for CVE-2020-7980:

df.loc[df['cve_number'] == 'CVE-2020-7980']
it is simple to query for CVE numbers

Next time we will look at scoring via the vector that is nested in the ‘impact’ column as well as demonstrate how to do look ups on descriptions and scores.

JSON for Security Specialists

JavaScript Object Notation began as a means to transport data from server to browser in JavaScript, but has since become an independent standard and a predominant method used by RESTful Web Services such as the VirusTotal API to deliver payload data. When you query a RESTful API you are likely to get back either JSON or XML, and of the two JSON is simpler and lighter, and thus more common.

Most IT Systems and Security specialists have heard of JSON and probably seen examples, but many have never actively played with it programmatically. It’s a useful tool in the InfoSec tool bag and an essential one for extracting , manipulating and presenting data gleaned from APIs.

Referencing the HashChecker object from the 4th Python/Splunk series blog entry, we see that VirusTotal returns JSON. It’s easy to interrogate the JSON object and experiment with processing it using Jupyter. First, here is an example of how to check that the transaction was successful:

Successful ‘200’ ok response code

While it’s straightforward to retrieve and view all of the nested JSON from the query, it can be a bit of a jumble to organize and make sense of:


JSON object with no formatting

Using Python’s json module, it’s simple to improve the readability of the output and determine the logical structure of the data. The json.dumps method will dump out a string that can be indented for readability with the indent parameter:

Nicer output with json module’s ‘dump’ method

With this it’s possible to instantly see the logical structure of the response and see a way to quickly check the results for positive hits. As noted, if one checks the type of object that json.dumps() produces, it will turn out to be a string. Generically, JSON is a string. When you ingest JSON from the Requests object, it will be a dictionary.

print(type(jdata))
print(type(hash_results.get_results().json()))
Output of this:
<class 'str'>
<class 'dict'>

Since JSON is actually a string comprised of curly braces, commas, and brackets (similar to a Python dictionary), it’s easy to see how Python can ingest a nested JSON object as a dictionary object. Understanding the type makes it obvious as to how to interrogate the object. Returning to the positive hits question, let’s write the code to look at it:

Checking for a positive result on the hash submitted to VirusTotal

For the sake of demonstration, suppose we want to know if the AVG service detected a positive result from the hash–here is one-liner to achieve that by querying for the ‘AVG’ key in the nested dictionary based on the ‘scans’ level:

Understanding JSON more deeply and recognizing that the task of ingesting VT data into DataFrame is essentially ingesting a dictionary into a DataFrame, we can formulate more elegant code for the set_dataframe method — recall that this was previously an inelegant ‘slash and burn’ exercise into forcing a glob of data into the desired row/column format desired.

With improved comprehension of the logical structures involved, it’s now possible to rewrite the set_dataframe method in the HashChecker class without using a ‘for loop’ or transpositions. This is achieved using only a couple of lines–the new DataFrame is called jdf:

jdf = pd.DataFrame.from_dict(hash_results.get_results().json()['scans'], orient='index').reset_index()
jdf.rename(columns={'index':'service'},inplace=True)

The reset_index and rename methods are used so as not to end up with the service name as the index. The resulting DataFrame is now logically organized and ready to use.

Printing out the new DataFrame

The Jupyter Notebook used in this entry is available for download on github.

About Ray Zupancic