Python for Cybersecurity Data Analysis

Buckle up because were about to take a wild ride through the land of logs, anomalies, and suspicious behavior.
To set the stage: what is Python for cybersecurity data analysis? Well, it’s basically using Python to analyze all that juicy data you collect during your security investigations. You know, the stuff that makes your eyes glaze over and your brain hurt. But with Python, we can turn those logs into actionable insights!
So how do we get started? Let’s say you have a log file from your firewall that looks like this:

# This script takes a log file from a firewall and converts it into actionable insights using Python.

# First, we need to import the necessary libraries for working with date and time data.
import datetime
import time

# Next, we define a function to convert the timestamp in the log file to a human-readable format.
def convert_timestamp(timestamp):
    # We use the datetime library to convert the timestamp to a datetime object.
    date_time = datetime.datetime.strptime(timestamp, '%Y-%m-%dT%H:%M:%S.%fZ')
    # Then, we use the time library to format the datetime object into a more readable string.
    formatted_time = time.strftime('%m/%d/%Y, %H:%M:%S', date_time)
    # Finally, we return the formatted time string.
    return formatted_time

# Now, we open the log file and read each line.
with open('firewall_log.txt', 'r') as log_file:
    for line in log_file:
        # We split each line by the comma delimiter and assign the values to variables.
        timestamp, source_ip, destination_ip, protocol, port, action, record_type, domain = line.split(',')
        # We call the convert_timestamp function to convert the timestamp to a human-readable format.
        formatted_time = convert_timestamp(timestamp)
        # Then, we print out the relevant information in a user-friendly format.
        print(f'At {formatted_time}, the firewall detected a {protocol} connection from {source_ip} to {destination_ip} on port {port}.')
        print(f'The action taken was {action} and the record type was {record_type}.')
        print(f'The domain queried was {domain}.')
        print('---') # This is just for visual separation between each log entry.

This log tells us that at 3:37 PM on June 14th, a device with the IP address of 10.0.0.1 sent a DNS query to a server with an IP address of 192.168.1.100 for example.com. Now, let’s say we want to analyze this log file and see if there are any suspicious queries being made. We can use Python to read in the log file, parse out the relevant information, and then perform some analysis on it!
Here’s an example script that reads in a CSV-formatted log file (you can convert your logs into CSV format using tools like `awk` or `sed`) and prints out any DNS queries to suspicious domains:

# Import necessary libraries
import csv # Import csv library to read in the log file
from datetime import datetime, timedelta # Import datetime library to handle timestamps

# Define the list of suspicious domains
suspicious_domains = ['example.com', 'evil-domain.org']

# Set up a dictionary to store our results
results = {}

# Read in the log file and parse out the relevant information
with open('logfile.csv') as f: # Open the log file in read mode and assign it to variable f
    reader = csv.reader(f) # Create a csv reader object to read the file
    for row in reader: # Loop through each row in the file
        # Check if this is a DNS query
        if 'DNS_QUERY' == row[4]: # Check if the 5th element in the row is 'DNS_QUERY'
            # Parse out the timestamp and domain name
            ts, ip1, ip2, protocol, action, qtype, domain = row # Assign each element in the row to a variable
            ts = datetime.strptime(ts[:-3], '%Y-%m-%dT%H:%M:%S') + timedelta(seconds=int(ts[-3:])) # Convert timestamp to UTC timezone
            if any([domain in d for d in suspicious_domains]): # Check if the domain is in the list of suspicious domains
                # If domain is on our list, add it to the results dictionary
                results[ip1] = results.get(ip1, []) + [f'{ts:%Y-%m-%d %H:%M:%S} {domain}'] # Add the timestamp and domain to the results dictionary for the corresponding IP address

# Print out the results for each IP address that made a suspicious query
for ip, queries in results.items(): # Loop through each key-value pair in the results dictionary
    print(f"IP: {ip}\nQueries:\n") # Print the IP address
    for q in queries: # Loop through each query for the IP address
        print(q) # Print the query

This script reads in our log file (assuming it’s named `logfile.csv`) and parses out the relevant information using a CSV reader. It then checks if this is a DNS query, converts the timestamp to UTC timezone, and adds any suspicious queries to a dictionary of results for each IP address. Finally, it prints out the results for each IP that made a suspicious query!
With just a few lines of code, we can turn our logs into insights and identify potential security threats. And best of all, we don’t need to be data scientists or machine learning experts to do it!

SICORPS