MathFeature: Extracting Mathematical Features from DNA, RNA, and Protein Sequences
In this tutorial, we’ll introduce you to MathFeature a Python package that allows us to extract mathematical features from biological sequences such as DNA, RNA, and protein. These features can help us identify patterns and trends within the data that may have been overlooked by traditional methods. Let’s dive in!
First, why we need math in biology. While it might seem like an unusual combination at first glance, there are many mathematical tools that can be applied to biological sequences. For example:
– Fourier transform (FT) is a tool used to analyze periodic signals in time-series data. By applying FT to DNA sequences, we can detect the presence of repeating motifs or patterns that could be indicative of regulatory elements or functional domains.
# Import necessary libraries
import mathfeature as mf # Import the mathfeature library for applying mathematical functions to biological sequences
from sklearn.preprocessing import StandardScaler # Import the StandardScaler library for scaling numerical vectors
# Load data
df = pd.read_csv('dna_sequences.txt', sep='\t') # Read the DNA sequences from a text file and store them in a dataframe called 'df'
# Preprocess data (optional)
df['seq'] = df['seq'].apply(lambda x: list(x)) # Convert sequences to lists of nucleotides for easier manipulation
df['num_vector'] = mf.AAC(df['seq']) # Extract numerical vectors using the Amino Acid Composition (AAC) method from the mathfeature library
scaler = StandardScaler() # Create an instance of the StandardScaler class
df['std_num_vector'] = scaler.fit_transform(df['num_vector'].values) # Scale numerical vectors to have zero mean and unit variance using the fit_transform method from the StandardScaler class
# Apply Fourier transform (FT)
ft_results = mf.FourierTransform(n=1024, window='hann')(df['std_num_vector']) # Apply the Fourier Transform (FT) to the standardized numerical vectors using a window size of 1024 and a Hann window function from the mathfeature library
That’s it! You now have a DataFrame with the results of your FT analysis for each DNA sequence in your dataset.
MathFeature also allows you to extract features based on other mathematical tools such as CGT and entropy, among others. Let’s take another look at how these techniques work:
– Chaos game representation (CGR): This technique involves converting protein sequences into points in a high-dimensional space using a set of rules that simulate the behavior of a chaotic system. The resulting point cloud can be used to identify regions with high complexity, which may indicate functional or structural importance. Here’s an example code snippet:
# Import the necessary library for converting protein sequences to points
from mathfeature import CGR
# Import the numpy library for creating arrays and performing mathematical operations
import numpy as np
# Load protein sequence data (in FASTA format) from a file called 'proteins.fa'
with open('proteins.fa', 'r') as f:
# Create an empty list to store the sequences
sequences = []
# Loop through each line in the file
for line in f:
# Check if the line does not start with '>'
if not line.startswith('>'):
# If it doesn't, skip to the next line
continue
# If it does, remove the '>' symbol and any extra spaces, then split the line into two parts
name, seq = line[1:].strip().split()
# Add the sequence to the list
sequences.append(seq)
# Convert protein sequences to CGR points using a set of rules (e.g. 3-letter amino acid code)
# Create an empty list to store the points
points = []
# Loop through each sequence in the list
for i in range(len(sequences)):
# Loop through each character in the sequence
for j in range(len(sequences[i])):
# Create a 128-dimensional vector filled with zeros
point = np.array([0] * 128)
# Loop through the next 3 characters in the sequence
for k in range(3):
# Get the 3-letter code for the current position in the sequence
code = sequences[i][j:j+4].upper()
# Check if the code has a length of 3
if len(code) == 3:
# Convert the first letter to an integer (A=1, B=2, ...) and add it to the index
index = ord(code[0]) - 65
# Add 1 to the corresponding index in the point vector
point[index] += 1
# Add the point to the list of points
points.append(point)
– Entropy: This measure of randomness or uncertainty can be used to identify regions in DNA sequences that are more variable than others. High entropy values may indicate regulatory elements or evolutionary adaptation. Here’s how you could calculate the entropy for a given sequence using MathFeature:
# Import the Entropy function from the MathFeature library
from mathfeature import Entropy
# Import the numpy library for array manipulation
import numpy as np
# Load DNA sequence data (in FASTA format)
with open('dna_sequences.fa', 'r') as f:
# Create an empty list to store the sequences
sequences = []
# Loop through each line in the file
for line in f:
# Check if the line does not start with '>'
if not line.startswith('>'):
# If it doesn't, skip to the next line
continue
# If it does, remove the '>' and any extra spaces, then split the line into name and sequence
name, seq = line[1:].strip().split()
# Add the sequence to the list
sequences.append(seq)
# Create an empty list to store the entropy values for each sequence
entropies = []
# Loop through each sequence in the list
for i in range(len(sequences)):
# Loop through each position in the sequence, using a sliding window approach with 50 bp windows and 25 bp overlap
for j in range(len(sequences[i]) - 49):
# Calculate the entropy for the current window using the Entropy function
entropy = Entropy(sequences[i][j:j+50])
# Add the entropy value to the list
entropies.append(entropy)
# The final list 'entropies' will contain the entropy values for each window in each sequence.
These are just a few examples of how you can use MathFeature to extract mathematical features from biological sequences. The package provides many more tools and techniques for analyzing DNA, RNA, and protein data using various mathematical frameworks such as Fourier analysis, chaos theory, information theory, among others.