In this tutorial, we’ll be discussing how MathFeature a Python package for extracting numerical information from biological sequences can help in predicting promoter regions using machine learning techniques. This tool provides 37 mathematical descriptors that can help analyze the structure, composition or behavior of DNA, RNA and protein sequences in various ways. Let’s take a closer look at this package and see how it works for our specific use case.
First, you need to have your sequence data in FASTA format. This is pretty standard for bioinformatics tools, so if you don’t know what that means, just Google it and we promise you’ll find plenty of resources on how to convert your sequences into this format. Once you have your sequences ready, open up a Python script or notebook (we recommend Jupyter Notebook for its ease-of-use) and import MathFeature using the following code:
# Import the mathfeature module and alias it as mf
import mathfeature as mf
# Create a function to convert sequences into a specific format
def convert_sequences(sequences):
# Initialize an empty list to store the converted sequences
converted_sequences = []
# Loop through each sequence in the given list of sequences
for sequence in sequences:
# Use the convert function from the mathfeature module to convert the sequence
converted_sequence = mf.convert(sequence)
# Append the converted sequence to the list of converted sequences
converted_sequences.append(converted_sequence)
# Return the list of converted sequences
return converted_sequences
# Create a list of sequences to be converted
sequences = [1, 2, 3, 4, 5]
# Call the convert_sequences function and pass in the list of sequences
converted_sequences = convert_sequences(sequences)
# Print the converted sequences
print(converted_sequences)
Now that we’ve got our package loaded, let’s take a look at some examples of how to use it. For this tutorial, we’ll be working with a dataset provided by [69] for predicting promoter regions using machine learning techniques. This dataset contains 741 positive samples (promoter) and 1400 negative samples (non-promoter).
To start, let’s load the data into our Python script:
# Import necessary libraries
import pandas as pd # Importing pandas library for data manipulation
from sklearn.model_selection import train_test_split # Importing train_test_split function from sklearn library for splitting data into training and testing sets
# Load dataset
df = pd.read_csv('dataset.txt', sep='\t') # Reading the dataset file and storing it in a pandas dataframe
X = df['sequence'].values # Extracting the 'sequence' column from the dataframe and storing it as input features
y = df['label'].values # Extracting the 'label' column from the dataframe and storing it as target variable
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Splitting the data into 80% training and 20% testing sets, with the input features and target variable for both sets
Next, let’s extract the numerical features using MathFeature:
# Import the CGR function from the mathfeature module
from mathfeature import CGR
# Reshape the training and testing sets to be compatible with scikit-learn
X_train = X_train.reshape(-1, 1) # convert to 2D array for compatibility with scikit-learn
X_test = X_test.reshape(-1, 1)
# Extract the CGR feature for the training set and store it in a list
CGR_train = [CGR(x)[0] for x in X_train] # CGR function returns a tuple, so [0] is used to extract the first element
# Extract the CGR feature for the testing set and store it in a list
CGR_test = [CGR(x)[0] for x in X_test] # CGR function returns a tuple, so [0] is used to extract the first element
The CGR (Composition-Transition/Transversion Ratio) feature is one of the most commonly used descriptors in bioinformatics for analyzing the composition of biological sequences. It measures the ratio between transitions and transversions, which can provide insights into the mutation rate or evolutionary history of a sequence.
Now that we have our features extracted, let’s train a random forest classifier using scikit-learn:
# Import the necessary library for using Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
# Train the model on the training data using Random Forest Classifier with 100 estimators
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Use the trained model to predict the labels for the testing data
y_pred = model.predict(X_test)
# Calculate the accuracy of the model on the testing data
accuracy = model.score(X_test, y_test)
# Print the accuracy of the model
print("Accuracy: ", accuracy)
In this example, we’re using a random forest classifier with 100 trees to predict whether a given sequence is a promoter or not based on the CGR feature extracted from MathFeature. We train our model on the training data and evaluate it on the testing data. The accuracy of our model can be used as an indicator for how well it performs in generalizing to new, unseen sequences.