Random Forest prediction of Molecule Ground State Energies

·

3 min read

Hey there! Today, I’ll be going over a quick project to help jump-start your Machine Learning Journey! We’ll mainly be using the scikit-learn library, a powerful python library for predictive data analysis.

First off, let’s take a look at the dataset we’ll be using. In Machine Learning, by far the most important part of any model is data engineering - the process of gathering and preprocessing high quality, quantitative data. I found this dataset here, on Kaggle, a prevalent, top tier choice for finding datasets to work with.

It’s great practice to read the description of a dataset to find relevant information about it. Applying to this dataset, we read that this dataset contains the Ground State Energies of 16,242 molecules, with each molecule containing 1277 values that factor into its Ground State Energy value. What does this mean for our project goal? It means that we are given 16,242 unique molecules and by training on the 1277 attributes for each molecule, we generate a model that is able to predict other molecules given their molecular features.

Lets write some code, shall we?

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
import numpy
import pandas as pd
import pickle

# read in dataset
df = pd.read_csv(r"C:\Users\josmo\Downloads\GroundStateEnergyDataset\roboBohr.csv")

This snippet of code is quite self-explanatory, we import necessary libraries and methods from these libraries and then read in the dataset into a panda “DataFrame” - a data structure consisting of rows and columns much like a spreadsheet.

df.pop("pubchem_id")
df.pop("Unnamed: 0")

Two columns from the dataframe are removed: pubchem_id, which gives the ID of the molecule as found on the world’s leading molecule database “pubchem”, and Unnamed: 0, which is irrelevant to us as we don’t care about the names of the molecules.

# initialize feature scaler
scaler = MinMaxScaler(feature_range=(-1, 1))

for column in df.columns:
    scaler.fit(df[column].values.reshape(-1, 1))
    df[column] = scaler.transform(df[column].values.reshape(-1, 1)

In Machine Learning, a necessary preprocessing step for every project is “feature scaling”, which is reducing the computational load on your machine by scaling down values proportionally. -1 → 1 is a common range. We then fit this scaler on every value in a column, then reassign these values in the column, repeating this process for each column in the dataframe.

# shuffle dataset
df = df.sample(frac=1)

y = df.pop('Eat')

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)

The dataset is shuffled using .sample(), a builtin method that returns a random sample of a given iterable. The parameter frac=1 signifies that we want the whole dataset to be shuffled. Proceeding this, we pop off the ‘Eat’ column and assign it to a column named y. This serves as the target column, in other words, the output of our input data. We then split the data into training features, training targets, testing features, and testing tagets with 80% of the data assigned to training and 20% to testing (usually, 20% is a great baseline).

model = RandomForestRegressor(n_estimators=25)
model.fit(X_train, y_train)

acc = model.score(X_test, y_test)
print(f'Accuracy on test set: {acc*100}')

with open('model.pkl', 'wb') as files:
    pickle.dump(model, files)

We define our model as a Random Forest algorithm with 25 estimators^, then train this model on our training set. Then, we score its accuracy by running it on our test set, and output this accuracy. Finally, we save the model to a .pkl file (pickle file extension) so that we can load this model in the future if need be.

^Estimators are decision trees used in the Random Forest algorithm. I will write an article on the Random Forest algorithm soon, so stay posted 😛