Maybe we should not predict the years of Abalone — Part 1

Both Linear Regression and the Classification by Logistic Regression and Decision Tree

Onur İnan Pektaş
5 min readFeb 5, 2022
Source: https://archive.ics.uci.edu/ml/datasets/abalone
Source: UCI Machine Learning Repository: Abalone Data Set

We can start from downloading the ‘Abalone Data Set’ from Kaggle with a csv format. You can actually find the original data set with a txt format, and the description in UCI (University of California, Irving) Machine Learning Repository. Now, we can code.

Here are the libraries that we use.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier

First, we call the data by Pandas to process it.

pd.set_option("display.max_column", 10)
pd.set_option("display.width", 1000)
# Your local address that holds the data
df = pd.read_excel("C:/.../abalone.xlsx")
# I converted the csv formatted data into xlsx format to actually see the data first, before coding.
# You can also type...
# df = pd.read_csv("C:/.../abalone.csv")

Let’s explore the data…

# First 5 rows with the titles of columns
print(df.head())
print()
# Number of rows, columns
print(df.shape)
print()
# Properties of the columns (index, name, number of values that are not missing, type)
print(df.info())
print()
# Some statistics about the columns
print(df.describe())
print()
# Columns with number of unique values
print(df.nunique())
print()

and the output seems to be like this.

We check the correlation matrix to see the correlations between the variables.

plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), cmap='Greens', annot=True)
plt.title("Correlation between the Variables")
plt.show()

Here is the preprocessing part that we should do. In this part, we should convert the ‘Sex’ column, which is categorical, into discrete numeric values, the type that machine can understand. For detail, you can check the video.

sex = pd.get_dummies(df["Sex"])
df = pd.concat([df, sex], axis=1)
df = df.drop("Sex", axis=1)
print(df.head())

And the new data look like…

As you can guess from the title of this article, we predict the years of trees. In this data set, ‘Rings’ column, which refers to the years, becomes the target variable (objective field, the dependent variable y). Then, we plot the ‘Rings’.

# Split the objective field y (dependent target variable) from the predicting (independent) variables Xs
y = df["Rings"].copy()
X = df.drop("Rings", axis=1).copy()

# Plot y
sns.displot(y)
plt.show()

It does not seem balanced, but let’s move on.

We split the data set into 70% the training, and 30% the test for the modelling. Then, we should scale the independent variables to make the effect of them to be normalized. We determine ‘StandardScaler’ as a scaler, and transform the scaled variables into data frame format.

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=1)

# Scale Xs
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns)

First we build a linear regression model to predict ‘Rings’.

# 1) Linear Regression Model
model_1 = LinearRegression()
model_1.fit(X_train, y_train)
print("R2 of the linear regression model is: ", model_1.score(X_test, y_test))

The output is…

The model is not so good because the R2 is not high enough.

Now, we jump into classification for ‘Rings’ variable. First, we built a logistic regression model. After the confusion matrix, we check classification report which gives us the precision, recall, f1-score, support, and the accuracy as well. To understand the metrics, you can check the video.

# 2) Logistic Regression Model
model_2 = LogisticRegression()
model_2.fit(X_train, y_train)
prediction_2 =model_2.predict(X_test)

cm = confusion_matrix(y_test, prediction_2)
fig = plt.figure(figsize=(10,8))
ax = sns.heatmap(cm, square=True, annot=True, cbar=False)
plt.title("Confusion Matrix coming from the Logistic Regression Model")
plt.show()

print(classification_report(y_test, prediction_2))

The scores do not look good.

Secondly, we try a decision tree model, and check the same things as we did in the logistic regression model.

# 3) Decision Tree Model
model_3 = DecisionTreeClassifier(max_depth=4, random_state=42)
model_3.fit(X_train,y_train)
prediction_3 = model_3.predict(X_test)

cm = confusion_matrix(y_test, prediction_3)
fig = plt.figure(figsize=(10,8))
ax = sns.heatmap(cm, square=True, annot=True, cmap='Blues', cbar=False)
plt.title("Confusion Matrix coming from the Decision Tree Model")
plt.show()

print(classification_report(y_test,prediction_3))

Unfortunately, we failed in the decision tree model, too.

Frankly, we applied 3 models to predict the year of a tree. All failed, but why? There can be some reasons. Maybe these independent variables are not quailified to predict the ‘Rings’ variable.

However, we won’t give up! :) Since we know the target variable is unbalanced, we will do apply undersampling and oversampling techniques to obtain balanced dependent variable. See you, in Part 2.

--

--