# Maybe we should not predict the years of Abalone — Part 1

## Both Linear Regression and the Classification by Logistic Regression and Decision Tree

We can start from downloading the ‘Abalone Data Set’ from Kaggle with a *csv* format. You can actually find the original data set with a txt format, and the description in UCI (University of California, Irving) Machine Learning Repository. Now, we can code.

Here are the libraries that we use.

`import pandas as pd`

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, LogisticRegression

from sklearn.metrics import classification_report, confusion_matrix

from sklearn.tree import DecisionTreeClassifier

First, we call the data by Pandas to process it.

pd.set_option("display.max_column", 10)

pd.set_option("display.width", 1000)# Your local address that holds the data

df = pd.read_excel("C:/.../abalone.xlsx")

# I converted thecsvformatted data intoxlsxformat to actually see the data first, before coding. # You can also type...

# df = pd.read_csv("C:/.../abalone.csv")

Let’s explore the data…

`# First 5 rows with the titles of columns`

print(df.head())

print()

# Number of rows, columns

print(df.shape)

print()

# Properties of the columns (index, name, number of values that are not missing, type)

print(df.info())

print()

# Some statistics about the columns

print(df.describe())

print()

# Columns with number of unique values

print(df.nunique())

print()

and the output seems to be like this.

We check the correlation matrix to see the correlations between the variables.

`plt.figure(figsize=(10,8))`

sns.heatmap(df.corr(), cmap='Greens', annot=True)

plt.title("Correlation between the Variables")

plt.show()

Here is the preprocessing part that we should do. In this part, we should convert the ‘Sex’ column, which is categorical, into discrete numeric values, the type that machine can understand. For detail, you can check the video.

`sex = pd.get_dummies(df["Sex"])`

df = pd.concat([df, sex], axis=1)

df = df.drop("Sex", axis=1)

print(df.head())

And the new data look like…

As you can guess from the title of this article, we predict the years of trees. In this data set, ‘Rings’ column, which refers to the years, becomes the target variable (objective field, the dependent variable *y*). Then, we plot the ‘Rings’.

`# Split the objective field y (dependent target variable) from the predicting (independent) variables Xs`

y = df["Rings"].copy()

X = df.drop("Rings", axis=1).copy()

# Plot y

sns.displot(y)

plt.show()

It does not seem balanced, but let’s move on.

We split the data set into 70% the training, and 30% the test for the modelling. Then, we should scale the independent variables to make the effect of them to be normalized. We determine ‘StandardScaler’ as a scaler, and transform the scaled variables into data frame format.

`X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=1)`

# Scale Xs

scaler = StandardScaler()

scaler.fit(X_train)

X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns)

First we build a linear regression model to predict ‘Rings’.

`# 1) Linear Regression Model`

model_1 = LinearRegression()

model_1.fit(X_train, y_train)

print("R2 of the linear regression model is: ", model_1.score(X_test, y_test))

The output is…

The model is not so good because the R2 is not high enough.

Now, we jump into classification for ‘Rings’ variable. First, we built a logistic regression model. After the confusion matrix, we check classification report which gives us the precision, recall, f1-score, support, and the accuracy as well. To understand the metrics, you can check the video.

`# 2) Logistic Regression Model`

model_2 = LogisticRegression()

model_2.fit(X_train, y_train)

prediction_2 =model_2.predict(X_test)

cm = confusion_matrix(y_test, prediction_2)

fig = plt.figure(figsize=(10,8))

ax = sns.heatmap(cm, square=True, annot=True, cbar=False)

plt.title("Confusion Matrix coming from the Logistic Regression Model")

plt.show()

print(classification_report(y_test, prediction_2))

The scores do not look good.

Secondly, we try a decision tree model, and check the same things as we did in the logistic regression model.

`# 3) Decision Tree Model`

model_3 = DecisionTreeClassifier(max_depth=4, random_state=42)

model_3.fit(X_train,y_train)

prediction_3 = model_3.predict(X_test)

cm = confusion_matrix(y_test, prediction_3)

fig = plt.figure(figsize=(10,8))

ax = sns.heatmap(cm, square=True, annot=True, cmap='Blues', cbar=False)

plt.title("Confusion Matrix coming from the Decision Tree Model")

plt.show()

print(classification_report(y_test,prediction_3))

Unfortunately, we failed in the decision tree model, too.

Frankly, we applied 3 models to predict the year of a tree. All failed, but why? There can be some reasons. Maybe these independent variables are not quailified to predict the ‘Rings’ variable.

However, we won’t give up! :) Since we know the target variable is unbalanced, we will do apply undersampling and oversampling techniques to obtain balanced dependent variable. See you, in Part 2.