# Maybe we should not predict the years of Abalone — Part 1

## Both Linear Regression and the Classification by Logistic Regression and Decision Tree

--

We can start from downloading the ‘Abalone Data Set’ from Kaggle with a csv format. You can actually find the original data set with a txt format, and the description in UCI (University of California, Irving) Machine Learning Repository. Now, we can code.

Here are the libraries that we use.

`import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegression, LogisticRegressionfrom sklearn.metrics import classification_report, confusion_matrixfrom sklearn.tree import DecisionTreeClassifier`

First, we call the data by Pandas to process it.

`pd.set_option("display.max_column", 10)pd.set_option("display.width", 1000)# Your local address that holds the datadf = pd.read_excel("C:/.../abalone.xlsx")# I converted the csv formatted data into xlsx format to actually see the data first, before coding. # You can also type... # df = pd.read_csv("C:/.../abalone.csv")`

Let’s explore the data…

`# First 5 rows with the titles of columnsprint(df.head())print()# Number of rows, columnsprint(df.shape)print()# Properties of the columns (index, name, number of values that are not missing, type) print(df.info())print()# Some statistics about the columnsprint(df.describe())print()# Columns with number of unique valuesprint(df.nunique())print()`

and the output seems to be like this.

We check the correlation matrix to see the correlations between the variables.

`plt.figure(figsize=(10,8))sns.heatmap(df.corr(), cmap='Greens', annot=True)plt.title("Correlation between the Variables")plt.show()`

Here is the preprocessing part that we should do. In this part, we should convert the ‘Sex’ column, which is categorical, into discrete numeric values, the type that machine can understand. For detail, you can check the video.

`sex = pd.get_dummies(df["Sex"])df = pd.concat([df, sex], axis=1)df = df.drop("Sex", axis=1)print(df.head())`

And the new data look like…

As you can guess from the title of this article, we predict the years of trees. In this data set, ‘Rings’ column, which refers to the years, becomes the target variable (objective field, the dependent variable y). Then, we plot the ‘Rings’.

`# Split the objective field y (dependent target variable) from the predicting (independent) variables Xsy = df["Rings"].copy()X = df.drop("Rings", axis=1).copy()# Plot ysns.displot(y)plt.show()`

It does not seem balanced, but let’s move on.

We split the data set into 70% the training, and 30% the test for the modelling. Then, we should scale the independent variables to make the effect of them to be normalized. We determine ‘StandardScaler’ as a scaler, and transform the scaled variables into data frame format.

`X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=1)# Scale Xsscaler = StandardScaler()scaler.fit(X_train)X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns)X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns)`

First we build a linear regression model to predict ‘Rings’.

`# 1) Linear Regression Modelmodel_1 = LinearRegression()model_1.fit(X_train, y_train)print("R2 of the linear regression model is: ", model_1.score(X_test, y_test))`

The output is…

The model is not so good because the R2 is not high enough.

Now, we jump into classification for ‘Rings’ variable. First, we built a logistic regression model. After the confusion matrix, we check classification report which gives us the precision, recall, f1-score, support, and the accuracy as well. To understand the metrics, you can check the video.

`# 2) Logistic Regression Modelmodel_2 = LogisticRegression()model_2.fit(X_train, y_train)prediction_2 =model_2.predict(X_test)cm = confusion_matrix(y_test, prediction_2)fig = plt.figure(figsize=(10,8))ax = sns.heatmap(cm, square=True, annot=True, cbar=False)plt.title("Confusion Matrix coming from the Logistic Regression Model")plt.show()print(classification_report(y_test, prediction_2))`

The scores do not look good.

Secondly, we try a decision tree model, and check the same things as we did in the logistic regression model.

`# 3) Decision Tree Modelmodel_3 = DecisionTreeClassifier(max_depth=4, random_state=42)model_3.fit(X_train,y_train)prediction_3 = model_3.predict(X_test)cm = confusion_matrix(y_test, prediction_3)fig = plt.figure(figsize=(10,8))ax = sns.heatmap(cm, square=True, annot=True, cmap='Blues', cbar=False)plt.title("Confusion Matrix coming from the Decision Tree Model")plt.show()print(classification_report(y_test,prediction_3))`

Unfortunately, we failed in the decision tree model, too.

Frankly, we applied 3 models to predict the year of a tree. All failed, but why? There can be some reasons. Maybe these independent variables are not quailified to predict the ‘Rings’ variable.

However, we won’t give up! :) Since we know the target variable is unbalanced, we will do apply undersampling and oversampling techniques to obtain balanced dependent variable. See you, in Part 2.