In any kind of business, we need to segment the customers. Customer segmentation is done so that we can plan exclusive treatment of each segment of customers. To state an example, when we started our business, we got business from the Ministries in Saudi Arabia and Bhutan, from private companies in Saudi Arabia, Egypt, Dubai, Australia and Bangladesh. In a very short period of time, we had to attend to more than 20 customers. Being a start up, we had limited resources. However, we endeavoured that none of our customers should ever face any difficulty with our services.

In order to be able to service our Customers effectively, we decided to assign specific people to a set of Customers. Having planned this, we needed to create these sets of Customers. So, we had to segment our customers so that customers with similar needs could be put in one group and thus we could assign them to one team of Engineers. Our segmentation was primarily on the basis of the type of business the customer was doing and the types of systems used to develop that customer’s projects.

When we talk about large business houses, we know that thay have thousands and millions of customers. These customers need to be segmented so that specific products could be pushed to these customers and for many more requirements like promotions, etc. Now, segmenting these customer is not so easy as we are discussing a very large number of customers. So, here we need the machine to help us out. However, when we want the machine to help us, we need to build a basis on which the machine will process the data to produce the segmentation.

One such basis used by many companies is to check the customers on the basis of Recency, Frequency and Monetary Value. We call this analysis as the **RFM analysis**.

**Recency** refers to how recently the customer has been involved in business with the company. Generally, customers who have done business recently may be targeted with more products to gain their loyalty.

**Frequency** refers to how frequently does the customer conduct business with the company. Customer who do business with a company more frequently should be taken care by the company with more focused plans so the loyalty of these customers towards the company is kept intact.

**Monetary Value** refers to how much did the customers spend during the interactions with the company over a period of time. The company would definitely like to keep the high value customers happy and look to increase business from the relatively low value customers.

Now, we have established a basis on which we will segment our customers. So, let us look at our data using which we will extract these information about our customers before we can segment them.

## Dataset¶

The dataset is the Online Retail dataset obtained from UCI Machine Learning Repository. To know more about the dataset, click here. It is a transnational dataset which contains all the transactions occurring between 01-Dec-2010 and 09-Dec-2011 for a UK-based and registered non-store online retail.

The dataset contains 541,909 records, and each record is made up of 8 fields.

The dataset contains 2 files with the same set of columns. We will use the file Online_Retail_Train.csv to create clusters of customers. Once we have the customers segmented, we will feed this data (transaction information along with the segment of the customer) to the machine and train the machine to determine the segment of the customers. Once we have a model for segmenting the customers, we will apply that model on new data to segment the customers. The test data for checking our customer segmentation model is Online_Retail_Test.csv.

## Clustering¶

**Clustering** is the task of grouping together a set of objects so that the objects in the same cluster are more like each other than to objects in other clusters. **Similarity** is a measure that reflects the strength of the relationship between two data objects.

For clustering, we will use the **K-Means algorithm**.

Let us understand the K-means algorithm through an example. We will take an arbitrary example in 2-dimensional space so that it is easy to visualise.

Suppose that we have 10 points in a 2-dimensional space as given below.

(1,4), (5,6), (-8,3), (6,-5), (10,2), (-1,2), (7,3), (-6,4), (9,2), (-4,4)

We can treat these points as 10 records having 2 features – x1 and x2. We create a dataframe of these point and see what this looks like.

```
import warnings
warnings.filterwarnings('ignore')
```

```
import pandas as pd
df = pd.DataFrame([(1,4), (5,6), (-8,3), (6,-5), (10,2), (-1,2), (7,3), (-6,4), (9,2), (-4,4)],
columns = ['x1', 'x2'])
df
```

Let us create a Scatter Plot to visualise this data.

```
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(df.x1, df.x2)
plt.show()
```

We can see that these points can be grouped in 2 or more number of groups. For a start (as we are understanding the concept), let us try to create 2 clusters from this data. **One point to note here is that to apply K-means algorithm, we need to define how many clusters we require. Later we will see how to determine the appropriate number of clusters to extract.**

So, we want to create 2 clusters from this data. Next, we have to select 1 point each for each of the 2 clusters which would be the **centroids** for these 2 clusters. Let us select the points (2,2) and (5,5) as our 2 centroids. Let us plot the centroids along with the points. The centroids are marked in red colour with ‘x’.

```
centroidsX = [2, 5]
centroidsY = [2, 5]
plt.scatter(df.x1, df.x2)
plt.scatter(centroidsX, centroidsY, marker = 'x', c = 'r')
plt.show()
```

We chose the centroids so that we will try to form a cluster around each of the centroids.

Now, we find the distance of each data point in our dataset to each of the centroids that we have selected. For a particular data point, the distance from the centroid which is the lowest, the data point will belong to that cluster.

Each data point can be defined as (x, y). So, our first data point can be defined as (x1, y1) and the centroids can be defined as (cx1, cy1) and (cx2, cy2).

So, the distance between (x1, y1) and (cx1, cy1) = sqrt((cx1 – x1)^2 + (cy1 – y1)^2).

Here, **sqrt** is the square root and **^** is the power of.

Let us work it out manually.

```
df['Round1-Distance-C1'] = (((centroidsX[0] - df['x1'])**2) + ((centroidsY[0] - df['x2'])**2))**(1/2)
df['Round1-Distance-C2'] = (((centroidsX[1] - df['x1'])**2) + ((centroidsY[1] - df['x2'])**2))**(1/2)
df['Round1-Cluster'] = [0 if df['Round1-Distance-C1'][i] < df['Round1-Distance-C2'][i] else 1
for i in range(len(df['Round1-Distance-C1']))]
df
```

So, we can see that each of the data points have been assigned one cluster.

The next step is to calculate the new centroids. For this, we take the data points in each cluster separately and find the mean (in this case, the mean of the x-coordinate and the mean of the y-coordinate).

```
centroidsX[0] = df[df['Round1-Cluster'] == 0]['x1'].mean()
centroidsY[0] = df[df['Round1-Cluster'] == 0]['x2'].mean()
centroidsX[1] = df[df['Round1-Cluster'] == 1]['x1'].mean()
centroidsY[1] = df[df['Round1-Cluster'] == 1]['x2'].mean()
print('New Centroids X-coordinates', centroidsX)
print('New Centroids Y-coordinates', centroidsY)
```

Let us plot the new centroids along with the data points.

```
plt.scatter(df.x1, df.x2, c = df['Round1-Cluster'])
plt.scatter(centroidsX, centroidsY, marker = 'x', c = 'r')
plt.show()
```

Now, we repeat the process to find new cluster assignment and new centroids.

```
df['Round2-Distance-C1'] = (((centroidsX[0] - df['x1'])**2) + ((centroidsY[0] - df['x2'])**2))**(1/2)
df['Round2-Distance-C2'] = (((centroidsX[1] - df['x1'])**2) + ((centroidsY[1] - df['x2'])**2))**(1/2)
df['Round2-Cluster'] = [0 if df['Round2-Distance-C1'][i] < df['Round2-Distance-C2'][i] else 1
for i in range(len(df['Round2-Distance-C1']))]
centroidsX[0] = df[df['Round2-Cluster'] == 0]['x1'].mean()
centroidsY[0] = df[df['Round2-Cluster'] == 0]['x2'].mean()
centroidsX[1] = df[df['Round2-Cluster'] == 1]['x1'].mean()
centroidsY[1] = df[df['Round2-Cluster'] == 1]['x2'].mean()
plt.scatter(df.x1, df.x2, c = df['Round2-Cluster'])
plt.scatter(centroidsX, centroidsY, marker = 'x', c = 'r')
plt.show()
```

We will repeat this process for one last time for understanding. We stop these iterations when the centroids have stabilised (or do not move too much).

```
df['Round3-Distance-C1'] = (((centroidsX[0] - df['x1'])**2) + ((centroidsY[0] - df['x2'])**2))**(1/2)
df['Round3-Distance-C2'] = (((centroidsX[1] - df['x1'])**2) + ((centroidsY[1] - df['x2'])**2))**(1/2)
df['Round3-Cluster'] = [0 if df['Round3-Distance-C1'][i] < df['Round3-Distance-C2'][i] else 1
for i in range(len(df['Round3-Distance-C1']))]
centroidsX[0] = df[df['Round3-Cluster'] == 0]['x1'].mean()
centroidsY[0] = df[df['Round3-Cluster'] == 0]['x2'].mean()
centroidsX[1] = df[df['Round3-Cluster'] == 1]['x1'].mean()
centroidsY[1] = df[df['Round3-Cluster'] == 1]['x2'].mean()
plt.scatter(df.x1, df.x2, c = df['Round3-Cluster'])
plt.scatter(centroidsX, centroidsY, marker = 'x', c = 'r')
plt.show()
```

So, we have divided the data points in 2 clusters. Now, let us use a metric to test how good this clustering has been. We can use the metric **inertia** as points of reference.

**To calculate the inertia, we compute the distance between each data points in each cluster from the corresponding centroid. Then we sum the squares of all these distances to obtain the value of inertia.**

```
distanceCluster = [0, 0]
for i in range(len(df)):
for j in range(len(df['Round3-Cluster'].unique())):
if j == df.loc[i, 'Round3-Cluster']:
distanceCluster[j] += (((centroidsX[j] - df.loc[i, 'x1'])**2) + ((centroidsY[j] - df.loc[i, 'x2'])**2))**(1/2)
inertia = sum(distanceCluster)
print('Inertia', inertia)
```

Lastly, in the process of understanding K-means clustering, let us try to create 3 clusters for the same data and check the value of inertia. We will compare the value of inertia between the exercise of creating 2 clusters and the exercise of creating 3 cluster to check which is better. **Remember that if the value of inertia is less, the clustering has been better.**

```
centroidsX = [2, 5, 6]
centroidsY = [2, 5, 6]
plt.scatter(df.x1, df.x2)
plt.scatter(centroidsX, centroidsY, marker = 'x', c = 'r')
plt.show()
```

Now, let us apply our first iteration.

```
df['Round1-3c-Distance-C1'] = (((centroidsX[0] - df['x1'])**2) + ((centroidsY[0] - df['x2'])**2))**(1/2)
df['Round1-3c-Distance-C2'] = (((centroidsX[1] - df['x1'])**2) + ((centroidsY[1] - df['x2'])**2))**(1/2)
df['Round1-3c-Distance-C3'] = (((centroidsX[2] - df['x1'])**2) + ((centroidsY[2] - df['x2'])**2))**(1/2)
df['Round1-3c-Cluster'] = [0 if (df['Round1-3c-Distance-C1'][i] < df['Round1-3c-Distance-C2'][i] and
df['Round1-3c-Distance-C1'][i] < df['Round1-3c-Distance-C3'][i])
else 1 if (df['Round1-3c-Distance-C2'][i] < df['Round1-3c-Distance-C1'][i] and
df['Round1-3c-Distance-C2'][i] < df['Round1-3c-Distance-C3'][i])
else 2
for i in range(len(df['Round1-3c-Distance-C1']))]
for i in range(3):
centroidsX[i] = df[df['Round1-3c-Cluster'] == i]['x1'].mean()
centroidsY[i] = df[df['Round1-3c-Cluster'] == i]['x2'].mean()
plt.scatter(df.x1, df.x2, c = df['Round1-3c-Cluster'])
plt.scatter(centroidsX, centroidsY, marker = 'x', c = 'r')
plt.show()
```

We can see that 3 clusters have formed. We do one more iteration.

```
df['Round2-3c-Distance-C1'] = (((centroidsX[0] - df['x1'])**2) + ((centroidsY[0] - df['x2'])**2))**(1/2)
df['Round2-3c-Distance-C2'] = (((centroidsX[1] - df['x1'])**2) + ((centroidsY[1] - df['x2'])**2))**(1/2)
df['Round2-3c-Distance-C3'] = (((centroidsX[2] - df['x1'])**2) + ((centroidsY[2] - df['x2'])**2))**(1/2)
df['Round2-3c-Cluster'] = [0 if (df['Round2-3c-Distance-C1'][i] < df['Round2-3c-Distance-C2'][i] and
df['Round2-3c-Distance-C1'][i] < df['Round2-3c-Distance-C3'][i])
else 1 if (df['Round2-3c-Distance-C2'][i] < df['Round2-3c-Distance-C1'][i] and
df['Round2-3c-Distance-C2'][i] < df['Round2-3c-Distance-C3'][i])
else 2
for i in range(len(df['Round2-3c-Distance-C1']))]
for i in range(3):
centroidsX[i] = df[df['Round2-3c-Cluster'] == i]['x1'].mean()
centroidsY[i] = df[df['Round2-3c-Cluster'] == i]['x2'].mean()
plt.scatter(df.x1, df.x2, c = df['Round2-3c-Cluster'])
plt.scatter(centroidsX, centroidsY, marker = 'x', c = 'r')
plt.show()
```

We conduct one last iteration.

```
df['Round3-3c-Distance-C1'] = (((centroidsX[0] - df['x1'])**2) + ((centroidsY[0] - df['x2'])**2))**(1/2)
df['Round3-3c-Distance-C2'] = (((centroidsX[1] - df['x1'])**2) + ((centroidsY[1] - df['x2'])**2))**(1/2)
df['Round3-3c-Distance-C3'] = (((centroidsX[2] - df['x1'])**2) + ((centroidsY[2] - df['x2'])**2))**(1/2)
df['Round3-3c-Cluster'] = [0 if (df['Round3-3c-Distance-C1'][i] < df['Round3-3c-Distance-C2'][i] and
df['Round3-3c-Distance-C1'][i] < df['Round3-3c-Distance-C3'][i])
else 1 if (df['Round3-3c-Distance-C2'][i] < df['Round3-3c-Distance-C1'][i] and
df['Round3-3c-Distance-C2'][i] < df['Round3-3c-Distance-C3'][i])
else 2
for i in range(len(df['Round2-3c-Distance-C1']))]
for i in range(3):
centroidsX[i] = df[df['Round3-3c-Cluster'] == i]['x1'].mean()
centroidsY[i] = df[df['Round3-3c-Cluster'] == i]['x2'].mean()
plt.scatter(df.x1, df.x2, c = df['Round3-3c-Cluster'])
plt.scatter(centroidsX, centroidsY, marker = 'x', c = 'r')
plt.show()
```

We see that the Centroids have stabilised. So, we calculate the inertia.

```
distanceCluster = [0, 0, 0]
for i in range(len(df)):
for j in range(len(df['Round3-3c-Cluster'].unique())):
if j == df.loc[i, 'Round3-3c-Cluster']:
distanceCluster[j] += (((centroidsX[j] - df.loc[i, 'x1'])**2) + ((centroidsY[j] - df.loc[i, 'x2'])**2))**(1/2)
inertia = sum(distanceCluster)
print('Inertia', inertia)
```

We see that the inertia when we create 3 clusters is less than the inertia when we create 2 clusters. This implies that, for this data, it is better to create 3 clusters rather than 2 clusters. We can check the inertia value by increasing the number of clusters. When we see that the inertia value is not decreasing significantly, we can stop and use that number of clusters for our final solution. This method is called the **Elbow Method.**

## Performing Customer Segmentation based on RFM Analysis¶

Now that we have discussed how K-means Clustering works, let us apply the same to cluster customer from our dataset based on RFM Analysis.

### Load the data¶

```
dfMainTrain = pd.read_csv('Online_Retail_Train.csv')
dfMainTest = pd.read_csv('Online_Retail_Test.csv')
dfTrain = dfMainTrain.copy()
dfTest = dfMainTest.copy()
print('Training Data Shape:', dfTrain.shape)
print('Test Data Shape:', dfTest.shape)
```

```
dfTrain.columns
```

```
dfTrain.head(20)
```

### Pre-process Data¶

First we check if duplicate data is available in the dataset.

```
# Find how may duplicate data exists
pd.DataFrame.duplicated(dfTrain, subset=None, keep= 'first').sum()
```

There are 4754 duplicate rows out of 514813. So, we will eliminate the duplicate rows.

```
print('Before', dfTrain.shape)
dfTrain.drop_duplicates(keep= 'first', inplace = True)
print('After', dfTrain.shape)
```

Let us find out the unique types of Invoices. We do this by finding out the first characters in the Invoice Numbers.

```
pd.DataFrame([inv[0] for inv in dfTrain.InvoiceNo], columns = ['InvStart'])['InvStart'].unique()
```

We see that there are 3 types of Invoice Numbers.

- Invoice Numbers which are entirely numbers. We will call these normal invoices.
- Invoice Numbers starting with “C”. These are Cancelled Invoices.
- Invoice Numbers starting with “A”. These are Adjusted Invoices.

```
# Classify the Invoices as Normal, Cancelled and Adjusted
dfTrain['InvoiceType'] = dfTrain['InvoiceNo'].astype(str).str[0]
dfTrain['InvoiceType'].replace('5', 'N', inplace = True)
dfTrain.InvoiceType.unique()
```

We find out how many invoices of each type is present in the dataset.

```
print('\n\nTypes of Invoices\n', dfTrain.InvoiceType.value_counts())
```

We see that the number of Cancelled and Adjustment Invoices are relatively very few. So, we drop these rows.

```
print('Before', dfTrain.shape)
dfTrain = dfTrain[dfTrain.InvoiceType == 'N']
print('After', dfTrain.shape)
```

Now we have all the invoices which have been for a sale of an item. We check if any Invoice has negative quantity.

```
print('Number of Invoices with negative quantity', len(dfTrain[dfTrain.Quantity < 0]))
```

We will drop the invoices with negative quantity.

```
print('Before', dfTrain.shape)
dfTrain = dfTrain[dfTrain.Quantity > 0]
print('After', dfTrain.shape)
```

Let us find the Stock Codes which are not items.

```
dfTrain['StockCodeType'] = ['Remove' if stockCode in ['POST', 'PADS', 'M', 'DOT', 'C2', 'BANK CHARGES'] else 'Keep' for stockCode in dfTrain.StockCode ]
dfTrain.StockCodeType.value_counts()
```

The records for Stock Codes which are not items are relatively very few. So, we remove them.

```
print('Before', dfTrain.shape)
dfTrain = dfTrain[dfTrain.StockCodeType == 'Keep']
print('After', dfTrain.shape)
```

We create an extra column to record the Week Day of the Invoice Date.

```
dfTrain['InvWeekDay'] = pd.to_datetime(dfTrain.InvoiceDate).dt.dayofweek
```

We create a column to store the value of each invoice. The value of each invoice can be found by multiplying the Quantity and the Unit Price.

```
dfTrain['TotalAmount'] = dfTrain.Quantity * dfTrain.UnitPrice
```

We put together all the above actions we have performed into a single function so that we can apply this function on the test data.

**Function to preprocess the data**

```
def preProcessData(dataFrame):
dataFrameCopy = dataFrame.copy()
print('Removing duplicates', dataFrameCopy.shape)
returnDF = dataFrameCopy.drop_duplicates(keep= 'first', inplace = False)
# Classify the Invoices as Normal, Cancelled and Adjusted
# Delete all the rows except for the Normal Invoices
print('Keep only Normal Invoices', returnDF.shape)
returnDF['InvoiceType1'] = returnDF['InvoiceNo'].astype(str).str[0]
dfTemp = returnDF.copy()
returnDF['InvoiceType'] = [invoiceType if invoiceType in ['C', 'A'] else 'N' for invoiceType in dfTemp.InvoiceType1]
returnDF = returnDF[returnDF.InvoiceType == 'N']
returnDF.drop(['InvoiceType', 'InvoiceType1'], axis = 1, inplace = True)
# Delete the rows where Quantity is less than or equal to 0
print('Remove rows with negative quantity', returnDF.shape)
returnDF = returnDF[returnDF.Quantity > 0]
# Classify the Invoices based on the Stock Code
# Remove the rows with invalid Stock Codes
print('Keep only valid Stock Code', returnDF.shape)
dfTemp = returnDF.copy()
returnDF['StockCodeType'] = ['Remove' if stockCode in ['POST', 'PADS', 'M', 'DOT', 'C2', 'BANK CHARGES'] else 'Keep'
for stockCode in dfTemp.StockCode ]
returnDF = returnDF[returnDF.StockCodeType == 'Keep']
returnDF.drop('StockCodeType', axis = 1, inplace = True)
# Create a column to store the week day of the invoice date
print('Add column to store week day of invoice date', returnDF.shape)
returnDF['InvWeekDay'] = pd.to_datetime(returnDF.InvoiceDate).dt.dayofweek
# Create a column to store the invoice amount
print('Add column to store invoice amount', returnDF.shape)
returnDF['TotalAmount'] = returnDF.Quantity * returnDF.UnitPrice
# Check if we have clean data
if returnDF.isnull().sum().sum() == 0:
print('\nData is clean', returnDF.shape)
else:
print('\nMissing Data exists', returnDF.shape)
print(returnDF.isnull().sum(), '\n')
return returnDF
```

**Pre-process the Data**

```
dfTrainToUse = preProcessData(dfMainTrain)
print('Original Data', dfMainTrain.shape)
print('Processed Data', dfTrainToUse.shape)
```

The data we have is from 2010-2011. However, at the time of writing this article, we are in 2022. So, we will consider that we are on the date when the last of the invoices in this dataset was obtained. So, we find the last invoice date from this dataset and store it.

```
lastInvoiceDate = dfTrainToUse.InvoiceDate.max()
lastInvoiceDate
```

### Feature Engineering and Transformation¶

From the above data, we extract the 2 features we are interested in, i.e., Recency, Frequency and Monetary value. We need this data at the Customer level as we will be clustering customers. So, we group the above data w.r.t. Customers.

For the Monetary Value of each Customer, we can sum up the Invoice Amounts across all the In voices of each Customer.

For the Frequency, we can count the number of Invoices per Customer.

However, to find the Recency, we will capture the date of the last Invoice of the Customer. Then we will subtract this date from the **lastInvoiceDate**, which is the date of the last invoice in this dataset.

```
gb = dfTrainToUse.groupby('CustomerID')
counts = gb.size().to_frame(name = 'Frequency')
dfCustomerDF = \
(counts
.join(gb.agg({'TotalAmount': 'sum'}).rename(columns={'TotalAmount': 'MonetaryValue'}))
.join(gb.agg({'InvoiceDate': 'max'}).rename(columns={'InvoiceDate': 'MostRecentPurchase'}))
.reset_index()
)
from datetime import datetime, date
dfCustomerDF['MostRecentPurchaseDate'] = pd.to_datetime(dfCustomerDF['MostRecentPurchase']).dt.date
dfCustomerDF['Recency'] = (datetime.strptime(dfTrain.InvoiceDate.max(), "%Y-%m-%d %H:%M:%S").date() -
dfCustomerDF['MostRecentPurchaseDate']).dt.days
dfCustomerDF.drop(['MostRecentPurchaseDate', 'MostRecentPurchase'], axis = 1, inplace = True)
print('Shape of Customer Data:', dfCustomerDF.shape)
print('\n')
print(dfCustomerDF.head())
```

We create this into a function so that we can apply it on any other dataset where we have to apply this formulation.

**Function for transforming the data**

```
def transformData(dataFrame):
gb = dataFrame.groupby('CustomerID')
counts = gb.size().to_frame(name = 'Frequency')
dfCustomerDF = \
(counts
.join(gb.agg({'TotalAmount': 'sum'}).rename(columns={'TotalAmount': 'MonetaryValue'}))
.join(gb.agg({'InvoiceDate': 'max'}).rename(columns={'InvoiceDate': 'MostRecentPurchase'}))
.reset_index()
)
from datetime import datetime, date
dfCustomerDF['MostRecentPurchaseDate'] = pd.to_datetime(dfCustomerDF['MostRecentPurchase']).dt.date
dfCustomerDF['Recency'] = (datetime.strptime(dfTrain.InvoiceDate.max(), "%Y-%m-%d %H:%M:%S").date() -
dfCustomerDF['MostRecentPurchaseDate']).dt.days
dfCustomerDF.drop(['MostRecentPurchaseDate', 'MostRecentPurchase'], axis = 1, inplace = True)
return dfCustomerDF
```

```
dfCustomerTrain = transformData(dfTrainToUse)
print('Shape of the Customer data:', dfCustomerTrain.shape)
print('\n')
print(dfCustomerTrain.head())
```

### Scale the data¶

During the discussion on K-means clustering, you would have noticed that we calculate the distance of the data points from the centroids. Now, if a feature in the dataset has very high values compared to the other features, the distance calculations will get skewed towards that variable. So, before applying K-means clustering, we need to scale the data so that all the features have data in similar ranges.

So, we apply `StandardScaler`

on the features.

```
dfCustomerTrainScaled = dfCustomerTrain.copy()
X = dfCustomerTrainScaled.drop('CustomerID', axis = 1)
y = dfCustomerTrainScaled['CustomerID']
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
XTrain = sc.fit_transform(X.to_numpy())
XTrain = pd.DataFrame(XTrain, columns=['Frequency', 'MonetaryValue', 'Recency'])
XTrain.head()
```

### Apply K-means algorithm to identify a specific number of clusters¶

Now we apply K-Means clustering and try to determine what would be the best number of clusters to form. For doing this, we will create between 1 and 15 number of clusters. Then, we will use the Elbow method to establish the best number of clusters to form.

Apsrt from inertia, we will compute distortion which can make it easier for us to make the decision.

```
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
clusters = np.arange(1, 15)
inertia = []
distortions = []
for k in clusters:
kmeans = KMeans(n_clusters = k, random_state = 42)
kmeans.fit(XTrain, y)
inertia.append(kmeans.inertia_)
distortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / XTrain.shape[0])
# Plot the elbow
plt.plot(clusters, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
plt.plot(clusters, inertia, marker= '.')
plt.title('Inertia Plot')
plt.xlabel("$k$")
plt.ylabel("Inertia")
plt.show()
```

**From the above charts, we can conclude that creating 12 number of clusters could be optimum.**

So, we create 12 clusters and store the centroids.

```
kmeans = KMeans(n_clusters = 12, random_state = 42)
kmeans.fit(XTrain, y)
KMeansCentroids = kmeans.cluster_centers_
KMeansCentroids
```

### Visualising the clusters¶

Now, we try to visualise the clusters. This is possible in our case as we have only 3 independent variables and we have the ability to create 3D charts.

```
# Fixing random state for reproducibility
np.random.seed(19680801)
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(projection = '3d')
xs = XTrain.Recency
ys = XTrain.Frequency
zs = XTrain.MonetaryValue
ax.scatter(xs, ys, zs, marker = 'o', c = kmeans.labels_)
xs1 = KMeansCentroids.T[2]
ys1 = KMeansCentroids.T[1]
zs1 = KMeansCentroids.T[0]
ax.scatter(xs1, ys1, zs1, marker = 'x', c = 'r')
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary Value')
plt.show()
```

## Train a Supervised Learning algorithm on segmented data¶

Now that we have clustered the training data, we can use this information to train a Supervised Learning model so that we can cluster other similar data. In our case, we have test dataset which we could cluster using the model we will develop now.

```
dfLabelled = XTrain.copy()
dfLabelled['Segment'] = kmeans.labels_
print(dfLabelled.head())
print('\n')
print(dfLabelled.Segment.value_counts())
```

### Create and test the model¶

```
X = dfLabelled.drop('Segment', axis = 1)
y = dfLabelled['Segment']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
yPredTrain = rf.predict(X_train)
print('Training Accuracy:', accuracy_score(y_train, yPredTrain))
yPredTest = rf.predict(X_test)
print('Test Accuracy:', accuracy_score(y_test, yPredTest))
```

### Evaluation of Test Data¶

Now that we have our model for Customer Segmentation, we will apply our model on the test data.

```
dfMainTest.head()
```

We first pre-process the test data.

```
dfTestToUse = preProcessData(dfMainTest)
print('Original Data', dfMainTest.shape)
print('Processed Data', dfTestToUse.shape)
```

Then we extract the features from the test data.

```
dfCustomerTest = transformData(dfTestToUse)
X = dfCustomerTest.drop(['CustomerID'], axis = 1)
XTest = sc.transform(X.to_numpy())
XTest = pd.DataFrame(XTest, columns=['Frequency','MonetaryValue', 'Recency'])
XTest.head()
```

And lastly, we apply our model to segment the customers in the test data.

```
yPredTestData = rf.predict(XTest)
pd.DataFrame(yPredTestData).value_counts()
```

```
# Fixing random state for reproducibility
np.random.seed(19680801)
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(projection = '3d')
xs = XTest.Recency
ys = XTest.Frequency
zs = XTest.MonetaryValue
ax.scatter(xs, ys, zs, marker = 'o', c = yPredTestData)
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary Value')
plt.show()
```