When Magento meets Python (episode Anomalies Detection)
Second post about applying some Python magic to deal with real use cases with Magento.
If you missed the first episode, check it out here.
Anomalies detection is a way to detect strange data behaviour (not necessary an error), finding the outliers, observations that are “far” from the rest, in terms of proximity or statistically, depending from the approach.
Let’s make an example: in an e-commerce scenario, most of the orders are ok, following the same steps and being in a consistent state.
Now let’s suppose that an external factor — of course all the internal code is 100% bug free :), is performing some unwanted changes, updating occasionally orders status in a bad way.
Is it possible to have something to assess the past situation and alert when something is wrong (and make customer care team your best friends)? Let’s find out.
In this PoC, we’re using fictional data orders with just three features: qty invoiced, order state and order status, fields present in Magento.
Let’s take a look at the data.
import pandas as pd df = pd.read_csv(filepath_or_buffer='anomalies_example.csv', sep=';')df.head()
A very simple table…
Let’s do some very basic EDA (Exploratory Data Analysis)…something to always perform when dealing with a new data set.
print(df.info())RangeIndex: 5200 entries, 0 to 5199 Data columns (total 3 columns): state 5200 non-null object status 5200 non-null object qty_invoiced 5200 non-null int64 dtypes: int64(1), object(2) memory usage: 122.0+ KB
We have 5200 entries, with state and status as strings (there are categorical values) and qty_invoiced as integer
print(df.describe())qty_invoiced count 5200.000000 mean 0.882885 std 0.325748 min 0.000000 25% 1.000000 50% 1.000000 75% 1.000000 max 3.000000
Interestingly, the max qty invoiced for an item is 3 and at least the 75% of records have just one item invoiced.
Let’s see the numbers
df.groupby('qty_invoiced').state.count()qty_invoiced 0 615 1 4580 2 4 3 1 Name: state, dtype: int64
Only one order with 3 items, 4 with 2 and the majority with 1..let’s keep in mind this.
Before using some algorithm, let’s try to assess the situation visually, using Seaborn, a powerful graphical library
import seaborn as sns_ = sns.stripplot(x="qty_invoiced", y="state", data=df)
Whoa, definitely something wrong here. For sure qty_invoiced positive with “pending payment” as status is not ok!
About “hold” and “processing”, well it depends from the implemented BL if is ok or not. But we can make an assumption from the data, using the algorithm as input/output processor, considering the BL simply as a black box.
The main characteristic of data set for anomalies is that the “wrong” examples usually are very, very few and often you don’t know them in advance (especially for continuous features), so it’s difficult to train a supervised algorithm to teach what is right and what is wrong.
So better “throw” all the data to an unsupervised learning algorithm and see what it can find.
As said earlier, there are different approaches.We’ll use an algorithm called “k-nearest neighbors” as unsupervised classifier, meaning that the output will be 1 if an order item is consider an anomaly or 0 if not.
Let’s start but, before, we have to transform the categorical features (state and status values) in something more understandable to an algorithm.
from sklearn import preprocessing df_dummy = pd.get_dummies(df) df_dummy
We transformed the data set, placing a 0/1 for every possible value for state and status. Now, we deal with a matrix of 5200 rows × 17 columns.
Let’s proceed with KNN.
from pyod.models.knn import KNN clf_name = 'KNN' clf = KNN() clf.fit(df_dummy) y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers) anomalies = [i for i, x in enumerate(y_train_pred) if x == 1] print(anomalies)[975, 1071, 1262, 1573, 1946, 1947, 1948, 1949, 2866, 3308, 3309, 3896, 5120, 5121]
Ok, we found some…let’s take a look at one
df_dummy.iloc[975,:]qty_invoiced 2 state_canceled 0 state_closed 0 state_complete 1 state_holded 0 state_new 0 state_payment_review 0 state_pending_payment 0 state_processing 0 status_canceled 0 status_closed 0 status_complete 1 status_holded 0 status_payment_review 0 status_pending 0 status_pending_payment 0 status_processing 0
It seems ok as order configuration, but is considered an anomaly….and KNN is right, because it is! Remember most of the order items have one item?
So we have to change a little bit the data to help the algorithm, doing what is called “feature engineering”.
To deal with the imbalance of order items, we can remove the qty_invoiced number and replace with a new feature telling just if qty_invoiced is positive. Let’s try it and let’s run the KNN again
df_dummy['invoiced_positive'] = df.apply(lambda row: row.qty_invoiced > 0, axis=1) df_dummy.drop(['qty_invoiced'], axis = 1, inplace=True) clf.fit(df_dummy) y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers) anomalies = [i for i, x in enumerate(y_train_pred) if x == 1] print(anomalies) [1071, 1262, 1946, 1947, 1948, 1949, 3308, 3309, 5120, 5121]
The 975 value is gone! Let’s take a look to 1071
It’s an anomaly and it’s wrong! Let’s see all of them, using the original data set for a better readability.
The red ones are wrong, so it’s ok..the others seem ok, there are just few cases in the data set.
Lastly, we have a tool to check new data too (imagine exposed as an API), using the trained classifier. Let’s check two new observations, one ok and one not.
new_data = pd.DataFrame([[False,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0], [True,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0]]) pred = clf.predict(new_data) anomalies = [i for i, x in enumerate(pred) if x == 1] print(anomalies)
The classifier found an anomaly, the second record (in Python array starts from 0).
This is just a little example, but it shows how these tools can be used to gain valuable info from data.
See you next time!