The data is formed from:

Item_Identifier: Unique product ID Item_Weight: Weight of product Item_Fat_Content: Whether the product is low fat or not Item_Visibility: The % of total display area of all products in a store allocated to the particular product Item_Type: The category to which the product belongs Item_MRP: Maximum Retail Price (list price) of the product Outlet_Identifier: Unique store ID Outlet_Establishment_Year: The year in which store was established Outlet_Size: The size of the store in terms of ground area covered Outlet_Location_Type: The type of city in which the store is located Outlet_Type: Whether the outlet is just a grocery store or some sort of supermarket Item_Outlet_Sales: Sales of the product in the particulat store. This is the outcome variable to be predicted.

In [1]:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
```

In [27]:

```
# merge the training data and test data in one data frame in order to operate on both simultaneously
train = pd.read_csv('/Users/s7c/Documents/Train.csv')
test = pd.read_csv('/Users/s7c/Documents/Test.csv')
df = pd.concat([train, test] , ignore_index = True, sort = True)
```

In [4]:

```
# get to know our data
df.info()
df.head()
```

Out[4]:

In [5]:

```
# how many unique values each of the variable has should be an interesting information
df.apply(lambda x : len(x.unique()))
```

Out[5]:

What's the relationships between individual features as well as between the features and the dependent variable.

In [6]:

```
corr = df.corr()
#Heatmaps show the strenght of relationships between the data features.
sns.heatmap(corr, annot=True)
plt.xticks(rotation=45)
```

Out[6]:

We saw previously that we have 2 columns with missing values, let's find the percentage:

In [7]:

```
percent_missing = df[['Item_Weight', 'Outlet_Size']].isnull().sum() * 100 / len(df)
percent_missing
```

Out[7]:

In [28]:

```
# I will try to replace the missing values from Item Weight in consideration with Item Type
df["Item_Weight"] = df.groupby("Item_Type")['Item_Weight'].apply(lambda x: x.fillna(x.mean()))
# and those from outlet size considering outlet type
from scipy.stats import mode
df["Outlet_Size"] = df.groupby("Outlet_Type")['Outlet_Size'].apply(lambda x: x.fillna(mode(x.astype('str')).mode[0])) # when we have categorical why do we need double mode?
```

Let's draw some plot and see what will find:

In [10]:

```
plt.hist(df.Item_Weight, bins = 50)
plt.show()
# between 12.50 and 13.50 are the most common weights
#or
plt.hist(df['Item_Visibility'], bins=30)
plt.show()
# are expresed as a percentage? most of them have 3% visibility?
```

One of the most useful plots are the boxplot. The central box of the boxplot represents the middle 50% of the observations, the central bar is the median and the bars at the end of the dotted lines (whiskers) encapsulate the great majority of the observations. Circles that lie beyond the end of the whiskers are data points that may be outliers.

One of the most useful features of a boxplot is the ability to make side-by-side boxplots. A side-by-side boxplot takes a numeric variable and splits it on based on some categorical variable, drawing a different boxplot for each level of the categorical variable.

In [11]:

```
df.boxplot(column="Item_Outlet_Sales", # Column to plot
by= "Outlet_Location_Type", # Column to split upon
figsize= (8,8)) # Figure size
# we can see that tier 3 has a wider range of sales and higher outliers but that's it...
df.boxplot(column="Item_Outlet_Sales", # Column to plot
by= "Outlet_Size", # Column to split upon
figsize= (8,8))
# Curious enough there is not that of a high discrepancy between outlet sizes. And even stranger is that the medium
# size outlet
df.boxplot(column="Item_Outlet_Sales", # Column to plot
by= "Outlet_Type", # Column to split upon
figsize= (8,8))
# we can see that the sales ar by far higher on the Supermarket Type 3 but what's interesting is that the Type 1 has
# allot of outlier sales
```

Out[11]:

In [18]:

```
df.Item_Fat_Content.value_counts()
```

Out[18]:

In [29]:

```
# we can see that we have 5 categories but with same meaning, let's fix that:
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'LF':'Low Fat',
'low fat':'Low Fat',
'reg':'Regular'})
print (df['Item_Fat_Content'].value_counts())
```

Further more we can see that Item_Visibility has values of 0. How is possible that a product is not visible at all, considering that we are talking about those products that are on the shelf

In [30]:

```
def replace(x):
if x == 0:
return df.Item_Visibility.mean()
else:
return x
# we can replace those values with the average
df.Item_Visibility = df.Item_Visibility.apply(replace)
```

If we take a look at the Item_Type there are way too many groups/clusters and that makes it harder for a prediction algorithm. That needs to be changed:

In [31]:

```
def change(x):
if x in ['Fruits and Vegetables', 'Snack Food', 'Frozen Food', 'Dairy', 'Canned', 'Baking Goods', 'Meat', 'Breads', 'Starchy Foods', 'Breakfast', 'Seafood']:
return 'Food'
elif x == ['Household', 'Health and Hygiene', 'Others']:
return 'Non-Consumable'
else:
return 'Drinks'
df.Item_Type = df.Item_Type.apply(change)
```

In [21]:

```
df.Outlet_Establishment_Year.value_counts()
```

Out[21]:

In [32]:

```
# drop item and outlet identifier. They don't help the alghoritm
df = df.drop(['Item_Identifier', 'Outlet_Identifier'], axis = 1)
```

We got at the prediction part of the problem. This is going to be a supervised machine learning problem. In particular, many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms. Following the same idea, in order to use the machine learning models we need to create dummy variables for that columns which are categorical.

In [33]:

```
catg = ['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type','Outlet_Type']
```

In [34]:

```
le = LabelEncoder()
for i in catg:
df[i] = le.fit_transform(df[i])
```

One-Hot-Coding refers to creating dummy variables, one for each category of a categorical variable. For example, the Item_Fat_Content has 3 categories – ‘Low Fat’, ‘Regular’ and ‘Non-Edible’. One hot coding will remove this variable and generate 3 new variables. Each will have binary numbers – 0 (if the category is not present) and 1(if category is present). This can be done using ‘get_dummies’ function of Pandas.

In [35]:

```
df = pd.get_dummies(df, columns=['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Outlet_Type',
'Item_Type'])
```

In [36]:

```
# regroup
train = df[df['Item_Outlet_Sales'].notnull()]
test = df[df['Item_Outlet_Sales'].isnull()]
```

Feature scaling is a method used to normalize the range of independent variables or features of data:

In [37]:

```
test = test.drop('Item_Outlet_Sales', axis = 1)
y = train.Item_Outlet_Sales
train = train.drop('Item_Outlet_Sales', axis = 1)
train_list = list(train.columns)
slc= StandardScaler()
train_sc = slc.fit_transform(train)
test_sc = slc.fit_transform(test)
```

Principal Component Analysis PCA is a tool which helps to produce better visualizations of high dimensional data. So, we need to calculate the variance of each variable we are given. Then drop the variables having low variance as compared to other variables in our dataset. The reason for doing this, as I mentioned above, is that variables with a low variance will not affect the target variable. First principal component is a linear combination of original predictor variables which captures the maximum variance in the data set. It determines the direction of highest variability in the data. Larger the variability captured in first component, larger the information captured by component. No other component can have variability higher than first principal component. The first principal component results in a line which is closest to the data i.e. it minimizes the sum of squared distance between a data point and the line.

In [38]:

```
pca = PCA(n_components=None)
train = pca.fit_transform(train_sc)
test = pca.fit_transform(test_sc)
#Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()
```

In [39]:

```
# this graph tells us that with 10 components we preserve around 90% of the data
pca = PCA(n_components = 10)
train = pca.fit_transform(train_sc)
test = pca.fit_transform(test_sc)
```

First we need to train our model in order to learn the relationships between the features and the target The next step is figuring out how good the model is! To do this we make predictions on the test features (the model is never allowed to see the test answers). We then normally compare the predictions to the known answers. When performing regression, we need to make sure to use the absolute error because we expect some of our answers to be low and some to be high. We are interested in how far away our average prediction is from the actual value so we take the absolute value.

Multi-linear regression Model. It's a linear regression with multiple variables to predict.

In [40]:

```
ln = LinearRegression()
ln.fit(train,y)
# predict
print ('Coefficients: ', ln.coef_)
# Coefficient and Intercept , are the parameters of the fit line
y_predicted = ln.predict(test)
```

Polynomial regression PolynomialFeatures() function in Scikit-learn library, drives a new feature sets from the original feature set. Polynomial regression is considered to be a special case of traditional multiple linear regression. So, you can use the same mechanism as linear regression to solve such a problems.

In [41]:

```
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import KFold, cross_val_score
poly = PolynomialFeatures(degree=4) #These 3 steps are to convert X matrix into X polynomial
train_poly=poly.fit_transform(train) #matrix.
regressor_poly = LinearRegression()
regressor_poly.fit(train_poly,y)
```

Out[41]:

In [42]:

```
# Let us check the accuracy
accuracy = cross_val_score(estimator=regressor_poly, X=train, y=y,cv=10)
print(f"The accuracy of the Polynomial Regression Model is \t {accuracy.mean()}")
print(f"The deviation in the accuracy is \t {accuracy.std()}")
```

Random Forest A random forest is a data construct applied to machine learning that develops large numbers of random decision trees analyzing sets of variables. This type of algorithm helps to enhance the ways that technologies analyze complex data.

In [47]:

```
# Import the model
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 100 decision trees
rf = RandomForestRegressor(n_estimators = 100, random_state = 21)
# Train the model on training data
rf.fit(train, y);
predictions = rf.predict(test)
# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(train_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
```