The best industry standard for performing Data Mining techniques in various predictive Modeling in Machine Learning.

ChallengePredict survival on the Titanic and get familiar with ML basics


Machine Learning

Machine learning is a subfield of computer science. It is a method that uses data and algorithms to imitate the way a human learns, thinks and improves, which is why big data is a vital part of machine learning.


Making a Predictive Model

  • Data acquisition and removing outliers.
  • Look at the data and decide between parametric or nonparametric predictive models.
  • The data acquired needs to be formatted appropriately for the predictive model.
  • The model needs training. The data scientist will choose a data subset for this.
  • The parameters need calibration. The data subset selected will be used for this also.
  • The model efficacy needs testing, so predictive model performance testing is necessary.
  • The next step is validating the model using a newer data set from the data.
  • Once every test is cleared, the model is ready.

Initialization/Loading Initial Libraries:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization

Reading data/Importing Dataset:

def read_data():
    train_data=pd.read_csv("/kaggle/input/titanic/train.csv")
    test_data=pd.read_csv("/kaggle/input/titanic/test.csv")
    print("train and test data imported")
    print("-"*100)
    return train_data,test_data

Combining train and test data:

train_data,test_data=read_data()
combine=[train_data,test_data] 

Feature engineering to virtualize data and analyze various complex patterns in a dataset
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


Evaluation of Predictive Model

  • Cross-validation methods for the evaluation of the model take place. The data is split into smaller datasets for training, testing and validating the predictive model.
  • The model needs training using the training dataset. Later, it is given the testing dataset for performance testing. Lastly, using the validation dataset, the neutral estimation accuracy is tested.
  • The evaluation testing is a cycle that keeps happening when newer data is available. All the datasets are test datasets. The historical datasets become training data, while the current data becomes the validation data. This cycle for historical data increases the efficiency and accuracy of the predictive model.

 Cross validate models

I compared 5 popular predictive algorithms and evaluate the mean accuracy of each of them by a stratified kfold cross validation procedure.

  • KNN
  • Decision Tree
  • Random Forest
  • GaussianNB
  • SVM
  • K-Nearest Neighbors is a machine-learning technique and algorithm that can be used for both regression and classification tasks. K-Nearest Neighbors examines the labels of a chosen number of data points surrounding a target data point, in order to make a prediction about the class that the data point falls into
  • A decision tree is a support tool with a tree-like structure that models probable outcomes, cost of resources, utilities, and possible consequences. Decision trees provide a way to present algorithms with conditional control statements. They include branches that represent decision-making steps that can lead to a favorable result.

  • Random Forest: This algorithm is derived from a combination of decision trees, none of which are related, and can use both classification and regression to classify vast amounts of data.

  • Gaussian Naïve Bayes is the extension of naïve Bayes. While other functions are used to estimate data distribution, Gaussian or normal distribution is the simplest to implement as you will need to calculate the mean and standard deviation for the training data
  • Support Vector Machines” (SVM) is a supervised learning technique as it gets trained using sample dataset. SVM is complex under the hood while figuring out higher dimensional support vectors or referred as hyperplanes across which to divide the data forming different clusters.


from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

k_fold = KFold(n_splits=10, shuffle=True, random_state=0)


# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

knn = KNeighborsClassifier(n_neighbors = 15)
score = cross_val_score(knn, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')

[0.82022472 0.81818182 0.77272727 0.71590909 0.80681818 0.78409091
 0.81818182 0.78409091 0.85227273 0.84090909]
__________________________________________________
 Mean Score of KNeighborsClassifier:  80.13

dtc = DecisionTreeClassifier()
score = cross_val_score(dtc, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')

[0.82022472 0.88636364 0.69318182 0.77272727 0.875 0.81818182
 0.80681818 0.76136364 0.78409091 0.80681818]
__________________________________________________
 Mean Score of DecisionTreeClassifier:  80.25

rfc = RandomForestClassifier(n_estimators=500, max_depth=6, random_state=1)
score = cross_val_score(rfc, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')

[0.84269663 0.875      0.75       0.71590909 0.85227273 0.78409091
 0.82954545 0.81818182 0.82954545 0.85227273]
__________________________________________________
 Mean Score of RandomForestClassifier:  81.5

gnb = GaussianNB()
score = cross_val_score(gnb, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')

[0.69662921 0.77272727 0.68181818 0.65909091 0.75       0.67045455
 0.72727273 0.52272727 0.82954545 0.79545455]
__________________________________________________
 Mean Score of GaussianNB:  71.06

svc = SVC()
score = cross_val_score(svc, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')

[0.78651685 0.79545455 0.76136364 0.64772727 0.82954545 0.76136364
 0.78409091 0.81818182 0.84090909 0.81818182]
__________________________________________________
 Mean Score of SVC:  78.43

**plot learning curve

for i in range(len(results)):
    algo.append(results[i][0])
    means.append(results[i][2])
    std.append(results[i][3])

cv_res = pd.DataFrame({"CrossValMeans":means,"CrossValerrors": std,"Algorithm":algo})

model_plotlearning_curve = sns.barplot("CrossValMeans","Algorithm",data = cv_res,**{'xerr':std})
model_plotlearning_curve.set_xlabel("Mean Accuracy")
model_plotlearning_curve.set_title("Cross validation scores")

` we see randomForestClassifier score is highest (>80%) , therefore we will use this classifier to predict```


fiting and predicting model::::

model = RandomForestClassifier()
rfc = RandomForestClassifier(n_estimators=500, max_depth=6, random_state=1) 
rfc.fit(train_data, target) 
test_data1 = test_data.drop("PassengerId", axis=1).copy() 
prediction = rfc.predict(test_data1)


Contribution:

  • InitialVersion(Public_Score=0.77511)
    • Features = Pclass,Sex,SibSp,Parch
    • categorise between
    • No. of men survived: 0.18890814558058924
    • No. of women survived: 0.7420382165605095
    • modelUsed = RandomForestClassifier
  • FinalVersion Contribution(Public_Score=0.78468,Rank=1974/14244)
    • modelUsed = RandomForestClassifier
    • Used Feature engineering to improve score ----> (contribution)
      • Name/Title: mapped title_mapping to 5 sub-categories: Mr,Miss,Mrs,Master,Rare
      • Pclass:split into [1st class,2nd class,3rd class]
      • Sex: astype(int) {'female': 1, 'male': 0}
      • Age:split Ageband to 5 sub-categories{<=16,>16,>48,>32,>64)
      • Fare:split Fireband to 4 sub-categories{<=7.91,>7.91,>14.45,>31) + fillna with Pclass ,Fare(median)
      • Embarked ----> split to 3 sub-categories {'S':0,'C':1,'Q':2} + fillna max()
      • FamilySize=SibSp+Parch+1;Dropped
      • IsAlone= 0,isalone=1(when familysize>=1)----> (contribution)
      • Parch=Dropped
      • SibSp=Dropped
      • Ticket=Dropped
      • Cabin=Dropped
      • detecting outliers (contribution) 10 outliers identified ----> (contribution)
      • final_data all columns have 0,1 before modeling
      • n_estimators=increased from 100 to 500
      • used various virtualization techniques to identify various direct/indirect relationships between features.
      • reducing parameters

Titanic - Machine Learning from Disaster,
kaggle competition Jypter Notebook: (Public_Score=0.78468,Rank=1974/14244)