The best industry standard for performing Data Mining techniques in various predictive Modeling in Machine Learning.
Challenge: Predict survival on the Titanic and get familiar with ML basics
Machine Learning
Machine learning is a subfield of computer science. It is a method that uses data and algorithms to imitate the way a human learns, thinks and improves, which is why big data is a vital part of machine learning.
Making a Predictive Model
- Data acquisition and removing outliers.
- Look at the data and decide between parametric or nonparametric predictive models.
- The data acquired needs to be formatted appropriately for the predictive model.
- The model needs training. The data scientist will choose a data subset for this.
- The parameters need calibration. The data subset selected will be used for this also.
- The model efficacy needs testing, so predictive model performance testing is necessary.
- The next step is validating the model using a newer data set from the data.
- Once every test is cleared, the model is ready.
Initialization/Loading Initial Libraries:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
Reading data/Importing Dataset:
def read_data():
train_data=pd.read_csv("/kaggle/input/titanic/train.csv")
test_data=pd.read_csv("/kaggle/input/titanic/test.csv")
print("train and test data imported")
print("-"*100)
return train_data,test_data
Combining train and test data:
train_data,test_data=read_data()
combine=[train_data,test_data]
Feature engineering to virtualize data and analyze various complex patterns in a dataset


Evaluation of Predictive Model
- Cross-validation methods for the evaluation of the model take place. The data is split into smaller datasets for training, testing and validating the predictive model.
- The model needs training using the training dataset. Later, it is given the testing dataset for performance testing. Lastly, using the validation dataset, the neutral estimation accuracy is tested.
- The evaluation testing is a cycle that keeps happening when newer data is available. All the datasets are test datasets. The historical datasets become training data, while the current data becomes the validation data. This cycle for historical data increases the efficiency and accuracy of the predictive model.
Cross validate models
I compared 5 popular predictive algorithms and evaluate the mean accuracy of each of them by a stratified kfold cross validation procedure.
- KNN
- Decision Tree
- Random Forest
- GaussianNB
- SVM
- K-Nearest Neighbors is a machine-learning technique and algorithm that can be used for both regression and classification tasks. K-Nearest Neighbors examines the labels of a chosen number of data points surrounding a target data point, in order to make a prediction about the class that the data point falls into
- A decision tree is a support tool with a tree-like structure that models probable outcomes, cost of resources, utilities, and possible consequences. Decision trees provide a way to present algorithms with conditional control statements. They include branches that represent decision-making steps that can lead to a favorable result.
- Random Forest: This algorithm is derived from a combination of decision trees, none of which are related, and can use both classification and regression to classify vast amounts of data.
- Gaussian Naïve Bayes is the extension of naïve Bayes. While other functions are used to estimate data distribution, Gaussian or normal distribution is the simplest to implement as you will need to calculate the mean and standard deviation for the training data
- Support Vector Machines” (SVM) is a supervised learning technique as it gets trained using sample dataset. SVM is complex under the hood while figuring out higher dimensional support vectors or referred as hyperplanes across which to divide the data forming different clusters.
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
knn = KNeighborsClassifier(n_neighbors = 15)
score = cross_val_score(knn, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')
[0.82022472 0.81818182 0.77272727 0.71590909 0.80681818 0.78409091
0.81818182 0.78409091 0.85227273 0.84090909]
__________________________________________________
Mean Score of KNeighborsClassifier: 80.13
knn = KNeighborsClassifier(n_neighbors = 15)
score = cross_val_score(knn, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')
[0.82022472 0.81818182 0.77272727 0.71590909 0.80681818 0.78409091
0.81818182 0.78409091 0.85227273 0.84090909]
__________________________________________________
Mean Score of KNeighborsClassifier: 80.13
dtc = DecisionTreeClassifier()
score = cross_val_score(dtc, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')
[0.82022472 0.88636364 0.69318182 0.77272727 0.875 0.81818182
0.80681818 0.76136364 0.78409091 0.80681818]
__________________________________________________
Mean Score of DecisionTreeClassifier: 80.25
dtc = DecisionTreeClassifier()
score = cross_val_score(dtc, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')
[0.82022472 0.88636364 0.69318182 0.77272727 0.875 0.81818182
0.80681818 0.76136364 0.78409091 0.80681818]
__________________________________________________
Mean Score of DecisionTreeClassifier: 80.25
rfc = RandomForestClassifier(n_estimators=500, max_depth=6, random_state=1)
score = cross_val_score(rfc, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')
[0.84269663 0.875 0.75 0.71590909 0.85227273 0.78409091
0.82954545 0.81818182 0.82954545 0.85227273]
__________________________________________________
Mean Score of RandomForestClassifier: 81.5
rfc = RandomForestClassifier(n_estimators=500, max_depth=6, random_state=1)
score = cross_val_score(rfc, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')
[0.84269663 0.875 0.75 0.71590909 0.85227273 0.78409091
0.82954545 0.81818182 0.82954545 0.85227273]
__________________________________________________
Mean Score of RandomForestClassifier: 81.5
gnb = GaussianNB()
score = cross_val_score(gnb, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')
[0.69662921 0.77272727 0.68181818 0.65909091 0.75 0.67045455
0.72727273 0.52272727 0.82954545 0.79545455]
__________________________________________________
Mean Score of GaussianNB: 71.06
gnb = GaussianNB()
score = cross_val_score(gnb, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')
[0.69662921 0.77272727 0.68181818 0.65909091 0.75 0.67045455
0.72727273 0.52272727 0.82954545 0.79545455]
__________________________________________________
Mean Score of GaussianNB: 71.06
svc = SVC()
score = cross_val_score(svc, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')
[0.78651685 0.79545455 0.76136364 0.64772727 0.82954545 0.76136364
0.78409091 0.81818182 0.84090909 0.81818182]
__________________________________________________
Mean Score of SVC: 78.43
svc = SVC()
score = cross_val_score(svc, train_data, target, cv=k_fold, n_jobs=1, scoring='accuracy')
[0.78651685 0.79545455 0.76136364 0.64772727 0.82954545 0.76136364
0.78409091 0.81818182 0.84090909 0.81818182]
__________________________________________________
Mean Score of SVC: 78.43
**plot learning curve
for i in range(len(results)):
algo.append(results[i][0])
means.append(results[i][2])
std.append(results[i][3])
cv_res = pd.DataFrame({"CrossValMeans":means,"CrossValerrors": std,"Algorithm":algo})
model_plotlearning_curve = sns.barplot("CrossValMeans","Algorithm",data = cv_res,**{'xerr':std})
model_plotlearning_curve.set_xlabel("Mean Accuracy")
model_plotlearning_curve.set_title("Cross validation scores")

` we see randomForestClassifier score is highest (>80%) , therefore we will use this classifier to predict```
fiting and predicting model::::
model = RandomForestClassifier()
rfc = RandomForestClassifier(n_estimators=500, max_depth=6, random_state=1)
rfc.fit(train_data, target)
test_data1 = test_data.drop("PassengerId", axis=1).copy()
prediction = rfc.predict(test_data1)
No comments:
Post a Comment