Skip to content

A short tutorial to get hands-on with basic Machine Learning Algorithms

Notifications You must be signed in to change notification settings

RamyaKoyaC/MLWorkshop1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Agenda

  • What is Machine Learning?
  • Why do we need it?
  • How to implement it?
  • Getting Hands-On !

What is Machine Learning?

  • A science that helps computers learn same like humans do. -We learn from past experiences and when it comes to Machine it learns from data that are records of past experiences of an application domain.
  • We feed in data to the machine, try to apply algorithms, train the machine on the fed data and now try working on the test data.
  • The traning data will help the machine learn and the test data concludes the learning capability of the model.
  • Once the machine produced good results, machine learning is finally acheived!

Types of Machine Learning

  • Supervised
  • Unsupervised
  • Reinforced
knitr::include_graphics("/storage/scratch2/rk0349/workshop1.png")

Supervised ML

  • In this, a function maps it's input to the output based on example pairs of input-ouput.
  • Be aware of the target function
  • Consider proper predictors that can define the target function

Few examples are,

  • Regression : To find a continous valued output (Stock Prices may vary in continuous values)
  • Classification: To find a discrete valued output(HIV Test, either positive or negative)

Our Hands-On Exercise has Linear Regression, Decision Trees, and Random Forests.

Unsupervised ML

In machine learning type, we are given datasets that have unlabeled responses and we now have to draw insights from this bundle of data.

-It is vital to explore this type of data and try to figure out any kind of hidden patterns, groups, sets of data.

-One of the famous unsupervised machine learning technique used is Clustering Analysis.

Reinforced ML

  • In this, an agent will interact with it's environment and there will be certain rewards for each action performed.
  • state-action-reward
  • The goal is to maximize the expected reward
  • The hand-heat-reward example

Getting Hands-On

Linear Regression

A modelling method which is used to forecast a target value based on independent predictors. This helps in finding out the cause and effect relationship between the target and predictors. If the relationship is linear, then it is called a linear regression.

In this example, we have considered the dataset of temperatures during World War 2.
Target - Mean Temperature
Predictor - Min Temperature

Here we are fitting a linear regression model:

temperature <- read.csv("/storage/scratch2/rk0349/Summary of Weather.csv")
fit <- lm(temperature$MeanTemp ~ temperature$MinTemp)

Target = (Slope * Predictor) + Intercept

summary(fit)

Now we make predictions of other possible values of the target variable using the linear regression equation

predictions <- predict(fit,temperature)
head(predictions)

The mean square error

mse <- mean((temperature$MeanTemp - predictions)^2)
print(mse)

Plotting the Predictor v/s Target

plot(temperature$MinTemp, temperature$MeanTemp)

Plotting the obtained predictions with the target

plot(predictions, temperature$MeanTemp)

Decision Trees

-One of famous machine learning techniques used for decision making is decision trees. -Considering the conditions, decisions are taken and branches are drawn. -It is upside down as the root is the beginning of the tree-like structure.

knitr::include_graphics("/storage/scratch2/rk0349/decision1.png")

In this breast cancer dataset we have 1 million instances and 15 columns that give information.

# Decision Tree 
cancer_data <- read.csv("/storage/scratch2/share/DSA/clean_BayesianNetworkGenerator_breast-cancer_small.csv")
cancer_data<-cancer_data[1:100,] #selecting only the 100 rows 
cancer_data$age <- factor(cancer_data$age) # Converting the dataset in categorical data
is.factor(cancer_data[,1])
cancer_data$menopause <- factor(cancer_data$menopause)
is.factor(cancer_data[,2])
cancer_data$tumor.size <- factor(cancer_data$tumor.size)
is.factor(cancer_data[,3])
cancer_data$inv.nodes <- factor(cancer_data$inv.nodes)
is.factor(cancer_data[,4])
cancer_data$node.caps <- factor(cancer_data$node.caps)
is.factor(cancer_data[,5])
cancer_data$deg.malig <- factor(cancer_data$deg.malig)
is.factor(cancer_data[,6])
cancer_data$breast <- factor(cancer_data$breast)
is.factor(cancer_data[,7])
cancer_data$breast.quad <- factor(cancer_data$breast.quad)
is.factor(cancer_data[,8])
cancer_data$irradiat <- factor(cancer_data$irradiat)
is.factor(cancer_data[,9])
cancer_data$Class <- factor(cancer_data$Class)
is.factor(cancer_data[,10]) 

To see the overall summary of the dataset

summary(cancer_data)# printing out the summary
head(cancer_data)

Splitting the data into Train and test

set.seed(100) 
train <- sample(nrow(cancer_data), 0.75*nrow(cancer_data)) #splitting the data in train and test sets
train_set <- cancer_data[train,]
test_set <- cancer_data[-train,]

Train_set

summary(train_set)

Test_set

summary(test_set) #viewing the models test set
library(partykit)
library(rpart)
cols <- c('Class', 'age', 'tumor.size')
cancer_data[cols] <- lapply(cancer_data[cols], as.factor)
set.seed(1)
train <- sample(1:nrow(cancer_data), 0.75 * nrow(cancer_data))

Building a decision tree model, taking age as a target. The train data is 75% and the test data is 25%

cancerTree <- rpart(age ~ ., data = cancer_data[train, ], method = 'class')

Decision Tree

plot(as.party(cancerTree))

##Random Forests

Loading the randomForest library

library(randomForest)

cancer_data <- read.csv("/storage/scratch2/share/DSA/clean_BayesianNetworkGenerator_breast-cancer_small.csv")
cancer_data<-cancer_data[1:500,] #selecting only the 500 rows 
cancer_data$age <- factor(cancer_data$age)
is.factor(cancer_data[,1])
cancer_data$menopause <- factor(cancer_data$menopause)
is.factor(cancer_data[,2])
cancer_data$tumor.size <- factor(cancer_data$tumor.size)
is.factor(cancer_data[,3])
cancer_data$inv.nodes <- factor(cancer_data$inv.nodes)
is.factor(cancer_data[,4])
cancer_data$node.caps <- factor(cancer_data$node.caps)
is.factor(cancer_data[,5])
cancer_data$deg.malig <- factor(cancer_data$deg.malig)
is.factor(cancer_data[,6])
cancer_data$breast <- factor(cancer_data$breast)
is.factor(cancer_data[,7])
cancer_data$breast.quad <- factor(cancer_data$breast.quad)
is.factor(cancer_data[,8])
cancer_data$irradiat <- factor(cancer_data$irradiat)
is.factor(cancer_data[,9])
cancer_data$Class <- factor(cancer_data$Class)
is.factor(cancer_data[,10])

Splitting data and building model

set.seed(100) 
train <- sample(nrow(cancer_data), 0.7*nrow(cancer_data)) #splitting the data in train and test sets
train_set <- cancer_data[train,]
test_set <- cancer_data[-train,]
model1 <- randomForest(Class ~ ., data = train_set, ntree = 500, mtry = 6, importance = TRUE)  #building a model 2 alongside the confusion matrix
model1 #viweing the built model 
plot(model1) #plotting the number of trees v/s error

Working with the predicted values

pred_train <- predict(model1, train_set, type = "class") 
mean(pred_train == train_set$Class)  #predicting the train set on model 1
table(pred_train, train_set$Class)  
pred_test <- predict(model1, test_set, type="class")
mean(pred_test == test_set$Class) #preciting the test set on model 1
table(pred_test, test_set$Class)

To find the variable importances

importance(model1) # finding the Importances for model 1

Variable Importance Plot

varImpPlot(model1)# Variable important plot for model 1

To find the accuracy with the mtry

a=c()
i=5
for (i in 3:8) {
  model3 <- randomForest(Class ~ ., data = train_set, ntree = 500, mtry = i, importance = TRUE)
  pred_test <- predict(model3, test_set, type = "class")
  a[i-2] = mean(pred_test == test_set$Class)
}
a

Plotting accuracy v/s mtry

plot(3:8,a)

IML

About the dataset: Housing Values in Suburbs of Boston Description The Boston data frame has 506 rows and 14 columns

crim per capita crime rate by town.

zn proportion of residential land zoned for lots over 25,000 sq.ft.

indus proportion of non-retail business acres per town.

chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox nitrogen oxides concentration (parts per 10 million).

rm average number of rooms per dwelling.

age proportion of owner-occupied units built prior to 1940.

dis weighted mean of distances to five Boston employment centres.

rad index of accessibility to radial highways.

tax full-value property-tax rate per $10,000.

ptratio pupil-teacher ratio by town.

black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat lower status of the population (percent).

medv median value of owner-occupied homes in $1000s.


library("mlr")
data("Boston", package  = "MASS")

# create an mlr task and model
tsk = makeRegrTask(data = Boston, target = "medv")
lrn = makeLearner("regr.randomForest", ntree = 100)
mod = train(lrn, tsk)
library("iml")
X = Boston[which(names(Boston) != "medv")]
predictor = Predictor$new(mod, data = X, y = Boston$medv)
imp = FeatureImp$new(predictor, loss = "mae")

Variable Importance Plot

plot(imp)

Plot with the most important feature and the predictor

pdp.obj = Partial$new(predictor, feature = "lstat")
plot(pdp.obj)

Growing a tree with considering 2 feature

tree = TreeSurrogate$new(predictor, maxdepth = 2)
plot(tree)

Thank you for your time!

-Reach out to us at email: hpc-admin@unt.edu -Find the code on: Github

About

A short tutorial to get hands-on with basic Machine Learning Algorithms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published