- What is Machine Learning?
- Why do we need it?
- How to implement it?
- Getting Hands-On !
- A science that helps computers learn same like humans do. -We learn from past experiences and when it comes to Machine it learns from data that are records of past experiences of an application domain.
- We feed in data to the machine, try to apply algorithms, train the machine on the fed data and now try working on the test data.
- The traning data will help the machine learn and the test data concludes the learning capability of the model.
- Once the machine produced good results, machine learning is finally acheived!
- Supervised
- Unsupervised
- Reinforced
knitr::include_graphics("/storage/scratch2/rk0349/workshop1.png")
- In this, a function maps it's input to the output based on example pairs of input-ouput.
- Be aware of the target function
- Consider proper predictors that can define the target function
Few examples are,
- Regression : To find a continous valued output (Stock Prices may vary in continuous values)
- Classification: To find a discrete valued output(HIV Test, either positive or negative)
Our Hands-On Exercise has Linear Regression, Decision Trees, and Random Forests.
In machine learning type, we are given datasets that have unlabeled responses and we now have to draw insights from this bundle of data.
-It is vital to explore this type of data and try to figure out any kind of hidden patterns, groups, sets of data.
-One of the famous unsupervised machine learning technique used is Clustering Analysis.
- In this, an agent will interact with it's environment and there will be certain rewards for each action performed.
- state-action-reward
- The goal is to maximize the expected reward
- The hand-heat-reward example
A modelling method which is used to forecast a target value based on independent predictors. This helps in finding out the cause and effect relationship between the target and predictors. If the relationship is linear, then it is called a linear regression.
In this example, we have considered the dataset of temperatures during World War 2.
Target - Mean Temperature
Predictor - Min Temperature
Here we are fitting a linear regression model:
temperature <- read.csv("/storage/scratch2/rk0349/Summary of Weather.csv")
fit <- lm(temperature$MeanTemp ~ temperature$MinTemp)
summary(fit)
Now we make predictions of other possible values of the target variable using the linear regression equation
predictions <- predict(fit,temperature)
head(predictions)
mse <- mean((temperature$MeanTemp - predictions)^2)
print(mse)
plot(temperature$MinTemp, temperature$MeanTemp)
plot(predictions, temperature$MeanTemp)
-One of famous machine learning techniques used for decision making is decision trees. -Considering the conditions, decisions are taken and branches are drawn. -It is upside down as the root is the beginning of the tree-like structure.
knitr::include_graphics("/storage/scratch2/rk0349/decision1.png")
In this breast cancer dataset we have 1 million instances and 15 columns that give information.
# Decision Tree
cancer_data <- read.csv("/storage/scratch2/share/DSA/clean_BayesianNetworkGenerator_breast-cancer_small.csv")
cancer_data<-cancer_data[1:100,] #selecting only the 100 rows
cancer_data$age <- factor(cancer_data$age) # Converting the dataset in categorical data
is.factor(cancer_data[,1])
cancer_data$menopause <- factor(cancer_data$menopause)
is.factor(cancer_data[,2])
cancer_data$tumor.size <- factor(cancer_data$tumor.size)
is.factor(cancer_data[,3])
cancer_data$inv.nodes <- factor(cancer_data$inv.nodes)
is.factor(cancer_data[,4])
cancer_data$node.caps <- factor(cancer_data$node.caps)
is.factor(cancer_data[,5])
cancer_data$deg.malig <- factor(cancer_data$deg.malig)
is.factor(cancer_data[,6])
cancer_data$breast <- factor(cancer_data$breast)
is.factor(cancer_data[,7])
cancer_data$breast.quad <- factor(cancer_data$breast.quad)
is.factor(cancer_data[,8])
cancer_data$irradiat <- factor(cancer_data$irradiat)
is.factor(cancer_data[,9])
cancer_data$Class <- factor(cancer_data$Class)
is.factor(cancer_data[,10])
summary(cancer_data)# printing out the summary
head(cancer_data)
set.seed(100)
train <- sample(nrow(cancer_data), 0.75*nrow(cancer_data)) #splitting the data in train and test sets
train_set <- cancer_data[train,]
test_set <- cancer_data[-train,]
summary(train_set)
summary(test_set) #viewing the models test set
library(partykit)
library(rpart)
cols <- c('Class', 'age', 'tumor.size')
cancer_data[cols] <- lapply(cancer_data[cols], as.factor)
set.seed(1)
train <- sample(1:nrow(cancer_data), 0.75 * nrow(cancer_data))
Building a decision tree model, taking age as a target. The train data is 75% and the test data is 25%
cancerTree <- rpart(age ~ ., data = cancer_data[train, ], method = 'class')
plot(as.party(cancerTree))
##Random Forests
library(randomForest)
cancer_data <- read.csv("/storage/scratch2/share/DSA/clean_BayesianNetworkGenerator_breast-cancer_small.csv")
cancer_data<-cancer_data[1:500,] #selecting only the 500 rows
cancer_data$age <- factor(cancer_data$age)
is.factor(cancer_data[,1])
cancer_data$menopause <- factor(cancer_data$menopause)
is.factor(cancer_data[,2])
cancer_data$tumor.size <- factor(cancer_data$tumor.size)
is.factor(cancer_data[,3])
cancer_data$inv.nodes <- factor(cancer_data$inv.nodes)
is.factor(cancer_data[,4])
cancer_data$node.caps <- factor(cancer_data$node.caps)
is.factor(cancer_data[,5])
cancer_data$deg.malig <- factor(cancer_data$deg.malig)
is.factor(cancer_data[,6])
cancer_data$breast <- factor(cancer_data$breast)
is.factor(cancer_data[,7])
cancer_data$breast.quad <- factor(cancer_data$breast.quad)
is.factor(cancer_data[,8])
cancer_data$irradiat <- factor(cancer_data$irradiat)
is.factor(cancer_data[,9])
cancer_data$Class <- factor(cancer_data$Class)
is.factor(cancer_data[,10])
set.seed(100)
train <- sample(nrow(cancer_data), 0.7*nrow(cancer_data)) #splitting the data in train and test sets
train_set <- cancer_data[train,]
test_set <- cancer_data[-train,]
model1 <- randomForest(Class ~ ., data = train_set, ntree = 500, mtry = 6, importance = TRUE) #building a model 2 alongside the confusion matrix
model1 #viweing the built model
plot(model1) #plotting the number of trees v/s error
pred_train <- predict(model1, train_set, type = "class")
mean(pred_train == train_set$Class) #predicting the train set on model 1
table(pred_train, train_set$Class)
pred_test <- predict(model1, test_set, type="class")
mean(pred_test == test_set$Class) #preciting the test set on model 1
table(pred_test, test_set$Class)
importance(model1) # finding the Importances for model 1
varImpPlot(model1)# Variable important plot for model 1
a=c()
i=5
for (i in 3:8) {
model3 <- randomForest(Class ~ ., data = train_set, ntree = 500, mtry = i, importance = TRUE)
pred_test <- predict(model3, test_set, type = "class")
a[i-2] = mean(pred_test == test_set$Class)
}
a
plot(3:8,a)
IML
About the dataset: Housing Values in Suburbs of Boston Description The Boston data frame has 506 rows and 14 columns
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox nitrogen oxides concentration (parts per 10 million).
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat lower status of the population (percent).
medv median value of owner-occupied homes in $1000s.
library("mlr")
data("Boston", package = "MASS")
# create an mlr task and model
tsk = makeRegrTask(data = Boston, target = "medv")
lrn = makeLearner("regr.randomForest", ntree = 100)
mod = train(lrn, tsk)
library("iml")
X = Boston[which(names(Boston) != "medv")]
predictor = Predictor$new(mod, data = X, y = Boston$medv)
imp = FeatureImp$new(predictor, loss = "mae")
plot(imp)
pdp.obj = Partial$new(predictor, feature = "lstat")
plot(pdp.obj)
tree = TreeSurrogate$new(predictor, maxdepth = 2)
plot(tree)
-Reach out to us at email: hpc-admin@unt.edu -Find the code on: Github