class: center, middle, inverse, title-slide # Machine Learning ## BSTA 504 ### Ted Laderas ### 2021-03-10 --- # Learning Objectives - **Learn** about different types of machine learning - **Learn** how to split data using `{rsample}` - **Learn** about standard approaches for preprocessing data using `{recipes}` - **Learn** about PCA as a visualization and exploration technique - **Learn** the basic phases of supervised machine learning and the `tidymodels` functions associated with them - **Learn how to evaluate** the predictive power of a model/learner --- # What is Machine Learning? - "The study of computer algorithms that improve automatically through experience" - Using algorithms to find predictive patterns in the data --- # Types of Machine Learning - **Unsupervised** - discovering groups in data without labels - Dimension reduction, clustering - Goal is discovery and exploration, not prediction - **Supervised** - learning how to predict labels using *features*/*covariates* - Labels: - **Reinforcement** - guided machine learning --- # What's the Difference? .pull-left[ ## Statistical Modeling - Understand and quantify relationships between covariates and outcome is primary goal - Prediction is secondary - Sample sizes tend to be smaller ] .pull-right[ ## (Supervised) Machine Learning - Prediction is primary goal - Understanding relationships between variables is secondary - Sample sizes tend to be very large ] --- # `tidymodels` General framework for machine learning, allows you access to many different machine learning packages, such as TensorFlow Learn one workflow, use many different algorithms! --- # The different parts of `tidymodels` The different sections of `tidymodels` are designed to be useful in a `tidy` workflow and roughly map to the different steps and requirements of a machine learning workflow. --- ## Let's run through a basic `tidymodels` workflow These are the major packages where `tidymodels` is used in machine learning. - `{rsample}` - use these functions to specify a test/training set, or to build a cross-validation set, or for bootstrap sampling - `{recipes}` - use these functions to normalize variables and process them for use in machine learning, also known as **feature engineering**. - `{parsnip}` - use these functions to specify and train your model - `{workflows}` - use a model and recipe together (allows you to switch out models and use them reproducibly) - `{yardstick}` - use these functions to evaluate your model (accuracy on test data) --- # Workflow <img src="image/week9/workflow.JPG" width = 700> --- # Workflow with Packages <img src="image/week9/workflow_packages.png" width = 700> --- # Starting Data <img src="image/week9/feature_matrix.JPG" width = 700> --- <img src="image/week9/labels.JPG" width = 700> --- <img src="image/week9/test_train.JPG" width = 700> --- # `rsample::initial_split()` The function `initial_split()` from `rsample` package in `tidymodels` handles splitting data into test/train sets. ```r all_features_split <- initial_split(all_features, prop = 3/4) all_features_train <- training(all_features_split) all_features_test <- testing(all_features_split) ``` --- # `recipes` The `recipes` package in `tidymodels` outlines approaches to transform data, using `step_` functions https://www.tidymodels.org/start/recipes/ ``` recipe(species ~., data = penguins) %>% update_role(species, island, new_role="id") %>% step_normalize(all_numeric()) #<< ``` --- # Data Types <img src="image/week9/data_types1.png" width = 700> --- # Data Types <img src="image/week9/data_types.png" width = 700> --- # Building a `recipe` - Many standardized steps for processing data in machine learning --- <table> <thead> <tr> <th style="text-align:left;"> Step </th> <th style="text-align:left;"> Function </th> <th style="text-align:left;"> Data_type </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Imputing missing data </td> <td style="text-align:left;"> step_knnimpute() </td> <td style="text-align:left;"> all_numeric(), all_nominal() </td> </tr> <tr> <td style="text-align:left;"> Transform variables for skewness </td> <td style="text-align:left;"> step_BoxCox(), step_log() </td> <td style="text-align:left;"> all_numeric() </td> </tr> <tr> <td style="text-align:left;"> Discretize continuous variables </td> <td style="text-align:left;"> step_cut(), step_discretize() </td> <td style="text-align:left;"> all_numeric() </td> </tr> <tr> <td style="text-align:left;"> Normalize data (center, scale, etc) </td> <td style="text-align:left;"> step_normalize() </td> <td style="text-align:left;"> all_numeric() </td> </tr> <tr> <td style="text-align:left;"> Create Dummy Variables </td> <td style="text-align:left;"> step_dummy() </td> <td style="text-align:left;"> all_nominal() </td> </tr> <tr> <td style="text-align:left;"> Identify highly correlated variables </td> <td style="text-align:left;"> step_corr() </td> <td style="text-align:left;"> all_numeric() </td> </tr> <tr> <td style="text-align:left;"> Create Interactions </td> <td style="text-align:left;"> step_interact() </td> <td style="text-align:left;"> all_numeric(), all_nominal() </td> </tr> <tr> <td style="text-align:left;"> Multivariate transformation </td> <td style="text-align:left;"> step_pca(), step_ica() </td> <td style="text-align:left;"> all_numeric() </td> </tr> </tbody> </table> --- # Build a recipe once, use it on different data - usually apply it to our training set first - then apply it to the test set --- # PCA: Principal Components Analysis - Visualization method - Summarize many covariates into a smaller number of "Prinicipal Components" - Principal components "squish" multiple covariaes into linear combinations - Values of linear combinations chosen to maximize variability --- # Clustering: Examining groupings in the data <img src="image/week9/kmeans.gif" width =700> --- <img src="image/week9/workflow.JPG" width = 700> --- # Training Data Workflow <img src="image/week9/train_data.JPG" width = 700> --- # Test Data Workflow <img src="image/week9/test_data.JPG" width = 700> --- # Fitting the Model <img src="image/week9/training_model.JPG" width = 700> --- # {yardstick} for evaluating on test set `collect_metrics()` from `tidymodels` allows you to calculate metrics on predictions including: - Accuracy - Balanced Accuracy - Sensitivity (requires an "event_level", such as "depressed") - Specificity (requires an "event_level", such as "depressed") - Area under the Reciever Operating Curve (ROC) --- <img src="image/week9/testing_model.JPG" width = 700> --- # Logistic Regression Talk more about this in RStudio cloud notebook. --- # K-nearest Neighbor <img src="image/week9/knn1.jpg" width = 700> --- # K-nearest Neighbor <img src="image/week9/knn2.jpg" width = 700> --- # K-nearest Neighbor <img src="image/week9/knn3.jpg" width = 700>