Applying Machine Learning (ML) bioinformatic best practices to the publicly available Wisconsin Diagnostic Breast Cancer (WDBC) data set to make a diagnostic in Python.

10 min readJan 7, 2020

I am going to take you on an adventure through my recent application of Dr. Olson’s bioinformatic best practice recommendations for “applying machine learning to supervised classification problems” on the well known Wisconsin Diagnostic Breast Cancer (WDBC) data set with my ascent of Long’s Peak in the middle of the night!

The thorough documentation is in the Jupyter Notebook. I will try to parallel it a higher level here so it is more of a fun read!

The Jupter Notebook can be viewed at the bottom of this story.

1. Background

Background on WDBC dataset

The WDBC dataset created by Dr. Wolberg, et al. contains 30 features computed for 569 unique instances from a digitized image of a fine needle aspirate (FNA) of a breast mass. The total attributes totals at 32 because of the unique instance’s ID and the diagnosis. They describe the characteristics of the cell nuclei present in the image.

Background on applications to WDBC dataset

So people have been tackling this problem for a long time now since the data was donated in 1995. Heck, if you just go to the site in the link, there is ~ 7 relevant papers on it.

Furthermore, I’ve had college professors use versions of this data for ML homework in class. In fact, Kaggle also has this on their site for people to tackle for ML. It really is a great data science problem to work on because the relevance is significantly life saving!

Background on what is new today

So what is new today? And how might we apply these recent discoveries to our old problems? Well…

A recent publication by Randal S. Olson, et al. in 2017 provides insightful best practice advice for solving bioinformatic problems with machine learning, “Data-driven Advice for Applying Machine Learning to Bioinformatics Problems”.

Looking at the abstract briefly, they analyzed

“13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers.”

Wow! That is very nice of them. But wait! Even better is that one of those 165 publicly available classification problems comes from this WDBC study (see Table 2 in their paper).

From their research, they were able to provide a “recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.” These recommendations are summarized below as follows from his paper:

Therefore, I thought it would be a great learning experience for me as an ambitious and aspiring data scientist to code these same 5 recommendations with scikit-learn and implement them by tuning the hyperparameters.

Because lets be honest! When it comes to diagnosing breast cancer, we want to make sure we don’t have too many false-positives (you don’t have cancer, but told you do and go on treatment) or false-negatives (you have cancer, but told you don’t and don’t get treatment). I like to think of this problem as, every bit of accuracy gained is potentially saving a life!

And we started the climb! In the dark…it was around 3:30am here.

2. Abstract

My Approach:

The Data was split into 80% training (455 people) and 20% testing (114 people).

Several different models were evaluated through k-fold Cross-Validation with k-fold = 10 using GridSearchCV, which iterates on different algorithm’s hyperparameters:

Gradient Tree Boosting (GradientBoostingClassifier) ‘In top 5 from Olson’
Random Forest (RandomForestClassifier) ‘In top 5 from Olson’
Support Vector Machine (SVC) ‘In top 5 from Olson’
Extra Random Forest (ExtraTreesClassifier) ‘In top 5 from Olson’
Logistic Regression (LogisticRegression) ‘In top 5 from Olson’
Multilayer Perceptron (MLPClassifier) ‘NOT analyzed by Olson’

The reason I wanted to try out a few New Parameters is because every data set is different and it is not always the case that the recommendations will give the optimal result. Therefore, I wanted to try a few new learning rates (learning_rate ), tree depth ( max_depth )values, and estimator numbers (n_estimators )as well as a completely new algorithm (MLPClassifier, which is a neural network) to see what the WDBC data set would require.

Glad my headlamp didn’t go out yet! It is around 5:45am here!

SO… lets make this approach happen!

3. Import Libraries

These are the libraries I needed for running this code in Python.

4. Import and View Data

We always want to make sure we look at our data and understand it before feeding it into any algorithm to make a model. I want to make sure there isn’t any missing values or outlier that would throw everything off.

4.1 Import and View Data: Check for Missing Values

As the background stated, no missing values should be present. The following verifies that. The last column doesn’t hold any information and should be removed. In addition, the diagnosis should be changed to a binary classification of 0= benign and 1=malignant.

4.2 Heatmap with Pearson Correlation Coefficient for Features

A strong correlation is indicated by a Pearson Correlation Coefficient value near 1. Therefore, when looking at the Heatmap, we want to see what correlates most with the first column, “diagnosis.” It appears that the features of “concave points worst” [0.79] has the strongest correlation with “diagnosis”.

5. Split Data for Training

Since we are feeding this data into several different Machine Learning (ML) algorithms, we want to make sure the data is standardized in some way so that algorithms do not overfit to the wrong features. In addition, we want to split the data into a training and testing set so that we do not have Data Leakage. We do not want to have a model made that is 100% accurate on only the data it was trained on. We want to test it on data the model has not seen to verify it is working properly. A general rule of thumb in ML is to split the data between 20 to 30 percent for testing and the rest for training.

Standardize and Split the Data

Verify the Split

We want to verify the data was split correctly before proceeding forward.

Around 6 am, what a view! The sun is finally coming out! My batteries were running low!

6. Machine Learning:

In order to find a good model, the algorithms (i.e. Logistic Regression, Support Vector Machine, Random Forest, etc) described above need to be tested with the training dataset we just made.

A senstivity study using the different hyper-parameters described in the background and abstract for the algorithms are iterated on with GridSearchCV in order optimize each model through cross-validation (CV).

The documentation is thoroughly explained in the Jupyter notebook I have at the bottom of this story, and do not want to bore you with code here.

Okay so now that the sun is out… let’s get to this steep gradient ascent haha! Not to be confused with the solver SGD (Steep Gradient Descendt). Time is still around 6am.

7. Evaluate Models

After creating all of our models on the training data with GridSearchCV, we need to evaluate them with the test data.

Looking at the Confusion Matrix Plots

Figures 2.A, 3.A, 4.A, 5.A, 6.A, 7.A, 8.A, 9.A, 10.A, 11.A, 12.A, and 13.A

When it comes to diagnosing breast cancer, we want to make sure we don’t have too many false-positives (you don’t have cancer, but told you do and go on treatment) or false-negatives (you have cancer, but told you don’t and don’t get treatment). Therefore, the highest overall accuracy model is chosen (the accuracy is the sum of the diagonals on the confusion matrix divided by the total). You can see that all of the models had less than a handful of false-positives and false-negatives, meaning they were all extremely accurate since there was 114 observations for the test set.

Looking at the Variable Importance Plots

Figures 2.B, 3.B, 4.B, 5.B, 8.B, and 9.B

Most of the ensemble models have a parameter called predict_proba, which allows the output of the most significant features of the model based on their probability through majority vote via either the gini index or entropy. The top 5 variables that appear most important for helping the model make a prediction are:

concave points_worst
concave points_mean
radius_worst
area_worst
perimeter_worst

It is not a surprise that these same variables also show a high correlation with diagnosis in the Heatmap from Figure 1. Thus, it makes sense why these variables turned out so important.

Looking at Receiver Operating Characteristic (ROC) Curves

Figure 14

All of the model’s Receiver Operating Characteristic (ROC) curves had Excellent Area Under the Curves (AUC) because their values were greater than 90% (really close to 99% for all), meaning they all would serve as excellent diagnostics.

8. Results & Conclusions

Reflecting on the Recommended Hyperparameters by Olson

All of the models performed well after fine tuning their hyperparameters, but the best model is the one with the highest overall accuracy.

In this analysis, using Olson’s recommendations for Support Vector Classifier (SVC), model SVM_Olson won the battle at nearly 98.2% accuracy. This is not to say that this is the best model in all cases. All this means is that for the given test set, this model performed the best. Out of the 20% of data witheld in this test (114 random individuals), only a handful were misdiagnosed from all models. No model is perfect, but I am happy to see how well the recommendations from Olson worked on this data set. If on average less than a handful of people out of 114 are misdiagnosed with such accuracy and precision, that is a good start for making a model.

We made it to the top of Long’s Peak! Thank you for following me on this journey!

9. References

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Olson, Randal S. et al. “Data-driven advice for applying machine learning to bioinformatics problems.” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 23 (2017): 192–203 .
SciPy. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. (2019) SciPy 1.0–Fundamental Algorithms for Scientific Computing in Python. preprint arXiv:1907.10121
Python. a) Travis E. Oliphant. Python for Scientific Computing, Computing in Science & Engineering, 9, 10–20 (2007) b) K. Jarrod Millman and Michael Aivazis. Python for Scientists and Engineers, Computing in Science & Engineering, 13, 9–12 (2011)
NumPy. a) Travis E. Oliphant. A guide to NumPy, USA: Trelgol Publishing, (2006). b) Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux. The NumPy Array: A Structure for Efficient Numerical Computation, Computing in Science & Engineering, 13, 22–30 (2011)
IPython. a) Fernando Pérez and Brian E. Granger. IPython: A System for Interactive Scientific Computing, Computing in Science & Engineering, 9, 21–29 (2007)
Matplotlib. J. D. Hunter, “Matplotlib: A 2D Graphics Environment”, Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007.
Pandas. Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51–56 (2010)
Scikit-Learn. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825–2830 (2011)
Scikit-Image. Stéfan van der Walt, Johannes L. Schönberger, Juan Nunez-Iglesias, François Boulogne, Joshua D. Warner, Neil Yager, Emmanuelle Gouillart, Tony Yu and the scikit-image contributors. scikit-image: Image processing in Python, PeerJ 2:e453 (2014) (publisher link)
Kaggle. UCI Machine Learning (2016, September). “Breast Cancer Wisconsin (Diagnostic) Data Set Predict whether the cancer is benign or malignant,” Version 2. Retrieved December, 2019 from https://www.kaggle.com/uciml/breast-cancer-wisconsin-data.

Notebook on nbviewer

The WDBC dataset created by Dr. Wolberg, et al. contains 30 features computed for 569 unique instances from a digitized…

nbviewer.jupyter.org