Implementing Various Machine Learning Algorithms and Hyperparameter Tuning On The Indian Liver Patient Records Dataset

Screen Shot 2020-06-15 at 6 39 33 PM

For this project I went ahead and implemented a number of machine learning algorithms on the dataset: Indian Liver Patient Records - Patient records collected from North East of Andhra Pradesh, India click here. The goal was to better predict whether an individual is likely to develop liver disease given certain features which included the age, gender, total bilirubin, direct bilirubin, alkaline phosphotase, alamine aminotransferase, aspartate aminotransferase, total proteins, albumin, and albumin and globulin ratio of the individual.

Screen Shot 2020-06-15 at 6 53 26 PM

The data mining and exploration step dealt some interesting insights regarding the data. There were some compelling countplots and undelying correlations that I came across. I won’t list them all, but I will say that most of the people that were tested were male patients. The description of the dataset did not provide any background as to why this is so. Also, most of the patients that tested positive were male. The discrepencies were astonishing. Adults aged 24-63 were also significantly impacted by this liver disease, as opposed to young patients, and the elderly. There were also high correlations between certain features like direct bilirubin v. total bilirubin, alamine aminotransferase v. aspartate aminotransferase, total proteins v. albumin, and albumin v. albumin and globulin ratio.

Screen Shot 2020-06-16 at 12 04 03 PM

Screen Shot 2020-06-15 at 6 43 31 PM

Screen Shot 2020-06-15 at 6 44 23 PM

Screen Shot 2020-06-15 at 6 45 09 PMScreen Shot 2020-06-15 at 6 45 23 PM

Screen Shot 2020-06-15 at 6 45 32 PMScreen Shot 2020-06-15 at 6 45 42 PM

The dataset was manipulated and cleaned using an assortment of libraries that included Pandas and Matplotlib. There was quite some feature engineering to do that included renaming certain columns, dealing null values, handling outliers, and log-transforming and min-max scaling the continous features. There was also some binning that had to be done regarding the age column in order to make the results easier to interpret and visualize. One-hot encoding was performed on any discrete features.

Screen Shot 2020-06-15 at 7 00 05 PM

Screen Shot 2020-06-15 at 7 00 20 PMScreen Shot 2020-06-15 at 7 00 27 PM

After creating our machine learning-ready-dataset we went ahead and applied our machine learning algorithms. These were classification algorithms that included Support Vector Machines (SVM), K-Nearest Neighbors, Decision Tree, Random Forest, Gradient Boosting Machines (GBM), eXtreme Gradient Boosting (XGBoost), and AdaBoost. These produced varying test metrics, and AUC measures. The. highest ranking proved to be the Random Forest algorithm with Accuracy Score: 0.7678571428571429 and AUC: 0.64140625. The AUCs were used to create and ROC/AUC Curve plot that compared all the algorithms. Having arrived at the conlusion that Random Forest was the best-performing out of the bunch, I went ahead and executed some hyperparameter tuning and optimizations using RandomizedSearchCV. Another ROC/AUC Curve plot was generated to show the newly incorporated Random Forest + Optimization AUC. The Random Forest + Optimization algorithm had Accuracy Score: 0.7440476190476191 and AUC: 0.55703125. This algorithm did slightly worse on both measures when compared to the Random Forest on default parameters.

Screen Shot 2020-06-15 at 7 51 54 PM

Key takeways:

1. There is evidence of correlations between certain features. What this tells us is that certain features can be indicative of other features being elevated as well. An example of this would be the high positive correlation between direct bilirubin and total bilirubin. If a patient were to come in with high levels of direct bilirubin, we would be safe to assume that the likelihood that they also have a high incidence of total bilirubin is quite high. The health care practitioner could choose to only administer certain tests and not others, which could potentially save a both the healthcare practitioner and patient vasts amounts of time and resources.

2. Males comprised most of the dataset by a vast amount. There are many more males than there are females affected by the liver disease as well. We are not made aware of how the data was acquired, but the disparity between genders is astonishing. The measures for all features associated with the disease were much greater in males than in females. These types of discoveries can lead to targeted preventive care for male subjects when they come in for rudimentary check-ups or health issues. Healthcare facilities and various other organizations can take a step in addressing the issue-at-hand and make the public aware of the consequences that are associated with liver disease.

3. Adults between the ages of 24-63 years old seem to be the most in danger. This can be attributed to these adults being tested more often for liver disease than are young people and the elderly. We may need to focus more on testing these other demographics. There is a strong possibility that we are missing the greater picture here. If we were to focus on testing the youth and improving their dietary intake, decreasing alcohol consumption, addressing pollution, decontaminating food, and eradicating drug use, we may arrive at a stage where we can prevent our youth from developing liver disease later in their lives.

Screen Shot 2020-06-15 at 7 10 22 PMScreen Shot 2020-06-15 at 7 17 35 PM

Liver Disease Machine Learning Project