Development of a machine learning-based screening method for thyroid nodules classification; by Solving the imbalance challenge in thyroid nodules data

sajad khodabandelu, Naser Ghaemian, Soraya Khafri^*, Mehdi Ezoji, Sara Khaleghi

*Corresponding Author: Email: khafri@yahoo.com

Abstract

Background:

The main objective of this study is to show and solve the integratedÂ challengesÂ to thyroid nodulesÂ data for use in machine learning models, as well as to construct a machine learning-based model for preoperative thyroid nodulesÂ screening.

Methods:

During a retrospective study, the ultrasonography (US) features were extract for each nodule from patientsâ€™ medical records. Because thyroid nodules are commonly benign, the sample data in the benign and malignant classes is expected to be imbalanced. It can lead to the bias of learning models toward the majority class. To solve it, we used a hybrid resampling method called the Smote-â€ŒTomek for create balance data. we trained Support Vector Classification algorithm by balance and unbalanced datasets as models 2 and 3 respectively in Python language programming and compared their performance with the logistic regression model as model 1 that fitted in traditional way.

Results:

The finding showed that various methods for measuring model performance produce radically different outcomes. Thus, according to the accuracy and AUC, the Model 1 performed best and the Model 3 performed worst, while the geometric meanÂ index shows the reverse.

Conclusions:

Paying attention to data challenges, such as data imbalances, can improve performance of machine learning models to assist with medical challenges. Also, according to the results of this study, having microcalcification feature, taller than wide shape and lack of ISO and hyper echogenicity features were identified as the most effective malignant features that can be considered by experts in this field.