Abstract
Background:
The main objective of this study is to show and solve the integrated challenges to thyroid nodules data for use in machine learning models, as well as to construct a machine learning-based model for preoperative thyroid nodules screening.
Methods:
During a retrospective study, the ultrasonography (US) features were extract for each nodule from patients’ medical records. Because thyroid nodules are commonly benign, the sample data in the benign and malignant classes is expected to be imbalanced. It can lead to the bias of learning models toward the majority class. To solve it, we used a hybrid resampling method called the Smote-‌Tomek for create balance data. we trained Support Vector Classification algorithm by balance and unbalanced datasets as models 2 and 3 respectively in Python language programming and compared their performance with the logistic regression model as model 1 that fitted in traditional way.
Results:
The finding showed that various methods for measuring model performance produce radically different outcomes. Thus, according to the accuracy and AUC, the Model 1 performed best and the Model 3 performed worst, while the geometric mean index shows the reverse.
Conclusions:
Paying attention to data challenges, such as data imbalances, can improve performance of machine learning models to assist with medical challenges. Also, according to the results of this study, having microcalcification feature, taller than wide shape and lack of ISO and hyper echogenicity features were identified as the most effective malignant features that can be considered by experts in this field.
Â