Hybrid CNN and Vision Transformer for Multi-Class Skin Cancer Detection

Main Article Content

Surendren D
Dr. R Rajesh

Abstract

Skin cancer is one of the most common and potentially life-threatening forms of cancer. In order to improve treatment outcomes and lower mortality rates, early and precise identification is essential. Deep-learning techniques have demonstrated significant potential for automating skin lesion classification. However, individual models often struggle to capture both local texture features and global contextual information effectively. To address this, we propose a hybrid deep learning framework that combines Convolutional Neural Networks (CNN) for fine-grained spatial feature extraction with Vision Transformers (ViTs) for capturing long-range dependencies and contextual relationships. This dual-architecture approach aims to enhance the performance of multiclass skin lesion classification. The experiments were conducted using the HAM10000 dataset, which includes a wide range of dermatoscopic images from various classes of skin lesions. The proposed Hybrid CNN-ViT model achieved a classification accuracy of 91.5% and an AUC-ROC of 95.8%, outperforming the standalone CNN and ViT models. Furthermore, the hybrid model recorded a precision of 0.92, recall of 0.91, and F1-score of 0.92, indicating a balanced performance across categories. These results highlight the effectiveness of integrating the CNN and ViT for improved feature representation and classification reliability. The proposed model not only enhances the predictive accuracy but also maintains practical computational demands. This study advances AI-driven dermatological diagnostics by addressing limitations in both local and global feature learning. Future directions include model optimization for real-time use and the inclusion of more diverse datasets to improve generalizability.

Article Details

Section
Articles