Skin and subcutaneous diseases rank as the fourth major cause of nonfatal disease burden worldwide, affecting a considerable proportion of individuals, with a prevalence ranging from 30 to 70% across all ages and regions. However, dermatologists are consistently in short supply, particularly in rural areas, and consultation costs are on the rise. As a result, the responsibility of diagnosis often falls on non-specialists such as primary care physicians, nurse practitioners, and physician assistants, which may have limited knowledge and training and low accuracy on diagnosis. The use of store-and-forward teledermatology has become dramatically popular in order to expand the range of services available to medical professionals, which involves transmitting digital images of the affected skin area (usually taken using a digital camera or smartphone) and other relevant medical information from users to dermatologists. Then, the dermatologist reviews the case remotely and advises on diagnosis, workup, treatment, and follow-up recommendations. Nonetheless, the field of dermatology diagnosis faces three significant hurdles. First, there is a shortage of dermatologists accessible to diagnose patients, particularly in rural regions. Second, accurately interpreting skin disease images poses a considerable challenge. Lastly, generating patient-friendly diagnostic reports is usually a time-consuming and labor-intensive task for dermatologists.
Advancements in technology have led to the development of various tools and techniques to aid dermatologists in their diagnosis. For example, the development of artificial intelligence (AI) tools to aid in the diagnosis of skin disorders from images has been made possible by recent advancements in deep learning (DL), such as skin cancer classification, dermatopathology, predicting novel risk factors or epidemiology, identifying onychomycosis, quantifying alopecia areata, classify skin lesions from mpox virus infection, and so on. Among these, most studies have predominantly concentrated on identifying skin lesions through dermoscopic images. However, dermatoscopy is often not readily available outside of dermatology clinics. Some studies have explored the use of clinical photographs of skin cancer, onychomycosis, and skin lesions on educational websites. Nevertheless, those methods are tailored for particular diagnostic objectives as classification tasks and their approach still requires further analysis by dermatologists to issue reports and make clinical decisions. Those methods are unable to automatically generate detailed reports in natural language and allow interactive dialogues with patients. At present, there are no such diagnostic systems available for users to self-diagnose skin conditions by submitting images that can automatically and interactively analyze and generate easy-to-understand text reports.
Over the past few months, the field of large language models (LLMs) has seen significant advancements, offering remarkable language comprehension abilities and the potential to perform complex linguistic tasks. One of the most anticipated models is GPT-4, which is a large-scale multimodal model that has demonstrated exceptional capabilities, such as generating accurate and detailed image descriptions, providing explanations for atypical visual occurrences, constructing websites based on handwritten textual descriptions, and even acting as family doctors. Despite these remarkable advancements, some features of GPT-4 are still not accessible to the public and are closed-source. Users need to pay and use some features through API. As an accessible alternative, ChatGPT, which is also developed by OpenAI, has demonstrated the potential to assist in disease diagnosis through conversation with patients. By leveraging its advanced natural language processing capabilities, ChatGPT could interpret symptoms and medical history provided by patients and make suggestions for potential diagnoses or referrals to appropriate dermatological specialists. However, it is important to note that most LLMs are limited to text interaction alone currently. Nevertheless, the development of multimodal large language models for medical diagnosis is still in its early stages, particularly considering the prevalence of image-based data in the field of medical diagnosis, among which, dermatological diagnosis is a very important task but lacks relevant research on enhanced diagnosis with multimodal large language models.
The idea of providing skin images directly for automatic dermatological diagnosis and generating text reports could greatly help solve the three aforementioned challenges in the field of dermatology diagnosis. However, there exists no method to accomplish this at present. But in related areas, ChatCAD is one of the most advanced approaches that designed various networks to analyze X-rays, CT scans, and MRIs images and generate diverse outputs, which are then transformed into text descriptions. These descriptions are combined as inputs to ChatGPT to generate a condensed report and offer interactive explanations and medical recommendations based on the given image. However, their proposed vision-text models were limited to certain tasks. Meanwhile, for ChatCAD, users need to use ChatGPT's API to upload text descriptions, which could raise data privacy issues as both medical images and text descriptions contain patients' private information. To address those issues, MiniGPT-4 is an open-source method that allows users to deploy locally to interface images with state-of-the-art LLMs and interact using natural language without the need to fine-tune both pre-trained large models and only a small alignment layer. MiniGPT-4 aims to combine the power of a large language model with visual information obtained from a pre-trained vision encoder. To achieve this, the model uses Vicuna as its language decoder, which is built on top of LLaMA and is capable of performing complex linguistic tasks. To process visual information, the same visual encoder used in BLIP-2 is employed, which consists of a ViT backbone combined with a pre-trained Q-Former. Both the language and vision models are open-source. To bridge the gap between the visual encoder and the language model, MiniGPT-4 utilizes a linear projection layer. However, MiniGPT-4 is trained on the combined dataset of Conceptual Caption, SBU, and LAION, which are irrelevant to medical images, especially dermatological images. Therefore, it is still challenging to directly apply MiniGPT-4 to specific domains such as formal dermatology diagnosis. Meanwhile, due to the limitations of Vicuna, MiniGPT-4 could not support commercial use, which could also be further improved by incorporating other state-of-the-art large language models.
Inspired by current state-of-the-art multimodal large language models, we present SkinGPT-4, which is an interactive dermatology diagnostic system based on multimodal large language models. (Fig. 1). SkinGPT-4 brings innovation on two fronts. First, SkinGPT-4 is a multimodal large language model aligned with the Llama-2-13b-chat. Second, SkinGPT-4 is a multimodal large language model designed for dermatologic diagnosis. To implement SkinGPT-4, we have designed a framework that aligned a pre-trained vision transformer with a pre-trained large language model named Llama-2-13b-chat. To train SkinGPT-4, we have collected an extensive collection of skin disease images (comprising 52,929 publicly available and proprietary images) along with clinical concepts and doctors' notes (Table 1). We designed a two-step training process to develop SkinGPT-4 as shown in Fig. 2. In the initial step, SkinGPT-4 aligns visual and textual clinical concepts, enabling it to recognize medical features within skin disease images and express those medical features with natural language. In the subsequent step, SkinGPT-4 learns to accurately diagnose the specific types of skin diseases. This comprehensive training methodology ensures the system's proficiency in analyzing and classifying various skin conditions. With SkinGPT-4, users have the ability to upload their own skin photos for diagnosis. The system autonomously evaluates the images, identifies the characteristics and categories of the skin conditions, performs in-depth analysis, and provides interactive treatment recommendations (Fig. 3). Meanwhile, SkinGPT-4's local deployment capability and commitment to user privacy also render it an appealing choice for patients in search of a dependable and precise diagnosis of their skin ailments. To demonstrate the robustness of SkinGPT-4, we conducted quantitative evaluations on 150 real-life cases, which were independently reviewed by board-certified dermatologists (Fig. 4 and Supplementary information). The results showed that SkinGPT-4 consistently provided accurate diagnoses of skin diseases. Though SkinGPT-4 is not a substitute for doctors, it greatly enhances users' understanding of their medical conditions, facilitates improved communication between patients and doctors, expedites the diagnostic process for dermatologists, facilitates triage, and has the potential to advance human-centered care and healthcare equity, particularly in underserved regions. In summary, SkinGPT-4 represents a significant leap forward in the field of dermatology diagnosis in the era of large language models and a valuable exploration of multimodal large language models in medical diagnosis.