Medical Hypothesis Generation Using the r AskDocs Dataset - minalee-research/cs257-students GitHub Wiki

#MedicalNLP #LLMEvaluation #HealthcareAI #Finetuning

Yusen Wu, Xiaopeng Zhang, Chenghong Hu

Abstract:

Health is an important issue, but it is also impractical to visit the doctor for any minor discomfort. Therefore, access to reliable preliminary medical guidance could be very beneficial and large language models (LLMs) have the potential to bridge this gap by generating medical hypotheses and offering general guidance based on symptom descriptions. In this project, we investigate the feasibility of using LLMs for this purpose by training LLaMa-70B-Instruct-Turbo on r/AskDocs posts and responses from Reddit and comparing the instruction-tuned responses with out-of-box and original responses. We use a combination of quantitative metrics and LLM-based evaluations to analyze the responses. Our results show that instruction-tuned responses successfully learn some features from the original responses on r/AskDocs, but the current evaluation methods are not able to accurately assess quality of responses. Therefore, in future work, better evaluation approaches that include human ratings and judgements would be required.

Here is the link of our project's github repository: https://github.com/xpenguchi/Medical_Hypothesis_Generation_Using_the_r-AskDocs_Dataset.

What this project is about:

The goal of this project is to evaluate the ability of LLMs to generate medical hypotheses and general guidance based on symptom descriptions. Rather than providing formal diagnoses, the models aim to offer preliminary next-step advice, such as identifying potential conditions, assessing whether medical attention is necessary, and highlighting symptoms that may require urgent care. If effective, LLMs could serve as the foundation for AI-driven chatbots, improving healthcare accessibility—especially in regions with limited medical resources.

To achieve this, we leverage high-quality responses from the r/AskDocs community on Reddit, where users describe their symptoms and receive advice from others, including medical professionals. Responses that receive many upvotes are generally well-informed and serve as valuable training data.

Specifically, we load LLaMa-70B-Instruct-Turbo, which is a variant of Meta’s LLaMa 2 (Large Language Model Meta AI) 70B and is fine-tuned for instruction-following tasks, and fine-tune it on a dataset extracted from r/AskDocs. In this setup, the input consists of post titles and user descriptions, while the target output corresponds to high-quality responses. To evaluate model performance, we use a combination of numerical metrics (e.g., cosine similarity and ROUGE) and LLM-based assessments to compare generated responses with expert-written comments.

Ultimately, this project aspires to bridge the gap between medical expertise and public accessibility with AI tools. Through the thoughtful integration of high-quality community insights, our approach has the potential to significantly enhance healthcare accessibility and support early symptom assessment, particularly in underserved regions.

Approach

Main Approach:

Our approach focuses on evaluating the effectiveness of language models in generating preliminary medical guidance based on patient-described symptoms. Specifically, we fine-tune models using instruction tuning techniques to improve response quality. We selected LLaMa-70B-Instruct-Turbo as our primary models and fine-tuned them using a dataset curated from the r/AskDocs community on Reddit.

To assess the performance of these models, we compare three types of responses: (1) out-of-the-box responses from the pretrained language models, (2) instruction-tuned responses trained on highly-rated expert replies from r/AskDocs, and (3) original responses provided by experienced patients or medical professionals on the platform. For instruction tuning, we formatted the dataset by combining the patient’s post title, symptom description, and the corresponding expert response into structured input-output pairs. This approach ensures that our models learn to interpret medical symptoms expressed in layman's terms while generating responses that align with the informative and accessible style of r/AskDocs experts.

For example, a typical input might look like:

Prompt: You are a supportive community member on a health related forum like Reddit’s AskDoc. Your goal is to provide concise, accurate, and empathetic replies in a single paragraph, reflective of both experienced patient insights or professional medical perspectives. When crafting your response, 1) Offer well-researched, evidence-based information whenever possible, 2) Include practical suggestions or remedies, and recommend seeing a healthcare provider when necessary 3) Use clear, accessible language while being mindful not to overstep your level of expertise. 4) Your reply must be exactly one paragraph, containing 50-250 words in total.

Title: "Persistent cough for three weeks"
Question: "I've had a dry cough for almost a month now. No fever, but I do feel a bit fatigued. Should I see a doctor or try over-the-counter meds first?"
Expert Response: "A cough lasting more than three weeks could be a sign of an underlying issue, such as a post-viral cough, allergies, or even acid reflux. It’s advisable to consult a healthcare professional if it persists, especially if you experience shortness of breath, chest pain, or blood in your sputum. In the meantime, staying hydrated and trying an over-the-counter cough suppressant might help.

Baseline:

Since our task is the first of its kind, there is no direct baseline for comparison. However, to evaluate the effectiveness of our fine-tuning approach, as we mentioned above, we would like to compare instruction-tuned responses with both out-of-the-box responses from pre-trained models and the original outputs on the platform. This comparison allows us to assess whether fine-tuning with expert-curated data improves response quality beyond what can be achieved through prompt engineering alone.

Novelty:

The novelty of our approach lies in leveraging the r/AskDocs dataset as a high-quality, community-driven source of medical advice. Unlike traditional medical datasets like PubMed, which often consist of formalized and clinical texts, r/AskDocs provides a unique blend of layperson symptom descriptions with responses from medical professionals and knowledgeable community members. This setup allows our fine-tuned models to bridge the gap between professional medical guidance and approachable, understandable advice for everyday users. Our approach offers a fresh perspective on utilizing LLMs for preliminary medical guidance while maintaining a practical and empathetic tone in the generated responses.

Experiments

Data:

Our dataset is sourced from an open dataset on Hugging Face that compiles posts from r/AskDocs (https://huggingface.co/datasets/ju-resplande/askD). It consists of 24.3K rows, with each entry containing the post title, symptom description (used as training input), and high-quality responses ranked by upvotes (used as training output) and the upvotes itself. This spares us some time for web scraping and cleaning.

To further refine the dataset, we apply the following additional data cleaning steps:

Remove short responses: We eliminate answers with fewer than 25 tokens to ensure sufficient content for training.
Filter out unwanted text patterns: Responses containing URLs, HTML tags, or other extraneous elements are discarded.
Quality-based filtering: We select the top 19,000 responses based on upvote scores to prioritize high-quality expert replies.
Train-test split: The final dataset is divided into approximately 15,000 training samples and 4,000 test samples to ensure robust model evaluation.

By leveraging this preprocessed dataset and applying additional filtering, we ensure that our fine-tuning data consists of high-quality, well-structured input-output pairs, preventing the "garbage in, garbage out" issue and enhancing model reliability.

Experimental Details:

We fine-tune the LLaMa-3.3-70B-Instruct-Turbo model from Meta using full model tuning, with training conducted on Together AI API. The model is trained for two epochs using the AdamW optimizer, with batch size optimized based on memory constraints. During inference, we set the temperature to 0.7 to control response randomness, Top-P (nucleus sampling) to 0.7, and Top-K to 50. Overfitting is detected as shown by the gap between training and evaluation loss after the first epoch. For future fine-tuning, Low-Rank Adaptation could be a viable alternative to reduce computational costs while maintaining model performance.

Evaluation method:

We would like to use a mix of quantitative and qualitative methods to evaluate the model outputs. Quantitative metrics included cosine similarity and ROUGE scores, which provide insights into the alignment of generated responses with both expert answers and the prompted input itself.

ROUGE Score evaluates text overlap using ROUGE-1, ROUGE-2, and ROUGE-L, assessing unigram, bigram, and longest common subsequence matches. Higher scores reflect better structural and informational alignment with expert responses.
Cosine Similarity measures the alignment between generated and reference responses by calculating the cosine of the angle between their text embeddings. We use OpenAI’s text-embedding-3-large model to generate high-quality text embeddings for response similarity evaluation. This model is a state-of-the-art embedding system designed for capturing nuanced semantic relationships in text.

While numerical metrics provide quantitative insights into model performance, these metrics have inherent limitations when evaluating medical guidance outputs. Numerical scores primarily measure surface-level similarities, such as token overlap or embedding proximity, but may fail to capture deeper semantic understanding, clinical relevance, or the nuanced quality of advice provided.

To address these limitations, we also incorporate LLM-based evaluation techniques. By utilizing OpenAI’s GPT-4o-mini model, we designed a set of evaluation prompts to assess relevance, accuracy, clarity, usefulness, and confidence of the generated medical advice. This approach leverages the broader contextual understanding of LLMs and their ability to simulate human-like judgment in evaluating the quality of output from our model, offering a more holistic and reliable assessment of model performance. Below is an example prompt used to evaluate relevance, with similar structures applied to other evaluation criteria.

Prompt: Given the patient's self-description and the provided advice, rate the relevance of the advice on a scale from 1 to 10, where 1 means the advice is completely unrelated to the patient's description, and 10 means the advice is perfectly aligned with the patient's concerns and condition.

Patient's Self-Description:{patient_description}
Provided Advice: {generated_advice}
Provide your response in the following format:{{one sentence short explanation for reason}}\n{{numerical score from 1 to 10}}

Results of Numerical Metrics

We firstly compared (1) output length, (2) the number of unique tokens and (3) the average sentence length of original posts, outputs of the out-of-box model and our fine-tuned model.

Comparison of Output Length: The output length comparison shows that the original model provides mostly short answers, the out-of-box model is overly long and inefficient, while the fine-tuned model reaches a balance between these two.

Comparison of Numbers of Unique Tokens: The comparison of unique tokens shows that the original model has low vocabulary diversity, the out-of-box model has the highest number of unique tokens, and the fine-tuned model has fewer unique tokens, possibly due to repetitions, temperature settings or overfitting, which are very common in language models.

Comparison of Average Sentence Length: The comparison of average sentence length shows that the out-of-box model produces the longest sentences, and the fine-tuned model’s sentence lengths closely match the original posts, indicating successful fine-tuning.

Then we drew the boxplots of ROUGE L Scores and Cosine Similarities of them shown as following:

Comparison of ROUGE L Score: The comparison of Rouge L scores shows that the out-of-box model has fewer outliers, indicating more stable text quality, while the fine-tuned model suggests that Rouge L may not be the best metric, though it performs better on a specific dataset.

Comparison of Cos-Similarity with Prompt: The comparison of Cosine Similarity with the prompt shows that the original reply often diverges significantly from the question because human has various way to express their answers to the same question, the fine-tuned model maintains moderate relevance, and the out-of-box model has the highest relevance, indicating it most strictly adheres to the prompt.

LLM-based Models

The plots above display the LLM-based evaluation results for three response generation methods: (1) fine-tuned model, (2) out-of-box model, and (3) original responses. The distributions appear highly similar across all four evaluation criteria—accuracy, usefulness, confidence, and relevance—with only slight variations.

Both the fine-tuned model and original expert responses exhibit a wide distribution of scores, indicating variability in the quality and style of the advice provided. In contrast, the out-of-box model consistently receives high but tightly clustered scores, suggesting that its responses are more stable but potentially less diverse.

The similarity in distributions across different evaluation criteria raises concerns about the effectiveness of LLM-based evaluation. The model does not appear to differentiate well between different quality dimensions (e.g., accuracy vs. usefulness) . To assess the validity of LLM-based evaluation scores, we tested their predictive power by using them to estimate the upvotes of original posts, which serve as a proxy for response quality. Ideally, a meaningful evaluation metric should correlate with human judgment to some extent, reflected in higher upvote counts. However, regression analysis revealed that all LLM-generated scores were statistically insignificant. This suggests that the evaluation method may lack the granularity needed to provide meaningful distinctions, highlighting the need for human validation or improved evaluation prompts to enhance the reliability of scoring.

Example Output

Title	Description	out-of-box	Fine-tuned	Original
Why did I develop an aversion to strong alcohol?	I am male, 24 years old, 6 ft 0 inch tall. I am not taking any medications. I am not a smoker. About two years ago I had a bad day, and decided foolishly to drown my sorrows by getting wasted. I have downed about 375 mL of 40% alcohol in a very short amount of time. I blacked out and later remember puking everything I drank, getting alcohol poisoning. I eventually recovered, but ever since then when I try to drink any alcohol that is 40% ethanol content, I always have a strong aversion off it when I ingest it, primarily from the smell of it. By aversion I mean the smell of it when drinking it makes me shudder involuntarily, and it's very unpleasant to swallow. I can still handle drinking the weaker alcohols like wine and beer without that issue. Can anyone explain what neurological issue I may have? (or I guess this may be a blessing since now I can avoid the evils of alcohol haha). Thank you in advance.	NAD, but I'm assuming it's just because of the association with the bad experience. I once had a bad experience with a certain type of wine (too much, too fast, on an empty stomach, and I was already stressed out), and ever since then I can't stand the smell of it.	It's not uncommon for individuals to develop an aversion to certain substances, including strong alcohol, after a traumatic experience like alcohol poisoning. This phenomenon can be attributed to classical conditioning, a concept introduced by Ivan Pavlov, where the brain associates the smell or taste of a particular substance with the unpleasant memories or physical sensations experienced during the traumatic event. In your case, the smell of 40% ethanol content alcohol may have become linked to the nausea, vomiting, and discomfort you experienced during the alcohol poisoning episode. This conditioned response can lead to a strong aversion, causing an involuntary shudder and discomfort when attempting to consume it. While it's great that you can still tolerate weaker alcohols like wine and beer, it's essential to consult with a healthcare provider to rule out any underlying neurological issues. However, it's likely that your aversion is a psychological response rather than a neurological disorder, and it may indeed be a blessing in disguise, helping you avoid excessive alcohol consumption and potential health risks associated with it.	Not a neurological issue. As other replies have stated, it’s a type of conditioning you’ve performed on yourself. If you become ill from undercooked chicken or oysters, you’d have a very difficult time stomaching those for awhile because your brain is telling you “poison!”. It’s rather difficult to undo and takes a long time typically, but nothing to be concerned about :) Perfectly normal.
How can I make sure my mild scoliosis doesn't get worse?	Hi there, 18M here. I had an x-ray a few months ago and I have mild thoracic scoliosis, less that 8 degrees bent to the right. The right side of my back is visibly larger than my left side due to this bend. It doesn't cause many problems, but I am anxious about it. When I'm standing I often feel like I can feel a tension within my right leg (the tendon behind the knee on the inner leg), and my right foot is also naturally pointing more outwards than my left. I try to keep a good posture sitting and standing, but it can become a little straining for my lower right back muscles keeping it so straight for so long. I think I may have developed a bit of a "donald duck" posture in order to try and correct it. What can I do to make sure this doesn't get worse and become a problem in the future? I'm free for the whole summer to work on this physically; I plan on doing much more exercise and workouts. I'm worried about back problems as they are so common. Thanks very much!	I am not a doctor but I have scoliosis. I'm 23 and it was diagnosed when I was 12. I had it checked every 6 months for a few years and then every year for a few years and now I just go every 2 years. My curve is about 22 degrees and it hasn't changed in years. I don't have any pain from it but I do have pain from the way I stand/sit to compensate for the curve.I would say the most important thing is to stay active and do core exercises. I'm not sure what kind of workouts you're planning on doing but I would recommend yoga. It has really helped me with my posture and back pain. It also helps with flexibility and balance. I'm not sure if it would help with the curve itself but it can't hurt.	As someone with mild thoracic scoliosis, it's great that you're taking proactive steps to manage your condition, and with a curvature of less than 8 degrees, you're on the right track. To prevent progression, maintaining good posture, engaging in regular exercise, and incorporating scoliosis-specific stretches and strengthening exercises can be beneficial. Consider consulting with a physical therapist or orthopedic specialist to develop a personalized exercise plan that addresses your specific needs, including strengthening your core and back muscles to improve posture and reduce strain on your lower right back. Additionally, exercises like yoga or Pilates can help improve flexibility and balance, while also promoting good posture. It's also essential to be mindful of your body positioning during daily activities, such as sitting and standing, to avoid putting unnecessary strain on your spine. Given your concerns about back problems, it's a good idea to schedule a follow-up appointment with your healthcare provider to discuss your exercise plans and any potential modifications to ensure you're not exacerbating your condition, and to monitor your curvature and address any concerns or changes you may experience.	Not a doctor, but I had/have minor scoliosis.What helped me cure my scoliosis was competitive swimming and water polo. I used to slouch forward and bring my shoulders together. After swimming for many years, my posture was great. I always had a straight back even without trying. After many years pursuing higher education, the long lectures are hurting my posture once again.So maybe you can try swimming to improve posture. If you're a beginner start with 5-10 minutes of consecutive swimming with little to no breaks. Try to go up to 15-30 min and even longer if you can. Start with freestyle strokes with this modification: when you raise your arm above water, make sure to touch your armpit with your thumb before your hand reaches the peak of its motion. This forces your body to learn the proper form of the freestyle stroke. Once you get that stroke down you can start learning other strokes. Goodluck!TL;DR - swim
Caught a virus after receiving a vaccine - will my body have a harder time forming antibodies to my vaccine?	28 Caucasian female, 5’3” and 110lbs. No medications, no medical history, non-smoker.Length of complaint: none. Received vaccines 3 weeks ago.I got travel vaccines (hepatitis A and Typhoid) and 2 days later I caught an URI. I know these are unrelated.My question is, did my body have a harder time forming antibodies to the vaccine due to it being stressed from a virus? Is it possible I could have less immunity coverage?	I don't think you should be concerned. I don't think there's any evidence to show that the immune system can't handle more than one pathogen at a time. It's pretty common for us to have a number of infections at once without even realizing it, and our immune system is more than capable of dealing with it.I don't think you'll have less immunity coverage.	It's understandable to have concerns about the effectiveness of your vaccines after catching a virus, but fortunately, the immune system is capable of handling multiple challenges simultaneously. Research suggests that concurrent viral infections, like a URI, are unlikely to significantly impact the immune response to vaccines, including travel vaccines like hepatitis A and Typhoid. However, it's possible that the immune response might be slightly delayed or weakened. To put your mind at ease, it's recommended to consult with your healthcare provider or the doctor who administered the vaccines to discuss your specific situation and determine if any additional measures, such as booster shots, are necessary. In general, it's still likely that your body has formed adequate antibodies to provide immunity, but a healthcare professional can provide personalized guidance and reassurance.	No. The cells which respond to one microbe by making antibodies are not the cells which respond to another microbe, unless these microbes are closely related (each lymphocyte has its own set of antigens that it responds to). So unless you are critically ill (in a hospital on a ventilator etc), one infection does not affect the immunogenicity of another (for the most part, unless you are a researcher evaluating subtle differences in immune response).
10+ doctors have told me they don't know what's wrong with me. Let's discuss.	So I've been to see a lot of doctors and specialists and no one can help me. At this point I feel like the best thing I can do is ask the general public and throw ideas at the doctors. I don't care how many tests they run, so long as someone finds the issue.Let's start with the basics. I'm a 24 year old female, 5'4, 160lb. I don't smoke, and I drink less than a glass a month. I'm allergic to most anti-psychotics and I have the mthfr gene mutation. I'm currently taking a b vitamin + folic acid mixture and .5mg lorazepam as needed, (up to three a day,) for anxiety and seizures. I've been diagnosed with schizophrenia, social anxiety disorder, and hypersomnia. I've had problems for as long as I can remember, and the problems have gotten much worse with age. By problems, I mean the following symptoms: hallucinations, lack of coordination in legs, (stumbling, falling, walking into doorways,) occasional slurred speech and stuttering, (gets worse before and after seizures or panic attacks,) occasional seizures and panic attacks with no known cause, severe pain in legs, (both calves and thighs, usually in nerves or muscles or both,) heavy sleeping for long periods of time, (24+ hours of uninterrupted sleep, and fatigue if awoken before 12 hours is reached,) depression, forgetfulness, and probably more I can't remember at the moment.I've worked very hard to get over these symptoms, with no good coming from it. I'm tired of people telling me I need to exercise more, eat better, or simply 'work harder.' I exercise as much as I can every week, I'm pescetarian with a preference for raw vegetables, and I've pushed myself to the point of ending up in a hospital too many times to count. So let's discuss this. I'd like to hear some real advice for once.	It is possible that you have a mitochondrial disorder. There are many different kinds, and they can be very difficult to diagnose. They often present with a wide range of symptoms, including fatigue, muscle weakness, seizures, and a variety of other neurological symptoms. They can be very difficult to diagnose, and often require genetic testing, muscle biopsies, and other invasive tests. Some of the more common mitochondrial disorders include MELAS, MERRF, and Kearns-Sayre syndrome.	I can sense the frustration and desperation in your post, and I'm here to offer a supportive and non-judgmental space to discuss your concerns. Given your complex medical history and multiple diagnoses, it's essential to approach your situation with a holistic perspective. The MTHFR gene mutation and schizophrenia diagnosis may be contributing factors to your symptoms, and it's crucial to work with a healthcare provider to optimize your treatment plan. I recommend seeking a second opinion from a specialist, such as a neurologist or a geneticist, to further investigate the underlying causes of your seizures, heavy sleeping, and severe pain. In the meantime, consider keeping a symptom journal to track your episodes and potential triggers, which may help identify patterns and correlations. Additionally, exploring alternative therapies like cognitive-behavioral therapy, mindfulness, or physical therapy may help alleviate some of your symptoms. While it's understandable to feel concerned about your prognosis, it's essential to focus on managing your symptoms and improving your quality of life rather than speculating about potential outcomes. I encourage you to continue advocating for yourself and seeking medical attention until you find a provider who can offer you the guidance and support you deserve.	NAD, but isn’t this all your conditions (anxiety, schizophrenia, hypersonmia) combined together? Along with epilepsy, and yes, you should be on medication for that,which other people have mentioned. I don’t mean to sound unsympathetic, i just don’t get what you’re asking. I may be reading the post wrongly though. But I’m genuinely sorry you have to go through this, i hope you find help from this post

Subjective Analysis: We can see LLMs actually do a decent job in replying the posts and questions, though not reflected through the evaluation metrics. The out-of-box model often provides a personal anecdote-based response with limited medical authority, while the fine-tuned model delivers a comprehensive, educational explanation using relatively accurate psychological concepts. Interestingly, in the last post, fine-tuned output demonstrate warm sympathy towards the user's situation compared to the out-of-box output.

Future Works

A major challenge in our project is the lack of reliable evaluation metrics for assessing the quality of model-generated medical responses. Both numerical metrics and LLM-based evaluation failed to provide meaningful distinctions. This makes it difficult to accurately determine the true impact of fine-tuning.

To address this limitation, future work should focus on developing more robust evaluation methodologies. One approach is to integrate human expert annotations, where medical professionals assess responses based on key quality criteria such as accuracy, clarity, and usefulness. These expert ratings could serve as a gold standard for evaluating AI-generated medical advice.

Another promising direction is to develop a machine learning-based evaluation model, trained on human-labeled responses. By collecting a large dataset of expert-annotated medical responses, we could fine-tune an evaluation model that predicts response quality with greater reliability than general-purpose LLMs. Such a model could capture subtle variations in medical relevance and accuracy that standard LLM-based scoring overlooks. With that said, both of these approaches require some level of human involvement, which is reasonable given that the model provides medical advice to patients. Ensuring that evaluation methods align with human judgment is essential for maintaining trust and reliability in AI-assisted medical guidance.

With that said, both of these approaches require some level of human involvement, which is reasonable given that the model provides medical advice to patients. Ensuring that evaluation methods align with human judgment is essential for maintaining trust and reliability in AI-assisted medical guidance.