Medical Hypothesis Generation using the Reddit AskDocs Dataset - minalee-research/cs257-students GitHub Wiki

#application

Siya Kalra, Dhirpal Shah, Anika Bansal

Abstract

With more people turning to online forums like Reddit’s r/AskDocs for medical advice, it is crucial to understand whether AI can provide useful preliminary health assessments. This project explores the effectiveness of Natural Language Processing in identifying potential medical conditions from user-generated posts. We scraped and processed AskDocs data and used a large language model to generate two types of responses: a list of possible diagnoses with test recommendations and a Reddit-style physician reply. To evaluate the precision, we compared these AI-generated responses to actual top-voted physician comments using a similarity metric. Our results indicate that real physician responses were significantly different from AI-generated predictions. The highest similarity score, measured using BLEURT, was 0.55 on a scale of 1 — suggesting some alignment but highlighting notable gaps between AI and human expertise. These findings suggest the limitations of AI in replicating expert medical judgment and emphasize that while AI may serve as a preliminary tool, it cannot yet replace professional consultation for accurate diagnoses and medical guidance.

What This Project Is About

Many people seek medical advice online before consulting a doctor, often turning to forums like Reddit’s r/AskDocs, where verified physicians respond to health-related questions. Our project explores whether NLP can help analyze these discussions and generate responses that align with expert medical opinions. Specifically, we aim to determine if a commonly used large language model can provide reliable preliminary diagnoses and recommendations based on user-generated symptom descriptions.

To achieve this, we scraped and processed posts from r/AskDocs, focusing only on those with at least one verified physician response. We designed two AI-generated responses using prompt engineering: one that lists possible diagnoses and recommended tests, and another that mimics a Reddit-style physician reply. We then compare these AI-generated responses to the top physician comment on each post using a similarity metric, helping us assess how closely the model’s recommendations align with real medical advice.

Our methodology involves leveraging the TogetherAI API to prompt an LLM (Llama 3.3-70B Instruct-Turbo) with user posts describing symptoms. The first prompt instructs the model to generate a list of potential diagnoses with justifications and recommended tests. The second prompt asks the model to write a response in the style of a verified physician on Reddit. We then use text similarity metrics, such as Cosine similarity, Jaccard, BLEU, ROUGE-L, and BLEURT to compare these responses to the top physician comment, providing insight into the AI’s ability to produce medically relevant and human-like answers.

By evaluating AI’s effectiveness in this context, we aim to understand whether NLP can play a role in improving healthcare accessibility, especially for those with limited access to professional medical advice. While our project is ongoing, this research contributes to discussions on AI’s reliability in medical contexts and the challenges of extracting meaningful insights from unstructured online discussions.

Methods

Since our original proposal, we have made significant progress in data collection, preprocessing, prompt engineering, and evaluation metric development. Our work has focused on building an AI system capable of analyzing discussions from r/AskDocs to generate medically relevant responses. Below, we outline our approach and the experiments conducted.

Approach

For our project, we designed two AI-generated response formats. The first, a diagnosis-based prompt, asks the model to list possible medical conditions based on symptoms and suggest diagnostic tests. The second, a Reddit-style physician reply, generates responses that mimic how a verified physician would reply to a post on the platform. These responses were generated using TogetherAI’s API, using the Llama-3.3-70B-Instruct-Turbo model.

Preprocessing

A major challenge we encountered was preparing a usable dataset. The r/AskDocs dataset was originally in a deeply nested and inconsistent JSON format, making it difficult to extract structured data. To address this, we considered using an alternative Hugging Face dataset designed for fine-tuning medical advice models. This dataset was curated for AI training but lacked real-world, conversational medical discussions. Ultimately, we committed to parsing and structuring the r/AskDocs dataset, as it provided authentic user inquiries and physician responses, making it a better fit for our study. However, this process required extensive data cleaning, formatting, and standardization, significantly extending our timeline.

Response Scraping

To assess AI-generated responses, we established a baseline comparison against real physician comments. Specifically, we only analyzed posts where a verified physician responded, selecting the top comment as our baseline. This allows us to measure how well our AI-generated responses align with expert medical advice. We started with scraping 100 Reddit posts and then scaled up to scrape 1000 posts with a verified physician response. We decided to evaluate the responses of the LLM based on two different prompts. The first is a general prompt which asks the LLM to act like a doctor evaluating symptoms and asks for recommendations on diagnostic tests or evaluations to be done. The second is a prompt that aims to produce a “Reddit-style” response in order to mimic a response one might actually find in the thread. We ask the LLM to act as a verified physician who is a Reddit user specifically telling it that it is replying with a comment to the AskDocs subthread. The two prompts we used were:

“Based on the following symptoms, list possible medical conditions and recommend the next diagnostic tests or evaluations a doctor might order. Symptoms: {body}. Include brief justifications for each recommendation.”
"You're replying to a Reddit post in r/AskDocs. The post asks: {body}. Write a comment like a real Reddit user who is a verified physician."

Our project is novel in its application of AI-driven medical response generation within a real-world social media context. While prior research has examined NLP models for processing structured medical texts, our work extends this by exploring AI’s effectiveness in informal, user-generated discussions. Unlike traditional medical chatbots, which rely on controlled inputs, our model must adapt to diverse phrasing, incomplete symptom descriptions, and varying levels of medical literacy among Reddit users. This makes our study particularly valuable in assessing whether AI can serve as a scalable and accessible tool for preliminary health guidance in online forums.

Experiments

Data Description

The dataset utilized for this analysis comprises medical-related posts from an online platform, with responses generated by two different automated systems and actual physician comments. Each post ID is associated with three types of responses: responses generated from the naive automated system (N), responses generated using our Reddit-prompted model (R), and the actual physician responses we scraped (P). This dataset is particularly suited for the task as it allows for the comparison of the quality and relevance of automated responses against actual expert responses, which is crucial in medical informatics where accurate information dissemination is vital.

The primary task associated with this dataset is to evaluate and compare the similarity of automated responses to those of actual physicians, aiming to assess how closely automated systems can mimic or replicate expert-level advice in a medical context. This comparison is critical to determine the viability of deploying automated systems for providing preliminary medical advice online.

Evaluation Method

For evaluation, several metrics were utilized:

Cosine Similarity: Measures the cosine of the angle between two non-zero vectors of an inner product space, helping to determine the similarity irrespective of their size.
Jaccard Similarity: Evaluates the similarity between finite sample sets, defined as the size of the intersection divided by the size of the union of the sample sets.
BLEU Score: Assesses the quality of text which has been machine-translated from one natural language to another, basing on the precision of matched n-grams to the reference text.
ROUGE-L Score: Focuses on the longest common subsequence between a candidate translation and a reference translation, useful for evaluating the fluency and intent preservation in generated text.

After reading through the peer feedback we received, we decided to include BLEURT as a second measure of semantic similarity.

BLEURT Score: Fine-tuned on human judgment data, it captures semantic similarity by leveraging contextual embeddings and provides human-like judgments on fluency and adequacy.

These metrics collectively help in comprehensively understanding the text similarity and quality from different aspects, providing a rounded evaluation of the response systems.

Experimental Details

The experiment involved computing the aforementioned similarity scores between the different response types for each post. This setup provided a straightforward, quantitative comparison to evaluate how the generated responses stack up against actual physician comments. The computational tasks were handled using Python libraries such as Pandas for data manipulation, Sklearn for model operations, NLTK for text processing, and Matplotlib for visualization.

Results

To evaluate our results, we looked at both qualitative and quantitative measurements.

For example, here is one post from r/AskDocs that we analyzed. It discusses the symptoms of a man struggling to breathe and asks for advice on the condition. As you skim the paragraph you can see a fair amount of spelling mistakes and grammatical errors which highlights the rawness of Reddit posts.

Screenshot 2025-03-08 at 4 53 56 PM Image 1: Sample post from r/AskDocs

Screenshot 2025-03-08 at 4 54 40 PM Image 2: Generated responses and real Reddit response

Above, we show the responses generated by the LLM as well as the top-rated comment on the subreddit thread. As we skim through the results we can see that the response from the general prompt is much more formal in word choice and sentence structure while the Reddit style is much more informal and empathetic. The top-rated comment is much shorter in length but also informal and empathetic using words like “i’m sorry that you have to experience it”. All three responses do mention COPD as one of the most likely conditions which gives us confidence in the precision of the LLM results.

The quantitative results were visualized using boxplots to represent the distribution of similarity scores across different metrics and response comparisons. Initial findings suggest varying levels of similarity:

Similarity Scoring

finalresults1 Figure 1: Boxplots with different similarity scores across all three response types

Our project's analysis revealed significant insights into the similarity metrics applied to compare automated medical responses to those generated by physicians. One of the notable outcomes is the inclusion of the BLEURT metric, which was integrated following peer feedback. This metric significantly outperformed others, demonstrating its effectiveness in capturing the semantic nuances of medical dialogue. The BLEURT scores consistently showed the highest similarity scores across all comparisons, indicating its robustness in evaluating deeper linguistic and contextual relationships.

Interestingly, the BLEU score, traditionally valued in translation tasks, registered nearly zero across all tests. This result underscores the challenges of using BLEU for assessing responses in specialized domains like medicine, where exact n-gram matches are less likely due to the variability and specificity of medical terminology.

Among the generated responses, those from the Naive and Reddit-Prompted models showed the highest similarity to each other. This suggests that while these models may use different training datasets or parameters, they still retain a level of linguistic or stylistic coherence.

However, when comparing both Naive and Reddit-Prompted responses to the actual Physician responses, there is a significant drop in similarity scores. This was observed across all metrics but was especially pronounced in the more traditional measures such as Jaccard and Cosine similarities. These findings highlight the gap between current automated systems and the complex, nuanced communication typically employed by medical professionals.

The distinctiveness of the Physician's responses emphasizes the need for further enhancements in NLP models used for generating medical advice. The significant difference points to the potential improvements in training models that more accurately mirror the depth and specificity of professional medical knowledge.

Response Length

Figure 2: Comparison of word count as a histogram

Figure 3: Comparison of word count as a boxplot

In Figures 2 and 3 which compare response length, we analyze the word and character counts for three types of responses—LLM-generated responses from Prompt 1 (diagnostic recommendations), Prompt 2 (Reddit-style physician comments), and actual physician comments. By visualizing these distributions through histograms and boxplots, we assess whether the LLM produces longer, shorter, or comparable responses to real physician comments, providing insights into verbosity and completeness. From the graphs, we can see that on average, the true comments on Reddit were much shorter than the LLM-generated ones. As expected, LLM responses from the Reddit-style prompt were shorter than those from the diagnostic recommendation prompt, aligning with the goal of mimicking real Reddit comments more closely.

Confidence Score

Figure 4: Confidence Score as a histogram

Figure 5: Confidence Score as a boxplot

In Figures 4 and 5, we evaluate how the LLM expresses certainty vs. uncertainty in its medical responses. Using a predefined set of certainty words (e.g., "definitely," "certainly") and uncertainty words (e.g., "might," "could,"), we compute a confidence score for each response type. A positive score indicates more confident phrasing, while a negative score suggests a tendency toward hedging language. By comparing confidence distributions across physician and LLM responses, we determine whether the LLM exhibits excessive caution, overconfidence, or an appropriate balance. The results help assess how closely the LLM aligns with human medical communication patterns. Comparing confidence distributions across LLM-generated and physician responses, we observe that all three response types have average confidence scores close to zero or slightly negative, indicating a neutral or mildly cautious tone. However, physician comments exhibit a much wider spread in confidence scores, reflecting greater variation in how doctors communicate certainty or uncertainty depending on the medical context.

Future Steps

As we continue to refine our automated medical response systems, our efforts will be concentrated on three pivotal areas of development to enhance the efficacy and relevance of our models.

Firstly, we will undertake the fine-tuning of large language models with a specific focus on the medical domain. This initiative is designed to significantly improve how our models comprehend and generate responses that are not only accurate but also highly relevant to medical inquiries. By adapting our models to better understand the nuances of medical dialogue, we aim to bridge the gap between general language processing and specialized medical knowledge.

Secondly, the project will advance the application of semantic analysis techniques. By harnessing state-of-the-art technologies such as transformers and sophisticated deep learning models, our goal is to elevate the depth and precision of our semantic evaluations. This step is crucial in ensuring that our systems go beyond mere keyword matching to fully grasping the underlying contexts within medical dialogues.

Lastly, recognizing the critical role of user-centric design, we will establish a robust feedback loop with real users, including healthcare professionals and patients. This continuous feedback mechanism will allow us to refine our system iteratively, making adjustments based on direct usage insights and feedback. This process will ensure that our solutions are not only effective but also tailored to meet the real-world demands of healthcare applications.

Through these focused efforts, we anticipate our systems will become more adept at handling the complexities of medical consultations, thereby supporting healthcare providers and enhancing patient care.