1MEIC04T03: Speech2Req - FEUP-MEIC-DS-2024-25/ai4sd GitHub Wiki
The main goal of our product assistant is to automatically generate project requirements by processing the audio recording of a meeting between stakeholders. The assistant transcribes the conversation and groups the information into relevant themes and epics, streamlining the requirements engineering process and enhancing clarity in communication.
Vision
The vision for the MP3 Transcription and Summarization App is to revolutionize how professionals and individuals interact with audio data by providing a streamlined, AI-driven solution that turns complex audio content into organized, accessible text insights. This product exists to eliminate the time-consuming task of manual transcription and make audio content readily available for deeper understanding and decision-making.
Our mission is to create an intuitive, accessible, and efficient tool that serves as an “invisible assistant” to users, handling the transcription and summarization of audio files with ease. This tool empowers users—from researchers and journalists to business professionals—to quickly access and reference key points from conversations, meetings, or interviews. By transforming audio into text automatically, it enables users to focus on higher-level tasks without the distractions of note-taking or manual transcription.
Research
In the process of researching and exploring AI tools that could be a reference to our work, we found these:
Airfocus
What it is
Airfocus is a prioritization and roadmapping tool designed to help product teams streamline their project management processes. It enables users to import tasks, features, or user stories from various sources, apply customizable scoring frameworks, and create visual roadmaps that communicate project timelines and priorities effectively. Airfocus supports collaboration among teams by providing a centralized platform for decision-making and planning.
Pros
-
Transformation of Ideas into Requirements: Airfocus excels at turning high-level ideas into structured requirements, including epics, themes, and user stories. This capability ensures that project elements are clearly defined and organized, facilitating better planning and execution.
-
Effective Grouping and Prioritization: The tool allows teams to categorize their project items intuitively, enabling them to make informed prioritization decisions based on their specific criteria and frameworks. This structured approach helps ensure that the most critical features are developed first, aligning with the overall product vision.
-
Visualization of Roadmaps: Airfocus offers a visually appealing interface that aids in the creation of clear and engaging roadmaps, making it easier to communicate project status and future plans to stakeholders.
Cons
-
Need for Preprocessing: Despite its advantages, Airfocus requires preprocessing of text transcripts generated by tools like Whisper before they can be effectively imported and utilized. This additional step may complicate the workflow, as teams need to invest time and effort in structuring the data appropriately for Airfocus to function optimally.
-
Closed API: Airfocus's API is not open source, which may limit the flexibility for teams wanting to customize integrations or automate certain aspects of their workflows. This can hinder the tool's adaptability to specific project needs, particularly for teams with unique requirements.
In conclusion, while Airfocus provides valuable tools for prioritization and project management, its reliance on preprocessing and the limitations of its closed API make it a less-than-ideal fit for projects that require seamless integration and quick turnaround in transforming meeting transcripts into actionable requirements. Teams seeking a more adaptable solution might explore alternatives that better align with their specific workflows and requirements.
Domain Analysis
This system is designed to simplify the process of transcribing audio files and converting the transcribed text into actionable insights in the form of user stories, themes, and epics. This functionality is essential for users who need to extract structured requirements or summaries from raw audio, making it especially useful in domains like requirements engineering and meeting documentation.
- User Interaction: A user uploads an audio file via the frontend.
- Audio Transcription: The backend forwards the file to Whisper API, which returns a transcription of the audio.
- Structured Output Generation: The backend sends this transcription to the Gemini LLM, prompting it to analyze and categorize the content into user stories, themes, and epics.
- Response Delivery: The frontend displays the structured output to the user.
The class diagram illustrates these high-level interactions between components and demonstrates the sequence of actions from audio upload to the final presentation of structured content. Additional sequence and activity diagrams further clarify this process by breaking down the individual steps taken to process each audio file into actionable insights.
Architecture and design
The architecture consists of three primary components, each fulfilling a specific role within the application:
-
Application Layer: Frontend: The frontend interface, built with a Streamlit-based application, allows users to upload audio files and receive transcriptions structured as user stories, themes, or epics. It provides a user-friendly means for uploading and viewing outputs while handling the initial request to transcribe audio. Backend: The backend orchestrates interactions between the frontend and external APIs (Whisper API and Gemini LLM). When an audio file is uploaded, the backend receives the request, forwards it to the Whisper API for transcription, and then sends the transcribed text to the Gemini LLM for processing into structured requirements.
-
Whisper API: This API serves as the audio transcription service. Upon receiving an audio file from the backend, Whisper processes the file and returns a text transcription. This transcription forms the basis for further analysis and structuring.
-
LLM - Gemini: Gemini is responsible for analyzing the transcription and converting it into structured output in the form of user stories, themes, and epics. This step transforms raw text into high-level insights that align with requirements engineering concepts, making it easier for stakeholders to understand and act upon the audio content.
Technologies
Our project leverages a variety of modern tools, languages, and frameworks to deliver an intuitive and efficient solution for transforming audio meeting content into structured user stories, themes, and epics. Below is a breakdown of the technologies used, categorized by client-imposed restrictions and team-selected tools.
Client-imposed Restrictions
-
Large Language Model (LLM) - Google Gemini: The client required the use of Google Gemini for analyzing transcriptions and extracting user stories, themes, and epics. Gemini was chosen due to its advanced NLP capabilities and high accuracy in interpreting contextual data, which is crucial for breaking down complex, unstructured text into meaningful requirements.
-
Whisper API for Transcription: The client mandated the use of Whisper for transcribing audio into text due to its robust performance and reliability in accurately capturing spoken content. Whisper’s ability to handle varied accents and background noise was a significant factor in this restriction, as it ensures that transcription quality remains high across diverse inputs.
Team Selected Technologies
-
Streamlit (Frontend): We chose Streamlit for the frontend interface due to its simplicity in creating data-centric web applications. It allows rapid prototyping and is well-suited for projects where real-time data updates and interactions are key. Streamlit’s integration with Python also allows for streamlined communication with our backend services.
-
Python (Backend): Python was chosen for the backend as it offers a wide range of libraries and frameworks for AI and data processing. The ease of integrating APIs and libraries like Whisper and Streamlit made Python an ideal choice. Its readability and popularity in machine learning and data processing were additional motivating factors.
Development guide
Explain what a new developer to the project should know in order to develop the system, including who to build, run and test it in a development environment.
Document any APIs, formats and protocols needed for development (but don't forget that public APIs should also be accessible from the "How to use" above).
Describe coding conventions and other guidelines adopted by the team(s).
Security concerns
In developing this system, we have proactively identified several classes of security vulnerabilities and implemented measures to mitigate them. Below are the main security risks identified and the decisions made to reduce exposure.
Data Privacy and Confidentiality
The audio files and transcriptions may contain sensitive information, including confidential meeting content or proprietary business information. Sending this data to external APIs (e.g., Whisper and Google Gemini) could expose it to third-party risk if not handled securely.
To mitigate these risks, we ensure that all audio file uploads and resulting transcriptions are encrypted during transit using HTTPS to prevent interception by unauthorized parties. Also, both Whisper and Google Gemini APIs are used in accordance with their privacy policies, and we follow data retention guidelines to minimize exposure.
Data Exposure
The application’s reliance on third-party APIs for core functionalities may expose it to risks related to API misuse or data leakage.
To mitigate these risks, the API keys for Whisper and Google Gemini are stored securely and are not hardcoded in the codebase. We use environment variables to manage secrets, reducing the risk of accidental exposure.
Quality assurance
Describe which tools are used for quality assurance and link to relevant resources. Namely, provide access to reports for coverage and mutation analysis, static analysis, and other tools that may be used for QA.
How to use
Prerequisites
-
Install the required libraries by running:
pip install streamlit whisper openai
-
Ensure
ffmpeg
is installed for audio processing:sudo apt update && sudo apt install ffmpeg
Running the App
-
Launch the Streamlit app:
streamlit run app.py
-
Open the provided URL in your web browser (usually http://localhost:8501).
Using the App
-
Upload Audio: Click on "Upload an MP3 file" and select an audio file (.mp3).
-
Transcription: The app will automatically transcribe the audio, displaying the text in the "Transcription" section.
-
Download Results: Copy or save the transcribed text as needed.
Sprint 1 Retrospective
Keep Doing
- Continue the strong collaboration and teamwork that enabled the completion of all planned user stories.
- Maintain focus and efficiency to ensure timely delivery of all tasks.
- Preserve the organized approach that kept everyone aligned on sprint goals and expectations.
Do Differently
- Experiment with new methods: Explore techniques like pair programming to encourage knowledge sharing and foster teamwork.
- Refine task assignment: Introduce a more balanced allocation method for user stories to prevent overloading specific team members. This could include assigning tasks more dynamically rather than waiting for members to finish their current workload.
- Enhance task ownership: Encourage proactive task completion and equitable workload distribution to ensure consistent progress across the team.
General Insights
- The sprint's success underlines the effectiveness of the team's collaboration and planning.
- Exploring innovative methods and refining existing processes will sustain the team’s momentum and prepare them for more complex challenges in upcoming sprints.
- The involvement of both the development team and the Product Owner in deciding priorities and features has proven to be a key factor in aligning the sprint outcomes with the overall product vision.
Sprint Restrspective 2
Keep Doing
- Continue delivering all planned sprint items successfully, demonstrating strong commitment to sprint goals.
- Maintain the focus and determination that ensures tasks are completed on time.
- Preserve the proactive approach to identifying and addressing challenges as they arise.
Do Differently
- Enhance communication: Establish regular check-ins or daily stand-ups to ensure better synchronization across the team and avoid miscommunication.
- Improve integration understanding: Schedule focused sessions to enhance the team’s knowledge and skills related to the integration process, possibly with guidance from subject matter experts or technical leads.
- Document learning: Create detailed documentation for the integration process to serve as a reference and reduce confusion during future tasks.
General Insights
- Despite the challenges, the team showcased resilience by completing all sprint items.
- Improved communication practices and a better grasp of integration complexities will not only boost efficiency but also reduce bottlenecks in upcoming sprints.
- A focus on continuous learning and knowledge sharing will strengthen the team’s ability to tackle more complex processes moving forward.
Sprint planing 3
Sprint 3 Retrospective
Keep Doing
- Delivering with quality focus: The team maintained a high standard of delivery, ensuring all features added significant value to the final product.
- Organized Sprint Backlog: Clear prioritization and alignment with user needs allowed for the successful implementation of essential functionalities.
- Effective collaboration: Strong communication between the development team and the PO ensured the most relevant and impactful features were delivered.
Do Differently
- Time and effort estimation: Complex tasks, such as application integration and technical features, could have been better anticipated to reduce challenges and avoid delays.
- Greater focus on final validation: Allocate more time for testing and validating completed features to ensure quality and alignment with the project’s overall goals.
- Comprehensive documentation: Finalize detailed documentation for all developed features to facilitate future maintenance or integrations.
General Insights
- Value Delivered: The team successfully achieved the sprint goals, delivering critical functionalities. Features such as application integration, audio download capabilities, and organized documentation provide tangible benefits to end users.
- Resilience and Adaptability: Despite technical challenges, the team demonstrated focus, collaboration, and maturity, showcasing a strong commitment to delivering a successful final product.
- Future Readiness: The successful conclusion of this final sprint reflects the team's ability to deliver a robust product. The finalized documentation and continuous process improvements lay a solid foundation for future development or project expansion.
Final Conclusion
The project was completed successfully, meeting all proposed objectives and delivering a valuable, functional product. The team's collaborative spirit, adaptability, and user-focused delivery were key factors in achieving success in this final sprint.
How to contribute
Explain what a new developer should know in order to develop the tool, including how to build, run and test it in a development environment.
Defer technical details to the technical documentation below, which should include information and decisions on architectural, design and technical aspects of the tool.
Contributions
Link to the factsheets of each team and of each team-member. For example:
- Team 3
- Hugo Silva (PO)
- Miguel Lima (SM)
- José Felizberto
- Pedro Romão