Community Meetings - vmware/versatile-data-kit GitHub Wiki
28th of February 16:00 CET - How to build a RAG pipeline
ADD to
Community meetings are held to keep all people interested in the project up to date. Meetings are recorded and available to the public. Please don't hesitate to start a conversation, ask questions in the chat or raise your hand. If you don't feel shy, we'll appreciate if you have your camera on.
Agenda
- quick intro of VDK, the team, and people who have been working/using VDK (Agi)
- Latest Release (Dilyan)
- RAG pipeline demo (Yoan)
recording
Data Sources - Guided Workshop w/ Versatile Data Kit -Agenda
- quick intro of VDK, the team, and people who have been working/using VDK (Agi)
- Latest Release? (Antoni)
- Workshop (Anotni)
recording
VDK Machine Learning Roadmap -Agenda
- quick intro of VDK, the team, and people who have been working/using VDK
- Latest Release? (Stanley)
- VDK Roadmap for ML Projects
In this community meeting, Paul Murphy will present how VDK can help with all aspects of ML workflows. We'll discuss the benefits of running your data creation and model training on the platform and all the benefits you will get!!
recording
Streamlining Dataset Creation and Debugging in AI/Data Models -Agenda
- Latest Release (Antoni)
- Streamlining Dataset Creation and Debugging in AI/Data Models (Antoni)
- Data Makers Fest (Agi)
- Thank yous, ⭐ and next meeting - 29th of Nov.
Streamlining Dataset Creation and Debugging in AI/Data Models
In an era where data is vital for machine learning models, efficient dataset creation and debugging mechanisms are the need of the hour. While platforms like HuggingFace offer a plethora of pre-existing datasets, they need more tools for easy dataset creation and management by end-users. Our presentation focuses on solving these challenges by extending the Versatile Data Kit (VDK), an open-source framework for developing and managing data pipelines. Key Challenges:
- Dataset Creation: Existing platforms offer limited user-driven options for generating and managing datasets from diverse sources.
- Data Integrity: Ensuring the quality and integrity of mutable datasets is a significant concern.
- Traceability: Lack of transparency from data origin to consumption in data models complicates debugging. Proposed Solution: We propose an integrated VDK-based solution encompassing:
- Source Plugins: To simplify dataset creation from diverse sources like databases, streams, or APIs.
- Metrics Abstraction: A layer in source plugins that calculates standard metrics for datasets.
- Automated Update Mechanism: Fetches the latest data modifications and computes metrics automatically.
- Report Generation: Produces detailed reports highlighting anomalies and metrics, streamlining debugging.
- Centralized Repository: Data and reports are stored centrally for easy access and examination.
Multiple Python Versions Support - recording
Agenda
quick intro of VDK, the team and people who have been working / using VDK
- Latest Release (Antoni)
- Multiple Python Versions Support (Andy)
- VDK Run logs: Simplified and readable. Quick overview (Dilyan)
Next meeting 25th of October 16:00 CET Support us by ⭐
Productionizing Jupyter Notebooks - recording
Agenda
- Release (Antoni)
- Productionizing Jupyter Notebooks (Duygu + Antoni)
Next meeting 27th of September 16:00 CET Support us by ⭐
Using VDK and Huggingface to train LLMs - recording
Agenda
quick intro of VDK and the team and people who have been working / using VDK
- Latest release (Antoni)
- Huggingface + VDK to train and use LLMs (Paul) He will show how running Hugging Face on VDK will augment its functionality.
Workflows:
- Finetuning an LLM
- Creating a dataset
- Catching regressions in LLMs ahead of time
- Q&A
- Roadmap (Antonio)
- Q&A
Watch Recording
28th of June - Practical Kimball Patterns - Dimensional modeling 101Agenda:
- VDK quick intro and latest release (Antoni)
- VDK team meeting/workshop (Agi)
- Dimensional modeling 101 - Practical Kimball data patterns (Antoni)
Watch Recording
31st of May - Generative Data Packs and DevOps for DataAgenda:
- VDK’s latest release (Antoni)
- Generative Data Packs (Iva)
- DevOps for Data (Agi)
Watch Recording
26th of April - VDK UI demoAgenda:
- Intro to the agenda and people - what are you working on lately?
- RADME updates and VDK intro (Agi)
- VDK’s latest release (Antoni)
- VDK UI demo (Paul)
- Next meeting topic. Date: 31th of May (Agi)
Watch recording
22nd of February Jupyter Integration -Shoutout to the recent VDK contributors and their work!
Agenda:
- VDK’s latest release
- VDK Jupyter integration
- FOSDEM experience and conclusions
- Next meeting date
Watch recording
11th of JanuaryAgenda:
- VDK’s latest release - Stanislav
- Introduction to Versatile Data Kit Control Service - Paul
- Demo of the current installation process - Iva
- Discussion on a proposal to implement the “Three Click Rule” to make the installation faster and easier for users. - Iva
- Decide on date for next community meeting (provisionally 15th of feb) - Paul
Watch Recording
30th of NovemberAgenda
- Welcome - Agi
- Release - Antoni
- Newest industry DB adoption stats suggest that PostgreSQL gets quite some traction lately. We have recently introduced PostgreSQL embedded support so that for the control service it is a configurational option to choose the database type deployed by default (in case no external data source is set). It could now be either CockroachDB (by default), or PostgreSQL - Iva
- We have just returned from Data Science Conference Europe 2022, and we’ll talk about our experience there - Vic, Antoni, Dimira
Discussion:
- templates for community meetings - do we need one?
- next community meeting x-mas/NY themed - 21st of Dec
- YT live community meetings
Watch recording
26th of OctoberAgenda
- Welcome and intro - Agi
- Release - Antoni
- Hackathon. We've applied for the Borathon, and we'll demo what we did there! - Antoni
- Demo of a new feature that allows skipping the remaining steps of a data job execution via the job input object - Momchil
- Latest articles about VDK
Watch recording
28th of September: Creating First Data JobAgenda:
- Welcome and intro, if you are new to VDK I encourage you to say hi :)
- Quick intro to the project (Agi)
- Release announcement (Antoni)
- GitHub Star History example demo (Agi)
- Discussion topics:
- Two PR reviewers
- VDK catchphrase
VDK catchphrase (also anchor text):
- unique value
- clear
- short and sweet
Examples:
- A high-performance observability data pipeline.
- Declarative continuous deployment for Kubernetes.
- The easiest way to coordinate your dataflow
- A cloud-native Pipeline resource.
- Always know what to expect from your data.
- Data-Centric Pipelines and Data Versioning
- An orchestration platform for the development, production, and observation of data assets.
- Build powerful pipelines in any programming language.
- Build data pipelines, the easy way
- Machine Learning Pipelines for Kubeflow
I have included data pipelines and other tools that have more than 2000 stars only Airbyte has a longer message:
- Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
I think the difference is felt in the readability of the message So, for this catchphrase it would be nice to come up with something that is #uniquevalue
Ideas:
- Building and Managing Data Pipelines with SQL or Python
- Data Pipelines covering full DataOps lifecycle
- Building and managing your data pipelines with python or SQL on the cloud (or Kubernetes)
- Build, run and manage your data jobs
- Build, run and manage your data pipelines
- Develop, run and manage your data pipelines on the cloud
- Automate and abstract the Data and DevOps cycle
- Automate and abstract the Data Journey and the DevOps cycle
- Orchestrate
A bit more abstract and unclear ideas:
- Efficient data engineering
- Enable everyone to focus on work that requires their core skills
(because SQL or python is maybe not our unique value prop)
Questions:
- Add cloud or Kubernetes ?
- Data Pipelines OR DataOps pipelines ?
Helpful questions:
- What do you think is the unique value of VDK
- How would you google to find this framework? (if you don't know it exists)
"VDK I think rather has a lot of possibilities in the “T” part - templates (kimball or generic), managed connection plugins enable quality, lineage (when implemented). And also in the abstracting DevOps part - though we need to do more around testing."
Action items: create a form where we can rank the catchphrase
https://youtu.be/HIRt4bX4ddk
24th of August: VDK TemplatesAttendees:
- Agita Jaunzeme aka Agi (VDK Community Manager)
- Momchil Zhivkov
- Duygu Hasan
- Antoni Ivanov
Agenda:
- Welcome and team (Agi)
- Intro to the project (Agi)
- Momchil Zhivkov about templates: Templates are reusable code in the context of data jobs. They are intended to solve a common use case among different users. A template is executed through a data job. An example of a common use case is loading data into a data warehouse.
This presentation will demo:
- what is a template
- how does it look
- the purpose of templates
- using and developing templates
- our already existing templates that can be reused
- Duygu - csv-export A new feature was added to the already existing CSV plugin, which allows people to export the result of a SQL query to a CSV file.
- Toni - VDK release v0.6
- Open discussion
https://youtu.be/wmdx7ngocr4
20th of July: How to promote an opensource project15:00 (GMT+01:00) - Add to Google calendar
Attendees:
- Agita Jaunzeme aka Agi (VDK Community Manager)
- Michael Gasch
Agenda:
- Welcome and team (Agi)
- Michael Gasch about how to promote an opensource project, tips, and questions
https://youtu.be/c3j1aOALjVU
22nd of June: Airflow integration11:00 (GMT+01:00) - Add to Google calendar
Attendees:
- Agita Jaunzeme aka Agi (VDK Community Manager)
- Gabriel Georgiev
- Antoni Ivanov
- Dimira Petrova
Agenda:
- Welcome and team (Agi)
- Intro to the project (Agi)
- Announcement of recent changes (Antoni)
- Airflow Provider Demo by Gabriel
- Discussion:
- VDK community update (Agi)
- how to find community meeting links
- Community and Resources page
- ODSC Europe conference, volunteering, speakers, Jacob Tomlinson Guglielmo Iozzia Carl Osipov Shawn Kyzer on Data Mesh
- Invitation to be DataOps community lead for Techies of Baltics - devops.lv also this guy from CDK James Craig
- next meeting possibly someone will join to tell us their story of growing an OSS community govmomi OR rapids
- Next meeting date
- Next meeting time (Let’s make next community meeting during US friendly time zone)
Useful links:
- Slack https://cloud-native.slack.com/archives/C033PSLKCPR
- Youtube https://www.youtube.com/channel/UCasf2Q7X8nF7S4VEmcTHJ0Q
- Release notes https://github.com/vmware/versatile-data-kit/releases/tag/v0.3
- Twitter https://twitter.com/vdkproject
- Articles https://towardsdatascience.com/an-overview-of-versatile-data-kit-a812cfb26de7
https://youtu.be/w0teqOw9qjc
May 25, 2022 : KubeConAttendees:
- Agita Jaunzeme aka Agi (VDK Community Manager)
Agenda:
- Welcome and team (Agi) / intro to the project
- VDK community update (Agi):
- VDK public calendar https://calendar.google.com/calendar/u/0?cid=dmluMjRrcTZ0MTZ1cjZ2YTVrc29oMm1hNXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
- List of good first issues https://github.com/vmware/versatile-data-kit/labels/good%20first%20issue
- Blog post https://towardsdatascience.com/how-to-build-a-web-app-with-data-ingested-through-versatile-data-kit-ddae43b5f62d
- Latest release
- Roadmap (Dako)
- Apache Airflow integration
- Security Improvements
- Provide users with better notifications/information about non-gracefully failed data job execution
- Open questions about Kubecon – discussion
- Conclusion and relevant links (Twitter / Slack / YT / blogs etc. )
Discussion Topics:
- Community track / Documentation
- Do we want to donate the project to CNCF, the price vs benefits - why?
- CNCF requirements: https://github.com/cncf/toc/blob/main/process/README.md https://github.com/cncf/toc/blob/main/process/graduation_criteria.md#incubation-stage
- Conclusion: CNCF is full of potential people who want to contribute BUT
Useful links:
- Slack https://cloud-native.slack.com/archives/C033PSLKCPR
- Youtube https://www.youtube.com/channel/UCasf2Q7X8nF7S4VEmcTHJ0Q
- Release notes https://github.com/vmware/versatile-data-kit/releases/tag/v0.3
- Twitter https://twitter.com/vdkproject
- Articles https://towardsdatascience.com/an-overview-of-versatile-data-kit-a812cfb26de7
https://youtu.be/VHJjrNyZjhg
April 20, 2022Attendees:
- Agita Jaunzeme
- Dimira Petrova
- Dako Dakov
- Antoni Ivanov
- Gabriel Georgiev
Agenda:
- Welcome (Agi)
- Intro of the team (all)
- Intro of the project (Agi)
- What are we doing lately (Antoni)
- What are we planning to do in the near future (Dako)
- Discussion
- Conclusion (Agi)
Discussion Topics:
- Kubernetes / ..
- Meeting frequency / next meeting - the week of 23rd of May
- Agenda for the next meeting to be more specific
Useful links:
- Repo https://github.com/vmware/versatile-data-kit/
- Wiki https://github.com/vmware/versatile-data-kit/wiki/Community-Meeting-and-Open-Discussion-Notes
- Slack https://cloud-native.slack.com/archives/C033PSLKCPR
- Youtube https://www.youtube.com/channel/UCasf2Q7X8nF7S4VEmcTHJ0Q
- Release notes https://github.com/vmware/versatile-data-kit/releases/tag/v0.3
- Twitter https://twitter.com/vdkproject
- Articles https://towardsdatascience.com/an-overview-of-versatile-data-kit-a812cfb26de7