Community Meetings - vmware/versatile-data-kit GitHub Wiki

28th of February 16:00 CET - How to build a RAG pipeline

RAG

ADD to

Community meetings are held to keep all people interested in the project up to date. Meetings are recorded and available to the public. Please don't hesitate to start a conversation, ask questions in the chat or raise your hand. If you don't feel shy, we'll appreciate if you have your camera on.

Agenda

quick intro of VDK, the team, and people who have been working/using VDK (Agi)
Latest Release (Dilyan)
RAG pipeline demo (Yoan)

Data Sources - Guided Workshop w/ Versatile Data Kit - recording

Agenda

quick intro of VDK, the team, and people who have been working/using VDK (Agi)
Latest Release? (Antoni)
Workshop (Anotni)

VDK Machine Learning Roadmap - recording

Agenda

quick intro of VDK, the team, and people who have been working/using VDK
Latest Release? (Stanley)
VDK Roadmap for ML Projects

In this community meeting, Paul Murphy will present how VDK can help with all aspects of ML workflows. We'll discuss the benefits of running your data creation and model training on the platform and all the benefits you will get!!

Streamlining Dataset Creation and Debugging in AI/Data Models - recording

Agenda

Latest Release (Antoni)
Streamlining Dataset Creation and Debugging in AI/Data Models (Antoni)
Data Makers Fest (Agi)
Thank yous, ⭐ and next meeting - 29th of Nov.

Streamlining Dataset Creation and Debugging in AI/Data Models

In an era where data is vital for machine learning models, efficient dataset creation and debugging mechanisms are the need of the hour. While platforms like HuggingFace offer a plethora of pre-existing datasets, they need more tools for easy dataset creation and management by end-users. Our presentation focuses on solving these challenges by extending the Versatile Data Kit (VDK), an open-source framework for developing and managing data pipelines. Key Challenges:

Dataset Creation: Existing platforms offer limited user-driven options for generating and managing datasets from diverse sources.
Data Integrity: Ensuring the quality and integrity of mutable datasets is a significant concern.
Traceability: Lack of transparency from data origin to consumption in data models complicates debugging. Proposed Solution: We propose an integrated VDK-based solution encompassing:
Source Plugins: To simplify dataset creation from diverse sources like databases, streams, or APIs.
Metrics Abstraction: A layer in source plugins that calculates standard metrics for datasets.
Automated Update Mechanism: Fetches the latest data modifications and computes metrics automatically.
Report Generation: Produces detailed reports highlighting anomalies and metrics, streamlining debugging.
Centralized Repository: Data and reports are stored centrally for easy access and examination.

Multiple Python Versions Support - recording

Agenda

quick intro of VDK, the team and people who have been working / using VDK

Latest Release (Antoni)
Multiple Python Versions Support (Andy)
VDK Run logs: Simplified and readable. Quick overview (Dilyan)

Next meeting 25th of October 16:00 CET Support us by ⭐

Productionizing Jupyter Notebooks - recording

Agenda

Release (Antoni)
Productionizing Jupyter Notebooks (Duygu + Antoni)

Next meeting 27th of September 16:00 CET Support us by ⭐

Using VDK and Huggingface to train LLMs - recording

Agenda

quick intro of VDK and the team and people who have been working / using VDK

Latest release (Antoni)
Huggingface + VDK to train and use LLMs (Paul) He will show how running Hugging Face on VDK will augment its functionality.

Workflows:

Finetuning an LLM
Creating a dataset
Catching regressions in LLMs ahead of time
Q&A

Roadmap (Antonio)

28th of June - Practical Kimball Patterns - Dimensional modeling 101 Watch Recording

Agenda:

VDK quick intro and latest release (Antoni)
VDK team meeting/workshop (Agi)
Dimensional modeling 101 - Practical Kimball data patterns (Antoni)

31st of May - Generative Data Packs and DevOps for Data Watch Recording

Agenda:

VDK’s latest release (Antoni)
Generative Data Packs (Iva)
DevOps for Data (Agi)

26th of April - VDK UI demo Watch Recording

Agenda:

Intro to the agenda and people - what are you working on lately?
RADME updates and VDK intro (Agi)
VDK’s latest release (Antoni)
VDK UI demo (Paul)
Next meeting topic. Date: 31th of May (Agi)

22nd of February Jupyter Integration - Watch recording

Shoutout to the recent VDK contributors and their work!

Agenda:

VDK’s latest release
VDK Jupyter integration
FOSDEM experience and conclusions
Next meeting date

11th of January Watch recording

Agenda:

VDK’s latest release - Stanislav
Introduction to Versatile Data Kit Control Service - Paul
Demo of the current installation process - Iva
Discussion on a proposal to implement the “Three Click Rule” to make the installation faster and easier for users. - Iva
Decide on date for next community meeting (provisionally 15th of feb) - Paul

30th of November Watch Recording

Agenda

Welcome - Agi
Release - Antoni
Newest industry DB adoption stats suggest that PostgreSQL gets quite some traction lately. We have recently introduced PostgreSQL embedded support so that for the control service it is a configurational option to choose the database type deployed by default (in case no external data source is set). It could now be either CockroachDB (by default), or PostgreSQL - Iva
We have just returned from Data Science Conference Europe 2022, and we’ll talk about our experience there - Vic, Antoni, Dimira

Discussion:

templates for community meetings - do we need one?
next community meeting x-mas/NY themed - 21st of Dec
YT live community meetings

26th of October Watch recording

Agenda

Welcome and intro - Agi
Release - Antoni
Hackathon. We've applied for the Borathon, and we'll demo what we did there! - Antoni
Demo of a new feature that allows skipping the remaining steps of a data job execution via the job input object - Momchil
Latest articles about VDK

28th of September: Creating First Data Job Watch recording

Agenda:

Welcome and intro, if you are new to VDK I encourage you to say hi :)
Quick intro to the project (Agi)
Release announcement (Antoni)
GitHub Star History example demo (Agi)
Discussion topics:

Two PR reviewers
VDK catchphrase

VDK catchphrase (also anchor text):

unique value
clear
short and sweet

Examples:

A high-performance observability data pipeline.
Declarative continuous deployment for Kubernetes.
The easiest way to coordinate your dataflow
A cloud-native Pipeline resource.
Always know what to expect from your data.
Data-Centric Pipelines and Data Versioning
An orchestration platform for the development, production, and observation of data assets.
Build powerful pipelines in any programming language.
Build data pipelines, the easy way
Machine Learning Pipelines for Kubeflow

I have included data pipelines and other tools that have more than 2000 stars only Airbyte has a longer message:

Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

I think the difference is felt in the readability of the message So, for this catchphrase it would be nice to come up with something that is #uniquevalue

Ideas:

Building and Managing Data Pipelines with SQL or Python
Data Pipelines covering full DataOps lifecycle
Building and managing your data pipelines with python or SQL on the cloud (or Kubernetes)
Build, run and manage your data jobs
Build, run and manage your data pipelines
Develop, run and manage your data pipelines on the cloud
Automate and abstract the Data and DevOps cycle
Automate and abstract the Data Journey and the DevOps cycle
Orchestrate

A bit more abstract and unclear ideas:

Efficient data engineering
Enable everyone to focus on work that requires their core skills

(because SQL or python is maybe not our unique value prop)

Questions:

Add cloud or Kubernetes ?
Data Pipelines OR DataOps pipelines ?

Helpful questions:

What do you think is the unique value of VDK
How would you google to find this framework? (if you don't know it exists)

"VDK I think rather has a lot of possibilities in the “T” part - templates (kimball or generic), managed connection plugins enable quality, lineage (when implemented). And also in the abstracting DevOps part - though we need to do more around testing."

Action items: create a form where we can rank the catchphrase

24th of August: VDK Templates https://youtu.be/HIRt4bX4ddk

Attendees:

Agita Jaunzeme aka Agi (VDK Community Manager)
Momchil Zhivkov
Duygu Hasan
Antoni Ivanov

Agenda:

Welcome and team (Agi)
Intro to the project (Agi)
Momchil Zhivkov about templates: Templates are reusable code in the context of data jobs. They are intended to solve a common use case among different users. A template is executed through a data job. An example of a common use case is loading data into a data warehouse.

This presentation will demo:

what is a template
how does it look
the purpose of templates
using and developing templates
our already existing templates that can be reused

Duygu - csv-export A new feature was added to the already existing CSV plugin, which allows people to export the result of a SQL query to a CSV file.
Toni - VDK release v0.6
Open discussion

20th of July: How to promote an opensource project https://youtu.be/wmdx7ngocr4

15:00 (GMT+01:00) - Add to Google calendar

Attendees:

Agita Jaunzeme aka Agi (VDK Community Manager)
Michael Gasch

Agenda:

Welcome and team (Agi)
Michael Gasch about how to promote an opensource project, tips, and questions

22nd of June: Airflow integration https://youtu.be/c3j1aOALjVU

11:00 (GMT+01:00) - Add to Google calendar

Attendees:

Agita Jaunzeme aka Agi (VDK Community Manager)
Gabriel Georgiev
Antoni Ivanov
Dimira Petrova

Agenda:

Welcome and team (Agi)
Intro to the project (Agi)
Announcement of recent changes (Antoni)
Airflow Provider Demo by Gabriel
Discussion:

VDK community update (Agi)
how to find community meeting links
Community and Resources page
ODSC Europe conference, volunteering, speakers, Jacob Tomlinson Guglielmo Iozzia Carl Osipov Shawn Kyzer on Data Mesh
Invitation to be DataOps community lead for Techies of Baltics - devops.lv also this guy from CDK James Craig
next meeting possibly someone will join to tell us their story of growing an OSS community govmomi OR rapids
Next meeting date
Next meeting time (Let’s make next community meeting during US friendly time zone)