Vaex - BKJackson/BKJackson_Wiki GitHub Wiki
Vaex links
vaex.io
docs.vaex.io
github.com/vaexio/vaex
github.com/vaexio/vaex-talks/
What is vaex
- A very fast and memory efficient Dataframe library (e.g., 4 Gb for a 100 Gb file)
- e.g., 1 billion rows - 1 TB data on a laptop
- Concept of ata + state (virtual columns, filters)
- Expression system allows jitting (numba, pytran, CUDA)
- State 'remembers' the 'pipeline', it's an artifact you get for free. Easy deployment.
- S3 support and remote dataframes
Vaex remote dataframe
- sometimes we don't need the data, we only care about the state
- data at server
- state changes at client
- server is stateless
- but does some caching for optimization
- can ask remotely to give a plot
token = open('token-STSci.txt').read().strip()
df = vaex.open(f'ws://ec2-18-222-183-211.us-east-2.compute.amazonaws.com:9000/gaia_ps1_nochunk?token_trusted={token})
df.plot('ra', 'dec', f='log')
- and can do computations remotely (returns head and tail)
np.deg2rad(df.ra)
Working with vaex and sklearn
See notebook from PyData London 2019 talk.
import vaex.ml.sklearn
from sklearn.linear_model import LinearRegression
frm sklearn.metrics import mean_absolute_error, mean_squared_error
linear_model = vaex.ml.sklearn.SKLearnPredictor(model=LinearRegression, features = features_linear)
df_train_mini = df_train[:1_000_000]
# Fit model to training data
linear_model.fit(df_train_mini, target=target)
# Return predictions
pred_linear = linear_model.predict(df_train_mini)
display(pred_linear)
# Create a virtual column housing the predictions
df_train = linear_model.transform(df_train)
df_train.head(5)
# Calculate error
mae_train_score = mean_absolute_error(df_train_mini.trip_duration)
mse_train_score = mean_squared_error(df_train_mini.trip_duration)
# Save pipeline to disk with vaex state
state = df_train.state_write('./taxi_ml_state.json')
# Load state into memory
df_test.state_load('./taxi_ml_state.json')
# See state
df_test.state_get()
# Visualize final prediction
df_test.final_prediction._graphviz()
See number of columns in vaex data frame
df_test.column_count()
Vaex: Out of Core Dataframes for Python and Fast Visualization - Blog post, Dec. 13, 2018