Vaex - BKJackson/BKJackson_Wiki GitHub Wiki

Vaex links

vaex.io
docs.vaex.io
github.com/vaexio/vaex
github.com/vaexio/vaex-talks/

What is vaex

  • A very fast and memory efficient Dataframe library (e.g., 4 Gb for a 100 Gb file)
  • e.g., 1 billion rows - 1 TB data on a laptop
  • Concept of ata + state (virtual columns, filters)
  • Expression system allows jitting (numba, pytran, CUDA)
  • State 'remembers' the 'pipeline', it's an artifact you get for free. Easy deployment.
  • S3 support and remote dataframes

Vaex remote dataframe

  • sometimes we don't need the data, we only care about the state
  • data at server
  • state changes at client
  • server is stateless
    • but does some caching for optimization
  • can ask remotely to give a plot
token = open('token-STSci.txt').read().strip()  
df = vaex.open(f'ws://ec2-18-222-183-211.us-east-2.compute.amazonaws.com:9000/gaia_ps1_nochunk?token_trusted={token})  

df.plot('ra', 'dec', f='log')
  • and can do computations remotely (returns head and tail)
np.deg2rad(df.ra)  

Working with vaex and sklearn

See notebook from PyData London 2019 talk.

import vaex.ml.sklearn
from sklearn.linear_model import LinearRegression  
frm sklearn.metrics import mean_absolute_error, mean_squared_error

linear_model = vaex.ml.sklearn.SKLearnPredictor(model=LinearRegression, features = features_linear)

df_train_mini = df_train[:1_000_000]

# Fit model to training data
linear_model.fit(df_train_mini, target=target)  

# Return predictions
pred_linear = linear_model.predict(df_train_mini)

display(pred_linear)

# Create a virtual column housing the predictions
df_train = linear_model.transform(df_train)

df_train.head(5)  

# Calculate error  
mae_train_score = mean_absolute_error(df_train_mini.trip_duration)
mse_train_score = mean_squared_error(df_train_mini.trip_duration)

# Save pipeline to disk with vaex state  
state = df_train.state_write('./taxi_ml_state.json')  

# Load state into memory  
df_test.state_load('./taxi_ml_state.json')  

# See state
df_test.state_get()

# Visualize final prediction  
df_test.final_prediction._graphviz()  

See number of columns in vaex data frame

df_test.column_count()  

Vaex: Out of Core Dataframes for Python and Fast Visualization - Blog post, Dec. 13, 2018