Remote machine learning

Recent ML models (whether LLMs or not) can often require more resources than available on a laptop.

For experiments and research, it would be very useful to be able to serve ML models running on machines with proper computing resources (RAM/CPU/GPU) and run remote inference through gRPC/HTTP.

Note: I didn't include skypilot but it also looks promising.

Specs⚑

support for batch inference
open source
actively maintained
async and sync API
K8s compatible
easy to deploy a new model
support arbitrary ML models
gRPC/HTTP APIs

Candidates⚑

vLLM ⚑

Pros: - trivial do deploy and use

Cons: - only support recent ML models

Kubeflow + Kserve ⚑

Pros: - tailored for k8s and serving - kube pipeline for training

Cons: - Kserve is not framework agnostic: inference runtimes need to be implemented to be available (currently there are a lot of them available, but that imply latency when a new framework/lib pops up)

BentoML ⚑

Pros: - agnostic/flexible Cons: - only a shallow integration with k8s

Nvidia triton ⚑

Cons: - only for GPU/Nvidia backed models, no traditional models

TorchServer ⚑

Cons: - limited maintainance - only for torch models, not traditional ML

Ray + Ray Serve ⚑

Pros: - fits very well with K8s (from a user standpoint at least). Will allow to easily elastically deploy ML models (a single model) and apps (a more complex ML workflow) - inference framework agnostic - vLLM support - seems to be the most popular/active project at the moment - support training + generic data processing: tasks and DAG of taks. Very well suited to ML experiments/research - tooling/monitoring tools to monitor inference + metrics for grafana

Cons: - a ray operator node is needed to manage worker nodes (can we use keda or something else to shut it down when not needed ?) - ray's flexibility/agnosticity comes at the cost of some minor boilerplate code to be implemented (to expose a HTTP service for instance)

Conclusion⚑

Ray comes in the first place, followed by Kserve.

Remote machine learning

Specs⚑

Candidates⚑

vLLM⚑

Kubeflow + Kserve⚑

BentoML⚑

Nvidia triton⚑

TorchServer⚑

Ray + Ray Serve⚑