Skip to content

Remote machine learning

Recent ML models (whether LLMs or not) can often require more resources than available on a laptop.

For experiments and research, it would be very useful to be able to serve ML models running on machines with proper computing resources (RAM/CPU/GPU) and run remote inference through gRPC/HTTP.

Note: I didn't include skypilot but it also looks promising.

Specs

  • support for batch inference
  • open source
  • actively maintained
  • async and sync API
  • K8s compatible
  • easy to deploy a new model
  • support arbitrary ML models
  • gRPC/HTTP APIs

Candidates

vLLM

Pros: - trivial do deploy and use

Cons: - only support recent ML models

Kubeflow + Kserve

Pros: - tailored for k8s and serving - kube pipeline for training

Cons: - Kserve is not framework agnostic: inference runtimes need to be implemented to be available (currently there are a lot of them available, but that imply latency when a new framework/lib pops up)

BentoML

Pros: - agnostic/flexible Cons: - only a shallow integration with k8s

Nvidia triton

Cons: - only for GPU/Nvidia backed models, no traditional models

TorchServer

Cons: - limited maintainance - only for torch models, not traditional ML

Ray + Ray Serve

Pros: - fits very well with K8s (from a user standpoint at least). Will allow to easily elastically deploy ML models (a single model) and apps (a more complex ML workflow) - inference framework agnostic - vLLM support - seems to be the most popular/active project at the moment - support training + generic data processing: tasks and DAG of taks. Very well suited to ML experiments/research - tooling/monitoring tools to monitor inference + metrics for grafana

Cons: - a ray operator node is needed to manage worker nodes (can we use keda or something else to shut it down when not needed ?) - ray's flexibility/agnosticity comes at the cost of some minor boilerplate code to be implemented (to expose a HTTP service for instance)

Conclusion

Ray comes in the first place, followed by Kserve.