Skip to content

GPU

GPU or Graphic Processing Unit is a specialized electronic circuit initially designed to accelerate computer graphics and image processing (either on a video card or embedded on motherboards, mobile phones, personal computers, workstations, and game consoles).

For years I've wanted to buy a graphic card but I've been stuck in the problem that I don't have a desktop. I have a X280 lenovo laptop used to work and personal use with an integrated card that has let me so far to play old games such as King Arthur Gold or Age of Empires II, but has hard times playing "newer" games such as It takes two. Last year I also bought a NAS with awesome hardware. So it makes no sense to buy a desktop just for playing.

Now that I host Jellyfin on the NAS and that machine learning is on the hype with a lot of interesting solutions that can be self-hosted (whisper, chatgpt similar solutions...), it starts to make sense to add a GPU to the server. What made me give the step is that you can also self-host a gaming server to stream to any device! It makes so much sense to have all the big guns inside the NAS and stream the content to the less powerful devices.

That way if you host services, you make the most use of the hardware.

Install cuda

CUDA is a parallel computing platform and programming model invented by NVIDIA®. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). If you're not using Debian 11 follow these instructions

Base Installer

Installation Instructions:

wget https://developer.download.nvidia.com/compute/cuda/12.5.1/local_installers/cuda-repo-debian11-12-5-local_12.5.1-555.42.06-1_amd64.deb
sudo dpkg -i cuda-repo-debian11-12-5-local_12.5.1-555.42.06-1_amd64.deb
sudo cp /var/cuda-repo-debian11-12-5-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-5

Additional installation options are detailed here.

Driver Installer

To install the open kernel module flavor:

sudo apt-get install -y nvidia-kernel-open-dkms
sudo apt-get install -y cuda-drivers

Install cuda

apt-get install cuda
reboot

Install nvidia card

Check if your card is supported in the releases supported by your OS

If it's supported

If it's not supported

Ensure the GPUs are Installed

Install pciutils

Ensure that the lspci command is installed (which lists the PCI devices connected to the server):

sudo apt-get -y install pciutils
Check Installed Nvidia Cards

Perform a quick check to determine what Nvidia cards have been installed:

lspci | grep VGA

The output of the lspci command above should be something similar to:

00:02.0 VGA compatible controller: Intel Corporation 4th Gen ...
01:00.0 VGA compatible controller: Nvidia Corporation ...

If you do not see a line that includes Nvidia, then the GPU is not properly installed. Otherwise, you should see the make and model of the GPU devices that are installed.

Disable Nouveau

Blacklist Nouveau in Modprobe

The nouveau driver is an alternative to the Nvidia drivers generally installed on the server. It does not work with CUDA and must be disabled. The first step is to edit the file at /etc/modprobe.d/blacklist-nouveau.conf.

Create the file with the following content:

cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
EOF

Then, run the following commands:

echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
sudo update-initramfs -u

Update Grub to Blacklist Nouveau

Backup your grub config template:

sudo cp /etc/default/grub /etc/default/grub.bak

Then, update your grub config template at /etc/default/grub. Add rd.driver.blacklist=nouveau and rcutree.rcu_idle_gp_delay=1 to the GRUB_CMDLINE_LINUX variable. For example, change:

GRUB_CMDLINE_LINUX="quiet"

to:

GRUB_CMDLINE_LINUX="quiet rd.driver.blacklist=nouveau rcutree.rcu_idle_gp_delay=1"

Then, rebuild your grub config:

sudo grub2-mkconfig -o /boot/grub/grub.cfg

Install prerequisites

The following prerequisites should be installed before installing the Nvidia drivers:

sudo apt-get -y install linux-headers-$(uname -r) make gcc-4.8
sudo apt-get -y install acpid dkms

Close X Server

Before running the install, you should exit out of any X environment, such as Gnome, KDE, or XFCE. To exit the X session, switch to a TTY console using Ctrl-Alt-F1 and then determine whether you are running lightdm or gdm by running:

sudo ps aux | grep "lightdm|gdm|kdm"

Depending on which is running, stop the service, running the following commands (substitute gdm or kdm for lightdm as appropriate):

sudo service lightdm stop
sudo init 3

Install Drivers Only

Important

To accommodate GL-accelerated rendering, OpenGL and GL Vendor Neutral Dispatch (GLVND) are now required and should be installed with the Nvidia drivers. OpenGL is an installation option in the *.run type of drivers. In other types of the drivers, OpenGL is enabled by default in most modern versions (dated 2016 and later). GLVND can be installed using the installer menus or via the --glvnd-glx-client command line flag.

This section deals with installing the drivers via the *.run executables provided by Nvidia.

To download only the drivers, navigate to http://www.nvidia.com/object/unix.html and click the Latest Long Lived Branch version under the appropriate CPU architecture. On the ensuing page, click Download and then click Agree and Download on the page that follows.

Note

The Unix drivers found in the link above are also compatible with all Nvidia Tesla models.

If you'd prefer to download the full driver repository, Nvidia provides a tool to recommend the most recent available driver for your graphics card at http://www.Nvidia.com/Download/index.aspx?lang=en-us.

If you are unsure which Nvidia devices are installed, the lspci command should give you that information:

lspci | grep -i "nvidia"

Download the recommended driver executable. Change the file permissions to allow execution:

chmod +x ./NVIDIA-Linux-$(uname -m)-*.run

Run the install.

Check that it's installed

To check that the GPU is well installed and functioning properly, you can use the nvidia-smi command. This command provides detailed information about the installed Nvidia GPUs, including their status, utilization, and driver version.

First, ensure the Nvidia drivers are installed. Then, run:

nvidia-smi

If the GPU is properly installed, you should see an output that includes information about the GPU, such as its model, memory usage, and driver version. The output will look something like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   38C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If you encounter any errors or the GPU is not listed, there may be an issue with the installation or configuration of the GPU drivers.

Measure usage

For Nvidia GPUs there is a tool nvidia-smi that can show memory usage, GPU utilization and temperature of GPU.

Load test the gpu

First make sure you have CUDA installed, then install the gpu_burn tool

git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
make

To run a test for 60 seconds run:

./gpu_burn 60

Monitor it with Prometheus

NVIDIA DCGM is a set of tools for managing and monitoring NVIDIA GPUs in large-scale, Linux-based cluster environments. It’s a low overhead tool that can perform a variety of functions including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration, and accounting. For more information, see the DCGM User Guide.

You can use DCGM to expose GPU metrics to Prometheus using dcgm-exporter.

Install NVIDIA Container Kit

The NVIDIA Container Toolkit allows users to build and run GPU accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs.

sudo apt-get install -y nvidia-container-toolkit

Configure the container runtime by using the nvidia-ctk command:

sudo nvidia-ctk runtime configure --runtime=docker
Restart the Docker daemon:

sudo systemctl restart docker

Install NVIDIA DCGM

Follow the Getting Started Guide.

Determine the distribution name:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')

Download the meta-package to set up the CUDA network repository:

wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.1-1_all.deb

Install the repository meta-data and the CUDA GPG key:

sudo dpkg -i cuda-keyring_1.1-1_all.deb

Update the Apt repository cache:

sudo apt-get update

Now, install DCGM:

sudo apt-get install -y datacenter-gpu-manager

Enable the DCGM systemd service (on reboot) and start it now:

sudo systemctl --now enable nvidia-dcgm

You should see output similar to this:

● dcgm.service - DCGM service
  Loaded: loaded (/usr/lib/systemd/system/dcgm.service; disabled; vendor preset: enabled)
  Active: active (running) since Mon 2020-10-12 12:18:57 PDT; 14s ago
Main PID: 32847 (nv-hostengine)
    Tasks: 7 (limit: 39321)
  CGroup: /system.slice/dcgm.service
          └─32847 /usr/bin/nv-hostengine -n

Oct 12 12:18:57 ubuntu1804 systemd[1]: Started DCGM service.
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: DCGM initialized
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: Host Engine Listener Started

To verify installation, use dcgmi to query the system. You should see a listing of all supported GPUs (and any NVSwitches) found in the system:

dcgmi discovery -l

Output:

8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:07:00.0                                         |
|        | Device UUID: GPU-1d82f4df-3cf9-150d-088b-52f18f8654e1                |
+--------+----------------------------------------------------------------------+
| 1      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:0F:00.0                                         |
|        | Device UUID: GPU-94168100-c5d5-1c05-9005-26953dd598e7                |
+--------+----------------------------------------------------------------------+
| 2      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:47:00.0                                         |
|        | Device UUID: GPU-9387e4b3-3640-0064-6b80-5ace1ee535f6                |
+--------+----------------------------------------------------------------------+
| 3      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:4E:00.0                                         |
|        | Device UUID: GPU-cefd0e59-c486-c12f-418c-84ccd7a12bb2                |
+--------+----------------------------------------------------------------------+
| 4      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:87:00.0                                         |
|        | Device UUID: GPU-1501b26d-f3e4-8501-421d-5a444b17eda8                |
+--------+----------------------------------------------------------------------+
| 5      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:90:00.0                                         |
|        | Device UUID: GPU-f4180a63-1978-6c56-9903-ca5aac8af020                |
+--------+----------------------------------------------------------------------+
| 6      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:B7:00.0                                         |
|        | Device UUID: GPU-8b354e3e-0145-6cfc-aec6-db2c28dae134                |
+--------+----------------------------------------------------------------------+
| 7      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:BD:00.0                                         |
|        | Device UUID: GPU-a16e3b98-8be2-6a0c-7fac-9cb024dbc2df                |
+--------+----------------------------------------------------------------------+
6 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
| 11        |
| 10        |
| 13        |
| 9         |
| 12        |
| 8         |
+-----------+

Install the dcgm-exporter

As it doesn't need any persistence I've added it to the prometheus docker compose:

  dcgm-exporter:
    # latest didn't work
    image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    restart: unless-stopped
    container_name: dcgm-exporter

And added the next scraping config in prometheus.yml

  - job_name: dcgm-exporter
    metrics_path: /metrics
    static_configs:
    - targets:
      - dcgm-exporter:9400

Adding alerts

Tweak the next alerts for your use case.

---
groups:
- name: dcgm-alerts
  rules:
  - alert: GPUHighTemperature
    expr: DCGM_FI_DEV_GPU_TEMP > 80
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "GPU High Temperature (instance {{ $labels.instance }})"
      description: "The GPU temperature is above 80°C for more than 5 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: GPUMemoryUtilizationHigh
    expr: DCGM_FI_DEV_MEM_COPY_UTIL > 90
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "GPU Memory Utilization High (instance {{ $labels.instance }})"
      description: "The GPU memory utilization is above 90% for more than 10 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: GPUComputeUtilizationHigh
    expr: DCGM_FI_DEV_GPU_UTIL > 90
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "GPU Compute Utilization High (instance {{ $labels.instance }})"
      description: "The GPU compute utilization is above 90% for more than 10 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: GPUPowerUsageHigh
    expr: DCGM_FI_DEV_POWER_USAGE > 160
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "GPU Power Usage High (instance {{ $labels.instance }})"
      description: "The GPU power usage is above 160W for more than 5 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: GPUUnavailable
    expr: up{job="dcgm-exporter"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "GPU Unavailable (instance {{ $labels.instance }})"
      description: "The DCGM Exporter instance is down or unreachable for more than 5 minutes.\n  LABELS: {{ $labels }}"

Adding a dashboard

I've tweaked this dashboard to simplify it:

{
  "annotations": {
    "list": [
      {
        "$$hashKey": "object:192",
        "builtIn": 1,
        "datasource": {
          "type": "datasource",
          "uid": "grafana"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "description": "This dashboard is to display the metrics from DCGM Exporter on a Kubernetes (1.13+) cluster",
  "editable": true,
  "fiscalYearStartMonth": 0,
  "gnetId": 12239,
  "graphTooltip": 0,
  "links": [],
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "PBFA97CFB590B2093"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 2,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "celsius"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 18,
        "x": 0,
        "y": 0
      },
      "id": 12,
      "options": {
        "legend": {
          "calcs": [
            "mean",
            "lastNotNull",
            "max"
          ],
          "displayMode": "table",
          "placement": "right",
          "showLegend": true
        },
        "tooltip": {
          "maxHeight": 600,
          "mode": "multi",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "PBFA97CFB590B2093"
          },
          "editorMode": "code",
          "expr": "DCGM_FI_DEV_GPU_TEMP",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "GPU Temperature",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "PBFA97CFB590B2093"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "max": 100,
          "min": 0,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "#EAB839",
                "value": 83
              },
              {
                "color": "red",
                "value": 87
              }
            ]
          },
          "unit": "celsius"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 6,
        "x": 18,
        "y": 0
      },
      "id": 14,
      "options": {
        "minVizHeight": 75,
        "minVizWidth": 75,
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true,
        "sizing": "auto"
      },
      "pluginVersion": "11.0.1",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "PBFA97CFB590B2093"
          },
          "editorMode": "code",
          "expr": "avg(DCGM_FI_DEV_GPU_TEMP)",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "GPU Avg. Temp",
      "type": "gauge"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "PBFA97CFB590B2093"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 2,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "watt"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 18,
        "x": 0,
        "y": 8
      },
      "id": 10,
      "options": {
        "legend": {
          "calcs": [
            "mean",
            "lastNotNull",
            "max"
          ],
          "displayMode": "table",
          "placement": "right",
          "showLegend": true
        },
        "tooltip": {
          "maxHeight": 600,
          "mode": "multi",
          "sort": "none"
        }
      },
      "pluginVersion": "6.5.2",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "PBFA97CFB590B2093"
          },
          "editorMode": "code",
          "expr": "DCGM_FI_DEV_POWER_USAGE",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "GPU Power Usage",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "PBFA97CFB590B2093"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "max": 2400,
          "min": 0,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "#EAB839",
                "value": 1800
              },
              {
                "color": "red",
                "value": 2200
              }
            ]
          },
          "unit": "watt"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 6,
        "x": 18,
        "y": 8
      },
      "id": 16,
      "options": {
        "minVizHeight": 75,
        "minVizWidth": 75,
        "orientation": "horizontal",
        "reduceOptions": {
          "calcs": [
            "sum"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true,
        "sizing": "auto"
      },
      "pluginVersion": "11.0.1",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "PBFA97CFB590B2093"
          },
          "editorMode": "code",
          "expr": "sum(DCGM_FI_DEV_POWER_USAGE)",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "GPU Power Total",
      "type": "gauge"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "PBFA97CFB590B2093"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 2,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "hertz"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 16
      },
      "id": 2,
      "interval": "",
      "options": {
        "legend": {
          "calcs": [
            "mean",
            "lastNotNull",
            "max"
          ],
          "displayMode": "table",
          "placement": "right",
          "showLegend": true
        },
        "tooltip": {
          "maxHeight": 600,
          "mode": "multi",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "PBFA97CFB590B2093"
          },
          "editorMode": "code",
          "expr": "DCGM_FI_DEV_SM_CLOCK * 1000000",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "GPU SM Clocks",
      "type": "timeseries"
    },
    {
      "aliasColors": {},
      "autoMigrateFrom": "graph",
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": {
        "uid": "${DS_PROMETHEUS}"
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 24
      },
      "hiddenSeries": false,
      "id": 6,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "max": true,
        "min": false,
        "rightSide": true,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "datasource": {
            "uid": "${DS_PROMETHEUS}"
          },
          "expr": "DCGM_FI_DEV_GPU_UTIL{instance=~\"${instance}\", gpu=~\"${gpu}\"}",
          "interval": "",
          "legendFormat": "GPU {{gpu}}",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeRegions": [],
      "title": "GPU Utilization",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "cumulative"
      },
      "type": "timeseries",
      "xaxis": {
        "mode": "time",
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "percent",
          "logBase": 1,
          "max": "100",
          "min": "0",
          "show": true
        },
        {
          "format": "short",
          "logBase": 1,
          "show": true
        }
      ],
      "yaxis": {
        "align": false
      }
    },
    {
      "aliasColors": {},
      "autoMigrateFrom": "graph",
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": {
        "uid": "${DS_PROMETHEUS}"
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 32
      },
      "hiddenSeries": false,
      "id": 4,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "max": true,
        "min": false,
        "rightSide": true,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "datasource": {
            "uid": "${DS_PROMETHEUS}"
          },
          "expr": "DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{instance=~\"${instance}\", gpu=~\"${gpu}\"}",
          "interval": "",
          "legendFormat": "GPU {{gpu}}",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeRegions": [],
      "title": "Tensor Core Utilization",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "cumulative"
      },
      "type": "timeseries",
      "xaxis": {
        "mode": "time",
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "percentunit",
          "logBase": 1,
          "max": "1",
          "min": "0",
          "show": true
        },
        {
          "format": "short",
          "logBase": 1,
          "show": true
        }
      ],
      "yaxis": {
        "align": false
      }
    },
    {
      "aliasColors": {},
      "autoMigrateFrom": "graph",
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": {
        "uid": "${DS_PROMETHEUS}"
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 40
      },
      "hiddenSeries": false,
      "id": 18,
      "legend": {
        "avg": true,
        "current": false,
        "max": true,
        "min": false,
        "rightSide": true,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "nullPointMode": "null",
      "options": {
        "dataLinks": []
      },
      "percentage": false,
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "datasource": {
            "uid": "${DS_PROMETHEUS}"
          },
          "expr": "DCGM_FI_DEV_FB_USED{instance=~\"${instance}\", gpu=~\"${gpu}\"}",
          "interval": "",
          "legendFormat": "GPU {{gpu}}",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeRegions": [],
      "title": "GPU Framebuffer Mem Used",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "timeseries",
      "xaxis": {
        "mode": "time",
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "decmbytes",
          "logBase": 1,
          "show": true
        },
        {
          "format": "short",
          "logBase": 1,
          "show": true
        }
      ],
      "yaxis": {
        "align": false
      }
    }
  ],
  "refresh": false,
  "schemaVersion": 39,
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-1h",
    "to": "now"
  },
  "timeRangeUpdatedDuringEditOrView": false,
  "timepicker": {
    "refresh_intervals": [
      "5s",
      "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "1h",
      "2h",
      "1d"
    ]
  },
  "timezone": "",
  "title": "NVIDIA DCGM Exporter Dashboard",
  "uid": "Oxed_c6Wz",
  "version": 1,
  "weekStart": ""
}

Market analysis

Done on January of 2024 taking into account the next requirements:

  • Price under 600$
  • Able to perform the next actions for at least 5 years:
  • Jellyfin transcoding
  • Videogame streaming from a headless server.
  • Machine learning operations.

Using these sources:

Overview of the market analysis

The best graphics card is objectively Nvidia's RTX 4090 but it's too expensive. Then there's the RTX 4080, which is a bit too pricey for us, and the RTX 4070 Ti. The RTX 4070 Ti also costs a heap more cash than we'd like, but at least it's more reasonable than Nvidia's finest for a perfectly 4K capable card.

On the other end of the market, Nvidia has a rather uninspired upgrade in the RTX 4060. We also met the release of AMD's RX 7600 with a shrug, but at least it's cheap enough now to feel more competitive. And Intel still has a dog in the budget game: the Arc A750. When this card drops down to around $200, it's a steal, though the drivers aren't always up to the standard we'd like to see. That leaves AMD's RX 7600 as the best budget graphics card today, mostly for being a boringly safe pick.