聯繫我們

課程簡介

EXO Infrastructure as Code

  • Overview of EXO deployment patterns: single-node, multi-node, and RDMA clusters.
  • Automating dependency installation (Xcode, uv, Node.js, Rust) with configuration management.
  • Utilizing Nix flakes for reproducible EXO builds and developer environments.
  • Writing Ansible playbooks or shell scripts for unattended cluster provisioning.

Reproducible Builds and CI Integration

  • Pinning dependencies and building the dashboard within CI pipelines.
  • Running EXO smoke tests in GitHub Actions or GitLab CI runners.
  • Creating golden images and snapshot-based rollback workflows for macOS and Linux VMs.
  • Versioning custom model cards alongside application code.

Cluster Discovery and Networking Automation

  • Configuring mDNS and static DNS for reliable libp2p node discovery.
  • Automating network profile creation and Thunderbolt bridge management on macOS.
  • Using custom namespaces (EXO_LIBP2P_NAMESPACE) to separate dev, staging, and prod clusters.
  • Firewall rules and network segmentation for multi-tenant environments.

Storage and Model Lifecycle Management

  • Designing EXO_MODELS_DIRS and EXO_MODELS_READ_ONLY_DIRS strategies.
  • Mounting NFS or SAN shares as read-only model repositories for rapid provisioning.
  • Garbage collection of stale caches and versioned weight retention policies.
  • Automating model pre-downloads and health checks prior to rolling updates.

Monitoring and Alerting

  • Shipping EXO logs to centralized logging systems (ELK, Loki, or Splunk).
  • Building Grafana dashboards from EXO_TRACING_ENABLED output.
  • Alerting on cluster membership changes, OOM events, and inference latency spikes.
  • Correlating macmon hardware telemetry with model performance regressions.

Update, Rollback, and Disaster Recovery

  • Staging EXO binary updates on a canary node before fleet-wide rollout.
  • Model-level rollback: switching between quantized versions without re-downloading.
  • Backing up and restoring cluster state, custom namespaces, and cached weights.
  • Documenting recovery runbooks for total cluster rebuild scenarios.

Security Hardening and Compliance

  • Applying TLS at the reverse proxy layer (nginx, traefik) for the dashboard and API.
  • Implementing API rate limiting and IP whitelisting for EXO endpoints.
  • Isolating clusters with VLANs and zero-trust network policies.
  • Auditing access and maintaining an inventory of deployed models and versions.

最低要求

  • Experience with DevOps practices (CI/CD, IaC, container orchestration)
  • Familiarity with macOS or Linux system administration and package management
  • Understanding of networking, DNS, and storage concepts

Audience

  • DevOps engineers
  • Infrastructure architects
  • SREs responsible for on-premise AI workloads
 21 小時

人數


每位參與者的報價

客戶評論 (2)

即將到來的課程

課程分類