Kickoff: Publishing Technical Notes

A checklist for drafting and shipping your first technical post.

This sampler walks through the Markdown patterns you can use to document technical work. It mirrors the actual structure of a runbook or design note so you can copy-and-paste sections as needed.

Summary

  • Goal: Document the new background job scheduler rollout. :rocket:
  • Scope: api and worker apps, deployments in staging and prod-us.
  • Owners: @ops-platform (primary), @edge-infra (review).

Context

We are migrating from the legacy cron runner to a Rust-based scheduler. The scheduler improves latency by 15% and unlocks per-tenant throttling. The code lives in services/scheduler. See ADR-042 for historical context.

Decision Log

Date Decision Status
2025-09-12 Adopt Rust scheduler with gRPC bridge :white_check_mark:
2025-09-18 Keep cron in read-only mode for 2 weeks :white_check_mark:
2025-09-25 Move metrics from StatsD to OpenTelemetry exporter :hourglass_flowing_sand:

Default table styling aligns with the blog’s dark theme; use emojis or status pills sparingly and explain abbreviations inline.

Rollout Plan

  1. Cut a release: git tag scheduler-v1.0.0 && git push origin scheduler-v1.0.0.
  2. Deploy to staging:
    kubectl --context staging-us apply -f deploy/scheduler.yaml
    kubectl --context staging-us rollout status deploy/scheduler
    
  3. Smoke test: run bin/tasks smoke --env staging and confirm every job fires within ±1 minute of schedule.
  4. Enable shadow mode in prod:
    helm upgrade scheduler charts/scheduler \
      --set scheduler.shadow.enabled=true \
      --set scheduler.shadow.percent=25
    
  5. Promote to active after 24 hours without alerts. :tada:

Keep the checklist chronological and annotate each step with the expected owner (for example, Platform SRE vs. Application Team).

Rollback tip

If the scheduler fails in active mode, run helm rollback scheduler and scale the cron deployment back to 2 replicas. Notify #oncall-platform before flipping the traffic switch.

Metrics & Alerting

  • Requests: sum(rate(scheduler_dispatch_total[5m])) by (queue)
  • Failures: sum(rate(scheduler_dispatch_failed_total[5m]))
  • p95 latency: histogram_quantile(0.95, sum(rate(scheduler_dispatch_latency_seconds_bucket[5m])) by (le))

Add dashboards under grafana/dashboards/scheduler.json. Use alert fatigue checks: no more than two pages in a 30-minute window.

Sample Alert

groups:
  - name: scheduler.alerts
    rules:
      - alert: SchedulerDispatchBacklog
        expr: sum(scheduler_enqueued_total - scheduler_processed_total) > 500
        for: 10m
        labels:
          severity: page
        annotations:
          summary: Scheduler backlog climbing above threshold
          runbook: https://runbooks.internal/scheduler-backlog

Client Integrations

# services/scheduler/client.rb
module Scheduler
  class Client
    include Singleton

    def enqueue(job_class, payload, run_at: Time.now)
      Http.post(
        "/internal/jobs",
        json: { job_class:, payload:, run_at: run_at.iso8601 }
      )
    rescue Http::Error => e
      Rails.logger.error("Scheduler enqueue failed: #{e.message}")
      raise
    end
  end
end
# sdk/scheduler.py
import httpx

class SchedulerClient:
    def __init__(self, base_url: str, token: str):
        self._client = httpx.Client(base_url=base_url, headers={"Authorization": f"Bearer {token}"})

    def enqueue(self, job_type: str, payload: dict, run_at=None):
        data = {"job_type": job_type, "payload": payload}
        if run_at:
            data["run_at"] = run_at.isoformat()
        response = self._client.post("/internal/jobs", json=data, timeout=5)
        response.raise_for_status()
        return response.json()
<!-- docs/snippets/job-status.html -->
<details class="job-status">
  <summary>Scheduler Status (preview)</summary>
  <table class="job-status__table">
    <thead>
      <tr>
        <th>Queue</th>
        <th>Lag</th>
        <th>Failures (24h)</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>critical</td>
        <td>0.5s</td>
        <td>1</td>
      </tr>
      <tr>
        <td>bulk</td>
        <td>4.3s</td>
        <td>0</td>
      </tr>
    </tbody>
  </table>
</details>
</details>
# config/scheduler.yaml
enabled: true
queues:
  - name: critical
    concurrency: 10
    retry_policy:
      max_attempts: 5
      backoff: exponential
  - name: bulk
    concurrency: 4
    retry_policy:
      max_attempts: 3
      backoff: fixed
shadow_mode:
  enabled: true
  sampling_percent: 25

Troubleshooting

  • Queues not draining: Check Kafka lag using bin/observability kafka-lag --queue scheduler.
  • High CPU: Profile with perf top --pid $(pgrep scheduler).
  • Job stuck in retrying: Inspect jobs table for the retry_after column and adjust via the admin API.

Common Pitfalls

  1. Forgetting to disable the cron runner leads to double dispatching.
  2. Missing Grafana dashboard updates hides latency regressions.
  3. Not bumping library versions in the monorepo causes CI lint failures.

References

Need more examples? Duplicate this post and trim sections you do not need. Every article in this space should ship with clear context, reproducible commands, and verified rollout steps.