Kickoff: Publishing Technical Notes
A checklist for drafting and shipping your first technical post.
This sampler walks through the Markdown patterns you can use to document technical work. It mirrors the actual structure of a runbook or design note so you can copy-and-paste sections as needed.
Summary
-
Goal: Document the new background job scheduler rollout.
-
Scope:
apiandworkerapps, deployments instagingandprod-us. - Owners: @ops-platform (primary), @edge-infra (review).
Context
We are migrating from the legacy cron runner to a Rust-based scheduler. The scheduler improves latency by 15% and unlocks per-tenant throttling. The code lives in services/scheduler. See ADR-042 for historical context.
Decision Log
| Date | Decision | Status |
|---|---|---|
| 2025-09-12 | Adopt Rust scheduler with gRPC bridge | |
| 2025-09-18 | Keep cron in read-only mode for 2 weeks | |
| 2025-09-25 | Move metrics from StatsD to OpenTelemetry exporter |
Default table styling aligns with the blog’s dark theme; use emojis or status pills sparingly and explain abbreviations inline.
Rollout Plan
-
Cut a release:
git tag scheduler-v1.0.0 && git push origin scheduler-v1.0.0. -
Deploy to staging:
kubectl --context staging-us apply -f deploy/scheduler.yaml kubectl --context staging-us rollout status deploy/scheduler -
Smoke test: run
bin/tasks smoke --env stagingand confirm every job fires within ±1 minute of schedule. -
Enable shadow mode in prod:
helm upgrade scheduler charts/scheduler \ --set scheduler.shadow.enabled=true \ --set scheduler.shadow.percent=25 -
Promote to active after 24 hours without alerts.
Keep the checklist chronological and annotate each step with the expected owner (for example, Platform SRE vs. Application Team).
Rollback tip
If the scheduler fails in active mode, run helm rollback scheduler and scale the cron deployment back to 2 replicas. Notify #oncall-platform before flipping the traffic switch.
Metrics & Alerting
- Requests:
sum(rate(scheduler_dispatch_total[5m])) by (queue) - Failures:
sum(rate(scheduler_dispatch_failed_total[5m])) - p95 latency:
histogram_quantile(0.95, sum(rate(scheduler_dispatch_latency_seconds_bucket[5m])) by (le))
Add dashboards under grafana/dashboards/scheduler.json. Use alert fatigue checks: no more than two pages in a 30-minute window.
Sample Alert
groups:
- name: scheduler.alerts
rules:
- alert: SchedulerDispatchBacklog
expr: sum(scheduler_enqueued_total - scheduler_processed_total) > 500
for: 10m
labels:
severity: page
annotations:
summary: Scheduler backlog climbing above threshold
runbook: https://runbooks.internal/scheduler-backlog
Client Integrations
# services/scheduler/client.rb
module Scheduler
class Client
include Singleton
def enqueue(job_class, payload, run_at: Time.now)
Http.post(
"/internal/jobs",
json: { job_class:, payload:, run_at: run_at.iso8601 }
)
rescue Http::Error => e
Rails.logger.error("Scheduler enqueue failed: #{e.message}")
raise
end
end
end
# sdk/scheduler.py
import httpx
class SchedulerClient:
def __init__(self, base_url: str, token: str):
self._client = httpx.Client(base_url=base_url, headers={"Authorization": f"Bearer {token}"})
def enqueue(self, job_type: str, payload: dict, run_at=None):
data = {"job_type": job_type, "payload": payload}
if run_at:
data["run_at"] = run_at.isoformat()
response = self._client.post("/internal/jobs", json=data, timeout=5)
response.raise_for_status()
return response.json()
<!-- docs/snippets/job-status.html -->
<details class="job-status">
<summary>Scheduler Status (preview)</summary>
<table class="job-status__table">
<thead>
<tr>
<th>Queue</th>
<th>Lag</th>
<th>Failures (24h)</th>
</tr>
</thead>
<tbody>
<tr>
<td>critical</td>
<td>0.5s</td>
<td>1</td>
</tr>
<tr>
<td>bulk</td>
<td>4.3s</td>
<td>0</td>
</tr>
</tbody>
</table>
</details>
</details>
# config/scheduler.yaml
enabled: true
queues:
- name: critical
concurrency: 10
retry_policy:
max_attempts: 5
backoff: exponential
- name: bulk
concurrency: 4
retry_policy:
max_attempts: 3
backoff: fixed
shadow_mode:
enabled: true
sampling_percent: 25
Troubleshooting
-
Queues not draining: Check Kafka lag using
bin/observability kafka-lag --queue scheduler. -
High CPU: Profile with
perf top --pid $(pgrep scheduler). -
Job stuck in
retrying: Inspectjobstable for theretry_aftercolumn and adjust via the admin API.
Common Pitfalls
- Forgetting to disable the cron runner leads to double dispatching.
- Missing Grafana dashboard updates hides latency regressions.
- Not bumping library versions in the monorepo causes CI lint failures.
References
Need more examples? Duplicate this post and trim sections you do not need. Every article in this space should ship with clear context, reproducible commands, and verified rollout steps.