← All jobs

Gazai

Infrastructure Engineer

Company
Gazai
Location
Tokyo (Shibuya) / Taipei · Hybrid
Type
Full-time

Salary: Salary isn't listed publicly — we'll share the details once we've confirmed your profile is a genuine fit for the role.

Apply for this role Posted Jul 1, 2026

Focus: Cloud Infrastructure · IaC · CI/CD · Observability · Reliability Languages: Mandarin (day-to-day); business-level English (technical reading/writing)

About the role

We are looking for an Infrastructure Engineer who is excited about building reliable, scalable systems that support fast-moving AI products. You will own the infrastructure behind Anini and gazai.ai, with a focus on reliability, cost efficiency, and developer velocity — working closely with Tokyo HQ and staying at the forefront of AI product trends.

Responsibilities

  • Own cloud infrastructure and IaC: Terraform, environments, and day-to-day operations across our stack (AWS/GCP, Kubernetes/ArgoCD, databases)
  • Improve developer velocity and delivery: CI/CD optimization (caching, concurrency, reusable workflows), release reliability, and cloud cost efficiency
  • Build and maintain production observability (monitoring, alerting, logs, tracing) and use it to drive performance and scalability work (latency, worker capacity, DB connection limits, load testing)
  • Improve reliability and correctness of production systems: incident response, root-cause analysis, postmortems, and pragmatic data verification checks such as quota and audit consistency
  • Own production incident response for infrastructure-related issues: rapid triage, mitigation, and postmortems with clear follow-ups
  • Improve developer experience for local and staging environments (e.g. Docker Compose-based local env, environment parity, safe debug tooling)
  • Maintain internal runbooks and setup documentation so the team can operate confidently across access, dashboards, and common debugging workflows

Qualifications

  • 5+ years of experience in infrastructure, DevOps, SRE, or platform engineering roles
  • Hands-on experience shipping changes safely to production
  • Strong fundamentals in Linux, networking, and cloud systems (AWS and/or GCP)
  • Experience operating containerized workloads (Kubernetes preferred) and continuous delivery tooling (e.g. ArgoCD)
  • Experience with monitoring and observability tools (Grafana, Sentry, logs/traces) and building actionable alerts
  • Comfortable debugging production issues end-to-end — from user symptoms to infrastructure root causes
  • Experience with CI/CD systems (GitHub Actions preferred) and optimizing pipelines for speed and reliability
  • Good engineering hygiene: documentation, runbooks, and operational checklists
  • Mandarin proficiency for day-to-day work; able to read and write technical English

Bonus — you will stand out if…

  • You have experience with data pipelines or operational analytics (e.g. Pub/Sub, BigQuery, scheduled verification scripts) and treat data quality as part of system reliability
  • You have experience supporting AI model serving infrastructure: GPU/CPU workload scheduling, inference pipeline optimization, and reliability of model endpoints under variable, bursty load
  • You have familiarity with security and compliance practices: secrets management (e.g. Vault, AWS Secrets Manager), least-privilege access controls, and container/dependency vulnerability scanning
  • You have experience owning cloud cost visibility: spend dashboards, budget alerting, and instance rightsizing
  • You are comfortable with on-call responsibilities and uptime ownership, and have contributed to SLA/SLO frameworks
  • You have experience managing staging and sandbox environments for AI model testing, enabling safe validation of new model versions before production rollout
infrastructuredevopssrekubernetes