Infrastructure Engineer at Gazai — Hello World Japan

Focus: Cloud Infrastructure · IaC · CI/CD · Observability · Reliability Languages: Japanese required (business level); business-level English or Mandarin also required

About the role

We are looking for an Infrastructure Engineer who is excited about building reliable, scalable systems that support fast-moving AI products. You will own the infrastructure behind Anini and gazai.ai, with a focus on reliability, cost efficiency, and developer velocity — working closely with Tokyo HQ and staying at the forefront of AI product trends.

Responsibilities

Own cloud infrastructure and IaC: Terraform, environments, and day-to-day operations across our stack (AWS/GCP, Kubernetes/ArgoCD, databases)
Improve developer velocity and delivery: CI/CD optimization (caching, concurrency, reusable workflows), release reliability, and cloud cost efficiency
Build and maintain production observability (monitoring, alerting, logs, tracing) and use it to drive performance and scalability work (latency, worker capacity, DB connection limits, load testing)
Improve reliability and correctness of production systems: incident response, root-cause analysis, postmortems, and pragmatic data verification checks such as quota and audit consistency
Own production incident response for infrastructure-related issues: rapid triage, mitigation, and postmortems with clear follow-ups
Improve developer experience for local and staging environments (e.g. Docker Compose-based local env, environment parity, safe debug tooling)
Maintain internal runbooks and setup documentation so the team can operate confidently across access, dashboards, and common debugging workflows

Qualifications

5+ years of experience in infrastructure, DevOps, SRE, or platform engineering roles
Hands-on experience shipping changes safely to production
Strong fundamentals in Linux, networking, and cloud systems (AWS and/or GCP)
Experience operating containerized workloads (Kubernetes preferred) and continuous delivery tooling (e.g. ArgoCD)
Experience with monitoring and observability tools (Grafana, Sentry, logs/traces) and building actionable alerts
Comfortable debugging production issues end-to-end — from user symptoms to infrastructure root causes
Experience with CI/CD systems (GitHub Actions preferred) and optimizing pipelines for speed and reliability
Good engineering hygiene: documentation, runbooks, and operational checklists
Business-level Japanese; business-level English or Mandarin also required

Bonus — you will stand out if…

You have experience with data pipelines or operational analytics (e.g. Pub/Sub, BigQuery, scheduled verification scripts) and treat data quality as part of system reliability
You have experience supporting AI model serving infrastructure: GPU/CPU workload scheduling, inference pipeline optimization, and reliability of model endpoints under variable, bursty load
You have familiarity with security and compliance practices: secrets management (e.g. Vault, AWS Secrets Manager), least-privilege access controls, and container/dependency vulnerability scanning
You have experience owning cloud cost visibility: spend dashboards, budget alerting, and instance rightsizing
You are comfortable with on-call responsibilities and uptime ownership, and have contributed to SLA/SLO frameworks
You have experience managing staging and sandbox environments for AI model testing, enabling safe validation of new model versions before production rollout

Infrastructure Engineer