Infrastructure Engineer
給与: 給与は公開していません。プロフィールがこの職務に本当に合致していることを確認できた段階で、個別にお伝えします。
このポジションに応募する 掲載日:2026年7月1日
Focus: Cloud Infrastructure · IaC · CI/CD · Observability · Reliability Languages: Mandarin (day-to-day); business-level English (technical reading/writing)
About the role
We are looking for an Infrastructure Engineer who is excited about building reliable, scalable systems that support fast-moving AI products. You will own the infrastructure behind Anini and gazai.ai, with a focus on reliability, cost efficiency, and developer velocity — working closely with Tokyo HQ and staying at the forefront of AI product trends.
Responsibilities
- Own cloud infrastructure and IaC: Terraform, environments, and day-to-day operations across our stack (AWS/GCP, Kubernetes/ArgoCD, databases)
- Improve developer velocity and delivery: CI/CD optimization (caching, concurrency, reusable workflows), release reliability, and cloud cost efficiency
- Build and maintain production observability (monitoring, alerting, logs, tracing) and use it to drive performance and scalability work (latency, worker capacity, DB connection limits, load testing)
- Improve reliability and correctness of production systems: incident response, root-cause analysis, postmortems, and pragmatic data verification checks such as quota and audit consistency
- Own production incident response for infrastructure-related issues: rapid triage, mitigation, and postmortems with clear follow-ups
- Improve developer experience for local and staging environments (e.g. Docker Compose-based local env, environment parity, safe debug tooling)
- Maintain internal runbooks and setup documentation so the team can operate confidently across access, dashboards, and common debugging workflows
Qualifications
- 5+ years of experience in infrastructure, DevOps, SRE, or platform engineering roles
- Hands-on experience shipping changes safely to production
- Strong fundamentals in Linux, networking, and cloud systems (AWS and/or GCP)
- Experience operating containerized workloads (Kubernetes preferred) and continuous delivery tooling (e.g. ArgoCD)
- Experience with monitoring and observability tools (Grafana, Sentry, logs/traces) and building actionable alerts
- Comfortable debugging production issues end-to-end — from user symptoms to infrastructure root causes
- Experience with CI/CD systems (GitHub Actions preferred) and optimizing pipelines for speed and reliability
- Good engineering hygiene: documentation, runbooks, and operational checklists
- Mandarin proficiency for day-to-day work; able to read and write technical English
Bonus — you will stand out if…
- You have experience with data pipelines or operational analytics (e.g. Pub/Sub, BigQuery, scheduled verification scripts) and treat data quality as part of system reliability
- You have experience supporting AI model serving infrastructure: GPU/CPU workload scheduling, inference pipeline optimization, and reliability of model endpoints under variable, bursty load
- You have familiarity with security and compliance practices: secrets management (e.g. Vault, AWS Secrets Manager), least-privilege access controls, and container/dependency vulnerability scanning
- You have experience owning cloud cost visibility: spend dashboards, budget alerting, and instance rightsizing
- You are comfortable with on-call responsibilities and uptime ownership, and have contributed to SLA/SLO frameworks
- You have experience managing staging and sandbox environments for AI model testing, enabling safe validation of new model versions before production rollout