About the Job
Join Impala AI, an innovative startup building a fully-managed LLM-inference platform, that enables data heavy enterprises to perform any AI task at any scale without limits.
We're looking for an Experienced Performance Researcher to join our founding team. You’ll be responsible for building and optimizing scalable cloud infrastructure solutions tailored for AI workloads. This role offers a unique opportunity to directly shape our infrastructure strategy, improve system reliability and performance, and contribute to establishing Impala AI as a leader in adaptive AI compute management.
Join us to tackle the magic that make AI tick under the hood and build the backbone powering the AI revolution.
What You’ll Do
- Design and build high-performance distributed inference pipelines for LLMs, focused on large-batch, non-real-time scenarios.
- Optimize GPU memory usage, kernel execution, and communication across nodes (NCCL, MPI, etc.).
- Own CUDA kernels, compiler-level tricks, and multi-GPU scheduling logic.
- Lead profiling and performance tuning for throughput, and cost- down to the kernel level.
- Collaborate with infra, product, and research teams to define SLAs, resource allocation logic, and runtime behaviors.
- Help build the core infrastructure that will run LLM workloads across hybrid GPU environments (cloud/on-prem/self-hosted).
What You’ll Bring
- Deep experience with CUDA programming, GPU architecture, and low-level performance engineering.
- Fluency with Python and C++, and a mastery of profiling tools like Nsight, nvprof, perf, etc.
- Experience building systems for large-scale distributed training or inference (PyTorch, DeepSpeed, Ray, Horovod, etc.).
- Hands-on familiarity with cluster and container orchestration tools (Kubernetes, Slurm, Docker).
- Self-motivated and able to operate independently in a fast-moving startup environment.