Job Description
Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. They are seeking a TT-Distributed Software Engineer to develop and optimize distributed software systems for AI and HPC clusters, focusing on distributed programming and scalable architectures.
Responsibilities
- Architect, implement, and optimize distributed software systems that coordinate computation and communication across clusters of AI accelerators and CPUs
- Design and build distributed APIs enabling data-parallel and tensor-parallel AI workloads
- Leverage MPI-based technologies and related frameworks to scale programming models across multiple hosts and compute nodes
- Develop robust systems using IPC, inter-node sockets, and distributed communication primitives to ensure reliability and high performance
- Build and maintain testing, debugging, profiling, and monitoring tools for large-scale distributed workloads and collaborate with model and systems teams on cluster bring-up
Skills
- Strong C or C++ engineer with solid foundations in systems programming, operating systems, and distributed systems principles
- Enthusiastic about distributed computing, including IPC, socket programming, and cluster resource coordination
- Comfortable reasoning about scalability, fault tolerance, and performance across multi-node environments
- Curious and first-principles thinker who challenges conventional approaches to distributed system design
- Motivated to grow into a deep technical expert in large-scale distributed AI infrastructure
- Architect, implement, and optimize distributed software systems that coordinate computation and communication across clusters of AI accelerators and CPUs
- Design and build distributed APIs enabling data-parallel and tensor-parallel AI workloads
- Leverage MPI-based technologies and related frameworks to scale programming models across multiple hosts and compute nodes
- Develop robust systems using IPC, inter-node sockets, and distributed communication primitives to ensure reliability and high performance
- Build and maintain testing, debugging, profiling, and monitoring tools for large-scale distributed workloads and collaborate with model and systems teams on cluster bring-up
Benefits
- Highly competitive compensation package and benefits
Company Overview
Apply To This Job