Software Engineer, TT-Distributed

🌍 Remote, USA 🎯 Full-time 🕐 Posted Recently

Job Description

Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. They are seeking a TT-Distributed Software Engineer to develop and optimize distributed software systems for AI and HPC clusters, focusing on distributed programming and scalable architectures.


Responsibilities

  • Architect, implement, and optimize distributed software systems that coordinate computation and communication across clusters of AI accelerators and CPUs
  • Design and build distributed APIs enabling data-parallel and tensor-parallel AI workloads
  • Leverage MPI-based technologies and related frameworks to scale programming models across multiple hosts and compute nodes
  • Develop robust systems using IPC, inter-node sockets, and distributed communication primitives to ensure reliability and high performance
  • Build and maintain testing, debugging, profiling, and monitoring tools for large-scale distributed workloads and collaborate with model and systems teams on cluster bring-up

Skills

  • Strong C or C++ engineer with solid foundations in systems programming, operating systems, and distributed systems principles
  • Enthusiastic about distributed computing, including IPC, socket programming, and cluster resource coordination
  • Comfortable reasoning about scalability, fault tolerance, and performance across multi-node environments
  • Curious and first-principles thinker who challenges conventional approaches to distributed system design
  • Motivated to grow into a deep technical expert in large-scale distributed AI infrastructure
  • Architect, implement, and optimize distributed software systems that coordinate computation and communication across clusters of AI accelerators and CPUs
  • Design and build distributed APIs enabling data-parallel and tensor-parallel AI workloads
  • Leverage MPI-based technologies and related frameworks to scale programming models across multiple hosts and compute nodes
  • Develop robust systems using IPC, inter-node sockets, and distributed communication primitives to ensure reliability and high performance
  • Build and maintain testing, debugging, profiling, and monitoring tools for large-scale distributed workloads and collaborate with model and systems teams on cluster bring-up

Benefits

  • Highly competitive compensation package and benefits

Company Overview

  • Tenstorrent develops AI hardware and software solutions for data processing and machine learning application. It was founded in 2016, and is headquartered in Toronto, Ontario, CAN, with a workforce of 501-1000 employees. Its website is http://tenstorrent.com.

  •  

    Apply To This Job

    Ready to Apply?

    Don't miss out on this amazing opportunity!

    🚀 Apply Now

    Similar Jobs

    Recent Jobs

    You May Also Like