Job Description
Hello,
We are launching a language technology project for Chimini, a low-resource Bantu language, and are seeking an ML/NLP engineer to help us design and implement the foundational phase of the project.
Long-Term Vision
Our long-term goal is to build:
A structured Chimini text + audio corpus
A scalable API layer for integration into our own applications
Eventually, speech-to-text and text-to-speech capability in Chimini
Chimini is historically related to Swahili, but we do not yet know how structurally similar they are. Pronunciation may differ significantly, which may impact model transfer for speech systems.
We currently have:
Written texts
Audio recordings
Access to native speakers for transcription and validation
Phase 1 (3–6 Months)
The objective of Phase 1 is to build a strong ML-ready foundation, including:
Designing a scalable database structure for text and audio
Preparing and structuring data for NLP workflows
Building a clean corpus pipeline (segmentation, transcription storage, metadata)
Advising on whether Chimini–Swahili linguistic comparison should be conducted before leveraging transfer learning
Evaluating potential approaches:
Fine-tuning multilingual models
Embedding-based retrieval systems
LLM + RAG architectures
Longer-term speech model strategy
We want the system designed from the beginning to support future ML training and experimentation.
Responsibilities
Define ML/NLP strategy for a low-resource language
Recommend architecture for scalable corpus and training workflows
Implement foundational data pipelines
Advise on transfer learning feasibility from Swahili or multilingual models
Provide phased roadmap (short-term vs long-term)
Ideal Experience:
NLP for low-resource or multilingual languages
Speech systems (ASR/TTS)
Fine-tuning transformer models
Embeddings and vector databases
Designing ML pipelines for scalable experimentation
We will handle data collection, transcription, and language validation.
Please include:
Relevant ML/NLP experience
Proposed high-level technical approach
Estimated timeline for Phase 1
Availability
We are looking for someone who can help architect this correctly from the start, with long-term ML scalability in mind.
Best regards,
Apply tot his job
Apply To this Job