Kathleen Siminyu – A Kiswahili Machine Learning Fellow At Mozilla On The Need Internet Access For All

What does it take to be an AI researcher and machine learning fellow at a giant internet search engine? We spoke to Mozilla Foundation’s Kathleen Siminyu and this is what she had to say.

Your name please and what you do at Mozilla Foundation

My name is Kathleen Siminyu and I’m a machine learning fellow at Mozilla Foundation. In terms of profession, I’m an NLP researcher, NLP being Natural Language Processing. The particular project that I’m working on at Mozilla is Common Voice – a data set platform that enables language communities to build language data sets. One of the languages on common voice is Kiswahili which is what I particularly work on.

What does it take to be an AI Researcher in terms of education, interests ad additional skills?

For my undergrad I studied math and computer science at JKUAT and then towards the end of my degree around fourth year I realized that I wanted to venture into a field that would use math as much as it used computer science. Through my research I encountered data science and in fourth year I did a project that was in this field. After school when doing my portfolio I included that project, plus I started doing online data science courses on edX, and Coursera.

I’ve done a lot of learning on the job and my first job was at a company called Africa’s Talking. In the beginning I don’t think my role was a data scientist at all but over time I automated myself out of the low-level roles. I did a lot of data engineering while there because that’s what the business needed. I also realized that there was a need for African language tooling or resources and that AT was not the place for me to go after those interests. So, I ventured back into academia and found research communities who were building NLP for African languages. A lot of that is what has continued to nurture my learning journey.

You will notice that I didn’t mention going back to school at any point and that’s because I haven’t. The textbook journey into AI involves an undergrad, then MSE and then potentially a PhD because it is a very academia driven career. But in Africa it looks different because not many institutions are able to offer data science, machine learning or AI courses. There are however grassroots communities that are nurturing that talent and I’m a product of those as opposed to being a product of the academic route.

What is a Kiswahili Common Voice Data set and why is it important to Mozilla and its users?

It is a data set for speech recognition also known as speech to text. This is a task which involves turning audio or speech to a piece of text and you’ll see it in captioning like those on TV. You will see it on some conference platforms as well, like I know Zoom has an add on where when someone is speaking you can see the transcript or even get a transcript of a recorded meeting afterwards. Common voice data sets enable that for various languages.

A data set for speech recognition is essentially a piece of text accompanied by an audio of what is in the text. That is the data that you would feed into your machine learning algorithm for it to start learning how to transcribe to Kiswahili text. It is then able to do a mapping of words to the respective sounds, broken down into smaller parts of speech.

This technology is not novel. It exists in so many devices and there are so many use cases. The data set in itself begins with us collecting texts which are then broke up at sentence level and fed to people. We crowd source the audio aspect as well, which means that if you go to our platforms and sign yourself up as a contributor for Kiswahili, you will start receiving sentence by sentence and you can record yourself saying those sentences out loud.

Mozilla is doing this for public good and this is one of my favourite things about this company. A lot of big tech companies build proprietary software for their own benefit to push profit. At Mozilla Foundation which is a non-profit, we want this to be a public resource that anyone can use to build applications for their context. For us it would be success in our books if developers were aware that these data sets exist and started to play around with them, and once the models become available, then they also start using the APIs that are made available to build end user products.

It would be great to one day have Kiswahili transcription on Zoom or on Google Meet and that can be built if people knew about these resources and went the extra mile to actually use them to solve problems.

Why is language diversification important to Mozilla?

It is an internet health issue because why do I have to get online and put on my Western persona to get resources? It goes back to the fact that there are so many digital tools and information available in English on the internet, but would the access be the same if you only spoke Kiswahili? If you go on Wikipedia and compared the number of articles available in English vs Kiswahili there is a huge difference and yet these days we celebrate that Kiswahili has 200 Million speakers and we have a day to celebrate it.

Kiswahili is growing. But now think of your mother tongue. It is an unfair reality that someone who is only conversant in their mother tongue is literally shut out of all these digital assets which are freely available. We like to say that a lot of stuff on the internet is free, but it is only free to a certain extent because there are barriers to access. That’s one of the things that we would like to see change.

Related posts

POCO’s Kenyan Comeback: Powerful, Affordable Smartphones Redefine Mid-Range

New Must-Watch Series and Seasons on Showmax This November

OPPO First to Implement MoE Architecture On-Device, Boosts AI Efficiency