Computer science professor is using AI to help endangered languages be heard

Body

There are more than 6,000 languages spoken in the world, and almost half of them are endangered. George Mason University researcher Antonios Anastasopoulos is working to keep those endangered languages alive and has built a Natural Language Processing (NLP) group at the university devoted to this work. 

The recent recipient of a $599,956 CAREER Award from the National Science Foundation and a $300,000 Small Business Innovation Research (SBIR) Award from Barron Associates, Anastasopoulos is building automatic translation tools for under-served populations, including speakers of certain Indo-Pacific languages that don’t have access to language technologies.  

He is also a collaborating senior researcher at Greece’s Archimedes AI Research Center, a hub connecting the global AI and data science research community.  

Anastasopoulos began his work with languages when he visited local communities to record endangered dialects as an undergraduate student in his native Greece. “What makes a language endangered is when it stops getting passed down across the generations,” said Anastasopoulos, an assistant professor in George Mason’s Department of Computer Science. 

Antonios Anastasopolous (left), Indigenous Huilliche activist Marite Perez, and UC Boulder Professor Alexis Palmer in Chile. Photo provided.

“I created databases for a small Greek dialect that’s spoken in South Italy called Griko. I built that small tool for this one community, but there are thousands of similar communities all over the world that speak languages that are completely neglected. They have no institutional support, so my work feels very meaningful,” said Anastasopoulos.   

Every language lies in a continuum, and no matter how small the language is, you will find variations, explained Anastasopoulos. If a system is built that has only seen data from one variety of a language, it will perform worse for variations that the model has never seen.  

“I just kept building on this concept, which is how I ended up developing this whole research program here at George Mason. The real motivation behind my work is that it's just something that simply needs to get done,” he said.  

Endangered languages are very common in places such as Latin America, explained Anastasopoulos, where the government and socially powerful class often speak Spanish or Portuguese, to the exclusion of hundreds of Indigenous language communities. These Indigenous peoples still have to participate in society, so their unique languages eventually become moribund.   

“For example, I had some folks from an Indigenous community in Chile contact me and inquire about my work. The Mapuche had, historically, resisted Spanish conquest, and they reached out asking for guidance in building AI tools to help with the instruction and preservation of their language, Mapuzugun,” said Anastasopoulos.  

Fahim Faisal, a PhD candidate in the Department of Computer Science who works with Anastasopoulos as part of George Mason’s Natural Language Processing (NLP) group, has experienced the limitations modern technology can have on cultural dialects.  

Fahim Faisal. Photo provided.

“I'm from Bangladesh and my primary language is Bengali, so when I try to interact with everyday technology, for example, [Amazon’s] Alexa, it can't always understand what I’m saying because of my dialectal variety and my accent differences,” said Faisal.  

“There are lots of variations when people speak in terms of accent and dialect, so we’re trying to implement that cultural variety into language modeling, so it’s still accessible when you go from one part of the world to another,” he said.   

Earlier in 2024, Faisal’s paper “DialectBench: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages” received one of the Best Social Impact Paper Awards at the Association for Computational Linguistics, the premier NLP conference, for showing that LLMs cannot handle dialects as well as standard language varieties.  

Computer science PhD candidate Milind Agarwal applied to George Mason specifically to work alongside Anastasopoulos.  

Agarwal works on documented digitization of archival data, as well as large- scale data extraction and language identification.  

“I’m trying to assist new artificial intelligence technologies to learn better so that it’s accessible to the smaller language communities and to ensure that there aren't huge swaths of people who are completely left out of this this tech revolution essentially,” said Agarwal, who has published six papers on the topic.  

One of Agarwal’s papers, “Script-Agnostic Language Identification” won the 2024 Best Student Impact Paper Award at a regional conference.  

“We're working with language community members directly, because they have a stake in revitalization and keeping their languages alive,” Agarwal said. “We continuously share our results with the community members to get feedback and make sure that the work we're doing is in line with the people who will use it rather than being research that's disconnected from the ground. That has been invaluable because it has grounded our work and made sure that it actually has an impact.”