Another reason not to allow multiples languages:
It makes using audio for detecting duplicate submissions way harder. An attacker could make a submission in 2 different languages and both humans and AI detection tools would have a way harder time to find out it’s the same person speaking (Text-Dependent speaker recognition is way easier than Text-Independent one).
Some people may say that those are not used yet. But when attacks will come they will need to be used. And if we don’t have the data about people already registered, we would not be able to use those.
Also note that both humans and AI systems do not need to limit themselves on image/video/audio to recognize if a submission is a duplicate, they can use all those information simultaneously and then take a holistic decision (which is generally superior compared to using only one type of cue).