It would be ideal to have two types of media in a prompt. For example, the text could say "Talk about what you see in the image" and there be a corresponding image and audio that says "Talk about what you see in the image" for our students who cannot read. The question is looking to assess their speaking ability and hearing the audio would allow them to fully understand what they are expected to do.
This is already an option on Extempore. Please see the bottom section of this help article for instructions to include Text, Audio, and an Image in a single question.
https://help.extemporeapp.com/en/articles/4808937-question-media