Designing Multimodal Generative AI Applications: Input Strategies and Output Formats
Susannah Greenwood
Susannah Greenwood

I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.

9 Comments

  1. Mbuyiselwa Cindi Mbuyiselwa Cindi
    March 14, 2026 AT 20:40 PM

    Just tried GPT-4o with a voice note + screenshot of my car’s dashboard light. It didn’t just say ‘check engine’-it told me the likely culprit was the O2 sensor, linked it to recent fuel quality issues in my area, and even gave me a step-by-step video guide to reset it. Mind blown. This isn’t future tech-it’s already saving people hours of guesswork.

  2. Krzysztof Lasocki Krzysztof Lasocki
    March 14, 2026 AT 21:40 PM

    So let me get this straight-you’re telling me AI can now understand sarcasm in a voice note and still not know why my cat is staring at the wall? 🤔 I’m not mad… I’m just disappointed. Also, why does every ‘multimodal’ app feel like it’s trying to sell me a Tesla?

  3. Henry Kelley Henry Kelley
    March 15, 2026 AT 16:05 PM

    Yo I tried the gemini thing with a photo of my receipt and a voice thing saying ‘what’s the tax rate here?’ and it actually got it right. Like… i didn’t even spell tax right. It still knew. Also, i think it sensed i was tired. Gave me a chill summary instead of a lecture. Feels like having a smart friend who doesn’t judge.

  4. Victoria Kingsbury Victoria Kingsbury
    March 16, 2026 AT 07:34 AM

    Let’s be real-most multimodal apps still treat audio like an afterthought. ‘Oh, here’s a 300-word text dump and a 10-second audio clip that’s just the first sentence repeated.’ No. We need parity. If you’re ingesting 4K video, your output should be spatial audio + visual annotations, not a static image with a robotic voiceover. Also, latency is still killing UX. Edge processing isn’t a buzzword-it’s a requirement.

  5. Tonya Trottman Tonya Trottman
    March 16, 2026 AT 10:53 AM

    ‘Multimodal AI understands gestures’? Lol. It thinks a thumbs-up is ‘positive sentiment’ and a head tilt is ‘confusion.’ What about the Nigerian ‘tongue click’? Or the Japanese ‘bow + slight glance’? No. It’s just pattern-matching on Western-centric data. And you call this ‘intuitive’? You’re training bots on TikTok clips and calling it cross-modal reasoning. Pathetic.

  6. Rocky Wyatt Rocky Wyatt
    March 16, 2026 AT 22:49 PM

    You think this is groundbreaking? I’ve been using voice-to-text for years. This just adds more layers of over-engineered nonsense. Your ‘intuitive’ app still asks me to ‘confirm intent’ like I’m a toddler. Meanwhile, my grandma just wants to know why her bill doubled. Give her a simple answer. Not a 5-step multimodal journey.

  7. Santhosh Santhosh Santhosh Santhosh
    March 18, 2026 AT 09:14 AM

    I work in a small village clinic in Odisha, and we just started using a multimodal system where patients record symptoms via video and upload photos of rashes. The AI cross-references with local climate data, common pathogens, and even dietary patterns from our database. Last week, it flagged a dengue case before the patient even had a fever. We didn’t have a doctor for three days. The AI didn’t replace us-it gave us time. I’m not tech-savvy. But this? This feels like hope.

  8. Veera Mavalwala Veera Mavalwala
    March 19, 2026 AT 06:29 AM

    Oh honey, you’re talking about multimodal AI like it’s a magic wand. You think your fancy GPT-4o is gonna understand that my aunt in Kerala says ‘it’s hot’ but means ‘I’m having chest pain’? Nah. It sees ‘heat’ and ‘red skin’ and says ‘sunburn.’ Meanwhile, she’s having a silent heart attack. No one told you-context isn’t data. It’s culture. And your models? They’re still stuck in Silicon Valley’s echo chamber, sipping oat milk lattes while the world burns.

  9. Mbuyiselwa Cindi Mbuyiselwa Cindi
    March 20, 2026 AT 23:09 PM

    Santhosh, you just nailed it. We built a version of this for rural health workers in South Africa. We trained it on local dialects, common symptoms, and even traditional remedies. The AI doesn’t ‘correct’ them-it augments. A nurse says ‘the baby’s eyes are yellow and he won’t feed’-the system pulls up jaundice patterns, suggests a bilirubin test, and plays a 15-second audio cue in Zulu: ‘Go to clinic now.’ No app. No website. Just voice. It’s not about fancy models. It’s about listening.

Write a comment