Also uses neural networks and a Decoder to decode source text, and encoder to translate to target text (both one word at a time)
Amazon Comprehend does the Automated language detection using neural networks. It will recognize key phrases, words, language, sentiment, and syntax. Uses deep learning, async and sync processing, integrates with other AWS services, and supports customization and clustering (https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html)
Polly converts text to "life-like" speech. Multiple voice options, low latency, pay for what you translate, logging available. Only available in 3 regions, throttle limits (https://docs.aws.amazon.com/polly/latest/dg/what-is.html)
Pricing model is simple and "pay for what you need", but multiple services can add up $$
Currently, non of these platforms support bilingual video conferencing
General Notes
The basic flow of real time bilingual video conferencing: Condition input, language identification, automatic speech recognition, speech to text, text "cleanup", natural language processing, text to speech
Basic system for a bilingual video conferencing application: Frontend display -> video API/Lambda -> speech to text (ASR) -> translation api (Neural network) -> text to speech -> output to service -> output to user