• chevron_right

      Meta’s “massively multilingual” AI model translates up to 100 languages, speech or text

      news.movim.eu / ArsTechnica · Tuesday, 22 August, 2023 - 19:57 · 1 minute

    An illustration of a person holding up a megaphone to a head silhouette that says

    Enlarge (credit: Getty Images)

    On Tuesday, Meta announced SeamlessM4T , a multimodal AI model for speech and text translations. As a neural network that can process both text and audio, it can perform text-to-speech, speech-to-text, speech-to-speech, and text-to-text translations for "up to 100 languages," according to Meta. Its goal is to help people who speak different languages communicate with each other more effectively.

    Continuing Meta's relatively open approach to AI, Meta is releasing SeamlessM4T under a research license (CC BY-NC 4.0) that allows developers to build on the work. They're also releasing SeamlessAlign, which Meta calls "the biggest open multimodal translation dataset to date, totaling 270,000 hours of mined speech and text alignments." That will likely kick-start the training of future translation AI models from other researchers.

    Among the features of SeamlessM4T touted on Meta's promotional blog, the company says that the model can perform speech recognition (you give it audio of speech, and it converts it to text), speech-to-text translation (it translates spoken audio to a different language in text), speech-to-speech translation (you feed it speech audio, and it outputs translated speech audio), text-to-text translation (similar to how Google Translate functions), and text-to-speech translation (feed it text and it will translate and speak it out in another language). Each of the text translation functions supports nearly 100 languages, and the speech output functions support about 36 output languages.

    Read 6 remaining paragraphs | Comments

    • chevron_right

      Google’s PaLM-E is a generalist robot brain that takes commands

      news.movim.eu / ArsTechnica · Tuesday, 7 March, 2023 - 23:06

    A robotic arm controlled by PaLM-E reaches for a bag of chips in a demonstration video.

    Enlarge / A robotic arm controlled by PaLM-E reaches for a bag of chips in a demonstration video. (credit: Google Research)

    On Monday, a group of AI researchers from Google and the Technical University of Berlin unveiled PaLM-E , a multimodal embodied visual-language model (VLM) with 562 billion parameters that integrates vision and language for robotic control. They claim it is the largest VLM ever developed and that it can perform a variety of tasks without the need for retraining.

    According to Google, when given a high-level command, such as "bring me the rice chips from the drawer," PaLM-E can generate a plan of action for a mobile robot platform with an arm (developed by Google Robotics) and execute the actions by itself.

    PaLM-E does this by analyzing data from the robot's camera without needing a pre-processed scene representation. This eliminates the need for a human to pre-process or annotate the data and allows for more autonomous robotic control.

    Read 11 remaining paragraphs | Comments