• chevron_right

    Meta’s “massively multilingual” AI model translates up to 100 languages, speech or text / ArsTechnica · Tuesday, 22 August - 19:57 · 1 minute

An illustration of a person holding up a megaphone to a head silhouette that says

Enlarge (credit: Getty Images)

On Tuesday, Meta announced SeamlessM4T , a multimodal AI model for speech and text translations. As a neural network that can process both text and audio, it can perform text-to-speech, speech-to-text, speech-to-speech, and text-to-text translations for "up to 100 languages," according to Meta. Its goal is to help people who speak different languages communicate with each other more effectively.

Continuing Meta's relatively open approach to AI, Meta is releasing SeamlessM4T under a research license (CC BY-NC 4.0) that allows developers to build on the work. They're also releasing SeamlessAlign, which Meta calls "the biggest open multimodal translation dataset to date, totaling 270,000 hours of mined speech and text alignments." That will likely kick-start the training of future translation AI models from other researchers.

Among the features of SeamlessM4T touted on Meta's promotional blog, the company says that the model can perform speech recognition (you give it audio of speech, and it converts it to text), speech-to-text translation (it translates spoken audio to a different language in text), speech-to-speech translation (you feed it speech audio, and it outputs translated speech audio), text-to-text translation (similar to how Google Translate functions), and text-to-speech translation (feed it text and it will translate and speak it out in another language). Each of the text translation functions supports nearly 100 languages, and the speech output functions support about 36 output languages.

Read 6 remaining paragraphs | Comments

  • chevron_right

    Banks fined $549M after senior execs found secretly texting on Signal, WhatsApp / ArsTechnica · Tuesday, 8 August - 19:22

Banks fined $549M after senior execs found secretly texting on Signal, WhatsApp

Enlarge (credit: SOPA Images / Contributor | LightRocket )

Banks with employees covertly texting about official business on apps like Signal, WhatsApp, and iMessage have been caught red-handed. Now federal agencies are charging banks with violating laws requiring recordkeeping on all business matters.

Today, the SEC and the Commodity Futures Trading Commission (CFTC) fined 11 firms a combined $549 million for what the SEC described as "widespread and longstanding failures by the firms and their employees to maintain and preserve electronic communications."

Wells Fargo was hit with the biggest fines, agreeing to pay the SEC a $125 million penalty and the CFTC another $75 million. Fines for other firms—including Bank of Montreal, BMO Capital Markets Corp., BNP Paribas, Houlihan Lokey Capital, Inc., Mizuho Securities USA, Moelis & Company LLC, SMBC Nikko Securities America, Inc., Société Générale, and Wedbush Securities Inc.—ranged between $9 million and $75 million.

Read 11 remaining paragraphs | Comments

  • chevron_right

    Most of the 100 million people who signed up for Threads stopped using it / ArsTechnica · Friday, 28 July - 18:07

Man holding a smartphone that displays Meta's Threads app.

Enlarge (credit: Getty Images | NurPhoto)

Meta's new Twitter competitor, Threads, is looking for ways to keep users interested after more than half of the people who signed up for the text-based platform stopped actively using the app, Meta CEO Mark Zuckerberg reportedly told employees in a company town hall yesterday. Threads launched on July 5 and signed up over 100 million users in less than five days, buoyed by user frustration with Elon Musk-owned Twitter.

"Obviously, if you have more than 100 million people sign up, ideally it would be awesome if all of them or even half of them stuck around. We're not there yet," Zuckerberg told employees yesterday, according to Reuters , which listened to audio of the event.

Third-party data suggests that Threads may have lost many more than half of its active users. Daily active users for Threads on Android dropped from 49 million on July 7 to 23.6 million on July 14 , and then to 12.6 million on July 23 , web analytics company SimilarWeb reported.

Read 8 remaining paragraphs | Comments

  • chevron_right

    Tax preparers that shared private data with Meta, Google could be fined billions / ArsTechnica · Wednesday, 12 July - 20:36

Tax preparers that shared private data with Meta, Google could be fined billions

Enlarge (credit: Pgiam | iStock / Getty Images Plus )

Yesterday, Congress members revealed the results of a seven-month investigation into tax-filing companies. Lawmakers found that H&R Block, TaxAct, and TaxSlayer "recklessly shared" potentially hundreds of millions of taxpayers' sensitive personal and financial data with Google and Meta "for years" in apparent violation of laws prohibiting tax preparers from sharing tax return information without customers' consent.

In a press release provided to Ars from the office of Senator Elizabeth Warren (D-Mass.), lawmakers alleged a "massive, likely illegal breach of taxpayer privacy." Insisting upon urgent redress, lawmakers are now calling upon the Department of Justice, the Internal Revenue Service (IRS), the Federal Trade Commission, and the Treasury Inspector General for Tax Administration to "fully investigate this matter and prosecute any company or individuals who violated the law."

The Congress members' report said that "any tax return preparer who 'knowingly or recklessly discloses'" tax return information "is subject to a fine up to $1,000 per violation, and a prison term of up to one year."

Read 32 remaining paragraphs | Comments

  • chevron_right

    Google’s head of AR software quits, citing “unstable commitment and vision” / ArsTechnica · Tuesday, 11 July - 20:16 · 1 minute

Promotional image of AR glasses.

Enlarge / Product photography of the Google Glass wearable. (credit: Google)

Google's head of operating system and software platforms for augmented and mixed reality devices, Mark Lucovsky, has left the company after months of turmoil for the company's mixed reality projects and staff. He publicly announced his departure in a tweet on Monday:

I have decided to step away from my role at Google, where I was Senior Director of Engineering, responsible for OS and Software Platform for AR and XR devices. The recent changes in AR leadership and Google’s unstable commitment and vision have weighed heavily on my decision.

It's unclear exactly which leadership changes he's referring to, but it seems possible or even likely that he's talking about the recent departure of Clay Bavor, who had led Google's XR work since 2015. Bavor left the company in March of this year.

Google was one of the pioneers of mass-market AR when it piloted Google Glass with developers in 2013, but things have been rocky of late. The company killed Glass, brought it back as an enterprise-only product, then killed it again . Rumors swirled that the tech giant was working on a new AR product called Project Iris , but it was reportedly canceled this year amidst a wave of company layoffs.

Read 5 remaining paragraphs | Comments

  • chevron_right

    Twitter is “tanking” amid Threads’ surging popularity, analysts say / ArsTechnica · Tuesday, 11 July - 19:22

Twitter is “tanking” amid Threads’ surging popularity, analysts say

Enlarge (credit: NurPhoto / Contributor | NurPhoto )

Data shows that a sudden spike in interest in Meta's Threads—which surpassed 100 million signups in five days, Mark Zuckerberg boasted yesterday—has likely already put a tiny dent in Twitter's traffic, The Wall Street Journal reported .

The news comes after a tweet from Cloudflare CEO Matthew Prince went viral. In it, Prince shared a Cloudflare chart showing that since January, Twitter traffic—compared to other popular websites—has been "tanking."

During Threads' first two days online, Twitter traffic dropped by 5 percent compared to the same two days in the prior week, web analytics firm SimilarWeb reported. When measuring year over year, Twitter's traffic dropped by 11 percent.

Read 37 remaining paragraphs | Comments

  • chevron_right

    Sarah Silverman sues OpenAI, Meta for being “industrial-strength plagiarists” / ArsTechnica · Monday, 10 July - 19:42 · 6 minutes

Comedian and author Sarah Silverman.

Enlarge / Comedian and author Sarah Silverman. (credit: Jason Kempin / Staff | Getty Images North America )

On Friday, the Joseph Saveri Law Firm filed US federal class-action lawsuits on behalf of Sarah Silverman and other authors against OpenAI and Meta, accusing the companies of illegally using copyrighted material to train AI language models such as ChatGPT and LLaMA .

Other authors represented include Christopher Golden and Richard Kadrey, and an earlier class-action lawsuit filed by the same firm on June 28 included authors Paul Tremblay and Mona Awad. Each lawsuit alleges violations of the Digital Millennium Copyright Act, unfair competition laws, and negligence.

The Joseph Saveri Law Firm is no stranger to press-friendly legal action against generative AI. In November 2022, the same firm filed suit over GitHub Copilot for alleged copyright violations. In January 2023, the same legal group repeated that formula with a class-action lawsuit against Stability AI, Midjourney, and DeviantArt over AI image generators. The GitHub lawsuit is currently on path to trial, according to lawyer Matthew Butterick. Procedural maneuvering in the Stable Diffusion lawsuit is still underway with no clear outcome yet.

In a press release last month, the law firm described ChatGPT and LLaMA as "industrial-strength plagiarists that violate the rights of book authors." Authors and publishers have been reaching out to the law firm since March 2023, lawyers Joseph Saveri and Butterick wrote, because authors "are concerned" about these AI tools' "uncanny ability to generate text similar to that found in copyrighted textual materials, including thousands of books."

The most recent lawsuits from Silverman, Golden, and Kadrey were filed in a US district court in San Francisco. Authors have demanded jury trials in each case and are seeking permanent injunctive relief that could force Meta and OpenAI to make changes to their AI tools.

Meta declined Ars' request to comment. OpenAI did not immediately respond to Ars' request to comment.

A spokesperson for the Saveri Law Firm sent Ars a statement, saying, "If this alleged behavior is allowed to continue, these models will eventually replace the authors whose stolen works power these AI products with whom they are competing. This novel suit represents a larger fight for preserving ownership rights for all artists and other creators."

Accused of using “flagrantly illegal” data sets

Neither Meta nor OpenAI has fully disclosed what's in the data sets used to train LLaMA and ChatGPT. But lawyers for authors suing say they have deduced the likely data sources from clues in statements and papers released by the companies or related researchers. Authors have accused both OpenAI and Meta of using training data sets that contained copyrighted materials distributed without authors' or publishers' consent, including by downloading works from some of the largest e-book pirate sites.

In the OpenAI lawsuit , authors alleged that based on OpenAI disclosures, ChatGPT appeared to have been trained on 294,000 books allegedly downloaded from "notorious 'shadow library' websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik." Meta has disclosed that LLaMA was trained on part of a data set called ThePile, which the other lawsuit alleged includes “all of Bibliotik,” and amounts to 196,640 books.

On top of allegedly accessing copyrighted works through shadow libraries, OpenAI is also accused of using a "controversial data set" called BookCorpus.

BookCorpus, the OpenAI lawsuit said, "was assembled in 2015 by a team of AI researchers for the purpose of training language models." This research team allegedly "copied the books from a website called Smashwords that hosts self-published novels, that are available to readers at no cost." These novels, however, are still under copyright and allegedly "were copied into the BookCorpus data set without consent, credit, or compensation to the authors."

Ars could not immediately reach the BookCorpus researchers or Smashwords for comment. [ Update: Dan Wood, COO of Draft2Digital—which acquired Smashwords in March 2022—told Ars that the Smashwords  "store site lists close to 800,000 titles for sale," with "about 100,000" currently priced at free.

"Typically, the free book will be the first of a series," Wood said. "Some authors will keep these titles free indefinitely, and some will run limited promotions where they offer the book for free. From what we understand of the BookCorpus data set, approximately 7,185 unique titles that were priced free at the time were scraped without the knowledge or permission of Smashwords or its authors." It wasn't until March 2023 when Draft2Digital "first became aware of the scraped books being used for commercial purposes and redistributed, which is a clear violation of Smashwords’ terms of service," Wood said.

"Every author, whether they have an internationally recognizable name or have just published their first book, deserve to have their copyright protected," Wood told Ars. "They also should have the confidence that the publishing service they entrust their work with will protect it. To that end, we are working diligently with our lawyers to fully understand the issues—including who took the data and where it was distributed—and to devise a strategy to ensure our authors’ rights are enforced. We are watching the current cases being brought against OpenAI and Meta very closely."]

“Numerous questions of law” raised

Authors claim that by utilizing "flagrantly illegal" data sets, OpenAI allegedly infringed copyrights of Silverman's book The Bedwetter , Golden’s Ararat , and Kadrey’s Sandman Slime . And Meta allegedly infringed copyrights of the same three books, as well as "several" other titles from Golden and Kadrey.

It seems obvious to authors that their books were used to train ChatGPT and LLaMA because the tools "can accurately summarize a certain copyrighted book." Although sometimes ChatGPT gets some details wrong, its summaries are otherwise very accurate, and this suggests that "ChatGPT retains knowledge of particular works in the training data set and is able to output similar textual content," the authors alleged.

It also seems obvious to authors that OpenAI and Meta knew that their models were "ingesting" copyrighted materials because all the copyright-management information (CMI) appears to have been "intentionally removed," authors alleged. That means that ChatGPT never responds to a request for a summary by citing who has the copyright, allowing OpenAI to "unfairly profit from and take credit for developing a commercial product based on unattributed reproductions of those stolen writing and ideas."

"OpenAI knew or had reasonable grounds to know that this removal of CMI would facilitate copyright infringement by concealing the fact that every output from the OpenAI Language Models is an infringing derivative work, synthesized entirely from expressive information found in the training data," the OpenAI complaint said.

Among "numerous questions of law" raised in these complaints was a particularly prickly question: Is ChatGPT or LLaMA itself an infringing derivative work based on perhaps thousands of authors' works?

Authors are already upset that companies seem to be unfairly profiting off their copyrighted materials, and the Meta lawsuit noted that any unfair profits currently gained could further balloon, as "Meta plans to make the next version of LLaMA commercially available." In addition to other damages, the authors are asking for restitution of alleged profits lost.

"Much of the material in the training datasets used by OpenAI and Meta comes from copyrighted works—including books written by plain­tiffs—that were copied by OpenAI and Meta without consent, without credit, and without compensation," Saveri and Butterick wrote in their press release.

Read on Ars Technica | Comments