Apple Reminds Us That Behind The AI Revolution Lies An Army Of Humans Seeing Our Data

Getty

Apple’s concession last week that it relies on an army of human contractors to review a small portion of the recordings from Siri activations reminds us just how much the AI revolution is being built upon human workers consuming our data. While the research papers and popular press pouring forth from Silicon Valley’s AI leaders paint a portrait of almost human-like machines learning about their world independently like silicon children, beneath that pristine veneer is a far bleaker reality of industrial assembly lines of human workers laboring to produce the raw training and testing material to be fed into primitive correlative software code that is often little more insightful than a glorified Excel spreadsheet of Pearson correlations. In this world even the most privacy conscious companies must turn to massive teams of humans to keep these AI engines fed.

The AI revolution is often portrayed with almost science fiction-like wonderment, in which algorithms emerge as silicon lifeforms out of test tubes made from software, consuming the world around them when they aren’t busy playing war games against clones of themselves. Piles of code are anthropomorphized into living creatures that actively search out information and learn what they need from it.

Missing from this romanticized fairy tale is the grim reality that today’s correlative engines require unimaginable quantities of intricately curated data to achieve even the extraordinarily brittle and easily deceivable models that define today’s state of the art.

Like modern day coal miners, behind closed doors an army of contractors toils in silence and obscurity to produce the steady stream of training and testing data that makes this AI revolution possible.

In turn, powering the work of those contractors is often our own data, harvested from the open Web, purchased from third parties or collected from companies’ own services.

In Apple’s case, it turns out that the company’s very public privacy posturing ultimately collided with the reality that as an increasingly services-oriented enterprise based on AI, it has little choice but to join its colleagues in making use of human reviewers to examine recordings from Siri activations.

Creating today’s AI systems requires incredible amounts of real-world data. When it comes to voice transcription, content understanding, task execution, question answering or any other modern AI application, the algorithms that power our modern world are, at the end of the day, built largely upon our own data.

Every voice invocation of a smart assistant, every social media photo upload, every question posed to a chatbot joins the firehose of our most intimate data being curated and selectively fed into today’s AI systems to make them better.

No company, no matter how much it enshrines privacy, can escape the simple fact that to build robust consumer-facing AI today requires using real customer data to train and test that system. Even Apple has not been able to find a way around this fact.

Though, there is perhaps some hope on the horizon. Advances in federated learning are making it increasingly possible to build large scale global models from large numbers of users allowing their content to be used to train the model in situ, without ever leaving their device and with only correction factors being transmitted to the shared model.

In other words, instead of today’s training workflow in which all the data used to build a model is deposited into one centralized place, federated learning allows many independent data stores to each train using the small amount of data they have on hand and feed this back to be combined into a central model, which in turn is sent back out to each remote store for further training and testing, with corrections sent back to the central server and so on, in an infinite cycle. The end result is a largely privacy-preserving training process (especially when combined with additional safeguards like differential privacy) in which companies can create large global models without ever needing to see users’ data.

Of course, federated learning doesn’t solve the issue of the vast teams of human annotators required to build systems like image categorization, though interactive gamified training systems might eventually allow users to volunteer to perform annotation tasks in return for some kind of reward, whether special features or services or perhaps even monetary compensation.

Perhaps someday, instead of the recordings from our smart speakers and digital assistants being sent to total strangers on the other side of the world, we’ll get a message from the company asking us to listen to our recording and type up an annotation of it, rate the machine’s annotation, comment on how well it performed the desired behavior or other annotation task and be offered a monetary fee or reward in return for our services or perhaps the option to annotate ourselves instead of having our content outsourced for annotation.

Putting this all together, it is important to remember that today’s AI revolution is built upon the backs of an army of very human contractors that generate the training and testing annotations that make these models possible. In turn, the data they annotate is typically our own, creating a strange contradiction in which we turn to our digital assistants in the privacy of our own homes to ask them very intimate questions we could not dream of a stranger being allowed to overhear, yet all of that privacy is for naught when those questions are shared with a contractor on the other side of the world to listen to and annotate. Advances like federated learning may eventually replace these armies of professional annotators, but for now they are the lifeblood of our AI advances.

In the end, the voracious appetite of today’s AI systems is ushering in the end of the last remnants of our privacy.

More From Forbes

Apple Reminds Us That Behind The AI Revolution Lies An Army Of Humans Seeing Our Data