Researchers at Ohio State University have developed a method for building a machine learning algorithm from data gathered from a variety of connected devices. There are two cool things about their model worth noting. The first is that the model is distributed and second, it can keep data private.
The researchers call their model Crowd-ML and the idea is pretty basic. Each device runs a version of a necessary app, much like one might run a version of SETI@home or other distributed computing application, and grabs samples of data to send to a central server. The server can tell when enough of the right data has been gathered to “teach” the computer and only grabs the data it needs, ensuring a relative amount of privacy.
The model uses a variant of stochastic (sub)gradient descent instead of batch processing, to grab data for machine learning, which is what makes the Crowd-ML effort different. Stochastic gradient descent is the basis for a lot of machine learning and deep learning efforts. It uses knowledge gleaned from previous computations to inform the next computations, making it iterative, as opposed to something processed all at once.
The paper goes on to describe how one can tweak the Crowd-ML model to ensure more or less privacy and process information faster or in greater amounts. It tries to achieve the happy medium between protecting privacy and gathering the right amount of data to generate a decent sample size to train the machine learning algorithm.
Crowd-ML is worth checking out because as we’re bringing more connected devices into our home, we’re also bringing with them the promise that thousands of connected sensors and smart objects can help us use resources more efficiently, manage our health and even direct our traffic once we aggregate and analyze the data they hold. With that promise comes the risk of losing our privacy, but also the risk that the promise falls flat because the ability to share information costs us too much time or energy to actually make it real.
Having a distributed method of machine learning could be one step in solving for some of those issues. Already, we’re seeing plenty of research into alternative networks and architectures for data traversing the internet of things, so what’s one more option to mull as we’re rethinking how we want to build a computing model for billions of always-on connected devices that aren’t managed by a human?