Reproducibility Issues Hinder Machine Learning Progress

Reproducibility, the factor that allows other scientists to reproduce an experiment's results by reproducing its procedure, is largely absent in machine learning and artificial intelligence development, and given the scale and scope of the field, it's starting to become a real issue. One of the biggest pieces of the puzzle is being able to record and factor in small changes, such as GPU driver updates in mid-job, or changes to the data set during a training run by an outside source. A very large number of factors can affect an AI research project's journey from conception to fruition, and without being able to reproduce all of these factors, AI researchers are essentially unable to reproduce one another's work. This harms collaboration and piggyback development, two of the more basic tenets of publishable scientific research, in a field that would benefit greatly from having no issues in those areas.

To paint as simple a picture as possible, imagine a data scientist wanting to set up a simple AI program that searches for and sorts images of blue jays from nature photos. They would write up an algorithm that detects relatively large blue forms, then compares them against the environment in the photo to determine their size, position, current activity, and other data about them, as well as to simply verify that they are what the machine thinks they are. This in-development model is run on a training data set on the researcher's machine, which gets a GPU driver update in the middle of running. After that, somebody on the network modifies or deletes a few of the files in the training data set for one reason or another, also in the middle of a run.

Small changes like this can have a powerful effect on the end product, especially in scenarios where a machine learning system is set to work largely unsupervised, training itself on vast data sets. Even if the researcher logged every tiny codebase change they made from start to finish, which is mostly impractical, others would still have issues reproducing their results by following the same procedure with the same algorithm, machine and data set. These principles can be applied across the spectrum of AI research, for the most part, and represent a growing problem that will need to be solved before any real industrywide collaboration can occur. Given the nature of AI research, needing vast data sets and lots of training across tons of machines for grander data processing tasks, industrywide collaboration across national borders is exactly what the AI field needs to make its next growth breakthrough.

You May Like These
More Like This:
About the Author
2018/10/Daniel-Fuller-2018.jpg

Daniel Fuller

Senior Staff Writer
Daniel has been writing for Android Headlines since 2015, and is one of the site's Senior Staff Writers. He's been living the Android life since 2010, and has been interested in technology of all sorts since childhood. His personal, educational and professional backgrounds in computer science, gaming, literature, and music leave him uniquely equipped to handle a wide range of news topics for the site. These include the likes of machine learning, voice assistants, AI technology development, and hot gaming news in the Android world. Contact him at [email protected]
Android Headlines We Are Hiring Apply Now