As part of its effort to find better ways to develop and train “safe artificial general intelligence,” OpenAI has been releasing its own versions of reinforcement learning algorithms. They call these OpenAI Baselines, and the most recent additions to these algorithms are two baselines that are meant to enhance machine learning performance by making it more efficient.
The first is a baseline implementation called Actor Critic using Kronecker-factored Trust Region (ACKTR). Developed by researchers from the University of Toronto (UofT) and New York University (NYU), ACKTR improves on the way AI policies perform deep reinforcement learning — learning that is accomplished only by trial and error, and obtained only through raw observation. In a paper published online, the UofT and NYU researchers used simulated robots and Atari games to test how ACKTR learns control policies.
“For machine learning algorithms, two costs are important to consider: sample complexity and computational complexity,” according to an OpenAI Research blog. “Sample complexity refers to the number of timesteps of interaction between the agent and its environment, and computational complexity refers to the amount of numerical operations that must be performed.” ACKTR is able to perform deep reinforcement learning faster by improving both sample and computational complexities.
Usually, machine learning algorithms are taught by feeding them tons of data. In deep reinforcement learning, AI policies are trained to adjust and learn depending on raw inputs. It works on its own by “trial and error” to achieve certain rewards. Using ACKTR and another baseline called A2C, the researchers at OpenAI managed to improve how deep reinforcement learning is done.
Compare: Agents trained with ACKTR (above) attain higher scores in a shorter amount of time than those trained with other algorithms, such as A2C (below).
If ACKTR focused on reducing the number of steps it takes for an AI to interact with an environment, A2C improved the efficiency of processor use to perform reinforcement learning with batches of AI agents. “One advantage of this method is that it can more effectively use … GPUs, which perform best with large batch sizes. This algorithm is naturally called A2C, short for advantage actor critic,” they wrote. “This A2C implementation is more cost-effective than A3C when using single-GPU machines, and is faster than a CPU-only A3C implementation when using larger policies.”
This is the latest addition to OpenAI’s work on developing AI policies and agents that learn better. One of its recent success was in developing an AI that could play the video game Defense of the Ancients (Dota) 2. Like DeepMind’s AlphaGo, OpenAI’s Dota-playing agent was able to defeat its human opponents in a game that’s considerably more complex than an ancient Chinese board game.
These achievements notwithstanding, OpenAI continues to work in keeping with how its founder Elon Musk views AI — i.e., with great caution. Musk has been an advocate of developing safe AI and even calling for sound policies to regulate it. OpenAI is his way of contributing directly to that need.