“Reinforcement Learning with Augmented Data,” (link) by Michael Laskin, et al introduces the idea of using data augmentation in Reinforcement Learning. Data augmentation is already a popular technique in supervised learning and has been effective in radically increasing data efficiency and generalization. (p. 1) So it seemed natural to attempt the same trick for Reinforcement Learning. In fact, the potential for improvements in data efficiency in reinforcement learning is, if anything, sharper. Reinforcement Learning requires enormous amounts of data to be successful. This is one of the main reasons reinforcement learning has shown lackluster results in most areas and has really shined primarily in game playing algorithms like Atari or Alpha Go where it is possible to generate as much training data as desired. But even in these cases, Laskin points out that it takes months to collect the needed data and requires enormous amounts of computation that isn’t available to most people. (p. 2) This has led to only research groups with deep pockets being able to make progress in this field. (p. 9) So even in cases where we can generate any amount of data, the need to find ways to increase data efficiency is still warranted.
Given the huge appetite for data, RL could definitely benefit from data augmentation. As I mentioned in this post, biological organisms are incredibly efficient in data efficiency compared to our state of the art RL techniques. A dog or a cat can learn to play Jenga based on a handful of trials whereas such a feat is currently beyond even our best RL techniques because of the unreasonable requirements for it to practice millions of times to learn Jenga within a physical space. Laskin’s hope is that any gains in data efficiency will lead us towards our ultimate goal of being able to train complicated algorithms (such as robots for nursing homes) using a reasonable amount of data more similar to that of biological organisms. However, Laskin admits that data augmentation is insufficient on its own to get us to this goal. (p. 9).
Laskin utilized several standard data augmentation techniques: Crop, Translate, Window, Greyscale, Cutout, Rotate, Flip, Random Convolution, and Color Jitter. To this he adds two novel ones: Random Amplitude Scaling and Gaussian Noise. The results are similar to supervised learning, namely that data augmentation both improves data efficiency (i.e. less data is needed to get good results) and improves generalization. (p. 2)
However, I was disappointed that these techniques are basically a direct import from the supervised world, namely that these techniques will only work with RL that uses pixel images as precepts. This seems like a major limitation. I was hoping for a more general form of data augmentation that worked with any RL algorithm.
Why is Random Crop So Efficient?
Of the various data augmentation techniques, Random Crop proved the most effective by far. (p. 6) A random crop really could be thought of as made up of two sub-techniques: Random Window (masking all but a specific window), and Random Translate (placing the full image randomly translated within a field of zeros). Of these two sub techniques, Random Translate produced strong results whereas Random Window did not. (p. 6)
So why is translation so important to data augmentation? One possible hypothesis is that this is a natural consequence of how RL actually works. Imagine a ‘robot’ (perhaps a software robot) that uses images to learn from. It would be able to rotate itself or move forward, backward, or side-to-side. Each of these movements is a translation. So doing data augmentation via a translation essentially mimics new training data. Compare this to, say, random color jitter or gray scaling which does not mimic what true data would look like. I conjecture that this is why translations are the most effective form of data augmentation.
Another possibility that comes to mind is that translation makes sense as a data augmentation because where things are in a field of vision should not in general make any difference in outcomes. This is the whole basis for why Convolutional Neural Networks (CNNs) work so well. A feature that identifies an object should do so no matter where in the image that object is found. So by training on a convolution rather than the whole image we learn features that are translation independent. Augmenting data in this way may create an inductive bias that does the same sort of thing.
RAD vs CURL
Reinforcement Learning with Data Augmentation (RAD) works better than the previously best data augmentation technique known as by directly Contrastive Unsupervised Representations for Reinforcement Learning (CURL). As the paper explains, “While the focus in CURL was to make use of data augmentations jointly through contrastive and reinforcement learning losses, RAD attempts to directly use data augmentations for reinforcement learning without any auxiliary loss.” (p. 13) The end result is that RAD outperforms CURL, presumably because “…it only optimizes for what we care about, which is the task reward. CURL, on the other hand, jointly optimizes the reinforcement and contrastive learning objectives.” (p. 19) In other words, RAD gives better results because it’s a more direct approach that directly optimizes the very metric we are using to measure success. Therefore, “…a method that purely focuses on reward optimization is expected to be better as long as it implicitly ensures similarity consistencies on the augmented views (in this case, just by training the RL objective on different augmentations directly).” (p. 19)
However, Laskin warns against taking this as a sign that CURL provides no value. CURL can be used to optimize absent any reward at all and is thus, arguably, a more general technique. (p. 19) Therefore Laskin believes both techniques have a place in future Reinforcement Learning research.