Behind Rosetta, Facebook's In-Image Text Recognition AI

Advertisement
Advertisement

Facebook has revealed some details as to the inner workings of Rosetta, a machine learning-based AI program that it uses to parse text inside images and understand that text in the context of the image. The program allows Facebook to offer more relevant image search results and make that sort of content more accessible for the visually impaired, among other use cases. Rosetta is not a singular AI program in and of itself, but rather a number of programs that work together to extract the text from an image, figure out how it relates to the image, and glean relevant insights when applicable.

The first step that Rosetta takes is using a convolutional neural network to figure out whether text is present in images, be it inside the image itself or in the form of text overlays commonly seen in image macros, motivational posters, and memes. This step uses what's called a region proposal network to create proposed regions to look for text in within an image, then checks those regions for known text patterns. Once text is found, optical character recognition comes into play so that the AI can figure out what the text is. The program uses sequence prediction alongside trained language and context processing in order to recognize words and phrases that may not have been seen in training. Finally, the understood text is run through a different program that's trained on context for in-image text blurbs.

The whole point of Rosetta is to look at text in images, in all forms, and figure out the relationship between them, if there is any. If you have, say, an image macro of an overweight cat sitting on the bumper of a truck with the warning label "Wide Load", the AI will be able to understand the association and make the joke easier to find via image searching. Likewise, if you have, for example, an image macro that says "When you're feeling down", and depicts somebody playing Katamari Damacy, the AI can recognize the link between the two and the fact that it's implied in the image that somebody who's feeling down is using the cheerful and nonsensical game as a way to cope with what's got them down.

Advertisement