Given only a few frames of a video, humans can usually surmise what is happening and will happen on screen. If we see an early frame of stacked cans, a middle frame with a finger at the stack’s base, and a late frame showing the cans toppled over, we can guess that the finger knocked down the cans. Computers, however, struggle with this concept.

In a paper being presented at this week’s European Conference on Computer Vision, MIT researchers describe an add-on module that helps artificial intelligence systems called convolutional neural networks, or CNNs, to fill in the gaps between video frames to greatly improve the network’s activity recognition.

The researchers’ module, called Temporal Relation Network (TRN), learns how objects change in a video at different times. It does so by analyzing a few key frames depicting an activity at different stages of the video — such as stacked objects that are then knocked down. Using the same process, it can then recognize the same type of activity in a new video.

In experiments, the module outperformed existing models by a large margin in recognizing hundreds of basic activities, such as poking objects to make them fall, tossing something in the air, and giving a thumbs-up. It also more accurately predicted what will happen next in a video — showing, for example, two hands making a small tear in a sheet of paper — given only a small number of early frames.

One day, the module could be used to help robots better understand what’s going on around them.

“We built an artificial intelligence system to recognize the transformation of objects, rather than appearance of objects,” says Bolei Zhou, a former PhD student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) who is now an assistant professor of computer science at the Chinese University of Hong Kong. “The system doesn’t go through all the frames — it picks up key frames and, using the temporal relation of frames, recognize what’s going on. That improves the efficiency of the system and makes it run in real-time accurately.”

Co-authors on the paper are CSAIL principal investigator Antonio Torralba, who is also a professor in the Department of Electrical Engineering and Computer Science; CSAIL Principal Research Scientist Aude Oliva; and CSAIL Research Assistant Alex Andonian.

Picking up key frames

Two common CNN modules being used for activity recognition today suffer from efficiency and accuracy drawbacks. One model is accurate but must analyze each video frame before making a prediction, which is computationally expensive and slow. The other type, called two-stream network, is less accurate but more efficient. It uses one stream to extract features of one video frame, and then merges the results with “optical flows,” a stream of extracted information about the movement of each pixel. Optical flows are also computationally expensive to extract, so the model still isn’t that efficient.

“We wanted something that works in between those two models — getting efficiency and accuracy,” Zhou says.

The researchers trained and tested their module on three crowdsourced datasets of short videos of various performed activities. The first dataset, called Something-Something, built by the company TwentyBN, has more than 200,000 videos in 174 action categories, such as poking an object so it falls over or lifting an object. The second dataset, Jester, contains nearly 150,000 videos with 27 different hand gestures, such as giving a thumbs-up or swiping left. The third, Charades, built by Carnegie Mellon University researchers, has nearly 10,000 videos of 157 categorized activities, such as carrying a bike or playing basketball.

When given a video file, the researchers’ module simultaneously processes ordered frames — in groups of two, three, and four — spaced some time apart. Then it quickly assigns a probability that the object’s transformation across those frames matches a specific activity class. For instance, if it processes two frames, where the later frame shows an object at the bottom of the screen and the earlier shows the object at the top, it will assign a high probability to the activity class, “moving object down.” If a third frame shows the object in the middle of the screen, that probability increases even more, and so on. From this, it learns object-transformation features in frames that most represent a certain class of activity.

Recognizing and forecasting activities

In testing, a CNN equipped with the new module accurately recognized many activities using two frames, but the accuracy increased by sampling more frames. For Jester, the module achieved top accuracy of 95 percent in activity recognition, beating out several existing models.  

It even guessed right on ambiguous classifications: Something-Something, for instance, included actions such as “pretending to open a book” versus “opening a book.” To discern between the two, the module just sampled a few more key frames, which revealed, for instance, a hand near a book in an early frame, then on the book, then moved away from the book in a later frame.

Some other activity-recognition models also process key frames but don’t consider temporal relationships in frames, which reduces their accuracy. The researchers report that their TRN module nearly doubles in accuracy over those key-frame models in certain tests.

The module also outperformed models on forecasting an activity, given limited frames. After processing the first 25 percent of frames, the module achieved accuracy several percentage points higher than a baseline model. With 50 percent of the frames, it achieved 10 to 40 percent higher accuracy. Examples include determining that a paper would be torn just a little, based how two hands are positioned on the paper in early frames, and predicting that a raised hand, shown facing forward, would swipe down.

“That’s important for robotics applications,” Zhou says. “You want [a robot] to anticipate and forecast what will happen early on, when you do a specific action.”

Next, the researchers aim to improve the module’s sophistication. The first step is implementing object recognition together with activity recognition. Then, they hope to add in “intuitive physics,” meaning helping it understand real-world physical properties of objects. “Because we know a lot of the physics inside these videos, we can train module to learn such physics laws and use those in recognizing new videos,” Zhou says. “We also open source all the code and models. Activity understanding is an exciting area of artificial intelligence right now.”

Source – Author: MIT – mit.edu
Date/time: 14th September 2018, 21:13

A child is presented with a picture of various shapes and is asked to find the big red circle. To come to the answer, she goes through a few steps of reasoning: First, find all the big things; next, find the big things that are red; and finally, pick out the big red thing that’s a circle.

We learn through reason how to interpret the world. So, too, do neural networks. Now a team of researchers from MIT Lincoln Laboratory’s Intelligence and Decision Technologies Group has developed a neural network that performs human-like reasoning steps to answer questions about the contents of images. Named the Transparency by Design Network (TbD-net), the model visually renders its thought process as it solves problems, allowing human analysts to interpret its decision-making process. The model performs better than today’s best visual-reasoning neural networks.  

Understanding how a neural network comes to its decisions has been a long-standing challenge for artificial intelligence (AI) researchers. As the neural part of their name suggests, neural networks are brain-inspired AI systems intended to replicate the way that humans learn. They consist of input and output layers, and layers in between that transform the input into the correct output. Some deep neural networks have grown so complex that it’s practically impossible to follow this transformation process. That’s why they are referred to as “black box” systems, with their exact goings-on inside opaque even to the engineers who build them.

With TbD-net, the developers aim to make these inner workings transparent. Transparency is important because it allows humans to interpret an AI’s results.

It is important to know, for example, what exactly a neural network used in self-driving cars thinks the difference is between a pedestrian and stop sign, and at what point along its chain of reasoning does it see that difference. These insights allow researchers to teach the neural network to correct any incorrect assumptions. But the TbD-net developers say the best neural networks today lack an effective mechanism for enabling humans to understand their reasoning process.

“Progress on improving performance in visual reasoning has come at the cost of interpretability,” says Ryan Soklaski, who built TbD-net with fellow researchers Arjun Majumdar, David Mascharka, and Philip Tran.

The Lincoln Laboratory group was able to close the gap between performance and interpretability with TbD-net. One key to their system is a collection of “modules,” small neural networks that are specialized to perform specific subtasks. When TbD-net is asked a visual reasoning question about an image, it breaks down the question into subtasks and assigns the appropriate module to fulfill its part. Like workers down an assembly line, each module builds off what the module before it has figured out to eventually produce the final, correct answer. As a whole, Tb-D net utilizes one AI technique that interprets human language questions and breaks those sentences into subtasks, followed by multiple computer vision AI techniques that interpret the imagery.

Majumdar says: “Breaking a complex chain of reasoning into a series of smaller subproblems, each of which can be solved independently and composed, is a powerful and intuitive means for reasoning.”

Each module’s output is depicted visually in what the group calls an “attention mask.” The attention mask shows heat-map blobs over objects in the image that the module is identifying as its answer. These visualizations let the human analyst see how a module is interpreting the image.   

Take, for example, the following question posed to TbD-net: “In this image, what color is the large metal cube?” To answer the question, the first module locates large objects only, producing an attention mask with those large objects highlighted. The next module takes this output and finds which of those objects identified as large by the previous module are also metal. That module’s output is sent to the next module, which identifies which of those large, metal objects is also a cube. At last, this output is sent to a module that can determine the color of objects. TbD-net’s final output is “red,” the correct answer to the question. 

When tested, TbD-net achieved results that surpass the best-performing visual reasoning models. The researchers evaluated the model using a visual question-answering dataset consisting of 70,000 training images and 700,000 questions, along with test and validation sets of 15,000 images and 150,000 questions. The initial model achieved 98.7 percent test accuracy on the dataset, which, according to the researchers, far outperforms other neural module network–based approaches.

Importantly, the researchers were able to then improve these results because of their model’s key advantage — transparency. By looking at the attention masks produced by the modules, they could see where things went wrong and refine the model. The end result was a state-of-the-art performance of 99.1 percent accuracy.

“Our model provides straightforward, interpretable outputs at every stage of the visual reasoning process,” Mascharka says.

Interpretability is especially valuable if deep learning algorithms are to be deployed alongside humans to help tackle complex real-world tasks. To build trust in these systems, users will need the ability to inspect the reasoning process so that they can understand why and how a model could make wrong predictions. 

Paul Metzger, leader of the Intelligence and Decision Technologies Group, says the research “is part of Lincoln Laboratory’s work toward becoming a world leader in applied machine learning research and artificial intelligence that fosters human-machine collaboration.”

The details of this work are described in the paper, “Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning,” which was presented at the Conference on Computer Vision and Pattern Recognition (CVPR) this summer.

Source – Author: MIT – mit.edu
Date/time: 12th September 2018, 09:08

Artists may soon have at their disposal a new MIT-developed tool that could help them create digital characters, logos, and other graphics more quickly and easily. 

Many digital artists rely on image vectorization, a technique that converts a pixel-based image into an image comprising groupings of clearly defined shapes. In this technique, points in the image are connected by lines or curves to construct the shapes. Among other perks, vectorized images maintain the same resolution when either enlarged or shrunk down.

To vectorize an image, artists often have to hand-trace each stroke using specialized software, such as Adobe Illustrator, which is laborious. Another option is using automated vectorization tools in those software packages. Often, however, these tools lead to numerous tracing errors that take more time to rectify by hand. The main culprit: mismatches at intersections where curves and lines meet.

In a paper being published in the journal ACM Transactions on Graphics, MIT researchers detail a new automated vectorization algorithm that traces intersections without error, greatly reducing the need for manual revision. Powering the tool is a modified version of a new mathematical technique in the computer-graphics community, called “frame fields,” used to guide tracing of paths around curves, sharp corners, and messy parts of drawings where many lines intersect.

The tool could save digital artists significant time and frustration. “A rough estimate is that it could save 20 to 30 minutes from automated tools, which is substantial when you think about animators who work with multiple sketches,” says first author Mikhail Bessmeltsev, a former Computer Science and Artificial Intelligence Laboratory (CSAIL) postdoc associate who is now an assistant professor at the University of Montreal. “The hope is to make automated vectorization tools more practical for artists who care about the quality of their work.”

Co-author on the paper is Justin Solomon, an assistant professor in CSAIL and in the Department of Electrical Engineering and Computer Science, and a principal investigator in the Geometric Data Processing Group.

Guiding the lines

Many modern tools used to model 3-D shapes directly from artist sketches, including Bessmeltsev’s previous research projects, require vectorizing the drawings first. Automated vectorization “never worked for me, so I got frustrated,” he says. Those tools, he says, are fine for rough alignments but aren’t designed for precision: “Imagine you’re an animator and you drew a couple frames of animation. They’re pretty clean sketches, and you want to edit or color them on a computer. For that, you really care how well your vectorization aligns with your pencil drawing.”

Many errors, he noted, come from misalignment between the original and vectorized image at junctions where two curves meet — in a type of “X” junction — and where one line ends at another — in a “T” junction. Previous research and software used models incapable of aligning the curves at those junctions, so Bessmeltsev and Solomon took on the task.

The key innovation came from using frame fields to guide tracing. Frame fields assign two directions to each point of a 2-D or 3-D shape. These directions overlay a basic structure, or topology, that can guide geometric tasks in computer graphics. Frame fields have been used, for instance, to restore destroyed historical documents and to convert triangle meshes — networks of triangles covering a 3-D shape — into quadrangle meshes — grids of four-sided shapes. Quad meshes are commonly used to create computer-generated characters in movies and video games, and for computer-aided design (CAD) for better real-world design and simulation.

Bessmeltsev, for the first time, applied frame fields to image vectorization. His frame fields assign two directions to every dark pixel on an image. This keeps track of the tangent directions — where a curve meets a line — of nearby drawn curves. That means, at every intersection of a drawing, the two directions of the frame field align with the directions of the intersecting curves. This drastically reduces the roughness, or noise, surrounding intersections, which usually makes them difficult to trace.

“At a junction, all you have to do is follow one direction of the frame field and you get a smooth curve. You do that for every junction, and all junctions will then be aligned properly,” Bessmeltsev says.

Cleaner vectorization

When given an input of a pixeled raster 2-D drawing with one color per pixel, the tool assigns each dark pixel a cross that indicates two directions. Starting at some pixel, it first chooses a direction to trace. Then, it traces the vector path along the pixels, following the directions. After tracing, the tool creates a graph capturing connections between the solid strokes in the drawn image. Using this graph, the tool matches the necessary lines and curves to those strokes and automatically vectorizes the image.

In their paper, the researchers demonstrated their tool on various sketches, such as cartoon animals, people, and plants. The tool cleanly vectorized all intersections that were traced incorrectly using traditional tools. With traditional tools, for instance, lines around facial features, such as eyes and teeth, didn’t stop where the original lines did or ran through other lines.

One example in the paper shows pixels making up two slightly curved lines leading to the tip of a hat worn by a cartoon elephant. There’s a sharp corner where the two lines meet. Each dark pixel contains a cross that’s straight or slightly slanted, depending on the curvature of the line. Using those cross directions, the traced line could easily follow as it swooped around the sharp turn.

“Many artists still enjoy and prefer to work with real media (for example, pen, pencil, and paper). … The problem is that the scanning of such content into the computer often results in a severe loss of information,” says Nathan Carr, a principal researcher in computer graphics at Adobe Systems Inc., who was not involved in the research. “[The MIT] work relies on a mathematical construct known as ‘frame fields,’ to clean up and disambiguate scanned sketches to gain back this loss of information. It’s a great application of using mathematics to facilitate the artistic workflow in a clean well-formed manner. In summary, this work is important, as it aids in the ability for artists to transition between the physical and digital realms.”

Next, the researchers plan to augment the tool with a temporal-coherence technique, which extracts key information from adjacent animation frames. The idea would be to vectorize the frames simultaneously, using information from one to adjust the line tracing on the next, and vice versa. “Knowing the sketches don’t change much between the frames, the tool could improve the vectorization by looking at both at the same time,” Bessmeltsev says.

Source – Author: MIT – mit.edu
Date/time: 11th September 2018, 21:03

Humans have long been masters of dexterity, a skill that can largely be credited to the help of our eyes. Robots, meanwhile, are still catching up.

Certainly there’s been some progress: For decades, robots in controlled environments like assembly lines have been able to pick up the same object over and over again. More recently, breakthroughs in computer vision have enabled robots to make basic distinctions between objects. Even then, though, the systems don’t truly understand objects’ shapes, so there’s little the robots can do after a quick pick-up.  

In a new paper, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), say that they’ve made a key development in this area of work: a system that lets robots inspect random objects, and visually understand them enough to accomplish specific tasks without ever having seen them before.

The system, called Dense Object Nets (DON), looks at objects as collections of points that serve as sort of visual roadmaps. This approach lets robots better understand and manipulate items, and, most importantly, allows them to even pick up a specific object among a clutter of similar — a valuable skill for the kinds of machines that companies like Amazon and Walmart use in their warehouses.

For example, someone might use DON to get a robot to grab onto a specific spot on an object, say, the tongue of a shoe. From that, it can look at a shoe it has never seen before, and successfully grab its tongue.

“Many approaches to manipulation can’t identify specific parts of an object across the many orientations that object may encounter,” says PhD student Lucas Manuelli, who wrote a new paper about the system with lead author and fellow PhD student Pete Florence, alongside MIT Professor Russ Tedrake. “For example, existing algorithms would be unable to grasp a mug by its handle, especially if the mug could be in multiple orientations, like upright, or on its side.”

The team views potential applications not just in manufacturing settings, but also in homes. Imagine giving the system an image of a tidy house, and letting it clean while you’re at work, or using an image of dishes so that the system puts your plates away while you’re on vacation.

What’s also noteworthy is that none of the data was actually labeled by humans. Instead, the system is what the team calls “self-supervised,” not requiring any human annotations.

Two common approaches to robot grasping involve either task-specific learning, or creating a general grasping algorithm. These techniques both have obstacles: Task-specific methods are difficult to generalize to other tasks, and general grasping doesn’t get specific enough to deal with the nuances of particular tasks, like putting objects in specific spots.

The DON system, however, essentially creates a series of coordinates on a given object, which serve as a kind of visual roadmap, to give the robot a better understanding of what it needs to grasp, and where.

The team trained the system to look at objects as a series of points that make up a larger coordinate system. It can then map different points together to visualize an object’s 3-D shape, similar to how panoramic photos are stitched together from multiple photos. After training, if a person specifies a point on a object, the robot can take a photo of that object, and identify and match points to be able to then pick up the object at that specified point.

This is different from systems like UC-Berkeley’s DexNet, which can grasp many different items, but can’t satisfy a specific request. Imagine a child at 18 months old, who doesn’t understand which toy you want it to play with but can still grab lots of items, versus a four-year old who can respond to “go grab your truck by the red end of it.”

In one set of tests done on a soft caterpillar toy, a Kuka robotic arm powered by DON could grasp the toy’s right ear from a range of different configurations. This showed that, among other things, the system has the ability to distinguish left from right on symmetrical objects.

When testing on a bin of different baseball hats, DON could pick out a specific target hat despite all of the hats having very similar designs — and having never seen pictures of the hats in training data before.

“In factories robots often need complex part feeders to work reliably,” says Florence. “But a system like this that can understand objects’ orientations could just take a picture and be able to grasp and adjust the object accordingly.”

In the future, the team hopes to improve the system to a place where it can perform specific tasks with a deeper understanding of the corresponding objects, like learning how to grasp an object and move it with the ultimate goal of say, cleaning a desk.

The team will present their paper on the system next month at the Conference on Robot Learning in Zürich, Switzerland.

Source – Author: MIT – mit.edu
Date/time: 10th September 2018, 21:03

Each year MIT professors invite thousands of undergraduates into their labs to work on cutting edge research through MIT’s Undergraduate Research Opportunities Program (UROP). Starting this fall, the MIT Quest for Intelligence will add a suite of new projects to the mix, allowing students to explore the latest ideas and applications in human and machine learning.

Through the generosity of several sponsors — including former Alphabet executive chairman Eric Schmidt and his wife, Wendy; the MIT­-IBM Watson AI Lab; and the MIT-SenseTime Alliance on Artificial Intelligence — The Quest will fund up to 100 students to participate in Quest-themed UROP projects each semester.

“We’re going to advance the frontiers of brain science and artificial intelligence by harnessing the brain power of our students,” says The Quest’s director, Antonio Torralba, a professor of electrical engineering and computer science who also heads the MIT-IBM Watson AI Lab. “We thank our partners for the funding that has made these new student research positions possible.”

MIT President L. Rafael Reif launched The Quest in February, framing its mission in a pair of questions: “How does human intelligence work, in engineering terms? And how can we use that deep grasp of human intelligence to build wiser and more useful machines to benefit society?”

To answer those questions, The Quest brings together more than 250 MIT researchers in artificial intelligence, cognitive science, neuroscience, social sciences, and ethics. Organized to ensure that breakthroughs in the lab are matched by the creation of useful tools for everyday people, The Quest’s advances might include new insights into how humans and machines learn, or new technologies for diagnosing and treating disease, discovering new drugs and materials, and designing safer automated systems.

The Quest has been met with enthusiasm by the MIT community and prominent technology leaders, many of whom spoke at a kick-off event in Kresge Auditorium in March. “I think MIT is uniquely positioned to do this,” said Eric Schmidt, a founding advisor to The Quest and a current MIT Innovation Fellow. “I think you can turn Cambridge into a genuine AI center.”

Now in its 49th year, UROP allows undergraduates to work closely with faculty, graduate students and other classmates on original research. More than 91 percent of graduating seniors participate in at least one UROP project in their time at MIT, with about 2,600 students participating each year. Students gain experience in their major or exposure to a new field, and practice writing proposals and communicating their results.

Source – Author: MIT – mit.edu
Date/time: 8th September 2018, 09:08