For the past several years, GumGum has participated as an industry sponsor in the RIPS program, which is hosted by the Institute of Pure and Applied Mathematics at UCLA. The Research in Industrial Projects (RIPS) provides undergraduate students an opportunity to work on a real-world research project. This year, the proposed project was Markerless Augmented Advertising for Sports Video. A video of the full execution can be found here: https://www.youtube.com/watch?v=ugZ-08c6IWY
The aim of this project is to leverage regions of the video frame and repurpose it for augmented reality. Consider a sporting event as shown below, the focus is usually on the field where the action is taking place instead of the crowd watching the game. For this very reason, the crowd region in this frame is highly desirable for the purpose of overlaying advertisements to provide an uninterrupted advertising experience.
The challenge involves a) identifying an actionable region, b) placing the advertisement in a natural perspective in accordance with the scene, c) not requiring the presence of a known calibration object, and lastly d) performing all of the above in an automatic way.
The output would then be as follows:
The conceptual diagram below shows our proposed end-to-end pipeline for the overlay of advertisements in a non-intrusive and engaging fashion.
For any given input video, the first step is to isolate and determine the frame that contains the texture of interest and whether the latter is appropriate for overlaying an advertisement. In the sporting event example above, the texture of interest would be the crowd region. We use semantic segmentation to decompose a scene into its various components to obtain pixel-level accuracy. We turned to a CNN-based segmentation technique, Pyramid Scene Parsing network which has been shown to be competitive in various scene-parsing challenges. The model we used was trained on ADE20K dataset consisting of 150 segmentation classes, and out of the supported classes, we created a subset from “person” and “grandstand”. The subset was chosen to be most representative of crowd-like textures.
Once the seed frame is identified, the suitability of segmentation is analyzed by computing the Segmentation Quality Score (SQS). Ideally, the segmented region should be a) fully connected, b) contain no holes, and c) compact. Each one of these criteria is calculated by the component, completeness, and shape score respectively with the overall SQS calculated as the product of all 3 metrics. Good segmentation are indicated by low values of SQS whereas large values are associated with poor segmentation. The minimum possible SQS score is 1.
3D Scene Reconstruction
In order to place the asset on the segmented texture of interest in a natural perspective with respect to the original image frame, we obtain a depth map of said region relative to the camera. By doing so, we identify the dominant plane coinciding with the crowd region for asset placement. Given the relative depth map, we can then use that to extrapolate to a 3D point cloud to represent this crowd region in the corresponding three-dimensional space. Since we are trying to do all of this programmatically, we estimate the focal length of the camera based on vanishing point detection.
We use MegaDepth, a convolutional neural network that predicts a dense relative depth map from a singular monocular RGB image. From these relative depth values, we normalize them to be between 0 and 1. Next, to create a point cloud for planar surface estimation, we plot each of the output pixels from the depth map along the z-axis at the same (x,y) location. To ensure we are not performing unnecessary calculations, we filter the depth map with the segmentation mask provided by PSPnet based on the relevant classes. Then, we use RANSAC to detect actionable planes in this filtered depth map (corresponding only to our segment of interest) to place our desired asset.
With regards to asset placement, we would like the asset to appear aligned with the intersection of the crowd plane and the ground (field) plane, for it to look “natural”. In the figure shown below, both the asset placements are technically correct, but only one of them (image on the right) is a harmonious viewing experience. It is with this consideration in mind that we align the asset along the dividing line between the ground and crowd planes.
To specify the orientation of the asset, we use a vector parallel to the surface of the crowd plane as the bottom edge of a rectangular asset. After some trial and error to compute the intersection line of the crowd plane and the ground plane, we settled on a combination of edge detection (using Canny) and 2D RANSAC to identify the alignment line. This 2D line y = ax + b is in image coordinates, and we determine the 3D plane of points that this line projects to in the image plane. We compute the equation of this “alignment line plane” using 2D to 3D perspective projection, and find a vector v that lies on both this plane and a second plane fit on the crowd region (using RANSAC).
To find another vector that lies on the surface of the crowd plane - v was the first one - we take a cross product of v and the normal n to get a new vector u that is perpendicular to both v and n and is parallel to the plane. We use these vectors u and v as edges of the rectangular asset, and scale them to have the desired length and width of the asset.
To ensure the asset covers as large an area as possible of the relevant class pixels (here, “person” or “grandstand”), we fit a convex hull on the inliers on the crowd plane while making sure all the corners of the asset lie within the plane. We keep the aspect ratio of the overlaid asset on this plane to be the same as the aspect ratio of the original asset, and use a homography transformation to transform the coordinate system of the asset to the coordinate system of the target crowd plane.
With the goal of placing an asset in a video, the asset must meet several requirements. The asset placement should remain in a perspective-correct way for the subsequent frames and minimal range of motion or distortion for the asset should be demonstrated. We implemented an optical flow-based tracking algorithm inspired by Lucas-Kanade. The algorithm is based on the four anchor quadrilateral tracks around the asset. Empirically, this was found to perform much better than tracking the points within the asset boundary themselves. Key points were determined from feature extraction and the latter was derived from a combination of SIFT, SURF, and KAZE feature descriptors.
Since we will be overlaying assets on crowd regions in videos of sporting events, we recognize that these events are commonly filmed using multiple cameras, from various vantage points. The shot changes between different cameras can occur rapidly and frequently throughout.
We account for shot changes also. When a shot change is detected, our algorithm takes into consideration the number of features that differ between the subsequent frames and evaluates to determine if and when it is necessary to restart the entire system. In other words, re-initialize and identify a new plausible region of interest for asset placement. On a similar note, the entire process can be restarted due to the tracking algorithm. For example, if one of the corners of the quadrilateral can no longer be matched with the previous frame, tracking can be suspended. To handle this edge case, we try to predict the position of the corner that goes out of frame by using the position of the remaining corners and prior knowledge about the quadrilateral position of the corners using the Kalman filter.
At the end of the 10-week period assigned to the program, we had two pipelines in place: one for overlaying an asset on an image, and one for overlaying it on a video. For the video pipeline, we can even overlay a video asset, as long as the framerate of both the asset and the target videos match, to be aesthetically pleasing.
During the program, the students worked hard to meet project deadlines and milestones with our GumGum Mentors (@Divyaa @Iris @Cambron). They came up with their own algorithms to tackle various obstacles and provided new insights to solve numerous technical issues. At the end of the program, they presented their work to our CTO and the rest of the Computer Vision team.
The progress that the team achieved resulted in a conference paper acceptance in the 1st International Workshop on Advanced Machine Vision for Real-life and Industrially Relevant Applications, that was held as part of the Asian Conference on Computer Vision 2018, in Perth, Australia.
We were also selected for a poster presentation at the same workshop. GumGum sponsored two of the students to present their poster in Perth in December during the conference, and they won the Best Poster Award!
The proceedings of the conference will be published in a special edition of Springer’s Machine Vision and Applications Journal later in 2019.
Congratulations to Iuliana Tabian (Imperial College London) , Emmanuel Antonio Cuevas (Universidad de Guanajuato), Osman Akar (UCLA), Hallee Wong (Williams College) and the CV team!