Selected Projects

SVAM: Saliency-guided Visual Attention Modeling

ArXiv   Bibliography   GitHub   USOD Test Dataset   Video Demo

Where to look?— is an intriguing problem of computer vision that deals with finding interesting or salient pixels in an image/video. As seen in this GANGNAM video!, the problem of Salient Object Detection (SOD) aims at identifying the most important or distinct objects in a scene. It is a successor to the human fixation prediction problem that aims to highlight pixels that human viewers would focus on at first glance.

For visually-guided robots, the SOD capability enables them to model spatial attention to eventually make important navigation decisions. In this project, we present a holistic approach to saliency-guided visual attention modeling (SVAM) for use by autonomous underwater robots. Our proposed model, named SVAM-Net, integrates deep visual features at various scales and semantics for effective SOD in natural underwater images. The SVAM-Net architecture is configured in a unique way to jointly accommodate bottom-up and top-down learning within two separate branches of the network while sharing the same encoding layers. We design dedicated spatial attention modules (SAMs) along these learning pathways to exploit the coarse-level and top-level semantic features for SOD at four stages of abstractions.

Specifically, the bottom-up pipeline extracts semantically rich features from early encoding layers, which facilitates an abstract yet accurate saliency prediction at a fast rate; we denote this decoupled bottom-up pipeline as SVAM-NetLight. In addition, a residual refinement module (RRM) ensures fine-grained saliency estimation through the deeper top-down pipeline. Check out this repository for the detailed SVAM-Net model architecture and its holistic learning pipeline.

In the implementation, we incorporate comprehensive end-to-end supervision of SVAM-Net by large-scale diverse training data consisting of both terrestrial and underwater imagery. Subsequently, we validate the effectiveness of its learning components and various loss functions by extensive ablation experiments. In addition to using existing datasets, we release a new challenging test set named USOD for the benchmark evaluation of SVAM-Net and other underwater SOD models. By a series of qualitative and quantitative analyses, we show that SVAM-Net provides SOTA performance for SOD on underwater imagery, exhibits significantly better generalization performance on challenging test cases than existing solutions, and achieves fast end-to-end inference on single-board devices. Moreover, we demonstrate that a delicate balance between robust performance and computational efficiency makes SVAM-NetLight suitable for real-time use by visually-guided underwater robots. Please refer to the paper for detailed results; a video demonstration can be seen here.


Deep SESR: Simultaneous Enhancement and Super-Resolution

Paper (RSS 2020)   ArXiv   Bibliography   GitHub   UFO-120 Dataset   Spotlight Talk

In this project, we introduce the SESR (simultaneous enhancement and super-resolution) problem and provide an efficient solution for underwater imagery. Specifically, we present Deep SESR, a residual-in-residual network-based model that learns to restore perceptual image qualities for up to 4x higher spatial resolution.

To supervise the large-scale training, we formulate a multi-modal objective function that addresses the chrominance-specific underwater color degradation, lack of image sharpness, and loss in high-level feature representation. It is also supervised to learn salient foreground regions in the image, which in turn guides the network to learn global contrast enhancement. Our Spotlight Talk has more details!

Moreover, we present UFO-120, the first dataset to facilitate large-scale SESR learning; it contains over 1500 training samples and a benchmark test set of 120 samples. By thorough experimental evaluation on UFO-120 and several other standard datasets, we demonstrate that Deep SESR outperforms the existing solutions for underwater image enhancement and super-resolution. We also validate its generalization performance on several test cases that include underwater images with diverse spectral and spatial degradation levels, and also terrestrial images with unseen natural objects. See this video for some demonstrations and refer to the paper for detailed results.


FUNIE-GAN: Fast Underwater Image Enhancement for Improved Perception

Paper (RA-L)   ArXiv   Bibliography   GitHub   EUVP Dataset

In this project, we design a fully-convolutional conditional GAN-based model for fast underwater image enhancement: FUnIE-GAN. We supervise its adversarial training by formulating an objective function that evaluates the perceptual image quality based on its global content, color, local texture, and style information. We also present EUVP, a large-scale dataset of a paired and an unpaired collection of underwater images (of poor and good quality) that are captured using seven different cameras over various visibility conditions during oceanic explorations and human-robot collaborative experiments. The dataset and relevant information can be found here.
    Latest: FUnIE-GAN is now running on our Aqua-8 MinneBot
    ...and soon to be ported on the LoCO AUV

By thorough experiments, we demonstrate that FUnIE-GAN can learn to enhance perceptual image quality from both paired and unpaired training. More importantly, the enhanced images significantly boost the performance of several underwater visual perception tasks such as object detection, human pose estimation, and saliency prediction. In addition to providing state-of-the-art image enhancement performance, FUnIE-GAN offers 148 FPS inference rate on Nvidia GTX 1080, 48 FPS on Jetson AGX Xavier, and over 25 FPS on Jetson TX2. Such fast run-times, particularly on the single-board platforms, makes it ideal for real-time use in robotic applications. Check out this repository for the FUnIE-GAN models and associated training pipelines.


R2R-OpenPose: Robot-to-robot Relative Pose Estimation from Human Body-Pose

Paper (AURO)   ArXiv   Bibliography

In this project, we propose a method to determine the 3D relative pose of pairs of communicating robots by using human pose-based key-points as correspondences. We adopt a leader-follower framework, where at first, the leader robot visually detects and triangulates the key-points using the state-of-the-art pose detector named OpenPose. Afterward, the follower robots match the corresponding 2D projections on their respective calibrated cameras and find their relative poses by solving the perspective-n-point (PnP) problem. In the proposed method, we design an efficient person re-identification technique for associating the mutually visible humans in the scene. Additionally, we present an iterative optimization algorithm to refine the associated key-points based on their local structural properties in the image space.
    Update: accepted for publication at the Autonomous Robots (AuRo) journal

We demonstrate that the proposed refinement processes are essential to establish accurate key-point correspondences across viewpoints. Furthermore, we evaluate the performance of the end-to-end pose estimation system through several experiments conducted in terrestrial and underwater environments.

Finally, we discuss the relevant operational challenges of this approach and analyze its feasibility for multi-robot cooperative systems in human-dominated social settings and feature-deprived environments such as underwater. Please refer to the paper for further details.


DDD: Balancing Robustness and Efficiency Deep Diver Detection

Paper (RA-L)   ArXiv   Code   Bibliography

In this project, we explore the design and development of a class of robust diver-following algorithms for autonomous underwater robots. By considering the operational challenges for underwater visual tracking in diverse real-world settings, we formulate a set of desired features of a generic diver following algorithm. We attempt to accommodate these features and maximize general tracking performance by exploiting the SOTA deep object detection models: Faster R-CNN (Inception V2), YOLO V2, Tiny YOLO, and SSD (MobileNet V2).

Subsequently, we design an architecturally simple CNN-based diver-detection model that is much faster than the SOTA deep models yet provides comparable detection performance. Each building block of the proposed model was fine-tuned in order to balance the trade-off between robustness and efficiency for a single-board setting under real-time constraints. We also validated its tracking performance and general applicability through numerous field experiments in pools and oceans. This detection model is used in the diver-following module of our Aqua-8 MinneBot.


Robo-Chat-Gest: Hand Gesture-based Robot Control and Reconfiguration

Paper-1 (ICRA 2018)   Bibliography-1   Paper-2 (JFR)   Bibliography-2

Robo-Chat-Gest introduces a real-time robot programming and parameter reconfiguration method for autonomous underwater robots using a set of intuitive and meaningful hand gestures. It is a syntactically simple framework and computationally more efficient than other existing grammar-based approaches. Find the Robo-Chat-Gest language rules in this paper!

The major components of Robo-Chat-Gest are: i) a simple set of hand-gestures to instruction mapping rules, (ii) a region selection mechanism to detect prospective hand gestures in the image-space, (iii) a CNN-based model for robust hand gesture recognition, and (iv) a Finite-State Machine (FSM) to efficiently decode complete instructions from the sequence of gestures. The key aspect of this framework is that it can be easily adopted by divers for communicating simple instructions to underwater robots without using artificial tags such as fiducial markers or requiring them to memorize a potentially complex set of language rules. We thoroughly analyze and discuss its practical utility and usability benefits in this paper.


Review and Dataset Papers

Semantic Segmentation of Underwater Imagery: Dataset and Benchmark

Paper (IROS 2020)   ArXiv   Bibliography   GitHub   USR-248 Dataset

Semantic segmentation of underwater scenes and pixel-level detection of salient objects are critically important features for visually-guided AUVs. The existing solutions are either too application-specific or outdated, despite the rapid advancements of relevant literature in the terrestrial domain. In this project, we attempt to address these limitations by presenting the first large-scale dataset for semantic Segmentation of Underwater IMagery (SUIM) for general-purpose robotic applications. In the proposed SUIM dataset, we consider eight object categories: fish (and other vertebrates), coral reefs (and other invertebrates), aquatic plants, wrecks/ruins, human divers, robots/instruments, and sea-floor. It contains over 1500 natural underwater images and their ground truth semantic labels (human-annotated); it also includes a test set of 110 samples for benchmark evaluation. These images have been selected from large-scale datasets named EUVP, USR-248, and UFO-120, which we previously released for underwater image enhancement and super-resolution tasks. The dataset and relevant resources are available at here.
    Update: accepted for publication at the IROS-2020

We also present a comprehensive benchmark evaluation of several state-of-the-art semantic segmentation approaches named DeepLab, PSPNet, UNet, SegNet, and FCN on the SUIM dataset. We configured several variants of these models, then evaluated and compared their performances based on standard metrics. Additionally, we present a fully-convolutional encoder-decoder model named SUIM-Net, which offers a considerably faster run-time than the SOTA approaches while achieving competitive semantic segmentation performance. Check out this repository for more information and refer to the paper for detailed results and performance analysis.


Underwater Image Super-Resolution using Deep Residual Multipliers

Paper (ICRA 2020)   ArXiv   Bibliography   GitHub   USR-248 Dataset

Single Image Super-Resolution (SISR) allows zooming-in interesting image regions for detailed visual perception. In this project, we design a deep residual network-based generative model for 2x, 4x, and 8x SISR of underwater imagery. We provide a generative and an adversarial training pipeline for the model, which we refer to as SRDRM and SRDRM-GAN, respectively.
    Update: accepted for publication at the ICRA-2020

In our implementation, both SRDRM and SRDM-GAN learn to generate 640 x 480 images from respective inputs of size 320 x 240, 160 x 120, or 80 x 60. For the supervised training, we use natural underwater images which we collected by several oceanic explorations and field trials. We also included images from publicly available online media resources such as YouTube and Flickr. We have released this compiled dataset named USR-248 for academic research purposes here.

We validate the effectiveness of SRDRM and SRDRM-GAN through qualitative and quantitative experiments and compare the results with several state-of-the-art models. We also analyze their practical feasibility for applications such as scene understanding and attention modeling in noisy visual conditions. See this Virtual Talk for more information and refer to the paper for detailed results and performance analysis.


Person Following by Autonomous Robots: A Categorical Overview

Paper (IJRR)   ArXiv   Bibliography

A wide range of human-robot collaborative applications in diverse domains such as manufacturing, health care, the entertainment industry, and social interactions, require an autonomous robot to follow its human companion. Different working environments and applications pose diverse challenges by adding constraints on the choice of sensors, the degree of autonomy, and dynamics of a person-following robot. Researchers have addressed these challenges in many ways and contributed to the development of a large body of literature. This paper provides a comprehensive overview of the literature by categorizing different aspects of person-following by autonomous robots. Also, the corresponding operational challenges are identified based on various design choices for ground, underwater, and aerial scenarios. In addition, state-of-the-art methods for perception, planning, control, and interaction are elaborately discussed and their applicability in varied operational scenarios are presented. Then, some of the prominent methods are qualitatively compared, corresponding practicalities are illustrated, and their feasibility is analyzed for various usecases. Furthermore, several prospective application areas are identified, and open problems are highlighted for future research.


Collaborations

UGAN: Underwater Image Enhancement by Generative Adversarial Networks (GANs)

I worked in this project lead by our former colleague Cameron Fabbri who designed a GAN-based model (named UGAN) to improve the quality of underwater imagery. The objective was to enhance noisy input images to boost the performance of various vision-based tasks such as detection and tracking; I was involved in the emperical analysis of these performance margins.

The UGAN paper (ICRA 2018) is one of the most cited papers on underwater image enhancement in recent times; check out the UGAN models and associated training pipelines in this repository.


Underwater Multi-Robot Convoying using Visual Tracking by Detection

I was involved in this project of Mobile Robotics Lab @McGill University lead by Florian Shkurti (now an Assistant Professor @UToronto). He introduced a robust multi-robot convoying approach relying on visual detection of the leading agent, thus enabling target following in unstructured 3D environments. The solution was tested on extensive footage of an underwater swimming robot in ocean settings. An empirical comparison of multiple tracker variants was presented, which included several CNN-based models as well as frequency-based model-free trackers.

I took part in the evaluation of our frequency-based MDPM tracker for the emperical analysis. A video demonstration of the multi-robot convoying can be seen here; please refer to the paper (IROS 2017) for more information.


Other Collaborators

I regulary collaborate with the current Ph.D. students at the IRVLAB. In particular, I heavily interacted with Chelsey Edge, Sadman Sakib Enan, Michael Fulton, and Jungseok Hong on several projects in the last few years. I am currently working with Jiawei Mo and Karin de Langis on separate projects. Also, Ruobing Wang is working with me on the SVAM project.

UG Students Whom I Mentored and Worked With

    - Peigen Luo (on SESR and SRDRM project); he is now a graduate student at UIUC
    - Youya Xia (on FUNIE-GAN project); she is now a PhD student at Cornell
    - Yuyang Xiao (on SUIM project); he is now a graduate student at UIUC
    - Marc Ho (on Robo-Chat-Gest project); he is now working at Optum
    - Muntaqim Mehtaz and Christopher Morse (on SUIM project); they are current
      students at the UMN