×
Research Teaching Members Join Us Publication Life
Research Content

Deep Video Prediction
SME-Net: Sparse Motion Estimation for Parametric Video Prediction through Reinforcement Learning
Yung-Han Ho, Chuan-Yuan Cho, Wen-Hsiao Peng, Guo-Lun Jin
IEEE International Conference on Computer Vision (ICCV), Oct. 2019.
This paper leverages a classic prediction technique, known as parametric overlapped block motion compensation (POBMC), in a reinforcement learning framework for video prediction. Learning-based prediction methods with explicit motion models often suffer from having to estimate large numbers of motion parameters with artificial regularization. Inspired by the success of sparse motion-based prediction for video compression, we propose a parametric video prediction on a sparse motion field composed of few critical pixels and their motion vectors. The prediction is achieved by gradually refining the estimate of a future frame in iterative, discrete steps. Along the way, the identification of critical pixels and their motion estimation are addressed by two neural networks trained under a reinforcement learning setting. Our model achieves the state-of-theart performance on CaltchPed, UCF101 and CIF datasets in one-step and multi-step prediction tests. It shows good generalization results and is able to learn well on small training data.
Deep Video Prediction Through Sparse Motion Regularization
Yung-Han Ho, Chih Chun Chan, Wen-Hsiao Peng
IEEE International Conference on Image Processing (ICIP), Oct. 2020.
This paper leverages a classic prediction technique, known as parametric overlapped block motion compensation (POBMC), in a reinforcement learning framework for video prediction. Learning-based prediction methods with explicit motion models often suffer from having to estimate large numbers of motion parameters with artificial regularization. Inspired by the success of sparse motion-based prediction for video compression, we propose a parametric video prediction on a sparse motion field composed of few critical pixels and their motion vectors. The prediction is achieved by gradually refining the estimate of a future frame in iterative, discrete steps. Along the way, the identification of critical pixels and their motion estimation are addressed by two neural networks trained under a reinforcement learning setting. Our model achieves the state-of-theart performance on CaltchPed, UCF101 and CIF datasets in one-step and multi-step prediction tests. It shows good generalization results and is able to learn well on small training data

Learning-based Video Compression
P-frame Coding Proposal by NCTU: Parametric Video Prediction through Backprop-based Motion Estimation
Yung-Han Ho, Chih-Chun Chan, David Alexandre, Wen-Hsiao Peng, Chih-Peng Chang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2020.
This paper presents a parametric video prediction scheme with backprop-based motion estimation, in response to the CLIC challenge on P-frame compression. Recognizing that most learning-based video codecs rely on optical flow-based temporal prediction and suffer from having to signal a large amount of motion information, we propose to perform parametric overlapped block motion compensation on a sparse motion field. In forming this sparse motion field, we conduct the steepest descent algorithm on a loss function for identifying critical pixels, of which the motion vectors are communicated to the decoder. Moreover, we introduce a critical pixel dropout mechanism to strike a good balance between motion overhead and prediction quality. Compression results with HEVC-based residual coding on CLIC validation sequences show that our parametric video prediction achieves higher PSNR and MS-SSIM than optical flow-based warping. Moreover, our critical pixel dropout mechanism is found beneficial in terms of rate-distortion performance. Our scheme offers the potential for working with learned residual coding.

Learning-based Image Compression
Learned Image Compression With Soft Bit-based Rate-distortion Optimization
David Alexandre, Chih-Peng Chang, Wen-Hsiao Peng, Hsueh-Ming Hang
IEEE International Conference on Image Processing (ICIP), Oct. 2019.
This paper introduces the notion of soft bits to address the rate-distortion optimization for learning-based image compression. Recent methods for such compression train an autoencoder end-to-end with an objective to strike a balance between distortion and rate. They are faced with the zero gradient issue due to quantization and the difficulty of estimating the rate accurately. Inspired by soft quantization, we represent quantization indices of feature maps with differentiable soft bits. This allows us to couple tightly the rate estimation with context-adaptive binary arithmetic coding. It also provides a differentiable distortion objective function. Experimental results show that our approach achieves the state-ofthe- art compression performance among the learning-based schemes in terms of MS-SSIM and PSNR.
An Autoencoder-based Image Compressor with Principle Component Analysis and Soft-Bit Rate Estimation
Chih-Peng Chang, David Alexandr, Wen-Hsiao Peng, Hsueh-Ming Hang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2019.
We propose a lossy image compression system using the deep-learning autoencoder structure to participate in the Challenge on Learned Image Compression (CLIC) 2018. Our autoencoder uses the residual blocks with skip connections to reduce the correlation among image pixels and condense the input image into a set of feature maps, a compact representation of the original image. The bit allocation and bitrate control are implemented by using the importance maps and quantizer. The importance maps are generated by a separate neural net in the encoder. The autoencoder and the importance net are trained jointly based on minimizing a weighted sum of mean squared error, MS-SSIM, and a rate estimate. Our aim is to produce reconstructed images with good subjective quality subject to the 0.15 bitsper-pixel constraint.

Reinforcement Learning for Video Encoder Control
Reinforcement Learning for HEVC/H.265 Frame-level Bit Allocation
Lian-Ching Chen, Jun-Hao Hu, Wen-Hsiao Peng
IEEE International Conference on Digital Signal Processing (DSP), China, Nov. 2018.
Frame-level bit allocation is crucial to video rate control. The problem is often cast as minimizing the distortions of a group of video frames subjective to a rate constraint. When these video frames are related through inter-frame prediction, the bit allocation for different frames exhibits dependency. To address such dependency, this paper introduces reinforcement learning. We first consider frame-level texture complexity and bit balance as a state signal, define the bit allocation for each frame as an action, and compute the negative frame-level distortion as an immediate reward signal. We then train a neural network to be our agent, which observes the state to allocate bits to each frame in order to maximize cumulative reward. As compared to the rate control scheme in HM-16.15, our method shows better PSNR performance while having smaller bit rate fluctuations.
Reinforcement Learning for HEVC/H.265 Intra-Frame Rate Control
Jun-Hao Hu, Wen-Hsiao Peng, Chia-Hua Chung
IEEE International Symposium on Circuits and Systems (ISCAS), Italy, May 2018.
Reinforcement learning has proven effective for solving decision making problems. However, its application to modern video codecs has yet to be seen. This paper presents an early attempt to introduce reinforcement learning to HEVC/H.265 intra-frame rate control. The task is to determine a quantization parameter value for every coding tree unit in a frame, with the objective being to minimize the frame-level distortion subject to a rate constraint. We draw an analogy between the rate control problem and the reinforcement learning problem, by considering the texture complexity of coding tree units and bit balance as the environment state, the quantization parameter value as an action that an agent needs to take, and the negative distortion of the coding tree unit as an immediate reward. We train a neural network based on Q-learning to be our agent, which observes the state to evaluate the reward for each possible action. When trained on only limited sequences, the proposed model can already perform comparably with the rate control algorithm in HM-16.15.
HEVC/H.265 Coding Unit Split Decision Using Deep Reinforcement Learning
Chia-Hua Chung, Wen-Hsiao Peng, Jun-Hao Hu
IEEE International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Xiamen, Nov. 2017.
The video coding community has long been seeking more effective rate-distortion optimization techniques than the widely adopted greedy approach. The difficulty arises when we need to predict how the coding mode decision made in one stage would affect subsequent decisions and thus the overall coding performance. Taking a data-driven approach, we introduce in this paper deep reinforcement learning (RL) as a mechanism for the coding unit (CU) split decision in HEVC/H.265. We propose to regard the luminance samples of a CU together with the quantization parameter as its state, the split decision as an action, and the reduction in ratedistortion cost relative to keeping the current CU intact as the immediate reward. Based on the Q-learning algorithm, we learn a convolutional neural network to approximate the ratedistortion cost reduction of each possible state-action pair. The proposed scheme performs compatibly with the current full rate-distortion optimization scheme in HM-16.15, incurring a 2.5% average BD-rate loss. While also performing similarly to a conventional scheme that treats the split decision as a binary classification problem, our scheme can additionally quantify the rate-distortion cost reduction, enabling more applications.

Domain Adaptation for Semantic Segmentation
All about Structure: Adapting Structural Information across Domains for Boosting Semantic Segmentation
Wei-Lun Chang, Hui-Po Wang, Wen-Hsiao Peng, Wei-Chen Chiu
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
In this paper we tackle the problem of unsupervised domain adaptation for the task of semantic segmentation, where we attempt to transfer the knowledge learned upon synthetic datasets with ground-truth labels to real-world images without any annotation. With the hypothesis that the structural content of images is the most informative and decisive factor to semantic segmentation and can be readily shared across domains, we propose a Domain Invariant Structure Extraction (DISE) framework to disentangle images into domain-invariant structure and domain-specific texture representations, which can further realize imagetranslation across domains and enable label transfer to improve segmentation performance. Extensive experiments verify the effectiveness of our proposed DISE model and demonstrate its superiority over several state-of-the-art approaches.

Video Semantic Segmentation
Semantic Segmentation on Compressed Video Using Block Motion Compensation and Guided Inpainting
Stefanie Tanujaya, Tieh Chu, Jia-Hao Liu, Wen-Hsiao Peng
IEEE International Symposium on Circuits and Systems (ISCAS), Spain, Oct 2020.
This paper addresses the problem of fast semantic segmentation on compressed video. Unlike most prior works for video segmentation, which perform feature propagation based on optical flow estimates or sophisticated warping techniques, ours takes advantage of block motion vectors in the compressed bitstream to propagate the segmentation of a keyframe to subsequent non-keyframes. This approach, however, needs to respect the inter-frame prediction structure, which often suggests recursive, multi-step prediction with error propagation and accumulation in the temporal dimension. To tackle the issue, we refine the motion-compensated segmentation using inpainting. Our inpainting network incorporates guided non-local attention for long-range reference and pixel-adaptive convolution for ensuring the local coherence of the segmentation. A fusion step then follows to combine both the motion-compensated and inpainted segmentations. Experimental results show that our method outperforms the state-of-the-art baselines in terms of segmentation accuracy. Moreover, it introduces the least amount of network parameters and multiply-add operations for non-keyframe segmentation.

Visual Question Answering
Learning Goal-oriented Visual Dialogue: Imitating and Surpassing Analytic Experts
Yen-Wei Chang, Wen-Hsiao Peng
IEEE International Conference on Multimedia and Expo (ICME), July 2019.
This paper tackles the problem of learning a questioner in the goal-oriented visual dialog task. Several previous works adopt model-free reinforcement learning. Most pretrain the model from a finite set of human-generated data. We argue that using limited demonstrations to kick-start the questioner is insufficient due to the large policy search space. Inspired by a recently proposed information theoretic approach, we develop two analytic experts to serve as a source of highquality demonstrations for imitation learning. We then take advantage of reinforcement learning to refine the model towards the goal-oriented objective. Experimental results on the GuessWhat?! dataset show that our method has the combined merits of imitation and reinforcement learning, achieving the state-of-the-art performance.

Deep Generative Model
Learning Priors for Adversarial Autoencoders
Hui-Po Wang, Wen-Hsiao Peng, Wei-Jan Ko
Asia-Pacific Signal and Information Processing Association (APSIPA), USA, Nov. 2018.
Most deep latent factor models choose simple priors for simplicity, tractability or not knowing what prior to use. Recent studies show that the choice of the prior may have a profound effect on the expressiveness of the model, especially when its generative network has limited capacity. In this paper, we propose to learn a proper prior from data for adversarial autoencoders (AAEs). We introduce the notion of code generators to transform manually selected simple priors into ones that can better characterize the data distribution. Experimental results show that the proposed model can generate better image quality and learn better disentangled representations than AAEs in both supervised and unsupervised settings. Lastly, we present its ability to do cross-domain translation in a text-to-image synthesis task.

Incremental Learning
Class-incremental Learning with Rectified Feature-Graph Preservation
Cheng-Hsun Lei*, Yi-Hsin Chen*, Wen-Hsiao Peng, Wei-Chen Chiu
Asian Conference on Computer Vision (ACCV), Japan, Nov. 2020.
In this paper, we address the problem of distillation-based class-incremental learning with a single head. A central theme of this task is to learn new classes that arrive in sequential phases over time while keeping the model's capability of recognizing seen classes with only limited memory for preserving seen data samples. Many regularization strategies have been proposed to mitigate the phenomenon of catastrophic forgetting. To understand better the essence of these regularizations, we introduce a feature-graph preservation perspective. Insights into their merits and faults motivate our weighted-Euclidean regularization for old knowledge preservation. We further propose rectified cosine normalization and show how it can work with binary cross-entropy to increase class separation for effective learning of new classes. Experimental results on both CIFAR-100 and ImageNet datasets demonstrate that our method outperforms the state-of-the-art approaches in reducing classification error, easing catastrophic forgetting, and encouraging evenly balanced accuracy over different classes.