UNIVERSITY OF CALIFORNIA SANTA CRUZ LEARNING-BASED APPROACH FOR VISION PROBLEMS A dissertation submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in COMPUTER ENGINEERING by Dan Kong December 2006 The Dissertation of Dan Kong is approved: Professor Hai Tao, Chair ho rofessor R O Manduchi (pms. Prdfessor James Davis LO. Sloan Vice Provost and Dean of Graduate Studies UMI Number: 3241208 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3241208 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved.
This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P. Box 1346 Ann Arbor, MI 48106-1346 Copyright @ by Dan Kong 2006 Table of Contents List of Figures vii List of Tables xi Abstract xii Dedication xV Acknowledgments xvi I Learning-based Stereo 1 Introduction 11 The Problem .00 00004 12 Foundations of Stereo. cv co ee ees 1.
ee eee es 12. ns 2 Related Work and Motivation 2.2 Window-based Matching.165 MRF-based Methods.6 Segmentation-based Methods Cr — 2. ee et ee. ee ee es 2.
eee ee ee ee 2. es The Approach 3.1 Representing and Learning Matching Behaviors .11 Representing matching behaviors.2 Learning the distribution .3 Adaptive bin selection .00 e ee eee eens 3.2 A Probabilistic Stereo Model .1 Stereo as an MAP-MRF problem. HQ Q kg kg kia Results and Discussion 4. HQ eee eee kg và 4.
ng nà g kg kg va IT Video Super-resolution Introduction and Related Work 5. kg và kg kg cv VN Và kia 5.2 Reconstruction-based Methods.3 Learning-based Methods.050 eee eae Primal Sketch Priors 6.2 Example-based Priors .2 Primal Sketch Priors. ee ee k kấ 6.2 Learning primal sketch priors. Why primal sketch.
HQ HQ HQ na Video Super-resolution 7.1 Overview of the Approach. ch và kg va 7. g k k NT gà Nà KV va 1V 73 Scene-specific pIiOTS. LH Q Q HQ nu Hà Q kg Tà kg 73.
ce và ` ee 7. ee và vi 7. ung và và 7. cee ee ee 7.1 Introduction to CRF.
000 ee ee eens 74.gaä< Implementation and Results 101 8. 0 ee ee và 101 8. ee et ee 105 83 Quantitative Results. eee eee eee 110 Discussion 113 9.
kg và ga àna 114 92 >1 -. ai es 114 93 Primitive co-occurrence. 00 0c eee eee eee en 115 9.eee 115 III Human Detection and Counting 10 Introduction and Related Work 10. 00 ee tk ke và 11 Human Counting: A Regression Approach 11.
eee eee ee es 11. Q cv ng vn cv vn gà kg kg cv va 11.3 Results and Diacussion. ee 12 Human Counting: A Detection-based Approach 12.1 Convolutional Neural Network(CNN). HQ ng và va 12.
ee ee ee 12.3 Multi-scale Detection. QC ne va 13 Conclusions and Future work Bibliography vi List of Figures 1.1 The geometry of nonverged stereo.2 Rectification can make parallel scanlines and enforce reduce the epipo- lar constraint to1D.0 ee ee ees 2.1 (a) Tsukuba left image. (b) Synthetically alteration of the Tsukuba right image by increasing the intensity. (c) Depth map computed us- ing multi-scale Belief Propagation.
(d) Depth computed using 9 x 9 correlation window. 000 eee eee eee eee ees 2. (c) Depth computed using graph cut (d) Depth computed using 9 x9 correlation window. eee ee ee 2.3 All the experts used in the algorithm.
Black dot means the center of the matching window.0 eee eee eee eee 2.4 For depth discontinuity regions, the accuracy of depth estimates de- pends on the matching position of the correlation window. In this example, window A is betterthanB.5 Tsukuba color image (a) and the depth map computed using 7 x 7 NCC (b) Typical errors in NCC-based stereo matching.6 (a) probability of depth error as a function of distance to the near- est foreground object. (b) probability of depth error as a function of texture strength. (c) Probability of estimating true depth for 36 ex- perts on the textured and textureless foreground.
(d) Probability of estimating true depth for 36 experts at depth discontinuity regions.1 Disparity map using 9 x 9 correlation window and pixels A, B, C from three typical regions. 6 ee ee vii 3.2 The depth map for a scene of two objects: foreground (white rect- angle) and background (gray rectangle). The shade rectangle A is a background region close to the foreground. (a) Depth map with fat- tening effect where A has the foreground depth.
(b) True map for the left view where A has the background depth.3 The texture and structure attributes around a pixel.4 (a) The marginal likelihood density of the 3 x 3 scale texture strength evaluated on Middlebury stereo data. The vertical axis labels the probaiblity density and the horizontal axis labels the texture strength. The vertical dashed lines indicate the position of the bin boundaries which are adaptively choosen (b) The posterior probability distribution based on the adaptively chosen bỉns.5 The expectation of the posterior entropy rapidly reaches an asymptotic value as a function of the number of bins.6 Graphical model for stereo (a): Traditional MRF model. (b): MRF model in this paper.7 Illustration of how to compute the proposal probability ration for one 3.8 (a):Color segmentation using mean-shift.
(b): Depth segmentation based on median filtered SSD depth map. (c): Joint color and depth segmentation.9 Computation of likelihood and smoothness change for one super-pixel in segmentation-based approach.0 00 ee eee nes 4.1 The learned matching behavior for 7 x 7 correlation window. ee ee ees 4.2 Dense disparity map for the ” Tsukuba” ,” Sawtooth” ,” Venus” and ” Map” images.3 Intermediate results on Tsukuba data at different iterations.4 Comparisons of the disparity maps for the ” Tsukuba”, ”Sawtooth”, ”Venus” and ”"Map” images using 7 x 7 NCC matching cost as the likelihood.5 Dense disparity maps for the ” Teddy” and ”Cones” images.6 Comparisons of the disparity maps for the ”face” stereo pair. (a) Left image (b) Right Image (c) Initial depth from 7 x 7 correlation window.
(d) Belief propagation result. (e) Graph cut result.7 Energy of estimated depth map and ground truth.1 The filter bank used for primitives extraction (a) and typical primitives extracted (b) CS Sy 7.1 Overview of our video super-resolution approach.2 The ROC curves of primitive training data (a) and component training data (b) at different sizes. X-axis is match error and Y-axis is hit-rate.3 The prediction ROC curves of primitive training data (a) and com- ponent training data (b) at different sizes. X-axis is match error and Y-axis is hit-rate.4 The ROC curves for scene-specific dictionary D, and general dictionary D, that measures sufficiency (a) and predictability (b).
The scene- specific dictionary outperforms the general dictionary.5 Graphical model for super-resolution. (b) Video super-resolution.6 Comparison of video super-resolution results. Top: the original ad- jacent low resolution frames. (a)(b) Independent super-resolution of each frame.
(c)(d) Super-resolution with temporal smoothing.1 Training phase of the algorithm.2 Select the training frame using relative blurriness measure.3 Super-resolution results for frame 8 and 87 from the plant video se- quence. The input videos has resolution 240x160. Top: Bi-cubic inter- polation results (720x480). Bottom: results using customized dictio- nary plus temporal constraint (720x480).4 Super-resolution results for frame 12 and 78 from the face video se- quence.
The input videos has resolution 240x160. Top: Bi-cubic interpolation results (720x480). Bottom: results using scene-specific dictionary plus temporal constraint (720x480) .5 Super-resolution results for frame 9, and 121 from the keyboard video sequence. The input videos has resolution 160x120.
Top: Bi-cubic interpolation results (640x480). Bottom: results using customized dic- tionary plus temporal constraint (640x480) .6 Super-resolution results for frame 56, and 73 from the MPEG-4 en- coded video sequence. The input videos has resolution 352x288. (a)(b): Low resolution frame 117 x 96.
(c)(d): Bicubic interpolation to 352 x 288 (e)(f): Super-resolution using our approach.7 RMS errors for first 20 frames of testing video sequences (a) plant” (b) "face (c)"keyboard”.1 Features for crowd counting: (a) one frame from the videos, (b) fore- ground mask image, (c) edge map, (d) the edge map after the AND’ operation between (b) and (c). ee ee ix 11.2 The same person has different projected height in the image when translates on the ground plane.3 (a)Density estimation using homography. (b) ROI in the image.4 Three layer neural network architecture. The input is the normalized blob and edge orientation histograms.
The output is the crowdedness MEASUTE 6 HQ HH HH ng k va k kg kg kia 130 11.5 Model selection: the cross validation errors for different number of hidden layers.6 Crowd counting results from site A. nu ng vu gà sa 134 11.7 Crowd counting results for sequence from site B 12.1 Architecture of our Convolutional Neural Networks for human detection 140 12.2 Multi-scale detector.3 Some example images from MIT database.4 Some example images from INRIA database.5 CNN performance on MIT database with different scale. (b) False alarm rate.6 CNN performance on INRIA database with different scale. (a) Detec- tion rate.
(b) False alarm rate.7 Crowd counting results for video sequence from Beijing, China. (a)(b): Initial detection results for two frames. (c)(d): Results after bootstrap- ping the CNN using 'hard examples.8 Crowd counting results for Bookstore, UCSC. (a): Initial detection results for one frame.
(b): Results after bootstrapping the CNN using hard examples’,. Q0 Q HQ Vu v va va 153 List of Tables 4.1 Performance comparisons using NCC matching cost - 4.2 Performence of the proposed method for the new testbed images ` 12.1 Confusion matrix for MIT testing data of size 16 x 32 12.2 Confusion matrix for INRIA testing data of size 16 x 32 xi Abstract Learning-based Approach for Vision Problems by Dan Kong Learning-based techniques have seen more and more successful application in com- puter vision. ”Learning for vision” is viewed as the next challenging frontier for computer vision. Technical challenges in applying learning-based methods in vision include picking the appropriate representation, model generalization and complexity.
This dissertation investigated different vision problems together with the proposed learning algorithms for them. In particular, three vision problems are studied from low-level to high level: stereo, super-resolution and human detection. In the first part, we present a learning-based approach [73, 74]to address the visual correspondence problems when the stereo images have different intensity level. The algorithm first learns the matching behaviors of multiple local-window methods (called experts) using a simple histogram-based method.
The learned behaviors are then integrated into a MAP-MRF depth estimation framework and the Metropolis- Hastings algorithm is used to find the MAP solution. Segmentation is also used to accelerate the computation and improve the performance. Qualitative and quanti- tative experimental results are presented, which demonstrate that, for stereo image pair having different intensity level, the proposed algorithm significantly outperforms the state-of-the-art methods. Using prior knowledge can significantly improve the performance of low-level image processing and vision problems.
In the second part, we propose a learning- based approach [72, 71] for video super-resolution. The approach extends previous primal sketch image hallucination method via learning a scene-specific priors using examples. This is achieved by constructing training examples using the high resolution images captured by still camera and use that to increase the low resolution videos. As a result, information from cameras with different spatio-temporal resolutions is combined in our framework.
In addition, we use conditional random field (CRF) to enforce smoothness constraint between adjacent super-resolved frames and the video super-resolution is posed as finding the high resolution video that maximize the conditional probability.