[精品]LearningFull-BodyMotionsfromMonocularVisionDynamicImitationinaHumanoidRobot.PDF可修改原格式下载

资源描述

1、Learning Full-Body Motions from Monocular Vision:Dynamic Imitation in a Humanoid RobotJeffrey B.Cole1David B.Grimes2Rajesh P.N.Rao2Department of Electrical Engineering1University of WashingtonBox 352500,Seattle,WA 98195 USAjeffcoleee.washington.eduDepartment of Computer Science andEngineering2Univer

2、sity of WashingtonBox 352350,Seattle,WA 98195 USAgrimes,raocs.washington.eduAbstractIn an effort to ease the burden of programmingmotor commands for humanoid robots,a computer vision tech-nique is developed for converting a monocular video sequenceof human poses into stabilized robot motor commands

3、for ahumanoid robot.The human teacher wears a multi-coloredbody suit while performing a desired set of actions.Leveragingthe colors of the body suit,the system detects the most probablelocations of the different body parts and joints in the image.Then,by exploiting the known dimensions of the body s

4、uit,a user specified number of candidate 3D poses are generatedfor each frame.Using human to robot joint correspondences,the estimated 3D poses for each frame are then mappedto corresponding robot motor commands.An initial set ofkinematically valid motor commands is generated using anapproximate bes

5、t path search through the pose candidatesfor each frame.Finally a learning-based probabilistic dynamicbalance model obtains a dynamically stable imitative sequenceof motor commands.We demonstrate the viability of theapproach by presenting results showing full-body imitation ofhuman actions by a Fuji

6、tsu HOAP-2 humanoid robot.I.INTRODUCTIONTeaching complex motor behavior to a robot can beextremely tedious and time consuming.Often,a programmerwill have to spend days deciding on exact motor controlsequences for every joint in the robot for a pose sequencethat only lasts a few seconds.A much more i

7、ntuitiveapproach would be to teach a robot how to generate itsown motor commands for gestures by simply watching aninstructor perform the desired task.In other words,the robotshould learn to translate the perceived pose of its instructorinto appropriate motor commands for itself.This imitationlearni

8、ng paradigm is intuitive because it is exactly how wehumans learn to control our bodies 1.Even at very youngages,we learn to control our bodies and perform tasks bywatching others perform those tasks.But the first hurdle inthis imitation learning task is one of image processing.Thechallenge is to de

9、velop accurate methods for extracting 3Dhuman poses from monocular image sequences.Imitation learning in humanoid and other robots has beenstudied in depth by a wide array of researchers.Early worksuch as 2,3 demonstrated the benefit of programminga robot via demonstration.Since then researchers hav

10、e ad-dressed building large corpora of useful skills 4,5,6,handling dynamics 7,8,studied biological connections9,or addressed goal-directed imitation 10.Typically a marker based motion capture system is usedto estimate human poses as input for training robots toperform complex motions.This requires

11、a full motion capturerig to extract the exact locations of special markers in arestricted 3D space.An instructor is typically required towear a special suit with careful marker placement.Themotion capture system then records the 3D position of eachmarker and recovers degree-of-freedom(DOF)estimatesr

12、elative to a skeletal model using various inverse kinematictechniques.Due to careful calibration of the cameras,highlyaccurate pose estimates can be extracted using multi-viewtriangulation techniques.The biggest downside to using a motion capture rig inour imitation learning scenario is that trainin

13、g can only beperformed in a rigid(and expensive)environment.Also,themotion capture system is unsatisfying because it does notallow the robot to behave autonomously.In this paper wedemonstrate initial steps in allowing the robot to use its ownvision system to extract the 3D pose of its instructor.Thi

14、swould allow us to”close the loop”for the learning process.Using only its own eyes,a robot should be able to watchan instructor,convert what it sees into a 3D pose,and thentranslate that sequence into appropriate motor commands.A large body of work has studied the problem performingpose estimation f

15、rom vision.Early computational approaches11,12 to analyzing images and video of people adoptedthe use of these kinematic models such as the kinematic treemodel.Since these earliest papers many systems have beenproposed for pose estimation and tracking(for examples see13,14,15,16),yet none have signi

16、ficantly supplantedmarker based motion capture for a broad array of applica-tions.The biggest limitation of many of these vision-based poseestimation techniques is that they require multiple,distantand often carefully calibrated cameras to be placed in aring around the instructor.While more portable

17、 and lesscostly than a commercial motion capture rig this is still notdesirable for autonomous robotic imitation learning.Thusin this paper we propose a method which relies solely onthe robots own commodity monocular camera.We note thatour work on monocular pose estimation builds on previousFig.2.RG

18、B training for body part detection.The top image shows handselected body part regions and the bottom plot shows each body parts colorclusters.techniques for solving the human and limb tracking problemusing learned image statistics 17,18,19,20,21.II.POSEESTIMATIONUSINGMONOCULARVIDEOAs an alternative

19、to expensive and cumbersome motioncapture systems,we have developed a new approach toestimating human poses using only a single,uncalibratedcamera and a multi-colored body suit.The method uses anonparametric probabilistic framework for localizing humanbody parts and joints in 2D images,converting th

20、ose jointsinto possible 3D locations,extracting the most likely 3Dpose,and then converting that pose into the equivalent motorcommands for our HOAP2 humanoid robot.As a final step,the motor commands are automatically refined to assurestability when the imitative action is finally performed bythe hum

21、anoid robot.The overall flow of the data processingis shown in Figure 1.A.Detecting Body Parts:The first step of the process is to detect where the differentbody parts are most likely located in each frame of the videosequence.Since we have granted ourselves the concession ofusing clothing with know

22、n colors,body part detection is doneby training a classifier in RGB color space.During the training phase,the user labels example regionsfor each of the body parts using a simple GUI.The RGBvalues of the pixels in each region are then fit with Gaussiandistributions and the curve fit parameters are s

23、aved to a file.An example of hand selected upper body parts and their RGBcolor clusters are shown in figure 2.Once the colors have been learned for each body part,itis relatively fast and easy to detect the probable body partFig.3.Probability map for the location of each upper body part in the given

24、frame.The value assigned to each pixel in the map is found by evaluatingthe pixels RGB values using the previously trained Gaussian distributions.Thus,intensity of the image on the right indicates the relative likelihood ofa pixel being a body part.Fig.4.Example of a probability map for the 2D locat

25、ions of each jointfor the video frame shown on the left.Joint maps are found by multiplyingtogether blurred versions of each of the body part maps.locations in any other frame from the sequence.For example,figure 3 shows the probability of each pixel being part ofthe persons torso,where intensity of

26、 the image encodes therelative likelihood.Part location probability maps can thusbe generated for each body part in each frame of the videosequence.B.Converting Body Parts into 2D Joint Location ProbabilityMaps:Once probability maps have been generated for each bodypart,the system uses that informat

27、ion to generate probabilitymaps for each of the persons joints.For every pair of bodyparts that are connected by a joint,the system performstwo steps to generate the joint location probability map.First,each body part probability map is spatially blurredwith a Gaussian kernel with a variance of 1 pi

28、xel.To speedup processing this blurring is performed in the frequencydomain using FFTs.Then,for every pair of body parts thatare connected by a joint,the spatially blurred body part mapsare multiplied together and the resulting map is normalizedso it is a valid probability distribution function(PDF)

29、forthe current joint.The resulting maps show the most likelylocations for each of the instructors joints in the current 2Dvideo frame.An example of a 2D joint location probabilitymap is shown in figure 4.For the work described herein,the lower body jointlocalization was done directly through color d

30、etection unlikeFig.1.General overview of the proposed approach to pose estimation.Arrows indicate what previous and future information is used to generate the datain each step of converting a raw video sequence into a stable set of motor commands for the humanoid to perform.the upper body where full

31、 parts were detected first and thenconverted into joint locations.The differences in processingof the lower body and upper body are meant to illustrate twovarying methods for joint localization.Detecting the jointsdirectly from color is much faster but is more likely to resultin joint locations bein

32、g lost due to self occlusions throughoutthe video sequence.The technique used on the upper bodyis more robust to occlusions as there is a larger region ofcolor to detect and the likelihood of full occlusion of a bodypart is much lower than occlusion of a joint.However theprocessing time required is

33、considerably higher when bodypart locations need to be converted into joint locations.C.Sampling 2D Poses From The Joint Maps:The next step the system takes is to randomly sample Ndifferent 2D poses from the joint location distributions.Thesampling is done with replacement using the PDF of eachjoint

34、 to control the sampling.The poses thus generated area collection of the most likely poses estimated from a singleframe.Figure 5 shows an example of fifty 2D poses sampledfrom the joint distributions.D.Converting 2D Poses into 3D Poses:Converting the 2D poses into poses in 3D space is doneby detecti

35、ng foreshortening and requires that we exploit theapproximate known dimensions of the human body.In thissystem,all body part lengths are measured with respect to thelength of the torso.This helps make the system more robustand allows the trainer to be any distance from the camera.In our course model

36、 of the human body,the shoulder lineFig.5.Example of 50 2D poses sampled from the joint distribution maps.Red dots indicate sampled joint locations and the green lines show whichjoints are connected in each sample pose.is 0.6 times the length of the torso,the upper arms are 0.4times the length of th

37、e torso,and the lower arms are 0.35times the length of the torso.However,this model couldbe extended to the case of multiple human instructors bylearning probability distributions over the lengths rather thana single proportional length.The limitation of using foreshortening to generate candi-date 3

38、D poses is that the user cannot bend forward at thewaist during the video sequence or the normalization factorwill be thrown off.The user can,however move in any othermanner desired.The user can freely move any distance fromthe camera.Also,if the user is not facing the camera(oreven with his back to

39、 the camera)the system will detect theforeshortened shoulder width and still be able to generate 3DFig.6.This figure shows a top down view of how a single 2D upperbody pose would be converted into 8 different possible 3D poses.Greylines indicate the measured length of the 2D pose body parts and the

40、redlines indicate the possible poses in 3D.Fig.7.This figure shows results for a single 2D pose(left)converted intoall possible 3D poses(right).poses.Converting a given 2D pose into 3D is thus a matter offiguring out how far forward or backwards each joint needsto move in order to make each body par

41、t the correct lengthin 3D space.For example,if the upper left arm is measuredto be length Dmeasuredin the current 2D pose and the upperleft arm is supposed to be length Dtruein 3D space,then theleft elbow could either be forward or backwards the distanceDoffset,whereDoffset=qD2true D2measured.(1)A t

42、op down view of how a single 2D upper body pose canbe converted into 8 possible 3D poses is shown in Figure 6.Figure 7 shows a frontal view of the results of 2D to 3Dconversion using the above described method.E.Converting 3D Human Poses into Robot Angles:The robots upper body has 8 degrees of freed

43、om(3 foreach shoulder and one for each elbow)and the lower bodyhas 12 degrees of freedom(3 for each hip,1 for each knee,and 2 for each ankle).Each degree of freedom is controlledby a servo motor.We use position-based control so motorcommands are simply joint angles from an initial”rest”state.Convert

44、ing each of the 3D poses into the correspondingangles for the robot joints is performed differently for theupper body and lower body.The upper body angles are found directly.Starting withthe upper left arm,the system detects the amount of for-ward/backward rotation in degrees,saves that angle,and th

45、enrotates all of the left arm joints about the shoulder using thenegative of the found angle.This procedure is carried out foreach of the degrees of freedom until all of the joints havebeen rotated back to their initial state.Thus,after finding allthe angles required to get the 3D pose to its zero s

46、tate,wehave all the motor commands the robot needs to perform toget to the current 3D pose.Unlike the upper body,the lower body angles are solvedusing inverse kinematics and an iterative optimization.Tofind the angles that generate each of the desired 3D legpositions for a given pose,the degrees of

47、freedom areadjusted iteratively using the Newton Raphson method untilthe ankle locations converge to the desired 3D points.The discrepancy between the upper and lower body pro-cessing techniques is due to the different motor configura-tions for arms and legs on the HOAP2 humanoid.Ambigui-ties that a

48、rise from the motor configurations in the robot hipmade it impossible to isolate the hip angles serially as wasdone with the upper body angles.The direct technique usedon the upper body is much faster than the iterative techniqueused on the lower body.Throughout the process of converting each of the

49、 3D posesinto robot angles,any poses generated that require motorcommands that are outside the limits imposed by the robotsphysical structure are removed from the list of possible poses.This both saves processing time and greatly reduces thenumber of 3D poses that are generated for the given frame.F

50、.Finding the Smoothest Path Through the Frames:After performing all the steps listed above,the system isinevitably left with a fair number of possible poses(motorcommands)it could send to the robot for any given framein the sequence.Initially,we tried to use a tree search to look forwarda few frames

展开阅读全文