JP7660387B2

JP7660387B2 - Object-to-robot pose estimation from a single RGB image

Info

Publication number: JP7660387B2
Application number: JP2021018845A
Authority: JP
Inventors: トレンブレイジョナサン; ウォルタータイリースティーブン; トーマスバーチフィールドスタンリー
Original assignee: エヌビディアコーポレーション
Priority date: 2020-06-15
Filing date: 2021-02-09
Publication date: 2025-04-11
Anticipated expiration: 2041-02-09
Also published as: JP2021197151A

Description

本出願は、２０１９年５月７日に出願した「ＤＥＴＥＣＴＩＮＧＡＮＤＥＳＴＩＭＡＴＩＮＧＴＨＥＰＯＳＥＯＦＡＮＯＢＪＥＣＴＵＳＩＮＧＡＮＥＵＲＡＬＮＥＴＷＯＲＫＭＯＤＥＬ」と題する米国特許出願第１６／４０５，６６２号（参考：７４１８５１／１８－ＲＥ－０１６１－ＵＳ０２）の一部継続出願であり、２０１８年５月１７日に出願した「ＤＥＴＥＣＴＩＯＮＡＮＤＰＯＳＥＥＳＴＩＭＡＴＩＯＮＯＦＨＯＵＳＥＨＯＬＤＯＢＪＥＣＴＳＦＯＲＨＵＭＡＮ－ＲＯＢＯＴＩＮＴＥＲＡＣＴＩＯＮ」と題する米国仮特許出願第６２／６７２，７６７（参考：１８－ＲＥ－０１６１ＵＳ０１）の優先権を主張するものであり、これらはその全体が参照によって本明細書に組み込まれる。 This application is a continuation-in-part of U.S. patent application Ser. No. 16/405,662 (Reference: 741851/18-RE-0161-US02) entitled "DETECTING AND ESTIMATING THE POSE OF AN OBJECT USING A NEURAL NETWORK MODEL" filed on May 7, 2019, and is a continuation-in-part of U.S. patent application Ser. No. 16/405,662 (Reference: 741851/18-RE-0161-US02) entitled "DETECTION AND POSE ESTIMATION OF HOUSEHOLD OBJECTS FOR HUMAN-ROBOT" filed on May 17, 2018. This application claims priority to U.S. Provisional Patent Application No. 62/672,767 (Reference: 18-RE-0161US01), entitled "INTERACTION," which is incorporated herein by reference in its entirety.

本出願は、２０１９年１０月１８日に出願した「ＰＯＳＥＤＥＴＥＲＭＩＮＡＴＩＯＮＵＳＩＮＧＯＮＥＯＲＭＯＲＥＮＥＵＲＡＬＮＥＴＷＯＲＫＳ」と題する米国特許出願第１６／６５７，２２０（参考：１Ｒ２６７４．００６９０１／１９－ＳＥ－０３４１ＵＳ０１）の一部継続出願であり、その全体が参照によって本明細書に組み込まれる。 This application is a continuation-in-part of U.S. Patent Application No. 16/657,220 (Reference: 1R2674.006901/19-SE-0341US01), entitled "POSE DETERMINATION USING ONE OR MORE NEURAL NETWORKS," filed on October 18, 2019, and is hereby incorporated by reference in its entirety.

本開示は、姿勢推定のシステム及び方法に関する。 This disclosure relates to a system and method for pose estimation.

姿勢推定は一般的に、通常は特定のカメラに対する、ある物体のユークリッド位置及び配向を決定するコンピュータ・ビジョン技法を指す。姿勢推定には、多くの用途があるが、特にロボティック操作システムのコンテキストにおいて有用である。これまで、ロボティック操作システムは、物体の画像を捕捉するためのロボット自身に設置されるカメラ（すなわち、手元のカメラ）、及び／又は物体の画像を捕捉するためのロボットの外部のカメラを必要としてきた。両方の場合で、その次にロボットに対して捕捉した物体の姿勢を推定するために、カメラはロボットに対して校正しなければならない。 Pose estimation generally refers to a computer vision technique that determines the Euclidean position and orientation of an object, usually relative to a particular camera. Pose estimation has many applications, but is particularly useful in the context of robotic manipulation systems. Until now, robotic manipulation systems have required a camera mounted on the robot itself (i.e., a handheld camera) to capture images of the object, and/or a camera external to the robot to capture images of the object. In both cases, the camera must be calibrated to the robot in order to then estimate the pose of the captured object relative to the robot.

手元のカメラの校正は、ロボットに対するカメラの位置を決定するために一度だけ実行すればよいが、ロボットにカメラをしっかり設置するため、このカメラは視野が限られており、そのため周囲のコンテキストを見ること、及びカメラの視野から物体が移動したときに容易に調整可能であることが妨げられる。外部のカメラは広い視野を有することができ、任意選択で手元のカメラを補助することができるが、外部のカメラは少しでも動かされると毎回校正する必要がある。しかしながら、校正は、典型的には退屈で、繊細で、エラーを起こしやすいオフラインの処理である。これらの問題を、特にロボティック操作システムに関して説明するが、同じ制限が、ある物体の別の物体（すなわち、動いていてもよく、又は動いていなくてもよい）に対する姿勢を推定するよう動作可能なあらゆる姿勢推定システムに当てはまることに留意されたい。 Although the calibration of a local camera only needs to be performed once to determine the camera's position relative to the robot, due to the rigid mounting of the camera on the robot, this camera has a limited field of view, which prevents it from seeing the surrounding context and being easily adjustable when an object moves out of the camera's field of view. An external camera can have a wider field of view and can optionally assist the local camera, but it needs to be calibrated every time it is moved even slightly. However, calibration is typically a tedious, delicate, and error-prone offline process. Note that while these issues are described with particular reference to robotic manipulation systems, the same limitations apply to any pose estimation system operable to estimate the pose of one object relative to another object (i.e., which may or may not be moving).

先行技術に関連付けられるこれらの問題及び／又は他の問題に対処する必要性がある。 There is a need to address these and/or other problems associated with the prior art.

画像からの、物体から物体への姿勢推定のための、方法、コンピュータ読取り可能媒体、システムが開示される。使用の際、第１の物体及びターゲット物体の画像が識別され、画像は第１の物体及びターゲット物体の外部のカメラによって捕捉される。加えて、画像は第１のニューラル・ネットワークを使用して処理され、カメラに対するターゲット物体の第１の姿勢を推定する。さらには、画像は第２のニューラル・ネットワークを使用して処理され、カメラに対する第１の物体の第２の姿勢を推定する。さらになお、ターゲット物体に対する第１の物体の第３の姿勢が、第１の姿勢及び第２の姿勢を使用して計算される。 A method, computer readable medium, and system are disclosed for object-to-object pose estimation from images. In use, images of a first object and a target object are identified, and the images are captured by a camera external to the first object and the target object. Additionally, the images are processed using a first neural network to estimate a first pose of the target object relative to the camera. Furthermore, the images are processed using a second neural network to estimate a second pose of the first object relative to the camera. Furthermore, a third pose of the first object relative to the target object is calculated using the first pose and the second pose.

一実施例による、画像からの物体から物体への姿勢推定のための方法の図である。1 is a diagram of a method for object-to-object pose estimation from images, according to one embodiment. 一実施例による、画像からの物体から物体への姿勢推定のためのシステムの図である。1 is a diagram of a system for object-to-object pose estimation from images, according to one embodiment. 一実施例による、図２の第１のニューラル・ネットワークに関連付けられるブロック図である。FIG. 3 is a block diagram associated with the first neural network of FIG. 2, according to one embodiment. 一実施例による、図２の第２のニューラル・ネットワークに関連付けられるブロック図である。FIG. 3 is a block diagram associated with the second neural network of FIG. 2, according to one embodiment. 一実施例による、ロボットから物体への姿勢推定システムを使用して制御されるロボティック把持システムの図である。FIG. 1 is a diagram of a robotic grasping system controlled using a robot-to-object pose estimation system, according to one embodiment. 少なくとも一実施例による、推論及び／又は訓練論理を示す図である。FIG. 1 illustrates inference and/or training logic, according to at least one embodiment. 少なくとも一実施例による、推論及び／又は訓練論理を示す図である。FIG. 1 illustrates inference and/or training logic, according to at least one embodiment. 少なくとも一実施例による、ニューラル・ネットワークの訓練及び展開を示す図である。FIG. 1 illustrates training and deployment of a neural network, according to at least one embodiment. 少なくとも一実施例による、例示的データ・センタ・システムを示す図である。FIG. 1 illustrates an exemplary data center system, according to at least one embodiment.

図１は、一実施例による、画像からの物体から物体への姿勢推定のための方法１００を図示している。方法１００は、あらゆるコンピューティング・システムによって実行され得、コンピューティング・システムは、１つ又は複数のコンピューティング・デバイス、１つ又は複数のコンピュータ・プロセッサ、非一時的なメモリ、回路などを含むことができる。一実施例では、非一時的なメモリは、方法１００を実行するために、１つ若しくは複数のコンピューティング・デバイス及び／又は１つ若しくは複数のコンピュータ・プロセッサによって実行可能な命令を記憶することができる。別の実施例では、回路は方法１００を実行するよう構成され得る。 FIG. 1 illustrates a method 100 for object-to-object pose estimation from images, according to one embodiment. Method 100 may be performed by any computing system, which may include one or more computing devices, one or more computer processors, non-transitory memory, circuitry, etc. In one embodiment, the non-transitory memory may store instructions executable by one or more computing devices and/or one or more computer processors to perform method 100. In another embodiment, a circuit may be configured to perform method 100.

動作１０２に示されるように、第１の物体及びターゲット物体の画像が識別され、画像は第１の物体及びターゲット物体の外部のカメラによって捕捉される。本説明のコンテキストでは、ターゲット物体及び第１の物体は、物理的に別個の物体である（たとえば、互いに幾分の近傍で）。一実施例では、第１の物体は、物体を把持するためのロボティック・アームを有するロボティック把持システムであり得る。この実施例に対する促進において、ターゲット物体はロボティック把持システムによって把持される既知の物体であり得る。別の実施例では、第１の物体は、別の車などのターゲット物体と相互作用する（又は、それに関連する決定を行う）自律車両などの別の自律物体であり得る。しかしながら、もちろん、第１の物体及びターゲット物体は、二者間の姿勢が（たとえば、コンピュータ・ビジョン用途にとって）望ましい、あらゆる物体であってもよい。 As shown in operation 102, images of the first object and the target object are identified and the images are captured by cameras external to the first object and the target object. In the context of this description, the target object and the first object are physically separate objects (e.g., in some proximity to each other). In one embodiment, the first object may be a robotic gripping system having a robotic arm for gripping the object. In furtherance of this embodiment, the target object may be a known object that is gripped by the robotic gripping system. In another embodiment, the first object may be another autonomous object, such as an autonomous vehicle, that interacts with (or makes decisions related to) the target object, such as another car. However, of course, the first object and the target object may be any object for which a bilateral pose is desirable (e.g., for computer vision applications).

上述のように、第１の物体及び第２の物体の外部のカメラは、第１の物体及びターゲット物体の画像を捕捉する。本説明に関して、カメラは、第１の物体及びターゲット物体とは無関係に配置されることにより、第１の物体及びターゲット物体の外部にある。たとえば、カメラは、第１の物体又はターゲット物体のいずれにも設置されていなくてもよい。一実施例では、カメラは、第１の物体及びターゲット物体の画像を捕捉するために、三脚又は他の剛性表面に設置され得る。 As described above, a camera external to the first object and the second object captures images of the first object and the target object. For purposes of this description, the camera is external to the first object and the target object by being positioned independent of the first object and the target object. For example, the camera may not be mounted on either the first object or the target object. In one embodiment, the camera may be mounted on a tripod or other rigid surface to capture images of the first object and the target object.

ターゲット物体の画像は、一実施例では赤緑青（ＲＧＢ）画像であってもよく、又は別の実施例ではグレースケール画像であってもよい。画像は、深度を含んでいてもよく、含んでいなくてもよい。別の実施例では、画像は単一の画像であってもよい。さらなる実施例では、赤、緑、青、赤外、紫外、及びレーダを含むことができる６チャネル画像などの様々なタイプのセンサから得られた画像の組合せ。であってもよい。さらには、カメラは単眼ＲＧＢカメラであってもよい。しかしながら、もちろん、カメラは、あらゆる波長の光（人間に可視、又は不可視）の、さらには非光の波長の、あらゆる数のチャネルを捕捉することができる。たとえば、カメラは画像を捕捉するために、赤外、紫外、マイクロ波、レーダ、ソナー、又は他の技術を利用することができる。 The image of the target object may be a red-green-blue (RGB) image in one embodiment, or a grayscale image in another embodiment. The image may or may not include depth. In another embodiment, the image may be a single image. In a further embodiment, the image may be a combination of images obtained from various types of sensors, such as a six-channel image that may include red, green, blue, infrared, ultraviolet, and radar. Furthermore, the camera may be a monocular RGB camera. However, of course, the camera may capture any number of channels of any wavelength of light (visible or invisible to humans), and even non-light wavelengths. For example, the camera may utilize infrared, ultraviolet, microwave, radar, sonar, or other technologies to capture the image.

加えて、動作１０４に示されるように、画像は第１のニューラル・ネットワークを使用して処理され、カメラに対するターゲット物体の第１の姿勢を推定する。第１のニューラル・ネットワークは、ターゲット物体のキー点の二次元（２Ｄ）画像ロケーション（ｘ、ｙ座標）を出力するよう訓練され得る。これらの２Ｄ画像ロケーションは次に、ターゲット物体モデルの３Ｄ座標に沿って、カメラに対するターゲット物体の姿勢を推定するために使用され得る。一実施例では、ＰｎＰアルゴリズムを使用してカメラに対するターゲット物体の姿勢を計算することができる。別の実施例では、カメラに対するターゲット物体の第１の姿勢は、カメラに対するターゲット物体の三次元（３Ｄ）回転及び併進を含むことができる。 Additionally, as shown in operation 104, the image is processed using a first neural network to estimate a first pose of the target object relative to the camera. The first neural network may be trained to output two-dimensional (2D) image locations (x, y coordinates) of key points of the target object. These 2D image locations may then be used to estimate the pose of the target object relative to the camera along with the 3D coordinates of the target object model. In one embodiment, a PnP algorithm may be used to calculate the pose of the target object relative to the camera. In another embodiment, the first pose of the target object relative to the camera may include a three-dimensional (3D) rotation and translation of the target object relative to the camera.

例として、第１のニューラル・ネットワークは、２０１９年５月７日に出願した「ＤＥＴＥＣＴＩＮＧＡＮＤＥＳＴＩＭＡＴＩＮＧＴＨＥＰＯＳＥＯＦＡＮＯＢＪＥＣＴＵＳＩＮＧＡＮＥＵＲＡＬＮＥＴＷＯＲＫＭＯＤＥＬ」と題する米国特許出願第１６／４０５，６６２号（参考：７４１８５１／１８－ＲＥ－０１６１－ＵＳ０２）に開示されたものであり得、その全体が参照によって本明細書に組み込まれる。第１のニューラル・ネットワークの実施例に関連するさらなる詳細を、図３を参照して以下で説明する。 By way of example, the first neural network may be one disclosed in U.S. Patent Application No. 16/405,662, entitled "DETECTING AND ESTIMATING THE POSE OF AN OBJECT USING A NEURAL NETWORK MODEL," filed May 7, 2019 (Reference: 741851/18-RE-0161-US02), which is incorporated by reference herein in its entirety. Further details relating to an embodiment of the first neural network are described below with reference to FIG. 3.

さらには、動作１０６に示されるように、画像は第２のニューラル・ネットワークを使用して処理され、カメラに対する第１の物体の第２の姿勢を推定する。第２のニューラル・ネットワークは、第１の物体のキー点の二次元（２Ｄ）画像ロケーション（ｘ、ｙ座標）を出力するよう訓練され得る。これらの２Ｄ画像ロケーションは次に、第１の物体モデルの３Ｄ座標に沿って、カメラに対する第１の物体の姿勢を推定するために使用され得る。一実施例では、ｐｅｒｓｐｅｃｔｉｖｅ－ｎ－ｐｏｉｎｔ（ＰｎＰ）アルゴリズムを使用してカメラに対する第１の物体の姿勢を計算することができる。別の実施例では、カメラに対する第１の物体の第２の姿勢は、カメラに対する第１の物体の３Ｄ回転及び併進を含むことができる。 Further, as shown in operation 106, the image is processed using a second neural network to estimate a second pose of the first object relative to the camera. The second neural network may be trained to output two-dimensional (2D) image locations (x, y coordinates) of key points of the first object. These 2D image locations may then be used to estimate the pose of the first object relative to the camera along with the 3D coordinates of the first object model. In one embodiment, a perspective-n-point (PnP) algorithm may be used to calculate the pose of the first object relative to the camera. In another embodiment, the second pose of the first object relative to the camera may include a 3D rotation and translation of the first object relative to the camera.

一選択肢として、第２のニューラル・ネットワークがカメラのオンライン校正を実行することもできる。このオンライン校正により、たとえばロボティック把持システム又は他の自律システムの動作中を含むランタイム中にカメラを動かすことができる。 As an option, the second neural network can also perform online calibration of the camera, allowing the camera to be moved during run-time, including during operation of, for example, a robotic gripping system or other autonomous system.

例として、第２のニューラル・ネットワークは、２０１９年１０月１８日に出願した「ＰＯＳＥＤＥＴＥＲＭＩＮＡＴＩＯＮＵＳＩＮＧＯＮＥＯＲＭＯＲＥＮＥＵＲＡＬＮＥＴＷＯＲＫＳ」と題する米国特許出願第１６／６５７，２２０（参考：１Ｒ２６７４．００６９０１／１９－ＳＥ－０３４１ＵＳ０１）に開示されたものであり得、その全体が参照によって本明細書に組み込まれる。第２のニューラル・ネットワークの実施例に関連するさらなる詳細を、図４を参照して以下で説明する。 By way of example, the second neural network may be one disclosed in U.S. Patent Application No. 16/657,220, entitled "POSE DETERMINATION USING ONE OR MORE NEURAL NETWORKS," filed on October 18, 2019 (Reference: 1R2674.006901/19-SE-0341US01), which is incorporated by reference in its entirety. Further details relating to an embodiment of the second neural network are described below with reference to FIG. 4.

さらになお、動作１０８に示されるように、ターゲット物体に対する第１の物体の第３の姿勢が、第１の姿勢及び第２の姿勢を使用して計算される。したがって、第１の物体からターゲット物体への姿勢推定が計算され得る。一実施例では、第３の姿勢は、第１の物体の座標フレームに対するターゲット物体の姿勢であり得る。別の実施例では、第３の姿勢は、ニューラル・ネットワークのうちの１つの出力の逆数に、ニューラル・ネットワークのうちの別の１つの出力、たとえば第２の姿勢を乗じた逆数第１の姿勢、又は第２の姿勢の逆数に第１の姿勢を乗じたものなどを乗ずることによって計算され得る。 Still further, as shown in operation 108, a third pose of the first object relative to the target object is calculated using the first pose and the second pose. Thus, a pose estimate from the first object to the target object may be calculated. In one embodiment, the third pose may be the pose of the target object relative to the coordinate frame of the first object. In another embodiment, the third pose may be calculated by multiplying the inverse of an output of one of the neural networks by an output of another one of the neural networks, such as the inverse first pose multiplied by the second pose, or the inverse of the second pose multiplied by the first pose, etc.

第１の物体がロボティック把持システムであるコンテキストでは、ロボティック把持システムに、さらに第３の姿勢を使用して、ターゲット物体（すなわち、既知の物体）を把持させるようにすることができる。たとえば、第３の姿勢は、ロボティック把持システムに与えられ、ロボティック把持システムがターゲット物体を位置特定して把持することを可能にすることができる。もちろん、他の実施例では、たとえば自律車を制御すること、又は自律車にターゲット物体の相対的なロケーションに基づいた決定を行わせることを含む、他の目的に、第３の姿勢を使用することができる。 In the context where the first object is a robotic gripping system, the robotic gripping system may further be caused to grip a target object (i.e., a known object) using a third pose. For example, a third pose may be provided to the robotic gripping system to enable the robotic gripping system to locate and grip the target object. Of course, in other embodiments, the third pose may be used for other purposes, including, for example, controlling an autonomous vehicle or having an autonomous vehicle make decisions based on the relative location of the target object.

一選択肢として、第１の姿勢及び／又は第２の姿勢は、第３の姿勢を計算することに先立って、精緻化することができる。たとえば、第１の姿勢は、画像を第１の姿勢によるモデルの合成投影と反復してマッチングし、次いで反復マッチングの結果に基づいて第１の姿勢のパラメータを調節することによって精緻化することができる。同様に、第２の姿勢は、画像を第２の姿勢によるモデルの合成投影と反復してマッチングし、次いで反復マッチングの結果に基づいて第２の姿勢のパラメータを調節することによって精緻化することができる。 As an option, the first pose and/or the second pose can be refined prior to calculating the third pose. For example, the first pose can be refined by iteratively matching the image with a synthetic projection of the model in the first pose and then adjusting parameters of the first pose based on the results of the iterative matching. Similarly, the second pose can be refined by iteratively matching the image with a synthetic projection of the model in the second pose and then adjusting parameters of the second pose based on the results of the iterative matching.

この目的のために、第１の物体とターゲット物体との間の姿勢が、２つのニューラル・ネットワークを使用して推定され得る。特に、１つのニューラル・ネットワークは、カメラに対する第１の物体の第１の姿勢を推定することができ、第２のニューラル・ネットワークは、カメラに対する第１の物体の第２の姿勢を推定することができる。ニューラル・ネットワークの出力（すなわち、第１の姿勢及び第２の姿勢）は、次いで第１の物体とターゲット物体との間の姿勢を決定する基本として使用され得る。 For this purpose, the pose between the first object and the target object may be estimated using two neural networks. In particular, one neural network may estimate a first pose of the first object relative to the camera, and a second neural network may estimate a second pose of the first object relative to the camera. The outputs of the neural networks (i.e., the first pose and the second pose) may then be used as a basis for determining the pose between the first object and the target object.

次に、さらに説明的な情報を、ユーザの所望ごとに前述のフレームワークを実装することができる様々な任意選択のアーキテクチャ及び特徴に関連して説明する。以下の情報は、例示目的に説明され、いかなるやり方でも限定として解釈されるべきではないことに特に留意されたい。以下の特徴のいずれも、任意選択で、説明される他の特徴を除外して、又は除外せずに、組み込まれ得る。 Next, further explanatory information is described in relation to various optional architectures and features that can implement the aforementioned framework as desired by a user. It is particularly noted that the following information is described for illustrative purposes and should not be construed as limiting in any manner. Any of the following features may be optionally incorporated with or without the exclusion of other features described.

図２は、一実施例による、画像からの物体から物体への姿勢推定のためのシステム２００を図示している。たとえば、システム２００は、図１の方法１００を実装することができる。システム２００は、一実施例では、クラウドに配置され、そのため物体に関してリモートに配置されてもよい。別の実施例では、システム２００は、物体にローカルなコンピューティング・デバイスに配置されてもよい。 FIG. 2 illustrates a system 200 for object-to-object pose estimation from images, according to one embodiment. For example, system 200 may implement method 100 of FIG. 1. System 200, in one embodiment, may be located in the cloud and thus remote with respect to the object. In another embodiment, system 200 may be located on a computing device local to the object.

示されるように、システム２００は、第１のニューラル・ネットワーク２０２を含む第１のモジュール２０１、第２のニューラル・ネットワーク２０４を含む第２のモジュール２０３、及びプロセッサ２０６を含む。画像は、第１のニューラル・ネットワーク２０２及び第２のニューラル・ネットワーク２０４に入力される。画像は第１の物体及びターゲット物体のものであり、第１の物体及びターゲット物体の両方の外部のカメラによって捕捉される。カメラはシステム２００の外部にあってもよい。しかしながら、画像は、カメラによって、ネットワーク、共有メモリを介して又は他のあらゆるやり方で、システム２００に与えられてもよい。 As shown, system 200 includes a first module 201 including a first neural network 202, a second module 203 including a second neural network 204, and a processor 206. Images are input to first neural network 202 and second neural network 204. The images are of a first object and a target object and are captured by a camera external to both the first object and the target object. The camera may be external to system 200. However, the images may be provided to system 200 by the camera, via a network, shared memory, or in any other manner.

第１のニューラル・ネットワーク２０２は、入力として画像を受信し、カメラに対するターゲット物体の第１の姿勢（すなわち、ターゲット物体からカメラへの姿勢）を推定するためにその画像を処理する。第１のニューラル・ネットワーク２０２は、カメラに対して既知の物体の６ＤｏＦ姿勢（たとえば３Ｄ回転及び併進）を推定するディープ・ニューラル・ネットワークであってもよい。このネットワーク２０２は、キー点ごとに１つ、入力ＲＧＢ画像をビリーフ・マップのセットに変換する、多段コンボリューショナル・ネットワークから構成され得る。一実施例では、ｎ＝９キー点が、重心とともに、バウンディング直方体の頂点を表現するために使用される。ビリーフ・マップに加えて、ネットワーク２０２は、重心ではないキー点ごとに１つ、ｎ－１アフィニティ・マップを出力することができる。各マップは、最も近い物体重心を指す単位ベクトルの２Ｄ場である。マップを使用して、後処理ステップにより物体を個別化することができ、システムが各物体の複数のインスタンスを扱えるようにしている。姿勢は、ビリーフ・マップのピークとして検出されたキー点にＰｎＰアルゴリズムを適用する第１のモジュール２０１によって決定することができる。 The first neural network 202 receives an image as input and processes it to estimate a first pose of the target object relative to the camera (i.e., the pose from the target object to the camera). The first neural network 202 may be a deep neural network that estimates the 6DoF pose (e.g., 3D rotation and translation) of a known object relative to the camera. This network 202 may consist of a multi-stage convolutional network that converts the input RGB image into a set of belief maps, one for each key point. In one embodiment, n=9 key points are used to represent the vertices of a bounding cuboid along with the centroids. In addition to the belief maps, the network 202 may output n-1 affinity maps, one for each key point that is not a centroid. Each map is a 2D field of unit vectors pointing to the nearest object centroid. The maps can be used to individuate objects in a post-processing step, allowing the system to handle multiple instances of each object. The pose can be determined by a first module 201 that applies the PnP algorithm to key points detected as peaks in the belief map.

一実施例では、ネットワーク２０２への入力は、ＶＧＧベースの特徴抽出器によって処理された５３３ｘ４００画像から構成され得、結果として５０ｘ５０ｘ５１２特徴ブロックとなる。これらの特徴は、上述のビリーフ・マップを出力して精緻化する一連の６段（それぞれ７コンボリューショナル層を含む）によって処理され得る。 In one embodiment, the input to network 202 may consist of a 533x400 image processed by a VGG-based feature extractor, resulting in a 50x50x512 feature block. These features may be processed by a series of six stages (each containing seven convolutional layers) that output and refine the belief map described above.

第２のニューラル・ネットワーク２０４は、入力として画像を受信し、カメラに対する第１の物体の第２の姿勢（すなわち、第１の物体からカメラへの姿勢）を推定するためにその画像を処理する。一実施例では、第２のニューラル・ネットワーク２０４は、カメラに対してロボットの６ＤｏＦ姿勢を推定するディープ・ニューラル・ネットワークであってもよい。このネットワーク２０４は、キー点ごとに１つ、入力ＲＧＢ画像をビリーフ・マップのセットに変換する、エンコーダ－デコーダから構成され得る。あるシーンには１つのロボットしかないため、アフィニティ場は必要ない場合がある。 The second neural network 204 receives an image as input and processes the image to estimate a second pose of the first object relative to the camera (i.e., the pose from the first object to the camera). In one embodiment, the second neural network 204 may be a deep neural network that estimates the 6DoF pose of the robot relative to the camera. This network 204 may consist of an encoder-decoder that converts the input RGB image into a set of belief maps, one for each key point. Since there is only one robot in a scene, an affinity field may not be required.

一実施例では、ロボットの関節に配置されるキー点は、アームがほとんどカメラの視野の外にある場合、姿勢安定性を達成するよう定義されてもよく、これはカメラが近い距離からシーンを見ている場合に生じる。一実施例では、ネットワーク２０４への入力は、４００ｘ４００画像であり、元々の６４０ｘ４８０からダウンサンプリングされ、中心クロップされている。ネットワーク２０４の層は、以下のとおりであってもよい：エンコーダは、ＶＧＧと同じ層構造に従うが、デコーダは４アップサンプリング層とそれに続く２コンボリューション層から構築されて、キー点のビリーフ・マップを与える。第２のモジュール２０３は、ＰｎＰアルゴリズムをビリーフ・マップのピークとして検出されたキー点に適用する。 In one embodiment, key points placed on the robot joints may be defined to achieve pose stability when the arm is mostly outside the camera's field of view, which occurs when the camera is viewing the scene from a close distance. In one embodiment, the input to the network 204 is a 400x400 image, downsampled and center cropped from the original 640x480. The layers of the network 204 may be as follows: the encoder follows the same layer structure as VGG, while the decoder is built from 4 upsampling layers followed by 2 convolution layers to give a belief map of key points. The second module 203 applies the PnP algorithm to the key points detected as peaks in the belief map.

一選択肢として、第１の姿勢及び／又は第２の姿勢は、精緻化して、推定において、訓練データ及び／又はネットワーク容量が限定されることによる誤りを低減することができる。この精緻化は、入力画像を現在の姿勢によるモデルの合成投影と反復してマッチングすることにより姿勢パラメータを調節することによって実施され得る。 As an option, the first pose and/or the second pose can be refined to reduce errors in the estimation due to limited training data and/or network capacity. This refinement can be performed by adjusting the pose parameters by iteratively matching the input image with a synthetic projection of the model with the current pose.

第１のニューラル・ネットワーク２０２第２のニューラル・ネットワーク２０４は、互いの出力を必要としないため、第１のニューラル・ネットワーク２０２第２のニューラル・ネットワーク２０４は、所望であれば並列に動作してもよく、動作しなくてもよい。第１のニューラル・ネットワーク２０２及び第２のニューラル・ネットワーク２０４のそれぞれの出力は、プロセッサ２０６に与えられる。プロセッサ２０６は、第１の姿勢及び第２の姿勢を使用して、ターゲット物体に対する第１の物体の第３の姿勢を計算する。第３の姿勢は、一実施例に従って説明される等式１を使用して計算され得る。
Since the first neural network 202 and the second neural network 204 do not require each other's outputs, the first neural network 202 and the second neural network 204 may or may not operate in parallel if desired. The outputs of each of the first neural network 202 and the second neural network 204 are provided to a processor 206. The processor 206 uses the first pose and the second pose to calculate a third pose of the first object relative to the target object. The third pose may be calculated using Equation 1 described according to one embodiment.

ここで、

は、ロボット・フレームにおける物体の姿勢であり、

は、カメラ・フレームにおけるロボットの姿勢であり（第２のネットワーク２０４によって計算される）、

は、カメラ・フレームにおける物体の姿勢である（第１のネットワーク２０２によって計算される）。 Where:

is the pose of the object in the robot frame,

is the pose of the robot in the camera frame (computed by the second network 204),

is the pose of the object in the camera frame (computed by the first network 202).

第１の物体がロボティック把持システムであり、ターゲット物体が既知の物体である例示的な実施例では、プロセッサ２０６は、ここで第３の姿勢を使用してロボティック把持システムにターゲット物体を把持させることができる。プロセッサ２０６は、インターネット（たとえばシステム２００が物体に対してリモートに配置されている場合）又はイーサネット（登録商標）（たとえばシステム２００が物体に対してローカルに配置されている場合）などのネットワークを介してロボティック把持システムと通信することができる。 In an exemplary embodiment where the first object is a robotic grasping system and the target object is a known object, the processor 206 can now use the third pose to cause the robotic grasping system to grasp the target object. The processor 206 can communicate with the robotic grasping system over a network, such as the Internet (e.g., if the system 200 is located remotely relative to the object) or Ethernet (e.g., if the system 200 is located locally relative to the object).

第１のニューラル・ネットワーク２０２は、カメラに対するターゲット物体の姿勢を推定するだけなので、第１のニューラル・ネットワーク２０２だけ（必ずしも第２のニューラル・ネットワーク２０４ではなく）がターゲット物体について訓練され得る。このやり方で、ターゲット物体は第１のニューラル・ネットワーク２０２にとって「既知」となり得、又は換言すると「既知の物体」となり得る。もちろん、第１のニューラル・ネットワーク２０２は、他のカテゴリ又はタイプのターゲット物体についても訓練され得る。 Because the first neural network 202 only estimates the pose of the target object relative to the camera, only the first neural network 202 (and not necessarily the second neural network 204) may be trained on the target object. In this manner, the target object may be "known" to the first neural network 202, or in other words, a "known object." Of course, the first neural network 202 may also be trained on other categories or types of target objects.

同様に、第２のニューラル・ネットワーク２０４は、カメラに対する第１の物体の姿勢を推定するだけなので、第２のニューラル・ネットワーク２０４だけ（必ずしも第１のニューラル・ネットワーク２０２ではなく）が第１の物体について訓練され得る。換言すると、第１の物体は、第２のニューラル・ネットワーク２０４にとって「既知」となり得る。しかしながら、第２のニューラル・ネットワーク２０４は、他の物体（たとえば他のロボット、又は他の自律物体）についても訓練され得ることに留意されたい。 Similarly, only the second neural network 204 (and not necessarily the first neural network 202) may be trained on the first object, since the second neural network 204 only estimates the pose of the first object relative to the camera. In other words, the first object may be "known" to the second neural network 204. However, it should be noted that the second neural network 204 may also be trained on other objects (e.g., other robots, or other autonomous objects).

この目的のために、エンドツーエンドの学習とは対照的に、本システム２００で提示されるモジュール的な手法により、すべてのネットワークを再訓練する必要なくシステムを再利用することができる。たとえば、システム２００を新しいロボットに適用するためには、第２のネットワーク２０４だけを再訓練する必要がある。同様に、システム２００を新しい物体に適用するためには、ロボット及び他の物体とは無関係に、第１のネットワーク２０２だけをこの物体について訓練する必要がある。その上、このモジュール的な手法は、正確性及び信頼性を保証するための個々のコンポーネントの試験及び精緻化を容易にすることができる。 To this end, in contrast to end-to-end learning, the modular approach presented in the present system 200 allows the system to be reused without the need to retrain all networks. For example, to apply the system 200 to a new robot, only the second network 204 needs to be retrained. Similarly, to apply the system 200 to a new object, only the first network 202 needs to be trained on this object, independent of the robot and other objects. Moreover, this modular approach allows for easier testing and refinement of the individual components to ensure accuracy and reliability.

図３は、一実施例による、図２の第１のニューラル・ネットワーク２０２に関連付けられるブロック図である。もちろん、ブロック図は、図２の第１のニューラル・ネットワーク２０２に関連付けられる１つの可能な実施例として説明される。この実施例は、２０１９年５月７日に出願した「ＤＥＴＥＣＴＩＮＧＡＮＤＥＳＴＩＭＡＴＩＮＧＴＨＥＰＯＳＥＯＦＡＮＯＢＪＥＣＴＵＳＩＮＧＡＮＥＵＲＡＬＮＥＴＷＯＲＫＭＯＤＥＬ」と題する米国特許出願第１６／４０５，６６２号（参考：７４１８５１／１８－ＲＥ－０１６１－ＵＳ０２）においてより詳細に説明されており、その全体が参照によって本明細書に組み込まれる。 3 is a block diagram associated with the first neural network 202 of FIG. 2 according to one embodiment. Of course, the block diagram is described as one possible embodiment associated with the first neural network 202 of FIG. 2. This embodiment is described in more detail in U.S. Patent Application No. 16/405,662, entitled "DETECTING AND ESTIMATING THE POSE OF AN OBJECT USING A NEURAL NETWORK MODEL," filed May 7, 2019 (Reference: 741851/18-RE-0161-US02), which is incorporated herein by reference in its entirety.

示される実施例では、姿勢推定システム３００は、キー点モジュール３１０、多段モジュール３０５のセット、及び姿勢ユニット３２０を含む。姿勢推定システム３００は、処理ユニットのコンテキストで説明されるが、キー点モジュール３１０、多段モジュール３０５のセット、及び姿勢ユニット３２０のうちの１つ又は複数は、プログラム、カスタム回路、又はカスタム回路とプログラムの組合せによって実行され得る。たとえば、キー点モジュール３１０は、ＧＰＵ、中央処理装置（ＣＰＵ：ｃｅｎｔｒａｌｐｒｏｃｅｓｓｉｎｇｕｎｉｔ）、又はキー点データを生成するよう画像を処理することが可能なあらゆるプロセッサによって実装することができる。 In the illustrated embodiment, the pose estimation system 300 includes a key point module 310, a set of multi-stage modules 305, and a pose unit 320. Although the pose estimation system 300 is described in the context of a processing unit, one or more of the key point module 310, the set of multi-stage modules 305, and the pose unit 320 may be implemented by a program, a custom circuit, or a combination of a custom circuit and a program. For example, the key point module 310 may be implemented by a GPU, a central processing unit (CPU), or any processor capable of processing an image to generate key point data.

姿勢推定システム３００は、単一のカメラによって捕捉された画像を受信する。画像は、検出のための１つ又は複数の物体を含む場合がある。一実施例では、いかなる深度データも持たない画像において、画像はピクセルごとの色データを含む。姿勢推定システム３００は、まず物体に関連付けられるキー点を検出し、次いでキー点に関連付けられる物体を取り囲むバウンディング・ボリュームを定義する頂点の２Ｄ投影を推定する。キー点は、物体の重心、及び物体を取り囲むバウンディング・ボリュームの頂点を含み得る。キー点は、画像内で明示的に可視ではないが、代わりに姿勢推定システム３００によって推測される。換言すると、対象の物体は画像において、遮蔽され得る物体の部分を除いて可視であり、対象の物体に関連付けられるキー点は、画像において明示的に可視ではない。キー点の２Ｄロケーションは、姿勢推定システム３００によって画像データのみを使用して推定される。姿勢ユニット３２０は、推定された２Ｄロケーション、カメラ固有パラメータ、及び物体の次元を使用して物体の３Ｄ姿勢を復元する。 The pose estimation system 300 receives an image captured by a single camera. The image may contain one or more objects for detection. In one embodiment, in images without any depth data, the image contains color data per pixel. The pose estimation system 300 first detects key points associated with the object, and then estimates 2D projections of vertices that define a bounding volume surrounding the object associated with the key points. The key points may include the center of gravity of the object and the vertices of the bounding volume surrounding the object. The key points are not explicitly visible in the image, but are instead inferred by the pose estimation system 300. In other words, the object of interest is visible in the image except for parts of the object that may be occluded, and the key points associated with the object of interest are not explicitly visible in the image. The 2D locations of the key points are estimated by the pose estimation system 300 using only the image data. The pose unit 320 recovers the 3D pose of the object using the estimated 2D locations, the camera intrinsic parameters, and the object dimensions.

キー点モジュール３１０は、物体を含む画像を受信し、画像特徴を出力する。一実施例では、キー点モジュール３１０は、多層のコンボリューショナル・ニューラル・ネットワーク（すなわち、第１のネットワーク２０２）を含む。一実施例では、キー点モジュール３１０は、ＩｍａｇｅＮｅｔ訓練データベースを使用して事前訓練済みのＶｉｓｕａｌＧｅｏｍｅｔｒｙＧｒｏｕｐ（ＶＧＧ－１９）ニューラル・ネットワークの初めの１０層、それに続いて特徴次元を５１２から２５６へ、さらに２５６から１２８へと低減するための２つの３×３のコンボリューション層を含む。キー点モジュール３１０は、３チャネルの特徴を、チャネルごとに１つ、出力する（たとえば、ＲＧＢ）。 The key point module 310 receives an image containing an object and outputs image features. In one embodiment, the key point module 310 includes a multi-layer convolutional neural network (i.e., the first network 202). In one embodiment, the key point module 310 includes the first 10 layers of a Visual Geometry Group (VGG-19) neural network pre-trained using the ImageNet training database, followed by two 3x3 convolution layers to reduce the feature dimensionality from 512 to 256, and then from 256 to 128. The key point module 310 outputs three channel features, one per channel (e.g., RGB).

画像特徴は、多段モジュール３０５のセットへ入力される。一実施例では、多段モジュール３０５のセットは、物体の重心を検出するように構成された第１の多段モジュール３０５、及び物体を取り囲むバウンディング・ボリュームの頂点を検出するように構成された追加的な多段モジュール３０５を並列に含む。一実施例では、多段モジュール３０５のセットは、画像特徴を複数の経路で処理して、順次バウンディング・ボリュームの重心及び頂点を検出するために使用される単一の多段モジュール３０５を含む。一実施例では、多段モジュール３０５は、重心を検出することなく頂点を検出するように構成される。 The image features are input to a set of multi-stage modules 305. In one embodiment, the set of multi-stage modules 305 includes a first multi-stage module 305 configured to detect the center of gravity of the object, and additional multi-stage modules 305 configured in parallel to detect vertices of a bounding volume surrounding the object. In one embodiment, the set of multi-stage modules 305 includes a single multi-stage module 305 that is used to process the image features in multiple passes to sequentially detect the center of gravity and vertices of the bounding volume. In one embodiment, the multi-stage module 305 is configured to detect vertices without detecting the center of gravity.

各多段モジュール３０５は、Ｔ段のビリーフ・マップ・ユニット３１５を含む。一実施例では、段数は６に等しい（たとえば、Ｔ＝６）。ビリーフ・マップ・ユニット３１５－１は、第１の段であり、ビリーフ・マップ・ユニット３１５－２は、第２の段である、などである。キー点モジュール３１０によって抽出された画像特徴は、多段モジュール３０５内のビリーフ・マップ・ユニット３１５のそれぞれに渡される。一実施例では、キー点モジュール３１０及び多段モジュール３０５は、入力としてサイズがｗｘｈｘ３のＲＧＢ画像を受け取り、たとえば、ビリーフ・マップなどの２つの異なる出力を生成するよう分岐するフィードフォワードのニューラル・ネットワーク（すなわち、第１のニューラル・ネットワーク２０２）を含む。一実施例では、ｗ＝６４０且つｈ＝４８０である。ビリーフ・マップ・ユニット３１５の段は、順次動作し、このとき各段（ビリーフ・マップ・ユニット３１５）は、画像特徴だけでなく、直前の段の出力も考慮している。 Each multi-stage module 305 includes T stages of belief map units 315. In one embodiment, the number of stages is equal to six (e.g., T=6). Belief map unit 315-1 is the first stage, belief map unit 315-2 is the second stage, and so on. Image features extracted by the key point module 310 are passed to each of the belief map units 315 in the multi-stage module 305. In one embodiment, the key point module 310 and the multi-stage module 305 include a feedforward neural network (i.e., the first neural network 202) that receives an RGB image of size wxhx3 as input and branches to generate two different outputs, such as belief maps. In one embodiment, w=640 and h=480. The stages of the Belief Map Unit 315 operate sequentially, with each stage (Belief Map Unit 315) taking into account not only image features but also the output of the previous stage.

各多段モジュール３０５内のビリーフ・マップ・ユニット３１５の段は、画像内の物体に関連付けられる単一の２Ｄロケーションの推定のためのビリーフ・マップを生成する。第１のビリーフ・マップは、物体の重心についての確率値を含み、追加的なビリーフ・マップは、物体を取り囲むバウンディング・ボリュームの頂点についての確率値を含む。 The stages of the belief map unit 315 in each multi-stage module 305 generate belief maps for the estimation of a single 2D location associated with an object in an image. A first belief map contains probability values for the object's centroid, and an additional belief map contains probability values for the vertices of a bounding volume that encloses the object.

一実施例では、検出された頂点の２Ｄロケーションは、それぞれが物体を取り囲み、そのシーンでの画像空間に投影された３Ｄバウンディング頂点の２Ｄ座標である。３Ｄのバウンディング・ボックスによって各物体を表現することにより、姿勢推定に十分であるが、なお物体の形状の詳細とは無関係な、各物体の抽象的な表現が定義される。バウンディング・ボリュームが３Ｄのバウンディング・ボックスである場合、９つの多段モジュール３０５を使用して、重心及び８つの頂点についてのビリーフ・マップを並列に生成することができる。姿勢ユニット３２５は、画像空間に投影された３Ｄバウンディング・ボックス頂点の２Ｄ座標を推定し、次いで従来的なコンピュータ・ビジョンのアルゴリズム又は別のニューラル・ネットワークのいずれかを使用して、ｐｅｒｓｐｅｃｔｉｖｅ－ｎ－ｐｏｉｎｔ（ＰｎＰ）から３Ｄ空間での物体ロケーション及び姿勢を推論する。ＰｎＰは、３Ｄ空間のｎロケーションのセット及び画像空間のｎロケーションの投影を使用して物体の姿勢を推定する。一実施例では、姿勢推定システム３００は、リアルタイムで、単一のＲＧＢ画像からクラッタ中にある既知の物体の３Ｄ姿勢を推定する。 In one embodiment, the 2D locations of the detected vertices are the 2D coordinates of the 3D bounding vertices that each enclose the object and are projected into image space for the scene. Representing each object by a 3D bounding box defines an abstract representation of each object that is sufficient for pose estimation, yet is independent of the details of the object's shape. When the bounding volumes are 3D bounding boxes, nine multi-stage modules 305 can be used to generate belief maps for the centroid and eight vertices in parallel. The pose unit 325 estimates the 2D coordinates of the 3D bounding box vertices projected into image space, and then infers object locations and poses in 3D space from the perspective-n-point (PnP) using either traditional computer vision algorithms or another neural network. PnP estimates the pose of an object using a set of n locations in 3D space and a projection of the n locations in image space. In one embodiment, the pose estimation system 300 estimates the 3D pose of a known object in clutter from a single RGB image in real time.

一実施例では、ビリーフ・マップ・ユニット３１５の段は、それぞれコンボリューショナル・ニューラル・ネットワーク（ＣＮＮ：ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）の段である。各段がＣＮＮである場合、各段は、データがニューラル・ネットワークを通過するにつれ、益々増大した有効受容野を活用する。この性質は、後方の段で益々増大した量のコンテキストを組み込むことにより、ビリーフ・マップ・ユニット３１５の段が曖昧さを解決できるようにしている。 In one embodiment, each stage of the Belief Map Unit 315 is a stage of a Convolutional Neural Network (CNN). When each stage is a CNN, each stage exploits an increasingly larger effective receptive field as data passes through the neural network. This property allows the stages of the Belief Map Unit 315 to resolve ambiguities by incorporating an increasingly larger amount of context in later stages.

一実施例では、ビリーフ・マップ・ユニット３１５の段は、キー点モジュール３１０によって抽出された１２８次元の特徴を受信する。一実施例では、ビリーフ・マップ・ユニット３１５－１は３つの３×３×１２８層及び１つの１×１×５１２層を含む。一実施例では、ビリーフ・マップ・ユニット３１５－２は１×１×９層である。一実施例では、ビリーフ・マップ・ユニット３１５－３から３１５－Ｔは、それぞれが１５３次元の入力（１２８＋１６＋９＝１５３）を受信し、１×１×１２８層又は１×１×１６層の前に５つの７×７×１２８層及び１つの１×１×１２８層を含むことを除いて第１の段と同一である。一実施例では、ビリーフ・マップ・ユニット３１５のそれぞれは、_ｗ／８及び_ｈ／８のサイズであり、正規化線形ユニット（ＲｅＬＵ）活性化関数が全体的に交互配置されている。 In one embodiment, the stages of belief map units 315 receive the 128-dimensional features extracted by the key point module 310. In one embodiment, belief map unit 315-1 includes three 3x3x128 layers and one 1x1x512 layer. In one embodiment, belief map unit 315-2 is a 1x1x9 layer. In one embodiment, belief map units 315-3 through 315-T are identical to the first stage except that each receives a 153-dimensional input (128+16+9=153) and includes five 7x7x128 layers and one 1x1x128 layer before the 1x1x128 or 1x1x16 layers. In one embodiment, each of the belief map units 315 is of size _w/8 and _h/8 and is globally interleaved with rectified linear unit (ReLU) activation functions.

図４は、一実施例による、図２の第２のニューラル・ネットワーク２０４の使用に関連付けられるブロック図である。もちろん、ブロック図は、図２の第２のニューラル・ネットワーク２０４に関連付けられる１つの可能な実施例として説明される。第２のニューラル・ネットワーク２０４に関するこの実施例は、２０１９年１０月１８日に出願した「ＰＯＳＥＤＥＴＥＲＭＩＮＡＴＩＯＮＵＳＩＮＧＯＮＥＯＲＭＯＲＥＮＥＵＲＡＬＮＥＴＷＯＲＫＳ」と題する米国特許出願第１６／６５７，２２０（参考：１Ｒ２６７４．００６９０１／１９－ＳＥ－０３４１ＵＳ０１）により詳細に説明されており、その全体が参照によって本明細書に組み込まれる。 FIG. 4 is a block diagram associated with the use of the second neural network 204 of FIG. 2, according to one embodiment. Of course, the block diagram is described as one possible embodiment associated with the second neural network 204 of FIG. 2. This embodiment of the second neural network 204 is described in more detail in U.S. Patent Application No. 16/657,220, entitled "POSE DETERMINATION USING ONE OR MORE NEURAL NETWORKS," filed on October 18, 2019 (Reference: 1R2674.006901/19-SE-0341US01), which is incorporated herein by reference in its entirety.

示されるように、ロボットの捕捉された画像４０２は、入力として訓練済みニューラル・ネットワーク２０４に与えられる。ロボットが本実施例のコンテキストで説明されるが、本明細書において説明されるブロック図は他の物体（たとえば、他の自律システム）に等しく適用できることに留意されたい。一選択肢として、処理の前に、解像度、色深度、又はコントラストを調節するためなど、この画像の何らかの事前処理又は拡張が実施されてもよい。少なくとも一実施例において、異なるロボットは異なる形状、サイズ、構成、運動学、及び特徴を有する可能性があるため、ネットワーク２０４は、特にロボット２０４のタイプについて訓練することができる。 As shown, a captured image 402 of a robot is provided as an input to the trained neural network 204. It should be noted that while a robot is described in the context of this example, the block diagrams described herein are equally applicable to other objects (e.g., other autonomous systems). As an option, some pre-processing or enhancement of the image may be performed prior to processing, such as to adjust resolution, color depth, or contrast. In at least one example, the network 204 may be trained specifically for the type of robot 204, since different robots may have different shapes, sizes, configurations, kinematics, and characteristics.

使用の際、ニューラル・ネットワーク２０４は、入力画像４０２を分析して、推論のセットとしてビリーフ・マップ４０６のセットを出力することができる。一選択肢として、特徴ポイントを位置特定するために他の次元決定推論を生成することができる。たとえば、ニューラル・ネットワーク２０４は、識別されるロボット特徴ごとに１つのビリーフ・マップ４０６を推論することができる。 In use, the neural network 204 may analyze the input images 402 and output a set of belief maps 406 as a set of inferences. As an option, other dimension determining inferences may be generated to locate feature points. For example, the neural network 204 may infer one belief map 406 for each robot feature identified.

少なくとも一実施例において、訓練に使用されるロボットのモデルは、追跡される具体的な特徴を識別することができる。これらの特徴は、訓練プロセスを通じて学習され得る。別の実施例では、特徴は、ロボットの姿勢をこれらの特徴から決定することができるように、そのロボットの様々な可動部分又はコンポーネントに位置特定され得る。さらなる実施例では、特徴は、ロボットのそれぞれの姿勢が１つ且つ１つだけの特徴の設定に対応するように、また特徴のそれぞれの設定が１つ且つ１つだけのロボット姿勢に対応するように、選択され得る。この一意性は、カメラからロボットへの姿勢が、捕捉された画像データに表現されるように特徴の一意な配向に基づいて決定され得るようにできる。 In at least one embodiment, a model of the robot used for training can identify specific features to be tracked. These features can be learned through a training process. In another embodiment, the features can be located on various moving parts or components of the robot such that a pose of the robot can be determined from the features. In a further embodiment, the features can be selected such that each pose of the robot corresponds to one and only one setting of the features, and each setting of the features corresponds to one and only one robot pose. This uniqueness can allow the camera-to-robot pose to be determined based on the unique orientation of the features as represented in the captured image data.

ネットワーク２０４に関して自動エンコーダ・ネットワークは、キー点を検出することができる。少なくとも一実施例において、ニューラル・ネットワーク２０４は入力として、サイズがｗｘｈｘ３のＲＧＢ画像を受け取り、形態ｗｘｈｘｎを有するｎのビリーフ・マップ４０６を出力する。少なくとも一実施例において、ＲＧＢＤ又は立体視画像を、入力として同じように受け取ることができる。任意選択で、ｗ＝６４０且つｈ＝４８０である。少なくとも一実施例において、キー点ごとの出力は、２Ｄのビリーフ・マップであり、ピクセル値はキー点がそのピクセルに投影される尤度を表現する。 Regarding network 204, an autoencoder network can detect key points. In at least one embodiment, neural network 204 receives as input an RGB image of size wxhx3 and outputs n belief maps 406 of the form wxhxn. In at least one embodiment, RGBD or stereoscopic images can be received as input as well. Optionally, w=640 and h=480. In at least one embodiment, the output for each key point is a 2D belief map, with pixel values representing the likelihood that the key point is projected onto that pixel.

一実施例では、ネットワーク２０４のエンコーダは、ＩｍａｇｅＮｅｔで事前訓練済みのＶＧＧ－１９のコンボリューショナル層を含む。別の実施例では、ＲｅｓＮｅｔベースのエンコーダを使用することができる。さらなる実施例では、ネットワーク２０４のデコーダ又はアップサンプリング・コンポーネントは、４つの２Ｄ転置コンボリューショナル層で構成され、各層には通常の３×３コンボリューショナル層及びＲｅＬＵ活性化関数が続く。なおさらに、出力された頭部は、それぞれ６４、３２、及びｎチャネルでＲｅＬＵ活性化を伴う３つのコンボリューショナル層（３×３、ストライド＝１、パディング＝１）で構成され得る。少なくとも一実施例において、最後のコンボリューショナル層の後に活性化層がない場合がある。別の実施例では、エンコーダ・ネットワークは、出力されたビリーフ・マップを正解（ｇｒｏｕｎｄｔｒｕｔｈ）ビリーフ・マップと比較するＬ２損失関数を使用して訓練することができ、ここで正解のビリーフ・マップはピークを生成するためにσ＝２ピクセルを使用して生成される。少なくとも一実施例において、立体視画像対の使用によって、これらの画像よって推定された姿勢を融合することを可能にできる、又は点群が計算され得、Ｐｒｏｃｒｕｓｔｅｓａｎａｌｙｓｉｓ又はｉｔｅｒａｔｉｖｅｃｌｏｓｅｓｔｐｏｉｎｔ（ＩＣＰ）などの処理を使用して決定された姿勢。 In one embodiment, the encoder of network 204 includes a VGG-19 convolutional layer pre-trained with ImageNet. In another embodiment, a ResNet-based encoder can be used. In a further embodiment, the decoder or upsampling component of network 204 is composed of four 2D transposed convolutional layers, each followed by a regular 3x3 convolutional layer and a ReLU activation function. Still further, the output head can be composed of three convolutional layers (3x3, stride=1, padding=1) with ReLU activation on 64, 32, and n channels, respectively. In at least one embodiment, there may be no activation layer after the last convolutional layer. In another embodiment, the encoder network can be trained using an L2 loss function that compares the output belief map to a ground truth belief map, where the ground truth belief map is generated using σ=2 pixels to generate peaks. In at least one embodiment, the use of stereoscopic image pairs can allow fusing the poses estimated by these images, or a point cloud can be computed and the poses determined using processes such as Procrustes analysis or iterative closest point (ICP).

少なくとも一実施例において、ビリーフ・マップ４０６は、入力として、関連性のあるロボット特徴の位置を表現する二次元での座標のセットを決定することができるピーク抽出コンポーネント４０８、又はサービスに与えられ得る。一選択肢として、キー点座標は、個々のビリーフ・マップにおいて、まずガウス平滑化をこれらのビリーフ・マップに適用してノイズ効果を低減した後、閾値とされるピーク付近の値の重み付けされた平均として計算することができる。この重み付けされた平均は、サブピクセルの精度を可能にすることができる。少なくとも一実施例において、これらの二次元座標（又はピクセル・ロケーション）は、入力としてｐｅｒｓｐｅｃｔｉｖｅ－ｎ－ｐｏｉｎｔ（ＰｎＰ）モジュール４１４などの姿勢決定モジュールに提供することができる。この姿勢決定モジュールは、入力として、レンズ非対称性、焦点距離、主点、又は他のそのようなファクタに起因する画像アーチファクトを説明するために使用され得るカメラ用の校正情報など、カメラ固有データ４１０を受け入れることができる。 In at least one embodiment, the belief maps 406 may be provided as input to a peak extraction component 408, or service, that may determine a set of coordinates in two dimensions that represent the locations of relevant robot features. As an option, key point coordinates may be calculated in each belief map as a weighted average of values around thresholded peaks, after first applying Gaussian smoothing to these belief maps to reduce noise effects. This weighted average may allow for sub-pixel accuracy. In at least one embodiment, these two-dimensional coordinates (or pixel locations) may be provided as input to a pose determination module, such as a perspective-n-point (PnP) module 414. This pose determination module may accept as input camera-specific data 410, such as calibration information for the camera, that may be used to account for image artifacts due to lens asymmetry, focal length, principal point, or other such factors.

少なくとも一実施例において、この姿勢決定モジュールは、入力として、可能な姿勢を決定するための、このタイプのロボットの順運動学（ｆｏｒｗａｒｄｋｉｎｅｍａｔｉｃｓ）４１２についての情報を受信することもできる。運動学（ｋｉｎｅｍａｔｉｃｓ）は、このタイプのロボットの物理的な構成又は制限のために、ある特徴ロケーションだけが可能である探索空間を狭めるために使用することができる。この情報は、ＰｎＰアルゴリズムを使用して分析して、決定されたカメラからロボットへの姿勢を出力することができる。少なくとも一実施例において、ｐｅｒｓｐｅｃｔｉｖｅ－ｎ－ｐｏｉｎｔは、このロボット・マニピュレータの関節構成が既知であると仮定して、カメラの外的要因を復元するために使用される。カメラ空間又はカメラ座標系において、このロボットのベース座標又は他の特徴が正確に特定され得るため、この姿勢情報を使用して、カメラとロボットの間の相対的な距離及び配向を決定することができる。この目的のために、ニューラル・ネットワーク２０４は、ビリーフ・マップ４０６のセットを推論することなどにより、これらの特徴の位置を推論できるように訓練され得る。 In at least one embodiment, the pose determination module can also receive as input information about the forward kinematics 412 of this type of robot to determine possible poses. The kinematics can be used to narrow the search space where only certain feature locations are possible due to the physical configuration or limitations of this type of robot. This information can be analyzed using a PnP algorithm to output a determined camera-to-robot pose. In at least one embodiment, perspective-n-point is used to recover the camera extrinsic factors, assuming that the joint configuration of the robot manipulator is known. Since the base coordinates or other features of the robot can be precisely specified in camera space or camera coordinate system, the pose information can be used to determine the relative distance and orientation between the camera and the robot. To this end, the neural network 204 can be trained to infer the location of these features, such as by inferring a set of belief maps 406.

少なくとも一実施例において、相対的な位置及び配向情報を使用して、カメラの視点からのカメラ座標空間が、ロボットのロボット座標空間と、次元及びアラインメントの両方について、位置合わせされたことを保証することができる。少なくとも一実施例において、これらの座標はカメラ座標系から正しいがロボットの座標系では正しくない場合があるため、ロボット５０４に対するカメラ５０２の不正確な配向又は位置により、ロボット５０４があるアクションを実行する誤った座標が与えられる可能性がある。この目的のために、カメラ５０２の相対的な位置及び配向は、ロボット５０４に対して決定され得る。少なくとも一実施例において、カメラ５０２の相対的な位置は十分である場合がある一方で、配向情報はカメラ固有などの要因に依存して有用な可能性があり、この場合、非対称な画像の性質は、適切に考慮しないと正確さに影響を及ぼす可能性がある。この相対的な位置／配向は、（たとえば、ロボットのランタイム中に）オンラインで校正することができる。 In at least one embodiment, the relative position and orientation information can be used to ensure that the camera coordinate space from the camera's viewpoint is aligned with the robot coordinate space of the robot, both in terms of dimensions and alignment. In at least one embodiment, these coordinates may be correct from the camera coordinate system but not the robot coordinate system, so that an inaccurate orientation or position of the camera 502 with respect to the robot 504 may give incorrect coordinates for the robot 504 to perform an action. To this end, the relative position and orientation of the camera 502 can be determined with respect to the robot 504. In at least one embodiment, while the relative position of the camera 502 may be sufficient, orientation information may be useful depending on factors such as camera specifics, where asymmetric image properties may affect accuracy if not properly accounted for. This relative position/orientation can be calibrated online (e.g., during the robot's runtime).

図５は、一実施例による、ロボットから物体への姿勢推定システムを使用して制御されるロボティック把持システム５００を図示している。ロボティック把持システム５００は、前述の実施例のコンテキストで実装することができる。 Figure 5 illustrates a robotic gripping system 500 controlled using a robot-to-object pose estimation system, according to one embodiment. The robotic gripping system 500 can be implemented in the context of the previously described embodiments.

示されるように、カメラ５０２を使用して、可能性としては、ロボット５０４などの自律物体の動画フレームの形態で、画像を捕捉することができる。カメラ５０２は、ロボット５０４がカメラ５０２の視野５１０内に入るように、また完全なビュー表現ではない場合はロボット５０４の少なくとも部分的な表現を含む場合がある画像を、カメラ５０２が捕捉できるように、位置付けることができる、又は外部に搭載することができる。捕捉された画像を使用して、特定のタスクを実行するためにロボット５０４に命令を与えるよう支援することができる。 As shown, the camera 502 can be used to capture images, potentially in the form of video frames, of an autonomous object such as a robot 504. The camera 502 can be positioned or externally mounted such that the robot 504 is within the field of view 510 of the camera 502 and such that the camera 502 can capture images that may include at least a partial representation of the robot 504 if not a full view representation. The captured images can be used to assist in providing instructions to the robot 504 to perform a particular task.

少なくとも一実施例において、捕捉された画像は、ロボット５０４に対する物体５１２のロケーションを決定するために分析することができ、そのロケーションを用いてロボット５０４は何らかの方法でこの物体５１２を持ち上げる又は変更を加えるなどの相互作用をする。特に、図２のシステム２００は、ロボット５０４又はロボット５０４用の制御システムに正確な命令を与える目的のために、ロボットから物体への姿勢を決定するために利用され得る。ロボットから物体への姿勢は、ロボット５０４をナビゲートすること、又はロボット５０４の状態について現在の情報を提供することを支援するなど、他の目的にも使用することができる。少なくとも一実施例において、ロボット５０４と物体５１２との間の正確な位置及び配向データは、構造化されていない動的な環境でロボット５０４が、物体把持及び操作、人間とロボットの対話、並びに衝突検出及び回避などのタスクを実行しつつ、ロバストに動作することを可能にする。したがって、カメラ５０２に対するロボット５０４の位置又は配向、及びカメラ５０２に対する物体５１２の位置又は配向のうちの少なくとも１つを決定して、次いでロボットから物体への姿勢を決定することが望ましい。 In at least one embodiment, the captured images can be analyzed to determine the location of the object 512 relative to the robot 504, which the robot 504 uses to interact with, such as lifting or modifying the object 512 in some way. In particular, the system 200 of FIG. 2 can be utilized to determine a robot-to-object pose for the purpose of providing precise instructions to the robot 504 or a control system for the robot 504. The robot-to-object pose can also be used for other purposes, such as to help navigate the robot 504 or provide current information about the state of the robot 504. In at least one embodiment, precise position and orientation data between the robot 504 and the object 512 allows the robot 504 to operate robustly in unstructured and dynamic environments while performing tasks such as object grasping and manipulation, human-robot interaction, and collision detection and avoidance. It is therefore desirable to determine at least one of the position or orientation of the robot 504 relative to the camera 502 and the position or orientation of the object 512 relative to the camera 502, and then determine the robot-to-object pose.

上述のように、カメラ５０２に対するロボット５０４の現在の配向及びカメラ５０２に対する物体５１２の現在の配向を示す画像がカメラ５０２によって捕捉され得る。ロボット５０４の配向に関して、ロボット５０４が様々な構成又は「姿勢」をとなるように、ロボット５０４は様々な関節接合された四肢５０８又はコンポーネントを有することができる。少なくとも一実施例において、ロボット５０４の様々な姿勢は、カメラ５０２によって捕捉される画像における様々な表現をもたらし得る。少なくとも一実施例において、カメラ５０２によって捕捉された単一の画像は、ロボット５０４の姿勢を決定するために使用可能なロボット５０４の特徴を決定するために分析することができる。特徴は、ロボットが動かすことができる又は位置若しくは配向を調節することができる関節又はロケーションに対応することができる。さらには、ロボット５０４の次元及び運動学は既知であるため、カメラ５０２の視点からロボット５０４の姿勢を決定することは、カメラからロボットへの距離及び配向の決定を正確にすることを可能にしている。 As described above, images may be captured by the camera 502 showing the current orientation of the robot 504 relative to the camera 502 and the current orientation of the object 512 relative to the camera 502. The robot 504 may have different articulated limbs 508 or components such that the robot 504 may be in different configurations or "poses" with respect to the orientation of the robot 504. In at least one embodiment, different poses of the robot 504 may result in different expressions in the images captured by the camera 502. In at least one embodiment, a single image captured by the camera 502 may be analyzed to determine features of the robot 504 that can be used to determine the pose of the robot 504. The features may correspond to joints or locations where the robot can move or adjust its position or orientation. Furthermore, because the dimensions and kinematics of the robot 504 are known, determining the pose of the robot 504 from the perspective of the camera 502 allows for accurate determination of the distance and orientation of the robot from the camera.

機械学習
プロセッサで発達した、モデルを深層学習することを含むディープ・ニューラル・ネットワーク（ＤＮＮ：ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ）は、自動運転車から迅速な創薬まで、オンライン画像データベースにおける自動画像キャプション付けから動画チャット・アプリケーションにおけるスマートなリアルタイム言語翻訳まで、多様な使用事例に用いられている。深層学習は、人間の脳の神経学習プロセスをモデル化する技法であり、継続的に学習し、継続的に賢くなり、より正確な結果をより速い時間で提供する。子供は初め、大人によって教えられて、様々な形状を正しく識別及び分類し、最終的にはいかなるコーチングもなしに形状を識別できるようになる。同様に、ディープ・ラーニング又はニューラル・ラーニング・システムは、コンテキストを物体に割り当てしつつも、基本的な物体、遮蔽された物体などを、よりスマート且つ効率的に識別するために、物体認識及び分類について訓練される必要がある。 Machine Learning Deep Neural Networks (DNNs), which involve deep learning models developed on processors, are used in a variety of use cases, from self-driving cars to rapid drug discovery, from automatic image captioning in online image databases to smart real-time language translation in video chat applications. Deep learning is a technique that models the neural learning process of the human brain, and it continuously learns and gets smarter, providing more accurate results in a faster time. Children are initially taught by adults to correctly identify and classify various shapes, and eventually become able to identify shapes without any coaching. Similarly, deep learning or neural learning systems need to be trained on object recognition and classification to smarter and more efficiently identify basic objects, occluded objects, etc., while still assigning context to the objects.

最も簡単なレベルでは、人間の脳のニューロンは、受信した様々な入力を見て、重要度のレベルをこれらの入力のそれぞれに割り当て、出力は他のニューロンに渡されて作用する。人工ニューロン又はパーセプトロンは、最も基本的なニューラル・ネットワークのモデルである。一実例では、パーセプトロンは、パーセプトロンが認識して分類するよう訓練される物体の様々な特徴を表現する１つ又は複数の入力を受信することができ、これらの特徴のそれぞれは、物体の形状を定義することにおいて、その特徴の重要度に基づいて一定の重みを割り当てられる。 At the simplest level, neurons in the human brain look at the various inputs they receive and assign a level of importance to each of these inputs, with the output being passed on to other neurons to act upon. An artificial neuron, or perceptron, is the most basic model of a neural network. In one example, a perceptron can receive one or more inputs that represent various features of an object that the perceptron is trained to recognize and classify, and each of these features is assigned a certain weight based on the importance of that feature in defining the object's shape.

ディープ・ニューラル・ネットワーク（ＤＮＮ）モデルは、膨大な量の入力データで訓練することができる多くの接続されたノード（たとえば、パーセプトロン、ボルツマン・マシン、半径ベースの関数、コンボリューショナル層など）の複数の層を含み、複雑な問題を高い精度で迅速に解決する。一実例では、ＤＮＮモデルの第１の層は、自動車の入力画像を様々なセクションに分解し、線や角度などの基本的なパターンを探す。第２の層は、線を組み立て、ホイール、フロントガラス、及びミラーなどの、より高次のパターンを探す。次の層は、車両のタイプを識別し、最後のいくつかの層は、入力画像用のラベルを生成し、具体的な自動車ブランドのモデルを特定する。 Deep Neural Network (DNN) models contain multiple layers of many connected nodes (e.g., perceptrons, Boltzmann machines, radius-based functions, convolutional layers, etc.) that can be trained with vast amounts of input data to quickly solve complex problems with high accuracy. In one example, the first layer of a DNN model breaks down an input image of a car into different sections and looks for basic patterns such as lines and angles. The second layer puts together the lines and looks for higher-level patterns such as wheels, windshields, and mirrors. The next layer identifies the type of vehicle, and the last few layers generate labels for the input image and identify the model of a specific car brand.

いったん、ＤＮＮが訓練されると、ＤＮＮを展開して、推論として知られるプロセスにおいて、物体又はパターンを識別及び分類するために使用することができる。推論（ＤＮＮが所与の入力から有用な情報を抽出するプロセス）の例としては、ＡＴＭ機に預け入れられた小切手の手書きの数字を識別すること、写真の友人の画像を識別すること、５０００万以上のユーザに映画のおすすめを提供すること、無人自動車において様々なタイプの自動車、歩行者、及び道路危険物を識別して分類すること、又はリアルタイムに人間によるスピーチを翻訳することが挙げられる。 Once a DNN is trained, it can be deployed and used to identify and classify objects or patterns in a process known as inference. Examples of inference (the process by which a DNN extracts useful information from given inputs) include identifying handwritten digits on a check deposited in an ATM machine, identifying images of friends in a photograph, providing movie recommendations to over 50 million users, identifying and classifying different types of vehicles, pedestrians, and road hazards in driverless cars, or translating human speech in real time.

訓練の間、データは、入力に対応するラベルを示す予測が生成されるまで、ＤＮＮを順伝播フェーズで通過する。ニューラル・ネットワークが入力を正確にラベル付けしない場合、正しいラベルと予測されたラベルとの間の誤差が分析され、逆伝播フェーズの間、ＤＮＮがその入力及び訓練データ・セット中の他の入力を正確にラベル付けするまで、特徴ごとに重みが調節される。複雑なニューラル・ネットワークを訓練することは、浮動小数点の乗算及び加算を含む、大量の並列コンピューティング・パフォーマンスを必要とする。推論することは、訓練することよりは計算集約的ではなく、訓練済みのニューラル・ネットワークが、画像の分類、スピーチの翻訳、及び新しい情報の一般的な推論のための、以前に見たことがない新しい入力に適用される、レイテンシに敏感なプロセスである。 During training, data is passed through the DNN in a forward propagation phase until a prediction is generated that indicates the label that corresponds to the input. If the neural network does not label the input correctly, the error between the correct label and the predicted label is analyzed, and during the backpropagation phase, the weights for each feature are adjusted until the DNN correctly labels that input and other inputs in the training data set. Training a complex neural network requires a large amount of parallel computing performance, including floating-point multiplications and additions. Inferring is a less computationally intensive, latency-sensitive process in which a trained neural network is applied to new, previously unseen inputs for image classification, speech translation, and general inference about new information.

推論及び訓練の論理
上述のように、深層学習又はニューラル学習システムは、入力データから推論を生成するために訓練される必要がある。深層学習又はニューラル学習システムについての推論及び／又は訓練論理６１５に関する詳細を、図６Ａ及び／又は図６Ｂと併せて、以下に与える。 Inference and Training Logic As mentioned above, a deep learning or neural learning system needs to be trained to generate inferences from input data. Details regarding the inference and/or training logic 615 for a deep learning or neural learning system are provided below in conjunction with Figures 6A and/or 6B.

少なくとも一実施例では、推論及び／又は訓練論理６１５は、１つ又は複数の実施例の態様において推論するように訓練及び／又は使用されるニューラル・ネットワークのニューロン又は層に対応した、順伝播及び／又は出力の重み、及び／又は入力／出力データを記憶するためのデータ・ストレージ６０１を、限定することなく含んでもよい。少なくとも一実施例では、データ・ストレージ６０１は、１つ又は複数の実施例の態様を使用した訓練及び／又は推論中に、入力／出力データ及び／又は重みパラメータを順伝播する間に１つ又は複数の実施例と併せて訓練又は使用されるニューラル・ネットワークの各層の重みパラメータ及び／又は入力／出力データを記憶する。少なくとも一実施例では、データ・ストレージ６０１の任意の部分は、プロセッサのＬ１、Ｌ２、又はＬ３のキャッシュ、若しくはシステム・メモリを含む他のオン・チップ又はオフ・チップのデータ・ストレージとともに含められてもよい。 In at least one embodiment, the inference and/or training logic 615 may include, without limitation, data storage 601 for storing forward propagation and/or output weights and/or input/output data corresponding to neurons or layers of a neural network trained and/or used to infer in one or more aspects of the embodiment. In at least one embodiment, the data storage 601 stores weight parameters and/or input/output data for each layer of a neural network trained or used in conjunction with one or more embodiments during forward propagation of input/output data and/or weight parameters during training and/or inference using aspects of one or more embodiments. In at least one embodiment, any portion of the data storage 601 may be included with other on-chip or off-chip data storage, including L1, L2, or L3 caches of a processor, or system memory.

少なくとも一実施例では、データ・ストレージ６０１の任意の部分は、１つ若しくは複数のプロセッサ、又は他のハードウェア論理デバイス若しくは回路の内部にあっても外部にあってもよい。少なくとも一実施例では、データ・ストレージ６０１は、キャッシュ・メモリ、ダイナミック・ランダム・アドレス可能メモリ（「ＤＲＡＭ」：ｄｙｎａｍｉｃｒａｎｄｏｍｌｙａｄｄｒｅｓｓａｂｌｅｍｅｍｏｒｙ）、スタティック・ランダム・アドレス可能メモリ（「ＳＲＡＭ」：ｓｔａｔｉｃｒａｎｄｏｍｌｙａｄｄｒｅｓｓａｂｌｅｍｅｍｏｒｙ）、不揮発性メモリ（たとえば、フラッシュ・メモリ）、又は他のストレージであってもよい。少なくとも一実施例では、データ・ストレージ６０１が、たとえばプロセッサの内部にあるか外部にあるかの選択、又はＤＲＡＭ、ＳＲＡＭ、フラッシュ、若しくは何らか他のタイプのストレージから構成されるかの選択は、オン・チップ対オフ・チップで利用可能なストレージ、実行される訓練及び／又は推論の機能のレイテンシ要件、ニューラル・ネットワークの推論及び／又は訓練で使用されるデータのバッチ・サイズ、又はこれらの要因の何からの組合せに応じて決められてもよい。 In at least one embodiment, any portion of data storage 601 may be internal or external to one or more processors or other hardware logic devices or circuits. In at least one embodiment, data storage 601 may be cache memory, dynamic random addressable memory ("DRAM"), static random addressable memory ("SRAM"), non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice of whether data storage 601 is internal or external to the processor, for example, or whether it is comprised of DRAM, SRAM, flash, or some other type of storage, may depend on the storage available on-chip versus off-chip, the latency requirements of the training and/or inference functions being performed, the batch size of data used in the inference and/or training of the neural network, or any combination of these factors.

少なくとも一実施例では、推論及び／又は訓練論理６１５は、１つ又は複数の実施例の態様において推論するように訓練及び／又は使用されるニューラル・ネットワークのニューロン又は層に対応した、逆伝播及び／又は出力の重み、及び／又は入力／出力データを記憶するためのデータ・ストレージ６０５を、限定することなく含んでもよい。少なくとも一実施例では、データ・ストレージ６０５は、１つ又は複数の実施例の態様を使用した訓練及び／又は推論中に、入力／出力データ及び／又は重みパラメータを逆伝播する間に１つ又は複数の実施例と併せて訓練又は使用されるニューラル・ネットワークの各層の重みパラメータ及び／又は入力／出力データを記憶する。少なくとも一実施例では、データ・ストレージ６０５の任意の部分は、プロセッサのＬ１、Ｌ２、又はＬ３のキャッシュ、若しくはシステム・メモリを含む他のオン・チップ又はオフ・チップのデータ・ストレージとともに含められてもよい。少なくとも一実施例では、データ・ストレージ６０５の任意の部分は、１つ又は複数のプロセッサ、又は他のハードウェア論理デバイス若しくは回路の内部にあっても外部にあってもよい。少なくとも一実施例では、データ・ストレージ６０５は、キャッシュ・メモリ、ＤＲＡＭ、ＳＲＡＭ、不揮発性メモリ（たとえば、フラッシュ・メモリ）、又は他のストレージであってもよい。少なくとも一実施例では、データ・ストレージ６０５が、たとえばプロセッサの内部にあるか外部にあるかの選択、又はＤＲＡＭ、ＳＲＡＭ、フラッシュ、若しくは何らか他のタイプのストレージから構成されるかの選択は、オン・チップ対オフ・チップで利用可能なストレージ、実行される訓練及び／又は推論の機能のレイテンシ要件、ニューラル・ネットワークの推論及び／又は訓練で使用されるデータのバッチ・サイズ、又はこれらの要因の何からの組合せに応じて決められてもよい。 In at least one embodiment, the inference and/or training logic 615 may include, without limitation, data storage 605 for storing backpropagated and/or output weights and/or input/output data corresponding to neurons or layers of a neural network trained and/or used to infer in one or more aspects of the embodiment. In at least one embodiment, the data storage 605 stores weight parameters and/or input/output data for each layer of a neural network trained or used in conjunction with one or more embodiments while backpropagating input/output data and/or weight parameters during training and/or inference using aspects of one or more embodiments. In at least one embodiment, any portion of the data storage 605 may be included with other on-chip or off-chip data storage, including L1, L2, or L3 caches of a processor, or system memory. In at least one embodiment, any portion of the data storage 605 may be internal or external to one or more processors, or other hardware logic devices or circuits. In at least one embodiment, data storage 605 may be cache memory, DRAM, SRAM, non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice of whether data storage 605 is internal or external to the processor, for example, or whether it is comprised of DRAM, SRAM, flash, or some other type of storage, may depend on the storage available on-chip versus off-chip, the latency requirements of the training and/or inference functions being performed, the batch size of data used in the inference and/or training of the neural network, or any combination of these factors.

少なくとも一実施例では、データ・ストレージ６０１とデータ・ストレージ６０５は、別々のストレージ構造であってもよい。少なくとも一実施例では、データ・ストレージ６０１とデータ・ストレージ６０５は、同じストレージ構造であってもよい。少なくとも一実施例では、データ・ストレージ６０１とデータ・ストレージ６０５は、部分的に同じストレージ構造で、部分的に別々のストレージ構造であってもよい。少なくとも一実施例では、データ・ストレージ６０１とデータ・ストレージ６０５との任意の部分は、プロセッサのＬ１、Ｌ２、又はＬ３のキャッシュ、若しくはシステム・メモリを含む他のオン・チップ又はオフ・チップのデータ・ストレージとともに含められてもよい。 In at least one embodiment, data storage 601 and data storage 605 may be separate storage structures. In at least one embodiment, data storage 601 and data storage 605 may be the same storage structure. In at least one embodiment, data storage 601 and data storage 605 may be partially the same storage structure and partially separate storage structures. In at least one embodiment, any portion of data storage 601 and data storage 605 may be included with other on-chip or off-chip data storage, including a processor's L1, L2, or L3 cache, or system memory.

少なくとも一実施例では、推論及び／又は訓練論理６１５は、訓練及び／又は推論コードに少なくとも部分的に基づく、又はそれにより示される論理演算及び／又は算術演算を実行するための、１つ又は複数の算術論理演算ユニット（「ＡＬＵ」）６１０を限定することなく含んでもよく、その結果が、アクティブ化ストレージ６２０に記憶されるアクティブ化（たとえば、ニューラル・ネットワーク内の層若しくはニューロンからの出力値）が生じる可能性があり、これらは、データ・ストレージ６０１及び／又はデータ・ストレージ６０５に記憶される入力／出力及び／又は重みパラメータのデータの関数である。少なくとも一実施例では、アクティブ化ストレージ６２０に記憶されるアクティブ化は、命令又は他のコードを実行したことに応答して、ＡＬＵ６１０によって実行される線形代数計算及び又は行列ベースの計算に従って生成され、ここでデータ・ストレージ６０５及び／又はデータ６０１に記憶された重み値は、バイアス値、勾配情報、運動量値などの他の値、又は他のパラメータ若しくはハイパーパラメータとともにオペランドとして使用され、これらのいずれか又はすべてが、データ・ストレージ６０５又はデータ・ストレージ６０１、又はオン・チップ若しくはオフ・チップの別のストレージに記憶されてもよい。少なくとも一実施例では、ＡＬＵ６１０は、１つ若しくは複数のプロセッサ、又は他のハードウェア論理デバイス若しくは回路内に含まれるが、別の実施例では、ＡＬＵ６１０は、それらを使用するプロセッサ又は他のハードウェア論理デバイス若しくは回路の外部にあってもよい（たとえばコプロセッサ）。少なくとも一実施例では、ＡＬＵ６１０は、プロセッサの実行ユニット内に含まれてもよく、又は同じプロセッサ内にあるか異なるタイプの異なるプロセッサ（たとえば、中央処理装置、グラフィックス・プロセッシング・ユニット、固定機能ユニットなど）の間で分散されているかのいずれかであるプロセッサの実行ユニットによりアクセス可能なＡＬＵバンク内に、他のやり方で含まれてもよい。少なくとも一実施例では、データ・ストレージ６０１、データ・ストレージ６０５、及びアクティブ化ストレージ６２０は、同じプロセッサ又は他のハードウェア論理デバイス若しくは回路にあってもよく、別の実施例では、それらは異なるプロセッサ又は他のハードウェア論理デバイス若しくは回路にあってもよく、或いは同じプロセッサ又は他のハードウェア論理デバイス若しくは回路と、異なるプロセッサ又は他のハードウェア論理デバイス若しくは回路との何らかの組合せにあってもよい。少なくとも一実施例では、アクティブ化ストレージ６２０の任意の部分は、プロセッサのＬ１、Ｌ２、又はＬ３のキャッシュ、若しくはシステム・メモリを含む他のオン・チップ又はオフ・チップのデータ・ストレージとともに含められてもよい。さらに、推論及び／又は訓練コードが、プロセッサ又は他のハードウェア論理若しくは回路にアクセス可能な他のコードとともに記憶されてもよく、プロセッサのフェッチ、デコード、スケジューリング、実行、リタイア、及び／又は他の論理回路を使用してフェッチ及び／又は処理されてもよい。 In at least one embodiment, the inference and/or training logic 615 may include, without limitation, one or more arithmetic logic units ("ALUs") 610 for performing logical and/or arithmetic operations based at least in part on or indicated by the training and/or inference code, which may result in activations (e.g., output values from layers or neurons in a neural network) that are stored in activation storage 620 and are a function of input/output and/or weight parameter data stored in data storage 601 and/or data storage 605. In at least one embodiment, the activations stored in activation storage 620 are generated according to linear algebra and/or matrix-based calculations performed by ALU 610 in response to executing instructions or other code, where the weight values stored in data storage 605 and/or data 601 are used as operands along with other values, such as bias values, gradient information, momentum values, or other parameters or hyperparameters, any or all of which may be stored in data storage 605 or data storage 601, or in another storage, on-chip or off-chip. In at least one embodiment, ALU 610 is contained within one or more processors or other hardware logic devices or circuits, although in other embodiments ALU 610 may be external to the processors or other hardware logic devices or circuits that use them (e.g., a co-processor). In at least one embodiment, ALU 610 may be included within an execution unit of a processor, or may be otherwise included within an ALU bank accessible by execution units of a processor, either within the same processor or distributed among different processors of different types (e.g., central processing unit, graphics processing unit, fixed function unit, etc.). In at least one embodiment, data storage 601, data storage 605, and activation storage 620 may be in the same processor or other hardware logic devices or circuits, and in other embodiments, they may be in different processors or other hardware logic devices or circuits, or some combination of the same processor or other hardware logic devices or circuits and different processors or other hardware logic devices or circuits. In at least one embodiment, any portion of activation storage 620 may be included with other on-chip or off-chip data storage, including L1, L2, or L3 cache of the processor, or system memory. Additionally, the inference and/or training code may be stored with other code accessible to the processor or other hardware logic or circuitry, and may be fetched and/or processed using the processor's fetch, decode, schedule, execute, retire, and/or other logic circuitry.

少なくとも一実施例では、アクティブ化ストレージ６２０は、キャッシュ・メモリ、ＤＲＡＭ、ＳＲＡＭ、不揮発性メモリ（たとえば、フラッシュ・メモリ）、又は他のストレージであってもよい。少なくとも一実施例では、アクティブ化ストレージ６２０は、完全に又は部分的に、１つ若しくは複数のプロセッサ又は他の論理回路の内部にあってもよく、又は外部にあってもよい。少なくとも一実施例では、アクティブ化ストレージ６２０が、たとえばプロセッサの内部にあるか外部にあるかの選択、又はＤＲＡＭ、ＳＲＡＭ、フラッシュ、若しくは何らか他のタイプのストレージから構成されるかの選択は、オン・チップ対オフ・チップの利用可能なストレージ、実行される訓練及び／又は推論機能のレイテンシ要件、ニューラル・ネットワークの推論及び／又は訓練で使用されるデータのバッチ・サイズ、又はこれらの要因の何からの組合せに応じて決められてもよい。少なくとも一実施例では、図６Ａに示す推論及び／又は訓練論理６１５は、グーグルからのＴｅｎｓｏｒｆｌｏｗ（登録商標）処理ユニット、Ｇｒａｐｈｃｏｒｅ（商標）からの推論処理ユニット（ＩＰＵ：ｉｎｆｅｒｅｎｃｅｐｒｏｃｅｓｓｉｎｇｕｎｉｔ）、又はＩｎｔｅｌＣｏｒｐからのＮｅｒｖａｎａ（登録商標）（たとえば「ＬａｋｅＣｒｅｓｔ」）プロセッサなどの特定用途向け集積回路（「ＡＳＩＣ：ａｐｐｌｉｃａｔｉｏｎ－ｓｐｅｃｉｆｉｃｉｎｔｅｇｒａｔｅｄｃｉｒｃｕｉｔ」）と併せて使用されてもよい。少なくとも一実施例では、図６Ａに示す推論及び／又は訓練論理６１５は、中央処理装置（「ＣＰＵ」：ｃｅｎｔｒａｌｐｒｏｃｅｓｓｉｎｇｕｎｉｔ）ハードウェア、グラフィックス・プロセッシング・ユニット（「ＧＰＵ」：ｇｒａｐｈｉｃｓｐｒｏｃｅｓｓｉｎｇｕｎｉｔ）ハードウェア、又はフィールド・プログラマブル・ゲート・アレイ（「ＦＰＧＡ」：ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）など他のハードウェアと併せて使用されてもよい。 In at least one embodiment, activation storage 620 may be cache memory, DRAM, SRAM, non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, activation storage 620 may be fully or partially internal to one or more processors or other logic circuits, or may be external. In at least one embodiment, the choice of whether activation storage 620 is internal or external to a processor, for example, or whether it is comprised of DRAM, SRAM, flash, or some other type of storage, may depend on available on-chip versus off-chip storage, latency requirements of the training and/or inference functions being performed, batch sizes of data used in inference and/or training of the neural network, or any combination of these factors. In at least one embodiment, the inference and/or training logic 615 shown in FIG. 6A may be used in conjunction with an application-specific integrated circuit ("ASIC"), such as a Tensorflow® processing unit from Google, an inference processing unit (IPU) from Graphcore™, or a Nervana® (e.g., "Lake Crest") processor from Intel Corp. In at least one embodiment, the inference and/or training logic 615 shown in FIG. 6A may be used in conjunction with other hardware, such as central processing unit ("CPU") hardware, graphics processing unit ("GPU") hardware, or field programmable gate array ("FPGA").

図６Ｂは、少なくとも１つの実施例による、推論及び／又は訓練論理６１５を示す。少なくとも一実施例では、推論及び／又は訓練論理６１５は、ハードウェア論理を限定することなく含んでもよく、このハードウェア論理では、計算リソースが、ニューラル・ネットワーク内のニューロンの１つ若しくは複数の層に対応する重み値又は他の情報の専用のものであるか、又は他のやり方でそれらと併せてしか使用されない。少なくとも一実施例では、図６Ｂに示す推論及び／又は訓練論理６１５は、グーグルからのＴｅｎｓｏｒｆｌｏｗ（登録商標）処理ユニット、Ｇｒａｐｈｃｏｒｅ（商標）からの推論処理ユニット（ＩＰＵ）、又はインテルコーポレーションからのＮｅｒｖａｎａ（登録商標）（たとえば「ＬａｋｅＣｒｅｓｔ」）プロセッサなどの特定用途向け集積回路（ＡＳＩＣ）と併せて使用されてもよい。少なくとも一実施例では、図６Ｂに示す推論及び／又は訓練論理６１５は、中央処理装置（ＣＰＵ）ハードウェア、グラフィックス・プロセッシング・ユニット（「ＧＰＵ」）ハードウェア、又はフィールド・プログラマブル・ゲート・アレイ（ＦＰＧＡ）など他のハードウェアと併せて使用されてもよい。少なくとも一実施例では、推論及び／又は訓練論理６１５は、限定することなく、データ・ストレージ６０１及びデータ・ストレージ６０５を含み、これらを使用して、重み値、並びに／又はバイアス値、勾配情報、運動量値、及び／若しくは他のパラメータ若しくはハイパーパラメータ情報を含む他の情報を記憶してもよい。図６Ｂに示す少なくとも一実施例では、データ・ストレージ６０１及びデータ・ストレージ６０５のそれぞれは、それぞれ計算ハードウェア６０２及び計算ハードウェア６０６などの専用計算リソースに関連付けられる。少なくとも一実施例では、計算ハードウェア６０６のそれぞれは、線形代数関数などの数学的関数を、それぞれデータ・ストレージ６０１及びデータ・ストレージ６０５に記憶された情報に対してのみ実行する１つ又は複数のＡＬＵを備え、その結果は、アクティブ化ストレージ６２０に記憶される。 6B illustrates inference and/or training logic 615 according to at least one embodiment. In at least one embodiment, the inference and/or training logic 615 may include, without limitation, hardware logic in which computational resources are dedicated to or otherwise used only in conjunction with weight values or other information corresponding to one or more layers of neurons in a neural network. In at least one embodiment, the inference and/or training logic 615 illustrated in FIG. 6B may be used in conjunction with an application specific integrated circuit (ASIC), such as a Tensorflow® processing unit from Google, an inference processing unit (IPU) from Graphcore™, or a Nervana® (e.g., "Lake Crest") processor from Intel Corporation. In at least one embodiment, the inference and/or training logic 615 shown in FIG. 6B may be used in conjunction with other hardware, such as central processing unit (CPU) hardware, graphics processing unit ("GPU") hardware, or field programmable gate arrays (FPGAs). In at least one embodiment, the inference and/or training logic 615 may include, without limitation, data storage 601 and data storage 605, which may be used to store weight values and/or other information, including bias values, gradient information, momentum values, and/or other parameters or hyperparameter information. In at least one embodiment shown in FIG. 6B, each of data storage 601 and data storage 605 is associated with dedicated computational resources, such as computation hardware 602 and computation hardware 606, respectively. In at least one embodiment, each of computation hardware 606 includes one or more ALUs that perform mathematical functions, such as linear algebraic functions, only on information stored in data storage 601 and data storage 605, respectively, with the results stored in activation storage 620.

少なくとも一実施例では、データ・ストレージ６０１及び６０５のそれぞれ、並びに対応する計算ハードウェア６０２及び６０６は、ニューラル・ネットワークの異なる層にそれぞれ対応し、それにより、データ・ストレージ６０１及び計算ハードウェア６０２との１つの「ストレージ／計算の対６０１／６０２」から結果的に生じるアクティブ化は、ニューラル・ネットワークの概念的組織化を反映させるために、次のデータ・ストレージ６０５及び計算ハードウェア６０６との「ストレージ／計算の対６０５／６０６」への入力として提供される。少なくとも一実施例では、ストレージ／計算の対６０１／６０２、及び６０５／６０６は、２つ以上のニューラル・ネットワークの層に対応してもよい。少なくとも一実施例では、ストレージ／計算の対６０１／６０２、及び６０５／６０６の後に、又はそれと並列に、追加のストレージ／計算の対（図示せず）が、推論及び／又は訓練論理６１５に含まれてもよい。 In at least one embodiment, each of the data storages 601 and 605 and the corresponding computing hardware 602 and 606 correspond to a different layer of the neural network, such that activations resulting from one "storage/computation pair 601/602" of data storage 601 and computing hardware 602 are provided as inputs to the next "storage/computation pair 605/606" of data storage 605 and computing hardware 606 to reflect the conceptual organization of the neural network. In at least one embodiment, the storage/computation pairs 601/602 and 605/606 may correspond to two or more layers of the neural network. In at least one embodiment, additional storage/computation pairs (not shown) may be included in the inference and/or training logic 615 after or in parallel with the storage/computation pairs 601/602 and 605/606.

ニューラル・ネットワークの訓練及び導入
図７は、ディープ・ニューラル・ネットワークの訓練及び導入のための別の実施例を示す。少なくとも一実施例では、未訓練ニューラル・ネットワーク７０６が、訓練データ・セット７０２を使用して訓練される。少なくとも一実施例では、訓練フレームワーク７０４は、ＰｙＴｏｒｃｈフレームワークであり、一方他の実施例では、訓練フレームワーク７０４は、Ｔｅｎｓｏｒｆｌｏｗ、Ｂｏｏｓｔ、Ｃａｆｆｅ、マイクロソフトＣｏｇｎｉｔｉｖｅＴｏｏｌｋｉｔ／ＣＮＴＫ、ＭＸＮｅｔ、Ｃｈａｉｎｅｒ、Ｋｅｒａｓ、Ｄｅｅｐｌｅａｒｎｉｎｇ４ｊ、又は他の訓練フレームワークである。少なくとも一実施例では、訓練フレームワーク７０４は、未訓練ニューラル・ネットワーク７０６を訓練し、本明細書に記載の処理リソースを使用してそれが訓練されるのを可能にして、訓練済みニューラル・ネットワーク７０８を生成する。少なくとも一実施例では、重みは、ランダムに選択されてもよく、又はディープ・ビリーフ・ネットワークを使用した事前訓練によって選択されてもよい。少なくとも一実施例では、訓練は、教師あり、一部教師あり、又は教師なしのいずれかのやり方で実行されてもよい。 Training and Deployment of a Neural Network FIG. 7 illustrates another embodiment for training and deployment of a deep neural network. In at least one embodiment, an untrained neural network 706 is trained using a training data set 702. In at least one embodiment, the training framework 704 is the PyTorch framework, while in other embodiments, the training framework 704 is Tensorflow, Boost, Caffe, Microsoft Cognitive Toolkit/CNTK, MXNet, Chainer, Keras, Deeplearning4j, or other training framework. In at least one embodiment, the training framework 704 trains the untrained neural network 706 and enables it to be trained using processing resources described herein to generate a trained neural network 708. In at least one embodiment, the weights may be selected randomly or by pre-training using a deep belief network. In at least one embodiment, training may be performed in either a supervised, semi-supervised, or unsupervised manner.

少なくとも一実施例では、未訓練ニューラル・ネットワーク７０６は教師あり学習を使用して訓練され、ここで訓練データ・セット７０２は、入力に対する所望の出力と対になった入力を含み、又は訓練データ・セット７０２は、既知の出力を有する入力を含み、ニューラル・ネットワーク７０６の出力が手動で採点される。少なくとも一実施例では、未訓練ニューラル・ネットワーク７０６は教師ありのやり方で訓練され、訓練データ・セット７０２からの入力を処理し、結果として得られた出力を、予想の又は所望の出力のセットと比較する。少なくとも一実施例では、次いで、誤差が、未訓練ニューラル・ネットワーク７０６を通って逆伝播される。少なくとも一実施例では、訓練フレームワーク７０４は、未訓練ニューラル・ネットワーク７０６を制御する重みを調節する。少なくとも一実施例では、訓練フレームワーク７０４は、未訓練ニューラル・ネットワーク７０６が、新規データ７１２などの既知の入力データに基づき、結果７１４などにおいて正しい答えを生成するのに好適な訓練済みニューラル・ネットワーク７０８などのモデルに向かって、どれだけ良好に収束しているかを監視するツールを含む。少なくとも一実施例では、訓練フレームワーク７０４は、未訓練ニューラル・ネットワーク７０６を繰り返し訓練する一方、損失関数、及び確率的勾配降下法などの調整アルゴリズムを使用して、未訓練ニューラル・ネットワーク７０６の出力を精緻化するように重みを調整する。少なくとも一実施例では、訓練フレームワーク７０４は、未訓練ニューラル・ネットワーク７０６が所望の精度に到達するまで未訓練ニューラル・ネットワーク７０６を訓練する。少なくとも一実施例では、次いで訓練済みニューラル・ネットワーク７０８を、任意の数の機械学習動作を実装するように導入することができる。 In at least one embodiment, the untrained neural network 706 is trained using supervised learning, where the training data set 702 includes inputs paired with desired outputs for the inputs, or the training data set 702 includes inputs with known outputs, and the outputs of the neural network 706 are manually scored. In at least one embodiment, the untrained neural network 706 is trained in a supervised manner, processing inputs from the training data set 702 and comparing the resulting outputs to a set of expected or desired outputs. In at least one embodiment, errors are then back-propagated through the untrained neural network 706. In at least one embodiment, the training framework 704 adjusts weights that control the untrained neural network 706. In at least one embodiment, the training framework 704 includes tools to monitor how well the untrained neural network 706 is converging toward a model, such as the trained neural network 708, that is suitable for generating the correct answer, such as in the results 714, based on known input data, such as the new data 712. In at least one embodiment, the training framework 704 iteratively trains the untrained neural network 706 while adjusting the weights to refine the output of the untrained neural network 706 using a loss function and a tuning algorithm such as stochastic gradient descent. In at least one embodiment, the training framework 704 trains the untrained neural network 706 until it reaches a desired accuracy. In at least one embodiment, the trained neural network 708 can then be deployed to implement any number of machine learning operations.

少なくとも一実施例では、未訓練ニューラル・ネットワーク７０６は、教師なし学習を使用して訓練され、ここで未訓練ニューラル・ネットワーク７０６は、ラベルなしデータを使用して自らを訓練しようとする。少なくとも一実施例では、教師なし学習の訓練データ・セット７０２は、いかなる関連出力データ又は「グラウンド・トゥルース」データもない入力データを含む。少なくとも一実施例では、未訓練ニューラル・ネットワーク７０６は、訓練データ・セット７０２内でグループ化を学習することができ、個々の入力が、未訓練データ・セット７０２にどのように関係しているかを判定することができる。少なくとも一実施例では、教師なし訓練を使用して、自己組織化マップを生成することができ、自己組織化マップは、新規データ７１２の次元を低減するのに有用な動作を実行することができるタイプの訓練済みニューラル・ネットワーク７０８である。少なくとも一実施例では、教師なし訓練を使用して異常検出を実行することもでき、異常検出は、新規データ・セット７１２の通常のパターンから逸脱した、新規データ・セット７１２内のデータ点を識別できるようにする。 In at least one embodiment, the untrained neural network 706 is trained using unsupervised learning, where the untrained neural network 706 attempts to train itself using unlabeled data. In at least one embodiment, the training data set 702 for unsupervised learning includes input data without any associated output data or "ground truth" data. In at least one embodiment, the untrained neural network 706 can learn groupings within the training data set 702 and determine how individual inputs relate to the untrained data set 702. In at least one embodiment, unsupervised training can be used to generate a self-organizing map, which is a type of trained neural network 708 that can perform operations useful for reducing the dimensionality of the new data 712. In at least one embodiment, unsupervised training can also be used to perform anomaly detection, which allows for identifying data points in the new data set 712 that deviate from the normal patterns of the new data set 712.

少なくとも一実施例では、半教師あり学習が使用されてもよく、それは、ラベル付きデータとラベルなしデータが訓練データ・セット７０２に混在している技法である。少なくとも一実施例では、訓練フレームワーク７０４を使用して、伝達学習技法などによる漸次的学習が実行されてもよい。少なくとも一実施例では、漸次的学習により、訓練済みニューラル・ネットワーク７０８は、初期訓練中にネットワーク内に教え込まれた知識を忘れることなく、新規データ７１２に適合できるようになる。 In at least one embodiment, semi-supervised learning may be used, which is a technique in which labeled and unlabeled data are mixed in the training data set 702. In at least one embodiment, the training framework 704 may be used to perform incremental learning, such as by transfer learning techniques. In at least one embodiment, incremental learning allows the trained neural network 708 to adapt to new data 712 without forgetting the knowledge that was instilled in the network during initial training.

データ・センタ
図８は、少なくとも一実施例が使用されてもよい例示的なデータ・センタ８００を示す。少なくとも一実施例では、データ・センタ８００は、データ・センタ・インフラストラクチャ層８１０、フレームワーク層８２０、ソフトウェア層８３０、及びアプリケーション層８４０を含む。 8 illustrates an exemplary data center 800 in which at least one embodiment may be used. In at least one embodiment, the data center 800 includes a data center infrastructure layer 810, a framework layer 820, a software layer 830, and an application layer 840.

少なくとも一実施例では、図８に示すように、データ・センタ・インフラストラクチャ層８１０は、リソース・オーケストレータ８１２、グループ化済みコンピューティング・リソース８１４、及びノード・コンピューティング・リソース（「ノードＣ．Ｒ．」：ｎｏｄｅｃｏｍｐｕｔｉｎｇｒｅｓｏｕｒｃｅ）８１６（１）～８１６（Ｎ）を含んでもよく、ここで「Ｎ」は、任意の正の整数を表す。少なくとも一実施例では、ノードＣ．Ｒ．８１６（１）～８１６（Ｎ）は、任意の数の中央処理装置（「ＣＰＵ」）又は（アクセラレータ、フィールド・プログラマブル・ゲート・アレイ（ＦＰＧＡ）、グラフィックス・プロセッサなどを含む）他のプロセッサ、メモリ・デバイス（たとえば、ダイナミック読取り専用メモリ）、ストレージ・デバイス（たとえば、半導体ドライブ又はディスク・ドライブ）、ネットワーク入力／出力（「ＮＷＩ／Ｏ」：ｎｅｔｗｏｒｋｉｎｐｕｔ／ｏｕｔｐｕｔ）デバイス、ネットワーク・スイッチ、仮想機械（「ＶＭ」：ｖｉｒｔｕａｌｍａｃｈｉｎｅ）、電源モジュール、及び冷却モジュールを含んでもよいが、これらに限定されない。少なくとも一実施例では、ノードＣ．Ｒ．８１６（１）～８１６（Ｎ）のうち１つ又は複数のノードＣ．Ｒ．は、上述したコンピューティング・リソースのうちの１つ又は複数を有するサーバであってもよい。 In at least one embodiment, as shown in FIG. 8, data center infrastructure layer 810 may include a resource orchestrator 812, grouped computing resources 814, and node computing resources ("node C.R.") 816(1)-816(N), where "N" represents any positive integer. In at least one embodiment, node C.R. 816(1)-816(N) may include, but are not limited to, any number of central processing units ("CPUs") or other processors (including accelerators, field programmable gate arrays (FPGAs), graphics processors, etc.), memory devices (e.g., dynamic read-only memory), storage devices (e.g., solid-state drives or disk drives), network input/output ("NW I/O") devices, network switches, virtual machines ("VMs"), power modules, and cooling modules. In at least one embodiment, one or more of the nodes C.R. 816(1)-816(N) may be a server having one or more of the computing resources described above.

少なくとも一実施例では、グループ化済みコンピューティング・リソース８１４は、１つ若しくは複数のラック（図示せず）内に収容されたノードＣ．Ｒ．の別々のグループ、又は様々なグラフィカル・ロケーション（同じく図示せず）においてデータ・センタに収容された多数のラックを含んでもよい。グループ化済みコンピューティング・リソース８１４内のノードＣ．Ｒ．の別々のグループは、１つ若しくは複数のワークロードをサポートするように構成又は配分されてもよいグループ化済みのコンピュート・リソース、ネットワーク・リソース、メモリ・リソース、又はストレージ・リソースを含んでもよい。少なくとも一実施例では、ＣＰＵ又はプロセッサを含むいくつかのノードＣ．Ｒ．は、１つ又は複数のラック内でグループ化されて、１つ又は複数のワークロードをサポートするためのコンピュート・リソースが提供されてもよい。少なくとも一実施例では、１つ又は複数のラックはまた、任意の数の電源モジュール、冷却モジュール、及びネットワーク・スイッチを任意の組合せで含んでもよい。 In at least one embodiment, the grouped computing resources 814 may include separate groups of nodes C.R. housed in one or more racks (not shown), or multiple racks housed in a data center in various graphical locations (also not shown). The separate groups of nodes C.R. in the grouped computing resources 814 may include grouped compute resources, network resources, memory resources, or storage resources that may be configured or allocated to support one or more workloads. In at least one embodiment, several nodes C.R. including CPUs or processors may be grouped in one or more racks to provide compute resources to support one or more workloads. In at least one embodiment, the one or more racks may also include any number of power modules, cooling modules, and network switches in any combination.

少なくとも一実施例では、リソース・オーケストレータ８２２は、１つ又は複数のノードＣ．Ｒ．８１６（１）～８１６（Ｎ）及び／若しくはグループ化済みコンピューティング・リソース８１４を構成してもよく、又は他のやり方で制御してもよい。少なくとも一実施例では、リソース・オーケストレータ８２２は、データ・センタ８００用のソフトウェア設計インフラストラクチャ（「ＳＤＩ」：ｓｏｆｔｗａｒｅｄｅｓｉｇｎｉｎｆｒａｓｔｒｕｃｔｕｒｅ）管理エンティティを含んでもよい。少なくとも一実施例では、リソース・オーケストレータは、ハードウェア、ソフトウェア、又はこれらの何らかの組合せを含んでもよい。 In at least one embodiment, the resource orchestrator 822 may configure or otherwise control one or more nodes C.R. 816(1)-816(N) and/or grouped computing resources 814. In at least one embodiment, the resource orchestrator 822 may include a software design infrastructure ("SDI") management entity for the data center 800. In at least one embodiment, the resource orchestrator may include hardware, software, or some combination thereof.

図８に示す少なくとも一実施例では、フレームワーク層８２０は、ジョブ・スケジューラ８３２、構成マネージャ８３４、リソース・マネージャ８３６、及び分配ファイル・システム８３８を含む。少なくとも一実施例では、フレームワーク層８２０は、ソフトウェア層８３０のソフトウェア８３２、及び／又はアプリケーション層８４０の１つ若しくは複数のアプリケーション８４２をサポートするためのフレームワークを含んでもよい。少なくとも一実施例では、ソフトウェア８３２又はアプリケーション８４２はそれぞれ、アマゾン・ウェブ・サービス、グーグル・クラウド、及びマイクロソフト・アジュールによって提供されるものなど、ウェブ・ベースのサービス・ソフトウェア又はアプリケーションを含んでもよい。少なくとも一実施例では、フレームワーク層８２０は、大規模なデータ処理（たとえば「ビック・データ」）のために分配ファイル・システム８３８を使用することができるＡｐａｃｈｅＳｐａｒｋ（登録商標）（以下「Ｓｐａｒｋ」）など、無料でオープン・ソースのソフトウェア・ウェブ・アプリケーション・フレームワークの一種であってもよいが、これに限定されない。少なくとも一実施例では、ジョブ・スケジューラ８３２は、データ・センタ８００の様々な層によってサポートされるワークロードのスケジューリングを容易にするために、Ｓｐａｒｋドライバを含んでもよい。少なくとも一実施例では、構成マネージャ８３４は、ソフトウェア層８３０、並びに大規模なデータ処理をサポートするためのＳｐａｒｋ及び分配ファイル・システム８３８を含むフレームワーク層８２０などの異なる層を構成することが可能であってもよい。少なくとも一実施例では、リソース・マネージャ８３６は、分配ファイル・システム８３８及びジョブ・スケジューラ８３２をサポートするようにマッピング若しくは配分されたクラスタ化済み又はグループ化済みのコンピューティング・リソースを管理することが可能であってもよい。少なくとも一実施例では、クラスタ化済み又はグループ化済みのコンピューティング・リソースは、データ・センタ・インフラストラクチャ層８１０にあるグループ化済みコンピューティング・リソース８１４を含んでもよい。少なくとも一実施例では、リソース・マネージャ８３６は、リソース・オーケストレータ８１２と連携して、これらのマッピング又は配分されたコンピューティング・リソースを管理してもよい。 In at least one embodiment shown in FIG. 8, framework layer 820 includes a job scheduler 832, a configuration manager 834, a resource manager 836, and a distributed file system 838. In at least one embodiment, framework layer 820 may include a framework for supporting software 832 in software layer 830 and/or one or more applications 842 in application layer 840. In at least one embodiment, software 832 or application 842 may each include web-based service software or applications, such as those offered by Amazon Web Services, Google Cloud, and Microsoft Azure. In at least one embodiment, framework layer 820 may be, but is not limited to, a type of free and open source software web application framework, such as Apache Spark® (hereinafter “Spark”), which can use distributed file system 838 for large-scale data processing (e.g., “big data”). In at least one embodiment, the job scheduler 832 may include a Spark driver to facilitate scheduling of workloads supported by various tiers of the data center 800. In at least one embodiment, the configuration manager 834 may be capable of configuring different tiers, such as the software tier 830 and the framework tier 820, which includes Spark and a distributed file system 838 to support large scale data processing. In at least one embodiment, the resource manager 836 may be capable of managing clustered or grouped computing resources that are mapped or allocated to support the distributed file system 838 and the job scheduler 832. In at least one embodiment, the clustered or grouped computing resources may include the grouped computing resources 814 in the data center infrastructure tier 810. In at least one embodiment, the resource manager 836 may work in conjunction with the resource orchestrator 812 to manage these mapped or allocated computing resources.

少なくとも一実施例では、ソフトウェア層８３０に含まれるソフトウェア８３２は、ノードＣ．Ｒ．８１６（１）～８１６（Ｎ）、グループ化済みコンピューティング・リソース８１４、及び／又はフレームワーク層８２０の分配ファイル・システム８３８のうちの少なくとも一部分によって使用されるソフトウェアを含んでもよい。１つ又は複数のタイプのソフトウェアは、インターネット・ウェブ・ページ検索ソフトウェア、電子メール・ウイルス・スキャン・ソフトウェア、データベース・ソフトウェア、及びストリーミング・ビデオ・コンテンツ・ソフトウェアを含んでもよいが、これらに限定されない。 In at least one embodiment, the software 832 included in the software layer 830 may include software used by at least a portion of the nodes C.R. 816(1)-816(N), the grouped computing resources 814, and/or the distributed file system 838 of the framework layer 820. The one or more types of software may include, but are not limited to, Internet web page searching software, e-mail virus scanning software, database software, and streaming video content software.

少なくとも一実施例では、アプリケーション層８４０に含まれるアプリケーション８４２は、ノードＣ．Ｒ．８１６（１）～８１６（Ｎ）、グループ化済みコンピューティング・リソース８１４、及び／又はフレームワーク層８２０の分配ファイル・システム８３８のうちの少なくとも一部分によって使用される１つ若しくは複数のタイプのアプリケーションを含んでもよい。１つ若しくは複数のタイプのアプリケーションは、任意の数のゲノム学アプリケーション、認識コンピュート、並びに訓練若しくは推論のソフトウェア、機械学習フレームワーク・ソフトウェア（たとえば、ＰｙＴｏｒｃｈ、Ｔｅｎｓｏｒｆｌｏｗ、Ｃａｆｆｅなど）を含む機械学習アプリケーション、又は１つ若しくは複数の実施例と併せて使用される他の機械学習アプリケーションを含んでもよいが、これらに限定されない。 In at least one embodiment, the applications 842 included in the application layer 840 may include one or more types of applications used by at least a portion of the nodes C.R. 816(1)-816(N), the grouped computing resources 814, and/or the distributed file system 838 of the framework layer 820. The one or more types of applications may include, but are not limited to, any number of genomics applications, cognitive compute, and machine learning applications including training or inference software, machine learning framework software (e.g., PyTorch, Tensorflow, Caffe, etc.), or other machine learning applications used in conjunction with one or more embodiments.

少なくとも一実施例では、構成マネージャ８３４、リソース・マネージャ８３６、及びリソース・オーケストレータ８１２のうちのいずれかは、任意の技術的に実行可能なやり方で取得された任意の量及びタイプのデータに基づき、任意の数及びタイプの自己修正措置を実装してもよい。少なくとも一実施例では、自己修正措置は、データ・センタ８００のデータ・センタ演算子が、不良の恐れのある構成を決定しないようにし、十分に利用されていない且つ／又は性能の低いデータ・センタの部分をなくせるようにしてもよい。 In at least one embodiment, any of configuration manager 834, resource manager 836, and resource orchestrator 812 may implement any number and type of self-correcting actions based on any amount and type of data obtained in any technically feasible manner. In at least one embodiment, the self-correcting actions may enable data center operators of data center 800 to avoid determining potentially bad configurations and eliminate underutilized and/or underperforming portions of the data center.

少なくとも一実施例では、データ・センタ８００は、１つ若しくは複数の機械学習モデルを訓練し、又は本明細書に記載の１つ若しくは複数の実施例による１つ若しくは複数の機械学習モデルを使用して情報を予測若しくは推論するためのツール、サービス、ソフトウェア、又は他のリソースを含んでもよい。たとえば、少なくとも一実施例では、機械学習モデルは、データ・センタ８００に関して上述したソフトウェア及びコンピューティング・リソースを使用して、ニューラル・ネットワーク・アーキテクチャに従って重みパラメータを計算することによって、訓練されてもよい。少なくとも一実施例では、１つ又は複数のニューラル・ネットワークに対応する訓練済み機械学習モデルは、本明細書に記載の１つ又は複数の技法によって計算された重みパラメータを使用することにより、データ・センタ８００に関して上述したリソースを使用して、情報を推論又は予測するために使用されてもよい。 In at least one embodiment, data center 800 may include tools, services, software, or other resources for training one or more machine learning models or predicting or inferring information using one or more machine learning models according to one or more embodiments described herein. For example, in at least one embodiment, a machine learning model may be trained by calculating weight parameters according to a neural network architecture using the software and computing resources described above with respect to data center 800. In at least one embodiment, a trained machine learning model corresponding to one or more neural networks may be used to infer or predict information using the resources described above with respect to data center 800 by using weight parameters calculated by one or more techniques described herein.

少なくとも一実施例では、データ・センタは、上述したリソースを使用して訓練及び／又は推論を実行するために、ＣＰＵ、特定用途向け集積回路（ＡＳＩＣ）、ＧＰＵ、ＦＰＧＡ、又は他のハードウェアを使用してもよい。さらに、上述した１つ又は複数のソフトウェア及び／又はハードウェアのリソースは、画像認識、音声認識、又は他の人工知能サービスなどの情報の訓練又は推論の実行を、ユーザが行えるようにするためのサービスとして構成されてもよい。 In at least one embodiment, the data center may use a CPU, application specific integrated circuit (ASIC), GPU, FPGA, or other hardware to perform training and/or inference using the resources described above. Additionally, one or more of the software and/or hardware resources described above may be configured as a service to enable a user to train or perform inference on information, such as image recognition, speech recognition, or other artificial intelligence services.

推論及び／又は訓練論理６１５を使用して、１つ若しくは複数の実施例に関連する推論及び／又は訓練の動作が実行される。少なくとも一実施例では、推論及び／又は訓練論理６１５は、本明細書に記載のニューラル・ネットワークの訓練動作、ニューラル・ネットワークの機能及び／若しくはアーキテクチャ、又はニューラル・ネットワークのユース・ケースを使用して計算された重みパラメータに少なくとも部分的に基づき、推論又は予測の動作のために図８のシステムにおいて使用されてもよい。 Inference and/or training logic 615 is used to perform inference and/or training operations associated with one or more embodiments. In at least one embodiment, inference and/or training logic 615 may be used in the system of FIG. 8 for inference or prediction operations based at least in part on weight parameters calculated using neural network training operations, neural network functionality and/or architecture, or neural network use cases described herein.

本明細書に記載されるように、外部的に捕捉された物体の画像を使用して物体から物体への姿勢を推定するための方法、コンピュータ読取り可能媒体、及びシステムが開示される。図１～図４によると、実施例は、動作を推論することを実行すること、及び推論されたデータを提供することのために使用可能なニューラル・ネットワークを提供することができ、図６Ａ及び図６Ｂに描写されるように、ニューラル・ネットワークは、推論及び／又は訓練論理６１５内の（部分的に、又は全体的に）データ・ストレージ６０１及び６０５のうちの１つ又は両方に記憶される。ニューラル・ネットワークの訓練及び展開は、図７で描写され、また本明細書において説明されるように、実行され得る。ニューラル・ネットワークの分散は、図８で描写され、また本明細書において説明されるように、データ・センタ８００の１つ又は複数のサーバを使用して実行され得る。 As described herein, a method, computer readable medium, and system for estimating object-to-object pose using externally captured images of the object are disclosed. According to FIGS. 1-4, an embodiment may provide a neural network usable for performing motion inference and providing inferred data, as depicted in FIGS. 6A and 6B, the neural network is stored (partially or entirely) in one or both of data storages 601 and 605 in inference and/or training logic 615. Training and deployment of the neural network may be performed as depicted in FIG. 7 and described herein. Distribution of the neural network may be performed using one or more servers of a data center 800, as depicted in FIG. 8 and described herein.

Claims

identifying an image of a robotic gripping system including a plurality of joints and a target object to be gripped by the robotic gripping system, the image being captured by a camera not attached to the robotic gripping system and the target object, but external to the robotic gripping system and the target object;
processing images of the identified target object using a first neural network to estimate a first pose of the target object relative to the camera ;
estimating a second pose of the robotic gripping system relative to the camera from the inferred positions of the joints using a second neural network operating in parallel with the first neural network, the second neural network generating a plurality of belief maps that infer, from images of the identified robotic gripping system , positions of each of the plurality of joints that operate to grasp the target object;
and calculating a third pose of the robotic gripping system relative to the target object using the first pose and the second pose.

The method of claim 1, wherein the image captured by the camera is a red-green-blue (RGB) image or a grayscale image.

The method of claim 1, wherein the camera captures one of optical or non-optical wavelengths.

The method of claim 1, wherein the target object is a known object to be grasped by the robotic grasping system.

The method of claim 4 , further comprising causing the robotic grasping system to grasp the known object using the third pose.

The method of claim 1, wherein the first pose of the target object relative to the camera includes a three-dimensional (3D) rotation and translation of the target object relative to the camera.

The method of claim 1, wherein the second pose of the robotic gripping system relative to the camera includes a 3D rotation and translation of the robotic gripping system relative to the camera.

The method of claim 1, wherein the second neural network performs an online calibration of the camera.

The method of claim 1, wherein the third pose is a pose of the target object relative to a coordinate frame of the robotic gripping system.

The method of claim 1, wherein only the first neural network is trained on the target object.

The method of claim 1, wherein only the second neural network is trained for the robotic grasping system.

refining the first pose;
and refining the second pose.

refining the first pose comprises:
13. The method of claim 12, performed by iteratively matching images captured by the camera with a synthetic projection of a model from the first pose, and adjusting parameters of the first pose based on results of the iterative matching.

refining the second pose comprises:
13. The method of claim 12, performed by iteratively matching images captured by the camera with a synthetic projection of a model at the second pose, and adjusting parameters of the second pose based on results of the iterative matching.

1. A non-transitory computer-readable medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform a method, the method comprising:
identifying an image of a robotic gripping system including a plurality of joints and a target object to be gripped by the robotic gripping system, the image being captured by a camera not attached to the robotic gripping system and the target object, but external to the robotic gripping system and the target object;
processing images of the identified target object using a first neural network to estimate a first pose of the target object relative to the camera ;
estimating a second pose of the robotic gripping system relative to the camera from the inferred positions of the joints using a second neural network operating in parallel with the first neural network, the second neural network generating a plurality of belief maps that infer, from images of the identified robotic gripping system , positions of each of the plurality of joints that operate to grasp the target object;
and calculating a third pose of the robotic gripping system relative to the target object using the first pose and the second pose.

a first neural network that receives as input images identifying the target object from images of the robotic gripping system and the target object captured by a camera that is not attached to a robotic gripping system including a plurality of joints and a target object to be gripped by the robotic gripping system and that is external to the robotic gripping system and the target object, and processes the images identifying the target object to estimate a first pose of the target object relative to the camera;
a second neural network operating in parallel with the first neural network, the second neural network receiving as input an image identifying the robotic gripping system from images of the robotic gripping system and the target object, generating from the image identifying the robotic gripping system a number of belief maps that infer positions of each of the plurality of joints that operate to grasp the target object, and estimating a second pose of the robotic gripping system relative to the camera from the inferred positions of the plurality of joints ;
and a processor that uses the first pose and the second pose to calculate a third pose of the robotic gripping system relative to the target object.

The system of claim 16, wherein the camera is external to the system.

The system of claim 16, wherein the target object is a known object.

The system of claim 18, wherein the processor causes the robotic grasping system to grasp the target object using the third pose.