JP7789798B2

JP7789798B2 - Multi-view Neural Human Prediction with an Implicit Differentiable Renderer for Facial Expression, Body Pose Shape, and Clothing Performance Capture

Info

Publication number: JP7789798B2
Application number: JP2023556536A
Authority: JP
Inventors: チンジャン; ハンユェンシャオ
Original assignee: Sony Corp; Sony Group Corp
Current assignee: Sony Corp; Sony Group Corp
Priority date: 2021-03-31
Filing date: 2022-03-31
Publication date: 2025-12-22
Anticipated expiration: 2042-03-31
Also published as: JP2024510230A; WO2022208440A1; EP4292059A1; CN116134491A; KR20230150867A

Description

〔関連出願との相互参照〕
本出願は、２０２１年１１月１６日に出願された「顔表情、身体ポーズ形状及び衣服パフォーマンスキャプチャのための暗黙的微分可能レンダラーを用いたマルチビューニューラル人間予測（ＭＵＬＴＩＶＩＥＷＮＥＵＲＡＬＨＵＭＡＮＰＲＥＤＩＣＴＩＯＮＵＳＩＮＧＩＭＰＬＩＣＩＴＤＩＦＦＥＲＥＮＴＩＡＢＬＥＲＥＮＤＥＲＦＯＲＦＡＣＩＡＬＥＸＰＲＥＳＳＩＯＮ，ＢＯＤＹＰＯＳＥＳＨＡＰＥＡＮＤＣＬＯＴＨＥＳＰＥＲＦＯＲＭＡＮＣＥＣＡＰＴＵＲＥ）」という名称の米国仮特許出願シリアル番号第６３／２７９，９１６号、及び２０２１年３月３１日に出願された「顔表情、身体ポーズ形状及び衣服変位のための暗黙的微分可能レンダラーを用いたマルチビューニューラル人間予測（ＭＵＬＴＩＶＩＥＷＮＥＵＲＡＬＨＵＭＡＮＰＲＥＤＩＣＴＩＯＮＵＳＩＮＧＩＭＰＬＩＣＩＴＤＩＦＦＥＲＥＮＴＩＡＢＬＥＲＥＮＤＥＲＦＯＲＦＡＣＩＡＬＥＸＰＲＥＳＳＩＯＮ，ＢＯＤＹＰＯＳＥＳＨＡＰＥＡＮＤＣＬＯＴＨＥＳＤＩＳＰＬＡＣＥＭＥＮＴ）」という名称の米国仮特許出願シリアル番号第６３／１６８，４６７号の米国特許法第１１９条に基づく優先権の利益を主張するものであり、これらの両文献はその全体が全ての目的で引用により本明細書に組み入れる。 CROSS-REFERENCE TO RELATED APPLICATIONS
This application is related to the publication entitled "Multiview Neural Human Prediction Using Implicit Differentiable Renderer for Facial Expression, Body Pose Shape, and Clothing Performance Capture," filed on November 16, 2021. No. 63/279,916, filed March 31, 2021, entitled "Multiview Neural Human Prediction Using Implicit Differentiable Renderer for Facial Expression, Body Pose Shape, and Clothing Displacement," and U.S. Provisional Patent Application Serial No. 63/279,916, filed March 31, 2021, entitled "Multiview Neural Human Prediction Using Implicit Differentiable Renderer for Facial Expression, Body Pose Shape, and Clothing Displacement." This application claims the benefit of priority under 35 U.S.C. §119 of U.S. Provisional Patent Application Serial No. 63/168,467, entitled "DISPLACEMENT," both of which are incorporated herein by reference in their entireties for all purposes.

本発明は、娯楽産業のための３次元コンピュータビジョン及びグラフィックスに関する。具体的には、本発明は、映画、ＴＶ、音楽及びゲームコンテンツ制作のための３次元コンピュータビジョン及びグラフィックスを取得して処理することに関する。 The present invention relates to 3D computer vision and graphics for the entertainment industry. Specifically, the present invention relates to acquiring and processing 3D computer vision and graphics for film, TV, music, and gaming content production.

例えばＦａｃｅｂｏｏｋＦｒａｎｋＭｏｃａｐなどの従来のシステムは、単一画像から裸体の形状及びポーズのみを予測する。このようなシステムは、衣服表面を予測することができない。このようなシステムは２Ｄ画像変換法であり、マルチビュー入力に対処することができない。 Conventional systems, such as Facebook FrankMocap, only predict the shape and pose of a nude body from a single image. They cannot predict clothing surfaces. These systems are 2D image transformation methods and cannot handle multi-view input.

暗黙的パーツネットワーク（ＩｍｐｌｉｃｉｔＰａｒｔＮｅｔｗｏｒｋ）は、スキャン又は再構成された点群から身体及び衣服の両方を予測するが、３Ｄスキャンを必要とし、入力としてのＲＧＢ画像にも、顔表情及び外観にも対処することができない。また、暗黙的パーツネットワークは、ボクセルを身体又は衣服として識別するラベルのみを予測した後に人間事前モデル（ｈｕｍａｎｐｒｉｏｒｍｏｄｅｌ）を明示的にフィットさせ、低速である。ＮｅｕｒａｌＢｏｄｙ及びＡｎｉｍａｔａｂｌｅＮｅＲＦは、ニューラル輝度場（ＮｅｕｒａｌＲａｄｉａｎｃｅＦｉｅｌｄ：ＮｅＲＦ）を使用して、顔表情を含まない衣服人体（ｃｌｏｔｈｅｓｈｕｍａｎｂｏｄｙ）を予測する。しかしながら、これらは低解像度に制限される高密度の潜在コードボリューム（ｄｅｎｓｅｌａｔｅｎｔｃｏｄｅｖｏｌｕｍｅ）の作成を必要とし、従って人体形状が粗くなってしまう。また、これらは、メッシュ頂点の対応関係を含まないボリュメトリックな人体モデルしか復元することができない。 Implicit part networks predict both the body and clothes from scanned or reconstructed point clouds, but they require 3D scans and cannot handle RGB images as input, nor facial expression and appearance. Implicit part networks also explicitly fit a human prior model after predicting only labels that identify voxels as body or clothes, making them slow. Neural Body and Animatable NeRF use neural radiance fields (NeRFs) to predict clothes human bodies without facial expression. However, these require the creation of dense latent code volumes, which are limited to low resolution, resulting in coarse body shapes. Furthermore, these methods can only reconstruct volumetric human models that do not include mesh vertex correspondences.

マルチビューニューラル人間予測（Ｍｕｌｔｉｖｉｅｗｎｅｕｒａｌｈｕｍａｎｐｒｅｄｉｃｔｉｏｎ）が、カメラ校正を与えられたマルチビュー画像セットから、骨格、体形、並びに衣服の変位及び外観を含む３Ｄ人間モデルを予測することを含む。 Multiview neural human prediction involves predicting a 3D human model, including bone structure, body shape, and clothing displacement and appearance, from a set of multiview images given camera calibration.

１つの態様では、ニューラルネットワークが、異なるビューからの単一画像又は複数画像であることができる入力画像セットを受け取って、層状３Ｄ人間モデル（ｌａｙｅｒｅｄ３Ｄｈｕｍａｎｍｏｄｅｌ）を予測する。画像セットは、Ｎ×ｗ×ｈ×ｃのサイズの４Ｄテンソルを含み、ここで、Ｎはビューの数であり、ｗは画像の幅であり、ｈは画像の高さであり、ｃは画像のチャネルである。画像セットのためのカメラ情報は既知である。出力モデルは、内側から外側に向かって、予測されたポーズの骨格、顔表情を含む予測された形状の裸の３Ｄ身体（例えば、ブレンドシェイプ（ｂｌｅｎｄｓｈａｐｅｓ）及び関節回転によってパラメータ化されたＳＭＰＬ－Ｘモデル）、及び入力画像から推測される衣服変位及び外観ＲＧＢ色の３Ｄ場という３つの層を含む。裸の３Ｄ人体メッシュを衣服変位場（ｃｌｏｔｈｅｓｄｉｓｐｌａｃｅｍｅｎｔｆｉｅｌｄ）に従って変形させることによって着衣姿の人体メッシュ（ｃｌｏｔｈｅｄｂｏｄｙｍｅｓｈ）が取得される。 In one aspect, a neural network receives an input image set, which can be a single image or multiple images from different views, and predicts a layered 3D human model. The image set contains a 4D tensor of size Nxwxhxc, where N is the number of views, w is the image width, h is the image height, and c is the image channel. The camera information for the image set is known. The output model contains three layers, from inside to outside: a skeleton with a predicted pose; a naked 3D body with a predicted shape including facial expression (e.g., an SMPL-X model parameterized by blendshapes and joint rotations); and a 3D field of clothing displacement and appearance RGB color inferred from the input images. A clothed body mesh is obtained by deforming a naked 3D body mesh according to the clothes displacement field.

別の態様では、ニューラルネットワークが、入力画像セットを特徴に符号化するマルチビューステレオ３Ｄ畳み込みニューラルネットワーク（ＭＶＳ－３ＤＣＮＮ）、特徴を人間パラメータに回帰させる人間メッシュ復元多層パーセプトロン（ｈｕｍａｎｍｅｓｈｒｅｃｏｖｅｒｙｍｕｌｔｉｌａｙｅｒｐｅｒｃｅｐｔｒｏｎ：ＨＭＲＭＬＰ）、及びＭＶＳ－３ＤＣＮＮを微調整してクエリ３Ｄ光線（３Ｄ位置及び方向）をＲＧＢカラー及び衣服－身体変位に復号するニューラル輝度場多層パーセプトロン（ｎｅｕｒａｌｒａｄｉａｎｃｅｆｉｅｌｄｍｕｌｔｉｌａｙｅｒｐｅｒｃｅｐｔｒｏｎ：ＮｅＲＦＭＬＰ）という３つのサブネットワークで構成される。 In another aspect, the neural network is composed of three sub-networks: a multi-view stereo 3D convolutional neural network (MVS-3DCNN) that encodes the input image set into features; a human mesh recovery multilayer perceptron (HMR MLP) that regresses the features onto human parameters; and a neural radiance field multilayer perceptron (NeRF MLP) that fine-tunes the MVS-3DCNN to decode query 3D rays (3D position and orientation) into RGB color and clothing-body displacement.

別の態様では、テスト／推論モードにおいて、層状３Ｄ人間モデルの予測が、訓練データ内のカメラのビュー範囲内で、明示的な数値最適化を伴わずに、小さな入力セットについて、装置に依存せず、完全に自動であり、リアルタイムである。訓練済みニューラルネットワークを用いて予測する際には、ＭＶＳ－３ＤＣＮＮが、マルチビュー画像セットを入力として受け取り、正面ビューを基準ビューとして選択し、特徴量を抽出する。ＨＭＲＭＬＰは、全ての特徴量を人間のポーズ、形状、顔表情パラメータに回帰させる。ＳＭＰＬ－Ｘモデルは、パラメータに従って人間の裸体メッシュを生成する。その後、裸体メッシュは、バウンディングボックス内の占有フィールドに変換される。訓練済みＮｅＲＦＭＬＰは、ビューの各中心からの光線方向に関連する身体メッシュの近くのいずれかの３Ｄ点について、ＲＧＢカラーと、裸体の表面を示す３Ｄ変位ベクトルとを生成する。カメラビュー（入力ビューと同じビュー、又はいずれかの新規ビュー）の全ての画素から放たれる全ての光線を問い合わせることにより、着衣姿の人体の外観をＲＧＢ画像としてレンダリングすることができる。サンプリングされた点から３Ｄ変位ベクトルを使用して裸体を変形させることにより、ＳＭＰＬ－Ｘモデルと同じ頂点対応のＳＭＰＬ－Ｘ＋Ｄなどの着衣姿の人体メッシュを取得することができる。 In another aspect, in test/inference mode, prediction of a layered 3D human model is device-independent, fully automatic, and real-time for a small input set within the view range of the cameras in the training data, without explicit numerical optimization. When predicting using a trained neural network, the MVS-3DCNN receives a multi-view image set as input, selects a frontal view as the reference view, and extracts features. The HMR MLP regresses all features onto human pose, shape, and facial expression parameters. The SMPL-X model generates a human nude body mesh according to the parameters. The nude body mesh is then converted into an occupancy field within a bounding box. The trained NeRF MLP generates RGB colors and 3D displacement vectors representing the nude body surface for any 3D point near the body mesh associated with the ray direction from each center of view. By querying all rays cast from all pixels in a camera view (either the same view as the input view or any new view), the appearance of a clothed human body can be rendered as an RGB image. By deforming the nude body using 3D displacement vectors from the sampled points, a clothed human body mesh such as SMPL-X+D can be obtained with the same vertex correspondence as the SMPL-X model.

別の態様では、ニューラルネットワークの訓練が、教師あり及び自己教師ありという２つの事例を含む。教師ありの事例では、例えばＨ３６Ｍデータセットなどの、既知の人間パラメータを有するラベル付きデータセットが与えられる。グランドトゥルース（ＧＴ）のパラメータ及び形状を、ＣＮＮ回帰されたパラメータ及び形状と比較する。その差分を形状損失として計算する。一方で、入力画像セット内のサンプリングされた画素から光線を投じ、ＮｅＲＦＭＬＰが光線をレンダリングして、パラメータを裸体の密度及び３Ｄ衣服変位の関数である色及び密度に回帰させる。色損失は、サンプリングされた画素色とレンダリングされた色との差分の合計によって計算される。一方で、モーションキャプチャデータセットなどの、ＧＴ人間パラメータが未知である既存のデータセットでは、自己教師あり／自己改善訓練（ｓｅｌｆ－ｉｍｐｒｏｖｉｎｇｔｒａｉｎｉｎｇ）が利用される。各訓練反復では、ＭＶＳ３ＤＣＮＮからパラメータを回帰させた後に、これらをＳＭＰＬｉｆｙＸなどの最適化ベースの人間予測アルゴリズムに送り、明示的数値最適化法（ｅｘｐｌｉｃｉｔｎｕｍｅｒｉｃａｌｏｐｔｉｍｉｚａｔｉｏｎａｐｐｒｏａｃｈｅｓ）によって最適化する。最適化されたパラメータは、ＣＮＮ回帰されたパラメータと比較されて形状損失になる。残りのステップは教師あり訓練と同じであるが、自己改善訓練は教師ありの事例よりも多くのエポック及び長い時間を要する。全体的なニューラルネットワークの訓練は、形状損失及び色損失の両方を最小化するＡｄａｍなどの並列最適化アルゴリズムによって実行され、最適化されたネットワークの重みが出力される。 In another aspect, neural network training includes two cases: supervised and self-supervised. In the supervised case, a labeled dataset with known human parameters, such as the H36M dataset, is given. The ground truth (GT) parameters and shape are compared with the CNN-regressed parameters and shape. The difference is calculated as the shape loss. On the other hand, rays are cast from sampled pixels in the input image set, and NeRF MLP renders the rays to regress the parameters to color and density, which are functions of nude body density and 3D clothing displacement. The color loss is calculated by summing the difference between the sampled pixel color and the rendered color. On the other hand, for existing datasets where GT human parameters are unknown, such as motion capture datasets, self-supervised/self-improving training is used. In each training iteration, parameters are regressed from the MVS 3DCNN, then fed into an optimization-based human predictive algorithm such as SMPLifyX and optimized using explicit numerical optimization approaches. The optimized parameters are compared to the CNN regressed parameters to obtain the shape loss. The remaining steps are the same as supervised training, but self-improvement training requires more epochs and longer time than supervised training. Overall neural network training is performed using a parallel optimization algorithm such as Adam, which minimizes both the shape loss and color loss, and the optimized network weights are output.

いくつかの実施形態によるニューラル人間予測のフローチャートを示す図である。FIG. 1 illustrates a flowchart for neural human prediction according to some embodiments. いくつかの実施形態による、全てのネットワークＭＶＳ３ＤＣＮＮ、ＨＭＲＭＬＰ及びＮｅＲＦＭＬＰの重みが既知である、テンソル表記によって表される前方予測のワークフローを示す図である。FIG. 1 illustrates the workflow of forward prediction represented by tensor notation, where the weights of all networks MVS 3DCNN, HMR MLP and NeRF MLP are known, according to some embodiments. いくつかの実施形態による、スーパービジョンを使用してネットワークを訓練するワークフローを示す図である。FIG. 1 illustrates a workflow for training a network using supervision, according to some embodiments. いくつかの実施形態による、自己改善戦略においてネットワークを訓練するワークフローを示す図である。FIG. 1 illustrates a workflow for training a network in a self-improvement strategy, according to some embodiments. いくつかの実施形態による、各ビューのＭＶＳ３ＤＣＮＮのＮｅＲＦＭＬＰへのアライメントを示す図である。FIG. 10 illustrates the alignment of MVS 3DCNN for each view to NeRF MLP according to some embodiments.

ニューラル人間予測が、画像セット（単一の画像又はマルチビュー画像）から骨格のポーズ、体形、並びに衣服の変位及び外観を含む３Ｄ人間モデルを予測することを含む。ニューラル人間予測の実施形態は、ニューラルネットワークの使用方法について説明する。マルチビューニューラル人間予測は、単一画像ベースのモーションキャプチャ（ｍｏｃａｐ）及び人間リフティング（ｈｕｍａｎｌｉｆｔｉｎｇ）を品質及びロバスト性において上回り、メモリコストの高いまばらな点群を入力として受け取って低速で実行する暗黙的パーツネットワークなどの身体衣服予測ネットワークのアーキテクチャを単純化し、３Ｄボリューム全体を符号化するＮｅｕｒａｌＢｏｄｙなどの潜在コードベースのネットワークの解像度制限を回避する。 Neural human prediction involves predicting a 3D human model, including skeletal pose, body shape, and clothing displacement and appearance, from a set of images (single or multi-view images). An embodiment of neural human prediction describes the use of neural networks. Multi-view neural human prediction surpasses single-image-based motion capture (mocap) and human lifting in quality and robustness, simplifies the architecture of body-clothes prediction networks such as implicit part networks, which take sparse point clouds as input and run slowly, and avoids the resolution limitations of latent code-based networks such as Neural Body, which encode the entire 3D volume.

図１は、いくつかの実施形態によるニューラル人間予測のフローチャートである。ステップ１００において、被写体の周囲で撮影された写真セットなどの、入力画像セットＩ、単一画像、又はマルチビュー画像を入力として取得する。入力Ｉは、Ｎ×ｗ×ｈ×ｃのサイズの４Ｄテンソルとして表され、Ｎはビューの数であり、ｗ、ｈ、ｃはそれぞれ画像幅、画像高さ及び画像チャンネルである。カメラは既に校正済みであり、従ってカメラ情報（例えば、カメラパラメータ）は全て既知である。画像前処理として、Ｄｅｔｅｃｔｒｏｎ２及びｉｍａｇｅＧｒａｂ－Ｃｕｔなどの既存の手法を使用して被写体のバウンディングボックス及び前景マスクを抽出する。画像はバウンディングボックスによって切り取られ、同じアスペクト比でｗ×ｈのサイズにズームされる。画像境界は黒で塗りつぶされる。 Figure 1 is a flowchart of neural human prediction according to some embodiments. In step 100, an input image set I, a single image, or multi-view images, such as a set of photographs taken around a subject, is obtained as input. The input I is represented as a 4D tensor of size Nxwxhxc, where N is the number of views, and w, h, and c are the image width, image height, and image channel, respectively. The camera has already been calibrated, so all camera information (e.g., camera parameters) is known. For image preprocessing, a bounding box of the subject and a foreground mask are extracted using existing techniques such as Detectron2 and Image Grab-Cut. The image is cropped by the bounding box and zoomed to a size of wxh with the same aspect ratio. The image border is filled with black.

ニューラルネットワーク（ＭＶＳ－ＰＥＲＦ）１０２は、入力画像セットを特徴に符号化するマルチビューステレオ３Ｄ畳み込みニューラルネットワーク（ＭＶＳ－３ＤＣＮＮ）１０４、特徴を人間パラメータに回帰させる人間メッシュ復元多層パーセプトロン（ＨＭＲＭＬＰ）１０６、及びＭＶＳ－３ＤＣＮＮを微調整してクエリ３Ｄ光線（３Ｄ位置及び方向）をＲＧＢカラー及び衣服－身体変位に復号するニューラル輝度場多層パーセプトロン（ＮｅＲＦＭＬＰ）１０８という３つのコンポーネントで構成される。 The neural network (MVS-PERF) 102 consists of three components: a multi-view stereo 3D convolutional neural network (MVS-3DCNN) 104 that encodes the input image set into features; a human mesh reconstruction multi-layer perceptron (HMR MLP) 106 that regresses the features onto human parameters; and a neural luminance field multi-layer perceptron (NeRF MLP) 108 that fine-tunes the MVS-3DCNN to decode query 3D rays (3D position and orientation) into RGB color and clothing-body displacement.

ステップ１０４において、深層２ＤＣＮＮが各ビューから画像特徴を抽出する。各畳み込み層の後には、最後の層を除いてバッチ正規化（ＢＮ）層及び整流化線形ユニット（ｒｅｃｔｉｆｉｅｄｌｉｎｅａｒｕｎｉｔ：ＲｅＬＵ）が続く。２つのダウンサンプリング層も配置される。２ＤＣＮＮの出力は、ｗ／４×ｈ／４×３２のサイズの特徴マップである。 In step 104, a deep 2D CNN extracts image features from each view. Each convolutional layer, except the last, is followed by a batch normalization (BN) layer and a rectified linear unit (ReLU). Two downsampling layers are also deployed. The output of the 2D CNN is a feature map of size w/4 x h/4 x 32.

その後、あるビューを基準ビューとして選択し、その視錐台（ｖｉｅｗｆｒｕｓｔｕｍ）を透視投影及び近遠面（ｎｅａｒｆａｒｐｌａｎｅｓ）に従って被写体の作業空間全体をカバーするように設定する。この錐台を、近い面及び遠い面の両方に平行なｄ個の深度面によって近くから遠くにサンプリングする。全ての特徴マップを各深度面に変換してブレンドする。ｉ＝１、２、．．．、Ｎであるいずれかのビューｉについて、（１をインデックスとする）基準ビューに対する３×３のホモグラフィ画像ワーピング行列（ｈｏｍｏｇｒａｐｈｙｉｍａｇｅｗａｒｐｉｎｇｍａｔｒｉｘ）が以下の数式によって与えられる。
Then, a view is selected as the reference view, and its view frustum is set to cover the entire workspace of the subject according to the perspective projection and near-far planes. This frustum is sampled from near to far with d depth planes parallel to both the near and far planes. All feature maps are transformed to each depth plane and blended. For any view i, i = 1, 2, ..., N, the 3x3 homography image warping matrix for the reference view (indexed from 1) is given by the following formula:

ここで、Ｋ，［Ｒ，ｔ］はカメラの固有パラメータ及び外部パラメータを表し、ｚは深度面から基準ビューのカメラ中心までの距離であり、ｎは深度面の法線方向である。 Here, K, [R, t] represent the intrinsic and extrinsic parameters of the camera, z is the distance from the depth plane to the camera center of the reference view, and n is the normal direction to the depth plane.

全ての画像が深度面にワープされた後に、全ての特徴の分散
によって座標（ｕ，ｖ，ｚ）におけるコストを決定する。
は、全てのビューの平均特徴値である。
コストボリュームのサイズは、ｄ×ｗ／４×ｈ／４である。 After all images are warped to the depth plane, the variance of all features
The cost at coordinates (u, v, z) is determined by:
is the average feature value of all views.
The size of the cost volume is d×w/4×h/4.

ステップ１０６において、人間メッシュ復元多層パーセプトロン（ＨＭＲＭＬＰ）が、フラット化層（ｆｌａｔｔｅｎｌａｙｅｒ）及びドロップアウト層（ｄｒｏｐｏｕｔｌａｙｅｒ）によって分離された３層の線形回帰を含む。ＨＭＲＭＬＰは、ＭＶＳ３ＤＣＮＮからの特徴量を人体パラメータθ_reg１１４に回帰させる。 In step 106, a human mesh reconstruction multi-layer perceptron (HMR MLP) contains three layers of linear regression separated by flatten and dropout layers. The HMR MLP regresses the features from the MVS 3DCNN onto the body parameters θ _reg 114.

人体パラメータθ_regは、ＳＭＰＬ－Ｘなどの人体パラメトリックモデルを３Ｄ裸体メッシュ２０２に操作することができる。通常、ＳＭＰＬ－Ｘ表現θ_regは、骨格ポーズ（各関節の３次元回転角）、身長及び体重などの体形を制御するボディブレンドシェイプパラメータ、並びに顔表情を制御するフェイシャルブレンドシェイプパラメータを含む。θ_regは、ブレンドシェイプパラメータを使用してＴポーズメッシュを構築し、これを線形スキニングモデルの骨格ポーズによってポーズメッシュに変形させる。 The human body parameters θ _reg can manipulate a human body parametric model such as SMPL-X into a 3D nude body mesh 202. Typically, the SMPL-X representation θ _reg includes a skeleton pose (3D rotation angle of each joint), body blendshape parameters that control body shape such as height and weight, and facial blendshape parameters that control facial expression. _{θ reg} uses the blendshape parameters to construct a T-pose mesh, which is then deformed into a pose mesh by the skeleton pose of a linear skinning model.

一方では、ステップ１０８において、コストボリュームがニューラル輝度場（ＮｅＲＦ）などの微分可能なレンダリングＭＬＰに送られる。ＮｅＲＦＭＬＰは、３Ｄ位置ｘ及び方向φによって表されるクエリ光線を４チャンネルカラーＲＧＢσにマッピングする関数Ｍとしてｃ（ｘ、φ）＝Ｍ（ｘ、φ、ｆ；Γ）のように定式化される。ｆは、錐台ＭＶＳ３ＤＣＮＮ１０４のコストボリュームからＮｅＲＦボリュームへの特徴マップであり、Γは、ＮｅＲＦＭＬＰネットワークの重みであり、σは、３Ｄポイントがメッシュ内に存在する場合の確率の占有密度を表す。裸体の占有密度場σｂは、錐台１０４のメッシュ２０２（図２）を変換することによって直接取得することができる。また、着衣姿の身体の密度場σは、３次元変位ベクトル場Ｄと特徴量マップｆとの関数：σ（Ｄ、ｆ）として表すことができる。３次元変位ベクトル場Ｄ１１６は、着衣姿の身体表面２０４上の点が裸体表面上の点とどのように関連しているかを表す。ＮｅＲＦＭＬＰを訓練すると、変位ベクトル場Ｄも最適化される。 Meanwhile, in step 108, the cost volume is sent to a differentiable rendering MLP, such as a neural luminance field (NeRF). The NeRF MLP is formulated as c(x, φ) = M(x, φ, f; Γ), where M is a function that maps a query ray, represented by a 3D position x and direction φ, to a four-channel color RGB σ. f is the feature map from the cost volume to the NeRF volume in the frustum MVS 3DCNN 104, Γ is the weight of the NeRF MLP network, and σ represents the occupancy density of the probability that a 3D point exists within the mesh. The occupancy density field σb for a nude body can be directly obtained by transforming the mesh 202 (Figure 2) of the frustum 104. The density field σ for a clothed body can be expressed as σ(D, f), a function of the 3D displacement vector field D and the feature map f. The 3D displacement vector field D116 represents how points on the clothed body surface 204 relate to points on the nude body surface. Training the NeRF MLP also optimizes the displacement vector field D.

図２は、いくつかの実施形態による、全てのネットワークＭＶＳ３ＤＣＮＮ、ＨＭＲＭＬＰ及びＮｅＲＦＭＬＰの重みが訓練されて固定された、テンソル表記によって表される前方予測のワークフローである。透視投影画像からの画素の全ての光線２００を問い合わせることによって、外観画像１１２がレンダリングされる。いくつかの実施形態では、３Ｄ人間予測１１０が実装される。人体の近くのサンプリングされた点を問い合わせることによって、変位フィールドＤ１１６が取得される。着衣姿の出力メッシュがテンプレートと同じトポロジーを有する人間パフォーマンスキャプチャタスクでは、各頂点に補間変位ベクトル（ｉｎｔｅｒｐｏｌａｔｅｄｄｉｓｐｌａｃｅｍｅｎｔｖｅｃｔｏｒ）を追加することによって、裸体メッシュＶ_b２０２を着衣姿の身体メッシュＶ_c２０４に変形することができる。 2 illustrates a workflow for forward prediction represented by tensor notation, where the weights of all networks (MVS 3DCNN, HMR MLP, and NeRF MLP) are trained and fixed, according to some embodiments. An appearance image 112 is rendered by querying all rays 200 of pixels from a perspective projection image. In some embodiments, 3D human prediction 110 is implemented. A displacement field D116 is obtained by querying sampled points near the human body. For human performance capture tasks, where the clothed output mesh has the same topology as the template, a nude body mesh _Vb 202 can be deformed into a clothed body mesh _Vc 204 by adding an interpolated displacement vector to each vertex.

図３は、いくつかの実施形態による、スーパービジョンを用いてネットワークを訓練するワークフローである。Ｈｕｍａｎ３．６Ｍなどの教師あり訓練データセットは、画像入力Ｉ１００だけでなく、グランドトゥルース人間パラメータθ_gt３００及び裸体メッシュＶ_b、gt３０２も含み、通常、これらはセンサ又は既存の手法によって取得される。この事例では、予測される裸体とグランドトゥルースとの差分を合計することによって、形状損失３０４が直接取得される。

ここで、Ｊは裸体の関節であり、Πは各カメラビューの３Ｄ点の透視投影を表す。ネットワークを効果的に訓練するために、各訓練ステップでは、全てのビューがＭＶＳ３ＤＣＮＮの基準ビューとして順番に選択される。 3 illustrates a workflow for training a network with supervision, according to some embodiments. A supervised training dataset, such as Human3.6M, includes not only image inputs I100 but also ground truth human parameters θ _gt 300 and a nude body mesh V _b,gt 302, typically acquired by sensors or existing techniques. In this case, the shape loss 304 is obtained directly by summing the differences between the predicted nude body and the ground truth.

where J is the joint of the bare body and Π represents the perspective projection of the 3D points in each camera view. To train the network effectively, in each training step, all views are selected in turn as reference views for the MVS 3DCNN.

一方で、典型的には画像顕著性（ｉｍａｇｅｓａｌｉｅｎｃｙ）に比例する不均一なサンプリング戦略を使用して、入力画像セット１００から光線３０６がサンプリングされる。高顕著性領域では多くの光線がサンプリングされ、平坦領域又は背景領域からは少ない光線がサンプリングされる。これらの光線は、ＭＶＳ３ＤＣＮＮ１０４からの特徴マップと共にＮｅＲＦＭＬＰ１０６に送られ、ＮｅＲＦＭＬＰ１０６がサンプルの外観ＲＧＢσ色３０８をレンダリングする。入力画像内のサンプリングされた色とレンダリングされた色３０８との全ての差分を合計することによって色損失３１０が計算される。 Meanwhile, rays 306 are sampled from the input image set 100 using a non-uniform sampling strategy that is typically proportional to image saliency. More rays are sampled in high-saliency regions and fewer rays are sampled from flat or background regions. These rays, along with the feature map from the MVS 3DCNN 104, are sent to the NeRF MLP 106, which renders the sample's appearance RGBσ color 308. The color loss 310 is calculated by summing all the differences between the sampled colors in the input image and the rendered color 308.

Ａｄａｍなどの並列化された確率的最適化アルゴリズム（ｐａｒａｌｌｅｌｉｚｅｄｓｔｏｃｈａｓｔｉｃｏｐｔｉｍｉｚａｔｉｏｎａｌｇｏｒｉｔｈｍ）を適用して、形状損失及び色損失の両方を最小化することによって全てのネットワークＭＶＳ３ＤＣＮＮ、ＨＭＲＭＬＰ、ＮｅＲＦＭＬＰの重みを訓練する。 We apply a parallelized stochastic optimization algorithm such as Adam to train the weights of all networks (MVS 3DCNN, HMR MLP, NeRF MLP) by minimizing both shape and color losses.

図４は、いくつかの実施形態による、自己改善戦略においてネットワークを訓練するワークフローである。この事例では、訓練データセットが、注釈又は人間グランドトゥルースパラメータを含まない人間画像のみを提供する。入力セット１００内の各画像について、回帰されたパラメータθ_reg１１４を初期推測として選択することにより、ＳＭＰＬｉｆｙＸアルゴリズムなどの最適化ベースの予測４００を適用する。最適化ベースの予測は、最初に各画像上の人間の２Ｄキーポイントを検出し、非線形最適化を適用して３Ｄ人間にフィットさせる。
これらの２Ｄキーポイントに（θ_opt４０２によってパラメータ化された）メッシュＶ_b,opt４０４を適用する。
4 illustrates a workflow for training a network in a self-improvement strategy, according to some embodiments. In this case, the training dataset provides only human images without annotations or human ground truth parameters. For each image in the input set 100, we apply an optimization-based prediction 400, such as the SMPLifyX algorithm, by selecting the regressed parameters θ _reg 114 as an initial guess. The optimization-based prediction first detects 2D human keypoints on each image and applies nonlinear optimization to fit a 3D human.
A mesh V _b,opt 404 (parameterized by θ _opt 402 ) is applied to these 2D keypoints.

ここで、Ｋは、キーポイントの検出された２Ｄ位置を示し、合計は全ての対応するキーポイント及び全てのビューを引き継ぐ。 Here, K denotes the detected 2D position of the keypoint, and the sum is over all corresponding keypoints and all views.

非線形最小二乗最適化は数値的に遅く、フィッティング精度は初期推測θ_regに依存するが、信頼度は高い。十分なフィッティングの反復後には、θ_optがグランドトゥルースに近くなる。従って、自己改善訓練ワークフローは、以下に要約するようにθ_optをグランドトゥルースに向けて効率的に改善することができる。
自己改善訓練ワークフロー：
以下を実行
ＭＶＳ－３ＤＣＮＮからθ_regを計算し、入力ＩからＨＭＲＭＬＰを計算
θ_regを初期推測、Ｉを入力として、ＳＭＰＬｉｆｙＸからθ_optを計算
Ｉから光線をサンプリングし、ＮｅＲＦＭＬＰからサンプリングされた色ｃを計算
ＳｈａｐｅＬｏｓｓ及びＣｏｌｏｒＬｏｓｓを計算
ＳｈａｐｅＬｏｓｓ及びＣｏｌｏｒＬｏｓｓを最小化することによってＭＶＳ３ＤＣＮＮ、ＨＭＲＭＬＰ及びＮｅＲＦＭＬＰのネットワークの重みを更新
全ての訓練データについて重みが収束するまで反復 Nonlinear least-squares optimization is numerically slow, and the fitting accuracy depends on the initial guess _θreg , but it is highly reliable. After enough fitting iterations, _θopt approaches the ground truth. Therefore, a self-improvement training workflow can efficiently improve _θopt toward the ground truth, as summarized below.
Self-improvement training workflow:
Do the following: Calculate θ _reg from MVS-3DCNN and calculate HMR MLP from input I Calculate θ _opt from SMPLifyX using θ _reg as the initial guess and I as input Sample rays from I and calculate sampled color c from NeRF MLP Calculate ShapeLoss and ColorLoss Update the network weights of MVS 3DCNN, HMR MLP, and NeRF MLP by minimizing ShapeLoss and ColorLoss Iterate on all training data until the weights converge

図５に、いくつかの実施形態による、各ビューのＭＶＳ３ＤＣＮＮのＮｅＲＦＭＬＰへのアライメントを示す。 Figure 5 shows the alignment of MVS 3DCNN to NeRF MLP for each view, according to some embodiments.

動作時には、例えばゲームスタジオにおけるマーカーレスモーションキャプチャ、又は人間３Ｄ表面再構成ＲＧＢカメラセットアップなどの、商業的及び／又は個人的マーカーレスパフォーマンスキャプチャ用途においてニューラル人間予測を直接適用することができる。マルチビューニューラル人間予測の実施形態の他の用途は、いずれかの拡張と組み合わせることができるリアルタイムバックボーン技術として、例えば深度センシングの入力、３Ｄモデリング、又は新規アニメーションを作成するための出力の使用を組み合わせることができる。マルチビューニューラル人間予測は、ゲーム用途、ＶＲ／ＡＲ用途、及びいずれかのリアルタイムヒューマンインタラクション用途において適用することもできる。マルチビューニューラル人間予測は、使用するハードウェア（例えば、ＧＰＵプロセッサの速度及びＧＰＵメモリのサイズ）に応じて、予測のために少量のビューを処理する際にはリアルタイムとし、より多くのビュー（例えば、２０）の場合には近リアルタイム処理及び予測を実装することができる。 In operation, neural human prediction can be directly applied in commercial and/or personal markerless performance capture applications, such as markerless motion capture in game studios or human 3D surface reconstruction RGB camera setups. Other applications of multi-view neural human prediction embodiments can combine, for example, depth sensing input, 3D modeling, or using the output to create novel animations as a real-time backbone technology that can be combined with any extensions. Multi-view neural human prediction can also be applied in gaming applications, VR/AR applications, and any real-time human interaction applications. Depending on the hardware used (e.g., GPU processor speed and GPU memory size), multi-view neural human prediction can be real-time when processing a small number of views for prediction, or can implement near-real-time processing and prediction for a larger number of views (e.g., 20).

本明細書で説明した方法は、いずれかのコンピュータ装置上に実装することができる。好適なコンピュータ装置の例としては、パーソナルコンピュータ、ラップトップコンピュータ、コンピュータワークステーション、サーバ、メインフレームコンピュータ、ハンドヘルドコンピュータ、携帯情報端末、セルラ／携帯電話機、スマート家電、ゲーム機、デジタルカメラ、デジタルカムコーダ、カメラ付き電話機、スマートホン、ポータブル音楽プレーヤ、タブレットコンピュータ、モバイル装置、ビデオプレーヤ、ビデオディスクライタ／プレーヤ（ＤＶＤライタ／プレーヤ、高精細ディスクライタ／プレーヤ、超高精細ディスクライタ／プレーヤなど）、テレビ、家庭用エンターテイメントシステム、拡張現実装置、仮想現実装置、スマートジュエリ（例えば、スマートウォッチ）、車両（例えば、自動走行車両）、又はその他のいずれかの好適なコンピュータ装置が挙げられる。 The methods described herein may be implemented on any computing device. Examples of suitable computing devices include a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile phone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, a smartphone, a portable music player, a tablet computer, a mobile device, a video player, a video disc writer/player (such as a DVD writer/player, a high-definition disc writer/player, or an ultra-high-definition disc writer/player), a television, a home entertainment system, an augmented reality device, a virtual reality device, smart jewelry (e.g., a smart watch), a vehicle (e.g., an autonomous vehicle), or any other suitable computing device.

顔表情、身体ポーズ形状及び衣服パフォーマンスキャプチャのための暗黙的微分可能レンダラーを用いたマルチビューニューラル人間予測のいくつかの実施形態
１．装置の非一時的なものにプログラムされた方法であって、
画像セットを入力として取得することと、
ニューラルネットワークを使用して画像セットを処理することと、
を含み、処理は、
画像セットを１又は２以上の特徴に符号化することと、
特徴を人間パラメータに回帰させることと、
ニューラルネットワークを微調整することと、
クエリ３Ｄ光線を、画像セットに基づくＲＧＢカラー及び衣服－身体変位に復号することと、
を含む、方法。 Some embodiments of multi-view neural human prediction using an implicit differentiable renderer for facial expression, body pose shape, and clothing performance capture: 1. A method programmed into a non-transient device, comprising:
Taking as input a set of images;
Processing the image set using a neural network;
and the processing comprises:
encoding the set of images into one or more features;
Regressing features onto human parameters;
Fine-tuning the neural network and
Decoding the query 3D ray into RGB color and clothes-body displacement based on the image set;
A method comprising:

２．画像セットは、サイズＮ×ｗ×ｈ×ｃの４Ｄテンソルを含み、ここで、Ｎはビューの数、ｗは画像の幅、ｈは画像の高さ、ｃは画像のチャネルである、条項１の方法。 2. The method of clause 1, wherein the image set comprises a 4D tensor of size Nxwxhxc, where N is the number of views, w is the image width, h is the image height, and c is the image channel.

３．ニューラルネットワークは、画像セットから正面ビューを基準ビューとして選択し、特徴量を抽出する、条項１の方法。 3. The method of clause 1, in which the neural network selects a frontal view from the image set as a reference view and extracts features.

４．ニューラルネットワークは、全ての特徴量を人間のポーズ、形状、表情パラメータに回帰させる、条項３の方法。 4. The neural network regresses all features onto human pose, shape, and facial expression parameters, a method described in clause 3.

５．ニューラルネットワークは、パラメータに従って人間の裸体メッシュを生成する、条項４の方法。 5. The method of clause 4, wherein the neural network generates a nude human body mesh according to the parameters.

６．裸体メッシュは、バウンディングボックス内の占有フィールドに変換される、条項５の方法。 6. The nude mesh is converted to an occupancy field within a bounding box, as in clause 5.

７．ニューラルネットワークは、ビューの各中心からの光線方向に関連する身体メッシュの近くのいずれかの３Ｄ点について、ＲＧＢカラーと、裸体の表面を示す３Ｄ変位ベクトルとを生成する、条項６の方法。 7. The method of clause 6, wherein the neural network generates an RGB color and a 3D displacement vector representing the surface of the nude body for any 3D point near the body mesh associated with a ray direction from each center of view.

８．カメラビューの全ての画素から放たれる全ての光線を問い合わせることにより、着衣姿の人体の外観がＲＧＢ画像としてレンダリングされ、サンプリングされた点から３Ｄ変位ベクトルを使用して裸体を変形させることにより、着衣姿の身体メッシュが取得される、条項７の方法。 8. The method of clause 7, wherein the appearance of the clothed human body is rendered as an RGB image by querying all rays cast from all pixels in the camera view, and the clothed body mesh is obtained by deforming the nude body using 3D displacement vectors from the sampled points.

９．ニューラルネットワークは、教師ありモード又は自己教師ありモードで実装される、条項１の方法。 9. The method of clause 1, wherein the neural network is implemented in a supervised mode or a self-supervised mode.

１０．アプリケーションを記憶するように構成された非一時的メモリと、
アプリケーションを処理するように構成されたプロセッサと、
を備えた装置であって、アプリケーションは、
画像セットを入力として取得し、
ニューラルネットワークを使用して画像セットを処理する、ように構成され、処理は、
画像セットを１又は２以上の特徴に符号化することと、
特徴を人間パラメータに回帰させることと、
ニューラルネットワークを微調整することと、
クエリ３Ｄ光線を、画像セットに基づくＲＧＢカラー及び衣服－身体変位に復号することと、
を含む、装置。 10. A non-transitory memory configured to store an application;
a processor configured to process the application;
an application comprising:
It takes a set of images as input,
and processing the set of images using a neural network, the processing comprising:
encoding the set of images into one or more features;
Regressing features onto human parameters;
Fine-tuning the neural network and
Decoding the query 3D ray into RGB color and clothes-body displacement based on the image set;
1. An apparatus comprising:

１１．画像セットは、サイズＮ×ｗ×ｈ×ｃの４Ｄテンソルを含み、ここで、Ｎはビューの数、ｗは画像の幅、ｈは画像の高さ、ｃは画像のチャネルである、条項１０の装置。 11. The apparatus of clause 10, wherein the image set comprises a 4D tensor of size Nxwxhxc, where N is the number of views, w is the image width, h is the image height, and c is the image channel.

１２．ニューラルネットワークは、画像セットから正面ビューを基準ビューとして選択し、特徴量を抽出する、条項１０の装置。 12. The apparatus of clause 10, wherein the neural network selects a frontal view from the image set as a reference view and extracts features.

１３．ニューラルネットワークは、全ての特徴量を人間のポーズ、形状、表情パラメータに回帰させる、条項１２の装置。 13. The neural network is the device of clause 12 that regresses all features onto human pose, shape, and facial expression parameters.

１４．ニューラルネットワークは、パラメータに従って人間の裸体メッシュを生成する、条項１３の装置。 14. The apparatus of clause 13, wherein the neural network generates a nude human body mesh according to the parameters.

１５．裸体メッシュは、バウンディングボックス内の占有フィールドに変換される、条項１４の装置。 15. The apparatus of clause 14, wherein the nude mesh is converted into an occupancy field within a bounding box.

１６．ニューラルネットワークは、ビューの各中心からの光線方向に関連する身体メッシュの近くのいずれかの３Ｄ点について、ＲＧＢカラーと、裸体の表面を示す３Ｄ変位ベクトルとを生成する、条項１５の装置。 16. The apparatus of clause 15, wherein the neural network generates an RGB color and a 3D displacement vector representing the surface of the nude body for any 3D point near the body mesh associated with a ray direction from each center of view.

１７．カメラビューの全ての画素から放たれる全ての光線を問い合わせることにより、着衣姿の人体の外観がＲＧＢ画像としてレンダリングされ、サンプリングされた点から３Ｄ変位ベクトルを使用して裸体を変形させることにより、着衣姿の身体メッシュが取得される、条項１６の装置。 17. The apparatus of clause 16, wherein the appearance of the clothed human body is rendered as an RGB image by querying all rays cast from all pixels in the camera view, and a clothed body mesh is obtained by deforming the nude body using 3D displacement vectors from the sampled points.

１８．ニューラルネットワークは、教師ありモード又は自己教師ありモードで実装される、条項１０の装置。 18. The apparatus of clause 10, wherein the neural network is implemented in a supervised mode or a self-supervised mode.

１９．アプリケーションを記憶するように構成された非一時的メモリと、
アプリケーションを処理するように構成されたプロセッサと、
を備えた装置であって、アプリケーションは、
入力画像を特徴に符号化するように構成されたマルチビューステレオ３Ｄ畳み込みニューラルネットワーク（ＭＶＳ－３ＤＣＮＮ）と、
特徴を人間パラメータに回帰させるように構成された人間メッシュ復元多層パーセプトロン（ＨＭＲＭＬＰ）と、
ＭＶＳ－３ＤＣＮＮを微調整するように構成され、クエリ３Ｄ光線（３Ｄ位置及び方向）をＲＧＢカラー及び衣服－身体変位に復号するニューラル輝度場多層パーセプトロン（ＮｅＲＦＭＬＰ）と、
を含む、装置。 19. A non-transitory memory configured to store an application;
a processor configured to process the application;
an application comprising:
a multi-view stereo 3D convolutional neural network (MVS-3DCNN) configured to encode an input image into features;
a human mesh reconstruction multi-layer perceptron (HMR MLP) configured to regress features onto human parameters;
A Neural Luminance Field Multilayer Perceptron (NeRF MLP) configured to fine-tune the MVS-3DCNN and decode the query 3D ray (3D position and orientation) into RGB color and clothes-body displacement;
1. An apparatus comprising:

２０．画像セットは、サイズＮ×ｗ×ｈ×ｃの４Ｄテンソルを含み、ここで、Ｎはビューの数、ｗは画像の幅、ｈは画像の高さ、ｃは画像のチャネルである、条項１９の装置。 20. The apparatus of clause 19, wherein the image set comprises a 4D tensor of size Nxwxhxc, where N is the number of views, w is the image width, h is the image height, and c is the image channel.

２１．ＭＶＳ－３ＤＣＮＮは、画像セットから正面ビューを基準ビューとして選択し、特徴量を抽出する、条項２０の装置。 21. The MVS-3DCNN is the apparatus of clause 20, which selects a frontal view from the image set as a reference view and extracts features.

２２．ＨＭＲＭＬＰは、全ての特徴量を人間のポーズ、形状、表情パラメータに回帰させる、条項２１の装置。 22. HMR MLP is the device of Clause 21 that regresses all features onto human pose, shape, and facial expression parameters.

２３．パラメータに従って人間の裸体メッシュを生成するように構成されたモデルをさらに備える、条項２２の装置。 23. The apparatus of clause 22, further comprising a model configured to generate a nude human body mesh according to the parameters.

２４．裸体メッシュは、バウンディングボックス内の占有フィールドに変換される、条項２３の装置。 24. The apparatus of clause 23, wherein the nude mesh is converted into an occupancy field within a bounding box.

２５．ＮｅＲＦＭＬＰは、ビューの各中心からの光線方向に関連する身体メッシュの近くのいずれかの３Ｄ点について、ＲＧＢカラーと、裸体の表面を示す３Ｄ変位ベクトルとを生成する、条項２４の装置。 25. The apparatus of clause 24, wherein the NeRF MLP generates an RGB color and a 3D displacement vector representing the surface of the nude body for any 3D point near the body mesh associated with a ray direction from each center of view.

２６．カメラビューの全ての画素から放たれる全ての光線を問い合わせることにより、着衣姿の人体の外観がＲＧＢ画像としてレンダリングされ、サンプリングされた点から３Ｄ変位ベクトルを使用して裸体を変形させることにより、着衣姿の身体メッシュが取得される、条項２５の装置。 26. The apparatus of clause 25, wherein the appearance of a clothed human body is rendered as an RGB image by querying all rays cast from all pixels in the camera view, and a clothed body mesh is obtained by deforming the nude body using 3D displacement vectors from the sampled points.

本発明の構成及び動作の原理を容易に理解できるように、詳細を含む特定の実施形態に関して本発明を説明した。本明細書におけるこのような特定の実施形態及びこれらの実施形態の詳細についての言及は、本明細書に添付する特許請求の範囲を限定することを意図したものではない。当業者には、特許請求の範囲によって定められる本発明の趣旨及び範囲から逸脱することなく、例示のために選択した実施形態において他の様々な修正を行えることが容易に明らかになるであろう。 The present invention has been described in terms of specific embodiments containing details to facilitate an understanding of the principles of construction and operation of the invention. Reference herein to such specific embodiments and details of these embodiments is not intended to limit the scope of the claims appended hereto. It will be readily apparent to those skilled in the art that various other modifications can be made in the embodiments chosen for illustration without departing from the spirit and scope of the invention as defined by the claims.

１００画像入力Ｉ
１０２ニューラルネットワーク（ＭＶＳ－ＰＥＲＦ）
１０４マルチビューステレオ３Ｄ畳み込みニューラルネットワーク（ＭＶＳ－３ＤＣＮＮ）
１０６人間メッシュ復元多層パーセプトロン（ＨＭＲＭＬＰ）
１０８ニューラル輝度場多層パーセプトロン（ＮｅＲＦＭＬＰ）
１１０３Ｄ人間予測
１１２外観画像
１１４人体パラメータθ_reg
１１６３次元変位ベクトル場Ｄ 100 Image Input I
102 Neural Network (MVS-PERF)
104 Multi-view stereo 3D convolutional neural network (MVS-3DCNN)
106 Human Mesh Reconstruction Multilayer Perceptron (HMR MLP)
108 Neural Luminance Field Multilayer Perceptron (NeRF MLP)
110 3D human prediction 112 Appearance image 114 Human body parameters θ _reg
116 Three-dimensional displacement vector field D

Claims

1. A method programmed into a non-transient device, comprising:
Taking as input a set of images;
processing the set of images using a neural network;
wherein the process comprises:
encoding the set of images into one or more features by inputting the set into a multi-view stereo 3D convolutional neural network (MVS-3DCNN) ;
inputting the features into a human mesh reconstruction multi-layer perceptron (HMR MLP) to regress the human body parameters, including parameters controlling the shape of the human body;
fine-tuning the neural network;
inputting a query 3D ray represented by a 3D position and orientation into a neural luminance field multi-layer perceptron (NeRF MLP) to decode the 3D ray into an RGB color based on the image set and a clothing-body displacement indicating the displacement of a point on the body surface after clothing relative to the point on the body surface before clothing is worn;
Including,
A method characterized by:

The image set comprises a 4D tensor of size Nxwxhxc, where N is the number of views, w is the image width, h is the image height, and c is the image channel.
The method of claim 1.

The neural network selects a front view from the image set as a reference view and extracts features.
The method of claim 1.

The neural network regresses all features onto human pose, shape, and facial expression parameters.
The method of claim 3.

the neural network generates a nude human body mesh according to the human body parameters;
The method of claim 4.

The nude mesh is converted into an occupancy field within a bounding box.
The method of claim 5.

the neural network generates, for any 3D point near the body mesh associated with a ray direction from each center of view, the RGB color and a 3D displacement vector representing the surface of the nude body;
The method of claim 6.

The appearance of the clothed human body is rendered as an RGB image by querying all rays cast from all pixels in the camera view, and the clothed body mesh is obtained by deforming the nude body using the 3D displacement vectors from the sampled points.
The method of claim 7.

The neural network is implemented in a supervised or self-supervised mode.
The method of claim 1.

a non-transitory memory configured to store an application;
a processor configured to process the application;
and wherein the application comprises:
a multi-view stereo 3D convolutional neural network (MVS-3DCNN) configured to encode an input image into features;
a human mesh reconstruction multi-layer perceptron (HMR MLP) configured to regress the features to human body parameters, including parameters controlling the shape of the human body;
a Neural Luminance Field Multilayer Perceptron (NeRF MLP) configured to fine-tune the MVS-3DCNN and decode a query 3D ray (3D position and orientation) represented by a 3D position and orientation into an RGB color and a clothes-body displacement indicating the displacement of a point on the body surface after wearing the clothes relative to the point on the body surface before wearing the clothes;
Including,
An apparatus characterized in that

The input image comprises a 4D tensor of size Nxwxhxc, where N is the number of views, w is the image width, h is the image height, and c is the image channel.
11. The apparatus of claim 10 .

The MVS-3DCNN selects a front view from the input image as a reference view and extracts features.
12. The apparatus of claim 11 .

The HMR MLP regresses all features onto human pose, shape, and facial expression parameters.
13. The apparatus of claim 12 .

further comprising a model configured to generate a nude human body mesh according to the parameters;
14. The apparatus of claim 13 .

The nude mesh is converted into an occupancy field within a bounding box.
15. The apparatus of claim 14 .

The NeRF MLP generates, for any 3D point near the body mesh associated with a ray direction from each center of view, the RGB color and a 3D displacement vector representing the surface of the nude body.
16. The apparatus of claim 15 .

The appearance of the clothed human body is rendered as an RGB image by querying all rays cast from all pixels in the camera view, and the clothed body mesh is obtained by deforming the nude body using the 3D displacement vectors from the sampled points.
17. The apparatus of claim 16 .