JP7740333B2

JP7740333B2 - Information processing device and information processing method

Info

Publication number: JP7740333B2
Application number: JP2023523964A
Authority: JP
Inventors: 卓己津留; 俊也浜田
Original assignee: Sony Corp; Sony Group Corp
Current assignee: Sony Corp; Sony Group Corp
Priority date: 2021-05-27
Filing date: 2022-01-17
Publication date: 2025-09-17
Anticipated expiration: 2042-01-17
Also published as: JPWO2022249536A1; WO2022249536A1

Description

本技術は、ＶＲ（Virtual Reality：仮想現実）映像の配信等に適用可能な情報処理装置、及び情報処理方法に関する。 This technology relates to an information processing device and information processing method that can be applied to the distribution of VR (Virtual Reality) images, etc.

近年、全天周カメラ等により撮影された、全方位を見回すことが可能な全天周映像が、ＶＲ映像として配信されるようになってきている。さらに最近では、視聴者（ユーザ）が、全方位見回し（視線方向を自由に選択）することができ、３次元空間中を自由に移動することができる（視点位置を自由に選択することができる）６ＤｏＦ（Degree of Freedom）映像（６ＤｏＦコンテンツとも称する）を配信する技術の開発が進んでいる。
このような６ＤｏＦコンテンツは、時刻毎に、視聴者の視点位置、視線方向及び視野角（視野範囲）に応じて、１つもしくは複数の３次元オブジェクトで３次元空間を動的に再現するものである。
このような映像配信においては、視聴者の視野範囲に応じて、視聴者に提示する映像データを動的に調整（レンダリング）することが求められる。例えば、このような技術の一例としては、特許文献１に開示の技術を挙げることができる。 In recent years, panoramic images that are captured by panoramic cameras and allow viewing in all directions have begun to be distributed as VR images. Furthermore, technology for distributing 6DoF (Degree of Freedom) images (also referred to as 6DoF content) that allow viewers (users) to view in all directions (freely select the viewing direction) and move freely in three-dimensional space (freely select the viewpoint position) has been developed.
Such 6DoF content dynamically reproduces a three-dimensional space using one or more three-dimensional objects according to the viewer's viewpoint position, line of sight direction, and viewing angle (viewing range) at each time.
In such video distribution, it is necessary to dynamically adjust (render) the video data presented to the viewer according to the viewer's field of view. For example, a technique disclosed in Patent Literature 1 can be cited as an example of such a technique.

また非特許文献１には、全天球画像に対する顕著性マップの推定処理について記載されている。
この推定処理では、全天球画像から様々なカメラ方向の平面画像が抽出され、平面画像用の顕著性マップ推定モデルにより、各平面画像に対する顕著性マップが推定される。各平面画像に対する顕著性マップが統合され、また画像中央の水平線方向に水平線バイアスがかけられて、全天球画像の顕著性マップが推定される。 Non-Patent Document 1 describes a process for estimating a saliency map for a spherical image.
In this estimation process, planar images from various camera directions are extracted from a spherical image, and a saliency map for each planar image is estimated using a saliency map estimation model for planar images. The saliency maps for each planar image are integrated, and a horizon bias is applied to the horizontal line direction at the center of the image, to estimate a saliency map for the spherical image.

特表２００７－５２０９２５号公報Special Publication No. 2007-520925

山中高夫、「画像中の目立つ場所を推定する技術：深層学習を用いた顕著性マップ推定」、［online］、２０２０年９月１５日、令和２年度新技術説明会、インターネット＜URL：https://shingi.jst.go.jp/var/rev0/0001/1222/02_sophia_yamanaka.pdf＞Takao Yamanaka, "Technology for Estimating Prominent Locations in Images: Saliency Map Estimation Using Deep Learning," [online], September 15, 2020, FY2020 New Technology Briefing, Internet <URL: https://shingi.jst.go.jp/var/rev0/0001/1222/02_sophia_yamanaka.pdf>

ＶＲ映像等の仮想的な映像（仮想映像）の配信は普及していくと考えられ、高品質な仮想映像の配信を可能とする技術が求められている。 The distribution of virtual images (virtual videos) such as VR videos is expected to become more widespread, and there is a demand for technology that enables the distribution of high-quality virtual videos.

以上のような事情に鑑み、本技術の目的は、高品質な仮想映像の配信を実現することが可能な情報処理装置、及び情報処理方法を提供することにある。 In light of the above circumstances, the purpose of this technology is to provide an information processing device and information processing method that are capable of delivering high-quality virtual images.

上記目的を達成するため、本技術の一形態に係る情報処理装置は、レンダリング部と、推定部と、生成部とを具備する。
前記レンダリング部は、ユーザの視野に関する視野情報に基づいて、仮想空間を構成する３次元空間データに対してレンダリング処理を実行することにより、前記ユーザの視野に応じた２次元映像データを生成する。
前記推定部は、前記仮想空間の前記ユーザの視野に含まれない視野外領域における、前記ユーザが認識している認識対象オブジェクトの前記ユーザが認識している認識位置を推定する。
前記生成部は、推定された前記視野外領域における前記認識対象オブジェクトの前記認識位置に基づいて、前記視野外領域における顕著性を表す顕著性マップを生成する。 In order to achieve the above object, an information processing device according to an embodiment of the present technology includes a rendering unit, an estimation unit, and a generation unit.
The rendering unit generates two-dimensional video data according to the user's field of view by performing a rendering process on three-dimensional space data that constitutes a virtual space based on field of view information relating to the user's field of view.
The estimation unit estimates a recognition position, recognized by the user, of a recognition target object recognized by the user in an out-of-field area of the virtual space that is not included in the user's field of view.
The generation unit generates a saliency map representing saliency in the out-of-field area based on the estimated recognition position of the recognition target object in the out-of-field area.

この情報処理装置では、視野外領域における認識対象オブジェクトの認識位置が推定される。推定された認識位置に基づいて、視野外領域における顕著性マップが生成される。これにより、視野外領域における高精度の顕著性マップを生成することが可能となり、顕著性マップを用いて高品質な仮想映像の配信を実現することが可能となる。 In this information processing device, the recognition position of the object to be recognized in the out-of-field area is estimated. A saliency map for the out-of-field area is generated based on the estimated recognition position. This makes it possible to generate a highly accurate saliency map for the out-of-field area, and to use the saliency map to deliver high-quality virtual video.

前記推定部は、現在時刻までにレンダリング対象となったことがあるオブジェクトを、前記認識対象オブジェクトとして設定してもよい。 The estimation unit may set an object that has been the subject of rendering up to the current time as the object to be recognized.

前記２次元映像データは、時系列に連続する複数のフレーム画像により構成されてもよい。この場合、前記推定部は、現在時刻のフレーム画像に含まれない前記認識対象オブジェクトについて、前記認識対象オブジェクトが含まれる過去の直近のフレーム画像内の前記認識対象オブジェクトの位置に対応する前記仮想空間内の位置に基づいて、前記認識位置を推定してもよい。 The two-dimensional video data may be composed of a plurality of frame images that are consecutive in time series. In this case, the estimation unit may estimate the recognition position of the recognition target object that is not included in the frame image at the current time based on a position in the virtual space that corresponds to the position of the recognition target object in the most recent past frame image that includes the recognition target object.

前記推定部は、前記直近のフレーム画像内の前記認識対象オブジェクトの位置に対応する前記仮想空間内の位置を、前記認識位置として推定してもよい。 The estimation unit may estimate as the recognition position a position in the virtual space corresponding to the position of the object to be recognized in the most recent frame image.

前記推定部は、前記直近のフレーム画像内の前記認識対象オブジェクトの位置に対応する前記仮想空間内の位置から前記認識対象オブジェクトの移動方向に沿ってシフトした位置を、前記認識位置として推定してもよい。 The estimation unit may estimate, as the recognition position, a position shifted along the movement direction of the object to be recognized from a position in the virtual space corresponding to the position of the object to be recognized in the most recent frame image.

前記推定部は、現在時刻のフレーム画像に含まれない前記認識対象オブジェクトについて、前記認識対象オブジェクトが発する音を前記ユーザが認識したと判定した場合に、前記音の前記仮想空間内における発生位置を、前記認識位置として推定してもよい。 When the estimation unit determines that the user has recognized a sound emitted by a recognition target object that is not included in the frame image at the current time, the estimation unit may estimate the position where the sound originated in the virtual space as the recognition position.

前記３次元空間データは、前記仮想空間の構成を定義する３次元空間記述データと、前記仮想空間における３次元オブジェクトを定義する３次元オブジェクトデータとを含んでもよい。この場合、前記３次元空間記述データは、前記認識対象オブジェクトの役割を表す役割情報、及び前記役割に関連する定位置を表す定位置情報を含んでもよい。また前記推定部は、現在時刻のフレーム画像に含まれない所定の役割情報が設定された前記認識対象オブジェクトについて、現在時刻までに、同じ役割情報が設定された前記認識対象オブジェクトがレンダリングされたことがある場合に、前記役割に関連する前記定位置を前記認識位置として推定してもよい。 The three-dimensional space data may include three-dimensional space description data that defines the configuration of the virtual space and three-dimensional object data that defines three-dimensional objects in the virtual space. In this case, the three-dimensional space description data may include role information that represents the role of the recognition target object and fixed position information that represents a fixed position associated with the role. Furthermore, for a recognition target object that has specified role information set and is not included in the frame image at the current time, if a recognition target object that has the same role information has been rendered up to the current time, the estimation unit may estimate the fixed position associated with the role as the recognition position.

前記推定部は、現在時刻までに同じ役割情報が設定された前記認識対象オブジェクトが前記役割に関連する前記定位置にいる状態がレンダリングされたことがある場合に、前記役割に関連する前記定位置を前記認識位置として推定してもよい。 The estimation unit may estimate the fixed position associated with the role as the recognition position if the object to be recognized, to which the same role information has been set, has been rendered in the fixed position associated with the role up to the current time.

前記３次元空間データは、前記仮想空間の構成を定義する３次元空間記述データと、前記仮想空間における３次元オブジェクトを定義する３次元オブジェクトデータとを含んでもよい。この場合、前記３次元空間記述データは、前記認識対象オブジェクトの役割を表す役割情報、及び前記役割に関連する定位置を表す定位置情報を含んでもよい。また前記推定部は、現在時刻のフレーム画像に含まれない所定の役割情報が設定された前記認識対象オブジェクトについて、現在時刻までに、同じ役割情報が設定された前記認識対象オブジェクトが発する音を前記ユーザが認識したと判定した場合に、前記役割に関連する前記定位置を前記認識位置として推定してもよい。 The three-dimensional space data may include three-dimensional space description data that defines the configuration of the virtual space and three-dimensional object data that defines three-dimensional objects in the virtual space. In this case, the three-dimensional space description data may include role information that represents the role of the recognition target object and fixed position information that represents a fixed position associated with the role. Furthermore, when the estimation unit determines that the user has recognized a sound emitted by a recognition target object that has predetermined role information set and is not included in the frame image at the current time, the estimation unit may estimate the fixed position associated with the role as the recognition position.

前記推定部は、現在時刻までに同じ役割情報が設定された前記認識対象オブジェクトが前記役割に関連する前記定位置にいる状態で発した音を前記ユーザが認識したと判定した場合に、前記役割に関連する前記定位置を前記認識位置として推定してもよい。 The estimation unit may estimate the fixed position associated with the role as the recognized position when it determines that the user recognized a sound made by the object to be recognized, to which the same role information has been set up until the current time, while the object is in the fixed position associated with the role.

前記推定部は、前記認識対象オブジェクトがレンダリングされている前記２次元映像データ内の前記認識対象オブジェクトの位置に基づいて、前記認識位置を推定してもよい。 The estimation unit may estimate the recognition position based on the position of the object to be recognized within the two-dimensional video data in which the object to be recognized is rendered.

前記生成部は、前記視野外領域におけるボトムアップ注意に基づく顕著性がゼロとなる前記顕著性マップを生成してもよい。 The generation unit may generate the saliency map in which the saliency based on bottom-up attention in the out-of-field area is zero.

前記生成部は、前記視野外領域における前記認識対象オブジェクトの前記認識位置に基づいて、前記視野外領域におけるトップダウン注意に基づく顕著性を表す前記顕著性マップを生成してもよい。 The generation unit may generate the saliency map representing saliency based on top-down attention in the out-of-field area based on the recognition position of the object to be recognized in the out-of-field area.

前記生成部は、前記視野外領域における前記顕著性マップと、前記２次元映像データの顕著性を表す顕著性マップとを生成してもよい。 The generation unit may generate the saliency map for the out-of-field area and a saliency map representing the saliency of the two-dimensional video data.

前記情報処理装置は、さらに、前記顕著性マップに基づいて、未来の前記視野情報を予測視野情報として生成する予測部を具備してもよい。この場合、前記レンダリング部は、前記予測視野情報に基づいて、前記２次元映像データを生成してもよい。 The information processing device may further include a prediction unit that generates future field of view information as predicted field of view information based on the saliency map. In this case, the rendering unit may generate the two-dimensional video data based on the predicted field of view information.

前記視野情報は、視点の位置、視線方向、視線の回転角度、前記ユーザの頭の位置、又は前記ユーザの頭の回転角度の少なくとも１つを含んでもよい。 The field of view information may include at least one of the position of the viewpoint, the direction of gaze, the rotation angle of the gaze, the position of the user's head, or the rotation angle of the user's head.

前記視野情報は、前記ユーザの頭の回転角度を含んでもよい。この場合、前記予測部は、前記顕著性マップに基づいて、未来の前記ユーザの頭の回転角度を予測してもよい。 The field of view information may include the user's head rotation angle. In this case, the prediction unit may predict the user's future head rotation angle based on the saliency map.

前記２次元映像データは、時系列に連続する複数のフレーム画像により構成されてもよい。この場合、前記レンダリング部は、前記予測視野情報に基づいてフレーム画像を生成し、予測フレーム画像として出力してもよい。 The two-dimensional video data may be composed of a plurality of frame images that are consecutive in time series. In this case, the rendering unit may generate frame images based on the predicted field of view information and output them as predicted frame images.

本技術の一形態に係る情報処理方法は、コンピュータシステムが実行する情報処理方法であって、ユーザの視野に関する視野情報に基づいて、仮想空間を構成する３次元空間データに対してレンダリング処理を実行することにより、前記ユーザの視野に応じた２次元映像データを生成することを含む。
前記仮想空間の前記ユーザの視野に含まれない視野外領域における、前記ユーザが認識している認識対象オブジェクトの前記ユーザが認識している認識位置が推定される。
推定された前記視野外領域における前記認識対象オブジェクトの前記認識位置に基づいて、前記視野外領域における顕著性を表す顕著性マップが生成される。 An information processing method according to one embodiment of the present technology is an information processing method executed by a computer system, and includes generating two-dimensional video data corresponding to a user's field of view by performing a rendering process on three-dimensional spatial data that constitutes a virtual space based on field of view information related to the user's field of view.
A recognition position of a recognition target object recognized by the user in an out-of-field area of the virtual space that is not included in the user's field of view is estimated.
A saliency map representing saliency in the out-of-field area is generated based on the estimated recognition position of the object to be recognized in the out-of-field area.

サーバサイドレンダリングシステムの基本的な構成例を示す模式図である。FIG. 1 is a schematic diagram illustrating an example of the basic configuration of a server-side rendering system. ユーザが視聴可能な仮想映像の一例を説明するための模式図である。FIG. 10 is a schematic diagram illustrating an example of a virtual video that can be viewed by a user. レンダリング処理を説明するための模式図である。FIG. 10 is a schematic diagram for explaining a rendering process. サーバサイドレンダリングシステムの構成例を示す模式図である。FIG. 1 is a schematic diagram illustrating an example of the configuration of a server-side rendering system. 認識対象オブジェクト及び認識位置を説明するための模式図である。FIG. 2 is a schematic diagram for explaining a recognition target object and a recognition position. レンダリング映像の生成の一例を示すフローチャートである。10 is a flowchart illustrating an example of generating a rendering video. 図６に示すフローチャートを説明するための図であり、各情報の取得及び生成のタイミングを示す模式図である。FIG. 7 is a diagram for explaining the flowchart shown in FIG. 6, and is a schematic diagram showing the timing of obtaining and generating each piece of information. ボトムアップ注意に基づく顕著性マップの生成例を示す模式図である。FIG. 1 is a schematic diagram illustrating an example of generating a saliency map based on bottom-up attention. 比較例の全天周顕著性マップの問題を説明するための模式図である。FIG. 10 is a schematic diagram for explaining a problem with the panoramic saliency map of the comparative example. 実施例２にてシーン記述情報として用いられるシーン記述ファイルで記述される情報の一例を示す模式図である。FIG. 10 is a schematic diagram showing an example of information described in a scene description file used as scene description information in the second embodiment. 実施例３にてシーン記述情報として用いられるシーン記述ファイルで記述される情報の一例を示す模式図である。FIG. 11 is a schematic diagram showing an example of information described in a scene description file used as scene description information in the third embodiment. 実施例３にてシーン記述情報として用いられるシーン記述ファイルで記述される情報の一例を示す模式図である。FIG. 11 is a schematic diagram showing an example of information described in a scene description file used as scene description information in the third embodiment. 実施例３における認識対象オブジェクトの認識位置の推定について説明するための模式図である。FIG. 11 is a schematic diagram for explaining estimation of a recognition position of a recognition target object in Example 3. 実施例３における認識対象オブジェクトの認識位置の推定について説明するための模式図である。FIG. 11 is a schematic diagram for explaining estimation of a recognition position of a recognition target object in Example 3. 認識対象オブジェクトの認識位置の推定例を示すフローチャートである。10 is a flowchart illustrating an example of estimating the recognition position of a recognition target object. 全天周顕著性マップの生成例を示すフローチャートである。10 is a flowchart illustrating an example of generating a panoramic saliency map. 全天周顕著性マップの一例を示す模式的な図である。FIG. 10 is a schematic diagram illustrating an example of a panoramic saliency map. サーバ装置及びクライアント装置を実現可能なコンピュータ（情報処理装置）のハードウェア構成例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of the hardware configuration of a computer (information processing device) that can realize the server device and the client device.

以下、本技術に係る実施形態を、図面を参照しながら説明する。 Below, an embodiment of the present technology is described with reference to the drawings.

［サーバサイドレンダリングシステム］
本技術に係る一実施形態として、サーバサイドレンダリングシステムを構成する。まず図１～図３を参照して、サーバサイドレンダリングシステムの基本的な構成例及び基本的な動作例について説明する。
図１は、サーバサイドレンダリングシステムの基本的な構成例を示す模式図である。
図２は、ユーザが視聴可能な仮想映像の一例を説明するための模式図である。
図３は、レンダリング処理を説明するための模式図である。
なお、サーバサイドレンダリングシステムを、サーバレンダリング型のメディア配信システムと呼ぶことも可能である。 [Server-side rendering system]
As an embodiment of the present technology, a server-side rendering system is configured. First, an example of the basic configuration and basic operation of the server-side rendering system will be described with reference to FIGS.
FIG. 1 is a schematic diagram showing an example of the basic configuration of a server-side rendering system.
FIG. 2 is a schematic diagram for explaining an example of a virtual video that can be viewed by a user.
FIG. 3 is a schematic diagram for explaining the rendering process.
The server-side rendering system can also be called a server-rendering type media distribution system.

図１に示すように、サーバサイドレンダリングシステム１は、ＨＭＤ（Head Mounted Display）２と、クライアント装置３と、サーバ装置４とを含む。
ＨＭＤ２は、ユーザ５に仮想映像を表示するために用いられるデバイスである。ＨＭＤ２は、ユーザ５の頭部に装着されて使用される。
例えば、仮想映像としてＶＲ映像が配信される場合には、ユーザ５の視野を覆うように構成された没入型のＨＭＤ２が用いられる。
仮想映像として、ＡＲ（Augmented Reality：拡張現実）映像が配信される場合には、ＡＲグラス等が、ＨＭＤ２として用いられる。
ユーザ５に仮想映像を提供するためのデバイスとして、ＨＭＤ２以外のデバイスが用いられてもよい。例えば、テレビ、スマートフォン、タブレット端末、及びＰＣ（Personal Computer）等に備えられたディスプレイにより、仮想映像が表示されてもよい。 As shown in FIG. 1, the server-side rendering system 1 includes an HMD (Head Mounted Display) 2 , a client device 3 , and a server device 4 .
The HMD 2 is a device used to display a virtual image to the user 5. The HMD 2 is worn on the head of the user 5 when in use.
For example, when VR video is distributed as virtual video, an immersive HMD 2 configured to cover the field of view of the user 5 is used.
When an AR (Augmented Reality) image is distributed as the virtual image, AR glasses or the like are used as the HMD 2 .
A device other than the HMD 2 may be used as a device for providing a virtual image to the user 5. For example, the virtual image may be displayed on a display provided on a television, a smartphone, a tablet terminal, a PC (Personal Computer), or the like.

図２に示すように、本実施形態では、没入型のＨＭＤ２を装着したユーザ５に対して、６ＤｏＦ映像がＶＲ映像として提供される。
ユーザ５は、３次元空間からなる仮想空間Ｓ内において、前後、左右、及び上下の全周囲３６０°の範囲で映像を視聴することが可能となる。例えばユーザ５は、仮想空間Ｓ内にて、視点の位置や視線方向等を自由に動かし、自分の視野（視野範囲）７を自由に変更させる。このユーザ５の視野７の変更に応じて、ユーザ５に表示される映像８が切替えられる。ユーザ５は、顔の向きを変える、顔を傾ける、振り返るといった動作をすることで、現実世界と同じような感覚で、仮想空間Ｓ内にて周囲を視聴することが可能となる。
このように、本実施形態に係るサーバサイドレンダリングシステム１では、フォトリアルな自由視点映像を配信することが可能となり、自由な視点位置での視聴体験を提供することが可能となる。 As shown in FIG. 2, in this embodiment, 6DoF images are provided as VR images to a user 5 wearing an immersive HMD 2.
Within the virtual space S, which is a three-dimensional space, the user 5 can view images in a 360° range around them, including front, back, left, right, and up and down. For example, within the virtual space S, the user 5 can freely change his or her field of view (field of view) 7 by freely moving the position of the viewpoint and the direction of line of sight. In response to this change in the field of view 7 of the user 5, the image 8 displayed to the user 5 is switched. By performing actions such as turning the face, tilting the face, and looking back, the user 5 can view the surroundings within the virtual space S with a sensation similar to that of the real world.
In this way, the server-side rendering system 1 according to this embodiment makes it possible to deliver photorealistic free viewpoint video, and to provide a viewing experience from any viewpoint position.

本実施形態では、ＨＭＤ２により、視野情報が取得される。
視野情報は、ユーザ５の視野７に関する情報である。具体的には、視野情報は、仮想空間Ｓ内におけるユーザ５の視野７を特定することが可能な任意の情報を含む。
例えば、視野情報として、視点の位置、視線方向、視線の回転角度等が挙げられる。また視野情報として、ユーザ５の頭の位置、ユーザ５の頭の回転角度等が挙げられる。
視線の回転角度は、例えば、視線方向に延在する軸を回転軸とする回転角度により規定することが可能である。またユーザ５の頭の回転角度は、頭に対して設定される互いに直交する３つの軸をロール軸、ピッチ軸、ヨー軸とした場合の、ロール角度、ピッチ角度、ヨー角度により規定することが可能である。
例えば、顔の正面方向に延在する軸をロール軸とする。ユーザ５の顔を正面から見た場合に左右方向に延在する軸をピッチ軸とし、上下方向に延在する軸をヨー軸とする。これらロール軸、ピッチ軸、ヨー軸に対する、ロール角度、ピッチ角度、ヨー角度が、頭の回転角度として算出される。なお、ロール軸の方向を、視線方向として用いることも可能である。
その他、ユーザ５の視野を特定可能な任意の情報が用いられてよい。視野情報として、上記で例示した情報が１つ用いられてもよいし、複数の情報が組み合わされて用いられてもよい。 In this embodiment, the HMD 2 acquires visual field information.
The visual field information is information relating to the visual field 7 of the user 5. Specifically, the visual field information includes any information that can identify the visual field 7 of the user 5 within the virtual space S.
For example, the visual field information may include the position of the viewpoint, the direction of the line of sight, the rotation angle of the line of sight, etc. Furthermore, the visual field information may include the position of the head of the user 5, the rotation angle of the head of the user 5, etc.
The rotation angle of the line of sight can be defined by, for example, a rotation angle about an axis extending in the line of sight direction, and the rotation angle of the head of the user 5 can be defined by a roll angle, a pitch angle, and a yaw angle when three mutually perpendicular axes set on the head are defined as a roll axis, a pitch axis, and a yaw axis.
For example, the axis extending in the direction of the face is defined as the roll axis. When the face of the user 5 is viewed from the front, the axis extending in the left-right direction is defined as the pitch axis, and the axis extending in the up-down direction is defined as the yaw axis. The roll angle, pitch angle, and yaw angle relative to these roll axis, pitch axis, and yaw axis are calculated as the rotation angle of the head. Note that the direction of the roll axis can also be used as the line of sight direction.
Any other information may be used that can identify the field of view of the user 5. As the field of view information, one of the above-exemplified pieces of information may be used, or a combination of a plurality of pieces of information may be used.

視野情報を取得する方法は限定されない。例えば、ＨＭＤ２に備えられたセンサ装置（カメラを含む）による検出結果（センシング結果）に基づいて、視野情報を取得することが可能である。
例えば、ＨＭＤ２に、ユーザ５の周囲を検出範囲とするカメラや測距センサ、ユーザ５の左右の目を撮像可能な内向きカメラ等が設けられる。また、ＨＭＤ２に、ＩＭＵ（Inertial Measurement Unit）センサやＧＰＳが設けられる。
例えば、ＧＰＳにより取得されるＨＭＤ２の位置情報を、ユーザ５の視点位置や、ユーザ５の頭の位置として用いることが可能である。もちろん、ユーザ５の左右の目の位置等がさらに詳しく算出されてもよい。
また、ユーザ５の左右の目の撮像画像から、視線方向を検出することも可能である。
また、ＩＭＵの検出結果から、視線の回転角度や、ユーザ５の頭の回転角度を検出することも可能である。 The method for acquiring the visual field information is not limited. For example, the visual field information can be acquired based on a detection result (sensing result) by a sensor device (including a camera) provided in the HMD 2.
For example, the HMD 2 is provided with a camera or distance measurement sensor that detects the area around the user 5, an inward-facing camera that can capture images of the left and right eyes of the user 5, etc. The HMD 2 is also provided with an IMU (Inertial Measurement Unit) sensor and a GPS.
For example, the position information of the HMD 2 acquired by GPS can be used as the viewpoint position of the user 5 or the head position of the user 5. Of course, the positions of the left and right eyes of the user 5 may be calculated in more detail.
It is also possible to detect the direction of the line of sight from captured images of the left and right eyes of the user 5 .
It is also possible to detect the rotation angle of the line of sight and the rotation angle of the head of the user 5 from the detection results of the IMU.

また、ＨＭＤ２に備えらえたセンサ装置による検出結果に基づいて、ユーザ５（ＨＭＤ２）の自己位置推定が実行されてもよい。例えば、自己位置推定により、ＨＭＤ２の位置情報、及びＨＭＤ２がどの方向を向いているか等の姿勢情報を算出することが可能である。当該位置情報や姿勢情報から、視野情報を取得することが可能である。
ＨＭＤ２の自己位置を推定するためのアルゴリズムも限定されず、ＳＬＡＭ（Simultaneous Localization and Mapping）等の任意のアルゴリズムが用いられてもよい。
また、ユーザ５の頭の動きを検出するヘッドトラッキングや、ユーザ５の左右の視線の動きを検出するアイトラッキングが実行されてもよい。 Furthermore, the self-position of the user 5 (HMD 2) may be estimated based on the detection results of a sensor device provided in the HMD 2. For example, the self-position estimation can calculate position information of the HMD 2 and posture information such as the direction in which the HMD 2 is facing. From the position information and posture information, it is possible to acquire field of view information.
The algorithm for estimating the self-position of the HMD 2 is not limited, and any algorithm such as SLAM (Simultaneous Localization and Mapping) may be used.
In addition, head tracking for detecting the movement of the head of the user 5 and eye tracking for detecting the movement of the gaze of the user 5 to the left and right may be performed.

その他、視野情報を取得するために、任意のデバイスや任意のアルゴリズムが用いられてもよい。例えば、ユーザ５に対して仮想映像を表示するデバイスとして、スマートフォン等が用いられる場合等では、ユーザ５の顔（頭）等が撮像され、その撮像画像に基づいて視野情報が取得されてもよい。
あるいは、ユーザ５の頭や目の周辺に、カメラやＩＭＵ等を備えるデバイスが装着されてもよい。
視野情報を生成するために、例えばＤＮＮ（Deep Neural Network：深層ニューラルネットワーク）等を用いた任意の機械学習アルゴリズムが用いられてもよい。例えばディープラーニング（深層学習）を行うＡＩ（人工知能）等を用いることで、視野情報の生成精度を向上させることが可能となる。
なお機械学習アルゴリズムの適用は、本開示内の任意の処理に対して実行されてよい。 Any other device or algorithm may be used to acquire the visual field information. For example, when a smartphone or the like is used as a device that displays a virtual image to the user 5, an image of the face (head) of the user 5 may be captured, and the visual field information may be acquired based on the captured image.
Alternatively, a device equipped with a camera, an IMU, etc. may be worn around the head or eyes of the user 5.
To generate the visual field information, any machine learning algorithm using, for example, a deep neural network (DNN) may be used. For example, by using an AI (artificial intelligence) that performs deep learning, it is possible to improve the accuracy of generating the visual field information.
It should be noted that the application of machine learning algorithms may be performed on any process within the present disclosure.

ＨＭＤ２と、クライアント装置３とは、互いに通信可能に接続されている。両デバイスを通信可能に接続するための通信形態は限定されず、任意の通信技術が用いられてよい。例えば、ＷｉＦｉ等の無線ネットワーク通信や、Bluetooth（登録商標）等の近距離無線通信等を用いることが可能である。
ＨＭＤ２は、視野情報を、クライアント装置３に送信する。
なお、ＨＭＤ２とクライアント装置３とが一体的構成されてもよい。すなわちＨＭＤ２に、クライアント装置３の機能が搭載されてもよい。 The HMD 2 and the client device 3 are connected to each other so that they can communicate with each other. The communication method for connecting the two devices to each other so that they can communicate with each other is not limited, and any communication technology may be used. For example, wireless network communication such as Wi-Fi, short-range wireless communication such as Bluetooth (registered trademark), etc. may be used.
The HMD 2 transmits the visual field information to the client device 3 .
The HMD 2 and the client device 3 may be integrated. That is, the HMD 2 may be equipped with the functions of the client device 3.

クライアント装置３、及びサーバ装置４は、例えばＣＰＵ、ＲＯＭ、ＲＡＭ、及びＨＤＤ等のコンピュータの構成に必要なハードウェアを有する（図１８参照）。ＣＰＵがＲＯＭ等に予め記録されている本技術に係るプログラムをＲＡＭにロードして実行することにより、本技術に係る情報処理方法が実行される。
例えばＰＣ（Personal Computer）等の任意のコンピュータにより、クライアント装置３、及びサーバ装置４を実現することが可能である。もちろんＦＰＧＡ、ＡＳＩＣ等のハードウェアが用いられてもよい。
もちろん、クライアント装置３とサーバ装置４とが互いに同じ構成を有する場合に限定される訳ではない。 The client device 3 and the server device 4 each have hardware necessary for configuring a computer, such as a CPU, a ROM, a RAM, and an HDD (see FIG. 18 ). The CPU loads a program according to the present technology, which is pre-recorded in the ROM or the like, into the RAM and executes the program, thereby executing the information processing method according to the present technology.
For example, the client device 3 and the server device 4 can be realized by any computer such as a PC (Personal Computer). Of course, hardware such as an FPGA or an ASIC may also be used.
Of course, the client device 3 and the server device 4 are not limited to having the same configuration.

クライアント装置３とサーバ装置４とは、ネットワーク９を介して、通信可能に接続されている。
ネットワーク９は、例えばインターネットや広域通信回線網等により構築される。その他、任意のＷＡＮ（Wide Area Network）やＬＡＮ（Local Area Network）等が用いられてよく、ネットワーク９を構築するためのプロトコルは限定されない。 The client device 3 and the server device 4 are connected to each other via a network 9 so as to be able to communicate with each other.
The network 9 is constructed using, for example, the Internet or a wide area communication network. Alternatively, any wide area network (WAN) or local area network (LAN) may be used, and the protocol for constructing the network 9 is not limited.

クライアント装置３は、ＨＭＤ２から送信された視野情報を受信する。またクライアント装置３は、視野情報を、ネットワーク９を介して、サーバ装置４に送信する。 The client device 3 receives the visual field information transmitted from the HMD 2. The client device 3 also transmits the visual field information to the server device 4 via the network 9.

サーバ装置４は、クライアント装置３から送信された視野情報を受信する。またサーバ装置４は、視野情報に基づいて、仮想空間Ｓを構成する３次元空間データに対してレンダリング処理を実行することにより、ユーザ５の視野７に応じた２次元映像データ（レンダリング映像）を生成する。
サーバ装置４は、本技術に係る情報処理装置の一実施形態に相当する。サーバ装置４により、本技術に係る情報処理方法の一実施形態が実行される。 The server device 4 receives the visual field information transmitted from the client device 3. Based on the visual field information, the server device 4 performs a rendering process on the three-dimensional space data that constitutes the virtual space S, thereby generating two-dimensional video data (rendered video) corresponding to the visual field 7 of the user 5.
The server device 4 corresponds to an embodiment of an information processing device according to the present technology. The server device 4 executes an embodiment of an information processing method according to the present technology.

図３に示すように、３次元空間データは、シーン記述情報と、３次元オブジェクトデータとを含む。
シーン記述情報は、仮想空間Ｓ（３次元空間）の構成を定義する３次元空間記述データに相当する。シーン記述情報は、６ＤｏＦコンテンツの各シーンを再現するための種々のメタデータを含む。
３次元オブジェクトデータは、仮想空間Ｓ（３次元空間）における３次元オブジェクトを定義するデータである。すなわち６ＤｏＦコンテンツの各シーンを構成する各オブジェクトのデータとなる。
例えば、人物や動物等の３次元オブジェクトのデータや、建物や木等の３次元オブジェクトのデータが格納される。あるいは、背景等を構成する空や海等の３次元オブジェクトのデータが格納される。複数の種類の物体がまとめて１つの３次元オブジェクトとして構成され、そのデータが格納されてもよい。
３次元オブジェクトデータは、例えば、多面体の形状データとして表すことのできるメッシュデータとその面に張り付けるデータであるテクスチャデータとにより構成される。あるいは、複数の点の集合（点群）で構成される（Point Cloud）。 As shown in FIG. 3, the three-dimensional space data includes scene description information and three-dimensional object data.
The scene description information corresponds to three-dimensional space description data that defines the configuration of the virtual space S (three-dimensional space). The scene description information includes various metadata for reproducing each scene of the 6DoF content.
The three-dimensional object data is data that defines a three-dimensional object in the virtual space S (three-dimensional space), that is, data of each object that constitutes each scene of the 6DoF content.
For example, data of three-dimensional objects such as people and animals, or data of three-dimensional objects such as buildings and trees, or data of three-dimensional objects such as the sky and sea that make up the background, etc. Multiple types of objects may be collectively configured as a single three-dimensional object, and the data for that object may be stored.
The three-dimensional object data may be composed of mesh data that can be expressed as shape data of a polyhedron and texture data that is data to be applied to the surfaces of the mesh data, or may be composed of a set of multiple points (point cloud).

図３に示すように、サーバ装置４は、シーン記述情報に基づいて、３次元空間に３次元オブジェクトを配置することにより、各シーンを構成する仮想空間Ｓを再現する。
図３に示すように、仮想空間ＳにはＸＹＺ座標系が設定されおり、座標値により規定される位置に、３次元オブジェクトが配置される。座標値は、仮想空間Ｓ上における位置情報に相当し、ワールド座標ともいえる。仮想空間ＳにＸＹＺ座標系を設定する方法は限定されず、任意の設定方法が採用されてよい。
再現された仮想空間Ｓを基準として、ユーザ５から見た映像を切り出すことにより（レンダリング処理）、ユーザ５が視聴する２次元映像であるレンダリング映像を生成する。
サーバ装置４は、生成したレンダリング映像をエンコードし、ネットワーク９を介してクライアント装置３に送信する。
なお、ユーザの視野７に応じたレンダリング映像は、ユーザの視野７に応じたビューポート（表示領域）の映像ともいえる。 As shown in FIG. 3, the server device 4 reproduces a virtual space S that constitutes each scene by arranging three-dimensional objects in the three-dimensional space based on the scene description information.
3, an XYZ coordinate system is set in the virtual space S, and a three-dimensional object is placed at a position defined by coordinate values. The coordinate values correspond to position information in the virtual space S and can also be called world coordinates. There are no limitations on the method for setting the XYZ coordinate system in the virtual space S, and any setting method may be adopted.
Using the reproduced virtual space S as a reference, the image as seen by the user 5 is extracted (rendering process), and a rendered image, which is a two-dimensional image viewed by the user 5, is generated.
The server device 4 encodes the generated rendering image and transmits it to the client device 3 via the network 9 .
The rendered image according to the user's field of view 7 can also be said to be an image of a viewport (display area) according to the user's field of view 7.

クライアント装置３は、サーバ装置４から送信された、エンコードされたレンダリング映像をデコードする。また、クライアント装置３は、デコードしたレンダリング映像を、ＨＭＤ２に送信する。
図２に示すように、ＨＭＤ２により、レンダリング映像が再生され、ユーザ５に対して表示される。以下、ＨＭＤ２によりユーザ５に対して表示される映像８を、レンダリング映像８と記載する場合がある。 The client device 3 decodes the encoded rendering video transmitted from the server device 4. The client device 3 also transmits the decoded rendering video to the HMD 2.
2, the rendered image is reproduced by the HMD 2 and displayed to the user 5. Hereinafter, the image 8 displayed to the user 5 by the HMD 2 may be referred to as the rendered image 8.

［サーバサイドレンダリングシステムの利点］
図２に例示するような６ＤｏＦ映像の他の配信システムとして、クライアントサイドレンダリングシステムが挙げられる。
クライアントサイドレンダリングシステムでは、クライアント装置３により、視野情報に基づいて３次元空間データに対してレンダリング処理が実行され、２次元映像データ（レンダリング映像８）が生成される。クライアントサイドレンダリングシステムを、クライアントレンダリング型のメディア配信システムと呼ぶことも可能である。
クライアントサイドレンダリングシステムでは、サーバ装置４からクライアント装置３に、３次元空間データ（３次元空間記述データ及び３次元オブジェクトデータ）を配信する必要がある。
３次元オブジェクトデータは、メッシュデータにより構成されたり、点群データ（Point Cloud）により構成される。従ってサーバ装置４からクライアント装置３への配信データ量は、膨大になってしまう。また、レンダリング処理を実行するために、クライアント装置３には、かなり高い処理能力が求められる。 [Advantages of a server-side rendering system]
Another example of a 6DoF video delivery system, such as the one shown in FIG. 2, is a client-side rendering system.
In the client-side rendering system, the client device 3 performs rendering processing on the three-dimensional spatial data based on the field of view information to generate two-dimensional video data (rendered video 8). The client-side rendering system can also be called a client-rendering type media distribution system.
In a client-side rendering system, it is necessary to distribute three-dimensional space data (three-dimensional space description data and three-dimensional object data) from the server device 4 to the client device 3 .
The three-dimensional object data is composed of mesh data or point cloud data. Therefore, the amount of data to be distributed from the server device 4 to the client device 3 becomes enormous. Furthermore, the client device 3 is required to have a fairly high processing capacity in order to execute the rendering process.

これに対して、本実施形態に係るサーバサイドレンダリングシステム１では、レンダリング後のレンダリング映像８がクライアント装置３に配信される。これにより、配信データ量を十分に抑えることが可能となる。すなわち少ない配信データ量にて、ユーザ５に対して、膨大な３次元オブジェクトデータから構成される大空間の６ＤｏＦ映像を、体験させることが可能となる。
また、クライアント装置３側の処理負荷を、サーバ装置４側にオフロードすることが可能となり、処理能力が低いクライアント装置３が用いられる場合でも、ユーザ５に対して６ＤｏＦ映像を体験させることが可能となる。 In contrast, in the server-side rendering system 1 according to this embodiment, the rendered image 8 after rendering is delivered to the client device 3. This makes it possible to sufficiently reduce the amount of data delivered. In other words, with a small amount of data delivered, it becomes possible to allow the user 5 to experience a 6DoF image of a large space composed of a huge amount of three-dimensional object data.
In addition, it becomes possible to offload the processing load on the client device 3 to the server device 4, making it possible for the user 5 to experience 6DoF video even when a client device 3 with low processing capabilities is used.

［応答遅延の問題］
サーバサイドレンダリングシステム１では、ユーザ５の視野情報やレンダリング後のレンダリング映像８が、ネットワーク９を介して送受信される。従って、視点の移動等に応じたレンダリング映像８の表示に関して、応答遅延が発生する可能性がある。
例えば、ユーザ５が、頭を動かすといった動作により、視野７を変更させる。ＨＭＤ２により視野情報が取得され、クライアント装置３に送信される。クライアント装置３は、受信した視野情報を、ネットワーク９を介して、サーバ装置４に送信する。
サーバ装置４は、受信したユーザ５の視野情報に基づいて、３次元空間データに対してレンダリング処理を実行し、レンダリング映像８を生成する。生成されたレンダリング映像８はエンコードされて、ネットワーク９を介してクライアント装置３に送信される。
クライアント装置３は、受信したレンダリング映像８をデコードし、ＨＭＤ２に送信する。ＨＭＤ２は、受信したレンダリング映像８を、ユーザ５に対して表示する。
このような処理フローを、ユーザ５の視野の変更に応じてリアルタイムで実行するように、サーバサイドレンダリングシステム１が構築される。この場合、ユーザ５が視野を変更させてから、それがＨＭＤ２の映像として反映されるまでの遅延が、応答遅延として発生してしまう可能性がある。
なお、この応答遅延を、（Motion-to-Photon Latency：T_m2p）と表現することも可能である。この応答遅延の遅延時間は、人間の知覚限界とされる２０ｍｓｅｃ以下に収めることが望ましいとされている。 [Response delay issue]
In the server-side rendering system 1, the field of view information of the user 5 and the rendered image 8 are transmitted and received via a network 9. Therefore, there is a possibility that a response delay may occur in the display of the rendered image 8 in response to a movement of the viewpoint, etc.
For example, the user 5 changes the field of view 7 by moving his/her head. Field of view information is acquired by the HMD 2 and transmitted to the client device 3. The client device 3 transmits the received field of view information to the server device 4 via the network 9.
The server device 4 performs rendering processing on the three-dimensional spatial data based on the received visual field information of the user 5, and generates a rendered image 8. The generated rendered image 8 is encoded and transmitted to the client device 3 via the network 9.
The client device 3 decodes the received rendered image 8 and transmits it to the HMD 2. The HMD 2 displays the received rendered image 8 to the user 5.
The server-side rendering system 1 is configured to execute such a processing flow in real time in response to changes in the field of view of the user 5. In this case, there is a possibility that a delay occurs between when the user 5 changes the field of view and when that change is reflected in the image on the HMD 2, which is called a response delay.
This response delay can also be expressed as motion-to-photon latency (T_m2p). It is desirable to keep the delay time of this response delay below 20 msec, which is the limit of human perception.

本技術は、上記の応答遅延の問題を解決するために非常に有効な技術となる。以下、本技術が適用されたサーバサイドレンダリングシステム１の実施形態について詳しく説明する。
以下の実施形態では、ユーザ５の視野情報として、Head Motion情報が用いられる場合を例に挙げる。
Head Motion情報は、ユーザ５の頭の位置移動を表現するPosition情報（X、Y、Z）と、ユーザ５の頭の回転移動の動きを表現するOrientation情報（yaw、pitch、roll）とを含む。
Position情報（X、Y、Z）は、仮想空間Ｓ上における位置情報に相当し、仮想空間Ｓに設定されたＸＹＺ座標系の座標値（ワールド座標）により規定される。
Orientation情報（yaw、pitch、roll）は、ユーザ５の頭に設定された互いに直交するロール軸、ピッチ軸、ヨー軸に関するロール角度、ピッチ角度、ヨー角度により規定される。
もちろん、本技術の適用が、ユーザ５の視野情報としてHead Motion情報（X、Y、Z、yaw、pitch、roll）が用いられる場合に限定される訳ではない。視野情報として、他の情報が用いられる場合でも、本技術は適用可能である。 This technology is extremely effective in solving the above-mentioned problem of response delay. An embodiment of a server-side rendering system 1 to which this technology is applied will be described in detail below.
In the following embodiment, a case where head motion information is used as visual field information of the user 5 will be exemplified.
The Head Motion information includes Position information (X, Y, Z) that represents the positional movement of the head of the user 5, and Orientation information (yaw, pitch, roll) that represents the rotational movement of the head of the user 5.
The position information (X, Y, Z) corresponds to position information in the virtual space S, and is defined by coordinate values (world coordinates) of an XYZ coordinate system set in the virtual space S.
The orientation information (yaw, pitch, roll) is defined by the roll angle, pitch angle, and yaw angle relative to a roll axis, pitch axis, and yaw axis that are set on the head of the user 5 and are perpendicular to each other.
Of course, application of the present technology is not limited to cases where head motion information (X, Y, Z, yaw, pitch, roll) is used as visual field information of the user 5. The present technology can also be applied to cases where other information is used as visual field information.

また、以下の実施形態では、サーバサイドレンダリングシステム１により、ユーザ５の視野情報がリアルタイムで取得され、ユーザ５に対してレンダリング映像が表示される。
サーバサイドレンダリングシステム１により、ユーザ５の視野情報が取得される時刻を、「現在時刻」として説明を行う。すなわち、ＨＭＤ２によりユーザ５の視野情報が取得される時刻を「現在時刻」として説明を行う。
上記したように、「現在時刻」に取得された視野情報がサーバ装置４まで送信され、レンダリング映像８が生成されて、ＨＭＤ２により表示されるまでに、応答遅延（T_m2p時間分）が発生する可能性がある。
本技術を適用することで、「現在時刻」からの応答遅延の問題を十分に抑制することが可能となり、高品質な仮想映像の配信が実現される。 In the following embodiment, the server-side rendering system 1 acquires the field of view information of the user 5 in real time, and displays the rendered image to the user 5 .
In the following description, the time when the visual field information of the user 5 is acquired by the server-side rendering system 1 is referred to as the "current time." In other words, the time when the visual field information of the user 5 is acquired by the HMD 2 is referred to as the "current time."
As described above, there is a possibility that a response delay (T_m2p time) may occur between the time when the visual field information acquired at the "current time" is transmitted to the server device 4, the time when the rendering image 8 is generated, and the time when it is displayed by the HMD 2.
By applying this technology, it is possible to sufficiently mitigate the problem of response delays from the "current time," enabling the delivery of high-quality virtual video.

図４は、本技術の一実施形態に係るサーバサイドレンダリングシステム１の構成例を示す模式図である。
図４に示すサーバサイドレンダリングシステム１は、ＨＭＤ２と、クライアント装置３と、サーバ装置４とを含む。
ＨＭＤ２は、ユーザ５の視野情報（Head Motion情報）をリアルタイムで取得することが可能である。上記したように、ＨＭＤ２によりHead Motion情報が取得される時刻が、現在時刻となる。
ＨＭＤ２は、所定のフレームレートで、Head Motion情報を取得し、クライアント装置３に送信する。従って、クライアント装置３には、所定のフレームレートで、「現在時刻のHead Motion情報」が、繰り返し送信されることになる。
同様に、クライアント装置３からサーバ装置４にも、所定のフレームレートで「現在時刻のHead Motion情報」が、繰り返し送信される。 FIG. 4 is a schematic diagram showing an example of the configuration of a server-side rendering system 1 according to an embodiment of the present technology.
The server-side rendering system 1 shown in FIG. 4 includes an HMD 2, a client device 3, and a server device 4.
The HMD 2 is capable of acquiring in real time visual field information (head motion information) of the user 5. As described above, the time when the head motion information is acquired by the HMD 2 is the current time.
The HMD 2 acquires head motion information at a predetermined frame rate and transmits it to the client device 3. Therefore, "head motion information at the current time" is repeatedly transmitted to the client device 3 at the predetermined frame rate.
Similarly, the client device 3 also repeatedly transmits "head motion information at the current time" to the server device 4 at a predetermined frame rate.

Head Motion情報取得のフレームレート（Head Motion情報の取得回数／秒）は、例えば、レンダリング映像８のフレームレートに同期するように設定される。
例えば、レンダリング映像８は、時系列に連続する複数のフレーム画像により構成される。各フレーム画像は、所定のフレームレートで生成される。このレンダリング映像８のフレームレートと同期するように、Head Motion情報取得のフレームレートが設定される。もちろんこれに限定される訳ではない。
また上記したように、ユーザ５に対して、仮想映像を表示するデバイスとして、ＡＲグラスやディスプレイが用いられてもよい。 The frame rate for acquiring head motion information (number of times head motion information is acquired per second) is set to be synchronized with the frame rate of the rendered video 8, for example.
For example, the rendered video 8 is composed of a plurality of frame images that are successive in time series. Each frame image is generated at a predetermined frame rate. The frame rate for acquiring head motion information is set so as to be synchronized with the frame rate of the rendered video 8. Of course, this is not a limitation.
As described above, AR glasses or a display may be used as a device for displaying a virtual image to the user 5 .

サーバ装置４は、データ入力部１１と、Head Motion情報記録部１２と、予測部１３と、レンダリング部１４と、エンコード部１５と、通信部１６とを有する。またサーバ装置４は、顕著性マップ生成部１７と、顕著性マップ記録部１８と、認識位置推定部１９とを有する。
これらの機能ブロックは、例えばＣＰＵが本技術に係るプログラムを実行することで実現され、本実施形態に係る情報処理方法が実行される。なお各機能ブロックを実現するために、ＩＣ（集積回路）等の専用のハードウェアが適宜用いられてもよい。 The server device 4 includes a data input unit 11, a head motion information recording unit 12, a prediction unit 13, a rendering unit 14, an encoding unit 15, and a communication unit 16. The server device 4 also includes a saliency map generation unit 17, a saliency map recording unit 18, and a recognition position estimation unit 19.
These functional blocks are realized by, for example, a CPU executing a program according to the present technology, and the information processing method according to the present embodiment is executed. Note that dedicated hardware such as an IC (integrated circuit) may be used as appropriate to realize each functional block.

データ入力部１１は、３次元空間データ（シーン記述情報、及び３次元オブジェクトデータ）を読み出し、レンダリング部１４に出力する。
なお、３次元空間データは、例えば、サーバ装置４内の記憶部６８（図１８参照）に格納されている。あるいは、サーバ装置４と通信可能に接続されたコンテンツサーバ等により、３次元空間データが管理されてもよい。この場合、データ入力部１１は、コンテンツサーバにアクセスすることで、３次元空間データを取得する。 The data input unit 11 reads out three-dimensional space data (scene description information and three-dimensional object data) and outputs it to the rendering unit 14 .
The three-dimensional spatial data is stored, for example, in a storage unit 68 (see FIG. 18 ) in the server device 4. Alternatively, the three-dimensional spatial data may be managed by a content server or the like that is communicably connected to the server device 4. In this case, the data input unit 11 acquires the three-dimensional spatial data by accessing the content server.

通信部１６は、他のデバイスとの間で、ネットワーク通信や近距離無線通信等を実行するためのモジュールである。例えばＷｉＦｉ等の無線ＬＡＮモジュールや、Bluetooth（登録商標）等の通信モジュールが設けられる。
本実施形態では、通信部１６により、ネットワーク９を介したクライアント装置３との通信が実現される。 The communication unit 16 is a module for performing network communication, short-range wireless communication, etc. with other devices. For example, a wireless LAN module such as WiFi, or a communication module such as Bluetooth (registered trademark) is provided.
In this embodiment, the communication unit 16 realizes communication with the client device 3 via the network 9 .

Head Motion情報記録部１２は、通信部１６を介してクライアント装置３から受信した視野情報（Head Motion情報）を記憶部６８（図１８参照）に記録する。例えば、視野情報（Head Motion情報）を記録するためのバッファ等が構成されてもよい。
所定のフレームレートで送信される「現在時刻のHead Motion情報」が、記憶部６８に蓄積されて保持される。 The Head Motion information recording unit 12 records the visual field information (Head Motion information) received from the client device 3 via the communication unit 16 in the storage unit 68 (see FIG. 18 ). For example, a buffer or the like for recording the visual field information (Head Motion information) may be configured.
The "head motion information at the current time" transmitted at a predetermined frame rate is accumulated and held in the storage unit 68.

予測部１３は、全天周の顕著性マップ（全天周顕著性マップ）に基づいて、未来の視野情報を予測視野情報として生成する。本実施形態では、ユーザ５の未来のHead Motion情報が予測され、予測Head Motion情報として生成される。
予測Head Motion情報は、未来のPosition情報（X、Y、Z）と、未来のOrientation情報（yaw、pitch、roll）とを含む。すなわち本実施形態では、全天周顕著性マップに基づいて、頭の位置、及び頭の回転角度が予測される。 The prediction unit 13 generates future visual field information as predicted visual field information based on a panoramic saliency map (panoramic saliency map). In this embodiment, future head motion information of the user 5 is predicted and generated as predicted head motion information.
The predicted head motion information includes future position information (X, Y, Z) and future orientation information (yaw, pitch, roll). That is, in this embodiment, the head position and head rotation angle are predicted based on the omnidirectional saliency map.

レンダリング部１４は、図３に例示するレンダリング処理を実行する。すなわち、ユーザ５の視野に関する視野情報に基づいて、３次元空間データに対してレンダリング処理を実行することにより、ユーザ５の視野７に応じたレンダリング映像８を生成する。
本実施形態では、レンダリング部１４は、予測部１３により生成された予測視野情報（予測Head Motion情報）に基づいて、レンダリング映像８を構成するフレーム画像が生成される。以下、予測Head Motion情報に基づいて生成されるフレーム画像を、予測フレーム画像２０と記載する。
レンダリング部１４は、例えば、仮想空間Ｓを再現する再現部、レンダラ、レンダリングパラメータを設定するパラメータ設定部等により構成される。レンダリングパラメータとしては、領域ごとの解像度を示す解像度マップ等が挙げられる。
その他、レンダリング部１４として、任意の構成が採用されてよい。 The rendering unit 14 executes the rendering process exemplified in Fig. 3. That is, by executing the rendering process on the three-dimensional space data based on the visual field information relating to the visual field of the user 5, a rendered image 8 corresponding to the visual field 7 of the user 5 is generated.
In this embodiment, the rendering unit 14 generates frame images that constitute the rendered video 8 based on the predicted field of view information (predicted head motion information) generated by the prediction unit 13. Hereinafter, the frame images generated based on the predicted head motion information will be referred to as predicted frame images 20.
The rendering unit 14 is configured by, for example, a reproduction unit that reproduces the virtual space S, a renderer, a parameter setting unit that sets rendering parameters, etc. The rendering parameters include a resolution map that indicates the resolution for each region.
Any other configuration may be adopted as the rendering unit 14 .

エンコード部１５は、レンダリング映像８（予測フレーム画像２０）に対してエンコード処理（圧縮符号化）を実行し、配信データを生成する。配信データは、通信部１６を介して、クライアント装置３に送信される。
例えば、エンコード処理は、ＱＰマップ（量子化パラメータ）に基づき、レンダリング映像８（予測フレーム画像２０）の各領域に対してリアルタイムに実行される。
より具体的には、本実施形態においては、エンコード部１５は、予測フレーム画像２０内で量子化精度（ＱＰ：Quantization Parameter）を領域ごとに切り替えることにより、予測フレーム画像２０内の着目点や重要領域の圧縮による画質劣化を抑えることができる。
このようにすることで、ユーザ５にとって重要な領域については十分な映像の品質を維持しつつ、配信データや処理の負荷を増加させることを抑えることができる。なお、ここでＱＰ値とは、非可逆圧縮効率の際の量子化の刻みを示す値であり、ＱＰ値が高いと符号化量が小さくなって、圧縮効率が高くなり、圧縮による画質劣化が進み、一方、ＱＰ値が低いと符号化量が大きくなり、圧縮効率が低くなり、圧縮による画質劣化を抑えることができる。
その他、任意の圧縮符号化技術が用いられてよい。
エンコード部１５は、例えば、エンコーダ、エンコードパラメータを設定するパラメータ設定部等により構成される。エンコードパラメータとしては、上記したＱＰマップ等が挙げられる。
例えば、レンダリング部１４のパラメータ設定部により設定された解像度マップに基づいて、ＱＰマップが生成される。その他、エンコード部１５として、任意の構成が採用されてよい。 The encoding unit 15 performs encoding (compression coding) on the rendering video 8 (predicted frame image 20) to generate distribution data. The distribution data is transmitted to the client device 3 via the communication unit 16.
For example, the encoding process is performed in real time for each region of the rendered video 8 (predicted frame image 20) based on a QP map (quantization parameter).
More specifically, in this embodiment, the encoding unit 15 can suppress image quality degradation due to compression of points of interest and important areas within the predicted frame image 20 by switching the quantization precision (QP: Quantization Parameter) for each area within the predicted frame image 20.
In this way, it is possible to suppress an increase in the load of distribution data and processing while maintaining sufficient video quality for areas that are important to the user 5. Note that the QP value here is a value that indicates the quantization step when using lossy compression efficiency, and a high QP value reduces the amount of coding and increases compression efficiency, resulting in greater image quality degradation due to compression, while a low QP value increases the amount of coding and decreases compression efficiency, making it possible to suppress image quality degradation due to compression.
Any other compression encoding technology may be used.
The encoding unit 15 is configured by, for example, an encoder, a parameter setting unit that sets encoding parameters, etc. The encoding parameters include the above-mentioned QP map, etc.
For example, the QP map is generated based on the resolution map set by the parameter setting unit of the rendering unit 14. Alternatively, the encoding unit 15 may have any other configuration.

顕著性マップ生成部１７は、全天周顕著性マップを生成する。
顕著性マップ生成部１７により、ユーザ５が視聴するレンダリング映像（２次元映像データ）８の顕著性を表す視野分の顕著性マップのみならず、ユーザ５の視野７に含まれない視野外領域における顕著性マップも生成される。
視野分の顕著性マップは、人間の視覚的注意の仕組みから、レンダリング映像８の各ピクセルがどれだけ注視を集めやすいかを推定し、定量的に表した情報となる。
視野外領域における顕著性マップは、ユーザ５の視野外領域を２Ｄ画像にて表現した場合の各ピクセルの顕著性が算出されたマップとして生成することが可能である。
全天周顕著性マップの生成については、後に詳しく説明する。なお顕著性マップは、サリエンシマップ（Saliency Map）とも呼ばれる。また全天周は、全天球ともいえる。 The saliency map generator 17 generates a panoramic saliency map.
The saliency map generation unit 17 generates not only a saliency map for the field of view that represents the saliency of the rendering image (two-dimensional image data) 8 viewed by the user 5, but also a saliency map for the out-of-field area that is not included in the field of view 7 of the user 5.
The visual field saliency map is information that quantitatively represents how likely each pixel in the rendered image 8 is to attract attention, based on the mechanism of human visual attention.
The saliency map for the out-of-field area can be generated as a map in which the saliency of each pixel is calculated when the out-of-field area of the user 5 is represented as a 2D image.
The generation of the omnidirectional saliency map will be described in detail later. The saliency map is also called a saliency map. The omnidirectional saliency map can also be called a celestial sphere.

顕著性マップ記録部１８は、顕著性マップ生成部１７により生成された全天周顕著性マップを、記憶部６８（図１８参照）に記録する。例えば、全天周顕著性マップを記録するためのバッファ等が構成されてもよい。 The saliency map recording unit 18 records the panoramic saliency map generated by the saliency map generation unit 17 in the memory unit 68 (see Figure 18). For example, a buffer or the like may be configured for recording the panoramic saliency map.

認識位置推定部１９は、仮想空間Ｓのユーザ５の視野７に含まれない視野外領域における、ユーザ５が認識している認識対象オブジェクトのユーザ５が認識している認識位置を推定する。
認識対象オブジェクトは、ユーザ５が視聴している６ＤｏＦコンテンツ内のオブジェクトのうち、ユーザ５が認識しているだろうと想定されるオブジェクトのことである。
例えば、現在時刻までにユーザ５が視聴したことのあるオブジェクトは、認識対象オブジェクトとして設定される。すなわち、現在時刻までにレンダリング対象となったことがあるオブジェクトは、認識対象オブジェクトとして設定される。
その他、現在時刻までに視聴はされていないが、オブジェクトが発する音をユーザが認識したオブジェクトを、認識対象オブジェクトとして設定してもよい。例えば、オブジェクトが発する音の音量が所定の基準値（閾値）よりも大きい場合には、当該音を発したオブジェクトはユーザ５に認識されたとして、認識対象オブジェクトとして設定されてもよい。この場合、音の種類や内容等が、認識対象オブジェクトとして設定されるか否かの条件として用いられてもよい。 The recognition position estimation unit 19 estimates the recognition position of the recognition target object recognized by the user 5 in the out-of-field area of the virtual space S that is not included in the field of view 7 of the user 5 .
The object to be recognized is an object that is assumed to be recognized by the user 5 among the objects in the 6DoF content that the user 5 is viewing.
For example, an object that the user 5 has viewed up to the current time is set as a recognition target object. In other words, an object that has been a rendering target up to the current time is set as a recognition target object.
Alternatively, an object that has not been viewed up to the current time but whose sound is recognized by the user may be set as a recognition target object. For example, if the volume of the sound made by the object is greater than a predetermined reference value (threshold value), the object that made the sound may be recognized by the user 5 and set as a recognition target object. In this case, the type or content of the sound may be used as a condition for whether or not to set the object as a recognition target object.

認識位置は、図３に示す仮想空間Ｓ上のＸＹＺ座標値（ワールド座標）により規定される。ユーザ５が脳内で、認識対象オブジェクトが今ここにいると認識しているであろう位置が、認識位置として推定される。
従って、認識位置は、必ずしも６ＤｏＦコンテンツ内で認識対象オブジェクトが実際に存在する位置と一致するとは限らない。
認識位置は、ユーザ５が把握している把握位置ともいえる。 The recognition position is defined by XYZ coordinate values (world coordinates) in the virtual space S shown in Fig. 3. The position where the user 5 would currently recognize the object to be recognized as being located in his or her mind is estimated as the recognition position.
Therefore, the recognition position does not necessarily coincide with the position where the object to be recognized actually exists in the 6DoF content.
The recognized position can also be said to be a grasped position that the user 5 grasps.

図５は、認識対象オブジェクト及び認識位置を説明するための模式図である。
図５では、説明を分かりやすくするために、横方向に長い画像にて、仮想空間Ｓの全天周分の画像ＳＰが表現されている。実際には、全天周画像ＳＰは、正距円筒画像として保存等される場合が多い。
また、以下の説明では、予測されるユーザの視野７を、単にユーザの視野７として説明を行う。また予測フレーム画像２０を、フレーム画像２０として説明を行う。 FIG. 5 is a schematic diagram for explaining a recognition target object and a recognition position.
5, for ease of understanding, the panoramic image SP of the virtual space S is expressed as an image that is long in the horizontal direction. In practice, the panoramic image SP is often saved as an equirectangular image.
In the following description, the predicted user's field of view 7 will be simply referred to as the user's field of view 7. The predicted frame image 20 will be described as the frame image 20.

図５に示すシーンでは、仮想空間Ｓ内に、３人の人物Ｐ１～Ｐ３と、点滅している照明装置Ｌの各オブジェクトが配置される。その他、木、草、道路、及び建物の各オブジェクトも配置される。
図５Ａに示すタイミングにおいて、まずユーザ５の視野７が、右側の人物Ｐ１の方に向けられるとする。従って、人物Ｐ１を含むフレーム画像２０がレンダリングされる。この時点で、人物Ｐ１が認識対象オブジェクトとして設定される。そして、人物Ｐ１が配置されている仮想空間Ｓ上の位置（ワールド座標）が、認識位置として推定される。
このように、認識対象オブジェクトがレンダリングされている２次元映像データ内の認識対象オブジェクトの位置に基づいて、認識位置を推定することが可能である。 5, three people P1 to P3 and a blinking lighting device L are placed in a virtual space S. Other objects such as trees, grass, roads, and buildings are also placed.
5A, the field of view 7 of the user 5 is first directed toward the person P1 on the right. Therefore, a frame image 20 including the person P1 is rendered. At this point, the person P1 is set as the object to be recognized. Then, the position (world coordinates) in the virtual space S where the person P1 is located is estimated as the recognition position.
In this way, it is possible to estimate the recognition position based on the position of the object to be recognized within the two-dimensional video data in which the object to be recognized is rendered.

図５Ａに示すタイミングでは、左側の人物Ｐ２及びＰ３、及び右側の点滅している照明装置Ｌのオブジェクトは、ユーザ５の視野７に含まれない視野外領域２１に配置される。従って、人物Ｐ２及びＰ３、及び照明装置Ｌはレンダリングされない。この結果、人物Ｐ２及びＰ３、及び照明装置Ｌは、仮想空間Ｓ内に配置はされているが、ユーザ５には認識されていないと判定され、認識対象オブジェクトとしては設定されない。 At the timing shown in Figure 5A, the persons P2 and P3 on the left and the blinking lighting device L object on the right are located in the out-of-field area 21 that is not included in the field of view 7 of the user 5. Therefore, the persons P2 and P3 and the lighting device L are not rendered. As a result, although the persons P2 and P3 and the lighting device L are located in the virtual space S, they are determined to be not recognized by the user 5 and are not set as objects to be recognized.

図５Ｂに示すタイミングにおいて、ユーザ５の視野７が左側に向けられたとする。人物Ｐ２及びＰ３がユーザ５の視野７に含まれレンダリングされる。これにより、人物Ｐ２及びＰ３が、認識対象オブジェクトとして設定される。そして、人物Ｐ２及びＰ３が配置されている仮想空間Ｓ上の位置（ワールド座標）が、認識位置として推定される。
人物Ｐ１は、ユーザ５の視野７から外れて視野外領域２１に位置することになるが、過去にレンダリングされたことがあるオブジェクトであるので、認識対象オブジェクトとしての設定は維持される。
照明装置Ｌはレンダリングされないので、認識対象オブジェクトとしては設定されない。 5B, it is assumed that the field of view 7 of the user 5 is turned to the left. The people P2 and P3 are included in the field of view 7 of the user 5 and are rendered. As a result, the people P2 and P3 are set as objects to be recognized. Then, the positions (world coordinates) in the virtual space S where the people P2 and P3 are located are estimated as recognition positions.
Although the person P1 is outside the field of view 7 of the user 5 and is positioned in the out-of-field area 21, the setting as an object to be recognized is maintained because the person P1 is an object that has been rendered in the past.
The lighting device L is not rendered and is therefore not set as a recognition target object.

図５Ｃに示すタイミングにおいて、視野外領域２１において、人物Ｐ１が照明装置Ｌの方へ移動したとする。この場合、仮想空間Ｓ上では、人物Ｐ１の位置が移動する。
ここでユーザ５は、人物Ｐ１の視野外領域２１での移動を認識していないとする。この場合、ユーザ５の脳内では人物Ｐ１の認識位置は、実際の人物Ｐ１の位置とは一致しない。
認識位置推定部１９では、図５Ｃに示すような認識対象オブジェクトの視野外領域２１での移動等に対して、ユーザ５の脳内での認識位置を高精度に推定することが可能である。 5C, it is assumed that the person P1 moves in the out-of-field area 21 toward the lighting device L. In this case, the position of the person P1 moves in the virtual space S.
Here, it is assumed that the user 5 is not aware of the movement of the person P1 in the out-of-field area 21. In this case, the recognized position of the person P1 in the user 5's mind does not match the actual position of the person P1.
The recognition position estimation unit 19 can estimate with high accuracy the recognition position in the brain of the user 5 when the recognition target object moves in the out-of-field area 21 as shown in FIG. 5C.

本実施形態において、レンダリング部１４は、本技術に係るレンダリング部の一実施形態として機能する。
認識位置推定部１９は、本技術に係る推定部の一実施形態として機能する。
顕著性マップ生成部１７は、本技術に係る生成部の一実施形態として機能する。
予測部１３は、本技術に係る予測部の一実施形態として機能する。 In this embodiment, the rendering unit 14 functions as an embodiment of a rendering unit according to the present technology.
The recognition position estimation unit 19 functions as an embodiment of an estimation unit according to the present technology.
The saliency map generation unit 17 functions as an embodiment of a generation unit according to the present technology.
The prediction unit 13 functions as an embodiment of a prediction unit according to the present technology.

クライアント装置３は、通信部２３と、デコード部２４と、レンダリング部２５とを有する。
これらの機能ブロックは、例えばＣＰＵが本技術に係るプログラムを実行することで実現され、本実施形態に係る情報処理方法が実行される。なお各機能ブロックを実現するために、ＩＣ（集積回路）等の専用のハードウェアが適宜用いられてもよい。 The client device 3 includes a communication unit 23 , a decoding unit 24 , and a rendering unit 25 .
These functional blocks are realized by, for example, a CPU executing a program according to the present technology, and the information processing method according to the present embodiment is executed. Note that dedicated hardware such as an IC (integrated circuit) may be used as appropriate to realize each functional block.

通信部２３は、他のデバイスとの間で、ネットワーク通信や近距離無線通信等を実行するためのモジュールである。例えばＷｉＦｉ等の無線ＬＡＮモジュールや、Bluetooth（登録商標）等の通信モジュールが設けられる。
デコード部２４は、配信データに対してデコード処理を実行する。これにより、エンコードされたレンダリング映像８（予測フレーム画像２０）がデコードされる。
レンダリング部２５は、デコードされたレンダリング映像８（予測フレーム画像２０）がＨＭＤ２により表示可能なように、レンダリング処理を実行する。 The communication unit 23 is a module for performing network communication, short-range wireless communication, etc. with other devices. For example, a wireless LAN module such as WiFi, or a communication module such as Bluetooth (registered trademark) is provided.
The decoding unit 24 executes a decoding process on the distribution data, thereby decoding the encoded rendering video 8 (predictive frame image 20).
The rendering unit 25 performs rendering processing so that the decoded rendered video 8 (predicted frame image 20 ) can be displayed on the HMD 2 .

［Head Motion情報の予測精度］
例えば、「現在時刻のHead Motion情報」を受信したサーバ装置４により、応答遅延（T_m2p時間）分未来の予測Head Motion情報が生成される。そして、予測Head Motion情報に基づいて予測フレーム画像２０が生成され、ＨＭＤ２によりユーザ５に対して表示される。
非常に高い精度で予測Head Motion情報を生成できれば、「現在時刻」から応答遅延（T_m2p時間）分未来のユーザ５の視野７に応じたレンダリング映像８を表示することが可能となり、応答遅延の問題は十分に抑制可能である。 [Head Motion Information Prediction Accuracy]
For example, the server device 4 receives the "head motion information at the current time" and generates predicted head motion information for the future corresponding to the response delay (T_m2p time). Then, a predicted frame image 20 is generated based on the predicted head motion information and is displayed to the user 5 by the HMD 2.
If predicted head motion information can be generated with extremely high accuracy, it will be possible to display a rendered image 8 according to the field of view 7 of the user 5 at a time in the future corresponding to the response delay (T_m2p time) from the "current time," and the problem of response delay can be sufficiently suppressed.

本発明者は、予測Head Motion情報の精度を向上させるために、Head Motion予測について考察を重ねた。
まず、Head Motion予測の予測誤差は、頭の動き信号（センサリング結果）の周波数の増加に伴って増大するという傾向が見受けられる。
人間の体の特性上、回転方向への動きは素早い動きの変化（高周波となる動き）が可能だが、前後、上下、左右といった位置移動においては、急な変化を有する高周波な動きはしにくい傾向にある。
そのため、これら２種類の動きのうち、位置移動への動き（X、Y、Z）に対する予測誤差は低く、視聴上の影響は非常に少ない。一方で、回転方向への動き（yaw、pitch、roll）に対する予測誤差が大きくなる傾向にあり、視聴に影響をきたしやすい。すなわち、回転方向の動き（yaw、pitch、roll）に対する予測精度の向上が非常に重要となる。 The present inventors have conducted extensive research into head motion prediction in order to improve the accuracy of predicted head motion information.
First, the prediction error of head motion prediction tends to increase as the frequency of the head movement signal (sensing result) increases.
Due to the characteristics of the human body, rapid changes in rotational movement (high-frequency movement) are possible, but high-frequency movement with sudden changes in position such as forward/backward, up/down, and left/right tends to be difficult.
Therefore, of these two types of motion, the prediction error for positional movements (X, Y, Z) is low and has very little impact on viewing. On the other hand, the prediction error for rotational movements (yaw, pitch, roll) tends to be large and can easily affect viewing. In other words, improving the prediction accuracy for rotational movements (yaw, pitch, roll) is extremely important.

本発明者は、Head Motion予測、特に回転方向の動き（yaw、pitch、roll）に対する予測精度を向上させるために、全天周顕著性マップに着目した。
高い精度で全天周顕著性マップを生成し、Head Motion予測に用いることで、非常に高い精度で回転方向の動き（yaw、pitch、roll）に対する予測精度を実行することが可能となる。 The inventors have focused on the panoramic saliency map to improve the accuracy of head motion prediction, especially for rotational movements (yaw, pitch, roll).
By generating a high-accuracy panoramic saliency map and using it for head motion prediction, it becomes possible to perform highly accurate prediction of rotational movements (yaw, pitch, roll).

［２次元映像データ（レンダリング映像）の生成動作］
サーバ装置４による全天周顕著性マップを用いたレンダリング映像の生成の動作例を説明する。
図６は、レンダリング映像の生成の一例を示すフローチャートである。
図７は、図６に示すフローチャートを説明するための図であり、Head Motion情報の取得、予測Head Motion情報の生成、予測フレーム画像２０の生成、全天周顕著性マップの生成のタイミングを示す模式図である。
本実施形態では、説明をわかりやすくするために、所定のフレームレートで、クライアント装置３から視野情報が取得され、同じフレームレートにて、予測Head Motion情報、予測フレーム画像２０、及び全天周顕著性マップの各々が生成されるものとする。もちろんこのような処理に限定される訳ではない。
図７に示す数字が付された枠は、各処理のフレームを示している。図７では、処理が開始された１フレーム目から２５フレームまでが模式的に図示されている。
また各フレームにおいて、四角の図形が図示されているフレームは、左側に記載されているデータが取得／生成されたことを表現している。また、四角の図形の中の数字は、どのフレームに対応するデータであるかを示す数字である。 [Generation of 2D video data (rendered video)]
An example of the operation of the server device 4 for generating a rendering image using the panoramic saliency map will be described.
FIG. 6 is a flowchart showing an example of generating a rendered image.
FIG. 7 is a diagram for explaining the flowchart shown in FIG. 6, and is a schematic diagram showing the timing of obtaining head motion information, generating predicted head motion information, generating a predicted frame image 20, and generating a panoramic saliency map.
In this embodiment, for ease of explanation, it is assumed that visual field information is acquired from the client device 3 at a predetermined frame rate, and that predicted head motion information, predicted frame images 20, and a panoramic saliency map are each generated at the same frame rate. Of course, the present invention is not limited to such processing.
7, the numbered boxes indicate the frames of each process, and in FIG. 7, the first frame at which the process starts to the 25th frame are schematically shown.
In addition, for each frame, a square indicates that the data shown on the left was acquired/generated. The number inside the square indicates which frame the data corresponds to.

まず、「現在時刻」からどれぐらい未来の予測Head Motion情報を生成するかが設定される。
本実施形態では、通信部１６により、クライアント装置３とのネットワーク遅延が測定され、ターゲットの予測時間が特定される（ステップ１０１）。すなわち、応答遅延（T_m2p時間）が測定され、T_m2p時間分が予測時間として特定される。
本実施形態では、「現在時刻」に対応するフレームよりも、所定のフレーム数未来のフレームにおけるHead Motion情報が予測され、予測Head Motion情報として生成される。
所定のフレーム数は、予測時間であるT_m2p時間分に相当するフレーム数が設定される。
例えば、本実施形態では、５フレーム先のHead Motion情報が予測されることとする。例えば１０フレーム目において「現在時刻のHead Motion情報」が取得された場合には、５フレーム先となる１５フレーム目のHead Motion情報が予測され、予測Head Motion情報として生成される。もちろん、具体的なフレーム数は限定されず任意に設定されてよい。 First, it is set how far into the future from the "current time" the predicted head motion information is to be generated.
In this embodiment, the communication unit 16 measures the network delay with the client device 3 and identifies the predicted time of the target (step 101). That is, the response delay (T_m2p time) is measured, and the T_m2p time is identified as the predicted time.
In this embodiment, head motion information for a frame that is a predetermined number of frames in the future than the frame corresponding to the "current time" is predicted and generated as predicted head motion information.
The predetermined number of frames is set to the number of frames equivalent to the predicted time T_m2p.
For example, in this embodiment, head motion information for five frames ahead is predicted. For example, if "head motion information at the current time" is acquired in the tenth frame, head motion information for the fifteenth frame, which is five frames ahead, is predicted and generated as predicted head motion information. Of course, the specific number of frames is not limited and may be set arbitrarily.

通信部１６により、クライアント装置３から、Head Motion情報が取得される（ステップ１０２）。図７に示すように１フレーム目から所定のフレームレートでHead Motion情報が取得される。各フレームで取得されるHead Motion情報は、そのフレームに対応するデータとしてそのまま用いられる。 The communication unit 16 acquires head motion information from the client device 3 (step 102). As shown in Figure 7, head motion information is acquired at a predetermined frame rate starting from the first frame. The head motion information acquired for each frame is used as is as the data corresponding to that frame.

予測部１３により、Head Motion情報が、Head Motion情報の予測に必要な分溜まったか否か判定される（ステップ１０３）。
本実施形態では、Head Motion情報の予測に、１０フレーム分のHead Motion情報が必要であるとする。もちろん具体的なフレーム数は限定されず任意に設定されてよい。
例えば、１フレームから９フレームまでは、Head Motion情報の予測に必要な分のHead Motion情報が溜まっていないので、ステップ１０３のＮｏとなりステップ１０２に戻る。従って、１０フレーム目までは、レンダリング映像８（予測フレーム画像２０）の生成は実行されない。
１０フレーム目のHead Motion情報が取得されると、Head Motion情報の予測に必要な分のHead Motion情報が溜まったと判定され、ステップ１０３のＹｅｓとなりステップ１０４に進む。 The prediction unit 13 determines whether or not the amount of head motion information necessary for predicting head motion information has been accumulated (step 103).
In this embodiment, it is assumed that 10 frames of head motion information are required to predict head motion information, but the number of frames is not limited to a specific number and may be set arbitrarily.
For example, for frames 1 to 9, the amount of head motion information required for predicting head motion information has not been accumulated, so the answer in step 103 is No and the process returns to step 102. Therefore, generation of the rendered video 8 (predicted frame image 20) is not executed until the 10th frame.
When the head motion information for the tenth frame is acquired, it is determined that the amount of head motion information required for predicting the head motion information has been accumulated, and the answer to step 103 is Yes, and the process proceeds to step 104 .

ステップ１０４では、予測部１３により、ステップ１０２にて取得された「現在時刻のHead Motion情報」に対応する全天周顕著性マップは生成済みか否か判定される。
本実施形態では、現在時刻までの視野情報（Head Motion情報）の履歴情報と、現在時刻に対応する全天周顕著性マップとを入力として、予測視野情報（予測Head Motion情報）が生成される。
現在時刻に対応する全天周顕著性マップは、過去に生成された全天周顕著性マップとなる。具体的には、過去に予測された予測視野情報（予測Head Motion情報）に基づいて生成された予測フレーム画像２０の顕著性を表す視野分の顕著性マップと、過去に予測された予測視野情報（予測Head Motion情報）に基づくユーザ５の視野７（予測された視野）に含まれない視野外領域における顕著性マップを含む全天周顕著性マップである。 In step 104, the prediction unit 13 determines whether or not a panoramic saliency map corresponding to the "head motion information at the current time" acquired in step 102 has already been generated.
In this embodiment, predicted visual field information (predicted head motion information) is generated using as input historical information of visual field information (head motion information) up to the current time and a panoramic saliency map corresponding to the current time.
The panoramic saliency map corresponding to the current time is a panoramic saliency map generated in the past. Specifically, the panoramic saliency map includes a saliency map for the visual field that indicates the saliency of the predicted frame image 20 generated based on predicted visual field information (predicted head motion information) predicted in the past, and a saliency map for an out-of-visual field area that is not included in the visual field 7 (predicted visual field) of the user 5 based on predicted visual field information (predicted head motion information) predicted in the past.

「現在時刻のHead Motion情報」に対応する全天周顕著性マップは、図７に示す例において、「現在時刻のHead Motion情報」が取得されるフレームに対応する全天周顕著性マップを意味する。
すなわち、Head Motion情報を示す四角の図形の中の数字と、全天周顕著性マップを示す四角の図形の中の数字とが、互いに等しい者同士が、互いに対応する「現在時刻のHead Motion情報」と全天周顕著性マップとのペアとなる。 In the example shown in FIG. 7, the omnidirectional saliency map corresponding to the "head motion information at the current time" refers to the omnidirectional saliency map corresponding to the frame in which the "head motion information at the current time" is acquired.
That is, when the number inside the square shape indicating the Head Motion information and the number inside the square shape indicating the omnidirectional saliency map are equal, a pair of corresponding "Head Motion information at the current time" and omnidirectional saliency map is formed.

例えば、１０フレーム目のHead Motion情報が取得された場合、現在時刻に対応するフレームは、１０フレーム目となる。ステップ１０４では、１０フレームに対応する全天周顕著性マップ（中に１０の数字が記載された四角の図形により表される全天周顕著性マップ）が生成されているか否かが判定される。
図７に示すように、１０フレーム目までは、まだ予測Head Motion情報が生成されておらず、予測フレーム画像２０も生成されていない。従って、全天周顕著性マップも生成されていないので、ステップ１０４はＮｏとなり、ステップ１０５に進む。 For example, when Head Motion information for the 10th frame is acquired, the frame corresponding to the current time is frame 10. In step 104, it is determined whether a panoramic saliency map corresponding to the 10th frame (a panoramic saliency map represented by a square with the number 10 written inside) has been generated.
7, up to the tenth frame, predicted head motion information has not been generated, and predicted frame image 20 has not been generated either. Therefore, since a panoramic saliency map has not been generated, step 104 becomes No, and the process proceeds to step 105.

ステップ１０５では、予測部１３により、現在時刻までの視野情報（Head Motion情報）の履歴情報に基づいて、予測視野情報（予測Head Motion情報）が生成される。
このように、現在時刻に対応するフレームの全天周顕著性マップが生成されていない場合は、現在時刻までのHead Motion情報の履歴情報のみに基づいて、予測Head Motion情報が生成されてもよい。
本実施形態では、フレーム１０では、フレーム１からフレーム１０までのHead Motion情報の履歴情報に基づいて、５フレーム先の未来の予測Head Motion情報が生成される。従って図７に示すように、１０フレーム目では、５フレーム未来の１５フレームに対応する予測Head Motion情報が生成される（中に１５の数字が記載された四角の図形により表される予測Head Motion情報）。
現在時刻までのHead Motion情報の履歴情報に基づいて予測Head Motion情報を生成するための具体的なアルゴリズムは限定されず、任意のアルゴリズムが用いられてよい。例えば、任意の機械学習アルゴリズムが用いられてもよい。 In step 105, the prediction unit 13 generates predicted visual field information (predicted head motion information) based on the history information of visual field information (head motion information) up to the current time.
In this way, when a panoramic saliency map for a frame corresponding to the current time has not been generated, predicted head motion information may be generated based only on the history information of head motion information up to the current time.
In this embodiment, at frame 10, predicted head motion information for the future, that is, five frames ahead, is generated based on the history information of head motion information from frame 1 to frame 10. Therefore, as shown in Fig. 7, at the 10th frame, predicted head motion information corresponding to frame 15, five frames ahead, is generated (predicted head motion information represented by a square with the number 15 written inside).
There are no particular limitations on the specific algorithm for generating predicted head motion information based on the history information of head motion information up to the current time, and any algorithm may be used. For example, any machine learning algorithm may be used.

レンダリング部１４により予測Head Motion情報に基づいて、図３に例示するレンダリング処理が実行され、レンダリング映像８（予測フレーム画像２０）が生成される（ステップ１０６）。本実施形態では、５フレーム先の未来の予測Head Motion情報に基づいて、１５フレームに対応する予測フレーム画像２０が生成される。 The rendering unit 14 performs the rendering process illustrated in FIG. 3 based on the predicted head motion information, and generates a rendered video 8 (predicted frame image 20) (step 106). In this embodiment, a predicted frame image 20 corresponding to 15 frames is generated based on predicted head motion information for the next five frames.

認識位置推定部１９により、視野外領域における認識対象オブジェクトの認識位置が推定される（ステップ１０７）。本実施形態では、予測視野情報（予測Head Motion情報）に基づくユーザ５の視野７（予測された視野）に含まれない視野外領域において、ユーザ５が認識している認識対象オブジェクトの、ユーザ５が認識している認識位置が推定される。The recognition position estimation unit 19 estimates the recognition position of the object to be recognized in the out-of-field area (step 107). In this embodiment, the recognition position of the object to be recognized that the user 5 recognizes is estimated in the out-of-field area that is not included in the user 5's field of view 7 (predicted field of view) based on the predicted field of view information (predicted head motion information).

顕著性マップ生成部１７により、予測フレーム画像２０、及び推定された認識位置に基づいて、１５フレームに対応する全天周顕著性マップが生成される（ステップ１０８）。
生成された全天周顕著性マップは、顕著性マップ記録部１８によりに記録されて保持される。図７に例示するように、１０フレーム目では、１５フレームに対応する全天周顕著性マップが記録される。 The saliency map generator 17 generates a panoramic saliency map corresponding to the 15 frames based on the predicted frame image 20 and the estimated recognition positions (step 108).
The generated panoramic saliency map is recorded and held by the saliency map recording unit 18. As illustrated in Fig. 7, for the 10th frame, a panoramic saliency map corresponding to the 15th frame is recorded.

このように本実施形態では、図７に示すように、「現在時刻」に対応するフレームにおいて、５フレーム先のフレーム画像が予測フレーム画像２０として生成される。また「現在時刻」に対応するフレームにおいて、５フレーム先の全天周顕著性マップが生成される。
本開示では、「現在時刻」に生成されたフレーム画像、すなわち「現在時刻」に対応するフレームで生成されたフレーム画像を、「現在時刻のフレーム画像」とする。従って本実施形態では、「現在時刻」に対応するフレームで生成された未来の予測フレーム画像２０が、「現在時刻のフレーム画像」に相当する。
一方で、「現在時刻に対応するフレーム画像（予測フレーム画像）」は、５フレーム分過去に生成されたフレーム画像（予測フレーム画像）が相当する。 7, in this embodiment, for the frame corresponding to the "current time", a frame image five frames ahead is generated as a predicted frame image 20. Also, for the frame corresponding to the "current time", a panoramic saliency map for five frames ahead is generated.
In the present disclosure, a frame image generated at the "current time," i.e., a frame image generated in a frame corresponding to the "current time," is referred to as the "frame image at the current time." Therefore, in this embodiment, the future predicted frame image 20 generated in a frame corresponding to the "current time" corresponds to the "frame image at the current time."
On the other hand, the "frame image corresponding to the current time (predicted frame image)" corresponds to a frame image (predicted frame image) generated five frames in the past.

エンコード部１５により、予測フレーム画像２０がエンコードされる。また通信部１６により、エンコードされた予測フレーム画像２０が、クライアント装置３に送信される（ステップ１０９）。
１０フレーム目に生成された予測フレーム画像２０は、６ＤｏＦ映像コンテンツの１フレーム目として、クライアント装置３を介してＨＭＤ２に送信され、ユーザ５に対して表示される。これにより、応答遅延の影響が十分に抑えられた仮想映像の配信が開始される。
レンダリング部１４により、全てのフレーム画像に対する処理が完了したか否かが判定される（ステップ１１０）。ここでは、図７に例示するように、フレーム２５まで処理が実行されるとする。
従って、ステップ１１０はＮｏとなり、ステップ１０２に戻る。 The encoding unit 15 encodes the predicted frame image 20. The communication unit 16 transmits the encoded predicted frame image 20 to the client device 3 (step 109).
The predicted frame image 20 generated for the tenth frame is transmitted to the HMD 2 via the client device 3 as the first frame of the 6DoF video content, and is displayed to the user 5. This starts the distribution of a virtual video with the effects of response delays sufficiently suppressed.
The rendering unit 14 determines whether or not processing has been completed for all frame images (step 110). Here, it is assumed that processing has been performed up to frame 25, as shown in FIG.
Therefore, step 110 is negative and the process returns to step 102.

図７に示すフレーム１１からフレーム１４までは、ステップ１０４はＮｏとなり、ステップ１０５からステップ１０６に進む処理フローが実行される。
フレーム１５になると、取得された「現在時刻のHead Motion情報」に対応する全天周顕著性マップとして、過去のフレーム１０で生成されたフレーム１５に対応する全天周顕著性マップが存在する。従って、ステップ１０４はＹｅｓとなり、ステップ１１１に進む。 From frame 11 to frame 14 shown in FIG. 7, step 104 is determined as No, and the processing flow proceeds from step 105 to step 106 is executed.
In frame 15, the omnidirectional saliency map corresponding to frame 15, which was generated in the past frame 10, exists as the omnidirectional saliency map corresponding to the acquired "Head Motion information at the current time." Therefore, step 104 becomes Yes, and the process proceeds to step 111.

ステップ１１１では、現在時刻までの視野情報（Head Motion情報）の履歴情報と、現在時刻に対応する全天周顕著性マップとを入力として、未来のHead Motion情報が予測され、予測Head Motion情報として生成される。
Head Motion情報の履歴情報と、全天周顕著性マップとを入力として予測Head Motion情報を生成するための具体的なアルゴリズムは限定されず、任意のアルゴリズムが用いられてよい。例えば、任意の機械学習アルゴリズムが用いられてもよい。
以後、フレーム２５まで、ステップ１０４はＹｅｓとなり、全天周顕著性マップが用いられて、高精度の予測Head Motion情報が生成される。
全てのフレーム画像に対する処理が完了した場合、ステップ１０９はＹｅｓとなり、映像生成と配信処理とが終了する。 In step 111, future head motion information is predicted using the historical information of visual field information (head motion information) up to the current time and the omnidirectional saliency map corresponding to the current time as input, and predicted head motion information is generated.
There are no particular limitations on the specific algorithm for generating predicted head motion information using the history information of head motion information and the omnidirectional saliency map as input, and any algorithm may be used. For example, any machine learning algorithm may be used.
From then on, up to frame 25, step 104 returns Yes and the panoramic saliency map is used to generate highly accurate predicted Head Motion information.
If the processing for all frame images is completed, the answer in step 109 is Yes, and the video generation and distribution processing ends.

[全天周顕著性マップの生成に関する考察]
本発明者は、全天周顕著性マップの生成について考察を重ねた。
人間の視覚的注意には、物体を認識する前の、視覚刺激による外発的な注意（ボトムアップ注意）と、物体認識後の、物体に対する興味や関心による内発的な注意（トップダウン注意）とがある。顕著性というキーワードは、ボトムアップ注意にも、トップダウン注意においても用いられる。 [Considerations on generating omnidirectional saliency maps]
The inventors have given extensive consideration to the generation of a panoramic saliency map.
Human visual attention can be divided into two types: exogenous attention (bottom-up attention) triggered by visual stimuli before object recognition, and endogenous attention (top-down attention) triggered by interest in or concern for an object after object recognition. The keyword "saliency" is used in both bottom-up and top-down attention.

ボトムアップ注意に基づく顕著性マップの生成例としては、人間が物体を認識する前の、視覚刺激による外発的な注意（ボトムアップ注意）を誘引する輝度、色、方向、運動方向、奥行きなどの各特徴量を入力映像（２Ｄ画像）から抽出する。各特徴量を示す値が周囲と大きく異なる領域に高い顕著度を割り当てるように、各特徴マップを計算し、それらを統合することで、最終的な顕著性マップを生成する。 An example of generating a saliency map based on bottom-up attention is to extract features such as brightness, color, direction, direction of motion, and depth from the input video (2D image) that attract exogenous attention (bottom-up attention) through visual stimuli before a person recognizes an object. Each feature map is calculated so that areas where the value of each feature is significantly different from the surrounding area are assigned a high level of saliency, and then the final saliency map is generated by integrating them.

図８は、本システムにより実行可能なボトムアップ注意に基づく顕著性マップの生成例を示す模式図である。
図８に示す例では、予測フレーム画像２０が入力フレームとして入力される。
予測フレーム画像２０に対して、特徴量抽出処理が実行され、ボトムアップ注意を誘引する輝度、色、方向の各特徴量が抽出される。なお、特徴量抽出のために、前フレームの予測フレーム画像２０等が用いられてもよい。
輝度、色、方向の各特徴量に対して、特徴量が輝度に変換された特徴画像が生成され、特徴画像のガウシアンピラミッドが生成される。
また、レンダリング部１４を構成するレンダラから、レンダリング処理に関するパラメータ（レンダリング情報）として、デプスマップ及び動きベクトルマップ画像が取得される。デプスマップは、レンダリング対象となるオブジェクトまでの距離情報（奥行情報）を含むデータである。動きベクトルマップ画像は、レンダリング対象となるオブジェクトの動き情報を含むデータである。
デプスマップ画像が奥行きの特徴画像として用いられガウシアンピラミッドが生成される。また動きベクトルマップ画像が運動方向の特徴画像として用い、ガウシアンピラミッドが生成される。
各特徴量のガウシアンピラミッドに対して、Center－surround差分処理が実行される。これにより、輝度、色、方向、運動方向、奥行きの各特徴量において、特徴マップが生成される。これら各特徴量の特徴マップを統合することで、ボトムアップ注意に基づく顕著性マップ２２が生成される。
特徴量抽出処理、ガウシアンピラミッドの生成処理、Center－surround差分処理、各特徴量の特徴マップの統合処理の具体的なアルゴリズムは限定されない。例えば各処理は、周知の技術を用いて実現することが可能である。 FIG. 8 is a schematic diagram illustrating an example of bottom-up attention-based saliency map generation that can be performed by the present system.
In the example shown in FIG. 8, a predicted frame image 20 is input as an input frame.
A feature extraction process is performed on the predicted frame image 20, and the features of luminance, color, and direction that attract bottom-up attention are extracted. Note that the predicted frame image 20 of the previous frame, etc., may be used for feature extraction.
For each feature amount of brightness, color, and direction, a feature image is generated in which the feature amount is converted into brightness, and a Gaussian pyramid of the feature image is generated.
Furthermore, a depth map and a motion vector map image are acquired as parameters (rendering information) related to the rendering process from a renderer constituting the rendering unit 14. The depth map is data including distance information (depth information) to an object to be rendered. The motion vector map image is data including motion information of the object to be rendered.
The depth map image is used as a depth feature image to generate a Gaussian pyramid, and the motion vector map image is used as a motion direction feature image to generate a Gaussian pyramid.
Center-surround subtraction is performed on each Gaussian pyramid of features, generating feature maps for brightness, color, orientation, motion direction, and depth. These feature maps are then combined to generate a saliency map 22 based on bottom-up attention.
There are no particular limitations on the specific algorithms for the feature extraction process, Gaussian pyramid generation process, center-surround difference process, and feature map integration process for each feature. For example, each process can be realized using well-known techniques.

レンダラから取得されるデプスマップ画像は、予測フレーム画像２０に対して２Ｄ画像解析等を実行することで推定したデプス値ではなく、レンダリング工程で得られた正確な値である。そこで、このデプスマップ画像をレンダラから直接受け取り、「奥行き」の特徴情報として、顕著性マップ２２の生成に使用することで、高精度でより的確な顕著性マップ２２の生成が可能となる。
レンダラから取得される動きベクトルマップ画像は、予測フレーム画像２０に対して２Ｄ画像解析等を実行することで推定した値ではなく、レンダリング工程で得られた正確な値である。そこで、このデプスマップ画像をレンダラから直接受け取り、「運動方向」の特徴情報として、顕著性マップ２２の生成に使用することで、高精度でより的確な顕著性マップの生成が可能となる。
なお、「輝度」や「色」等の他の特徴量等もレンダリング工程で算出しレンダリング情報として用いることも可能である。
ボトムアップ注意に基づく顕著性マップを生成するためのアルゴリズムは限定されず、他の任意のアルゴリズムが用いられてよい。例えば、任意の機械学習アルゴリズムが用いられてもよい。
例えば、予測フレーム画像２０に対して２Ｄ画像解析等を実行することで「奥行き」や「運動方向」の特徴情報が取得され、それらが用いられてもよい。 The depth map image acquired from the renderer is not a depth value estimated by performing 2D image analysis or the like on the predicted frame image 20, but an accurate value obtained in the rendering process. Therefore, by receiving this depth map image directly from the renderer and using it as "depth" feature information in generating the saliency map 22, it becomes possible to generate a more accurate saliency map 22 with higher precision.
The motion vector map image obtained from the renderer is not a value estimated by performing 2D image analysis or the like on the predicted frame image 20, but an accurate value obtained in the rendering process. Therefore, by receiving this depth map image directly from the renderer and using it as feature information on the "motion direction" in generating the saliency map 22, it becomes possible to generate a more accurate saliency map with higher precision.
It should be noted that other feature quantities such as "brightness" and "color" can also be calculated in the rendering process and used as rendering information.
The algorithm for generating a saliency map based on bottom-up attention is not limited, and any other algorithm may be used, for example, any machine learning algorithm.
For example, feature information such as "depth" and "direction of movement" may be acquired by performing 2D image analysis on the predicted frame image 20, and this information may be used.

トップダウン注意は、物体認識後にその意味に基づいた注意として向けられるものであるため、顕著性は物体に与えられる。
例えば、人間の顔といった、一般的に人が興味を引きやすい物体を画像から検出して顕著性を付与する。その他、オブジェクトの種類（アイドル、野球選手、車両等）、６ＤｏＦコンテンツにおけるオブジェクトの重要度、ユーザにとってのオブジェクトに対する嗜好度（興味や好みの程度等）等に基づいて、オブジェクトの表示領域に対して顕著性が付与される。
トップダウン注意に基づく顕著性マップを生成するためのアルゴリズムは限定されず、任意のアルゴリズムが用いられてよい。例えば、任意の機械学習アルゴリズムが用いられてもよい。 Top-down attention is directed after object recognition based on its meaning, so salience is given to the object.
For example, an object that generally attracts people's attention, such as a human face, is detected from an image and given a saliency rating. In addition, saliency is given to the display area of an object based on the type of object (idol, baseball player, vehicle, etc.), the importance of the object in the 6DoF content, the user's preference for the object (degree of interest or preference, etc.), etc.
The algorithm for generating the saliency map based on top-down attention is not limited, and any algorithm may be used, for example, any machine learning algorithm.

また、ボトムアップ注意に基づく顕著性、及びトップダウン注意に基づく顕著性を含む顕著性マップを、まとめて生成することも可能である。例えば、実際に人間の視線データをキャプチャし、それを元に教師データを作成、学習させることで、ボトムアップ注意からトップダウン注意までを含めた顕著性マップを生成可能な機械学習モデルを構築することも可能である。 It is also possible to generate a saliency map that includes both saliency based on bottom-up attention and saliency based on top-down attention all at once. For example, by capturing actual human gaze data and using it to create and train training data, it is possible to build a machine learning model that can generate a saliency map that includes both bottom-up and top-down attention.

［比較例として挙げる全天周顕著性マップの生成方法］
ここで、比較例として挙げる全天周顕著性マップの生成方法について説明する。
仮想空間Ｓの全体に対してビューポートを一定間隔ごとに変化させながら、ビューポートごとに２Ｄ画像をレンダリングする（以下、このレンダリングされた２Ｄを、ビューポート画像と記載する）。そして、各ビューポート画像に対して、上記したようなボトムアップ注意に基づく顕著性マップやトップダウン注意に基づく顕著性マップを生成し、これらを統合する。
この比較例の全天周顕著性マップは、ユーザ５に対して、全てのビューポート画像の顕著性を表す情報となる。すなわち、ユーザ５にとって、仮想領域Ｓ内の全ての領域が視野内になった場合の顕著性を表す情報となる。 [Comparative Example of a Method for Generating a Panoramic Saliency Map]
Here, a method for generating a panoramic saliency map will be described as a comparative example.
A 2D image is rendered for each viewport while changing the viewport at regular intervals across the entire virtual space S (hereinafter, the rendered 2D image will be referred to as a viewport image). Then, for each viewport image, a saliency map based on bottom-up attention or top-down attention as described above is generated and integrated.
The panoramic saliency map of this comparative example serves as information representing the saliency of all viewport images for the user 5. In other words, it serves as information representing the saliency of all areas in the virtual area S when they are within the field of view of the user 5.

従って、この比較例の全天周顕著性マップを使用した場合、以下のような問題が発生してしまう可能性が高い。
図９は、比較例の全天周顕著性マップ３０の問題を説明するための模式図である。図９では、人物Ｐ１～Ｐ３により発生されるトップダウン注意に基づく顕著性と、照明装置Ｌにより発生されるボトムアップ注意に基づく顕著性とが、白色の領域として模式的に図示されている。 Therefore, when the panoramic saliency map of this comparative example is used, there is a high possibility that the following problems will occur.
9 is a schematic diagram for explaining the problem with the panoramic saliency map 30 of the comparative example. In FIG. 9, the saliency based on the top-down attention generated by the persons P1 to P3 and the saliency based on the bottom-up attention generated by the lighting device L are schematically illustrated as white regions.

（１）ボトムアップ注意に起因する顕著性は、ユーザ５の視野外では目に入らないため、視覚刺激はなく実質ゼロに等しい。比較例の全天周顕著性マップ３０では、仮想空間Ｓ内の全ての領域が視野内になった場合を前提として作成されている。従って比較例の全天周顕著性マップ３０をそのまま視野情報の予測に使用すると、視野外にありユーザ５には認識できない顕著性が、視野情報の予測に悪影響を及ぼす。
例えば、図５Ａに示すタイミングでは、右側の点滅している照明装置Ｌはユーザ５に認識されていない。比較例の全天周顕著性マップ３０では、図９Ａに示すように、点滅している照明装置Ｌのピクセル領域に対してボトムアップ注意に基づく顕著性が付与されてしまう。この結果、ユーザ５の脳内にはない顕著性が発生してしまい、視野情報の予測に悪影響を及ぼす。 (1) The saliency resulting from bottom-up attention is not seen outside the visual field of the user 5, and therefore there is no visual stimulation and it is essentially equal to zero. The omnidirectional saliency map 30 of the comparative example is created on the assumption that all areas in the virtual space S are within the visual field. Therefore, if the omnidirectional saliency map 30 of the comparative example is used as is to predict visual field information, saliency that is outside the visual field and cannot be recognized by the user 5 will have a negative impact on the prediction of visual field information.
For example, at the timing shown in Fig. 5A , the blinking lighting device L on the right side is not recognized by the user 5. In the omnidirectional saliency map 30 of the comparative example, as shown in Fig. 9A , saliency based on bottom-up attention is assigned to the pixel region of the blinking lighting device L. As a result, saliency that does not exist in the brain of the user 5 is generated, which adversely affects the prediction of visual field information.

（２）トップダウン注意に起因する顕著性においても、ユーザ５の視野外にあり、まだビューポートで捉えていない(視認していない)オブジェクトに対する顕著性は、その存在すら気づけていないため、実質ゼロに等しい。比較例の全天周顕著性マップ３０を使用すると、視野外にありユーザ５には認識できない顕著性が、視野情報の予測に悪影響を及ぼす。
例えば図５Ａに示すタイミングでは、左側の人物Ｐ２及びＰ３は、ユーザ５に認識されていない。比較例の全天周顕著性マップ３０では、図９Ａに示すように、人物Ｐ２及びＰ３に対してトップダウン注意に基づく顕著性が付与されてしまう。この結果、ユーザ５の脳内にはない顕著性が発生してしまい、視野情報の予測に悪影響を及ぼす。 (2) Even in the case of saliency resulting from top-down attention, the saliency of an object that is outside the field of view of the user 5 and has not yet been captured (visually recognized) in the viewport is essentially zero because the user 5 is not even aware of its existence. When the omnidirectional saliency map 30 of the comparative example is used, saliency that is outside the field of view and cannot be recognized by the user 5 adversely affects the prediction of visual field information.
For example, at the timing shown in Fig. 5A, the people P2 and P3 on the left side are not recognized by the user 5. In the omnidirectional saliency map 30 of the comparative example, as shown in Fig. 9A, the people P2 and P3 are assigned saliency based on top-down attention. As a result, saliency that does not exist in the brain of the user 5 is generated, which adversely affects the prediction of visual field information.

（３）またトップダウン注意に起因する顕著性において、過去に視認しているオブジェクトが視野外領域２１で移動している場合もあり得る。この場合、ユーザ５は、そのオブジェクトの移動は気づけない。従って、ユーザにとっては、最後にそのオブジェクトを視聴したときに把握した状況に応じた位置に、そのオブジェクトはいるだろうと認識するはずである。
例えば、オブジェクトは静止していた場合には、最後に見た位置にそのオブジェクトはいるだろうと認識すると考えられる。また最後に見たときにオブジェクトが移動中であった場合には、最後に見た位置から移動方向に沿ってある程度進んだ位置にそのオブジェクトはいるだろうと認識すると考えられる。
比較例の全天周顕著性マップ３０では、移動中のオブジェクトや移動後のオブジェクトも視野内になった場合を前提としているので、移動中のオブジェクトや移動後のオブジェクトにトップダウン注意に基づく顕著性が付与される。
例えば、最後に見たときにオブジェクトが静止している場合、すなわちその後オブジェクトが移動したことを認識していない場合には、脳内の認識位置（把握位置）と全く異なる位置に顕著性が発生する。
最後に見たときにオブジェクトが移動中であった場合には、脳内でこのあたりにいるだろうという認識位置（把握位置）と、オブジェクトの実際の位置が一致する場合は問題ない。一方で、ユーザが予想している位置と、実際のオブジェクトの位置とが一致する可能性は必ずしも高くないと考えられる。従って、脳内の認識位置（把握位置）と異なる位置に顕著性が発生してしまう可能性が高い。
例えば図５Ｃに示すタイミングでは、ユーザ５は、人物Ｐ１の移動を認識していない。比較例の全天周顕著性マップ３０では、図９Ｂに示すように、照明装置Ｌ１の近くで人物Ｐ１に対してトップダウン注意に基づく顕著性が付与されてしまう。この結果、ユーザ５の脳内にはない顕著性が発生してしまい、視野情報の予測に悪影響を及ぼす。
このように、比較例の全天周顕著性マップ３０では、ビューポート外（視野外）に対する顕著性というものが考慮されていない。そのため、現在のビューポートから、次にどこにビューポートを向けるかのHead Orientation予測に、比較例の全天周顕著性マップ３０をそのまま使用すると、予測精度が逆に低下する、もしくは役に立たないなど、予測精度の向上という目的が果たせない。
高い予測精度を実現するためには、実際にユーザ５が視野内・外に対して抱く注意状況を的確に反映できるかどうかが重要となる。 (3) Furthermore, in the case of salience resulting from top-down attention, an object previously viewed may be moving in the out-of-field area 21. In this case, the user 5 will not notice the object moving. Therefore, the user will likely recognize that the object is in a position that corresponds to the situation he or she perceived the last time he or she viewed the object.
For example, if an object was stationary, we would likely recognize it as being in the position where we last saw it, and if the object was moving when we last saw it, we would likely recognize it as being some distance along the direction of movement from where we last saw it.
The omnidirectional saliency map 30 of the comparative example is based on the assumption that moving objects and objects after they have moved are also within the field of view, and therefore saliency based on top-down attention is assigned to moving objects and objects after they have moved.
For example, if an object was stationary when last viewed, i.e., if the user is not aware that the object has moved since then, salience will occur in a location completely different from the location of recognition (grasping location) in the brain.
If the object was moving when last seen, there is no problem if the perceived location (perceived location) in the user's mind matches the actual location of the object. However, it is not necessarily likely that the location the user expects to be and the actual location of the object will match. Therefore, there is a high possibility that saliency will occur in a location different from the perceived location (perceived location) in the user's mind.
For example, at the timing shown in Fig. 5C, the user 5 does not recognize the movement of the person P1. In the omnidirectional saliency map 30 of the comparative example, as shown in Fig. 9B, the person P1 near the lighting device L1 is given saliency based on top-down attention. As a result, saliency that does not exist in the brain of the user 5 is generated, which adversely affects the prediction of visual field information.
As described above, the omnidirectional saliency map 30 of the comparative example does not take into consideration saliency outside the viewport (outside the field of view). Therefore, if the omnidirectional saliency map 30 of the comparative example is used as is for predicting Head Orientation, which is the next direction to which the viewport will be directed from the current viewport, the prediction accuracy may actually decrease or may be useless, thereby failing to achieve the objective of improving prediction accuracy.
To achieve high prediction accuracy, it is important to accurately reflect the actual attention state of the user 5 with respect to the inside and outside of the visual field.

［認識対象オブジェクトの認識位置の推定］
本実施形態では、認識位置推定部１９により、認識対象オブジェクトの設定、及び認識対象オブジェクトの認識位置の推定が実行される。これらの処理は、フレームごとに実行される。
以下、認識対象オブジェクトの認識位置の推定の実施例をいくつか説明する。
（実施例１）
レンダリング部１４によりレンダリングの対象となったオブジェクトが認識対象オブジェクトとして設定される。
例えば、図５Ａに示すタイミングでは、人物Ｐ１が認識対象オブジェクトとして設定される。図５Ｂに示すタイミングでは、人物Ｐ１及びＰ２が認識対象オブジェクトとして設定される。 [Estimation of the recognition position of the object to be recognized]
In this embodiment, the recognition target object is set and the recognition target position of the recognition target object is estimated by the recognition position estimation unit 19. These processes are executed for each frame.
Hereinafter, several examples of estimating the recognition position of a recognition target object will be described.
Example 1
The object that is the subject of rendering by the rendering unit 14 is set as the object to be recognized.
For example, at the timing shown in Fig. 5A, a person P1 is set as the object to be recognized, and at the timing shown in Fig. 5B, persons P1 and P2 are set as the objects to be recognized.

各フレームにおいて、「現在時刻のフレーム画像」（現在時刻に生成される未来の予測フレーム画像２０）が生成されるたびに、設定された各認識対象オブジェクトの認識位置が推定される。
「現在時刻のフレーム画像」に含まれる認識対象オブジェクトについては、フレーム画像内の認識対象オブジェクトの位置に対応する仮想空間Ｓ内の位置に基づいて、認識位置が推定される。
例えば、図５Ａに示すタイミングでは、フレーム画像（予測フレーム画像）２０内に人物Ｐ１が含まれるので、フレーム画像２０内の人物Ｐ１の位置に対応する仮想空間Ｓ内の位置に基づいて、人物Ｐ１の認識位置が推定される。 In each frame, each time a "current time frame image" (a future predicted frame image 20 generated at the current time) is generated, the recognition position of each set object to be recognized is estimated.
For a recognition target object included in the "frame image at the current time", the recognition position is estimated based on the position in the virtual space S corresponding to the position of the recognition target object in the frame image.
For example, at the timing shown in Figure 5A, person P1 is included in frame image (predicted frame image) 20, so the recognized position of person P1 is estimated based on the position in virtual space S corresponding to the position of person P1 in frame image 20.

フレーム画像２０内の人物Ｐ１の位置に対応する仮想空間Ｓ内の位置は、フレーム画像２０をレンダリングする際に、仮想空間Ｓ内に配置されている人物Ｐ１の位置となる。当該人物Ｐ１の位置を、ユーザ５が認識している認識位置として推定する。
もし人物Ｐ１が移動中である状態がレンダリングされていた場合、すなわちユーザ５が人物Ｐ１が移動していることを認識した場合には、フレーム画像２０内の人物Ｐ１の位置に対応する仮想空間Ｓ内の位置、すなわち仮想空間Ｓ内に配置されている人物Ｐ１の位置から、移動方向に沿ってシフトした位置が、ユーザ５が認識している認識位置として推定されてもよい。シフト量は、ユーザ５が移動先を予測するだろうと考えられる量が適宜設定されればよい。
ここでは、フレーム画像２０内の人物Ｐ１の位置に対応する仮想空間Ｓ内の位置がそのまま認識位置として推定されるものとする。 The position in the virtual space S corresponding to the position of the person P1 in the frame image 20 becomes the position of the person P1 placed in the virtual space S when rendering the frame image 20. The position of the person P1 is estimated as the recognized position recognized by the user 5.
If a state in which the person P1 is moving has been rendered, that is, if the user 5 recognizes that the person P1 is moving, a position in the virtual space S corresponding to the position of the person P1 in the frame image 20, that is, a position shifted along the direction of movement from the position of the person P1 placed in the virtual space S, may be estimated as the recognized position recognized by the user 5. The amount of shift may be set appropriately to an amount that is thought to allow the user 5 to predict the destination.
Here, it is assumed that the position in the virtual space S corresponding to the position of the person P1 in the frame image 20 is directly estimated as the recognized position.

例えば、図５Ｂに示すタイミングでは、フレーム画像２０内の人物Ｐ２及びＰ３の位置に対応する仮想空間Ｓ内の位置が、人物Ｐ２及びＰ３に対して、ユーザ５が認識している認識位置として推定される。 For example, at the timing shown in Figure 5B, the position in virtual space S corresponding to the positions of people P2 and P3 in frame image 20 is estimated as the recognized position recognized by user 5 for people P2 and P3.

「現在時刻のフレーム画像」に含まれない認識対象オブジェクトについては、当該認識対象オブジェクトが含まれる過去の直近のフレーム画像２０内の認識対象オブジェクトの位置に対応する仮想空間Ｓ内の位置に基づいて、認識位置が推定される。本実施形態では、直近のフレーム画像２０内の認識対象オブジェクトの位置に対応する仮想空間Ｓ内の位置が、そのまま認識位置として推定される。
なお、過去の直近のフレーム画像は、最後にレンダリングされたフレーム画像ともいえる。 For a recognition target object that is not included in the "frame image at the current time," the recognition position is estimated based on the position in the virtual space S that corresponds to the position of the recognition target object in the most recent past frame image 20 that includes the recognition target object. In this embodiment, the position in the virtual space S that corresponds to the position of the recognition target object in the most recent frame image 20 is directly estimated as the recognition position.
The most recent past frame image can also be said to be the last frame image rendered.

例えば、図５Ｂに示すタイミングでは、フレーム画像２０に人物Ｐ１が含まれていない。フレーム画像２０に含まれない人物Ｐ１については、図５Ａに示すタイミングにおけるフレーム画像２０の人物Ｐ１の位置に対応する仮想空間Ｓ内の位置がそのまま認識位置として推定される。
図５Ｃに示すタイミングでは、視野外領域２１にて、人物Ｐ１が照明装置Ｌの方へ移動している。本実施形態では、直近のフレーム画像２０、すなわち図５Ａに示すタイミングでのフレーム画像２０の人物Ｐ１の位置に対応する仮想空間Ｓ内の位置が認識位置として維持される。
これにより、ユーザ５の脳内にはない顕著性が発生してしまうことを抑制することが可能となり、高い精度で視野情報を予測することが可能となる。 For example, at the timing shown in Fig. 5B, person P1 is not included in frame image 20. For person P1 not included in frame image 20, the position in virtual space S corresponding to the position of person P1 in frame image 20 at the timing shown in Fig. 5A is directly estimated as the recognized position.
At the timing shown in Fig. 5C, in the out-of-field area 21, person P1 is moving toward the lighting device L. In this embodiment, the position in the virtual space S corresponding to the position of person P1 in the most recent frame image 20, i.e., the frame image 20 at the timing shown in Fig. 5A, is maintained as the recognized position.
This makes it possible to prevent the occurrence of saliency that does not exist in the brain of the user 5, and makes it possible to predict visual field information with high accuracy.

例えば、認識対象オブジェクトがレンダリングされるたびに、レンダリングされたフレーム画像２０内の認識対象オブジェクトの位置に対応する仮想空間Ｓ内の位置を、認識位置として更新して記憶する。これにより、現在時刻のフレーム画像２０に含まれない認識対象オブジェクトについては、直近のフレーム画像２０内の認識対象オブジェクトの位置に対応する仮想空間Ｓ内の位置が認識位置として保持される。このようにフレームごとに認識位置の更新が実行されてもよい。 For example, each time a recognition target object is rendered, the position in virtual space S corresponding to the position of the recognition target object in the rendered frame image 20 is updated and stored as the recognition position. As a result, for recognition target objects that are not included in the frame image 20 at the current time, the position in virtual space S corresponding to the position of the recognition target object in the most recent frame image 20 is held as the recognition position. In this way, the recognition position may be updated for each frame.

（実施例２）
ユーザ５は、オブジェクトを視界から外しても、足音や声といったそのオブジェクトから発せられる音を聞きとり、そこからオブジェクトの位置を把握することがある。本実施例２は、そのような場合を想定して考案されている
図１０は、本実施例にてシーン記述情報として用いられるシーン記述ファイルで記述される情報の一例を示す模式図である。
本実施例では、６ＤｏＦコンテンツを生成する際に、シーン記述ファイルで記述されている各オブジェクト情報に、オブジェクトから発せられる足音や声などのオーディオデータが紐づけて生成される。
レンダラはこの情報を元に、各オブジェクト位置から、紐づけられたオーディオデータが鳴るようにレンダリングするものとする。 Example 2
Even if the user 5 moves an object out of his or her field of vision, the user 5 may be able to determine the object's location by listening to sounds such as footsteps or voices emitted from the object. This embodiment 2 is designed to address such a situation. Fig. 10 is a schematic diagram showing an example of information described in a scene description file used as scene description information in this embodiment.
In this embodiment, when 6DoF content is generated, audio data such as footsteps and voices emitted from the objects is generated in association with each object information described in the scene description file.
Based on this information, the renderer will render the associated audio data so that it plays from the position of each object.

図１０に示す例では、オブジェクト情報として、以下の情報が格納される。
Ｎａｍｅ…オブジェクトの名前
Ｐｏｓｉｔｉｏｎ…オブジェクトの位置
Ｕｒｌ…３次元オブジェクトデータのアドレス
Ａｕｄｉｏ…オブジェクトから発せられる音のオーディオデータの名前
また図１０に示す例では、オブジェクト情報に紐づけられるオーディオデータ情報として、以下の情報が格納される。
Ｎａｍｅ…オーディオデータの名前
Ｕｒｌ…オーディオデータのアドレス In the example shown in FIG. 10, the following information is stored as object information:
Name...name of object Position...position of object Url...address of three-dimensional object data Audio...name of audio data of sound emitted from object In the example shown in Figure 10, the following information is stored as audio data information linked to object information.
Name...Name of the audio data Url...Address of the audio data

図１０に示す例では、かくれんぼのシーンにおいて、「隠れる人１」「隠れる人２」の映像オブジェクト情報に、「声や足音１」「声や足音２」のオーディオデータ情報が紐づけられている。「声や足音１」「声や足音２」のオーディオデータは、例えば鬼の「もういいかい？」という呼びかけに対する「まあだだよ」や「もういいよ」といった返答や、鬼から逃げ回る足音等が挙げられる。 In the example shown in Figure 10, in a hide-and-seek scene, audio data information for "Voice and Footsteps 1" and "Voice and Footsteps 2" is linked to video object information for "Hiding Person 1" and "Hiding Person 2." Examples of audio data for "Voice and Footsteps 1" and "Voice and Footsteps 2" include responses such as "Not yet" or "That's enough" to the Oni's call "Are you ready?", as well as the footsteps of someone running away from the Oni.

認識位置推定部１９は、レンダラから、各オブジェクトに対して紐づけられたオーディオデータの有無情報、及び現在の音量情報等を受け取る。そして、ユーザ５が聞きとることが可能であり認識できると判断するための基準とする音量レベルである基準値（閾値）を超えているかどうかによって、ユーザ５が現在、オブジェクトから発せられる音を聞いて認識しているかどうかを判断する。なお基準の音量レベルは、任意に設定されてよい。
紐づけられたオーディオデータがあり、かつ基準とする音量レベルを超えていたとする。この場合、そのオブジェクトの仮想空間Ｓ内における現在の位置、すなわち音の仮想空間Ｓ内における発生位置が、ユーザ５が認識している認識位置として推定される。
すなわち、本実施例では、「現在時刻のフレーム画像」に含まれない認識対象オブジェクトについて、認識対象オブジェクトが発する音をユーザ５が認識したと判定した場合に、音の仮想空間Ｓ内における発生位置が、認識位置として推定される。 The recognition position estimation unit 19 receives from the renderer information on the presence or absence of audio data associated with each object, current volume information, etc. Then, it determines whether the user 5 is currently hearing and recognizing the sound emitted from the object based on whether the volume level exceeds a reference value (threshold) that is a volume level used as a reference for determining whether the user 5 can hear and recognize the sound. Note that the reference volume level may be set arbitrarily.
If there is associated audio data and the volume level exceeds a reference level, the current position of the object in the virtual space S, i.e., the position where the sound is generated in the virtual space S, is estimated as the recognized position recognized by the user 5.
That is, in this embodiment, when it is determined that the user 5 has recognized the sound emitted by a recognition target object that is not included in the “frame image at the current time”, the position where the sound originated within the virtual space S is estimated as the recognition position.

例えば、図１０に示すシーン記述ファイルに基づいて構成されるシーンでは、かくれんぼをしている人たちの声に基づいて、隠れる人たちの認識位置が推定される。例えば、フレーム画像に含まれない「隠れる人１」から「まあだだよ」という声が発せられたとする。その声の音量が基準値を超えている場合には、ユーザ５によりその声が聞こえたと判断され、「隠れる人１」の仮想空間Ｓ内における現在の位置が認識位置として推定される。
これにより、ユーザ５の脳内にはない顕著性が発生してしまうことを抑制することが可能となり、高い精度で視野情報を予測することが可能となる。
なお、認識対象オブジェクトから発せられる音が途切れ、聞こえなくなった場合は、最後に聞いた位置、すなわち過去の直近に聞いた位置が、認識位置として維持される。 For example, in a scene configured based on the scene description file shown in Fig. 10, the recognized positions of people hiding are estimated based on the voices of the people playing hide-and-seek. For example, suppose a voice saying "It's not yet" is uttered from "Hiding Person 1," who is not included in the frame image. If the volume of the voice exceeds a reference value, it is determined that the voice was heard by user 5, and the current position of "Hiding Person 1" in virtual space S is estimated as the recognized position.
This makes it possible to prevent the occurrence of saliency that does not exist in the brain of the user 5, and makes it possible to predict visual field information with high accuracy.
When the sound emitted from the object to be recognized is interrupted and can no longer be heard, the position where the sound was last heard, that is, the position where the sound was most recently heard in the past, is maintained as the recognition position.

（実施例３）
ユーザ５は現在視聴しているシーンと同様のシーンを過去に視聴したことがある場合、その時の記憶から連想してオブジェクトの位置を把握することがある。本実施例３は、そのような場合を想定して考案されており、現在と同様シーンにおける過去の視聴情報を元に、ユーザ５の脳内の認識位置が推定される。
図１１及び図１２は、本実施例にてシーン記述情報として用いられるシーン記述ファイルで記述される情報の一例を示す模式図である。
本実施例では、６ＤｏＦコンテンツを生成する際に、シーン記述ファイルで記述されている各オブジェクト情報に、各オブジェクトの現在のシーンにおける役割情報と、その役割時の定位置情報（ワールド座標）が格納される。
すなわち、シーン記述ファイルに、認識対象オブジェクトの役割を表す役割情報と、役割に関する定位置（ワールド座標）が格納される。 Example 3
If the user 5 has previously viewed a scene similar to the scene currently being viewed, the user may grasp the position of an object by associating it with his or her memory of that scene. This embodiment 3 is designed to take such a case into consideration, and the recognized position in the brain of the user 5 is estimated based on past viewing information of a scene similar to the current scene.
11 and 12 are schematic diagrams showing an example of information described in a scene description file used as scene description information in this embodiment.
In this embodiment, when generating 6DoF content, the object information described in the scene description file stores the role information of each object in the current scene and its fixed position information (world coordinates) when playing that role.
That is, the scene description file stores role information that indicates the role of the object to be recognized, and a fixed position (world coordinates) related to the role.

図１１及び図１２に示す例では、オブジェクト情報として、以下の情報が格納される。
Ｎａｍｅ…オブジェクトの名前
Ｐｏｓｉｔｉｏｎ…オブジェクトの位置
Ｕｒｌ…３次元オブジェクトデータのアドレス
Ｒｏｌｅ…役割情報
ＦｉｘｅｄＰｏｓ…役割に関する定位置情報 In the examples shown in FIGS. 11 and 12, the following information is stored as object information:
Name...Name of object Position...Position of object Url...Address of 3D object data Role...Role information FixedPos...Fixed position information related to role

図１１及び図１２に示す例では、野球のシーンにおいて、「Ａ田Ａ夫」選手、及び「Ｂ川Ｂ助」選手の攻撃時と、守備時とにおける役割情報と定位置情報とが格納されている。図１１は攻撃時おけるシーン記述ファイルであり、図１２は守備時におけるシーン記述ファイルである。
攻守が交代するたびに、攻撃時のシーン記述ファイルと、守備時のシーン記述ファイルが、互いに更新される。また攻撃時や守備時において、オブジェクトの役割が変わる場合等には、攻撃時及び守備時の各シーン記述ファイルが更新される。
例えば、攻守交替に応じて図１１から図１２へシーン記述ファイルが更新される場合には、「Ａ田Ａ夫」選手は、「次打者」の役割から「一塁手」の役割に変わる。「Ｂ川Ｂ助」選手は、「打者」の役割から「投手」の役割に変わる。
なお、「打者」「次打者」「投手」「一塁手」のＦｉｘｅｄＰｏｓは、以下の位置に関するワールド座標での位置であるとする。
「打者」…バッターボックス
「次打者」…ネクストバッター
「投手」…ピッチャーマウンド
「一塁手」…一塁の位置
これらＦｉｘｅｄＰｏｓは、その役割における一般的な位置を示し、実際のオブジェクトの位置を示すＰｏｓｉｔｉｏｎとは異なる。
認識位置推定部１９は、これらの情報をもとに、ユーザ５が現在と同様のシーンを過去にみたことがあるかどうかの判断を行い、認識位置を推定する。 11 and 12, in a baseball scene, role information and regular position information for players "Ada A-san" and "Bgawa B-suke" when on offense and when on defense are stored. Fig. 11 is a scene description file when on offense, and Fig. 12 is a scene description file when on defense.
Every time the offense and defense switch positions, the attacking scene description file and the defensive scene description file are updated. Also, if the role of an object changes between offense and defense, the attacking and defensive scene description files are updated accordingly.
For example, when the scene description file is updated from Fig. 11 to Fig. 12 in response to a change of offense and defense, player "A-da A-san" changes from the role of "next batter" to the role of "first baseman," and player "B-gawa B-suke" changes from the role of "batter" to the role of "pitcher."
The FixedPos of the "batter,""nextbatter,""pitcher," and "first baseman" are assumed to be positions in world coordinates relating to the following positions.
"Batter"...batter's box "Next batter"...next batter "Pitcher"...pitcher's mound "First baseman"...position of first base These FixedPos indicate the general position for that role, and are different from Position, which indicates the actual position of the object.
Based on this information, the recognition position estimation unit 19 determines whether the user 5 has previously seen a scene similar to the current scene, and estimates the recognition position.

図１３及び１４は、本実施例３における、認識対象オブジェクトの認識位置の推定について説明するための模式図である。
図１３及び１４では、仮想空間Ｓとして野球のスタジアムが構成され、「Ａ田Ａ夫」選手３２、及び「Ｂ川Ｂ助」選手３３も、仮想空間Ｓ内に配置される。
図中の目の視線の先が、ユーザ５の視野（予測された視野）７に対応し、その視野７の領域のフレーム画像（予測フレーム画像）２０が生成される。 13 and 14 are schematic diagrams for explaining estimation of the recognition position of the recognition target object in the third embodiment.
13 and 14, a baseball stadium is configured as the virtual space S, and a player "A-ta A-fu" 32 and a player "B-gawa B-suke" 33 are also placed within the virtual space S.
The direction of the line of sight of the eyes in the figure corresponds to the field of view (predicted field of view) 7 of the user 5, and a frame image (predicted frame image) 20 of the area of the field of view 7 is generated.

図１３Ａにて、まずユーザ５は、「Ａ田Ａ夫」選手３２が一塁手をしているシーンを視聴している。つまり、ユーザ５はこの「Ａ田Ａ夫」選手３２が守備のシーンでは、一塁にいるということを把握したことになる。もちろん、「Ａ田Ａ夫」選手３２は、認識対象オブジェクトとして設定される。
その後、攻守交替により、「Ａ田Ａ夫」選手３２が「次打者」となり、ネクストバッターサークルに移動する。図１３Ｂに示すように、ユーザ５は、ネクストバッターサークルに移動する「Ａ田Ａ夫」選手３２を目で追って視聴する。
その後、図１４Ａに示すように、ユーザ５は、「打者」となった「Ｂ川Ｂ助」選手３３の方へ視野７を向け、バッティングを観戦する。この際に、認識対象オブジェクトである「Ａ田Ａ夫」選手３２は、視野外領域に存在することになり、フレーム画像２０には含まれなくなる。
その後、攻守が交代し、「Ａ田Ａ夫」選手３２は「一塁手」として一塁の位置に移動する。一方で、図１４Ｂに示すように、ユーザ５は、観客席に視野７を向けて、応援席での観客の様子を視聴している。「Ａ田Ａ夫」選手３２は、ユーザ５の視野外領域において一塁に移動し、ユーザ５はその移動を視聴していない。
ユーザ５は、攻守交替により、「Ａ田Ａ夫」選手３２が現在守備に回ったことは把握できる。そして、過去の図１３Ａにおける視聴の経験により、「Ａ田Ａ夫」選手３２が守備の時には、一塁の位置にいることを知っている。従って、ユーザ５の脳内では「Ａ田Ａ夫」選手３２は、図１３Ａにおける視聴と同様に一塁の位置にいるであろうと連想し、脳内位置を更新するということが想定可能である。
認識位置推定部１９は、この脳内の更新に合わせて、「Ａ田Ａ夫」選手３２の認識位置を、「一塁手」の役割に関連する定位置である「一塁の位置」に推定する。すなわち、図１３Ａにおける視聴にて現在と同様に「Ａ田Ａ夫」選手３２の役割が「一塁手」であるシーンを視聴していること（「Ａ田Ａ夫」選手３２をフレーム画像２０内にレンダリングしたこと）、及び図１４Ａにおける視聴から図１４Ｄにおける視聴へのシーンアップデートにより、「Ａ田Ａ夫」選手３２の役割が再び「一塁手」になったことに基づいて、「Ａ田Ａ夫」選手３２のユーザ５の脳内の認識位置を「一塁の位置」に推定する。
このように本実施例３では、認識対象オブジェクトの過去の視聴時(レンダリング時)のシーンでの役割情報が保持され、現在その認識対象オブジェクトが視野外にある場合、シーンアップデートでそのオブジェクトの役割が更新され、かつその役割の時の視聴経験がある場合、その役割の定位置に認識位置を推定する。 13A, first, the user 5 watches a scene in which the player "A-da A-husband" 32 plays first base. That is, the user 5 understands that the player "A-da A-husband" 32 is at first base in the defensive scene. Of course, the player "A-da A-husband" 32 is set as a recognition target object.
After that, as the offense and defense switch, player "A-da A-san" 32 becomes the "next batter" and moves to the next batter's circle. As shown in Fig. 13B, user 5 follows player "A-da A-san" 32 with his eyes as he moves to the next batter's circle.
14A, the user 5 turns his field of view 7 toward the player "Bgawa Bsuke" 33, who is now the "batter," and watches the batting. At this time, the player "Ada Ao" 32, who is the object to be recognized, is outside the field of view and is no longer included in the frame image 20.
After that, the offense and defense switch, and player "A-da A-san" 32 moves to the first base position as the "first baseman." Meanwhile, as shown in FIG. 14B , user 5 turns his field of view 7 toward the spectator seats and watches the spectators in the cheering section. Player "A-da A-san" 32 moves to first base in an area outside the field of view of user 5, and user 5 does not watch this movement.
User 5 can understand that, due to the change of offense and defense, player "A-da A-san" 32 is now playing defense. Furthermore, based on the past viewing experience of FIG. 13A , user 5 knows that player "A-da A-san" 32 is at first base when playing defense. Therefore, it is conceivable that user 5 associates player "A-da A-san" 32 with the position of first base in his mind, just as when viewing FIG. 13A , and updates his mental position.
In accordance with this update in the brain, the recognized position estimation unit 19 estimates the recognized position of the player "A-da A-husband" 32 to the "position of first base," which is a fixed position associated with the role of "first baseman." That is, based on the fact that the scene in which the role of the player "A-da A-husband" 32 is "first baseman" as in the viewing in Fig. 13A is being viewed (the player "A-da A-husband" 32 has been rendered in the frame image 20), as well as the fact that the role of the player "A-da A-husband" 32 has again become "first baseman" due to the scene update from the viewing in Fig. 14A to the viewing in Fig. 14D, the recognized position estimation unit 19 estimates the recognized position of the player "A-da A-husband" 32 in the brain of the user 5 to the "position of first base."
In this way, in this third embodiment, role information of the object to be recognized in the scene at the time of past viewing (rendering) is retained, and if the object to be recognized is currently out of the field of view, the role of the object is updated in a scene update, and if there is viewing experience at the time of that role, the recognition position is estimated to be the fixed position of that role.

図１４Ｂにおける視聴において、「Ａ田Ａ夫」選手３２は、「現在時刻のフレーム画像」に含まれない所定の役割情報（「一塁手」）が設定された認識対象オブジェクトに相当する。
そして、現在時刻までに、同じ役割情報（「一塁手」）が設定された「Ａ田Ａ夫」選手３２がレンダリングされたことがある場合に、役割に関連する定位置（「一塁の位置」）が認識位置として推定される。
これにより、ユーザの脳内にはない顕著性が発生してしまうことを抑制することが可能となり、高い精度で視野情報を予測することが可能となる。 In the view shown in FIG. 14B, the player "A-ta A-san" 32 corresponds to an object to be recognized that has predetermined role information ("first baseman") that is not included in the "frame image at the current time".
Then, if the player "A-ta A-fu" 32 with the same role information ("first baseman") has been rendered up to the current time, the fixed position associated with the role ("first base position") is estimated as the recognized position.
This makes it possible to prevent the occurrence of saliency that does not exist in the user's brain, and makes it possible to predict visual field information with high accuracy.

なお、「Ａ田Ａ夫」選手３２の役割が「一塁手」であるシーンを視聴しているのが、過去の他の野球の試合での視聴でもよい。すなわち、現在観戦している試合のみならず、過去に観戦した他の試合で、「Ａ田Ａ夫」選手３２の役割が「一塁手」であるシーンが視聴された場合でも、「一塁の位置」が認識位置として推定されてもよい。
すなわち、現在時刻までに、同じ役割情報が設定された認識対象オブジェクトがレンダリングさえされていれば、役割に関連する定位置を認識位置として推定することが可能である。 The scene in which the role of "A-ta A-huu" player 32 is "first baseman" may be viewed in another past baseball game. That is, even if a scene in which the role of "A-ta A-huu" player 32 is "first baseman" is viewed not only in the game currently being watched but also in another game watched in the past, the "position of first base" may be estimated as the recognized position.
That is, as long as a recognition target object to which the same role information is set has been rendered up to the current time, it is possible to estimate a fixed position related to the role as the recognition position.

過去に、「Ａ田Ａ夫」選手３２が定位置である「一塁の位置」にいる状態がレンダリングされた場合に、「一塁の位置」が認識位置として推定可能であってもよい。
すなわち、現在時刻までに同じ役割情報（「一塁手」）が設定された認識対象オブジェクト（「Ａ田Ａ夫」選手３２）が役割に関連する定位置（「一塁の位置」）にいる状態がレンダリングされたことがある場合に、役割に関連する定位置（「一塁の位置」）が認識位置として推定されてもよい。
これにより、守備時には「Ａ田Ａ夫」選手３２は「一塁の位置」にいるということがユーザにとって確実に把握されている場合に、認識位置の推定が可能となる。一方で、役割情報が設定された認識対象オブジェクトは、ほとんどの場合「定位置」にいることが多いので、「定位置」にいる状態がレンダリングされる可能性が高い。 In the past, when a state in which the player "A-ta A-fu" 32 was in his regular position "first base position" was rendered, the "first base position" may be estimated as the recognized position.
That is, if a recognition target object (player "A-ta A-fu" 32) with the same role information ("first baseman") set has been rendered in a fixed position related to the role ("first base position") up to the current time, the fixed position related to the role ("first base position") may be estimated as the recognition position.
This makes it possible to estimate the recognition position when the user is sure that the player "A-ta A-fu" 32 is at the "first base position" during defense. On the other hand, since the recognition target object with role information set is often at the "regular position" in most cases, it is highly likely that the object will be rendered as being at the "regular position."

（実施例１）～（実施例３）の処理を統合して、認識対象オブジェクトの認識位置の推定を実行することも可能である。例えば（実施例１）（実施例２）（実施例３）の順番で優先順位をつけて実行する。
まず一番優先度が高い情報は目からの視覚情報によるものとし、（実施例１）を実行する。すなわち、ユーザ５が認識対象オブジェクトを目で認識したと思われる位置を、認識位置として推定する。
ユーザ５が認識対象オブジェクトを視認している場合は、その目で確認した位置がその認識対象オブジェクトの認識位置となる。認識位置推定部１９は、ユーザ５が認識対象オブジェクトを視認しているかどうかの判断を、認識対象オブジェクトをフレーム画像（予測フレーム画像）２０内にレンダリングしたかどうかで行い、そのレンダリングした時の認識対象オンブジェクトの仮想空間Ｓ上の位置を認識位置として推定する。
この場合は、認識位置の推定に他の情報は不要であるため、（実施例２）で使用する音情報や、（実施例３）で使用する役割情報及び定位置情報の取得は行われない。認識対象オブジェクトが視野７から外れた（すなわちレンダリングされなくなった）場合は、（実施例２）で使用する音情報や、（実施例３）で使用する役割情報及び定位置情報が取得される。これらの情報がない場合は、最後に視聴した位置（最後にレンダリングした時の位置）が認識位置として維持される。 It is also possible to integrate the processes of (Example 1) to (Example 3) to estimate the recognition position of the recognition target object. For example, the processes are executed with priority in the order of (Example 1), (Example 2), and (Example 3).
First, the information with the highest priority is determined to be visual information from the eyes, and (Example 1) is executed. That is, the position where the user 5 is thought to have recognized the recognition target object with his/her eyes is estimated as the recognition position.
When the user 5 is visually recognizing the recognition target object, the position confirmed with the eyes becomes the recognition position of the recognition target object. The recognition position estimation unit 19 determines whether the user 5 is visually recognizing the recognition target object based on whether the recognition target object has been rendered in a frame image (predicted frame image) 20, and estimates the position of the recognition target object in the virtual space S at the time of rendering as the recognition position.
In this case, since no other information is required to estimate the recognition position, the sound information used in (Example 2) and the role information and fixed position information used in (Example 3) are not acquired. When the object to be recognized moves out of the field of view 7 (i.e., is no longer rendered), the sound information used in (Example 2) and the role information and fixed position information used in (Example 3) are acquired. When this information is not available, the last viewing position (the position at the time of last rendering) is maintained as the recognition position.

次に優先度が高い情報は、その認識対象オブジェクトから発せられる音情報によるものとし、（実施例２）を実行する。すなわち、認識対象オブジェクトに紐づくオーディオデータの有無と、そのオーディオデータの現在の発生状況情報（発生位置や音量情報等）に基づいて、認識位置が推定される。
認識対象オブジェクトが視野から外れ視覚情報がない場合に、オブジェクトを音で認識したと思われる位置が、認識位置として推定される。 The next highest priority information is based on sound information emitted from the object to be recognized, and the second embodiment is executed. That is, the recognition position is estimated based on the presence or absence of audio data linked to the object to be recognized and the current generation status information of the audio data (generation position, volume information, etc.).
When the object to be recognized is out of the field of view and there is no visual information, the position where the object is thought to have been recognized by sound is estimated as the recognized position.

次に優先度が高い情報として、役割情報及び定位置情報が取得され、（実施例３）が実行される。すなわち、現在と同様のシーンにおけるユーザ５の過去の視聴経験の有無情報と、そのシーンにおける認識対象オブジェクトの定位置情報に基づいて、認識位置が推定される。
視覚情報及び音情報がともに取得されない場合に、現在と同様のシーンの過去の視聴経験から、位置を連想したと思われる位置が、認識位置として推定される。
視覚情報、音情報、役割情報及び定位置情報のいずれの情報もない場合は、ユーザ５はそのオブジェクトの存在に気づいていないため、そのオブジェクトへの注意はゼロとなる（認識対象オブジェクトの設定はなく、認識位置もなし）。 As the information with the next highest priority, the role information and the fixed position information are acquired, and (Example 3) is executed. That is, the recognition position is estimated based on the information on whether the user 5 has had a past viewing experience in a scene similar to the current scene and the fixed position information of the object to be recognized in that scene.
When neither visual information nor sound information is acquired, a position that is thought to be associated with a position based on past viewing experiences of a scene similar to the current one is estimated as the recognized position.
If there is no visual information, sound information, role information, or fixed position information, the user 5 is unaware of the existence of the object, and therefore pays zero attention to the object (there is no setting of an object to be recognized, and there is no recognition position).

図１５は、認識対象オブジェクトの認識位置の推定例を示すフローチャートである。図１５に示す処理は、（実施例１）～（実施例３）の処理を統合した処理例ともいえる。
シーン内の全ての認識対象オブジェクトに対するレンダリング情報及びシーン情報が取得される（ステップ２０１）。
レンダリング情報は、認識対象オブジェクトのレンダリングに関する任意の情報を含む。ここでは、レンダリング情報として、現在時刻までの認識対象オブジェクトのレンダリングの履歴情報やフレーム画像２０内の認識対象オブジェクトの位置情報等が含まれる。
シーン情報は、認識対象オブジェクトに関するシーン記述情報を含む。例えば、現在時刻までのシーン記述情報の履歴情報等が含まれる。
なおステップ３０１では、「現在時刻のフレーム画像」に初めてレンダリングされるオブジェクトも認識対象オブジェクトとして設定され、レンダリング情報及びフレーム情報が取得される。 15 is a flowchart showing an example of estimating the recognition position of a recognition target object. The process shown in FIG. 15 can be said to be an example of a process that integrates the processes of (Example 1) to (Example 3).
Rendering information and scene information for all objects to be recognized in the scene are obtained (step 201).
The rendering information includes any information related to the rendering of the recognition target object. Here, the rendering information includes rendering history information of the recognition target object up to the current time, position information of the recognition target object within the frame image 20, etc.
The scene information includes scene description information relating to the object to be recognized, such as history information of the scene description information up to the current time.
In step 301, an object that is rendered for the first time in the "frame image at the current time" is also set as an object to be recognized, and rendering information and frame information are obtained.

未処理の認識対象オブジェクトがあるか否か判定される（ステップ２０２）。
未処理の認識対象オブジェクトがある場合（ステップ２０２のＹｅｓ）、未処理の認識対象オブジェクトが１つ選択され、ステップ２０３以下の処理が実行される。 It is determined whether there are any unprocessed objects to be recognized (step 202).
If there are any unprocessed objects to be recognized (Yes in step 202), one of the unprocessed objects to be recognized is selected, and the processing from step 203 onwards is executed.

選択された認識対象オブジェクトは「現在時刻のフレーム画像」内に含まれるか否か（すなわちレンダリングされているか否か）が判定される（ステップ２０３）。認識対象オブジェクトが「現在時刻のフレーム画像」に含まれる場合（ステップ２０３のＹｅｓ）、その認識対象オブジェクトの現在の役割情報が、役割リストに追加される（ステップ２０４）。
役割リストは、現在時刻までに役割情報が設定された認識対象オブジェクトを視聴したことがある（すなわちレンダリング済みである）場合に、その役割情報が入力されるリストである。認識対象オブジェクトに役割情報が設定されていない場合は、役割リストへの追加は実行されない。役割リストは、役割視聴済リストともいえる。
認識位置が現在の認識対象オブジェクトの位置に推定される（ステップ２０５）。ここでは、現在の認識対象オブジェクトの位置は、「現在時刻のフレーム画像」の認識対象オブジェクトの位置に対応する仮想空間Ｓ内の位置に相当する。「現在時刻のフレーム画像」に初めてレンダリングされるオブジェクトは、このステップ２０５にて、最初の認識位置が推定される。
現在時刻までに認識位置が推定されている認識対象オブジェクトは、認識位置が更新される。もちろん、過去に推定された認識位置と同じ結果となる場合もあり得る。
この認識対象オブジェクトの認識位置の推定は完了し、ステップ２０２に戻る。 It is determined whether the selected object to be recognized is included in the "frame image at the current time" (i.e., whether it has been rendered) (step 203). If the object to be recognized is included in the "frame image at the current time" (Yes in step 203), the current role information of the object to be recognized is added to the role list (step 204).
The role list is a list into which role information is input when a recognition target object with role information set thereto has been viewed (i.e., has been rendered) up to the current time. If role information has not been set for the recognition target object, it is not added to the role list. The role list can also be called a role viewed list.
The recognition position is estimated to be the current position of the object to be recognized (step 205). Here, the current position of the object to be recognized corresponds to the position in the virtual space S that corresponds to the position of the object to be recognized in the "frame image at the current time". In step 205, the initial recognition position of an object that is rendered for the first time in the "frame image at the current time" is estimated.
The recognition position of the object to be recognized, whose recognition position has been estimated up to the current time, is updated. Of course, there may be cases where the result is the same as the recognition position estimated in the past.
The estimation of the recognition position of this object to be recognized is now complete, and the process returns to step 202.

認識対象オブジェクトが「現在時刻のフレーム画像」内に含まれない場合（ステップ２０３のＮｏ）、その認識対象オブジェクトにオーディオデータが紐づいていて、かつ現在のレンダリング時の音量は基準値を超えているか否かが判定される（ステップ２０６）。
ステップ２０６が肯定の場合（ステップ２０６のＹｅｓ）、その認識対象オブジェクトの現在の役割（役割情報）が、役割リストに追加される（ステップ２０４）。
このように本実施形態では、現在時刻までに、同じ役割情報が設定された認識対象オブジェクトが発する音をユーザ５が認識したと判定した場合でも、同じ役割情報が設定された認識対象オブジェクトがレンダリングされた場合と同様に、役割リストへ役割の追加が実行される。
認識位置が現在の認識対象オブジェクトの位置に推定される（ステップ２０５）。ここでは、現在の認識対象オブジェクトの位置は、認識対象オブジェクトから発せられる音の仮想空間Ｓ内における発生位置に相当する。現在時刻までに推定位置が推定されている認識対象オブジェクトは、認識位置が更新される。
この認識対象オブジェクトの認識位置の推定は完了し、ステップ２０２に戻る。 If the object to be recognized is not included in the "frame image at the current time" (No in step 203), it is determined whether audio data is associated with the object to be recognized and whether the volume at the time of current rendering exceeds the reference value (step 206).
If step 206 is positive (Yes in step 206), the current role (role information) of the object to be recognized is added to the role list (step 204).
In this way, in this embodiment, even if it is determined that the user 5 has recognized a sound emitted by a recognition target object to which the same role information has been set up until the current time, the role is added to the role list in the same manner as when a recognition target object to which the same role information has been set is rendered.
The recognition position is estimated to be the current position of the recognition target object (step 205). Here, the current position of the recognition target object corresponds to the generation position of the sound emitted from the recognition target object in the virtual space S. The recognition position of the recognition target object whose estimated position has been estimated up to the current time is updated.
The estimation of the recognition position of this object to be recognized is now complete, and the process returns to step 202.

ステップ２０６が否定の場合（ステップ２０６のＮｏ）、認識対象オブジェクトの現在の役割は、最後に認識位置を更新した時から変わったか否か判定される（ステップ２０７）。
現在の役割が、最後に認識位置を更新したときから変わっていない場合（ステップ２０７のＮｏ）、認識位置は更新されない（すなわち認識位置は変更なし）。そして、ステップ２０２に戻る。 If the result of step 206 is negative (No in step 206), it is determined whether the current role of the object to be recognized has changed since the last time the recognition position was updated (step 207).
If the current role has not changed since the last time the recognized position was updated (No in step 207), the recognized position is not updated (i.e., the recognized position remains unchanged), and the process returns to step 202.

現在の役割が、最後に認識位置を更新したときから変わっている場合（ステップ２０７のＹｅｓ）、認識対象オブジェクトの現在の役割のシーンを過去にユーザ５が視聴しているか否かが判定される（ステップ２０８）。
ステップ２０８の判定は、役割リストに、認識対象オブジェクトの現在の役割情報が入力されているかを参照することで実行される。役割リストに認識対象オブジェクトの現在の役割情報が入力されている場合、ステップ２０８は肯定となる。役割リストに認識対象オブジェクトの現在の役割情報が入力されていない場合、ステップ２０８は否定となる。 If the current role has changed since the last time the recognition position was updated (Yes in step 207), it is determined whether user 5 has previously viewed a scene in which the object to be recognized plays the current role (step 208).
The determination in step 208 is made by referring to whether the current role information of the object to be recognized is entered in the role list. If the current role information of the object to be recognized is entered in the role list, the result in step 208 is affirmative. If the current role information of the object to be recognized is not entered in the role list, the result in step 208 is negative.

認識対象オブジェクトの現在の役割のシーンを過去にユーザ５が視聴していない場合（ステップ２０８のＮｏ）、認識位置は更新されない（すなわち認識位置は変更なし）。そして、ステップ２０２に戻る。
認識対象オブジェクトの現在の役割のシーンを過去にユーザ５が視聴している場合（ステップ２０８のＹｅｓ）、認識位置が現在の認識対象オブジェクトの位置に推定される（ステップ２０９）。
ここでは、現在の認識対象オブジェクトの位置は、役割に関連する定位置に相当する。現在時刻までに過去のフレームにて、推定位置が推定されている認識対象オブジェクトは、認識位置が更新される。
この認識対象オブジェクトの認識位置の推定は完了し、ステップ２０２に戻る。 If the user 5 has not previously viewed a scene in which the object to be recognized plays the current role (No in step 208), the recognition position is not updated (i.e., the recognition position remains unchanged), and the process returns to step 202.
If the user 5 has previously viewed a scene in which the object to be recognized plays the current role (Yes in step 208), the recognition position is estimated to be the position of the current object to be recognized (step 209).
Here, the current position of the object to be recognized corresponds to a fixed position related to the role. The recognition position of the object to be recognized, whose estimated position has been estimated in the past frame up to the current time, is updated.
The estimation of the recognition position of this object to be recognized is now complete, and the process returns to step 202.

未処理の認識対象オブジェクトがない場合（ステップ２０２のＮｏ）、全ての認識対象オブジェクトの認識位置の推定処理は終了する（ステップ２１０）。
図１５に示すように、視覚情報、現在の音情報、過去の視聴情報を用いることで、高い精度で、認識対象オブジェクトの認識位置を推定することが可能となる。 If there are no unprocessed recognition target objects (No in step 202), the process of estimating the recognition positions of all recognition target objects ends (step 210).
As shown in FIG. 15, by using visual information, current sound information, and past audio information, it is possible to estimate the recognition position of the recognition target object with high accuracy.

なお、図１５に示す推定例では、「現在時刻のフレーム画像」に含まれない所定の役割情報が設定された認識対象オブジェクトについて、現在時刻までに、同じ役割情報が設定された認識対象オブジェクトが発する音をユーザ５が認識したと判定した場合に、役割に関連する定位置が認識位置として推定された。
これに代えて、現在時刻までに同じ役割情報が設定された認識対象オブジェクトが役割に関連する定位置にいる状態で発した音をユーザ５が認識したと判定した場合に、役割に関連する定位置が認識位置として推定されてもよい。 In the estimation example shown in FIG. 15 , for a recognition target object to which predetermined role information that is not included in the “frame image at the current time” is set, if it is determined that the user 5 has recognized a sound emitted by a recognition target object to which the same role information is set by the current time, a fixed position related to the role is estimated as the recognition position.
Alternatively, if it is determined that user 5 has recognized a sound made by an object to be recognized that has the same role information set up to the current time and is in a fixed position related to the role, the fixed position related to the role may be estimated as the recognized position.

[全天周顕著性マップの生成]
顕著性マップ生成部１７による全天周顕著性マップの生成について説明する。
図１６は、全天周顕著性マップの生成例を示すフローチャートである。
まずステップ３０１では、「現在時刻のフレーム画像」に基づいて、視野分の顕著性マップが生成される。本実施形態では、視野分の顕著性マップとして、ボトムアップ注意に基づく顕著性、及びトップダウン注意に基づく顕著性の両方を含む顕著性マップが生成される。
例えば、ボトムアップ注意に基づく顕著性及びトップダウン注意に基づく顕著性の各々が別に検出され、これらを足し合わせることで、最終的なビューポート画像に対応した視野分の顕著性マップが生成される。
また同じステップ３０１にて、空の全天周顕著性マップ（本実施形態では、全ピクセルが値ゼロとなる正距円筒画像）が用意され、その中のビューポートに対応する箇所に、視野分の顕著性マップが貼り付けられる。
これにより、視野外領域において、ボトムアップ注意に基づく顕著性を発生することがないように全天周顕著性マップを生成することが可能となる。すなわち視野外領域においてボトムアップ注意に基づく顕著性がゼロとなる全天周顕著性マップを生成することが可能となる。この結果、不要な顕著性を発生させないようにすることが可能となり、上記した課題ポイント（１）を解決することが可能となる。
なお、視野外領域において、ボトムアップ注意に基づく顕著性の発生を回避する方法として、他の任意の方法が用いられてよい。例えば、一度全天周分の顕著性マップ（ボトムアップ注意に基づく顕著性及びトップダウン注意に基づく顕著性の両方を含む）を生成した後に、視野外領域の部分をマスクするやり方が採用されてもよい。
一方で、本実施形態のように、空の全天周顕著性マップの視野領域に視野分の顕著性マップを貼り付ける方法によれば、処理負荷を軽減することが可能であり、処理時間の短縮も図ることが可能となる。 [Generating a panoramic saliency map]
The generation of the panoramic saliency map by the saliency map generation unit 17 will be described.
FIG. 16 is a flowchart illustrating an example of generating a panoramic saliency map.
First, in step 301, a saliency map for the visual field is generated based on the “frame image at the current time.” In this embodiment, a saliency map for the visual field that includes both saliency based on bottom-up attention and saliency based on top-down attention is generated.
For example, saliency based on bottom-up attention and saliency based on top-down attention are detected separately, and then these are added together to generate a saliency map for the field of view corresponding to the final viewport image.
Also in the same step 301, a panoramic saliency map of the sky (in this embodiment, an equirectangular image in which all pixels have a value of zero) is prepared, and the saliency map for the field of view is pasted at the location corresponding to the viewport.
This makes it possible to generate a panoramic saliency map so that saliency based on bottom-up attention is not generated in the out-of-field area. That is, it is possible to generate a panoramic saliency map in which saliency based on bottom-up attention is zero in the out-of-field area. As a result, it is possible to prevent unnecessary saliency from being generated, and it is possible to solve the above-mentioned problem point (1).
Note that any other method may be used to avoid generating saliency based on bottom-up attention in the out-of-field region, such as masking out-of-field regions after generating a full-sphere saliency map (including both saliency based on bottom-up attention and saliency based on top-down attention).
On the other hand, according to the method of this embodiment, in which a saliency map for the entire field of view is pasted into the field of view area of the sky omnidirectional saliency map, it is possible to reduce the processing load and shorten the processing time.

認識位置推定部１９により推定された視野外領域の全ての認識対象オブジェクトの認識位置が取得される（ステップ３０２）。
未処理の視野外領域における認識対象オブジェクト（以下、視野外オブジェクトと記載する）があるか否か判定される（ステップ３０３）。
未処理の視野外オブジェクトがある場合（ステップ３０３のＹｅｓ）、未処理の視野外オブジェクトが１つ選択され、ステップ３０４の処理が実行される。 The recognition positions of all objects to be recognized in the out-of-view area estimated by the recognition position estimation unit 19 are acquired (step 302).
It is determined whether or not there is an object to be recognized in an unprocessed out-of-field area (hereinafter referred to as an out-of-field object) (step 303).
If there are any unprocessed out-of-field objects (Yes in step 303), one of the unprocessed out-of-field objects is selected, and the process of step 304 is executed.

ステップ３０４では、視野外オブジェクトの仮想空間Ｓ上の認識位置に基づいて、視野外オブジェクトの全天周顕著性マップ内の位置（２Ｄマップ上の位置）が算出される。算出された位置に、視野外オブジェクトにより発生されるトップダウンに基づく顕著性が配置される。
視野外オブジェクトは、過去にレンダリングされたことがあるオブジェクトである。従って、視野外オブジェクトは、過去にステップ３０１にて、トップダウン注意に基づく顕著性が検出されている。例えば、オブジェクトの形状に沿った各ピクセルの顕著性の値が検出される。
本実施形態では、ステップ３０１で検出された認識対象オブジェクトに対するトップダウン注意に基づく顕著性が保持される。そして、ステップ３０４にて、保持されているトップダウン注意に基づく顕著性（形状と値）が再利用され、全天周顕著性マップ上に配置される。
この視野外オブジェクトのトップダウン注意に基づく顕著性の、認識位置に基づく配置は完了し、ステップ３０３に戻る。 In step 304, the position of the out-of-field object in the omnidirectional saliency map (position on the 2D map) is calculated based on the recognized position of the out-of-field object in the virtual space S. The top-down based saliency generated by the out-of-field object is placed at the calculated position.
An out-of-view object is an object that has been rendered in the past, and therefore has previously had its saliency detected based on top-down attention in step 301. For example, the saliency value of each pixel along the shape of the object is detected.
In this embodiment, the saliency based on the top-down attention for the object to be recognized detected in step 301 is retained. Then, in step 304, the retained saliency (shape and value) based on the top-down attention is reused and arranged on the omnidirectional saliency map.
The top-down attention-based saliency placement of this out-of-view object based on its perceived location is now complete, and we return to step 303.

なお、ステップ３０４にて、一度全天周全体の顕著性マップ（トップダウン注意）を生成した後に、視野外領域の顕著性の発生位置を、推定された認識位置に合わせて調整する方法が採用されてもよい。
一方で、本実施形態のように、ステップ１０１で検出されたトップダウン注意に基づく顕著性を再利用することで、レンダリング処理がビューポートのみで済み、処理負荷の低減を図ることが可能となる。また処理時間の短縮も図れる。 In addition, in step 304, a method may be adopted in which, after generating a saliency map (top-down attention) for the entire sky, the position at which saliency occurs in the out-of-field area is adjusted to match the estimated recognition position.
On the other hand, by reusing the saliency based on the top-down attention detected in step 101 as in this embodiment, rendering processing can be performed only in the viewport, which reduces the processing load and processing time.

未処理の視野外オブジェクトがない場合（ステップ３０２のＮｏ）、視野外オブジェクトのトップダウン注意に基づく顕著性を、認識位置に基づいて発生させるようにした全天周顕著性マップの生成処理は終了する。 If there are no unprocessed out-of-field objects (No in step 302), the process of generating a panoramic saliency map that generates saliency based on top-down attention of out-of-field objects based on their recognized position is terminated.

ユーザ５が脳内で認識している認識位置にて、トップダウン注意に基づく顕著性を発生させる。すなわち、視野外領域おける認識対象オブジェクトの認識位置に基づいて、視野外領域におけるトップダウン注意に基づく顕著性を表す顕著性マップを生成する。
これにより、脳内の認識位置とは異なる位置からの不要な顕著性の発生を防止することが可能となり、上記の課題ポイント（２）及び（３）を解決することが可能となる。 Salience based on top-down attention is generated at the recognition position recognized in the brain of the user 5. That is, a saliency map representing saliency based on top-down attention in the out-of-field area is generated based on the recognition position of the object to be recognized in the out-of-field area.
This makes it possible to prevent the generation of unnecessary saliency from locations different from the recognition location in the brain, thereby solving the above-mentioned problems (2) and (3).

図１７は、本実施形態により生成される全天周顕著性マップの一例を示す模式的な図である。図１７Ａは、図５Ａに示すタイミングにおいて生成される全天周顕著性マップ３５の一例である。図１７Ｂは、図５Ｃに示すタイミングにおいて生成される全天周顕著性マップ３５の一例である。
図１７Ａに示すように、図５Ａに示すタイミングにおいて生成される全天周顕著性マップ３５では、ユーザ５が認識している人物Ｐ１のみのトップダウン注意に基づく顕著性が発生している。ユーザ５に認識されていない照明装置Ｌのボトムアップ注意に基づく顕著性は発生しない。またユーザ５に認識されていない人物Ｐ１及びＰ２のトップダウン注意に基づく顕著性も発生しない。
また、図１７Ｂに示すように、人物Ｐ１の移動を認識していないユーザ５に対して、移動前の人物Ｐ１の位置に、人物Ｐ１のトップダウン注意に基づく顕著性が発生している。またユーザ５に認識されていない照明装置Ｌのボトムアップ注意に基づく顕著性は発生しない。
このように、本実施形態により生成される全天周顕著性マップ３５では、ユーザ５の脳内にない顕著性が発生することが回避されており、高精度の全天周顕著性マップとなっている。 Fig. 17 is a schematic diagram showing an example of a panoramic saliency map generated by this embodiment. Fig. 17A is an example of a panoramic saliency map 35 generated at the timing shown in Fig. 5A. Fig. 17B is an example of a panoramic saliency map 35 generated at the timing shown in Fig. 5C.
17A , in the panoramic saliency map 35 generated at the timing shown in FIG. 5A , saliency based on the top-down attention is generated only for the person P1 who is recognized by the user 5. No saliency based on the bottom-up attention is generated for the lighting device L that is not recognized by the user 5. Furthermore, no saliency based on the top-down attention is generated for the persons P1 and P2 who are not recognized by the user 5.
17B , for user 5 who is not aware of the movement of person P1, salience based on the top-down attention of person P1 is generated at the position of person P1 before the movement. Furthermore, salience based on the bottom-up attention of lighting device L, which is not recognized by user 5, is not generated.
In this way, the panoramic saliency map 35 generated by this embodiment avoids the occurrence of saliency that does not exist in the brain of the user 5, resulting in a highly accurate panoramic saliency map.

以上、本実施形態に係るサーバサイドレンダリングシステム１では、視野外領域２１における認識対象オブジェクトの認識位置が推定される。推定された認識位置に基づいて、視野外領域２１における顕著性マップを含む、全天周顕著性マップ３５が生成される。
これにより、その時々の視聴状況に応じた、ユーザ５の視野外に対する注意を全天周顕著性マップ３５に反映することが可能となる。この結果、高精度の全天周顕著性マップ３５を生成することが可能となる。
高精度で的確な全天周顕著性マップ３５が生成されるので、非常に高い精度で予測Head Motion情報（特にOrientation情報）を生成することが可能となり、応答遅延（T_m2p時間）の問題を十分に抑制することが可能となる。すなわち、全天周顕著性マップ３５を用いて高品質な仮想映像の配信を実現することが可能となる。
なお、本実施形態にて生成される高精度の全天周顕著性マップ３５を、他の用途に用いることも可能である。 As described above, the server-side rendering system 1 according to this embodiment estimates the recognition position of the recognition target object in the out-of-field area 21. Based on the estimated recognition position, a panoramic saliency map 35 including a saliency map in the out-of-field area 21 is generated.
This makes it possible to reflect the attention of the user 5 to areas outside the field of view, depending on the viewing situation at that time, in the omnidirectional saliency map 35. As a result, it becomes possible to generate a highly accurate omnidirectional saliency map 35.
Since the accurate and precise omnidirectional saliency map 35 is generated, it is possible to generate predicted head motion information (particularly orientation information) with extremely high accuracy, and it is possible to sufficiently suppress the problem of response delay (T_m2p time). In other words, it is possible to realize the delivery of high-quality virtual video using the omnidirectional saliency map 35.
The high-precision panoramic saliency map 35 generated in this embodiment can also be used for other purposes.

＜その他の実施形態＞
本技術は、以上説明した実施形態に限定されず、他の種々の実施形態を実現することができる。 <Other embodiments>
The present technology is not limited to the above-described embodiments, and various other embodiments can be realized.

上記では、仮想画像として、６ＤｏＦ映像が配信される場合を例に挙げた。これに限定されず、３ＤｏＦ映像や２Ｄ映像等が配信される場合にも、本技術は適用可能である。また仮想画像として、ＶＲ映像ではなく、ＡＲ映像等が配信されてもよい。
また、３Ｄ映像を視聴するためのステレオ映像（例えば右目画像及び左目画像等）についても、本技術は適用可能である。
本技術は、視野外領域が発生し得る任意の仮想空間を表示するコンテンツに対して適用可能である。また視野外領域の顕著性マップとして、仮想空間全体の領域の顕著性マップに限定されず、視野外領域となる仮想空間の一部の領域の顕著性マップが生成されてもよい。 In the above, an example is given in which a 6DoF video is distributed as a virtual image. However, the present technology is not limited to this and can also be applied to cases in which a 3DoF video, a 2D video, or the like is distributed. Furthermore, instead of a VR video, an AR video or the like may be distributed as a virtual image.
The present technology can also be applied to stereo images (for example, right-eye images and left-eye images) for viewing 3D images.
The present technology can be applied to content that displays any virtual space where an out-of-field area may occur. Furthermore, the saliency map of the out-of-field area is not limited to the saliency map of the entire virtual space, but may also be generated for a portion of the virtual space that becomes an out-of-field area.

図１８は、サーバ装置４及びクライアント装置３を実現可能なコンピュータ（情報処理装置）６０のハードウェア構成例を示すブロック図である。
コンピュータ６０は、ＣＰＵ６１、ＲＯＭ（Read Only Memory）６２、ＲＡＭ６３、入出力インタフェース６５、及びこれらを互いに接続するバス６４を備える。入出力インタフェース６５には、表示部６６、入力部６７、記憶部６８、通信部６９、及びドライブ部７０等が接続される。
表示部６６は、例えば液晶、ＥＬ等を用いた表示デバイスである。入力部６７は、例えばキーボード、ポインティングデバイス、タッチパネル、その他の操作装置である。入力部６７がタッチパネルを含む場合、そのタッチパネルは表示部６６と一体となり得る。
記憶部６８は、不揮発性の記憶デバイスであり、例えばＨＤＤ、フラッシュメモリ、その他の固体メモリである。ドライブ部７０は、例えば光学記録媒体、磁気記録テープ等、リムーバブルの記録媒体７１を駆動することが可能なデバイスである。
通信部６９は、ＬＡＮ、ＷＡＮ等に接続可能な、他のデバイスと通信するためのモデム、ルータ、その他の通信機器である。通信部６９は、有線及び無線のどちらを利用して通信するものであってもよい。通信部６９は、コンピュータ６０とは別体で使用される場合が多い。
上記のようなハードウェア構成を有するコンピュータ６０による情報処理は、記憶部６８またはＲＯＭ６２等に記憶されたソフトウェアと、コンピュータ６０のハードウェア資源との協働により実現される。具体的には、ＲＯＭ６２等に記憶された、ソフトウェアを構成するプログラムをＲＡＭ６３にロードして実行することにより、本技術に係る情報処理方法が実現される。
プログラムは、例えば記録媒体６１を介してコンピュータ６０にインストールされる。あるいは、グローバルネットワーク等を介してプログラムがコンピュータ６０にインストールされてもよい。その他、コンピュータ読み取り可能な非一過性の任意の記憶媒体が用いられてよい。 FIG. 18 is a block diagram showing an example of the hardware configuration of a computer (information processing device) 60 that can realize the server device 4 and the client device 3.
The computer 60 includes a CPU 61, a ROM (Read Only Memory) 62, a RAM 63, an input/output interface 65, and a bus 64 that interconnects these components. The input/output interface 65 is connected to a display unit 66, an input unit 67, a storage unit 68, a communication unit 69, a drive unit 70, and the like.
The display unit 66 is a display device using, for example, a liquid crystal display, an electroluminescence display, etc. The input unit 67 is, for example, a keyboard, a pointing device, a touch panel, or other operating device. When the input unit 67 includes a touch panel, the touch panel can be integrated with the display unit 66.
The storage unit 68 is a non-volatile storage device such as a HDD, flash memory, or other solid-state memory. The drive unit 70 is a device capable of driving a removable recording medium 71 such as an optical recording medium or magnetic recording tape.
The communication unit 69 is a modem, a router, or other communication equipment that can be connected to a LAN, WAN, etc. and that communicates with other devices. The communication unit 69 may communicate either wired or wirelessly. The communication unit 69 is often used separately from the computer 60.
Information processing by the computer 60 having the above-described hardware configuration is realized by cooperation between software stored in the storage unit 68 or the ROM 62, etc., and the hardware resources of the computer 60. Specifically, the information processing method according to the present technology is realized by loading a program constituting the software stored in the ROM 62, etc., into the RAM 63 and executing it.
The program is installed in the computer 60 via, for example, a recording medium 61. Alternatively, the program may be installed in the computer 60 via a global network or the like. Alternatively, any other computer-readable non-transitory storage medium may be used.

ネットワーク等を介して通信可能に接続された複数のコンピュータが協働することで、本技術に係る情報処理方法及びプログラムが実行され、本技術に係る情報処理装置が構築されてもよい。
すなわち本技術に係る情報処理方法、及びプログラムは、単体のコンピュータにより構成されたコンピュータシステムのみならず、複数のコンピュータが連動して動作するコンピュータシステムにおいても実行可能である。
なお本開示において、システムとは、複数の構成要素（装置、モジュール（部品）等）の集合を意味し、すべての構成要素が同一筐体中にあるか否かは問わない。したがって、別個の筐体に収納され、ネットワークを介して接続されている複数の装置、及び、１つの筐体の中に複数のモジュールが収納されている１つの装置は、いずれもシステムである。
コンピュータシステムによる本技術に係る情報処理方法、及びプログラムの実行は、例えば視野情報の取得、レンダリング処理の実行、認識対象オブジェクトの設定、認識位置の推定、全天周顕著性マップの生成等が、単体のコンピュータにより実行される場合、及び各処理が異なるコンピュータにより実行される場合の両方を含む。また所定のコンピュータによる各処理の実行は、当該処理の一部または全部を他のコンピュータに実行させその結果を取得することを含む。
すなわち本技術に係る情報処理方法及びプログラムは、１つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成にも適用することが可能である。 An information processing device according to the present technology may be constructed by a plurality of computers connected to each other so as to be able to communicate via a network or the like working together to execute an information processing method and program according to the present technology.
That is, the information processing method and program according to the present technology can be executed not only in a computer system configured by a single computer, but also in a computer system in which multiple computers operate in conjunction with each other.
In this disclosure, a system refers to a collection of multiple components (devices, modules (components), etc.), regardless of whether all of the components are housed in the same housing. Therefore, multiple devices housed in separate housings and connected via a network, and a single device housed in a single housing with multiple modules, are both systems.
The information processing method and program execution according to the present technology by a computer system include both cases where, for example, obtaining visual field information, performing rendering processing, setting a recognition target object, estimating a recognition position, generating a panoramic saliency map, etc. are performed by a single computer, and cases where each process is performed by a different computer. Furthermore, the execution of each process by a specific computer includes having another computer execute part or all of the process and obtaining the results.
In other words, the information processing method and program according to the present technology can also be applied to a cloud computing configuration in which a single function is shared and processed jointly by multiple devices via a network.

各図面を参照して説明したサーバサイドレンダリングシステム、ＨＭＤ、サーバ装置、クライアント装置等の各構成、各処理フロー等はあくまで一実施形態であり、本技術の趣旨を逸脱しない範囲で、任意に変形可能である。すなわち本技術を実施するための他の任意の構成やアルゴリズム等が採用されてよい。 The configurations and processing flows of the server-side rendering system, HMD, server device, client device, etc. described with reference to the drawings are merely one embodiment and may be modified as desired without departing from the spirit of this technology. In other words, any other configurations, algorithms, etc. may be adopted to implement this technology.

本開示において、説明の理解を容易とするために、「略」「ほぼ」「おおよそ」等の文言が適宜使用されている。一方で、これら「略」「ほぼ」「おおよそ」等の文言を使用する場合と使用しない場合とで、明確な差異が規定されるわけではない。
すなわち、本開示において、「中心」「中央」「均一」「等しい」「同じ」「直交」「平行」「対称」「延在」「軸方向」「円柱形状」「円筒形状」「リング形状」「円環形状」等の、形状、サイズ、位置関係、状態等を規定する概念は、「実質的に中心」「実質的に中央」「実質的に均一」「実質的に等しい」「実質的に同じ」「実質的に直交」「実質的に平行」「実質的に対称」「実質的に延在」「実質的に軸方向」「実質的に円柱形状」「実質的に円筒形状」「実質的にリング形状」「実質的に円環形状」等を含む概念とする。
例えば「完全に中心」「完全に中央」「完全に均一」「完全に等しい」「完全に同じ」「完全に直交」「完全に平行」「完全に対称」「完全に延在」「完全に軸方向」「完全に円柱形状」「完全に円筒形状」「完全にリング形状」「完全に円環形状」等を基準とした所定の範囲（例えば±１０％の範囲）に含まれる状態も含まれる。
従って、「略」「ほぼ」「おおよそ」等の文言が付加されていない場合でも、いわゆる「略」「ほぼ」「おおよそ」等を付加して表現され得る概念が含まれ得る。反対に、「略」「ほぼ」「おおよそ」等を付加して表現された状態について、完全な状態が必ず排除されるというわけではない。 In this disclosure, to facilitate understanding of the explanation, words such as "approximately,""almost," and "roughly" are used as appropriate. However, there is no clear difference between using and not using words such as "approximately,""almost," and "roughly."
That is, in the present disclosure, concepts that define shape, size, positional relationship, state, etc., such as "center,""central,""uniform,""equal,""same,""orthogonal,""parallel,""symmetrical,""extended,""axialdirection,""cylindrical,""cylindrical,""ring-shaped," and "annular," are concepts that include "substantially center,""substantiallycentral,""substantiallyuniform,""substantiallyequal,""substantially the same,""substantiallyorthogonal,""substantiallyparallel,""substantiallysymmetrical,""substantiallyextended,""substantially axial direction,""substantiallycylindrical,""substantiallycylindrical,""substantiallyring-shaped,""substantiallyannular," and the like.
For example, this also includes states that fall within a predetermined range (e.g., a range of ±10%) based on criteria such as "perfectly centered,""perfectlycentral,""perfectlyuniform,""perfectlyequal,""perfectly the same,""perfectlyperpendicular,""perfectlyparallel,""perfectlysymmetrical,""perfectlyextended,""perfectlyaxial,""perfectlycylindrical,""perfectlycylindrical,""perfectlyring-shaped," and "perfectly annular."
Therefore, even if the words "roughly,""almost,""approximately," etc. are not added, it may include concepts that can be expressed by adding "roughly,""almost,""approximately," etc. Conversely, a state expressed by adding "roughly,""almost,""approximately," etc. does not necessarily exclude a complete state.

本開示において、「Ａより大きい」「Ａより小さい」といった「より」を使った表現は、Ａと同等である場合を含む概念と、Ａと同等である場合を含まない概念の両方を包括的に含む表現である。例えば「Ａより大きい」は、Ａと同等は含まない場合に限定されず、「Ａ以上」も含む。また「Ａより小さい」は、「Ａ未満」に限定されず、「Ａ以下」も含む。
本技術を実施する際には、上記で説明した効果が発揮されるように、「Ａより大きい」及び「Ａより小さい」に含まれる概念から、具体的な設定等を適宜採用すればよい。 In the present disclosure, expressions using "than", such as "greater than A" and "smaller than A", are expressions that comprehensively include both concepts that include the case where something is equivalent to A and concepts that do not include the case where something is equivalent to A. For example, "greater than A" is not limited to cases that do not include equivalent to A, but also includes "A or greater". Furthermore, "smaller than A" is not limited to "less than A" but also includes "A or less".
When implementing the present technology, specific settings and the like may be appropriately adopted from the concepts included in "greater than A" and "smaller than A" so as to achieve the effects described above.

以上説明した本技術に係る特徴部分のうち、少なくとも２つの特徴部分を組み合わせることも可能である。すなわち各実施形態で説明した種々の特徴部分は、各実施形態の区別なく、任意に組み合わされてもよい。また上記で記載した種々の効果は、あくまで例示であって限定されるものではなく、また他の効果が発揮されてもよい。 It is also possible to combine at least two of the features of the present technology described above. In other words, the various features described in each embodiment may be combined in any way, regardless of the embodiment. Furthermore, the various effects described above are merely examples and are not limiting, and other effects may also be achieved.

なお、本技術は以下のような構成も採ることができる。
（１）
ユーザの視野に関する視野情報に基づいて、仮想空間を構成する３次元空間データに対してレンダリング処理を実行することにより、前記ユーザの視野に応じた２次元映像データを生成するレンダリング部と、
前記仮想空間の前記ユーザの視野に含まれない視野外領域における、前記ユーザが認識している認識対象オブジェクトの前記ユーザが認識している認識位置を推定する推定部と、
推定された前記視野外領域における前記認識対象オブジェクトの前記認識位置に基づいて、前記視野外領域における顕著性を表す顕著性マップを生成する生成部と
を具備する情報処理装置。
（２）（１）に記載の情報処理装置であって、
前記推定部は、現在時刻までにレンダリング対象となったことがあるオブジェクトを、前記認識対象オブジェクトとして設定する
情報処理装置。
（３）（１）又は（２）に記載の情報処理装置であって、
前記２次元映像データは、時系列に連続する複数のフレーム画像により構成され、
前記推定部は、現在時刻のフレーム画像に含まれない前記認識対象オブジェクトについて、前記認識対象オブジェクトが含まれる過去の直近のフレーム画像内の前記認識対象オブジェクトの位置に対応する前記仮想空間内の位置に基づいて、前記認識位置を推定する
情報処理装置。
（４）（３）に記載の情報処理装置であって、
前記推定部は、前記直近のフレーム画像内の前記認識対象オブジェクトの位置に対応する前記仮想空間内の位置を、前記認識位置として推定する
情報処理装置。
（５）（３）又は（４）に記載の情報処理装置であって、
前記推定部は、前記直近のフレーム画像内の前記認識対象オブジェクトの位置に対応する前記仮想空間内の位置から前記認識対象オブジェクトの移動方向に沿ってシフトした位置を、前記認識位置として推定する
情報処理装置。
（６）（３）から（５）のうちいずれか１つに記載の情報処理装置であって、
前記推定部は、現在時刻のフレーム画像に含まれない前記認識対象オブジェクトについて、前記認識対象オブジェクトが発する音を前記ユーザが認識したと判定した場合に、前記音の前記仮想空間内における発生位置を、前記認識位置として推定する
情報処理装置。
（７）（３）から（６）のうちいずれか１つに記載の情報処理装置であって、
前記３次元空間データは、前記仮想空間の構成を定義する３次元空間記述データと、前記仮想空間における３次元オブジェクトを定義する３次元オブジェクトデータとを含み、
前記３次元空間記述データは、前記認識対象オブジェクトの役割を表す役割情報、及び前記役割に関連する定位置を表す定位置情報を含み、
前記推定部は、現在時刻のフレーム画像に含まれない所定の役割情報が設定された前記認識対象オブジェクトについて、現在時刻までに、同じ役割情報が設定された前記認識対象オブジェクトがレンダリングされたことがある場合に、前記役割に関連する前記定位置を前記認識位置として推定する
情報処理装置。
（８）（７）に記載の情報処理装置であって、
前記推定部は、現在時刻までに同じ役割情報が設定された前記認識対象オブジェクトが前記役割に関連する前記定位置にいる状態がレンダリングされたことがある場合に、前記役割に関連する前記定位置を前記認識位置として推定する
情報処理装置。
（９）（３）から（８）のうちいずれか１つに記載の情報処理装置であって、
前記３次元空間データは、前記仮想空間の構成を定義する３次元空間記述データと、前記仮想空間における３次元オブジェクトを定義する３次元オブジェクトデータとを含み、
前記３次元空間記述データは、前記認識対象オブジェクトの役割を表す役割情報、及び前記役割に関連する定位置を表す定位置情報を含み、
前記推定部は、現在時刻のフレーム画像に含まれない所定の役割情報が設定された前記認識対象オブジェクトについて、現在時刻までに、同じ役割情報が設定された前記認識対象オブジェクトが発する音を前記ユーザが認識したと判定した場合に、前記役割に関連する前記定位置を前記認識位置として推定する
情報処理装置。
（１０）（９）に記載の情報処理装置であって、
前記推定部は、現在時刻までに同じ役割情報が設定された前記認識対象オブジェクトが前記役割に関連する前記定位置にいる状態で発した音を前記ユーザが認識したと判定した場合に、前記役割に関連する前記定位置を前記認識位置として推定する
情報処理装置。
（１１）（１）から（１０）のうちいずれか１つに記載の情報処理装置であって、
前記推定部は、前記認識対象オブジェクトがレンダリングされている前記２次元映像データ内の前記認識対象オブジェクトの位置に基づいて、前記認識位置を推定する
情報処理装置。
（１２）（１）から（１１）のうちいずれか１つに記載の情報処理装置であって、
前記生成部は、前記視野外領域におけるボトムアップ注意に基づく顕著性がゼロとなる前記顕著性マップを生成する
情報処理装置。
（１３）（１）から（１２）のうちいずれか１つに記載の情報処理装置であって、
前記生成部は、前記視野外領域における前記認識対象オブジェクトの前記認識位置に基づいて、前記視野外領域におけるトップダウン注意に基づく顕著性を表す前記顕著性マップを生成する
情報処理装置。
（１４）（１）から（１３）のうちいずれか１つに記載の情報処理装置であって、
前記生成部は、前記視野外領域における前記顕著性マップと、前記２次元映像データの顕著性を表す顕著性マップとを生成する
情報処理装置。
（１５）（１）から（１４）のうちいずれか１つに記載の情報処理装置であって、さらに、
前記顕著性マップに基づいて、未来の前記視野情報を予測視野情報として生成する予測部を具備し、
前記レンダリング部は、前記予測視野情報に基づいて、前記２次元映像データを生成する
情報処理装置。
（１６）（１５）に記載の情報処理装置であって、
前記視野情報は、視点の位置、視線方向、視線の回転角度、前記ユーザの頭の位置、又は前記ユーザの頭の回転角度の少なくとも１つを含む
情報処理装置。
（１７）（１６）に記載の情報処理装置であって、
前記視野情報は、前記ユーザの頭の回転角度を含み、
前記予測部は、前記顕著性マップに基づいて、未来の前記ユーザの頭の回転角度を予測する
情報処理装置。
（１８）（１５）から（１７）のうちいずれか１つに記載の情報処理装置であって、
前記２次元映像データは、時系列に連続する複数のフレーム画像により構成され、
前記レンダリング部は、前記予測視野情報に基づいてフレーム画像を生成し、予測フレーム画像として出力する
情報処理装置。
（１９）
ユーザの視野に関する視野情報に基づいて、仮想空間を構成する３次元空間データに対してレンダリング処理を実行することにより、前記ユーザの視野に応じた２次元映像データを生成し、
前記仮想空間の前記ユーザの視野に含まれない視野外領域における、前記ユーザが認識している認識対象オブジェクトの前記ユーザが認識している認識位置を推定し、
推定された前記視野外領域における前記認識対象オブジェクトの前記認識位置に基づいて、前記視野外領域における顕著性を表す顕著性マップを生成する
ことをコンピュータシステムが実行する情報処理方法。 The present technology can also be configured as follows.
(1)
a rendering unit that generates two-dimensional video data according to the user's field of view by performing a rendering process on three-dimensional space data that constitutes a virtual space based on field of view information regarding the user's field of view;
an estimation unit that estimates a recognition position of a recognition target object recognized by the user in an out-of-field area of the virtual space that is not included in the user's field of view;
and a generation unit that generates a saliency map that represents saliency in the out-of-field area based on the estimated recognition position of the recognition target object in the out-of-field area.
(2) The information processing device according to (1),
The information processing device, wherein the estimation unit sets an object that has been a rendering target up to a current time as the recognition target object.
(3) The information processing device according to (1) or (2),
the two-dimensional video data is composed of a plurality of frame images that are successive in time series,
The information processing device wherein the estimation unit estimates the recognition position of the recognition target object that is not included in the frame image at the current time based on a position in the virtual space that corresponds to a position of the recognition target object in a most recent past frame image that includes the recognition target object.
(4) The information processing device according to (3),
The information processing device, wherein the estimation unit estimates, as the recognition position, a position in the virtual space corresponding to a position of the object to be recognized in the most recent frame image.
(5) The information processing device according to (3) or (4),
The information processing device, wherein the estimation unit estimates, as the recognition position, a position shifted along a movement direction of the recognition target object from a position in the virtual space corresponding to the position of the recognition target object in the most recent frame image.
(6) The information processing device according to any one of (3) to (5),
When it is determined that the user has recognized a sound emitted by the object to be recognized that is not included in the frame image at the current time, the estimation unit estimates the position where the sound is generated in the virtual space as the recognition position.
(7) The information processing device according to any one of (3) to (6),
the three-dimensional space data includes three-dimensional space description data that defines a configuration of the virtual space and three-dimensional object data that defines three-dimensional objects in the virtual space;
the three-dimensional space description data includes role information representing a role of the object to be recognized, and fixed position information representing a fixed position associated with the role;
When the recognition target object is set with predetermined role information that is not included in a frame image at a current time, and the recognition target object is set with the same role information, the estimation unit estimates the fixed position related to the role as the recognition position when the recognition target object is set with the same role information that is not included in a frame image at a current time and has been rendered before the current time.
(8) The information processing device according to (7),
The information processing device, wherein the estimation unit estimates the fixed position related to the role as the recognition position when a state in which the recognition target object, to which the same role information is set, has been rendered up to a current time and is in the fixed position related to the role.
(9) The information processing device according to any one of (3) to (8),
the three-dimensional space data includes three-dimensional space description data that defines a configuration of the virtual space and three-dimensional object data that defines three-dimensional objects in the virtual space;
the three-dimensional space description data includes role information representing a role of the object to be recognized, and fixed position information representing a fixed position associated with the role;
When it is determined that the user has recognized a sound emitted by the recognition target object, to which predetermined role information that is not included in a frame image at a current time has been set, the estimation unit estimates the fixed position related to the role as the recognition position by the current time.
(10) The information processing device according to (9),
When it is determined that the user has recognized a sound made by the object to be recognized, to which the same role information has been set up until the current time, while the object is in the fixed position related to the role, the estimation unit estimates the fixed position related to the role as the recognition position.
(11) The information processing device according to any one of (1) to (10),
The information processing device, wherein the estimation unit estimates the recognition position based on a position of the recognition target object in the two-dimensional video data in which the recognition target object is rendered.
(12) The information processing device according to any one of (1) to (11),
The information processing device wherein the generation unit generates the saliency map in which saliency based on bottom-up attention in the area outside the visual field is zero.
(13) The information processing device according to any one of (1) to (12),
The information processing device wherein the generation unit generates the saliency map representing saliency based on top-down attention in the out-of-field region based on the recognition position of the recognition target object in the out-of-field region.
(14) The information processing device according to any one of (1) to (13),
The information processing device is configured to generate the saliency map in the out-of-field region and a saliency map representing saliency of the two-dimensional video data.
(15) The information processing device according to any one of (1) to (14), further comprising:
a prediction unit that generates future visual field information as predicted visual field information based on the saliency map;
The rendering unit generates the two-dimensional video data based on the predicted field of view information.
(16) The information processing device according to (15),
The information processing device, wherein the field of view information includes at least one of a viewpoint position, a line of sight direction, a line of sight rotation angle, a head position of the user, or a head rotation angle of the user.
(17) The information processing device according to (16),
the field of view information includes a rotation angle of the user's head;
The information processing device, wherein the prediction unit predicts a future head rotation angle of the user based on the saliency map.
(18) The information processing device according to any one of (15) to (17),
the two-dimensional video data is composed of a plurality of frame images that are successive in time series,
The rendering unit generates a frame image based on the predicted view information and outputs the generated frame image as a predicted frame image.
(19)
generating two-dimensional video data according to the user's field of view by performing a rendering process on three-dimensional space data constituting a virtual space based on field of view information relating to the user's field of view;
Estimating a recognition position of a recognition target object recognized by the user in an out-of-field area of the virtual space that is not included in the user's field of view;
An information processing method executed by a computer system, which generates a saliency map representing saliency in the out-of-field area based on the estimated recognition position of the recognition target object in the out-of-field area.

Ｓ…仮想空間
１…サーバサイドレンダリングシステム
２…ＨＭＤ
３…クライアント装置
４…サーバ装置
５…ユーザ
７…ユーザの視野
８…レンダリング映像
１３…予測部
１４…レンダリング部
１７…顕著性マップ生成部
１８…顕著性マップ記録部
１９…認識位置推定部
２０…予測フレーム画像（フレーム画像）
３５…全天周顕著性マップ
６０…コンピュータ S...Virtual space 1...Server-side rendering system 2...HMD
DESCRIPTION OF SYMBOLS 3: Client device 4: Server device 5: User 7: User's field of view 8: Rendered image 13: Prediction unit 14: Rendering unit 17: Saliency map generation unit 18: Saliency map recording unit 19: Recognition position estimation unit 20: Predicted frame image (frame image)
35... Full-dome saliency map 60... Computer

Claims

a rendering unit that generates two-dimensional video data according to the user's field of view by performing a rendering process on three-dimensional space data that constitutes a virtual space based on field of view information regarding the user's field of view;
an estimation unit that estimates a recognition position of a recognition target object that is assumed to be recognized by the user as existing in the virtual space among objects existing in the virtual space, in an out-of-field area of the virtual space that is not included in the user's field of view; and
and a generation unit that generates a saliency map that represents saliency in the out-of-field area based on the estimated recognition position of the recognition target object in the out-of-field area.

2. The information processing device according to claim 1,
The information processing device, wherein the estimation unit sets an object that has been a rendering target up to a current time as the recognition target object.

2. The information processing device according to claim 1,
the two-dimensional video data is composed of a plurality of frame images that are successive in time series,
The information processing device wherein the estimation unit estimates the recognition position of the recognition target object that is not included in the frame image at the current time based on a position in the virtual space that corresponds to a position of the recognition target object in a most recent past frame image that includes the recognition target object.

4. The information processing device according to claim 3,
The information processing device, wherein the estimation unit estimates, as the recognition position, a position in the virtual space corresponding to a position of the object to be recognized in the most recent frame image.

4. The information processing device according to claim 3,
The information processing device, wherein the estimation unit estimates, as the recognition position, a position shifted along a movement direction of the recognition target object from a position in the virtual space corresponding to the position of the recognition target object in the most recent frame image.

4. The information processing device according to claim 3,
When it is determined that the user has recognized a sound emitted by the recognition target object that is not included in the frame image at the current time, the estimation unit estimates the position where the sound is generated in the virtual space as the recognition position.

4. The information processing device according to claim 3,
the three-dimensional space data includes three-dimensional space description data that defines a configuration of the virtual space and three-dimensional object data that defines three-dimensional objects in the virtual space;
the three-dimensional space description data includes role information representing a role of the object to be recognized, and fixed position information representing a fixed position associated with the role;
When the recognition target object, to which predetermined role information not included in the frame image at the current time is set, has been rendered up to the current time, the estimation unit estimates the fixed position related to the role as the recognition position.

8. The information processing device according to claim 7,
The information processing device, wherein the estimation unit estimates the fixed position related to the role as the recognition position when a state in which the recognition target object, to which the same role information is set, has been rendered up to a current time and is in the fixed position related to the role.

4. The information processing device according to claim 3,
the three-dimensional space data includes three-dimensional space description data that defines a configuration of the virtual space and three-dimensional object data that defines three-dimensional objects in the virtual space;
the three-dimensional space description data includes role information representing a role of the object to be recognized, and fixed position information representing a fixed position associated with the role;
When it is determined that the user has recognized a sound emitted by the recognition target object, to which predetermined role information that is not included in the frame image at the current time has been set, the estimation unit estimates the fixed position related to the role as the recognition position by the current time.

10. The information processing device according to claim 9,
When it is determined that the user has recognized a sound made by the object to be recognized, to which the same role information has been set up until the current time, while the object is in the fixed position related to the role, the estimation unit estimates the fixed position related to the role as the recognition position.

2. The information processing device according to claim 1,
The information processing device, wherein the estimation unit estimates the recognition position based on a position of the recognition target object in the two-dimensional video data in which the recognition target object is rendered.

2. The information processing device according to claim 1,
The information processing device wherein the generation unit generates the saliency map in which saliency based on bottom-up attention in the area outside the visual field is zero.

2. The information processing device according to claim 1,
The information processing device wherein the generation unit generates the saliency map representing saliency based on top-down attention in the out-of-field region based on the recognition position of the recognition target object in the out-of-field region.

2. The information processing device according to claim 1,
The information processing device is configured to generate the saliency map for the out-of-field region and a saliency map representing saliency of the two-dimensional video data.

The information processing device according to claim 1, further comprising:
a prediction unit that generates future visual field information as predicted visual field information based on the saliency map;
The rendering unit generates the two-dimensional video data based on the predicted field of view information.

16. The information processing device according to claim 15,
The information processing device, wherein the field of view information includes at least one of a viewpoint position, a line of sight direction, a line of sight rotation angle, a head position of the user, or a head rotation angle of the user.

17. The information processing device according to claim 16,
the field of view information includes a rotation angle of the user's head;
The information processing device, wherein the prediction unit predicts a future head rotation angle of the user based on the saliency map.

16. The information processing device according to claim 15,
the two-dimensional video data is composed of a plurality of frame images that are successive in time series,
The rendering unit generates a frame image based on the predicted view information and outputs the generated frame image as a predicted frame image.

generating two-dimensional video data according to the user's field of view by performing a rendering process on three-dimensional space data constituting a virtual space based on field of view information relating to the user's field of view;
Estimating a recognition position of a recognition target object that is assumed to be recognized by the user as existing in the virtual space among objects existing in the virtual space, in an out-of-field area of the virtual space that is not included in the user's field of view, and that is assumed to be recognized by the user at the current time ;
An information processing method executed by a computer system, which generates a saliency map representing saliency in the out-of-field area based on the estimated recognition position of the recognition target object in the out-of-field area.