JP7789622B2

JP7789622B2 - Image Processing System

Info

Publication number: JP7789622B2
Application number: JP2022084849A
Authority: JP
Inventors: 博之中村; 朗新井; 太郎守岡
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2025-12-22
Anticipated expiration: 2042-05-24
Also published as: JP2023172783A

Description

本発明は、ＶＲ（Virtual Reality:仮想現実）／ＡＲ（Augmented Reality : 拡張現実）の技術に関し、特に、画像・映像ベースでのＶＲ／ＡＲを実現する画像処理システムに適用して有効な技術技術に関するものである。 The present invention relates to VR (Virtual Reality)/AR (Augmented Reality) technology, and in particular to technology that is effective when applied to image processing systems that realize image/video-based VR/AR.

ＶＲ／ＡＲ技術を利用することにより、ユーザが別の空間・時間にいるかのような疑似体験をすることができる仕組みが検討・開発されている。例えば、自宅に居ながらにして遠方（別の空間）に旅行に行く体験をしたり、旅先で現に見ている情景の昔（別の時間）の様子をその場で臨場感を持って体験したりする技術が検討・開発されている。 Systems are being considered and developed that use VR/AR technology to allow users to have simulated experiences that make them feel as if they are in another space or time. For example, technologies are being considered and developed that allow users to experience traveling to a distant place (another space) from the comfort of their own home, or to experience what a scene they are currently seeing while traveling looked like in the past (another time) in a realistic way.

後者に関連する技術として、例えば、米国特許第１０１２７７３０号明細書（特許文献１）には、ユーザ端末の現在位置の情報に基づいて、当該位置において過去に存在したアトラクションをシミュレーションするコンテンツをＶＲ／ＡＲ技術によってユーザ端末に表示する仕組みが記載されている。この技術によれば、現在見ている情景に、その場所における過去の映像を重ね合わせて見ることが可能となる。 As an example of technology related to the latter, U.S. Patent No. 10,127,730 (Patent Document 1) describes a mechanism that uses VR/AR technology to display content on a user's device that simulates an attraction that existed in the past at that location, based on information about the user's current location. This technology makes it possible to superimpose past footage of that location onto the scene currently being viewed.

なお、現在の情景に他のコンテンツを重ね合わせて表示する技術として、例えば、特許第６４２０６０５号公報（特許文献２）には、画像認識型のＡＲ技術において、任意形状の認識対象物に対する追跡、特に、撮影角度や距離等の視点変化が大きい場合でも、小さなＤＢサイズおよび処理負荷で頑健な追跡を可能とする画像処理装置が記載されている。 As an example of a technology for overlaying other content on the current scene, Japanese Patent No. 6420605 (Patent Document 2) describes an image processing device that uses image recognition-based AR technology to track objects of any shape, enabling robust tracking with a small database size and processing load, even when there are large changes in viewpoint, such as shooting angle or distance.

米国特許第１０１２７７３０号明細書U.S. Pat. No. 10,127,730 特許第６４２０６０５号公報Patent No. 6420605

特許文献１に記載されたような従来技術を用いれば、例えば、現在の情景の映像に対してその場所における過去の情景を重ね合わせて表示するようなＶＲ／ＡＲシステムを実現することができる。 Using conventional technology such as that described in Patent Document 1, it is possible to realize a VR/AR system that, for example, displays a past scene from a location superimposed on an image of the current scene.

しかしながら、特許文献１に記載された技術のように、位置情報のみからその場所における過去の情景に係るコンテンツを取得する仕組みの場合、位置情報を取得する手段によっては、例えば、屋内や地下などで正確な位置情報が取得できない場合や、数十センチ～数メートルの誤差が生じる場合、同じ位置でもユーザが向いている方向の相違によって情景が全く異なる場合など、現在の情景と対応する適切な過去の情景を選択・決定できない場合が生じ易くなるという課題がある。 However, with a system like the technology described in Patent Document 1 that acquires content related to past scenes at a location based solely on location information, there are issues with the means of acquiring location information. For example, accurate location information may not be acquired indoors or underground, errors of several tens of centimeters to several meters may occur, or the scene may be completely different depending on the direction the user is facing even at the same location. This can make it difficult to select or determine an appropriate past scene that corresponds to the current scene.

また、ＶＲ／ＡＲシステムにおいて現在の情景の映像に他の画像を重ね合わせて表示する際、特許文献２に記載された従来技術のように、ユーザの移動や向いている方向の変化などに追従するために、重ね合わせる画像のオブジェクトを３次元モデルとして取り扱うことがよく行われるが、３次元モデルとした場合、オブジェクトがＣＧによってデフォルメして表現されるため現実感の薄い情景となってしまう上に、データ処理の負荷が重くなってしまうという課題がある。 Furthermore, when superimposing other images onto a video of the current scene in a VR/AR system, the objects in the superimposed image are often treated as three-dimensional models in order to follow the user's movements and changes in the direction they are facing, as in the conventional technology described in Patent Document 2. However, when using three-dimensional models, the objects are deformed using CG, resulting in a less realistic scene and a heavy data processing load.

そこで本発明の目的は、ＶＲ／ＡＲにより現在の情景の映像に過去の情景を重ね合わせて表示する際に、重ね合わせるのに適切な過去の情景の選択を効果的・効率的に行うとともに、重ね合わせの際の負荷を低減する画像処理システムを提供することにある。 The object of the present invention is to provide an image processing system that effectively and efficiently selects appropriate past scenes to overlay on images of current scenes using VR/AR, while reducing the load imposed by the overlay process.

本発明の前記ならびにその他の目的と新規な特徴は、本明細書の記載および添付図面から明らかになるであろう。 The above and other objects and novel features of the present invention will become apparent from the description of this specification and the accompanying drawings.

本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば、以下のとおりである。 A brief summary of the representative inventions disclosed in this application is as follows:

本発明の代表的な実施の形態である画像処理システムは、撮影装置により撮影された現在の情景に係る現在動画に対して、画像処理装置により、対応する過去の情景が撮影された過去動画を重ね合わせて表示装置に表示する画像処理システムであって、前記画像処理装置は、複数の過去動画についてそれぞれ１つ以上のキーフレームを抽出して、抽出元の過去動画と関連付けて前処理後過去動画として記録する前処理部と、前記前処理後過去動画から前記現在動画と類似する第１のキーフレームを画像検索により特定し、前記第１のキーフレームに対応する第１の過去動画を特定する画像検索部と、前記現在動画に対して、前記第１の過去動画をそれぞれの動画像中の特徴点のずれが最小となるように変換して重ね合わせて前記表示装置に表示する画像合成部と、を有する。 A representative embodiment of the present invention is an image processing system that uses an image processing device to overlay a past video of a corresponding past scene onto a current video of a current scene captured by a camera device and displays the overlay on a display device. The image processing device has a pre-processing unit that extracts one or more key frames from each of a plurality of past videos and records them as pre-processed past videos in association with the past videos from which they were extracted; an image search unit that identifies a first key frame similar to the current video from the pre-processed past videos by image search and identifies the first past video corresponding to the first key frame; and an image synthesis unit that converts the first past video onto the current video so as to minimize deviations between feature points in each video, and overlays the converted first past video onto the display device.

本願において開示される発明のうち、代表的なものによって得られる効果を簡単に説明すれば、以下のとおりである。 The effects achieved by representative inventions disclosed in this application can be briefly explained as follows:

すなわち、本発明の代表的な実施の形態によれば、ＶＲ／ＡＲにより現在の情景の映像に過去の情景を重ね合わせて表示する際に、重ね合わせるのに適切な過去の情景の選択を効果的・効率的に行うことが可能となる。また、重ね合わせの際の負荷を低減することが可能となる。 In other words, according to a representative embodiment of the present invention, when a past scene is superimposed on an image of a current scene using VR/AR, it is possible to effectively and efficiently select an appropriate past scene to superimpose. It is also possible to reduce the load involved in the superimposition.

本発明の一実施の形態である画像処理システムの構成例について概要を示した図である。1 is a diagram illustrating an overview of an example of the configuration of an image processing system according to an embodiment of the present invention. 本発明の一実施の形態における画像処理の例について概要を示した図である。FIG. 1 is a diagram illustrating an example of image processing according to an embodiment of the present invention. 本発明の一実施の形態におけるデータ構造の例について概要を示した図である。FIG. 1 is a diagram showing an overview of an example of a data structure according to an embodiment of the present invention. 本発明の一実施の形態における前処理の流れの例について概要を示したフローチャートである。10 is a flowchart outlining an example of a preprocessing flow according to an embodiment of the present invention. 本発明の一実施の形態における画像検索・合成処理の流れの例について概要を示したフローチャートである。1 is a flowchart outlining an example of the flow of image search and synthesis processing according to an embodiment of the present invention.

以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一部には原則として同一の符号を付し、その繰り返しの説明は省略する。一方で、ある図において符号を付して説明した部位について、他の図の説明の際に再度の図示はしないが同一の符号を付して言及する場合がある。 Embodiments of the present invention will be described in detail below with reference to the drawings. In all drawings used to explain the embodiments, identical parts will generally be given the same reference numerals, and repeated explanations will be omitted. However, parts that have been described using reference numerals in one drawing may be referred to using the same reference numerals when explaining other drawings, although they will not be shown again.

＜概要＞
ＶＲ／ＡＲ技術を利用して自宅に居ながらにして擬似的に旅行を体験可能とするような仕組みは、新型コロナウィルス蔓延による外出自粛等への対応もあり、検討・開発が進んだ。一方で、ウィズコロナ／アフターコロナ時代における地域支援や、デジタル化による地方創生の観点からは、現地への移動の促進にフォーカスしたＶＲ／ＡＲの仕組みを志向する必要がある。 <Overview>
Systems that use VR/AR technology to allow people to experience a virtual trip from the comfort of their own home have been explored and developed in response to the self-restraint on going out due to the spread of COVID-19. However, from the perspective of supporting local areas in the with-COVID/post-COVID era and revitalizing local areas through digitalization, it is necessary to aim for VR/AR systems that focus on promoting travel to local areas.

すなわち、例えば、旅行に行く前の体験として、ユーザが自宅に居ながらにしてリアルタイムで現地の情景を見たり、イベントに参加したりできるようにすることで、次は現地に実際に行こうという気持ちを醸成させることができる。また、旅行前もしくはユーザが実際に旅行に行っている間での体験として、例えば、その時点ではリアルタイムで直接見ることができない現地の過去の情景（動画）を、現地の現在の情景の映像に重ね合わせて臨場感を持って見せることで、別の機会にまた来たいという気持ちを醸成させることができる。 For example, as an experience before going on a trip, users could be able to view local scenery in real time and participate in events from the comfort of their own home, which could foster a desire to actually visit the location next time. Also, as an experience before a trip or while the user is actually traveling, for example, past scenes (videos) of the location that cannot be viewed directly in real time at the time could be superimposed on video of the current location, creating a sense of realism, which could foster a desire to return on another occasion.

ここで重ね合わせる過去の情景（動画）としては、例えば、すでに廃止された鉄道や取り壊された建築物など現在ではなくなっている情景や、伝説のライブイベントやスポーツの名試合など過去の特定の日時に開催されたイベント、現地におけるレポーターやガイド、著名解説者等による解説など、過去のある時点では存在したり起きたりしたが現在の情景には映っていないというような各種のものが考えられる。また、満開の桜並木や晴天時の風景、夕暮れ時や満潮・干潮時にしか見られない独特な風景などのように、観光によい季節や天候、時間帯でのベストな情景など、現在も存在はするが現在とは異なる状態であったというようなものも考えられる。 The past scenes (videos) that can be overlaid here could be anything that existed or happened at some point in the past but is not shown in the current scene, such as scenes that no longer exist, such as abandoned railways or demolished buildings, events that were held at a specific time in the past, such as legendary live events or famous sports matches, or commentary by local reporters, guides or famous commentators. It could also be anything that still exists today but in a different state than it is now, such as the best scenery for a particular season, weather or time of day for sightseeing, such as rows of cherry blossoms in full bloom, a clear day, or a unique view that can only be seen at dusk or during high or low tide.

以下に説明する本発明の一実施の形態である画像処理システムは、撮影した現在の情景の映像に対して、ＶＲ／ＡＲにより、その場所における過去の情景を重ね合わせて（合成して）表示することで、ユーザが上述したような体験をすることを可能とするものである。 The image processing system, which is one embodiment of the present invention and will be described below, uses VR/AR to overlay (combine) past scenes from the same location onto a captured image of the current scene, allowing the user to have the experience described above.

＜システム構成＞
図１は、本発明の一実施の形態である画像処理システムの構成例について概要を示した図である。画像処理システム１は、例えば、ＶＲ／ＡＲに関する画像処理を行う画像処理装置１０と、ユーザの指示により情景を撮影してその動画像を取得する撮影装置２０、および撮影装置２０により撮影された動画像等を表示してユーザに提示する機能を有する表示装置３０などの各装置を有する。撮影装置２０や表示装置３０は、図示しない有線もしくは無線での通信手段や接続手段により画像処理装置１０に接続される構成を有する。 <System Configuration>
1 is a diagram showing an overview of an example of the configuration of an image processing system according to one embodiment of the present invention. The image processing system 1 includes various devices, such as an image processing device 10 that performs image processing related to VR/AR, a photographing device 20 that photographs a scene in response to a user's instruction and acquires a moving image thereof, and a display device 30 that displays the moving image, etc., photographed by the photographing device 20 and presents it to the user. The photographing device 20 and the display device 30 are connected to the image processing device 10 by wired or wireless communication means or connection means (not shown).

撮影装置２０は、例えば、デジタルビデオカメラやウェブカメラ、アクションカメラ、ドライブレコーダーその他の動画撮影機能を備える装置により構成することができる。後述する表示装置３０がカメラ機能を備えている場合は、これを用いる形として表示装置３０と一体に構成されていてもよい。表示装置３０は、例えば、モニターやＰＣ（Personal Computer）のディスプレイなど動画像の表示機能を備える装置により構成することができる。スマートフォンやタブレット端末、ＶＲ／ＡＲゴーグルなど、表示機能に加えて撮影装置２０として機能することができるカメラ機能を備える装置であってもよい。 The image capture device 20 can be configured, for example, as a digital video camera, webcam, action camera, dashcam, or other device with video capture capabilities. If the display device 30 (described below) also has a camera function, it may be configured as an integrated unit with the display device 30. The display device 30 can be configured, for example, as a device with a video image display function, such as a monitor or PC (Personal Computer) display. It may also be a device with a camera function that can function as the image capture device 20 in addition to a display function, such as a smartphone, tablet device, or VR/AR goggles.

画像処理装置１０は、例えば、サーバ機器やクラウドコンピューティングサービス上に構築された仮想サーバ、ＰＣ等により構成され、図示しないＣＰＵ（Central Processing Unit）により、ＨＤＤ（Hard Disk Drive）やＳＳＤ（Solid State Drive）等の記録装置からメモリ上に展開したＯＳ（Operating System）やＤＢＭＳ（DataBase Management System）、Ｗｅｂサーバプログラム等のミドルウェアや、その上で稼働するソフトウェアを実行することで、ＶＲ／ＡＲの画像処理に係る各種機能を実現する。 The image processing device 10 is configured, for example, from a server device, a virtual server built on a cloud computing service, a PC, etc., and realizes various functions related to VR/AR image processing by using a CPU (Central Processing Unit) (not shown) to execute middleware such as an OS (Operating System), DBMS (Database Management System), and web server program, which are deployed to memory from a storage device such as an HDD (Hard Disk Drive) or SSD (Solid State Drive), as well as software running on top of them.

この画像処理装置１０は、例えば、ソフトウェアとして実装された前処理部１１、画像検索部１２および画像合成部１３などの各部を有する。また、データベースやファイル等として記録・保持された過去動画群１４および前処理後過去動画群１５などの画像データを有する。 This image processing device 10 has various units implemented as software, such as a pre-processing unit 11, an image search unit 12, and an image synthesis unit 13. It also has image data such as a past video set 14 and a pre-processed past video set 15, which are recorded and stored as a database, file, etc.

前処理部１１は、過去動画群１４に記録されている各地の過去の情景を撮影した過去動画について、ＶＲ／ＡＲの処理に供するために、後述するような前景／背景の分離やシーンの分割などの前処理を事前に行って、生成された画像データを前処理後過去動画群１５として記録する機能を有する。画像検索部１２は、撮影装置２０により撮影された現在の情景の動画像に基づいて、前処理後過去動画群１５から類似する過去動画を画像検索により取得する機能を有する。画像合成部１３は、撮影装置２０により撮影された現在の情景の動画像に、画像検索部１２により得られた過去動画を重ね合わせて（合成して）表示装置３０に表示する機能を有する。 The pre-processing unit 11 has the function of performing pre-processing such as foreground/background separation and scene division, as described below, on past videos of past scenes captured in various locations and recorded in the past video set 14 in advance in order to provide them for VR/AR processing, and recording the generated image data as the pre-processed past video set 15. The image search unit 12 has the function of performing an image search to obtain similar past videos from the pre-processed past video set 15 based on video images of the current scene captured by the imaging device 20. The image synthesis unit 13 has the function of overlaying (synthesizing) the past video obtained by the image search unit 12 onto the video images of the current scene captured by the imaging device 20 and displaying them on the display device 30.

画像検索部１２や画像合成部１３の全部もしくは一部を画像処理装置１０上ではなくスマートフォンやＶＲ／ＡＲゴーグル等により構成された表示装置３０上に実装する構成としてもよい。なお、前処理の内容や前処理後過去動画群１５の構成、過去動画の検索や現在の動画像との重ね合わせ（合成）処理の内容については後述する。 All or part of the image search unit 12 and image synthesis unit 13 may be implemented on a display device 30 configured as a smartphone, VR/AR goggles, or the like, rather than on the image processing device 10. The details of preprocessing, the composition of the preprocessed past video set 15, and the details of searching for past videos and overlaying (synthesizing) them with the current video image will be described later.

＜画像処理の例＞
図２は、本発明の一実施の形態における画像処理の例について概要を示した図である。図２の左上の図は、ある現地の現在の情景を撮影装置２０により撮影した動画像（以下「現在動画」と記載する場合がある）の例を示しており、図２の右上の図は、同じ場所の過去の情景を撮影した動画像（以下「過去動画」と記載する場合がある）の例を示している（いずれも便宜上、動画像中の連続した静止画（フレーム）の一つにより示している）。過去動画には、現在動画と同じ「城」が写っているものの、現在動画に写っている「桜の枝」については写っておらず、一方で、現在動画にはない過去に存在した「２頭の馬」が写っている状況を示している。 <Image processing example>
2 is a diagram outlining an example of image processing in one embodiment of the present invention. The diagram in the upper left of Fig. 2 shows an example of a video (hereinafter sometimes referred to as "current video") of a current scene at a certain location captured by the camera device 20, and the diagram in the upper right of Fig. 2 shows an example of a video (hereinafter sometimes referred to as "past video") of a past scene at the same location (for convenience, both are shown as one of consecutive still images (frames) in the video). The past video shows the same "castle" as the current video, but does not show the "cherry blossom branch" that appears in the current video, and instead shows "two horses" that existed in the past but are not in the current video.

本実施の形態では、図２の左下の図に示すように、現在動画に対して過去動画を変換して重ね合わせて合成した上で再生する。重ね合わせる過去動画として、ＣＧではなく実写撮影された２次元の動画像を用いることで、重ね合わせる過去動画の選択や重ね合わせの処理負荷を低減し、処理を高速化するとともに現実感のある画像とすることができる。なお、重ね合わせの際には、例えば、現在動画とこれに重ね合わせる過去動画について、それぞれ対応する特徴点（例えば、画像中のオブジェクト（図２の例では「城」の頂点等）を抽出し、画像の回転、移動により対応する特徴点同士を一致させる２次元の変換行列（アフィン変換行列）を算出する。この変換行列を過去動画に適用することで過去動画を変換（変形）して、現在動画に対して対応する特徴点を重ね合わせて合成する。なお、アフィン変換行列の算出や演算の処理には、変換元と変換先を行列表現し、最小二乗法で解く公知の手法（例えば、ＯｐｅｎＣＶなどのライブラリも存在）を適宜用いることができる。 In this embodiment, as shown in the lower left diagram of Figure 2, the previous video is transformed onto the current video, overlaid, and composited before playback. By using two-dimensional moving images captured live rather than CG as the previous video to be overlaid, the processing load for selecting and overlaying the previous video to be overlaid is reduced, processing is sped up, and a more realistic image can be produced. Note that when overlaying, for example, corresponding feature points (e.g., objects in the image (such as the vertices of the "castle" in the example of Figure 2)) are extracted for the current video and the previous video to be overlaid, and a two-dimensional transformation matrix (affine transformation matrix) is calculated to match corresponding feature points by rotating and moving the image. This transformation matrix is applied to the previous video to transform (deform) the previous video, and the corresponding feature points are overlaid and composited onto the current video. Note that the calculation and operation of the affine transformation matrix can be performed using known methods (for example, libraries such as OpenCV exist) in which the source and destination are represented as matrices and solved using the least squares method.

現在動画に対して過去動画の全体を重ね合わせるのではなく、図２の右下の図に示すように、過去動画中に前景として写っているオブジェクトのみを切り出して合成するようにしてもよい、同図では、過去動画に写っている「２頭の馬」の部分のみを切り出して合成した状況の例を示している。過去動画全体を重ね合わせる手法と、過去動画中の前景オブジェクトのみを重ね合わせる手法を切り替えられるようにしてもよい。例えば、ユーザの指示により切り替えるようにしてもよいし、ユーザが使用する表示装置３０の種類（例えば、スマートフォンかＶＲ／ＡＲゴーグルか）によって切り替えるようにしてもよい。 Rather than overlaying the entire past video onto the current video, it is also possible to cut out and composite only the foreground objects from the past video, as shown in the diagram at the bottom right of Figure 2. This diagram shows an example of a situation where only the "two horses" portion from the past video is cut out and composited. It is also possible to switch between a method of overlaying the entire past video and a method of overlaying only the foreground objects from the past video. For example, the method may be switched at the user's command, or may be switched depending on the type of display device 30 used by the user (e.g., a smartphone or VR/AR goggles).

＜データ構造＞
図３は、本発明の一実施の形態におけるデータ構造の例について概要を示した図である。本実施の形態では、現在の情景の動画像（現在動画２１）に過去の情景の動画像（過去動画１４ａ）を重ね合わせて合成するに際して、現在動画２１と同じ対象を撮影した過去動画１４ａを特定する必要があるが、その際、現在動画２１に対する類似画像検索により過去動画１４ａを特定する。これにより、現在動画２１を撮影した場所もしくは撮影対象の場所の位置情報が十分に取得できない場合でも、重ね合わせる過去動画１４ａを特定できる場合を増やすことができる。 <Data structure>
3 is a diagram outlining an example of a data structure in one embodiment of the present invention. In this embodiment, when superimposing a video of a past scene (past video 14a) onto a video of a current scene (current video 21), it is necessary to identify the past video 14a that captured the same subject as the current video 21. In this case, the past video 14a is identified by performing a similar image search for the current video 21. This increases the number of cases where the past video 14a to be superimposed can be identified even when sufficient location information about the location where the current video 21 was captured or the location of the subject cannot be obtained.

そして本実施の形態では、類似画像検索の際、検索の対象として過去動画１４ａ自体ではなく、その中で過去動画１４ａを代表する１つ以上の特徴的な静止画（フレーム）であるキーフレーム１５ｃを対象とする。これにより、効率的な類似画像検索が可能となる。キーフレーム１５ｃは、前処理部１１により過去動画１４ａから抽出され、前処理後過去動画群１５の中の１データとして過去動画１４ａに関連付けて記録・保持される。 In this embodiment, when performing a similar image search, the search target is not the past video 14a itself, but rather key frames 15c, which are one or more characteristic still images (frames) that represent the past video 14a. This enables efficient similar image searches. The key frames 15c are extracted from the past video 14a by the pre-processing unit 11, and are recorded and stored in association with the past video 14a as one piece of data in the pre-processed past video group 15.

キーフレーム１５ｃを抽出するため、前処理部１１では、まず過去動画１４ａについて前景のオブジェクトと背景を分離し、前景オブジェクトのみを抽出した前景動画１５ｂと、背景のみの背景動画１５ａを取得する。これらも前処理後過去動画群１５の中の１データとして過去動画１４ａに関連付けて記録・保持される。過去動画１４ａにおける前景オブジェクトは、典型的には、過去に存在したが現在動画２１には映っていないオブジェクトであることから、これを分離した上で現在動画２１との類似画像検索を行うことでより効率的・効果的に過去動画１４ａを特定することができる。 To extract keyframe 15c, the preprocessing unit 11 first separates foreground objects from the background in the past video 14a, obtaining a foreground video 15b from which only the foreground objects have been extracted, and a background video 15a from which only the background has been extracted. These are also recorded and stored in association with the past video 14a as one piece of data in the preprocessed past video set 15. Since the foreground object in the past video 14a is typically an object that existed in the past but does not appear in the current video 21, separating it and then performing a similar image search with the current video 21 can more efficiently and effectively identify the past video 14a.

そして、前景オブジェクトが分離された背景動画１５ａについて、特徴的なフレーム（重ね合わせたいフレーム）をキーフレーム１５ｃとして抽出する。本実施の形態では、背景動画１５ａのシーンが大きく切り替わった時点でシーンを分割し、各シーンの先頭フレームをキーフレーム１５ｃとして抽出するものとしているが、これに限られない。各シーンを代表する特徴的なフレームとして取り扱うことができるものであれば、シーンの末尾や途中のフレームをキーフレーム１５ｃとしてもよい。 Then, for the background video 15a from which the foreground object has been separated, characteristic frames (frames to be overlaid) are extracted as key frames 15c. In this embodiment, the background video 15a is divided at the point where there is a major scene change, and the first frame of each scene is extracted as key frame 15c, but this is not limited to this. A frame at the end or in the middle of a scene may also be used as key frame 15c, as long as it can be treated as a characteristic frame representing each scene.

シーン分割の判断は、例えば、フレーム画像間の６自由度（six degrees of freedom：６ＤｏＦ、３次元直交座標系の軸に沿った移動の並進３自由度と、軸まわりの回転３自由度からなる）の変化量が所定の閾値を超えた時点で分割するが、これに限られるものではない。 Scene division is determined, for example, when the change in the six degrees of freedom (6DoF, consisting of three translational degrees of freedom for movement along the axes of a three-dimensional Cartesian coordinate system and three rotational degrees of freedom around the axes) between frame images exceeds a predetermined threshold, but is not limited to this.

現在動画２１による類似画像検索では、過去動画群１４に記録された複数の過去動画１４ａからそれぞれ抽出されたキーフレーム１５ｃを対象に検索を行い、マッチしたキーフレーム１５ｃに対応する過去動画１４ａを特定する。そして、特定された過去動画１４ａもしくは当該過去動画１４ａから分離抽出された前景動画１５ｂを、上述した２次元のアフィン変換行列により変換して現在動画２１に重ね合わせて合成する。 In a similar image search using the current video 21, a search is performed for key frames 15c extracted from multiple past videos 14a recorded in the past video set 14, and the past video 14a corresponding to the matching key frame 15c is identified. The identified past video 14a or the foreground video 15b separated and extracted from the past video 14a is then transformed using the above-mentioned two-dimensional affine transformation matrix and superimposed on the current video 21 for synthesis.

＜処理の流れ＞
図４は、本発明の一実施の形態における前処理の流れの例について概要を示したフローチャートである。この前処理では、前処理部１１により、事前に過去動画群１４に保持されている各過去動画１４ａから類似画像検索用のキーフレーム１５ｃを抽出して前処理後過去動画群１５の中の１データとして記録する処理を行う。 <Processing flow>
4 is a flowchart outlining an example of the flow of preprocessing according to an embodiment of the present invention. In this preprocessing, the preprocessing unit 11 extracts key frames 15c for similar image search from each past video 14a stored in advance in the past video set 14, and records the extracted key frames 15c as one data item in the preprocessed past video set 15.

まず、処理対象の過去動画１４ａにつき、前処理部１１により前景と背景を分離する（Ｓ０１）。すなわち、過去動画１４ａから前景オブジェクト（物体）を切り抜いて前景動画１５ｂとして抽出するとともに、これを過去動画１４ａから分離して背景動画１５ａを取得する処理（Image Matting）を行う。この処理には、例えば、ＣＮＮ（Convolutional Neural Network：畳み込みニューラルネットワーク）などのＡＩによる画像処理技術など、実用化されている既存技術や手法、ライブラリ等が多数存在し、これらを適宜利用して実装することができる。分離した背景動画１５ａと前景動画１５ｂは、過去動画１４ａと関連付けて前処理後過去動画群１５として記録する。なお、以降の処理は分離した背景動画１５ａを対象として実行する。 First, the pre-processing unit 11 separates the foreground and background of the past video 14a to be processed (S01). That is, a process (Image Matting) is performed in which the foreground object (object) is cut out from the past video 14a and extracted as the foreground video 15b, and this is then separated from the past video 14a to obtain the background video 15a. This process can be implemented using a number of existing technologies, methods, libraries, etc. that are in practical use, such as AI image processing technologies such as CNN (Convolutional Neural Network). The separated background video 15a and foreground video 15b are associated with the past video 14a and recorded as the pre-processed past video group 15. Note that subsequent processing is performed on the separated background video 15a.

背景動画１５ａを取得すると、次にその動画の種類を判別し（Ｓ０２）、動画の種類に応じて各フレームにおける６自由度の情報を取得してそのフレーム間の変化量を取得する。背景動画１５ａが６自由度の情報付きの動画である場合は、そのままフレーム間の６自由度の変化量を取得する（Ｓ０５）。背景動画１５ａがステレオカメラで撮影したステレオ動画である場合は、例えば、公知の三角測量の手法で６自由度を推定し（Ｓ０３）、フレーム間の変化量を取得する（Ｓ０５）。また、背景動画１５ａが単眼カメラで撮影した単眼動画である場合は、例えば、公知のＶ－ＳＬＡＭ（Visual Simultaneous Localization and Mapping）手法により、動画内の特徴点の追跡から６自由度を推定し（Ｓ０４）、フレーム間の変化量を取得する（Ｓ０５）。 Once the background video 15a is acquired, the type of video is then determined (S02). Six-degree-of-freedom information for each frame is acquired according to the type of video, and the amount of change between frames is obtained. If the background video 15a is a video with six-degree-of-freedom information, the amount of change between frames is simply acquired (S05). If the background video 15a is a stereo video captured with a stereo camera, the six degrees of freedom are estimated using, for example, a known triangulation technique (S03), and the amount of change between frames is acquired (S05). Furthermore, if the background video 15a is a monocular video captured with a monocular camera, the six degrees of freedom are estimated by tracking feature points within the video using, for example, a known V-SLAM (Visual Simultaneous Localization and Mapping) technique (S04), and the amount of change between frames is acquired (S05).

そして、ステップＳ０５で取得したフレーム間の６自由度の変化量が所定の閾値を超えたときに、当該フレーム間でシーンを分割する（Ｓ０６）。６自由度の変化量としては、６つの自由度それぞれの変化量の単純な合計を用いてもよいし、１つ以上の特定の自由度の変化量に重み付けして合計してもよい。例えば、ＶＲ／ＡＲによる画像を視聴している一般的なユーザの挙動としては、６自由度（前後、左右、上下に移動する、前後に傾く（ピッチ）、左右に首を旋回させる（ヨー）、左右に首を傾ける（ロール））のうち、ヨーが中心的な動きとなることが多いことから、ヨーの変化量に大きい重みをつけるようにしてもよい。動画の内容によって重み付けする自由度や重み付けの値を変更するようにしてもよい。６自由度の変化量の合計に代えて、１つ以上の特定の自由度の変化量が所定の閾値を超えたときにシーン分割するようにしてもよい。 Then, when the amount of change in the six degrees of freedom between frames acquired in step S05 exceeds a predetermined threshold, the scene is divided between those frames (S06). The amount of change in the six degrees of freedom may be a simple sum of the amounts of change in each of the six degrees of freedom, or a weighted sum of the amounts of change in one or more specific degrees of freedom. For example, of the six degrees of freedom (moving forward/backward, left/right, up/down, tilting forward/backward (pitch), rotating the head left/right (yaw), tilting the head left/right (roll)) that typically characterize the behavior of a user viewing VR/AR images, yaw is often the most central movement, so a large weight may be assigned to the amount of change in yaw. The degrees of freedom to be weighted and the weighting value may be changed depending on the content of the video. Instead of the sum of the amounts of change in the six degrees of freedom, a scene may be divided when the amount of change in one or more specific degrees of freedom exceeds a predetermined threshold.

より簡易的な手法として、フレーム間の６自由度の変化量に代えて、例えば、フレーム間の画像データ同士の単純な変化量（差分）を取得して、当該変化量が所定の閾値を超えたときに、当該フレーム間でシーンを分割するようにしてもよい。 As a simpler method, instead of using the amount of change in six degrees of freedom between frames, it is possible to obtain, for example, a simple amount of change (difference) between the image data between frames, and when this amount of change exceeds a predetermined threshold, divide the scene between those frames.

背景動画１５ａについてシーン分割がされると、シーンごとに先頭のフレームをキーフレーム１５ｃとして抽出する（Ｓ０７）。そして、抽出された１つ以上のキーフレーム１５ｃからなるキーフレーム群のデータを、抽出元である処理対象の過去動画１４ａの情報と関連付けて前処理後過去動画群１５として記録する（Ｓ０８）。上記の一連の処理を、全ての処理対象の過去動画１４ａに対して実行することで前処理を終了する。 Once the background video 15a has been divided into scenes, the first frame of each scene is extracted as a key frame 15c (S07). Data for a key frame group consisting of one or more extracted key frames 15c is then associated with information on the previous video 14a to be processed from which it was extracted, and recorded as a preprocessed previous video group 15 (S08). The above series of processes are performed on all previous videos 14a to be processed, completing the preprocessing.

図５は、本発明の一実施の形態における画像検索・合成処理の流れの例について概要を示したフローチャートである。この画像検索・合成処理では、画像検索部１２および画像合成部１３により、撮影装置２０により撮影した現在の情景の動画像に対して対応する過去動画１４を重ね合わせて合成して表示装置３０に表示する処理を行う。 Figure 5 is a flowchart outlining an example of the flow of image search and synthesis processing in one embodiment of the present invention. In this image search and synthesis processing, the image search unit 12 and image synthesis unit 13 superimpose and synthesize the corresponding past video 14 onto a video image of the current scene captured by the imaging device 20, and display the resulting composite image on the display device 30.

まず、画像検索部１２が、現地の位置情報を取得する（Ｓ１１）。ここで現地とは、現在動画２１を撮影している撮影装置２０が所在する場所、もしくは現在動画２１に係る撮影対象が所在する場所、およびこれらの付近の場所である。撮影装置２０がカメラ機能とともにＧＰＳ（Global Positioning System）機能を備えるスマートフォン等の情報処理端末である場合は、撮影装置２０がＧＰＳ機能により緯度・経度情報として位置情報を取得して、これを図示しない通信手段を介して画像処理装置１０に送信し、画像検索部１２がこれを取得する。現地に所在する撮影装置２０とは別のＧＰＳ機能を備えた情報処理端末等から位置情報を取得する構成としてもよいし、撮影者やユーザ等が現地の施設やランドマークの名称等を入力し、対応する位置情報をインターネット等を介して取得する構成としてもよい。 First, the image search unit 12 acquires local location information (S11). Here, the local location refers to the location of the camera device 20 currently capturing the video 21, or the location of the subject currently being captured in the video 21, or any location nearby. If the camera device 20 is an information processing terminal such as a smartphone equipped with a camera function and a GPS (Global Positioning System) function, the camera device 20 acquires location information as latitude and longitude information using the GPS function and transmits this to the image processing device 10 via communication means (not shown), which is then acquired by the image search unit 12. The location information may be acquired from an information processing terminal equipped with a GPS function separate from the camera device 20 located at the local location, or the photographer or user may input the name of a local facility or landmark and acquire the corresponding location information via the Internet or the like.

次に、取得した現地の位置情報に基づいて画像検索部１２が前処理後過去動画群１５から１つ以上の過去動画１４ａの候補を取得する（Ｓ１２）。例えば、現地の位置情報から所定の距離の範囲内で撮影された過去動画１４ａを候補として取得する。そして、取得した各過去動画１４ａに係るキーフレーム１５ｃを前処理後過去動画群１５からそれぞれ取得する（Ｓ１３）。これにより、後述する類似画像検索の対象となるキーフレーム１５ｃを予め絞り込む。この絞り込みを行えるようにするため、各過去動画１４ａには撮影場所もしくは撮影対象の位置情報が関連付けられて記録されているものとする。なお、現在の位置情報に対応する過去動画１４ａの候補が存在しない場合は、処理を終了して、撮影装置２０により撮影されている現在動画２１の画像を表示装置３０にそのまま表示する。 Next, the image search unit 12 acquires one or more candidate past videos 14a from the preprocessed past video set 15 based on the acquired local location information (S12). For example, past videos 14a shot within a predetermined distance from the local location information are acquired as candidates. Then, key frames 15c associated with each acquired past video 14a are acquired from the preprocessed past video set 15 (S13). This narrows down the key frames 15c to be used in the similar image search described below. To enable this narrowing down, each past video 14a is assumed to be recorded in association with location information about the shooting location or subject. Note that if there are no candidate past videos 14a corresponding to the current location information, the process is terminated and the image of the current video 21 shot by the shooting device 20 is displayed as is on the display device 30.

絞り込まれたキーフレーム１５ｃを取得すると、その中から撮影装置２０により撮影されている現在動画２１に類似するキーフレーム１５ｃを類似画像検索により特定し（Ｓ１４）、特定されたキーフレーム１５ｃに対応する過去動画１４ａをステップＳ１２で取得した候補の中から選択する（Ｓ１５）。なお、類似画像検索の手法は特に限定されず、ＡＩ技術を用いた公知のライブラリ等を適宜用いることができる。 Once the narrowed-down key frames 15c are obtained, key frames 15c similar to the current video 21 being captured by the imaging device 20 are identified from among them by a similar image search (S14), and the past video 14a corresponding to the identified key frame 15c is selected from the candidates acquired in step S12 (S15). Note that the method of similar image search is not particularly limited, and known libraries using AI technology, etc., can be used as appropriate.

過去動画１４ａが選択されると、次に、画像合成部１３により、ステップＳ１４で特定されたキーフレーム１５ｃと、現在動画２１の画像とで、それぞれ対応する特徴点を抽出し、対応する各特徴点を回転、移動によりマッチングさせる画像合成用の２次元の変換行列（アフィン変換行列）を算出する（Ｓ１６）。そして、ステップＳ１５で選択した過去動画１４ａに対してステップＳ１６で算出したアフィン変換行列を適用して変換（変形）し、これを現在動画２１の動画に重ね合わせて合成し、再生する（Ｓ１７）。現在動画２１の動画像に過去動画１４ａが重ね合わされた状態で再生された画像は、表示装置３０に表示される。 Once the past video 14a is selected, the image synthesis unit 13 then extracts corresponding feature points between the key frame 15c identified in step S14 and the image of the current video 21, and calculates a two-dimensional transformation matrix (affine transformation matrix) for image synthesis that matches each corresponding feature point by rotating and translating them (S16). The past video 14a selected in step S15 is then transformed (deformed) by applying the affine transformation matrix calculated in step S16, and this is superimposed on the video of the current video 21, synthesized, and played (S17). The image played with the past video 14a superimposed on the moving image of the current video 21 is displayed on the display device 30.

なお、過去動画１４ａの再生に際しては、当該過去動画１４ａ全体を先頭から再生するようにしてもよいし、当該過去動画１４ａ中においてステップＳ１４で特定されたキーフレーム１５ｃに対応する箇所から再生するようにしてもよい。キーフレーム１５ｃに対応する箇所としては、例えば、当該キーフレーム１５ｃが含まれるシーンの先頭や、当該キーフレーム１５ｃ自身などが考えられる。 When playing back the past video 14a, the entire past video 14a may be played back from the beginning, or the past video 14a may be played back from a point corresponding to the key frame 15c identified in step S14. The point corresponding to the key frame 15c may be, for example, the beginning of a scene that includes the key frame 15c, or the key frame 15c itself.

その後、ステップＳ１７で現在動画２１の動画と過去動画１４ａを重ね合わせて合成したときの合成のずれ（アフィン変換行列適用後の各特徴点のずれ）の量が所定の閾値を超えているか否かを判定する（Ｓ１８）。閾値と比較するずれの量は、各特徴点での値を合計したものとしてもよいし、いずれか１つの特徴点についてのものとしてもよい。これら両方を用いて判断するものであってもよい。ずれの量が閾値以下である場合（ステップＳ１８でＮｏ）は、現在動画２１と過去動画１４ａの内容がまだマッチしているということで、ステップＳ１７に戻って、現在動画２１の動画に過去動画１４ａを合成しての再生を継続する。一方、ずれの量が閾値を超える場合（ステップＳ１８でＹｅｓ）は、もはや現在動画２１と過去動画１４ａがマッチしないということで、過去動画１４ａを合成する処理を終了する。 Then, in step S17, it is determined whether the amount of deviation (the deviation of each feature point after applying the affine transformation matrix) when the current video 21 and the past video 14a are superimposed and combined exceeds a predetermined threshold (S18). The amount of deviation compared to the threshold may be the sum of the values at each feature point, or it may be the deviation for any one feature point. It is also possible to make a determination using both of these. If the amount of deviation is below the threshold (No in step S18), it means that the contents of the current video 21 and the past video 14a still match, so the process returns to step S17 and continues playback by combining the past video 14a with the current video 21. On the other hand, if the amount of deviation exceeds the threshold (Yes in step S18), it means that the current video 21 and the past video 14a no longer match, so the process of combining the past video 14a is terminated.

その後は、現在動画２１をそのまま表示装置３０に表示するとともに、ステップＳ１１からの一連の処理を繰り返して、現在動画２１に対応する過去動画１４ａを選択し直す。上述した一連の処理により、現在動画２１の表示を継続しながら、適切な過去動画１４ａを適宜選択して重ね合わせて表示することができる。 Then, the current video 21 is displayed as is on the display device 30, and the series of processes from step S11 is repeated to reselect the past video 14a that corresponds to the current video 21. Through the series of processes described above, it is possible to continue displaying the current video 21 while appropriately selecting and superimposing an appropriate past video 14a for display.

以上に説明したように、本発明の一実施の形態である画像処理システム１によれば、撮影した現在の情景の映像に対して、ＶＲ／ＡＲにより、その場所における過去の情景を重ね合わせて（合成して）表示する仕組みにおいて、重ね合わせるのに適切な過去の情景の選択を効果的・効率的に行うとともに、重ね合わせの際の負荷を低減することができる。 As described above, image processing system 1, which is one embodiment of the present invention, uses VR/AR to overlay (combine) past scenes from a location onto a captured image of the current scene, and displays the images. This system effectively and efficiently selects appropriate past scenes to overlay, while also reducing the load imposed by the overlay process.

すなわち、本発明の一実施の形態である画像処理システム１では、現在動画２１に重ね合わせる過去の情景としてＣＧではなく実写撮影された２次元の過去動画１４ａを用いる。各過去動画１４ａには、予め３次元の位置・方向（６自由度）の変化量に基づいてキーフレーム１５ｃを抽出し、関連付けておく。そして、現在動画２１に係る位置情報により過去動画群１４から候補となる過去動画１４ａを絞り込んだ上で、絞り込まれた各過去動画１４ａに係るキーフレーム１５ｃから現在動画２１に類似するものを２次元画像間の類似画像検索により抽出し、これに対応する過去動画１４ａを特定する。そして、抽出されたキーフレーム１５ｃと現在動画２１の各特徴点が一致する（すなわち、ずれが最小となる）ように２次元のアフィン変換行列を生成し、特定した過去動画１４ａを同変換行列によって変換（変形）して現在動画２１に重ね合わせる。 That is, in one embodiment of the present invention, the image processing system 1 uses two-dimensional past videos 14a, not CG images but live-action images, as past scenes to be overlaid on the current video 21. Key frames 15c are extracted and associated with each past video 14a in advance based on the amount of change in three-dimensional position and direction (six degrees of freedom). Then, candidate past videos 14a are narrowed down from the past video set 14 using position information related to the current video 21. Key frames 15c related to each narrowed-down past video 14a are then extracted that are similar to the current video 21 using a similar image search between two-dimensional images, and the corresponding past video 14a is identified. A two-dimensional affine transformation matrix is then generated so that the extracted key frames 15c and the feature points of the current video 21 match (i.e., minimize the deviation), and the identified past video 14a is transformed (deformed) using this transformation matrix and overlaid on the current video 21.

これらの手法により、現在動画２１に重ね合わせる過去動画１４ａの選択や重ね合わせの処理負荷を低減し、処理を高速化するとともに現実感のあるＶＲ／ＡＲ映像とすることができる。 These techniques reduce the processing load of selecting and overlaying the past video 14a to be overlaid on the current video 21, speeding up processing and producing realistic VR/AR images.

以上、本発明者によってなされた発明を実施の形態に基づき具体的に説明したが、本発明は上記の実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。また、上記の実施の形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、上記の実施の形態の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 The invention made by the inventor has been specifically described above based on the embodiments, but it goes without saying that the present invention is not limited to the above embodiments and can be modified in various ways without departing from the spirit of the invention. Furthermore, the above embodiments have been described in detail to clearly explain the invention, and the invention is not necessarily limited to those that include all of the described configurations. Furthermore, it is possible to add, delete, or replace some of the configurations of the above embodiments with other configurations.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部または全部を、例えば、集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリやハードディスク、ＳＳＤ等の記録装置、またはＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に置くことができる。 Furthermore, the above-mentioned configurations, functions, processing units, processing means, etc. may be realized in part or in whole in hardware, for example by designing them as integrated circuits. Furthermore, the above-mentioned configurations, functions, etc. may be realized in software by a processor interpreting and executing programs that realize each function. Information such as programs, tables, and files that realize each function can be stored in a storage device such as memory, hard disk, or SSD, or on a storage medium such as an IC card, SD card, or DVD.

また、上記の各図において、制御線や情報線は説明上必要と考えられるものを示しており、必ずしも実装上の全ての制御線や情報線を示しているとは限らない。実際にはほとんど全ての構成が相互に接続されていると考えてもよい。 In addition, in the above diagrams, the control lines and information lines shown are those considered necessary for explanation, and do not necessarily represent all control lines and information lines in the actual implementation. In reality, it is safe to assume that almost all components are interconnected.

本発明は、画像・映像ベースでのＶＲ／ＡＲを実現する画像処理システムに利用可能である。 This invention can be used in image processing systems that realize image/video-based VR/AR.

１…画像処理システム、
１０…画像処理装置、１１…前処理部、１２…画像検索部、１３…画像合成部、１４…過去動画群、１４ａ…過去動画、１５…前処理後過去動画群、１５ａ…背景動画、１５ｂ…前景動画、１５ｃ…キーフレーム、
２０…撮影装置、２１…現在動画、
３０…表示装置 1...Image processing system,
10... image processing device, 11... pre-processing unit, 12... image search unit, 13... image synthesis unit, 14... past video group, 14a... past video, 15... pre-processed past video group, 15a... background video, 15b... foreground video, 15c... key frame,
20...camera, 21...current video,
30...Display device

Claims

An image processing system in which a past video in which a corresponding past scene is shot is superimposed on a current video relating to a current scene shot by a shooting device and displayed on a display device by an image processing device,
The image processing device includes:
a preprocessing unit that extracts one or more keyframes from each of a plurality of past videos, associates the extracted keyframes with the past videos, and records the preprocessed past videos;
an image search unit that identifies a first key frame similar to the current video from the preprocessed past video by image search, and identifies a first past video corresponding to the first key frame;
an image synthesis unit that converts and superimposes the first past video onto the current video so as to minimize deviations between feature points in the respective video images , plays back the first past video from a position corresponding to the first key frame, and displays the first past video on the display device;
An image processing system comprising:

2. The image processing system according to claim 1 ,
The pre-processing unit acquires values of six degrees of freedom, consisting of three degrees of freedom of translation and three degrees of freedom of rotation, for each frame image of each of the past videos, and when the amount of change in the values of the six degrees of freedom between frame images exceeds a predetermined threshold, divides the scenes in each of the past videos between the frame images and extracts frame images representing each scene as the key frames .

3. The image processing system according to claim 2 ,
The pre- processing unit calculates the amount of change in the six degrees of freedom between each frame image of each of the past videos by weighting the amount of change in yaw in the three rotational degrees of freedom among the six degrees of freedom more heavily than the other degrees of freedom.

2. The image processing system according to claim 1,
The image search unit identifies the first key frame by image search from among the multiple key frames recorded as the pre-processed past video, narrowing down the key frames based on location information related to the current video.