JP7640248B2

JP7640248B2 - Image generating device and image generating method

Info

Publication number: JP7640248B2
Application number: JP2020187607A
Authority: JP
Inventors: 修輿水
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2025-03-05
Anticipated expiration: 2040-11-10
Also published as: JP2022076940A

Description

本発明は、画像生成装置、および画像生成方法に関する。 The present invention relates to an image generating device and an image generating method.

従来、動画データから字幕を作成する装置が知られている。この装置においては、顔検出手段によって、動画データから顔特徴量と顔位置とが検出され、音声識別手段によって、動画データから音声特徴量が検出される。検出された各特徴量は、話者特定手段に送られ、音声・顔対応データ記憶手段に登録されている話者の特徴量と比較され、その結果、話者の位置が特定される。特定された話者の音声は、音声認識手段によりテキスト化される。話者の位置とテキストデータにより吹き出し作成手段により吹き出しが作成され、動画像作成手段により動画データと音声データと吹き出しデータとをまとめて新たな動画データが作成される（例えば、特許文献１参照）。 Conventionally, a device that creates subtitles from video data is known. In this device, a face detection means detects facial features and face positions from the video data, and a voice recognition means detects voice features from the video data. Each detected feature is sent to a speaker identification means and compared with the features of the speaker registered in a voice/face correspondence data storage means, and as a result, the position of the speaker is identified. The voice of the identified speaker is converted to text by a voice recognition means. A speech bubble creation means creates a speech bubble based on the speaker's position and text data, and a video creation means combines the video data, audio data, and speech bubble data to create new video data (see, for example, Patent Document 1).

特開２００７－０２７９９０号公報JP 2007-027990 A

しかしながら、上述のような従来技術は以下の問題がある。動画データから検出された顔特徴量又は音声特徴量が、音声・顔対応データ記憶手段に登録されていない場合は、話者と音声の紐付けを誤る可能性がある。その結果、違う話者の音声が吹き出しで表示され、視聴者が話者と音声の対応を誤認識する可能性がある。 However, the above-mentioned conventional techniques have the following problems. If the facial features or audio features detected from video data are not registered in the audio/face correspondence data storage means, there is a possibility that the speaker and audio will be incorrectly linked. As a result, the audio of a different speaker may be displayed in the speech bubble, and the viewer may erroneously recognize the correspondence between the speaker and audio.

本発明の一態様は、視聴者がオブジェクトとテキストデータとの紐付けを適切に認識できることを目的とする。 One aspect of the present invention aims to enable viewers to properly recognize the association between objects and text data.

上記の課題を解決するために、本発明の一態様に係る画像生成装置は、画像を示す画像データおよび当該画像に付随する音声を示す音声データを取得する取得部と、前記取得部が取得した画像データ及び音声データを学習済モデルに入力することによって前記画像に含まれる１又は複数のオブジェクトのうち、音声を発生させているオブジェクトを特定する特定処理を実行する特定処理実行部と、前記音声データに対応するテキストデータを、前記画像内の位置であって、前記特定処理の結果に応じた位置に重畳する重畳部と、を備える。 In order to solve the above problem, an image generating device according to one aspect of the present invention includes an acquisition unit that acquires image data representing an image and audio data representing audio accompanying the image, a specific processing execution unit that executes a specific processing to identify an object that is generating audio among one or more objects included in the image by inputting the image data and audio data acquired by the acquisition unit into a trained model, and a superimposition unit that superimposes text data corresponding to the audio data at a position within the image that corresponds to the result of the specific processing.

上記の課題を解決するために、本発明の他の態様に係る画像生成装置は、画像を示す画像データおよび当該画像に付随する音声を示す音声データを取得する取得部と、前記取得部が取得した画像データ及び音声データを学習済モデルに入力することによって前記画像に含まれる１又は複数のオブジェクトのうち、音声を発生させているオブジェクトを特定する特定処理を実行する特定処理実行部と、前記特定処理の確からしさを示す確度情報に応じた表示態様を有する情報であって、前記オブジェクトを特定する情報を重畳した画像を生成する重畳部とを備える。 To solve the above problem, an image generating device according to another aspect of the present invention includes an acquisition unit that acquires image data representing an image and audio data representing audio accompanying the image, a specific processing execution unit that executes a specific processing to identify an object that is generating audio among one or more objects included in the image by inputting the image data and audio data acquired by the acquisition unit into a trained model, and a superimposition unit that generates an image on which information identifying the object is superimposed, the information having a display form corresponding to accuracy information indicating the accuracy of the identification processing.

上記の課題を解決するために、本発明の一態様に係る画像生成方法は、画像を示す画像データおよび当該画像に付随する音声を示す音声データを取得する取得工程と、前記取得工程において取得した画像データ及び音声データを学習済モデルに入力することによって前記画像に含まれる１又は複数のオブジェクトのうち、音声を発生させているオブジェクトを特定する特定処理を実行する特定処理実行工程と、前記音声データに対応するテキストデータを、前記画像内の位置であって、前記特定処理の結果に応じた位置に重畳する重畳工程と、を含む。 In order to solve the above problems, an image generating method according to one aspect of the present invention includes an acquisition step of acquiring image data representing an image and audio data representing audio accompanying the image, a specific processing execution step of executing a specific processing step of identifying an object that is generating audio among one or more objects included in the image by inputting the image data and audio data acquired in the acquisition step into a trained model, and a superimposition step of superimposing text data corresponding to the audio data at a position within the image that corresponds to the result of the specific processing.

上記の課題を解決するために、本発明の他の態様に係る画像生成方法は、画像を示す画像データおよび当該画像に付随する音声を示す音声データを取得する取得工程と、前記取得工程において取得した画像データ及び音声データを学習済モデルに入力することによって前記画像に含まれる１又は複数のオブジェクトのうち、音声を発生させているオブジェクトを特定する特定処理を実行する特定処理実行工程と、前記特定処理の確からしさを示す確度情報に応じた表示態様を有する情報であって、前記オブジェクトを特定する情報を重畳した画像を生成する重畳工程とを含む。 In order to solve the above problem, an image generating method according to another aspect of the present invention includes an acquisition step of acquiring image data representing an image and audio data representing audio accompanying the image, a specific processing execution step of executing a specific processing step of identifying an object that is generating audio among one or more objects included in the image by inputting the image data and audio data acquired in the acquisition step into a trained model, and a superimposition step of generating an image on which information identifying the object is superimposed, the information having a display form corresponding to accuracy information indicating the accuracy of the identification process.

本発明の一態様によれば、視聴者がオブジェクトとテキストデータとの紐付けを適切に認識できる。 According to one aspect of the present invention, the viewer can properly recognize the association between the object and the text data.

本発明の実施形態１に係る表示装置の透視図である。1 is a perspective view of a display device according to a first embodiment of the present invention. 図１に示す表示装置が備える画像生成装置の要部構成を示すブロック図である。2 is a block diagram showing a configuration of a main part of an image generating device included in the display device shown in FIG. 1 . 図１に示す表示装置が備え表示部に表示される１つの表示例を示す図である。2 is a diagram showing an example of a display displayed on a display unit of the display device shown in FIG. 1 . 図１に示す表示装置が備え表示部に表示される他の表示例を示す図である。1. FIG. 4 is a diagram showing another example of display displayed on the display unit of the display device shown in FIG. 図１に示す表示装置が備え表示部に表示される他の表示例を示す図である。1. FIG. 4 is a diagram showing another example of display displayed on the display unit of the display device shown in FIG. 図１に示す表示装置が備え表示部に表示される他の表示例を示す図である。1. FIG. 4 is a diagram showing another example of display displayed on the display unit of the display device shown in FIG. 図１に示す表示装置が備える画像生成装置により実行する画像生成方法を示すフローチャートである。4 is a flowchart showing an image generating method executed by an image generating device included in the display device shown in FIG. 1 . 本発明の実施形態２に係る画像生成装置の要部構成を示すブロック図である。FIG. 11 is a block diagram showing a configuration of a main part of an image generating device according to a second embodiment of the present invention. 図８に示す表示部に表示される１つの表示例を示す図である。FIG. 9 is a diagram showing an example of a display displayed on the display unit shown in FIG. 8 . 図８に示す表示部に表示される他の表示例を示す図である。9A and 9B are diagrams illustrating other examples of display displayed on the display unit illustrated in FIG. 8 . 図８に示す画像生成装置により実行する画像生成方法を示すフローチャートである。9 is a flowchart showing an image generating method executed by the image generating device shown in FIG. 8 .

〔実施形態１〕
以下、本発明の一実施形態について、図１～図７を参照しながら説明する。 [Embodiment 1]
Hereinafter, an embodiment of the present invention will be described with reference to FIGS.

図１は、本実施形態に係る表示装置１を正面から見た透視図である。本実施形態において、表示装置１は、テレビジョン受信機として実現されている。図１に示すように、表示装置１は、少なくとも画像生成装置１０と、表示部４０とを備える。 Figure 1 is a front perspective view of a display device 1 according to this embodiment. In this embodiment, the display device 1 is realized as a television receiver. As shown in Figure 1, the display device 1 includes at least an image generating device 10 and a display unit 40.

以下、図１に示す表示装置１が備える画像生成装置１０の構成について図２を参照しながら詳しく説明する。図２は、画像生成装置１０の要部構成を示すブロック図である。 The configuration of the image generating device 10 included in the display device 1 shown in FIG. 1 will be described in detail below with reference to FIG. 2. FIG. 2 is a block diagram showing the main configuration of the image generating device 10.

図２に示すように、画像生成装置１０は、取得部１１と、音声・画像認識部１２と、テキストデータ生成部１３と、字幕表示位置決定部１４と、重畳部１５とを備える。取得部１１は、特許請求の範囲における「取得部」の一例である。また、音声・画像認識部１２及び字幕表示位置決定部１４は、特許請求の範囲における「特定処理実行部」の一例である。また、重畳部１５は、特許請求の範囲における「重畳部」の一例である。また、図２において、符号２０は、音声認識用ＤＢを示し、符号３０は、画像認識用ＤＢを示し、符号４０は、表示部を示す。音声認識用ＤＢ２０及び画像認識用ＤＢ３０は、（１）表示装置１の内部に設けられていてもよいし、（２）表示装置１の外部に設けられ、表示装置１と通信可能に構成されていてもよい。 As shown in FIG. 2, the image generating device 10 includes an acquisition unit 11, a voice/image recognition unit 12, a text data generating unit 13, a subtitle display position determining unit 14, and a superimposing unit 15. The acquisition unit 11 is an example of an "acquisition unit" in the claims. The voice/image recognition unit 12 and the subtitle display position determining unit 14 are an example of a "specific processing execution unit" in the claims. The superimposing unit 15 is an example of a "superimposing unit" in the claims. In FIG. 2, the reference numeral 20 indicates a voice recognition DB, the reference numeral 30 indicates an image recognition DB, and the reference numeral 40 indicates a display unit. The voice recognition DB 20 and the image recognition DB 30 may be (1) provided inside the display device 1, or (2) provided outside the display device 1 and configured to be able to communicate with the display device 1.

取得部１１は、画像を示す画像データおよび当該画像に付随する音声を示す音声データを取得するための構成である。一例として、取得部１１は、放送波を復調することによって得られたデータから、画像データと音声データとを分離・抽出する。別の例として、取得部１１は、コンテンツデータをデコードすることによって得られたデータから、画像データと音声データとを分離・抽出する。なお、画像データが表す画像は、静止画像であってもよいし、動画像（映像）であってもよい。そして、図２に示すように、取得部１１により取得した画像データが、音声・画像認識部１２が備える画像認識部、および重畳部１５のそれぞれに出力される。また、取得部１１により取得した音声データが、テキストデータ生成部１３、および音声・画像認識部１２が備える音声認識部のそれぞれに出力される。 The acquisition unit 11 is configured to acquire image data representing an image and audio data representing audio accompanying the image. As an example, the acquisition unit 11 separates and extracts image data and audio data from data obtained by demodulating broadcast waves. As another example, the acquisition unit 11 separates and extracts image data and audio data from data obtained by decoding content data. Note that the image represented by the image data may be a still image or a moving image (video). Then, as shown in FIG. 2, the image data acquired by the acquisition unit 11 is output to each of the image recognition unit provided in the audio/image recognition unit 12 and the superimposition unit 15. Also, the audio data acquired by the acquisition unit 11 is output to each of the text data generation unit 13 and the audio recognition unit provided in the audio/image recognition unit 12.

音声・画像認識部１２は、（１）取得部１１が取得した音声データの表す音声を発したオブジェクトを識別すると共に、（２ａ）取得部１１が取得した画像データの表す画像に被写体として含まれる各オブジェクトを識別し、（２ｂ）各オブジェクトの画面内の位置を特定するための構成である。音声・画像認識部１２は、（１ａ）音声データの表す音声を発したオブジェクトの識別情報、（２ａ）画像データの表す画像に被写体として含まれる各オブジェクトの識別情報、および、（２ｂ）画像データの表す画像に被写体として含まれる各オブジェクトの位置情報を、字幕表示位置決定部１４に出力する。 The audio/image recognition unit 12 is configured to (1) identify the object that emitted the sound represented by the audio data acquired by the acquisition unit 11, (2a) identify each object included as a subject in the image represented by the image data acquired by the acquisition unit 11, and (2b) specify the position of each object on the screen. The audio/image recognition unit 12 outputs (1a) identification information of the object that emitted the sound represented by the audio data, (2a) identification information of each object included as a subject in the image represented by the image data, and (2b) position information of each object included as a subject in the image represented by the image data to the subtitle display position determination unit 14.

音声・画像認識部１２は、例えば図２に示すように、音声認識部と画像認識部とにより構成することができる。 The voice/image recognition unit 12 can be composed of a voice recognition unit and an image recognition unit, for example, as shown in FIG. 2.

音声認識部は、取得部１１が取得した音声データの表す音声を発したオブジェクトを識別する。この識別（特許請求の範囲における「第２の特定処理」の一例）を行うために、音声認識部は、例えば、音声認識用ＤＢに格納された学習済モデル（特許請求の範囲における「第２の学習済モデル」の一例）を利用する。この学習済モデルは、音声データを入力とし、その音声データの表す音声を発したオブジェクトの識別情報を出力とする学習済モデル（例えば、ニューラルネットワーク）である。ここで、オブジェクトの識別情報とは、そのオブジェクトを他のオブジェクトと識別するための情報であり、例えば、そのオブジェクトに付与された識別子である。この学習済モデルは、例えば、音声データと、その音声データの表す音声を発したオブジェクトの識別情報との組み合わせを教師データとする機械学習によって構築することができる。 The voice recognition unit identifies the object that emitted the voice represented by the voice data acquired by the acquisition unit 11. To perform this identification (an example of the "second identification process" in the claims), the voice recognition unit uses, for example, a trained model (an example of the "second trained model" in the claims) stored in a voice recognition DB. This trained model is a trained model (for example, a neural network) that takes voice data as input and outputs identification information of the object that emitted the voice represented by the voice data. Here, the identification information of the object is information for distinguishing the object from other objects, for example, an identifier assigned to the object. This trained model can be constructed, for example, by machine learning using a combination of voice data and identification information of the object that emitted the voice represented by the voice data as training data.

例えば、この学習済モデルにＡ首相の話し声を表す音声データを入力すると、この学習済モデルからＡ首相の識別情報が出力される。また、例えば、この学習済モデルにＴ大統領の話し声を表す音声データを入力すると、この学習済モデルからＴ大統領の識別情報が出力される。また、例えば、この学習済モデルに犬の鳴き声を表す音声データを入力すると、この学習済モデルから犬の識別情報が出力される。また、例えば、この学習済モデルに救急車のサイレン音を表す音声データを入力すると、この学習済モデルから救急車の識別情報が出力される。 For example, when audio data representing the voice of Prime Minister A is input to this trained model, identification information for Prime Minister A is output from this trained model. Also, when audio data representing the voice of President T is input to this trained model, identification information for President T is output from this trained model. Also, when audio data representing the sound of a dog barking is input to this trained model, identification information for the dog is output from this trained model. Also, when audio data representing the sound of an ambulance siren is input to this trained model, identification information for the ambulance is output from this trained model.

画像認識部は、取得部１１が取得した画像データの表す画像に被写体として含まれる各オブジェクトを識別すると共に、各オブジェクトの画面内の位置を特定する。この識別・特定（特許請求の範囲における「第１の特定処理」の一例）を行うために、画像認識部は、例えば、画像認識用ＤＢに格納された学習済モデル（特許請求の範囲における「第１の学習済モデル」の一例）を利用する。この学習済モデルは、画像データを入力とし、その画像データの表す画像に被写体として含まれる各オブジェクトの識別情報および位置情報を出力する学習済モデル（例えば、ニューラルネットワーク）である。ここで、オブジェクトの位置情報とは、そのオブジェクトの画面内の位置を表す情報であり、例えば、そのオブジェクトの代表点の座標である。この学習済モデルは、例えば、画像データと、その画像データの表す画像に被写体として含まれるオブジェクトの識別情報及び位置情報との組み合わせを教師データとする機械学習によって構築することができる。 The image recognition unit identifies each object included as a subject in the image represented by the image data acquired by the acquisition unit 11, and specifies the position of each object on the screen. To perform this identification and specification (an example of the "first specification process" in the claims), the image recognition unit uses, for example, a trained model (an example of the "first trained model" in the claims) stored in an image recognition DB. This trained model is a trained model (e.g., a neural network) that receives image data as input and outputs identification information and position information of each object included as a subject in the image represented by the image data. Here, the position information of an object is information that represents the position of the object on the screen, for example, the coordinates of a representative point of the object. This trained model can be constructed, for example, by machine learning using a combination of image data and identification information and position information of objects included as subjects in the image represented by the image data as training data.

例えば、この学習済モデルにＡ首相およびＴ大統領を被写体として含む画像を表す画像データが入力されると、この学習済モデルからＡ首相の識別情報および位置情報、並びに、Ｔ大統領の識別情報および位置情報が出力される。また、例えば、この学習済モデルに犬および救急車を被写体として含む画像を表す画像データが入力されると、この学習済モデルから犬の識別情報および位置情報、並びに、救急車の識別情報および位置情報が出力される。 For example, when image data representing an image including Prime Minister A and President T as subjects is input to this trained model, the trained model outputs identification information and location information for Prime Minister A, as well as identification information and location information for President T. Also, for example, when image data representing an image including a dog and an ambulance as subjects is input to this trained model, the trained model outputs identification information and location information for the dog, as well as identification information and location information for the ambulance.

テキストデータ生成部１３は、取得部１１が取得した音声データの表す音声をテキストデータに変換するための構成である。音声データの表す音声をテキストデータに変換する方法は公知であるため、ここではその説明を省略する。テキストデータ生成部１３は、生成したテキストデータを、重畳部１５に出力する。 The text data generating unit 13 is configured to convert the voice represented by the voice data acquired by the acquiring unit 11 into text data. Methods for converting the voice represented by the voice data into text data are well known, so a description thereof will be omitted here. The text data generating unit 13 outputs the generated text data to the superimposing unit 15.

字幕表示位置決定部１４は、音声・画像認識部１２が生成した識別情報および位置情報に基づき、取得部１１が取得した画像データの表す画像に被写体として含まれるオブジェクトのうち、取得部１１が取得した音声データの表す音声を発しているオブジェクトを特定する。ここで、字幕表示位置決定部１４が特定するオブジェクトは、画像認識部にて得られた識別情報が音声認識部にて得られた識別情報と一致するオブジェクトである。そして、字幕表示位置決定部１４は、特定したオブジェクトの位置情報に基づいて、テキストデータ生成部１３が生成したテキストを字幕として表示する位置を決定する。例えば、字幕表示位置決定部１４は、テキストデータ生成部１３が生成したテキストを字幕として表示する位置を、特定したオブジェクトの近くに設定する。或いは、字幕表示位置決定部１４は、テキストデータ生成部１３が生成したテキストを字幕として表示する位置を、特定したオブジェクトの近くに表示された吹き出し画像の内部に表示する。 Based on the identification information and position information generated by the audio/image recognition unit 12, the subtitle display position determination unit 14 identifies an object that emits the sound represented by the audio data acquired by the acquisition unit 11, among the objects included as subjects in the image represented by the image data acquired by the acquisition unit 11. Here, the object identified by the subtitle display position determination unit 14 is an object whose identification information obtained by the image recognition unit matches the identification information obtained by the audio recognition unit. Then, based on the position information of the identified object, the subtitle display position determination unit 14 determines the position at which the text generated by the text data generation unit 13 is displayed as subtitles. For example, the subtitle display position determination unit 14 sets the position at which the text generated by the text data generation unit 13 is displayed as subtitles near the identified object. Alternatively, the subtitle display position determination unit 14 displays the position at which the text generated by the text data generation unit 13 is displayed as subtitles inside a speech bubble image displayed near the identified object.

重畳部１５は、取得部１１が取得した音声データに対応する、テキストデータ生成部１３が生成したテキストデータを、取得部１１が取得した画像データの表す画像に重畳するための構成である。重畳部１５は、テキストデータ生成部１３が生成したテキストデータを重畳する位置を、字幕表示位置決定部１４が決定した位置に設定する。重畳部１５が生成した字幕付の画像は、表示部４０に表示される。 The superimposition unit 15 is configured to superimpose the text data generated by the text data generation unit 13, which corresponds to the audio data acquired by the acquisition unit 11, on an image represented by the image data acquired by the acquisition unit 11. The superimposition unit 15 sets the position at which to superimpose the text data generated by the text data generation unit 13, to the position determined by the subtitle display position determination unit 14. The image with subtitles generated by the superimposition unit 15 is displayed on the display unit 40.

表示装置１によれば、音声をテキスト化することにより得られた字幕を、その音声を発したオブジェクトの位置に応じた場所に表示することが可能になる。したがって、表示装置１を視聴している視聴者、例えば、聴覚に障害がある視聴者であってもオブジェクトと字幕との紐付けを適切に認識することが可能になる。 The display device 1 makes it possible to display subtitles obtained by converting audio into text at a location corresponding to the position of the object that emitted the audio. Therefore, even a viewer watching the display device 1, for example a viewer with hearing impairment, can properly recognize the association between the object and the subtitles.

なお、重畳部１５は、特定処理における確からしさを示す確度情報を参照して、前記テキストデータを重畳する位置を決定してもよい。ここで、特定処理とは、音声認識部におけるオブジェクトの識別、画像認識部におけるオブジェクトの識別、および、字幕表示位置決定部１４におけるオブジェクトの特定を含む処理のことを指す。 The superimposing unit 15 may determine the position at which to superimpose the text data by referring to accuracy information indicating the accuracy of the identification process. Here, the identification process refers to a process including object identification in the voice recognition unit, object identification in the image recognition unit, and object identification in the subtitle display position determination unit 14.

例えば、重畳部１５は、前記確度情報が示す確からしさが所定の高さよりも高い場合、音声を発生させているオブジェクトの近くにテキストデータを字幕として重畳する。それ以外の場合、重畳部１５は、例えば、画面の下端、画面の上端、画面の右端、画面の左端など、音声を発生させているオブジェクトとは関係なく予め定められた領域にテキストデータを字幕として重畳する。 For example, if the likelihood indicated by the likelihood information is higher than a predetermined level, the superimposition unit 15 superimposes the text data as subtitles near the object generating the sound. In other cases, the superimposition unit 15 superimposes the text data as subtitles in a predetermined area unrelated to the object generating the sound, such as the bottom edge, top edge, right edge, or left edge of the screen.

或いは、重畳部１５は、前記確度情報が示す確からしさが所定の高さよりも高い場合、音声を発生させているオブジェクトの近くにオブジェクトから発生された音声であることを表す吹き出し画像を重畳し、この吹き出し画像の内部にテキストデータを字幕として重畳する。それ以外の場合、重畳部１５は、例えば、画面の下端、画面の上端、画面の右端、画面の左端など、音声を発生させているオブジェクトとは関係なく予め定められた領域にテキストデータを字幕として重畳する。 Alternatively, if the likelihood indicated by the likelihood information is higher than a predetermined level, the superimposition unit 15 superimposes a speech bubble image indicating that the sound is generated from the object near the object generating the sound, and superimposes the text data as subtitles inside the speech bubble image. In other cases, the superimposition unit 15 superimposes the text data as subtitles in a predetermined area unrelated to the object generating the sound, such as the bottom edge, top edge, right edge, or left edge of the screen.

上記の構成によれば、重畳部１５により、テキストデータを重畳する位置を適切に決定することができる。 With the above configuration, the overlay unit 15 can appropriately determine the position at which to overlay the text data.

なお、ここでいう「所定の高さ」は特に限定されず、必要に応じて適切に設定すればよい。 Note that the "predetermined height" referred to here is not particularly limited and may be set appropriately as needed.

（表示部に表示される表示例）
図３は、図１に示す表示装置１が備え表示部４０に表示される１つの表示例を示す図であり、図４～図６は、他の表示例を示す図である。 (Example of what appears on the display)
FIG. 3 is a diagram showing one display example displayed on the display unit 40 of the display device 1 shown in FIG. 1, and FIGS. 4 to 6 are diagrams showing other display examples.

図３に示すように、上述した特定処理により、音声を発生させているオブジェクトが人間であるＯＢ１と特定した場合、重畳部１５が、その音声に対応するテキストデータＴＤ１（本例では「こんにちは」）を、その音声を発生させているＯＢ１の近くの吹き出し画像に重畳することができる。 As shown in FIG. 3, when the above-mentioned identification process identifies the object generating the voice as a human OB1, the superimposition unit 15 can superimpose text data TD1 corresponding to the voice (in this example, "Hello") on a speech bubble image near the OB1 generating the voice.

これにより、表示部４０を見ている視聴者、例えば、聴覚に障害がある視聴者であってもオブジェクト（本例ではＯＢ１）と字幕（本例では「こんにちは」）との紐付けを適切に認識することができる。 This allows a viewer watching the display unit 40, even a viewer with a hearing impairment, to properly recognize the association between the object (OB1 in this example) and the subtitle ("Hello" in this example).

一方、図４に示すように、上述した特定処理により、音声を発生させているオブジェクトが人間であるＯＢ２と特定した場合、重畳部１５が、その音声に対応するテキストデータＴＤ２（本例では「ようこそ」）を、その音声を発生させているＯＢ２の近くの吹き出し画像に重畳することができる。これによっても、図３に示す例と同じ効果を奏する。 On the other hand, as shown in FIG. 4, if the above-mentioned identification process identifies the object generating the sound as a human OB2, the superimposition unit 15 can superimpose text data TD2 corresponding to the sound (in this example, "Welcome") on the speech bubble image near the OB2 generating the sound. This also achieves the same effect as the example shown in FIG. 3.

なお、特定処理により、音声を発生させているオブジェクトを特定できなかった場合、重畳部１５は、例えば、画面の下端、画面の上端、画面の右端、画面の左端など、音声を発生しているオブジェクトとは関係なく予め定められた領域に、オブジェクトが発生する音声を示す音声データに対応するテキストデータを字幕として重畳してもよい。 If the identification process fails to identify the object generating the sound, the superimposition unit 15 may superimpose, as subtitles, text data corresponding to the audio data indicating the sound generated by the object in a predetermined area unrelated to the object generating the sound, such as the bottom edge of the screen, the top edge of the screen, the right edge of the screen, or the left edge of the screen.

（実施形態１の変形例）
本実施形態においては、オブジェクトが人間であると説明したが、オブジェクトは、人間に限定されず、音声を発する任意のものであり得る。例えば、オブジェクトは、鳴き声を発する犬であってもよいし、サイレン音を発する救急車であってもよい。 (Modification of the first embodiment)
In the present embodiment, the object is described as a human being, but the object is not limited to a human being and may be any object that emits a sound. For example, the object may be a dog that emits a barking sound or an ambulance that emits a siren sound.

図５に示すように、上述した特定処理により、音声を発生させているオブジェクトが犬であるＯＢ３と特定した場合、重畳部１５が、その音声に対応するテキストデータＴＤ３（本例では「ワンワン」）を、その音声を発生させているＯＢ３の近くの吹き出し画像に重畳することができる。これによっても、図３に示す例と同じ効果を奏する。 As shown in FIG. 5, when the above-mentioned identification process identifies the object generating the sound as a dog OB3, the superimposition unit 15 can superimpose text data TD3 corresponding to the sound (in this example, "woof woof") on the speech bubble image near the OB3 generating the sound. This also achieves the same effect as the example shown in FIG. 3.

一方、図６に示すように、上述した特定処理により、音声を発生させているオブジェクトが救急車であるＯＢ４と特定した場合、重畳部１５が、その音声に対応するテキストデータＴＤ４（本例では「ピーポーピーポー」を、その音声を発生させているＯＢ４の近くの吹き出し画像に重畳することができる。これによっても、図３に示す例と同じ効果を奏する。 On the other hand, as shown in FIG. 6, if the above-mentioned identification process identifies the object generating the sound as OB4, an ambulance, the superimposition unit 15 can superimpose text data TD4 corresponding to the sound (in this example, "Peep-peep-peep") on the speech bubble image near OB4 generating the sound. This also achieves the same effect as the example shown in FIG. 3.

また、本実施形態においては、オブジェクトの近くに、そのオブジェクトが発した音声を表すテキストデータのみを重畳する構成について説明したが、本発明はこれに限定されない。例えば下記のように構成されてもよい。 In addition, in this embodiment, a configuration has been described in which only text data representing the sound emitted by an object is superimposed near the object, but the present invention is not limited to this. For example, the configuration may be as follows.

すなわち、重畳部１５は、オブジェクトの近くに、そのオブジェクトが発した音声を表すテキストデータに加えて、そのオブジェクトの特定情報を表示してもよい。ここで、オブジェクトの特定情報とは、そのオブジェクトを特定するための情報であり、例えば、そのオブジェクトの名称である。また、特定情報の重畳は、テキストデータの重畳と同様、上述した確度情報が示す確からしさが所定の高さよりも高い場合に限って実行されてもよい。 That is, the superimposition unit 15 may display specific information of an object near the object in addition to text data representing the sound emitted by the object. Here, the specific information of the object is information for identifying the object, such as the name of the object. Also, similar to the superimposition of text data, the superimposition of the specific information may be performed only when the likelihood indicated by the above-mentioned likelihood information is higher than a predetermined level.

この場合、特に図示しないが、例えば上記図３～図６において、表示部４０のテキストデータＴＤ１～４に対応した位置に、特定情報を表示すればよい。 In this case, although not specifically shown, for example, in Figures 3 to 6 above, specific information may be displayed at positions on the display unit 40 corresponding to the text data TD1 to TD4.

上記の構成によれば、表示部４０を見ている視聴者、例えば、聴覚に障害がある視聴者であってもオブジェクトと名称との紐付けをより適切に認識することができる。 With the above configuration, a viewer watching the display unit 40, even a viewer with hearing impairment, can more appropriately recognize the association between objects and names.

以下、本実施形態の表示方法について図７を参照しながら説明する。図７は、図１に示す表示装置１が備える画像生成装置１０により実行する画像生成方法を示すフローチャートである。 The display method of this embodiment will be described below with reference to FIG. 7. FIG. 7 is a flowchart showing an image generation method executed by the image generation device 10 included in the display device 1 shown in FIG. 1.

画像生成方法は、図７に示すように、取得工程Ｓ１０と、特定処理実行工程Ｓ１１と、重畳工程Ｓ１２と、を含んでいる。 As shown in FIG. 7, the image generation method includes an acquisition step S10, a specific processing execution step S11, and a superimposition step S12.

取得工程Ｓ１０において、画像生成装置１０は、外部から入力された画像を示す画像データおよび当該画像に付随する音声を示す音声データを取得する。 In the acquisition step S10, the image generating device 10 acquires image data representing an image input from outside and audio data representing audio accompanying the image.

次に、特定処理実行工程Ｓ１１において、画像生成装置１０は、取得工程Ｓ１０において取得した画像データ及び音声データを学習済モデルに入力することによって前記画像に含まれる１又は複数のオブジェクトのうち、音声を発生させているオブジェクトを特定する。 Next, in the identification process execution step S11, the image generating device 10 inputs the image data and audio data acquired in the acquisition step S10 into a trained model to identify the object that is generating the audio among one or more objects contained in the image.

最後に、重畳工程Ｓ１２において、画像生成装置１０は、前記音声データに対応するテキストデータを、前記画像内の位置であって、前記特定処理の結果に応じた位置に重畳する。 Finally, in the superimposition step S12, the image generating device 10 superimposes the text data corresponding to the audio data at a position within the image that corresponds to the result of the specific processing.

〔実施形態２〕
本発明の他の実施形態について、以下に説明する。なお、説明の便宜上、上記実施形態にて説明した部材と同じ機能を有する部材については、同じ符号を付記し、その説明を繰り返さない。 [Embodiment 2]
Other embodiments of the present invention will be described below. For ease of explanation, the same reference numerals are given to members having the same functions as those described in the above embodiment, and the description thereof will not be repeated.

図８は、本実施形態に係る画像生成装置１０ａの要部構成を示すブロック図である。実施形態１と比較すると、本実施形態に係る画像生成装置１０ａがテキストデータ生成部を備えない点のみで相違する。以下、この相違点を重点的に説明する。 Figure 8 is a block diagram showing the main configuration of the image generating device 10a according to this embodiment. Compared to the first embodiment, the only difference is that the image generating device 10a according to this embodiment does not include a text data generating unit. The following will focus on this difference.

本実施形態において、画像生成装置１０ａがテキストデータ生成部を備えないため、当然テキストデータを生成しない。本実施形態において、重畳部１５が、上述した特定処理の確からしさを示す確度情報に応じた表示態様を有する情報であって、前記オブジェクトを特定する情報を重畳した画像を生成する。 In this embodiment, the image generating device 10a does not include a text data generating unit, and so does not generate text data. In this embodiment, the superimposing unit 15 generates an image on which information identifying the object is superimposed, the information having a display form corresponding to the accuracy information indicating the accuracy of the above-mentioned identification process.

（表示部に表示される表示例）
図９は、図８に示す表示部４０に表示される１つの表示例を示す図であり、図１０は、他の表示例を示す図である。 (Example of what appears on the display)
FIG. 9 is a diagram showing one display example displayed on the display unit 40 shown in FIG. 8, and FIG. 10 is a diagram showing another display example.

図９に示す例において、画像生成装置１０ａは、上述した特定処理により、音声を発生させている２つのオブジェクトを特定している。そして、画像生成装置１０ａは、これら２つのオブジェクトの名称ＯＢ１，ＯＢ２を、特定処理の確からしさが高いことを示す表示態様で重畳した画像を生成している。ここでは、特定処理の確からしさが高いことを示す表示態様として、オブジェクトに近い辺に突出部を有する長方形の吹き出しを、オブジェクト名ＯＢ１，ＯＢ２と共に表示する表示態様を採用している。 In the example shown in FIG. 9, the image generating device 10a identifies two objects that are generating sounds through the identification process described above. The image generating device 10a then generates an image in which the names OB1 and OB2 of these two objects are superimposed in a display manner that indicates that the identification process is highly likely. Here, as a display manner that indicates that the identification process is highly likely, a display manner is adopted in which a rectangular speech bubble with a protrusion on the side closest to the object is displayed together with the object names OB1 and OB2.

これにより、表示部４０を見ている視聴者、例えば、聴覚に障害がある視聴者であってもオブジェクト、本例ではＯＢ１と、ＯＢ２とを適切に認識することができる。 This allows a viewer looking at the display unit 40, even a viewer with a hearing impairment, to properly recognize the objects, in this example, OB1 and OB2.

一方、図１０に示す例において、画像生成装置１０ａは、上述した特定処理により、音声を発生させている２つのオブジェクトを特定している。そして、画像生成装置１０ａは、これら２つのオブジェクトの名称ＯＢ１，ＯＢ２を、特定処理の確からしさが低いことを示す表示態様で重畳した画像を生成している。ここでは、特定処理の確からしさが低いことを示す表示態様として、オブジェクトに近い辺に突出部を有さない長方形の吹き出しを、オブジェクト名ＯＢ１，ＯＢ２と共に表示する表示態様を採用している。 On the other hand, in the example shown in FIG. 10, the image generating device 10a identifies two objects that are generating sound through the identification process described above. The image generating device 10a then generates an image in which the names OB1 and OB2 of these two objects are superimposed in a display manner that indicates that the identification process is unlikely to be reliable. Here, as a display manner that indicates that the identification process is unlikely to be reliable, a display manner is adopted in which a rectangular speech bubble with no protrusions on the side closest to the object is displayed together with the object names OB1 and OB2.

これによって、表示部４０を見ている視聴者、例えば、聴覚に障害がある視聴者であってもオブジェクト、本例ではＯＢ１と、ＯＢ２との確からしさが低いと認識することができる。 This allows a viewer looking at the display unit 40, even a viewer with hearing impairment, to recognize that the likelihood of the objects, in this example OB1 and OB2, is low.

以下、本実施形態の表示方法について図１１を参照しながら説明する。図１１は、図８に示す画像生成装置１０ａにより実行する画像生成方法を示すフローチャートである。 The display method of this embodiment will be described below with reference to FIG. 11. FIG. 11 is a flowchart showing the image generation method executed by the image generation device 10a shown in FIG. 8.

画像生成方法は、図１１に示すように、取得工程Ｓ２０と、特定処理実行工程Ｓ２１と、重畳工程Ｓ２２と、を含んでいる。 As shown in FIG. 11, the image generation method includes an acquisition step S20, a specific processing execution step S21, and a superimposition step S22.

図１１に示すように、本実施形態において、ステップＳ２０、およびステップＳ２１の処理内容が、それぞれ実施形態１におけるステップＳ１０、およびステップＳ１１の処理内容と同様のためその説明を省略する。 As shown in FIG. 11, in this embodiment, the processing contents of steps S20 and S21 are similar to the processing contents of steps S10 and S11 in embodiment 1, respectively, and therefore the description thereof is omitted.

そして、重畳工程Ｓ２２において、画像生成装置１０ａは、前記特定処理の確からしさを示す確度情報に応じた表示態様を有する情報であって、前記オブジェクトを特定する情報を重畳した画像を生成する。 Then, in the superimposition step S22, the image generating device 10a generates an image on which information identifying the object is superimposed, the information having a display form corresponding to the accuracy information indicating the accuracy of the identification process.

〔まとめ〕
本発明の態様１に係る画像生成装置（１０）は、画像を示す画像データおよび当該画像に付随する音声を示す音声データを取得する取得部（１１）と、前記取得部が取得した画像データ及び音声データを学習済モデルに入力することによって前記画像に含まれる１又は複数のオブジェクトのうち、音声を発生させているオブジェクトを特定する特定処理を実行する特定処理実行部（１４）と、前記音声データに対応するテキストデータを、前記画像内の位置であって、前記特定処理の結果に応じた位置に重畳する重畳部（１５）と、を備えている。〔summary〕
An image generating device (10) according to aspect 1 of the present invention includes an acquisition unit (11) that acquires image data representing an image and audio data representing audio accompanying the image, a specific processing execution unit (14) that executes a specific processing to identify an object that is generating audio among one or more objects contained in the image by inputting the image data and audio data acquired by the acquisition unit into a learned model, and a superimposition unit (15) that superimposes text data corresponding to the audio data at a position within the image that corresponds to the result of the identification processing.

上記の構成によれば、表示装置（１）を視聴している視聴者、聴覚に障害がある視聴者であってもオブジェクトとテキストデータ（字幕）との紐付けを適切に認識することが可能になる。 The above configuration enables viewers watching the display device (1), even those with hearing impairments, to properly recognize the association between objects and text data (subtitles).

本発明の態様２に係る画像生成装置（１０）は、上記態様１において、前記重畳部は、前記特定処理における確からしさを示す確度情報を参照して、前記テキストデータを重畳する位置を決定してもよい。 In the image generating device (10) according to aspect 2 of the present invention, in the above aspect 1, the superimposition unit may determine the position at which to superimpose the text data by referring to accuracy information indicating the accuracy of the specific process.

上記の構成によれば、重畳部（１５）により、テキストデータを重畳する位置を適切に決定することができる。 With the above configuration, the superimposition unit (15) can appropriately determine the position at which to superimpose the text data.

本発明の態様３に係る画像生成装置（１０）は、上記態様２において、前記特定処理実行部は、前記画像データを第１の学習済モデルに入力することによって、前記画像に含まれる１又は複数のオブジェクトの画像内の位置及び識別情報を取得する第１の特定処理を実行し、前記音声データを第２の学習済モデルに入力することによって、前記音声データの識別情報を取得する第２の特定処理を実行し、前記第１の特定処理の結果、及び前記第２の特定処理の結果を参照して、前記音声を発生させているオブジェクトを特定してもよい。 In the image generating device (10) according to aspect 3 of the present invention, in the above aspect 2, the identification process execution unit may execute a first identification process to acquire positions and identification information of one or more objects included in the image by inputting the image data into a first trained model, execute a second identification process to acquire identification information of the audio data by inputting the audio data into a second trained model, and identify the object generating the audio by referring to the results of the first identification process and the results of the second identification process.

上記の構成によれば、特定処理実行部（１４）により、音声を発生させているオブジェクトを確実に特定することができる。 With the above configuration, the identification process execution unit (14) can reliably identify the object that is generating the sound.

本発明の態様４に係る画像生成装置（１０）は、上記態様２または３において、前記重畳部は、前記確度情報が示す確からしさが所定の高さよりも高い場合、前記音声を発生させているオブジェクトの近くに前記テキストデータを重畳してもよい。 In the image generating device (10) according to aspect 4 of the present invention, in the above aspect 2 or 3, the superimposition unit may superimpose the text data near the object generating the sound when the likelihood indicated by the likelihood information is higher than a predetermined level.

本発明の態様５に係る画像生成装置（１０）は、上記態様４において、前記重畳部は、前記確度情報が示す確からしさが所定の高さよりも高い場合、前記オブジェクトから発せされた音声であることを表す吹き出し画像を、前記音声を発生させているオブジェクトの近くに重畳し、前記吹き出し画像に前記テキストデータを重畳してもよい。 In the image generating device (10) according to aspect 5 of the present invention, in the above aspect 4, the superimposition unit may superimpose a speech bubble image indicating that the sound is emitted from the object near the object generating the sound when the likelihood indicated by the likelihood information is higher than a predetermined level, and may superimpose the text data on the speech bubble image.

上記の構成によれば、重畳部（１５）により、テキストデータをより適切な位置に重畳することができる。 With the above configuration, the superimposition unit (15) can superimpose text data at a more appropriate position.

本発明の態様６に係る画像生成装置（１０）は、上記態様２～５の何れか１つにおいて、前記重畳部は、前記確度情報が示す確からしさが所定の高さよりも高い場合、前記オブジェクトの近くに、前記オブジェクトを特定する情報を重畳した画像を生成してもよい。 The image generating device (10) according to aspect 6 of the present invention is any one of aspects 2 to 5 above, and the superimposition unit may generate an image in which information identifying the object is superimposed near the object when the likelihood indicated by the likelihood information is higher than a predetermined height.

上記の構成によれば、表示部（４０）を見ている視聴者、聴覚に障害がある視聴者であってもオブジェクトとテキストデータ（字幕）との紐付けをより適切に認識することができる。 With the above configuration, viewers watching the display unit (40), even those with hearing impairments, can more appropriately recognize the association between objects and text data (subtitles).

本発明の態様７に係る画像生成装置（１０ａ）は、画像を示す画像データおよび当該画像に付随する音声を示す音声データを取得する取得部（１１）と、前記取得部が取得した画像データ及び音声データを学習済モデルに入力することによって前記画像に含まれる１又は複数のオブジェクトのうち、音声を発生させているオブジェクトを特定する特定処理を実行する特定処理実行部（１４）と、前記特定処理の確からしさを示す確度情報に応じた表示態様を有する情報であって、前記オブジェクトを特定する情報を重畳した画像を生成する重畳部（１５）とを備えている。 The image generating device (10a) according to aspect 7 of the present invention includes an acquisition unit (11) that acquires image data representing an image and audio data representing audio accompanying the image, a specific processing execution unit (14) that executes a specific processing to identify an object that is generating audio among one or more objects included in the image by inputting the image data and audio data acquired by the acquisition unit into a trained model, and a superimposition unit (15) that generates an image on which information identifying the object is superimposed, the information having a display form corresponding to accuracy information indicating the accuracy of the identification processing.

上記の構成によれば、上記の態様１と同様な効果を奏する。 The above configuration provides the same effects as in aspect 1 above.

本発明の態様８に係る画像生成方法は、画像を示す画像データおよび当該画像に付随する音声を示す音声データを取得する取得工程と、前記取得工程において取得した画像データ及び音声データを学習済モデルに入力することによって前記画像に含まれる１又は複数のオブジェクトのうち、音声を発生させているオブジェクトを特定する特定処理を実行する特定処理実行工程と、前記音声データに対応するテキストデータを、前記画像内の位置であって、前記特定処理の結果に応じた位置に重畳する重畳工程と、を含む。 The image generating method according to aspect 8 of the present invention includes an acquisition step of acquiring image data representing an image and audio data representing audio accompanying the image, a specific processing execution step of executing a specific processing step of identifying an object that is generating audio among one or more objects included in the image by inputting the image data and audio data acquired in the acquisition step into a trained model, and a superimposition step of superimposing text data corresponding to the audio data at a position within the image that corresponds to the result of the specific processing.

本発明の態様９に係る画像生成方法は、画像を示す画像データおよび当該画像に付随する音声を示す音声データを取得する取得工程と、前記取得工程において取得した画像データ及び音声データを学習済モデルに入力することによって前記画像に含まれる１又は複数のオブジェクトのうち、音声を発生させているオブジェクトを特定する特定処理を実行する特定処理実行工程と、前記特定処理の確からしさを示す確度情報に応じた表示態様を有する情報であって、前記オブジェクトを特定する情報を重畳した画像を生成する重畳工程と、を含む。 The image generating method according to aspect 9 of the present invention includes an acquisition step of acquiring image data representing an image and audio data representing audio accompanying the image, an identification process execution step of executing an identification process for identifying an object generating audio among one or more objects included in the image by inputting the image data and audio data acquired in the acquisition step into a trained model, and a superimposition step of generating an image on which information identifying the object is superimposed, the information having a display form corresponding to accuracy information indicating the accuracy of the identification process.

本発明は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。さらに、各実施形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成することができる。 The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the claims. The technical scope of the present invention also includes embodiments obtained by appropriately combining the technical means disclosed in the different embodiments. Furthermore, new technical features can be formed by combining the technical means disclosed in the respective embodiments.

１表示装置
１０、１０ａ画像生成装置
１１取得部
１２音声・画像認識部
１３テキストデータ生成部
１４特定処理実行部
１５重畳部
２０音声認識用ＤＢ
３０画像認識用ＤＢ
４０表示部
ＯＢ１～４オブジェクト１～４
ＴＤ１～４テキストデータ１～４ REFERENCE SIGNS LIST 1 Display device 10, 10a Image generating device 11 Acquisition unit 12 Voice/image recognition unit 13 Text data generating unit 14 Specific process execution unit 15 Superimposition unit 20 Voice recognition DB
30 Image recognition database
40 Display section OB1-4 Object 1-4
TD1-4 Text data 1-4

Claims

an acquisition unit that acquires image data representing an image and audio data representing audio accompanying the image;
A specific processing execution unit that executes a specific processing to identify an object generating a sound among one or more objects included in the image by inputting the image data and the voice data acquired by the acquisition unit into a trained model;
a superimposition unit that superimposes text data corresponding to the voice data at a position within the image according to a result of the identification process ,
The overlay unit determines a position at which to overlay the text data by referring to probability information indicating a probability in the specified process .

The specific process execution unit is
A first identification process is executed to obtain positions and identification information of one or more objects included in the image by inputting the image data into a first trained model;
A second identification process is executed to obtain identification information of the voice data by inputting the voice data into a second trained model;
The image generating device according to claim 1 , further comprising: a processor for processing the image generated by the object generating the sound;

The image generating device according to claim 1 , wherein the superimposing unit superimposes the text data near an object generating the sound when the likelihood indicated by the likelihood information is higher than a predetermined level.

The overlapping portion is
If the likelihood indicated by the likelihood information is higher than a predetermined level, a balloon image indicating that the sound is emitted from the object is superimposed near the object emitting the sound;
The image generating device according to claim 3 , wherein the text data is superimposed on the speech balloon image.

The image generating device according to any one of claims 1 to 4 , wherein the superimposition unit generates an image in which information identifying the object is superimposed near the object when the likelihood indicated by the likelihood information is higher than a predetermined height.

an acquisition unit that acquires image data representing an image and audio data representing audio accompanying the image;
A specific processing execution unit that executes a specific processing to identify an object generating a sound among one or more objects included in the image by inputting the image data and the voice data acquired by the acquisition unit into a trained model;
a superimposing unit that generates an image on which information that identifies the object is superimposed, the information having a display form corresponding to certainty information that indicates the certainty of the identification process; and
An image generating device comprising:

an acquiring step of acquiring image data representing an image and audio data representing audio accompanying the image;
A process for executing a process for identifying an object that is generating a sound among one or more objects included in the image by inputting the image data and the sound data acquired in the acquisition process into a trained model;
a superimposing step of superimposing text data corresponding to the voice data at a position within the image according to a result of the identification process ,
In the superimposing step, a position at which to superimpose the text data is determined by referring to accuracy information indicating accuracy in the specific process .

an acquiring step of acquiring image data representing an image and audio data representing audio accompanying the image;
A process for executing a process for identifying an object that is generating a sound among one or more objects included in the image by inputting the image data and the sound data acquired in the acquisition process into a trained model;
a superimposing step of generating an image on which information for identifying the object is superimposed, the information having a display form corresponding to certainty information indicating the certainty of the identification process;
An image generating method comprising: