JP7710909B2

JP7710909B2 - Imaging device, control method, and program

Info

Publication number: JP7710909B2
Application number: JP2021112964A
Authority: JP
Inventors: 宏樹太田; 修原田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2025-07-22
Anticipated expiration: 2041-07-07
Also published as: JP2023009567A

Description

本発明は、人物の音声に対して音声処理を行う音声処理装置に関するものである。 The present invention relates to a voice processing device that performs voice processing on a person's voice.

撮像装置における動画撮影では、撮影時の状況を撮影者のイメージ通りに残すことが重要であり、それは映像だけでなく音声についても同様である。 When shooting video using an imaging device, it is important to capture the shooting conditions exactly as the photographer imagined them, and this applies not only to video but also to audio.

特許文献１では、被写体の音声を抽出し、その抽出した音声信号を被写体の位置に応じて個別に調整することで、臨場感やステレオ感をもった音響空間を実現することが開示されている。 Patent document 1 discloses a method for extracting the sound of a subject and individually adjusting the extracted sound signal according to the position of the subject to create an acoustic space with a sense of realism and stereoscopic sensation.

特開２０１２－１３８９３０号公報JP 2012-138930 A

しかし、人間が会話を聴取するとき、正確に再現された音響空間が人間のイメージ通りであるとは必ずしも限らない。例えば、人間はたくさんの人がそれぞれに雑談しているなかでも、自分が興味のある人の会話や、自分の名前などは、自然と聞き取ることができる。また、人間は音声情報だけでなく視覚的情報も使用しているともいわれており、話し手を視覚的に確認することのよって、その人物の口の動きやしぐさなどから得る情報も用いて聞こえ方を補っていると言われている。つまり、動画に記録される音声についても、人の記憶（イメージ）に残る会話音声と同じになるように、記録することも重要である。 However, when humans listen to a conversation, the accurately reproduced acoustic space does not necessarily match what humans imagine. For example, even when many people are chatting away, humans can naturally hear the conversation of people they are interested in, as well as their own name. It is also said that humans use visual information as well as audio information, and by visually confirming who is speaking, they supplement how they hear by using information obtained from the person's mouth movements and gestures. In other words, it is also important that the audio recorded on video is recorded so that it is the same as the conversational audio that is remembered (imaged).

しかし、特許文献１では、人（音源）の位置関係に基づいて、声の音響空間を正確に再現することが目的であるため、撮影者のイメージとは異なる動画となっているおそれがあった。 However, in Patent Document 1, the aim is to accurately reproduce the acoustic space of the voice based on the relative positions of people (sound sources), so there is a risk that the resulting video may differ from the image of the photographer.

そこで、本発明は、撮影者のイメージに沿った動画および音声を記録することを目的とする。 Therefore, the present invention aims to record video and audio that matches the photographer's imagination.

本発明の撮像装置は、動画から被写体を検出する検出手段と、前記動画から検出された被写体から主被写体を選定する選定手段と、前記動画から被写体の音声を決定する決定手段と、前記検出手段によって検出された前記被写体と前記決定手段によって抽出された音声とを関連付ける関連付け手段と、前記選定手段によって選定された主被写体と関連する被写体を判断する判断手段と、前記主被写体に関連付けられた音声と前記判断手段によって前記主被写体と関連すると判断された被写体の音声とに対する音声処理を、前記判断手段によって前記主被写体と関連すると判断されなかった被写体の音声に対する音声処理と異ならせる音声処理手段とを有することを特徴とする。 The imaging device of the present invention is characterized by having a detection means for detecting a subject from a video, a selection means for selecting a main subject from the subjects detected from the video, a determination means for determining the sound of the subject from the video, an association means for associating the subject detected by the detection means with the sound extracted by the determination means, a determination means for determining a subject associated with the main subject selected by the selection means, and an audio processing means for differentiating audio processing for the sound associated with the main subject and the sound of a subject determined by the determination means to be associated with the main subject from audio processing for a subject not determined by the determination means to be associated with the main subject.

本発明によれば、撮影者のイメージに沿った動画および音声を記録することができる。 The present invention makes it possible to record video and audio in line with the photographer's imagination.

第一の実施形態の撮像装置のブロック図を示す図である。FIG. 1 is a block diagram of an imaging apparatus according to a first embodiment. 第一の実施形態の撮像処理部と音声処理部のブロック図（記録時）を示す図である。FIG. 2 is a block diagram of an imaging processing unit and an audio processing unit according to the first embodiment (during recording); 第一の実施形態の撮像処理部と音声処理部のブロック図（後処理時）を示す図である。FIG. 2 is a block diagram of an imaging processing unit and an audio processing unit according to the first embodiment (during post-processing); 第一の実施形態の主対象選定方法を示す図である。FIG. 4 is a diagram illustrating a main target selection method according to the first embodiment. 第一の実施形態の動画記録シーケンスの動作フローを示す図である。FIG. 4 is a diagram showing an operation flow of a moving image recording sequence according to the first embodiment. 第一の実施形態の想定シーンを説明する図である。FIG. 2 is a diagram illustrating an assumed scene in the first embodiment. 第一の実施形態の音声処理の内容を説明する図である。FIG. 4 is a diagram illustrating the content of audio processing according to the first embodiment. 第二の実施形態の撮像処理部と音声処理部のブロック図を示す図である。FIG. 11 is a block diagram of an imaging processing unit and an audio processing unit according to a second embodiment. 第二の実施形態の録画記録シーケンスの動作フローを示す図である。FIG. 11 is a diagram showing an operation flow of a video recording sequence according to the second embodiment. 第二の実施形態の課題を説明するための図である。FIG. 11 is a diagram for explaining a problem of the second embodiment; 第二の実施形態の課題となるシーンを説明した図である。FIG. 11 is a diagram illustrating a problem scene according to the second embodiment.

以下に、本発明の好ましい実施の形態を、添付の図面に基づいて詳細に説明する。 A preferred embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

［第一の実施形態］
本実施形態では、撮像装置に含まれる音声処理装置ついて図１から図３を用いて説明する。 [First embodiment]
In this embodiment, an audio processing device included in an imaging device will be described with reference to FIGS. 1 to 3. FIG.

図１は第一の実施形態の撮像装置１００の構成を示すブロック図である。 Figure 1 is a block diagram showing the configuration of an imaging device 100 according to the first embodiment.

撮像部１０１は、撮影光学レンズにより取り込まれた被写体の光学像を撮像素子により画像信号に変換し、画像処理部１０２によってアナログデジタル変換、画像調整処理などを行い、画像データを生成する。撮影光学レンズは、内蔵型の光学レンズであっても、着脱式の光学レンズであっても良い。また、撮像素子は、ＣＣＤ、ＣＭＯＳ等に代表される光電変換素子であればよい。音声入力部１０３は、内蔵または音声端子を介して接続されたマイクにより、撮像装置１００の周辺の音声を集音し、アナログデジタル変換されたものを、音声処理部１０４にて各種音声処理を行い、音声データを生成する。マイクは、指向性、無指向性を問わない。メモリ１０５は、撮像部１０１、画像処理部１０２により得られた画像データや、音声入力部１０３、音声処理部１０４により得られた音声データを一時的に記憶する。表示制御部１０６は、画像処理部１０２により得られた画像データに係る映像や、撮像装置１００の操作画面、メニュー画面等を表示部１０７や、不図示の映像端子を介して外部のディスプレイに表示させる。表示部１０７はタッチパネル機能を有し、撮影者が操作することでメニューや被写体の選択などが可能である。 The imaging unit 101 converts the optical image of the subject captured by the optical lens into an image signal using an imaging element, and performs analog-to-digital conversion and image adjustment processing using the image processing unit 102 to generate image data. The imaging optical lens may be a built-in optical lens or a removable optical lens. The imaging element may be a photoelectric conversion element such as a CCD or CMOS. The audio input unit 103 collects audio from the surroundings of the imaging device 100 using a built-in microphone or a microphone connected via an audio terminal, and performs various audio processing on the analog-to-digital converted audio in the audio processing unit 104 to generate audio data. The microphone may be either directional or non-directional. The memory 105 temporarily stores image data obtained by the imaging unit 101 and the image processing unit 102, and audio data obtained by the audio input unit 103 and the audio processing unit 104. The display control unit 106 displays images related to the image data obtained by the image processing unit 102, an operation screen of the imaging device 100, a menu screen, etc. on the display unit 107 or on an external display via a video terminal (not shown). The display unit 107 has a touch panel function, and the photographer can operate it to select menus and subjects.

符号化処理部１０８は、メモリ１０５に一時的に記憶された画像データや音声データを読み出して所定の符号化を行い、圧縮画像データ、圧縮音声データ等を生成する。また、音声データに関しては圧縮しないようにしてもよい。圧縮画像データは、例えば、ＭＰＥＧ２やＨ．２６４／ＭＰＥＧ４－ＡＶＣなど、どのような圧縮方式で圧縮されたものであってもよい。また、圧縮音声データも、ＡＣ３（Ａ）ＡＣ、ＡＴＲＡＣ、ＡＤＰＣＭなどのような圧縮方式で圧縮されたものであってもよい。記録再生部１０９は、記録媒体１１０に対して、符号化処理部１０８で生成された圧縮画像データ、圧縮音声データまたは音声データ、各種データを記録したり、記録媒体１１０から読出したりする。ここで、記録媒体１１０は、画像データ、音声データ、等を記録することができれば、磁気ディスク、光学式ディスク、半導体メモリなどのあらゆる方式の記録媒体を含む。 The encoding processing unit 108 reads out the image data and audio data temporarily stored in the memory 105 and performs a predetermined encoding to generate compressed image data, compressed audio data, etc. Also, the audio data may not be compressed. The compressed image data may be compressed by any compression method, such as MPEG2 or H.264/MPEG4-AVC. The compressed audio data may also be compressed by a compression method such as AC3(A)AC, ATRAC, ADPCM, etc. The recording and reproducing unit 109 records the compressed image data, compressed audio data or audio data, and various data generated by the encoding processing unit 108 to the recording medium 110, and reads them from the recording medium 110. Here, the recording medium 110 includes any type of recording medium, such as a magnetic disk, optical disk, or semiconductor memory, as long as it is capable of recording image data, audio data, etc.

制御部１１１は、撮像装置１００、撮像部１０１の各ブロックに制御信号を送信することで撮像装置１００の各ブロックを制御することができ、各種制御を実行するためのＣＰＵやメモリなどからなる。制御部１１１で使用するメモリ１０５は、各種制御プログラムを格納するＲＯＭ、演算処理のためのＲＡＭ等であり、制御部１１１外付けのメモリも含む。操作部１１２は、ボタンやダイヤルなどからなり、ユーザの操作に応じて、指示信号を制御部１１１に送信する。本実施形態の撮像装置では、動画記録開始、終了を指示するための撮影ボタン、光学的もしくは電子的に画像に対してズーム動作する指示するためのズームレバー、各種調整をするための十字キー、決定キーなどからなる。音声出力部１１３は、記録再生部１０９により再生された音声データや圧縮音声データ、または制御部１１１により出力される音声データをスピーカ１１４や音声端子などに出力する。外部出力部１１５は、記録再生部１０９により再生された圧縮映像データや圧縮音声データ、音声データなどを外部機器に出力する。データバス１１６は、音声データや画像データ等の各種データ、各種制御信号を撮像装置１００の各ブロックに供給する。 The control unit 111 can control each block of the imaging device 100 by transmitting control signals to each block of the imaging device 100 and the imaging unit 101, and is composed of a CPU and memory for executing various controls. The memory 105 used by the control unit 111 is a ROM for storing various control programs, a RAM for arithmetic processing, etc., and also includes a memory external to the control unit 111. The operation unit 112 is composed of buttons, dials, etc., and transmits instruction signals to the control unit 111 in response to user operations. In the imaging device of this embodiment, it is composed of a shooting button for instructing the start and end of video recording, a zoom lever for instructing optical or electronic zoom operation on an image, a cross key for various adjustments, a decision key, etc. The audio output unit 113 outputs audio data or compressed audio data reproduced by the recording and playback unit 109, or audio data output by the control unit 111 to a speaker 114 or an audio terminal. The external output unit 115 outputs compressed video data, compressed audio data, audio data, etc. reproduced by the recording and playback unit 109 to an external device. The data bus 116 supplies various data such as audio data and image data, as well as various control signals, to each block of the imaging device 100.

ここで、本実施形態の撮像装置１００の通常の動作について説明する。 Here, we will explain the normal operation of the imaging device 100 of this embodiment.

本実施形態の撮像装置１００は、ユーザが操作部１１２を操作して電源を投入する指示が出されたことに応じて、不図示の電源供給部から、撮像装置の各ブロックに電源を供給する。 In this embodiment, the imaging device 100 supplies power to each block of the imaging device from a power supply unit (not shown) in response to a user operating the operation unit 112 to issue an instruction to turn on the power.

電源が供給されると、制御部１１１は、操作部１１２のモード切り換えスイッチが、例えば、撮影モード、再生モード等のどのモードであるかを操作部１１２からの指示信号により確認する。動画記録モードでは、撮像部１０１、画像処理部１０２により得られた画像データ（映像データ）と音声入力部１０３、音声処理部１０４により得られた音声データとを動画ファイルとして保存する。再生モードでは、記録媒体１１０に記録された圧縮画像データを記録再生部１０９により再生して表示部１０７に表示させる。 When power is supplied, the control unit 111 checks whether the mode switch of the operation unit 112 is in a shooting mode, a playback mode, or the like, based on an instruction signal from the operation unit 112. In the video recording mode, the image data (video data) obtained by the imaging unit 101 and the image processing unit 102 and the audio data obtained by the audio input unit 103 and the audio processing unit 104 are saved as a video file. In the playback mode, the compressed image data recorded on the recording medium 110 is played back by the recording and playback unit 109 and displayed on the display unit 107.

動画記録モードでは、まず、制御部１１１は、撮影待機状態に移行させるように制御信号を撮像装置１００の各ブロックに送信し、以下のような動作をさせる。撮像部１０１は、撮影光学レンズにより取り込まれた被写体の光学像を撮像素子により画像信号に変換し、画像処理部１０２で画像調整処理などを行い、画像データを生成する。そして、得られた画像データを表示制御部１０６に送信し、表示部１０７に表示させる。ユーザはこの様にして表示された画面を見ながら撮影の準備を行う。 In video recording mode, first, the control unit 111 sends a control signal to each block of the imaging device 100 to transition to a shooting standby state, and causes the following operations to be performed. The imaging unit 101 converts the optical image of the subject captured by the shooting optical lens into an image signal using an imaging element, and the image processing unit 102 performs image adjustment processing and the like to generate image data. The obtained image data is then sent to the display control unit 106, which displays it on the display unit 107. The user prepares for shooting while looking at the screen displayed in this way.

音声入力部１０３は、複数のマイクにより得られたアナログ音声信号をデジタル変換し、得られた複数のデジタル音声信号を処理して、マルチチャンネルの音声データを生成する。そして、得られた音声データを音声出力部１１３に送信し、接続されたスピーカ１１４や不図示のイヤホンから音声として出力させる。ユーザは、この様にして出力された音声を聞きながら記録音量を決定するためのマニュアルボリュームの調整をすることもできる。 The audio input unit 103 converts analog audio signals obtained by multiple microphones into digital form, and processes the resulting digital audio signals to generate multi-channel audio data. The resulting audio data is then sent to the audio output unit 113, which outputs the audio from a connected speaker 114 or an earphone (not shown). The user can adjust the manual volume to determine the recording volume while listening to the audio output in this manner.

次に、ユーザが操作部１１２の記録ボタンを操作することにより撮影開始の指示信号が制御部１１１に送信されると、制御部１１１は、撮像装置１００の各ブロックに撮影開始の指示信号を送信し、以下のような動作をさせる。 Next, when the user operates the record button on the operation unit 112 to send an instruction signal to start shooting to the control unit 111, the control unit 111 sends an instruction signal to start shooting to each block of the imaging device 100, causing them to perform the following operations.

撮像部１０１は、撮影光学レンズにより取り込まれた被写体の光学像を撮像素子により画像信号に変換し、画像処理部１０２にて画像調整処理などを行い、画像データを生成する。そして、得られた画像データを表示制御部１０６に送信し、表示部１０７に表示させる。また、得られた画像データをメモリ１０５へ送信する。 The imaging unit 101 converts an optical image of a subject captured by a photographing optical lens into an image signal using an imaging element, and performs image adjustment processing and the like in the image processing unit 102 to generate image data. The obtained image data is then sent to the display control unit 106, which displays it on the display unit 107. The obtained image data is also sent to the memory 105.

音声入力部１０３は、複数のマイクにより得られたアナログ音声信号をデジタル変換し、音声処理部１０４にて得られた複数のデジタル音声信号を処理して、マルチチャンネルの音声データを生成する。そして、得られた音声データをメモリ１０５に送信する。また、マイクが一つの場合には、得られたアナログ音声信号をデジタル変換し音声データを生成し、音声データをメモリ１０５に送信する。 The audio input unit 103 digitally converts analog audio signals obtained by multiple microphones, and processes the multiple digital audio signals obtained by the audio processing unit 104 to generate multi-channel audio data. The obtained audio data is then sent to the memory 105. Also, when there is only one microphone, the obtained analog audio signal is digitally converted to generate audio data, and the audio data is sent to the memory 105.

符号化処理部１０８は、メモリ１０５に一時的に記憶された画像データや音声データを読み出して所定の符号化を行い、圧縮画像データ、圧縮音声データ等を生成する。 The encoding processing unit 108 reads the image data and audio data temporarily stored in the memory 105 and performs a specified encoding process to generate compressed image data, compressed audio data, etc.

そして、制御部１１１は、これらの圧縮画像データ、圧縮音声データを合成し、データストリームを形成し、記録再生部１０９に出力する。音声データを圧縮しない場合には、制御部１１１は、メモリ１０５に格納された音声データと圧縮画像データとを合成し、データストリームを形成して記録再生部１０９に出力する。記録再生部１０９は、ＵＤＦ、ＦＡＴ等のファイルシステム管理のもとに、データストリームを一つの動画ファイルとして記録媒体１１０に書き込んでいく。以上の動作を撮影中は継続する。 Then, the control unit 111 combines the compressed image data and compressed audio data to form a data stream, which it outputs to the recording and playback unit 109. If the audio data is not compressed, the control unit 111 combines the audio data stored in memory 105 with the compressed image data to form a data stream, which it outputs to the recording and playback unit 109. The recording and playback unit 109 writes the data stream to the recording medium 110 as a single video file under file system management such as UDF or FAT. The above operations continue while shooting is in progress.

そして、ユーザが操作部１１２の記録ボタンを操作することにより撮影終了の指示信号が制御部１１１に送信されると、制御部１１１は、撮像装置１００の各ブロックに撮影終了の指示信号を送信し、以下のような動作をさせる。 When the user operates the record button on the operation unit 112 to send an instruction signal to end shooting to the control unit 111, the control unit 111 sends an instruction signal to end shooting to each block of the imaging device 100, causing them to perform the following operations.

撮像部１０１、画像処理部１０２、音声入力部１０３、音声処理部１０４は、それぞれ画像データ、音声データの生成を停止する。符号化処理部１０８は、メモリに記憶されている残りの画像データと音声データとを読出して所定の符号化を行い、圧縮画像データ、圧縮音声データ等を生成し終えたら動作を停止する。音声データを圧縮しない場合には、当然、圧縮画像データの生成が終わったら動作を停止する。 The imaging unit 101, image processing unit 102, audio input unit 103, and audio processing unit 104 each stop generating image data and audio data. The encoding processing unit 108 reads out the remaining image data and audio data stored in memory and performs the specified encoding, and stops operating when it has finished generating compressed image data, compressed audio data, etc. If the audio data is not compressed, it naturally stops operating when it has finished generating the compressed image data.

そして、制御部１１１は、これらの最後の圧縮画像データと、圧縮音声データまたは音声データとを合成し、データストリームを形成し、記録再生部１０９に出力する。記録再生部１０９は、ＵＤＦ、ＦＡＴ等のファイルシステム管理のもとに、データストリームを一つの動画ファイルとして記録媒体１１０に書き込んでいく。そして、データストリームの供給が停止したら、動画ファイルを完成させて、記録動作を停止させる。制御部１１１は、記録動作が停止すると、撮影待機状態に移行させるように制御信号を撮像装置１００の各ブロックに送信して、撮影待機状態に戻る。 The control unit 111 then combines these final compressed image data with the compressed audio data or the audio data to form a data stream, which it outputs to the recording and playback unit 109. The recording and playback unit 109 writes the data stream to the recording medium 110 as a single video file under file system management such as UDF or FAT. When the supply of the data stream stops, it completes the video file and stops the recording operation. When the recording operation stops, the control unit 111 sends a control signal to each block of the imaging device 100 to transition to a shooting standby state, and returns to the shooting standby state.

次に、再生モードでは、制御部１１１は、再生状態に移行させるように制御信号を撮像装置１００の各ブロックに送信し、以下のような動作をさせる。記録媒体１１０に記録された圧縮画像データと圧縮音声データとからなる動画ファイルを記録再生部１０９が読出して、読出された圧縮画像データ、圧縮音声データは、符号化処理部１０８に送る。 Next, in the playback mode, the control unit 111 sends a control signal to each block of the imaging device 100 to transition to the playback state, and causes the following operations to be performed. The recording and playback unit 109 reads out a video file consisting of compressed image data and compressed audio data recorded on the recording medium 110, and sends the read compressed image data and compressed audio data to the encoding processing unit 108.

符号化処理部１０８は、圧縮画像データ、圧縮音声データを復号してそれぞれ、表示制御部１０６、音声出力部１１３に送信する。表示制御部１０６は、復号された画像データを表示部１０７に表示させる。音声出力部１１３は、復号された音声データを内蔵または、取付けられた外部スピーカから出力させる。 The encoding processing unit 108 decodes the compressed image data and compressed audio data and transmits them to the display control unit 106 and audio output unit 113, respectively. The display control unit 106 displays the decoded image data on the display unit 107. The audio output unit 113 outputs the decoded audio data from a built-in or attached external speaker.

本実施形態の撮像装置１００は以上のように、画像、音声の記録再生を行うことができる。 As described above, the imaging device 100 of this embodiment can record and play back images and audio.

本実施形態では、音声入力部１０３、音声処理部１０４において、音声信号を得る際に、マイクにより得られた音声信号のレベル調整処理等の処理をしている。この処理は、装置が起動してから常に行われてもよいし、撮影モードが選択されてから行われてもよい、または、音声の記録に関連するモードが選択されてから行われても良い。また、音声の記録に関連するモードにおいて、音声の記録が開始したことに応じて上記の処理を行ってもよい。本実施形態では、動画像撮影の開始されたタイミングで上記の処理を行うようにしたものとする。 In this embodiment, when obtaining an audio signal, the audio input unit 103 and audio processing unit 104 perform processing such as level adjustment of the audio signal obtained by the microphone. This processing may be performed at all times after the device is started, or may be performed after a shooting mode is selected, or may be performed after a mode related to audio recording is selected. Furthermore, in a mode related to audio recording, the above processing may be performed in response to the start of audio recording. In this embodiment, the above processing is performed at the timing when video recording begins.

図２は本実施形態の撮像装置１００の撮像部１０１、画像処理部１０２、音声入力部１０３、音声処理部１０４の詳細な構成の一例を示すブロック図である。 Figure 2 is a block diagram showing an example of a detailed configuration of the imaging unit 101, image processing unit 102, audio input unit 103, and audio processing unit 104 of the imaging device 100 of this embodiment.

撮像部１０１は、被写体の光学像を取り込む光学レンズ２０１等の光学系、光学レンズ２０１により取り込まれた被写体の光学像を電気信号（画像信号）に変換させる撮像素子２０２を有している。さらに、光学レンズ２０１を移動させるための位置センサ、モータ等の公知の駆動メカニズムを有する光学レンズ制御部２０３を有している。本実施形態では撮像部１０１に光学レンズ２０１、光学レンズ制御部２０３が内蔵されているように記載しているが、これらは着脱可能な交換光学レンズであっても良い。例えば、ズーム動作、フォーカス調整などの指示を、ユーザが操作部１１２を操作して入力すると、制御部１１１は、光学レンズ制御部２０３に光学レンズを移動させる制御信号（駆動信号）を送信する。光学レンズ制御部２０３は、この制御信号に応じて、位置センサで光学レンズ２０１の位置を確認し、モータ等で光学レンズ２０１の移動を行う。 The imaging unit 101 has an optical system such as an optical lens 201 that captures an optical image of a subject, and an imaging element 202 that converts the optical image of the subject captured by the optical lens 201 into an electrical signal (image signal). It also has an optical lens control unit 203 that has a known driving mechanism such as a position sensor and a motor for moving the optical lens 201. In this embodiment, the imaging unit 101 is described as having the optical lens 201 and the optical lens control unit 203 built in, but these may be detachable and replaceable optical lenses. For example, when a user inputs an instruction such as a zoom operation or focus adjustment by operating the operation unit 112, the control unit 111 transmits a control signal (drive signal) to the optical lens control unit 203 to move the optical lens. In response to this control signal, the optical lens control unit 203 checks the position of the optical lens 201 with a position sensor and moves the optical lens 201 with a motor or the like.

画像処理部１０２は、撮像素子２０２により変換された画像信号に対して、画像調整部２２１にて各種画質調整処理をして画像データを形成し、データバス１１６を介してメモリ１０５に送信する。ここで形成された画像データをもとに、制御部１１１はフォーカス調整や光量調整などの各種調整を行う。 The image processing unit 102 performs various image quality adjustment processes on the image signal converted by the image sensor 202 in the image adjustment unit 221 to form image data, and transmits the image data to the memory 105 via the data bus 116. Based on the image data formed here, the control unit 111 performs various adjustments such as focus adjustment and light amount adjustment.

さらに本実施形態では、画像処理部１０２は各種検出機能を有する。人物検出部２２２は画像調整部２２１にて形成された画像データから、目や鼻や口などの人物の顔の特徴点を抽出し、それに画像データにおける人物の位置や顔の大きさなどを検出する。そして、それら特徴点の情報をメモリ１０５に記憶することで、その情報に基づいて被写体人物を個別に認識することも可能である。また、人物検出部２２２は、唇や頭の動きを検出する人物動作検出部２２３と、それによりその人物が発話しているか否かを判定する人物発話検出部２２４とを有している。また、画像処理部１０２には、人物検出部２２２にて検出された人物のうち、どの人物を音声処理の主となる被写体（以下、主被写体、主対象ともいう）とするかを選定する主対象選定部２２５を有する。主対象選定部２２５は、制御部１１１によって定められた条件をもとに主対象を選定する。主対象選定部２２５による、主対象の選定条件については後述する
さらに、画像処理部１０２は会話グループ検出部２２６を有する。会話グループ検出部２２６は、人物検出部２２２において検出された人物のうちから、主対象選定部２２５にて選定された人物と会話している人物を検出する。その検出は、人物同士の位置関係や、顔の向き、動作などによって判断されるものである。例えば、会話グループ検出部２２６は、主対象である被写体に最も距離が近い被写体を、主対象と会話している人物（関連する人物）であると判断する。また、例えば、会話グループ検出部２２６は、主対象の体や顔、視線等の向きに対向する被写体を、主対象と会話している人物であると判断する。また、会話グループ検出部２２６は、主対象が動いている場合、その動いている方向の先にいる被写体を、主対象と会話している人物であると判断する。なぜなら、このような被写体は、近い将来に主対象と会話すると考えられるからである。 Furthermore, in this embodiment, the image processing unit 102 has various detection functions. The person detection unit 222 extracts features of the person's face, such as the eyes, nose, and mouth, from the image data formed by the image adjustment unit 221, and detects the position of the person in the image data and the size of the face. Then, by storing information on these features in the memory 105, it is possible to individually recognize the subject person based on the information. In addition, the person detection unit 222 has a person movement detection unit 223 that detects the movement of the lips and head, and a person speech detection unit 224 that determines whether the person is speaking or not. In addition, the image processing unit 102 has a main object selection unit 225 that selects which person, among the people detected by the person detection unit 222, is to be the main subject of the voice processing (hereinafter, also referred to as the main object or main object). The main object selection unit 225 selects the main object based on conditions determined by the control unit 111. The conditions for selecting the main object by the main object selection unit 225 will be described later. Furthermore, the image processing unit 102 has a conversation group detection unit 226. Conversation group detection unit 226 detects a person who is conversing with the person selected by main target selection unit 225 from among the people detected by person detection unit 222. The detection is determined based on the positional relationship between the people, the direction of the face, the movement, and the like. For example, conversation group detection unit 226 determines the subject closest to the main target subject as the person conversing with the main target (a related person). Also, for example, conversation group detection unit 226 determines the subject facing the direction of the main target's body, face, line of sight, and the like as the person conversing with the main target. Also, when the main target is moving, conversation group detection unit 226 determines the subject in the direction of the movement as the person conversing with the main target. This is because such a subject is considered to be conversing with the main target in the near future.

なお、会話グループ検出部２２６は、主対象と会話している人物が、所定時間より長く主対象と会話していないと判断した場合、その人物を主対象と会話していない（関連しない）人物とする。言い換えれば、主対象と会話している人物が、所定時間以内であれば、主対象と会話していないと判断されても、主対象と会話している人物と判断される。 If the conversation group detection unit 226 determines that a person who is conversing with the main target has not conversed with the main target for longer than a predetermined time, the conversation group detection unit 226 determines that the person is not conversing with the main target (is not related to the main target). In other words, if a person who is conversing with the main target has not conversed with the main target for a predetermined time, the person is determined to be conversing with the main target even if it is determined that the person has not conversed with the main target.

次に、音声入力部１０３、音声処理部１０４について説明する。音声入力部１０３は音声振動を電気信号に変換し、音声信号として出力するマイク２１１。本実施形態ではマイク２１１は左右のＬｃｈ／Ｒｃｈの２チャンネルで構成されたステレオ方式とするが、１チャンネルのモノラル方式でも、２チャンネル以上の複数のマイクを保持する構成でも構わない。Ａ／Ｄ変換部２１２は、マイク２１１により得られたアナログ音声信号をデジタル音声信号に変換する手段である。 Next, the audio input unit 103 and audio processing unit 104 will be described. The audio input unit 103 is a microphone 211 that converts audio vibrations into an electrical signal and outputs it as an audio signal. In this embodiment, the microphone 211 is a stereo type consisting of two channels, Lch/Rch, but it may be a mono type with one channel, or a configuration that holds multiple microphones with two or more channels. The A/D conversion unit 212 is a means for converting an analog audio signal obtained by the microphone 211 into a digital audio signal.

音声処理部１０４は音声入力部１０３によって変換された音声信号に各種音声処理を行うブロックである。本実施形態では、音声処理部１０４に音声抽出部２１３、音声調整部２１５、音声合成部２１７を有する。音声抽出部２１３では、人物の音声とそれ以外の音声（以後、「非人物音声」という）とに抽出（決定）することが可能である。さらに、人物音声抽出部２１４では、人物検出部２２２の情報をもとに、人物の音声をひとりひとりの個々の音声に抽出することが可能である。例えば、人物音声抽出部２１４は、音声の周波数、大きさ、および抑揚に基づいて個々の音声に抽出する。さらに、第一の実施形態では、制御部１１１は、人物音声抽出部２１４によって抽出された音声と、画像処理部１０２によって検出された被写体の動作とに基づいて、被写体と音声とを関連付けることができる。例えば、被写体の動作は、発話の頻度、発声のタイミング、口の動きである。 The voice processing unit 104 is a block that performs various voice processing on the voice signal converted by the voice input unit 103. In this embodiment, the voice processing unit 104 has a voice extraction unit 213, a voice adjustment unit 215, and a voice synthesis unit 217. The voice extraction unit 213 can extract (determine) a person's voice and other voices (hereinafter referred to as "non-human voices"). Furthermore, the person voice extraction unit 214 can extract the voices of people into individual voices of each person based on information from the person detection unit 222. For example, the person voice extraction unit 214 extracts individual voices based on the frequency, volume, and intonation of the voice. Furthermore, in the first embodiment, the control unit 111 can associate the subject with the voice based on the voice extracted by the person voice extraction unit 214 and the movement of the subject detected by the image processing unit 102. For example, the movement of the subject is the frequency of speech, the timing of speech, and the movement of the mouth.

また、音声調整部２１５では音声抽出部２１３によって抽出された音声に対して、レベル調整やイコライザ等による周波数帯域別の音声処理を個別に実施することができる。特に会話音声調整部２１６では、会話グループ検出部２２６の情報に基づいて調整を実施し、抽出された音声対して聞こえやすく強調したり、聞こえにくく控えめにしたりする。その調整内容については後述する。さらに、音声合成部２１７では音声調整部２１５にて個々に調整された音声を合成し、再度ひとつの音声信号に戻す。そして、合成された音声信号はオートレベルコントローラによって振幅を所定のレベルに調整される（以後、ＡＬＣ２１９）。以上の構成を備え、音声処理部１０４は音声信号に所定の処理を行い、音声データを形成しメモリ１０５へ送信する。 The audio adjustment unit 215 can individually perform audio processing for each frequency band using level adjustment, equalizer, etc., on the audio extracted by the audio extraction unit 213. In particular, the conversation audio adjustment unit 216 performs adjustment based on information from the conversation group detection unit 226, emphasizing the extracted audio so that it is easier to hear, or reducing the volume so that it is harder to hear. The adjustment content will be described later. Furthermore, the audio synthesis unit 217 synthesizes the audio individually adjusted by the audio adjustment unit 215 and returns it to a single audio signal. The amplitude of the synthesized audio signal is then adjusted to a predetermined level by an auto level controller (hereinafter, ALC 219). With the above configuration, the audio processing unit 104 performs predetermined processing on the audio signal, forms audio data, and transmits it to the memory 105.

図３は本実施形態の撮像装置１００の画像処理部１０２および音声処理部１０４の、他の構成の一例を示すブロック図である。図３と図２との相違点は、画像データおよび音声データの入力ソースが違う点である。図２では、画像信号は撮像部１０１、音声信号は音声入力部１０３からの信号を使用する。一方、図３では画像および音声の入力ソースはメモリ１０５に保存されているデータを入力する。このようにメモリ１０５に一旦保存された（保持された）データを用いることで、撮影時の処理だけでなく、記録後の後処理として本提案の手法を用いることが可能となる。また、主対象選定部２２５においても、一連の動画データから音声処理の対象人物を選定することが可能となる。 Figure 3 is a block diagram showing an example of another configuration of the image processing unit 102 and the audio processing unit 104 of the imaging device 100 of this embodiment. The difference between Figure 3 and Figure 2 is that the input sources of image data and audio data are different. In Figure 2, the image signal is from the imaging unit 101, and the audio signal is from the audio input unit 103. On the other hand, in Figure 3, the image and audio input sources input data stored in the memory 105. By using data temporarily stored (held) in the memory 105 in this way, it is possible to use the proposed method not only for processing during shooting, but also for post-processing after recording. Also, the main subject selection unit 225 can select a person to be the target of audio processing from a series of video data.

ここで、主対象選定部２２５による主対象の選定方法の例について図４を用いて説明する。本実施形態では主対象を、撮影者が着目すると考えられる人物として説明する。例えば、図４（ａ）の場合、合焦マーク４０２は撮像装置１００がフォーカスを合わせている対象を示すマークである。図４（ａ）では主対象４０１と合焦マーク４０２とが一致していることから、撮像装置１００は主対象４０１を主となる被写体と認識し、主対象４０１にフォーカスを合わせていることとなる。主対象選定部２２５は、この主対象４０１を主対象として判断する。このように主被写体と認識している人物を主対象として選定することができる。 Here, an example of a method for selecting a main object by the main object selection unit 225 will be described with reference to FIG. 4. In this embodiment, the main object will be described as a person on whom the photographer is likely to focus. For example, in the case of FIG. 4(a), the focus mark 402 is a mark indicating the object on which the imaging device 100 is focused. In FIG. 4(a), the main object 401 and the focus mark 402 match, so the imaging device 100 recognizes the main object 401 as the main subject, and focuses on the main object 401. The main object selection unit 225 determines that this main object 401 is the main object. In this way, the person recognized as the main subject can be selected as the main object.

また、図４（ｂ）では登録された顔画像を用いる方法を示している。登録顔画像４０３はメモリ１０５に事前に登録された被写体の画像である。主対象選定部２２５はその画像の顔と一致すると判断された人物を主対象と選定する。 Figure 4(b) also shows a method of using a registered face image. The registered face image 403 is an image of a subject that has been registered in advance in the memory 105. The main subject selection unit 225 selects a person whose face is determined to match that of the image as the main subject.

また、図４（ｃ）では撮影者の意思によって主対象を決める方法を示す。表示部１０７に表示されている人物に対して、撮影者が表示部１０７のタッチパネルに対してタッチすることで主対象となる被写体を選択する。主対象選定部２２５は、撮影者によって選択された被写体を主対象として判断する。 Figure 4(c) shows a method for determining the main subject at the photographer's discretion. The photographer selects the main subject from among the people displayed on the display unit 107 by touching the touch panel of the display unit 107. The main subject selection unit 225 determines that the subject selected by the photographer is the main subject.

また、図４（ｄ）では記録済みの動画データを用いる方法を示している。例えば、記録済みの動画データ４０４がメモリ１０５に記録されている場合、主対象選定部２２５は、動画データ４０４の中で最も登場頻度の高い人物４０５を主対象として判断する。ほかにも、例えば、主対象選定部２２５は、フォーカス合焦頻度の高い人物を選択してもよい。 Also, FIG. 4(d) shows a method of using recorded video data. For example, when recorded video data 404 is stored in memory 105, main subject selection unit 225 determines person 405 who appears most frequently in video data 404 as the main subject. Alternatively, for example, main subject selection unit 225 may select a person who is frequently in focus.

なお、主対象選定部２２５は、例えばフォーカスが合わせられている被写体を主対象とする場合、その主対象に対するフォーカスが外れても、所定時間内にその被写体にフォーカスが戻れば主対象として維持する。言い換えれば、主対象選定部２２５は、主対象からフォーカスが所定時間より長く外れた場合、新たに主対象となる被写体を選定する。 Note that, for example, when the main object selection unit 225 selects a focused subject as the main object, even if the main object is out of focus, the main object selection unit 225 maintains the subject as the main object if the focus returns to the main object within a predetermined time. In other words, if the focus is out of focus from the main object for longer than a predetermined time, the main object selection unit 225 selects a new main object.

続いて、本実施形態の撮像装置１００の動作について図５～図７を用いて説明する。 Next, the operation of the imaging device 100 of this embodiment will be described with reference to Figures 5 to 7.

図５は撮像装置１００の一連の録画記録シーケンスの一例を示すフローチャートである。この撮像装置１００の処理は、ＲＯＭ（不図示）に記録されたソフトウェアをメモリ１０５に展開してＣＰＵが実行することで実現する。また、本フローチャートの処理は、撮像装置１００が電源オンされたことをトリガに開始される。 Figure 5 is a flowchart showing an example of a series of video recording sequences of the imaging device 100. The processing of the imaging device 100 is realized by software recorded in a ROM (not shown) being loaded into the memory 105 and executed by the CPU. The processing of this flowchart is triggered by the imaging device 100 being powered on.

ステップＳ５０１では、制御部１１１は、ユーザによる操作部１１２の操作により動画記録を開始するための指示を受け付ける。 In step S501, the control unit 111 accepts an instruction to start video recording by a user operating the operation unit 112.

ステップＳ５０２では、制御部１１１は、音声録音するための音声のパスを接続する。 In step S502, the control unit 111 connects an audio path for recording audio.

ステップＳ５０３では、制御部１１１は、音声パスが確立した後、本実施形態で説明する制御を含めた信号処理の初期設定をおこない、動画記録のための信号処理を開始する。以降、録音シーケンスについて焦点を当てて説明する。動画記録のための信号処理が終了するまで、制御部１１１は動画に記録される映像を記録している。 In step S503, after the audio path is established, the control unit 111 performs initial setting of signal processing including the control described in this embodiment, and starts signal processing for video recording. The following description focuses on the recording sequence. Until the signal processing for video recording is completed, the control unit 111 records the image to be recorded in the video.

ステップＳ５０４では、画像処理部１０２の人物検出部２２２は被写体を検出する。 In step S504, the person detection unit 222 of the image processing unit 102 detects the subject.

ステップＳ５０５では、画像処理部１０２の主対象選定部２２５は、ステップＳ５０４において検出された被写体から、主対象を選定（判断）する。 In step S505, the main object selection unit 225 of the image processing unit 102 selects (determines) a main object from the subjects detected in step S504.

ステップＳ５０６では、画像処理部１０２の会話グループ検出部２２６は、ステップＳ５０５において選定された主対象と会話している人物（被写体）を判断する。 In step S506, the conversation group detection unit 226 of the image processing unit 102 determines the person (subject) who is conversing with the main subject selected in step S505.

ステップＳ５０７では、音声処理部１０４の音声抽出部２１３は、人物音声の抽出を行う。 In step S507, the voice extraction unit 213 of the voice processing unit 104 extracts the person's voice.

音声処理部１０４の音声調整部２１５は、ステップＳ５０７において抽出された音声に対して調整処理を行う。ステップＳ５０７において抽出された音声の被写体（人物）が主対象の会話グループに属する被写体（人物）か否かで音声調整処理の内容を異ならせる。音声調整処理の詳細については、図６、図７を用いて後述するが、本フローチャートでは簡易的に説明する。 The audio adjustment unit 215 of the audio processing unit 104 performs an adjustment process on the audio extracted in step S507. The content of the audio adjustment process differs depending on whether the subject (person) of the audio extracted in step S507 is a subject (person) that belongs to the main conversation group. Details of the audio adjustment process will be described later using Figures 6 and 7, but will be explained simply in this flowchart.

ステップＳ５０８では、音声処理部１０４の音声調整部２１５は、ステップＳ５０７において抽出された音声の人物が主対象の会話グループに属する被写体か否かを判断する。抽出された音声の人物が主対象の会話グループに属する被写体である場合、ステップＳ５０９の処理が実行される。抽出された音声の人物が主対象の会話グループに属する被写体ではない場合、ステップＳ５１０の処理が実行される。 In step S508, the audio adjustment unit 215 of the audio processing unit 104 determines whether the person whose voice was extracted in step S507 is a subject belonging to the main conversation group. If the person whose voice was extracted is a subject belonging to the main conversation group, the process of step S509 is executed. If the person whose voice was extracted is not a subject belonging to the main conversation group, the process of step S510 is executed.

ステップＳ５０９では、音声処理部１０４の音声調整部２１５は、抽出された音声の音量が大きくなるようにレベル調整する。 In step S509, the audio adjustment unit 215 of the audio processing unit 104 adjusts the level of the extracted audio so that the volume is increased.

ステップＳ５１０では、音声処理部１０４の音声調整部２１５は、抽出された音声の音量が小さくなるようにレベル調整する。ステップＳ５１１では、音声処理部１０４の音声調整部２１５は、抽出された音声に対して、音量以外の調整処理を行う。 In step S510, the audio adjustment unit 215 of the audio processing unit 104 adjusts the level of the extracted audio so that the volume of the audio is reduced. In step S511, the audio adjustment unit 215 of the audio processing unit 104 performs adjustment processing other than the volume on the extracted audio.

ステップＳ５１２では、音声処理部１０４の音声合成部２１７は、個別に音声調整された抽出音声を合成し、ひとつの音声データを生成する。 In step S512, the voice synthesis unit 217 of the voice processing unit 104 synthesizes the extracted voices that have been individually adjusted to generate a single voice data.

ステップＳ５１３では、制御部１１１は、動画記録を終了するか否かを判断する。例えば、制御部１１１は、ユーザによる操作部１１２の操作によって動画記録の終了を指示された場合や、記録媒体１１０の残り容量が少ないと判断された場合に、動画記録を終了すると判断する。動画記録を終了しないと判断された場合、ステップＳ５０４の処理に戻り、録音シーケンス処理が継続される。動画記録を終了すると判断された場合、ステップＳ５１４の処理が実行される。 In step S513, the control unit 111 determines whether to end the video recording. For example, the control unit 111 determines to end the video recording when the user operates the operation unit 112 to instruct the end of the video recording, or when it is determined that the remaining capacity of the recording medium 110 is low. If it is determined not to end the video recording, the process returns to step S504, and the sound recording sequence process continues. If it is determined to end the video recording, the process of step S514 is executed.

ここで、動画記録を終了しないと判断された場合、ステップＳ５０４の処理に戻る。すなわち、動画記録中は、繰り返し主対象および、主対象と会話している人物が判断される。これにより、例えば、主対象である被写体が画角外に消えた場合やフォーカスが外れた場合でも、制御部１１１は別の被写体を主対象として決定できる。また、主対象と会話している人物の人数が増減した場合でも、制御部１１１はそれに合わせて主対象と会話している人物を決定することができる。 If it is determined that the video recording should not end, the process returns to step S504. That is, during video recording, the main subject and the people who are conversing with the main subject are repeatedly determined. This allows the control unit 111 to determine a different subject as the main subject, for example, even if the main subject disappears outside the angle of view or goes out of focus. Also, even if the number of people conversing with the main subject increases or decreases, the control unit 111 can determine the people who are conversing with the main subject accordingly.

ステップＳ５１４では、制御部１１１は、音声パスを切断し、信号処理を終了する。 In step S514, the control unit 111 disconnects the audio path and ends signal processing.

ここで、図６および図７を用いて、音声調整処理について説明する。 Here, we will explain the audio adjustment process using Figures 6 and 7.

図６は音声調整処理の想定シーンを示す図である。いま、人物６０２～人物６０５の４人の被写体（人物）が画角６０１の中に存在し、人物６０２は人物６０３と、人物６０４は人物６０５とそれぞれ会話（発声）をしているものとする。このとき、主対象選定部２２５が選定する、音声処理の主対象が人物６０２であった場合、人物６０２と人物６０３とは、画像データから会話グループ検出部２２６によって会話グループ６１０として検出される。この場合、人物６０２、人物６０３の音声は注目すべき音声として強調するように音声調整され、人物６０４と人物６０５の音声は強調対象ではない不要な音声として音声調整される。 Figure 6 shows an assumed scene for audio adjustment processing. Four subjects (people), person 602 to person 605, are present within the angle of view 601, and person 602 is conversing (speaking) with person 603, and person 604 is conversing (speaking) with person 605. In this case, if person 602 is selected as the main target of audio processing by main target selection unit 225, people 602 and 603 are detected as a conversation group 610 from the image data by conversation group detection unit 226. In this case, the voices of people 602 and 603 are adjusted to be emphasized as noteworthy voices, and the voices of people 604 and 605 are adjusted to be unnecessary voices that are not to be emphasized.

図７（ａ）～（ｃ）は音声調整処理を示す図である。図７では、図６における人物６０２、人物６０３、人物６０４をそれぞれ人物Ａ、Ｂ、Ｃとして表記している（人物６０５は不図示）。 Figures 7(a) to (c) are diagrams showing the audio adjustment process. In Figure 7, person 602, person 603, and person 604 in Figure 6 are represented as person A, person B, and person C, respectively (person 605 is not shown).

図７（ａ）は人物音声抽出部２１４にて抽出された、人物Ａ～Ｃのそれぞれの音声信号を示している。つまり、信号７０１は人物Ａ、信号７０２は人物Ｂ、信号７０３は人物Ｃのそれぞれ抽出された音声信号を示している。そして、それぞれの信号において、振幅の大きな区間は、それぞれの人物が発話（発声）している期間（有声タイミング）を示しており、振幅の小さな区間は発話していない期間（無声タイミング）を示している。例えば、信号７０４と信号７０５とを比較してみると、人物Ａと人物Ｂとは会話しているため、有声タイミングと無声タイミングとがほぼ交互に現れている。一方、人物Ｃは人物ＡおよびＢの会話の相手ではないため、信号７０６は信号７０４と信号７０５とは有声タイミングと無声タイミングが交互に現れることは少ない。 Figure 7(a) shows the voice signals of persons A to C extracted by the person voice extraction unit 214. That is, signal 701 shows the extracted voice signal of person A, signal 702 shows the extracted voice signal of person B, and signal 703 shows the extracted voice signal of person C. In each signal, the sections with large amplitude indicate the period when the person is speaking (uttering) (voiced timing), and the sections with small amplitude indicate the period when the person is not speaking (silent timing). For example, comparing signals 704 and 705, since persons A and B are conversing, voiced timing and silent timing appear almost alternately. On the other hand, since person C is not the conversation partner of persons A and B, signal 706 does not alternate between voiced timing and silent timing as much as signals 704 and 705.

図７（ｂ）は、それぞれの人物に対しての音声の補正係数を示している。本実施形態においては補正係数が１．０のときはレベル調整（ゲイン調整）が行われないことを示す。また、補正係数が１．０よりも大きい場合の処理は、その音声を強調して聞き取りやすくする（より大きい音量にする）ための音声調整処理であり、係数が１．０よりも小さい場合の処理は、音声を聞こえにくくする（より小さい音量にする）ための処理である。 Figure 7 (b) shows the audio correction coefficient for each person. In this embodiment, a correction coefficient of 1.0 indicates that no level adjustment (gain adjustment) is performed. Furthermore, when the correction coefficient is greater than 1.0, processing is an audio adjustment process for emphasizing the audio to make it easier to hear (at a higher volume), and when the coefficient is less than 1.0, processing is a process for making the audio harder to hear (at a lower volume).

例えば、会話グループ検出部２２６によって、期間７１０の間は人物Ａと人物Ｂが会話していると判定された場合を例に説明する。この場合、人物Ａは主対象であることから、会話音声調整部２１６は、人物Ａと人物Ｂのそれぞれの音声を強調する対象として認識し、それぞれの音声に対する補正係数を大きい値にする（係数７１４、係数７１５）。本実施形態では、人物Ａと人物Ｂとの音声に対する補正係数を同じ値にする。これは、撮影者であるユーザはどちらの音声も等しく聞いていることが想定されるからである。一方、会話音声調整部２１６は、人物Ａと会話していないと判断された人物Ｃの音声に対する補正係数を小さく設定し、人物Ｃの音声を比較的聞き取りにくくなるようにする（係数７１６）。このように、会話音声調整部２１６は、主対象の人物Ａおよびその会話相手である人物Ｂの音声が強調し、それ以外の音声が小さくする。例えば、会話音声調整部２１６は、主対象の人物Ａおよびその会話相手である人物Ｂの音声に対するゲインやレベルを、それ以外の音声に対するものより大きくする。これにより、映像および音声が撮影者であるユーザのイメージに沿った動画データとなる。 For example, a case will be described in which the conversation group detection unit 226 has determined that person A and person B are conversing during the period 710. In this case, since person A is the main target, the conversation sound adjustment unit 216 recognizes the voices of person A and person B as targets to be emphasized, and sets the correction coefficients for each voice to large values (coefficients 714 and 715). In this embodiment, the correction coefficients for the voices of person A and person B are set to the same value. This is because it is assumed that the user who is the photographer hears both voices equally. On the other hand, the conversation sound adjustment unit 216 sets the correction coefficient for the voice of person C, who is determined not to be conversing with person A, to a small value, so that the voice of person C is relatively difficult to hear (coefficient 716). In this way, the conversation sound adjustment unit 216 emphasizes the voices of the main target person A and person B, who is his conversation partner, and reduces the other voices. For example, the conversation sound adjustment unit 216 increases the gain and level for the voices of the main target person A and person B, who is his conversation partner, more than those for the other voices. This results in video data whose images and audio match the imagination of the user who filmed it.

そして、図７（ｃ）は、前述の図７（ｂ）の補正係数に基づいて調整処理された音声信号を示している。例えば、会話音声調整部２１６のよる音声調整をゲイン調整によって実現した場合、期間７１０の間は、会話判定された人物Ａと人物Ｂの音声（信号７２４、信号７２５）は補正係数が１．０よりも大きいため、音量が大きくなりユーザにとって聞こえやすくなる。また、会話判定されなかった人物Ｃの音声（信号７２６）は、補正係数が１．０よりも小さため、音量が小さくなり聞こえづらくなる。このように個別調整された抽出音声が音声合成部２１７にて合成されることで、結果として注目対象として判定された会話のみが聞き取りやすい音声データとして生成される。 Figure 7(c) shows the audio signal that has been adjusted based on the correction coefficient of Figure 7(b) described above. For example, if the audio adjustment by the conversation audio adjustment unit 216 is realized by gain adjustment, during the period 710, the audio of persons A and B who have been determined to be having a conversation (signals 724 and 725) will have a correction coefficient greater than 1.0, making them louder and easier for the user to hear. Meanwhile, the audio of person C who has not been determined to be having a conversation (signal 726) will have a correction coefficient less than 1.0, making them quieter and harder to hear. The extracted audio that has been individually adjusted in this way is synthesized by the audio synthesis unit 217, and as a result, only the conversation determined to be the subject of attention is generated as audio data that is easy to hear.

なお、本実施形態では、主被写体に関する音声を強調（大きくなるよう補正）し、主被写体と関係のない音声を聞こえにくくした（小さくなるように補正した）が、どちらか一方にだけ調整を適用しても構わない。すなわち、主対象となる被写体（人物）およびその会話対象である被写体（人物）の補正係数が、その他の被写体の補正係数よりも大きければよい。 Note that in this embodiment, the sound related to the main subject is emphasized (corrected to be louder) and the sound unrelated to the main subject is made hard to hear (corrected to be quieter), but adjustments can be applied to only one of them. In other words, it is sufficient that the correction coefficients for the main subject (person) and the subjects (people) with whom it is speaking are larger than the correction coefficients for the other subjects.

また、会話音声調整部２１６による強調手法も、前述のようなゲイン全体の調整に限らず、イコライザなどにより人物音声の周波数帯域において周波数別に調整しても構わない。 The emphasis method used by the conversation voice adjustment unit 216 is also not limited to adjusting the overall gain as described above, but may be adjusted by frequency in the frequency band of human voices using an equalizer or the like.

［第二の実施形態］
第一の実施形態では、主対象を選定後、主対象と会話している人物を主対象との位置関係や人物の動作により会話グループを検出し、会話グループの音声を強調し、もしくは不要である他の音声は抑え、注目すべき会話が聞き取りやすい音声データを取得している。 [Second embodiment]
In the first embodiment, after selecting a main target, a conversation group of people who are conversing with the main target is detected based on their positional relationship with the main target and their movements, and the audio of the conversation group is emphasized or other unnecessary audio is suppressed, thereby obtaining audio data that makes the noteworthy conversation easy to hear.

第一の実施形態では、会話グループの検出方法は、人物検出部２２２において検出された人物のうちから、主対象選定部２２５にて選定された人物と会話している人物同士の、位置関係や、顔の向き、動作などによって判断されている。このように、第一の実施形態では、会話グループ検出部２２６の検出は、撮像装置１００の画角６０１内に存在する人物によって行われている。 In the first embodiment, the method of detecting a conversation group is based on the positional relationship, facial orientation, and movement of people who are conversing with the person selected by the main subject selection unit 225 from among the people detected by the person detection unit 222. In this way, in the first embodiment, the detection by the conversation group detection unit 226 is performed based on people present within the angle of view 601 of the imaging device 100.

いま、図１０（ａ）のように主対象６０２である人物Ａと、画角６０１内の人物Ｂ、人物Ｄ（６０３、６０６）が会話グループとして検出されたとする。撮影者によるズーム操作やパンニング操作により人物Ｂが画角からはずれてしまった場合、人物Ａ、人物Ｂ、人物Ｄの会話は継続されていても、次の会話グループの検出では人物Ｂは図１０（ｂ）のように会話グループから外れてしまう。その結果、人物Ｂが会話に参加していても、会話グループ検出部２２６は、人物Ｂを会話グループと判断しないため、人物Ｂの音声だけが強調されず聞き取りづらい会話となってしまうおそれがある。 Now, assume that person A, who is the main subject 602, and person B and person D (603, 606) within the angle of view 601 are detected as a conversation group as shown in Figure 10(a). If person B moves out of the angle of view due to a zoom or panning operation by the photographer, even if the conversation between person A, person B, and person D continues, person B will be removed from the conversation group when the next conversation group is detected, as shown in Figure 10(b). As a result, even if person B is participating in the conversation, the conversation group detection unit 226 does not determine that person B is part of the conversation group, and there is a risk that only person B's voice will not be emphasized, resulting in a conversation that is difficult to hear.

第二の実施形態は、画角内にいた会話グループの少なくとも１人が画角からはずれても、画角からずれた人の会話が継続していると判断した時には、会話グループを画角からはずれる前の状態で維持し、聞き取りやすい音声を取得し続けることを目的とする。 The second embodiment aims to maintain the conversation group in the state it was in before it moved out of the field of view, and continue to capture easy-to-listen audio, when it is determined that the conversation of the person who moved out of the field of view is continuing, even if at least one person in the conversation group who was within the field of view moves out of the field of view.

以下、第二の実施形態について、添付の図面に基づいて詳細に説明する。尚、図１の撮像装置１００の構成は、第一の実施形態と同じため説明を省略する。 The second embodiment will be described in detail below with reference to the accompanying drawings. Note that the configuration of the imaging device 100 in FIG. 1 is the same as that of the first embodiment, so a description thereof will be omitted.

図８は本実施形態の撮像装置１００の撮像部１０１、画像処理部１０２、音声入力部１０３、音声処理部１０４の詳細な構成を示すブロック図である。尚、図２と同じ機能を持つブロックは同じ番号を割付し、説明を省略する。 Figure 8 is a block diagram showing the detailed configuration of the imaging unit 101, image processing unit 102, audio input unit 103, and audio processing unit 104 of the imaging device 100 of this embodiment. Note that blocks having the same functions as those in Figure 2 are assigned the same numbers, and descriptions are omitted.

特徴抽出部８０１は、人物音声抽出部２１４より抽出された音声とその音声に対応する人物とを関連付ける。例えば、特徴抽出部８０１は、音声の特徴と画角内の被写体の動作とに基づいて、抽出された音声と対応する人物とを関連付ける。例えば、上記音声の特徴は、周波数、大きさ、および抑揚である。例えば、被写体の動作は、発話の頻度、発声のタイミング、口の動きである。このような関連付けにより、話者の特定を行うための確度を向上させることができる。これにより、制御部１１１は、会話グループの人物が画角から外れても音声から話者を特定できる。 The feature extraction unit 801 associates the voice extracted by the person voice extraction unit 214 with the person corresponding to that voice. For example, the feature extraction unit 801 associates the extracted voice with the corresponding person based on the voice characteristics and the movement of the subject within the field of view. For example, the voice characteristics are frequency, volume, and intonation. For example, the movement of the subject is the frequency of speech, the timing of speech, and the movement of the mouth. This association can improve the accuracy of identifying the speaker. This allows the control unit 111 to identify the speaker from the voice even if the person in the conversation group is out of the field of view.

会話グループ修正部８０２は、特徴抽出部８０１で取得した人物と関連付けされた音声の特徴から、画角からはずれた人物が会話を継続しているかを判断する。制御部１１１は、この結果と会話グループ検出部２２６の検出結果から画角から外れた人物を考慮した会話グループになるよう修正する。 The conversation group correction unit 802 determines whether a person outside the field of view is continuing a conversation based on the voice characteristics associated with the person acquired by the characteristics extraction unit 801. The control unit 111 corrects the conversation group based on this result and the detection result of the conversation group detection unit 226 to take into account the person outside the field of view.

なお、第二の実施形態では特徴抽出部８０１、会話グループ修正部８０２を図２に示すブロック図に追加した形態で説明したが、会話グループ修正部８０２を図３に示すブロック図に追加した形態でも動作内容は同じである。 In the second embodiment, the feature extraction unit 801 and the conversation group correction unit 802 are added to the block diagram shown in FIG. 2, but the operation is the same when the conversation group correction unit 802 is added to the block diagram shown in FIG. 3.

次に第二の実施形態の撮像装置１００の動作について図９、１１を用いて説明する。 Next, the operation of the imaging device 100 of the second embodiment will be described with reference to Figures 9 and 11.

図９は撮像装置１００の一連の記録動作を説明したフローチャートである。図９では、図５と同じ動作をするブロックには図５と同じステップ番号を付与している。ここで、先に図９の動作での想定シーン例を図１１を用いて説明する。 Figure 9 is a flow chart explaining a series of recording operations of the imaging device 100. In Figure 9, blocks that perform the same operations as in Figure 5 are given the same step numbers as in Figure 5. Here, an example of an assumed scene for the operation in Figure 9 will be explained first with reference to Figure 11.

図１１（ａ）では、撮影者により記録釦が押下された時点における場面が示されている。図１１（ａ）に示す場面（以降、初期撮影シーンという）では、画角内に人物Ａ、人物Ｂ、および人物Ｄ（６０２、６０３、６０６）が存在する。主対象を人物Ａとし、主対象を含む会話グループは、人物Ａ、人物Ｂ、人物Ｄの３名が検出される。そして、会話グループの少なくとも１人が画角から外れた場合のシーンを説明する。 Figure 11(a) shows the scene at the time when the photographer presses the record button. In the scene shown in Figure 11(a) (hereinafter referred to as the initial shooting scene), Person A, Person B, and Person D (602, 603, 606) are present within the angle of view. Person A is the main subject, and a conversation group containing the main subject is detected as consisting of three people: Person A, Person B, and Person D. Next, we will explain the scene when at least one person in the conversation group is out of the angle of view.

会話グループに含まれる人物が画角から外れた場合のシーンの例を、図１１（ｂ）～（ｅ）に示す。図１１（ｂ）～（ｄ）は人物Ｂ（６０３）が画角６０１から外れた場合のシーンである。図１１（ｅ）は撮影者による撮像装置１００のパンニング動作により会話グループの全員が画角から外れた場合のシーンである。また各図中の人物の口付近に表記されている横向きの「ハ」の字は、それぞれの人物からの発声状態を表しており、その線の太さで声量や会話への参加頻度の程度を表現している。また図１１の各シーンを、図１１（ａ）は初期撮影シーン、図１１（ｂ）はシーンｂ、図１１（ｃ）はシーンｃ、図１１（ｄ）はシーンｄ、図１１（ｅ）はシーンｅと記述する。また、各図に登場する人物６０２を人物Ａ、人物６０３を人物Ｂ、人物６０６を人物Ｄと記述する。また、各シーンの主対象を人物Ａとする。また、各図の画角６０１を撮像装置１００の撮影画角、会話グループ６１０は会話グループを示す。 Examples of scenes in which a person in a conversation group is out of the field of view are shown in Figs. 11(b) to (e). Figs. 11(b) to (d) show scenes in which person B (603) is out of the field of view 601. Fig. 11(e) shows a scene in which all members of the conversation group are out of the field of view due to the cameraman panning the imaging device 100. The horizontal "V" characters written near the mouths of the people in each figure represent the vocalization state of each person, and the thickness of the line represents the volume of the voice and the frequency of participation in the conversation. Each scene in Fig. 11 is described as Fig. 11(a) as the initial shooting scene, Fig. 11(b) as scene b, Fig. 11(c) as scene c, Fig. 11(d) as scene d, and Fig. 11(e) as scene e. The person 602 appearing in each figure is described as person A, the person 603 as person B, and the person 606 as person D. The main subject of each scene is person A. In addition, the angle of view 601 in each diagram is the shooting angle of the imaging device 100, and the conversation group 610 is the conversation group.

図１１（ａ）～（ｅ）各シーンの想定は、以下のとおりである。 The assumptions for each scene in Figures 11 (a) to (e) are as follows:

シーンｂでは、初期撮影シーンに対し、人物Ｂが画角からは外れているが、画角内にいるときと同様に会話を継続しているシーンが示されている。 In scene b, person B is no longer in the field of view as in the initial shot, but continues to talk as if he were in the field of view.

シーンｃでは、初期撮影シーンに対し、人物Ｂが画角から外れており、かつ会話をしていないシーンが示されている。なお、シーンｃでは、人物Ａ、人物Ｄともに人物Ｂの方を向いていない状態である。 Scene c shows a scene in which person B is out of the field of view and is not engaged in a conversation, as compared to the initial shooting scene. Note that in scene c, neither person A nor person D is facing the direction of person B.

シーンｄでは、シーンｂのシーンに対し、人物Ｂが遠方へ移動しているが会話は継続しているシーンが示されている。なお、シーンｄでは、人物Ｂの音声は撮像装置１００に入力されている。また、画角内にいる人物Ａの顔の向きが、人物Ｂのいる方向を向いており、発声量が大きくなっている。 In scene d, compared to scene b, person B has moved into the distance, but the conversation continues. In scene d, the voice of person B is input to the imaging device 100. Also, person A, who is within the angle of view, is facing the direction of person B, and is speaking louder.

シーンｅでは、初期撮影シーンに対し、人物Ａ、人物Ｂ、および人物Ｄが画角から外れたシーンが示されている。なお、シーンｅでは、人物Ａ、人物Ｂ、および人物Ｄは会話を継続している。 Scene e shows a scene in which Person A, Person B, and Person D are out of the field of view compared to the initial shot scene. Note that in scene e, Person A, Person B, and Person D are continuing a conversation.

以上、図９の動作での想定シーン例を図１１を用いて説明した。以降、図９のフローチャートを用いて撮像装置１００の動作を説明する。本実施形態の説明では、主にステップＳ９０１～ステップＳ９０４について行う。 An example of an assumed scene for the operation in FIG. 9 has been described above with reference to FIG. 11. Hereinafter, the operation of the imaging device 100 will be described with reference to the flowchart in FIG. 9. In the description of this embodiment, steps S901 to S904 will be mainly described.

まず、ステップＳ５０１からステップＳ５０７までの処理によって、画角内の人物検出、主対象の特定、主対象と会話している人物の検出、および音声の抽出が実施される。 First, steps S501 to S507 are processed to detect people within the field of view, identify the main subject, detect people conversing with the main subject, and extract audio.

ステップＳ９０１では、制御部１１１は、ステップＳ５０６検出された主対象と会話している人数と、特徴抽出部８０１および会話グループ修正部８０２によって関連付けられた会話グループの人数と一致するか否かを判断する。例えば、制御部１１１は、ステップＳ５０６で検出された会話グループの人数に対する現時点の会話グループの人数との差分をとることで判断する。人数が減少したと判断された場合、特徴抽出部８０１および会話グループ修正部８０２によって関連付けられた会話グループの人物のうち、画角から外れた人物が存在することになる。人数が一致すると判断された場合、ステップＳ９０４の処理が実行される。人数が一致しないと判断された場合、ステップＳ９０２の処理が実行される。 In step S901, control unit 111 determines whether the number of people conversing with the main target detected in step S506 matches the number of people in the conversation group associated by feature extraction unit 801 and conversation group correction unit 802. For example, control unit 111 makes this determination by taking the difference between the number of people in the conversation group detected in step S506 and the number of people in the current conversation group. If it is determined that the number of people has decreased, it means that there is a person in the conversation group associated by feature extraction unit 801 and conversation group correction unit 802 who is outside the field of view. If it is determined that the numbers match, processing of step S904 is executed. If it is determined that the numbers do not match, processing of step S902 is executed.

ステップＳ９０２では、会話グループ修正部８０２は、画角から外れている人物と、画角内の人物との会話が継続しているか否かを判断する。画角から外れている人物と画角内の人物との会話が継続してないと判断された場合、制御部１１１は、現在の会話グループはステップＳ５０６での検出結果として、ステップＳ９０４以降の処理を行う。会話が継続していると判断された場合、ステップＳ９０３の処理が実行される。 In step S902, conversation group modification unit 802 determines whether or not a conversation is ongoing between a person outside the field of view and a person within the field of view. If it is determined that a conversation is not ongoing between a person outside the field of view and a person within the field of view, control unit 111 performs processing from step S904 onward, treating the current conversation group as the detection result in step S506. If it is determined that a conversation is ongoing, processing in step S903 is executed.

ステップＳ９０３では、制御部１１１は、画角から外れた人物が画角内の会話グループに含まれるように、ステップＳ５０６での検出された主対象と会話している人物（被写体）を修正する。 In step S903, the control unit 111 modifies the person (subject) who is conversing with the main subject detected in step S506 so that the person outside the angle of view is included in the conversation group within the angle of view.

ステップＳ９０４では、特徴抽出部８０１は、人物音声抽出部２１４より抽出された被写体（人物）毎に抽出された音声に基づいて、音声とその音声に対応する人物との関連付けを行う。 In step S904, the feature extraction unit 801 associates the voice with the person corresponding to that voice based on the voice extracted for each subject (person) extracted by the person voice extraction unit 214.

ここで、上述のシーンを用いて、ステップＳ９０２における、人物Ｂが人物Ａ、人物Ｄとの会話を継続しているか否かの判断の一例を説明する。 Here, using the above-mentioned scene, an example of the determination in step S902 of whether person B is continuing a conversation with person A and person D will be described.

シーンｂでは、図９のステップＳ５０５およびステップＳ５０６で、主対象である人物Ａと会話している人物として人物Ｄが特定される。しかし、図９のステップＳ９０１で、初期撮影シーンでは会話グループに属していた人物Ｂが、画角から外れたことがわかる。そして、図９のステップＳ９０２で、会話グループ修正部８０２によって人物部が人物Ａ、および人物Ｄとの会話を継続していることが判断される。そのため、図９ステップ９０３で、制御部１１１は、主対象である人物Ａと会話している被写体（人物）に人物Ｂを追加する。すなわち、シーンｂでは、初期撮影シーンと同様の会話グループを維持することになる。 In scene b, in steps S505 and S506 of FIG. 9, person D is identified as a person having a conversation with person A, who is the main subject. However, in step S901 of FIG. 9, it is found that person B, who belonged to the conversation group in the initial shooting scene, has moved out of the angle of view. Then, in step S902 of FIG. 9, the conversation group correction unit 802 determines that the person section is continuing a conversation with person A and person D. Therefore, in step 903 of FIG. 9, the control unit 111 adds person B to the subjects (people) having a conversation with person A, who is the main subject. That is, in scene b, the same conversation group as in the initial shooting scene is maintained.

ここで、人物Ｂが人物Ａ、Ｄとの会話が続いているか否かの判断の一例を説明する。会話グループ修正部８０２は、特徴抽出部８０１の情報より、人物Ｂの声の大きさや抑揚に変化がなく、人物Ａ、Ｄとの会話時の発話タイミングが合っているような場合、会話が継続していると判断する。この場合、制御部１１１は、主対象である人物Ａと会話している人物（被写体）に人物Ｂを追加する。また、会話グループ修正部８０２は、画像処理部１０２が被写体の画角から外れた方向と被写体の顔の向きとが判断できる場合、さらに画角内の人物Ａまたは人物Ｄの顔の向きと人物Ｂの画角から外れた方向とに基づいて会話が継続しているか否かを判断する。すなわち、上述の声の大きさや浴用、発話タイミングで会話が継続していると判断しても、画角内の人物Ａまたは人物Ｄの顔の向きが人物Ｂが画角から外れた方向と一致していない場合、会話グループ修正部８０２は、会話が継続していないと判断する。 Here, an example of determining whether person B is continuing a conversation with persons A and D will be described. When information from the feature extraction unit 801 indicates that there is no change in the volume or intonation of person B's voice and that the timing of speech during the conversation with persons A and D is in sync, the conversation group correction unit 802 determines that the conversation is continuing. In this case, the control unit 111 adds person B to the people (subjects) who are talking to person A, who is the main target. In addition, when the image processing unit 102 can determine the direction in which the subject is out of the field of view and the face direction of the subject, the conversation group correction unit 802 further determines whether the conversation is continuing based on the face direction of person A or person D within the field of view and the direction in which person B is out of the field of view. In other words, even if it is determined that the conversation is continuing based on the volume of the voice, bathing, and speech timing described above, if the face direction of person A or person D within the field of view does not match the direction in which person B is out of the field of view, the conversation group correction unit 802 determines that the conversation is not continuing.

シーンｃでは、シーンｂに対し、人物Ｂの音声が検出されていない場合である。このような場合、人物Ｂは人物Ａおよび人物Ｄの会話に参加していないと判断され、制御部１１１は、主対象である人物Ａと会話している被写体を人物Ｄのまま、修正は行わない。 In scene c, the voice of person B is not detected in scene b. In such a case, it is determined that person B is not participating in the conversation between person A and person D, and the control unit 111 does not modify the subject who is conversing with person A, who is the main subject, and leaves it as person D.

シーンｄでは、シーンｂの状況から人物Ｂが移動し、人物Ａ、Ｄから遠ざかるも会話は継続しているシーンである。このシーンでは、人物Ｂの声は小さくなっているが、人物Ａおよび人物Ｄとの会話時の発話タイミングは合っている。また、人物Ｂの声は小さくなったが、人物Ａの声はこれに反し大きくなっている。これらの情報から、会話グループ修正部８０２は、人物Ａおよび人物Ｂは会話をしていると判断する。これに応じて、制御部１１１は、主被写体である人物Ａと会話している被写体として人物Ｂを追加する。 In scene d, person B has moved from the situation in scene b, moving away from people A and D, but the conversation continues. In this scene, person B's voice has become quieter, but the timing of his/her speech when conversing with people A and D is correct. Furthermore, while person B's voice has become quieter, person A's voice, in contrast, has become louder. From this information, conversation group modification unit 802 determines that person A and person B are having a conversation. In response, control unit 111 adds person B as a subject who is conversing with person A, who is the main subject.

シーンｅでは、撮影者が撮像方向を人物Ａ、人物Ｂ、および人物Ｄのいる方向から打ち上げ花火に向けて変更したシーンである。すなわち、人物Ａ、人物Ｂ、および人物Ｄは会話を継続しているが、主対象である人物Ａも画角から消えた状態である。しかし、特徴抽出部８０１の情報より、取得した音声に人物Ａの音声が含まれているため、この場合では、制御部１１１は人物Ａを主対象であると判断する。加えて、特徴抽出部８０１の情報より、人物Ｂおよび人物Ｄの音声も検出され続けているため、制御部１１１は、主対象である人物Ａと会話している被写体として人物Ｂおよび人物Ｄを追加する。このように、シーンｅのようなシーンでは人物は誰も画角内にいないが、会話グループの音声が強調されて記録される。なお、制御部１１１は、図１１（ｆ）に示すように会話の内容をテキスト変換し、吹き出し状などの形態で表示するよう制御してもよい。 In scene e, the photographer changes the imaging direction from the direction where person A, person B, and person D are located toward the fireworks. That is, person A, person B, and person D continue to talk, but person A, who is the main subject, has disappeared from the angle of view. However, according to the information of the feature extraction unit 801, the acquired voice contains the voice of person A, so in this case, the control unit 111 determines that person A is the main subject. In addition, according to the information of the feature extraction unit 801, the voices of person B and person D are also continuously detected, so the control unit 111 adds person B and person D as subjects who are talking with person A, who is the main subject. In this way, in a scene like scene e, no person is within the angle of view, but the voices of the conversation group are emphasized and recorded. Note that the control unit 111 may control the content of the conversation to be converted into text and displayed in a form such as a speech bubble, as shown in FIG. 11(f).

以上、第二の実施形態における撮像装置１００の動作について説明した。 The above describes the operation of the imaging device 100 in the second embodiment.

なお、ステップＳ５０６で検出される主対象である被写体と会話している人物の人数は画角内のステップＳ５０４での人物検出に基づくものなので、初期撮影シーン（図１１（ａ））では人物Ｂ、人物Ｄの２名、図１１（ｂ）では人物Ｄの１名である。 The number of people conversing with the main subject detected in step S506 is based on the person detection within the angle of view in step S504, so in the initial shooting scene (Figure 11(a)), there are two people, person B and person D, and in Figure 11(b), there is one person, person D.

なお、第二の実施形態における音声抽出は、動画記録開始から所定時間が経過するまでは人物検出部２２２の検出結果、その後は人物検出２２３の結果と特徴抽出部８０１の情報に基づいて実行される。 In the second embodiment, audio extraction is performed based on the detection results of the person detection unit 222 until a predetermined time has elapsed since the start of video recording, and thereafter based on the results of person detection 223 and information from the feature extraction unit 801.

以上のように第二の実施形態によれば、会話グループに属する人物が画角から外れた場合でも、会話が継続している場合では適切な会話グループに修正することできる。 As described above, according to the second embodiment, even if a person belonging to a conversation group moves out of the field of view, if the conversation is continuing, the conversation group can be corrected to an appropriate one.

第二の実施形態の図１１（ｂ）、（ｃ）、（ｄ）での会話継続の判定について、会話グループに属する人物が画角からはずれた要因について考慮しない前提で説明したが、これを考慮してもよい。例えば、撮影者がレンズ２０１のズーム操作により会話グループに属する人物が画角から外れた場合、その人物が自身の意思とは関係なく画角から外れたため、制御部１１１は、特徴抽出部８０１の情報を使うことなく会話が継続されていると判断してもよい。 The second embodiment has been described with reference to Figures 11(b), (c), and (d) with regard to the judgment of conversation continuation assuming that the cause of a person belonging to a conversation group moving out of the angle of view is not taken into consideration, but this may also be taken into consideration. For example, if the photographer operates the zoom of the lens 201 to cause a person belonging to a conversation group to move out of the angle of view, the control unit 111 may determine that the conversation is continuing without using information from the feature extraction unit 801 since the person has moved out of the angle of view unintentionally.

以上、本発明の好ましい実施形態について説明したが、本発明はこれらの実施形態に限定されず、その要旨の範囲内で種々の変形及び変更が可能である。 The above describes preferred embodiments of the present invention, but the present invention is not limited to these embodiments, and various modifications and variations are possible within the scope of the gist of the invention.

［その他の実施形態］
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 [Other embodiments]
The present invention can also be realized by a process in which a program for implementing one or more of the functions of the above-described embodiments is supplied to a system or device via a network or a storage medium, and one or more processors in a computer of the system or device read and execute the program. The present invention can also be realized by a circuit (e.g., ASIC) that implements one or more of the functions.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 The present invention is not limited to the above-described embodiment as it is, and in the implementation stage, the components can be modified and embodied without departing from the gist of the invention. In addition, various inventions can be formed by appropriately combining the multiple components disclosed in the above-described embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, components from different embodiments may be appropriately combined.

Claims

A detection means for detecting a subject from a video;
a selection means for selecting a main subject from the subjects detected from the moving image;
A determination means for determining a subject's voice from the video;
an association means for associating the subject detected by the detection means with the sound extracted by the determination means;
a determining means for determining a subject related to the main subject selected by the selecting means;
and an audio processing means for differentiating audio processing for audio associated with the main subject and audio of a subject determined by the determination means to be related to the main subject from audio processing for audio of a subject not determined by the determination means to be related to the main subject.

The audio processing device according to claim 1, characterized in that the audio processing means adjusts the level of the audio associated with the main subject and the audio of the subject determined by the determination means to be related to the main subject differently from the level of the audio of the subject not determined by the determination means to be related to the main subject.

The audio processing device according to claim 1 or 2, characterized in that the audio processing means sets a correction coefficient for the audio associated with the main subject and the audio of a subject determined by the determination means to be related to the main subject to be larger than a correction coefficient for the audio of a subject not determined by the determination means to be related to the main subject.

The audio processing device according to claim 1 or 2, characterized in that the audio processing means increases the gain for the audio associated with the main subject and the audio of a subject determined by the determination means to be related to the main subject, more than the gain for the audio of a subject not determined by the determination means to be related to the main subject.

The audio processing device according to any one of claims 1 to 4, characterized in that the selection means selects a subject that is in focus in the video as the main subject.

The audio processing device according to any one of claims 1 to 4, characterized in that the selection means selects a main subject from the video based on an image recorded as the main subject.

The audio processing device according to any one of claims 1 to 4, characterized in that the selection means selects, as the main subject, a subject that appears most frequently among the subjects captured in the video.

The audio processing device according to any one of claims 1 to 7, characterized in that the determination means determines a subject related to the main subject based on the distance from the main subject.

The audio processing device according to any one of claims 1 to 8, characterized in that the determining means determines that the subject closest to the main subject is the subject related to the main subject.

The audio processing device according to any one of claims 1 to 7, characterized in that the determination means determines that a subject facing the main subject is a subject related to the main subject.

The audio processing device according to any one of claims 1 to 7, characterized in that the determination means determines a subject related to the main subject based on the movement of the main subject.

Further comprising image processing means,
12. The audio processing device according to claim 1, wherein the associating means associates the subject detected by the detection means with the audio extracted by the determination means based on the audio extracted by the determination means and the movement of the subject detected by the image processing means.

The audio processing device according to claim 12, characterized in that the image processing means detects the frequency of speech, the timing of speech, or the movement of the mouth of the subject.

The audio processing device according to any one of claims 1 to 13, characterized in that the determination means extracts the subject's audio based on the frequency, volume, and intonation of the audio.

The audio processing device according to any one of claims 1 to 14, further comprising an imaging means for capturing the video.

A method for controlling a sound processing device, comprising:
A detection step of detecting a subject from a video;
a selection step of selecting a main subject from the subjects detected from the video;
An extraction step of extracting a subject's voice from the video;
an associating step of associating the subject detected in the detecting step with the sound extracted in the extracting step;
a determination step of determining a subject related to the main subject selected in the selection step;
a sound processing step of processing the sound associated with the main subject and the sound of a subject determined to be related to the main subject in the judgment step, with respect to the sound of a subject not determined to be related to the main subject in the judgment step.

A computer-readable program for causing a computer to function as each of the means of the voice processing device according to any one of claims 1 to 15.