JP4488091B2

JP4488091B2 - Electronic device, video content editing method and program

Info

Publication number: JP4488091B2
Application number: JP2008164652A
Authority: JP
Inventors: 昇村林
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2008-06-24
Filing date: 2008-06-24
Publication date: 2010-06-23
Anticipated expiration: 2028-06-24
Also published as: KR20100002090A; US8494338B2; US20100008641A1; KR101590186B1; JP2010010780A; CN101615389B; CN101615389A

Description

本発明は、映像コンテンツを編集可能な電子機器、当該電子機器における映像コンテンツ編集方法及びそのプログラムに関する。 The present invention relates to an electronic device capable of editing video content, a video content editing method in the electronic device, and a program thereof.

従来から、カムコーダ等で撮影された映像コンテンツに、ＢＧＭ（Background Music）や効果音等を付加する編集作業が行われている。例えば、下記特許文献１には、編集対象映像の特徴（記録時間や画像枚数）を抽出し、予め定めたユーザが与える指示に基づいて、編集対象映像に最適な音楽を自動的に生成して編集対象映像に付加する映像信号編集装置が開示されている。
特開２００１−２０２０８２号公報（段落［００２４］、［００３１］、図２等） 2. Description of the Related Art Conventionally, editing work for adding background music (BGM), sound effects, and the like to video content shot with a camcorder or the like has been performed. For example, in Patent Document 1 below, features (recording time and number of images) of an editing target video are extracted, and music optimal for the editing target video is automatically generated based on a predetermined instruction given by a user. A video signal editing apparatus for adding to a video to be edited is disclosed.
JP 2001-202082 A (paragraphs [0024], [0031], FIG. 2 etc.)

しかしながら、上記特許文献１に記載のような技術においては、編集対象映像に音楽が付加されることで、元の編集対象映像に記録されている元の音声信号が消去されてしまう。したがって、シーンによっては、音楽を付加するよりもむしろ元の音声信号を残した方が編集対象映像をより印象深いものにすることができる場合があるが、上記特許文献１の技術を用いる場合はそれができないため、ユーザの利便性を欠くこととなる。また、一般的に、編集対象映像のどの区間に音楽を付加し、どの区間を元の音声信号のままにしておくかをユーザが手動で選択して編集することも考えられるが、その作業は非常に煩雑で面倒である。 However, in the technique as described in Patent Document 1, the original audio signal recorded in the original editing target video is erased by adding music to the editing target video. Therefore, depending on the scene, it may be possible to make the video to be edited more impressive by leaving the original audio signal rather than adding music. Therefore, user convenience is lost. In general, the user can manually select and edit which section of the video to be edited and which section to leave as the original audio signal. Very cumbersome and cumbersome.

以上のような事情に鑑み、本発明の目的は、シーンに応じて、元の映像コンテンツ中の音声信号を効果的に残しながら他の音声信号を付加することが可能な電子機器、映像コンテンツ編集方法及びプログラムを提供することにある。 In view of the circumstances as described above, an object of the present invention is to provide an electronic device and video content editing capable of adding another audio signal while effectively leaving the audio signal in the original video content depending on the scene. It is to provide a method and a program.

上述の課題を解決するため、本発明の一の形態に係る電子機器は、第１の入力手段と、第２の入力手段と、第１の算出手段と、第２の算出手段と、設定手段と、生成手段とを有する。
上記第１の入力手段は、第１の映像コンテンツを構成する画像信号及び第１の音声信号を入力する。
上記第２の入力手段は、上記第１の音声信号とは異なる第２の音声信号を入力する。
上記第１の算出手段は、上記入力された画像信号から、人物の顔が表れた顔画像領域を検出して、当該検出された顔画像領域の確からしさを評価する顔評価値を算出する。
上記第２の算出手段は、上記入力された第１の音声信号から、上記人物の声を検出して、当該検出された声の大きさを評価する声評価値を算出する。
上記設定手段は、上記算出された顔評価値及び声評価値を基に、上記画像信号毎に、上記第１の音声信号の重みを示す第１の重み係数及び上記第２の音声信号の重みを示す第２の重み係数を設定する。
上記生成手段は、上記設定された第１及び第２の重み係数を基に、上記第１及び第２の音声信号を混合した第３の音声信号を生成し、当該第３の音声信号及び上記画像信号により構成される第２の映像コンテンツを生成する。
ここで電子機器とは、例えばＰＣ（Personal Computer）、ＨＤＤ（Hard Disk Drive）／ＤＶＤ／ＢＤ（Blu-ray Disc）等の記録媒体を用いた記録再生装置、デジタルビデオカメラ、携帯型ＡＶ機器、携帯電話機、ゲーム機器等の電化製品等である。第１の映像コンテンツとは、例えばカムコーダ等の機器により記録された映像コンテンツや、ネットワークを介して受信された映像コンテンツ等である。第２の音声信号とは、例えばＢＧＭや効果音用の音声信号である。
この構成により、電子機器は、第１の映像コンテンツ中に含まれる顔画像と声とを基に、第１及び第２の音声信号の重みを可変して、第１の映像コンテンツから第２の映像コンテンツを生成することができる。したがって、第１の映像コンテンツに単に別の音声を挿入するような場合に比べて、シーンに応じて、人物の声をそのまま残したり、別の音声を挿入したりすることで、編集効果を高めて、より印象的な第２の映像コンテンツを生成することができる。 In order to solve the above-described problem, an electronic apparatus according to an aspect of the present invention includes a first input unit, a second input unit, a first calculation unit, a second calculation unit, and a setting unit. And generating means.
The first input means inputs an image signal and a first audio signal constituting the first video content.
The second input means inputs a second audio signal different from the first audio signal.
The first calculation means detects a face image area in which a human face appears from the input image signal, and calculates a face evaluation value for evaluating the likelihood of the detected face image area.
The second calculation means detects a voice of the person from the input first voice signal, and calculates a voice evaluation value for evaluating the detected voice level.
The setting means, based on the calculated face evaluation value and voice evaluation value, for each image signal, a first weight coefficient indicating a weight of the first audio signal and a weight of the second audio signal A second weighting coefficient indicating is set.
The generation means generates a third audio signal obtained by mixing the first and second audio signals based on the set first and second weighting factors, and generates the third audio signal and the Second video content configured by the image signal is generated.
Here, the electronic device refers to, for example, a recording / playback device using a recording medium such as a PC (Personal Computer), an HDD (Hard Disk Drive) / DVD / BD (Blu-ray Disc), a digital video camera, a portable AV device, Electric appliances such as mobile phones and game machines. The first video content is, for example, video content recorded by a device such as a camcorder or video content received via a network. The second audio signal is, for example, an audio signal for BGM or sound effect.
With this configuration, the electronic device can change the weights of the first and second audio signals based on the face image and voice included in the first video content, and change the weights of the first and second audio signals from the first video content to the second video content. Video content can be generated. Therefore, compared to the case where another audio is simply inserted into the first video content, the editing effect is enhanced by leaving the person's voice as it is or inserting another audio depending on the scene. Thus, a more impressive second video content can be generated.

上記設定手段は、上記顔評価値が第１の閾値以上であり、かつ、上記声評価値が第２の閾値以上である場合に、上記第１の重み係数を上記第２の重み係数よりも大きい第１の値に設定してもよい。
顔評価値と声評価値とが共に大きい場合には、第１の映像コンテンツに現れる人物が話している可能性が高いと考えられる。したがって、そのような場合には第１の重み係数を第２の重み係数よりも極力大きくして当該人物の声を強調することで、当該人物をより印象付けることができる。ここで、上記第１の値は１に設定されてもよい。 When the face evaluation value is equal to or greater than a first threshold and the voice evaluation value is equal to or greater than a second threshold, the setting means sets the first weight coefficient to be greater than the second weight coefficient. A large first value may be set.
When both the face evaluation value and the voice evaluation value are large, it is considered that there is a high possibility that a person appearing in the first video content is speaking. Therefore, in such a case, the person can be more impressed by making the first weighting coefficient as large as possible than the second weighting coefficient and emphasizing the voice of the person. Here, the first value may be set to 1.

上記設定手段は、上記顔評価値が上記第１の閾値未満であり、かつ、上記声評価値が上記第２の閾値未満である場合に、上記第１の重み係数を上記第２の重み係数よりも小さい第２の値に設定してもよい。
顔評価値と声評価値とが共に小さい場合には、第１の映像コンテンツには人物が現れない可能性が高いと考えられる。したがって、そのような場合には第１の重み係数を第２の重み係数よりも極力小さくして、第２の音声信号を強調することで、第１の映像コンテンツの平凡なシーンをより魅力的なものに編集することができる。ここで、第２の値は０に設定されてもよい。 The setting means determines the first weighting factor as the second weighting factor when the face evaluation value is less than the first threshold value and the voice evaluation value is less than the second threshold value. You may set to the 2nd value smaller than this.
When both the face evaluation value and the voice evaluation value are small, it is highly likely that no person appears in the first video content. Accordingly, in such a case, the ordinary weight scene of the first video content is made more attractive by enhancing the second audio signal by making the first weighting factor as small as possible than the second weighting factor. You can edit anything. Here, the second value may be set to zero.

上記設定手段は、上記顔評価値が上記第１の閾値以上であり、かつ、上記声評価値が上記第２の閾値未満である場合に、上記顔評価値及び上記声評価値に応じて、上記第１の重み係数を上記第２の重み係数よりも大きく設定してもよい。
顔評価値が大きく、声評価値が小さい場合には、第１の映像コンテンツ中に人物の顔が表れているため、声は小さくとも、その人物が何らかの声を発していると考えられる。したがって、そのような場合には、第２の音声信号を付加しつつも、第１の音声信号の重みを大きくすることで、第１の音声信号を強調しながら第２の音声信号の効果を付加することができる。 When the face evaluation value is greater than or equal to the first threshold value and the voice evaluation value is less than the second threshold value, the setting means, depending on the face evaluation value and the voice evaluation value, The first weighting factor may be set larger than the second weighting factor.
When the face evaluation value is large and the voice evaluation value is small, a person's face appears in the first video content. Therefore, even if the voice is small, it is considered that the person is producing some kind of voice. Therefore, in such a case, the effect of the second audio signal is enhanced while enhancing the first audio signal by adding the weight of the first audio signal while adding the second audio signal. Can be added.

上記設定手段は、上記顔評価値が上記第１の閾値未満であり、かつ、上記声評価値が上記第２の閾値以上である場合に、上記顔評価値及び上記声評価値に応じて、上記第１の重み係数を上記第２の重み係数よりも小さく設定してもよい。
顔評価値が小さく、声評価値が大きい場合には、第１の映像コンテンツに人物がほとんど映っていないため、人物の声が含まれていても、その声は画像とはあまり関係ない人物の声であると考えられる。したがって、そのような場合には、第１の音声信号を残しつつも、第２の音声信号の重みを大きくすることで、第１の音声信号の効果を残しながら第２の音声信号の効果を高めることができる。 The setting means, when the face evaluation value is less than the first threshold and the voice evaluation value is greater than or equal to the second threshold, according to the face evaluation value and the voice evaluation value, The first weighting factor may be set smaller than the second weighting factor.
When the face evaluation value is small and the voice evaluation value is large, since the person is hardly reflected in the first video content, even if a person's voice is included, the voice is not related to the image. It is considered to be a voice. Therefore, in such a case, by increasing the weight of the second audio signal while leaving the first audio signal, the effect of the second audio signal is obtained while leaving the effect of the first audio signal. Can be increased.

上記電子機器は、特定の人物の顔の特徴を示す顔特徴データを記憶する記憶手段を更に具備してもよい。
この場合、上記第１の算出手段は、上記記憶された顔特徴データを基に、上記特定の人物の顔が表れた顔画像領域を検出可能であってもよい。
これにより、映像コンテンツ中に複数の人物の顔が現れる場合でも、特定の人物の顔を他の人物の顔と区別して検出することができる。したがって、特定の人物に特化して、第１及び第２の音声信号の重み係数設定処理をより効果的に実行することができる。 The electronic apparatus may further include a storage unit that stores facial feature data indicating facial features of a specific person.
In this case, the first calculation means may be able to detect a face image area in which the face of the specific person appears based on the stored face feature data.
Thus, even when a plurality of human faces appear in the video content, the face of a specific person can be detected separately from the faces of other persons. Therefore, it is possible to more effectively execute the weighting coefficient setting processing for the first and second audio signals specialized for a specific person.

上記電子機器は、特定の人物の声の特徴を示す声特徴データを記憶する記憶手段を更に具備してもよい。
この場合、上記第２の算出手段は、上記記憶された声特徴データを基に、上記特定の人物の声を検出可能であってもよい。
これにより、映像コンテンツ中に複数の人物の声が含まれる場合でも、特定の人物の声を他の人物の声と区別して検出することができる。したがって、特定の人物に特化して、第１及び第２の音声信号の重み係数設定処理をより効果的に実行することができる。 The electronic apparatus may further include a storage unit that stores voice feature data indicating a voice feature of a specific person.
In this case, the second calculation means may be capable of detecting the voice of the specific person based on the stored voice feature data.
Thereby, even when a plurality of human voices are included in the video content, the voice of a specific person can be detected separately from the voices of other persons. Therefore, it is possible to more effectively execute the weighting coefficient setting processing for the first and second audio signals specialized for a specific person.

本発明の別の形態に係る映像コンテンツ編集方法は、第１の映像コンテンツを構成する画像信号及び第１の音声信号を入力すること及び上記第１の音声信号とは異なる第２の音声信号を入力することを含む。
上記入力された画像信号からは、人物の顔が表れた顔画像領域を検出され、当該検出された顔画像領域の確からしさを評価する顔評価値を算出される。
上記入力された第１の音声信号からは、上記人物の声を検出され、当該検出された声の大きさを評価する声評価値を算出される。
上記算出された顔評価値及び声評価値を基に、上記画像信号毎に、上記第１の音声信号の重みを示す第１の重み係数及び上記第２の音声信号の重みを示す第２の重み係数が設定される。
上記設定された第１及び第２の重み係数を基に、上記第１及び第２の音声信号が混合された第３の音声信号が生成され、当該第３の音声信号及び上記画像信号により構成される第２の映像コンテンツが生成される。
この構成により、第１の映像コンテンツに単に別の音声を挿入するような場合に比べて、シーンに応じて、人物の声をそのまま残したり、別の音声を挿入したりすることで、編集効果を高めて、より印象的な第２の映像コンテンツを生成することができる。 The video content editing method according to another aspect of the present invention inputs an image signal and a first audio signal constituting the first video content, and outputs a second audio signal different from the first audio signal. Including typing.
From the input image signal, a face image area in which a person's face appears is detected, and a face evaluation value for evaluating the likelihood of the detected face image area is calculated.
The voice of the person is detected from the input first voice signal, and a voice evaluation value for evaluating the magnitude of the detected voice is calculated.
Based on the calculated face evaluation value and voice evaluation value, for each image signal, a first weight coefficient indicating the weight of the first sound signal and a second weight indicating the weight of the second sound signal. A weighting factor is set.
Based on the set first and second weighting factors, a third audio signal in which the first and second audio signals are mixed is generated, and is configured by the third audio signal and the image signal. Second video content to be generated is generated.
With this configuration, it is possible to leave the person's voice as it is or insert another audio depending on the scene, compared to a case where another audio is simply inserted into the first video content. And more impressive second video content can be generated.

本発明のまた別の形態に係るプログラムは、電子機器に、第１の入力ステップと、第２の入力ステップと、第１の算出ステップと、第２の算出ステップと、設定ステップと、生成ステップとを実行させるためのものである。
上記第１の入力ステップは、第１の映像コンテンツを構成する画像信号及び第１の音声信号を入力する。
上記第２の入力ステップは、上記第１の音声信号とは異なる第２の音声信号を入力する。
上記第１の算出ステップは、上記入力された画像信号から、人物の顔が表れた顔画像領域を検出して、当該検出された顔画像領域の確からしさを評価する顔評価値を算出する。
上記第２の算出ステップは、上記入力された第１の音声信号から、上記人物の声を検出して、当該検出された声の大きさを評価する声評価値を算出する。
上記設定ステップは、上記算出された顔評価値及び声評価値を基に、上記画像信号毎に、上記第１の音声信号の重みを示す第１の重み係数及び上記第２の音声信号の重みを示す第２の重み係数を設定する。
上記生成ステップは、上記設定された第１及び第２の重み係数を基に、上記第１及び第２の音声信号を混合した第３の音声信号を生成し、当該第３の音声信号及び上記画像信号により構成される第２の映像コンテンツを生成する。 A program according to still another aspect of the present invention includes a first input step, a second input step, a first calculation step, a second calculation step, a setting step, and a generation step. And to execute.
In the first input step, an image signal and a first audio signal constituting the first video content are input.
In the second input step, a second audio signal different from the first audio signal is input.
In the first calculation step, a face image area in which a person's face appears is detected from the input image signal, and a face evaluation value for evaluating the likelihood of the detected face image area is calculated.
In the second calculation step, the voice of the person is detected from the input first voice signal, and a voice evaluation value for evaluating the magnitude of the detected voice is calculated.
The setting step includes, based on the calculated face evaluation value and voice evaluation value, a first weighting coefficient indicating a weight of the first audio signal and a weight of the second audio signal for each image signal. A second weighting coefficient indicating is set.
The generation step generates a third audio signal obtained by mixing the first and second audio signals based on the set first and second weighting coefficients, and generates the third audio signal and the third audio signal. Second video content configured by the image signal is generated.

以上のように、本発明によれば、シーンに応じて、元の映像コンテンツ中の音声信号を効果的に残しながら他の音声信号を付加することができる。 As described above, according to the present invention, it is possible to add another audio signal while effectively leaving the audio signal in the original video content according to the scene.

以下、本発明の実施の形態を図面に基づき説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の一実施形態に係る記録再生装置の構成を示すブロック図である。
同図に示すように、本実施形態に係る記録再生装置１００は、画像信号入力部１及び３、音声信号入力部２及び４、入力画像処理部５、入力音声処理部６、画像特徴検出部７、音声特徴検出部８、記録部９、記録媒体１０を有する。記録再生装置１００はまた、再生部１１、出力画像処理部１２、出力音声処理部１３、ユーザインタフェース部１４、ＣＰＵ（Central Processing Unit）１５及びＲＡＭ（Random Access Memory）１６を有する。 FIG. 1 is a block diagram showing a configuration of a recording / reproducing apparatus according to an embodiment of the present invention.
As shown in the figure, the recording / reproducing apparatus 100 according to the present embodiment includes an image signal input unit 1 and 3, an audio signal input unit 2 and 4, an input image processing unit 5, an input audio processing unit 6, and an image feature detection unit. 7, an audio feature detection unit 8, a recording unit 9, and a recording medium 10. The recording / reproducing apparatus 100 also includes a reproducing unit 11, an output image processing unit 12, an output audio processing unit 13, a user interface unit 14, a CPU (Central Processing Unit) 15, and a RAM (Random Access Memory) 16.

画像信号入力部１及び３は、各種有線通信用端子や無線通信用ユニットである。有線通信用端子としては、例えばＳ端子、ＲＣＡ端子、ＤＶＩ（Digital Visual Interface）端子、ＨＤＭＩ（High-Definition Multimedia Interface）端子、Ethernet（登録商標）端子等の有線通信用端子、ＵＳＢ（Universal Serial Bus）端子、IEEE 1394端子等が挙げられる。無線通信用ユニットとしては、例えば無線ＬＡＮ、Bluetooth（登録商標）、無線ＵＳＢ、無線ＨＤＭＩ等の各無線ユニットが挙げられる。しかし、有線通信用端子、無線通信用端子は、これらに限られるものではない。画像信号入力部１及び３は、各種ケーブルや無線ネットワークを介して、映像コンテンツの画像信号を記録再生装置１００内に入力し、入力画像処理部５へ供給する。ここで映像コンテンツとは、例えばカムコーダ等で撮影されたコンテンツやインターネット上のコンテンツである。 The image signal input units 1 and 3 are various wired communication terminals and wireless communication units. As a terminal for wired communication, for example, an S terminal, an RCA terminal, a DVI (Digital Visual Interface) terminal, an HDMI (High-Definition Multimedia Interface) terminal, an Ethernet (registered trademark) terminal or the like, a USB (Universal Serial Bus) ) Terminal, IEEE 1394 terminal, and the like. Examples of the wireless communication unit include wireless units such as a wireless LAN, Bluetooth (registered trademark), wireless USB, and wireless HDMI. However, the wired communication terminal and the wireless communication terminal are not limited to these. The image signal input units 1 and 3 input video content image signals into the recording / reproducing apparatus 100 via various cables and a wireless network, and supply them to the input image processing unit 5. Here, the video content is, for example, content shot with a camcorder or the like or content on the Internet.

音声信号入力部２及び４も、各種有線通信用端子や無線通信用ユニットであり、Ｓ端子及びＤＶＩ端子を除いて上記各端子及び各ユニットとほぼ同様である。音声信号入力部２及び４は、上記各種ケーブルや無線ネットワークを介して、映像コンテンツの音声信号を記録再生装置１００内に入力し、入力音声処理部６へ供給する。 The audio signal input units 2 and 4 are also various wired communication terminals and wireless communication units, and are substantially the same as the above terminals and units except for the S terminal and the DVI terminal. The audio signal input units 2 and 4 input the audio signal of the video content into the recording / reproducing apparatus 100 via the various cables and the wireless network, and supply them to the input audio processing unit 6.

また、上記画像信号入力部１及び３、音声信号入力部２及び４は、図示しないアンテナを介して、デジタル放送信号に含まれる画像信号及び音声信号を記録再生装置１００内に入力するアンテナ入力端子及びチューナ等であってもよい。 The image signal input units 1 and 3 and the audio signal input units 2 and 4 are antenna input terminals for inputting an image signal and an audio signal included in the digital broadcast signal into the recording / reproducing apparatus 100 via an antenna (not shown). And a tuner or the like.

入力画像処理部５は、入力された画像信号に、デジタル変換処理やエンコード処理等の種々の信号処理を施し、デジタル画像信号として画像特徴検出部７及び記録部９へ出力する。
入力音声処理部６は、入力された音声信号に、デジタル変換処理やエンコード処理等の種々の信号処理を施し、デジタル音声信号として音声特徴検出部８及び記録部９へ出力する。 The input image processing unit 5 performs various signal processing such as digital conversion processing and encoding processing on the input image signal, and outputs it to the image feature detection unit 7 and the recording unit 9 as a digital image signal.
The input audio processing unit 6 performs various signal processing such as digital conversion processing and encoding processing on the input audio signal, and outputs the digital audio signal to the audio feature detection unit 8 and the recording unit 9.

画像特徴検出部７は、入力画像処理部５から供給された画像信号中から、人の顔が表れた顔画像（顔画像の領域）を検出して、当該顔画像領域の確からしさを評価する顔評価値を算出する。
音声特徴検出部８は、入力音声処理部６から供給された音声信号中から、人の声を検出し、当該検出された声の大きさを評価する声評価値を算出する。 The image feature detection unit 7 detects a face image (face image region) in which a human face appears from the image signal supplied from the input image processing unit 5 and evaluates the likelihood of the face image region. A face evaluation value is calculated.
The voice feature detection unit 8 detects a human voice from the voice signal supplied from the input voice processing unit 6 and calculates a voice evaluation value for evaluating the detected voice level.

記録部９は、入力画像処理部５から供給された画像信号及び入力音声処理部６から供給された音声信号を多重化して、記録媒体１０へ記録する。 The recording unit 9 multiplexes the image signal supplied from the input image processing unit 5 and the audio signal supplied from the input audio processing unit 6 and records them on the recording medium 10.

記録媒体１０としては、例えばＨＤＤ、フラッシュメモリ等の内蔵型の記録媒体や、光ディスク、メモリカード等の可般性の記録媒体が挙げられる。光ディスクとしては、ＢＤ、ＤＶＤ、ＣＤ等が挙げられる。記録媒体１０は、種々の映像コンテンツ、各種プログラム及びデータ等を記憶する。記録媒体１０が内蔵型の記録媒体である場合、記録媒体１０は、ＯＳや、上記顔画像の検出処理、声の検出処理、それら検出処理の学習処理、映像コンテンツの音声編集処理等を実行するための各種プログラム及びデータを記憶する。記録媒体１０が可般性の記録媒体である場合、記録再生装置１００には、上記各種プログラムやデータを記録するための図示しない内蔵型の記録媒体が別途設けられる。 Examples of the recording medium 10 include a built-in recording medium such as an HDD and a flash memory, and a general recording medium such as an optical disk and a memory card. Examples of the optical disc include BD, DVD, and CD. The recording medium 10 stores various video contents, various programs, data, and the like. When the recording medium 10 is a built-in recording medium, the recording medium 10 executes an OS, face image detection processing, voice detection processing, learning processing of these detection processing, audio editing processing of video content, and the like. Various programs and data are stored. When the recording medium 10 is a general-purpose recording medium, the recording / reproducing apparatus 100 is separately provided with a built-in recording medium (not shown) for recording the various programs and data.

再生部１１は、記録媒体１０に記録された多重化された画像信号及び音声信号を読み出して分離し、分離された画像信号及び音声信号をデコードして、画像信号を出力画像処理部１２へ、音声信号を出力音声処理部１３へ供給する。映像信号及び音声信号の圧縮形式としては、例えばＭＰＥＧ（Moving Picture Expert Group）−２やＭＰＥＧ−４等が挙げられる。 The reproduction unit 11 reads and separates the multiplexed image signal and audio signal recorded on the recording medium 10, decodes the separated image signal and audio signal, and outputs the image signal to the output image processing unit 12. The audio signal is supplied to the output audio processing unit 13. Examples of compression formats of video signals and audio signals include MPEG (Moving Picture Expert Group) -2 and MPEG-4.

出力画像処理部１２は、アナログ変換処理やＯＳＤ（On Screen Display）処理等の種々の信号処理を施し、当該画像信号を例えば記録再生装置１００に接続された液晶ディスプレイ等の外部機器や、記録再生装置１００に内蔵された液晶ディスプレイへ出力する。
出力音声処理部１３は、アナログ変換処理等の種々の信号処理を施し、当該音声信号を上記外部機器や内蔵液晶ディスプレイへ出力する。 The output image processing unit 12 performs various signal processing such as analog conversion processing and OSD (On Screen Display) processing, and outputs the image signal to an external device such as a liquid crystal display connected to the recording / reproducing apparatus 100 or recording / reproducing. The data is output to a liquid crystal display built in the device 100.
The output audio processing unit 13 performs various signal processing such as analog conversion processing, and outputs the audio signal to the external device or the built-in liquid crystal display.

ユーザインタフェース部１４は、例えばリモートコントローラの赤外線信号受光部や、操作ボタン、スイッチ、マウス、キーボード等であり、ユーザの操作による各種指令を入力してＣＰＵ１５へ出力する。 The user interface unit 14 is, for example, an infrared signal light receiving unit of a remote controller, an operation button, a switch, a mouse, a keyboard, or the like, and inputs various commands by a user operation and outputs them to the CPU 15.

ＣＰＵ１５は、必要に応じてＲＡＭ１６等に適宜アクセスし、記録再生装置１００の各ブロックを統括的に制御する。ＲＡＭ１６は、ＣＰＵ１５の作業用領域等として用いられ、ＯＳ（Operating System）やプログラム、処理データ等を一時的に保持する。 The CPU 15 appropriately accesses the RAM 16 or the like as necessary, and comprehensively controls each block of the recording / reproducing apparatus 100. The RAM 16 is used as a work area for the CPU 15 and temporarily holds an OS (Operating System), a program, processing data, and the like.

外部音声ソース１７は、例えばＰＣや各種ＡＶ機器等の外部機器であり、映像コンテンツに挿入するためのＢＧＭ（または効果音）の音声信号（以下、ＢＧＭ音声と称する）を記憶し、各種インタフェースを介してＣＰＵ１５へ当該音声信号を入力する。しかし、外部音声ソース１７は、上記記録媒体１０等、記録再生装置１００に内蔵または装着された記録媒体であってもよい。 The external audio source 17 is an external device such as a PC or various AV devices, for example, stores a BGM (or sound effect) audio signal (hereinafter referred to as BGM audio) for insertion into video content, and has various interfaces. The audio signal is input to the CPU 15 through the CPU 15. However, the external audio source 17 may be a recording medium built in or attached to the recording / reproducing apparatus 100 such as the recording medium 10.

次に、以上のように構成された記録再生装置１００の動作について説明する。 Next, the operation of the recording / reproducing apparatus 100 configured as described above will be described.

本実施形態において、記録再生装置１００は、映像コンテンツを編集して、当該映像コンテンツに上記外部音声ソース１７に記憶されたＢＧＭ音声を挿入することが可能である。このＢＧＭ音声の挿入にあたり、記録再生装置１００は、上述したように、映像コンテンツの画像信号から顔画像を検出し、音声信号から声を検出して、それに応じてＢＧＭ音声の挿入の適否を判断する。このうち顔画像の検出のために、記録再生装置１００は、前処理として、学習処理を実行する。以下、この学習処理について説明する。 In the present embodiment, the recording / reproducing apparatus 100 can edit the video content and insert the BGM audio stored in the external audio source 17 into the video content. When inserting the BGM sound, the recording / reproducing apparatus 100 detects the face image from the image signal of the video content, detects the voice from the sound signal, and determines whether or not the BGM sound is inserted accordingly, as described above. To do. Among these, the recording / reproducing apparatus 100 performs a learning process as a pre-process for the detection of a face image. Hereinafter, this learning process will be described.

図２は、顔画像検出のための学習処理について概念的に示した図である。
同図に示すように、記録再生装置１００の上記記録媒体１０には、様々な人物の顔画像のサンプルを表す顔画像サンプルデータと、非顔画像のサンプルを表す非顔画像サンプルデータとがそれぞれ学習用データとしてデータベース化され記憶されている。 FIG. 2 is a diagram conceptually showing a learning process for detecting a face image.
As shown in the figure, the recording medium 10 of the recording / reproducing apparatus 100 includes face image sample data representing various human face image samples and non-face image sample data representing non-face image samples. A database is stored as learning data.

記録再生装置１００の画像特徴検出部７は、この顔画像サンプルデータベース及び非顔画像サンプルデータベースに記憶された各サンプル画像データを、特徴フィルターにかけ、個々の顔特徴を抽出し、特徴ベクトル（特徴データ）を検出する。 The image feature detection unit 7 of the recording / reproducing apparatus 100 applies each of the sample image data stored in the face image sample database and the non-face image sample database to a feature filter, extracts individual face features, and extracts feature vectors (feature data). ) Is detected.

特徴フィルターは、同図に示すように、例えば画像中の長方形のある部分は検出し、ある部分はマスクするようなフィルターである。この特徴フィルターにより、顔画像サンプルデータからは、顔の目、眉毛、鼻、頬等の位置関係が顔特徴として検出され、非顔画像サンプルデータからは、顔以外の物体の形、その物体の各構成要素の位置関係等が非顔特徴として検出される。特徴フィルターとしては、長方形のフィルター以外にも、例えば円形の特徴を検出する分離度フィルターや、特定方位のエッジにより顔の各パーツの位置関係を検出するGaborフィルター等が用いられても構わない。また、顔特徴の検出には、特徴フィルター以外にも、例えば輝度分布情報や肌色情報等が用いられても構わない。
ここで、画像特徴検出部７は、サンプル画像データからは、顔領域の大きさ及び位置を認識できない。したがって、画像特徴検出部７は、上記特徴フィルターの枠の大きさを変えて特徴フィルターにかけた場合に、最も確からしい検出値が得られたときの特徴フィルターの大きさを、顔領域の大きさと認識して顔特徴の抽出を行う。また、画像特徴検出部７は、サンプル画像データの全ての領域を特徴フィルターでスキャンした場合に、最も確からしい検出値が得られたときの特徴フィルターの位置を、顔領域の位置と認識して顔特徴の抽出を行う。 As shown in the figure, the feature filter is a filter that detects, for example, a certain rectangular portion in an image and masks a certain portion. By this feature filter, the positional relationship between the face eyes, eyebrows, nose, cheeks, etc. is detected as face features from the face image sample data, and the shape of the object other than the face, the shape of the object is detected from the non-face image sample data. The positional relationship of each component is detected as a non-facial feature. As the feature filter, in addition to the rectangular filter, for example, a separability filter that detects a circular feature, a Gabor filter that detects the positional relationship of each part of the face using an edge in a specific direction, and the like may be used. In addition to the feature filter, for example, luminance distribution information, skin color information, or the like may be used for detecting the facial features.
Here, the image feature detection unit 7 cannot recognize the size and position of the face region from the sample image data. Therefore, the image feature detection unit 7 changes the size of the feature filter when the most probable detection value is obtained when the size of the frame of the feature filter is changed and applied to the feature filter. Recognize and extract facial features. In addition, the image feature detection unit 7 recognizes the position of the feature filter when the most probable detection value is obtained as the position of the face region when all regions of the sample image data are scanned with the feature filter. Extract facial features.

画像特徴検出部７は、この顔画像サンプルデータ及び非顔画像サンプルデータから抽出された各特徴から、多次元の特徴ベクトルを生成する。そして、画像特徴検出部７は、この特徴ベクトルを、多次元ベクトル空間で表現し、統計的機械学習により判別関数を生成する。生成された判別関数は、例えば記録媒体１０等に記憶され、編集対象の映像コンテンツから顔画像を検出する際に用いられる。
また、判別関数を用いた判別分析処理の代わりに、例えばサポートベクターマシン（ＳＭＶ）、Ada-boost、ニューラルネットワーク等の機械学習的な手法を用いた判別分析処理が実行されてもよい。この場合、判別関数の代わりに、その判別処理を実行する処理モジュールが記録再生装置１００に組み込まれる。これは、以下の説明において判別関数が関係する処理についても同様である。 The image feature detection unit 7 generates a multidimensional feature vector from each feature extracted from the face image sample data and the non-face image sample data. Then, the image feature detection unit 7 expresses this feature vector in a multidimensional vector space, and generates a discriminant function by statistical machine learning. The generated discriminant function is stored, for example, in the recording medium 10 or the like, and is used when a face image is detected from video content to be edited.
Further, instead of the discriminant analysis process using the discriminant function, for example, a discriminant analysis process using a machine learning method such as a support vector machine (SMV), Ada-boost, or a neural network may be executed. In this case, instead of the discriminant function, a processing module that executes the discriminating process is incorporated in the recording / reproducing apparatus 100. The same applies to processing related to the discriminant function in the following description.

次に、本実施形態において、記録再生装置１００が映像コンテンツを編集して映像コンテンツにＢＧＭデータを挿入する処理について説明する。 Next, in the present embodiment, a process in which the recording / playback apparatus 100 edits video content and inserts BGM data into the video content will be described.

図３は、記録再生装置１００の、映像コンテンツへのＢＧＭ挿入処理の流れを示したフローチャートである。
同図に示すように、まず、編集対象の映像コンテンツが、記録媒体１０から読み出され、または画像信号入力部１または３及び音声信号入力部２または４から入力される。続いてＣＰＵ１５は、当該映像コンテンツから、所定区間（所定数の連続フレーム）の画像信号及び音声信号を抽出する（ステップ３１）。抽出された所定区間の画像信号は、上記画像特徴検出部７へ供給され、所定区間の音声信号は、上記音声特徴検出部８へ供給される。 FIG. 3 is a flowchart showing a flow of BGM insertion processing for video content in the recording / reproducing apparatus 100.
As shown in the figure, first, video content to be edited is read from the recording medium 10 or input from the image signal input unit 1 or 3 and the audio signal input unit 2 or 4. Subsequently, the CPU 15 extracts an image signal and an audio signal in a predetermined section (a predetermined number of continuous frames) from the video content (step 31). The extracted image signal of the predetermined section is supplied to the image feature detection unit 7, and the sound signal of the predetermined section is supplied to the sound feature detection unit 8.

続いて、画像特徴検出部７は、上記判別関数を用いて、上記所定区間の画像信号から、顔画像領域を検出する（ステップ３２）。図４は、顔画像領域の検出処理について概念的に示した図である。同図に示すように、画像特徴検出部７は、所定区間の画像信号を上記特徴フィルターにかけ、顔特徴を抽出して、多次元の特徴ベクトルを生成する。そして、画像特徴検出部７は、当該特徴ベクトルの各次元の値を判別関数の各次元の変数に導入して、判別関数の出力が正負のいずれであるかにより、当該画像信号に顔画像領域が含まれるか否かを判定する。 Subsequently, the image feature detection unit 7 detects a face image region from the image signal in the predetermined section using the discriminant function (step 32). FIG. 4 is a diagram conceptually showing the face image area detection processing. As shown in the figure, the image feature detection unit 7 applies an image signal in a predetermined section to the feature filter, extracts face features, and generates a multidimensional feature vector. Then, the image feature detection unit 7 introduces the value of each dimension of the feature vector into the variable of each dimension of the discriminant function, and determines whether the output of the discriminant function is positive or negative according to the face image region. Whether or not is included is determined.

そして、画像特徴検出部７は、この判別関数の出力値を基に、顔画像の検出の確からしさを評価する顔評価値Ｔｆを算出する（ステップ３２）。この顔評価値は、例えば、所定の明確な顔画像データを基に特徴ベクトルを生成してこれを判別関数に入力した場合における、判別関数の出力値を百分率で表した値とされる。 Then, the image feature detection unit 7 calculates a face evaluation value Tf for evaluating the probability of detection of the face image based on the output value of the discriminant function (step 32). The face evaluation value is, for example, a value representing the output value of the discriminant function as a percentage when a feature vector is generated based on predetermined clear face image data and input to the discriminant function.

続いて、音声特徴検出部８は、所定区間の音声信号から、人の声が含まれる区間を検出する（ステップ３４）。図５は、声の検出処理について概念的に示した図である。同図においては、上記所定区間の音声信号のパワーが示されている。同図の波形Ａは、人の声を示しており、同図の波形Ｂは、人の声以外の音声を示している。 Subsequently, the voice feature detection unit 8 detects a section including a human voice from the voice signal of a predetermined section (step 34). FIG. 5 is a diagram conceptually showing the voice detection process. In the figure, the power of the audio signal in the predetermined section is shown. A waveform A in the figure shows a human voice, and a waveform B in the figure shows a voice other than the human voice.

同図に示すように、音声特徴検出部８はまず、ノイズの影響を除去するために、音声パワーに関する閾値Ａｔｈを設定する。そして、音声特徴検出部８は、所定区間における平均パワーがＡｔｈよりも大きい場合には、その区間は音声区間であると判定し、Ａｔｈよりも小さい場合には、その区間は非音声区間であると判定する。すなわち、同図においては、波形Ａ及びＢ以外の音声信号は非音声区間とされる。 As shown in the figure, the voice feature detection unit 8 first sets a threshold Ath relating to voice power in order to remove the influence of noise. Then, when the average power in the predetermined section is larger than Ath, the voice feature detection unit 8 determines that the section is a voice section, and when smaller than Ath, the section is a non-voice section. Is determined. That is, in the figure, the audio signals other than the waveforms A and B are non-audio intervals.

音声区間のうち、人の声には、子音、母音、息継ぎ等が含まれるため、音楽等の声以外の音声と比べて、所定パワー以上の継続区間が短いという特徴がある。この特徴を利用して、音声特徴検出部８は、時間に関する閾値Ｔｔｈを設定し、所定パワー以上の平均継続時間長がＴｔｈよりも小さい場合には、その区間は声区間とし、Ｔｔｈよりも大きい場合には、その区間は非声区間であると判定する。 Among voice sections, human voice includes consonants, vowels, breath breaths, etc., and therefore has a feature that a continuous section of a predetermined power or higher is shorter than voices other than voice such as music. Using this feature, the voice feature detection unit 8 sets a threshold Tth relating to time, and when the average duration time equal to or greater than a predetermined power is smaller than Tth, the segment is a voice segment, which is larger than Tth. In this case, it is determined that the section is a non-voice section.

続いて、音声特徴検出部８は、検出された声の大きさ（パワーレベル、振幅）を基に、声評価値Ｔｖを算出する（ステップ３５）この声評価値は、例えば検出可能な声の最大パワーレベルを１として、声のパワーレベルを百分率で表した値とされる。 Subsequently, the voice feature detection unit 8 calculates a voice evaluation value Tv based on the detected voice level (power level, amplitude) (step 35). The maximum power level is 1, and the voice power level is expressed as a percentage.

続いて、ＣＰＵ１５は、上記顔評価値Ｔｆが、所定の閾値Ｔｆｓ以上であるか否かを判断する（ステップ３６）。ＣＰＵ１５は、顔評価値Ｔｆが閾値Ｔｆｓ以上である場合（Ｙｅｓ）、上記声評価値Ｔｖが所定の閾値Ｔｖｓ以上であるか否かを判断する（ステップ３７）。 Subsequently, the CPU 15 determines whether or not the face evaluation value Tf is equal to or greater than a predetermined threshold value Tfs (step 36). When the face evaluation value Tf is equal to or greater than the threshold value Tfs (Yes), the CPU 15 determines whether or not the voice evaluation value Tv is equal to or greater than the predetermined threshold value Tvs (step 37).

ＣＰＵ１５は、声評価値Ｔｖが閾値Ｔｖｓ以上である場合（Ｙｅｓ）には、ＢＧＭ音声の重み係数ｋを、０．５よりも小さい所定の重みｋ１に設定し、映像コンテンツの音声信号の重み計数ｍを１−ｋ１に設定する。ｋ１は例えば０に設定されるが、０でない場合でも、極力０に近い値となるように設定される。 When the voice evaluation value Tv is equal to or greater than the threshold value Tvs (Yes), the CPU 15 sets the BGM audio weight coefficient k to a predetermined weight k1 smaller than 0.5, and the audio signal weight coefficient of the video content. Set m to 1-k1. For example, k1 is set to 0, but is set to be as close to 0 as possible even when it is not 0.

ＣＰＵ１５は、上記ステップ３７において、声評価値Ｔｖが閾値Ｔｖｓ未満である場合（Ｎｏ）には、顔評価値Ｔｆ及び声評価値Ｔｖに応じて上記重み係数ｋ及びｍを設定する（ステップ３９）。すなわち、重み係数ｋ及びｍのいずれも０または１ではないが、重み係数ｋは、重み係数ｍよりも小さく設定される。 If the voice evaluation value Tv is less than the threshold value Tvs in step 37 (No), the CPU 15 sets the weighting factors k and m according to the face evaluation value Tf and the voice evaluation value Tv (step 39). . That is, neither of the weighting factors k and m is 0 or 1, but the weighting factor k is set smaller than the weighting factor m.

ＣＰＵ１５は、上記ステップ３６において、顔評価値Ｔｆが閾値Ｔｆｓ未満である場合（Ｎｏ）、上記声評価値Ｔｖが所定の閾値Ｔｖｓ以上であるか否かを判断する（ステップ４０）。ＣＰＵ１５は、上記声評価値Ｔｖが閾値Ｔｖｓ以上である場合（Ｙｅｓ）には、顔評価値Ｔｆ及び声評価値Ｔｖに応じて上記重み係数ｋ及びｍを設定する（ステップ４１）。すなわち、重み係数ｋ及びｍのいずれも０または１ではないが、重み係数ｋは、重み係数ｍよりも大きく設定される。 When the face evaluation value Tf is less than the threshold value Tfs (No) in step 36, the CPU 15 determines whether or not the voice evaluation value Tv is equal to or greater than a predetermined threshold value Tvs (step 40). When the voice evaluation value Tv is equal to or greater than the threshold value Tvs (Yes), the CPU 15 sets the weighting factors k and m according to the face evaluation value Tf and the voice evaluation value Tv (step 41). That is, neither of the weighting factors k and m is 0 or 1, but the weighting factor k is set larger than the weighting factor m.

ＣＰＵ１５は、上記ステップ４０において、声評価値Ｔｖが閾値Ｔｖｓ未満である場合（Ｎｏ）には、重み係数ｋを、０．５よりも大きい所定の重みｋ２に設定し、重み計数ｍを１−ｋ２に設定する。ｋ２は例えば１に設定されるが、１でない場合でも、極力１に近い値となるように設定される。 When the voice evaluation value Tv is less than the threshold value Tvs in Step 40 (No), the CPU 15 sets the weight coefficient k to a predetermined weight k2 greater than 0.5, and sets the weight count m to 1−. Set to k2. For example, k2 is set to 1, but even if it is not 1, it is set to be as close to 1 as possible.

ＣＰＵ１５は、このように設定された重み係数ｋ及びｍに基づいて、映像コンテンツの所定区間毎（フレーム毎）に、映像コンテンツを編集して、外部音声ソース１７から入力されたＢＧＭ音声を挿入していく（ステップ４３）。 Based on the weighting factors k and m set in this way, the CPU 15 edits the video content for each predetermined section (for each frame) of the video content and inserts the BGM audio input from the external audio source 17. (Step 43).

ＣＰＵ１５は、以上の処理を、映像コンテンツの全ての所定区間に対して実行するまで、または、ユーザ等から処理の中止が命令されるまで実行する（ステップ４４、４５）。ＣＰＵ１５は、編集後の映像コンテンツを、最終的に元の画像信号と多重化して、新たな映像コンテンツとして記録媒体１０に記録する。 The CPU 15 executes the above processing for all the predetermined sections of the video content or until the user or the like instructs to stop the processing (steps 44 and 45). The CPU 15 finally multiplexes the edited video content with the original image signal and records the new video content on the recording medium 10.

図６は、以上説明した重み係数ｋ及びｍの設定処理を示した表である。同図に示すように、顔評価値及び声評価値が各閾値Ｔｆｓ及びＴｖｓ以上であるか否かに応じて、４つのパターンの重み係数が設定される。 FIG. 6 is a table showing the setting processing of the weighting factors k and m described above. As shown in the figure, four pattern weight coefficients are set depending on whether the face evaluation value and the voice evaluation value are equal to or higher than the threshold values Tfs and Tvs.

図７は、上記顔評価値及び声評価値、重み係数ｋ及びｍ及び映像コンテンツの各フレーム画像との関係を示したグラフである。同図に示されるフレームｆ１〜ｆ６は、一例として、カムコーダ等で学校の運動会の様子が収録された映像コンテンツの一部のフレームを示している。 FIG. 7 is a graph showing the relationship between the face evaluation value and voice evaluation value, the weight coefficients k and m, and each frame image of the video content. As an example, frames f1 to f6 shown in the figure show some frames of video content in which a state of a school sports day is recorded by a camcorder or the like.

同図に示すように、映像コンテンツのフレームｆ１及びｆ２では、顔が小さすぎて、上記画像特徴検出部７により顔画像領域が検出されないため、顔評価値は低い（閾値Ｔｆｓ未満）。また、このフレームｆ１及びｆ２の区間では、遠くから撮影されており、人の声もほとんど集音されないため、声評価値も低い（閾値Ｔｖｓ未満）。そのため、この区間では、ＢＧＭ音声の重み係数ｋが高く、コンテンツの音声信号の重み係数ｍが低く設定されている。これにより、平凡なシーンをより魅力的なものに編集することができる。 As shown in the figure, in the frames f1 and f2 of the video content, the face evaluation value is low (less than the threshold Tfs) because the face is too small and the face image area is not detected by the image feature detection unit 7. Further, in the sections of the frames f1 and f2, the voice evaluation value is low (less than the threshold value Tvs) because the image is taken from a distance and almost no human voice is collected. Therefore, in this section, the weight coefficient k of the BGM audio is set high, and the weight coefficient m of the content audio signal is set low. This makes it possible to edit a mediocre scene into a more attractive one.

フレームｆ３及びｆ４では、人がややアップで撮影され、集音される声もやや大きくなっているため、この区間では、顔評価値及び声評価値に応じて重み係数ｋ及びｍが設定される。これにより、人の音声も残しながら、同時にＢＧＭ挿入による効果も得ることができる。すなわち、画像特徴検出部７は、顔評価値が閾値Ｔｆｓ以上で声評価値が閾値Ｔｖｓ未満の場合には、ＢＧＭ音声の重みを低くすることで、画像に現れる人物の声を強調することができる。また、画像特徴検出部７は、顔評価値が閾値Ｔｆｓ未満で声評価値が閾値Ｔｖｓ以上の場合には、ＢＧＭ音声の重みを高くすることで、画像と無関係な人物の声よりも、ＢＧＭの効果を高めることができる。 In frames f3 and f4, a person is photographed slightly up and the voice collected is also slightly louder. Therefore, in this section, weight coefficients k and m are set according to the face evaluation value and the voice evaluation value. . Thereby, the effect by BGM insertion can also be acquired at the same time, leaving a human voice. That is, when the face evaluation value is equal to or higher than the threshold value Tfs and the voice evaluation value is lower than the threshold value Tvs, the image feature detection unit 7 can emphasize the voice of the person appearing in the image by reducing the weight of the BGM sound. it can. Further, when the face evaluation value is less than the threshold value Tfs and the voice evaluation value is greater than or equal to the threshold value Tvs, the image feature detection unit 7 increases the weight of the BGM sound so that the BGM is more effective than the voice of a person unrelated to the image. Can enhance the effect.

フレームｆ５及びｆ６では、顔がはっきり検出できる程度に人がアップで撮影されているため、顔評価値は高い（閾値Ｔｆｓ以上）。また検出される声のパワーレベルも大きいため、声評価値も高い（閾値Ｔｖｓ未満）。そのため、この区間では、重み係数ｋは低く、重み係数ｍは高く設定されている。これにより、人の声を強調することで、その人をより印象付けることができる。 In the frames f5 and f6, since the person is photographed up to such an extent that the face can be clearly detected, the face evaluation value is high (threshold value Tfs or more). Further, since the power level of the detected voice is high, the voice evaluation value is also high (less than the threshold value Tvs). Therefore, in this section, the weighting factor k is set low and the weighting factor m is set high. Thereby, the person can be more impressed by emphasizing the voice of the person.

以上のように、本実施形態によれば、顔評価値及び声評価値に基づいて映像コンテンツにＢＧＭ音声を挿入することとしたため、シーンに応じて、元の映像コンテンツ中の音声信号を効果的に残しながら、ＢＧＭ音声を挿入することができる。これにより、単に一律にＢＧＭ音声を挿入する場合に比べて、映像コンテンツをより印象的な、思い出深いものとすることできる。 As described above, according to the present embodiment, since the BGM sound is inserted into the video content based on the face evaluation value and the voice evaluation value, the audio signal in the original video content is effectively used according to the scene. BGM sound can be inserted while leaving Thereby, it is possible to make the video content more impressive and memorable as compared with the case where the BGM sound is simply inserted uniformly.

本発明は上述の実施形態にのみ限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変更を加え得ることは勿論である。 The present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the scope of the present invention.

上述の実施形態において、画像特徴検出部７は、人の顔画像のみならず、動物の顔画像を検出してもよい。また音声特徴検出部８は、人の声のみならず、動物の声を検出してもよい。 In the above-described embodiment, the image feature detection unit 7 may detect not only a human face image but also an animal face image. The voice feature detection unit 8 may detect not only a human voice but also an animal voice.

上述の実施形態において、画像特徴検出部７は、単に顔画像を検出するのみならず、特定の人物の顔画像を認識してもよい。この顔認識処理は、上記判別関数による顔検出処理の後に実行される。この顔認識処理には、エッジ強度画像、周波数強度画像、高次自己相関、カラー変換画像等を用いることができる。
図８は、エッジ強度画像を用いた顔認識処理を概念的に示した図である。
同図に示すように、記録媒体１０等には、顔認識したい人の特徴データ（辞書パターン）として、濃淡画像と、エッジ強度画像とが記憶されている。画像特徴検出部７は、検出された顔画像から、特徴データとして、濃淡画像及びエッジ強度画像を抽出する。そして、画像特徴検出部７は、この抽出した濃淡画像及びエッジ強度画像と、上記記憶された、顔認識したい人の濃淡画像及びエッジ強度画像とをパターンマッチングにより比較処理することで、特定の人の顔画像を認識することができる。この場合、画像特徴検出部７は、顔画像の認識率（マッチング率）を百分率で表して、顔評価値とすればよい。画像特徴検出部７は、目や鼻等の顔特徴点の情報が得られる場合には、上記エッジ強度画像等に加えてそれらの情報を併用することもできる。
この処理により、例えば上記図７の例では、多数の子供の中から、ユーザの子供の顔のみを検出及び認識する等、特定の人の顔の認識率に応じて、映像コンテンツにＢＧＭを挿入することができる。これにより、編集後の映像コンテンツをより印象深いものとすることができる。 In the above-described embodiment, the image feature detection unit 7 may not only detect a face image but also recognize a face image of a specific person. This face recognition process is executed after the face detection process by the discriminant function. For this face recognition process, an edge intensity image, a frequency intensity image, a high-order autocorrelation, a color conversion image, or the like can be used.
FIG. 8 is a diagram conceptually showing the face recognition process using the edge intensity image.
As shown in the figure, a grayscale image and an edge strength image are stored in the recording medium 10 or the like as feature data (dictionary pattern) of a person whose face is to be recognized. The image feature detection unit 7 extracts a grayscale image and an edge strength image as feature data from the detected face image. Then, the image feature detection unit 7 compares the extracted grayscale image and edge strength image with the stored grayscale image and edge strength image of the person who wants to recognize the face by pattern matching. Can be recognized. In this case, the image feature detection unit 7 may represent the recognition rate (matching rate) of the face image as a percentage to obtain a face evaluation value. The image feature detection unit 7 can also use the information in addition to the edge intensity image or the like when information on facial feature points such as eyes and nose is obtained.
With this process, for example, in the example of FIG. 7 described above, BGM is inserted into the video content according to the recognition rate of a specific person's face, such as detecting and recognizing only the face of the user's child from among many children. can do. Thereby, the edited video content can be made more impressive.

上述の実施形態において、音声特徴検出部８は、単に声を検出するのみならず、特定の人物の声を認識してもよい。この声認識処理は、例えば、音声特徴検出部８が、認識したい人の声信号を周波数解析して、スペクトル特性を検出して上記記録媒体１０等に記憶しておき、検出された声のスペクトル特性と比較処理（パターンマッチング）することで実行される。スペクトル特性としては、子音部分及び母音部分のスペクトルピーク周波数、スペクトル間隔等が用いられる。また、息継ぎの間隔等も個人によって異なるため、音声特徴検出部８は、息継ぎの間隔に関する情報を上記スペクトル特性と併用しても構わない。この場合、音声特徴検出部８は、声認識率（マッチング率）を百分率で表して、声評価値とすればよい。
この処理により、特定の人の声の認識率に応じて、映像コンテンツにＢＧＭを挿入することができるため、編集後の映像コンテンツをより印象深いものとすることができる。 In the above-described embodiment, the voice feature detection unit 8 may not only detect a voice but also recognize a specific person's voice. In this voice recognition process, for example, the voice feature detection unit 8 analyzes the frequency of a voice signal of a person to be recognized, detects a spectral characteristic, stores it in the recording medium 10 or the like, and detects the detected voice spectrum. It is executed by comparing with characteristics (pattern matching). As the spectral characteristics, spectral peak frequencies, spectral intervals, and the like of the consonant part and the vowel part are used. In addition, since the breath interval and the like vary depending on the individual, the voice feature detection unit 8 may use information regarding the breath interval in combination with the spectral characteristics. In this case, the speech feature detection unit 8 may express the voice recognition rate (matching rate) as a percentage to obtain a voice evaluation value.
By this process, BGM can be inserted into the video content in accordance with the recognition rate of a specific person's voice, so that the edited video content can be made more impressive.

上述の実施形態においては、画像特徴検出部７は、顔評価値が閾値Ｔｆｓ未満で声評価値が閾値Ｔｖｓ以上の場合には、ＢＧＭ音声の重みを高く設定した。しかし、この場合、画像特徴検出部７は、逆にＢＧＭ音声の重みを低く設定してもよい。これにより、撮影対象人物と、撮影者の両方の声を残すことが可能となる。また、上記声の認識が可能な場合、撮影者の声を認識し、顔評価値が閾値Ｔｆｓ未満でも、撮影者の声の声評価値が閾値Ｔｖｓ以上の場合には、ＢＧＭ音声の重みを低く設定してもよい。これにより、撮影者の音声をより確実に効果的に残すことができる。 In the above-described embodiment, the image feature detection unit 7 sets the weight of the BGM sound high when the face evaluation value is less than the threshold value Tfs and the voice evaluation value is equal to or greater than the threshold value Tvs. However, in this case, the image feature detection unit 7 may set the weight of the BGM sound low. Thereby, it becomes possible to leave the voices of both the person to be photographed and the photographer. If the voice can be recognized, the photographer's voice is recognized. If the voice evaluation value of the photographer's voice is equal to or greater than the threshold Tvs even if the face evaluation value is less than the threshold Tfs, the weight of the BGM sound is set. It may be set low. Thereby, a photographer's voice can be left more effectively and reliably.

上述の実施形態においては、記録再生装置１００は、声の検出処理については学習処理を実行しないが、もちろん、学習処理を実行しても構わない。 In the above-described embodiment, the recording / reproducing apparatus 100 does not perform the learning process for the voice detection process, but may naturally execute the learning process.

上述の実施形態においては、本発明を記録再生装置に適用した例を示したが、本発明を、ＰＣ、デジタルビデオカメラ、携帯型ＡＶ機器、携帯電話機、ゲーム機器等の他の電子機器に適用することももちろん可能である。 In the above-described embodiment, an example in which the present invention is applied to a recording / reproducing apparatus has been described. However, the present invention is applied to other electronic devices such as a PC, a digital video camera, a portable AV device, a mobile phone, and a game device. Of course it is also possible to do.

本発明の一実施形態に係る記録再生装置の構成を示すブロック図である。It is a block diagram which shows the structure of the recording / reproducing apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態における顔画像検出のための学習処理について概念的に示した図である。It is the figure which showed notionally the learning process for the face image detection in one Embodiment of this invention. 本発明の一実施形態に係る記録再生装置の、映像コンテンツへのＢＧＭ挿入処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the BGM insertion process to a video content of the recording / reproducing apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態における顔画像領域の検出処理について概念的に示した図である。It is the figure which showed notionally the detection process of the face image area | region in one Embodiment of this invention. 本発明の一実施形態における声の検出処理について概念的に示した図である。It is the figure which showed notionally the voice detection process in one Embodiment of this invention. 本発明の一実施形態における重み係数ｋ及びｍの設定処理を示した表である。It is the table | surface which showed the setting process of the weighting factors k and m in one Embodiment of this invention. 本発明の一実施形態における上記顔評価値及び声評価値、重み係数ｋ及びｍ及び映像コンテンツの各フレーム画像との関係を示したグラフである。It is the graph which showed the relationship with each said frame image of the said face evaluation value and voice evaluation value, weighting coefficient k and m, and video content in one Embodiment of this invention. 本発明の他の実施形態における、エッジ強度画像を用いた顔認識処理を概念的に示した図である。It is the figure which showed notionally the face recognition process using the edge strength image in other embodiment of this invention.

Explanation of symbols

１、３…画像信号入力部
２、４…音声信号入力部
５…入力画像処理部
６…入力音声処理部
７…画像特徴検出部
８…音声特徴検出部
９…記録部
１０…記録媒体
１１…再生部
１２…出力画像処理部
１３…出力音声処理部
１４…ユーザインタフェース部
１５…ＣＰＵ
１６…ＲＡＭ
１７…外部音声ソース
１００…記録再生装置 DESCRIPTION OF SYMBOLS 1, 3 ... Image signal input part 2, 4 ... Sound signal input part 5 ... Input image processing part 6 ... Input sound processing part 7 ... Image feature detection part 8 ... Sound feature detection part 9 ... Recording part 10 ... Recording medium 11 ... Playback unit 12 ... Output image processing unit 13 ... Output audio processing unit 14 ... User interface unit 15 ... CPU
16 ... RAM
17 ... External audio source 100 ... Recording / reproducing apparatus

Claims

First input means for inputting an image signal and a first audio signal constituting the first video content;
Second input means for inputting a second audio signal different from the first audio signal;
First calculation means for detecting a face image area in which a person's face appears from the input image signal and calculating a face evaluation value for evaluating the likelihood of the detected face image area;
Second calculating means for detecting a voice of the person from the input first audio signal and calculating a voice evaluation value for evaluating the magnitude of the detected voice;
Based on the calculated face evaluation value and voice evaluation value, for each image signal, a first weight coefficient indicating the weight of the first sound signal and a second weight indicating the weight of the second sound signal. A setting means for setting a weighting factor;
Based on the set first and second weighting factors, a third audio signal is generated by mixing the first and second audio signals, and is configured by the third audio signal and the image signal. An electronic device comprising: generating means for generating second video content.

The electronic device according to claim 1,
The setting means sets the first weight coefficient to be greater than the second weight coefficient when the face evaluation value is equal to or greater than a first threshold and the voice evaluation value is equal to or greater than a second threshold. Electronic device set to a large first value.

The electronic device according to claim 2,
The setting means sets the first weighting factor to the second weighting factor when the face evaluation value is less than the first threshold value and the voice evaluation value is less than the second threshold value. Electronic device set to a second value smaller than

The electronic device according to claim 3,
The setting means, when the face evaluation value is not less than the first threshold and the voice evaluation value is less than the second threshold, according to the face evaluation value and the voice evaluation value, An electronic device that sets the first weighting factor to be larger than the second weighting factor.

The electronic device according to claim 3,
The setting means, when the face evaluation value is less than the first threshold and the voice evaluation value is greater than or equal to the second threshold, according to the face evaluation value and the voice evaluation value, An electronic device that sets the first weighting factor to be smaller than the second weighting factor.

The electronic device according to claim 3,
Storage means for storing facial feature data indicating facial features of a specific person,
The electronic device is capable of detecting a face image area in which a face of the specific person appears based on the stored face feature data.

The electronic device according to claim 3,
Storage means for storing voice feature data indicating the voice characteristics of a specific person;
The electronic device is capable of detecting the voice of the specific person based on the stored voice feature data.

Input an image signal and a first audio signal constituting the first video content,
Input a second audio signal different from the first audio signal;
Detecting a face image area in which a person's face appears from the input image signal, and calculating a face evaluation value for evaluating the likelihood of the detected face image area;
Detecting the voice of the person from the input first audio signal, and calculating a voice evaluation value for evaluating the magnitude of the detected voice;
Based on the calculated face evaluation value and voice evaluation value, for each image signal, a first weight coefficient indicating the weight of the first sound signal and a second weight indicating the weight of the second sound signal. Set the weighting factor,
Based on the set first and second weighting factors, a third audio signal is generated by mixing the first and second audio signals, and is configured by the third audio signal and the image signal. A video content editing method for generating second video content.

Electronic equipment,
Inputting an image signal and a first audio signal constituting the first video content;
Inputting a second audio signal different from the first audio signal;
Detecting a face image area in which a person's face appears from the input image signal, and calculating a face evaluation value for evaluating the likelihood of the detected face image area;
Detecting a voice of the person from the input first audio signal and calculating a voice evaluation value for evaluating the magnitude of the detected voice;
Based on the calculated face evaluation value and voice evaluation value, for each image signal, a first weight coefficient indicating the weight of the first sound signal and a second weight indicating the weight of the second sound signal. Setting a weighting factor;
Based on the set first and second weighting factors, a third audio signal is generated by mixing the first and second audio signals, and is configured by the third audio signal and the image signal. Generating the second video content.