JP4487909B2

JP4487909B2 - Voice control device and voice control method

Info

Publication number: JP4487909B2
Application number: JP2005335224A
Authority: JP
Inventors: 達也出嶌
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2005-11-21
Filing date: 2005-11-21
Publication date: 2010-06-23
Anticipated expiration: 2025-11-21
Also published as: JP2007140225A

Description

本発明は、音声制御装置および音声制御方法に関し、特に、マイクロフォンから出力される音声信号に対してエフェクト処理を施す音声制御装置および音声制御方法に関するものである。 The present invention relates to a sound control device and a sound control method, and more particularly to a sound control device and a sound control method for performing effect processing on a sound signal output from a microphone.

近年、電子楽器とエフェクタなどの装置とを組み合わせて、電子楽器から発生する楽音に対してエコーやリバーブの効果を付加するシステムが広く普及してきている。さらに、電子楽器とビデオカメラとを組み合わせて、電子楽器から発生する楽音に対して様々なバリエーションを持たせる提案もなされている。例えば、演奏者の表情に基づいて、演奏される音楽に表現効果を付与する音楽演奏装置の提案がある。この提案においては、演奏者の顔画像を撮影するための撮影手段からの画像データに基づいて、顔画像の注目部分、例えば、演奏者の口の形状から抽出された形状パラメータに応じて、音データに対する表現効果を制御するための制御信号を生成する。具体的には、演奏者の口の開口部の縦方向の開き度合いに応じて制御信号を生成し、その制御信号に応じて、音が通過するローパスフィルタの遮断周波数を変化させる。例えば、口を大きく開けると、ローパスフィルタの遮断周波数を大きくし、口を閉じるにしたがってローパスフィルタの遮断周波数を小さくする。あるいは、演奏者の口の開口部の幅に応じて制御信号を生成し、その制御信号に応じて、音が通過するアンプの非直線性を制御する。例えば、口の幅が広がると、アンプの非直線性を変化させて、音の振幅をクリップすることでひずみを与える。演奏する楽器としては、実施形態に記載されているエレキギターのほかに、ピアノやシンセサイザなどが想定されている。さらには、演奏者だけでなくＤＪ（ディスクジョッカー）の顔の表情によっても音データに対する表現効果を制御することが記載されている。（特許文献１参照）
特開平２００２−１４００６６号公報 In recent years, systems that add an echo or reverb effect to musical sounds generated from electronic musical instruments by combining electronic musical instruments and devices such as effectors have become widespread. In addition, there have been proposals for combining electronic musical instruments and video cameras to give various variations to musical sounds generated from electronic musical instruments. For example, there is a proposal of a music performance device that gives an expression effect to music to be played based on a player's facial expression. In this proposal, based on the image data from the photographing means for photographing the performer's face image, the sound is determined according to the shape parameter extracted from the attention portion of the face image, for example, the shape of the performer's mouth. A control signal for controlling the expression effect on the data is generated. Specifically, a control signal is generated according to the degree of vertical opening of the opening of the performer's mouth, and the cut-off frequency of the low-pass filter through which the sound passes is changed according to the control signal. For example, when the mouth is opened wide, the cutoff frequency of the low-pass filter is increased, and as the mouth is closed, the cutoff frequency of the low-pass filter is decreased. Alternatively, a control signal is generated according to the width of the opening of the performer's mouth, and the nonlinearity of the amplifier through which the sound passes is controlled according to the control signal. For example, if the width of the mouth widens, distortion is applied by changing the nonlinearity of the amplifier and clipping the sound amplitude. As a musical instrument to be played, in addition to the electric guitar described in the embodiment, a piano or a synthesizer is assumed. Furthermore, it is described that the expression effect on the sound data is controlled not only by the performer but also by the facial expression of a DJ (disc jocker). (See Patent Document 1)
Japanese Patent Laid-Open No. 2002-140066

しかしながら、上記特許文献のように、演奏者の表情に基づいて、演奏される音楽に表現効果を制御することは、却って音楽性を喪失又は破壊するおそれがある。なぜなら、演奏者の表情と発生する楽音と間に相関性があるとは考えられないからである。例えば、演奏者が笑いながら明るい音色で演奏している場合でも、口の幅が広がった画像によって、音にひずみを与えて荒々しい音色にすると、演奏者の意図とは逆の効果になってしまう。また、演奏者の中には、静かでメランコリックな曲を演奏していても、メロディにひたって口を大きく開ける場合もある。このような場合に、その口の動きに応じて音にひずみを与えて荒々しい音色にすると、演奏を台無しにすることになる。 However, as in the above-mentioned patent document, controlling the expression effect on the music to be played based on the player's facial expression may cause loss or destruction of musicality. This is because it cannot be considered that there is a correlation between the performer's facial expression and the generated musical sound. For example, even if the performer is playing with a bright tone while laughing, if the sound is distorted and the tone is rough due to an image with a wide mouth, the effect will be the opposite of the intention of the performer. End up. Also, some performers may open their mouths with a melody, even if they are playing a quiet, melancholic song. In such a case, if the sound is distorted according to the movement of the mouth to create a rough tone, the performance will be spoiled.

演奏者の口の表情と発生する楽音と間に相関性があるのは、歌唱する場合だけである。口を頻繁にぱくぱく大きく開けて「…は夜露に濡れて……」と静かに歌う人はいないし、逆に口を開けずに「…来たぜ函館……」と歯切れよく大きな声で歌う人もいない。すなわち、歌唱者の口の動きと歌声の音色と間には高い相関性があり、この相関性を利用すれば、歌唱力を高めることが期待できる。
例えば、広く普及しているカラオケは、歌唱力を競い合うというより、会社などの団体における親睦、宴会、接待などのイベントとして利用されている。しかし、歌唱力に自身のない人にとっては皆の前で歌うことは恥ずかしく、苦痛を伴うことになる。したがって、実際よりも高い歌唱力で歌うことができれば、羞恥心や苦痛を和らげることができる上、聴いている回りの人にとっても雰囲気がよくなる。
本発明は、このような従来の課題を解決するためのものであり、歌唱者の口の動きと歌声の音色との相関性を利用して、実際よりも歌唱力を高めることができるようにすることを目的とする。 There is a correlation between the expression of the performer's mouth and the generated musical sound only when singing. No one sings quietly, saying that “... is wet with night dew ...” and singing loudly, “... Came Ze Hakodate ...” without opening his mouth. There are no people. That is, there is a high correlation between the movement of the singer's mouth and the timbre of the singing voice, and if this correlation is used, it can be expected to improve the singing ability.
For example, karaoke, which is widely used, is used as an event for relatives, banquets, entertainment, etc. in organizations such as companies rather than competing for singing ability. However, it is embarrassing and painful to sing in front of everyone for those who have no singing ability. Therefore, if you can sing with higher singing ability than you can, you can relieve shame and pain, and the atmosphere will be better for those around you.
The present invention is for solving such a conventional problem, and by utilizing the correlation between the movement of the singer's mouth and the timbre of the singing voice, the singing ability can be enhanced more than actual. The purpose is to do.

請求項１に記載の音声制御装置は、マイクロフォンに収容され、入力される音声に応じて音声信号を出力する信号発生手段（実施形態においては、図２、図３のマイク部５に相当する）と、マイクロフォンに収容され、歌唱者の口の映像を撮像して画像信号を出力する撮像手段（実施形態においては、図２、図３のカメラ部６に相当する）と、信号発生手段から出力された音声信号と前記撮像手段によって出力された画像信号との相関性を検出する相関性検出手段（実施形態においては、図３のＣＰＵ１に相当する）と、相関性検出手段によって検出された相関性のデータに応じて信号発生手段から出力される音声信号に対してエフェクト処理を施す信号処理手段（実施形態においては、図３のＣＰＵ１０およびＤＳＰ制御部１８に相当する）と、を備えた構成になっている。 The voice control device according to the first aspect is a signal generating means that is accommodated in a microphone and outputs a voice signal in accordance with an inputted voice (in the embodiment, it corresponds to the microphone unit 5 in FIGS. 2 and 3). And an image pickup means (corresponding to the camera unit 6 in FIGS. 2 and 3 in the embodiment) which is housed in the microphone and picks up an image of the mouth of the singer and outputs an image signal, and is output from the signal generation means. Correlation detection means (corresponding to the CPU 1 in FIG. 3 in the embodiment) for detecting the correlation between the sound signal thus generated and the image signal output by the imaging means, and the correlation detected by the correlation detection means Signal processing means for effecting the audio signal output from the signal generating means in accordance with the sex data (corresponding to the CPU 10 and the DSP control section 18 in FIG. 3 in the embodiment). It has a configuration which includes a, and.

請求項１の音声制御装置において、請求項２に記載したように、相関性検出手段は、撮像手段から出力された画像信号と信号発生手段から出力される音声信号との差分関係を検出する差分検出手段（実施形態においては、図３のＣＰＵ１０に相当する）を備え、信号処理手段は、差分検出手段によって検出された差分関係のデータに基づいて信号発生手段から出力される音声信号に対してエフェクト処理を施すような構成にしてもよい。 2. The audio control apparatus according to claim 1, wherein the correlation detecting unit detects a difference between the image signal output from the imaging unit and the audio signal output from the signal generating unit. The signal processing means includes a detection means (corresponding to the CPU 10 in FIG. 3 in the embodiment), and the signal processing means outputs an audio signal output from the signal generation means based on the difference relation data detected by the difference detection means. You may make it a structure which performs an effect process.

請求項１又は２の音声制御装置において、請求項３に記載したように、信号処理手段は、マイクロフォン内に収容されているような構成にしてもよい。
請求項１又は２の音声制御装置において、請求項４に記載したように、信号処理手段は、リバーブ処理のフィードバック成分を制御するような構成にしてもよい。
請求項１又は２の音声制御装置において、請求項５に記載したように、信号処理手段は、リバーブ処理のリバーブタイムを制御するような構成にしてもよい。
請求項１又は２の音声制御装置において、請求項６に記載したように、信号処理手段は、遅延処理のフィードバック成分を制御するような構成にしてもよい。
請求項１又は２の音声制御装置において、請求項７に記載したように、信号処理手段は、遅延処理の遅延時間を制御するような構成にしてもよい。 In the voice control device according to claim 1 or 2, as described in claim 3, the signal processing means may be accommodated in a microphone.
In the audio control apparatus according to claim 1 or 2, as described in claim 4, the signal processing means may be configured to control a feedback component of the reverberation process.
In the voice control device according to claim 1 or 2, as described in claim 5, the signal processing means may be configured to control a reverb time of the reverb process.
In the voice control device according to claim 1 or 2, as described in claim 6, the signal processing means may be configured to control the feedback component of the delay processing.
In the voice control device according to claim 1 or 2, as described in claim 7, the signal processing means may be configured to control the delay time of the delay processing.

請求項８に記載の音声制御方法は、入力される音声に応じてマイクロフォンから出力される音声信号を検出するステップＡと、マイクロフォンに収容されて歌唱者の口の映像を撮像する撮像手段（実施形態においては、図２、図３のカメラ部６に相当する）から出力される画像信号を検出するステップＢと、ステップＡによって検出された音声信号と前記ステップＢによって検出された画像信号との相関性を検出するステップＣと、ステップＣによって検出された相関性のデータに応じてマイクロフォンから出力される音声信号に対してエフェクト処理を施すステップDと、を実行する構成になっている。
ステップＡないしステップＤは、実施形態においては、図３のＣＰＵ１０の処理に相当する。 The voice control method according to claim 8 includes a step A for detecting a voice signal output from a microphone in accordance with an input voice, and an imaging means for picking up an image of a singer's mouth contained in the microphone. In the embodiment, step B for detecting an image signal output from the camera unit 6 in FIGS. 2 and 3), the audio signal detected in step A, and the image signal detected in step B Step C for detecting the correlation and Step D for performing the effect processing on the audio signal output from the microphone in accordance with the correlation data detected in Step C are executed.
Step A to step D correspond to the processing of the CPU 10 in FIG. 3 in the embodiment.

請求項８の音声制御方法において、請求項９に記載したように、ステップＣは、ステップＡによって検出された音声信号とステップＢによって出力された画像信号との差分関係を検出するステップＥを有し、ステップＤは、ステップＥによって検出された差分関係のデータに基づいてマイクロフォンから出力される音声信号に対してエフェクト処理を施すような構成にしてもよい。 9. The audio control method according to claim 8, wherein, as described in claim 9, step C includes step E for detecting a difference relationship between the audio signal detected at step A and the image signal output at step B. Then, step D may be configured such that effect processing is performed on the audio signal output from the microphone based on the difference-related data detected in step E.

請求項８又は請求項９の音声制御方法において、請求項１０に記載したように、ステップＤは、マイクロフォン内に収容されている信号処理手段によってエフェクト処理を行うような構成にしてもよい。
請求項８又は請求項９の音声制御方法において、請求項１１に記載したように、ステップＤは、リバーブ処理のフィードバック成分を制御するような構成にしてもよい。
請求項８又は請求項９の音声制御方法において、請求項１２に記載したように、ステップＤは、リバーブ処理のリバーブタイムを制御するような構成にしてもよい。
請求項８又は請求項９の音声制御方法において、請求項１３に記載したように、ステップＤは、遅延処理のフィードバック成分を制御するような構成にしてもよい。
請求項８又は請求項９の音声制御方法において、請求項１４に記載したように、ステップＤは、遅延処理の遅延時間を制御するような構成にしてもよい。 In the audio control method according to claim 8 or 9, as described in claim 10, step D may be configured such that the effect processing is performed by the signal processing means accommodated in the microphone.
In the voice control method according to claim 8 or claim 9, as described in claim 11, step D may be configured to control the feedback component of the reverb process.
In the voice control method according to claim 8 or 9, as described in claim 12, step D may be configured to control the reverb time of the reverb process.
In the voice control method according to claim 8 or claim 9, as described in claim 13, step D may be configured to control a feedback component of delay processing.
In the voice control method according to claim 8 or claim 9, as described in claim 14, step D may be configured to control the delay time of the delay processing.

本発明の音声制御装置および音声制御方法によれば、歌唱者の口の動きと歌声の音色との相関性を利用して、実際よりも歌唱力を高めることができるという効果が得られる。 According to the voice control device and the voice control method of the present invention, the effect that the singing ability can be enhanced more than the actual one is obtained by utilizing the correlation between the movement of the singer's mouth and the tone of the singing voice.

以下、本発明による音声制御装置およびその音声制御方法の第１実施形態および第２実施形態について、図を参照して詳細に説明する。
図１は、各実施形態に共通するカラオケ用のマイクロフォン１の外観図であり、メッシュ状又は多数の孔が形成されたマイクカバー２が取り付けられている。図２は、マイクロフォン１の内部の構造を示す図である。マイクカバー２にはポリカーボネイトやアクリルなどの樹脂からなる透明保護カバー３が接着やその他の方法で設けられている。また、マイクカバー２の奥のマイクロフォン１の内部には基板４が取り付けられている。その基板４には、マイクカバー２を通して入力される音声を電気信号に変換して音声信号を出力するマイク部５、透明保護カバー３を通して見える歌唱者の口の映像を撮像するカメラ部６が搭載されている。図には示していないが、マイク部５には音声信号を増幅する増幅回路などが含まれている。また、カメラ部６には、ＣＣＤやＣＭＯＳなどの撮像素子、駆動回路、増幅回路、Ａ／Ｄ変換回路などが含まれている。マイク部５からのアナログの音声信号はリード線７によってマイクロフォン１から出力され、カメラ部６からのデジタルの画像信号はリード線８によってマイクロフォン１から出力される。 Hereinafter, a first embodiment and a second embodiment of a voice control device and a voice control method according to the present invention will be described in detail with reference to the drawings.
FIG. 1 is an external view of a karaoke microphone 1 common to each embodiment, and a microphone cover 2 in which a mesh shape or a plurality of holes are formed is attached. FIG. 2 is a diagram showing an internal structure of the microphone 1. The microphone cover 2 is provided with a transparent protective cover 3 made of a resin such as polycarbonate or acrylic by bonding or other methods. A substrate 4 is attached inside the microphone 1 behind the microphone cover 2. The board 4 is equipped with a microphone section 5 that converts sound input through the microphone cover 2 into an electrical signal and outputs the sound signal, and a camera section 6 that captures an image of the singer's mouth that can be seen through the transparent protective cover 3. Has been. Although not shown in the figure, the microphone unit 5 includes an amplifier circuit for amplifying an audio signal. The camera unit 6 includes an image pickup device such as a CCD or CMOS, a drive circuit, an amplifier circuit, an A / D conversion circuit, and the like. An analog audio signal from the microphone unit 5 is output from the microphone 1 through the lead wire 7, and a digital image signal from the camera unit 6 is output from the microphone 1 through the lead wire 8.

図３は、本発明による音声制御装置を用いた第１実施形態におけるカラオケシステムの構成を示すブロック図である。図３において、ＣＰＵ１０は、システムバス１１を介して、プログラムＲＯＭ１２、ワークＲＡＭ１３、操作スイッチ１４、表示部１５、音源１６、曲データＲＯＭ１７、ＤＳＰ（Digital Signal Processor）部１８、および図２に示したカメラ部６に接続されている。ＣＰＵ１０は、システムバス１１を介して接続された上記各部との間でデータやコマンドを授受してカラオケシステム全体を制御する。 FIG. 3 is a block diagram showing the configuration of the karaoke system in the first embodiment using the voice control device according to the present invention. In FIG. 3, a CPU 10 is connected to a program ROM 12, a work RAM 13, an operation switch 14, a display unit 15, a sound source 16, a music data ROM 17, a DSP (Digital Signal Processor) unit 18 and a DSP (Digital Signal Processor) unit 18 via a system bus 11. It is connected to the camera unit 6. The CPU 10 controls the entire karaoke system by exchanging data and commands with the above-described units connected via the system bus 11.

プログラムＲＯＭ１２は、ＣＰＵ１０によって実行される音声制御処理のプログラムや初期データなどをあらかじめ格納している。また、一般的な口の形状のパターンを記憶している。ワークＲＡＭ１３は、ＣＰＵ１０によって処理されるデータを一時的に記憶するワークエリアであり、各種のレジスタ、フラグが設けられている。スイッチ部１４は、曲選択キー、曲スタートキー、曲停止キーなどのスイッチ群で構成され、操作に応じたコマンドやデータをＣＰＵ１０に入力する。表示部１５は、カラオケ曲のリストや歌詞などを表示する。音源１６は、ＰＣＭ波形データなどを記憶する波形ＲＯＭを内蔵しており、ＣＰＵ１０の発音コマンドに応じて、デジタルの楽音信号を生成する。曲データＲＯＭ１７は、カラオケの伴奏曲の楽音データおよび歌詞データを記憶している。 The program ROM 12 stores a voice control processing program executed by the CPU 10 and initial data in advance. Also, a general mouth shape pattern is stored. The work RAM 13 is a work area that temporarily stores data processed by the CPU 10, and is provided with various registers and flags. The switch unit 14 includes a group of switches such as a song selection key, a song start key, and a song stop key, and inputs commands and data corresponding to operations to the CPU 10. The display unit 15 displays a list of karaoke songs and lyrics. The sound source 16 has a built-in waveform ROM for storing PCM waveform data and the like, and generates a digital musical tone signal in accordance with the sound generation command of the CPU 10. The song data ROM 17 stores musical tone data and lyrics data of karaoke accompaniment.

一方、Ａ／Ｄ変換回路２０は、図２に示したマイク部５からの音声信号を入力して、その音声信号をアナログからデジタルに変換してＤＳＰ部１８に入力する。音源１６は、ＣＰＵ１０によって曲データＲＯＭ１７から読み出されて入力されたカラオケの伴奏曲に応じて、内部の波形ＲＯＭから読み出した波形データに基づいて伴奏曲の楽音信号を生成してＤＳＰ部１８に入力する。ＤＳＰ部１８は、ＣＰＵ１０からの係数に基づいて、Ａ／Ｄ変換回路２０を介してマイク部５から入力される音声信号に対する信号処理を行って、その音声信号と伴奏曲の楽音信号とを合成してＤ／Ａ変換回路２１に入力する。Ｄ／Ａ変換回路２１は、ＤＳＰ部１８から入力された合成信号をデジタルからアナログに変換し、パワーアンプ２２に入力してスピーカ２３から発音させる。 On the other hand, the A / D conversion circuit 20 receives the audio signal from the microphone unit 5 shown in FIG. 2, converts the audio signal from analog to digital, and inputs it to the DSP unit 18. The sound source 16 generates an accompaniment musical tone signal based on the waveform data read from the internal waveform ROM in accordance with the karaoke accompaniment read and input from the song data ROM 17 by the CPU 10 to the DSP unit 18. input. The DSP unit 18 performs signal processing on the audio signal input from the microphone unit 5 via the A / D conversion circuit 20 based on the coefficient from the CPU 10 and synthesizes the audio signal and the musical tone signal of the accompaniment. And input to the D / A conversion circuit 21. The D / A conversion circuit 21 converts the composite signal input from the DSP unit 18 from digital to analog, and inputs it to the power amplifier 22 to generate sound from the speaker 23.

図４は、第１実施形態におけるＤＳＰ部１８の内部構成を示すブロック図である。図４において、エフェクタ１８１は、マイク部５から入力された音声信号に対して、ＣＰＵ１０から入力された係数に基づいて信号処理を施して、信号合成部１８２に入力する。信号合成部１８２は、エフェクタ１８１から入力された音声信号と、図３の音源１６から入力された伴奏曲の楽音信号とを合成して、図３のＤ／Ａ変換回路２１に入力する。エフェクタ１８１は、遅延部１８３および帰還部１８４で構成されている。遅延部１８３はマイク部５から入力された音声信号に対して、ＣＰＵ１０から入力される遅延係数に応じた遅延処理を施して出力する。帰還部１８４は、ＣＰＵ１０から入力されるフィードバック係数に応じて、遅延処理された音声信号を遅延部１８３の入力側にフィードバックする。この場合のエフェクト処理は、最もポピュラーなプレートタイプのリバーブであり、リバーブタイムは標準的に使用される４秒である。また、プリディレイは、自然な感じを演出する１０ｍｓに固定されている。 FIG. 4 is a block diagram showing an internal configuration of the DSP unit 18 in the first embodiment. In FIG. 4, the effector 181 performs signal processing on the audio signal input from the microphone unit 5 based on the coefficient input from the CPU 10, and inputs the signal processing unit 182. The signal synthesis unit 182 synthesizes the audio signal input from the effector 181 and the musical tone signal of the accompaniment input from the sound source 16 in FIG. 3 and inputs the synthesized signal to the D / A conversion circuit 21 in FIG. The effector 181 includes a delay unit 183 and a feedback unit 184. The delay unit 183 performs a delay process corresponding to the delay coefficient input from the CPU 10 on the audio signal input from the microphone unit 5 and outputs the result. The feedback unit 184 feeds back the delayed audio signal to the input side of the delay unit 183 in accordance with the feedback coefficient input from the CPU 10. The effect processing in this case is the most popular plate type reverb, and the reverberation time is 4 seconds which is normally used. The pre-delay is fixed at 10 ms for producing a natural feeling.

次に、第１実施形態の音声制御処理方法について、図５ないし図９に示すＣＰＵ１０のフローチャートおよびその他の図に基づいて説明する。
図５は、各実施形態に共通するＣＰＵ１０のメインルーチンのフローチャートである。まず、所定のイニシャライズ（ステップＳＡ１）の後、曲選択の操作に応じて曲選択処理を行い（ステップＳＡ２）、曲スタートキーがオンされたか否かを判別する（ステップＳＡ３）。曲スタートキーがオンされたときは、タイマーをスタートして（ステップＳＡ４）、口の開閉の回数を表す変数Ｎを０にセットする（ステップＳＡ５）。次に、選択された曲データを曲データＲＯＭ１７から読み出し（ステップＳＡ６）、音源１６に送付する（ステップＳＡ７）。次に、カメラ部６からの画像信号に基づいて画像認識処理を実行し（ステップＳＡ８）、ＤＳＰ制御処理を実行する（ステップＳＡ９）。この後は、曲が終了したか又はスイッチ部１４の停止キーがオンされたか否かを判別し（ステップＳＡ１０）、曲の終了でなく、且つ停止キーがオンでない場合には、ステップＳＡ６に移行して曲データの読み出しを繰り返す。ステップＳＡ１０において、曲の終了又は停止キーがオンの場合は、ステップＳＡ２に移行して、スイッチ部１４の操作に応じて次ぎの曲選択を行う。 Next, the voice control processing method of the first embodiment will be described based on the flowchart of the CPU 10 shown in FIGS. 5 to 9 and other drawings.
FIG. 5 is a flowchart of a main routine of the CPU 10 common to the embodiments. First, after predetermined initialization (step SA1), music selection processing is performed in accordance with the music selection operation (step SA2), and it is determined whether or not the music start key is turned on (step SA3). When the music start key is turned on, a timer is started (step SA4), and a variable N indicating the number of opening / closing of the mouth is set to 0 (step SA5). Next, the selected music data is read from the music data ROM 17 (step SA6) and sent to the sound source 16 (step SA7). Next, image recognition processing is executed based on the image signal from the camera unit 6 (step SA8), and DSP control processing is executed (step SA9). After this, it is determined whether or not the music is finished or the stop key of the switch unit 14 is turned on (step SA10). If the music is not finished and the stop key is not on, the process proceeds to step SA6. And repeat the reading of the song data. If the song end or stop key is on in step SA10, the process proceeds to step SA2 to select the next song in accordance with the operation of the switch unit 14.

図６および図７は、メインルーチンにおける画像認識処理のフローチャートである。図６において、曲の開始時刻であるか否か、すなわち、メインルーチンのイニシャライズ（ステップＳＡ１）がされた直後であるか否かを判別し（ステップＳＢ１）、イニシャライズの直後である場合には、最初の口認識処理を実行する（ステップＳＢ２）。図８は、その口認識処理のフローチャートである。カメラ部６から画像を取り込み（ステップＳＣ１）、白黒画像に変換する（ステップＳＣ２）。次に、プログラムＲＯＭ１２に予め記憶されている口の形状のパターンとマッチングをとり（ステップＳＣ３）、歌唱者の口を認識したか否かを判別する（ステップＳＣ４）。認識できない場合には、ステップＳＣ１において、再びカメラ部６からの画像の取り込みを行う。ステップＳＣ４において歌唱者の口を認識したときは、口の両端と上下の４個のポイントの座標ａ１〜ａ４をワークＲＡＭ１３のレジスタ（ａ１〜ａ４）に記憶する（ステップＳＣ５）。図１０は、口の両端の座標ａ１、ａ２および口の上下の座標ａ３、ａ４を示す図である。この座標ａ１〜ａ４によって口の開け度合いを検出することができる。 6 and 7 are flowcharts of the image recognition process in the main routine. In FIG. 6, it is determined whether or not it is the start time of the music, that is, whether or not it is immediately after the initialization of the main routine (step SA1) (step SB1). The first mouth recognition process is executed (step SB2). FIG. 8 is a flowchart of the mouth recognition process. An image is captured from the camera unit 6 (step SC1) and converted into a monochrome image (step SC2). Next, it is matched with a mouth shape pattern stored in advance in the program ROM 12 (step SC3), and it is determined whether or not the singer's mouth is recognized (step SC4). If it cannot be recognized, an image from the camera unit 6 is captured again in step SC1. When the mouth of the singer is recognized in step SC4, the coordinates a1 to a4 of both ends of the mouth and the upper and lower four points are stored in the registers (a1 to a4) of the work RAM 13 (step SC5). FIG. 10 is a diagram showing coordinates a1 and a2 at both ends of the mouth and upper and lower coordinates a3 and a4 of the mouth. The opening degree of the mouth can be detected by the coordinates a1 to a4.

図８のステップＳＣ５において座標ａ１〜ａ４を記憶した後は、図６のフローチャートに戻って、ステップＳＢ３に移行する。ステップＳＢ３においては、記憶した座標ａ１〜ａ４をレジスタＦａ１〜Ｆａ４に記憶（コピー）する。次に、ａ１〜ａ４に基づき口の大きさＤを演算する（ステップＳＢ４）。すなわち、図１０において、口の両端の幅（ａ２−ａ１）および口の上下の間隔（ａ４−ａ３）に基づいて口の大きさＤを演算する。さらに、その演算したＤをレジスタＦＤ、ＦＦＤにストアする（ステップＳＢ５）。すなわち、イニシャライズの後は、Ｄ、ＦＤ、およびＦＦＤの初期データは同じである。 After storing the coordinates a1 to a4 in step SC5 of FIG. 8, the process returns to the flowchart of FIG. 6 and proceeds to step SB3. In step SB3, the stored coordinates a1 to a4 are stored (copied) in the registers Fa1 to Fa4. Next, the mouth size D is calculated based on a1 to a4 (step SB4). That is, in FIG. 10, the mouth size D is calculated based on the widths of both ends of the mouth (a2-a1) and the vertical distance (a4-a3) of the mouth. Further, the calculated D is stored in the registers FD and FFD (step SB5). That is, after initialization, the initial data of D, FD, and FFD are the same.

このように、メインルーチンのイニシャライズの後は、上記ステップＳＢ２ないしステップＳＢ５を実行して、演算したＤをＦＤ、ＦＦＤにストアするが、その後は、ステップＳＢ１において曲の開始時刻（最初の口認識）でないと判別されて、ステップＳＢ１のＮＯからステップＳＢ６に移行して、図８に示した２度目以降の口認識処理を実行する。この口認識処理によって、新たに口の両端の座標ａ１、ａ２および口の上下の座標ａ３、ａ４がワークＲＡＭ１３に記憶される。したがって、前回の口の両端の座標および口の上下の座標がＦａ１〜Ｆａ４に記憶され、今回の口の両端の座標および口の上下の座標がａ１〜ａ４に記憶されることになる。次に、ａ１〜ａ４とＦａ１〜Ｆａ４との差分、すなわち、水平方向のｘ座標および垂直方向のｙ座標の差分を下記のように求める（ステップＳＢ７）。
Δｘ１＝ｘ１−ｘＦ１、Δｙ１＝ｙ１−ｙＦ１
Δｘ２＝ｘ２−ｘＦ２、Δｙ２＝ｙ２−ｙＦ２
Δｘ３＝ｘ３−ｘＦ３、Δｙ３＝ｙ３−ｙＦ３
Δｘ４＝ｘ４−ｘＦ４、Δｙ４＝ｙ４−ｙＦ４
そして、ａ１〜ａ４をＦａ１〜Ｆａ４にストアする（ステップＳＢ８）。 As described above, after the initialization of the main routine, steps SB2 to SB5 are executed, and the calculated D is stored in the FD and FFD. Thereafter, in step SB1, the music start time (first mouth recognition) is stored. ), The process proceeds from NO in step SB1 to step SB6, and the mouth recognition process for the second time and thereafter shown in FIG. 8 is executed. By this mouth recognition process, the coordinates a1 and a2 of the both ends of the mouth and the coordinates a3 and a4 of the top and bottom of the mouth are newly stored in the work RAM 13. Therefore, the previous coordinates of both ends of the mouth and the upper and lower coordinates of the mouth are stored in Fa1 to Fa4, and the current coordinates of both ends of the mouth and the upper and lower coordinates of the mouth are stored in a1 to a4. Next, the difference between a1 to a4 and Fa1 to Fa4, that is, the difference between the x coordinate in the horizontal direction and the y coordinate in the vertical direction is obtained as follows (step SB7).
Δx1 = x1−xF1, Δy1 = y1−yF1
Δx2 = x2-xF2, Δy2 = y2-yF2
Δx3 = x3-xF3, Δy3 = y3-yF3
Δx4 = x4-xF4, Δy4 = y4-yF4
Then, a1 to a4 are stored in Fa1 to Fa4 (step SB8).

次に、求めた差分に基づき同一方向移動成分を算出する（ステップＳＢ９）。同一方向移動成分は、口の両端の２個のポイントの中心のｘ座標および口の上下の２個のポイントの中心のｙ座標の移動の有無で判断できる。したがって、下記の演算によって同一方向移動成分Δｘ、Δｙを算出する。
Δｘ＝（Δｘ１＋Δｘ２）／２−（ΔＦｘ１＋ΔＦｘ２）／２
Δｙ＝（Δｙ１＋Δｙ２）／２−（ΔＦｙ１＋ΔＦｙ２）／２
次に、ａ１〜ａ４により同一方向の移動成分（Δｘ、Δｙ）を下記のように減算する（ステップＳＢ１０）。
Δｘ１’＝Δｘ１−Δｘ、Δｙ１’＝Δｙ１−Δｙ
Δｘ２’＝Δｘ２−Δｘ、Δｙ２’＝Δｙ２−Δｙ
Δｘ３’＝Δｘ３−Δｘ、Δｙ３’＝Δｙ３−Δｙ
Δｘ４’＝Δｘ４−Δｘ、Δｙ４’＝Δｙ４−Δｙ
なお、口とマイクロフォンとの距離によって画像が拡大又は縮小されるので、４個のポイントの座標が変化する。この場合を考慮して、口認識の後に図１０の点線の面積すなわち口の面積を算出して座標データを正規化する。一般の画像処理の技法において、このような正規化については公知であるので、その演算処理の詳細な説明は省略する。 Next, the same direction moving component is calculated based on the obtained difference (step SB9). The same direction movement component can be determined by the presence or absence of movement of the x coordinate of the center of two points at both ends of the mouth and the y coordinate of the center of two points above and below the mouth. Therefore, the same direction moving components Δx and Δy are calculated by the following calculation.
Δx = (Δx1 + Δx2) / 2− (ΔFx1 + ΔFx2) / 2
Δy = (Δy1 + Δy2) / 2− (ΔFy1 + ΔFy2) / 2
Next, the movement components (Δx, Δy) in the same direction are subtracted as follows using a1 to a4 (step SB10).
Δx1 ′ = Δx1−Δx, Δy1 ′ = Δy1−Δy
Δx2 ′ = Δx2−Δx, Δy2 ′ = Δy2−Δy
Δx3 ′ = Δx3−Δx, Δy3 ′ = Δy3−Δy
Δx4 ′ = Δx4−Δx, Δy4 ′ = Δy4−Δy
Since the image is enlarged or reduced depending on the distance between the mouth and the microphone, the coordinates of the four points change. Considering this case, the area of the dotted line in FIG. 10, that is, the area of the mouth is calculated after the mouth recognition, and the coordinate data is normalized. Such normalization is well known in general image processing techniques, and therefore detailed description of the arithmetic processing is omitted.

次に、ａ１〜ａ４により口の大きさＤを演算する（ステップＳＢ１１）。そして、Ｄが所定値より大きいか否かを判別する（ステップＳＢ１２）。所定値とは、口の開け方が歌っていない場合の上限の値である。したがって、Ｄが所定値よりも大きい場合には、歌っている状態であると判断する。この場合には、前回の口の大きさＦＤが前々回の口の大きさＦＦＤ以上であるか否かを判別し（ステップＳＢ１３）、ＦＤがＦＦＤ以上である場合には、ＦＤが今回の口の大きさよりも大きいか否かを判別する（ステップＳＢ１４）。例えば、図１１に示すように、（Ａ）の状態の口の大きさＦＦＤが、（Ｂ）の状態で大きいＦＤに変化し、（Ｃ）の状態で再び小さいＤに変化した場合には、ＦＦＤからＤに推移する過程において、ＦＤの大きさが極大値であることを示している。すなわち、口が大きくなって再び小さくなったことを示している。この場合には、口の開閉数を表す変数Ｎの値をインクリメントする（ステップＳＢ１５）。 Next, the mouth size D is calculated from a1 to a4 (step SB11). And it is discriminate | determined whether D is larger than predetermined value (step SB12). The predetermined value is an upper limit value when the mouth opening method is not singing. Therefore, when D is larger than a predetermined value, it is determined that the user is singing. In this case, it is determined whether or not the previous mouth size FD is greater than or equal to the previous mouth size FFD (step SB13). If FD is greater than or equal to FFD, FD is the current mouth size FFD. It is determined whether or not it is larger than the size (step SB14). For example, as shown in FIG. 11, when the mouth size FFD in the state (A) changes to a large FD in the state (B) and changes to a small D again in the state (C), In the process of transition from FFD to D, it shows that the magnitude of FD is a maximum value. That is, it shows that the mouth has become larger and smaller again. In this case, the value of the variable N indicating the number of opening / closing of the mouth is incremented (step SB15).

Ｎの値をインクリメントした後、又は、ステップＳＢ１４においてＦＤがＤ以下である場合、ステップＳＢ１３においてＦＤがＦＦＤよりも小さい場合、すなわち、ＦＤの大きさが極大値でなく、口が開閉されなかった場合、若しくはステップＳＢ１２においてＤの大きさが所定値以下である場合には、ＦＤの値をＦＦＤにストアし（ステップＳＢ１６）、Ｄの値をＦＤにストアする（ステップＳＢ１７）。そして、図５のメインルーチンに戻る。 After incrementing the value of N or when FD is less than or equal to D in step SB14, if FD is smaller than FFD in step SB13, that is, the size of FD is not a maximum value and the mouth has not been opened or closed If the value of D is equal to or smaller than a predetermined value in step SB12, the value of FD is stored in FFD (step SB16), and the value of D is stored in FD (step SB17). Then, the process returns to the main routine of FIG.

なお、ステップＳＢ７からステップＳＢ１１の演算処理の代わり、又は、これらの演算処理と併せて、図１１に示す口の左右の幅ＦＦＨ、ＦＨ、Ｈと、口の上下の距離ＦＦＶ、ＦＶ、Ｖのそれぞれの比であるＦＦＶ／ＦＦＨ、ＦＶ／ＦＨ、Ｖ／Ｈを演算して、口の開閉を判別し、又は、口の平行移動の場合および口とマイクロフォンとの距離によって画像が拡大又は縮小した場合の補正処理を行う構成にしてもよい。 It should be noted that instead of the calculation processing from step SB7 to step SB11 or in combination with these calculation processing, the widths FFH, FH, H of the left and right sides of the mouth shown in FIG. The respective ratios FFV / FFH, FV / FH, and V / H are calculated to determine the opening / closing of the mouth, or the image is enlarged or reduced depending on the translation of the mouth and the distance between the mouth and the microphone. It may be configured to perform the correction process.

図９は、メインルーチンのステップＳＡ９における第１実施形態のＤＳＰ制御処理のフローチャートである。タイマーがエフェクト処理のインターバルである一定時間をカウントしたか否かを判別し（ステップＳＤ１）、一定時間をカウントしていない場合にはメインルーチンに戻るが、一定時間をカウントしたときは、Ｎの値に基づいてｓｅｎｄ係数を生成する（ステップＳＤ２）。ｓｅｎｄ係数とは、図４のＤＳＰ部１８の帰還部１８４におけるフィードバック成分（量など）を決定するパラメータである。さらに、Ｎの値に基づいてｔｉｍｅ係数を生成する（ステップＳＤ３）。ｔｉｍｅ係数とは、ＤＳＰ部１８の遅延部１８３における遅延時間を決定するパラメータである。次に、生成したｓｅｎｄ係数およびｔｉｍｅ係数をＤＳＰ部１８に供給する（ステップＳＤ４）。この後は、Ｎの値に初期値の０をストアし（ステップＳＤ５）、タイマーをクリアして再びスタートさせる（ステップＳＤ６）。そして、図５のメインルーチンに戻る。 FIG. 9 is a flowchart of the DSP control process of the first embodiment in step SA9 of the main routine. It is determined whether or not the timer has counted a certain time, which is an effect processing interval (step SD1). If the certain time has not been counted, the process returns to the main routine. A send coefficient is generated based on the value (step SD2). The send coefficient is a parameter that determines a feedback component (amount or the like) in the feedback unit 184 of the DSP unit 18 in FIG. Further, a time coefficient is generated based on the value of N (step SD3). The time coefficient is a parameter that determines the delay time in the delay unit 183 of the DSP unit 18. Next, the generated send coefficient and time coefficient are supplied to the DSP unit 18 (step SD4). Thereafter, the initial value 0 is stored as the value of N (step SD5), the timer is cleared and started again (step SD6). Then, the process returns to the main routine of FIG.

図１２は、口の開閉数Ｎに対するＤＳＰ部１８に対する係数Ｃｄｓｐの特性を示す図である。係数Ｃｄｓｐには、ＤＳＰ部１８の遅延部１８３に対する遅延係数Ｃｄｓｐ（ｄｅｌａｙｔｉｍｅ）および帰還係数Ｃｄｓｐ（ｓｅｎｄ）がある。図１２に示すように、Ｎの値が小さいほど、すなわち、口の開閉頻度が低いほど、どちらの係数も大きくなる。この結果、口をあまり動かさずにムードのある曲を歌っている場合には、リバーブやエコーを深くして発音するので、実際よりも歌唱力を向上することができる。逆に、口を頻繁に開けて歯切れよく歌っている人の場合には、元々歌唱力のある人が多いので、リバーブやエコーをカットしてそのまま音声信号を出力することで、歌唱力を活かした発音を行う。 FIG. 12 is a diagram showing the characteristic of the coefficient Cdsp for the DSP unit 18 with respect to the opening / closing number N of the mouth. The coefficient Cdsp includes a delay coefficient Cdsp (delay time) for the delay unit 183 of the DSP unit 18 and a feedback coefficient Cdsp (send). As shown in FIG. 12, the smaller the value of N, that is, the lower the opening / closing frequency of the mouth, the larger both coefficients. As a result, when a song with a mood is sung without much movement of the mouth, the reverb and the echo are deeply pronounced, so that the singing ability can be improved more than actual. On the other hand, if you are singing crisply with your mouth open frequently, there are many people who originally have singing ability, so you can use the singing ability by cutting the reverb and echo and outputting the audio signal as it is. To pronounce.

例えば、プレートタイプのリバーブにおいて、リバーブタイムを４秒、プリディレイタイムを１０ｍｓとした場合に、口の開閉頻度が高くＮの値が大きいときはリバーブをかけない。一方、口の開閉頻度が中程度のＮの値の場合には、ｓｅｎｄ量（変化量）を−１０ｄＢとし、口の開閉頻度が低くＮの値が小さいときは、ｓｅｎｄ量を−３ｄＢにして、口の開閉頻度が低くなるに従ってリバーブを次第に深くする。
あるいは、口の開閉頻度にかかわらずｓｅｎｄ量を一定の−５ｄＢにした状態で、口の開閉頻度が多いときはリバーブタイムを０．５秒、開閉頻度が中程度のときはリバーブタイムを２．８秒、開閉頻度が高いときはリバーブタイムを４．８秒として、開閉頻度が低くなるほどリバーブを次第に深くする。
あるいは、ディレイタイムを１５０ｍｓに固定した状態で、口の開閉頻度が高いときはリバーブをかけず、口の開閉頻度が中程度のＮの値の場合には、ｓｅｎｄ量を−１０ｄＢとし、口の開閉頻度が低くＮの値が小さいときは、ｓｅｎｄ量を−３ｄＢにしてリバーブを次第に深くする。
あるいは、ｓｅｎｄ量を−５ｄＢに固定した状態で、口の開閉頻度が高いときはディレイタイムを５ｍｓとし、口の開閉頻度が中程度のときはディレイタイムを５０ｍｓとし、口の開閉頻度が低いときはディレイタイムを５００ｍｓにして、リバーブを次第に深くする。
また、図１１（Ｂ）のように口が開く極大値から次の極大値までの間隔の平均、すなわち、歌うテンポに応じてディレイタイムを設定する構成にして、音楽業界用語では「プレートタイム」と称するディレイタイムを実現する構成にしてもよい。 For example, in a plate-type reverb, when the reverb time is 4 seconds and the pre-delay time is 10 ms, the reverb is not applied when the opening / closing frequency of the mouth is high and the value of N is large. On the other hand, when the opening / closing frequency of the mouth is a medium value of N, the send amount (change amount) is −10 dB, and when the opening / closing frequency of the mouth is low and the value of N is small, the send amount is −3 dB. The reverb is gradually deepened as the opening and closing frequency of the mouth decreases.
Alternatively, when the send amount is constant -5 dB regardless of the opening / closing frequency of the mouth, the reverb time is 0.5 seconds when the opening / closing frequency of the mouth is high, and the reverb time is 2. When the open / close frequency is high for 8 seconds, the reverb time is set to 4.8 seconds, and the reverb is gradually deepened as the open / close frequency is low.
Alternatively, with the delay time fixed at 150 ms, when the mouth opening / closing frequency is high, no reverb is applied, and when the mouth opening / closing frequency is a medium N value, the send amount is set to −10 dB, When the switching frequency is low and the value of N is small, the send amount is set to -3 dB and the reverb is gradually deepened.
Alternatively, with the send amount fixed at −5 dB, when the mouth opening / closing frequency is high, the delay time is set to 5 ms, when the mouth opening / closing frequency is medium, the delay time is set to 50 ms, and the mouth opening / closing frequency is low. Increases the delay time to 500 ms and gradually deepens the reverb.
Also, as shown in FIG. 11B, the delay time is set according to the average interval from the maximum value at which the mouth opens to the next maximum value, that is, the tempo of singing. It may be configured to realize a delay time called.

このように、第１実施形態のカラオケシステムは、マイクロフォン１に収容されて、入力される音声に応じて音声信号を出力するマイク部５と、マイクロフォン１に収容されて歌唱者の口の映像を撮像して画像信号を出力するカメラ部６とを備えている。ＣＰＵ１０は、マイク部５から出力された音声信号とカメラ部６によって出力された画像信号との差分関係を検出して、その差分関係のデータに応じてＤＳＰ部１８に制御信号を与えて、マイク部５から出力される音声信号に対してエフェクト処理を施す。
したがって、歌唱者の口の動きと歌声の音色との相関性を利用して、実際よりも歌唱力を高めることができる。 As described above, the karaoke system according to the first embodiment is housed in the microphone 1 and outputs a sound signal according to the input sound, and the microphone 1 accommodates the video of the singer's mouth. And a camera unit 6 that captures an image and outputs an image signal. The CPU 10 detects a difference relationship between the audio signal output from the microphone unit 5 and the image signal output from the camera unit 6, and gives a control signal to the DSP unit 18 according to the difference relationship data, Effect processing is performed on the audio signal output from the unit 5.
Therefore, the singing ability can be enhanced more than actual using the correlation between the movement of the singer's mouth and the tone of the singing voice.

次に、本発明の第２実施形態について説明する。
図１３は、本発明による音声制御装置を用いた第２実施形態におけるカラオケシステムの構成を示すブロック図である。図１３において、ＣＰＵ１０は、システムバス１１を介して、プログラムＲＯＭ１２、ワークＲＡＭ１３、スイッチ部１４、表示部１５、音源１６、曲データＲＯＭ１７、Ａ／Ｄ変換回路２０、および図２に示したカメラ部６に接続されている。ＣＰＵ１０は、システムバス１１を介して接続された上記各部との間でデータやコマンドを授受してカラオケシステム全体を制御する。また、図２に示したマイク部５は音声信号をＡ／Ｄ変換回路２０に入力する。Ａ／Ｄ変換回路２０は、その音声信号をアナログからデジタルに変換してＤＳＰ部１８に入力する。ＤＳＰ部１８の内部構成については、図４に示した第１実施形態と同じである。
このように、第２実施形態におけるカラオケシステムは、第１実施形態の構成とほとんど同じであるが、第２実施形態においては、Ａ／Ｄ変換回路２０の出力がシステムバス１１に接続されている。ＣＰＵ１０は、以下に記載するように、Ａ／Ｄ変換回路２０から得られる音声信号を取り込んでＤＳＰ部１８を制御する。 Next, a second embodiment of the present invention will be described.
FIG. 13 is a block diagram showing a configuration of a karaoke system in the second embodiment using the voice control device according to the present invention. In FIG. 13, a CPU 10 is connected via a system bus 11 to a program ROM 12, a work RAM 13, a switch unit 14, a display unit 15, a sound source 16, a song data ROM 17, an A / D conversion circuit 20, and the camera unit shown in FIG. 6 is connected. The CPU 10 controls the entire karaoke system by exchanging data and commands with the above-described units connected via the system bus 11. The microphone unit 5 shown in FIG. 2 inputs an audio signal to the A / D conversion circuit 20. The A / D conversion circuit 20 converts the audio signal from analog to digital and inputs it to the DSP unit 18. The internal configuration of the DSP unit 18 is the same as that of the first embodiment shown in FIG.
As described above, the karaoke system in the second embodiment is almost the same as the configuration of the first embodiment, but in the second embodiment, the output of the A / D conversion circuit 20 is connected to the system bus 11. . As described below, the CPU 10 takes in an audio signal obtained from the A / D conversion circuit 20 and controls the DSP unit 18.

図１４は、第２実施形態におけるＣＰＵ１０のメインルーチンのフローチャートである。まず、所定のイニシャライズ（ステップＳＧ１）の後、曲選択の操作に応じて曲選択処理を行い（ステップＳＧ２）、曲スタートキーがオンされたか否かを判別する（ステップＳＧ３）。曲スタートキーがオンされたときは、タイマーをスタートする（ステップＳＧ４）。次に、音声信号のエンベロープの前々回の値をストアするレジスタＦＦＥ、および前回の値をストアするレジスタＦＥをともに０にクリアする（ステップＳＧ５）。さらに、口の開閉の回数を表す変数Ｎおよび音声信号のエンベロープの山（極大値）の数を表すＭをともに０にセットする（ステップＳＧ６）。次に、選択された曲データを曲データＲＯＭ１７から読み出し（ステップＳＧ７）、音源１６に送付する（ステップＳＧ８）。次に、カメラ部６からの画像信号に基づいて画像認識処理を実行し（ステップＳＧ９）、Ａ／Ｄ変換回路２０を介してマイク部５から得られる音声信号に基づいてマイク入力制御処理を実行し（ステップＳＧ１０）、ＤＳＰ制御処理を実行する（ステップＳＡ１１）。この後は、曲が終了したか又は停止キーがオンされたか否かを判別し（ステップＳＡ１２）、曲の終了でなく、且つ停止キーがオンでない場合には、ステップＳＧ７に移行して曲データの読み出しを繰り返す。ステップＳＧ１２において、曲の終了又は停止キーがオンの場合は、ステップＳＧ２に移行して、スイッチ部１４の操作に応じて次の曲選択を行う。
なお、このメインルーチンにおいて、ステップＳＧ９の画像処理、およびその画像処理における口認識処理は、図６、図７に示した第１実施形態の画像処理、および図８に示した第１実施形態の口認識処理と同じである。 FIG. 14 is a flowchart of the main routine of the CPU 10 in the second embodiment. First, after predetermined initialization (step SG1), music selection processing is performed in accordance with the music selection operation (step SG2), and it is determined whether or not the music start key is turned on (step SG3). When the song start key is turned on, a timer is started (step SG4). Next, both the register FFE for storing the previous value of the envelope of the audio signal and the register FE for storing the previous value are cleared to 0 (step SG5). Further, a variable N representing the number of times of opening and closing the mouth and M representing the number of peaks (maximum values) of the envelope of the audio signal are both set to 0 (step SG6). Next, the selected music data is read from the music data ROM 17 (step SG7) and sent to the sound source 16 (step SG8). Next, image recognition processing is executed based on the image signal from the camera unit 6 (step SG9), and microphone input control processing is executed based on the audio signal obtained from the microphone unit 5 via the A / D conversion circuit 20. (Step SG10), and DSP control processing is executed (step SA11). After this, it is determined whether or not the song is finished or the stop key is turned on (step SA12). If the song is not finished and the stop key is not turned on, the process proceeds to step SG7 and the song data is transferred. Repeat reading. In step SG12, if the song end or stop key is on, the process proceeds to step SG2, and the next song is selected in accordance with the operation of the switch unit.
In this main routine, the image processing in step SG9 and the mouth recognition processing in the image processing are the same as those in the first embodiment shown in FIGS. 6 and 7 and the first embodiment shown in FIG. This is the same as the mouth recognition process.

図１５は、ステップＳＧ１０のマイク入力制御処理のフローチャートである。Ａ／Ｄ変換回路２０から出力される音声信号のエンベロープを抽出し（ステップＳＨ１）、そのエンベロープ値をレジスタＥにストアする（ステップＳＨ２）。そして、Ｅにストアしたエンベロープ値が所定値より大きいか否かを判別する（ステップＳＨ３）。所定値とは、歌唱者がマイクロフォン１に向かって歌っていないと判断される上限値である。Ｅのエンベロープ値が所定値より大きい場合、すなわち、歌唱者がマイクロフォン１に向かって歌っていると判断した場合には、ＦＥにストアされている前回のエンベロープ値がＦＦＥにストアされている前々回のエンベロープ値以上であるか否かを判別する（ステップＳＨ４）。ＦＥの値がＦＦＥの値以上である場合には、さらにＦＥの値がＥにストアされている今回のエンベロープ値より大きいか否かを判別する（ステップＳＨ５）。図１４のステップＳＧ１のイニシャライズの直後は、ＦＥおよびＦＦＥの値は、ステップＳＧ５において０に初期化されているので、ステップＳＨ６の処理はスキップしてステップＳＨ７に移行するが、このマイク入力制御処理が２回繰り返された後は、ＦＦＥに前々回のエンベロープ値がストアされ、ＦＥに前回のエンベロープ値がストアされる。 FIG. 15 is a flowchart of the microphone input control process in step SG10. The envelope of the audio signal output from the A / D conversion circuit 20 is extracted (step SH1), and the envelope value is stored in the register E (step SH2). Then, it is determined whether or not the envelope value stored in E is larger than a predetermined value (step SH3). The predetermined value is an upper limit value at which it is determined that the singer is not singing toward the microphone 1. If the envelope value of E is larger than the predetermined value, that is, if it is determined that the singer is singing into the microphone 1, the previous envelope value stored in the FE is It is determined whether or not the value is greater than or equal to the envelope value (step SH4). If the FE value is greater than or equal to the FFE value, it is further determined whether or not the FE value is greater than the current envelope value stored in E (step SH5). Immediately after the initialization of step SG1 in FIG. 14, since the values of FE and FFE are initialized to 0 in step SG5, the process of step SH6 is skipped and the process proceeds to step SH7. Is repeated twice, the previous envelope value is stored in the FFE, and the previous envelope value is stored in the FE.

３回目のマイク入力制御処理において、ＦＥの値がＥの値より大きい場合、例えば、図１７に示すエンベロープの推移で、エンベロープ値が所定値より大きい状態で、ＦＦＥの値から上昇してＦＥの値になり、その後Ｅの値に下降した場合には、ＦＥのエンベロープは極大値である。すなわち、歌唱者は口を開けて声を発した状態であると判断できる。したがってこの場合には、エンベロープの山を表すＭの値をインクリメントする（ステップＳＨ６）。この後、又は、ステップＳＨ５においてＦＥの値がＥの値より大きくない場合、ステップＳＨ４においてＦＥの値がＦＦＥの値より小さい場合、若しくは、ステップＳＨ３においてＥの値が所定値以下の場合には、ステップＳＨ７に移行してＦＥの値をＦＦＥにストアし、さらに、Ｅの値をＦＥにストアする（ステップＳＨ８）。そして、図１４のメインルーチンに戻る。 In the third microphone input control process, when the value of FE is larger than the value of E, for example, in the transition of the envelope shown in FIG. When the value reaches the value E and then falls to the value E, the envelope of the FE is a maximum value. That is, it can be determined that the singer is in a state of opening his mouth and speaking out. Therefore, in this case, the value of M representing the peak of the envelope is incremented (step SH6). After this, or when the value of FE is not larger than the value of E at step SH5, when the value of FE is smaller than the value of FFE at step SH4, or when the value of E is less than or equal to a predetermined value at step SH3 In step SH7, the value of FE is stored in FFE, and the value of E is stored in FE (step SH8). Then, the process returns to the main routine of FIG.

図１６は、メインルーチンにおけるステップＳＧ１１のＤＳＰ制御処理のフローチャートである。タイマーがエフェクト処理のインターバルである一定時間をカウントしたか否かを判別し（ステップＳＪ１）、一定時間をカウントしていない場合にはメインルーチンに戻るが、一定時間をカウントしたときは、Ｍの値とＮの値の差の絶対値を算出してレジスタαにストアする（ステップＳＪ２）。そして、αに基づいてｓｅｎｄ係数を生成する（ステップＳＪ３）。ｓｅｎｄ係数は、第１実施形態と同様に、図４のＤＳＰ部１８の帰還部１８４におけるフィードバック成分（量など）を決定するパラメータである。さらに、αの値に基づいてｔｉｍｅ係数を生成する（ステップＳＪ４）。ｔｉｍｅ係数も、第１実施形態と同様に、ＤＳＰ部１８の遅延部１８３における遅延時間を決定するパラメータである。次に、生成したｓｅｎｄ係数およびｔｉｍｅ係数をＤＳＰ部１８に供給する（ステップＳＪ５）。この後は、Ｍ、Ｎの値に初期値の０をストアし（ステップＳＪ６）、タイマーをクリアして再びスタートさせる（ステップＳＪ７）。そして、図１３のメインルーチンに戻る。 FIG. 16 is a flowchart of the DSP control process in step SG11 in the main routine. It is determined whether or not the timer has counted a certain time as an effect processing interval (step SJ1). If the certain time has not been counted, the process returns to the main routine, but if the certain time has been counted, The absolute value of the difference between the value and the value of N is calculated and stored in the register α (step SJ2). Then, a send coefficient is generated based on α (step SJ3). As in the first embodiment, the send coefficient is a parameter that determines a feedback component (such as an amount) in the feedback unit 184 of the DSP unit 18 in FIG. Further, a time coefficient is generated based on the value of α (step SJ4). The time coefficient is also a parameter for determining the delay time in the delay unit 183 of the DSP unit 18 as in the first embodiment. Next, the generated send coefficient and time coefficient are supplied to the DSP unit 18 (step SJ5). After this, the initial value 0 is stored in the values of M and N (step SJ6), the timer is cleared and restarted (step SJ7). Then, the process returns to the main routine of FIG.

図１８は、音声信号のエンベロープの山の数Ｍと口の開閉数Ｎとの差の絶対値であるαに対するＤＳＰ部１８に対する係数Ｃｄｓｐの特性を示す図である。第１実施形態と同様に、係数Ｃｄｓｐには、ＤＳＰ部１８の遅延部１８３に対する遅延係数Ｃｄｓｐ（ｄｅｌａｙｔｉｍｅ）および帰還係数Ｃｄｓｐ（ｓｅｎｄ）がある。図１８に示すように、αの値が大きいほど、どちらの係数も大きくなる。この結果、口をあまり動かさずにムードのある曲を歌っている場合で、音声信号のエンベロープの山を検出したときには、リバーブやエコーを効かせて発音するので、実際よりも歌唱力を向上することができる。逆に、口を頻繁に開けて歯切れよく歌っている人の場合には、元々歌唱力のある人が多いので、リバーブやエコーをカットして音声信号をそのまま出力することで、歌唱力を活かした発音を行う。また、曲の間の間奏部分で口を閉じた状態でハミングやシャウトのように声を出している場合、あるいは、曲のエンディングにおいて口を大きく開けた状態で声を小さくして余韻に浸っている場合には、リバーブを深くしてムードを盛り上げるようなエフェクト処理を行う。 FIG. 18 is a diagram illustrating the characteristic of the coefficient Cdsp for the DSP unit 18 with respect to α, which is the absolute value of the difference between the number M of the envelopes of the audio signal and the number N of mouth opening / closing. Similar to the first embodiment, the coefficient Cdsp includes a delay coefficient Cdsp (delay time) and a feedback coefficient Cdsp (send) for the delay unit 183 of the DSP unit 18. As shown in FIG. 18, as the value of α increases, both coefficients increase. As a result, when you are singing a moody song without moving your mouth too much, if you detect a peak of the envelope of the audio signal, you can use reverb and echo to pronounce it, improving the singing ability more than it actually is be able to. On the other hand, if you are singing crisply with your mouth open frequently, there are many people who originally have singing ability, so you can use the singing ability by cutting the reverb and echo and outputting the audio signal as it is. To pronounce. Also, if you are singing like a humming or shout with your mouth closed at the interlude part of the song, or if your mouth is wide open during the ending of the song, make your voice lower and immerse yourself in the finish. If so, effect processing is performed to increase the mood by deepening the reverb.

具体的には、口の開閉頻度が高い状態（頻度高）、口の開閉頻度が中程度の状態（頻度中）、又は口の開閉頻度が低い状態（頻度低）の３つの場合を、さらに、音声信号のエンベロープが所定値より大きい状態（発生有り）又は音声信号のエンベロープが所定値以下の状態（発生無し）で分類すると、これら６通りの歌唱状態のエフェクト処理の制御目的は下記のようになる。
（１）＜頻度高、音声有り：通常テンポの歌声＞適度なエフェクト処理
（２）＜頻度高、音声無し＞声は出ていないので、純粋に伴奏曲だけを聴かせるように、不要部分をカットして聞きやすくするエフェクト処理
（３）＜頻度中、音声有り：スローテンポの歌声＞響きを深くして雰囲気を盛り上げるエフェクト処理
（４）＜頻度中、音声無し＞声は出ていないので、純粋に伴奏曲だけを聴かせるように、不要部分をカットして聞きやすくするエフェクト処理
（５）＜頻度低、音声有り：ハミングやシャウトの歌声＞ゆったりと長く伸ばすようなエフェクト処理
（６）＜頻度低、音声無し＞声は出ていないので、純粋に伴奏曲だけを聴かせるように、不要部分をカットして聞きやすくするエフェクト処理
この６通りの歌唱状態におけるｓｅｎｄ（リバーブ、ディレイ）、リバーブタイム、およびディレイタイムの例を図１９に示す。 Specifically, the three cases of a state where the mouth opening / closing frequency is high (high frequency), a state where the mouth opening / closing frequency is medium (medium frequency), or a state where the mouth opening / closing frequency is low (frequency) are further When the sound signal envelope is classified into a state where the envelope of the audio signal is larger than the predetermined value (occurrence occurs) or the state where the envelope of the audio signal is equal to or smaller than the predetermined value (occurrence occurs), the control purpose of the effect processing in these six singing states is as follows: become.
(1) <High frequency, with voice: normal tempo singing voice> Moderate effect processing (2) <High frequency, no voice> Since no voice is produced, unneeded parts can be heard so that only the accompaniment is heard Effect processing to cut and make it easy to hear (3) <With voice during frequency: Singing voice of slow tempo> Effect processing to deepen the sound and excite the atmosphere (4) <No voice during frequency> Since there is no voice, Effect processing that cuts unnecessary parts and makes it easier to hear so that only the accompaniment is heard (5) <Infrequent, with voice: humming and shout singing voice> Effect processing that stretches slowly and long (6) < Low frequency, no voice> Since there is no voice, effect processing that makes it easy to hear by cutting unnecessary parts so that only the accompaniment can be heard purely. nd (reverb, delay), an example of a reverberation time, and delay time in FIG.

このように、第２実施形態のカラオケシステムは、マイクロフォン１に収容されて入力される音声に応じて音声信号を出力するマイク部５と、マイクロフォン１に収容されて歌唱者の口の映像を撮像して画像信号を出力するカメラ部６とを備えている。ＣＰＵ１０は、マイク部５から出力された音声信号のエンベロープとカメラ部６によって出力された画像信号との差分関係を検出して、その差分関係のデータに応じてＤＳＰ部１８に制御信号を与えて、マイク部５から出力される音声信号に対してエフェクト処理を施す。
したがって、歌唱者の歌声と歌唱者の口の動きとの差に応じてエフェクトを制御することにより、歌唱者の口の動きと歌声の音色との相関性を利用して、実際よりも歌唱力を高めることができる。 As described above, the karaoke system according to the second embodiment captures an image of a singer's mouth accommodated in the microphone 1 and the microphone unit 5 that outputs an audio signal according to the sound accommodated in the microphone 1 and input. And a camera unit 6 for outputting an image signal. The CPU 10 detects the differential relationship between the envelope of the audio signal output from the microphone unit 5 and the image signal output from the camera unit 6, and gives a control signal to the DSP unit 18 according to the data of the differential relationship. Then, effect processing is performed on the audio signal output from the microphone unit 5.
Therefore, by controlling the effect according to the difference between the singing voice of the singer and the movement of the singer's mouth, the correlation between the singing's mouth movement and the timbre of the singing voice is used, and the singing ability is more than actual. Can be increased.

なお、上記第１および第２実施形態において、マイク部５から出力された音声信号とカメラ部６によって出力された画像信号との相関性を両者の差分関係としたが、相関性は差分関係に限定されるものでない。
図１１に示した口の縦横の比と音声信号の波形データと相関性を検出して、その相関性に対応する制御データをＤＳＰ部１８に与えて、マイク部５から出力される音声信号に対してエフェクト処理を施すような構成にしてもよい。例えば、縦／横の比が小さく音声信号の波形データの周波数が変動している場合には、口をあまり開けずに声を震わせて歌っている状態である考えられるので、ビブラートを強調したエフェクト処理を施す。あるいは、縦／横の比が大きく音声信号の波形データの周波数が高い場合には、高い声を出そうとがんばって歌っている状態である考えられるので、高音を強調したエフェクト処理を施す。
また、音声信号の複数種類の波形データにそれぞれ対応する画像信号の口の形状のパターンと、各パターンに対応する制御信号のデータとをあらかじめ記憶し、カメラ部６によって出力された画像信号に対応する制御信号のデータを読み出して、その制御信号をＤＳＰ部１８に与えてマイク部５から出力される音声信号に対してエフェクト処理を施すような構成にしてもよい。 In the first and second embodiments, the correlation between the audio signal output from the microphone unit 5 and the image signal output from the camera unit 6 is the difference relationship between the two, but the correlation is the difference relationship. It is not limited.
The correlation between the aspect ratio of the mouth and the waveform data of the audio signal shown in FIG. 11 is detected, control data corresponding to the correlation is supplied to the DSP unit 18, and the audio signal output from the microphone unit 5 is output. Alternatively, the effect processing may be performed. For example, when the ratio of the aspect ratio of the audio signal is small and the frequency of the waveform data of the audio signal is fluctuating, it can be considered that the voice is sung without shaking your mouth so much, so an effect that emphasizes vibrato Apply processing. Alternatively, when the ratio of the vertical / horizontal ratio is large and the frequency of the waveform data of the audio signal is high, it can be considered that the voice is being sung in an effort to produce a high voice.
In addition, the mouth shape pattern of the image signal corresponding to each of a plurality of types of waveform data of the audio signal and the control signal data corresponding to each pattern are stored in advance, and the image signal output by the camera unit 6 is supported. The control signal data to be read may be read out, and the control signal may be provided to the DSP unit 18 to effect the audio signal output from the microphone unit 5.

本発明の各実施形態におけるマイクロフォンの外観図。The external view of the microphone in each embodiment of this invention. 図１のマイクロフォンの内部構造を示す図。The figure which shows the internal structure of the microphone of FIG. 第１実施形態のカラオケシステムの構成を示すブロック図。The block diagram which shows the structure of the karaoke system of 1st Embodiment. 図１におけるＤＳＰ部の内部構成を示すブロック図。The block diagram which shows the internal structure of the DSP part in FIG. 第１実施形態におけるＣＰＵのメインルーチンのフローチャート。The flowchart of the main routine of CPU in 1st Embodiment. 図５における画像認識処理のフローチャート。6 is a flowchart of image recognition processing in FIG. 5. 図６に続く画像認識処理のフローチャート。The flowchart of the image recognition process following FIG. 図６における口認識処理のフローチャート。The flowchart of the mouth recognition process in FIG. 図５におけるＤＳＰ制御処理のフローチャート。6 is a flowchart of DSP control processing in FIG. 第１実施形態における口の両端と上下の４個のポイントの座標を示す図。The figure which shows the coordinate of four points of the both ends of an opening | mouth and upper and lower sides in 1st Embodiment. 第１実施形態における口の開閉の推移を示す図。The figure which shows transition of opening and closing of the opening | mouth in 1st Embodiment. 第１実施形態における口の開閉数に対するＤＳＰ部に対する係数の特性を示す図。The figure which shows the characteristic of the coefficient with respect to the DSP part with respect to the opening-and-closing number of mouths in 1st Embodiment. 第２実施形態のカラオケシステムの構成を示すブロック図。The block diagram which shows the structure of the karaoke system of 2nd Embodiment. 第２実施形態におけるＣＰＵのメインルーチンのフローチャート。The flowchart of the main routine of CPU in 2nd Embodiment. 図１４におけるマイク入力制御処理のフローチャート。The flowchart of the microphone input control process in FIG. 図１４におけるＤＳＰ制御処理のフローチャート。The flowchart of the DSP control processing in FIG. 第２実施形態における音声信号のエンベロープを示す図。The figure which shows the envelope of the audio | voice signal in 2nd Embodiment. 第２実施形態における口の開閉数と音声信号のエンベロープとの差に対するＤＳＰ部に対する係数の特性を示す図。The figure which shows the characteristic of the coefficient with respect to the DSP part with respect to the difference of the opening / closing number of mouths and the envelope of an audio | voice signal in 2nd Embodiment. 第２実施形態における口の状態に対するＤＳＰ部に対する係数の具体例を示す図。The figure which shows the specific example of the coefficient with respect to the DSP part with respect to the state of the mouth in 2nd Embodiment.

Explanation of symbols

１マイクロフォン
２マイクカバー
３透明保護カバー
４基板
５マイク部
６カメラ部
１０ＣＰＵ
１２プログラムＲＯＭ
１３ワークＲＡＭ
１６音源
１７曲データＲＯＭ
１８ＤＳＰ部
２０Ａ／Ｄ変換回路
１８１エフェクタ
１８２信号合成部
１８３遅延部
１８４帰還部 DESCRIPTION OF SYMBOLS 1 Microphone 2 Microphone cover 3 Transparent protective cover 4 Board | substrate 5 Microphone part 6 Camera part 10 CPU
12 Program ROM
13 Work RAM
16 sound source 17 song data ROM
18 DSP unit 20 A / D conversion circuit 181 Effector 182 Signal synthesis unit 183 Delay unit 184 Feedback unit

Claims

A signal generating means that is accommodated in a microphone and outputs an audio signal according to the input audio;
Imaging means housed in the microphone and imaging a singer's mouth and outputting an image signal;
Correlation detecting means for detecting the correlation between the audio signal output from the signal generating means and the image signal output by the imaging means;
Signal processing means for effecting the audio signal output from the signal generating means according to the correlation data detected by the correlation detecting means;
A voice control device.

The correlation detection unit includes a difference detection unit that detects a difference relationship between an image signal output from the imaging unit and an audio signal output from the signal generation unit, and the signal processing unit includes the difference detection unit. The sound control apparatus according to claim 1, wherein effect processing is performed on the sound signal output from the signal generating unit based on the difference relation data detected by the step .

The voice control apparatus according to claim 1, wherein the signal processing unit is accommodated in the microphone.

The audio control apparatus according to claim 1, wherein the signal processing unit controls a feedback component of reverberation processing.

The voice control apparatus according to claim 1, wherein the signal processing unit controls a reverberation time of a reverb process.

The voice control apparatus according to claim 1, wherein the signal processing unit controls a feedback component of delay processing.

The voice control apparatus according to claim 1, wherein the signal processing unit controls a delay time of delay processing.

Detecting a sound signal output from the microphone in accordance with the input sound; and
A step B of detecting an image signal output from an image pickup means for picking up an image of a singer's mouth contained in the microphone;
Detecting a correlation between the audio signal detected in step A and the image signal detected in step B;
Performing an effect process on the audio signal output from the microphone according to the correlation data detected in the step C; and
Voice control method to execute.

The step C includes a step E for detecting a difference relationship between the audio signal detected in the step A and the image signal output in the step B, and the step D includes the difference detected in the step E. 9. The audio control method according to claim 8, wherein effect processing is performed on an audio signal output from the microphone based on related data.

The voice control method according to claim 8 or 9, wherein in step D, effect processing is performed by signal processing means accommodated in the microphone.

The voice control method according to claim 8 or 9, wherein the step D controls a feedback component of reverberation processing.

The voice control method according to claim 8 or 9, wherein the step D controls a reverb time of a reverb process.

The voice control method according to claim 8 or 9, wherein the step D controls a feedback component of delay processing.

The voice control method according to claim 8 or 9, wherein the step D controls a delay time of delay processing.