JP3346799B2

JP3346799B2 - Sign language interpreter

Info

Publication number: JP3346799B2
Application number: JP24728592A
Authority: JP
Inventors: 裕酒匂; 浩彦佐川; 正博阿部; 熹市川; 潔井上; 清志新井; 隆則志村; 裕二戸田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1992-08-24
Filing date: 1992-08-24
Publication date: 2002-11-18
Anticipated expiration: 2017-11-18
Also published as: JPH0667601A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、手話を認識し、それ
を、テキスト、音声、または他の種類の手話などの表現
形態に変換し、その情報を伝達する手話通訳装置および
手話通訳システムに関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a sign language interpreting apparatus and a sign language interpreting system for recognizing sign language, converting the sign language into an expression such as text, voice, or another type of sign language, and transmitting the information. Things.

【０００２】[0002]

【従来の技術】従来の関連技術として、（公知例１）特
開平２−１４４６７５号公報「手動作認識装置と手話言
語変換装置」、（公知例２）特開平３−１８６９７９号
公報「ニューロコンピュータを用いた手の姿勢認識方
式」が提案されている。（公知例１）においては、色の
付いた手袋を用いて色認識により指の位置関係を求め
る。そして、それと予め登録してある幾つかの指文字の
指位置関係を照合することで、入力された指文字を認識
している。（公知例２）においては、周知のデータグロ
ーブを用いて指の静的な形状とその指文字の意味との対
応をニューラルネットワークを用いて学習させている。
認識時には、データグローブからの指の静的な形状デー
タをニューラルネットワークに入力し、その出力を指文
字の意味としている。2. Description of the Related Art As conventional related arts, (Publication Example 1) Japanese Patent Application Laid-Open No. 2-144675, "Hand Motion Recognition Device and Sign Language Conversion Device", and (Publication Example 2) Japanese Patent Application Laid-Open Publication No. 3-186979, "Neurocomputer" A hand posture recognition method that uses an image "has been proposed. In (known example 1), the positional relationship of the finger is obtained by color recognition using colored gloves. Then, the input finger character is recognized by comparing the finger position relationship of several finger characters registered in advance with the finger position relationship. In (known example 2), the correspondence between the static shape of the finger and the meaning of the finger character is learned using a known data glove using a neural network.
At the time of recognition, the static shape data of the finger from the data glove is input to the neural network, and the output is the meaning of the finger character.

【０００３】[0003]

【発明が解決しようとする課題】上記の従来技術は、基
本的に静的な指文字を認識するものなので、次のような
理由で、指や手の動き（手話パターン）を認識する必要
のある自然な手話会話の認識ができないという問題点が
ある。なお、以降では、静的な片手の指形状を用いた伝
達方法を指文字、動的な両手の指や手の動きを用いたも
のを手話とし、それぞれを区別して記述する。（１）（公知例１）では、指文字としての指の伸び、曲
げ、不定等を記号化しているが、手話における指や手の
時間的動きは自由度が大きいため時間的動きの記号化は
一般に困難であり、原理的には手話の認識に適用しずら
い。（２）（公知例２）ではニューロ学習を用いているが、
動的な手話に適用しようとすると、手話単語の切り出し
や手話単語の時間的変動という困難な問題に直面し、原
理的に手話の認識に適用できない。また、（公知例１）
には、手話における単語の照合を目的とした記述もある
が、そこに述べてあるように単語と単語の間にポーズを
挿入する必要があるため使用者に負担を与える結果とな
り、自然な手話会話を阻害してしまう。The above prior art is basically for recognizing a static finger character. Therefore, it is necessary to recognize a finger or hand movement (sign language pattern) for the following reasons. There is a problem that certain natural language sign language conversation cannot be recognized. In the following, a transmission method using a static one-handed finger shape will be referred to as a finger character, and a method using dynamic two-handed fingers or hand movements will be referred to as sign language, and each will be described separately. (1) In (known example 1), the elongation, bending, indefiniteness, and the like of a finger as a finger character are symbolized, but the temporal movement of the finger or hand in sign language has a large degree of freedom, so that the temporal movement is symbolized. Is generally difficult and in principle difficult to apply to sign language recognition. In (2) (known example 2), neuro learning is used.
When trying to apply it to dynamic sign language, it faces a difficult problem of sign language word cutout and temporal fluctuation of the sign language word, and cannot be applied to sign language recognition in principle. Also, (known example 1)
There is also a description for the purpose of collating words in sign language, but as described there, it is necessary to insert a pause between words, which results in a burden on the user and results in a natural sign language. It hinders conversation.

【０００４】さらに、手話通訳自体のその他の課題とし
て次項が挙げられる。（３）手話は、その指形状、手の位置、動きに個人差が
存在する。音声認識からの知見として特定話者認識の方
が不特定話者認識より簡単である。指文字の場合、その
種類数が５０語弱と少ないため、使用以前に、使用者に
よる個人指文字の登録や個人用ニューロの重み係数の学
習が考えられる。しかしながら、手話の場合、その基本
単語数が少なくとも１０００語と多いために、個人適用
のためのそのような登録や学習が時間的に不可能とな
る。（４）一般に、手話によって行なった過去の会話を参考
にしたくなったり、記録しておきたい場合も多い。これ
らの機能は、まだ実現されていない。（５）手話には、感情を伝える単語が乏しい。そのた
め、顔の表情や身振りの大きさでそれを表現している。
しかし、一般に、健常者は手話に神経が集中しているた
め、この表情や身振りの大きさを読み取れない場合が多
い。従って、手話通訳装置には、自然な会話を実現する
ために、感情を伝達する機能が必要不可欠となる。本発明の目的は、以上説明した従来技術の問題点やその
他の課題を解決し、単語数が多く、しかも、個人差のあ
る動的な手話を認識し、種々の表現形態を用いて相手に
伝達することが可能な手話通訳装置および手話通訳シス
テムを提供することにある。[0004] Further, other issues of the sign language interpreter itself include the following. (3) In sign language, there are individual differences in the finger shape, hand position, and movement. Specific speaker recognition is simpler than indefinite speaker recognition as knowledge from speech recognition. In the case of finger characters, since the number of types is as small as less than 50 words, it is conceivable that the user registers personal finger characters and learns weight coefficients of personal neurons before use. However, in the case of sign language, such registration and learning for personal application are temporally impossible because the number of basic words is as large as at least 1000 words. (4) Generally, it is often desired to refer to or record past conversations performed in sign language. These functions have not been realized yet. (5) Sign language has few words that convey emotions. Therefore, it is expressed by the size of facial expressions and gestures.
However, in general, a healthy person is often unable to read the facial expression and the size of the gesture because nerves are concentrated in sign language. Therefore, in order to realize natural conversation, a sign language interpreting device is indispensable for a function of transmitting emotions. An object of the present invention is to solve the above-described problems and other problems of the prior art, recognize a dynamic sign language having a large number of words and having individual differences, and communicate with a partner using various forms of expression. It is an object of the present invention to provide a sign language interpreting apparatus and a sign language interpreting system capable of transmitting information.

【０００５】[0005]

【課題を解決するための手段】上記問題点や課題を解決
するために、本発明では、指と手の動きを時系列データ
として求める手段と、前記指と手の動きの時系列データ
を入力手話時系列データとし、該入力手話時系列データ
をキャリブレーションする変換部と、各手話単語の手話
時系列データを手話単語辞書データとして格納する手話
単語辞書と、変換部の出力と手話単語辞書データとを照
合し、入力手話時系列データに対応する手話単語を認識
出力する照合部を備える手話認識部と、手話認識部から
出力された手話単語にルール規範に基づき助詞等を付加
して自然文を出力する自然文変換部を備える。また、さ
らに、特定の手話単語の手話時系列データと該手話時系
列データに対応する手話単語辞書データとを入力し、該
両入力データの各時間ごとの対応を求め、この対応関係
を出力する第２の照合部と、前記入力手話時系列データ
と前記第２の照合部の出力のいずれか一方を選択して前
記変換部に出力する選択部とを設け、前記変換部に前記
第２の照合部の出力により前記キャリブレーションのた
めの認識用パラメータを学習する手段を設けている。ま
た、さらに、前記キャリブレーションのための認識用パ
ラメータを学習する手段としてニューラルネットワーク
を用いている。また、さらに、手話使用者の顔画像を入
力し、その顔画像から表情を認識し、“喜び”“悲し
み”等の感情種類とその感情度（強度）を求める表情認
識部と処理装置を設け、該処理装置は、前記自然文変換
部の出力である自然文と前記表情認識部の出力である感
情種類とその感情度（強度）を入力として感情的形容詞
を付加した自然文を出力するようにしている。また、ロ
ーカルエリアネットワークの複数のステーションに、ス
テーション毎に手話通訳装置を設置し、複数台の手話通
訳装置間で情報の交換をすることができるようにしてい
る。In order to solve the above-mentioned problems and problems, in the present invention, means for obtaining finger and hand movements as time series data, and inputting the finger and hand movement time series data A conversion unit for calibrating the input sign language time series data as sign language time series data; a sign language word dictionary for storing the sign language time series data of each sign language word as sign language word dictionary data; an output of the conversion unit and sign language word dictionary data And a sign language recognition unit having a matching unit for recognizing and outputting a sign language word corresponding to the input sign language time-series data, and adding a particle or the like based on the rule norm to the sign language word output from the sign language recognition unit to generate a natural sentence Is output. Further, the user inputs the sign language time-series data of the specific sign language word and the sign language word dictionary data corresponding to the sign language time-series data, obtains the correspondence between the input data at each time, and outputs the correspondence. A second matching unit, and a selecting unit that selects one of the input sign language time-series data and the output of the second matching unit and outputs the selected sign language time-series data to the conversion unit; Means for learning the recognition parameters for the calibration based on the output of the collating unit is provided. Further, a neural network is used as means for learning the recognition parameters for the calibration. Further, a facial expression recognition unit and a processing device for inputting a facial image of a sign language user, recognizing facial expressions from the facial image, and obtaining an emotion type such as "joy" or "sadness" and an emotion level (intensity) are provided. The processing device outputs a natural sentence to which an emotional adjective is added by using a natural sentence output from the natural sentence conversion unit and an emotion type and an emotion degree (strength) output from the expression recognition unit as inputs. I have to. In addition, a sign language interpreter is installed at each of a plurality of stations in the local area network so that information can be exchanged between the plurality of sign language interpreters.

【０００６】[0006]

【作用】データグローブ等から入力されてくる時系列の
手話の時系列データと各手話単語辞書データをスキャン
しながら照合するようにする。そして、一致度が最も高
い部分（図４の点線部）で各手話単語辞書データが照合
することができる。また、幾つかの限られた個数の入力
手話単語データとそれに対応する手話単語辞書データの
違い、すなわち、使用者の手話の一般的な癖をデータ変
換規則（キャリブレーション規則）として学習すること
ができる。この変換規則を用いることで、使用者の手話
をより手話単語辞書データに近いデータに変換すること
ができるため認識精度をよくすることができる。さら
に、表情認識部を備えたことにより感情表現をも通訳す
ることができるようになり、より自然な会話を実現でき
る。In the present invention, time-series sign language time-series data input from a data glove or the like and each sign language word dictionary data are scanned and collated. Then, the sign language word dictionary data can be collated at the part having the highest degree of coincidence (the dotted line part in FIG. 4). It is also possible to learn the difference between some limited number of input sign language word data and the corresponding sign language word dictionary data, that is, the general habit of user's sign language as a data conversion rule (calibration rule). it can. By using this conversion rule, the sign language of the user can be converted into data closer to sign language word dictionary data, so that recognition accuracy can be improved. Furthermore, the provision of the facial expression recognition section enables the interpretation of emotional expressions, thereby realizing a more natural conversation.

【０００７】[0007]

【実施例】以下、本発明の実施例を図１〜図９を用いて
詳細に説明する。まず、図１において、手話通訳装置１
００の全体構成と各要素部の機能及びデータの遷移につ
いて説明する。この装置は、大きく分けて８つの部分、
手話認識部分（１、２、３）、表情認識部分（４、
５）、データ入出力部分（６、７）、音声認識部分
（９、１０）、キーボード部分（８）、表示部分（１
３、１４、１５）、音声発生部分（１１、１２）、処理
装置（計算機）（１６）からなっている。まず、手話利
用者と他方の利用者との会話の例を通じて、この手話通
訳装置の動きを説明する。手話利用者が行なう手話を通
訳するために、手話利用者の指の形、手の方向と位置の
時系列情報ｄ１がデータグローブ１から手話認識部２に
送られる。そして、手話認識部２において、この時系列
情報ｄ１と手話単語辞書データを動的に照合すること
で、時系列情報ｄ１に含まれている単語を記号化（ｄ
２）する。自然文変換部３では、順次入力されてくる記
号化単語ｄ２の間に適切な助詞等を補足して、自然文
（ｄ３）を作成し出力する。一方、ＴＶカメラ４によ
り、手話利用者の顔画像（ｄ４）が同時に取り込まれ、
表情認識部で彼の（微笑みや悲しみ等の）表情を認識
し、その（微笑みや悲しみ等の）程度（ｄ５）を出力す
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention will be described below in detail with reference to FIGS. First, in FIG.
The overall configuration of 00, the function of each element, and the transition of data will be described. This device is roughly divided into eight parts,
Sign language recognition part (1, 2, 3), facial expression recognition part (4,
5), data input / output part (6, 7), voice recognition part (9, 10), keyboard part (8), display part (1
3, 14, 15), a voice generating portion (11, 12), and a processing device (computer) (16). First, the operation of the sign language interpreter will be described through an example of a conversation between a sign language user and another user. In order to translate the sign language performed by the sign language user, time-series information d1 on the finger shape, hand direction and position of the sign language user is sent from the data glove 1 to the sign language recognition unit 2. The sign language recognizing unit 2 dynamically matches the time series information d1 with the sign language word dictionary data to symbolize the words included in the time series information d1 (d
2) Do it. The natural sentence conversion unit 3 creates and outputs a natural sentence (d3) by supplementing an appropriate particle or the like between the sequentially input symbolized words d2. On the other hand, the face image (d4) of the sign language user is simultaneously captured by the TV camera 4,
The facial expression recognition unit recognizes his facial expression (smile, sadness, etc.) and outputs the degree (d5) (smile, sadness, etc.).

【０００８】通訳出力媒体として音声を利用する場合、
これらのデータｄ３、ｄ５は、計算機１６に送られ、簡
単な感情的形容詞が付加された自然文あるいは音声合成
用のパラメータ（ｄ１２）に変換された後、音声合成部
１２に送られる。音声合成部１２では、このデータｄ１
２に基づき、自然文（ｄ３）に対応した感情（ｄ５）を
伴った音声が合成される。通訳出力媒体としてテキスト
を利用する場合、データｄ３、ｄ５は、計算機１６に送
られ、簡単な感情的形容詞が付加された自然文（ｄ１３
２）に変換された後、モニタ１３に送られテキスト表示
される。また、通訳出力媒体として他の種類の手話を利
用する場合、計算機１６により、簡単な感情的形容詞が
付加された自然文（ｄ３）から単語部分（ｄ１４）が取
り出された後、手話ＣＧ発生部１４に送られ、対応する
単語単位の手話ＣＧがモニタ１３に出力される。[0008] When using voice as an interpreter output medium,
These data d3 and d5 are sent to the computer 16, converted into natural sentences to which a simple emotional adjective is added or a parameter for speech synthesis (d12), and then sent to the speech synthesis unit 12. In the voice synthesizing unit 12, the data d1
2, a voice with an emotion (d5) corresponding to the natural sentence (d3) is synthesized. When text is used as an interpreter output medium, the data d3 and d5 are sent to the computer 16 and a natural sentence (d13) to which a simple emotional adjective is added.
After the conversion into 2), the text is sent to the monitor 13 and displayed as text. When another type of sign language is used as the interpreting output medium, the computer 16 extracts the word part (d14) from the natural sentence (d3) to which a simple emotional adjective is added, and then outputs the sign language CG generating unit. 14 and the corresponding sign language CG in word units is output to the monitor 13.

【０００９】これらの通訳出力媒体を通じて手話利用者
の会話を理解した他方の利用者は、キーボード８、マイ
ク９、或いは、データグローブ１（もう一式別の手話認
識部を用意しても良い）を用いて会話を行なう。テキス
ト会話の場合には、キーボード８を用いて会話文（ｄ
８）を入力し、それをモニタ１３に表示する。あるい
は、キーボード８を用いて簡単な単語単位の会話文（ｄ
８）を入力し、手話ＣＧ発生部１４で対応する単語単位
の手話ＣＧに変換してモニタ１３に出力してもよい。マ
イク９を用いた音声会話の場合には、各単語毎に発声し
た音声データを音声認識部１０で認識し記号化（ｄ９）
する。それを、テキストとしてモニタ１３に表示する
か、あるいは、手話ＣＧ発生部１４で対応する単語単位
の手話ＣＧをモニタ１３に出力する。また、データグロ
ーブ１を用いた手話会話の場合には、先に説明したと同
様な手順で、手話会話を音声、テキスト、手話ＣＧで出
力する。[0009] The other user who understands the sign language user's conversation through these interpreting output media can use the keyboard 8, the microphone 9, or the data glove 1 (another sign language recognition unit may be prepared). Conversation using In the case of a text conversation, a conversation sentence (d
8) and display it on the monitor 13. Alternatively, by using the keyboard 8, a simple sentence (d
8), the sign language CG generating unit 14 may convert the sign language CG into the corresponding sign language CG and output it to the monitor 13. In the case of a voice conversation using the microphone 9, voice data uttered for each word is recognized by the voice recognition unit 10 and converted into a symbol (d9).
I do. This is displayed on the monitor 13 as a text, or the sign language CG generation unit 14 outputs the corresponding sign language CG for each word to the monitor 13. In the case of sign language conversation using the data glove 1, the sign language conversation is output as voice, text, and sign language CG in the same procedure as described above.

【００１０】計算機１６は、処理部の全体制御と簡単な
データ変換を行なうためのものである。なお、音声認識
部１０、音声合成部１２は、それぞれの既に開発されて
いる技術（例えば、“聴覚と音声” 電子通信学会三
浦種敏監修１９８０）を用いれば、容易に達成でき
る。以降、図１における本発明に直接かかわる、手話認
識部（１、２、３）、表情認識部（４、５）、データ入
出力部（６、７）に関して、順次詳細に説明する。The computer 16 controls the entire processing unit and performs simple data conversion. Note that the speech recognition unit 10 and the speech synthesis unit 12 can be easily achieved by using a technology that has already been developed (for example, “Hearing and Speech”, supervised by IEICE, Toshitaka Miura 1980). Hereinafter, the sign language recognizing units (1, 2, 3), the facial expression recognizing units (4, 5), and the data input / output units (6, 7) directly related to the present invention in FIG. 1 will be sequentially described in detail.

【００１１】図２は、手話認識部２の具体的構成を示し
たものである。この認識部の機能として、次の２つがあ
る。機能（１）：予め定めてある幾つかの手話単語に対応し
た利用者の手話単語データをデータグローブから取得
し、同一単語における入力手話単語データと手話単語辞
書データとの関係を比較し、データ変換を行なう。これ
は、手話の個人差のキャリブレーション、すなわち、利
用者の手話データをより手話単語辞書データに似たデー
タに変換することを目的としている。このキャリブレー
ションは、装置使用開始時のみ実行される。機能（２）：利用者の（キャリブレーション後の）手話
会話の時系列データと手話単語辞書データを動的に照合
し、時系列データに含まれる単語を検出する。FIG. 2 shows a specific configuration of the sign language recognition section 2. There are the following two functions of the recognition unit. Function (1): Acquire user's sign language word data corresponding to some predetermined sign language words from the data globe, compare the relationship between input sign language word data and sign language word dictionary data for the same word, and Perform the conversion. This is intended to calibrate individual differences in sign language, that is, to convert user sign language data into data more similar to sign language word dictionary data. This calibration is executed only at the start of use of the apparatus. Function (2): dynamically collates the time series data of the user's sign language conversation (after calibration) with the sign language word dictionary data, and detects words included in the time series data.

【００１２】まず、機能（１）の実現に関して説明す
る。手話利用者は、予め決められた手話単語ｄ１を一つ
づつ照合部１に入力する。そして、それに対応した手話
単語辞書データｄ２２が手話単語辞書格納部２２から読
み出されて照合部１に入力され、照合部１で端点固定動
的照合が行なわれる。一般に、手話単語は静的な指文字
と違い、指や手の動きで単語を表したものである。その
ため、データグローブから得られるデータは、図４に示
したような、指関節角度を表すｆ11からｆ52（たとえ
ば、ｆ11は親指の第１関節の関節角度、ｆ12は親指の第
２関節の関節角度を表し、以下ｆ21からｆ52は他の指の
第１関節の関節角度と第２関節の関節角度を表す）、手
の平の方向を表すｈｄ１からｈｄ３、手の位置を表すｈ
ｐｘからｈｐｚの時系列多次元関数Ｐ（ｔ）（多数の１
次元関数からなる関数のことである）となる。手話単語
辞書データは一つの単語を表現したもので、同様に時系
列多次元関数ｐ（ｔ）と表現できる。なお、図には簡単
のため片手分の時系列データを示したが、実際には、両
手のデータを用いるため、この２倍の次元を持った関数
である。First, the implementation of the function (1) will be described. The sign language user inputs the predetermined sign language words d1 to the collation unit 1 one by one. Then, the sign language word dictionary data d22 corresponding thereto is read from the sign language word dictionary storage unit 22 and input to the matching unit 1, and the matching unit 1 performs end point fixed dynamic matching. Generally, a sign language word is different from a static finger character in that a word is represented by movement of a finger or a hand. Therefore, the data obtained from the data glove is represented by f11 to f52 (for example, f11 is the joint angle of the first joint of the thumb, f12 is the joint angle of the second joint of the thumb) as shown in FIG. Hereafter, f21 to f52 represent the joint angles of the first and second joints of the other fingers), hd1 to hd3 representing the direction of the palm, and h representing the position of the hand.
time series multidimensional function P (t) from px to hpz (many 1
Dimensional function). The sign language word dictionary data expresses one word, and can be similarly expressed as a time-series multidimensional function p (t). Although the figure shows time-series data for one hand for simplicity, it is actually a function having twice as many dimensions as using two-hand data.

【００１３】図４に示したように、当然のことながら、
入力手話単語データと手話単語辞書データには個人差に
よる違いがあるため、その差を吸収するために入力手話
単語データから手話単語辞書データへの変換を行なう必
要がある。このため、まず、端点固定動的照合を行な
い、両者の最適な対応個所を求める。端点固定動的照合
は、入力データの始点と辞書データの始点を対応固定
し、入力データの終点と辞書データの終点を対応固定
し、すなわち両端の対応を固定し、その他の部分の時間
的な変動を許した一種の動的なテンプレートマッチング
であり、公知の方法を用いる（“パターン認識と学習の
アルゴリズム” 文一総合出版上坂吉則／尾関和彦
著、ｐ９１）。As shown in FIG. 4, naturally,
Since there is a difference between the input sign language word data and the sign language word dictionary data due to individual differences, it is necessary to convert the input sign language word data into the sign language word dictionary data in order to absorb the difference. For this reason, first, an end point fixed dynamic collation is performed, and an optimal corresponding portion between them is obtained. In the end-point fixed dynamic collation, the start point of the input data and the start point of the dictionary data are fixed, and the end point of the input data and the end point of the dictionary data are fixed. This is a kind of dynamic template matching that allows variation, and uses a known method (“Pattern Recognition and Learning Algorithm” Bunichi Sogo Publishing Yoshinori Uesaka / Kazuhiko Ozeki, p91).

【００１４】幾つかの単語毎にこの照合を行ない、対応
個所の情報を図４のｄ２４として出力する。ｄ２４は、
図４に示すように入力手話単語データと手話単語辞書デ
ータの対応個所のデータ、すなわちＰ（Ａ）、Ｐ
（Ｂ）、Ｐ（Ｃ）、Ｐ（Ｄ）、Ｐ（Ｅ）、Ｐ（Ｆ）、Ｐ
（Ｇ）・・・およびｐ（ａ）、ｐ（ｂ）、ｐ（ｃ）、ｐ
（ｄ）、ｐ（ｅ）、ｐ（ｆ）、ｐ（ｇ）・・・からな
る。図２の２５は選択部であり、機能１を実行する場合
には、選択部２５で、データｄ１を選択せずに、対応個
所データｄ２４を選択し、これを変換部２１への入力ｄ
２５とする。変換部２１は、図３に示すように、層型ニ
ューロ２１１および選択部２１２、２１３からなる。変
換部２１では、選択部２１３により、入力されたデータ
ｄ２５の内のＰ（Ａ）、Ｐ（Ｂ）、Ｐ（Ｃ）、Ｐ
（Ｄ）、Ｐ（Ｅ）、Ｐ（Ｆ）、Ｐ（Ｇ）・・・を選択し
て層型ニューロ２１１の入力とし、また、選択部２１２
によってｄ２５中のｐ（ａ）、ｐ（ｂ）、ｐ（ｃ）、ｐ
（ｄ）、ｐ（ｅ）、ｐ（ｆ）、ｐ（ｇ）・・・を層型ニ
ューロ２１１の教師データｄ２１２として利用すること
によって層型ニューロ２１１の学習が行なわれる。幾つ
かのサンプルデータｄ２５を用いて層型ニューロ２１１
の学習を行なうことによって、手話利用者の時系列デー
タＰ（ｔ）から単語辞書データｐ（ｔ）への変換則が、
層型ニューロ２１１の重み係数の形で学習されることに
なる。This collation is performed for every several words, and the information of the corresponding locations is output as d24 in FIG. d24 is
As shown in FIG. 4, the data at the corresponding locations of the input sign language word data and the sign language word dictionary data, that is, P (A), P
(B), P (C), P (D), P (E), P (F), P
(G) ... and p (a), p (b), p (c), p
(D), p (e), p (f), p (g),. Reference numeral 25 in FIG. 2 denotes a selection unit. When the function 1 is executed, the selection unit 25 selects the corresponding part data d24 without selecting the data d1, and inputs the data d24 to the conversion unit 21.
25. The conversion unit 21 includes a layered neuron 211 and selection units 212 and 213, as shown in FIG. In the conversion unit 21, the selection unit 213 selects P (A), P (B), P (C), P
(D), P (E), P (F), P (G)... Are selected and input to the layered neuro 211, and the selection unit 212 is selected.
P (a), p (b), p (c), p in d25
Learning of the layered neuro 211 is performed by using (d), p (e), p (f), p (g), etc. as the teacher data d212 of the layered neuro 211. The layered neuron 211 is obtained by using some sample data d25.
, The conversion rule from the time series data P (t) of the sign language user to the word dictionary data p (t) becomes
The learning is performed in the form of the weight coefficient of the layered neuron 211.

【００１５】次に、機能（２）の実現に関して説明す
る。機能（２）を実行する場合には、図２の選択部２５
ではデータｄ１が選択され、ｄ２５となり、図３の選択
部２１３ではｄ２５を選択して層型ニューロ２１１の入
力とし、選択部２１２では、ｄ２１２が選択され、ｄ２
１となる。手話利用者の入力手話時系列データｄ１は、
先に説明したキャリブレーションの場合と異なり、手話
会話を目的とした幾つかの手話単語を時系列に並べたも
のである。この入力手話時系列データｄ１は、学習済み
の層型ニューロで変換され、利用者の個人差を排除した
変換入力手話時系列データｄ２１となる。すなわち、こ
の変換入力手話時系列データｄ２１は、先に説明した機
能（１）により、幾つかの対応した手話単語辞書データ
に似た時系列データに変換されていることになる。Next, the implementation of the function (2) will be described. When executing the function (2), the selection unit 25 shown in FIG.
In FIG. 3, the data d1 is selected and becomes d25. The selection unit 213 in FIG. 3 selects d25 and uses it as the input of the layered neuron 211. The selection unit 212 selects d212 and d2.
It becomes 1. The input sign language time series data d1 of the sign language user is
Unlike the case of the calibration described above, several sign language words intended for sign language conversation are arranged in chronological order. The input sign language time series data d1 is converted by the learned layered neuro, and becomes converted input sign language time series data d21 excluding the individual differences of users. That is, the converted input sign language time series data d21 is converted into time series data similar to some corresponding sign language word dictionary data by the function (1) described above.

【００１６】図２の照合部２（２３）は、この変換され
た入力手話時系列データｄ２１と各手話単語辞書データ
を照合し、入力手話時系列データ内の一連の手話単語を
検出するためのものである。この照合部２（２３）の動
作に関して図５を用いて説明する。図５のｄ２１は入力
手話時系列データ、ｄ２２は手話単語辞書データであ
る。入力手話時系列データは先に説明したように、多次
元の時間ｔの関数である。手話単語辞書データｄ２２に
は、例えば、図に示したようにＡ、Ｂ、Ｃ、・・・があ
り、それぞれが入力手話時系列データｄ２１と動的に照
合される。この動的な照合には、連続ＤＰ照合と呼ばれ
る方法（岡，“連続ＤＰを用いた連続単語認識”，日本
音響学会音声研究会，Ｓ７８−２０，ｐｐ．１４５−１
５２，１９７８）を利用する。すなわち、この動的な照
合は、基本的には、各手話単語辞書データを、入力手話
時系列データ上でスキャンさせ、各時間毎にマッチング
を行ない、一致度が最も良い位置（時刻）にその手話単
語が存在したと認識するものである。マッチングの際に
は、若干の時間的な変動を許しながら行なうことができ
る。照合部２（２３）からは、認識された単語を時系列
にｄ２として出力する。自然変換部３では、これらのデ
ータをもとに、ルール規範によって助詞等を付加し、よ
り自然文に近い文章を記号化して出力する。The collating unit 2 (23) of FIG. 2 collates the converted input sign language time series data d21 with each sign language word dictionary data to detect a series of sign language words in the input sign language time series data. Things. The operation of the matching unit 2 (23) will be described with reference to FIG. In FIG. 5, d21 is input sign language time series data, and d22 is sign language word dictionary data. The input sign language time-series data is a function of the multidimensional time t, as described above. The sign language word dictionary data d22 includes, for example, A, B, C,... As shown in the figure, and each is dynamically matched with the input sign language time-series data d21. For this dynamic matching, a method called continuous DP matching (Oka, “Consecutive Word Recognition Using Continuous DP”, Acoustic Society of Japan, S78-20, pp. 145-1)
52, 1978). That is, this dynamic collation is basically performed by scanning each sign language word dictionary data on the input sign language time-series data, performing matching at each time, and searching for a position (time) having the best matching degree. It recognizes that a sign language word exists. The matching can be performed while allowing a slight temporal variation. The matching unit 2 (23) outputs the recognized words in time series as d2. The natural conversion unit 3 adds particles and the like according to the rule norm based on these data, and symbolizes and outputs a sentence closer to a natural sentence.

【００１７】次に、表情認識部５を、図６、７、８を用
いて詳細に説明する。図６は、表情認識部５の具体的な
構成を示したものであり、入力ｄ４は手話利用者の顔画
像、出力ｄ５は、表情から得られた感情種類とその感情
を数値化した感情度（喜び度６０％、悲しみ度１０％等
々）である。図内の４１で、顔画像の“目”“口”
“鼻”“眉”の位置検出（ｄ４１０）とその“目”部分
画像（ｄ４１４）、“口”部分画像（ｄ４１３）、
“鼻”部分画像（ｄ４１２）、“眉”部分画像（ｄ４１
１）の切り出しを行なう。各部分の表情照合部（４２
４、４２３、４２２、４２１）では、基本表情パターン
辞書（４３４、４３３、４３２、４３１）を参照しなが
ら、各部分の感情度（ｄ４２４、ｄ４２３、ｄ４２２、
ｄ４２１）を検出する。総合判定部４４では、これらの
感情度と“目”“口”“鼻”“眉”の位置検出（ｄ４１
０）から決まる感情度とを総合的に判断し、最終的な感
情種類とその感情度を出力する。Next, the expression recognition section 5 will be described in detail with reference to FIGS. FIG. 6 shows a specific configuration of the facial expression recognition unit 5. The input d4 is a face image of a sign language user, and the output d5 is an emotion type obtained from the expression and an emotion degree obtained by quantifying the emotion. (60% joy, 10% sadness, etc.). At 41 in the figure, “eyes” and “mouth” of the face image
“Nose” and “eyebrows” position detection (d410) and its “eye” partial image (d414), “mouth” partial image (d413),
"Nose" partial image (d412), "brow" partial image (d41)
Cut out in 1). Expression matching section (42)
4, 423, 422, 421), while referring to the basic expression pattern dictionary (434, 433, 432, 431), the degree of emotion (d424, d423, d422, d422,
d421) is detected. The comprehensive judgment unit 44 detects these emotion degrees and the positions of “eyes”, “mouth”, “nose”, and “eyebrows” (d41).
0) is comprehensively determined, and the final emotion type and the emotion level are output.

【００１８】図７は、各部分の位置検出と切り出しを行
なうための具体的な構成を示したものである。まず、差
分による顔抽出部４１１で、背景の無い顔画像ｄ４１１
０を得る。この実現には、予め登録してある背景画像４
１２と入力画像ｄ４との差分を求め、絶対差分値が大き
い部分を求めれば、容易に達成できる。各部の抽出（４
１３）では、顔画像ｄ４１１０の全体位置と全体の大き
さを投影分布等の画像処理手法を利用して検出する。
“目”“口”“鼻”“眉”の各部については標準的な画
像パターンおよび基本位置情報（４１４）が用意され、
背景の無い顔画像ｄ４１１０に対して各部の標準的な画
像パターン（４１４）をテンプレートとしてテンプレー
トマッチングを行ない、各部分の位置を検出する。この
時、各部の基本位置情報（４１４）を、求めてある顔画
像の全体位置と全体の大きさを用いて正規化すること
で、テンプレートマッチングを実行する範囲を制限で
き、精度と効率の良いマッチングを実現できる。一旦、
各部分の位置を検出できれば、各部分画像を切り出すこ
とは容易となる。出力のｄ４１０が各部の位置座標、ｄ
４１４、ｄ４１３、ｄ４１２、ｄ４１１が“目”部分画
像、“口”部分画像、“鼻”部分画像、“眉”部分画像
である。FIG. 7 shows a specific structure for detecting and cutting out the position of each part. First, a face image d411 without a background is extracted by a face extraction unit 411 based on a difference.
Get 0. To realize this, a background image 4 registered in advance is used.
This can be easily achieved by obtaining the difference between the input image 12 and the input image d4 and obtaining a portion having a large absolute difference value. Extraction of each part (4
In 13), the whole position and the whole size of the face image d4110 are detected by using an image processing method such as a projection distribution.
Standard image patterns and basic position information (414) are prepared for each part of "eyes", "mouth", "nose", and "eyebrows".
Template matching is performed on the face image d4110 having no background using the standard image pattern (414) of each part as a template, and the position of each part is detected. At this time, the range in which template matching is performed can be limited by normalizing the basic position information (414) of each unit using the entire position and the entire size of the obtained face image, thereby improving accuracy and efficiency. Matching can be realized. Once
If the position of each part can be detected, it becomes easy to cut out each partial image. The output d410 is the position coordinates of each part, d
414, d413, d412, and d411 are the “eye” partial image, the “mouth” partial image, the “nose” partial image, and the “eyebrow” partial image.

【００１９】図８は、表情認識部５の口の表情照合４２
３の具体的な構成例を示したものである。まず、“口”
部分画像ｄ４１３は、特徴抽出部で、画像処理により、
面積、縦横比、ｘｙ投影分布形状等の形状的特徴が抽出
される。これらの形状的特徴は、予め登録してある、喜
んだ“口”の形状的特徴、悲しんだ“口”の形状的特徴
と比較（特徴空間上での距離による比較）され、いずれ
に近いかが決定される。そして、近い方の感情種類と距
離に反比例した感情度を出力（ｄ４２３）する。その他
の部分（“目”“鼻”“眉”）の表情照合も全く同様な
構成で実現できる。表情照合部５の総合判定部４４で
は、入力画像の各部の形状的特徴から得られた感情種類
と感情度、および各部の位置座標関係から得られた感情
種類と感情度を用いて、最終感情種類を多数決で決定す
る。そして、得られた感情度のうち、最終感情種類と同
じ感情種類の感情度の平均を最終感情度（ｄ５）とす
る。なお、各部の位置座標関係から得られる感情種類と
感情度の決定は、予め、各表情の標準画像の各部の位置
座標を記憶しておき、入力画像と標準画像の各部の位置
ずれ誤差の総和を求め、その誤差の大小で決定する方式
をとる。すなわち、誤差が最も少ない位置関係を持つ標
準画像の感情種類および誤差に反比例した感情度をそれ
ぞれの入力画像の感情種類および感情度とする。FIG. 8 shows a face expression collation 42 of the mouth of the expression recognition section 5.
3 shows a specific configuration example. First, "mouth"
The partial image d413 is a feature extracting unit,
Shape features such as area, aspect ratio, and xy projection distribution shape are extracted. These topographical features are compared with the pre-registered ones of the happy “mouth” and the sadness of the “mouth” (comparison by distance in the feature space). It is determined. Then, an emotion degree inversely proportional to the closer emotion type and distance is output (d423). Expression matching of the other parts ("eyes", "nose", "eyebrows") can be realized by the completely same configuration. The overall determination unit 44 of the expression matching unit 5 uses the emotion type and emotion degree obtained from the shape characteristics of each part of the input image and the emotion type and emotion degree obtained from the positional coordinate relationship of each part to determine the final emotion. The type is decided by majority vote. Then, among the obtained emotion degrees, the average of the emotion degrees of the same emotion type as the final emotion type is set as the final emotion degree (d5). In addition, the determination of the emotion type and the emotion degree obtained from the positional coordinate relationship of each part is performed by storing the position coordinates of each part of the standard image of each expression in advance, and summing the positional deviation error of each part of the input image and the standard image. Is determined, and a method of determining the magnitude of the error is used. That is, the emotion type and emotion degree of the input image that are inversely proportional to the emotion type and emotion degree of the standard image having the positional relationship with the least error are set as the emotion type and emotion degree of each input image.

【００２０】次に、図１におけるデータ入出力部（６、
７）に関して説明する。この部分は、（１）手話利用者
のデータグローブからの会話生データあるいは自然文変
換後の文章をフロッピデスク６に記憶する、あるいは、
記憶されたそれらのデータをフロッピデスク６から読み
出す、（２）手話認識部２の変換部２１の層型ニューロ
のパラメータ（重み係数）をフロッピデスク６に記憶す
る、あるいは、記憶されたそのデータをフロッピデスク
６から読みだすためのものである。読み出された（１）
のデータは、音声合成部やモニタ、あるいは、手話ＣＧ
発生部に送られ、その内容を表示される。また、読み出
された（２）のデータは、変換部２１の層型ニューロの
パラメータ（重み係数）にセットされる。Next, the data input / output unit (6,
7) will be described. This part includes (1) storing the conversational raw data from the sign language user's data glove or the sentence after natural sentence conversion in the floppy desk 6, or
The stored data is read out from the floppy disk 6. (2) The layered neuro parameter (weight coefficient) of the conversion unit 21 of the sign language recognition unit 2 is stored in the floppy disk 6, or the stored data is stored. This is for reading from the floppy desk 6. Read out (1)
The data of is a speech synthesizer, monitor, or sign language CG
It is sent to the generator and its contents are displayed. Further, the read data (2) is set as a parameter (weight coefficient) of the layered neuron of the conversion unit 21.

【００２１】図９は、手話ＣＧ発生部１４の構成を示し
たものである。図９のｄ１４は、単語情報であり、その
情報により対応する単語ＣＧパターンのアドレス（ｄ１
４２）をアドレス発生部（１４２）で発生し、指定され
た単語ＣＧパターン（ｄ１３１）をモニタ１３に送り、
表示する。以上、手話通訳装置の各部分を詳細に説明し
た。このような構成で、先に述べた課題すべてを解決で
きる手話通訳装置を実現できる。なお、以上の手話通訳
装置１００の実施例では、データのやり取りを有線で行
うことを想定して説明したが、各部のデータのやり取り
をすべて、または、一部無線で行うようにして、稼動性
を良くした構成にしても良い。FIG. 9 shows the structure of the sign language CG generator 14. 9 is word information, and the address (d1) of the word CG pattern corresponding to the word information is indicated by d14.
42) is generated by the address generator (142), and the designated word CG pattern (d131) is sent to the monitor 13,
indicate. The parts of the sign language interpreter have been described above in detail. With such a configuration, it is possible to realize a sign language interpreter that can solve all of the problems described above. In the above-described embodiment of the sign language interpreting apparatus 100, the description has been made on the assumption that the data exchange is performed by wire. However, all or some of the data exchanges of each unit are performed wirelessly, and the operability is improved. May be improved.

【００２２】図１０は、以上説明した手話通訳装置１０
０をＬＡＮ（ローカルエリアネットワーク）インタフェ
ース１０１を介してＬＡＮまたは無線ＬＡＮで複数台連
結した手話通訳システムを示したものである。このよう
な構成をとることで、病院、警察署、役所等の公共施設
内、或いは、それぞれの施設間における手話やジェスチ
ャによる情報通信やデータベースへのアクセスが可能と
なる。FIG. 10 shows the sign language interpreter 10 described above.
1 illustrates a sign language interpreting system in which a plurality of sign language interpreters 0 are connected via a LAN (local area network) interface 101 via a LAN or a wireless LAN. By adopting such a configuration, it is possible to access information communication and databases in sign facilities and gestures in public facilities such as hospitals, police stations, and government offices, or between facilities.

【００２３】[0023]

【発明の効果】本発明により、個人差のある動的な手話
を認識でき、認識結果に基づき自然文変換をすることが
できる。また、表情認識をすることができ、感情を伴っ
た自然文の発生が可能となる。さらに、手話利用者と健
常者の間での会話を容易にすることができる。According to the present invention, dynamic sign language having individual differences can be recognized, and natural sentence conversion can be performed based on the recognition result. In addition, facial expressions can be recognized, and natural sentences with emotions can be generated. Furthermore, conversation between a sign language user and a healthy person can be facilitated.

[Brief description of the drawings]

【図１】手話通訳装置の全体構成を示した図である。FIG. 1 is a diagram showing an entire configuration of a sign language interpreting apparatus.

【図２】手話認識部の構成を示した図である。FIG. 2 is a diagram illustrating a configuration of a sign language recognition unit.

【図３】手話認識部内の変換部の構成を示した図であ
る。FIG. 3 is a diagram showing a configuration of a conversion unit in the sign language recognition unit.

【図４】手話認識部内の照合部１の動作を示した図であ
る。FIG. 4 is a diagram illustrating an operation of a collation unit 1 in a sign language recognition unit.

【図５】手話認識部内の照合部２の動作を示した図であ
る。FIG. 5 is a diagram illustrating an operation of a collation unit 2 in the sign language recognition unit.

【図６】表情認識部の構成を示した図である。FIG. 6 is a diagram illustrating a configuration of a facial expression recognition unit.

【図７】表情認識部内の各部分の位置検出と切り出し部
の構成を示した図である。FIG. 7 is a diagram illustrating the position detection of each part in the facial expression recognition unit and the configuration of a cutout unit;

【図８】表情認識部内の口の表情照合部の構成を示した
図である。FIG. 8 is a diagram showing a configuration of a mouth expression matching unit in the expression recognition unit.

【図９】手話ＣＧ発生部の構成を示した図である。FIG. 9 is a diagram illustrating a configuration of a sign language CG generation unit.

【図１０】手話通訳システムの構成例を示した図であ
る。FIG. 10 is a diagram showing a configuration example of a sign language interpreting system.

[Explanation of symbols]

１データグローブ２手話認識部３自然文変換部４ＴＶカメラ５表情認識部６フロッピディスク７Ｉ／Ｏデバイス部８キーボード９マイク１０音声認識部１１スピーカ１２音声合成部１３モニタ１４手話ＣＧ発生部１５選択部１６計算機 Reference Signs List 1 data glove 2 sign language recognition unit 3 natural sentence conversion unit 4 TV camera 5 expression recognition unit 6 floppy disk 7 I / O device unit 8 keyboard 9 microphone 10 voice recognition unit 11 speaker 12 voice synthesis unit 13 monitor 14 sign language CG generation unit 15 Selection unit 16 Calculator

───────────────────────────────────────────────────── フロントページの続き (72)発明者市川熹東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者井上潔東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者新井清志東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者志村隆則東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者戸田裕二東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (56)参考文献特開平２−144675（ＪＰ，Ａ) 特開平３−186979（ＪＰ，Ａ) 特開平４−134515（ＪＰ，Ａ) 特開昭63−172297（ＪＰ，Ａ) 特開昭59−132079（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G09B 21/00 G06F 17/28 G06T 1/00 ──────────────────────────────────────────────────続き Continuing from the front page (72) Inventor Akira Ichikawa 1-280 Higashi Koikekubo, Kokubunji-shi, Tokyo Hitachi Central Research Laboratory, Inc. (72) Inventor Kiyoshi Inoue 1-280 Higashi Koikekubo, Kokubunji-shi, Tokyo Hitachi, Ltd. Central Research Laboratory (72) Inventor Kiyoshi Arai 1-280 Higashi Koikekubo, Kokubunji-shi, Tokyo Hitachi, Ltd.Central Research Laboratories (72) Inventor Takanori Shimura 1-1280 Higashi-Koikekubo, Kokubunji-shi, Tokyo Hitachi Central Co., Ltd. ( 72) Inventor Yuji Toda 1-280 Higashi-Koigakubo, Kokubunji-shi, Tokyo Inside the Central Research Laboratory, Hitachi, Ltd. (56) References JP-A-2-144675 (JP, A) JP-A-3-186979 (JP, A) JP-A-4-134515 (JP, A) JP-A-63-172297 (JP, A) JP-A 59-132079 (JP, A) (58 ) investigated the field (Int.Cl. ^7, DB name) G09B 21/00 G06F 17/28 G06T 1/00

Claims

(57) [Claims]

1. A means for obtaining finger and hand movements as time series data, and a conversion unit for setting the finger and hand movements time series data as input sign language time series data and calibrating the input sign language time series data. A sign language word dictionary that stores the sign language time series data of each sign language word as sign language word dictionary data, and compares the output of the conversion unit with the sign language word dictionary data to recognize a sign language word corresponding to the input sign language time series data. A sign language recognition unit having a collating unit for outputting, a natural sentence conversion unit for adding a particle or the like based on a rule norm to the sign language word output from the sign language recognition unit and outputting a natural sentence, and a face image of the sign language user. A facial expression recognition unit for inputting and recognizing a facial expression from the face image to obtain an emotion type such as "joy" or "sadness" and a degree of emotion (intensity); a natural sentence output from the natural sentence conversion unit and the facial expression recognition Emotion types are output and its emotion degree (intensity) sign language interpretation system, characterized in that it comprises a processing device for outputting a natural sentence added with emotional adjectives as input.

2. The sign language interpreting apparatus according to claim 1, wherein a speech output device having a speech synthesizing unit is connected to the processing device so as to output a speech corresponding to an emotion type and an emotion level (intensity). A sign language interpreter.

3. The sign language interpreter according to claim 1, wherein a text output device for outputting a text is connected to the processing device, and the text is output in accordance with an emotion type and an emotion level (intensity). A sign language interpreter.

4. The sign language interpreting apparatus according to claim 1, wherein a sign language graphics output device having a sign language CG generating unit is connected to the processing device, and the emotion type and the emotion level (intensity) are connected.
A sign language interpreter, wherein the sign language graphics are output according to the language.