JP7373358B2

JP7373358B2 - Sound extraction system and sound extraction method

Info

Publication number: JP7373358B2
Application number: JP2019197987A
Authority: JP
Inventors: 洋平川口; 佳小里末房
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2023-11-02
Anticipated expiration: 2039-10-30
Also published as: JP2021071586A

Description

本発明は、音抽出システム及び音抽出方法に関する。 The present invention relates to a sound extraction system and a sound extraction method.

設備の異常や故障予兆などの状態は、音に現れることが多い。そこで、設備の状態を把握するために設備の稼動音に基づく診断は重要である。ただし、診断対象以外に由来する雑音の影響で診断を誤りうることがある。したがって、入力信号から、外来の雑音を除去し、診断対象の音を選択的に抽出する音抽出処理が求められる。 Conditions such as abnormalities in equipment or signs of failure are often manifested in sounds. Therefore, diagnosis based on the operating sounds of the equipment is important in order to understand the condition of the equipment. However, the diagnosis may be incorrect due to the influence of noise originating from sources other than the subject of diagnosis. Therefore, a sound extraction process is required that removes extraneous noise from the input signal and selectively extracts the sounds to be diagnosed.

音抽出の問題を解決する方法として、特開２００９－１２８９０６（特許文献１）がある。この公報には、「制約付き非負行列因数分解（ＮＭＦ）を混成信号に適用する工程であって、前記ＮＭＦは雑音除去モデルによって制約され、前記雑音除去モデルはトレーニング音響信号とトレーニング雑音信号とのトレーニング基礎マトリクスおよび該トレーニング基礎マトリクスの重みの統計値とからなり、前記適用により前記混成信号の内の前記音響信号の基底行列の重みを生成する工程と、前記音響信号を再構成するために、該音響信号の基底行列の重みと、前記トレーニング音響信号と前記トレーニング雑音信号とのトレーニング基礎マトリクスとの積を取る工程と、を含む、音響信号と雑音信号とを含む混成信号の雑音を除去するための方法。」と記載されている。 Japanese Patent Laid-Open No. 2009-128906 (Patent Document 1) is a method for solving the problem of sound extraction. The publication describes a process of applying a constrained non-negative matrix factorization (NMF) to a mixed signal, the NMF being constrained by a denoising model, and the denoising model combining a training acoustic signal and a training noise signal. a training basis matrix and statistics of the weights of the training basis matrix, the application of which generates the weights of the basis matrix of the acoustic signal of the composite signal, and for reconstructing the acoustic signal; denoising a mixed signal comprising an acoustic signal and a noise signal, the method comprising: taking the product of a weight of a basis matrix of the acoustic signal and a training basis matrix of the training acoustic signal and the training noise signal; ``Method for.''

特開２００９－１２８９０６号公報JP2009-128906A

特許文献１で開示された発明は、雑音が混合した信号から、ＮＭＦを使って音声と雑音に分離する。ただし、抽出したい音声と、除去したい雑音の両方の学習データが与えられている条件でしか機能しない。例えば、診断対象の設備の稼働音に異常があったとしても、異常時の音を事前に学習することは困難であるため、特許文献１の技術で抽出することはできないのである。雑音を小さく、診断対象の音を大きくする方法として、診断対象にマイクロホンをできるだけ近づけて録音するという方法がある。ただし、環境雑音が著しく大きい場合にはそれでも不十分である。 The invention disclosed in Patent Document 1 uses NMF to separate a signal mixed with noise into speech and noise. However, it only works under conditions where training data for both the speech you want to extract and the noise you want to remove is given. For example, even if there is an abnormality in the operating sound of the equipment to be diagnosed, it is difficult to learn in advance the sound at the time of the abnormality, so the technology of Patent Document 1 cannot extract it. One way to reduce noise and make the sound of the diagnostic target louder is to place a microphone as close as possible to the diagnostic target and record the sound. However, even this is insufficient if the environmental noise is extremely large.

そこで、本発明では、入力信号から外来の雑音を除去し、診断対象の音を選択的に抽出する音抽出処理を課題とする。 Therefore, an object of the present invention is to provide a sound extraction process that removes extraneous noise from an input signal and selectively extracts sounds to be diagnosed.

上記課題を解決するために、例えば特許請求の範囲に記載の構成を採用する。すなわち、診断対象からの距離が異なる複数の位置で録音した複数の入力音を距離と対応付けて取得し、複数の入力音についてそれぞれ特徴量を求め、特徴量と対応する距離との組み合わせを複数用いて、診断対象の音の特徴量を抽出する。 In order to solve the above problems, for example, the configurations described in the claims are adopted. In other words, multiple input sounds recorded at multiple locations with different distances from the diagnosis target are acquired in association with the distances, feature values are determined for each of the multiple input sounds, and multiple combinations of feature values and corresponding distances are obtained. This method is used to extract features of the sound to be diagnosed.

本発明によれば、入力信号から外来の雑音を除去し、診断対象の音を選択的に抽出することができる。上記した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 According to the present invention, extraneous noise can be removed from an input signal and sounds to be diagnosed can be selectively extracted. Problems, configurations, and effects other than those described above will be made clear by the following description of the embodiments.

実施例のハードウェア構成の説明図。FIG. 2 is an explanatory diagram of the hardware configuration of the embodiment. 実施例の正常音モデルの学習時の処理に係る機能ブロック図。FIG. 3 is a functional block diagram related to processing during learning of a normal sound model according to the embodiment. 実施例の異常検知実行時の処理に係る機能ブロック図。FIG. 3 is a functional block diagram related to processing during execution of abnormality detection in the embodiment. 複数距離入力音取得部による録音に係る処理手順を示すフローチャート。7 is a flowchart showing a processing procedure related to recording by a multi-distance input sound acquisition unit. 距離別の正常音モデルの学習時の処理に係る機能ブロック図。FIG. 3 is a functional block diagram related to processing during learning of a normal sound model for each distance. 距離別の異常検知実行時の処理に係る機能ブロック図。FIG. 3 is a functional block diagram related to processing when performing abnormality detection by distance. 距離－音量制約付き音抽出部による音抽出の処理手順を示すフローチャート。12 is a flowchart showing a sound extraction processing procedure by a sound extraction unit with distance-volume constraints. 音抽出の第１の変形例を示すフローチャート。The flowchart which shows the 1st modification of sound extraction. 音抽出の第２の変形例を示すフローチャート。The flowchart which shows the 2nd modification of sound extraction.

以下、実施例を、図面を用いて説明する。 Examples will be described below with reference to the drawings.

図１は、実施例のハードウェア構成の説明図である。図１に示すように、可搬端末１０１は、マイクロホン１０２、ＡＤ変換器１０３、測距センサ１０４を備えており、ユーザが持って移動することが可能な端末であり、例えばタブレット端末である。 FIG. 1 is an explanatory diagram of the hardware configuration of the embodiment. As shown in FIG. 1, the portable terminal 101 includes a microphone 102, an AD converter 103, and a distance measurement sensor 104, and is a terminal that a user can carry and move, such as a tablet terminal.

マイクロホン１０２はアナログ入力信号をＡＤ変換器１０３に送る。ＡＤ変換器１０３はアナログ入力信号をデジタル出力信号に変換し、可搬端末１０１に送る。測距センサ１０４は診断対象１０５からの距離を計測し、可搬端末１０１に送る。可搬端末１０１に測距センサ１０４が接続されていない場合は、ユーザ自身が別途距離を測ることで代替してもよい。なお、診断対象１０５は、例えば工場の設備として設置された装置などである。 Microphone 102 sends an analog input signal to AD converter 103. The AD converter 103 converts the analog input signal into a digital output signal and sends it to the portable terminal 101. The distance sensor 104 measures the distance from the diagnosis target 105 and sends it to the portable terminal 101. If the distance measurement sensor 104 is not connected to the portable terminal 101, the user may measure the distance separately. Note that the diagnosis target 105 is, for example, a device installed as factory equipment.

可搬端末１０１は、そのディスプレイを通して、診断対象１０５からの距離の指示値（短距離ｒ１、長距離ｒ２など）と、現在の診断対象１０５からの距離ｒを表示する。これらの表示により、ユーザは容易に距離ｒを指示値にすることができる。また、録音停止中であれば、録音可否と録音開始ボタンを表示する。距離ｒと指示値との差の絶対値が閾値ｅｐｓ以上であれば録音不可の表示をする。そうでなければ録音可の表示をして録音開始ボタンを有効にして押下を可能とする。これにより、正確な距離で安定的に録音することができる。録音開始後、録音中は、全録音時間Ｔ、録音開始からの経過時間ｔ、残りの録音時間（Ｔ－ｔ）を表示する。この表示は、録音条件の安定化、及び録音中のユーザの心理負担の軽減の効果を有する。 The portable terminal 101 displays the indicated value of the distance from the diagnosis target 105 (short distance r1, long distance r2, etc.) and the current distance r from the diagnosis target 105 through its display. These displays allow the user to easily set the distance r to the indicated value. Also, if recording is stopped, the recording permission and recording start button are displayed. If the absolute value of the difference between the distance r and the instruction value is equal to or greater than the threshold value eps, a display indicating that recording is not possible is displayed. Otherwise, a display indicating that recording is possible is displayed, and the recording start button is enabled and can be pressed. This allows stable recording at accurate distances. After the start of recording and during recording, the total recording time T, the elapsed time t from the start of recording, and the remaining recording time (Tt) are displayed. This display has the effect of stabilizing the recording conditions and reducing the psychological burden on the user during recording.

ユーザは、診断対象１０５からの距離が、短距離ｒ１、長距離ｒ２などの異なる複数の位置に可搬端末１０１を移動させて録音を行う。この録音結果は、診断対象１０５の音と背景雑音１０６とを含むものであり、診断対象１０５の音の正常音モデルの学習と、診断対象１０５の異常検知に用いられる。 The user performs recording by moving the portable terminal 101 to a plurality of positions having different distances from the diagnosis target 105, such as a short distance r1 and a long distance r2. This recording result includes the sound of the diagnosis target 105 and the background noise 106, and is used for learning a normal sound model of the sound of the diagnosis target 105 and detecting an abnormality of the diagnosis target 105.

具体的には、まず、診断対象１０５である装置が適正に動作している状態で、短距離ｒ１での録音と長距離ｒ２での録音を行い、複数距離の録音結果から診断対象１０５の音の特徴を抽出し、正常音モデルとして学習する。その後、診断対象１０５の異常検知を実行するときに、改めて短距離ｒ１での録音と長距離ｒ２での録音を行い、複数距離の録音結果から診断対象１０５の音の特徴を抽出し、正常音モデルと比較することで異常の検知を行う。 Specifically, first, while the device that is the diagnosis target 105 is operating properly, recording is performed at short distance r1 and long distance r2, and the sound of the diagnosis target 105 is determined from the recording results at multiple distances. The features of the sound are extracted and learned as a normal sound model. After that, when performing abnormality detection of the diagnostic target 105, recording is performed again at short distance r1 and long distance r2, and the characteristics of the sound of diagnostic target 105 are extracted from the recording results at multiple distances, and the normal sound Anomalies are detected by comparing with the model.

図２は、実施例の正常音モデルの学習時の処理に係る機能ブロック図である。正常音モデルの学習時の一連の処理は可搬端末１０１上において行っても良く、別のコンピュータやサーバ上で行っても良い。複数距離入力音取得部２０１は、診断対象１０５から複数の距離で録音されたデジタル入力音を距離に対応付けて取得する。デジタル入力音としては、ＡＤ変換器１０３が出力するデジタル出力信号を用いる。距離は、測距センサ１０４の出力を用いることができる。例えば、診断対象１０５からの距離が、短距離ｒ１、長距離ｒ２、などと異なる複数の位置に可搬端末１０１を移動させて、ユーザに録音を実行させることで複数の距離とデジタル入力音とを対応付けて取得すればよい。 FIG. 2 is a functional block diagram related to processing during learning of a normal sound model according to the embodiment. A series of processes during learning of a normal sound model may be performed on the portable terminal 101, or may be performed on another computer or server. The multiple distance input sound acquisition unit 201 acquires digital input sounds recorded at multiple distances from the diagnosis target 105 in association with the distances. A digital output signal output from the AD converter 103 is used as the digital input sound. The distance can be determined by using the output of the distance measuring sensor 104. For example, by moving the portable terminal 101 to multiple positions with different distances from the diagnosis target 105, such as short distance r1, long distance r2, etc., and having the user perform recording, multiple distances and digital input sound can be recorded. All you have to do is to associate it and get it.

複数距離入力音取得部２０１は、デジタル入力音と距離時系列を出力する。複数距離入力音取得部２０１が出力するデジタル入力音は、時間軸に沿って信号値を示す時間領域の信号である。距離時系列は、デジタル入力信号の時間軸に録音時の距離を値として与えたものである。短距離ｒ１での録音と長距離ｒ２での録音は個別に行われ、時間的な連続性を有さないが、複数距離入力音取得部２０１は、それぞれの録音結果を連続させて１つのデジタル入力音として出力する。このとき、距離時系列は、連続した時系列に対してそれぞれの録音距離を示す１つのデータとなる。 The multiple distance input sound acquisition unit 201 outputs digital input sounds and distance time series. The digital input sound output by the multi-distance input sound acquisition unit 201 is a time domain signal that indicates signal values along the time axis. The distance time series is obtained by giving the distance at the time of recording as a value to the time axis of the digital input signal. Although the recording at the short distance r1 and the recording at the long distance r2 are performed separately and have no temporal continuity, the multiple distance input sound acquisition unit 201 continuously records the results of each recording to create one digital Output as input sound. At this time, the distance time series becomes one piece of data indicating each recording distance for a continuous time series.

前処理部２０２は、デジタル入力音をフレームごとに分割し、そのフレームに窓関数を乗算し、窓関数乗算後の信号に短時間フーリエ変換を施して、周波数領域信号を計算する。周波数領域信号は、フレームサイズがＮであれば、（Ｎ／２＋１）＝Ｍ個の周波数ビンそれぞれに１個の複素数が対応する、Ｍ個の複素数の組である。さらに、周波数領域信号から入力音スペクトログラム（パワースペクトログラムまたは振幅スペクトログラム）を計算する。 The preprocessing unit 202 divides the digital input sound into frames, multiplies the frames by a window function, and performs short-time Fourier transform on the signal after the window function multiplication to calculate a frequency domain signal. The frequency domain signal is a set of M complex numbers, where if the frame size is N, one complex number corresponds to each of the (N/2+1)=M frequency bins. Furthermore, an input sound spectrogram (power spectrogram or amplitude spectrogram) is calculated from the frequency domain signal.

距離－音量制約付き音抽出部２０３は、入力音スペクトログラムおよび距離時系列を元に、診断対象抽出音スペクトログラムを抽出する。正常音モデル学習部２０４は、過去に得られた多数の診断対象抽出音スペクトログラムを元に、連続するＬフレームからなる特徴量ベクトルの正常時の分布のモデルを学習し、そのモデルを正常音モデルデータベース２０５に格納する。 The distance-volume constrained sound extraction unit 203 extracts a diagnosis target extraction sound spectrogram based on the input sound spectrogram and the distance time series. The normal sound model learning unit 204 learns a model of the normal distribution of feature vectors consisting of consecutive L frames based on a large number of diagnostic target extracted sound spectrograms obtained in the past, and converts the model into a normal sound model. The information is stored in the database 205.

正常音モデルとして、混合ガウス分布（ＧＭＭ）、１クラスサポートベクター分類器、部分空間法、局所部分空間法、k-meansクラスタリング、Deep Neural Network (ＤＮＮ) autoencoder、Convolutional Neural Network (ＣＮＮ) autoencoder、Long Short Term Memory (ＬＳＴＭ) autoencoder、variational autoencoder (ＶＡＥ) などを用いてよい。 As a normal sound model, Gaussian mixture distribution (GMM), one-class support vector classifier, subspace method, local subspace method, k-means clustering, Deep Neural Network (DNN) autoencoder, Convolutional Neural Network (CNN) autoencoder, Long Short Term Memory (LSTM) autoencoder, variational autoencoder (VAE), etc. may be used.

各正常音モデルには、それぞれのモデルに適したアルゴリズムが知られており、それを用いて学習を行う。例えば、ＧＭＭであればＥＭアルゴリズムにより、あらかじめ定めたクラスタ数の個数だけのガウス分布の組み合わせによるあてはめがなされる。学習された正常音モデルは、算出されたモデルパラメタによって規定される。そのモデルパラメタ全てを図示していない正常音モデルデータベースに格納する。 An algorithm suitable for each normal sound model is known, and learning is performed using this algorithm. For example, in the case of GMM, the EM algorithm performs fitting using combinations of Gaussian distributions as many as the predetermined number of clusters. The learned normal sound model is defined by the calculated model parameters. All of the model parameters are stored in a normal sound model database (not shown).

図３は、実施例の異常検知実行時の処理に係る機能ブロック図である。異常検知実行時の一連の処理は、可搬端末１０１上において行われる。複数距離入力音取得部２０１から距離－音量制約付き音抽出部２０３までの処理は図２と同一である。 FIG. 3 is a functional block diagram related to processing during execution of abnormality detection in the embodiment. A series of processes when performing abnormality detection are performed on the portable terminal 101. The processing from the multiple distance input sound acquisition unit 201 to the distance-volume constrained sound extraction unit 203 is the same as that in FIG.

異常検知部３０１は、正常音モデルデータベース２０５から正常音モデルを読み出し、診断対象抽出音スペクトログラムに対して異常検知処理を実行する。すなわち、連続するＬフレームからなる特徴量ベクトルの時系列を計算し、その時系列が正常音モデルから十分な確率で生成されうるかどうかを判定する。 The abnormality detection unit 301 reads a normal sound model from the normal sound model database 205 and performs abnormality detection processing on the extracted sound spectrogram to be diagnosed. That is, a time series of feature vectors consisting of consecutive L frames is calculated, and it is determined whether the time series can be generated from a normal sound model with sufficient probability.

例えば、正常音モデルがＧＭＭの場合、Ｍ×Ｌ次元の特徴量ベクトルｖが正常音モデル（モデルパラメタΘ=((μ1、Γ1、π1)、・・・(μq、Γq、πq) 、(μQ、ΓQ、πQ)）から生成される確率ｐ(ｖ｜Θ)を、次式により計算する。

ここで、

例えば、前記確率p(v | Θ)の負の対数尤度 - log p(v | Θ) を推定異常度と定義し、出力する。 For example, when the normal sound model is GMM, the M×L-dimensional feature vector v is the normal sound model (model parameters Θ=((μ1, Γ1, π1), ...(μq, Γq, πq), (μQ , ΓQ, πQ)) is calculated using the following equation.

here,

For example, the negative log likelihood of the probability p(v | Θ) - log p(v | Θ) is defined as the estimated abnormality degree and output.

正常音モデルとしてDeep Neural Network (ＤＮＮ) autoencoderを用いた場合は、SGD、Momentum SGD、AdaGrad、RMSprop、AdaDelta、Adamなどの最適化アルゴリズムによって、正常音の特徴量ベクトルを入力した際に、入力した特徴量ベクトルと出力される特徴量ベクトルとの間の復元誤差が小さくなるように内部パラメタが最適化される。異常音の特徴量ベクトルを入力した場合、その間の復元誤差が大きくなることが期待される。したがって復元誤差を推定異常度と定義し、出力する。 When using a Deep Neural Network (DNN) autoencoder as a normal sound model, optimization algorithms such as SGD, Momentum SGD, AdaGrad, RMSprop, AdaDelta, Adam, etc. Internal parameters are optimized so that the reconstruction error between the feature vector and the output feature vector is reduced. When a feature vector of an abnormal sound is input, it is expected that the restoration error between them will become large. Therefore, the restoration error is defined as the estimated degree of abnormality and is output.

異常表示部３０２は、前記推定異常度の値を表示し、さらに推定異常度の値が一定以上である場合には異常である旨を表示する。 The abnormality display section 302 displays the value of the estimated abnormality degree, and further displays that there is an abnormality when the value of the estimated abnormality degree is equal to or higher than a certain value.

図４は、複数距離入力音取得部２０１による録音に係る処理手順を示すフローチャートの例を示している。この例では、短距離ｒ１と長距離ｒ２の２通りの距離でユーザに録音をさせるよう制御する場合を示しているが、短距離ｒ１及び長距離ｒ２と同様の手順で３通り以上の距離で録音させても良い。 FIG. 4 shows an example of a flowchart showing a processing procedure related to recording by the multi-distance input sound acquisition unit 201. This example shows a case where the user is controlled to record at two distances, short distance r1 and long distance r2. You can also record it.

まず、Ｓ４０１において、複数距離入力音取得部２０１は、短距離ｒ１での録音を指示する出力を行い、Ｓ４０２に進む。
Ｓ４０２において、複数距離入力音取得部２０１は、ｔに０を代入し、Ｓ４０３に進む。
Ｓ４０３において、複数距離入力音取得部２０１は、ｔ＜ＴであればＳ４０４に進み、そうでなければＳ４０８に進む。
Ｓ４０４において、測距センサ１０４で測距を実行し、現在の距離ｒを取得し、Ｓ４０５に進む。
Ｓ４０５において、複数距離入力音取得部２０１は、｜ｒ－ｒ１｜＜ｅｐｓであればＳ４０６に進み、そうでなければＳ４０１に戻る。
Ｓ４０６において、マイクロホン１０２とＡＤ変換器１０３を用いて録音を実行し、Ｓ４０７に進む。
Ｓ４０７において、複数距離入力音取得部２０１は、ｔに対して、前回からの経過時間Δｔを加算し、Ｓ４０３に戻る。 First, in S401, the multi-distance input sound acquisition unit 201 outputs an instruction to record at a short distance r1, and proceeds to S402.
In S402, the multiple distance input sound acquisition unit 201 assigns 0 to t, and proceeds to S403.
In S403, if t<T, the multi-distance input sound acquisition unit 201 proceeds to S404, otherwise proceeds to S408.
In S404, the distance measurement sensor 104 executes distance measurement to obtain the current distance r, and the process advances to S405.
In S405, the multiple distance input sound acquisition unit 201 proceeds to S406 if |rr1|<eps, otherwise returns to S401.
In S406, recording is performed using the microphone 102 and AD converter 103, and the process advances to S407.
In S407, the multiple distance input sound acquisition unit 201 adds the elapsed time Δt from the previous time to t, and returns to S403.

次に、Ｓ４０８において、複数距離入力音取得部２０１は、長距離ｒ２での録音を指示し、Ｓ４０９に進む。
Ｓ４０９において、複数距離入力音取得部２０１は、ｔに０を代入し、Ｓ４１０に進む。
Ｓ４１０において、複数距離入力音取得部２０１は、ｔ＜ＴであればＳ４１１に進み、そうでなければ処理を終了する。
Ｓ４１１において、測距センサ１０４で測距を実行し、現在の距離ｒを取得し、Ｓ４１２に進む。
Ｓ４１２において、複数距離入力音取得部２０１は、｜ｒ－ｒ２｜＜ｅｐｓであればＳ４１３に進み、そうでなければＳ４０８に戻る。
Ｓ４１３において、マイクロホン１０２とＡＤ変換器１０３を用いて録音を実行し、Ｓ４１４に進む。
Ｓ４１４において、複数距離入力音取得部２０１は、ｔに対して、前回からの経過時間Δｔを加算し、Ｓ４１０に戻る。 Next, in S408, the multiple distance input sound acquisition unit 201 instructs recording at long distance r2, and proceeds to S409.
In S409, the multiple distance input sound acquisition unit 201 assigns 0 to t, and proceeds to S410.
In S410, the multi-distance input sound acquisition unit 201 proceeds to S411 if t<T, otherwise ends the process.
In S411, the distance measurement sensor 104 executes distance measurement to obtain the current distance r, and the process advances to S412.
In S412, the multiple distance input sound acquisition unit 201 proceeds to S413 if |rr2|<eps, otherwise returns to S408.
In S413, recording is performed using the microphone 102 and AD converter 103, and the process advances to S414.
In S414, the multiple distance input sound acquisition unit 201 adds the elapsed time Δt from the previous time to t, and returns to S410.

ここで、正常音モデルのモデルパラメタの規模を抑えることができる変形例を開示する。図２及び図３に示した構成との違いは、距離毎に異なる正常音モデルを学習し、異常検知処理においても、その時の距離に対応した正常音モデルを用いる点である。 Here, a modification example that can suppress the scale of the model parameters of the normal sound model will be disclosed. The difference from the configurations shown in FIGS. 2 and 3 is that a different normal sound model is learned for each distance, and the normal sound model corresponding to the distance at that time is used also in the abnormality detection process.

図５は、距離別の正常音モデルの学習時の処理に係る機能ブロック図である。複数距離入力音取得部２０１から距離－音量制約付き音抽出部２０３までの処理は図２と同一である。距離毎正常音モデル学習部５０１は、距離－音量制約付き音抽出部２０３が出力した診断対象抽出音スペクトログラムと複数距離入力音取得部２０１が出力した距離時系列とを用いて、距離毎に別の正常音モデルを学習し、距離毎正常音モデルデータベース５０２に格納する。各モデルの学習のアルゴリズムは正常音モデル学習部２０４と同一でよい。 FIG. 5 is a functional block diagram related to processing during learning of a normal sound model for each distance. The processing from the multiple distance input sound acquisition unit 201 to the distance-volume constrained sound extraction unit 203 is the same as that in FIG. The distance-by-distance normal sound model learning unit 501 uses the diagnosis target extraction sound spectrogram output by the distance-volume constrained sound extraction unit 203 and the distance time series output by the multiple-distance input sound acquisition unit 201 to separately perform distance-by-distance normal sound model learning. normal sound model is learned and stored in the normal sound model database 502 for each distance. The learning algorithm for each model may be the same as that of the normal sound model learning section 204.

図６は、距離別の異常検知実行時の処理に係るブロック図である。複数距離入力音取得部２０１から距離－音量制約付き音抽出部２０３までの処理は図２と同一である。距離毎異常検知部６０１は、距離－音量制約付き音抽出部２０３が出力した診断対象抽出音スペクトログラムと複数距離入力音取得部２０１が出力した距離時系列とを用いて、診断対象抽出音スペクトログラムを、録音時の距離毎に分割し、分割したスペクトログラムに対応した距離の正常音モデルを用いて異常検知を行い、推定異常度を出力する。 FIG. 6 is a block diagram related to processing when performing abnormality detection by distance. The processing from the multiple distance input sound acquisition unit 201 to the distance-volume constrained sound extraction unit 203 is the same as that in FIG. The distance-by-distance abnormality detection unit 601 uses the diagnosis target extraction sound spectrogram output by the distance-volume constrained sound extraction unit 203 and the distance time series output by the multiple distance input sound acquisition unit 201 to generate a diagnosis target extraction sound spectrogram. , the spectrogram is divided by distance at the time of recording, abnormality detection is performed using a normal sound model of the distance corresponding to the divided spectrogram, and the estimated degree of abnormality is output.

統合異常検知部６０２は、分割したスペクトログラム間にわたっての推定異常度を統合する。もし、事前に距離ｄ毎のＲＯＣ（Receiver Operating Characteristic）曲線におけるＡＵＣ（Area under the curve）をｗ＿ｄとして計算している場合は、各距離ｄの異常度に、ｗ＿ｄが大きいほど大きくなる重み係数を乗算した値の総和を統合推定異常度として出力する。重み係数は例えば、－１．０／ｌｏｇ（ｗ＿ｄ）である。 The integrated anomaly detection unit 602 integrates the estimated anomaly degrees between the divided spectrograms. If the AUC (Area under the curve) of the ROC (Receiver Operating Characteristic) curve for each distance d is calculated in advance as w_d, a weighting coefficient that increases as w_d increases is added to the abnormality degree of each distance d. The sum of the multiplied values is output as the integrated estimated abnormality degree. The weighting coefficient is, for example, −1.0/log(w_d).

異常表示部６０３は、前記統合推定異常度の値を表示し、さらに統合推定異常度の値が一定以上である場合には異常である旨を表示する。 The abnormality display section 603 displays the value of the integrated estimated abnormality degree, and further displays that there is an abnormality when the value of the integrated estimated abnormality degree is equal to or higher than a certain value.

図７は、距離－音量制約付き音抽出部２０３による音抽出の処理手順を示すフローチャートである。まず、Ｓ７０１において、距離－音量制約付き音抽出部２０３は、入力音スペクトログラムを行列Ｘとし、行列Ｘに対する教師なしＮＭＦの初期化を実行する。例えば、各音源のアクティベーションと基底を乱数で初期化し、Ｓ７０２に進む。
Ｓ７０２において、距離－音量制約付き音抽出部２０３は、行列Ｘに対する教師なしＮＭＦを実行し、Ｓ７０３に進む。
Ｓ７０３において、距離－音量制約付き音抽出部２０３は、教師なしＮＭＦで得られたアクティベーションを録音時の距離ｄ毎に分割し、Ｓ７０４に進む。アクティベーションは、入力音スペクトログラムにおける時間成分を示し、入力音の時間と距離の関係は距離時系列として与えられている。したがって、距離時系列から距離に対応する時間帯を特定し、距離に応じた時間帯でアクティベーションを分割することができる。 FIG. 7 is a flowchart showing a sound extraction processing procedure by the distance-volume constrained sound extraction unit 203. First, in S701, the distance-volume constrained sound extraction unit 203 takes the input sound spectrogram as a matrix X, and initializes the unsupervised NMF for the matrix X. For example, the activation and base of each sound source are initialized with random numbers, and the process proceeds to S702.
In S702, the distance-volume constrained sound extraction unit 203 executes unsupervised NMF on the matrix X, and proceeds to S703.
In S703, the distance-volume constrained sound extraction unit 203 divides the activation obtained by the unsupervised NMF by distance d at the time of recording, and proceeds to S704. Activation indicates the time component in the input sound spectrogram, and the relationship between time and distance of the input sound is given as a distance time series. Therefore, it is possible to identify the time period corresponding to the distance from the distance time series and divide the activation into time periods corresponding to the distance.

Ｓ７０４において、距離－音量制約付き音抽出部２０３は、基底ｋを選択して、Ｓ７０５に進む。
Ｓ７０５において、距離－音量制約付き音抽出部２０３は、選択した基底ｋについて、分割した時間内にわたるアクティベーションの平均値ａ＿｛ｋ，ｄ｝を計算し、Ｓ７０６に進む。すなわち、ａ＿｛ｋ，ｄ｝は、距離ｄで録音した時間帯における基底ｋのアクティベーションの平均値となる。
Ｓ７０６において、距離－音量制約付き音抽出部２０３は、ａ＿｛ｋ，ｄ｝の大小順序が距離の逆数１／ｄの大小順序と一致するか否かを判定する。判定の結果、一致するならば（Ｓ７０６；Ｙｅｓ）、Ｓ７０７に進む。また、一致しなければ（Ｓ７０６；Ｎｏ）、Ｓ７０８に進む。
Ｓ７０７において、距離－音量制約付き音抽出部２０３は、選択した基底ｋを診断対象１０５の成分とみなして集合Ｓに格納し、Ｓ７０８に進む。
Ｓ７０８において、距離－音量制約付き音抽出部２０３は、全ての基底ｋを選択したかを判定する。判定の結果、未選択の基底ｋが残っていれば（Ｓ７０８；Ｎｏ）、Ｓ７０４に進む。そして、全ての基底ｋが選択済みであれば（Ｓ７０８；Ｙｅｓ）、Ｓ７０９に進む。 In S704, the distance-volume constrained sound extraction unit 203 selects base k, and proceeds to S705.
In S705, the distance-volume constrained sound extraction unit 203 calculates the average activation value a_{k, d} over the divided time for the selected base k, and proceeds to S706. That is, a_{k, d} is the average value of the activation of base k in the time period recorded at distance d.
In S706, the distance-volume constrained sound extraction unit 203 determines whether the order of magnitude of a_{k, d} matches the order of magnitude of the reciprocal of distance 1/d. As a result of the determination, if they match (S706; Yes), the process advances to S707. If they do not match (S706; No), the process advances to S708.
In S707, the distance-volume constrained sound extraction unit 203 regards the selected base k as a component of the diagnosis target 105 and stores it in the set S, and proceeds to S708.
In S708, the distance-volume constrained sound extraction unit 203 determines whether all bases k have been selected. As a result of the determination, if unselected base k remains (S708; No), the process advances to S704. If all bases k have been selected (S708; Yes), the process advances to S709.

Ｓ７０９において、距離－音量制約付き音抽出部２０３は、診断対象音スペクトログラム復元を行う。具体的には、距離－音量制約付き音抽出部２０３は、集合Ｓの全要素ｋにわたって、アクティベーションＷ＿ｋと基底ベクトルＨ＿ｋとの乗算Ｗ＿ｋＨ＿ｋの総和＾Ｘを計算する。距離－音量制約付き音抽出部２０３は、＾Ｘを診断対象抽出音として出力し、処理を終了する。 In S709, the distance-volume constrained sound extraction unit 203 restores the diagnosis target sound spectrogram. Specifically, the distance-volume constrained sound extraction unit 203 calculates the sum ^X of the multiplication W_k H_k of the activation W_k and the base vector H_k over all elements k of the set S. The distance-volume constrained sound extraction unit 203 outputs ^X as the diagnosis target extraction sound, and ends the process.

図８は、音抽出の第１の変形例を示すフローチャートである。まず、Ｓ８０１において、距離－音量制約付き音抽出部２０３は、最長距離で録音した時刻のスペクトログラムである行列Ｘ＿ｆａｒに対する教師なしＮＭＦの初期化を行い、Ｓ８０２に進む。
Ｓ８０２において、距離－音量制約付き音抽出部２０３は、行列Ｘ＿ｆａｒに対する教師なしＮＭＦを行い、行列Ｘ＿ｆａｒに対する背景雑音のアクティベーションの初期解Ｗ＿ｆａｒ＿ｉｎｉと背景雑音の基底ベクトルＨ＿ｆａｒ＿ｉｎｉを出力し、Ｓ８０３に進む。 FIG. 8 is a flowchart showing a first modification of sound extraction. First, in S801, the distance-volume constrained sound extraction unit 203 initializes the unsupervised NMF for the matrix X_far, which is the spectrogram at the time of recording at the longest distance, and proceeds to S802.
In S802, the distance-volume constrained sound extraction unit 203 performs unsupervised NMF on the matrix X_far, outputs the initial solution W_far_ini of background noise activation and the background noise basis vector H_far_ini for the matrix X_far, and proceeds to S803.

Ｓ８０３において、距離－音量制約付き音抽出部２０３は、入力音スペクトログラムである行列Ｘに対する半教師有りＮＭＦの初期化を行う。すなわち、背景雑音のアクティベーションとして、最長距離で録音した時刻の初期解をＷ＿ｆａｒ＿ｉｎｉとし、それ以外の時刻の初期解はＷ＿ｆａｒ＿ｉｎｉの時間平均値とする。背景雑音の基底ベクトルにＨ＿ｆａｒ＿ｉｎｉを格納する。診断対象音のアクティベーションとして、最長距離で録音した時刻の初期解に十分小さい正の値を格納し、それ以外の時刻の初期解には乱数を格納する。診断対象音の基底ベクトルに初期解として乱数を格納し、Ｓ８０４に進む。 In S803, the distance-volume constrained sound extraction unit 203 initializes a semi-supervised NMF for the matrix X that is the input sound spectrogram. That is, as background noise activation, the initial solution at the time when the recording was made at the longest distance is set as W_far_ini, and the initial solution at other times is set as the time average value of W_far_ini. Store H_far_ini in the base vector of background noise. As the activation of the sound to be diagnosed, a sufficiently small positive value is stored in the initial solution at the time when the recording was made at the longest distance, and random numbers are stored in the initial solutions at other times. A random number is stored as an initial solution in the basis vector of the sound to be diagnosed, and the process advances to S804.

Ｓ８０４において、距離－音量制約付き音抽出部２０３は、行列Ｘに対する半教師有りＮＭＦを実行し、Ｓ８０５に進む。
Ｓ８０５において、距離－音量制約付き音抽出部２０３は、図７のＳ７０９と同様に診断対象音スペクトログラム復元を行い、処理を終了する。 In S804, the distance-volume constrained sound extraction unit 203 executes semi-supervised NMF on the matrix X, and proceeds to S805.
In S805, the distance-volume constrained sound extraction unit 203 restores the diagnosis target sound spectrogram in the same manner as in S709 of FIG. 7, and ends the process.

図９は、音抽出の第２の変形例を示すフローチャートである。図９は、図８のＳ８０４に対応するＳ９０１において行列Ｘに対する距離正則化付き半教師有りＮＭＦを実行する以外は図８と同じである。距離正則化とは、ＮＭＦの繰り返し処理毎に、距離がｒ倍である時刻のアクティベーションの平均値の比率が１／ｒとなるように、アクティベーションを距離毎に定数倍する処理である。 FIG. 9 is a flowchart showing a second modification of sound extraction. 9 is the same as FIG. 8 except that semi-supervised NMF with distance regularization is performed on the matrix X in S901 corresponding to S804 in FIG. Distance regularization is a process in which activation is multiplied by a constant for each distance so that the ratio of the average value of activation at times when the distance is r times becomes 1/r for each NMF repetition process.

上述してきたように、本実施例に係る音抽出システム及び音抽出方法は、診断対象からの距離が異なる複数の位置で録音した複数の入力音を距離と対応付けて取得し、複数の入力音についてそれぞれ特徴量を求め、特徴量と対応する距離との組み合わせを複数用いて、診断対象の音の特徴量を抽出する。このため、入力信号から外来の雑音を除去し、診断対象の音を選択的に抽出することができる。 As described above, the sound extraction system and sound extraction method according to the present embodiment acquires a plurality of input sounds recorded at a plurality of positions having different distances from the diagnosis target and associates them with distance, and extracts the plurality of input sounds. A feature quantity is obtained for each, and a plurality of combinations of the feature quantity and the corresponding distance are used to extract the feature quantity of the sound to be diagnosed. Therefore, extraneous noise can be removed from the input signal and sounds to be diagnosed can be selectively extracted.

そして、本実施例に係る音抽出システム及び音抽出方法は、抽出したい音と、除去したい雑音の両方の学習データが事前に与えられている条件を必要としない。例えば、事前に学習できる音が雑音の混ざった正常時の診断対象の音のみであってもよい。また、診断対象の正常時の音だけでなく、事前の学習が困難な異常音も抽出することができる。 The sound extraction system and sound extraction method according to the present embodiment do not require the condition that learning data for both the sound to be extracted and the noise to be removed are given in advance. For example, the sounds that can be learned in advance may be only sounds that are subject to diagnosis during normal conditions and are mixed with noise. In addition, it is possible to extract not only the normal sounds of the diagnostic target but also abnormal sounds that are difficult to learn in advance.

また、本実施例に係る音抽出システム及び音抽出方法は、同一のマイクロホンを移動させて複数の位置で録音した複数の入力音を用いて診断対象の音を抽出することができる。このため、可搬の端末装置単体で運用することが可能である。また、録音を行うべき位置をマイクロホンの位置との関係を出力してユーザを誘導することで、録音条件の安定化、及び録音中のユーザの心理負担の軽減の効果を奏することができる。 Further, the sound extraction system and sound extraction method according to the present embodiment can extract a sound to be diagnosed using a plurality of input sounds recorded at a plurality of positions by moving the same microphone. Therefore, it is possible to operate the portable terminal device alone. Further, by outputting the relationship between the position where recording should be performed and the position of the microphone and guiding the user, it is possible to stabilize the recording conditions and reduce the psychological burden on the user during recording.

また、本実施例に係る音抽出システム及び音抽出方法は、抽出された診断対象の音の特徴量に基づいて推定異常度を計算することで、診断対象の異常を検知することができる。また、推定異常度に応じた表示により、ユーザに異常を認識させることができる。異常推定度は、例えば、抽出した診断対象の音の特徴量と診断対象の適正動作時の音の特徴量とを比較して求めればよい。さらに、診断対象の適正動作時の音の特徴量を正常音モデルとして予め学習することも可能である。 Further, the sound extraction system and sound extraction method according to the present embodiment can detect an abnormality in the diagnosis target by calculating the estimated degree of abnormality based on the extracted feature amount of the sound of the diagnosis target. Further, the display according to the estimated degree of abnormality allows the user to recognize the abnormality. The degree of abnormality estimation may be determined, for example, by comparing the extracted feature amount of the sound of the diagnosis target with the feature amount of the sound when the diagnosis target is operating properly. Furthermore, it is also possible to learn in advance the feature amount of sound during proper operation of the diagnostic target as a normal sound model.

また、本実施例に係る音抽出システム及び音抽出方法は、複数の距離に対応する複数の特徴量に共通して存在し、距離に応じた大小関係を示す特徴成分を診断対象の音の特徴量として抽出することができる。すなわち、複数の距離で録音するので、短距離で録音した時刻に音量が大きく、長距離で録音した時刻に音量が小さい成分を診断対象音であると特定することができ、その成分だけを抽出することができる。また、大小関係のみではなく、距離に応じた変化を示す特徴成分を診断対象の音の特徴量として抽出することも可能である。 In addition, the sound extraction system and sound extraction method according to the present embodiment can extract feature components that are common to a plurality of feature quantities corresponding to a plurality of distances and exhibit a magnitude relationship according to the distances as a feature of the sound to be diagnosed. It can be extracted as a quantity. In other words, since recording is performed at multiple distances, it is possible to identify components that are high in volume at times when they are recorded at short distances and low in volume at times when they are recorded at long distances as diagnostic target sounds, and extract only those components. can do. Furthermore, it is also possible to extract characteristic components that show not only a magnitude relationship but also a change according to distance as a characteristic amount of the sound to be diagnosed.

また、本実施例に係る音抽出システム及び音抽出方法は、複数の入力音についてそれぞれ周波数領域信号を計算し、それぞれの周波数領域信号からスペクトログラムを計算して、該スペクトログラムを特徴量として用いることができる。 Furthermore, the sound extraction system and sound extraction method according to the present embodiment can calculate frequency domain signals for each of a plurality of input sounds, calculate a spectrogram from each frequency domain signal, and use the spectrogram as a feature quantity. can.

本実施例では説明を省略したが、診断対象は、複数の動作モードを有する装置であってもよい。この場合、複数の動作モードのいずれかで適正に動作している状態で録音を行い、正常音モデルを学習する。 Although the description is omitted in this embodiment, the diagnosis target may be a device having multiple operation modes. In this case, recording is performed while the device is operating properly in one of a plurality of operating modes, and a normal sound model is learned.

また、本実施例では、正常音モデルの学習を含めて説明を行ったが、正常音モデルは予め与えられていてもよい。また、本実施例では、診断対象の音を抽出したのち、異常の検知まで行う構成を例示して説明を行ったが、例えば抽出した診断対象音スペクトログラムから診断対象の音を再構成し、再構成した診断対象の音をユーザが聞いて異常の有無を判断することも可能である。また、本実施例では、指定した距離でそれぞれ録音を行う場合を例示したが、録音を継続したまま録音位置を移動させて、診断対象の音を抽出してもよい。 Further, although the present embodiment has been described including learning of a normal sound model, the normal sound model may be given in advance. In addition, in this embodiment, the configuration was explained by exemplifying the configuration in which the diagnosis target sound is extracted and then the abnormality is detected. However, for example, the diagnosis target sound is reconstructed from the extracted diagnosis target sound spectrogram, It is also possible for the user to listen to the configured sounds to be diagnosed and determine the presence or absence of an abnormality. Further, in this embodiment, a case is illustrated in which recording is performed at each specified distance, but the sound to be diagnosed may be extracted by moving the recording position while continuing recording.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 Note that the present invention is not limited to the above-described embodiments, and includes various modifications. For example, the embodiments described above are described in detail to explain the present invention in an easy-to-understand manner, and the present invention is not necessarily limited to having all the configurations described. Furthermore, it is possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment. Further, it is possible to add, delete, or replace a part of the configuration of each embodiment with other configurations.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、SSD（Solid State Drive）等の記録装置、または、ICカード、SDカード、DVD等の記録媒体に置くことができる。 Further, each of the above-mentioned configurations, functions, processing units, processing means, etc. may be partially or entirely realized in hardware by designing, for example, an integrated circuit. Furthermore, each of the above configurations, functions, etc. may be realized by software by a processor interpreting and executing a program for realizing each function. Information such as programs, tables, files, etc. that realize each function can be stored in a memory, a recording device such as a hard disk, an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。 Further, the control lines and information lines are shown to be necessary for explanation purposes, and not all control lines and information lines are necessarily shown in the product. In reality, almost all components may be considered to be interconnected.

１０１・・・可搬端末、１０２・・・マイクロホン、１０３・・・ＡＤ変換器、１０４・・・測距センサ、１０５・・・診断対象、１０６・・・背景雑音、２０１・・・複数距離入力音取得部、２０２・・・前処理部、２０３・・・距離－音量制約付き音抽出部、２０４・・・正常音モデル学習部、２０５・・・正常音モデルデータベース、３０１・・・異常検知部、３０２・・・異常表示部、５０１・・・距離毎正常音モデル学習部、５０２・・・距離毎正常音モデルデータベース、６０１・・・距離毎異常検知部、６０２・・・統合異常検知部、６０３・・・異常表示部
101... Portable terminal, 102... Microphone, 103... AD converter, 104... Distance sensor, 105... Diagnosis target, 106... Background noise, 201... Multiple distances Input sound acquisition unit, 202... Preprocessing unit, 203... Sound extraction unit with distance-volume constraints, 204... Normal sound model learning unit, 205... Normal sound model database, 301... Abnormality Detection unit, 302... Abnormality display unit, 501... Distance normal sound model learning unit, 502... Distance normal sound model database, 601... Distance based abnormality detection unit, 602... Integrated abnormality Detection unit, 603... Abnormality display unit

Claims

a multi-distance input sound acquisition unit that acquires a plurality of input sounds recorded at a plurality of positions having different distances from the diagnosis target in association with the distances;
a preprocessing unit that calculates feature amounts for each of the plurality of input sounds;
a sound extraction unit that extracts a feature amount of the sound to be diagnosed using a plurality of combinations of the feature amount and corresponding distance;
Equipped with
The sound extraction system is characterized in that the multiple distance input sound acquisition unit acquires a plurality of input sounds recorded at the plurality of positions by moving the same microphone.

The multi-distance input sound acquisition unit outputs the relationship between the microphone and the plurality of positions, guides the microphone to the plurality of positions, and moves the microphone to collect the plurality of input sounds recorded at the plurality of positions. The sound extraction system according to claim 1, characterized in that the sound extraction system acquires.

2. The apparatus according to claim 1, further comprising an abnormality detection section that calculates an estimated degree of abnormality indicating an abnormality of the diagnosis target based on the feature amount of the sound of the diagnosis target extracted by the sound extraction section. sound extraction system.

an anomaly detection unit that calculates an estimated abnormality degree indicating an abnormality of the diagnosis target based on the feature amount of the sound of the diagnosis target extracted by the sound extraction unit;
The sound extraction system according to claim 1, further comprising: an abnormality display unit that displays a display according to the estimated abnormality degree.

anomaly detection that calculates an estimated degree of abnormality indicating an abnormality in the diagnosis target by comparing the feature amount of the sound of the diagnosis target extracted by the sound extraction unit with the feature amount of the sound when the diagnosis target is operating properly; The sound extraction system according to claim 1, further comprising: a.

a learning unit that learns the feature amount of sound when the diagnosis target is operating properly;
an anomaly detection unit that calculates an estimated degree of abnormality indicating an abnormality of the diagnosis target by comparing the feature amount of the sound of the diagnosis target extracted by the sound extraction unit and the feature amount of the sound during the proper operation; The sound extraction system according to claim 1, further comprising: a sound extraction system according to claim 1;

The sound extraction unit extracts a feature component that exists in common among the plurality of feature quantities corresponding to the plurality of distances and exhibits a magnitude relationship according to the distance, as a feature quantity of the sound to be diagnosed. The sound extraction system according to claim 1.

The sound extraction unit is characterized in that it extracts a feature component that exists in common among the plurality of feature quantities corresponding to the plurality of distances and shows a change according to the distance, as a feature quantity of the sound to be diagnosed. The sound extraction system according to claim 1.

The preprocessing unit calculates frequency domain signals for each of the plurality of input sounds, calculates a spectrogram from each frequency domain signal, and uses the spectrogram as the feature amount. sound extraction system.

a multi-distance input sound acquisition step of acquiring a plurality of input sounds recorded at a plurality of positions having different distances from the diagnosis target in association with the distances;
a preprocessing step of calculating feature amounts for each of the plurality of input sounds;
a sound extraction step of extracting a feature amount of the sound to be diagnosed using a plurality of combinations of the feature amount and corresponding distance ;
The sound extraction method is characterized in that the plural distance input sound acquisition step moves the same microphone to acquire a plurality of input sounds recorded at the plurality of positions.