JP7651901B2

JP7651901B2 - IMAGE PROCESSING METHOD, IMAGE PROCESSING SYSTEM, AND PROGRAM

Info

Publication number: JP7651901B2
Application number: JP2021051181A
Authority: JP
Inventors: 陽前澤
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2025-03-27
Anticipated expiration: 2041-03-25
Also published as: JP2022149159A; WO2022202266A1; CN117043818A

Description

本開示は、利用者による演奏を解析する技術に関する。 This disclosure relates to technology for analyzing a user's performance.

例えば撮影装置により撮影された画像のうち特定の物体が存在する領域を推定する技術が従来から提案されている。例えば特許文献１には、深層ニューラルネットワークを利用して物体を検知する技術が開示されている。 For example, there have been proposed techniques for estimating the area in an image captured by a camera in which a specific object exists. For example, Patent Literature 1 discloses a technique for detecting an object using a deep neural network.

特表２０２０－５２８１７６号公報Special table 2020-528176 publication

ところで、鍵盤楽器等の楽器の演奏を撮影した演奏画像のうち鍵盤の領域等の特定の領域を抽出できれば、例えば利用者の運指の解析等に利用できて便利である。以上の事情を考慮して、本開示のひとつの態様は、演奏画像の利便性を向上させることを目的とする。 However, if it were possible to extract a specific area, such as the keyboard area, from a performance image captured while playing a musical instrument such as a keyboard instrument, this would be useful for analyzing the user's fingering, for example. In consideration of the above circumstances, one aspect of the present disclosure aims to improve the convenience of performance images.

以上の課題を解決するために、本開示のひとつの態様に係る画像処理方法は、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定し、前記演奏画像のうち前記特定領域を抽出する。 To solve the above problems, an image processing method according to one aspect of the present disclosure estimates a specific region that includes an image of a musical instrument from a performance image that includes an image of the musical instrument and an image of the fingers of a user playing the musical instrument, and extracts the specific region from the performance image.

本開示のひとつの態様に係る画像処理システムは、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定する領域推定部と、前記演奏画像のうち前記特定領域を抽出する領域抽出部とを具備する。 An image processing system according to one aspect of the present disclosure includes an area estimation unit that estimates a specific area including an image of a musical instrument from a performance image that includes an image of the musical instrument and an image of the fingers of a user playing the musical instrument, and an area extraction unit that extracts the specific area from the performance image.

本開示のひとつの態様に係るプログラムは、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定する領域推定部、および、前記演奏画像のうち前記特定領域を抽出する領域抽出部、としてコンピュータシステムを機能させる。 A program according to one aspect of the present disclosure causes a computer system to function as an area estimation unit that estimates a specific area that includes an image of a musical instrument from a performance image that includes an image of the musical instrument and an image of the fingers of a user playing the musical instrument, and an area extraction unit that extracts the specific area from the performance image.

第１実施形態に係る演奏解析システムの構成を例示するブロック図である。1 is a block diagram illustrating a configuration of a performance analysis system according to a first embodiment. 演奏画像の模式図である。FIG. 演奏解析システムの機能的な構成を例示するブロック図である。1 is a block diagram illustrating an example of a functional configuration of a performance analysis system. 解析画面の模式図である。FIG. 13 is a schematic diagram of an analysis screen. 指位置推定処理のフローチャートである。13 is a flowchart of a finger position estimation process. 左右判定処理のフローチャートである。13 is a flowchart of a left/right determination process. 画像抽出処理の説明図である。FIG. 11 is an explanatory diagram of an image extraction process. 画像抽出処理のフローチャートである。13 is a flowchart of an image extraction process. 推定モデルを確立する機械学習の説明図である。FIG. 1 is an explanatory diagram of machine learning for establishing an estimation model. 参照画像の模式図である。FIG. 2 is a schematic diagram of a reference image. 行列生成処理のフローチャートである。13 is a flowchart of a matrix generation process. 初期設定処理のフローチャートである。13 is a flowchart of an initial setting process. 設定画面の模式図である。FIG. 4 is a schematic diagram of a setting screen. 演奏解析処理のフローチャートである。13 is a flowchart of a performance analysis process. 運指推定の課題に関する説明図である。FIG. 1 is an explanatory diagram regarding a problem of fingering estimation. 第２実施形態における演奏解析システムの構成を例示するブロック図である。FIG. 13 is a block diagram illustrating the configuration of a performance analysis system according to a second embodiment. 第２実施形態における制御データの模式図である。FIG. 11 is a schematic diagram of control data in the second embodiment. 第２実施形態における演奏解析処理のフローチャートである。13 is a flowchart of a performance analysis process in the second embodiment. 第３実施形態における演奏解析処理のフローチャートである。13 is a flowchart of a performance analysis process according to the third embodiment. 第４実施形態における初期設定処理のフローチャートである。13 is a flowchart of an initial setting process in the fourth embodiment. 第５実施形態における演奏解析システムの構成を例示するブロック図である。FIG. 13 is a block diagram illustrating the configuration of a performance analysis system according to a fifth embodiment. 第６実施形態における画像処理システムの機能的な構成を例示するブロック図である。FIG. 13 is a block diagram illustrating an example of a functional configuration of an image processing system according to a sixth embodiment. 第６実施形態における第１画像処理のフローチャートである。23 is a flowchart of a first image processing according to the sixth embodiment. 第７実施形態における画像処理システムの機能的な構成を例示するブロック図である。FIG. 23 is a block diagram illustrating an example of a functional configuration of an image processing system according to a seventh embodiment. 第７実施形態における第２画像処理のフローチャートである。23 is a flowchart of second image processing in the seventh embodiment.

１：第１実施形態
図１は、第１実施形態に係る演奏解析システム１００の構成を例示するブロック図である。演奏解析システム１００には、鍵盤楽器２００が有線または無線により接続される。鍵盤楽器２００は、複数（Ｎ個）の鍵２１が配列された鍵盤２２を具備する電子楽器である。鍵盤２２の複数の鍵２１の各々は、相異なる音高ｎ（ｎ＝１～Ｎ）に対応する。利用者（すなわち演奏者）は、自身の左手および右手により鍵盤楽器２００の所望の鍵２１を順次に操作する。鍵盤楽器２００は、利用者による演奏を表す演奏データＰを演奏解析システム１００に供給する。演奏データＰは、利用者が順次に演奏する複数の音符の各々について当該音符の音高ｎを指定する時系列データである。例えば、演奏データＰは、例えばＭＩＤＩ（Musical Instrument Digital Interface）規格に準拠した形式のデータである。 1: First embodiment FIG. 1 is a block diagram illustrating the configuration of a musical performance analysis system 100 according to a first embodiment. A keyboard instrument 200 is connected to the musical performance analysis system 100 by wire or wirelessly. The keyboard instrument 200 is an electronic musical instrument having a keyboard 22 on which a plurality of (N) keys 21 are arranged. Each of the plurality of keys 21 of the keyboard 22 corresponds to a different pitch n (n=1 to N). A user (i.e., a performer) operates desired keys 21 of the keyboard instrument 200 sequentially with his/her left and right hands. The keyboard instrument 200 supplies performance data P representing a performance by the user to the musical performance analysis system 100. The performance data P is time-series data that specifies the pitch n of each of a plurality of notes that the user sequentially plays. For example, the performance data P is data in a format that complies with the MIDI (Musical Instrument Digital Interface) standard.

演奏解析システム１００は、利用者による鍵盤楽器２００の演奏を解析するコンピュータシステムである。具体的には、演奏解析システム１００は、利用者の運指を解析する。運指は、鍵盤楽器２００の演奏において利用者が左手および右手の各手指を使用する方法（すなわち指使い）である。すなわち、利用者が鍵盤楽器２００の各鍵２１を何れの手指により操作するかという情報が、利用者の運指として解析される。 The performance analysis system 100 is a computer system that analyzes a user's performance of the keyboard instrument 200. Specifically, the performance analysis system 100 analyzes the user's fingering. Fingering is the way in which the user uses the fingers of his or her left and right hands when playing the keyboard instrument 200 (i.e., fingering). In other words, information about which fingers the user uses to operate each key 21 of the keyboard instrument 200 is analyzed as the user's fingering.

演奏解析システム１００は、制御装置１１と記憶装置１２と操作装置１３と表示装置１４と撮影装置１５とを具備する。演奏解析システム１００は、例えばスマートフォンまたはタブレット端末等の可搬型の情報装置、またはパーソナルコンピュータ等の可搬型または据置型の情報装置により実現される。なお、演奏解析システム１００は、単体の装置として実現されるほか、相互に別体で構成された複数の装置でも実現される。また、演奏解析システム１００は、鍵盤楽器２００に搭載されてもよい。 The performance analysis system 100 comprises a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15. The performance analysis system 100 is realized by a portable information device such as a smartphone or a tablet terminal, or a portable or stationary information device such as a personal computer. The performance analysis system 100 may be realized as a single device, or as multiple devices configured separately from each other. The performance analysis system 100 may also be mounted on a keyboard instrument 200.

制御装置１１は、演奏解析システム１００の各要素を制御する単数または複数のプロセッサで構成される。例えば、制御装置１１は、ＣＰＵ（Central Processing Unit）、ＳＰＵ（Sound Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＦＰＧＡ（Field Programmable Gate Array）、またはＡＳＩＣ（Application Specific Integrated Circuit）等の１種類以上のプロセッサにより構成される。 The control device 11 is composed of one or more processors that control each element of the performance analysis system 100. For example, the control device 11 is composed of one or more types of processors, such as a CPU (Central Processing Unit), an SPU (Sound Processing Unit), a DSP (Digital Signal Processor), an FPGA (Field Programmable Gate Array), or an ASIC (Application Specific Integrated Circuit).

記憶装置１２は、制御装置１１が実行するプログラムと、制御装置１１が使用する各種のデータとを記憶する単数または複数のメモリである。記憶装置１２は、例えば磁気記録媒体もしくは半導体記録媒体等の公知の記録媒体、または、複数種の記録媒体の組合せで構成される。なお、演奏解析システム１００に対して着脱される可搬型の記録媒体、または例えばインターネット等の通信網を介して制御装置１１が書込または読出を実行可能な記録媒体（例えばクラウドストレージ）を、記憶装置１２として利用してもよい。 The storage device 12 is a single or multiple memories that store the programs executed by the control device 11 and various data used by the control device 11. The storage device 12 is composed of a known recording medium, such as a magnetic recording medium or a semiconductor recording medium, or a combination of multiple types of recording media. Note that the storage device 12 may be a portable recording medium that is detachable from the performance analysis system 100, or a recording medium (e.g., cloud storage) that the control device 11 can write to or read from via a communication network such as the Internet.

操作装置１３は、利用者からの指示を受付ける入力機器である。操作装置１３は、例えば、利用者が操作する操作子、または、利用者による接触を検知するタッチパネルである。なお、演奏解析システム１００とは別体の操作装置１３（例えばマウスまたはキーボード）を、演奏解析システム１００に対して有線または無線により接続してもよい。 The operation device 13 is an input device that accepts instructions from a user. The operation device 13 is, for example, a control operated by the user, or a touch panel that detects contact by the user. Note that an operation device 13 (for example, a mouse or keyboard) separate from the performance analysis system 100 may be connected to the performance analysis system 100 by wire or wirelessly.

表示装置１４は、制御装置１１による制御のもとで画像を表示する。例えば液晶表示パネルまたは有機ＥＬ（Electroluminescence）パネル等の各種の表示パネルが表示装置１４として利用される。なお、演奏解析システム１００とは別体の表示装置１４を、演奏解析システム１００に対して有線または無線により接続してもよい。 The display device 14 displays images under the control of the control device 11. For example, various display panels such as a liquid crystal display panel or an organic EL (Electroluminescence) panel are used as the display device 14. Note that the display device 14, which is separate from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly.

撮影装置１５は、被写体の撮影により画像データＤ1の時系列を生成する画像入力機器である。画像データＤ1の時系列は、動画を表す動画データである。例えば、撮影装置１５は、撮影レンズ等の光学系と、光学系からの入射光を受光する撮像素子と、撮像素子による受光量に応じた画像データＤ1を生成する処理回路とを具備する。なお、演奏解析システム１００とは別体の撮影装置１５を演奏解析システム１００に対して有線または無線により接続してもよい。 The photographing device 15 is an image input device that generates a time series of image data D1 by photographing a subject. The time series of image data D1 is video data representing a video. For example, the photographing device 15 includes an optical system such as a photographing lens, an image sensor that receives incident light from the optical system, and a processing circuit that generates image data D1 according to the amount of light received by the image sensor. Note that the photographing device 15, which is separate from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly.

利用者は、演奏解析システム１００の提供者から推奨された撮影条件が実現されるように、鍵盤楽器２００に対する撮影装置１５の位置または角度を調整する。具体的には、撮影装置１５は、鍵盤楽器２００の上方に設置され、鍵盤楽器２００の鍵盤２２と利用者の左手および右手とを撮影する。したがって、図２に例示される通り、鍵盤楽器２００の鍵盤２２の画像（以下「鍵盤画像」という）ｇ1と利用者の左手および右手の画像（以下「手指画像」という）ｇ2とを含む演奏画像Ｇ1を表す画像データＤ1の時系列が、撮影装置１５により生成される。すなわち、利用者が鍵盤楽器２００を演奏する動画を表す動画データが、当該演奏に並行して生成される。なお、撮影装置１５による撮影条件は、例えば撮影範囲または撮影方向である。撮影範囲は、撮影装置１５が撮影する範囲（画角）である。撮影方向は、鍵盤楽器２００に対する撮影装置１５の方向である。 The user adjusts the position or angle of the image capture device 15 relative to the keyboard instrument 200 so that the image capture conditions recommended by the provider of the performance analysis system 100 are realized. Specifically, the image capture device 15 is installed above the keyboard instrument 200 and captures the keyboard 22 of the keyboard instrument 200 and the user's left and right hands. Therefore, as illustrated in FIG. 2, a time series of image data D1 representing a performance image G1 including an image of the keyboard 22 of the keyboard instrument 200 (hereinafter referred to as a "keyboard image") g1 and an image of the user's left and right hands (hereinafter referred to as a "fingering image") g2 is generated by the image capture device 15. That is, video data representing a video of the user playing the keyboard instrument 200 is generated in parallel with the performance. The image capture conditions by the image capture device 15 are, for example, the image capture range or image capture direction. The image capture range is the range (angle of view) captured by the image capture device 15. The image capture direction is the direction of the image capture device 15 relative to the keyboard instrument 200.

図３は、演奏解析システム１００の機能的な構成を例示するブロック図である。制御装置１１は、記憶装置１２に記憶されたプログラムを実行することで、演奏解析部３０および表示制御部４０として機能する。演奏解析部３０は、演奏データＰおよび画像データＤ1の解析により、利用者の運指を表す運指データＱを生成する。運指データＱは、鍵盤楽器２００の複数の鍵２１の各々が利用者の複数の手指のうち何れの手指により操作されたかを指定する。具体的には、運指データＱは、利用者が操作した鍵２１に対応する音高ｎと、利用者が当該鍵２１の操作に使用した手指の番号（以下「指番号」という）ｋとを指定する。音高ｎは、例えばＭＩＤＩ規格におけるノート番号である。指番号ｋは、利用者の左手および右手の各手指に付与された番号である。 Figure 3 is a block diagram illustrating the functional configuration of the performance analysis system 100. The control device 11 executes a program stored in the storage device 12 to function as a performance analysis unit 30 and a display control unit 40. The performance analysis unit 30 generates fingering data Q representing the fingering of the user by analyzing the performance data P and image data D1. The fingering data Q specifies which of the user's multiple fingers has operated each of the multiple keys 21 of the keyboard instrument 200. Specifically, the fingering data Q specifies the pitch n corresponding to the key 21 operated by the user and the number of the finger (hereinafter referred to as the "finger number") k used by the user to operate the key 21. The pitch n is, for example, a note number in the MIDI standard. The finger number k is a number assigned to each finger of the user's left and right hands.

表示制御部４０は、各種の画像を表示装置１４に表示させる。例えば、表示制御部４０は、演奏解析部３０による解析の結果を表す画像（以下「解析画面」という）６１を表示装置１４に表示させる。図４は、解析画面６１の模式図である。解析画面６１は、横方向の時間軸と縦方向の音高軸とが設定された座標平面に複数の音符画像６１１が配置された画像である。音符画像６１１は利用者が演奏した音符毎に表示される。音高軸の方向における音符画像６１１の位置は、当該音符画像６１１が表す音符の音高ｎに応じて設定される。時間軸の方向における音符画像６１１の位置および全長は、当該音符画像６１１が表す音符の発音期間に応じて設定される。 The display control unit 40 causes various images to be displayed on the display device 14. For example, the display control unit 40 causes the display device 14 to display an image (hereinafter referred to as the "analysis screen") 61 showing the results of the analysis by the performance analysis unit 30. FIG. 4 is a schematic diagram of the analysis screen 61. The analysis screen 61 is an image in which multiple note images 611 are arranged on a coordinate plane in which a horizontal time axis and a vertical pitch axis are set. The note images 611 are displayed for each note played by the user. The position of the note image 611 in the direction of the pitch axis is set according to the pitch n of the note represented by the note image 611. The position and total length of the note image 611 in the direction of the time axis is set according to the sounding period of the note represented by the note image 611.

各音符の音符画像６１１には、運指データＱが当該音符について指定する指番号ｋに対応する符号（以下「運指符号」という）６１２が配置される。運指符号６１２の文字「Ｌ」は左手を意味し、運指符号６１２の文字「Ｒ」は右手を意味する。また、運指符号６１２の数字は各手指を意味する。具体的には、運指符号６１２の数字「１」は親指を意味し、数字「２」は人差指を意味し、数字「３」は中指を意味し、数字「４」は薬指を意味し、数字「５」は小指を意味する。したがって、例えば運指符号６１２「Ｒ2」は右手の人差指を意味し、運指符号６１２「Ｌ4」は左手の薬指を意味する。音符画像６１１および運指符号６１２は、右手と左手とについて相異なる態様（例えば色相または階調）で表示される。表示制御部４０は、運指データＱを利用して図４の解析画面６１を表示装置１４に表示させる。 A symbol (hereinafter referred to as a "fingering symbol") 612 corresponding to the finger number k specified for the note by the fingering data Q is arranged on the note image 611 of each note. The letter "L" of the fingering symbol 612 means the left hand, and the letter "R" of the fingering symbol 612 means the right hand. The numbers of the fingering symbol 612 mean each finger. Specifically, the number "1" of the fingering symbol 612 means the thumb, the number "2" means the index finger, the number "3" means the middle finger, the number "4" means the ring finger, and the number "5" means the little finger. Therefore, for example, the fingering symbol 612 "R2" means the index finger of the right hand, and the fingering symbol 612 "L4" means the ring finger of the left hand. The note image 611 and the fingering symbol 612 are displayed in different modes (e.g., hues or gradations) for the right hand and the left hand. The display control unit 40 uses the fingering data Q to display the analysis screen 61 in FIG. 4 on the display device 14.

なお、解析画面６１内の複数の音符画像６１１のうち、指番号ｋの推定結果の信頼性が低い音符については、通常の音符画像６１１とは相違する態様（例えば破線状の枠線）で音符画像６１１が表示され、かつ、指番号ｋの推定結果が無効であることを意味する特定の符号「？？」が表示される。 In addition, for notes among the multiple note images 611 in the analysis screen 61 for which the estimation result of the finger number k is unreliable, the note image 611 is displayed in a manner different from that of a normal note image 611 (for example, with a dashed frame), and a specific symbol "??" is displayed, which means that the estimation result of the finger number k is invalid.

図３に例示される通り、演奏解析部３０は、指位置データ生成部３１と運指データ生成部３２とを具備する。指位置データ生成部３１は、演奏画像Ｇ1の解析により指位置データＦを生成する。指位置データＦは、利用者の左手の各手指の位置と右手の各手指の位置とを表すデータである。以上の通り、第１実施形態においては、利用者の各手指の位置が左手と右手とに区別されるから、利用者の左手と右手とを区別した運指を推定できる。他方、運指データ生成部３２は、演奏データＰと指位置データＦとを利用して運指データＱを生成する。指位置データＦおよび運指データＱは、時間軸上の単位期間毎に生成される。各単位期間は、所定長の期間（フレーム）である。 As illustrated in FIG. 3, the performance analysis unit 30 includes a finger position data generation unit 31 and a fingering data generation unit 32. The finger position data generation unit 31 generates finger position data F by analyzing the performance image G1. The finger position data F is data representing the positions of each finger on the user's left hand and each finger on the right hand. As described above, in the first embodiment, the positions of each finger on the user's left hand and right hand are differentiated, so that fingering can be estimated separately for the user's left hand and right hand. On the other hand, the fingering data generation unit 32 generates fingering data Q using the performance data P and the finger position data F. The finger position data F and the fingering data Q are generated for each unit period on the time axis. Each unit period is a period (frame) of a predetermined length.

Ａ：指位置データ生成部３１
指位置データ生成部３１は、画像抽出部３１１と行列生成部３１２と指位置推定部３１３と射影変換部３１４とを具備する。 A: Finger position data generating unit 31
The finger position data generating unit 31 includes an image extracting unit 311 , a matrix generating unit 312 , a finger position estimating unit 313 , and a projective transformation unit 314 .

［指位置推定部３１３］
指位置推定部３１３は、画像データＤ1が表す演奏画像Ｇ1の解析により利用者の左手および右手の各手指の位置ｃ[h,f]を推定する。各手指の位置ｃ[h,f]は、演奏画像Ｇ1に設定されるｘ-ｙ座標系における各指先の位置である。位置ｃ[h,f]は、演奏画像Ｇ1のｘ-ｙ座標系におけるｘ軸上の座標ｘ[h,f]とｙ軸上の座標ｙ[h,f]との組合せ（ｘ[h,f]，ｙ[h,f]）で表現される。ｘ軸の正方向は、鍵盤２２の右方向（低音から高音に向かう方向）に相当し、ｘ軸の負方向は、鍵盤２２の左方向（高音から低音に向かう方向）に相当する。記号ｈは、左手および右手の何れかを示す変数である（ｈ＝１，２）。具体的には、変数ｈの数値「１」は左手を意味し、変数ｈの数値「２」は右手を意味する。変数ｆは、左手および右手の各々における各手指の番号（ｆ＝１～５）である。変数ｆの数値「１」は親指を意味し、数値「２」は人差指を意味し、数値「３」は中指を意味し、数値「４」は薬指を意味し、数値「５」は小指を意味する。したがって、例えば図２に例示された位置ｃ[1,2]は、左手（ｈ＝１）の人差指（ｆ＝２）の指先の位置であり、位置ｃ[2,4]は、右手（ｈ＝２）の薬指（ｆ＝４）の指先の位置である。 [Finger position estimation unit 313]
The finger position estimation unit 313 estimates the position c[h,f] of each finger of the left hand and the right hand of the user by analyzing the performance image G1 represented by the image data D1. The position c[h,f] of each finger is the position of each fingertip in the x-y coordinate system set in the performance image G1. The position c[h,f] is expressed by a combination (x[h,f], y[h,f]) of the coordinate x[h,f] on the x-axis and the coordinate y[h,f] on the y-axis in the x-y coordinate system of the performance image G1. The positive direction of the x-axis corresponds to the right direction of the keyboard 22 (the direction from the low note to the high note), and the negative direction of the x-axis corresponds to the left direction of the keyboard 22 (the direction from the high note to the low note). The symbol h is a variable indicating either the left hand or the right hand (h=1, 2). Specifically, the value "1" of the variable h means the left hand, and the value "2" of the variable h means the right hand. The variable f is the number of each finger on the left and right hands (f=1 to 5). The value "1" of the variable f means the thumb, the value "2" means the index finger, the value "3" means the middle finger, the value "4" means the ring finger, and the value "5" means the little finger. Thus, for example, the position c[1,2] illustrated in FIG. 2 is the position of the tip of the index finger (f=2) on the left hand (h=1), and the position c[2,4] is the position of the tip of the ring finger (f=4) on the right hand (h=2).

図５は、指位置推定部３１３が利用者の各手指の位置を推定する処理（以下「指位置推定処理」という）の具体的な手順を例示するフローチャートである。指位置推定処理は、画像解析処理Ｓa1と左右判定処理Ｓa2と補間処理Ｓa3とを含む。 Figure 5 is a flowchart illustrating the specific steps of the process (hereinafter referred to as the "finger position estimation process") in which the finger position estimation unit 313 estimates the position of each finger of the user. The finger position estimation process includes an image analysis process Sa1, a left/right determination process Sa2, and an interpolation process Sa3.

画像解析処理Ｓa1は、利用者の左手および右手の一方（以下「第１手」という）における各手指の位置ｃ[h,f]と、利用者の左手および右手の他方（以下「第２手」という）における各手指の位置ｃ[h,f]とを、演奏画像Ｇ1の解析により推定する処理である。具体的には、指位置推定部３１３は、画像の解析により利用者の骨格または関節を推定する画像認識処理により、第１手の各手指の位置ｃ[h,1]～ｃ[h,5]と第２手の各手指の位置ｃ[h,1]～ｃ[h,5]とを推定する。画像解析処理Ｓa1には、例えばMediaPipeまたはOpenPose等の公知の画像認識処理が利用される。なお、演奏画像Ｇ1から指先が検出されない場合、ｘ軸上における当該指先の座標ｘ[h,f]は「０」等の無効値に設定される。 The image analysis process Sa1 is a process for estimating the position c[h,f] of each finger on one of the user's left and right hands (hereinafter referred to as the "first hand") and the position c[h,f] of each finger on the other of the user's left and right hands (hereinafter referred to as the "second hand") by analyzing the performance image G1. Specifically, the finger position estimation unit 313 estimates the positions c[h,1] to c[h,5] of each finger on the first hand and the positions c[h,1] to c[h,5] of each finger on the second hand by image recognition processing that estimates the user's skeleton or joints by analyzing the image. The image analysis process Sa1 uses a known image recognition process such as MediaPipe or OpenPose. Note that if a fingertip is not detected from the performance image G1, the coordinate x[h,f] of the fingertip on the x-axis is set to an invalid value such as "0".

画像解析処理Ｓa1においては、利用者の第１手の各手指の位置ｃ[h,1]～ｃ[h,5]と第２手の各手指の位置ｃ[h,1]～ｃ[h,5]とは推定されるが、第１手および第２手の各々が利用者の左手および右手の何れに該当するのかまでは特定できない。また、鍵盤楽器２００の演奏においては、利用者の右腕と左腕とが交差する場合があるため、画像解析処理Ｓa1により推定された各位置ｃ[h,f]の座標ｘ[h,f]のみから左手／右手を確定することは適切でない。なお、利用者の両腕および胴体を含む部分を撮影装置１５により撮影すれば、利用者の両肩および両腕の座標から、利用者の左手／右手を演奏画像Ｇ1から推定できる。しかし、撮影装置１５により広範囲を撮影する必要があるという問題、および、画像解析処理Ｓa1の処理負荷が増大するという問題がある。 In the image analysis process Sa1, the positions c[h,1] to c[h,5] of the fingers of the first hand and the positions c[h,1] to c[h,5] of the fingers of the second hand of the user are estimated, but it is not possible to determine whether the first hand and the second hand correspond to the left hand or the right hand of the user. In addition, when playing the keyboard instrument 200, the right arm and the left arm of the user may cross, so it is not appropriate to determine the left hand/right hand only from the coordinate x[h,f] of each position c[h,f] estimated by the image analysis process Sa1. If the user's arms and a part including the torso are photographed by the imaging device 15, the user's left hand/right hand can be estimated from the performance image G1 based on the coordinates of the user's shoulders and arms. However, there are problems in that a wide range needs to be photographed by the imaging device 15, and in that the processing load of the image analysis process Sa1 increases.

以上の事情を考慮して、第１実施形態の指位置推定部３１３は、第１手および第２手の各々が利用者の左手および右手の何れに該当するのかを判定する図５の左右判定処理Ｓa2を実行する。すなわち、指位置推定部３１３は、第１手および第２手の各々の手指の位置ｃ[h,f]における変数ｈを、左手を意味する数値「１」および右手を意味する数値「２」の何れかに確定する。 Taking the above into consideration, the finger position estimation unit 313 of the first embodiment executes the left/right determination process Sa2 in FIG. 5, which determines whether each of the first and second hands corresponds to the user's left or right hand. That is, the finger position estimation unit 313 determines the variable h at the finger position c[h,f] of each of the first and second hands to be either the number "1" indicating the left hand or the number "2" indicating the right hand.

鍵盤楽器２００を演奏する状態では、左手および右手の双方の甲が鉛直方向の上方に位置するから、撮影装置１５が撮影する演奏画像Ｇ1は、利用者の左手および右手の双方の甲の画像を含む。したがって、演奏画像Ｇ1内の左手においては、親指の位置ｃ[h,1]が小指の位置ｃ[h,5]よりも右方に位置し、演奏画像Ｇ1内の右手においては、親指の位置ｃ[h,1]が小指の位置ｃ[h,5]よりも左方に位置する。以上の事情を考慮して、指位置推定部３１３は、左右判定処理Ｓa2において、第１手および第２手のうち、親指の位置ｃ[h,1]が小指の位置ｃ[h,5]よりも右方（ｘ軸の正方向）に位置する手を左手（ｈ＝１）と判定する。他方、指位置推定部３１３は、第１手および第２手のうち、親指の位置ｃ[h,1]が小指の位置ｃ[h,5]よりも左方（ｘ軸の負方向）に位置する手を右手と判定する。 When playing the keyboard instrument 200, the backs of both the left and right hands are positioned vertically upward, so the performance image G1 captured by the image capture device 15 includes images of the backs of both the left and right hands of the user. Therefore, in the left hand in the performance image G1, the thumb position c[h,1] is located to the right of the little finger position c[h,5], and in the right hand in the performance image G1, the thumb position c[h,1] is located to the left of the little finger position c[h,5]. Taking the above into consideration, in the left-right determination process Sa2, the finger position estimation unit 313 determines that the hand whose thumb position c[h,1] is located to the right (positive direction of the x-axis) of the little finger position c[h,5] is the left hand (h=1) of the first and second hands. On the other hand, the finger position estimation unit 313 determines that the hand whose thumb position c[h,1] is located to the left (in the negative direction of the x-axis) of the little finger position c[h,5] of the first or second hand is the right hand.

図６は、左右判定処理Ｓa2の具体的な手順を例示するフローチャートである。指位置推定部３１３は、第１手および第２手の各々について判定指標γ[h]を算定する（Ｓa21）。判定指標γ[h]は、例えば以下の数式(1)により算定される。

数式(1)の記号μ[h]は、第１手および第２手の各々における５本の手指の座標ｘ[h,1]～ｘ[h,5]の平均値（例えば単純平均）である。数式(1)から理解される通り、親指から小指にかけて座標ｘ[h,f]が減少する場合（左手）には判定指標γ[h]が負数となり、親指から小指にかけて座標ｘ[h,f]が増加する場合（右手）には判定指標γ[h]が正数となる。そこで、指位置推定部３１３は、第１手および第２手のうち判定指標γ[h]が負数である手を左手と判定し、変数ｈを数値「１」に設定する（Ｓa22）。また、指位置推定部３１３は、第１手および第２手のうち判定指標γ[h]が正数である手を右手と判定し、変数ｈを数値「２」に設定する（Ｓa23）。以上に説明した左右判定処理Ｓa2によれば、親指の位置と小指の位置との関係を利用した簡便な処理により、利用者の各手指の位置ｃ[h,f]を右手と左手とに区別できる。 6 is a flowchart illustrating a specific procedure of the left-right determination process Sa2. The finger position estimation unit 313 calculates a determination index γ[h] for each of the first hand and the second hand (Sa21). The determination index γ[h] is calculated, for example, by the following formula (1).

The symbol μ[h] in the formula (1) is the average value (e.g., simple average) of the coordinates x[h,1] to x[h,5] of the five fingers on each of the first and second hands. As can be understood from the formula (1), when the coordinate x[h,f] decreases from the thumb to the little finger (left hand), the judgment index γ[h] is negative, and when the coordinate x[h,f] increases from the thumb to the little finger (right hand), the judgment index γ[h] is positive. Therefore, the finger position estimation unit 313 determines that the hand with the negative judgment index γ[h] of the first and second hands is the left hand, and sets the variable h to the value "1" (Sa22). Moreover, the finger position estimation unit 313 determines that the hand with the positive judgment index γ[h] of the first and second hands is the right hand, and sets the variable h to the value "2" (Sa23). According to the left/right determination process Sa2 described above, the positions c[h, f] of the fingers of the user can be distinguished as being on the right hand or the left hand by simple processing that utilizes the relationship between the positions of the thumb and the little finger.

画像解析処理Ｓa1および左右判定処理Ｓa2により、利用者の各手指の位置ｃ[h,f]が、単位期間毎に推定される。しかし、演奏画像Ｇ1に存在するノイズ等の種々の事情により、位置ｃ[h,f]が適正に推定されない場合がある。そこで、指位置推定部３１３は、特定の単位期間（以下「欠落期間」という）において位置ｃ[h,f]が欠落した場合に、当該欠落期間の前後の単位期間における位置ｃ[h,f]を利用した補間処理Ｓa3により、欠落期間における位置ｃ[h,f]を算定する。例えば、時間軸上で連続する３個の単位期間のうち中央の単位期間（欠落期間）において位置ｃ[h,f]が欠落した場合、欠落期間の直前の単位期間における位置ｃ[h,f]と直後の単位期間における位置ｃ[h,f]との平均が、欠落期間における位置ｃ[h,f]として算定される。 The image analysis process Sa1 and the left/right determination process Sa2 estimate the position c[h,f] of each finger of the user for each unit period. However, due to various circumstances such as noise present in the performance image G1, the position c[h,f] may not be estimated properly. Therefore, when the position c[h,f] is missing in a specific unit period (hereinafter referred to as the "missing period"), the finger position estimation unit 313 calculates the position c[h,f] in the missing period by the interpolation process Sa3 using the position c[h,f] in the unit periods before and after the missing period. For example, when the position c[h,f] is missing in the central unit period (missing period) of three consecutive unit periods on the time axis, the average of the position c[h,f] in the unit period immediately before the missing period and the position c[h,f] in the unit period immediately after the missing period is calculated as the position c[h,f] in the missing period.

［画像抽出部３１１］
前述の通り、演奏画像Ｇ1は、鍵盤画像ｇ1と手指画像ｇ2とを含む。図３の画像抽出部３１１は、図７に例示される通り、演奏画像Ｇ1のうち特定の領域（以下「特定領域」という）Ｂを抽出する。特定領域Ｂは、演奏画像Ｇ1のうち鍵盤画像ｇ1と手指画像ｇ2とを含む領域である。手指画像ｇ2は、利用者の身体の少なくとも一部の画像に相当する。 [Image extraction unit 311]
As described above, the performance image G1 includes the keyboard image g1 and the finger image g2. The image extraction unit 311 in Fig. 3 extracts a specific area (hereinafter referred to as "specific area") B from the performance image G1, as illustrated in Fig. 7. The specific area B is an area of the performance image G1 that includes the keyboard image g1 and the finger image g2. The finger image g2 corresponds to an image of at least a part of the user's body.

図８は、画像抽出部３１１が演奏画像Ｇ1から特定領域Ｂを抽出する処理（以下「画像抽出処理」という）の具体的な手順を例示するフローチャートである。画像抽出処理は、領域推定処理Ｓb1と領域抽出処理Ｓb2とを含む。 Figure 8 is a flowchart illustrating the specific steps of the process (hereinafter referred to as "image extraction process") in which the image extraction unit 311 extracts a specific area B from the performance image G1. The image extraction process includes an area estimation process Sb1 and an area extraction process Sb2.

領域推定処理Ｓb1は、画像データＤ1が表す演奏画像Ｇ1について特定領域Ｂを推定する処理である。具体的には、画像抽出部３１１は、領域推定処理Ｓb1により、特定領域Ｂを表す画像処理マスクＭを画像データＤ1から生成する。画像処理マスクＭは、図７に例示される通り、演奏画像Ｇ1と同等のサイズのマスクであり、演奏画像Ｇ1の相異なる画素に対応する複数の要素で構成される。具体的には、画像処理マスクＭは、演奏画像Ｇ1の特定領域Ｂに対応する領域内の各要素が数値「１」に設定され、特定領域Ｂ以外の領域内の各要素が数値「０」に設定されたバイナリマスクである。制御装置１１が領域推定処理Ｓb1を実行することで、演奏画像Ｇ1の特定領域Ｂを推定する要素（領域推定部）が実現される。 The area estimation process Sb1 is a process for estimating a specific area B for the performance image G1 represented by the image data D1. Specifically, the image extraction unit 311 generates an image processing mask M representing the specific area B from the image data D1 by the area estimation process Sb1. As illustrated in FIG. 7, the image processing mask M is a mask of the same size as the performance image G1, and is composed of multiple elements corresponding to different pixels of the performance image G1. Specifically, the image processing mask M is a binary mask in which each element in the area corresponding to the specific area B of the performance image G1 is set to the numerical value "1", and each element in the area other than the specific area B is set to the numerical value "0". The control device 11 executes the area estimation process Sb1 to realize an element (area estimation unit) that estimates the specific area B of the performance image G1.

図３に例示される通り、画像抽出部３１１による画像処理マスクＭの生成には推定モデル５１が利用される。すなわち、画像抽出部３１１は、演奏画像Ｇ1を表す画像データＤ1を推定モデル５１に入力することで画像処理マスクＭを生成する。推定モデル５１は、画像データＤ1と画像処理マスクＭとの関係を機械学習により学習した統計モデルである。推定モデル５１は、例えば深層ニューラルネットワーク（ＤＮＮ：Deep Neural Network）で構成される。例えば、畳込ニューラルネットワーク（ＣＮＮ：Convolutional Neural Network）または再帰型ニューラルネットワーク（ＲＮＮ：Recurrent Neural Network）等の任意の形式の深層ニューラルネットワークが推定モデル５１として利用される。複数種の深層ニューラルネットワークの組合せで推定モデル５１が構成されてもよい。また、長短期記憶（ＬＳＴＭ：Long Short-Term Memory）等の付加的な要素が推定モデル５１に搭載されてもよい。 3, the image extraction unit 311 uses an estimation model 51 to generate the image processing mask M. That is, the image extraction unit 311 generates the image processing mask M by inputting image data D1 representing the performance image G1 to the estimation model 51. The estimation model 51 is a statistical model that learns the relationship between the image data D1 and the image processing mask M by machine learning. The estimation model 51 is configured, for example, by a deep neural network (DNN). For example, any type of deep neural network such as a convolutional neural network (CNN) or a recurrent neural network (RNN) is used as the estimation model 51. The estimation model 51 may be configured by combining multiple types of deep neural networks. In addition, additional elements such as a long short-term memory (LSTM) may be installed in the estimation model 51.

図９は、推定モデル５１を確立する機械学習の説明図である。例えば演奏解析システム１００とは別体の機械学習システム９００による機械学習で推定モデル５１が確立され、当該推定モデル５１が演奏解析システム１００に提供される。機械学習システム９００は、例えばインターネット等の通信網を介して演奏解析システム１００と通信可能なサーバシステムである。機械学習システム９００から通信網を介して演奏解析システム１００に推定モデル５１が送信される。 Figure 9 is an explanatory diagram of machine learning that establishes an estimation model 51. For example, the estimation model 51 is established through machine learning by a machine learning system 900 that is separate from the performance analysis system 100, and the estimation model 51 is provided to the performance analysis system 100. The machine learning system 900 is a server system that can communicate with the performance analysis system 100 via a communication network such as the Internet. The estimation model 51 is transmitted from the machine learning system 900 to the performance analysis system 100 via the communication network.

推定モデル５１の機械学習には複数の学習データＴが利用される。複数の学習データＴの各々は、学習用の画像データＤtと学習用の画像処理マスクＭtとの組合せで構成される。画像データＤtは、鍵盤楽器の鍵盤画像ｇ1と当該鍵盤楽器の周囲の画像とを含む既知画像を表す。鍵盤楽器の機種および撮影条件（例えば撮影範囲および撮影方向）は、画像データＤt毎に相違する。すなわち、複数種の鍵盤楽器の各々を相異なる撮影条件により撮影することで画像データＤtが事前に用意される。なお、公知の画像合成技術により画像データＤtが用意されてもよい。各学習データＴの画像処理マスクＭtは、当該学習データＴの画像データＤtが表す既知画像のうち特定領域Ｂを表すマスクである。具体的には、画像処理マスクＭtのうち特定領域Ｂに対応する領域内の要素は数値「１」に設定され、特定領域Ｂ以外の領域内の要素は数値「０」に設定される。すなわち、画像処理マスクＭtは、画像データＤtの入力に対して推定モデル５１が出力すべき正解を意味する。 A plurality of learning data T are used for the machine learning of the estimation model 51. Each of the plurality of learning data T is composed of a combination of learning image data Dt and learning image processing mask Mt. The image data Dt represents a known image including a keyboard image g1 of a keyboard instrument and an image of the surroundings of the keyboard instrument. The type of keyboard instrument and the shooting conditions (e.g., shooting range and shooting direction) differ for each image data Dt. That is, the image data Dt is prepared in advance by shooting each of a plurality of types of keyboard instruments under different shooting conditions. Note that the image data Dt may be prepared by a known image synthesis technique. The image processing mask Mt of each learning data T is a mask representing a specific region B of the known image represented by the image data Dt of the learning data T. Specifically, the elements in the area corresponding to the specific region B of the image processing mask Mt are set to the numerical value "1", and the elements in the area other than the specific region B are set to the numerical value "0". That is, the image processing mask Mt means the correct answer that the estimation model 51 should output in response to the input of the image data Dt.

機械学習システム９００は、各学習データＴの画像データＤtを入力したときに初期的または暫定的なモデル（以下「暫定モデル」という）５１aが出力する画像処理マスクＭと、当該学習データＴの画像処理マスクＭとの誤差を表す誤差関数を算定する。そして、機械学習システム９００は、誤差関数が低減されるように暫定モデル５１aの複数の変数を更新する。複数の学習データＴの各々について以上の処理が反復された時点の暫定モデル５１aが、推定モデル５１として確定される。したがって、推定モデル５１は、複数の学習データＴにおける画像データＤtと画像処理マスクＭtとの間に潜在する関係のもとで、未知の画像データＤ1に対して統計的に妥当な画像処理マスクＭを出力する。すなわち、推定モデル５１は、画像データＤtと画像処理マスクＭtとの関係を学習した学習済モデルである。 The machine learning system 900 calculates an error function that represents the error between the image processing mask M output by an initial or provisional model (hereinafter referred to as the "provisional model") 51a when image data Dt of each training data T is input, and the image processing mask M of the training data T. Then, the machine learning system 900 updates multiple variables of the provisional model 51a so as to reduce the error function. The provisional model 51a at the time when the above process is repeated for each of the multiple training data T is determined as the estimated model 51. Therefore, the estimated model 51 outputs an image processing mask M that is statistically valid for unknown image data D1 under the underlying relationship between the image data Dt and the image processing mask Mt in the multiple training data T. In other words, the estimated model 51 is a trained model that has learned the relationship between the image data Dt and the image processing mask Mt.

以上の通り、第１実施形態においては、機械学習済の推定モデル５１に演奏画像Ｇ1の画像データＤ1を入力することで、特定領域Ｂを表す画像処理マスクＭが生成される。したがって、未知の多様な演奏画像Ｇ1について特定領域Ｂを高精度に特定できる。 As described above, in the first embodiment, an image processing mask M representing a specific region B is generated by inputting image data D1 of a performance image G1 into a machine-learned estimation model 51. Therefore, a specific region B can be identified with high accuracy for a variety of unknown performance images G1.

図８の領域抽出処理Ｓb2は、画像データＤ1が表す演奏画像Ｇ1のうち特定領域Ｂを抽出する処理である。具体的には、領域抽出処理Ｓb2は、演奏画像Ｇ1のうち特定領域以外の領域を選択的に除去することで特定領域Ｂを相対的に強調する画像処理である。第１実施形態の画像抽出部３１１は、画像処理マスクＭを画像データＤ1（演奏画像Ｇ1）に適用することで画像データＤ2を生成する。具体的には、画像抽出部３１１は、演奏画像Ｇ1における各画素の画素値に対して、画像処理マスクＭのうち当該画素に対応する要素を乗算する。領域抽出処理Ｓb2により、図７に例示される通り、演奏画像Ｇ1のうち特定領域Ｂ以外の領域が除去された画像（以下「演奏画像Ｇ2」という）を表す画像データＤ2が生成される。すなわち、画像データＤ2が表す演奏画像Ｇ2は、演奏画像Ｇ1のうち鍵盤画像ｇ1と手指画像ｇ2とが抽出された画像である。制御装置１１が領域抽出処理Ｓb2を実行することで、演奏画像Ｇ1の特定領域Ｂを抽出する要素（領域抽出部）が実現される。 The area extraction process Sb2 in FIG. 8 is a process for extracting a specific area B from the performance image G1 represented by the image data D1. Specifically, the area extraction process Sb2 is an image process for relatively emphasizing the specific area B by selectively removing areas other than the specific area from the performance image G1. The image extraction unit 311 in the first embodiment generates image data D2 by applying an image processing mask M to the image data D1 (performance image G1). Specifically, the image extraction unit 311 multiplies the pixel value of each pixel in the performance image G1 by an element of the image processing mask M corresponding to the pixel. By the area extraction process Sb2, as illustrated in FIG. 7, image data D2 representing an image (hereinafter referred to as "performance image G2") from which areas other than the specific area B from the performance image G1 have been removed is generated. That is, the performance image G2 represented by the image data D2 is an image from which the keyboard image g1 and the fingering image g2 have been extracted from the performance image G1. The control device 11 executes the area extraction process Sb2 to realize an element (area extraction unit) that extracts a specific area B of the performance image G1.

［射影変換部３１４］
指位置推定処理により推定された各手指の位置ｃ[h,f]は、演奏画像Ｇ1に設定されたｘ-ｙ座標系における座標である。撮影装置１５による鍵盤楽器２００の撮影条件は、鍵盤楽器２００の使用環境等の各種の事情に応じて相違し得る。例えば、図２に例示した理想的な撮影条件と比較して撮影範囲が広過ぎる（または狭過ぎる）場合または撮影方向が鉛直方向に対して傾斜する場合が想定される。各位置ｃ[h,f]における座標ｘ[h,f]および座標ｙ[h,f]の数値は、撮影装置１５による演奏画像Ｇ1の撮影条件に依存する。そこで、第１実施形態の射影変換部３１４は、演奏画像Ｇ1に関する各手指の位置ｃ[h,f]を、撮影装置１５による撮影条件に実質的に依存しないＸ-Ｙ座標系における位置Ｃ[h,f]に変換（image registration）する。指位置データ生成部３１が生成する指位置データＦは、射影変換部３１４による変換後の位置Ｃ[h,f]を表すデータである。すなわち、指位置データＦは、利用者の左手の各手指の位置Ｃ[1,1]～Ｃ[1,5]と、利用者の右手の各手指の位置Ｃ[2,1]～Ｃ[2,5]とを指定する。 [Projective transformation unit 314]
The position c[h,f] of each finger estimated by the finger position estimation process is a coordinate in the x-y coordinate system set in the performance image G1. The shooting conditions of the keyboard instrument 200 by the image capture device 15 may differ depending on various circumstances such as the environment in which the keyboard instrument 200 is used. For example, it is assumed that the shooting range is too wide (or too narrow) compared to the ideal shooting conditions exemplified in FIG. 2, or the shooting direction is inclined with respect to the vertical direction. The numerical values of the coordinate x[h,f] and the coordinate y[h,f] at each position c[h,f] depend on the shooting conditions of the performance image G1 by the image capture device 15. Therefore, the projection transformation unit 314 of the first embodiment transforms (image registration) the position c[h,f] of each finger in the performance image G1 into a position C[h,f] in the X-Y coordinate system that is substantially independent of the shooting conditions by the image capture device 15. The finger position data F generated by the finger position data generation unit 31 is data representing the position C[h,f] after transformation by the projective transformation unit 314. That is, the finger position data F specifies the positions C[1,1] to C[1,5] of each finger on the user's left hand and the positions C[2,1] to C[2,5] of each finger on the user's right hand.

Ｘ-Ｙ座標系は、図１０に例示される通り、所定の画像（以下「参照画像」という）Ｇrefに設定される。参照画像Ｇrefは、標準的な鍵盤楽器の鍵盤（以下「参照楽器」という）を標準的な撮影条件で撮影した画像である。なお、参照画像Ｇrefは、実在の鍵盤を撮影した画像に限定されない。例えば公知の画像合成技術により合成された画像が参照画像Ｇrefとして利用されてもよい。参照画像Ｇrefを表す画像データ（以下「参照データ」という）Ｄrefと、当該参照画像Ｇrefに関する補助データＡとが、記憶装置１２に記憶される。 The X-Y coordinate system is set to a predetermined image (hereinafter referred to as the "reference image") Gref, as exemplified in FIG. 10. The reference image Gref is an image of the keyboard of a standard keyboard instrument (hereinafter referred to as the "reference instrument") photographed under standard photographing conditions. Note that the reference image Gref is not limited to an image of an actual keyboard. For example, an image synthesized by a known image synthesis technique may be used as the reference image Gref. Image data (hereinafter referred to as the "reference data") Dref representing the reference image Gref and auxiliary data A related to the reference image Gref are stored in the storage device 12.

補助データＡは、参照画像Ｇref内において参照楽器の各鍵２１が存在する領域（以下「単位領域」という）Ｒnと、当該鍵２１に対応する音高ｎとの組合せを指定するデータである。すなわち、補助データＡは、参照画像Ｇrefのうち各音高ｎに対応する単位領域Ｒnを定義するデータとも換言される。 The auxiliary data A is data that specifies the combination of an area (hereinafter referred to as a "unit area") Rn in which each key 21 of the reference instrument exists in the reference image Gref, and the pitch n corresponding to that key 21. In other words, the auxiliary data A is data that defines the unit area Rn corresponding to each pitch n in the reference image Gref.

ｘ-ｙ座標系の位置ｃ[h,f]からＸ-Ｙ座標系の位置Ｃ[h,f]への変換には、以下の数式(2)で表現される通り、変換行列Ｗを利用した射影変換が利用される。数式(2)の記号Ｘは、Ｘ-Ｙ座標系におけるＸ軸上の座標を意味し、記号ＹはＹ軸上の座標を意味する。また、記号ｓは、ｘ-ｙ座標系とＸ-Ｙ座標系との間で縮尺（スケール）を整合させるための調整値である。

To convert a position c[h,f] in the x-y coordinate system to a position C[h,f] in the X-Y coordinate system, a projective transformation using a transformation matrix W is used, as expressed in the following formula (2). The symbol X in formula (2) means the coordinate on the X-axis in the X-Y coordinate system, and the symbol Y means the coordinate on the Y-axis. The symbol s is an adjustment value for matching the scale between the x-y coordinate system and the X-Y coordinate system.

［行列生成部３１２］
図３の行列生成部３１２は、射影変換部３１４が射影変換に適用する数式(2)の変換行列Ｗを生成する。図１１は、行列生成部３１２が変換行列Ｗを生成する処理（以下「行列生成処理」という）の具体的な手順を例示するフローチャートである。第１実施形態の行列生成処理は、画像抽出処理による処理後の演奏画像Ｇ2（画像データＤ2）を処理対象として実行される。以上の構成によれば、特定領域Ｂ以外の領域も含む演奏画像Ｇ1の全体を処理対象として行列生成処理が実行される構成と比較して、鍵盤画像ｇ1を参照画像Ｇrefに高精度に近似させる適切な変換行列Ｗを生成できる。 [Matrix generation unit 312]
The matrix generation unit 312 in Fig. 3 generates a transformation matrix W of Equation (2) that the projection transformation unit 314 applies to the projection transformation. Fig. 11 is a flowchart illustrating a specific procedure of the process (hereinafter referred to as "matrix generation process") in which the matrix generation unit 312 generates the transformation matrix W. The matrix generation process of the first embodiment is executed with the performance image G2 (image data D2) after the image extraction process as the processing target. According to the above configuration, compared to a configuration in which the matrix generation process is executed with the entire performance image G1 including areas other than the specific area B as the processing target, it is possible to generate an appropriate transformation matrix W that accurately approximates the keyboard image g1 to the reference image Gref.

行列生成処理は、初期設定処理Ｓc1と行列更新処理Ｓc2とを含む。初期設定処理Ｓc1は、変換行列Ｗの初期値である初期行列Ｗ0を設定する処理である。初期設定処理Ｓc1の詳細については後述する。 The matrix generation process includes an initial setting process Sc1 and a matrix update process Sc2. The initial setting process Sc1 is a process for setting the initial matrix W0, which is the initial value of the transformation matrix W. The details of the initial setting process Sc1 will be described later.

行列更新処理Ｓc2は、初期行列Ｗ0を反復的に更新することで変換行列Ｗを生成する処理である。すなわち、射影変換部３１４は、演奏画像Ｇ2の鍵盤画像ｇ1が、変換行列Ｗを利用した射影変換により参照画像Ｇrefに近付くように、初期行列Ｗ0を反復的に更新することで、変換行列Ｗを生成する。例えば、参照画像Ｇrefにおける特定の地点のＸ軸上の座標Ｘ/ｓが、鍵盤画像ｇ1のうち当該地点に対応する地点のｘ軸上の座標ｘに近似または一致し、かつ、参照画像Ｇrefにおける特定の地点のＹ軸上の座標Ｙ/ｓが、鍵盤画像ｇ1のうち当該地点に対応する地点のｙ軸上の座標ｙに近似または一致するように、変換行列Ｗが生成される。すなわち、鍵盤画像ｇ1のうち特定の音高に対応する鍵２１の座標が、変換行列Ｗを適用した射影変換により、参照画像Ｇrefのうち当該音高に対応する鍵２１の座標に変換されるように、変換行列Ｗが生成される。以上に例示した行列更新処理Ｓc2を制御装置１１が実行することで、変換行列Ｗを生成する要素（行列生成部３１２）が実現される。 The matrix update process Sc2 is a process for generating a transformation matrix W by iteratively updating the initial matrix W0. That is, the projection transformation unit 314 generates the transformation matrix W by iteratively updating the initial matrix W0 so that the keyboard image g1 of the performance image G2 approaches the reference image Gref by projective transformation using the transformation matrix W. For example, the transformation matrix W is generated so that the coordinate X/s on the X axis of a specific point in the reference image Gref approximates or matches the coordinate x on the x axis of a point corresponding to the point in the keyboard image g1, and the coordinate Y/s on the Y axis of a specific point in the reference image Gref approximates or matches the coordinate y on the y axis of a point corresponding to the point in the keyboard image g1. That is, the transformation matrix W is generated so that the coordinates of the key 21 corresponding to a specific pitch in the keyboard image g1 are transformed into the coordinates of the key 21 corresponding to the pitch in the reference image Gref by projective transformation to which the transformation matrix W is applied. The control device 11 executes the matrix update process Sc2 illustrated above to realize an element (matrix generation unit 312) that generates the transformation matrix W.

ところで、行列更新処理Ｓc2としては、例えばＳＩＦＴ（Scale-Invariant Feature Transform）等の画像特徴量が参照画像Ｇrefと鍵盤画像ｇ1との間で近付くように変換行列Ｗを更新する処理が想定される。しかし、鍵盤画像ｇ1においては、複数の鍵２１が同様に配列されたパターンが反復されるから、画像特徴量を利用した形態では変換行列Ｗを適切に推定できない可能性がある。 The matrix update process Sc2 is assumed to be a process that updates the transformation matrix W so that image features such as SIFT (Scale-Invariant Feature Transform) are closer between the reference image Gref and the keyboard image g1. However, because the keyboard image g1 repeats a pattern in which multiple keys 21 are arranged in the same way, it may not be possible to properly estimate the transformation matrix W using the image features.

以上の事情を考慮して、第１実施形態の行列生成部３１２は、行列更新処理Ｓc2において、参照画像Ｇrefと鍵盤画像ｇ1との間の拡張相関係数（ＥＣＣ：Enhanced Correlation Coefficient）が増加（理想的には最大化）するように初期行列Ｗ0を反復的に更新する。以上の形態によれば、画像特徴量を利用した前述の形態と比較して、鍵盤画像ｇ1を参照画像Ｇrefに高精度に近似させ得る適切な変換行列Ｗを生成できる。拡張相関係数を利用した変換行列Ｗの生成については、Georgios D. Evangelidis and Emmanouil Z. Psarakis, "Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 10, October 2008、にも開示されている。なお、前述の通り、鍵盤画像ｇ1の変換に利用される変換行列Ｗの生成には拡張相関係数が好適であるが、前述のＳＩＦＴ等の画像特徴量が参照画像Ｇrefと鍵盤画像ｇ1との間で近付くように変換行列Ｗを生成してもよい。 Considering the above circumstances, the matrix generation unit 312 of the first embodiment iteratively updates the initial matrix W0 in the matrix update process Sc2 so that the enhanced correlation coefficient (ECC) between the reference image Gref and the keyboard image g1 increases (ideally maximizes). According to the above embodiment, compared to the above embodiment using image features, it is possible to generate an appropriate transformation matrix W that can accurately approximate the keyboard image g1 to the reference image Gref. The generation of the transformation matrix W using the enhanced correlation coefficient is also disclosed in Georgios D. Evangelidis and Emmanouil Z. Psarakis, "Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 10, October 2008. As mentioned above, the extended correlation coefficient is suitable for generating the transformation matrix W used to transform the keyboard image g1, but the transformation matrix W may be generated so that the image features of the aforementioned SIFT, etc., are closer between the reference image Gref and the keyboard image g1.

図３の射影変換部３１４は、射影変換処理を実行する。射影変換処理は、行列生成処理により生成された変換行列Ｗを利用した演奏画像Ｇ1の射影変換である。射影変換処理により、演奏画像Ｇ1は、参照画像Ｇrefと同等の撮影条件のもとで撮影された画像（以下「変換画像」という）に変換される。例えば、変換画像のうち音高ｎの鍵２１に対応する領域は、参照画像Ｇrefにおける当該音高ｎの単位領域Ｒnに実質的に一致する。また、変換画像のｘ-ｙ座標系は、参照画像ＧrefのＸ-Ｙ座標系に実質的に一致する。以上に説明した射影変換処理において、射影変換部３１４は、前述の数式(2)で表現される通り、各手指の位置ｃ[h,f]を、Ｘ-Ｙ座標系の位置Ｃ[h,f]に変換する。以上に例示した射影変換処理を制御装置１１が実行することで、演奏画像Ｇ1の射影変換を実行する要素（射影変換部３１４）が実現される。 The projection transformation unit 314 in FIG. 3 executes a projection transformation process. The projection transformation process is a projection transformation of the performance image G1 using the transformation matrix W generated by the matrix generation process. The projection transformation process transforms the performance image G1 into an image (hereinafter referred to as a "transformed image") captured under the same shooting conditions as the reference image Gref. For example, the area of the transformed image corresponding to the key 21 of pitch n substantially coincides with the unit area Rn of the pitch n in the reference image Gref. In addition, the x-y coordinate system of the transformed image substantially coincides with the X-Y coordinate system of the reference image Gref. In the projection transformation process described above, the projection transformation unit 314 transforms the position c[h,f] of each finger into the position C[h,f] of the X-Y coordinate system, as expressed by the above-mentioned formula (2). The control device 11 executes the projection transformation process exemplified above, thereby realizing an element (projection transformation unit 314) that executes the projection transformation of the performance image G1.

表示制御部４０は、射影変換処理により生成された変換画像を表示装置１４に表示させる。例えば、表示制御部４０は、変換画像と参照画像Ｇrefと相互に重複させた状態で表示装置１４に表示させる。前述の通り、変換画像のうち各音高ｎの鍵２１に対応する領域と、参照画像Ｇrefのうち当該音高ｎに対応する単位領域Ｒnとは、相互に重複する。 The display control unit 40 causes the display device 14 to display the transformed image generated by the projective transformation process. For example, the display control unit 40 causes the display device 14 to display the transformed image and the reference image Gref in a mutually overlapping state. As described above, the area of the transformed image corresponding to the key 21 of each pitch n and the unit area Rn of the reference image Gref corresponding to that pitch n mutually overlap.

以上の通り、第１実施形態においては、演奏画像Ｇ1の鍵盤画像ｇ1が参照画像Ｇrefに近付くように変換行列Ｗが生成され、変換行列Ｗを利用した射影変換処理が演奏画像Ｇ1に対して実行される。したがって、利用者が演奏する鍵盤楽器２００の演奏画像Ｇ1を、参照画像Ｇrefにおける参照楽器の撮影条件に対応する変換画像に変換できる。 As described above, in the first embodiment, a transformation matrix W is generated so that the keyboard image g1 of the performance image G1 approaches the reference image Gref, and a projective transformation process using the transformation matrix W is executed on the performance image G1. Therefore, the performance image G1 of the keyboard instrument 200 played by the user can be transformed into a transformed image that corresponds to the shooting conditions of the reference instrument in the reference image Gref.

図１２は、初期設定処理Ｓc1の具体的な手順を例示するフローチャートである。初期設定処理Ｓc1が開始されると、射影変換部３１４は、図１３に例示される設定画面６２を表示装置１４に表示させる（Ｓc11）。設定画面６２は、撮影装置１５が撮影する演奏画像Ｇ1と、利用者に対する指示６２２とを含む。指示６２２は、演奏画像Ｇ1内の鍵盤画像ｇ1のうち１個以上の特定の音高（以下「目標音高」という）ｎに対応する領域（以下「目標領域」という）６２１を選択する旨のメッセージである。利用者は、設定画面６２を視認しながら操作装置１３を操作することで、演奏画像Ｇ1のうち、目標音高ｎに対応する目標領域６２１を選択する。射影変換部３１４は、利用者による目標領域６２１の選択を受付ける（Ｓc12）。 Figure 12 is a flow chart illustrating the specific steps of the initial setting process Sc1. When the initial setting process Sc1 starts, the projective transformation unit 314 displays the setting screen 62 illustrated in Figure 13 on the display device 14 (Sc11). The setting screen 62 includes a performance image G1 captured by the image capture device 15 and instructions 622 for the user. The instructions 622 are a message to select an area (hereinafter referred to as a "target area") 621 corresponding to one or more specific pitches (hereinafter referred to as a "target pitch") n from the keyboard image g1 in the performance image G1. The user operates the operation device 13 while viewing the setting screen 62 to select the target area 621 corresponding to the target pitch n from the performance image G1. The projective transformation unit 314 accepts the selection of the target area 621 by the user (Sc12).

射影変換部３１４は、参照データＤrefが表す参照画像Ｇrefのうち補助データＡが目標音高ｎについて指定する１個以上の単位領域Ｒnを特定する（Ｓc13）。そして、射影変換部３１４は、演奏画像Ｇ1の目標領域６２１を、参照画像Ｇrefから特定された１個以上の単位領域Ｒnに射影変換するための行列を、初期行列Ｗ0として算定する（Ｓc14）。以上の説明から理解される通り、第１実施形態の初期設定処理Ｓc1は、鍵盤画像ｇ1のうち利用者から指示された目標領域６２１が、初期行列Ｗ0を利用した射影変換により、参照画像Ｇrefのうち目標音高ｎに対応する単位領域Ｒnに近付くように、初期行列Ｗ0を設定する処理である。 The projection transformation unit 314 identifies one or more unit regions Rn specified by the auxiliary data A for the target pitch n in the reference image Gref represented by the reference data Dref (Sc13). The projection transformation unit 314 then calculates an initial matrix W0 as a matrix for projectively transforming the target region 621 of the performance image G1 onto one or more unit regions Rn identified from the reference image Gref (Sc14). As can be understood from the above explanation, the initial setting process Sc1 of the first embodiment is a process for setting the initial matrix W0 so that the target region 621 specified by the user in the keyboard image g1 approaches the unit region Rn corresponding to the target pitch n in the reference image Gref by projection transformation using the initial matrix W0.

行列更新処理Ｓc2により適切な変換行列Ｗを生成するには、初期行列Ｗ0の設定が重要である。行列更新処理Ｓc2に拡張相関係数を利用する形態においては特に、初期行列Ｗ0の適否が最終的な変換行列Ｗの適否に影響し易いという傾向がある。第１実施形態においては、演奏画像Ｇ1のうち利用者からの指示に応じた目標領域６２１が、参照画像Ｇrefのうち目標音高ｎに対応する単位領域Ｒnに近付くように、初期行列Ｗ0が設定される。したがって、鍵盤画像ｇ1を参照画像Ｇrefに高精度に近似させ得る適切な変換行列Ｗを生成できる。また、第１実施形態においては、演奏画像Ｇ1のうち利用者が操作装置１３に対する操作で指定した領域が目標領域６２１として初期行列Ｗ0の設定に利用される。したがって、例えば演奏画像Ｇ1のうち目標音高ｎに対応する領域を演算処理により推定する形態と比較して、処理負荷を低減しながら適切な初期行列Ｗ0を生成できる。なお、以上の説明においては演奏画像Ｇ1を対象として初期設定処理Ｓc1を実行したが、演奏画像Ｇ2について初期設定処理Ｓc1が実行されてもよい。 In order to generate an appropriate transformation matrix W by the matrix update process Sc2, it is important to set the initial matrix W0. In particular, in the case where the extended correlation coefficient is used in the matrix update process Sc2, the suitability of the initial matrix W0 tends to affect the suitability of the final transformation matrix W. In the first embodiment, the initial matrix W0 is set so that the target area 621 in the performance image G1 according to the instruction from the user approaches the unit area Rn corresponding to the target pitch n in the reference image Gref. Therefore, an appropriate transformation matrix W that can accurately approximate the keyboard image g1 to the reference image Gref can be generated. In the first embodiment, the area in the performance image G1 specified by the user through the operation of the operation device 13 is used as the target area 621 to set the initial matrix W0. Therefore, compared to the case where the area corresponding to the target pitch n in the performance image G1 is estimated by calculation processing, for example, an appropriate initial matrix W0 can be generated while reducing the processing load. Note that in the above description, the initial setting process Sc1 is performed on the performance image G1, but the initial setting process Sc1 may be performed on the performance image G2.

Ｂ：運指データ生成部３２
図３の運指データ生成部３２は、前述の通り、鍵盤楽器２００が生成する演奏データＰと指位置データ生成部３１が生成する指位置データＦとを利用して運指データＱを生成する。運指データＱの生成は、単位期間毎に実行される。第１実施形態の運指データ生成部３２は、確率算定部３２１と運指推定部３２２とを具備する。なお、以上の説明においては、利用者の１個の手指を変数ｈと変数ｆとの組合せで表現したが、以下の説明においては、利用者の１個の手指を指番号ｋ（ｋ＝１～１０）で表現する。したがって、指位置データＦが各手指について指定する位置Ｃ[h,f]は、以下の説明では位置Ｃ[k]と表記される。 B: Fingering data generating unit 32
As described above, the fingering data generating unit 32 in Fig. 3 generates fingering data Q by using the performance data P generated by the keyboard instrument 200 and the finger position data F generated by the finger position data generating unit 31. The fingering data Q is generated for each unit period. The fingering data generating unit 32 in the first embodiment includes a probability calculating unit 321 and a fingering estimating unit 322. Note that in the above description, one finger of the user is represented by a combination of variables h and f, but in the following description, one finger of the user is represented by a finger number k (k = 1 to 10). Therefore, in the following description, the position C[h, f] specified for each finger by the finger position data F will be expressed as position C[k].

［確率算定部３２１］
確率算定部３２１は、演奏データＰにより指定された音高ｎが各指番号ｋの手指により演奏された確率ｐを、指番号ｋ毎に算定する。確率ｐは、指番号ｋの手指が音高ｎの鍵２１を操作した確度の指標（尤度）である。確率算定部３２１は、指番号ｋの手指の位置Ｃ[k]が音高ｎの単位領域Ｒn内に存在するか否かに応じて確率ｐを算定する。確率ｐは、時間軸上の単位期間毎に算定される。具体的には、演奏データＰが音高ｎを指定する場合、確率算定部３２１は、以下に例示する数式(3)の演算により、確率ｐ(C[k]|ηk=n)を算定する。

[Probability Calculation Unit 321]
The probability calculation unit 321 calculates the probability p that the pitch n specified by the performance data P is played by the finger with finger number k for each finger number k. The probability p is an index of accuracy (likelihood) that the finger with finger number k operated the key 21 of the pitch n. The probability calculation unit 321 calculates the probability p depending on whether the position C[k] of the finger with finger number k is within the unit area Rn of the pitch n. The probability calculation unit 321 calculates the probability p for each unit period on the time axis. Specifically, when the performance data P specifies the pitch n, the probability calculation unit 321 calculates the probability p(C[k]|ηk=n) by the calculation of the following mathematical formula (3).

確率ｐ(C[k]|ηk=n)における条件「ηk＝ｎ」は、指番号ｋの手指が音高ｎを演奏しているという条件を意味する。すなわち、確率ｐ(C[k]|ηk=n)は、指番号ｋの手指が音高ｎを演奏している状況のもとで当該手指について位置Ｃ[k]が観測される確率を意味する。 The condition "ηk=n" in the probability p(C[k]|ηk=n) means that the finger with finger number k is playing pitch n. In other words, the probability p(C[k]|ηk=n) means the probability that position C[k] is observed for finger with finger number k when that finger is playing pitch n.

数式(3)の記号Ｉ(C[k]∈Rn)は、位置Ｃ[k]が単位領域Ｒn内に存在する場合に数値「１」に設定され、位置Ｃ[k]が単位領域Ｒn外に存在する場合に数値「０」に設定される指示関数である。記号|Ｒn|は、単位領域Ｒnの面積を意味する。また、記号ν(0,σ²E)は、観測ノイズを意味し、平均０および分散σ²の正規分布で表現される。記号Ｅは２行２列の単位行列である。記号＊は観測ノイズν(0,σ²E)の畳込を意味する。 The symbol I(C[k]∈Rn) in formula (3) is an indicator function that is set to the value "1" when the position C[k] is within the unit region Rn, and is set to the value "0" when the position C[k] is outside the unit region Rn. The symbol |Rn| means the area of the unit region Rn. The symbol ν(0,σ ² E) means the observation noise, and is expressed by a normal distribution with mean 0 and variance σ ^2. The symbol E is a 2-row, 2-column unit matrix. The symbol * means the convolution of the observation noise ν(0,σ ² E).

以上の説明から理解される通り、確率算定部３２１が算定する確率ｐ(C[k]|ηk=n)は、演奏データＰにより指定される音高ｎが指番号ｋの手指により演奏されるという条件のもとで、当該手指の位置が、指位置データＦが当該手指について指定する位置Ｃ[k]である確度である。したがって、確率ｐ(C[k]|ηk=n)は、指番号ｋの手指の位置Ｃ[k]が演奏状態の単位領域Ｒn内にある場合に極大となり、当該位置Ｃ[k]が単位領域Ｒnから離間するほど減少する。 As can be understood from the above explanation, the probability p(C[k]|ηk=n) calculated by the probability calculation unit 321 is the likelihood that, under the condition that the pitch n specified by the performance data P is played by the finger with finger number k, the position of that finger is the position C[k] specified for that finger by the finger position data F. Therefore, the probability p(C[k]|ηk=n) is maximum when the position C[k] of the finger with finger number k is within the unit area Rn of the performance state, and decreases as the position C[k] moves away from the unit area Rn.

他方、演奏データＰが何れの音高ｎも指定しない場合、すなわち、利用者がＮ個の鍵２１の何れも操作していない場合、確率算定部３２１は、各手指の確率ｐ(C[k]|ηk=0)を以下の数式(4)により算定する。

数式(4)の記号|Ｒ|は、参照画像ＧrefにおけるＮ個の単位領域Ｒ1～ＲNの総面積を意味する。数式(4)から理解される通り、利用者が何れの鍵２１も操作していない状態では、確率ｐ(C[k]|ηk=0)は、全部の指番号ｋについて共通の数値（１/|Ｒ|）に設定される。 On the other hand, when the performance data P does not specify any pitch n, that is, when the user does not operate any of the N keys 21, the probability calculation unit 321 calculates the probability p(C[k]|ηk=0) of each fingering using the following formula (4).

The symbol |R| in formula (4) means the total area of N unit regions R1 to RN in the reference image Gref. As can be seen from formula (4), when the user is not operating any of the keys 21, the probability p(C[k]|ηk=0) is set to a common value (1/|R|) for all finger numbers k.

以上の通り、演奏データＰが音高ｎを指定する期間内においては、相異なる手指に対応する複数の確率ｐ(C[k]|ηk=n)が、時間軸上の単位期間毎に算定される。他方、演奏データＰが音高ｎを指定しない期間内の各単位期間においては、相異なる手指に対応する複数の確率ｐ(C[k]|ηk=0)が、充分に小さい固定値（１/|Ｒ|）に設定される。 As described above, during the period in which the performance data P specifies pitch n, multiple probabilities p(C[k]|ηk=n) corresponding to different fingers are calculated for each unit period on the time axis. On the other hand, during each unit period in which the performance data P does not specify pitch n, multiple probabilities p(C[k]|ηk=0) corresponding to different fingers are set to a sufficiently small fixed value (1/|R|).

［運指推定部３２２］
運指推定部３２２は、利用者の運指を推定する。具体的には、運指推定部３２２は、各手指の確率ｐ(C[k]|ηk=n)から、演奏データＰにより指定される音高ｎを演奏した手指（指番号ｋ）を推定する。運指推定部３２２による指番号ｋの推定（運指データＱの生成）は、各手指の確率ｐ(C[k]|ηk=n)の算定毎（すなわち単位期間毎）に実行される。具体的には、運指推定部３２２は、相異なる手指に対応する複数の確率ｐ(C[k]|ηk=n)のうち最大値に対応する指番号ｋを特定する。そして、運指推定部３２２は、演奏データＰが指定する音高ｎと、確率ｐ(C[k]|ηk=n)から特定した指番号ｋとを指定する運指データＱを生成する。 [Fingering Estimation Unit 322]
The fingering estimation unit 322 estimates the fingering of the user. Specifically, the fingering estimation unit 322 estimates the finger (finger number k) that played the pitch n specified by the performance data P from the probability p(C[k]|ηk=n) of each finger. The fingering estimation unit 322 estimates the finger number k (generates the fingering data Q) every time the probability p(C[k]|ηk=n) of each finger is calculated (i.e., every unit period). Specifically, the fingering estimation unit 322 identifies the finger number k that corresponds to the maximum value among a plurality of probabilities p(C[k]|ηk=n) corresponding to different fingers. Then, the fingering estimation unit 322 generates the fingering data Q that specifies the pitch n specified by the performance data P and the finger number k identified from the probability p(C[k]|ηk=n).

なお、演奏データＰが音高ｎを指定する期間内において、複数の確率ｐ(C[k]|ηk=n)のうちの最大値が所定の閾値を下回る場合には、運指を推定した結果の信頼性が低いことを意味する。そこで、運指推定部３２２は、複数の確率ｐ(C[k]|ηk=n)の最大値が閾値を下回る単位期間においては、指番号ｋを、推定結果の無効を意味する無効値に設定する。指番号ｋが無効値に設定された音符について、表示制御部４０は、図４の例示の通り、通常の音符画像６１１とは相違する態様で音符画像６１１を表示し、指番号ｋの推定結果が無効であることを意味する符号「？？」を表示する。運指データ生成部３２の構成および動作は以上の通りである。 If the maximum value of the multiple probabilities p(C[k]|ηk=n) falls below a predetermined threshold during the period in which the performance data P specifies the pitch n, this means that the reliability of the fingering estimation result is low. Therefore, during the unit period in which the maximum value of the multiple probabilities p(C[k]|ηk=n) falls below the threshold, the fingering estimation unit 322 sets the finger number k to an invalid value, meaning that the estimation result is invalid. For notes in which the finger number k is set to an invalid value, the display control unit 40 displays the note image 611 in a manner different from the normal note image 611, as shown in the example of FIG. 4, and displays the symbol "??", meaning that the estimation result of the finger number k is invalid. The configuration and operation of the fingering data generation unit 32 are as described above.

図１４は、演奏解析部３０が実行する処理（以下「演奏解析処理」という）の具体的な手順を例示するフローチャートである。例えば操作装置１３に対する利用者からの指示を契機として演奏解析処理が開始される。 Figure 14 is a flowchart illustrating the specific steps of the process executed by the performance analysis unit 30 (hereinafter referred to as the "performance analysis process"). For example, the performance analysis process is started in response to an instruction from the user via the operation device 13.

演奏解析処理が開始されると、制御装置１１（画像抽出部３１１）は、図８の画像抽出処理を実行する（Ｓ11）。すなわち、制御装置１１は、演奏画像Ｇ1のうち鍵盤画像ｇ1と手指画像ｇ2とを含む特定領域Ｂを抽出することで演奏画像Ｇ2を生成する。画像抽出処理は、前述の通り、領域推定処理Ｓb1と領域抽出処理Ｓb2とを含む。 When the performance analysis process is started, the control device 11 (image extraction unit 311) executes the image extraction process of FIG. 8 (S11). That is, the control device 11 generates a performance image G2 by extracting a specific area B that includes the keyboard image g1 and the fingering image g2 from the performance image G1. As described above, the image extraction process includes an area estimation process Sb1 and an area extraction process Sb2.

画像抽出処理を実行すると、制御装置１１（行列生成部３１２）は、図１１の行列生成処理を実行する（Ｓ12）。すなわち、制御装置１１は、参照画像Ｇrefと鍵盤画像ｇ1との間の拡張相関係数が増加するように初期行列Ｗ0を反復的に更新することで、変換行列Ｗを生成する。行列生成処理は、前述の通り、初期設定処理Ｓc1と行列更新処理Ｓc2とを含む。 After executing the image extraction process, the control device 11 (matrix generation unit 312) executes the matrix generation process of FIG. 11 (S12). That is, the control device 11 generates the transformation matrix W by iteratively updating the initial matrix W0 so as to increase the extended correlation coefficient between the reference image Gref and the keyboard image g1. As described above, the matrix generation process includes the initial setting process Sc1 and the matrix update process Sc2.

変換行列Ｗが生成されると、制御装置１１は、以下に例示する処理（Ｓ13～Ｓ18）を単位期間毎に反復する。まず、制御装置１１（指位置推定部３１３）は、図５の指位置推定処理を実行する（Ｓ13）。すなわち、制御装置１１は、演奏画像Ｇ1の解析により利用者の左手および右手の各手指の位置ｃ[h,f]を推定する。指位置推定処理は、前述の通り、画像解析処理Ｓa1と左右判定処理Ｓa2と補間処理Ｓa3とを含む。 Once the transformation matrix W is generated, the control device 11 repeats the process exemplified below (S13 to S18) for each unit period. First, the control device 11 (finger position estimation unit 313) executes the finger position estimation process of FIG. 5 (S13). That is, the control device 11 estimates the positions c[h,f] of the fingers of the user's left and right hands by analyzing the performance image G1. As described above, the finger position estimation process includes the image analysis process Sa1, the left/right determination process Sa2, and the interpolation process Sa3.

制御装置１１（射影変換部３１４）は、射影変換処理を実行する（Ｓ14）。すなわち、制御装置１１は、変換行列Ｗを利用した演奏画像Ｇ1の射影変換により変換画像を生成する。射影変換処理において、制御装置１１は、利用者の各手指の位置ｃ[h,f]を、Ｘ-Ｙ座標系における位置Ｃ[h,f]に変換し、各手指の位置Ｃ[h,f]を表す指位置データＦを生成する。 The control device 11 (projective transformation unit 314) executes a projective transformation process (S14). That is, the control device 11 generates a transformed image by projective transformation of the performance image G1 using the transformation matrix W. In the projective transformation process, the control device 11 transforms the positions c[h,f] of each of the user's fingers into positions C[h,f] in the X-Y coordinate system, and generates finger position data F that represents the positions C[h,f] of each finger.

以上の処理により指位置データＦを生成すると、制御装置１１（確率算定部３２１）は、確率算定処理を実行する（Ｓ15）。すなわち、制御装置１１は、演奏データＰが指定する音高ｎが各指番号ｋの手指により演奏された確率ｐ(C[k]|ηk=n)を算定する。そして、制御装置１１（運指推定部３２２）は、運指推定処理を実行する（Ｓ16）。すなわち、制御装置１１は、音高ｎを演奏した手指の指番号ｋを各手指の確率ｐ(C[k]|ηk=n)から推定し、音高ｎと指番号ｋとを指定する運指データＱを生成する。 When the finger position data F is generated by the above process, the control device 11 (probability calculation unit 321) executes a probability calculation process (S15). That is, the control device 11 calculates the probability p(C[k]|ηk=n) that the pitch n specified by the performance data P was played by each finger with finger number k. Then, the control device 11 (fingering estimation unit 322) executes a fingering estimation process (S16). That is, the control device 11 estimates the finger number k of the finger that played the pitch n from the probability p(C[k]|ηk=n) of each finger, and generates fingering data Q that specifies the pitch n and finger number k.

以上の処理により運指データＱを生成すると、制御装置１１（表示制御部４０）は、運指データＱに応じて解析画面６１を更新する（Ｓ17）。また、制御装置１１は、所定の終了条件が成立したか否かを判定する（Ｓ18）。例えば操作装置１３に対する操作で利用者から演奏解析処理の終了が指示された場合に、制御装置１１は終了条件が成立したと判定する。終了条件が成立しない場合（Ｓ18：NO）、制御装置１１は、直後の単位期間について、指位置推定処理以降の処理（Ｓ13～Ｓ18）を反復する。他方、終了条件が成立した場合（Ｓ18：YES）、制御装置１１は、演奏解析処理を終了する。 When the fingering data Q is generated by the above process, the control device 11 (display control unit 40) updates the analysis screen 61 according to the fingering data Q (S17). The control device 11 also determines whether a predetermined termination condition is met (S18). For example, if the user issues an instruction to end the performance analysis process by operating the operation device 13, the control device 11 determines that the termination condition is met. If the termination condition is not met (S18: NO), the control device 11 repeats the processes following the finger position estimation process (S13 to S18) for the immediately following unit period. On the other hand, if the termination condition is met (S18: YES), the control device 11 ends the performance analysis process.

以上に説明した通り、第１実施形態においては、演奏画像Ｇ1の解析により生成される指位置データＦと、利用者による演奏を表す演奏データＰとを利用して、運指データＱが生成される。したがって、演奏データＰのみから運指を推定する構成と比較して運指を高精度に推定できる。 As described above, in the first embodiment, fingering data Q is generated using finger position data F generated by analyzing the performance image G1 and performance data P representing the performance by the user. Therefore, fingering can be estimated with a high degree of accuracy compared to a configuration in which fingering is estimated only from the performance data P.

また、第１実施形態においては、鍵盤画像ｇ1を参照画像Ｇrefに近付ける射影変換のための変換行列Ｗを利用して、指位置推定処理により推定された各手指の位置ｃ[h,f]が変換される。すなわち、参照画像Ｇrefを基準とした各手指の位置Ｃ[h,f]が推定される。したがって、各手指の位置ｃ[h,f]を、参照画像Ｇrefを基準とした位置に変換しない構成と比較して、運指を高精度に推定できる。 In addition, in the first embodiment, the position c[h,f] of each finger estimated by the finger position estimation process is transformed using a transformation matrix W for projective transformation that brings the keyboard image g1 closer to the reference image Gref. That is, the position C[h,f] of each finger is estimated based on the reference image Gref. Therefore, fingering can be estimated with high accuracy compared to a configuration in which the position c[h,f] of each finger is not transformed to a position based on the reference image Gref.

第１実施形態においては、演奏画像Ｇ1のうち鍵盤画像ｇ1を含む特定領域Ｂが抽出される。したがって、前述の通り、鍵盤画像ｇ1を参照画像Ｇrefに高精度に近似させ得る適切な変換行列Ｗを生成できる。また、特定領域Ｂの抽出により、演奏画像Ｇ1の利便性を向上させることが可能である。第１実施形態においては特に、演奏画像Ｇ1のうち鍵盤画像ｇ1と手指画像ｇ2とを含む特定領域Ｂが抽出される。したがって、鍵盤楽器２００の鍵盤２２の様子と利用者の手指の様子とを効率的に視認可能な演奏画像Ｇ2を生成できる。 In the first embodiment, a specific region B including the keyboard image g1 is extracted from the performance image G1. Therefore, as described above, an appropriate transformation matrix W can be generated that can accurately approximate the keyboard image g1 to the reference image Gref. Furthermore, by extracting the specific region B, it is possible to improve the usability of the performance image G1. In particular, in the first embodiment, a specific region B including the keyboard image g1 and the fingering image g2 is extracted from the performance image G1. Therefore, a performance image G2 can be generated that allows efficient visual confirmation of the appearance of the keys 22 of the keyboard instrument 200 and the appearance of the user's fingers.

２：第２実施形態
第２実施形態を説明する。なお、以下に例示する各形態において機能が第１実施形態と同様である要素については、第１実施形態の説明で使用したのと同様の符号を流用して各々の詳細な説明を適宜に省略する。 2: Second embodiment A second embodiment will be described. Note that, in each of the following exemplary embodiments, elements that have the same functions as those in the first embodiment will be denoted by the same reference numerals as those used in the description of the first embodiment, and detailed descriptions of each will be omitted as appropriate.

第１実施形態においては、指番号ｋの手指の位置Ｃ[k]が音高ｎの単位領域Ｒn内に存在するか否かに応じて確率ｐ(C[k]|ηk=n)が算定される。単位領域Ｒn内に１本の手指のみが存在することを前提とすれば、第１実施形態においても運指を高精度に推定できる。ただし、鍵盤楽器２００の実際の演奏においては、１個の単位領域Ｒn内に複数の手指の位置Ｃ[k]が存在する場合が想定される。 In the first embodiment, the probability p(C[k]|ηk=n) is calculated depending on whether the position C[k] of finger number k is present within the unit area Rn of pitch n. Assuming that only one finger is present within the unit area Rn, the fingering can be estimated with high accuracy even in the first embodiment. However, in an actual performance of the keyboard instrument 200, it is assumed that there are multiple finger positions C[k] within one unit area Rn.

例えば、図１５に例示される通り、利用者が左手の中指で１個の鍵２１を操作した状態で、当該左手の人差指を鉛直方向の上方に移動させた場合、演奏画像Ｇ1においては、左手の中指と人差指とが相互に重複する。すなわち、左手の中指の位置Ｃ[k]と人差指の位置Ｃ[k]とが１個の単位領域Ｒn内に存在する。また、利用者が１本の指で鍵２１を操作した状態で当該手指の上方または下方に他の他指を通過させる演奏方法（指くぐり）においては、複数の手指が相互に重複する場合がある。以上のように複数の手指が１個の単位領域Ｒn内において相互に重複する場合には、第１実施形態の方法では、運指を高精度に推定できない可能性がある。第２実施形態は、以上の課題を解決するための形態である。具体的には、第２実施形態においては、複数の手指の位置関係と各手指の位置の時間的な変動（ばらつき）とが、運指の推定に加味される。 For example, as illustrated in FIG. 15, when a user operates one key 21 with the middle finger of the left hand and moves the index finger of the left hand vertically upward, the middle finger and index finger of the left hand overlap each other in the performance image G1. That is, the position C[k] of the middle finger and the position C[k] of the index finger of the left hand are present in one unit area Rn. In addition, in a performance method in which a user operates a key 21 with one finger and passes other fingers above or below the finger (finger passing), multiple fingers may overlap each other. When multiple fingers overlap each other in one unit area Rn as described above, the method of the first embodiment may not be able to estimate the fingering with high accuracy. The second embodiment is a form for solving the above problems. Specifically, in the second embodiment, the positional relationship of multiple fingers and the temporal fluctuation (variation) of the position of each finger are taken into account in the fingering estimation.

図１６は、第２実施形態における演奏解析システム１００の機能的な構成を例示するブロック図である。第２実施形態の演奏解析システム１００は、第１実施形態と同様の要素に制御データ生成部３２３を追加した構成である。 Figure 16 is a block diagram illustrating the functional configuration of the performance analysis system 100 in the second embodiment. The performance analysis system 100 in the second embodiment has the same elements as the first embodiment, and further includes a control data generator 323.

制御データ生成部３２３は、相異なる音高ｎに対応するＮ個の制御データＺ[1]～Ｚ[N]を生成する。図１７は、任意の１個の音高ｎに対応する制御データＺ[n]の模式図である。制御データＺ[n]は、音高ｎの単位領域Ｒnに対する各手指の相対的な位置（以下「相対位置」という）Ｃ'[k]の特徴を表すベクトルデータである。相対位置Ｃ'[k]は、指位置データＦが表す位置Ｃ[k]を単位領域Ｒnに対する相対的な位置に変換した情報である。 The control data generator 323 generates N pieces of control data Z[1] to Z[N] corresponding to different pitches n. FIG. 17 is a schematic diagram of the control data Z[n] corresponding to any one pitch n. The control data Z[n] is vector data that represents the characteristics of the relative position (hereinafter referred to as "relative position") C'[k] of each finger with respect to the unit area Rn of pitch n. The relative position C'[k] is information obtained by converting the position C[k] represented by the finger position data F into a relative position with respect to the unit area Rn.

１個の音高ｎに対応する制御データＺ[n]は、当該音高ｎを含むほか、複数の手指の各々について、位置平均Ｚa[n,k]と位置分散Ｚb[n,k]と速度平均Ｚc[n,k]と速度分散Ｚd[n,k]とを含む。位置平均Ｚa[n,k]は、現在の単位期間を含む所定長の期間（以下「観測期間」という）内における相対位置Ｃ'[k]の平均である。観測期間は、例えば、現在の単位期間を末尾として時間軸上で前方に配列する複数の単位期間に相当する期間である。位置分散Ｚb[n,k]は、観測期間内における相対位置Ｃ'[k]の分散である。速度平均Ｚc[n,k]は、観測期間内において相対位置Ｃ'[k]が変化する速度（すなわち変化率）の平均である。速度分散Ｚd[n,k]は、観測期間内において相対位置Ｃ'[k]が変化する速度の分散である。 The control data Z[n] corresponding to one pitch n includes the pitch n, as well as the position average Za[n,k], position variance Zb[n,k], velocity average Zc[n,k], and velocity variance Zd[n,k] for each of the multiple fingers. The position average Za[n,k] is the average of the relative position C'[k] within a period of a predetermined length including the current unit period (hereinafter referred to as the "observation period"). The observation period is, for example, a period equivalent to multiple unit periods arranged forward on the time axis with the current unit period at the end. The position variance Zb[n,k] is the variance of the relative position C'[k] within the observation period. The velocity average Zc[n,k] is the average of the velocity (i.e., the rate of change) at which the relative position C'[k] changes within the observation period. The velocity variance Zd[n,k] is the variance of the velocity at which the relative position C'[k] changes within the observation period.

以上の通り、制御データＺ[n]は、複数の手指の各々について相対位置Ｃ'[k]に関する情報（Ｚa[n,k]，Ｚb[n,k]．Ｚc[n,k]，Ｚd[n,k]）を含む。したがって、制御データＺ[n]は、利用者の複数の手指の位置関係が反映されたデータである。また、制御データＺ[n]は、複数の手指の各々について相対位置Ｃ'[k]の変動に関する情報（Ｚb[n,k]，Ｚd[n,k]）を含む。したがって、制御データＺ[n]は、各手指の位置の時間的な変動が反映されたデータである。 As described above, the control data Z[n] includes information about the relative position C'[k] for each of the multiple fingers (Za[n,k], Zb[n,k], Zc[n,k], Zd[n,k]). Therefore, the control data Z[n] is data that reflects the positional relationships of the user's multiple fingers. In addition, the control data Z[n] includes information about the fluctuation in the relative position C'[k] for each of the multiple fingers (Zb[n,k], Zd[n,k]). Therefore, the control data Z[n] is data that reflects the temporal fluctuation in the position of each finger.

第２実施形態の確率算定部３２１による確率算定処理には、相異なる手指について事前に用意された複数の推定モデル５２[k]（５２[1]～５２[10]）が利用される。各手指の推定モデル５２[k]は、制御データＺ[n]と当該手指に関する確率ｐ[k]との関係を学習した学習済モデルである。確率ｐ[k]は、演奏データＰが指定する音高ｎを指番号ｋの手指が演奏した確度の指標（確率）である。確率算定部３２１は、複数の手指の各々について、Ｎ個の制御データＺ[1]～Ｚ[N]を当該手指の推定モデル５２[k]に入力することで確率ｐ[k]を算定する。 The probability calculation process by the probability calculation unit 321 in the second embodiment uses multiple estimation models 52[k] (52[1] to 52[10]) prepared in advance for different fingers. The estimation model 52[k] for each finger is a trained model that has learned the relationship between the control data Z[n] and the probability p[k] for that finger. The probability p[k] is an index (probability) of the likelihood that the finger with finger number k played the pitch n specified by the performance data P. The probability calculation unit 321 calculates the probability p[k] for each of the multiple fingers by inputting N pieces of control data Z[1] to Z[N] into the estimation model 52[k] for that finger.

任意の１個の指番号ｋに対応する推定モデル５２[k]は、以下の数式(5)で表現されるロジスティック回帰モデルである。

The estimation model 52[k] corresponding to any one finger number k is a logistic regression model expressed by the following formula (5).

数式(5)の変数βkおよび変数ωk,nは、機械学習システム９００による機械学習で設定される。すなわち、機械学習システム９００による機械学習で各推定モデル５２[k]が確立され、各推定モデル５２[k]が演奏解析システム１００に提供される。例えば、各推定モデル５２[k]の変数βkおよび変数ωk,nが、機械学習システム９００から演奏解析システム１００に送信される。 The variables βk and ωk,n in formula (5) are set by machine learning using the machine learning system 900. That is, each estimation model 52[k] is established by machine learning using the machine learning system 900, and each estimation model 52[k] is provided to the performance analysis system 100. For example, the variables βk and ωk,n of each estimation model 52[k] are transmitted from the machine learning system 900 to the performance analysis system 100.

押鍵状態にある手指の上方に位置する手指、または、押鍵状態にある手指の上方または下方を移動する手指は、押鍵状態にある手指と比較して移動し易いという傾向がある。以上の傾向を考慮すると、推定モデル５２[k]は、相対位置Ｃ'[k]の変化率が高い手指について確率ｐ[k]が小さい数値となるように、制御データＺ[n]と確率ｐ[k]との関係を学習する。確率算定部３２１は、複数の推定モデル５２[k]の各々に制御データＺ[n]を入力することで、相異なる手指に関する複数の確率ｐ[k]を単位期間毎に算定する。 Fingers located above a finger in a key-pressing state, or fingers moving above or below a finger in a key-pressing state, tend to move more easily than fingers in a key-pressing state. Taking the above tendencies into consideration, the estimation model 52[k] learns the relationship between the control data Z[n] and the probability p[k] so that the probability p[k] is a small value for fingers with a high rate of change in the relative position C'[k]. The probability calculation unit 321 inputs the control data Z[n] to each of the multiple estimation models 52[k], thereby calculating multiple probabilities p[k] for different fingers for each unit period.

運指推定部３２２は、複数の確率ｐ[k]を適用した運指推定処理により、利用者の運指を推定する。具体的には、運指推定部３２２は、演奏データＰが指定する音高ｎを演奏した手指（指番号ｋ）を、各手指の確率ｐ[k]から推定する。運指推定部３２２による指番号ｋの推定（運指データＱの生成）は、各手指の確率ｐ[k]の算定毎（すなわち単位期間毎）に実行される。具体的には、運指推定部３２２は、相異なる手指に対応する複数の確率ｐ[k]のうち最大値に対応する指番号ｋを特定する。そして、運指推定部３２２は、演奏データＰが指定する音高ｎと、確率ｐ[k]から特定した指番号ｋとを指定する運指データＱを生成する。 The fingering estimation unit 322 estimates the fingering of the user by a fingering estimation process that applies multiple probabilities p[k]. Specifically, the fingering estimation unit 322 estimates the finger (finger number k) that played the pitch n specified by the performance data P from the probability p[k] of each finger. The fingering estimation unit 322 estimates the finger number k (generates the fingering data Q) every time the probability p[k] of each finger is calculated (i.e., every unit period). Specifically, the fingering estimation unit 322 identifies the finger number k that corresponds to the maximum value among the multiple probabilities p[k] corresponding to different fingers. The fingering estimation unit 322 then generates the fingering data Q that specifies the pitch n specified by the performance data P and the finger number k identified from the probability p[k].

図１８は、第２実施形態における演奏解析処理の具体的な手順を例示するフローチャートである。第２実施形態の演奏解析処理においては、第１実施形態と同様の処理に制御データＺ[n]の生成（Ｓ19）が追加される。具体的には、制御装置１１（制御データ生成部３２３）は、指位置データ生成部３１が生成する指位置データＦ（すなわち各手指の位置Ｃ[h,f]）から、相異なる音高ｎに対応するＮ個の制御データＺ[1]～Ｚ[N]を生成する。 Figure 18 is a flow chart illustrating the specific steps of the performance analysis process in the second embodiment. In the performance analysis process of the second embodiment, the generation of control data Z[n] (S19) is added to the same process as in the first embodiment. Specifically, the control device 11 (control data generation unit 323) generates N pieces of control data Z[1] to Z[N] corresponding to different pitches n from the finger position data F (i.e., the positions C[h,f] of each finger) generated by the finger position data generation unit 31.

制御装置１１（確率算定部３２１）は、各推定モデル５２[k]にＮ個の制御データＺ[1]～Ｚ[N]を入力する確率算定処理により、指番号ｋに対応する確率ｐ[k]を算定する（Ｓ15）。また、制御装置１１（運指推定部３２２）は、複数の確率ｐ[k]を適用した運指推定処理により、利用者の運指を推定する（Ｓ16）。運指データ生成部３２以外の要素の動作（Ｓ11～Ｓ14，Ｓ17～Ｓ18）は第１実施形態と同様である。 The control device 11 (probability calculation unit 321) calculates a probability p[k] corresponding to finger number k by a probability calculation process in which N pieces of control data Z[1] to Z[N] are input to each estimation model 52[k] (S15). In addition, the control device 11 (fingering estimation unit 322) estimates the user's fingering by a fingering estimation process that applies the multiple probabilities p[k] (S16). The operations of the elements other than the fingering data generation unit 32 (S11 to S14, S17 to S18) are the same as in the first embodiment.

第２実施形態においても第１実施形態と同様の効果が実現される。また、第２実施形態において推定モデル５２[k]に入力される制御データＺ[k]は、各手指の相対位置Ｃ'[k]の平均Ｚa[n,k]および分散Ｚb[n,k]と、相対位置Ｃ'[k]の変化率の平均Ｚc[n,k]および分散Ｚd[n,k]とを含む。したがって、例えば指くぐり等に起因して複数の手指が相互に重複する状態でも、利用者の運指を高精度に推定できる。 In the second embodiment, the same effect as in the first embodiment is achieved. Furthermore, in the second embodiment, the control data Z[k] input to the estimation model 52[k] includes the average Za[n,k] and variance Zb[n,k] of the relative position C'[k] of each finger, and the average Zc[n,k] and variance Zd[n,k] of the rate of change of the relative position C'[k]. Therefore, the fingering of the user can be estimated with high accuracy even in a state where multiple fingers overlap each other due to, for example, finger under-taking.

なお、以上の説明においては、推定モデル５２[k]としてロジスティック回帰モデルを例示したが、推定モデル５２[k]の種類は以上の例示に限定されない。例えば、多層パーセプトロン等の統計モデルを推定モデル５２[k]として利用してもよい。また、畳込ニューラルネットワークまたは再帰型ニューラルネットワーク等の深層ニューラルネットワークを推定モデル５２[k]として利用してもよい。複数種の統計モデルの組合せを推定モデル５２[k]として利用してもよい。以上に例示した各種の推定モデル５２[k]は、制御データＺ[n]と確率ｐ[k]との関係を学習した学習済モデルとして包括的に表現される。 In the above explanation, a logistic regression model is exemplified as the estimation model 52[k], but the type of estimation model 52[k] is not limited to the above example. For example, a statistical model such as a multi-layer perceptron may be used as the estimation model 52[k]. Also, a deep neural network such as a convolutional neural network or a recurrent neural network may be used as the estimation model 52[k]. A combination of multiple types of statistical models may be used as the estimation model 52[k]. The various estimation models 52[k] exemplified above are collectively expressed as trained models that have learned the relationship between the control data Z[n] and the probability p[k].

３：第３実施形態
図１９は、第３実施形態における演奏解析処理の具体的な手順を例示するフローチャートである。画像抽出処理および行列生成処理を実行すると、制御装置１１は、演奏データＰを参照することで、利用者による鍵盤楽器２００の演奏の有無を判定する（Ｓ21）。具体的には、制御装置１１は、鍵盤楽器２００の複数の鍵２１の何れかが操作されているか否かを判定する。 19 is a flow chart illustrating a specific procedure of the performance analysis process in the third embodiment. After executing the image extraction process and the matrix generation process, the control device 11 refers to the performance data P to determine whether or not the user is playing the keyboard instrument 200 (S21). Specifically, the control device 11 determines whether or not any of the multiple keys 21 of the keyboard instrument 200 is being operated.

鍵盤楽器２００が演奏されている場合（Ｓ21：YES）、制御装置１１は、第１実施形態と同様に、指位置データＦの生成（Ｓ13～Ｓ14）と運指データＱの生成（Ｓ15～Ｓ16）と解析画面６１の更新（Ｓ17）とを実行する。他方、鍵盤楽器２００が演奏されていない場合（Ｓ21：NO）、制御装置１１は処理をステップＳ18に移行する。すなわち、指位置データＦの生成（Ｓ13～14）と運指データＱの生成（Ｓ15～Ｓ16）と解析画面６１の更新（Ｓ17）とは実行されない。 If the keyboard instrument 200 is being played (S21: YES), the control device 11 generates finger position data F (S13-S14), generates fingering data Q (S15-S16), and updates the analysis screen 61 (S17), as in the first embodiment. On the other hand, if the keyboard instrument 200 is not being played (S21: NO), the control device 11 transitions to step S18. In other words, the generation of finger position data F (S13-S14), the generation of fingering data Q (S15-S16), and the update of the analysis screen 61 (S17) are not executed.

第３実施形態においても第１実施形態と同様の効果が実現される。また、第３実施形態においては、鍵盤楽器２００が演奏されていない場合には、指位置データＦおよび運指データＱの生成が停止される。したがって、鍵盤楽器２００の演奏の有無に関わらず指位置データＦの生成が継続される構成と比較して、運指データＱの生成に必要な処理負荷を低減できる。なお、第３実施形態は第２実施形態にも適用される。 The third embodiment achieves the same effects as the first embodiment. Furthermore, in the third embodiment, when the keyboard instrument 200 is not being played, the generation of finger position data F and fingering data Q is stopped. Therefore, the processing load required to generate fingering data Q can be reduced compared to a configuration in which the generation of finger position data F continues regardless of whether the keyboard instrument 200 is being played or not. The third embodiment is also applicable to the second embodiment.

４：第４実施形態
第４実施形態は、前述の各形態における初期設定処理Ｓc1を変更した形態である。図２０は、第４実施形態の制御装置１１（行列生成部３１２）が実行する初期設定処理Ｓc1の具体的な手順を例示するフローチャートである。 4: Fourth embodiment The fourth embodiment is an embodiment in which the initial setting process Sc1 in each of the above-mentioned embodiments is modified. Fig. 20 is a flowchart illustrating a specific procedure of the initial setting process Sc1 executed by the control device 11 (matrix generation unit 312) of the fourth embodiment.

初期設定処理Ｓc1が開始されると、利用者は、鍵盤楽器２００の複数の鍵２１のうち所望の音高（以下「特定音高」という）ｎに対応する鍵２１を、特定の手指（以下「特定手指」という）により操作する。特定手指は、例えば表示装置１４による表示または鍵盤楽器２００の取扱説明書等により利用者に通知された手指（例えば右手の人差指）である。利用者による演奏の結果、特定音高ｎを指定する演奏データＰが鍵盤楽器２００から演奏解析システム１００に供給される。制御装置１１は、鍵盤楽器２００から演奏データＰを取得することで利用者による特定音高ｎの演奏を認識する（Ｓc15）。制御装置１１は、参照画像ＧrefのＮ個の単位領域Ｒ1～ＲNのうち特定音高ｎに対応する単位領域Ｒnを特定する（Ｓc16）。 When the initial setting process Sc1 starts, the user operates a key 21 corresponding to a desired pitch (hereinafter referred to as a "specific pitch") n among the multiple keys 21 of the keyboard instrument 200 with a specific finger (hereinafter referred to as a "specific finger"). The specific finger is, for example, a finger (e.g., the index finger of the right hand) notified to the user by display on the display device 14 or an instruction manual for the keyboard instrument 200. As a result of the user's performance, performance data P specifying the specific pitch n is supplied from the keyboard instrument 200 to the performance analysis system 100. The control device 11 recognizes the user's performance of the specific pitch n by acquiring the performance data P from the keyboard instrument 200 (Sc15). The control device 11 identifies a unit area Rn corresponding to the specific pitch n among the N unit areas R1 to RN of the reference image Gref (Sc16).

他方、指位置データ生成部３１は、指位置推定処理により指位置データＦを生成する。指位置データＦは、利用者が特定音高ｎの演奏に使用した特定手指の位置Ｃ[h,f]を含む。制御装置１１は、指位置データＦを取得することで、特定手指の位置Ｃ[h,f]を特定する（Ｓc17）。 On the other hand, the finger position data generation unit 31 generates finger position data F by finger position estimation processing. The finger position data F includes the positions C[h,f] of the specific fingers used by the user to play the specific pitch n. The control device 11 acquires the finger position data F to identify the positions C[h,f] of the specific fingers (Sc17).

制御装置１１は、特定音高ｎに対応する単位領域Ｒnと、指位置データＦが表す特定手指の位置Ｃ[h,f]とを利用して、初期行列Ｗ0を設定する（Ｓc18）。すなわち、制御装置１１は、指位置データＦが表す特定手指の位置Ｃ[h,f]が、参照画像Ｇrefのうち特定音高ｎの単位領域Ｒnに近付くように、初期行列Ｗ0を設定する。具体的には、特定手指の位置Ｃ[h,f]を単位領域Ｒnの中心に射影変換するための行列が、初期行列Ｗ0として設定される。 The control device 11 sets the initial matrix W0 using the unit area Rn corresponding to the specific pitch n and the position C[h,f] of the specific finger represented by the finger position data F (Sc18). That is, the control device 11 sets the initial matrix W0 so that the position C[h,f] of the specific finger represented by the finger position data F approaches the unit area Rn of the specific pitch n in the reference image Gref. Specifically, a matrix for projecting the position C[h,f] of the specific finger onto the center of the unit area Rn is set as the initial matrix W0.

第４実施形態においても第１実施形態と同様の効果が実現される。また、第４実施形態においては、利用者が所望の特定音高ｎを特定手指で演奏すると、演奏画像Ｇ1における特定手指の位置ｃ[h,f]が、参照画像Ｇrefのうち特定音高ｎに対応する部分（単位領域Ｒn）に近付くように、初期行列Ｗ0が設定される。利用者は所望の音高ｎを演奏すればよいから、例えば利用者が操作装置１３の操作により目標領域６２１を選択する必要がある第１実施形態と比較して、初期行列Ｗ0の設定に必要な利用者の作業の負荷が軽減される。他方、利用者が目標領域６２１を指定する第１実施形態によれば、利用者の手指の位置Ｃ[h,f]の推定が不要であるから、第２実施形態と比較して、推定誤差の影響を低減しながら適切な初期行列Ｗ0を設定できる。なお、第４実施形態は、第２実施形態または第３実施形態にも同様に適用される。 The fourth embodiment also achieves the same effect as the first embodiment. In addition, in the fourth embodiment, when a user plays a desired specific pitch n with a specific finger, the initial matrix W0 is set so that the position c[h,f] of the specific finger in the performance image G1 approaches the portion (unit area Rn) of the reference image Gref corresponding to the specific pitch n. Since the user only needs to play the desired pitch n, the burden of the user's work required to set the initial matrix W0 is reduced compared to the first embodiment in which the user must select the target area 621 by operating the operation device 13. On the other hand, according to the first embodiment in which the user specifies the target area 621, it is not necessary to estimate the position C[h,f] of the user's finger, so compared to the second embodiment, it is possible to set an appropriate initial matrix W0 while reducing the influence of estimation errors. Note that the fourth embodiment is also applicable to the second or third embodiment.

なお、第４実施形態においては利用者が１個の特定音高ｎを演奏する場合を想定したが、複数の特定音高ｎを利用者が特定手指により演奏してもよい。制御装置１１は、複数の特定音高ｎの各々について、当該特定音高ｎの演奏時における特定手指の位置Ｃ[h,f]と、当該特定音高ｎの単位領域Ｒnとが近付くように、初期行列Ｗ0を設定する。 In the fourth embodiment, it is assumed that the user plays one specific pitch n, but the user may play multiple specific pitches n with specific fingers. The control device 11 sets the initial matrix W0 for each of the multiple specific pitches n so that the position C[h,f] of the specific finger when playing the specific pitch n approaches the unit area Rn of the specific pitch n.

５：第５実施形態
図２１は、第５実施形態における演奏解析システム１００の機能的な構成を例示するブロック図である。第５実施形態の演奏解析システム１００は、収音装置１６を具備する。収音装置１６は、利用者による演奏で鍵盤楽器２００から再生される音響を収音することで音響信号Ｖを生成する。音響信号Ｖは、鍵盤楽器２００が再生する音響の波形を表す時間領域のオーディオ信号である。なお、演奏解析システム１００とは別体の収音装置１６を、演奏解析システム１００に対して有線または無線により接続してもよい。なお、音響信号Ｖを構成するサンプルの時系列を「演奏データＰ」と解釈してもよい。 5: Fifth embodiment Fig. 21 is a block diagram illustrating a functional configuration of a performance analysis system 100 in a fifth embodiment. The performance analysis system 100 in the fifth embodiment includes a sound collection device 16. The sound collection device 16 collects the sound reproduced from the keyboard instrument 200 by the user's performance to generate an audio signal V. The audio signal V is a time-domain audio signal representing the waveform of the sound reproduced by the keyboard instrument 200. The sound collection device 16, which is separate from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly. The time series of samples constituting the audio signal V may be interpreted as "performance data P".

演奏解析システム１００の制御装置１１は、記憶装置１２に記憶されたプログラムを実行することで演奏解析部３０として機能する。演奏解析部３０は、収音装置１６から供給される音響信号Ｖと撮影装置１５から供給される画像データＤ1とを利用して運指データＱを生成する。運指データＱは、第１実施形態と同様に、利用者が操作した鍵２１に対応する音高ｎと、利用者が当該鍵２１の操作に使用した手指の指番号ｋとを指定する。第１実施形態においては音高ｎが演奏データＰにより指定されるが、第５実施形態の音響信号Ｖは音高ｎを直接的に指定する信号ではない。そこで、演奏解析部３０は、音響信号Ｖおよび画像データＤ1を利用して音高ｎと指番号ｋとを同時に推定する。 The control device 11 of the performance analysis system 100 functions as a performance analysis unit 30 by executing a program stored in the storage device 12. The performance analysis unit 30 generates fingering data Q using the audio signal V supplied from the sound collection device 16 and the image data D1 supplied from the image capture device 15. As in the first embodiment, the fingering data Q specifies the pitch n corresponding to the key 21 operated by the user and the finger number k of the finger used by the user to operate the key 21. In the first embodiment, the pitch n is specified by the performance data P, but the audio signal V in the fifth embodiment is not a signal that directly specifies the pitch n. Therefore, the performance analysis unit 30 simultaneously estimates the pitch n and the finger number k using the audio signal V and the image data D1.

音高ｎおよび指番号ｋの推定のために、潜在変数ｗ_t,n,kを想定する。記号ｔは時刻を示す変数である。時間軸上の１個の単位期間が変数ｔにより指示されてもよい。また、第５実施形態における指番号ｋは、相異なる手指に対応する１０個の数値（ｋ＝１～10）と所定の無効値（ｋ＝０）とを含む１１通り数値の何れかに設定される。 A latent variable wt _,n,k is assumed to estimate the pitch n and the finger number k. The symbol t is a variable indicating time. One unit period on the time axis may be indicated by the variable t. In addition, the finger number k in the fifth embodiment is set to one of eleven values including ten values (k=1 to 10) corresponding to different fingers and a predetermined invalid value (k=0).

音高ｎと指番号ｋとの組合せ毎に潜在変数ｗ_t,n,kが用意される。潜在変数ｗ_t,n,kは、「０」および「１」の２値の何れかに設定されるone-hot表現のための変数である。潜在変数ｗ_t,n,kの数値「１」は、音高ｎが指番号ｋの手指により演奏されていることを意味し、潜在変数ｗ_t,n,kの数値「０」は、何れの手指も演奏に使用されていないことを意味する。 A latent variable wt _,n,k is prepared for each combination of pitch n and finger number k. The latent variable _wt,n,k is a variable for one-hot expression that is set to either of the two values "0" and "1." The value "1" of the latent variable wt _,n,k means that pitch n is played by the finger with finger number k, and the value "0" of the latent variable wt _,n,k means that no finger is used in the performance.

また、事後確率Ｕ_t,nと確率π_t,n,kとを想定する。事後確率Ｕ_t,nは、音響信号Ｖが観測された条件のもとで時刻ｔにおいて音高ｎが発音されている事後確率である。したがって、確率(１－Ｕ_t,n)は、音響信号Ｖが観測された条件のもとで潜在変数ｗ_t,n,0が数値「１」である確率（何れの音高ｎも演奏されていない確率）に相当する。事後確率Ｕ_t,nは、音響信号Ｖと事後確率Ｕ_t,nとの関係を学習した公知の推定モデルにより推定される。推定モデルは、自動採譜用の学習済モデルである。例えば畳込ニューラルネットワークまたは再帰型ニューラルネットワーク等の深層ニューラルネットワークが、事後確率Ｕ_t,nを推定するための推定モデルとして利用される。確率π_t,n,kは、音高ｎが演奏されている状態において当該音高ｎが指番号ｋの手指により演奏されている確率である。 Also, assume a posterior probability U _t,n and a probability π _t,n,k . The posterior probability U _t,n is the posterior probability that pitch n is sounded at time t under the conditions under which the audio signal V is observed. Therefore, the probability (1-U _t,n ) corresponds to the probability that the latent variable w _t,n,0 is the numerical value "1" under the conditions under which the audio signal V is observed (the probability that no pitch n is played). The posterior probability U _t,n is estimated by a known estimation model that has learned the relationship between the audio signal V and the posterior probability U t, _n . The estimation model is a trained model for automatic music transcription. For example, a deep neural network such as a convolutional neural network or a recurrent neural network is used as an estimation model for estimating the posterior probability U _t,n . The probability π _t,n,k is the probability that pitch n is played by a finger with finger number k when pitch n is being played.

音響信号Ｖと確率π_t,n,kとが観測されたときの潜在変数ｗ_t,n,kの確率ｐ(ｗ|V,π)は、以下の数式(6)で表現される。

数式(6)における右辺の第１項は、何れの音高ｎも発音されていない確率を意味し、第２項は、音高ｎが発音されている場合に当該音高ｎが指番号ｋの手指により演奏されている確率を意味する。 The probability p(w|V,π) of the latent variable w t, _n,k when an acoustic signal V and a probability π t, _n,k are observed is expressed by the following equation (6).

The first term on the right hand side of equation (6) represents the probability that no pitch n is sounded, and the second term represents the probability that, when pitch n is sounded, that pitch n is played by the finger with finger number k.

また、潜在変数ｗ_t,n,kが観測されたときに演奏画像Ｇ1から位置Ｃ[k]が観測される確率ｐ(C[k]|w)は、以下の数式(7)で表現される。

数式(7)における確率ｐ(C[k]|σ²,Rn)は、前掲の数式(3)または数式(4)で表現される確率である。 Furthermore, the probability p(C[k]|w) that a position C[k] is observed from the performance image G1 when a latent variable w _t,n,k is observed is expressed by the following formula (7).

The probability p(C[k]|σ ² , Rn) in formula (7) is the probability expressed by formula (3) or formula (4) shown above.

また、確率π_t,n,kの事前分布としては、以下の数式(8)で表現される対称ディリクレ分布(Dir)を想定する。

数式(8)の記号αは、対称ディリクレ分布の形状を規定する変数である。 In addition, as the prior distribution of the probability π _t,n,k , the symmetric Dirichlet distribution (Dir) expressed by the following equation (8) is assumed.

The symbol α in equation (8) is a variable that defines the shape of the symmetric Dirichlet distribution.

以上の前提において、潜在変数ｗ_t,n,kの事後確率ｐ(z|V,π,C[k])を最大化する最大事後確率推定（ＭＡＰ：Maximum A Posteriori）を実行することで、音高ｎの有無と指番号ｋとを同時に推定できる。しかし、事後確率ｐ(z|V,π,C[k])の確率分布の推定は困難であるため、第５実施形態においては平均場近似（変分ベイズ推定）を検討する。 Under the above assumptions, by performing maximum a posteriori (MAP) estimation to maximize the posterior probability p(z|V,π,C[k]) of the latent variable wt _,n,k , it is possible to simultaneously estimate the presence or absence of pitch n and finger number k. However, since it is difficult to estimate the probability distribution of the posterior probability p(z|V,π,C[k]), in the fifth embodiment, a mean field approximation (variational Bayesian estimation) is considered.

具体的には、以下の数式(9)のように因子分解される分布のうち事後確率ｐ(z|V,π,C[k])の確率分布に最も近似する分布が特定される。例えば、事後確率ｐ(z|V,π,C[k])とのＫＬ（Kullback-Leibler）距離が最小となる分布が特定される。

Specifically, among the distributions factorized as shown in the following formula (9), a distribution that is most similar to the probability distribution of the posterior probability p(z|V,π,C[k]) is identified. For example, a distribution that has a minimum Kullback-Leibler (KL) distance from the posterior probability p(z|V,π,C[k]) is identified.

具体的には、演奏解析部３０は、以下の数式(10)および数式(11)の演算を反復する。

数式(10)の記号ｃは、複数の指番号ｋにわたる確率分布ρ_t,n,kの合計が「１」となるように当該確率分布ρ_t,n,kを正規化する係数である。また、記号〈〉は、期待値を意味する。 Specifically, the performance analysis unit 30 repeats the calculations of the following formulas (10) and (11).

The symbol c in Equation (10) is a coefficient that normalizes the probability distribution ρ _t,n,k over multiple finger numbers k so that the sum of the probability distribution ρ _t,n,k is “1.” Also, the symbol <> denotes the expected value.

具体的には、演奏解析部３０は、時間軸上の１個の時刻ｔについて、音高ｎと指番号ｋとの全通りの組合せについて数式(10)および数式(11)の演算を反復する。演奏解析部３０は、所定の回数にわたり数式(10)および数式(11)の演算を反復した時点の数式(10)の演算結果を、潜在変数ｗ_t,n,kの確率分布ρ_t,n,kとして確定する。時間軸上の時刻ｔ毎に確率分布ρ_t,n,kが算定される。 Specifically, the performance analysis unit 30 repeats the calculation of formula (10) and formula (11) for all combinations of pitch n and finger number k for one time t on the time axis. The performance analysis unit 30 determines the calculation result of formula (10) at the time point when the calculation of formula (10) and formula (11) is repeated a predetermined number of times as the probability distribution ρ t, _{n,k of the latent variable w t} , _n,k . The probability distribution ρ _t,n,k is calculated for each time t on the time axis.

ところで、時間軸上の時刻ｔ毎に個別に算定された確率分布ρ_t,n,kから、音高ｎおよび指番号ｋを時刻ｔ毎に算定する形態では、利用者が１個の音符を演奏する期間内において前後の時刻ｔで指番号ｋが変化する場合、または、音高ｎが継続する期間が過度に短くなる場合がある。そこで、第５実施形態の演奏解析部３０は、確率分布ρ_t,n,kを適用したＨＭＭ（Hidden Markov Model）を利用して、音高ｎと指番号ｋとの組合せ（すなわち運指データＱ）の時系列を生成する。 However, in a configuration in which pitch n and finger number k are calculated for each time t from a probability distribution ρ _t,n,k calculated individually for each time t on the time axis, finger number k may change between times t before and after a user plays a note, or the duration of pitch n may be excessively short. Therefore, the performance analysis unit 30 of the fifth embodiment uses a hidden Markov model (HMM) to which the probability distribution ρ _t,n,k is applied to generate a time series of combinations of pitch n and finger number k (i.e., fingering data Q).

具体的には、運指推定用のＨＭＭは、音高ｎの発音（押鍵）および消音の各々に対応する潜在状態と、相異なる指番号ｋに対応する複数の潜在状態とで構成される。状態遷移としては、（１）自己遷移、（２）無音→任意の指番号ｋ、および（３）任意の指番号ｋ→無音、の３種類のみが許容され、他の状態遷移に係る遷移確率は「０」に設定される。以上の条件は、１個の音符が発音される期間内において指番号ｋを変化させないための制約条件である。また、数式(10)および数式(11)の演算により算定された確率分布ρ_t,n,kの期待値が、ＨＭＭの各潜在状態に関する観測確率として設定される。演奏解析部３０は、以上に説明したＨＭＭを利用し、例えばビタビアルゴリズム等の動的計画法により状態系列を推定する。演奏解析部３０は、状態系列を推定した結果に応じて運指データＱの時系列を生成する。 Specifically, the HMM for fingering estimation is composed of latent states corresponding to the sounding (key pressing) and mute of pitch n, and multiple latent states corresponding to different finger numbers k. Only three types of state transitions are allowed: (1) self-transition, (2) silence → any finger number k, and (3) any finger number k → silence, and the transition probabilities for other state transitions are set to "0". The above conditions are constraints for not changing the finger number k during the period in which one note is sounded. In addition, the expected value of the probability distribution ρ _t,n,k calculated by the calculation of formula (10) and formula (11) is set as the observation probability for each latent state of the HMM. The performance analysis unit 30 uses the above-described HMM to estimate a state sequence by dynamic programming such as the Viterbi algorithm. The performance analysis unit 30 generates a time series of fingering data Q according to the result of estimating the state sequence.

第５実施形態によれば、音響信号Ｖと画像データＤ1とを利用して運指データＱが生成される。すなわち、演奏データＰを取得できない状況でも運指データＱを生成できる。また、第５実施形態においては、音響信号Ｖおよび画像データＤ1を利用して音高ｎと指番号ｋとが同時に推定されるから、音高ｎおよび指番号ｋの各々を個別に推定する形態と比較して処理負荷を軽減しながら高精度に運指を推定できる。なお、第５実施形態は第２実施形態から第４実施形態にも適用される。 According to the fifth embodiment, fingering data Q is generated using the audio signal V and image data D1. That is, fingering data Q can be generated even in a situation where performance data P cannot be acquired. In addition, in the fifth embodiment, pitch n and finger number k are simultaneously estimated using audio signal V and image data D1, so fingering can be estimated with high accuracy while reducing the processing load compared to a form in which pitch n and finger number k are estimated separately. Note that the fifth embodiment is also applicable to the second to fourth embodiments.

６：第６実施形態
前述の各形態において例示した通り、射影変換部３１４は、演奏画像Ｇ1から変換画像を生成する。すなわち、射影変換部３１４は、演奏画像Ｇ1の撮影条件を変化させる。第６実施形態は、演奏画像Ｇ1の撮影条件を変化させる以上の機能を利用した画像処理システム７００である。なお、第１実施形態から第５実施形態の演奏解析システム１００も、射影変換部３１４による演奏画像Ｇ1の処理に着目すれば、画像処理システム７００と表現される。なお、第６実施形態においては、利用者の運指の推定は必須ではない。 6: Sixth Embodiment As exemplified in each of the above embodiments, the projective transformation unit 314 generates a transformed image from the performance image G1. That is, the projective transformation unit 314 changes the shooting conditions of the performance image G1. The sixth embodiment is an image processing system 700 that utilizes the above function of changing the shooting conditions of the performance image G1. Note that the performance analysis systems 100 of the first to fifth embodiments can also be expressed as image processing systems 700 if attention is focused on the processing of the performance image G1 by the projective transformation unit 314. Note that in the sixth embodiment, it is not essential to estimate the user's fingering.

図２２は、第６実施形態における画像処理システム７００の機能的な構成を例示するブロック図である。画像処理システム７００は、第１実施形態の演奏解析システム１００と同様に、制御装置１１と記憶装置１２と操作装置１３と表示装置１４と撮影装置１５とを具備する。撮影装置１５は、第１実施形態と同様に、特定の撮影条件のもとで鍵盤楽器２００を撮影することで、演奏画像Ｇ1を表す画像データＤ1の時系列を生成する。 Fig. 22 is a block diagram illustrating the functional configuration of an image processing system 700 in the sixth embodiment. Like the performance analysis system 100 in the first embodiment, the image processing system 700 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15. Like the first embodiment, the photographing device 15 photographs the keyboard instrument 200 under specific photographing conditions to generate a time series of image data D1 representing a performance image G1.

記憶装置１２は、複数の参照データＤrefを記憶する。複数の参照データＤrefの各々は、標準的な鍵盤楽器の鍵盤である参照楽器を撮影した参照画像Ｇrefを表す。参照楽器の撮影条件は、参照画像Ｇref毎（参照データＤref毎）に相違する。具体的には、例えば撮影範囲または撮影方向のうち１以上の条件が、参照画像Ｇref毎に相違する。また、記憶装置１２は、参照データＤref毎に補助データＡを記憶する。 The storage device 12 stores multiple reference data Dref. Each of the multiple reference data Dref represents a reference image Gref captured of a reference instrument, which is the keyboard of a standard keyboard instrument. The shooting conditions of the reference instrument differ for each reference image Gref (each reference data Dref). Specifically, for example, one or more conditions of the shooting range or shooting direction differ for each reference image Gref. The storage device 12 also stores auxiliary data A for each reference data Dref.

制御装置１１は、記憶装置１２に記憶されたプログラムを実行することで、行列生成部３１２と射影変換部３１４と表示制御部４０とを実現する。行列生成部３１２は、複数の参照データＤrefの何れかを選択的に利用して変換行列Ｗを生成する。射影変換部３１４は、変換行列Ｗを利用した射影変換により、演奏画像Ｇ1の画像データＤ1から変換画像Ｇ3の画像データＤ3を生成する。表示制御部４０は、画像データＤ3が表す変換画像Ｇ3を表示装置１４に表示させる。 The control device 11 executes a program stored in the storage device 12 to realize a matrix generation unit 312, a projective transformation unit 314, and a display control unit 40. The matrix generation unit 312 selectively uses one of a plurality of reference data Dref to generate a transformation matrix W. The projective transformation unit 314 generates image data D3 of a transformed image G3 from image data D1 of a performance image G1 by projective transformation using the transformation matrix W. The display control unit 40 causes the display device 14 to display the transformed image G3 represented by the image data D3.

図２３は、第６実施形態の制御装置１１が実行する処理（以下「第１画像処理」という）の具体的な手順を例示するフローチャートである。例えば操作装置１３に対する利用者からの指示を契機として第１画像処理が開始される。 Figure 23 is a flowchart illustrating a specific procedure of a process (hereinafter referred to as "first image processing") executed by the control device 11 of the sixth embodiment. For example, the first image processing is started in response to an instruction from a user via the operation device 13.

利用者は、操作装置１３を操作することで、相異なる参照画像Ｇrefに対応する複数の撮影条件の何れかを選択する。制御装置１１（行列生成部３１２）は、撮影条件の選択を利用者から受付けたか否かを判定する（Ｓ31）。撮影条件の選択を受付けた場合（Ｓ31：YES）、制御装置１１（行列生成部３１２）は、記憶装置１２に記憶された複数の参照データＤrefのうち、利用者が選択した撮影条件に対応する参照データＤref（以下「選択参照データＤref」という）を取得する（Ｓ32）。利用者による撮影条件の選択は、相異なる撮影条件に対応する複数の参照画像Ｇref（参照データＤref）の何れかを選択する動作に相当する。 The user operates the operation device 13 to select one of a plurality of shooting conditions corresponding to different reference images Gref. The control device 11 (matrix generation unit 312) judges whether or not a selection of shooting conditions has been accepted from the user (S31). If a selection of shooting conditions has been accepted (S31: YES), the control device 11 (matrix generation unit 312) acquires reference data Dref (hereinafter referred to as "selected reference data Dref") corresponding to the shooting conditions selected by the user from among the plurality of reference data Dref stored in the storage device 12 (S32). The selection of shooting conditions by the user corresponds to an operation of selecting one of a plurality of reference images Gref (reference data Dref) corresponding to different shooting conditions.

制御装置１１（行列生成部３１２）は、選択参照データＤrefを利用して、第１実施形態と同様の行列生成処理を実行する（Ｓ33）。具体的には、制御装置１１は、選択参照データＤrefを利用した初期設定処理Ｓc1により初期行列Ｗ0を設定する。また、制御装置１１は、演奏画像Ｇ1の鍵盤画像ｇ1が選択参照データＤrefの参照画像Ｇrefに近付くように初期行列Ｗ0を反復的に更新する行列更新処理Ｓc2により、変換行列Ｗを生成する。他方、撮影条件の選択を受付けない場合（Ｓ31：NO）、参照データＤrefの選択（Ｓ32）および行列生成処理（Ｓ33）は実行されない。 The control device 11 (matrix generation unit 312) executes a matrix generation process similar to that of the first embodiment using the selected reference data Dref (S33). Specifically, the control device 11 sets an initial matrix W0 by an initial setting process Sc1 using the selected reference data Dref. The control device 11 also generates a transformation matrix W by a matrix update process Sc2 that iteratively updates the initial matrix W0 so that the keyboard image g1 of the performance image G1 approaches the reference image Gref of the selected reference data Dref. On the other hand, if the selection of the shooting conditions is not accepted (S31: NO), the selection of the reference data Dref (S32) and the matrix generation process (S33) are not executed.

制御装置１１（射影変換部３１４）は、変換行列Ｗを利用した射影変換処理を演奏画像Ｇ1に対して実行することで変換画像Ｇ3を生成する（Ｓ34）。射影変換処理は、第１実施形態と同様である。射影変換処理の結果、変換画像Ｇ3を表す画像データＤ3が生成される。具体的には、選択参照データＤrefの参照画像Ｇrefと同等の撮影条件に対応する変換画像Ｇ3が演奏画像Ｇ1から生成される。すなわち、変換画像Ｇ3は、演奏画像Ｇ1の撮影条件を参照画像Ｇrefと同等の撮影条件に変換した画像である。以上の説明から理解される通り、第６実施形態によれば、利用者が選択した撮影条件に対応する変換画像Ｇ3が生成される。 The control device 11 (projective transformation unit 314) generates a transformed image G3 by performing a projective transformation process using the transformation matrix W on the performance image G1 (S34). The projective transformation process is the same as in the first embodiment. As a result of the projective transformation process, image data D3 representing the transformed image G3 is generated. Specifically, a transformed image G3 corresponding to shooting conditions equivalent to those of the reference image Gref of the selected reference data Dref is generated from the performance image G1. In other words, the transformed image G3 is an image obtained by transforming the shooting conditions of the performance image G1 into shooting conditions equivalent to those of the reference image Gref. As can be understood from the above explanation, according to the sixth embodiment, a transformed image G3 corresponding to the shooting conditions selected by the user is generated.

制御装置１１（表示制御部４０）は、射影変換処理により生成された変換画像Ｇ3を表示装置１４に表示させる（Ｓ35）。制御装置１１は、終了条件が成立したか否かを判定する（Ｓ36）。例えば操作装置１３に対する操作で利用者から第１画像処理の終了が指示された場合に、制御装置１１は終了条件が成立したと判定する。終了条件が成立しない場合（Ｓ36：NO）、制御装置１１は、処理をステップＳ31に移行する。すなわち、撮影条件の選択の受付（Ｓ31：YES）を条件とした変換行列Ｗの生成（Ｓ32～Ｓ33）と、変換画像Ｇ3の生成および表示（Ｓ34～Ｓ35）とが実行される。他方、終了条件が成立した場合（Ｓ36：YES）、制御装置１１は、第１画像処理を終了する。 The control device 11 (display control unit 40) causes the display device 14 to display the transformed image G3 generated by the projective transformation process (S35). The control device 11 determines whether or not the end condition is met (S36). For example, when the user instructs the operation device 13 to end the first image process, the control device 11 determines that the end condition is met. If the end condition is not met (S36: NO), the control device 11 moves the process to step S31. That is, the generation of the transformation matrix W (S32-S33) and the generation and display of the transformed image G3 (S34-S35) are executed on the condition that the selection of the shooting conditions is accepted (S31: YES). On the other hand, if the end condition is met (S36: YES), the control device 11 ends the first image process.

以上の通り、第６実施形態においては、演奏画像Ｇ1における鍵盤画像ｇ1が参照画像Ｇrefに近付くように変換行列Ｗが生成され、当該変換行列Ｗを利用した射影変換処理が演奏画像Ｇ1に対して実行される。したがって、利用者が演奏する鍵盤楽器２００の演奏画像Ｇ1を、参照画像Ｇrefにおける参照楽器の撮影条件に対応する変換画像Ｇ3に変換できる。 As described above, in the sixth embodiment, a transformation matrix W is generated so that the keyboard image g1 in the performance image G1 approaches the reference image Gref, and a projective transformation process using the transformation matrix W is executed on the performance image G1. Therefore, the performance image G1 of the keyboard instrument 200 played by the user can be transformed into a transformation image G3 that corresponds to the shooting conditions of the reference instrument in the reference image Gref.

また、第６実施形態においては、撮影条件が相違する複数の参照データＤrefの何れかが選択的に行列生成処理に利用される。したがって、特定の撮影条件のもとで撮影された演奏画像Ｇ1から、多様な撮影条件に対応する変換画像Ｇ3を生成できる。第６実施形態では特に、複数の参照データＤrefのうち利用者が選択した撮影条件に対応する参照データＤrefが行列生成処理に利用されるから、利用者の所望の撮影条件に対応する変換画像Ｇ3を生成できる。以上のように演奏画像Ｇ1の撮影条件を変化させることで、多様な用途に利用可能な変換画像Ｇ3を生成できる。例えば、音楽教習の指導者が自身の演奏を撮影した複数の演奏画像Ｇ1の各々について第６実施形態の第１画像処理を実行することで、撮影条件が統一された複数の変換画像Ｇ3を、例えば音楽教習の教材として生成できる。 In the sixth embodiment, one of the multiple reference data Drefs with different shooting conditions is selectively used in the matrix generation process. Therefore, a converted image G3 corresponding to various shooting conditions can be generated from a performance image G1 captured under specific shooting conditions. In particular, in the sixth embodiment, the reference data Dref corresponding to the shooting conditions selected by the user among the multiple reference data Drefs is used in the matrix generation process, so that a converted image G3 corresponding to the shooting conditions desired by the user can be generated. By changing the shooting conditions of the performance image G1 as described above, a converted image G3 that can be used for various purposes can be generated. For example, a music instructor can perform the first image processing of the sixth embodiment on each of multiple performance images G1 captured by his/her own performance, and thus multiple converted images G3 with unified shooting conditions can be generated, for example, as teaching materials for music training.

７：第７実施形態
前述の各形態において例示した通り、画像抽出部３１１は、演奏画像Ｇ1のうち鍵盤画像ｇ1と手指画像ｇ2とを含む特定領域Ｂを抽出する。第７実施形態は、演奏画像Ｇ1の特定領域Ｂを抽出する以上の機能を利用した画像処理システム７００である。なお、第１実施形態から第５実施形態の演奏解析システム１００も、画像抽出部３１１による演奏画像Ｇ1の処理に着目すれば、画像処理システム７００と表現される。なお、第７実施形態においては、利用者の運指の推定は必須ではない。 7: Seventh Embodiment As exemplified in each of the above embodiments, the image extraction unit 311 extracts a specific area B including a keyboard image g1 and a fingering image g2 from the performance image G1. The seventh embodiment is an image processing system 700 that utilizes the above function of extracting the specific area B from the performance image G1. Note that the performance analysis systems 100 of the first to fifth embodiments can also be expressed as image processing systems 700 if attention is focused on the processing of the performance image G1 by the image extraction unit 311. Note that in the seventh embodiment, estimation of the user's fingering is not essential.

図２４は、第７実施形態における画像処理システム７００の機能的な構成を例示するブロック図である。画像処理システム７００は、第１実施形態の演奏解析システム１００と同様に、制御装置１１と記憶装置１２と操作装置１３と表示装置１４と撮影装置１５とを具備する。撮影装置１５は、特定の撮影条件のもとで鍵盤楽器２００を撮影することで、演奏画像Ｇ1を表す画像データＤ1の時系列を生成する。演奏画像Ｇ1は、前述の各形態と同様に、鍵盤画像ｇ1と手指画像ｇ2とを含む。 Figure 24 is a block diagram illustrating the functional configuration of an image processing system 700 in the seventh embodiment. Like the performance analysis system 100 in the first embodiment, the image processing system 700 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15. The photographing device 15 photographs the keyboard instrument 200 under specific photographing conditions to generate a time series of image data D1 representing a performance image G1. Like the previous embodiments, the performance image G1 includes a keyboard image g1 and a fingering image g2.

制御装置１１は、記憶装置１２に記憶されたプログラムを実行することで、画像抽出部３１１および表示制御部４０として機能する。画像抽出部３１１は、演奏画像Ｇ1のうち一部の領域を抽出した演奏画像Ｇ2を表す画像データＤ2を生成する。具体的には、画像抽出部３１１は、第１実施形態と同様に、画像処理マスクＭを生成する領域推定処理Ｓb1と、画像処理マスクＭを演奏画像Ｇ1に適用する領域抽出処理Ｓb2とを実行する。表示制御部４０は、画像データＤ2が表す演奏画像Ｇ2を表示装置１４に表示させる。 The control device 11 functions as an image extraction unit 311 and a display control unit 40 by executing a program stored in the storage device 12. The image extraction unit 311 generates image data D2 representing a performance image G2 obtained by extracting a partial area of the performance image G1. Specifically, similar to the first embodiment, the image extraction unit 311 executes an area estimation process Sb1 for generating an image processing mask M, and an area extraction process Sb2 for applying the image processing mask M to the performance image G1. The display control unit 40 causes the display device 14 to display the performance image G2 represented by the image data D2.

第１実施形態においては単体の推定モデル５１を例示した。第７実施形態において領域推定処理Ｓb1に利用される推定モデル５１は、第１モデル５１１および第２モデル５１２を含む。第１モデル５１１および第２モデル５１２の各々は、畳込ニューラルネットワークまたは再帰型ニューラルネットワーク等の深層ニューラルネットワークで構成される。 In the first embodiment, a single estimation model 51 is exemplified. In the seventh embodiment, the estimation model 51 used in the area estimation process Sb1 includes a first model 511 and a second model 512. Each of the first model 511 and the second model 512 is configured as a deep neural network such as a convolutional neural network or a recurrent neural network.

第１モデル５１１は、演奏画像Ｇ1のうち第１領域を表す第１マスクを生成するための統計モデルである。第１領域は、演奏画像Ｇ1のうち鍵盤画像ｇ1を含む領域である。手指画像ｇ2は第１領域に含まれない。第１マスクは、例えば、第１領域内の各要素が数値「１」に設定され、第１領域以外の領域内の各要素が数値「０」に設定されたバイナリマスクである。画像抽出部３１１は、演奏画像Ｇ1を表す画像データＤ1を第１モデル５１１に入力することで第１マスクを生成する。すなわち、第１モデル５１１は、画像データＤ1と第１マスク（第１領域）との関係を機械学習により学習した学習済モデルである。 The first model 511 is a statistical model for generating a first mask representing a first region of the performance image G1. The first region is a region of the performance image G1 that includes the keyboard image g1. The finger image g2 is not included in the first region. The first mask is, for example, a binary mask in which each element in the first region is set to the numerical value "1" and each element in the region other than the first region is set to the numerical value "0". The image extraction unit 311 generates the first mask by inputting image data D1 representing the performance image G1 to the first model 511. In other words, the first model 511 is a trained model that has learned the relationship between the image data D1 and the first mask (first region) by machine learning.

第２モデル５１２は、演奏画像Ｇ1のうち第２領域を表す第２マスクを生成するための統計モデルである。第２領域は、演奏画像Ｇ1のうち手指画像ｇ2を含む領域である。鍵盤画像ｇ1は第２領域に含まれない。第２マスクは、例えば、第２領域内の各要素が数値「１」に設定され、第２領域以外の領域内の各要素が数値「０」に設定されたバイナリマスクである。画像抽出部３１１は、演奏画像Ｇ1を表す画像データＤ1を第２モデル５１２に入力することで第２マスクを生成する。すなわち、第２モデル５１２は、画像データＤ1と第２マスク（第２領域）との関係を機械学習により学習した学習済モデルである。 The second model 512 is a statistical model for generating a second mask representing the second region of the performance image G1. The second region is a region of the performance image G1 that includes the fingering image g2. The keyboard image g1 is not included in the second region. The second mask is, for example, a binary mask in which each element in the second region is set to the numerical value "1" and each element in the region other than the second region is set to the numerical value "0". The image extraction unit 311 generates the second mask by inputting the image data D1 representing the performance image G1 to the second model 512. In other words, the second model 512 is a trained model that has learned the relationship between the image data D1 and the second mask (second region) by machine learning.

図２５は、第７実施形態の制御装置１１が実行する処理（以下「第２画像処理」という）の具体的な手順を例示するフローチャートである。例えば操作装置１３に対する利用者からの指示を契機として第２画像処理が開始される。 Figure 25 is a flowchart illustrating a specific procedure of a process (hereinafter referred to as "second image processing") executed by the control device 11 of the seventh embodiment. For example, the second image processing is started in response to an instruction from a user via the operation device 13.

第２画像処理が開始されると、制御装置１１（画像抽出部３１１）は、領域推定処理Ｓb1を実行する（Ｓ41～Ｓ43）。第７実施形態の領域推定処理Ｓb1は、第１推定処理（Ｓ41）と第２推定処理（Ｓ42）と領域合成処理（Ｓ43）とを含む。 When the second image processing is started, the control device 11 (image extraction unit 311) executes the area estimation processing Sb1 (S41 to S43). The area estimation processing Sb1 of the seventh embodiment includes the first estimation processing (S41), the second estimation processing (S42), and the area synthesis processing (S43).

第１推定処理は、演奏画像Ｇ1の第１領域を推定する処理である。具体的には、制御装置１１は、演奏画像Ｇ1を表す画像データＤ1を第１モデル５１１に入力することで、第１領域を表す第１マスクを生成する（Ｓ41）。第２推定処理は、演奏画像Ｇ2の第２領域を推定する処理である。具体的には、制御装置１１は、演奏画像Ｇ1を表す画像データＤ1を第２モデル５１２に入力することで、第２領域を表す第２マスクを生成する（Ｓ42）。 The first estimation process is a process of estimating a first region of the performance image G1. Specifically, the control device 11 generates a first mask representing the first region by inputting image data D1 representing the performance image G1 to the first model 511 (S41). The second estimation process is a process of estimating a second region of the performance image G2. Specifically, the control device 11 generates a second mask representing the second region by inputting image data D1 representing the performance image G1 to the second model 512 (S42).

領域合成処理は、第１領域と第２領域とを含む特定領域Ｂを表す画像処理マスクＭを生成する処理である。具体的には、画像処理マスクＭが表す特定領域Ｂは、第１領域と第２領域との和に相当する。すなわち、制御装置１１は、第１マスクと第２マスクとを合成することで画像処理マスクＭを生成する（Ｓ43）。以上の説明から理解される通り、画像処理マスクＭは、第１実施形態と同様に、演奏画像Ｇ1のうち鍵盤画像ｇ1と手指画像ｇ2とを含む特定領域Ｂを抽出するためのバイナリマスクである。 The area synthesis process is a process for generating an image processing mask M that represents a specific area B that includes the first area and the second area. Specifically, the specific area B represented by the image processing mask M corresponds to the sum of the first area and the second area. That is, the control device 11 generates the image processing mask M by synthesizing the first mask and the second mask (S43). As can be understood from the above explanation, the image processing mask M is a binary mask for extracting the specific area B that includes the keyboard image g1 and the fingering image g2 from the performance image G1, similar to the first embodiment.

制御装置１１（画像抽出部３１１）は、領域推定処理Ｓb1で生成された画像処理マスクＭを利用して第１実施形態と同様の領域抽出処理Ｓb2を実行する（Ｓ44）。すなわち、制御装置１１は、画像データＤ1が表す演奏画像Ｇ1のうち特定領域Ｂを画像処理マスクＭにより抽出することで、演奏画像Ｇ2を表す画像データＤ2を生成する。 The control device 11 (image extraction unit 311) uses the image processing mask M generated in the area estimation process Sb1 to execute the area extraction process Sb2 similar to that of the first embodiment (S44). That is, the control device 11 generates image data D2 representing the performance image G2 by extracting the specific area B from the performance image G1 represented by the image data D1 using the image processing mask M.

制御装置１１（表示制御部４０）は、領域抽出処理Ｓb2により生成された演奏画像Ｇ2を表示装置１４に表示させる（Ｓ45）。制御装置１１は、終了条件が成立したか否かを判定する（Ｓ46）。例えば操作装置１３に対する操作で利用者から第２画像処理の終了が指示された場合に、制御装置１１は終了条件が成立したと判定する。終了条件が成立しない場合（Ｓ46：NO）、制御装置１１は、処理をステップＳ41に移行する。すなわち、領域推定処理Ｓb1（Ｓ41～Ｓ43）と、領域抽出処理Ｓb2（Ｓ44）と、演奏画像Ｇ2の表示（Ｓ45）とが実行される。他方、終了条件が成立した場合（Ｓ46：YES）、制御装置１１は、第２画像処理を終了する。 The control device 11 (display control unit 40) causes the display device 14 to display the performance image G2 generated by the area extraction process Sb2 (S45). The control device 11 determines whether or not the end condition is met (S46). For example, when the user instructs the operation device 13 to end the second image process, the control device 11 determines that the end condition is met. If the end condition is not met (S46: NO), the control device 11 shifts the process to step S41. That is, the area estimation process Sb1 (S41-S43), the area extraction process Sb2 (S44), and the display of the performance image G2 (S45) are executed. On the other hand, if the end condition is met (S46: YES), the control device 11 ends the second image process.

第７実施形態においては、第１実施形態と同様に、演奏画像Ｇ1のうち鍵盤画像ｇ1を含む特定領域Ｂが抽出される。したがって、演奏画像Ｇ1の利便性を向上させることが可能である。第７実施形態においては特に、演奏画像Ｇ1のうち鍵盤画像ｇ1と手指画像ｇ2とを含む特定領域Ｂが抽出される。したがって、鍵盤楽器２００の鍵盤２２の様子と利用者の手指の様子とを効率的に視認可能な演奏画像Ｇ2を生成できる。 In the seventh embodiment, as in the first embodiment, a specific region B including the keyboard image g1 is extracted from the performance image G1. Therefore, it is possible to improve the usability of the performance image G1. In particular, in the seventh embodiment, a specific region B including the keyboard image g1 and the finger image g2 is extracted from the performance image G1. Therefore, it is possible to generate a performance image G2 that allows the appearance of the keyboard 22 of the keyboard instrument 200 and the appearance of the user's fingers to be efficiently visually recognized.

また、第７実施形態によれば、演奏画像Ｇ1のうち鍵盤画像ｇ1を含む第１領域が第１モデル５１１により推定され、演奏画像Ｇ1のうち手指画像ｇ2を含む第２領域が第２モデル５１２により推定される。したがって、鍵盤画像ｇ1と手指画像ｇ2との双方を一括的に抽出する単体の推定モデル５１を利用する構成と比較して、鍵盤画像ｇ1と手指画像ｇ2とを含む特定領域Ｂを高精度に抽出できる。また、第１モデル５１１および第２モデル５１２の各々が個別の機械学習により確立されるから、第１モデル５１１および第２モデル５１２の機械学習に関する処理負荷が軽減される。 Furthermore, according to the seventh embodiment, a first region of the performance image G1 including the keyboard image g1 is estimated by the first model 511, and a second region of the performance image G1 including the finger image g2 is estimated by the second model 512. Therefore, compared to a configuration using a single estimation model 51 that simultaneously extracts both the keyboard image g1 and the finger image g2, a specific region B including the keyboard image g1 and the finger image g2 can be extracted with high accuracy. Furthermore, since each of the first model 511 and the second model 512 is established by individual machine learning, the processing load related to the machine learning of the first model 511 and the second model 512 is reduced.

なお、画像抽出部３１１が第１モードと第２モードとを切替可能な構成も想定される。第１モードは、演奏画像Ｇ1から鍵盤画像ｇ1および手指画像ｇ2の双方を抽出する動作モードである。すなわち、第１モードにおいて、画像抽出部３１１は、第１推定処理および第２推定処理の双方を実行する。したがって、第７実施形態と同様に、特定領域Ｂを表す画像処理マスクＭが生成される。すなわち、第１モードにおいては、鍵盤画像ｇ1および手指画像ｇ2の双方を含む特定領域Ｂが演奏画像Ｇ1から抽出される。 It is also assumed that the image extraction unit 311 can be configured to switch between the first mode and the second mode. The first mode is an operation mode in which both the keyboard image g1 and the fingering image g2 are extracted from the performance image G1. That is, in the first mode, the image extraction unit 311 executes both the first estimation process and the second estimation process. Therefore, as in the seventh embodiment, an image processing mask M representing the specific region B is generated. That is, in the first mode, the specific region B including both the keyboard image g1 and the fingering image g2 is extracted from the performance image G1.

第２モードは、演奏画像Ｇ1から鍵盤画像ｇ1を抽出する動作モードである。すなわち、第２モードにおいて、画像抽出部３１１は、第１推定処理を実行する一方で第２推定処理を実行しない。すなわち、第１推定処理により生成される第１マスクが、領域抽出処理Ｓb2に適用される画像処理マスクＭとして確定される。したがって、第２モードにおいては、鍵盤画像ｇ1が演奏画像Ｇ1から抽出される。 The second mode is an operating mode in which the keyboard image g1 is extracted from the performance image G1. That is, in the second mode, the image extraction unit 311 executes the first estimation process but does not execute the second estimation process. That is, the first mask generated by the first estimation process is determined as the image processing mask M to be applied to the region extraction process Sb2. Therefore, in the second mode, the keyboard image g1 is extracted from the performance image G1.

以上の通り、第１モードと第２モードとを切替可能な形態によれば、演奏画像Ｇ1からの抽出対象を簡便に切替えることが可能である。なお、以上の説明においては、画像抽出部３１１が第２モードにおいて第１推定処理を実行したが、第２モードにおいて、画像抽出部３１１が、第２推定処理を実行する一方で第１推定処理を実行しない形態も想定される。以上の形態においては、手指画像ｇ2が演奏画像Ｇ1から抽出される。以上の例示から理解される通り、第２モードは、第１推定処理および第２推定処理の一方が実行される動作モードとして表現される。 As described above, in a form that allows switching between the first mode and the second mode, it is possible to easily switch the target to be extracted from the performance image G1. In the above description, the image extraction unit 311 executes the first estimation process in the second mode, but a form in which the image extraction unit 311 executes the second estimation process while not executing the first estimation process in the second mode is also envisioned. In the above form, the finger image g2 is extracted from the performance image G1. As can be understood from the above example, the second mode is expressed as an operation mode in which one of the first estimation process and the second estimation process is executed.

８：変形例
以上に例示した各態様に付加される具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を、相互に矛盾しない範囲で適宜に併合してもよい。 8: Modifications Specific modifications to the above-mentioned embodiments are given below. Two or more of the following embodiments may be combined as long as they are not mutually contradictory.

（１）前述の各形態においては、画像抽出処理（図８）による処理後の演奏画像Ｇ2を処理対象として行列生成処理を実行したが、撮影装置１５が撮影する演奏画像Ｇ1を処理対象として行列生成処理が実行されてもよい。すなわち、演奏画像Ｇ1から演奏画像Ｇ2を生成する画像抽出処理（画像抽出部３１１）は省略されてもよい。 (1) In each of the above-described embodiments, the matrix generation process is performed on the performance image G2 after processing by the image extraction process (FIG. 8). However, the matrix generation process may be performed on the performance image G1 captured by the image capture device 15. In other words, the image extraction process (image extraction unit 311) that generates the performance image G2 from the performance image G1 may be omitted.

前述の各形態においては、演奏画像Ｇ1を利用した指位置推定処理を例示したが、画像抽出処理による処理後の演奏画像Ｇ2を利用して指位置推定処理が実行されてもよい。すなわち、演奏画像Ｇ2の解析により利用者の各手指の位置Ｃ[h,f]が推定されてもよい。また、前述の各形態においては、演奏画像Ｇ1を対象として射影変換処理を実行したが、画像抽出処理による処理後の演奏画像Ｇ2を対象として射影変換処理が実行されてもよい。すなわち、演奏画像Ｇ2に対する射影変換により変換画像が生成されてもよい。 In each of the above-mentioned embodiments, a finger position estimation process using the performance image G1 has been exemplified, but the finger position estimation process may also be performed using the performance image G2 after processing by the image extraction process. That is, the position C[h,f] of each of the user's fingers may be estimated by analyzing the performance image G2. Also, in each of the above-mentioned embodiments, a projective transformation process was performed on the performance image G1, but a projective transformation process may also be performed on the performance image G2 after processing by the image extraction process. That is, a transformed image may be generated by projective transformation of the performance image G2.

（２）前述の各形態においては、利用者の各手指の位置ｃ[h,f]を射影変換処理によりＸ-Ｙ座標系の位置Ｃ[h,f]に変換したが、各手指の位置ｃ[h,f]を表す指位置データＦが生成されてもよい。すなわち、位置ｃ[h,f]を位置Ｃ[h,f]に変換する射影変換処理（射影変換部３１４）は省略されてもよい。 (2) In each of the above-described embodiments, the position c[h,f] of each finger of the user is converted to a position C[h,f] in the X-Y coordinate system by a projective transformation process, but finger position data F representing the position c[h,f] of each finger may be generated. In other words, the projective transformation process (projective transformation unit 314) that converts the position c[h,f] to the position C[h,f] may be omitted.

（３）第１実施形態から第５実施形態においては、演奏解析処理の開始の直後に生成される変換行列Ｗが、以降の処理において継続的に利用される形態を例示したが、演奏解析処理の実行中の適切な時点において変換行列Ｗが更新されてもよい。例えば、鍵盤楽器２００に対する撮影装置１５の位置が変化した場合に、変換行列Ｗを更新する形態が想定される。具体的には、演奏画像Ｇ1の解析により撮影装置１５の位置の変化（以下「位置変化」という）が検出された場合、または、撮影装置１５の位置変化が利用者から指示された場合に、変換行列Ｗが更新される。 (3) In the first to fifth embodiments, the transformation matrix W generated immediately after the start of the performance analysis process is continuously used in the subsequent processes. However, the transformation matrix W may be updated at an appropriate time during the execution of the performance analysis process. For example, a form in which the transformation matrix W is updated when the position of the image capture device 15 relative to the keyboard instrument 200 changes is envisioned. Specifically, the transformation matrix W is updated when a change in the position of the image capture device 15 (hereinafter referred to as "position change") is detected by analysis of the performance image G1, or when a change in the position of the image capture device 15 is instructed by the user.

具体的には、行列生成部３１２は、撮影装置１５の位置変化（ズレ）を表す変換行列δを生成する。例えば、位置変化後の演奏画像Ｇ（Ｇ1，Ｇ2）内の座標（ｘ,ｙ）について、以下の数式(12)で表現される関係を想定する。

Specifically, the matrix generation unit 312 generates a transformation matrix δ that represents a position change (displacement) of the image capture device 15. For example, the relationship expressed by the following formula (12) is assumed for the coordinates (x, y) in the performance image G (G1, G2) after the position change.

行列生成部３１２は、位置変化後の特定の地点のｘ座標から数式(12)で算定される座標ｘ'/εが、位置変化前における演奏画像Ｇのうち当該地点に対応する地点のｘ座標に近似または一致し、かつ、位置変換後の特定の地点のｙ座標から数式(12)で算定される座標ｙ'/εが、位置変化前における演奏画像Ｇのうち当該地点に対応する地点のｙ座標に近似または一致するように、変換行列δを生成する。そして、行列生成部３１２は、位置変化前の変換行列Ｗと位置変化を表す変換行列δとの積Ｗδを初期行列Ｗ0として生成し、当該初期行列Ｗ0を行列更新処理Ｓc2により更新することで変換行列Ｗを生成する。 The matrix generation unit 312 generates a transformation matrix δ so that the coordinate x'/ε calculated from the x coordinate of a specific point after the position change using formula (12) approximates or matches the x coordinate of a point in the performance image G before the position change that corresponds to the specific point, and the coordinate y'/ε calculated from the y coordinate of a specific point after the position change using formula (12) approximates or matches the y coordinate of a point in the performance image G before the position change that corresponds to the specific point. The matrix generation unit 312 then generates the product Wδ of the transformation matrix W before the position change and the transformation matrix δ that represents the position change as an initial matrix W0, and updates the initial matrix W0 using matrix update processing Sc2 to generate the transformation matrix W.

以上の構成においては、位置変化前に算定された変換行列Ｗと位置変化を表す変換行列δとを利用して、位置変化後の変換行列Ｗが生成される。したがって、行列生成処理の負荷を軽減しながら、各手指の位置Ｃ[h,f]を高精度に特定可能な変換行列Ｗを生成できる。なお、以上の説明においては第１実施形態から第５実施形態を想定したが、第６実施形態においても同様に、第１画像処理の実行中の適切な時点において変換行列Ｗが更新されてもよい。 In the above configuration, the transformation matrix W after the position change is generated using the transformation matrix W calculated before the position change and the transformation matrix δ representing the position change. Therefore, it is possible to generate a transformation matrix W that can identify the position C[h,f] of each finger with high accuracy while reducing the load of the matrix generation process. Note that while the above explanation assumes the first to fifth embodiments, in the sixth embodiment as well, the transformation matrix W may be updated at an appropriate time during the execution of the first image process.

（４）前述の各形態においては、鍵盤２２を具備する鍵盤楽器２００を例示したが、本開示が適用される楽器の種類は任意である。例えば、弦楽器，管楽器または打楽器等、利用者が手動で操作可能な任意の楽器について、前述の各形態は同様に適用される。楽器の典型例は、利用者が片手または両手の手指により演奏する種類の楽器である。 (4) In each of the above embodiments, a keyboard instrument 200 having a keyboard 22 is exemplified, but the present disclosure can be applied to any type of instrument. For example, the above embodiments can be applied to any instrument that can be manually operated by a user, such as a string instrument, a wind instrument, or a percussion instrument. A typical example of an instrument is a type of instrument that a user plays with the fingers of one or both hands.

（５）例えばスマートフォンまたはタブレット端末等の情報装置と通信するサーバ装置により演奏解析システム１００が実現されてもよい。例えば、情報装置に接続された鍵盤楽器２００が生成する演奏データＰと、当該情報装置に搭載または接続された撮影装置１５が生成する画像データＤ1とが、情報装置から演奏解析システム１００に送信される。演奏解析システム１００は、情報装置から受信した演奏データＰおよび画像データＤ1に対して演奏解析処理を実行することで運指データＱを生成し、当該運指データＱを情報装置に送信する。また、第６実施形態または第７実施形態に例示した画像処理システム７００も同様に、情報装置と通信するサーバ装置により実現されてよい。 (5) The performance analysis system 100 may be realized by a server device that communicates with an information device such as a smartphone or tablet terminal. For example, performance data P generated by a keyboard instrument 200 connected to the information device and image data D1 generated by a photographing device 15 mounted on or connected to the information device are transmitted from the information device to the performance analysis system 100. The performance analysis system 100 generates fingering data Q by executing a performance analysis process on the performance data P and image data D1 received from the information device, and transmits the fingering data Q to the information device. Similarly, the image processing system 700 exemplified in the sixth or seventh embodiment may be realized by a server device that communicates with the information device.

（６）第１実施形態から第５実施形態に係る演奏解析システム１００、または第６実施形態から第７実施形態に係る画像処理システム７００の機能は、前述の通り、制御装置１１を構成する単数または複数のプロセッサと、記憶装置１２に記憶されたプログラムとの協働により実現される。本開示に係るプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体または磁気記録媒体等の公知の任意の形式の記録媒体も包含される。なお、非一過性の記録媒体とは、一過性の伝搬信号（transitory, propagating signal）を除く任意の記録媒体を含み、揮発性の記録媒体も除外されない。また、配信装置が通信網を介してプログラムを配信する構成では、当該配信装置においてプログラムを記憶する記憶装置１２が、前述の非一過性の記録媒体に相当する。 (6) The functions of the performance analysis system 100 according to the first to fifth embodiments, or the image processing system 700 according to the sixth to seventh embodiments, are realized by the cooperation of one or more processors constituting the control device 11 and the program stored in the storage device 12, as described above. The program according to the present disclosure can be provided in a form stored in a computer-readable recording medium and installed in a computer. The recording medium is, for example, a non-transitory recording medium, and a good example is an optical recording medium (optical disk) such as a CD-ROM, but also includes any known type of recording medium such as a semiconductor recording medium or a magnetic recording medium. Note that a non-transitory recording medium includes any recording medium except a transient, propagating signal, and does not exclude volatile recording media. In addition, in a configuration in which a distribution device distributes a program via a communication network, the storage device 12 that stores the program in the distribution device corresponds to the non-transitory recording medium described above.

９：付記
以上に例示した形態から、例えば以下の構成が把握される。 9: Supplementary Note From the above-described exemplary embodiments, the following configurations, for example, can be understood.

本開示のひとつの態様（態様１）に係る画像処理方法は、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定し、前記演奏画像のうち前記特定領域を抽出する。以上の態様においては、楽器の画像と利用者の複数の手指の画像とを含む演奏画像のうち楽器の画像を含む特定領域が抽出される。したがって、演奏画像の利便性を向上させることが可能である。 An image processing method according to one aspect (aspect 1) of the present disclosure estimates a specific region that includes an image of a musical instrument from a performance image that includes an image of the musical instrument and an image of the multiple fingers of a user playing the musical instrument, and extracts the specific region from the performance image. In the above aspect, a specific region that includes the image of the musical instrument is extracted from a performance image that includes an image of the musical instrument and an image of the multiple fingers of the user. Therefore, it is possible to improve the convenience of the performance image.

態様１の具体例（態様２）において、前記特定領域は、前記楽器の画像と前記利用者の身体の少なくとも一部の画像とを含む領域である。以上の態様においては、楽器の画像と利用者の身体の画像とを含む特定領域が抽出される。したがって、楽器の様子と利用者の身体の様子とを効率的に視認できる画像を生成できる。 In a specific example of aspect 1 (aspect 2), the specific area is an area that includes an image of the musical instrument and an image of at least a portion of the user's body. In the above aspect, a specific area that includes an image of the musical instrument and an image of the user's body is extracted. Therefore, an image can be generated that allows the appearance of the musical instrument and the appearance of the user's body to be efficiently viewed.

態様２の具体例（態様３）において、前記特定領域の推定においては、前記演奏画像を表す画像データを、機械学習済の推定モデルに入力することで、前記特定領域を表す画像処理マスクを生成し、前記特定領域の抽出においては、前記画像処理マスクを前記演奏画像に適用することで前記特定領域を抽出する。以上の態様においては、機械学習済の推定モデルに演奏画像の画像データを入力することで、特定領域を表す画像処理マスクが生成される。したがって、未知の多様な演奏画像について特定領域を高精度に特定できる。 In a specific example (aspect 3) of aspect 2, in estimating the specific region, image data representing the performance image is input into a machine-learned estimation model to generate an image processing mask representing the specific region, and in extracting the specific region, the image processing mask is applied to the performance image to extract the specific region. In the above aspect, an image processing mask representing the specific region is generated by inputting image data of the performance image into a machine-learned estimation model. Therefore, it is possible to identify specific regions with high accuracy for unknown and diverse performance images.

態様３の具体例（態様４）において、前記推定モデルは、第１モデルと第２モデルとを含み、前記特定領域の推定は、前記演奏画像を表す画像データを前記第１モデルに入力することで、当該演奏画像のうち前記楽器の画像を含む第１領域を推定する第１推定処理と、前記演奏画像を表す画像データを前記第２モデルに入力することで、当該演奏画像のうち前記複数の手指の画像を含む第２領域を推定する第２推定処理と、前記第１領域と前記第２領域とを含む前記特定領域を表す前記画像処理マスクを生成する領域合成処理とを含む。以上の態様においては、演奏画像のうち楽器の画像を含む第１領域が第１モデルにより推定され、演奏画像のうち利用者の画像を含む第２領域が第２モデルにより推定される。したがって、楽器の画像と利用者の画像との双方を一括的に抽出する単体のモデルを利用する構成と比較して、楽器の画像と利用者の画像とを含む特定領域を高精度に抽出できる。また、第１モデルおよび第２モデルの各々が個別の機械学習により確立されるから、第１モデルおよび第２モデルの機械学習に関する処理負荷が軽減される。 In a specific example (aspect 4) of aspect 3, the estimation model includes a first model and a second model, and the estimation of the specific area includes a first estimation process for estimating a first area including the image of the musical instrument in the performance image by inputting image data representing the performance image into the first model, a second estimation process for estimating a second area including the image of the multiple fingers in the performance image by inputting image data representing the performance image into the second model, and an area synthesis process for generating the image processing mask representing the specific area including the first area and the second area. In the above aspect, the first area including the image of the musical instrument in the performance image is estimated by the first model, and the second area including the image of the user in the performance image is estimated by the second model. Therefore, compared to a configuration using a single model that extracts both the image of the musical instrument and the image of the user at once, the specific area including the image of the musical instrument and the image of the user can be extracted with high accuracy. In addition, since each of the first model and the second model is established by individual machine learning, the processing load related to the machine learning of the first model and the second model is reduced.

態様４の具体例（態様５）において、前記第１推定処理および前記第２推定処理の双方を実行する第１モードと、前記第１推定処理および前記第２推定処理の一方を実行する第２モードとを切替可能である。以上の態様において、第１モードでは、楽器の画像と利用者の画像とを含む特定領域が演奏画像から抽出される。他方、第２モードでは、楽器の楽器と利用者の画像との一方を含む特定領域が演奏画像から抽出される。以上の通り、演奏画像からの抽出対象を簡便に切替えることが可能である。 In a specific example of aspect 4 (aspect 5), it is possible to switch between a first mode in which both the first estimation process and the second estimation process are executed, and a second mode in which one of the first estimation process and the second estimation process is executed. In the above aspect, in the first mode, a specific area including an image of the musical instrument and an image of the user is extracted from the performance image. On the other hand, in the second mode, a specific area including one of the image of the musical instrument and the image of the user is extracted from the performance image. As described above, it is possible to easily switch the object to be extracted from the performance image.

本開示のひとつの態様（態様６）に係る画像処理システムは、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定する領域推定部と、前記演奏画像のうち前記特定領域を抽出する領域抽出部とを具備する。 An image processing system according to one aspect (aspect 6) of the present disclosure includes an area estimation unit that estimates a specific area including an image of a musical instrument from a performance image including an image of a musical instrument and an image of the fingers of a user playing the musical instrument, and an area extraction unit that extracts the specific area from the performance image.

本開示のひとつの態様（態様７）に係るプログラムは、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定する領域推定部、および、前記演奏画像のうち前記特定領域を抽出する領域抽出部、としてコンピュータシステムを機能させる。 A program according to one aspect (aspect 7) of the present disclosure causes a computer system to function as an area estimation unit that estimates a specific area that includes an image of a musical instrument from a performance image that includes an image of the musical instrument and an image of the fingers of a user playing the musical instrument, and an area extraction unit that extracts the specific area from the performance image.

１００…演奏解析システム、１１…制御装置、１２…記憶装置、１３…操作装置、１４…表示装置、１５…撮影装置、２００…鍵盤楽器、２１…鍵、２２…鍵盤、３０…演奏解析部、３１…指位置データ生成部、３１１…画像抽出部、３１２…行列生成部、３１３…指位置推定部、３１４…射影変換部、３２…運指データ生成部、３２１…確率算定部、３２２…運指推定部、３２３…制御データ生成部、４０…表示制御部、５１…推定モデル、５１a…暫定モデル、５２[k]…推定モデル、７００…画像処理システム。
100...performance analysis system, 11...control device, 12...storage device, 13...operation device, 14...display device, 15...imaging device, 200...keyboard instrument, 21...key, 22...keyboard, 30...performance analysis unit, 31...finger position data generation unit, 311...image extraction unit, 312...matrix generation unit, 313...finger position estimation unit, 314...projection transformation unit, 32...fingering data generation unit, 321...probability calculation unit, 322...fingering estimation unit, 323...control data generation unit, 40...display control unit, 51...estimation model, 51a...tentative model, 52[k]...estimation model, 700...image processing system.

Claims

An image processing method implemented by a computer system for extracting a specific area from a performance image including a keyboard image of a keyboard instrument and an image of a plurality of fingers of a user playing the keyboard instrument , the method comprising:
In the extraction of the specific region,
a first mode in which the specific area including both the keyboard image and the fingering image is extracted from the performance image;
a second mode in which the specific area including the keyboard image and not including the fingering image is extracted from the performance image;
It is switchable,
Image processing methods.

In the extraction of the specific region,
inputting image data representing the performance image into a machine-learned estimation model to generate an image processing mask representing the specific region;
The image processing method according to claim 1 , further comprising the step of extracting the specific region by applying the image processing mask to the performance image.

the estimation model includes a first model and a second model;
In the first mode,
a first estimation process for estimating a first region of the performance image including the keyboard image by inputting image data representing the performance image into the first model;
a second estimation process for estimating a second region including the hand image in the performance image by inputting image data representing the performance image into the second model ;
generating a first image processing mask representing the specific region including the first region and the second region ;
extracting the specific region by applying the first image processing mask to the performance image;
In the second mode,
By executing the first estimation process, a second image processing mask is generated that represents the specific area including the keyboard image and not including the hand image;
The specific region is extracted by applying the second image processing mask to the performance image.
The image processing method according to claim 2 .

an image extraction unit that extracts a specific area from a performance image including a keyboard image of a keyboard instrument and an image of a plurality of fingers of a user playing the keyboard instrument ;
The image extraction unit
a first mode in which the specific area including both the keyboard image and the fingering image is extracted from the performance image;
a second mode in which the specific area including the keyboard image and not including the fingering image is extracted from the performance image;
It is switchable,
Image processing system.

A program that causes a computer system to function as an image extracting unit that extracts a specific area from a performance image including a keyboard image of a keyboard instrument and an image of a plurality of fingers of a user playing the keyboard instrument , the program comprising:
The image extraction unit
a first mode in which the specific area including both the keyboard image and the fingering image is extracted from the performance image;
a second mode in which the specific area including the keyboard image and not including the fingering image is extracted from the performance image;
It is switchable,
program.