JP7614926B2

JP7614926B2 - Image forming system, control method, and program

Info

Publication number: JP7614926B2
Application number: JP2021073599A
Authority: JP
Inventors: 良介笠原
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2025-01-16
Anticipated expiration: 2041-04-23
Also published as: JP2022167659A

Description

本発明は、画像形成システム、制御方法、及びプログラムに関する。 The present invention relates to an image forming system, a control method, and a program.

昨今、スマートスピーカ等の音声制御装置と連携可能な情報処理装置が普及してきている。特許文献１では、音声制御装置に対する発話内容を、画像形成装置の指示命令に変換し、画像形成装置を操作する技術について開示している。 Recently, information processing devices that can be linked to voice control devices such as smart speakers have become widespread. Patent Document 1 discloses a technology that converts speech to a voice control device into instructions and commands for an image forming device and operates the image forming device.

特開２０２０－１０７１３０号公報JP 2020-107130 A

しかしながら、特許文献１では、複数の音声制御装置が１つの画像形成装置に連携することを想定しておらず、この点において改善の余地がある。
音声制御装置が複数存在する環境下で、音声制御装置を用いて画像形成装置の設定変更を行う場合、動作が競合することがある。例えばユーザＡがジョブを実行しようと、第１のスマートスピーカと対話モードを開始する。ほぼ同時期にユーザＢが第２のスマートスピーカでジョブを実行しようと対話モードを開始する。その後、ユーザＢが設定を変更してユーザＡよりも先にジョブを実行すると、ユーザＡには設定の変更が通知されず、元の設定のままと思ってジョブを実行するため、想定とは異なる印刷が行われることになる。 However, in Japanese Patent Laid-Open No. 2006-133999, it is not assumed that a plurality of voice control devices are linked to one image forming device, and in this respect there is room for improvement.
In an environment where multiple voice control devices exist, when changing settings of an image forming device using a voice control device, operations may conflict. For example, user A starts an interactive mode with a first smart speaker to execute a job. At almost the same time, user B starts an interactive mode with a second smart speaker to execute a job. If user B then changes the settings and executes the job before user A, user A is not notified of the setting change and executes the job thinking that the settings are the original, resulting in printing that is different from what was expected.

そこで本発明は、音声制御装置を用いて画像形成装置を操作する際のユーザビリティを向上させることを目的とする。 The present invention aims to improve the usability when operating an image forming device using a voice control device.

本発明の情報処理システムは、ユーザの発話を受け付ける複数の音声制御装置と、前記複数の音声制御装置から取得した発話情報に基づき指示情報を取得するサーバと、前記サーバから取得した前記指示情報に基づく処理を実行する画像形成装置と、を備える画像形成システムであって、前記複数の音声制御装置のうちの一の音声制御装置において、ユーザから音声操作の開始指示を受け付け、前記サーバに起動情報を送信する送信手段と、前記サーバにおいて、前記起動情報を受信し、前記画像形成装置に設定情報を要求する要求手段と、前記サーバにおいて、前記画像形成装置の設定情報と、前記一の音声制御装置の情報とを紐づけて保持する保持手段と、前記サーバにおいて、前記一の音声制御装置から取得した前記発話情報に基づきジョブの実行を指示する場合に、前記保持された設定情報を少なくとも用いて前記指示情報を生成する生成手段と、を有することを特徴とする。 The information processing system of the present invention is an image forming system including a plurality of voice control devices that accept user speech, a server that acquires instruction information based on speech information acquired from the plurality of voice control devices, and an image forming device that executes processing based on the instruction information acquired from the server, and is characterized in that the system includes a transmission means in one of the plurality of voice control devices that accepts an instruction to start a voice operation from a user and transmits startup information to the server, a request means in the server that receives the startup information and requests setting information from the image forming device, a storage means in the server that links and stores the setting information of the image forming device and the information of the one voice control device, and a generation means in the server that generates the instruction information using at least the stored setting information when instructing the execution of a job based on the speech information acquired from the one voice control device.

本発明によれば、音声制御装置を用いて画像形成装置を操作する際のユーザビリティを向上させることができる。 The present invention can improve usability when operating an image forming device using a voice control device.

本実施形態に係る画像形成システムの全体構成例を示す図である。1 is a diagram illustrating an example of the overall configuration of an image forming system according to an embodiment of the present invention. ＭＦＰのハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of an MFP. スマートスピーカのハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a smart speaker. クラウドサーバのハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a cloud server. ＭＦＰの機能構成例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of an MFP. スマートスピーカの機能構成例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of a smart speaker. クラウドサーバの機能構成例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of a cloud server. 画像形成システムの実行する全体処理を示すシーケンス図である。FIG. 2 is a sequence diagram showing an overall process executed by the image forming system. サーバの実行する処理の詳細を示すフローチャートである。10 is a flowchart showing details of a process executed by the server. 画像形成システムの実行する全体処理を示すシーケンス図である。FIG. 2 is a sequence diagram showing an overall process executed by the image forming system. 画像形成システムの実行する全体処理を示すシーケンス図である。FIG. 2 is a sequence diagram showing an overall process executed by the image forming system.

以下、本発明を実施するための形態について図面を用いて説明する。 Below, the form for implementing the present invention will be explained with reference to the drawings.

＜システム構成＞
図１は、本実施形態の画像形成システムのシステム構成の一例を示す図である。図１に示すように、画像形成システム１００は、ＭＦＰ１０１、クラウドサーバ１０３、及び複数のスマートスピーカ１１０，１１１により構成される。ＭＦＰ１０１、及びスマートスピーカ１１０，１１１は、ネットワーク１０４に接続され、互いに通信可能である。またクラウドサーバ１０３は、ゲートウェイ１０５を介してネットワーク１０４に接続され、ネットワーク１０４上のＭＦＰ１０１、及びスマートスピーカ１１０，１１１と通信可能である。なお、画像形成システム１００は、２台のスマートスピーカ１１０，１１１で構成されるが、３台以上の装置で構成されてもよい。 <System Configuration>
Fig. 1 is a diagram showing an example of the system configuration of the image forming system of the present embodiment. As shown in Fig. 1, the image forming system 100 is composed of an MFP 101, a cloud server 103, and a plurality of smart speakers 110 and 111. The MFP 101 and the smart speakers 110 and 111 are connected to a network 104 and can communicate with each other. The cloud server 103 is also connected to the network 104 via a gateway 105 and can communicate with the MFP 101 and the smart speakers 110 and 111 on the network 104. The image forming system 100 is composed of two smart speakers 110 and 111, but may be composed of three or more devices.

スマートスピーカ１１０，１１１は、音声制御機能が搭載された電子機器であり、音声制御装置の一例である。スマートスピーカ１１０，１１１は、ユーザの発話を受け付けて、クラウドサーバ１０３を介して、ＭＦＰ１０１に対して各種の指示を行う。本実施形態では、スマートスピーカ１１０，１１１のような独立した電子機器を音声制御装置として用いるが、音声制御機能が搭載されたＰＣ（パーソナルコンピュータ）、スマートフォン等の端末装置を用いてもよい。 The smart speakers 110 and 111 are electronic devices equipped with a voice control function, and are examples of voice control devices. The smart speakers 110 and 111 accept the user's speech and issue various instructions to the MFP 101 via the cloud server 103. In this embodiment, an independent electronic device such as the smart speakers 110 and 111 is used as the voice control device, but a terminal device such as a PC (personal computer) or a smartphone equipped with a voice control function may also be used.

クラウドサーバ１０３は、音声認識機能、及びＭＦＰ１０１の操作制御機能のサービスを提供するクラウドサーバである。クラウドサーバ１０３は、スマートスピーカ１１０，１１１から取得した発話情報について音声認識を行うとともに、音声認識結果に基づきＭＦＰ１０１に対して処理の実行を指示する。これにより、ＭＦＰ１０１の操作を制御する。
ＭＦＰ１０１は、コピー、スキャン、プリント、ＦＡＸ等の複数の機能を備える複合機（ＭＦＰ：ＭｕｌｔｉｆｕｎｃｔｉｏｎＰｒｉｎｔｅｒ）であり、画像形成装置の一例である。なお、画像形成装置は、プリンタ、スキャナ等の、単体の機能を備える装置であってもよい。ＭＦＰ１０１は、クラウドサーバ１０３から取得した指示情報に基づく処理を実行する。 The cloud server 103 is a cloud server that provides services of a voice recognition function and an operation control function of the MFP 101. The cloud server 103 performs voice recognition on speech information acquired from the smart speakers 110 and 111, and instructs the MFP 101 to execute processing based on the voice recognition result. In this way, the operation of the MFP 101 is controlled.
The MFP 101 is a multifunction printer (MFP) having multiple functions such as copying, scanning, printing, and faxing, and is an example of an image forming apparatus. Note that the image forming apparatus may be an apparatus having a single function such as a printer or scanner. The MFP 101 executes processing based on instruction information acquired from the cloud server 103.

ネットワーク１０４は、有線又は無線ＬＡＮ、ＷＡＮ、インターネット等を含む。ネットワーク１０４は、スマートスピーカ１１０，１１１、ＭＦＰ１０１、及びゲートウェイ１０５を互いに接続する。
ゲートウェイ１０５は、例えば、ＩＥＥＥ８０２．１１規格シリーズに準拠した無線通信方式に従って動作する無線ＬＡＮルータである。またゲートウェイ１０５は、無線ＬＡＮルータではなく、１０ＢＡＳＥ－Ｔ、１００ＢＡＳＥＴ、１０００ＢＡＳＥ－Ｔ等に代表されるＥｔｈｅｒｎｅｔ規格に準拠した有線通信方式に従って動作する有線ＬＡＮルータでもよい。 The network 104 includes a wired or wireless LAN, a WAN, the Internet, etc. The network 104 connects the smart speakers 110 and 111, the MFP 101, and the gateway 105 to each other.
The gateway 105 is, for example, a wireless LAN router that operates according to a wireless communication method conforming to the IEEE 802.11 standard series. The gateway 105 may also be a wired LAN router that operates according to a wired communication method conforming to Ethernet standards such as 10BASE-T, 100BASE, and 1000BASE-T, instead of a wireless LAN router.

なお、本実施形態では、スマートスピーカ１１０，１１１とＭＦＰ１０１が互いに独立した構成となっているが、この構成に限定されるものではない。例えば、スマートスピーカ１１０，１１１を構成するデバイスや機能のうちの一部が、ＭＦＰ１０１に組み込まれて一体的に構成されてもよい。また、本実施形態では、スマートスピーカ１１０，１１１とクラウドサーバ１０３が互いに独立した構成となっているが、この構成に限定されるものではない。例えば、クラウドサーバ１０３を構成するデバイスや機能のうちの一部が、スマートスピーカ１１０，１１１に組み込まれて一体的に構成されてもよい。更に、クラウドサーバ１０３とＭＦＰ１０１が互いに独立した構成となっているが、この構成に限定されるものではない。例えば、クラウドサーバ１０３を構成するデバイスや機能のうちの一部が、ＭＦＰ１０１に組み込まれて一体的に構成されてもよい。 In this embodiment, the smart speakers 110 and 111 and the MFP 101 are configured to be independent of each other, but this is not limited to this configuration. For example, some of the devices and functions constituting the smart speakers 110 and 111 may be incorporated into the MFP 101 and configured as an integrated unit. In addition, in this embodiment, the smart speakers 110 and 111 and the cloud server 103 are configured to be independent of each other, but this is not limited to this configuration. For example, some of the devices and functions constituting the cloud server 103 may be incorporated into the smart speakers 110 and 111 and configured as an integrated unit. Furthermore, the cloud server 103 and the MFP 101 are configured to be independent of each other, but this is not limited to this configuration. For example, some of the devices and functions constituting the cloud server 103 may be incorporated into the MFP 101 and configured as an integrated unit.

＜ＭＦＰのハードウェア構成＞
図２は、本実施形態に係るＭＦＰ１０１のハードウェア構成を示すブロック図である。ＭＦＰ１０１は、主として、制御部２００、操作パネル２０９、プリントエンジン２１１、スキャナ２１３、カードリーダ２１４により構成される。制御部２００は、ＣＰＵ２０２、ＲＡＭ２０３、ＲＯＭ２０４、ストレージ２０５、ネットワークＩ／Ｆ２０６、ディスプレイコントローラ２０７、及び操作Ｉ／Ｆ２０８を含む。また制御部２００は、プリントコントローラ２１０、及びスキャンコントローラ２１２も含む。これら各ブロックはシステムバス２０１を介して相互に接続されている。 <MFP Hardware Configuration>
2 is a block diagram showing the hardware configuration of the MFP 101 according to this embodiment. The MFP 101 is mainly composed of a control unit 200, an operation panel 209, a print engine 211, a scanner 213, and a card reader 214. The control unit 200 includes a CPU 202, a RAM 203, a ROM 204, a storage 205, a network I/F 206, a display controller 207, and an operation I/F 208. The control unit 200 also includes a print controller 210 and a scan controller 212. These blocks are connected to each other via a system bus 201.

ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０２は、ＭＦＰ１０１全体の動作を制御する。ＣＰＵ２０２がＲＯＭ２０４等に記憶されたプログラムを読み出して、ＲＡＭ２０３に展開して実行することにより、後述するシーケンス図に示す各処理や後述する図５に示す各種の機能が実現される。ＲＡＭ２０３は、ＣＰＵ２０２の主記憶メモリであり、プログラムを展開するための一時記憶領域として用いられる。ＲＯＭ２０４は、ＣＰＵ２０２で実行可能なプログラムを記憶する。ストレージ２０５は、印刷データ、画像データ、及びジョブを実行する際に必要な各種の設定情報を記憶する。 The CPU (Central Processing Unit) 202 controls the overall operation of the MFP 101. The CPU 202 reads out programs stored in the ROM 204 etc., loads them into the RAM 203, and executes them to realize the processes shown in the sequence diagrams described below and the various functions shown in FIG. 5 described below. The RAM 203 is the main storage memory of the CPU 202, and is used as a temporary storage area for loading the programs. The ROM 204 stores programs that can be executed by the CPU 202. The storage 205 stores print data, image data, and various setting information required when executing a job.

本実施形態では、１つのＣＰＵ２０２が１つのメモリ（ＲＡＭ２０３）を用いて後述するシーケンス図に示す各処理を実行するものとするが、他の様態であっても構わない。例えば複数のＣＰＵ、ＲＡＭ、ＲＯＭ、及びストレージを協働させて後述するシーケンス図に示す処理を実行することもできる。また、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）やＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等のハードウェア回路を用いて一部の処理を実行するようにしてもよい。 In this embodiment, one CPU 202 uses one memory (RAM 203) to execute each process shown in the sequence diagram described below, but other configurations are also possible. For example, multiple CPUs, RAMs, ROMs, and storages can work together to execute the processes shown in the sequence diagrams described below. Also, some of the processes may be executed using hardware circuits such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field-Programmable Gate Array).

ネットワークＩ／Ｆ２０６は、ネットワーク１０４を介してＭＦＰ１０１を外部の装置と通信させるためのインタフェースである。ＭＦＰ１０１は、ネットワークＩ／Ｆ２０６を介してクラウドサーバ１０３から指示情報を受信して、当該指示情報に基づく処理を実行する。指示情報が、ジョブの実行を指示するジョブ実行コマンドである場合、ＭＦＰ１０１はジョブの実行を行う。ジョブは、プリントエンジン２１１又はスキャナ２１３を用いて実現する一連の画像形成処理の実行単位であり、印刷ジョブ、コピージョブ、及びスキャンジョブ等がある。また、ＭＦＰ１０１本体の設定変更を指示する設定変更コマンドである場合、ＭＦＰ１０１はストレージ２０５に記憶される設定情報を指示された内容に変更する。 The network I/F 206 is an interface that allows the MFP 101 to communicate with external devices via the network 104. The MFP 101 receives instruction information from the cloud server 103 via the network I/F 206 and executes processing based on the instruction information. If the instruction information is a job execution command that instructs the execution of a job, the MFP 101 executes the job. A job is an execution unit of a series of image formation processes realized using the print engine 211 or the scanner 213, and includes a print job, a copy job, and a scan job. If the instruction information is a setting change command that instructs a change to the settings of the MFP 101 main body, the MFP 101 changes the setting information stored in the storage 205 to the instructed content.

操作Ｉ／Ｆ２０８は、操作パネル２０９、及びカードリーダ２１４をＣＰＵ２０２に接続する。操作パネル２０９は、入力デバイスと表示デバイスとの機能を兼ね備えており、例えばタッチパネルディスプレイにより構成される。操作パネル２０９は、ユーザから入力された操作内容を操作Ｉ／Ｆ２０８を介してＣＰＵ２０２に提供する。ディスプレイコントローラ２０７は、ＣＰＵ２０２の制御により、操作パネル２０９の表示内容を制御する。 The operation I/F 208 connects an operation panel 209 and a card reader 214 to the CPU 202. The operation panel 209 combines the functions of an input device and a display device, and is configured, for example, by a touch panel display. The operation panel 209 provides the operation contents input by the user to the CPU 202 via the operation I/F 208. The display controller 207 controls the display contents of the operation panel 209 under the control of the CPU 202.

プリントコントローラ２１０は、プリントエンジン２１１をＣＰＵ２０２に接続する。印刷すべき画像データは、プリントコントローラ２１０を介してプリントエンジン２１１に転送される。プリントエンジン２１１は、ＣＰＵ２０２の制御下で、印刷すべき画像データを受信し、受信した画像データに基づいた画像をシート上に形成する。プリントエンジン２１１の印刷方式は、電子写真方式であってもよいし、インクジェット方式であってもよい。電子写真方式の場合は、感光体上に静電潜像を形成した後トナーにより現像し、そのトナー像をシートに転写し、転写されたトナー像を定着することにより画像を形成する。一方、インクジェット方式の場合は、インクを吐出してシートに画像を形成する。 The print controller 210 connects the print engine 211 to the CPU 202. Image data to be printed is transferred to the print engine 211 via the print controller 210. The print engine 211 receives the image data to be printed under the control of the CPU 202, and forms an image based on the received image data on a sheet. The printing method of the print engine 211 may be an electrophotographic method or an inkjet method. In the case of the electrophotographic method, an electrostatic latent image is formed on a photoconductor, and then developed with toner, the toner image is transferred to a sheet, and the transferred toner image is fixed to form an image. On the other hand, in the case of the inkjet method, ink is ejected to form an image on a sheet.

スキャンコントローラ２１２は、スキャナ２１３をＣＰＵ２０２に接続する。スキャナ２１３はシート上の画像を読み取って、画像データを生成する。スキャナ２１３によって生成された画像データはストレージ２０５に記憶される。プリントエンジン２１１は、ＣＰＵ２０２の制御下で、スキャナ２１３により生成された画像データに基づいた画像をシート上に形成する。スキャナ２１３は、原稿フィーダ（不図示）を有しており、原稿フィーダに載置されたシートを１枚ずつ搬送させながら読み取ることが可能である。
カードリーダ２１４は、ユーザの認証を行う。ユーザはＭＦＰ１０１との認証に成功することで、操作パネル２０９に表示される画面を操作して、印刷ジョブ、コピージョブ、及びスキャンジョブ等の実行を指示することが可能となる。 The scan controller 212 connects the scanner 213 to the CPU 202. The scanner 213 reads an image on a sheet and generates image data. The image data generated by the scanner 213 is stored in the storage 205. The print engine 211 forms an image on a sheet based on the image data generated by the scanner 213 under the control of the CPU 202. The scanner 213 has a document feeder (not shown) and is capable of reading sheets placed on the document feeder while transporting them one by one.
The card reader 214 performs authentication of the user. If the user is successfully authenticated by the MFP 101, the user can operate the screen displayed on the operation panel 209 to instruct the execution of a print job, a copy job, a scan job, and the like.

＜スマートスピーカのハードウェア構成＞
図３は、本実施形態に係るスマートスピーカ１１０，１１１のハードウェア構成を示すブロック図である。スマートスピーカ１１０，１１１は、主として、制御部３００、マイクロフォン３０８、スピーカ３１０、ＬＥＤ３１２により構成される。制御部３００は、ＣＰＵ３０２、ＲＡＭ３０３、ＲＯＭ３０４、ストレージ３０５、ネットワークＩ／Ｆ３０６、マイクＩ／Ｆ３０７、オーディオコントローラ３０９、及び表示コントローラ３１１を含む。これら各ブロックはシステムバス３０１を介して相互に接続されている。以下では、スマートスピーカ１１０とスマートスピーカ１１１とは同様のハードウェア構成を有するため、スマートスピーカ１１０について説明し、スマートスピーカ１１１の説明については省略する。 <Smart speaker hardware configuration>
3 is a block diagram showing the hardware configuration of the smart speaker 110, 111 according to the present embodiment. The smart speaker 110, 111 is mainly composed of a control unit 300, a microphone 308, a speaker 310, and an LED 312. The control unit 300 includes a CPU 302, a RAM 303, a ROM 304, a storage 305, a network I/F 306, a microphone I/F 307, an audio controller 309, and a display controller 311. These blocks are connected to each other via a system bus 301. In the following, since the smart speaker 110 and the smart speaker 111 have similar hardware configurations, the smart speaker 110 will be described and the description of the smart speaker 111 will be omitted.

ＣＰＵ３０２は、スマートスピーカ１１０の全体の動作を制御する。ＣＰＵ３０２がＲＯＭ３０４等に記憶されたプログラムを読み出して、ＲＡＭ３０３に展開して実行することにより、後述するシーケンス図に示す各処理や後述する図６に示す各種の機能が実現される。ＲＡＭ３０３は、ＣＰＵ３０２の主記憶メモリであり、プログラムを展開するための一時記憶領域として用いられる。ＲＯＭ３０４は、音声制御機能を実現するための音声制御プログラム等のプログラムを記憶する。ストレージ３０５は、クラウドサーバ１０３に対する認証手続きに必要な認証情報、スマートスピーカ１１０本体の設定情報等を記憶する。 The CPU 302 controls the overall operation of the smart speaker 110. The CPU 302 reads out programs stored in the ROM 304 etc., deploys them in the RAM 303, and executes them to realize the processes shown in the sequence diagrams described below and the various functions shown in FIG. 6 described below. The RAM 303 is the main memory of the CPU 302, and is used as a temporary storage area for deploying programs. The ROM 304 stores programs such as a voice control program for realizing a voice control function. The storage 305 stores authentication information required for the authentication procedure with the cloud server 103, setting information for the smart speaker 110 main unit, etc.

ネットワークＩ／Ｆ３０６は、ネットワーク１０４を介してスマートスピーカ１１０を外部の装置と通信させるためのインタフェースである。スマートスピーカ１１０は、マイクロフォン３０８で受け付けたユーザの発話に関する発話情報を、ネットワークＩ／Ｆ３０６を介してクラウドサーバ１０３に対して送信する。またスマートスピーカ１１０は、ネットワークＩ／Ｆ３０６を介してクラウドサーバ１０３から音声合成データを受信する。 The network I/F 306 is an interface that allows the smart speaker 110 to communicate with external devices via the network 104. The smart speaker 110 transmits speech information regarding the user's speech received by the microphone 308 to the cloud server 103 via the network I/F 306. The smart speaker 110 also receives voice synthesis data from the cloud server 103 via the network I/F 306.

マイクＩ／Ｆ３０７は、マイクロフォン３０８をＣＰＵ３０２に接続する。マイクロフォン３０８は、音声入力デバイスであり、スマートスピーカ１１０の周囲にいるユーザが発した音声を音声信号に変換する。マイクロフォン３０８は、スマートスピーカ１１０に内蔵される構成に限られず、有線通信又は無線通信で連携される構成でもよい。またマイクロフォン３０８は、ユーザが発した音声の到来方向を算出可能なように、複数個を所定の位置に配して用いてもよい。ＣＰＵ３０２は、マイクロフォン３０８で取得した音声信号を用いて、クラウドサーバ１０３に送信するために符号化された音声データを生成し、生成した音声データをＲＡＭ３０３に一時的に保存する。ユーザの発話が終了すると、ＣＰＵ３０２は、ＲＡＭ３０３に保存された音声データを発話情報として取得する。 The microphone I/F 307 connects the microphone 308 to the CPU 302. The microphone 308 is a voice input device that converts the voice uttered by the user around the smart speaker 110 into a voice signal. The microphone 308 is not limited to being built into the smart speaker 110, and may be configured to be linked by wired communication or wireless communication. Furthermore, the microphone 308 may be used in a plurality of units arranged at predetermined positions so that the direction of arrival of the voice uttered by the user can be calculated. The CPU 302 uses the voice signal acquired by the microphone 308 to generate coded voice data to be transmitted to the cloud server 103, and temporarily stores the generated voice data in the RAM 303. When the user finishes speaking, the CPU 302 acquires the voice data stored in the RAM 303 as speech information.

オーディオコントローラ３０９は、スピーカ３１０をＣＰＵ３０２に接続する。スピーカ３１０は、音声出力デバイスである。スピーカ３１０は、スマートスピーカ１１０に内蔵される構成に限られず、有線通信又は無線通信で連携される構成でもよい。オーディオコントローラ３０９は、ＣＰＵ３０２の制御により、応答音、及びクラウドサーバ１０３から受信した音声合成データに基づく音声をスピーカ３１０から出力する。 The audio controller 309 connects the speaker 310 to the CPU 302. The speaker 310 is an audio output device. The speaker 310 is not limited to being built into the smart speaker 110, but may be linked via wired or wireless communication. Under the control of the CPU 302, the audio controller 309 outputs from the speaker 310 a response sound and a voice based on voice synthesis data received from the cloud server 103.

表示コントローラ３１１は、ＬＥＤ３１２をＣＰＵ３０２に接続する。表示コントローラ３１１は、ＣＰＵ３０２の制御により、ＬＥＤ３１２の点灯制御を行う。例えば、表示コントローラ３１１は、ユーザの発話を正常に認識している状態である場合に、ＬＥＤ３１２を点灯させる。なお、ＬＥＤ３１２に代えて、文字や図形を表示可能なディスプレイ装置を用いてもよい。 The display controller 311 connects the LED 312 to the CPU 302. The display controller 311 controls the lighting of the LED 312 under the control of the CPU 302. For example, the display controller 311 lights up the LED 312 when the user's speech is being properly recognized. Note that a display device capable of displaying characters and figures may be used instead of the LED 312.

＜クラウドサーバのハードウェア構成＞
図４は、本実施形態に係るクラウドサーバ１０３のハードウェア構成を示すブロック図である。クラウドサーバ１０３は、ＣＰＵ４０２、ＲＡＭ４０３、ＲＯＭ４０４、ストレージ４０５、及びネットワークＩ／Ｆ４０６を含む。これら各ブロックはシステムバス４０１を介して相互に接続されている。 <Cloud server hardware configuration>
4 is a block diagram showing the hardware configuration of the cloud server 103 according to this embodiment. The cloud server 103 includes a CPU 402, a RAM 403, a ROM 404, a storage 405, and a network I/F 406. These blocks are connected to each other via a system bus 401.

ＣＰＵ４０２は、クラウドサーバ１０３の全体の動作を制御する。ＣＰＵ４０２が、ＲＯＭ４０４等に記憶されたプログラムを読み出して、ＲＡＭ４０３に展開して実行することにより、後述するシーケンス図やフローチャートに示す各処理、及び後述する図７に示す各種の機能が実現される。ＲＡＭ４０３は、ＣＰＵ４０２の主記憶メモリであり、プログラムを展開するための一時記憶領域として用いられる。ＲＯＭ４０４は、音声認識機能を実現するための音声認識プログラムの他各種のプログラムを記憶する。ストレージ４０５は、ＭＦＰ１０１に対する認証手続きに必要な認証情報の他、コンテンツやファイル等のデータを記憶する。 The CPU 402 controls the overall operation of the cloud server 103. The CPU 402 reads out programs stored in the ROM 404 etc., deploys them in the RAM 403, and executes them to realize the processes shown in the sequence diagrams and flowcharts described below, and the various functions shown in FIG. 7 described below. The RAM 403 is the main memory of the CPU 402, and is used as a temporary storage area for deploying programs. The ROM 404 stores various programs, including a voice recognition program for implementing a voice recognition function. The storage 405 stores authentication information required for the authentication procedure for the MFP 101, as well as data such as content and files.

ネットワークＩ／Ｆ４０６は、ネットワーク１０４を介して外部の装置と通信させるためのインタフェースである。クラウドサーバ１０３は、ネットワークＩ／Ｆ４０６を介してスマートスピーカ１１０から発話情報を受信するとともに、発話情報の音声認識結果に基づき取得された指示情報をＭＦＰ１０１に対して送信する。またクラウドサーバ１０３は、ネットワークＩ／Ｆ４０６を介してＭＦＰ１０１から各種の通知を受信するとともに、通知に基づき生成された音声合成データをスマートスピーカ１１０，１１１に対して送信する。 The network I/F 406 is an interface for communicating with external devices via the network 104. The cloud server 103 receives speech information from the smart speaker 110 via the network I/F 406, and transmits instruction information acquired based on the voice recognition results of the speech information to the MFP 101. The cloud server 103 also receives various notifications from the MFP 101 via the network I/F 406, and transmits voice synthesis data generated based on the notifications to the smart speakers 110 and 111.

本実施形態では、１つのＣＰＵ４０２が１つのメモリ（ＲＡＭ４０３）を用いてＲＯＭ４０４等に記憶されたプログラムを実行することにより、後述するシーケンス図やフローチャートに示す各処理を実行する場合を例示するが、他の態様であってもよい。例えば、複数のＣＰＵ、ＲＡＭ、ＲＯＭ、ＨＤＤ等の記憶装置を協働させて後述する各処理を実行することもできる。また、クラウドサーバ１０３は、１のサーバにて実現されるものではなく、機能に応じた複数のサーバにより構成されていてもよいし、処理分散のために複数の装置により構成されてもよい。 In this embodiment, a single CPU 402 uses a single memory (RAM 403) to execute a program stored in ROM 404 or the like to execute the processes shown in the sequence diagrams and flowcharts described below, but other configurations are also possible. For example, multiple storage devices such as CPUs, RAMs, ROMs, and HDDs can work together to execute the processes described below. In addition, the cloud server 103 is not necessarily realized by a single server, but may be composed of multiple servers according to the functions, or may be composed of multiple devices for distributed processing.

＜ＭＦＰの機能構成＞
図５は、本実施形態に係るＭＦＰ１０１の機能構成例を示す図である。ＭＦＰ１０１は、ＣＰＵ２０２がＲＯＭ２０４等に記憶されたプログラムを実行することにより図５に示す各機能部として動作する。
データ送受信部５０１は、ネットワークＩ／Ｆ２０６を介して、クラウドサーバ１０３との間でデータの送受信を行う。具体的には、データ送受信部５０１は、クラウドサーバ１０３から指示や要求に関するデータを受信する。また、データ送受信部５０１は、指示や要求に対する応答をクラウドサーバ１０３に対して送信する。 <Functional configuration of MFP>
Fig. 5 is a diagram showing an example of the functional arrangement of the MFP 101 according to this embodiment. The MFP 101 operates as each of the functional units shown in Fig. 5 when the CPU 202 executes a program stored in the ROM 204 or the like.
The data transmission/reception unit 501 transmits and receives data to and from the cloud server 103 via the network I/F 206. Specifically, the data transmission/reception unit 501 receives data related to instructions and requests from the cloud server 103. In addition, the data transmission/reception unit 501 transmits responses to the instructions and requests to the cloud server 103.

データ解析部５０２は、データ送受信部５０１が受信したデータを解析する。解析の結果、クラウドサーバ１０３から受信したデータが、ジョブ実行コマンドである場合、データ解析部５０２は、当該ジョブ実行コマンドを後述するジョブ制御部５０３に提供する。また、設定変更コマンドである場合、データ解析部５０２は、当該設定変更コマンドを後述するデータ管理部５０４に提供する。また、ＭＦＰ１０１本体の設定情報の要求コマンドである場合、データ管理部５０４が、ＭＦＰ１０１本体の設定情報をストレージ２０５から読み出す。そしてデータ送受信部５０１が、読み出した設定情報をクラウドサーバ１０３に対して送信する。 The data analysis unit 502 analyzes the data received by the data transmission/reception unit 501. If the analysis shows that the data received from the cloud server 103 is a job execution command, the data analysis unit 502 provides the job execution command to the job control unit 503, which will be described later. If the data is a setting change command, the data analysis unit 502 provides the setting change command to the data management unit 504, which will be described later. If the data is a command requesting setting information for the MFP 101 main body, the data management unit 504 reads out the setting information for the MFP 101 main body from the storage 205. The data transmission/reception unit 501 then transmits the read setting information to the cloud server 103.

ジョブ制御部５０３は、プリントコントローラ２１０及びスキャンコントローラ２１２を用いて、プリントエンジン２１１及びスキャナ２１３の制御指示を行う。具体的には、ジョブ制御部５０３は、データ解析部５０２から提供されたジョブ実行コマンド、及び操作制御部５０６から提供されたジョブ実行コマンドに基づき印刷やスキャンを実行する。なお、データ解析部５０２から提供されたジョブ実行コマンドは、ユーザの発話情報に基づき取得されたジョブ実行コマンドである。 The job control unit 503 uses the print controller 210 and the scan controller 212 to give control instructions to the print engine 211 and the scanner 213. Specifically, the job control unit 503 executes printing and scanning based on the job execution command provided by the data analysis unit 502 and the job execution command provided by the operation control unit 506. Note that the job execution command provided by the data analysis unit 502 is a job execution command obtained based on the user's speech information.

データ管理部５０４は、プリントエンジン２１１及びスキャナ２１３の制御に必要な設定情報等をストレージ２０５に保存する。設定情報は、ジョブ制御部５０３で実行するジョブの各設定項目及び設定値の組み合わせからなるデータであり、例えば、白黒／カラーの設定、片面／両面プリントの設定である。データ管理部５０４は、データ解析部５０２から提供された設定変更コマンド、及び操作制御部５０６から提供された設定変更コマンドに基づきストレージ２０５に保存される設定情報を変更する。なお、データ解析部５０２から提供された設定変更コマンドは、ユーザの発話情報に基づき取得された設定変更コマンドである。 The data management unit 504 stores in the storage 205 setting information and the like required for controlling the print engine 211 and the scanner 213. The setting information is data consisting of a combination of each setting item and setting value of the job executed by the job control unit 503, such as black and white/color settings and single-sided/double-sided printing settings. The data management unit 504 changes the setting information stored in the storage 205 based on setting change commands provided by the data analysis unit 502 and setting change commands provided by the operation control unit 506. The setting change commands provided by the data analysis unit 502 are setting change commands obtained based on the user's speech information.

表示制御部５０５は、ディスプレイコントローラ２０７を介して、操作パネル２０９の表示内容を制御する。より具体的には、表示制御部５０５は、ユーザが操作可能なＵＩ部品（ボタン、プルダウンリスト、チェックボックス等）を操作パネル２０９に表示する。ジョブ制御部５０３によりジョブが実行されると、表示制御部５０５は、ジョブの実行状態に基づき操作パネル２０９の表示内容を更新する。またデータ管理部５０４により設定情報の変更が行われると、変更された内容で操作パネル２０９の表示内容を更新する。 The display control unit 505 controls the display contents of the operation panel 209 via the display controller 207. More specifically, the display control unit 505 displays UI components (buttons, pull-down lists, check boxes, etc.) that can be operated by the user on the operation panel 209. When a job is executed by the job control unit 503, the display control unit 505 updates the display contents of the operation panel 209 based on the execution status of the job. In addition, when the setting information is changed by the data management unit 504, the display contents of the operation panel 209 are updated with the changed contents.

操作制御部５０６は、操作Ｉ／Ｆ２０８を介して、操作パネル２０９上のタッチされた座標を取得し、操作対象とされたＵＩ部品に対応する指示内容を決定する。例えば、印刷の実行ボタンがタッチされた場合、操作制御部５０６は、印刷ジョブの実行が指示されたと認識する。そして操作制御部５０６は、データ管理部５０４を介してストレージ２０５から指示された画像データと設定情報を読み出し、読み出した画像データと設定情報を用いて、画像データを印刷するジョブ実行コマンドを生成する。生成されたジョブ実行コマンドは、ジョブ制御部５０３に提供される。また例えば、設定変更ボタンがタッチされた場合、操作制御部５０６は、設定情報を指示された内容に変更する設定変更コマンドを生成する。生成された設定変更コマンドは、データ管理部５０４に提供される。 The operation control unit 506 acquires the touched coordinates on the operation panel 209 via the operation I/F 208, and determines the instruction content corresponding to the UI part that is the target of the operation. For example, when the print execution button is touched, the operation control unit 506 recognizes that the execution of a print job has been instructed. The operation control unit 506 then reads the instructed image data and setting information from the storage 205 via the data management unit 504, and generates a job execution command to print the image data using the read image data and setting information. The generated job execution command is provided to the job control unit 503. Also, for example, when the setting change button is touched, the operation control unit 506 generates a setting change command to change the setting information to the instructed content. The generated setting change command is provided to the data management unit 504.

＜スマートスピーカの機能構成＞
図６は、本実施形態に係るスマートスピーカ１１０，１１１の機能構成を示すブロック図である。スマートスピーカ１１０，１１１は、ＣＰＵ３０２がＲＯＭ３０４等に記憶されたプログラムを実行することにより図６に示す各機能部として動作する。以下では、スマートスピーカ１１０とスマートスピーカ１１１とは同様の機能構成を有するため、スマートスピーカ１１０について説明し、スマートスピーカ１１１の説明については省略する。 <Functional configuration of smart speaker>
Fig. 6 is a block diagram showing the functional configuration of the smart speakers 110 and 111 according to this embodiment. The smart speakers 110 and 111 operate as the functional units shown in Fig. 6 when the CPU 302 executes a program stored in the ROM 304 or the like. In the following, since the smart speaker 110 and the smart speaker 111 have the same functional configuration, only the smart speaker 110 will be described and a description of the smart speaker 111 will be omitted.

データ送受信部６０１は、ネットワークＩ／Ｆ３０６を介して、クラウドサーバ１０３との間でデータの送受信を行う。具体的には、データ送受信部５０１は、クラウドサーバ１０３に対して後述する音声取得部６０４で取得された発話情報を送信し、その応答をクラウドサーバ１０３から受信する。具体的には、データ送受信部５０１は、クラウドサーバ１０３との対話セッションの継続中、後述する音声制御部６０３で取得された発話情報を、順次クラウドサーバ１０３に対して送信する。また、データ送受信部５０１は、クラウドサーバ１０３に対して起動情報を送信する。起動情報を送信するタイミングは、音声制御部６０３によって管理される。
データ管理部６０２は、マイクロフォン３０８、スピーカ３１０及びＬＥＤ３１２の制御に必要な設定情報等をストレージ３０５に保存する。設定情報は、例えば、後述する音声再生部６０５で再生する音声の音量設定である。 The data transmission/reception unit 601 transmits and receives data to and from the cloud server 103 via the network I/F 306. Specifically, the data transmission/reception unit 501 transmits speech information acquired by a voice acquisition unit 604 (described later) to the cloud server 103, and receives a response thereto from the cloud server 103. Specifically, the data transmission/reception unit 501 sequentially transmits speech information acquired by a voice control unit 603 (described later) to the cloud server 103 during an ongoing dialogue session with the cloud server 103. In addition, the data transmission/reception unit 501 transmits startup information to the cloud server 103. The timing of transmitting the startup information is managed by the voice control unit 603.
The data management unit 602 stores in the storage 305 setting information and the like required for controlling the microphone 308, the speaker 310, and the LED 312. The setting information is, for example, the volume setting of the sound reproduced by the sound reproduction unit 605 described later.

音声取得部６０４は、マイクロフォン３０８で取得した音声信号を用いて、ＭＰ３等の所定のフォーマットで符号化された音声データを生成し、生成した音声データをＲＡＭ３０３に一時的に保存する。以上のような音声取得部６０４の処理を音声処理という。音声処理の開始及び終了のタイミングは、後述する音声制御部６０３によって管理される。なお、音声データの生成には、汎用のストリーミング用フォーマットが用いられてもよい。
音声再生部６０５は、データ送受信部６０１が受信した応答に含まれる音声合成データを、オーディオコントローラ３０９を介してスピーカ３１０で再生する。以上のような音声再生部６０５の処理を再生処理という。
ＬＥＤ表示部６０６は、ＬＥＤ３１２の点灯・消灯を行う。ＬＥＤ表示部６０６の点灯・消灯タイミングは、後述する音声制御部６０３によって管理される。 The audio acquisition unit 604 generates audio data encoded in a predetermined format such as MP3 using the audio signal acquired by the microphone 308, and temporarily stores the generated audio data in the RAM 303. The above-described processing by the audio acquisition unit 604 is called audio processing. The timing of starting and ending the audio processing is managed by the audio control unit 603, which will be described later. Note that a general-purpose streaming format may be used to generate the audio data.
The voice reproducing unit 605 reproduces the synthesized voice data included in the response received by the data transmitting/receiving unit 601 on the speaker 310 via the audio controller 309. The above-described processing by the voice reproducing unit 605 is referred to as a reproduction process.
The LED display unit 606 turns on and off the LED 312. The timing of turning on and off the LED display unit 606 is managed by the audio control unit 603, which will be described later.

操作開始検知部６０７は、ユーザから音声操作の開始指示を受け付けて、音声制御部６０３に対して操作開始通知を送信する。具体的には、操作開始検知部６０７は、マイクロフォン３０８で取得した音声信号から、ウェイクワードの発声を検知する。ウェイクワードは、音声認識機能を起動する際にユーザから発せられるワードである。ユーザはウェイクワードを発し、続いて自身が行いたいこと（例えば、「コピーしたい」）を話すことでＭＦＰ１０１の操作を行うことができる。なお、操作開始検知部６０７は、スマートスピーカ１１０の操作開始キー（不図示）の押下により、音声操作の開始指示を検知してもよい。 The operation start detection unit 607 receives a voice operation start instruction from the user and transmits an operation start notification to the voice control unit 603. Specifically, the operation start detection unit 607 detects the utterance of a wake word from the voice signal acquired by the microphone 308. The wake word is a word uttered by the user when activating the voice recognition function. The user can operate the MFP 101 by uttering the wake word and then speaking what they want to do (for example, "I want to copy"). The operation start detection unit 607 may detect a voice operation start instruction by pressing an operation start key (not shown) on the smart speaker 110.

発話終了判定部６０８は、発話の終了タイミングを判定し、発話が終了したと判定した場合に、音声制御部６０３に対して発話終了通知を送信する。例えば、ユーザの音声が所定時間（例えば３秒）途切れた場合に、ユーザの発話が終了したと判定する。なお、発話終了の判定は、発話が無い時間により行う方法に限られず、所定のワードの発声を検知したか否かによって行ってもよい。例えば、「はい」、「いいえ」、「ＯＫ」、「キャンセル」、「終了」、「スタート」、「開始」等の所定のワードを受け付けた場合、発話終了判定部６０８は、所定時間を待たずに発話が終了したと判定する。また、発話処理の終了判定は、クラウドサーバ１０３で行ってもよい。この場合には、ユーザの発話内容の意味や文脈に基づき操作終了の判定を行ってもよい。 The speech end determination unit 608 determines the timing of the end of speech, and when it is determined that the speech has ended, it transmits a speech end notification to the voice control unit 603. For example, when the user's voice is interrupted for a predetermined time (e.g., 3 seconds), it is determined that the user's speech has ended. Note that the determination of the end of speech is not limited to a method of determining based on the time during which there is no speech, and may be performed based on whether or not the utterance of a predetermined word is detected. For example, when a predetermined word such as "yes," "no," "OK," "cancel," "end," "start," or "start" is received, the speech end determination unit 608 determines that the speech has ended without waiting for a predetermined time. The end of speech processing may also be determined by the cloud server 103. In this case, the end of the operation may be determined based on the meaning and context of the user's speech content.

音声制御部６０３は、音声取得部６０４、音声再生部６０５、ＬＥＤ表示部６０６の統括的な制御を行う。
まず音声制御部６０３は、操作開始検知部６０７から操作開始通知を受信すると、クラウドサーバ１０３との対話セッションの処理を開始する。まず音声制御部６０３は、ストレージ３０５から自装置の情報を読み出す。自装置の情報は、例えばスマートスピーカ１１０に個別に設定されているリアルナンバー等のＩＤ番号やデバイス名称である。以下、自装置の情報を機器情報という。そしてデータ送受信部６０１が、起動情報とともに機器情報をクラウドサーバ１０３に対して送信する。
続いて音声制御部６０３は、音声取得部６０４に音声処理を開始するよう制御するとともに、ＬＥＤ表示部６０６にＬＥＤ３１２を点灯させるよう制御する。また、音声制御部６０３は、発話終了判定部６０８から発話終了通知を受信すると、音声取得部６０４に音声処理を終了するよう制御するとともに、ＬＥＤ表示部６０６にＬＥＤ３１２を消灯させるよう制御する。 The audio control unit 603 performs overall control of the audio acquisition unit 604 , audio playback unit 605 , and LED display unit 606 .
First, when the voice control unit 603 receives an operation start notification from the operation start detection unit 607, it starts processing an interactive session with the cloud server 103. First, the voice control unit 603 reads information about its own device from the storage 305. The information about its own device is, for example, an ID number such as a real number that is individually set in the smart speaker 110, and a device name. Hereinafter, the information about its own device will be referred to as device information. Then, the data transmission/reception unit 601 transmits the device information together with the startup information to the cloud server 103.
Next, the voice control unit 603 controls the voice acquisition unit 604 to start voice processing, and controls the LED display unit 606 to turn on the LED 312. Furthermore, upon receiving a speech end notification from the speech end determination unit 608, the voice control unit 603 controls the voice acquisition unit 604 to end voice processing, and controls the LED display unit 606 to turn off the LED 312.

ユーザの発話が終了すると、音声制御部６０３は、音声取得部６０４で一時保存された音声データを発話情報として取得し、取得した発話情報を機器情報とともにデータ送受信部６０１でクラウドサーバ１０３に対して送信するよう制御する。その後データ送受信部６０１は、クラウドサーバ１０３からの応答を待ち受ける。クラウドサーバ１０３からの応答は、例えば、応答であることを示すヘッダ部と、音声合成データとから成る応答メッセージある。データ送受信部６０１で応答を受信すると、音声制御部６０３は、音声再生部６０５に再生処理を開始するよう制御する。 When the user finishes speaking, the voice control unit 603 acquires the voice data temporarily stored in the voice acquisition unit 604 as speech information, and controls the data transmission/reception unit 601 to transmit the acquired speech information together with device information to the cloud server 103. The data transmission/reception unit 601 then waits for a response from the cloud server 103. The response from the cloud server 103 is, for example, a response message consisting of a header section indicating that it is a response, and voice synthesis data. When the data transmission/reception unit 601 receives the response, the voice control unit 603 controls the voice playback unit 605 to start playback processing.

再生処理の終了後、音声制御部６０３は、音声取得部６０４に音声処理を開始するよう制御する。これにより、クラウドサーバ１０３との対話セッションが継続している間、ユーザはウェイクワードを発することなく、続けて自身の行いたいことを話すことができる。対話セッションの終了判定は、クラウドサーバ１０３が行う。クラウドサーバ１０３が、スマートスピーカ１１０に対して対話セッション終了通知を送信することで、対話セッションが終了する。なお、対話セッション終了から、次の対話セッションが開始されるまでの状態を待機状態と呼ぶこととする。音声制御部６０３が操作開始検知部６０７からの操作開始通知を受信するまでは、常時待機状態である。 After the playback process is completed, the voice control unit 603 controls the voice acquisition unit 604 to start voice processing. As a result, while the dialogue session with the cloud server 103 continues, the user can continue to talk about what he or she wants to do without uttering the wake word. The cloud server 103 determines whether the dialogue session has ended. The dialogue session ends when the cloud server 103 sends a dialogue session end notification to the smart speaker 110. The state from the end of a dialogue session to the start of the next dialogue session is referred to as a standby state. The voice control unit 603 is always in a standby state until it receives an operation start notification from the operation start detection unit 607.

＜クラウドサーバの機能構成＞
図７は、本実施形態に係るクラウドサーバ１０３の機能構成を示すブロック図である。クラウドサーバ１０３は、ＣＰＵ４０２がＲＯＭ４０４等に記憶されたプログラムを実行することにより図７に示す各機能部として動作する。
データ送受信部７０１は、ネットワークＩ／Ｆ４０６を介して、スマートスピーカ１１０，１１１、及びクラウドサーバ１０３との間でデータの送受信を行う。以下、クラウドサーバ１０３が、スマートスピーカ１１０，１１１のうちの、スマートスピーカ１１０から起動情報を受信したものとして説明する。データ送受信部７０１は、スマートスピーカ１１０から起動情報を受信すると、ＭＦＰ１０１本体の設定情報の要求コマンドをＭＦＰ１０１に対して送信する。またデータ送受信部７０１は、スマートスピーカ１１０から発話情報を受信すると、発話情報を後述する音声認識部７０３に提供する。またデータ送受信部７０１は、後述する指示コマンド取得部７０４で取得したＭＦＰ１０１に対する指示コマンドをＭＦＰ１０１に対して送信する。 <Functional configuration of cloud server>
Fig. 7 is a block diagram showing the functional configuration of the cloud server 103 according to this embodiment. The cloud server 103 operates as each of the functional units shown in Fig. 7 by the CPU 402 executing a program stored in the ROM 404 or the like.
The data transmission/reception unit 701 transmits and receives data between the smart speakers 110 and 111 and the cloud server 103 via the network I/F 406. In the following, it is assumed that the cloud server 103 receives startup information from the smart speaker 110 of the smart speakers 110 and 111. When the data transmission/reception unit 701 receives the startup information from the smart speaker 110, it transmits a request command for setting information of the MFP 101 main body to the MFP 101. When the data transmission/reception unit 701 receives speech information from the smart speaker 110, it provides the speech information to a voice recognition unit 703 described later. The data transmission/reception unit 701 transmits an instruction command for the MFP 101 acquired by an instruction command acquisition unit 704 described later to the MFP 101.

またデータ送受信部７０１は、要求コマンドや指示コマンドに対する応答をＭＦＰ１０１から受信する。データ送受信部７０１は、要求コマンドの応答として受信したＭＦＰ１０１本体の設定情報をデータ管理部７０２に提供する。またデータ送受信部７０１は、指示コマンドに対する応答を後述する音声合成部７０５に提供する。データ送受信部７０１は、音声合成部７０５で音声合成データに変換された応答をスマートスピーカ１１０に対して送信する。 The data transmission/reception unit 701 also receives responses to request commands and instruction commands from the MFP 101. The data transmission/reception unit 701 provides the setting information of the MFP 101 main body received as a response to the request command to the data management unit 702. The data transmission/reception unit 701 also provides responses to instruction commands to the voice synthesis unit 705, which will be described later. The data transmission/reception unit 701 transmits the response converted into voice synthesis data by the voice synthesis unit 705 to the smart speaker 110.

データ管理部７０２は、音声認識部７０３で音声認識処理をするために必要なパラメータや、音声合成部７０５で音声合成処理をするために必要なパラメータなどをストレージ４０５に保存し、管理する。またデータ管理部７０２は、データ送受信部７０１から提供されたＭＦＰ１０１本体の設定情報と、スマートスピーカ１１０から受信した機器情報とが紐づけて保持されるテーブルをＲＡＭ４０３に保存し、管理する。またデータ管理部７０２は、対話セッションの継続中に、データ送受信部７０１からＭＦＰ１０１本体の設定情報を受け取ると、テーブルに保持される設定情報を提供された設定情報で更新する。 The data management unit 702 stores and manages in the storage 405 parameters necessary for the voice recognition processing by the voice recognition unit 703 and parameters necessary for the voice synthesis processing by the voice synthesis unit 705. The data management unit 702 also stores in the RAM 403 and manages a table in which the setting information of the MFP 101 main body provided from the data transmission/reception unit 701 and the device information received from the smart speaker 110 are linked and stored. When the data management unit 702 receives the setting information of the MFP 101 main body from the data transmission/reception unit 701 during an ongoing dialogue session, it updates the setting information stored in the table with the provided setting information.

音声認識部７０３は、音声認識処理を行う。まず音声認識部７０３は、データ送受信部７０１で受信したユーザの発話情報を、テキストデータに変換する。例えば音声認識部７０３は、音響モデルを用いてユーザの音声データを音素に変換し、更に言語モデルを用いて音素を実際のテキストデータに変換する。更に音声認識部７０３は、変換されたテキストデータを形態素解析する。例えば音声認識部７０３は、その言語の文法や、品詞などの情報をもつ辞書から形態素列を導出し、さらに各形態素の品詞などを判別する。音声認識部７０３は、以上のような音声認識処理を行った結果を指示コマンド取得部７０４に提供する。 The voice recognition unit 703 performs voice recognition processing. First, the voice recognition unit 703 converts the user's speech information received by the data transmission/reception unit 701 into text data. For example, the voice recognition unit 703 converts the user's voice data into phonemes using an acoustic model, and then converts the phonemes into actual text data using a language model. The voice recognition unit 703 then performs morpheme analysis on the converted text data. For example, the voice recognition unit 703 derives a morpheme string from a dictionary containing information such as the grammar and parts of speech of the language, and then determines the part of speech of each morpheme. The voice recognition unit 703 provides the results of the above voice recognition processing to the instruction command acquisition unit 704.

指示コマンド取得部７０４は、音声認識部７０３による結果からＭＦＰ１０１に対する指示情報を取得する。具体的には、指示コマンド取得部７０４は、ユーザからの指示がジョブの実行指示であると判定した場合には、発話情報の送信元のスマートスピーカ１１０の機器情報に紐づけられるＭＦＰ１０１本体の設定情報を読み出す。そして指示コマンド取得部７０４は、読み出した設定情報を用いてジョブ実行コマンドを生成する。例えば、画像データをＭＦＰ１０１でプリントするためのコマンドを生成する。なお印刷対象の画像データは、インターネット上の外部サーバ（不図示）、クラウドサーバ１０３のストレージ４０５、又はＭＦＰ１０１のストレージ２０５等に格納されている。 The instruction command acquisition unit 704 acquires instruction information for the MFP 101 from the result of the voice recognition unit 703. Specifically, when the instruction command acquisition unit 704 determines that the instruction from the user is an instruction to execute a job, it reads out setting information of the MFP 101 main body that is linked to the device information of the smart speaker 110 that sent the spoken information. The instruction command acquisition unit 704 then generates a job execution command using the read setting information. For example, it generates a command to print image data with the MFP 101. Note that the image data to be printed is stored in an external server (not shown) on the Internet, the storage 405 of the cloud server 103, the storage 205 of the MFP 101, or the like.

また指示コマンド取得部７０４は、ユーザからの指示が設定変更指示であると判定した場合には、ユーザからの指示内容を用いて設定変更コマンドを生成する。例えば、ＭＦＰ１０１の印刷設定を片面印刷から両面印刷に変更するためのコマンドを生成する。 When the instruction command acquisition unit 704 determines that the instruction from the user is a setting change instruction, it generates a setting change command using the instruction from the user. For example, it generates a command to change the print settings of the MFP 101 from single-sided printing to double-sided printing.

音声合成部７０５は、音声合成データを生成する。例えば音声合成部７０５は、データ送受信部７０１で受信したＭＦＰ１０１からの応答をテキストデータに変換し、音声データベースを用いて音声合成データに変換する。また音声合成部７０５は、設定変更コマンドに対する応答であった場合、音声合成データを生成するとともに、データ送受信部７０１でＭＦＰ１０１本体の設定情報の要求コマンドをＭＦＰ１０１に対して送信するよう制御する。 The voice synthesis unit 705 generates voice synthesis data. For example, the voice synthesis unit 705 converts a response from the MFP 101 received by the data transmission/reception unit 701 into text data, and converts this into voice synthesis data using a voice database. Furthermore, if the response is to a setting change command, the voice synthesis unit 705 generates voice synthesis data and controls the data transmission/reception unit 701 to send a request command for setting information of the MFP 101 main body to the MFP 101.

＜画像処理システムの実行する全体処理その１＞
図８は、本実施形態における画像形成システム１００の実行する処理を示すシーケンス図である。以下の説明では、各工程（ステップ）について先頭にＳを付けて表記することで、工程（ステップ）の表記を省略する。図８では、１台のスマートスピーカ１１０を用いてＭＦＰ１０１に対して指示を行うケースについて説明する。 <Overall processing performed by the image processing system, part 1>
8 is a sequence diagram showing a process executed by the image forming system 100 in this embodiment. In the following description, each process (step) is represented by adding an S to the beginning, and the description of the process (step) is omitted. In FIG. 8, a case where an instruction is given to the MFP 101 using one smart speaker 110 will be described.

まずスマートスピーカ１１０が、ユーザからウェイクワードを受け付けると（Ｓ８０１）、機器情報（自装置の情報）及び起動情報をクラウドサーバ１０３に対して送信する（Ｓ８０２，Ｓ８０３）。以上により対話セッションが開始する。
クラウドサーバ１０３は、スマートスピーカ１１０から起動情報を受信すると、ＭＦＰ１０１本体の設定情報の要求コマンドをＭＦＰ１０１に対して送信する（Ｓ８０４）。ＭＦＰ１０１は、要求コマンドに応じて設定情報を取得して（Ｓ８０５）、クラウドサーバ１０３に対して設定情報を送信する（Ｓ８０６）。クラウドサーバ１０３は、ＭＦＰ１０１本体の設定情報を受信すると、受信した設定情報とＳ８０２でスマートスピーカ１１０から受信した機器情報を紐づけてＲＡＭ４０３に保存する（Ｓ８０７）。 First, when the smart speaker 110 receives a wake word from a user (S801), the smart speaker 110 transmits device information (information about its own device) and startup information to the cloud server 103 (S802, S803). This starts an interactive session.
When the cloud server 103 receives the startup information from the smart speaker 110, it transmits a request command for setting information of the MFP 101 main body to the MFP 101 (S804). The MFP 101 acquires the setting information in response to the request command (S805) and transmits the setting information to the cloud server 103 (S806). When the cloud server 103 receives the setting information of the MFP 101 main body, it associates the received setting information with the device information received from the smart speaker 110 in S802 and stores them in the RAM 403 (S807).

続いてスマートスピーカ１１０は、ユーザから発話を受け付けると（Ｓ８０８）、機器情報及び発話情報をクラウドサーバ１０３に対して送信する（Ｓ８０９、Ｓ８１０）。例えばスマートスピーカ１１０は、Ｓ８０１で受け付けたウェイクワードに続いて発られた「Ａというファイルをプリントして」のようなユーザからの発話に基づき発話情報を取得する。なおＳ８０９で送信される機器情報は、Ｓ８０３で送信される機器情報と同じ値である。 Next, when the smart speaker 110 receives a speech from the user (S808), it transmits the device information and the speech information to the cloud server 103 (S809, S810). For example, the smart speaker 110 acquires the speech information based on the user's speech such as "Print file A" uttered following the wake word received in S801. Note that the device information transmitted in S809 is the same value as the device information transmitted in S803.

次にクラウドサーバ１０３は、スマートスピーカ１１０から受信した発話情報に対して音声認識処理を行い（Ｓ８１１）、音声認識処理の結果から指示内容を判定する（Ｓ８１２）。ここでは、指示内容がジョブの実行指示であるため、クラウドサーバ１０３は、発話情報とともに受信した機器情報に紐づいてＲＡＭ４０３に保存される設定情報を用いて、ジョブ情報を生成する（Ｓ８１３）。次にクラウドサーバ１０３は、生成したジョブ情報をＭＦＰ１０１に対して送信すると（Ｓ８１４）、ＭＦＰ１０１は、操作パネル２０９の表示をジョブ実行画面に遷移させ（Ｓ８１５）、ジョブ実行開始通知をクラウドサーバ１０３に対して送信する（Ｓ８１６）。クラウドサーバ１０３は、受信したジョブ実行開始通知を音声合成データに変換し（Ｓ８１７）、当該音声合成データをスマートスピーカ１１０に対して送信する（Ｓ８１８）。続いてスマートスピーカ１１０は、クラウドサーバ１０３から受信した音声合成データを再生する（Ｓ８１９）。 Next, the cloud server 103 performs voice recognition processing on the speech information received from the smart speaker 110 (S811) and determines the instruction content from the result of the voice recognition processing (S812). Here, since the instruction content is an instruction to execute a job, the cloud server 103 generates job information using the setting information stored in the RAM 403 in association with the device information received together with the speech information (S813). Next, when the cloud server 103 transmits the generated job information to the MFP 101 (S814), the MFP 101 transitions the display of the operation panel 209 to a job execution screen (S815) and transmits a job execution start notification to the cloud server 103 (S816). The cloud server 103 converts the received job execution start notification into voice synthesis data (S817) and transmits the voice synthesis data to the smart speaker 110 (S818). Next, the smart speaker 110 plays the voice synthesis data received from the cloud server 103 (S819).

次にＭＦＰ１０１は、Ｓ８１４で受信したジョブ情報に応じてジョブの実行処理を行う（Ｓ８２０）。実行処理が終了すると、ＭＦＰ１０１は、操作パネル２０９の表示をジョブ実行終了画面に遷移させ（Ｓ８２１）、ジョブ実行終了通知をクラウドサーバ１０３に対して送信する（Ｓ８２２）。クラウドサーバ１０３は、受信したジョブ実行終了通知を音声合成データに変換し（Ｓ８２３）、当該音声合成データをスマートスピーカ１１０に対して送信する（Ｓ８２４）。続いてスマートスピーカ１１０はクラウドサーバ１０３から受信した音声合成データを再生する（Ｓ８２５）。例えばスマートスピーカ１１０は、「プリントしました」という音声合成データを、スピーカ３１０を通じて再生する。 The MFP 101 then performs job execution processing in accordance with the job information received in S814 (S820). When execution processing is completed, the MFP 101 transitions the display on the operation panel 209 to a job execution completion screen (S821) and transmits a job execution completion notification to the cloud server 103 (S822). The cloud server 103 converts the received job execution completion notification into voice synthesis data (S823) and transmits the voice synthesis data to the smart speaker 110 (S824). The smart speaker 110 then plays back the voice synthesis data received from the cloud server 103 (S825). For example, the smart speaker 110 plays back the voice synthesis data saying "Printed" through the speaker 310.

次にクラウドサーバ１０３は、対話セッションを終了するかの判定を行い（Ｓ８２６）、終了する場合には、スマートスピーカ１１０に対して対話セッション終了通知を送信する（Ｓ８２７）。スマートスピーカ１１０は、クラウドサーバ１０３から対話セッション終了通知を受信すると、対話セッションを終了する（Ｓ８２８）。その後スマートスピーカ１１０は待機状態へ移行する。以上により図８のシーケンス図に示す一連の処理が終了する。 Next, the cloud server 103 determines whether to end the interaction session (S826), and if so, sends an interaction session end notification to the smart speaker 110 (S827). When the smart speaker 110 receives the interaction session end notification from the cloud server 103, it ends the interaction session (S828). The smart speaker 110 then transitions to a standby state. This completes the series of processes shown in the sequence diagram of FIG. 8.

＜クラウドサーバの実行する処理＞
図９は、本実施形態に係るクラウドサーバ１０３が実行する処理の詳細を示すフローチャートである。図９のフローチャートに示す各処理を実行する主体は、クラウドサーバ１０３のＣＰＵ４０２である場合について説明するが、各処理を実行する主体はこれに限定されず、他の装置のＣＰＵが実行しても構わない。
まず、クラウドサーバ１０３がスマートスピーカ１１０からデータを受信すると、Ｓ９０１では、ＣＰＵ４０２が、受信したデータが起動情報であるか発話情報であるか判定する。ＣＰＵ４０２が起動情報であると判定した場合、Ｓ９０２へ進み、発話情報であると判定した場合、Ｓ９０５へ進む。起動情報は、例えば音声認識機能の起動確認コマンドである。
Ｓ９０２では、ＣＰＵ４０２が、ＭＦＰ１０１に対して設定情報の要求コマンドを送信する。
次にＳ９０３では、ＣＰＵ４０２が、要求コマンドに対する応答としてＭＦＰ１０１本体の設定情報を取得する。
次にＳ９０４では、ＣＰＵ４０２が、取得したＭＦＰ１０１本体の設定情報と、起動情報の送信元のスマートスピーカ１１０の機器情報を紐づけてＲＡＭ４０３上のテーブルに保持する。その後処理はＳ９２１へ進む。 <Processing performed by the cloud server>
Fig. 9 is a flowchart showing details of the processing executed by the cloud server 103 according to this embodiment. Although the case where the subject that executes each process shown in the flowchart of Fig. 9 is the CPU 402 of the cloud server 103 will be described, the subject that executes each process is not limited to this, and each process may be executed by a CPU of another device.
First, when the cloud server 103 receives data from the smart speaker 110, in S901, the CPU 402 determines whether the received data is startup information or speech information. If the CPU 402 determines that the data is startup information, the process proceeds to S902, and if the CPU 402 determines that the data is speech information, the process proceeds to S905. The startup information is, for example, a command to confirm the startup of a voice recognition function.
In step S902 , the CPU 402 transmits a request command for setting information to the MFP 101 .
Next, in step S903, the CPU 402 acquires setting information of the MFP 101 main body as a response to the request command.
Next, in S904, the CPU 402 associates the acquired setting information of the main body of the MFP 101 with the device information of the smart speaker 110 that transmitted the startup information, and stores the association in a table on the RAM 403. After that, the process proceeds to S921.

Ｓ９０５で、ＣＰＵ４０２は、スマートスピーカ１１０から受信した発話情報に対して音声認識処理を行う。
次にＳ９０６で、ＣＰＵ４０２が、Ｓ９０５の音声認識処理の結果に基づき、指示内容がジョブの実行指示であるか否かを判定する。ＣＰＵ４０２がジョブの実行指示であると判定した場合、Ｓ９０７へ進み、ジョブの実行指示ではないと判定した場合、Ｓ９１２へ進む。
Ｓ９０７では、ＣＰＵ４０２が、ＲＡＭ４０３上のテーブルに保持されるレコードから、発話情報の送信元のスマートスピーカ１１０の機器情報と一致するレコードの設定情報を読み出す。
次にＳ９０８では、ＣＰＵ４０２が、Ｓ９０７で読み出した設定情報を用いてジョブ実行コマンドを生成する。なおＳ９０７で、発話情報の送信元のスマートスピーカ１１０の機器情報と一致するレコードが存在しなかった場合や、存在してもジョブの実行指示で用いる設定項目が含まれていなかった場合には、デフォルトの設定情報を用いてジョブ実行コマンドを生成する。
次にＳ９０９では、ＣＰＵ４０２が、生成したジョブ実行コマンドをＭＦＰ１０１に対して送信する。
次にＳ９１０では、ＣＰＵ４０２が、Ｓ９０９で送信したコマンドに対する応答通知を受信する。
次にＳ９１１では、ＣＰＵ４０２が、Ｓ９１０で受信した応答通知に対して音声合成処理を行って音声合成データを生成し、生成した音声合成データを発話情報の送信元のスマートスピーカ１１０に対して送信する。ここでは例えば、「ファイルを印刷しました」といった音声が合成される。その後処理はＳ９２１へ進む。 In S905 , the CPU 402 performs voice recognition processing on the speech information received from the smart speaker 110 .
Next, in step S906, the CPU 402 determines whether the instruction content is an instruction to execute a job based on the result of the voice recognition process in step S905. If the CPU 402 determines that the instruction content is an instruction to execute a job, the process proceeds to step S907. If the CPU 402 determines that the instruction content is not an instruction to execute a job, the process proceeds to step S912.
In S907, the CPU 402 reads out the setting information of a record that matches the device information of the smart speaker 110 that is the sender of the speech information, from the records stored in the table on the RAM 403.
Next, in S908, the CPU 402 generates a job execution command using the setting information read in S907. Note that in S907, if there is no record that matches the device information of the smart speaker 110 that is the sender of the speech information, or if there is a record but it does not include the setting items used in the job execution instruction, the job execution command is generated using default setting information.
Next, in step S909 , the CPU 402 transmits the generated job execution command to the MFP 101 .
Next, in step S910, the CPU 402 receives a response notification to the command sent in step S909.
Next, in S911, the CPU 402 performs voice synthesis processing on the response notification received in S910 to generate voice synthesis data, and transmits the generated voice synthesis data to the smart speaker 110 that is the sender of the speech information. Here, for example, a voice such as "The file has been printed" is synthesized. After that, the process proceeds to S921.

Ｓ９１２では、ＣＰＵ４０２が、Ｓ９０５の音声認識処理の結果に基づき、指示内容がＭＦＰ１０１本体の設定変更の指示であるか否かを判定する。ＣＰＵ４０２がＭＦＰ１０１本体の設定変更の指示であると判定した場合、Ｓ９１３へ進み、設定変更の指示ではないと判定した場合、Ｓ９２０へ進む。
Ｓ９１３では、ＣＰＵ４０２が、ユーザからの指示内容を用いて設定変更コマンドを生成する。
次にＳ９１４では、ＣＰＵ４０２が、生成した設定変更コマンドをＭＦＰ１０１に対して送信する。
次にＳ９１５では、ＣＰＵ４０２が、Ｓ９１４で送信したコマンドに対する応答通知を受信する。
次にＳ９１６では、ＣＰＵ４０２が、Ｓ９１５で受信した応答通知に対して音声合成処理を行って音声合成データを生成し、生成した音声合成データを発話情報の送信元のスマートスピーカ１１０に対して送信する。ここでは例えば、「設定を変更しました」といった音声が合成される。
次にＳ９１７では、ＣＰＵ４０２が、ＭＦＰ１０１に対して設定情報の要求コマンドを送信する。
次にＳ９１８では、ＣＰＵ４０２が、要求コマンドに対する応答として変更後の設定情報を取得する。
次にＳ９１９では、ＣＰＵ４０２が、ＲＡＭ４０３上のテーブルに保持されるレコードから、発話情報の送信元のスマートスピーカ１１０の機器情報と一致するレコードの設定情報を、Ｓ９１８で取得した変更後の設定情報で更新する。これにより、ユーザが設定変更後にジョブの実行を続けて指示する場合には、変更後の設定を用いてジョブを実行することが可能になる。その後処理はＳ９２１へ進む。 In S912, the CPU 402 determines, based on the result of the voice recognition process in S905, whether or not the instruction content is an instruction to change the settings of the main body of the MFP 101. If the CPU 402 determines that the instruction content is an instruction to change the settings of the main body of the MFP 101, the process proceeds to S913, and if the CPU 402 determines that the instruction content is not an instruction to change the settings, the process proceeds to S920.
In S913, the CPU 402 generates a setting change command using the contents of the instruction from the user.
Next, in step S914 , the CPU 402 transmits the generated setting change command to the MFP 101 .
Next, in S915, the CPU 402 receives a response notification to the command sent in S914.
Next, in S916, the CPU 402 performs voice synthesis processing on the response notification received in S915 to generate voice synthesis data, and transmits the generated voice synthesis data to the smart speaker 110 that is the transmission source of the speech information. Here, for example, a voice such as "The settings have been changed" is synthesized.
Next, in step S917 , the CPU 402 transmits a request command for setting information to the MFP 101 .
Next, in S918, the CPU 402 acquires the changed setting information as a response to the request command.
Next, in S919, the CPU 402 updates the setting information of the record stored in the table on the RAM 403 that matches the device information of the smart speaker 110 that transmitted the speech information with the changed setting information acquired in S918. This makes it possible to execute the job using the changed settings if the user continues to instruct execution of the job after changing the settings. Then, the process proceeds to S921.

Ｓ９２０では、ＣＰＵ４０２が、スマートスピーカ１１０の状態等に応じて必要な音声合成処理を行って音声合成データを生成し、生成した音声合成データを発話情報の送信元のスマートスピーカ１１０に対して送信する。ここでは例えば、「もう一度話してください」といった音声が合成される。
次にＳ９２１では、ＣＰＵ４０２が、対話セッションの終了判定を行う。ＣＰＵ４０２が対話セッションを終了すると判定した場合、Ｓ９２２へ進み、対話セッションを継続すると判定した場合、処理はＳ９０１へ戻り、スマートスピーカ１１０から発話情報が送信されるのを待ち受ける。
Ｓ９２２では、ＣＰＵ４０２が、ＲＡＭ４０３上のテーブルに保持されるレコードのうち、対話セッションの切断対象のスマートスピーカ１１０の機器情報と一致するレコードを、ＲＡＭ４０３から破棄する。その後図９のフローチャートに示す一連の処理が終了する。 In S920, the CPU 402 performs necessary voice synthesis processing according to the state of the smart speaker 110, etc., to generate voice synthesis data, and transmits the generated voice synthesis data to the smart speaker 110 that is the sender of the speech information. Here, for example, a voice such as "Please speak again" is synthesized.
Next, in S921, the CPU 402 determines whether to end the interaction session. If the CPU 402 determines that the interaction session is to be ended, the process proceeds to S922. If the CPU 402 determines that the interaction session is to be continued, the process returns to S901, and the process waits for the transmission of speech information from the smart speaker 110.
In S922, the CPU 402 discards from the RAM 403, among the records held in the table on the RAM 403, any records that match the device information of the smart speaker 110 to be disconnected from the interaction session. Then, the series of processes shown in the flowchart in FIG. 9 ends.

＜画像処理システムの実行する全体処理その２＞
図１０Ａ及び図１０Ｂは、本実施形態における画像形成システム１００の実行する処理を示すシーケンス図である。図１０Ａ及び図１０Ｂでは、ユーザＡがジョブを実行しようと、スマートスピーカ１１０との対話セッションを開始し、ほぼ同時期にユーザＢがスマートスピーカ１１１との対話セッションを開始したケースについて説明する。 <Overall processing performed by the image processing system, part 2>
10A and 10B are sequence diagrams showing the processing executed by the image forming system 100 in this embodiment. In FIG. 10A and FIG. 10B, a case will be described in which a user A starts an interaction session with the smart speaker 110 to execute a job, and a user B starts an interaction session with the smart speaker 111 at approximately the same time.

Ｓ１００１～Ｓ１００７において、スマートスピーカ１１０を操作するユーザＡからの音声操作開始指示により、図８のＳ８０１～Ｓ８０７と同様の処理が実行される。これにより、クラウドサーバ１０３にはスマートスピーカ１１０が対話セッションを開始した時点でのＭＦＰ１０１の設定情報が、スマートスピーカ１１０の情報に紐づけて保持される。
次にＳ１００８～Ｓ１０１４において、スマートスピーカ１１１を操作するユーザＢからの音声操作開始指示により、図８のＳ８０１～Ｓ８０７と同様の処理が実行される。これにより、クラウドサーバ１０３にはスマートスピーカ１１１が対話セッションを開始した時点でのＭＦＰ１０１の設定情報が、スマートスピーカ１１１の情報に紐づけて保持される。 8 is executed in response to a voice operation start instruction from user A who operates the smart speaker 110. As a result, the setting information of the MFP 101 at the time when the smart speaker 110 starts the dialogue session is stored in the cloud server 103 in association with the information of the smart speaker 110.
8 is executed in response to a voice operation start instruction from user B who operates the smart speaker 111. As a result, the setting information of the MFP 101 at the time when the smart speaker 111 starts the dialogue session is stored in the cloud server 103 in association with the information of the smart speaker 111.

続いてスマートスピーカ１１１がユーザＢから発話を受け付けると（Ｓ１０１５）、機器情報及び発話情報をクラウドサーバ１０３に対して送信する（Ｓ１０１６、Ｓ１０１７）。次にクラウドサーバ１０３が受信した発話情報に対して音声認識処理を行い（Ｓ１０１８）、音声認識処理の結果から指示内容を判定する（Ｓ１０１９）。ここでは指示内容が本体設定変更指示であるため、クラウドサーバ１０３は、指示された内容の変更情報を生成する（Ｓ１０２０）。クラウドサーバ１０３が生成した変更情報をＭＦＰ１０１に対して送信すると（Ｓ１０２１）、ＭＦＰ１０１は、受信した変更情報を用いてストレージ２０５に保持される設定情報の変更を行う（Ｓ１０２２）。その後ＭＦＰ１０１は、操作パネル２０９の表示を変更通知画面に遷移させ（Ｓ１０２３）、設定変更通知をクラウドサーバ１０３に対して送信する（Ｓ１０２４）。クラウドサーバ１０３は、受信した設定変更通知を音声合成データに変換し（Ｓ１０２５）、当該音声合成データをスマートスピーカ１１１に対して送信する（Ｓ１０２６）。続いてスマートスピーカ１１１は、クラウドサーバ１０３から受信した音声合成データを再生する（Ｓ１０２７）。 Next, when the smart speaker 111 receives a speech from user B (S1015), it transmits device information and speech information to the cloud server 103 (S1016, S1017). Next, the cloud server 103 performs a voice recognition process on the received speech information (S1018) and determines the instruction content from the result of the voice recognition process (S1019). Here, since the instruction content is an instruction to change the main body settings, the cloud server 103 generates change information of the instructed content (S1020). When the cloud server 103 transmits the generated change information to the MFP 101 (S1021), the MFP 101 uses the received change information to change the setting information stored in the storage 205 (S1022). After that, the MFP 101 transitions the display of the operation panel 209 to a change notification screen (S1023) and transmits a setting change notification to the cloud server 103 (S1024). The cloud server 103 converts the received setting change notification into voice synthesis data (S1025) and transmits the voice synthesis data to the smart speaker 111 (S1026). The smart speaker 111 then plays back the voice synthesis data received from the cloud server 103 (S1027).

次にクラウドサーバ１０３は、スマートスピーカ１１１との対話セッションを終了するかの判定を行い（Ｓ１０２８）、終了する場合には、スマートスピーカ１１１に対して対話セッション終了通知を送信する（Ｓ１０２９）。スマートスピーカ１１１は、対話セッション終了通知を受信すると、対話セッションを終了する（Ｓ１０３０）。その後スマートスピーカ１１１は待機状態へ移行する。 Next, the cloud server 103 determines whether to end the interactive session with the smart speaker 111 (S1028), and if so, sends an interactive session end notification to the smart speaker 111 (S1029). When the smart speaker 111 receives the interactive session end notification, it ends the interactive session (S1030). The smart speaker 111 then transitions to a standby state.

次にＳ１０３１～Ｓ１０５１において、スマートスピーカ１１０を操作するユーザＡからのジョブ実行指示により、図８のＳ８０８～Ｓ８２８と同様の処理が実行される。以上により図１０Ａ及び図１０Ｂに示す一連のシーケンス図の処理が終了する。
以上のようなシーケンス図に示すように、ユーザＡがスマートスピーカ１１０との対話セッションを開始したタイミングで、クラウドサーバ１０３がＭＦＰ１０１本体の設定情報を取得し、保持する。その後ユーザＡがスマートスピーカ１１０を用いてジョブの実行を指示する際には、クラウドサーバ１０３は保持される設定情報を用いてジョブの実行を指示する。これにより、ユーザＡとほぼ同時期に対話セッションを開始したユーザＢが、ユーザＡのジョブの実行指示よりも前に設定変更の指示を行ったとしても、ユーザＡは変更前の設定内容でジョブを実行することが可能になる。即ちユーザＡが想定していた通りの印刷を行うことができる。 Next, in S1031 to S1051, the same processes as S808 to S828 in Fig. 8 are executed in response to a job execution instruction from user A who operates the smart speaker 110. This completes the series of processes in the sequence diagrams shown in Fig. 10A and 10B.
As shown in the sequence diagram above, the cloud server 103 acquires and stores the setting information of the MFP 101 main body at the timing when user A starts an interactive session with the smart speaker 110. When user A subsequently instructs execution of a job using the smart speaker 110, the cloud server 103 instructs execution of the job using the stored setting information. As a result, even if user B, who started an interactive session at approximately the same time as user A, instructs to change the settings before user A instructs execution of the job, user A can execute the job with the settings before the change. In other words, printing can be performed as user A intended.

以上のような本実施形態の画像形成システム１００によれば、音声制御装置が複数存在する環境下で、音声制御装置を用いて画像形成装置の設定変更を行う場合に、動作が競合することがなくなる。従って、音声制御装置を用いて画像形成装置を操作する際のユーザビリティを向上させることができる。 According to the image forming system 100 of this embodiment as described above, when changing the settings of an image forming device using a voice control device in an environment where multiple voice control devices exist, conflicts in operations are eliminated. Therefore, usability can be improved when operating an image forming device using a voice control device.

本実施形態の第１の変形例として、ＭＦＰ１０１も同様に、ＭＦＰ１０１のＣＰＵ２０２が、ＭＦＰ１０１の操作開始を検知したタイミング（例えば、ユーザＡのログイン時）で、ＭＦＰ１０１本体の設定情報をＲＡＭ２０３に読み出して保持してもよい。この場合に、ユーザＡが操作パネル２０９を操作してジョブの実行を指示する際に、ＭＦＰ１０１はＲＡＭ２０３上に保持される設定情報を用いてジョブを実行する。これにより、ユーザＡが操作パネル２０９の操作中に、ユーザＢがスマートスピーカ１１０を用いて設定変更の指示を行って、ＭＦＰ１０１本体の設定情報が変更された場合でも、ユーザＡは変更前の設定内容でジョブを実行することが可能になる。 As a first modified example of this embodiment, the MFP 101 may also read and store the setting information of the MFP 101 main body in the RAM 203 when the CPU 202 of the MFP 101 detects the start of operation of the MFP 101 (for example, when user A logs in). In this case, when user A operates the operation panel 209 to instruct execution of a job, the MFP 101 executes the job using the setting information stored in the RAM 203. As a result, even if user B uses the smart speaker 110 to instruct a setting change while user A is operating the operation panel 209, and the setting information of the MFP 101 main body is changed, user A can execute the job with the setting contents before the change.

以上、本発明を実施形態と共に説明したが、上記実施形態は本発明を実施するにあたっての具体化の例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその技術思想、又はその主要な特徴から逸脱することなく、様々な形で実施することができる。 The present invention has been described above in conjunction with embodiments, but the above embodiments are merely illustrative examples of how the present invention may be implemented, and the technical scope of the present invention should not be interpreted in a limiting manner based on these embodiments. In other words, the present invention can be implemented in various forms without departing from its technical concept or main features.

（その他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記録媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 Other Embodiments
The present invention can also be realized by a process in which a program for implementing one or more of the functions of the above-described embodiments is supplied to a system or device via a network or a recording medium, and one or more processors in a computer of the system or device read and execute the program. The present invention can also be realized by a circuit (e.g., ASIC) that implements one or more of the functions.

１００：画像形成システム、１０１：ＭＦＰ、１１０，１１１：スマートスピーカ、１０３：クラウドサーバ 100: Image forming system, 101: MFP, 110, 111: Smart speaker, 103: Cloud server

Claims

An image forming system including a plurality of voice control devices that receive user utterances, a server that acquires instruction information based on utterance information acquired from the plurality of voice control devices, and an image forming device that executes a process based on the instruction information acquired from the server,
A transmission means for receiving a voice operation start instruction from a user in one of the plurality of voice control devices and transmitting start information to the server;
a requesting unit in the server for receiving the startup information and requesting setting information from the image forming apparatus;
a storage unit in the server for storing setting information of the image forming apparatus and information of the one of the voice control devices in association with each other;
a generating unit configured to generate the instruction information by using at least the stored setting information when instructing execution of a job based on the speech information acquired from the one voice control device in the server;
An image forming system comprising:

the request means, when instructing a change in setting information of the image forming apparatus based on the speech information acquired from the one voice control apparatus, requests the image forming apparatus for the changed setting information;
2. The image forming system according to claim 1, wherein said storage unit updates the stored setting information to the changed setting information.

The image forming system according to claim 1 or 2, characterized in that the instruction to start the voice operation is given by the user uttering a wake word.

A control method for an image forming system including a plurality of voice control devices that receive user utterances, a server that acquires instruction information based on utterance information acquired from the plurality of voice control devices, and an image forming device that executes a process based on the instruction information acquired from the server, comprising:
a transmission step of receiving a voice operation start instruction from a user in one of the plurality of voice control devices and transmitting start information to the server;
a request step of receiving the startup information in the server and requesting setting information from the image forming apparatus;
a storage step of storing, in the server, setting information of the image forming apparatus and information of the one of the voice control devices in association with each other;
a generating step of generating, in the server, the instruction information by using at least the stored setting information when instructing execution of a job based on the speech information acquired from the one voice control device;
A control method comprising:

A program for causing a computer to function as each of the means of the image forming system according to any one of claims 1 to 3.