JP7659540B2

JP7659540B2 - SYSTEM AND METHOD FOR MANAGING VOICE QUERIES USING PRONUNCIATION INFORMATION - Patent application

Info

Publication number: JP7659540B2
Application number: JP2022506260A
Authority: JP
Inventors: アンクールアヘル，; インドラニルクーマードス，; アーシシュゴヤル，; アマンプニヤニ，; カンダラレディ，; ミトゥンウメシュ，
Original assignee: アデイアガイズインコーポレイテッド
Priority date: 2019-07-31
Filing date: 2020-07-22
Publication date: 2025-04-09
Anticipated expiration: 2040-07-22
Also published as: JP2025102873A; JP2022542415A; EP4004913A1; CA3143967A1; WO2021021529A1

Description

本開示は、音声クエリを管理するためのシステムに関し、より具体的に、発音情報に基づいて音声クエリを管理するためのシステムに関する。 The present disclosure relates to a system for managing voice queries, and more specifically, to a system for managing voice queries based on pronunciation information.

会話システムでは、ユーザが音声クエリをシステムに発すると、発話は、自動発話認識（ＡＳＲ）モジュールを使用して、テキストに変換される。このテキストは、次いで、会話システムへの入力を形成し、それは、テキストへの応答を決定する。例えば、ユーザが、「ＴｏｍＣｒｕｉｓｅの映画を見せて」と言うと、ＡＳＲモジュールは、ユーザの音声をテキストに変換し、それを会話システムに発する。会話システムは、それがＡＳＲモジュールから受信したテキストに基づいて行動するに過ぎない。時として、このプロセスでは、会話システムは、単語の発音の詳細またはユーザのクエリに含まれる音を失う。発音詳細は、特に、同じ単語が、２つ以上の発音を有し、発音が、異なる意味に対応するとき、検索に役立ち得る情報を提供し得る。 In a conversation system, when a user utters a voice query to the system, the speech is converted to text using an automatic speech recognition (ASR) module. This text then forms the input to the conversation system, which determines the response to the text. For example, when a user says "show me Tom Cruise movies," the ASR module converts the user's speech to text and issues it to the conversation system. The conversation system only acts on the text it receives from the ASR module. Sometimes, in this process, the conversation system loses the pronunciation details of the words or sounds contained in the user's query. The pronunciation details can provide information that can be useful for search, especially when the same word has two or more pronunciations and the pronunciations correspond to different meanings.

本開示は、ユーザがクエリ単語を発話すると、複数のコンテキスト入力に基づいて、検索を実施し、ユーザの意図する検索クエリを予測するシステムおよび方法を説明する。検索は、例えば、ユーザ検索履歴、ユーザの好きなものおよび嫌いなもの、一般的傾向、クエリ単語の発音詳細、および任意の他の好適な情報を含む複数のコンテキスト入力に基づき得る。アプリケーションが、音声クエリを受信し、音声クエリを表すテキストクエリを生成する。アプリケーションは、テキストクエリに含まれるテキストクエリに関連付けられたメタデータに含まれ得るか、または、データベース内のエンティティのメタデータに含まれ得る発音情報を使用して、検索結果をより正確に読み出す。いくつかの実施形態では、アプリケーションは、検索クエリからのエンティティの到達可能性を改良するために、テキスト→発話変換、および発話→テキスト変換に基づいて、メタデータを生成する。
本発明は、例えば、以下を提供する。
（項目１）
音声クエリに応答する方法であって、前記方法は、
音声クエリをオーディオインターフェースにおいて受信することと、
制御回路を使用して、１つ以上のキーワードを前記音声クエリから抽出することと、
前記制御回路を使用して、前記１つ以上のキーワードに基づいて、テキストクエリを生成することと、
エンティティを識別することであって、前記エンティティを識別することは、前記テキストクエリおよび前記エンティティに関するメタデータに基づき、前記メタデータは、前記エンティティの１つ以上の代替テキスト表現を備え、前記１つ以上の代替テキスト表現は、前記エンティティに関連付けられた識別子の発音に基づく、ことと、
前記エンティティに関連付けられたコンテンツ項目を読み出すことと
を含む、方法。
（項目２）
前記１つ以上の代替テキスト表現は、前記エンティティの音素表現を備えている、項目１に記載の方法。
（項目３）
前記１つ以上の代替テキスト表現は、発音に基づく前記エンティティの代替スペルを備えている、項目１および２のいずれかに記載の方法。
（項目４）
前記エンティティの前記１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目１－３のいずれかに記載の方法。
（項目５）
前記１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、前記複数の代替テキスト表現のうちの各代替テキスト表現は、
第１のテキスト表現をオーディオファイルに変換することと、
前記オーディオファイルを第２のテキスト表現に変換することと
によって生成され、
前記第２のテキスト表現は、前記第１のテキスト表現と同一ではない、項目１－４のいずれかに記載の方法。
（項目６）
前記エンティティを識別することは、ユーザプロファイル情報にさらに基づく、項目１－５のいずれかに記載の方法。
（項目７）
前記エンティティを識別することは、前記エンティティに関連付けられた人気情報にさらに基づく、項目１－６のいずれかに記載の方法。
（項目８）
前記エンティティを識別することは、
前記複数のエンティティを識別することであって、それぞれのメタデータが、前記複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
前記それぞれの１つ以上の代替テキスト表現を前記テキストクエリと比較することに基づいて、前記複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、前記エンティティを選択することと
を含む、項目１－７のいずれかに記載の方法。
（項目９）
複数のテキストクエリを生成することをさらに含み、前記複数のテキストクエリは、前記テキストクエリを備え、前記複数のテキストクエリのうちの各テキストクエリは、前記制御回路の発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目１－８のいずれかに記載の方法。
（項目１０）
前記複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別することと、
前記それぞれのテキストクエリの前記それぞれのエンティティに関連付けられたメタデータとの比較に基づいて、前記それぞれのエンティティに関するそれぞれのスコアを決定することと、
前記それぞれのスコアの最大スコアを選択することによって、前記エンティティを識別することと
をさらに含む、項目９に記載の方法。
（項目１１）
音声クエリに応答するためのシステムであって、前記システムは、
メモリと、
項目１－１０のいずれかに記載の方法のステップを実装する手段と
を備えている、システム。
（項目１２）
エンコーディングされた命令を有する非一過性コンピュータ読み取り可能な媒体であって、前記命令は、制御回路によって実行されると、前記制御回路が項目１－１０のいずれかに記載の方法のステップを実行することを可能にする、非一過性コンピュータ読み取り可能な媒体。
（項目１３）
音声クエリに応答するためのシステムであって、前記システムは、項目１－１０のいずれかに記載の方法のステップを実装する手段を備えている、システム。 This disclosure describes systems and methods for conducting searches and predicting a user's intended search query based on multiple context inputs when a user speaks a query word. The search may be based on multiple context inputs including, for example, user search history, user likes and dislikes, general trends, pronunciation details of the query words, and any other suitable information. An application receives the voice query and generates a text query representing the voice query. The application uses pronunciation information that may be included in metadata associated with the text query included in the text query or that may be included in metadata of entities in a database to more accurately retrieve search results. In some embodiments, the application generates metadata based on text-to-speech and speech-to-text conversions to improve reachability of entities from the search query.
The present invention provides, for example, the following:
(Item 1)
1. A method for responding to a voice query, the method comprising:
Receiving a voice query at an audio interface;
extracting, using control circuitry, one or more keywords from the voice query;
generating a text query based on the one or more keywords using the control circuitry; and
identifying an entity, the identifying the entity being based on the text query and metadata about the entity, the metadata comprising one or more alternative text representations of the entity, the one or more alternative text representations being based on a pronunciation of an identifier associated with the entity;
Retrieving a content item associated with the entity; and
A method comprising:
(Item 2)
13. The method of claim 1, wherein the one or more alternative textual representations comprise a phonemic representation of the entity.
(Item 3)
3. The method of any of claims 1 and 2, wherein the one or more alternative textual representations comprise alternative spellings of the entity that are based on pronunciation.
(Item 4)
4. The method of claim 1, wherein the one or more alternative text representations of the entity comprise text strings generated based on a previous speech-to-text transformation.
(Item 5)
The one or more alternative text representations comprise a plurality of alternative text representations, each alternative text representation in the plurality of alternative text representations:
Converting the first text representation into an audio file;
converting the audio file into a second text representation;
Generated by
5. The method of any of items 1-4, wherein the second text representation is not identical to the first text representation.
(Item 6)
6. The method of any of items 1-5, wherein identifying the entity is further based on user profile information.
(Item 7)
7. The method of any of items 1-6, wherein identifying the entity is further based on popularity information associated with the entity.
(Item 8)
Identifying the entity includes:
identifying the plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining, for each respective entity of the plurality of entities, a respective score based on comparing the respective one or more alternative text representations to the text query;
selecting said entity by determining a maximum score;
8. The method according to any one of items 1 to 7, comprising:
(Item 9)
9. The method of claim 1, further comprising generating a plurality of text queries, the plurality of text queries comprising the text query, each text query of the plurality of text queries being generated based on a respective setting of a speech-to-text module of the control circuitry.
(Item 10)
identifying a respective entity based on each of the plurality of text queries;
determining a respective score for each of the entities based on a comparison of the respective text queries to metadata associated with the respective entities;
identifying said entity by selecting a maximum score of said respective scores;
10. The method of claim 9, further comprising:
(Item 11)
1. A system for responding to a voice query, the system comprising:
Memory,
Means for implementing the steps of the method according to any one of items 1 to 10;
The system comprises:
(Item 12)
A non-transitory computer readable medium having instructions encoded thereon that, when executed by control circuitry, enable the control circuitry to perform the steps of the method of any of items 1-10.
(Item 13)
11. A system for responding to voice queries, said system comprising means for implementing the steps of the method according to any of items 1-10.

本開示の上記および他の目的および利点は、同様の参照記号が全体を通して同様の部分を指す付随する図面と併せて解釈される以下の詳細な説明の考慮に応じて明白であろう。 These and other objects and advantages of the present disclosure will become apparent upon consideration of the following detailed description taken in conjunction with the accompanying drawings in which like reference characters refer to like parts throughout.

図１は、本開示のいくつかの実施形態による、テキストクエリを生成するための例証的システムのブロック図を示す。FIG. 1 shows a block diagram of an illustrative system for generating text queries in accordance with some embodiments of this disclosure.

図２は、本開示のいくつかの実施形態による、音声クエリに応答してコンテンツを読み出すための例証的システムのブロック図を示す。FIG. 2 shows a block diagram of an illustrative system for retrieving content in response to a voice query, according to some embodiments of this disclosure.

図３は、本開示のいくつかの実施形態による、発音情報を生成するための例証的システムのブロック図を示す。FIG. 3 shows a block diagram of an illustrative system for generating phonetic information according to some embodiments of this disclosure.

図４は、本開示のいくつかの実施形態による、例証的ユーザ機器のブロック図である。FIG. 4 is a block diagram of an illustrative user equipment in accordance with some embodiments of the disclosure.

図５は、本開示のいくつかの実施形態による、音声クエリに応答するための例証的システムのブロック図を示す。FIG. 5 shows a block diagram of an illustrative system for responding to voice queries in accordance with some embodiments of this disclosure.

図６は、本開示のいくつかの実施形態による、発音情報に基づいて音声クエリに応答するための例証的プロセスのフローチャートを示す。FIG. 6 shows a flowchart of an illustrative process for responding to a voice query based on pronunciation information, according to some embodiments of the present disclosure.

図７は、本開示のいくつかの実施形態による、代替表現に基づいて音声クエリに応答するための例証的プロセスのフローチャートを示す。FIG. 7 shows a flowchart of an illustrative process for responding to a voice query based on alternative expressions according to some embodiments of this disclosure.

図８は、本開示のいくつかの実施形態による、発音に基づいてエンティティに関するメタデータを生成するための例証的プロセスのフローチャートを示す。FIG. 8 shows a flowchart of an illustrative process for generating metadata about an entity based on pronunciation, according to some embodiments of the present disclosure.

図９は、本開示のいくつかの実施形態による、音声クエリのエンティティに関連付けられたコンテンツを読み出すための例証的プロセスのフローチャートを示す。FIG. 9 shows a flowchart of an illustrative process for retrieving content associated with an entity of a voice query, according to some embodiments of the present disclosure.

いくつかの実施形態では、本開示は、音声クエリをユーザから受信し、音声クエリを分析し、コンテンツまたは情報を検索するためのテキストクエリ（例えば、転換物）を生成するように構成されたシステムを対象とする。システムは、１つ以上のキーワードの発音に部分的に基づいて、音声クエリに応答する。例えば、英語言語では、同じスペルであるが、異なる発音を有する複数の単語が存在する。これは、特に、人々の名前に当てはまり得る。いくつかの例は、以下を含む。
例証するために、ユーザは、「Ｌｏｕｉｓのインタビューを見せて」とシステムのオーディオインターフェースに対して声に出し得る。システムは、以下等の例証的テキストクエリを生成し得る。
オプション１）「ＦｒａｕｄＭａｇａｚｉｎｅとのＬｏｕｉｓＦｒｅｅｈのインタビューを見せて」
オプション２）「ＣＢＳで放送されたＬｅｗｉｓＢｌａｃｋのインタビューを見せて」
結果として生じるテキストクエリは、ユーザが単語「Ｌｏｕｉｓ」を発話した方法に依存する。ユーザが、「ＬＯＯ－ｅｅ」と発音した場合、システムは、オプション１を選択するか、または、より重い重みをオプション１に適用する。ユーザが、「ＬＯＯ－ｈｉｓ」と発音した場合、システムは、オプション２を選択するか、または、より重い重みをオプション２に適用する。発音が考慮されないと、システムは、音声クエリに正確に応答することが可能ではないであろう可能性が高い。 In some embodiments, the present disclosure is directed to a system configured to receive a voice query from a user, analyze the voice query, and generate a text query (e.g., a transcription) for searching content or information. The system responds to the voice query based in part on the pronunciation of one or more keywords. For example, in the English language, there are multiple words that are spelled the same but have different pronunciations. This may be especially true for people's names. Some examples include:
To illustrate, a user may say into the system's audio interface, "Show me Louis'interview." The system may generate illustrative text queries such as:
Option 1) "Show me Louis Freeh's interview with Fraud Magazine."
Option 2) "Show me the interview with Lewis Black that aired on CBS."
The resulting text query depends on how the user spoke the word "Louis." If the user pronounced it "LOO-ee," the system would select option 1 or apply a heavier weight to option 1. If the user pronounced it "LOO-his," the system would select option 2 or apply a heavier weight to option 2. If the pronunciation was not taken into account, it is likely that the system would not be able to respond to the voice query accurately.

いくつかの状況では、人物の部分的名前を含む音声クエリは、その人を正しく検出することにおいて曖昧性を引き起こし得る（例えば、「非決定的人物検索クエリ」と称される）。例えば、ユーザが、「Ｔｏｍが主演の映画を見せて」または「Ｌｏｕｉｓのインタビューを見せて」と声に出す場合、システムは、ユーザが尋ねているのがＴｏｍまたはＬｏｕｉｓ／Ｌｏｕｉｅ／Ｌｅｗｉｓであるかを決定する必要があるであろう。発音情報に加え、システムは、例えば、ユーザ検索履歴（例えば、前のクエリおよび検索結果）、ユーザの好きなもの／嫌いなもの／選好（例えば、ユーザプロファイル情報から）、（例えば、複数のユーザの）一般的傾向、（例えば、複数のユーザの中の）人気、任意の他の好適な情報、またはそれらの任意の組み合わせ等の１つ以上のコンテキスト入力を分析し得る。システムは、自動発話認識（ＡＳＲ）プロセス後、失われないように、発音情報を好適な形態において（例えば、テキストクエリ自体で、またはテキストクエリに関連付けられたメタデータで）に保持する。 In some circumstances, a voice query that includes a partial name of a person may cause ambiguity in correctly detecting that person (e.g., referred to as a "non-deterministic person search query"). For example, if a user utters "show me movies starring Tom" or "show me interviews with Louis", the system would need to determine whether the user is asking about Tom or Louis/Louie/Lewis. In addition to the pronunciation information, the system may analyze one or more contextual inputs, such as, for example, user search history (e.g., previous queries and search results), user likes/dislikes/preferences (e.g., from user profile information), general trends (e.g., of multiple users), popularity (e.g., among multiple users), any other suitable information, or any combination thereof. The system retains the pronunciation information in a suitable form (e.g., in the text query itself or in metadata associated with the text query) so that it is not lost after the automatic speech recognition (ASR) process.

いくつかの実施形態では、システムによって使用されるための発音情報に関して、その中でシステムが検索する情報フィールドは、クエリとの比較のための発音情報を含まなければならない。例えば、情報フィールドは、発音メタデータを含むエンティティについての情報を含み得る。システムは、音素転換プロセスを実施し得、素転換プロセスは、ユーザの音声クエリを入力としてとり、それをテキストに転換し、テキストは、読み返されると、音声学的に正しく聞こえる。システムは、音素転換プロセスの出力および発音メタデータを使用して、検索結果を決定するように構成され得る。例証的例では、エンティティに関して記憶される発音メタデータは、以下を含み得る。
In some embodiments, for pronunciation information to be used by the system, the information field in which the system searches must contain the pronunciation information for comparison to the query. For example, the information field may include information about the entity that includes the pronunciation metadata. The system may perform a phoneme transcription process that takes the user's voice query as input and converts it to text that, when read back, sounds phonetically correct. The system may be configured to use the output of the phoneme transcription process and the pronunciation metadata to determine search results. In an illustrative example, the pronunciation metadata stored for the entity may include the following:

いくつかの実施形態では、本開示は、音声クエリをユーザから受信し、音声クエリを分析し、コンテンツまたは情報を検索するためのテキストクエリ（例えば、転換物）を生成するように構成されたシステムを対象とする。システムが検索する情報フィールドは、発音メタデータ、エンティティの代替テキスト表現、または両方を含む。例えば、ユーザが、音声クエリをシステムに発すると、システムは、最初に、ＡＳＲモジュールを使用して、音声をテキストに変換する。結果として生じるテキストは、次いで、会話システム（例えば、クエリに応答して、アクションを実施する）への入力を形成する。例証するために、ユーザが、「ＴｏｍＣｒｕｉｓｅの映画を見せて」と言う場合、ＡＳＲモジュールは、ユーザの発話をテキストに変換し、テキストクエリを会話システムに発する。「ＴｏｍＣｒｕｉｓｅ」に対応するエンティティが、データ内に存在する場合、システムは、それをテキスト「ＴｏｍＣｒｕｉｓｅ」と合致させ、適切な結果（例えば、ＴｏｍＣｒｕｉｓｅについての情報、ＴｏｍＣｒｕｉｓｅを特徴とするコンテンツ、またはそのコンテンツ識別子）を返す。エンティティが、（例えば、情報フィールドの）データ内に存在し、直接、エンティティタイトルを使用してアクセスされることができるとき、エンティティは、「到達可能」と称され得る。到達可能性は、システムが検索動作を実施するために最も重要である。例えば、あるデータ（例えば、映画、芸術家、テレビシリーズ、または他のエンティティ）が、システム内に存在し、関連付けられたデータが、記憶されるが、ユーザが、その情報にアクセスすることができない場合、エンティティは、「到達不能」と称され得る。データシステム内の到達不能エンティティは、検索システムの失敗を表す。 In some embodiments, the present disclosure is directed to a system configured to receive a voice query from a user, analyze the voice query, and generate a text query (e.g., a transcription) for searching for content or information. The information fields that the system searches include pronunciation metadata, alternative text representations of entities, or both. For example, when a user issues a voice query to the system, the system first converts the speech to text using an ASR module. The resulting text then forms an input to a conversational system (e.g., performing an action in response to the query). To illustrate, if a user says, "Show me a Tom Cruise movie," the ASR module converts the user's speech to text and issues a text query to the conversational system. If an entity corresponding to "Tom Cruise" exists in the data, the system matches it with the text "Tom Cruise" and returns the appropriate results (e.g., information about Tom Cruise, content featuring Tom Cruise, or a content identifier thereof). An entity may be referred to as "reachable" when it exists in the data (e.g., in an information field) and can be accessed directly using the entity title. Reachability is essential for the system to perform search operations. For example, an entity may be referred to as "unreachable" if a piece of data (e.g., a movie, artist, television series, or other entity) exists in the system and associated data is stored, but a user cannot access the information. An unreachable entity in the data system represents a failure of the search system.

システムは、複数の記憶された情報の中の１つ以上のエンティティまたはコンテンツ項目を識別し得る。いくつかの実施形態では、システムは、エンティティまたはコンテンツ項目を表す第１のテキスト文字列に基づいて、オーディオファイルを生成する。第１のテキスト文字列および少なくとも１つの発話基準に基づいて、システムは、発話→テキストモジュールを使用して、オーディオファイルに基づいて、第２のテキスト文字列を生成し得る。システムは、テキスト文字列を比較し、第２のテキスト文字列が第１のテキスト文字列と同一でない場合、第２のテキスト文字列を記憶する。いくつかの実施形態では、システムは、テキスト－発話－テキスト変換からの結果を含むメタデータを生成し、検索動作中、音声クエリに応答するとき、可能な誤識別を予想する。メタデータは、到達可能性を改良するために、エンティティの代替表現を含み得る。 The system may identify one or more entities or content items among the plurality of stored information. In some embodiments, the system generates an audio file based on a first text string representing the entity or content item. Based on the first text string and at least one speech criterion, the system may generate a second text string based on the audio file using a speech-to-text module. The system compares the text strings and stores the second text string if it is not identical to the first text string. In some embodiments, the system generates metadata including results from the text-to-speech-to-text conversion to anticipate possible misidentifications when responding to voice queries during search operations. The metadata may include alternative representations of the entity to improve reachability.

図１は、本開示のいくつかの実施形態による、テキストクエリを生成するための例証的システム１００のブロック図を示す。システム１００は、ＡＳＲモジュール１１０と、会話システム１２０と、発音メタデータ１５０と、ユーザプロファイル情報１６０と、１つ以上のデータベース１７０とを含む。例えば、一緒にシステム１９９に含まれ得るＡＳＲモジュール１１０および会話システム１２０は、クエリアプリケーションを実装するために使用され得る。 FIG. 1 shows a block diagram of an illustrative system 100 for generating text queries, according to some embodiments of the present disclosure. System 100 includes an ASR module 110, a conversation system 120, pronunciation metadata 150, user profile information 160, and one or more databases 170. For example, ASR module 110 and conversation system 120, which may be included together in system 199, may be used to implement a query application.

ユーザは、発話「先週のあのＬｏｕｉｓのインタビューを見せて」を含むクエリ１０１をシステム１９９のオーディオインターフェースに対して声に出し得る。ＡＳＲモジュール１１０は、受信されたオーディオ入力をサンプリング、調整、およびデジタル化し、結果として生じるオーディオファイルを分析し、テキストクエリを生成するように構成されている。いくつかの実施形態では、ＡＳＲモジュール１１０は、ユーザプロファイル情報１６０からの情報を読み出し、テキストクエリを生成することに役立てる。例えば、ユーザに関する音声認識情報が、ユーザプロファイル情報１６０に記憶され得、ＡＳＲモジュール１１０は、音声認識情報を使用して、発話するユーザを識別し得る。さらなる例では、システム１９９は、好適なメモリに記憶されたユーザプロファイル情報１６０を含み得る。ＡＳＲモジュール１１０は、声に出された単語「Ｌｏｕｉｓ」に関する発音情報を決定し得る。テキスト単語「Ｌｏｕｉｓ」に関して２つ以上の発音が存在するので、システム１９９は、発音情報に基づいて、テキストクエリを生成する。さらに、音「Ｌｏｏ－ｈｉｓ」は、「Ｌｏｕｉｓ」または「Ｌｅｗｉｓ」としてテキストに変換され得、故に、コンテキスト情報は、音声クエリの正しいエンティティ（例えば、ＬｏｕｉｓＦａｒｒａｋｈａｎにおけるようなＬｏｕｉｓとは対照的に、ＬｅｗｉｓＢｌａｃｋにおけるようなＬｅｗｉｓ）を識別することに役立ち得る。いくつかの実施形態では、会話システム１２０は、ＡＳＲモジュール１１０からの認識された単語、コンテキスト情報、ユーザプロファイル情報１６０、発音メタデータ１５０、１つ以上のデータベース１７０、任意の他の情報、またはそれらの任意の組み合わせに基づいて、テキストクエリを生成すること、テキストクエリに応答すること、または、両方を行うように構成される。例えば、会話システム１２０は、テキストクエリを生成し、次いで、合致を決定するために、テキストクエリを複数のエンティティに関する発音メタデータ１５０と比較し得る。さらなる例では、会話システム１２０は、１つ以上の認識された単語を複数のエンティティに関する発音メタデータ１５０と比較し、合致を決定し、次いで、識別されたエンティティに基づいて、テキストクエリを生成し得る。いくつかの実施形態では、会話システム１２０は、付随の発音情報を伴うテキストクエリを生成する。いくつかの実施形態では、会話システム１２０は、埋め込み発音情報を伴うテキストクエリを生成する。例えば、テキストクエリは、正しい文法的表現「Ｌｏｕｉｓ」ではなく、「ｌｏｏ－ｅｅ」等の単語の音素表現を含み得る。さらなる例では、発音メタデータ１５０は、それとテキストクエリが比較され得る１つ以上の基準音素表現を含み得る。 A user may voice a query 101, including the utterance "Show me that interview with Louis from last week," into an audio interface of the system 199. The ASR module 110 is configured to sample, condition, and digitize the received audio input, analyze the resulting audio file, and generate a text query. In some embodiments, the ASR module 110 retrieves information from the user profile information 160 to aid in generating the text query. For example, speech recognition information about a user may be stored in the user profile information 160, and the ASR module 110 may use the speech recognition information to identify the speaking user. In a further example, the system 199 may include the user profile information 160 stored in a suitable memory. The ASR module 110 may determine pronunciation information about the voiced word "Louis." Because there are more than one pronunciation for the text word "Louis," the system 199 generates the text query based on the pronunciation information. Further, the sound "Loo-his" may be converted to text as "Louis" or "Lewis," and thus the context information may help identify the correct entity of the voice query (e.g., Lewis as in Lewis Black as opposed to Louis as in Louis Farrakhan). In some embodiments, conversation system 120 is configured to generate a text query, respond to a text query, or both, based on the recognized words from ASR module 110, the context information, user profile information 160, pronunciation metadata 150, one or more databases 170, any other information, or any combination thereof. For example, conversation system 120 may generate a text query and then compare the text query to pronunciation metadata 150 for multiple entities to determine a match. In a further example, the conversation system 120 may compare one or more recognized words to the pronunciation metadata 150 for multiple entities to determine matches and then generate a text query based on the identified entities. In some embodiments, the conversation system 120 generates a text query with accompanying pronunciation information. In some embodiments, the conversation system 120 generates a text query with embedded pronunciation information. For example, the text query may include a phoneme representation of a word such as "loo-ee" rather than the correct grammatical representation "Louis." In a further example, the pronunciation metadata 150 may include one or more reference phoneme representations to which the text query may be compared.

ユーザプロファイル情報１６０は、ユーザ識別情報（例えば、名前、識別子、住所、連絡先情報）、ユーザ検索履歴（例えば、前の音声クエリ、前のテキストクエリ、前の検索結果、前の検索結果またはクエリに関するフィードバック）、ユーザ選好（例えば、検索設定、お気に入りエンティティ、２つ以上のクエリに含まれるキーワード）、ユーザが好きなもの／嫌いなもの（例えば、ソーシャルメディアアプリケーション内でユーザによってフォローされるエンティティ、ユーザ入力情報）、ユーザに接続される他のユーザ（例えば、友人、家族、ソーシャルネットワーキングアプリケーション内の連絡先、ユーザデバイスに記憶される連絡先）、ユーザ音声データ（例えば、オーディオサンプル、シグネチャ、発話パターン、またはユーザの音声を識別するためのファイル）、ユーザについての任意の他の好適な情報、またはそれらの任意の組み合わせを含み得る。 User profile information 160 may include user identification information (e.g., name, identifier, address, contact information), user search history (e.g., previous voice queries, previous text queries, previous search results, feedback regarding previous search results or queries), user preferences (e.g., search settings, favorite entities, keywords included in two or more queries), user likes/dislikes (e.g., entities followed by the user in a social media application, user entered information), other users connected to the user (e.g., friends, family, contacts in a social networking application, contacts stored on the user device), user voice data (e.g., audio samples, signatures, speech patterns, or files to identify the user's voice), any other suitable information about the user, or any combination thereof.

１つ以上のデータベース１７０は、テキストクエリを生成すること、テキストクエリに応答すること、または、両方を行うための任意の好適な情報を含む。いくつかの実施形態では、発音メタデータ１５０、ユーザプロファイル情報１６０、または両方は、１つ以上のデータベース１７０に含まれ得る。いくつかの実施形態では、１つ以上のデータベース１７０は、複数のユーザに関する統計的情報（例えば、検索履歴、コンテンツ消費履歴、消費パターン）を含む。いくつかの実施形態では、１つ以上のデータベース１７０は、人、場所、オブジェクト、イベント、コンテンツ項目、１つ以上のエンティティに関連付けられたメディアコンテンツ、またはそれらの組み合わせを含む複数のエンティティについての情報を含む。 The one or more databases 170 include any suitable information for generating text queries, responding to text queries, or both. In some embodiments, the phonetic metadata 150, the user profile information 160, or both may be included in the one or more databases 170. In some embodiments, the one or more databases 170 include statistical information (e.g., search history, content consumption history, consumption patterns) about a plurality of users. In some embodiments, the one or more databases 170 include information about a plurality of entities, including people, places, objects, events, content items, media content associated with the one or more entities, or combinations thereof.

図２は、本開示のいくつかの実施形態による、音声クエリに応答してコンテンツを読み出すための例証的システム２００のブロック図を示す。システム２００は、発話処理システム２１０と、検索エンジン２２０と、エンティティデータベース２５０と、ユーザプロファイル情報２４０とを含む。発話処理システム２１０は、オーディオファイルを識別し得、キーワードが識別され得る音素、パターン、単語、または他の要素に関して、オーディオファイルを分析し得る。いくつかの実施形態では、発話処理システム２１０は、時間ドメイン、スペクトルドメイン、または両方において、オーディオ入力を分析し、単語を識別し得る。例えば、発話処理システム２１０は、時間ドメインにおいて、オーディオ入力を分析し、発話が生じる期間を決定し得る（例えば、一時停止または沈黙の期間を排除するため）。発話処理システム２１０は、次いで、スペクトルドメインにおいて、各期間を分析し、キーワードが識別され得る音素、パターン、単語、または他の要素を識別し得る。発話処理システム２１０は、生成されたテキストクエリ、１つ以上の単語、発音情報、またはそれらの組み合わせを出力し得る。いくつかの実施形態では、発話処理システム２１０は、音声認識、発話認識、または両方のために、ユーザプロファイル情報２４０からのデータを読み出し得る。 FIG. 2 shows a block diagram of an illustrative system 200 for retrieving content in response to a voice query, according to some embodiments of the present disclosure. The system 200 includes a speech processing system 210, a search engine 220, an entity database 250, and user profile information 240. The speech processing system 210 may identify audio files and analyze the audio files for phonemes, patterns, words, or other elements from which keywords may be identified. In some embodiments, the speech processing system 210 may analyze the audio input and identify words in the time domain, the spectral domain, or both. For example, the speech processing system 210 may analyze the audio input in the time domain and determine periods in which speech occurs (e.g., to eliminate periods of pauses or silence). The speech processing system 210 may then analyze each period in the spectral domain and identify phonemes, patterns, words, or other elements from which keywords may be identified. The speech processing system 210 may output the generated text query, one or more words, pronunciation information, or a combination thereof. In some embodiments, the speech processing system 210 may read data from the user profile information 240 for voice recognition, speech recognition, or both.

検索エンジン２２０が、発話処理システム２１０からの出力を受信し、検索設定２２１およびコンテキスト情報２２２と組み合わせて、テキストクエリへの応答を生成する。検索エンジン２２０は、ユーザプロファイル情報２４０を使用して、テキストクエリを生成し、それを修正し、または、それに応答し得る。検索エンジン２２０は、テキストクエリを使用して、エンティティ２５０のデータベースのデータの中を検索する。エンティティ２５０のデータベースは、複数のエンティティに関連付けられたメタデータ、複数のエンティティに関連付けられたコンテンツ、または両方を含み得る。例えば、データは、エンティティに関する識別子、エンティティを説明する詳細、エンティティを指すタイトル（例えば、音素表現または代替表現を含み得る）、エンティティに関連付けられた語句（例えば、音素表現または代替表現を含み得る）、エンティティに関連付けられたリンク（例えば、ＩＰアドレス、ＵＲＬ、ハードウェアアドレス）、エンティティに関連付けられたキーワード（例えば、音素表現または代替表現を含み得る）、エンティティに関連付けられた任意の他の好適な情報、またはそれらの任意の組み合わせを含み得る。検索エンジン２２０が、テキストクエリのキーワードに合致する１つ以上のエンティティを識別すること、テキストクエリのキーワードに合致する１つ以上のコンテンツ項目を識別すること、または、両方を行うと、検索エンジン２２０は、次いで、テキストクエリへの応答２７０として、情報、コンテンツ、または両方をユーザに提供し得る。いくつかの実施形態では、検索設定２２１は、テキストクエリの生成、検索結果の読み出し、または両方に影響を及ぼすデータベース、エンティティ、エンティティのタイプ、コンテンツのタイプ、他の検索基準、またはそれらの任意の組み合わせを含む。いくつかの実施形態では、コンテキスト情報２２２は、ジャンル情報（例えば、検索フィールドをさらに絞り込むため）、キーワード、データベース識別（例えば、標的情報またはコンテンツを含む可能性が高いデータベース）、コンテンツのタイプ（例えば、日付、ジャンル、タイトル、フォーマット別）、任意の他の好適な情報、またはそれらの任意の組み合わせを含む。応答２７０は、例えば、コンテンツ（例えば、表示されるビデオ）、情報、検索結果の一覧、コンテンツへのリンク、任意の他の好適な検索結果、またはそれらの任意の組み合わせを含み得る。 A search engine 220 receives the output from the speech processing system 210 and combines it with the search settings 221 and the context information 222 to generate a response to the text query. The search engine 220 may use the user profile information 240 to generate, modify, or respond to the text query. The search engine 220 uses the text query to search among data in a database of entities 250. The database of entities 250 may include metadata associated with a number of entities, content associated with a number of entities, or both. For example, the data may include an identifier for the entity, details describing the entity, a title referring to the entity (e.g., may include a phonetic representation or an alternative representation), a phrase associated with the entity (e.g., may include a phonetic representation or an alternative representation), a link associated with the entity (e.g., an IP address, a URL, a hardware address), a keyword associated with the entity (e.g., may include a phonetic representation or an alternative representation), any other suitable information associated with the entity, or any combination thereof. Once the search engine 220 has identified one or more entities that match the keywords of the text query, identified one or more content items that match the keywords of the text query, or both, the search engine 220 may then provide the information, content, or both to the user as a response 270 to the text query. In some embodiments, the search settings 221 include databases, entities, types of entities, types of content, other search criteria, or any combination thereof, that affect the generation of the text query, the retrieval of search results, or both. In some embodiments, the context information 222 includes genre information (e.g., to further narrow the search field), keywords, database identification (e.g., databases likely to contain the targeted information or content), type of content (e.g., by date, genre, title, format), any other suitable information, or any combination thereof. The response 270 may include, for example, content (e.g., a displayed video), information, a list of search results, links to content, any other suitable search results, or any combination thereof.

図３は、本開示のいくつかの実施形態による、発音情報を生成するための例証的システム３００のブロック図を示す。システム３００は、テキスト→発話エンジン３１０と、発話→テキストエンジン３２０とを含む。いくつかの実施形態では、システム３００は、テキストまたは音声クエリから独立して、発音情報を決定する。例えば、システム３００は、１つ以上のエンティティに関するメタデータ（例えば、システム１００の発音メタデータ１５０またはシステム２００のエンティティ２５０のデータベースに記憶されるメタデータ等）を生成し得る。テキスト→発話エンジン３１０は、音声クエリに含まれる可能性が高いエンティティ名または他の識別子を含み得る第１のテキスト文字列３０２を識別し得る。例えば、テキスト→発話エンジン３１０は、ユーザが、数値または英数字識別子ではなく、名前を含む音声クエリを発話する（例えば、ユーザが、「ＷＩＫＩ０４５５６」ではなく、「Ｌｏｕｉｓ」と発話する）可能性がより高いので、「ＩＤ」フィールドではなく、エンティティメタデータの「名前」フィールドを識別し得る。テキスト→発話エンジン３１０は、第１のテキスト文字列に基づいて、スピーカまたは他のオーディオデバイスにおいて、オーディオ出力３１２を生成する。例えば、テキスト→発話エンジン３１０は、１つ以上の設定を使用して、生成されたオーディオ出力に影響を及ぼし得る音声詳細（例えば、男性／女性音声、アクセント、または他の詳細）、再生速度、または任意の他の好適な設定を規定し得る。発話→テキストエンジン３２０は、マイクロホンまたは他の好適なデバイスにおいて、オーディオ出力３１２からオーディオ入力３１３を受信し（例えば、記憶され得るオーディオファイルに加え、またはその代わりに）、オーディオ入力３１３のテキスト変換を生成する（例えば、記録されるオーディオのオーディオファイルを記憶することに加え、またはその代わりに）。発話→テキストエンジン３２０は、処理設定を使用して、新しいテキスト文字列３２２を生成し得る。新しいテキスト文字列３２２は、第１のテキスト文字列３０２と比較される。新しいテキスト文字列３２２が、テキスト文字列３０２と同一である場合、音声クエリが正確なテキストクエリへの変換をもたらし得るので、メタデータは、生成される必要がない。新しいテキスト文字列３２２が、テキスト文字列３０２と同一でない場合、これは、音声クエリがテキストクエリに正しくなく変換されたこともあることを示す。故に、新しいテキスト文字列３２２が、テキスト文字列３０２と同一でない場合、発話→テキストエンジン３２０は、新しいテキスト文字列３２２をテキスト文字列３０２が関連付けられる、エンティティに関連付けられたメタデータ内に含む。システム３００は、複数のエンティティを識別し、各エンティティに関して、テキスト→発話エンジン３１０および発話→テキストエンジン３２０からの結果として生じるテキスト文字列（例えば、新しいテキスト文字列３２２等）を含むメタデータを生成し得る。いくつかの実施形態では、所与のエンティティに関して、テキスト→発話エンジン３１０、発話→テキストエンジン３２０、または両方は、２つ以上の設定を使用して、２つ以上の新しいテキスト文字列を生成し得る。故に、２つ以上のテキスト文字列は、テキスト文字列３０２と異なるので、次いで、各新しいテキスト文字列は、メタデータに記憶され得る。例えば、異なる設定から生じる異なる発音または発音の解釈は、異なる新しいテキスト文字列を生成し得、それは、異なるユーザからの音声クエリに備えて記憶され得る。代替表現（例えば、テキスト文字列３０２および新しいテキスト文字列３２２）を生成および記憶することによって、システム３００は、メタデータを更新し、より正確な検索を可能にし得る（例えば、エンティティの到達可能性および検索の正確度を改良する）。 FIG. 3 illustrates a block diagram of an illustrative system 300 for generating pronunciation information according to some embodiments of the present disclosure. The system 300 includes a text-to-speech engine 310 and a speech-to-text engine 320. In some embodiments, the system 300 determines the pronunciation information independent of a text or voice query. For example, the system 300 may generate metadata about one or more entities (e.g., metadata stored in a database of pronunciation metadata 150 of the system 100 or entities 250 of the system 200, etc.). The text-to-speech engine 310 may identify a first text string 302 that may include an entity name or other identifier that is likely to be included in a voice query. For example, the text-to-speech engine 310 may identify a “name” field of the entity metadata rather than an “ID” field because a user is more likely to speak a voice query that includes a name rather than a numeric or alphanumeric identifier (e.g., a user speaks “Louis” rather than “Wiki04556”). The text-to-speech engine 310 generates an audio output 312 at a speaker or other audio device based on the first text string. For example, the text-to-speech engine 310 may use one or more settings to specify voice details (e.g., male/female voice, accent, or other details), playback speed, or any other suitable settings that may affect the generated audio output. The speech-to-text engine 320 receives an audio input 313 from the audio output 312 at a microphone or other suitable device (e.g., in addition to or instead of an audio file that may be stored) and generates a text conversion of the audio input 313 (e.g., in addition to or instead of storing an audio file of the recorded audio). The speech-to-text engine 320 may use the processing settings to generate a new text string 322. The new text string 322 is compared to the first text string 302. If the new text string 322 is identical to the text string 302, no metadata needs to be generated since the voice query may result in a conversion to the exact text query. If new text string 322 is not identical to text string 302, this may indicate that the voice query was incorrectly converted to a text query. Thus, if new text string 322 is not identical to text string 302, speech-to-text engine 320 includes new text string 322 in the metadata associated with the entity with which text string 302 is associated. System 300 may identify multiple entities and generate metadata for each entity that includes the resulting text strings (e.g., new text string 322, etc.) from text-to-speech engine 310 and speech-to-text engine 320. In some embodiments, for a given entity, text-to-speech engine 310, speech-to-text engine 320, or both may generate two or more new text strings using two or more settings. Thus, since the two or more text strings are different from text string 302, each new text string may then be stored in the metadata. For example, different pronunciations or pronunciation interpretations resulting from different settings may generate different new text strings, which may be stored for voice queries from different users. By generating and storing alternative representations (e.g., text string 302 and new text string 322), system 300 may update metadata and enable more accurate searches (e.g., improving entity reachability and search accuracy).

例証的例では、エンティティに関して、システム３００は、タイトルおよび関連語句を識別し、各語句をテキスト→発話エンジン３１０に通し、それぞれのオーディオファイルを保存し、次いで、各それぞれのオーディオファイルを発話→テキストエンジン３２０に通し、ＡＳＲ書き起こし記録（例えば、新しいテキスト文字列３２２）を得る。ＡＳＲ書き起こし記録が、元の語句（例えば、テキスト文字列３０２）と異なる場合、システム３００は、ＡＳＲ書き起こし記録を（例えば、メタデータに記憶されるような）エンティティの関連語句に追加する。いくつかの実施形態では、システム３００は、任意の手動作業を要求せず、完全に自動化され得る（例えば、ユーザ入力は、要求されない）。いくつかの実施形態では、ユーザが、クエリを発し、所望の結果を得られないとき、システム３００は、アラートされる。それに応答して、人が、クエリに関する正しいエンティティであるべきものを手動で識別する。正しくない結果は、記憶され、将来的クエリのための情報を提供する。システム３００は、システムレベルではなく、メタデータレベルにおいて、潜在的不正確度に対処する。多くのエンティティに関するテキスト文字列３０２の分析は、全ての誤った例が、事前に（例えば、ユーザの音声クエリに先立って）識別され、解決されるように、網羅的かつ自動であり得る。システム３００は、誤った例（例えば、代替表現）を生成するために、ユーザが音声クエリを提供することを要求しない。システム３００は、クエリシステムとのユーザの相互作用をエミュレートし、検索を実施することにおける潜在的エラー源を予想するために使用され得る。 In an illustrative example, for an entity, system 300 identifies the title and related phrases, passes each phrase through text-to-speech engine 310, saves a respective audio file, and then passes each respective audio file through speech-to-text engine 320 to obtain an ASR transcript (e.g., new text string 322). If the ASR transcript differs from the original phrase (e.g., text string 302), system 300 adds the ASR transcript to the entity's related phrases (e.g., as stored in metadata). In some embodiments, system 300 does not require any manual work and may be fully automated (e.g., no user input is required). In some embodiments, system 300 is alerted when a user issues a query and does not get the desired results. In response, a human manually identifies what should be the correct entity for the query. Incorrect results are stored and provide information for future queries. System 300 addresses potential inaccuracies at the metadata level, not at the system level. The analysis of the text string 302 for many entities can be exhaustive and automatic, such that all erroneous examples are identified and resolved up front (e.g., prior to a user's voice query). The system 300 does not require a user to provide a voice query in order to generate erroneous examples (e.g., alternative expressions). The system 300 can be used to emulate a user's interaction with the query system and anticipate potential sources of error in conducting a search.

ユーザは、コンテンツ、（例えば、音声クエリを解釈するための）アプリケーション、および、例えば、そのデバイス（すなわち、ユーザ機器またはオーディオ機器）、１つ以上のネットワーク接続デバイス、ディスプレイを有する１つ以上の電子デバイス、またはそれらの組み合わせのうちの１つ以上のものからの他の特徴にアクセスし得る。本開示の例証的技法のいずれかは、ユーザデバイス、ディスプレイをユーザに提供するデバイス、または、音声クエリに応答し、ディスプレイコンテンツをユーザに生成するように構成された任意の他の好適な制御回路によって実装され得る。 The user may access content, applications (e.g., for interpreting voice queries), and other features from, for example, one or more of their devices (i.e., user equipment or audio equipment), one or more network-connected devices, one or more electronic devices having a display, or combinations thereof. Any of the illustrative techniques of this disclosure may be implemented by a user device, a device that provides a display to a user, or any other suitable control circuitry configured to respond to voice queries and generate display content to a user.

図４は、例証的ユーザデバイスの一般化された実施形態を示す。ユーザ機器システム４０１は、ディスプレイ４１２、オーディオ機器４１４、およびユーザ入力インターフェース４１０を含むか、または、それらに通信可能に結合されたセットトップボックス４１６を含み得る。いくつかの実施形態では、ディスプレイ４１２は、テレビディスプレイまたはコンピュータディスプレイを含み得る。いくつかの実施形態では、ユーザ入力インターフェース４１０は、遠隔制御デバイスである。セットトップボックス４１６は、１つ以上の回路基板を含み得る。いくつかの実施形態では、１つ以上の回路基板は、処理回路、制御回路、および記憶装置（例えば、ＲＡＭ、ＲＯＭ、ハードディスク、リムーバブルディスク等）を含む。いくつかの実施形態では、回路基板は、入／出力経路を含む。ユーザ機器デバイス４００およびユーザ機器システム４０１の各々は、入力／出力（以降では「Ｉ／Ｏ」）経路４０２を介してコンテンツおよびデータを受信し得る。Ｉ／Ｏ経路４０２は、処理回路４０６と記憶装置４０８とを含む制御回路４０４に、コンテンツおよびデータを提供し得る。制御回路４０４は、Ｉ／Ｏ経路４０２を使用して、コマンド、要求、および他の好適なデータを送信および受信するために使用され得る。Ｉ／Ｏ経路４０２は、制御回路４０４（具体的に、処理回路４０６）を１つ以上の通信経路（下記に説明される）に接続し得る。Ｉ／Ｏ機能は、これらの通信経路のうちの１つ以上のものによって提供され得るが、図面を過剰に複雑にすることを回避するように、図４では単一の経路として示される。セットトップボックス４１６が、例証のために図４に示されるが、処理回路、制御回路、および記憶装置を有する任意の好適なコンピューティングデバイスが、本開示に従って使用され得る。例えば、セットトップボックス４１６は、パーソナルコンピュータ（例えば、ノートブック、ラップトップ、デスクトップ）、ユーザアクセス可能クライアントデバイスをホストするネットワークベースのサーバ、非ユーザ所有デバイス、任意の他の好適なデバイス、またはそれらの任意の組み合わせによって置換または補完され得る。 FIG. 4 illustrates a generalized embodiment of an illustrative user device. User equipment system 401 may include a set-top box 416 that includes or is communicatively coupled to a display 412, audio equipment 414, and a user input interface 410. In some embodiments, display 412 may include a television display or a computer display. In some embodiments, user input interface 410 is a remote control device. Set-top box 416 may include one or more circuit boards. In some embodiments, one or more circuit boards include processing circuitry, control circuitry, and storage (e.g., RAM, ROM, hard disk, removable disk, etc.). In some embodiments, the circuit boards include input/output paths. Each of user equipment device 400 and user equipment system 401 may receive content and data via input/output (hereinafter "I/O") paths 402. I/O paths 402 may provide content and data to control circuitry 404, which includes processing circuitry 406 and storage 408. The control circuitry 404 may be used to send and receive commands, requests, and other suitable data using the I/O paths 402. The I/O paths 402 may connect the control circuitry 404 (specifically, the processing circuitry 406) to one or more communication paths (described below). The I/O functions may be provided by one or more of these communication paths, but are shown as a single path in FIG. 4 to avoid overcomplicating the drawing. Although a set-top box 416 is shown in FIG. 4 for illustrative purposes, any suitable computing device having processing circuitry, control circuitry, and storage may be used in accordance with the present disclosure. For example, the set-top box 416 may be replaced or supplemented by a personal computer (e.g., notebook, laptop, desktop), a network-based server hosting user-accessible client devices, a non-user-owned device, any other suitable device, or any combination thereof.

制御回路４０４は、処理回路４０６等の任意の好適な処理回路に基づき得る。本明細書で参照されるように、処理回路は、１つ以上のマイクロプロセッサ、マイクロコントローラ、デジタル信号プロセッサ、プログラマブル論理デバイス、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、特定用途向け集積回路（ＡＳＩＣ）等に基づく回路を意味すると理解されるべきであり、マルチコアプロセッサ（例えば、デュアルコア、クアッドコア、ヘキサコア、または任意の好適な数のコア）またはスーパーコンピュータを含み得る。いくつかの実施形態では、処理回路は、複数の別個のプロセッサまたは処理ユニット、例えば、複数の同じのタイプの処理ユニット（例えば、２つのＩｎｔｅｌＣｏｒｅｉ７プロセッサ）または複数の異なるプロセッサ（例えば、ＩｎｔｅｌＣｏｒｅｉ５プロセッサおよびＩｎｔｅｌＣｏｒｅｉ７プロセッサ）を横断して分散される。いくつかの実施形態では、制御回路４０４は、メモリ（例えば、記憶装置４０８）に記憶されたアプリケーションのための命令を実行する。具体的に、制御回路４０４は、上記および下記に議論される機能を実施するようにアプリケーションによって命令され得る。例えば、アプリケーションは、命令を制御回路４０４に提供し、メディアガイド表示を発生させ得る。いくつかの実装では、制御回路４０４によって実施される任意のアクションは、アプリケーションから受信される命令に基づき得る。 The control circuitry 404 may be based on any suitable processing circuitry, such as the processing circuitry 406. As referred to herein, the processing circuitry should be understood to mean a circuitry based on one or more microprocessors, microcontrollers, digital signal processors, programmable logic devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), etc., and may include a multi-core processor (e.g., dual-core, quad-core, hexa-core, or any suitable number of cores) or a supercomputer. In some embodiments, the processing circuitry is distributed across multiple separate processors or processing units, such as multiple processing units of the same type (e.g., two Intel Core i7 processors) or multiple different processors (e.g., an Intel Core i5 processor and an Intel Core i7 processor). In some embodiments, the control circuitry 404 executes instructions for an application stored in memory (e.g., the storage device 408). Specifically, the control circuitry 404 may be instructed by the application to perform the functions discussed above and below. For example, an application may provide instructions to control circuitry 404 to generate a media guide display. In some implementations, any actions performed by control circuitry 404 may be based on instructions received from the application.

いくつかのクライアント／サーバベースの実施形態では、制御回路４０４は、アプリケーションサーバまたは他のネットワークまたはサーバと通信するために好適な通信回路を含む。上記に述べられる機能性を実行するための命令は、アプリケーションサーバ上に記憶され得る。通信回路は、他の機器または任意の他の好適な通信回路と通信するために、ケーブルモデム、総合サービスデジタルネットワーク（ＩＳＤＮ）モデム、デジタル加入者回線（ＤＳＬ）モデム、電話モデム、イーサネット（登録商標）カード、または無線モデムを含み得る。そのような通信は、インターネットまたは任意の他の好適な通信ネットワークまたは経路を伴い得る。加えて、通信回路は、ユーザ機器デバイスのピアツーピア通信または互いに遠隔の場所にあるユーザ機器デバイスの通信を可能にする回路（下記により詳細に説明される）を含み得る。 In some client/server-based embodiments, the control circuitry 404 includes communications circuitry suitable for communicating with an application server or other network or server. Instructions for performing the functionality described above may be stored on the application server. The communications circuitry may include a cable modem, an integrated services digital network (ISDN) modem, a digital subscriber line (DSL) modem, a telephone modem, an Ethernet card, or a wireless modem for communicating with other equipment or any other suitable communications circuitry. Such communications may involve the Internet or any other suitable communications network or path. In addition, the communications circuitry may include circuitry (described in more detail below) that enables peer-to-peer communication of user equipment devices or communication of user equipment devices at remote locations from each other.

メモリは、制御回路４０４の一部である記憶装置４０８等の電子記憶デバイスであり得る。本明細書で参照されるように、語句「電子記憶デバイス」または「記憶デバイス」は、ランダムアクセスメモリ、読み取り専用メモリ、ハードドライブ、光学ドライブ、ソリッドステートデバイス、量子記憶デバイス、ゲーム機、ゲーム媒体、または任意の他の好適な固定またはリムーバブル記憶デバイス等の任意の組み合わせ等の電子データ、コンピュータソフトウェア、またはファームウェアを記憶するための任意のデバイスを意味すると理解されるべきである。記憶装置４０８は、本明細書に説明される種々のタイプのコンテンツおよび上記に説明されるメディアガイドデータを記憶するために使用され得る。不揮発性メモリも、（例えば、ブートアップルーチンおよび他の命令を起動するために）使用され得る。クラウドベースの記憶装置が、例えば、記憶装置４０８を補完するために、または記憶装置４０８の代わりに使用され得る。 The memory may be an electronic storage device such as storage device 408 that is part of control circuitry 404. As referred to herein, the phrase "electronic storage device" or "storage device" should be understood to mean any device for storing electronic data, computer software, or firmware, such as any combination of random access memory, read-only memory, hard drives, optical drives, solid state devices, quantum storage devices, game consoles, game media, or any other suitable fixed or removable storage devices. Storage device 408 may be used to store various types of content described herein and media guidance data described above. Non-volatile memory may also be used (e.g., to initiate boot-up routines and other instructions). Cloud-based storage may be used, for example, to supplement storage device 408 or in place of storage device 408.

ユーザが、ユーザ入力インターフェース４１０を使用して、命令を制御回路４０４に送信し得る。ユーザ入力インターフェース４１０、ディスプレイ４１２、または両方は、表示を提供し、触覚入力を受信するように構成されたタッチスクリーンを含み得る。例えば、タッチスクリーンは、指、スタイラス、または両方から触覚入力を受信するように構成され得る。いくつかの実施形態では、機器デバイス４００は、前向きの画面および後向きの画面、複数の前方画面、または複数の角度付き画面を含み得る。いくつかの実施形態では、ユーザ入力インターフェース４１０は、１つ以上のマイクロホン、ボタン、キーパッド、ユーザ入力を受信するように構成された任意の他のコンポーネント、またはそれらの組み合わせを有するリモートコントロールデバイスを含む。例えば、ユーザ入力インターフェース４１０は、英数字キーパッドおよびオプションを有するハンドヘルドリモートコントロールデバイスを含み得る。さらなる例では、ユーザ入力インターフェース４１０は、音声コマンドを受信および識別し、情報をセットトップボックス４１６に伝送するように構成されたマイクロホンおよび制御回路を有するハンドヘルドリモートコントロールデバイスを含み得る。 A user may use the user input interface 410 to send instructions to the control circuitry 404. The user input interface 410, the display 412, or both may include a touch screen configured to provide a display and receive tactile input. For example, the touch screen may be configured to receive tactile input from a finger, a stylus, or both. In some embodiments, the equipment device 400 may include a front-facing screen and a rear-facing screen, multiple forward screens, or multiple angled screens. In some embodiments, the user input interface 410 includes a remote control device having one or more microphones, buttons, keypads, any other components configured to receive user input, or combinations thereof. For example, the user input interface 410 may include a handheld remote control device having an alphanumeric keypad and options. In a further example, the user input interface 410 may include a handheld remote control device having a microphone and control circuitry configured to receive and identify voice commands and transmit information to the set-top box 416.

オーディオ機器４１４は、ユーザデバイス４００およびユーザ機器システム４０１の各々の他の要素と統合されるものとして提供され得るか、または、独立型ユニットであり得る。ディスプレイ４１２上に表示されるビデオおよび他のコンテンツのオーディオコンポーネントが、オーディオ機器４１４のスピーカを通して再生され得る。いくつかの実施形態では、オーディオは、受信機（図示せず）に分配され得、受信機は、オーディオを処理し、オーディオ機器４１４のスピーカを介して出力する。いくつかの実施形態では、例えば、制御回路４０４は、オーディオ機器４１４のスピーカを使用して、オーディオキューをユーザに、または他のオーディオフィードバックをユーザに提供するように構成される。オーディオ機器４１４は、音声コマンドおよび発話（例えば、音声クエリを含む）等のオーディオ入力を受信するように構成されたマイクロホンを含み得る。例えば、ユーザは、文字または単語を話し得、それらは、マイクロホンによって受信され、制御回路４０４によってテキストに変換される。さらなる例では、ユーザは、コマンドを声に出し得、コマンドは、マイクロホンによって受信され、制御回路４０４によって認識される。 Audio equipment 414 may be provided as integrated with other elements of each of user device 400 and user equipment system 401, or may be a stand-alone unit. Audio components of videos and other content displayed on display 412 may be played through speakers of audio equipment 414. In some embodiments, the audio may be distributed to a receiver (not shown), which processes the audio and outputs it through speakers of audio equipment 414. In some embodiments, for example, control circuitry 404 is configured to provide audio cues to the user or other audio feedback to the user using speakers of audio equipment 414. Audio equipment 414 may include a microphone configured to receive audio input, such as voice commands and utterances (including, for example, voice queries). For example, a user may speak letters or words, which are received by the microphone and converted to text by control circuitry 404. In a further example, a user may vocalize commands, which are received by the microphone and recognized by control circuitry 404.

（例えば、音声クエリを管理するための）アプリケーションが、任意の好適なアーキテクチャを使用して実装され得る。例えば、独立型アプリケーションが、ユーザデバイス４００およびユーザ機器システム４０１の各々上に完全に実装され得る。いくつかのそのような実施形態では、アプリケーションのための命令が、ローカルで（例えば、記憶装置４０８内に）記憶され、アプリケーションによって使用するためのデータが、周期的基準で（例えば、帯域外フィードから、インターネットリソースから、または別の好適なアプローチを使用して）ダウンロードされる。制御回路４０４は、記憶装置４０８からアプリケーションのための命令を読み出し、命令を処理し、本明細書に議論される表示のうちのいずれかを発生させ得る。処理された命令に基づいて、制御回路４０４は、入力がユーザ入力インターフェース４１０から受信されるときに実施するべきアクションの内容を決定し得る。例えば、上／下への表示上のカーソルの移動は、入力インターフェース４１０が、上／下ボタンが選択されたことを示すときに、処理された命令によって示され得る。本明細書に議論される実施形態のうちのいずれかを実施するためのアプリケーションおよび／または任意の命令が、コンピュータ読み取り可能な媒体上にエンコードされ得る。コンピュータ読み取り可能な媒体は、データを記憶することが可能な任意の媒体を含む。コンピュータ読み取り可能な媒体は、限定ではないが、伝搬電気または電磁信号を含み、一過性であり得るか、または、限定ではないが、ハードディスク、フロッピー（登録商標）ディスク、ＵＳＢドライブ、ＤＶＤ、ＣＤ、メディアカード、レジスタメモリ、プロセッサキャッシュ、ランダムアクセスメモリ（ＲＡＭ）等の揮発性および不揮発性コンピュータメモリまたは記憶デバイスを含み、非一過性であり得る。 The application (e.g., for managing voice queries) may be implemented using any suitable architecture. For example, a standalone application may be implemented entirely on each of user device 400 and user equipment system 401. In some such embodiments, instructions for the application are stored locally (e.g., in storage 408) and data for use by the application is downloaded on a periodic basis (e.g., from an out-of-band feed, from an Internet resource, or using another suitable approach). Control circuitry 404 may read instructions for the application from storage 408, process the instructions, and generate any of the displays discussed herein. Based on the processed instructions, control circuitry 404 may determine the content of an action to perform when input is received from user input interface 410. For example, movement of a cursor on the display up/down may be indicated by the processed instructions when input interface 410 indicates that the up/down button has been selected. The application and/or any instructions for implementing any of the embodiments discussed herein may be encoded on a computer-readable medium. A computer-readable medium includes any medium capable of storing data. A computer-readable medium may be transient, including, but not limited to, a propagating electrical or electromagnetic signal, or may be non-transient, including, but not limited to, volatile and non-volatile computer memory or storage devices such as hard disks, floppy disks, USB drives, DVDs, CDs, media cards, registered memory, processor cache, random access memory (RAM), etc.

いくつかの実施形態では、アプリケーションは、クライアント／サーバベースのアプリケーションである。ユーザデバイス４００およびユーザ機器システム４０１の各々上で実装される、シックまたはシンクライアントによって使用するためのデータが、ユーザ機器デバイス４００およびユーザ機器システム４０１の各々から遠隔にあるサーバに要求を発行することによって、オンデマンドで読み出される。例えば、遠隔サーバは、記憶デバイス内にアプリケーションのための命令を記憶し得る。遠隔サーバは、回路（例えば、制御回路４０４）を使用して、記憶された命令を処理し、上記および下記に議論される表示を発生させ得る。クライアントデバイスは、遠隔サーバによって発生させられる表示を受信し得、ユーザデバイス４００上にローカルで表示のコンテンツを表示し得る。このように、命令の処理が、サーバによって遠隔で実施される一方、テキスト、キーボード、または他の視覚物を含み得る結果として生じる表示は、ユーザデバイス４００上にローカルで提供される。ユーザデバイス４００は、入力インターフェース４１０を介してユーザから入力を受信し、対応する表示を処理し、発生させるために、それらの入力を遠隔サーバに伝送し得る。例えば、ユーザデバイス４００は、上／下ボタンが入力インターフェース４１０を介して選択されたことを示す通信を遠隔サーバに伝送し得る。遠隔サーバは、その入力に従って命令を処理し、入力に対応するアプリケーションの表示（例えば、カーソルを上／下に移動させる表示）を発生させ得る。発生させられた表示は、次いで、ユーザへの提示のためにユーザデバイス４００に伝送される。 In some embodiments, the application is a client/server based application. Data for use by a thick or thin client implemented on each of the user device 400 and the user equipment system 401 is retrieved on demand by issuing requests to a server remote from each of the user equipment device 400 and the user equipment system 401. For example, the remote server may store instructions for the application in a storage device. The remote server may use circuitry (e.g., control circuitry 404) to process the stored instructions and generate the displays discussed above and below. The client device may receive the display generated by the remote server and display the contents of the display locally on the user device 400. In this manner, the processing of the instructions is performed remotely by the server, while the resulting display, which may include text, a keyboard, or other visual objects, is provided locally on the user device 400. The user device 400 may receive inputs from a user via the input interface 410 and transmit those inputs to the remote server for processing and generating corresponding displays. For example, the user device 400 may transmit a communication to a remote server indicating that an up/down button was selected via the input interface 410. The remote server may process instructions according to the input and generate a display of the application corresponding to the input (e.g., a display that moves a cursor up/down). The generated display is then transmitted to the user device 400 for presentation to the user.

いくつかの実施形態では、アプリケーションは、ダウンロードされ、インタープリタまたは仮想マシン（例えば、制御回路４０４によって起動される）によって解釈され、または別様に起動される。いくつかの実施形態では、アプリケーションは、ＥＴＶバイナリ交換形式（ＥＢＩＦ）でエンコードされ、好適なフィードの一部として制御回路によって受信され、制御回路４０４上で起動するユーザエージェントによって解釈され得る。例えば、アプリケーションは、ＥＢＩＦアプリケーションであり得る。いくつかの実施形態では、アプリケーションは、制御回路４０４によって実行されるローカル仮想マシンまたは他の好適なミドルウェアによって受信および起動される一連のＪＡＶＡ（登録商標）ベースのファイルによって定義され得る。 In some embodiments, the application is downloaded and interpreted or otherwise launched by an interpreter or virtual machine (e.g., launched by the control circuitry 404). In some embodiments, the application may be encoded in ETV Binary Interchange Format (EBIF) and received by the control circuitry as part of a suitable feed and interpreted by a user agent running on the control circuitry 404. For example, the application may be an EBIF application. In some embodiments, the application may be defined by a series of JAVA-based files that are received and launched by a local virtual machine or other suitable middleware executed by the control circuitry 404.

図５は、本開示のいくつかの実施形態による、音声クエリに応答するための例証的ネットワーク配置５００のブロック図を示す。例証的システム５００は、ユーザが、音声クエリをユーザデバイス５５０において提供すること、コンテンツをユーザデバイス５５０のディスプレイ上で視聴すること、または両方を行う状況を表し得る。システム５００では、２つ以上のタイプのユーザデバイスが存在し得るが、１つのみのが、図面を過度に複雑にすることを回避するために、図５に示される。加えて、各ユーザは、２つ以上のタイプのユーザデバイスを利用し、２つ以上の各タイプのユーザデバイスも利用し得る。ユーザデバイス５５０は、図４のユーザデバイス４００、ユーザ機器システム４０１、任意の他の好適なデバイス、またはそれらの任意の組み合わせと同じであり得る。 5 illustrates a block diagram of an illustrative network arrangement 500 for responding to voice queries, in accordance with some embodiments of the present disclosure. The illustrative system 500 may represent a situation in which a user provides a voice query at a user device 550, views content on a display of the user device 550, or both. In the system 500, there may be more than one type of user device, although only one is shown in FIG. 5 to avoid overcomplicating the drawing. In addition, each user may utilize more than one type of user device, and may also utilize more than one of each type of user device. The user device 550 may be the same as the user device 400 of FIG. 4, the user equipment system 401, any other suitable device, or any combination thereof.

無線対応デバイスとして図示されるユーザデバイス５５０は、通信ネットワーク５１０に結合され得る（例えば、インターネットに接続される）。例えば、ユーザデバイス５５０は、通信経路（例えば、アクセスポイントを含み得る）を介して、通信ネットワーク５１０に結合される。いくつかの実施形態では、ユーザデバイス５５０は、有線接続を介して通信ネットワーク５１０に結合されるコンピューティングデバイスであり得る。例えば、ユーザデバイス５５０は、ＬＡＮへの有線接続またはネットワーク５１０への任意の他の好適な通信リンクも含み得る。通信ネットワーク５１０は、インターネット、携帯電話ネットワーク、モバイル音声またはデータネットワーク（例えば、４ＧまたはＬＴＥネットワーク）、ケーブルネットワーク、公衆交換電話網、または他のタイプの通信ネットワークまたは通信ネットワークの組み合わせを含む１つ以上のネットワークであり得る。通信経路は、衛星経路、光ファイバ系経路、ケーブル経路、インターネット通信をサポートする経路、自由空間接続（例えば、ブロードキャストまたは他の無線信号のため）、または任意の他の好適な有線または無線通信経路またはそのような経路の組み合わせ等の１つ以上の通信経路を含み得る。通信経路は、ユーザデバイス５５０とネットワークデバイス５２０との間に描かれないが、これらのデバイスは、上記に説明されるもの等の通信経路、およびＵＳＢケーブル、ＩＥＥＥ１３９４ケーブル、無線経路（例えば、Ｂｌｕｅｔｏｏｔｈ（登録商標）、赤外線、ＩＥＥＥ８０２－１１ｘ等）等の他の短範囲２地点間通信経路、または有線または無線経路を介した他の短範囲通信を介して、直接、互いに通信し得る。ＢＬＵＥＴＯＯＴＨ（登録商標）は、Ｂｌｕｅｔｏｏｔｈ（登録商標）ＳＩＧ，Ｉｎｃ．によって所有される認証マークである。デバイスはまた、通信ネットワーク５１０を介した間接経路を通して、直接、互いに通信し得る。 The user device 550, illustrated as a wireless-enabled device, may be coupled to the communications network 510 (e.g., connected to the Internet). For example, the user device 550 is coupled to the communications network 510 via a communications path (which may include, for example, an access point). In some embodiments, the user device 550 may be a computing device coupled to the communications network 510 via a wired connection. For example, the user device 550 may also include a wired connection to a LAN or any other suitable communications link to the network 510. The communications network 510 may be one or more networks including the Internet, a cellular network, a mobile voice or data network (e.g., a 4G or LTE network), a cable network, a public switched telephone network, or any other type of communications network or combination of communications networks. The communications path may include one or more communications paths, such as a satellite path, a fiber optic-based path, a cable path, a path supporting Internet communications, a free-space connection (e.g., for broadcast or other wireless signals), or any other suitable wired or wireless communications path or combination of such paths. Although no communication paths are depicted between user device 550 and network device 520, these devices may communicate with each other directly via communication paths such as those described above and other short-range point-to-point communication paths such as USB cables, IEEE 1394 cables, wireless paths (e.g., Bluetooth, infrared, IEEE 802-11x, etc.), or other short-range communications via wired or wireless paths. BLUETOOTH is a certification mark owned by Bluetooth SIG, Inc. The devices may also communicate with each other directly through an indirect path via communication network 510.

図示されるようなシステム５００は、好適な通信経路を介して通信ネットワーク５１０に結合されるネットワークデバイス５２０（例えば、サーバまたは他の好適なコンピューティングデバイス）を含む。ネットワークデバイス５２０とユーザデバイス５５０との間の通信は、１つ以上の通信経路を経由して交換され得るが、図面を過度に複雑にすることを回避するために、図５では、単一経路として示される。ネットワークデバイス５２０は、データベースと、１つ以上のアプリケーション（例えば、アプリケーションサーバ、ホストサーバとして）とを含み得る。複数のネットワークエンティティが、存在し、ネットワーク５１０と通信し得るが、１つのみが、図面を過度に複雑にすることを回避するために、図５に示される。いくつかの実施形態では、ネットワークデバイス５２０は、１つのソースデバイスを含み得る。いくつかの実施形態では、ネットワークデバイス５２０は、多くのユーザデバイス（例えば、ユーザデバイス５５０）におけるアプリケーションのインスタンスと通信するアプリケーションを実装する。例えば、ソーシャルメディアアプリケーションのインスタンスが、ユーザデバイス５５０上に実装され得、アプリケーション情報は、ユーザに関するプロファイル情報を記憶し得るネットワークデバイス５２０に、および、それから通信される（例えば、現在のソーシャルメディアフィードが、ユーザデバイス５５０以外のデバイス上で利用可能であるように）。さらなる例では、検索アプリケーションのインスタンスが、ユーザデバイス５５０上に実装され得、アプリケーション情報は、ユーザに関するプロファイル情報、複数のユーザからの検索履歴、エンティティ情報（例えば、コンテンツおよびメタデータ）、任意の他の好適な情報、またはそれらの任意の組み合わせを記憶し得るネットワークデバイス５２０に、および、それから通信される。 The system 500 as shown includes a network device 520 (e.g., a server or other suitable computing device) coupled to a communication network 510 via a suitable communication path. Communications between the network device 520 and the user device 550 may be exchanged via one or more communication paths, but are shown in FIG. 5 as a single path to avoid overcomplicating the drawing. The network device 520 may include a database and one or more applications (e.g., an application server, as a host server). Multiple network entities may exist and communicate with the network 510, but only one is shown in FIG. 5 to avoid overcomplicating the drawing. In some embodiments, the network device 520 may include one source device. In some embodiments, the network device 520 implements an application that communicates with instances of the application in many user devices (e.g., the user device 550). For example, an instance of a social media application may be implemented on user device 550, and application information may be communicated to and from network device 520, which may store profile information about the user (e.g., so that current social media feeds are available on devices other than user device 550). In a further example, an instance of a search application may be implemented on user device 550, and application information may be communicated to and from network device 520, which may store profile information about the user, search history from multiple users, entity information (e.g., content and metadata), any other suitable information, or any combination thereof.

いくつかの実施形態では、ネットワークデバイス５２０は、例えば、エンティティ情報、メタデータ、コンテンツ、履歴通信および検索記録、ユーザ選好、ユーザプロファイル情報、任意の他の好適な情報、またはそれらの任意の組み合わせを含む、記憶された情報のうちの１つ以上のタイプを含む。ネットワークデバイス５２０は、アプリケーションホストデータベースまたはサーバ、プラグイン、ソフトウェア開発者キット（ＳＤＫ）、アプリケーションプログラミングインターフェース（ＡＰＩ）、または、（例えば、ユーザデバイスにダウンロードされるような）ソフトウェアを提供すること、（例えば、ユーザデバイスによってアクセスされるアプリケーションをホストする）ソフトウェアを遠隔で起動すること、または、別様に、アプリケーションサポートをユーザデバイス５５０のアプリケーションに提供することを行うように構成された他のソフトウェアツールを含み得る。いくつかの実施形態では、ネットワークデバイス５２０からの情報は、クライアント／サーバアプローチを使用して、ユーザデバイス５５０に提供される。例えば、ユーザデバイス５５０は、情報をサーバからプルし得るか、または、サーバは、情報をユーザデバイス５５０にプッシュし得る。いくつかの実施形態では、ユーザデバイス５５０上に常駐するアプリケーションクライアントは、ネットワークデバイス５２０とのセッションを開始し、必要に応じて（例えば、データが、古くなると、またはユーザデバイスが、データを受信するための要求をユーザから受信すると）、情報を取得し得る。いくつかの実施形態では、情報は、ユーザ情報（例えば、ユーザプロファイル情報、ユーザ作成コンテンツ）を含み得る。例えば、ユーザ情報は、ユーザが関わるコンテンツトランザクション、ユーザが実施した検索、ユーザが消費したコンテンツ、ユーザがソーシャルネットワークと相互作用するかどうか、任意の他の好適な情報、またはそれらの任意の組み合わせ等の現在および／または履歴ユーザアクティビティ情報を含み得る。いくつかの実施形態では、ユーザ情報は、ある期間にわたって、所与のユーザのパターンを識別し得る。図示されるように、ネットワークデバイス５２０は、複数のエンティティに関するエンティティ情報を含む。エンティティ情報５２１、５２２、および５２３は、それぞれのエンティティに関するメタデータを含む。それに関してメタデータがネットワークデバイス５２０に記憶されているエンティティは、互いにリンクされ得るか、互いに参照され得るか、メタデータ内に１つ以上のタグによって記述され得るか、またはそれらの組み合わせであり得る。 In some embodiments, the network device 520 includes one or more types of stored information, including, for example, entity information, metadata, content, historical communication and search records, user preferences, user profile information, any other suitable information, or any combination thereof. The network device 520 may include an application host database or server, plug-ins, software developer kits (SDKs), application programming interfaces (APIs), or other software tools configured to provide software (e.g., to be downloaded to the user device), remotely launch software (e.g., to host applications accessed by the user device), or otherwise provide application support to applications of the user device 550. In some embodiments, information from the network device 520 is provided to the user device 550 using a client/server approach. For example, the user device 550 may pull information from a server, or the server may push information to the user device 550. In some embodiments, an application client residing on user device 550 may initiate a session with network device 520 and retrieve information as needed (e.g., as data becomes stale or as the user device receives a request from the user to receive data). In some embodiments, the information may include user information (e.g., user profile information, user-created content). For example, the user information may include current and/or historical user activity information, such as content transactions in which the user is involved, searches performed by the user, content consumed by the user, whether the user interacts with social networks, any other suitable information, or any combination thereof. In some embodiments, the user information may identify patterns of a given user over a period of time. As illustrated, network device 520 includes entity information for multiple entities. Entity information 521, 522, and 523 include metadata for the respective entities. The entities for which metadata is stored in network device 520 may be linked to one another, referenced to one another, described by one or more tags in the metadata, or a combination thereof.

いくつかの実施形態では、アプリケーションは、ユーザデバイス５５０、ネットワークデバイス５２０、または両方上に実装され得る。例えば、アプリケーションは、ソフトウェアまたは実行可能命令の組として実装され得、それらは、ユーザデバイス５５０、ネットワークデバイス５２０、または両方の記憶装置に記憶され、それぞれのデバイスの制御回路によって実行され得る。いくつかの実施形態では、アプリケーションは、クライアント／サーバベースのアプリケーションとして実装されるオーディオ記録アプリケーション、発話→テキストアプリケーション、テキスト→発話アプリケーション、音声－認識アプリケーション、またはそれらの組み合わせを含み得、クライアントアプリケーションのみが、ユーザデバイス５５０上に常駐し、サーバアプリケーションは、遠隔サーバ（例えば、ネットワークデバイス５２０）上に常駐する。例えば、アプリケーションは、部分的に、クライアントアプリケーションとして、ユーザデバイス５５０上に（例えば、ユーザデバイス５５０の制御回路によって）、部分的に、遠隔サーバ上に、遠隔サーバの制御回路（例えば、ネットワークデバイス５２０の制御回路）上で起動するサーバアプリケーションとして、実装され得る。遠隔サーバの制御回路によって実行されると、アプリケーションは、ディスプレイを生成し、生成されたディスプレイをユーザデバイス５５０に伝送するように制御回路に命令し得る。サーバアプリケーションは、ユーザデバイス５５０上への記憶のためにデータを伝送するように遠隔デバイスの制御回路に命令し得る。クライアントアプリケーションは、アプリケーションディスプレイを生成するように受信側ユーザデバイスの制御回路に命令し得る。 In some embodiments, the application may be implemented on the user device 550, the network device 520, or both. For example, the application may be implemented as a software or set of executable instructions that may be stored on storage of the user device 550, the network device 520, or both, and executed by the control circuitry of the respective devices. In some embodiments, the application may include an audio recording application, a speech-to-text application, a text-to-speech application, a voice-recognition application, or a combination thereof, implemented as a client/server based application, where only the client application resides on the user device 550 and the server application resides on a remote server (e.g., the network device 520). For example, the application may be implemented partially on the user device 550 (e.g., by the control circuitry of the user device 550) as a client application, and partially on the remote server as a server application that runs on the control circuitry of the remote server (e.g., the control circuitry of the network device 520). When executed by the control circuitry of the remote server, the application may instruct the control circuitry to generate a display and transmit the generated display to the user device 550. The server application may instruct the control circuitry of the remote device to transmit the data for storage on the user device 550. The client application may instruct the control circuitry of the receiving user device to generate an application display.

いくつかの実施形態では、システム５００の配置は、クラウドベースの配置である。クラウドは、例の中でもとりわけ、情報記憶、検索、メッセージング、またはソーシャルネットワーキングサービス等のサービスへのアクセス、およびユーザデバイスに関して上記に説明される任意のコンテンツへのアクセスを提供する。サービスは、クラウド－コンピューティングサービスプロバイダを通して、またはオンラインサービスの他のプロバイダを通して、クラウド内に提供されることができる。例えば、クラウドベースのサービスは、ユーザソースコンテンツが接続されるデバイス上での他者による視聴のために配信される記憶サービス、共有サイト、ソーシャルネットワーキングサイト、検索エンジン、または他のサービスを含むことができる。これらのクラウドベースのサービスは、ユーザデバイスが、情報をローカルで記憶し、ローカルで記憶された情報にアクセスするのではなく、情報をクラウドに記憶し、情報をクラウドから受信することを可能にし得る。クラウドリソースは、例えば、ウェブブラウザ、メッセージングアプリケーション、ソーシャルメディアアプリケーション、デスクトップアプリケーション、またはモバイルアプリケーションを使用して、ユーザデバイスによってアクセスされ得、オーディオ記録アプリケーション、発話→テキストアプリケーション、テキスト→発話アプリケーション、音声－認識アプリケーション、および／またはそれらのアクセスアプリケーションの任意の組み合わせを含み得る。ユーザデバイス５５０は、アプリケーション配信のためにクラウドコンピューティングに依拠するクラウドクライアントであり得るか、または、ユーザデバイス５５０は、クラウドリソースへのアクセスを伴わずに、いくつかの機能性を有し得る。例えば、ユーザデバイス５５０上で起動するいくつかのアプリケーションは、クラウドアプリケーション（例えば、インターネットを経由してサービスとして配信されるアプリケーション）であり得る一方、他のアプリケーションは、ユーザデバイス５５０上で記憶および起動され得る。いくつかの実施形態では、ユーザデバイス５５０は、複数のクラウドリソースからの情報を同時に受信し得る。 In some embodiments, the deployment of system 500 is a cloud-based deployment. The cloud provides access to services such as information storage, retrieval, messaging, or social networking services, among other examples, and to any of the content described above with respect to user devices. Services can be provided in the cloud through a cloud-computing service provider or through other providers of online services. For example, cloud-based services can include storage services, sharing sites, social networking sites, search engines, or other services where user-sourced content is distributed for viewing by others on connected devices. These cloud-based services can enable user devices to store information in the cloud and receive information from the cloud, rather than storing information locally and accessing the locally stored information. Cloud resources can be accessed by user devices using, for example, a web browser, messaging applications, social media applications, desktop applications, or mobile applications, and can include audio recording applications, speech-to-text applications, text-to-speech applications, voice-recognition applications, and/or any combination of those access applications. User device 550 may be a cloud client that relies on cloud computing for application delivery, or user device 550 may have some functionality without access to cloud resources. For example, some applications running on user device 550 may be cloud applications (e.g., applications delivered as a service over the Internet), while other applications may be stored and run on user device 550. In some embodiments, user device 550 may receive information from multiple cloud resources simultaneously.

例証的例では、ユーザは、音声クエリをユーザデバイス５５０に発話し得る。音声クエリは、ユーザデバイス５５０のオーディオインターフェースによって記録され、アプリケーション５６０によってサンプリングおよびデジタル化され、アプリケーション５６０によってテキストクエリに変換される。アプリケーション５６０は、テキストクエリとともに、発音も含み得る。例えば、テキストクエリの１つ以上の単語が、適切なスペルではなく、音素記号によって表され得る。さらなる例では、発音メタデータは、テキストクエリの１つ以上の単語の音素表現を含むテキストクエリとともに記憶され得る。いくつかの実施形態では、アプリケーション５６０は、エンティティ、コンテンツ、メタデータ、またはそれらの組み合わせのデータベースの中を検索するために、テキストクエリおよび任意の好適な発音情報をネットワークデバイス５２０に伝送する。ネットワークデバイス５２０は、テキストクエリに関連付けられたエンティティ、テキストクエリに関連付けられたコンテンツ、または両方を識別し、その情報をユーザデバイス５５０に提供し得る。 In an illustrative example, a user may speak a voice query into user device 550. The voice query is recorded by an audio interface of user device 550, sampled and digitized by application 560, and converted into a text query by application 560. Application 560 may also include a pronunciation along with the text query. For example, one or more words of the text query may be represented by phonemic symbols rather than proper spelling. In a further example, pronunciation metadata may be stored along with the text query including a phonemic representation of one or more words of the text query. In some embodiments, application 560 transmits the text query and any suitable pronunciation information to network device 520 for searching in a database of entities, content, metadata, or combinations thereof. Network device 520 may identify entities associated with the text query, content associated with the text query, or both, and provide the information to user device 550.

例えば、ユーザは、「ＴｏｍＣｒｕｉｓｅの映画を見せて」とユーザデバイス５５０のマイクロホンに発話し得る。アプリケーション５６０は、テキストクエリ「ＴｏｍＣｒｕｉｓｅの映画」を生成し、テキストクエリをネットワークデバイス５２０に伝送し得る。ネットワークデバイス５２０は、エンティティ「ＴｏｍＣｒｕｉｓｅ」を識別し、次いで、エンティティにリンクされる映画を識別し得る。ネットワークデバイス５２０は、次いで、コンテンツ（例えば、ビデオファイル、トレーラ、またはクリップ）、コンテンツ識別子（例えば、映画タイトルおよび画像）、コンテンツアドレス（例えば、ＵＲＬ、ウェブサイト、またはＩＰアドレス）、任意の他の好適な情報、またはそれらの任意の組み合わせをユーザデバイス５５０に伝送し得る。「Ｔｏｍ」および「Ｃｒｕｉｓｅ」の発音は、概して、曖昧ではないので、アプリケーション５６０は、この状況では、発音情報を生成する必要はない。 For example, a user may speak "Watch Tom Cruise movies" into the microphone of user device 550. Application 560 may generate a text query "Tom Cruise movies" and transmit the text query to network device 520. Network device 520 may identify the entity "Tom Cruise" and then identify movies that are linked to the entity. Network device 520 may then transmit the content (e.g., video file, trailer, or clip), content identifier (e.g., movie title and image), content address (e.g., URL, website, or IP address), any other suitable information, or any combination thereof, to user device 550. Because the pronunciation of "Tom" and "Cruise" is generally unambiguous, application 560 does not need to generate pronunciation information in this situation.

さらなる例では、ユーザは、「Ｌｏｕｉｓとのインタビューを見せて」とユーザデバイス５５０のマイクロホンに発話し得、ユーザは、名前Ｌｏｕｉｓを「ｌｏｏ－ｉｈｓ」ではなく、「ｌｏｏ－ｅｅ」と発音する。いくつかの実施形態では、アプリケーション５６０は、テキストクエリ「Ｌｏｕｉｓとのインタビュー」を生成し、「ｌｏｏ－ｅｅ」としての音素表現を含むメタデータとともに、テキストクエリをネットワークデバイス５２０に伝送し得る。いくつかの実施形態では、アプリケーション５６０は、テキストクエリ「Ｌｏｏ－ｅｅとのインタビュー」を生成し、テキストクエリをネットワークデバイス５２０に伝送し得、テキストクエリ自体は、発音情報（例えば、この例では、音素表現）を含む。名前Ｌｏｕｉｓは、一般的であるので、この識別子を含む、多くのエンティティが存在し得る。いくつかの実施形態では、ネットワークデバイス５２０は、「ｌｏｏ－ｅｅ」を音素表現として有する発音タグを含むメタデータを有するエンティティを識別し得る。いくつかの実施形態では、ネットワークデバイス５２０は、トレンド検索、ユーザの検索履歴、または他のコンテキスト情報を読み出し、ユーザが指す可能性が高いエンティティを識別し得る。例えば、ユーザは、「ＦＢＩ」を以前に検索していることもあり、エンティティＬｏｕｉｓＦｒｅｅｈ（例えば、ＦＢＩの前長官）は、「ＦＢＩ」に関するタグを含むメタデータを含み得る。エンティティが、識別されると、ネットワークデバイス５２０は、次いで、コンテンツ（例えば、インタビューのビデオファイルまたはクリップ）、コンテンツ識別子（例えば、インタビューからのファイルタイトルおよび静止画像）、コンテンツアドレス（例えば、インタビューの１つ以上のビデオファイルをストリーミングするためのＵＲＬ、ウェブサイト、またはＩＰアドレス）、ＬｏｕｉｓＦｒｅｅｈに関連する任意の他の好適な情報、またはそれらの任意の組み合わせをユーザデバイス５５０に伝送し得る。「Ｌｏｕｉｓ」の発音は、曖昧であり得るので、アプリケーション５６０は、そのような状況では、発音情報を生成し得る。 In a further example, a user may speak "show me an interview with Louis" into the microphone of user device 550, where the user pronounces the name Louis as "loo-ee" rather than "loo-ihs". In some embodiments, application 560 may generate a text query "interview with Louis" and transmit the text query to network device 520 along with metadata including the phonetic representation as "loo-ee". In some embodiments, application 560 may generate a text query "interview with Loo-ee" and transmit the text query to network device 520, where the text query itself includes pronunciation information (e.g., in this example, the phonetic representation). Because the name Louis is common, there may be many entities that include this identifier. In some embodiments, network device 520 may identify entities that have metadata including a pronunciation tag that has "loo-ee" as a phonetic representation. In some embodiments, the network device 520 may read trending searches, the user's search history, or other contextual information to identify entities that the user is likely to refer to. For example, the user may have previously searched for "FBI," and the entity Louis Freeh (e.g., former Director of the FBI) may include metadata that includes tags related to "FBI." Once an entity is identified, the network device 520 may then transmit to the user device 550 the content (e.g., a video file or clip of the interview), a content identifier (e.g., a file title and still images from the interview), a content address (e.g., a URL, website, or IP address for streaming one or more video files of the interview), any other suitable information related to Louis Freeh, or any combination thereof. Because the pronunciation of "Louis" may be ambiguous, the application 560 may generate pronunciation information in such circumstances.

例証的例では、ユーザは、「ＷｉｌｌｉａｍＤｊｏｋｏ」とユーザデバイス５５０のマイクロホンに発話し得る。アプリケーション５６０は、エンティティの正しいスペルに対応していないこともあるテキストクエリを生成し得る。例えば、音声クエリ「ＷｉｌｌｉａｍＤｊｏｋｏ」は、「Ｗｉｌｌｉａｍｇｊｏｋａ」として、テキストに変換され得る。この正しくないテキスト変換は、正しいエンティティを識別することにおいて困難をもたらし得る。いくつかの実施形態では、エンティティＷｉｌｌｉａｍＤｊｏｋｏに関連付けられたメタデータは、発音に基づく代替表現を含む。エンティティ「ＷｉｌｌｉａｍＤｊｏｋｏ」に関するメタデータは、表１に示されるように、発音タグ（例えば、「関連語句」）を含み得る。
テキストクエリは、正しくないスペルを含み得るが、正しいエンティティに関連付けられたメタデータが、変形例を含むので、正しいエンティティが、識別され得る。故に、ネットワークデバイス５２０は、代替表現を含むエンティティ情報を含み得、したがって、語句「Ｗｉｌｌｉａｍｇｊｏｋａ」を含むテキストクエリに応答して、正しいエンティティを識別し得る。エンティティが、識別されると、ネットワークデバイス５２０は、次いで、コンテンツ（例えば、オーディオまたはビデオファイルクリップ）、コンテンツ識別子（例えば、曲またはアルバムタイトルおよびコンサートからの静止画像）、コンテンツアドレス（例えば、音楽の１つ以上のオーディオファイルをストリーミングするためのＵＲＬ、ウェブサイト、またはＩＰアドレス）、ＷｉｌｌｉａｍＤｊｏｋｏに関連する任意の他の好適な情報、またはそれらの任意の組み合わせをユーザデバイス５５０に伝送し得る。名前「Ｄｊｏｋｏ」は、発話から正しくなく変換され得るので、アプリケーション５６０は、そのような状況では、正しいエンティティを識別するための発音情報をメタデータ内への記憶のために生成し得る。 In an illustrative example, a user may speak "William Djoko" into the microphone of user device 550. Application 560 may generate a text query that may not correspond to the correct spelling of the entity. For example, the voice query "William Djoko" may be converted to text as "William gjoka." This incorrect text conversion may result in difficulty in identifying the correct entity. In some embodiments, metadata associated with the entity William Djoko includes alternative representations based on pronunciation. Metadata for the entity "William Djoko" may include pronunciation tags (e.g., "related phrases"), as shown in Table 1.
Although the text query may include an incorrect spelling, the correct entity may be identified because the metadata associated with the correct entity includes the variant. Thus, network device 520 may include entity information including alternative expressions and thus identify the correct entity in response to a text query including the phrase "William gjoka". Once the entity is identified, network device 520 may then transmit to user device 550 the content (e.g., an audio or video file clip), a content identifier (e.g., a song or album title and a still image from a concert), a content address (e.g., a URL, website, or IP address for streaming one or more audio files of music), any other suitable information related to William Djoko, or any combination thereof. Because the name "Djoko" may be translated incorrectly from speech, application 560 may generate phonetic information for identifying the correct entity in such a situation for storage in the metadata.

上記の例証的例では、エンティティＷｉｌｌｉａｍＤｊｏｋｏの到達可能性は、特に、ＡＳＲプロセスがエンティティ名の文法的に正しくないテキスト変換をもたらし得るので、代替表現を記憶することによって改良される。 In the illustrative example above, the reachability of the entity William Djoko is improved by storing alternative representations, particularly since the ASR process may result in grammatically incorrect textual translations of the entity name.

例証的例では、メタデータは、ユーザの音声クエリに応答してではなく、（例えば、テキストクエリまたは他の検索および読み出しプロセスによる）後の参照のために、発音に基づいて生成され得る。いくつかの実施形態では、ネットワークデバイス５２０、ユーザデバイス５５０、または両方は、発音情報に基づいて、メタデータを生成し得る。例えば、ユーザデバイス５５０は、エンティティの代替表現のユーザ入力を受信し得る（例えば、前の検索結果または発話→テキスト変換に基づいて）。いくつかの実施形態では、ネットワークデバイス５２０、ユーザデバイス５５０、または両方は、テキスト→発話モジュールおよび発話→テキストモジュールを使用して、エンティティに関するメタデータを自動的に生成し得る。例えば、アプリケーション５６０は、エンティティのテキスト表現（例えば、エンティティの名前のテキスト文字列）を識別し、テキスト表現をテキスト→発話モジュールに入力し、オーディオファイルを生成し得る。いくつかの実施形態では、テキスト→発話モジュールは、１つ以上の設定または基準（それらを用いてオーディオファイルが生成される）を含む。例えば、設定または基準は、言語（例えば、英語、スペイン語、マンダリン）、アクセント（例えば、地方または言語ベース）、音声（例えば、特定の人の音声、男性音声、女性音声）、速度（例えば、オーディオファイルの関連部分の再生時間）、発音（例えば、複数の音素変形例に関して）、任意の他の好適な設定または基準、またはそれらの任意の組み合わせを含み得る。アプリケーション５６０は、次いで、オーディオファイルを発話→テキストモジュールに入力し、結果として生じるテキスト表現を生成する。結果として生じるテキスト表現が、元のテキスト表現と同一でない場合、アプリケーション５６０は、結果として生じるテキスト表現をエンティティに関連付けられたメタデータに記憶し得る。いくつかの実施形態では、アプリケーション５６０は、種々の設定または基準のためのこのプロセスを繰り返し、したがって、メタデータに記憶され得る種々のテキスト表現を生成し得る。結果として生じるメタデータは、可能性が高い変形例を予想するためのテキスト－発話－テキスト変換を使用して生成された変形例とともに、元のテキスト表現を含む。故に、アプリケーション５６０が、音声クエリをユーザから受信し、テキストへの転換が、エンティティ識別子に正確に合致しないとき、アプリケーション５６０は、依然として、正しいエンティティを識別し得る。さらに、アプリケーション５６０は、メタデータが変形例を含むので、発音情報に関してテキストクエリを分析する必要はない（例えば、分析は、リアルタイムでではなく、事前に実施される）。 In an illustrative example, metadata may be generated based on the pronunciation, not in response to a user's voice query, but for later reference (e.g., via a text query or other search and retrieval process). In some embodiments, network device 520, user device 550, or both may generate metadata based on the pronunciation information. For example, user device 550 may receive user input of an alternative representation of an entity (e.g., based on previous search results or speech-to-text conversion). In some embodiments, network device 520, user device 550, or both may automatically generate metadata about an entity using a text-to-speech module and a speech-to-text module. For example, application 560 may identify a text representation of an entity (e.g., a text string of the entity's name), input the text representation to the text-to-speech module, and generate an audio file. In some embodiments, the text-to-speech module includes one or more settings or criteria with which the audio file is generated. For example, the settings or criteria may include language (e.g., English, Spanish, Mandarin), accent (e.g., regional or language-based), voice (e.g., a particular person's voice, male voice, female voice), speed (e.g., play time of the relevant portion of the audio file), pronunciation (e.g., with respect to multiple phoneme variants), any other suitable settings or criteria, or any combination thereof. Application 560 then inputs the audio file into a speech-to-text module and generates a resultant text representation. If the resultant text representation is not identical to the original text representation, application 560 may store the resultant text representation in metadata associated with the entity. In some embodiments, application 560 may repeat this process for different settings or criteria, thus generating different text representations that may be stored in the metadata. The resultant metadata includes the original text representation along with variants generated using text-to-speech-to-text conversion to anticipate likely variants. Thus, when application 560 receives a voice query from a user and the transcription to text does not exactly match the entity identifier, application 560 may still identify the correct entity. Additionally, application 560 does not need to analyze the text query for pronunciation information because the metadata includes the variants (e.g., the analysis is performed in advance rather than in real time).

アプリケーション５６０は、例えば、オーディオ記録、発話認識、発話→テキスト変換、テキスト→発話変換、クエリ生成、検索エンジン機能性、コンテンツ読み出し、ディスプレイ生成、コンテンツ提示、メタデータ生成、データベース機能性、またはそれらの組み合わせ等の任意の好適な機能性を含み得る。いくつかの実施形態では、アプリケーション５６０の側面は、２つ以上のデバイスを横断して実装される。いくつかの実施形態では、アプリケーション５６０は、単一デバイス上に実装される。例えば、エンティティ情報５２１、５２２、および５２３は、ユーザデバイス５５０のメモリ記憶装置に記憶され得、アプリケーション５６０によってアクセスされ得る。 Application 560 may include any suitable functionality, such as, for example, audio recording, speech recognition, speech-to-text conversion, text-to-speech conversion, query generation, search engine functionality, content retrieval, display generation, content presentation, metadata generation, database functionality, or combinations thereof. In some embodiments, aspects of application 560 are implemented across two or more devices. In some embodiments, application 560 is implemented on a single device. For example, entity information 521, 522, and 523 may be stored in memory storage of user device 550 and accessed by application 560.

図６は、本開示のいくつかの実施形態による、発音情報に基づいて音声クエリに応答するための例証的プロセス６００のフローチャートを示す。例えば、クエリアプリケーションは、図４のユーザデバイス４００、図４のユーザ機器システム４０１、図５のユーザデバイス５５０、図５のネットワークデバイス５２０、任意の他の好適なデバイス、またはそれらの任意の組み合わせ等の任意の好適なハードウェア上に実装されたプロセス６００を実施し得る。さらなる例では、クエリアプリケーションは、図５のアプリケーション５６０のインスタンスであり得る。 FIG. 6 shows a flowchart of an illustrative process 600 for responding to a voice query based on pronunciation information, according to some embodiments of the present disclosure. For example, the query application may perform process 600 implemented on any suitable hardware, such as user device 400 of FIG. 4, user equipment system 401 of FIG. 4, user device 550 of FIG. 5, network device 520 of FIG. 5, any other suitable device, or any combination thereof. In a further example, the query application may be an instance of application 560 of FIG. 5.

ステップ６０２では、クエリアプリケーションが、音声クエリを受信する。いくつかの実施形態では、オーディオインターフェース（例えば、オーディオ機器４１４、ユーザ入力インターフェース４１０、またはそれらの組み合わせ）は、オーディオ入力を受信し、電子信号を生成するマイクロホンまたは他のセンサを含み得る。いくつかの実施形態では、オーディオ入力は、アナログセンサにおいて受信され、アナログセンサは、アナログ信号を提供し、アナログ信号は、オーディオファイルを生成するために、調整、サンプリング、デジタル化される。オーディオファイルは、次いで、ステップ６０４および６０６において、クエリアプリケーションによって分析され得る。いくつかの実施形態では、オーディオファイルは、メモリ（例えば、記憶装置４０８）に記憶される。いくつかの実施形態では、クエリアプリケーションは、ユーザインターフェース（例えば、ユーザ入力インターフェース４１０）を含み、それは、ユーザが、オーディオ記録を記録、再生、改変、クロッピング、可視化、または別様に管理することを可能にする。例えば、いくつかの実施形態では、オーディオインターフェースは、常時、オーディオ入力を受信するように構成される。さらなる例では、いくつかの実施形態では、オーディオインターフェースは、ユーザが指示をユーザに入力インターフェースに提供すると（例えば、タッチスクリーン上のソフトボタンを選択し、オーディオ記録を開始することによって）、オーディオ入力を受信するように構成される。さらなる例では、いくつかの実施形態では、オーディオインターフェースは、オーディオ入力を受信し、発話または他の好適なオーディオ信号が検出されると、記録を開始するように構成される。クエリアプリケーションは、オーディオ入力を記憶されたオーディオファイルに変換するために、任意の好適な調整ソフトウェアまたはハードウェアを含み得る。例えば、クエリアプリケーションは、１つ以上のフィルタ（例えば、低域通過、高域通過、ノッチフィルタ、または帯域通過フィルタ）、増幅器、デジメータ、または他の調整を適用し、オーディオファイルを生成し得る。さらなる例では、クエリアプリケーションは、圧縮、転換（例えば、スペクトル変換、ウェーブレット変換）、正規化、等化、切り捨て（例えば、時間またはスペクトルドメインにおいて）、任意の他の好適な処理、またはそれらの任意の組み合わせ等の任意の好適な処理を調整された信号に適用し、オーディオファイルを生成し得る。いくつかの実施形態では、ステップ６０２において、制御回路が、別個のアプリケーションから、クエリアプリケーションの別個のモジュールから、ユーザ入力に基づいて、またはそれらの任意の組み合わせにおいて、オーディオファイルを受信する。例えば、ステップ６０２では、制御回路は、さらなる処理（例えば、プロセス６００のステップ６０４－６１２）のために、記憶装置（例えば、記憶装置４０８）に記憶されるオーディオファイルとして、音声クエリを受信し得る。 In step 602, the query application receives a voice query. In some embodiments, the audio interface (e.g., audio equipment 414, user input interface 410, or a combination thereof) may include a microphone or other sensor that receives the audio input and generates an electronic signal. In some embodiments, the audio input is received at an analog sensor that provides an analog signal that is conditioned, sampled, and digitized to generate an audio file. The audio file may then be analyzed by the query application in steps 604 and 606. In some embodiments, the audio file is stored in memory (e.g., storage device 408). In some embodiments, the query application includes a user interface (e.g., user input interface 410) that allows a user to record, play, modify, crop, visualize, or otherwise manage the audio recording. For example, in some embodiments, the audio interface is configured to receive audio input at all times. In a further example, in some embodiments, the audio interface is configured to receive audio input when a user provides an instruction to the input interface (e.g., by selecting a soft button on a touch screen to start an audio recording). In a further example, in some embodiments, the audio interface is configured to receive audio input and begin recording when speech or other suitable audio signals are detected. The query application may include any suitable conditioning software or hardware to convert the audio input into a stored audio file. For example, the query application may apply one or more filters (e.g., low-pass, high-pass, notch, or band-pass filters), amplifiers, digitizers, or other conditioning to generate an audio file. In a further example, the query application may apply any suitable processing, such as compression, transformation (e.g., spectral transformation, wavelet transformation), normalization, equalization, truncation (e.g., in the time or spectral domain), any other suitable processing, or any combination thereof, to the conditioned signal to generate an audio file. In some embodiments, in step 602, the control circuitry receives the audio file from a separate application, from a separate module of the query application, based on user input, or any combination thereof. For example, in step 602, the control circuitry may receive the voice query as an audio file that is stored in a storage device (e.g., storage device 408) for further processing (e.g., steps 604-612 of process 600).

ステップ６０４では、クエリアプリケーションが、１つ以上のキーワードをステップ６０２の音声クエリから抽出する。いくつかの実施形態では、１つ以上のキーワードは、完全な音声クエリを表し得る。いくつかの実施形態では、１つ以上のキーワードは、重要な単語または発話の一部のみを含む。例えば、いくつかの実施形態では、クエリアプリケーションは、発話内の単語を識別し、それらの単語のうちのいくつかをキーワードとして選択し得る。例えば、クエリアプリケーションは、単語を識別し、それらの単語の中から、前置詞ではない単語を選択し得る。さらなる例では、クエリアプリケーションは、キーワードとして、少なくとも３つの文字長の単語のみを識別し得る。さらなる例では、クエリアプリケーションは、キーワードを２つ以上の単語を含む語句として識別し得（例えば、より記述的であり、より多くのコンテキストを提供するために）、それは、関連コンテンツの潜在的検索フィールドを絞り込むために有用であり得る。いくつかの実施形態では、クエリアプリケーションは、オーディオ入力からキーワードを識別するための任意の好適な基準を使用して、例えば、単語、語句、名前、場所、チャネル、メディアアセットタイトル、または他のキーワード等のキーワードを識別する。クエリアプリケーションは、任意の好適な単語検出技法、発話検出技法、パターン認識技法、信号処理技法、またはそれらの任意の組み合わせを使用して、単語を処理し得る。例えば、クエリアプリケーションは、一連の信号テンプレートをオーディオ信号の一部と比較し、合致が存在するかどうか（例えば、特定の単語がオーディオ信号に含まれるかどうか）を見出し得る。さらなる例では、クエリアプリケーションは、学習技法を適用し、音声クエリ内の単語をより良好に認識し得る。例えば、クエリアプリケーションは、複数のクエリとの関連で、複数の要求されるコンテンツ項目に関するフィードバックをユーザから集め、故に、推奨を行い、コンテンツを読み出すために、過去のデータを訓練セットとして使用し得る。いくつかの実施形態では、クエリアプリケーションは、検出された発話中、記録されたオーディオのスニペット（すなわち、短持続時間のクリップ）を記憶し、スニペットを処理し得る。いくつかの実施形態では、クエリアプリケーションは、発話の比較的に大きなセグメント（例えば、１０秒を上回る）をオーディオファイルとして記憶し、ファイルを処理する。いくつかの実施形態では、クエリアプリケーションは、発話を処理し、継続的な計算を使用することによって、単語を検出し得る。例えば、ウェーブレット変換が、リアルタイムで、発話に実施され、若干の時間の遅れがあっても、発話パターンの継続的な計算（例えば、単語を識別するための参照と比較され得る）を提供し得る。いくつかの実施形態では、クエリアプリケーションは、本開示に従って、単語および単語を発声したユーザ（例えば、音声認識）を検出し得る。 In step 604, the query application extracts one or more keywords from the voice query of step 602. In some embodiments, the one or more keywords may represent the complete voice query. In some embodiments, the one or more keywords include only significant words or portions of the utterance. For example, in some embodiments, the query application may identify words in the utterance and select some of those words as keywords. For example, the query application may identify words and select from among those words those words that are not prepositions. In a further example, the query application may identify as keywords only words that are at least three characters long. In a further example, the query application may identify keywords as phrases that contain two or more words (e.g., to be more descriptive and provide more context), which may be useful for narrowing down the potential search field of related content. In some embodiments, the query application uses any suitable criteria for identifying keywords from the audio input, such as, for example, words, phrases, names, places, channels, media asset titles, or other keywords. The query application may process the words using any suitable word detection technique, speech detection technique, pattern recognition technique, signal processing technique, or any combination thereof. For example, the query application may compare a series of signal templates to a portion of the audio signal to find out if a match exists (e.g., whether a particular word is included in the audio signal). In a further example, the query application may apply learning techniques to better recognize words in the voice query. For example, the query application may use past data as a training set to gather feedback from users regarding multiple requested content items in the context of multiple queries, and thus make recommendations and retrieve content. In some embodiments, the query application may store snippets (i.e., short duration clips) of recorded audio during the detected speech and process the snippets. In some embodiments, the query application stores relatively large segments of the speech (e.g., greater than 10 seconds) as audio files and processes the files. In some embodiments, the query application may process the speech and detect words by using ongoing calculations. For example, a wavelet transform may be performed on the speech in real time, albeit with some time delay, to provide a continuous computation of the speech pattern (e.g., that may be compared to a reference to identify words). In some embodiments, a query application may detect the words and the user who spoke the words (e.g., speech recognition) in accordance with the present disclosure.

いくつかの実施形態では、ステップ６０４において、クエリアプリケーションは、検出された単語をクエリ内で検出された単語のリストに追加する。いくつかの実施形態では、クエリアプリケーションは、これらの検出された単語をメモリに記憶し得る。例えば、クエリアプリケーションは、メモリに、ＡＳＣＩＩ文字の集合（すなわち、８ビットコード）、パターン（例えば、単語を合致させるために使用される発話信号基準を示す）、識別子（例えば、単語のためのコード）、文字列、任意の他のデータタイプ、またはそれらの任意の組み合わせとして、単語を記憶し得る。いくつかの実施形態では、メディアガイドアプリケーションは、単語が検出されるにつれて、単語をメモリに追加し得る。例えば、メディアガイドアプリケーションは、以前に検出された単語の文字列に新しく検出された単語を付加すること、新しく検出された単語を以前に検出された単語のセルアレイに追加すること（例えば、セルアレイサイズを１増加させる）、新しく検出された単語に対応する新しい変形例を作成すること、新しく作成された単語に対応する新しいファイルを作成すること、または、ステップ６０４において検出された１つ以上の単語を記憶することを行い得る。 In some embodiments, in step 604, the query application adds the detected word to a list of words detected in the query. In some embodiments, the query application may store these detected words in memory. For example, the query application may store the words in memory as a set of ASCII characters (i.e., an 8-bit code), a pattern (e.g., indicating the speech signal criteria used to match the word), an identifier (e.g., a code for the word), a string, any other data type, or any combination thereof. In some embodiments, the media guidance application may add words to memory as they are detected. For example, the media guidance application may append the newly detected word to a string of previously detected words, add the newly detected word to a cell array of previously detected words (e.g., increase the cell array size by 1), create a new variant corresponding to the newly detected word, create a new file corresponding to the newly created word, or store one or more words detected in step 604.

ステップ６０６では、クエリアプリケーションが、ステップ６０４の１つ以上のキーワードに関する発音情報を決定する。いくつかの実施形態では、発音情報は、１つ以上のキーワードの音素表現（例えば、国際音声記号を使用する）を含む。いくつかの実施形態では、発音情報は、発音を組み込むための１つ以上のキーワードの１つ以上の代替スペルを含む。いくつかの実施形態では、ステップ６０６では、制御回路が、音素表現を含むテキストクエリに関連付けられたメタデータを生成する。 At step 606, the query application determines pronunciation information for the one or more keywords of step 604. In some embodiments, the pronunciation information includes a phonemic representation (e.g., using the International Phonetic Alphabet) of the one or more keywords. In some embodiments, the pronunciation information includes one or more alternative spellings of the one or more keywords to incorporate the pronunciation. In some embodiments, at step 606, the control circuitry generates metadata associated with the text query that includes the phonemic representation.

ステップ６０８では、クエリアプリケーションが、ステップ６０４の１つ以上のキーワードおよびステップ６０６の発音情報に基づいて、テキストクエリを生成する。クエリアプリケーションは、１つ以上のキーワードを好適な順序で（例えば、発話された順序で）配置することによって、テキストクエリを生成し得る。いくつかの実施形態では、クエリアプリケーションは、音声クエリの１つ以上の単語（例えば、短単語、前置詞、または比較的にあまり重要ではないと決定された任意の他の単語）を省略し得る。テキストクエリは、ファイル（例えば、テキストファイル）として生成され、好適な記憶装置（例えば、記憶装置４０８）に記憶され得る。 At step 608, a query application generates a text query based on the one or more keywords from step 604 and the pronunciation information from step 606. The query application may generate the text query by placing the one or more keywords in a preferred order (e.g., in the order in which they were spoken). In some embodiments, the query application may omit one or more words from the voice query (e.g., short words, prepositions, or any other words determined to be relatively less important). The text query may be generated as a file (e.g., a text file) and stored in a suitable storage device (e.g., storage device 408).

ステップ６１０では、クエリアプリケーションが、テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別する。いくつかの実施形態では、メタデータは、発音タグを含む。いくつかの実施形態では、クエリアプリケーションは、エンティティに対応するコンテンツ項目のメタデータタグを識別することによって、エンティティを識別し得る。例えば、コンテンツ項目は、映画内の俳優に関するタグを有する映画を含み得る。テキストクエリが俳優を含む場合、クエリアプリケーションは、合致を決定し得、合致に基づいて、コンテンツ項目に関連付けられているとして、エンティティを識別し得る。例証するために、クエリアプリケーションは、最初に、エンティティを識別し（例えば、エンティティの中を検索し）、次いで、エンティティに関連付けられたコンテンツを読み出し得るか、または、クエリアプリケーションは、最初に、コンテンツを識別し（例えば、コンテンツの中を検索し）、コンテンツに関連付けられたエンティティがテキストクエリに合致するかどうかを決定し得る。エンティティ別に、コンテンツ別に、またはその両方で配置されているデータベースが、クエリアプリケーションによって検索され得る。 At step 610, the query application identifies an entity among the multiple entities in the database based on the text query and the stored metadata about the entity. In some embodiments, the metadata includes a pronunciation tag. In some embodiments, the query application may identify the entity by identifying a metadata tag of a content item that corresponds to the entity. For example, the content item may include a movie that has tags about actors in the movie. If the text query includes actors, the query application may determine a match and, based on the match, identify the entity as associated with the content item. To illustrate, the query application may first identify the entity (e.g., search among the entities) and then retrieve the content associated with the entity, or the query application may first identify the content (e.g., search among the content) and determine whether the entity associated with the content matches the text query. A database arranged by entity, by content, or both may be searched by the query application.

いくつかの実施形態では、クエリアプリケーションは、ユーザプロファイル情報に基づいて、エンティティを識別する。例えば、クエリアプリケーションは、前の音声クエリからの既に識別されたエンティティに基づいて、エンティティを識別し得る。さらなる例では、クエリアプリケーションは、エンティティに関連付けられた人気情報に基づいて（例えば、複数のユーザに関する検索に基づいて）、エンティティを識別し得る。いくつかの実施形態では、クエリアプリケーションは、ユーザの選好に基づいて、エンティティを識別する。例えば、１つ以上のキーワードがユーザプロファイル情報の好ましいエンティティ名または識別子に合致する場合、クエリアプリケーションは、そのエンティティを識別するか、または、そのエンティティにより重く重み付けし得る。 In some embodiments, the query application identifies an entity based on user profile information. For example, the query application may identify an entity based on already identified entities from a previous voice query. In a further example, the query application may identify an entity based on popularity information associated with the entity (e.g., based on searches across multiple users). In some embodiments, the query application identifies an entity based on user preferences. For example, if one or more keywords match a preferred entity name or identifier in the user profile information, the query application may identify or weight the entity more heavily.

いくつかの実施形態では、クエリアプリケーションは、複数のエンティティを識別すること（例えば、各エンティティに関して記憶されたメタデータを用いて）と、それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティの各それぞれのエンティティに関して、それぞれのスコアを決定することと、最大スコアを決定することによって、エンティティを選択することとによって、エンティティを識別する。スコアは、テキストクエリのキーワードとエンティティまたはコンテンツ項目に関連付けられたメタデータとの間で識別された合致の数に基づき得る。 In some embodiments, the query application identifies the entities by identifying a plurality of entities (e.g., using metadata stored for each entity), determining a respective score for each respective entity of the plurality of entities based on comparing the respective pronunciation tags to the text query, and selecting the entity by determining a maximum score. The score may be based on the number of matches identified between keywords of the text query and metadata associated with the entities or content items.

いくつかの実施形態では、クエリアプリケーションは、テキストクエリに基づいて、複数のエンティティの中の２つ以上のエンティティ（例えば、関連付けられたメタデータ）を識別する。クエリアプリケーションは、クエリのエンティティのいくつかまたは全てに関連付けられたコンテンツ項目を識別し得る。いくつかの実施形態では、クエリアプリケーションは、テキストクエリの少なくとも一部を各エンティティに関して記憶されたメタデータのタグと比較し、合致を識別することによって、エンティティを識別する。 In some embodiments, the query application identifies two or more entities (e.g., associated metadata) among the plurality of entities based on the text query. The query application may identify content items associated with some or all of the entities of the query. In some embodiments, the query application identifies the entities by comparing at least a portion of the text query to metadata tags stored for each entity and identifying matches.

ステップ６１２では、クエリアプリケーションは、エンティティに関連付けられたコンテンツ項目を読み出す。いくつかの実施形態では、クエリアプリケーションは、コンテンツ項目を識別すること、コンテンツ項目をダウンロードすること、コンテンツ項目をストリーミングすること、表示のためにコンテンツ項目を生成すること、または、それらの組み合わせを行う。例えば、音声クエリは、「最近のＴｏｍＣｒｕｉｓｅの映画を見せて」を含み得、クエリアプリケーションは、ユーザがビデオコンテンツを視聴するために選択し得る映画「ＭｉｓｓｉｏｎＩｍｐｏｓｓｉｂｌｅ：Ｆａｌｌｏｕｔ」へのリンクを提供し得る。いくつかの実施形態では、クエリアプリケーションは、テキストクエリに合致するエンティティに関連付けられた複数のコンテンツを読み出し得る。例えば、クエリアプリケーションは、本開示に従って、複数のリンク、ビデオファイル、オーディオファイル、または他のコンテンツ、または識別されたコンテンツ項目のリストを読み出し得る。 At step 612, the query application retrieves a content item associated with the entity. In some embodiments, the query application identifies a content item, downloads a content item, streams a content item, generates a content item for display, or a combination thereof. For example, a voice query may include "show me the latest Tom Cruise movies" and the query application may provide a link to the movie "Mission Impossible: Fallout" from which the user may select to view the video content. In some embodiments, the query application may retrieve a number of content items associated with the entity that matches the text query. For example, the query application may retrieve a number of links, video files, audio files, or other content, or a list of identified content items in accordance with this disclosure.

図７は、本開示のいくつかの実施形態による、代替表現に基づいて音声クエリに応答するための例証的プロセス７００のフローチャートを示す。例えば、クエリアプリケーションは、図４のユーザデバイス４００、図４のユーザ機器システム４０１、図５のユーザデバイス５５０、図５のネットワークデバイス５２０、任意の他の好適なデバイス、またはそれらの任意の組み合わせ等の任意の好適なハードウェア上に実装されるプロセス７００を実施し得る。さらなる例では、クエリアプリケーションは、図５のアプリケーション５６０のインスタンスであり得る。 7 shows a flowchart of an illustrative process 700 for responding to a voice query based on alternative expressions, according to some embodiments of the present disclosure. For example, the query application may perform process 700 implemented on any suitable hardware, such as user device 400 of FIG. 4, user equipment system 401 of FIG. 4, user device 550 of FIG. 5, network device 520 of FIG. 5, any other suitable device, or any combination thereof. In a further example, the query application may be an instance of application 560 of FIG. 5.

ステップ７０２では、クエリアプリケーションが、音声クエリを受信する。いくつかの実施形態では、オーディオインターフェース（例えば、オーディオ機器４１４、ユーザ入力インターフェース４１０、またはそれらの組み合わせ）は、オーディオ入力を受信し、電子信号を生成するマイクロホンまたは他のセンサを含み得る。いくつかの実施形態では、オーディオ入力は、アナログセンサにおいて受信され、アナログセンサは、アナログ信号を提供し、アナログ信号は、オーディオファイルを生成するために、調整、サンプリング、デジタル化される。オーディオファイルは、次いで、ステップ７０４において、クエリアプリケーションによって分析され得る。いくつかの実施形態では、オーディオファイルは、メモリ（例えば、記憶装置４０８）に記憶される。いくつかの実施形態では、クエリアプリケーションは、ユーザインターフェース（例えば、ユーザ入力インターフェース４１０）を含み、それは、ユーザが、オーディオ記録を記録、再生、改変、クロッピング、可視化、または別様に管理することを可能にする。例えば、いくつかの実施形態では、オーディオインターフェースは、常時、オーディオ入力を受信するように構成される。さらなる例では、いくつかの実施形態では、オーディオインターフェースは、ユーザが指示をユーザインターフェースに提供する（例えば、タッチスクリーン上のソフトボタンを選択し、オーディオ記録を開始することによって）と、オーディオ入力を受信するように構成される。さらなる例では、いくつかの実施形態では、オーディオインターフェースは、オーディオ入力を受信し、発話または他の好適なオーディオ信号が検出されると、記録を開始するように構成される。クエリアプリケーションは、オーディオ入力を記憶されたオーディオファイルに変換するための任意の好適な調整ソフトウェアまたはハードウェアを含み得る。例えば、クエリアプリケーションは、１つ以上のフィルタ（例えば、低域通過、高域通過、ノッチフィルタ、または帯域通過フィルタ）、増幅器、デジメータ、または他の調整を適用し、オーディオファイルを生成し得る。さらなる例では、クエリアプリケーションは、圧縮、転換（例えば、スペクトル変換、ウェーブレット変換）、正規化、等化、切り捨て（例えば、時間またはスペクトルドメインにおいて）、任意の他の好適な処理、またはそれらの任意の組み合わせ等の任意の好適な処理を調整された信号に適用し、オーディオファイルを生成し得る。いくつかの実施形態では、ステップ７０２では、制御回路が、別個のアプリケーションから、クエリアプリケーションの別個のモジュールから、ユーザ入力に基づいて、またはそれらの任意の組み合わせにおいてオーディオファイルを受信する。例えば、ステップ７０２は、さらなる処理（例えば、プロセス７００のステップ７０４－７１０）のために、記憶装置（例えば、記憶装置４０８）に記憶されるオーディオファイルとして、音声クエリを受信することを含み得る。 In step 702, the query application receives a voice query. In some embodiments, an audio interface (e.g., audio equipment 414, user input interface 410, or a combination thereof) may include a microphone or other sensor that receives audio input and generates an electronic signal. In some embodiments, the audio input is received at an analog sensor that provides an analog signal that is conditioned, sampled, and digitized to generate an audio file. The audio file may then be analyzed by the query application in step 704. In some embodiments, the audio file is stored in memory (e.g., storage device 408). In some embodiments, the query application includes a user interface (e.g., user input interface 410) that allows a user to record, play, modify, crop, visualize, or otherwise manage the audio recording. For example, in some embodiments, the audio interface is configured to receive audio input at all times. In a further example, in some embodiments, the audio interface is configured to receive audio input when a user provides an instruction to the user interface (e.g., by selecting a soft button on a touch screen to start an audio recording). In a further example, in some embodiments, the audio interface is configured to receive audio input and begin recording when speech or other suitable audio signals are detected. The query application may include any suitable conditioning software or hardware for converting the audio input into a stored audio file. For example, the query application may apply one or more filters (e.g., low-pass, high-pass, notch, or band-pass filters), amplifiers, digitizers, or other conditioning to generate an audio file. In a further example, the query application may apply any suitable processing to the conditioned signal, such as compression, transformation (e.g., spectral transformation, wavelet transformation), normalization, equalization, truncation (e.g., in the time or spectral domain), any other suitable processing, or any combination thereof, to generate an audio file. In some embodiments, at step 702, the control circuitry receives the audio file from a separate application, from a separate module of the query application, based on user input, or any combination thereof. For example, step 702 may include receiving the voice query as an audio file that is stored in a storage device (e.g., storage device 408) for further processing (e.g., steps 704-710 of process 700).

ステップ７０４では、クエリアプリケーションが、１つ以上のキーワードをステップ７０２の音声クエリから抽出する。いくつかの実施形態では、１つ以上のキーワードは、完全な音声クエリを表し得る。いくつかの実施形態では、１つ以上のキーワードは、重要な単語または発話の一部のみを含む。例えば、いくつかの実施形態では、クエリアプリケーションは、発話内の単語を識別し、それらの単語のうちのいくつかをキーワードとして選択し得る。例えば、クエリアプリケーションは、単語を識別し、それらの単語の中から、前置詞ではない単語を選択し得る。さらなる例では、クエリアプリケーションは、キーワードとして、少なくとも３つの文字長の単語のみを識別し得る。さらなる例では、クエリアプリケーションは、キーワードを２つ以上の単語を含む語句として識別し得（例えば、より記述的であり、より多くのコンテキストを提供するために）、それは、関連コンテンツの潜在的検索フィールドを絞り込むために有用であり得る。いくつかの実施形態では、クエリアプリケーションは、オーディオ入力からキーワードを識別するための任意の好適な基準を使用して、例えば、単語、語句、名前、場所、チャネル、メディアアセットタイトル、または他のキーワード等のキーワードを識別する。クエリアプリケーションは、任意の好適な単語検出技法、発話検出技法、パターン認識技法、信号処理技法、またはそれらの任意の組み合わせを使用して、単語を処理し得る。例えば、クエリアプリケーションは、一連の信号テンプレートをオーディオ信号の一部と比較し、合致が存在するかどうか（例えば、特定の単語がオーディオ信号に含まれるかどうか）を見出し得る。さらなる例では、クエリアプリケーションは、学習技法を適用し、音声クエリ内の単語をより良好に認識し得る。例えば、クエリアプリケーションは、複数のクエリとの関連で、複数の要求されるコンテンツ項目に関するフィードバックをユーザから集め、故に、推奨を行い、コンテンツを読み出すために、過去のデータを訓練セットとして使用し得る。いくつかの実施形態では、クエリアプリケーションは、検出された発話中、記録されたオーディオのスニペット（すなわち、短持続時間のクリップ）を記憶し、スニペットを処理し得る。いくつかの実施形態では、クエリアプリケーションは、発話の比較的に大きなセグメント（例えば、１０秒を上回る）をオーディオファイルとして記憶し、ファイルを処理する。いくつかの実施形態では、クエリアプリケーションは、発話を処理し、継続的な計算を使用することによって、単語を検出し得る。例えば、ウェーブレット変換が、リアルタイムで、発話に実施され、若干の時間の遅れがあっても、発話パターンの継続的な計算（例えば、単語を識別するための参照と比較され得る）を提供し得る。いくつかの実施形態では、クエリアプリケーションは、本開示に従って、単語および単語を発声したユーザ（例えば、音声認識）を検出し得る。 In step 704, the query application extracts one or more keywords from the voice query of step 702. In some embodiments, the one or more keywords may represent the complete voice query. In some embodiments, the one or more keywords include only significant words or portions of the utterance. For example, in some embodiments, the query application may identify words in the utterance and select some of those words as keywords. For example, the query application may identify words and select from among those words those words that are not prepositions. In a further example, the query application may identify only words that are at least three characters long as keywords. In a further example, the query application may identify keywords as phrases that contain two or more words (e.g., to be more descriptive and provide more context), which may be useful for narrowing down the potential search field of related content. In some embodiments, the query application uses any suitable criteria for identifying keywords from the audio input, such as, for example, words, phrases, names, places, channels, media asset titles, or other keywords. The query application may process the words using any suitable word detection technique, speech detection technique, pattern recognition technique, signal processing technique, or any combination thereof. For example, the query application may compare a series of signal templates to a portion of the audio signal to find out if a match exists (e.g., whether a particular word is included in the audio signal). In a further example, the query application may apply learning techniques to better recognize words in the voice query. For example, the query application may use past data as a training set to gather feedback from users regarding multiple requested content items in the context of multiple queries, and thus make recommendations and retrieve content. In some embodiments, the query application may store snippets (i.e., short duration clips) of recorded audio during the detected speech and process the snippets. In some embodiments, the query application stores relatively large segments of the speech (e.g., greater than 10 seconds) as audio files and processes the files. In some embodiments, the query application may process the speech and detect words by using ongoing calculations. For example, a wavelet transform may be performed on the speech in real time, albeit with some time delay, to provide a continuous computation of the speech pattern (e.g., that may be compared to a reference to identify words). In some embodiments, a query application may detect the words and the user who spoke the words (e.g., speech recognition) in accordance with the present disclosure.

いくつかの実施形態では、ステップ７０４において、クエリアプリケーションは、検出された単語をクエリ内で検出された単語のリストに追加する。いくつかの実施形態では、クエリアプリケーションは、これらの検出された単語をメモリに記憶し得る。例えば、クエリアプリケーションは、メモリに、ＡＳＣＩＩ文字の集合（すなわち、８ビットコード）、パターン（例えば、単語を合致させるために使用される発話信号基準を示す）、識別子（例えば、単語のためのコード）、文字列、任意の他のデータタイプ、またはそれらの任意の組み合わせとして、単語を記憶し得る。いくつかの実施形態では、メディアガイドアプリケーションは、単語が検出されるにつれて、単語をメモリに追加し得る。例えば、メディアガイドアプリケーションは、以前に検出された単語の文字列に新しく検出された単語を付加すること、新しく検出された単語を以前に検出された単語のセルアレイに追加すること（例えば、セルアレイサイズを１増加させる）、新しく検出された単語に対応する新しい変形例を作成すること、新しく作成された単語に対応する新しいファイルを作成すること、または、ステップ７０４において検出された１つ以上の単語を記憶することを行い得る。 In some embodiments, in step 704, the query application adds the detected word to a list of words detected in the query. In some embodiments, the query application may store these detected words in memory. For example, the query application may store the words in memory as a set of ASCII characters (i.e., an 8-bit code), a pattern (e.g., indicating the speech signal criteria used to match the word), an identifier (e.g., a code for the word), a string, any other data type, or any combination thereof. In some embodiments, the media guidance application may add words to memory as they are detected. For example, the media guidance application may append the newly detected word to a string of previously detected words, add the newly detected word to a cell array of previously detected words (e.g., increase the cell array size by 1), create a new variant corresponding to the newly detected word, create a new file corresponding to the newly created word, or store one or more words detected in step 704.

ステップ７０６では、クエリアプリケーションが、ステップ７０４の１つ以上のキーワードに基づいて、テキストクエリを生成する。クエリアプリケーションは、１つ以上のキーワードを好適な順序で（例えば、発話された順序で）配置することによって、テキストクエリを生成し得る。いくつかの実施形態では、クエリアプリケーションは、音声クエリの１つ以上の単語（例えば、短単語、前置詞、または比較的にあまり重要ではないと決定された任意の他の単語）を省略し得る。テキストクエリは、ファイル（例えば、テキストファイル）として生成され、好適な記憶装置（例えば、記憶装置４０８）に記憶され得る。 At step 706, a query application generates a text query based on the one or more keywords of step 704. The query application may generate the text query by placing the one or more keywords in a preferred order (e.g., in the order in which they were spoken). In some embodiments, the query application may omit one or more words of the voice query (e.g., short words, prepositions, or any other words determined to be relatively less important). The text query may be generated as a file (e.g., a text file) and stored in a suitable storage device (e.g., storage device 408).

ステップ７０８では、クエリアプリケーションが、テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別する。メタデータは、発音に基づくエンティティの代替テキスト表現を含む。いくつかの実施形態では、クエリアプリケーションは、エンティティの代替表現に対応するコンテンツ項目のメタデータタグを識別することによって、エンティティを識別し得る。例えば、コンテンツ項目は、映画内の俳優に関するタグを有する映画を含み得、タグは、（例えば、システム３００等のシステムから導出されるか、または別様にメタデータに含まれる）代替スペルを含む。テキストクエリが、俳優を含む場合、クエリアプリケーションは、合致を決定し得、合致に基づいて、コンテンツ項目に関連付けられているとして、エンティティを識別し得る。例証するために、クエリアプリケーションは、最初に、エンティティを識別し（例えば、エンティティの中を検索し）、次いで、エンティティに関連付けられたコンテンツを読み出し得るか、または、クエリアプリケーションは、最初に、コンテンツを識別し（例えば、コンテンツの中を検索し）、コンテンツに関連付けられたエンティティがテキストクエリに合致するかどうかを決定し得る。エンティティ別に、コンテンツ別に、またはその両方で配置されているデータベースが、クエリアプリケーションによって検索され得る。クエリアプリケーションは、テキストクエリの１つ以上の単語がエンティティの代替表現（例えば、エンティティに関連付けられたメタデータに記憶されるような）に合致するとき、合致を決定し得る。 At step 708, the query application identifies the entity based on the text query and metadata about the entity. The metadata includes an alternative text representation of the entity based on the pronunciation. In some embodiments, the query application may identify the entity by identifying a metadata tag of the content item that corresponds to the alternative representation of the entity. For example, the content item may include a movie with tags for actors in the movie, the tags including alternative spellings (e.g., derived from a system such as system 300 or otherwise included in the metadata). If the text query includes the actors, the query application may determine a match and, based on the match, identify the entity as associated with the content item. To illustrate, the query application may first identify the entity (e.g., search among the entity) and then retrieve the content associated with the entity, or the query application may first identify the content (e.g., search among the content) and determine whether the entity associated with the content matches the text query. Databases arranged by entity, by content, or both may be searched by the query application. The query application may determine a match when one or more words of the text query match an alternative representation of the entity (e.g., as stored in metadata associated with the entity).

いくつかの実施形態では、クエリアプリケーションは、ユーザプロファイル情報に基づいて、エンティティを識別する。例えば、クエリアプリケーションは、前の音声クエリからの既に識別されたエンティティに基づいて、エンティティを識別し得る。さらなる例では、クエリアプリケーションは、エンティティに関連付けられた人気情報に基づいて（例えば、複数のユーザに関する検索に基づいて）、エンティティを識別し得る。いくつかの実施形態では、クエリアプリケーションは、ユーザの選好に基づいて、エンティティを識別する。例えば、１つ以上のキーワードがユーザプロファイル情報の好ましいエンティティ名または識別子の代替表現に合致する場合、クエリアプリケーションは、そのエンティティを識別するか、または、そのエンティティにより重く重み付けし得る。 In some embodiments, the query application identifies an entity based on user profile information. For example, the query application may identify an entity based on already identified entities from a previous voice query. In a further example, the query application may identify an entity based on popularity information associated with the entity (e.g., based on searches across multiple users). In some embodiments, the query application identifies an entity based on user preferences. For example, the query application may identify or weight an entity more heavily if one or more keywords match an alternative representation of a preferred entity name or identifier in the user profile information.

いくつかの実施形態では、クエリアプリケーションは、複数のエンティティ（例えば、各エンティティに関して記憶されたメタデータを伴う）を識別することと、それぞれのメタデータをテキストクエリと比較することに基づいて、複数のエンティティの各それぞれのエンティティに関して、それぞれのスコアを決定することと、最大スコアを決定することによって、エンティティを選択することとによって、エンティティを識別する。スコアは、テキストクエリのキーワードとエンティティまたはコンテンツ項目に関連付けられたメタデータとの間で識別された合致の数に基づき得る。 In some embodiments, the query application identifies the entities by identifying a plurality of entities (e.g., with metadata stored for each entity), determining a respective score for each respective entity of the plurality of entities based on comparing the respective metadata to the text query, and selecting the entity by determining a maximum score. The score may be based on the number of matches identified between keywords of the text query and metadata associated with the entities or content items.

ステップ７１０では、クエリアプリケーションは、エンティティに関連付けられたコンテンツ項目を読み出す。いくつかの実施形態では、クエリアプリケーションは、コンテンツ項目を識別すること、コンテンツ項目をダウンロードすること、コンテンツ項目をストリーミングすること、表示のためにコンテンツ項目を生成すること、または、それらの組み合わせを行う。例えば、音声クエリは、「最近のＴｏｍＣｒｕｉｓｅの映画を見せて」を含み得、クエリアプリケーションは、ユーザがビデオコンテンツを視聴するために選択し得る映画「ＭｉｓｓｉｏｎＩｍｐｏｓｓｉｂｌｅ：Ｆａｌｌｏｕｔ」へのリンクを提供し得る。いくつかの実施形態では、クエリアプリケーションは、テキストクエリに合致するエンティティに関連付けられた複数のコンテンツを読み出し得る。例えば、クエリアプリケーションは、本開示に従って、複数のリンク、ビデオファイル、オーディオファイル、または他のコンテンツ、または識別されたコンテンツ項目のリストを読み出し得る。 At step 710, the query application retrieves a content item associated with the entity. In some embodiments, the query application identifies a content item, downloads a content item, streams a content item, generates a content item for display, or a combination thereof. For example, a voice query may include "show me the latest Tom Cruise movies" and the query application may provide a link to the movie "Mission Impossible: Fallout" from which the user may select to view the video content. In some embodiments, the query application may retrieve a number of content items associated with the entity that matches the text query. For example, the query application may retrieve a number of links, video files, audio files, or other content, or a list of identified content items in accordance with this disclosure.

図８は、本開示のいくつかの実施形態による、発音に基づいてエンティティに関するメタデータを生成するための例証的プロセス８００のフローチャートを示す。例えば、アプリケーションは、図４のユーザデバイス４００、図４のユーザ機器システム４０１、図５のユーザデバイス５５０、図５のネットワークデバイス５２０、任意の他の好適なデバイス、またはそれらの任意の組み合わせ等の任意の好適なハードウェア上に実装されたプロセス８００を実施し得る。さらなる例では、アプリケーションは、図５のアプリケーション５８０のインスタンスであり得る。さらなる例では、図３のシステム３００が、例証的プロセス８００を実施し得る。 8 shows a flowchart of an illustrative process 800 for generating metadata about an entity based on a pronunciation, according to some embodiments of the present disclosure. For example, an application may perform the process 800 implemented on any suitable hardware, such as user device 400 of FIG. 4, user equipment system 401 of FIG. 4, user device 550 of FIG. 5, network device 520 of FIG. 5, any other suitable device, or any combination thereof. In a further example, the application may be an instance of application 580 of FIG. 5. In a further example, system 300 of FIG. 3 may perform the illustrative process 800.

ステップ８０２では、アプリケーションが、複数のエンティティのうちの情報が記憶されているエンティティを識別する。いくつかの実施形態では、アプリケーションは、所定の順序に基づいて、エンティティを選択する。例えば、アプリケーションは、エンティティをアルファベット順で選択し、プロセス８００の一部を実施し得る。いくつかの実施形態では、アプリケーションは、エンティティに関するメタデータが作成されると、エンティティを識別する。例えば、アプリケーションは、エンティティがデータベース（例えば、エンティティのデータベース）に追加されると、エンティティを識別し得る。いくつかの実施形態では、アプリケーションは、検索動作が、エンティティを誤識別し、故に、代替表現が、さらなる誤識別を防止するために所望され得るとき、エンティティを識別する。いくつかの実施形態では、アプリケーションは、ユーザ入力に基づいて、エンティティを識別する。例えば、ユーザは、アプリケーションに、正しくない検索結果、到達不能エンティティ、または検索結果内で観察されるエラーに基づいて、エンティティを示し得る（例えば、好適なユーザインターフェースにおいて）。いくつかの実施形態では、アプリケーションは、検索結果におけるエラーまたは所定の順序に応答してエンティティを識別する必要はない。例えば、アプリケーションは、エンティティデータベースのエンティティをランダムに選択し、ステップ８０４に進み得る。いくつかの実施形態では、アプリケーションは、検索クエリ内のエンティティの人気に基づいて、エンティティを識別し得る。例えば、より大きな検索有効性は、より多くの検索クエリが正しく応答されるように、より一般的エンティティに関する代替表現を決定することによって達成され得る。さらなる例では、アプリケーションは、あまり一般的ではない、またはさらに曖昧なエンティティを識別し、非常に少ない検索クエリがこれらのエンティティを規定し得るので、それらのエンティティの到達不能性を防止し得る。アプリケーションは、任意の好適な基準を適用し、識別すべきエンティティを決定し得る。いくつかの実施形態では、アプリケーションは、ステップ８０２において、２つ以上のエンティティを識別し得、故に、ステップ８０４－８１０は、各識別されたエンティティに関して実施され得る。いくつかの実施形態では、アプリケーションは、エンティティではなく、またはそれに加え、コンテンツ項目を識別し得る。例えば、アプリケーションは、映画等のエンティティを識別し、次いで、そのエンティティに関連付けられた全ての他の重要なエンティティを識別し、ステップ８０４－８１０を受けることもある。 In step 802, the application identifies an entity of a plurality of entities for which information is stored. In some embodiments, the application selects an entity based on a predefined order. For example, the application may select entities in alphabetical order to perform a portion of process 800. In some embodiments, the application identifies an entity when metadata about the entity is created. For example, the application may identify an entity when the entity is added to a database (e.g., a database of entities). In some embodiments, the application identifies an entity when a search operation misidentifies the entity and thus an alternative representation may be desirable to prevent further misidentification. In some embodiments, the application identifies an entity based on user input. For example, a user may indicate to the application (e.g., in a suitable user interface) an entity based on an incorrect search result, an unreachable entity, or an error observed in the search results. In some embodiments, the application need not identify an entity in response to an error in the search results or a predefined order. For example, the application may randomly select an entity in the entity database and proceed to step 804. In some embodiments, the application may identify an entity based on the popularity of the entity in search queries. For example, greater search effectiveness may be achieved by determining alternative representations for more common entities such that more search queries are correctly answered. In a further example, the application may identify less common or more ambiguous entities and prevent inaccessibility of those entities since fewer search queries may specify these entities. The application may apply any suitable criteria to determine the entities to identify. In some embodiments, the application may identify more than one entity in step 802, and thus steps 804-810 may be performed for each identified entity. In some embodiments, the application may identify content items rather than or in addition to entities. For example, the application may identify an entity such as a movie, and then identify all other significant entities associated with that entity and undergo steps 804-810.

ステップ８０４では、アプリケーションが、第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成する。第１のテキスト文字列は、ステップ８０２において識別されたエンティティを記述する。例えば、図３に図示されるように、アプリケーションは、テキスト→発話エンジン３１０を含み得、それは、オーディオファイルを生成するように構成され得る。アプリケーションは、マイクロホンまたは他の好適な検出デバイスによって検出され得るスピーカまたは他の好適な音生成デバイスから出力されたオーディオを生成し得る。アプリケーションは、オーディオファイルを生成および出力することにおいて１つ以上の設定または発話基準を適用し得る。例えば、生成された「音声」の側面は、任意の好適な基準に基づいて、調整または別様に選択され得る。いくつかの実施形態では、少なくとも１つの発話基準は、発音設定（例えば、１つ以上の音節、文字群、または単語が、発音される方法、または使用されるべき音素）を含む。いくつかの実施形態では、少なくとも１つの発話基準は、言語設定（例えば、言語、アクセント、地方アクセント、または他の言語情報を規定する）を含む。 In step 804, the application generates an audio file based on the first text string and at least one speech criterion. The first text string describes the entity identified in step 802. For example, as illustrated in FIG. 3, the application may include a text-to-speech engine 310, which may be configured to generate the audio file. The application may generate audio output from a speaker or other suitable sound generating device that may be detected by a microphone or other suitable detection device. The application may apply one or more settings or speech criteria in generating and outputting the audio file. For example, aspects of the generated "speech" may be adjusted or otherwise selected based on any suitable criteria. In some embodiments, the at least one speech criterion includes a pronunciation setting (e.g., the way one or more syllables, letter groups, or words are pronounced, or the phonemes to be used). In some embodiments, the at least one speech criterion includes a language setting (e.g., defining a language, accent, regional accent, or other linguistic information).

複数の発話基準を含む例証的例では、アプリケーションは、第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成し、それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成し、それぞれの第２のテキスト文字列を第１のテキスト文字列と比較し、第１のテキスト文字列と同一でない場合、それぞれの第２のテキスト文字列を記憶し得る（例えば、エンティティに関連付けられたメタデータ内に）。 In an illustrative example involving multiple speech criteria, the application may generate respective audio files based on the first text string and the respective speech criteria, generate respective second text strings based on the respective audio files, compare each second text string to the first text string, and store each second text string (e.g., in metadata associated with the entity) if it is not identical to the first text string.

例証的例では、アプリケーションは、第１のテキスト文字列を第１のオーディオ信号に変換し、オーディオ信号に基づいて、発話をスピーカにおいて生成し、マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成し、オーディオ信号を処理し、オーディオファイルを生成し得る。いくつかの実施形態では、アプリケーションは、テキスト→発話モジュールの少なくとも１つの発話設定に基づいて、発話をスピーカにおいて生成する。 In an illustrative example, the application may convert a first text string into a first audio signal, generate speech at a speaker based on the audio signal, detect the speech using a microphone, generate a second audio signal, process the audio signal, and generate an audio file. In some embodiments, the application generates the speech at the speaker based on at least one speech setting of the text-to-speech module.

ステップ８０６では、アプリケーションが、オーディオファイルに基づいて、第２のテキスト文字列を生成する。第２のテキスト文字列は、テキスト→発話変換、または発話→テキスト変換から生じ得る差異は別として、第１のテキスト文字列に合致し、ステップ８０２において識別されたエンティティを記述するべきである。例えば、図３に図示されるように、アプリケーションは、発話→テキストエンジン３２０を含み得、それは、オーディオ入力またはその生成されたファイルを受信し、オーディオを書き起こし記録（例えば、テキスト文字列）に転換するように構成され得る。アプリケーションは、オーディオ入力をマイクロホンまたは他の好適な音検出デバイスにおいて受信し得る。アプリケーションは、オーディオファイルを受信し、調整し、テキストに変換することにおいて１つ以上の設定を適用し得る。例えば、検出された「音声」を調整および転換する側面は、任意の好適な基準に基づいて、調整または別様に選択され得る。 In step 806, the application generates a second text string based on the audio file. The second text string should match the first text string and describe the entity identified in step 802, aside from differences that may result from text-to-speech or speech-to-text conversion. For example, as illustrated in FIG. 3, the application may include a speech-to-text engine 320, which may be configured to receive the audio input or the generated file thereof and convert the audio into a transcript (e.g., a text string). The application may receive the audio input at a microphone or other suitable sound detection device. The application may apply one or more settings in receiving, tuning, and converting the audio file to text. For example, aspects of tuning and converting the detected "speech" may be tuned or otherwise selected based on any suitable criteria.

例証的例では、アプリケーションは、オーディオファイルの再生をスピーカにおいて生成し、マイクロホンを使用して、再生を検出し、オーディオ信号を生成し、１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換する。いくつかの実施形態では、アプリケーションは、発話→テキストモジュールの少なくとも１つのテキスト設定に基づいて、オーディオ信号を第２のテキスト文字列に変換する。 In an illustrative example, the application generates a playback of an audio file on a speaker and uses a microphone to detect the playback, generate an audio signal, and convert the audio signal to a second text string by identifying one or more words. In some embodiments, the application converts the audio signal to the second text string based on at least one text setting of the speech-to-text module.

ステップ８０８では、アプリケーションが、第２のテキスト文字列を第１のテキスト文字列と比較する。いくつかの実施形態では、アプリケーションは、第１および第２のテキスト文字列の各文字を比較し、合致を決定する。いくつかの実施形態では、アプリケーションは、第１のテキスト文字列および第２のテキスト文字列が合致する程度（例えば、合致するテキスト文字列の割合、存在する相違の数、合致するか、または、合致しない、キーワードの数）を決定する。アプリケーションは、任意の好適な技法を使用して、第１および第２のテキスト文字列が、同一であるか、類似するか、または、異なるかと、それらが類似または異なる程度とを決定し得る。 In step 808, the application compares the second text string to the first text string. In some embodiments, the application compares each character of the first and second text strings to determine a match. In some embodiments, the application determines the extent to which the first and second text strings match (e.g., the percentage of text strings that match, the number of differences that exist, the number of keywords that match or do not match). The application may use any suitable technique to determine whether the first and second text strings are identical, similar, or different, and the extent to which they are similar or different.

ステップ８１０では、アプリケーションが、第１のテキスト文字列と同一でない場合、第２のテキスト文字列を記憶する。いくつかの実施形態では、アプリケーションは、第２のテキスト文字列をエンティティに関連付けられたメタデータに記憶する。いくつかの実施形態では、ステップ８１０は、アプリケーションが、１つ以上のテキストクエリに基づいて、既存のメタデータを更新することを含む。例えば、クエリが、応答され、検索結果が、評価されると、アプリケーションは、メタデータを更新し、新しい学習を反映させ得る。第２のテキスト文字列が、第１のテキスト文字列と同一であると決定された場合、新しい情報は、第２のテキスト文字列を記憶することによって得られない。しかしながら、ステップ８０８の比較の指示は、メタデータに記憶され、音声クエリを介したエンティティの到達可能性における信頼度を増加させ得る。例えば、第２のテキスト文字列が、第１のテキスト文字列と同一である場合、それは、音声ベースのクエリに関する既存のメタデータを検証する役割を果たし得る。 In step 810, the application stores the second text string if it is not identical to the first text string. In some embodiments, the application stores the second text string in metadata associated with the entity. In some embodiments, step 810 includes the application updating existing metadata based on one or more text queries. For example, once the query is answered and the search results are evaluated, the application may update the metadata to reflect new learnings. If the second text string is determined to be identical to the first text string, no new information is gained by storing the second text string. However, an indication of the comparison of step 808 may be stored in the metadata to increase confidence in the reachability of the entity via the voice query. For example, if the second text string is identical to the first text string, it may serve to validate the existing metadata for the voice-based query.

図９は、本開示のいくつかの実施形態による、音声クエリのエンティティに関連付けられたコンテンツを読み出すための例証的プロセス９００のフローチャートを示す。例えば、クエリアプリケーションは、図４のユーザデバイス４００、図４のユーザ機器システム４０１、図５のユーザデバイス５５０、図５のネットワークデバイス５２０、任意の他の好適なデバイス、またはそれらの任意の組み合わせ等の任意の好適なハードウェア上に実装されたプロセス９００を実施し得る。さらなる例では、クエリアプリケーションは、図５のアプリケーション５６０のインスタンスであり得る。 9 shows a flowchart of an illustrative process 900 for retrieving content associated with an entity of a voice query, according to some embodiments of the present disclosure. For example, the query application may perform the process 900 implemented on any suitable hardware, such as user device 400 of FIG. 4, user equipment system 401 of FIG. 4, user device 550 of FIG. 5, network device 520 of FIG. 5, any other suitable device, or any combination thereof. In a further example, the query application may be an instance of application 560 of FIG. 5.

ステップ９０２では、クエリアプリケーションが、オーディオ信号をオーディオインターフェースにおいて受信する。システムは、マイクロホンまたは他のオーディオ検出デバイスを含み得、デバイスに入力されるオーディオに基づいて、オーディオファイルを記録し得る。 In step 902, the query application receives an audio signal at an audio interface. The system may include a microphone or other audio detection device and may record an audio file based on the audio input to the device.

ステップ９０４では、クエリアプリケーションが、ステップ９０２のオーディオ信号を解析し、発話を識別する。クエリアプリケーションは、任意の好適なデシメーション、調整（例えば、増幅、フィルタリング）、処理（例えば、時間またはスペクトルドメインにおいて）、パターン認識、アルゴリズム、転換、任意の他の好適なアクション、またはそれらの任意の組み合わせを適用し得る。いくつかの実施形態では、クエリアプリケーションは、任意の好適な技法を使用して、単語、音、語句、またはそれらの組み合わせを識別する。 In step 904, a query application analyzes the audio signal of step 902 to identify speech. The query application may apply any suitable decimation, conditioning (e.g., amplification, filtering), processing (e.g., in the time or spectral domain), pattern recognition, algorithms, transformations, any other suitable actions, or any combination thereof. In some embodiments, the query application uses any suitable techniques to identify words, sounds, phrases, or combinations thereof.

ステップ９０６では、クエリアプリケーションが、音声クエリが受信されたかどうかを決定する。いくつかの実施形態では、クエリアプリケーションは、オーディオ信号のパラメータに基づいて、音声クエリが受信されたことを決定する。例えば、クエリ前後の発話を伴わない期間は、記録内の音声クエリの範囲を区切り得る。いくつかの実施形態では、クエリアプリケーションは、キーワードを発話された順序で識別し、文またはクエリテンプレートをキーワードに適用し、テキストクエリを抽出する。例えば、名詞、固有名詞、動詞、形容詞、副詞、および発話の他の部分の配置は、音声クエリの開始および終了の指示を提供し得る。クエリアプリケーションは、オーディオ信号を解析する際、任意の好適な基準を適用し、テキストを抽出し得る。ステップ９０８では、クエリアプリケーションは、ステップ９０４および９０６の結果に基づいて、テキストクエリを生成する。いくつかの実施形態では、ステップ９０８において、クエリアプリケーションは、テキストクエリを好適な記憶装置（例えば、記憶装置４０８）に記憶し得る。ステップ９０６において、クエリアプリケーションが、音声クエリが受信されていない、または別様に、テキストクエリが、ステップ９０４の解析されるオーディオに基づいて生成されることができないことを決定する場合、クエリアプリケーションは、ステップ９０２に戻り、音声クエリが受信されるまで、オーディオを検出するステップに進み得る。 In step 906, the query application determines whether a voice query has been received. In some embodiments, the query application determines that a voice query has been received based on parameters of the audio signal. For example, periods of no speech before and after the query may delimit the scope of the voice query in the recording. In some embodiments, the query application identifies keywords in the order in which they were spoken, applies sentence or query templates to the keywords, and extracts a text query. For example, the placement of nouns, proper nouns, verbs, adjectives, adverbs, and other parts of speech may provide indications of the beginning and end of a voice query. The query application may apply any suitable criteria in analyzing the audio signal to extract text. In step 908, the query application generates a text query based on the results of steps 904 and 906. In some embodiments, in step 908, the query application may store the text query in suitable storage (e.g., storage 408). If, in step 906, the query application determines that a voice query has not been received or otherwise that a text query cannot be generated based on the analyzed audio of step 904, the query application may return to step 902 and proceed with detecting audio until a voice query is received.

ステップ９１０では、クエリアプリケーションが、エンティティ情報に関するデータベースにアクセスする。クエリアプリケーションは、ステップ９０８のテキストクエリを使用して、データベースの情報の中を検索する。クエリアプリケーションは、任意の好適な検索アルゴリズムを適用し、データベースの情報、エンティティ、またはコンテンツを識別し得る。 In step 910, a query application accesses a database for entity information. The query application uses the text query of step 908 to search among the information in the database. The query application may apply any suitable search algorithm to identify information, entities, or content in the database.

ステップ９１２では、クエリアプリケーションが、ステップ９１０のデータベースのエンティティがステップ９０８のテキストクエリに合致するかどうかを決定する。クエリアプリケーションは、複数のエンティティを識別および評価し、合致を見出し得る。いくつかの実施形態では、テキストクエリは、２つ以上のエンティティを含み、クエリアプリケーションは、コンテンツの中を検索し、メタデータ内に関連付けられたエンティティを有するコンテンツ項目を決定する（例えば、テキストクエリとコンテンツ項目のメタデータタグを比較することによって）。いくつかの状況では、クエリアプリケーションは、合致を識別することが不可能であり得、それに応答して、検索を継続すること、別のデータベースの中を検索すること、テキストクエリを修正すること（例えば、ステップ９０８に戻る（図示せず））、ステップ９０４に戻り、ステップ９０４において使用される設定を修正すること（図示せず）、検索結果が見出されなかったことの指示を返すこと、任意の他の好適な応答を行うこと、または、それらの任意の組み合わせを行い得る。いくつかの実施形態では、クエリアプリケーションは、テキストクエリに合致する複数のエンティティ、コンテンツ、または両方を識別し得る。ステップ９１４は、クエリアプリケーションが、ステップ９０８のテキストクエリに関連付けられたコンテンツを識別することを含む。いくつかの実施形態では、ステップ９１４および９１０は、逆転され得、クエリアプリケーションは、テキストクエリに基づいて、コンテンツの中を検索し得る。いくつかの実施形態では、エンティティは、コンテンツ識別子を含み得、故に、ステップ９１０および９１４は、組み合わせられ得る。 In step 912, the query application determines whether an entity in the database of step 910 matches the text query of step 908. The query application may identify and evaluate multiple entities to find a match. In some embodiments, the text query includes two or more entities, and the query application searches through the content to determine content items that have the entity associated in their metadata (e.g., by comparing the metadata tags of the content items with the text query). In some circumstances, the query application may be unable to identify a match, and in response may continue the search, search in another database, modify the text query (e.g., return to step 908 (not shown)), return to step 904 and modify the settings used in step 904 (not shown), return an indication that no search results were found, provide any other suitable response, or any combination thereof. In some embodiments, the query application may identify multiple entities, content, or both that match the text query. Step 914 includes the query application identifying content associated with the text query of step 908. In some embodiments, steps 914 and 910 may be reversed and the query application may search within the content based on a text query. In some embodiments, the entity may include a content identifier, and thus steps 910 and 914 may be combined.

ステップ９１６では、クエリアプリケーションが、ステップ９０８のテキストクエリに関連付けられたコンテンツを読み出す。ステップ９１６では、例えば、クエリアプリケーションが、コンテンツ項目を識別すること、コンテンツ項目をダウンロードすること、コンテンツ項目をストリーミングすること、表示のためにコンテンツ項目またはコンテンツ項目のリスト（例えば、またはコンテンツ項目へのリンクのリスト）を生成すること、または、それらの組み合わせを行い得る。 At step 916, the query application retrieves the content associated with the text query of step 908. At step 916, for example, the query application may identify the content items, download the content items, stream the content items, generate the content items or a list of content items (e.g., or a list of links to the content items) for display, or a combination thereof.

本開示の上記に説明される実施形態は、限定ではなく、例証の目的のために提示され、本開示は、以下に続く請求項のみによって限定される。さらに、いずれか１つの実施形態に説明される特徴および限界が、本明細書の任意の他の実施形態に適用され得、一実施形態に関するフローチャートまたは例が、好適な様式で任意の他の実施形態と組み合わせられること、異なる順序で行われること、または並行して行われ得ることに留意されたい。加えて、本明細書に説明されるシステムおよび方法は、リアルタイムで実施され得る。上記に説明されるシステムおよび／または方法が他のシステムおよび／または方法に適用される、またはそれに従って使用され得ることにも留意されたい。
本明細書は、限定ではないが、以下を含む実施形態を開示する：
（項目１）音声クエリに応答する方法であって、方法は、
音声クエリをオーディオインターフェースにおいて受信することと、
制御回路を使用して、１つ以上のキーワードを音声クエリから抽出することと、
制御回路を使用して、１つ以上のキーワードに関する発音情報を決定することと、
制御回路を使用して、１つ以上のキーワードおよび発音情報に基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別することであって、メタデータは、発音タグを備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと、
を含む、方法。
（項目２）発音情報は、１つ以上のキーワードのうちの１つの音素を備えている、項目１に記載の方法。
（項目３）エンティティを識別することは、ユーザプロファイル情報にさらに基づく、項目１に記載の方法。
（項目４）エンティティを識別することは、前の音声クエリからの以前に識別されたエンティティに基づく、項目３に記載の方法。
（項目５）エンティティを識別することは、エンティティに関連付けられた人気情報にさらに基づく、項目１に記載の方法。
（項目６）エンティティを識別することは、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと、
を含む、項目１に記載の方法。
（項目７）エンティティは、第１のエンティティであり、テキストクエリおよび第２のエンティティに関する第２のメタデータに基づいて、複数のエンティティの中の第２のエンティティを識別することをさらに含み、コンテンツ項目は、第１のエンティティおよび第２のエンティティに関連付けられている、項目１に記載の方法。
（項目８）データベースの複数のエンティティの中のエンティティを識別することは、テキストクエリの少なくとも一部を記憶されたメタデータのタグと比較し、合致を識別することを含む、項目１に記載の方法。
（項目９）１つ以上のキーワードのうちの第１のキーワードは、第１のキーワードの２つ以上の発音に関連付けられている、項目１に記載の方法。
（項目１０）発音情報は、１つ以上のキーワードのうちの第１のキーワードの音素表現を備えている、項目１に記載の方法。
（項目１１）音声クエリに応答するためのシステムであって、システムは、
音声クエリを受信するためのオーディオインターフェースと、
オーディオインターフェースに結合された制御回路と
を備え、
制御回路は、
１つ以上のキーワードを音声クエリから抽出することと、
１つ以上のキーワードに関する発音情報を決定抽出することと、
１つ以上のキーワードおよび発音情報に基づいて、テキストクエリを生成抽出することと、
テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別抽出することであって、メタデータは、発音タグを備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を行うように構成されている、システム。
（項目１２）発音情報は、１つ以上のキーワードのうちの１つの音素を備えている、項目１１に記載のシステム。
（項目１３）制御回路は、ユーザプロファイル情報に基づいて、エンティティを識別するようにさらに構成されている、項目１１に記載のシステム。
（項目１４）制御回路は、前の音声クエリから以前に識別されたエンティティに基づいて、エンティティを識別するようにさらに構成されている、項目１３に記載のシステム。
（項目１５）制御回路は、エンティティに関連付けられた人気情報に基づいて、エンティティを識別するようにさらに構成されている、項目１１に記載のシステム。
（項目１６）制御回路は、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと
ことによって、エンティティを識別するようにさらに構成されている、項目１１に記載のシステム。
（項目１７）エンティティは、第１のエンティティであり、制御回路は、テキストクエリおよび第２のエンティティに関する第２のメタデータに基づいて、複数のエンティティの中の第２のエンティティを識別するようにさらに構成され、コンテンツ項目は、第１のエンティティおよび第２のエンティティに関連付けられている、項目１１に記載のシステム。
（項目１８）制御回路は、テキストクエリの少なくとも一部を記憶されたメタデータのタグと比較し、合致を識別することによって、データベースの複数のエンティティの中のエンティティを識別するようにさらに構成されている、項目１１に記載の。
（項目１９）１つ以上のキーワードのうちの第１のキーワードは、第１のキーワードの２つ以上の発音に関連付けられている、項目１１に記載のシステム。
（項目２０）発音情報は、１つ以上のキーワードのうちの第１のキーワードの音素表現を備えている、項目１１に記載のシステム。
（項目２１）エンコーディングされた命令を有する非一過性コンピュータ読み取り可能な媒体であって、命令は、制御回路によって実行されると、
音声クエリをオーディオインターフェースにおいて受信することと、
１つ以上のキーワードを音声クエリから抽出することと、
１つ以上のキーワードに関する発音情報を決定することと、
１つ以上のキーワードおよび発音情報に基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別することであって、メタデータは、発音タグを備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を制御回路に行わせる、非一過性コンピュータ読み取り可能な媒体。
（項目２２）発音情報は、１つ以上のキーワードのうちの１つの音素を備えている、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２３）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路にユーザプロファイル情報に基づいてエンティティを識別させる、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２４）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、前の音声クエリからの以前に識別されたエンティティに基づいて、エンティティを識別させる、項目２３に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２５）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、エンティティに関連付けられた人気情報に基づいて、エンティティを識別させる、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２６）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
複数のエンティティを識別することであってし、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択ることと
によって、制御回路にエンティティを識別させる、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２７）エンティティは、第１のエンティティであり、エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、テキストクエリおよび第２のエンティティに関する第２のメタデータに基づいて、複数のエンティティの中の第２のエンティティを識別させ、コンテンツ項目は、第１のエンティティおよび第２のエンティティに関連付けられている、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２８）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、テキストクエリの少なくとも一部を記憶されたメタデータのタグと比較し、合致を識別することによって、データベースの複数のエンティティの中のエンティティを識別させる、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２９）１つ以上のキーワードのうちの第１のキーワードは、第１のキーワードの２つ以上の発音に関連付けられている、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目３０）発音情報は、１つ以上のキーワードのうちの第１のキーワードの音素表現を備えている、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目３１）音声クエリに応答するためのシステムであって、システムは、
音声クエリを受信する手段と、
１つ以上のキーワードを音声クエリから抽出する手段と、
１つ以上のキーワードに関する発音情報を決定する手段と、
１つ以上のキーワードおよび発音情報に基づいて、テキストクエリを生成する手段と、
テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別する手段であって、メタデータは、発音タグを備えている、手段と、
エンティティに関連付けられたコンテンツ項目を読み出すための手段と
を備えている、システム。
（項目３２）発音情報は、１つ以上のキーワードのうちの１つの音素を備えている、項目３１に記載のシステム。
（項目３３）エンティティを識別する手段は、ユーザプロファイル情報に基づいて、エンティティを識別する手段を備えている、項目３１に記載のシステム。
（項目３４）エンティティを識別する手段は、前の音声クエリからの以前に識別されたエンティティに基づいて、エンティティを識別する手段を備えている、項目３３に記載のシステム。
（項目３５）エンティティを識別する手段は、エンティティに関連付けられた人気情報に基づいて、エンティティを識別する手段を備えている、項目３１に記載のシステム。
（項目３６）エンティティを識別する手段は、
複数のエンティティを識別する手段であって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、手段と、
それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定する手段と、
最大スコアを決定することによって、エンティティを選択する手段と
を備えている、項目３１に記載のシステム。
（項目３７）エンティティは、第１のエンティティであり、テキストクエリおよび第２のエンティティに関する第２のメタデータに基づいて、複数のエンティティの中の第２のエンティティを識別する手段をさらに備え、コンテンツ項目は、第１のエンティティおよび第２のエンティティに関連付けられている、項目３１に記載のシステム。
（項目３８）データベースの複数のエンティティの中のエンティティを識別する手段は、テキストクエリの少なくとも一部を記憶されたメタデータのタグと比較し、合致を識別する手段を備えている、項目３１に記載のシステム。
（項目３９）１つ以上のキーワードのうちの第１のキーワードは、第１のキーワードの２つ以上の発音に関連付けられている、項目３１に記載のシステム。
（項目４０）発音情報は、１つ以上のキーワードのうちの第１のキーワードの音素表現を備えている、項目３１に記載のシステム。
（項目４１）音声クエリに応答する方法であって、方法は、
音声クエリをオーディオインターフェースにおいて受信することと、
制御回路を使用して、１つ以上のキーワードを音声クエリから抽出することと、
制御回路を使用して、１つ以上のキーワードに関する発音情報を決定することと、
制御回路を使用して、１つ以上のキーワードおよび発音情報に基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別することであって、メタデータは、発音タグを備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を含む、方法。
（項目４２）発音情報は、１つ以上のキーワードのうちの１つの音素を備えている、項目４１に記載の方法。
（項目４３）エンティティを識別することは、ユーザプロファイル情報にさらに基づく、項目４１－４２のいずれかに記載の方法。
（項目４４）エンティティを識別することは、前の音声クエリからの以前に識別されたエンティティに基づく、項目４１－４３のいずれかに記載の方法。
（項目４５）エンティティを識別することは、エンティティに関連付けられた人気情報にさらに基づく、項目４１－４４のいずれかに記載の方法。
（項目４６）エンティティを識別することは、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと
を含む、項目４１－４５のいずれかに記載の方法。
（項目４７）エンティティは、第１のエンティティであり、テキストクエリおよび第２のエンティティに関する第２のメタデータに基づいて、複数のエンティティの中の第２のエンティティを識別することをさらに含み、コンテンツ項目は、第１のエンティティおよび第２のエンティティに関連付けられている、項目４１－４６のいずれかに記載の方法。
（項目４８）データベースの複数のエンティティの中のエンティティを識別することは、テキストクエリの少なくとも一部を記憶されたメタデータのタグと比較し、合致を識別することを含む、項目４１－４７のいずれかに記載の方法。
（項目４９）１つ以上のキーワードのうちの第１のキーワードは、第１のキーワードの２つ以上の発音に関連付けられている、項目４１－４８のいずれかに記載の方法。
（項目５０）発音情報は、１つ以上のキーワードのうちの第１のキーワードの音素表現を備えている、項目４１－４９のいずれかに記載の方法。
（項目５１）音声クエリに応答する方法であって、方法は、
音声クエリをオーディオインターフェースにおいて受信することと、
制御回路を使用して、１つ以上のキーワードを音声クエリから抽出することと、
制御回路を使用して、１つ以上のキーワードに基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別することであって、メタデータは、エンティティに関連付けられた識別子の発音に基づくエンティティの１つ以上の代替テキスト表現を備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を含む、方法。
（項目５２）１つ以上の代替テキスト表現は、エンティティの音素表現を備えている、項目５１に記載の方法。
（項目５３）１つ以上の代替テキスト表現は、発音に基づくエンティティの代替スペルを備えている、項目５１に記載の方法。
（項目５４）エンティティの１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目５１に記載の方法。
（項目５５）１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、複数の代替テキスト表現のうちの各代替テキスト表現は、
第１のテキスト表現をオーディオファイルに変換することと、
オーディオファイルを第２のテキスト表現に変換することであって、第２のテキスト表現は、第１のテキスト表現と同一ではない、ことと
によって生成される、項目５１に記載の方法。
（項目５６）エンティティを識別することは、ユーザプロファイル情報にさらに基づく、項目５１に記載の方法。
（項目５７）エンティティを識別することは、エンティティに関連付けられた人気情報にさらに基づく、項目５１に記載の方法。
（項目５８）エンティティを識別することは、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの１つ以上の代替テキスト表現をテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと
を含む、項目５１に記載の方法。
（項目５９）複数のテキストクエリを生成することをさらに含み、複数のテキストクエリは、テキストクエリを備え、複数のテキストクエリのうちの各テキストクエリは、制御回路の発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目５１に記載の方法。
（項目６０）
複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別することと、
それぞれのテキストクエリのそれぞれのエンティティに関連付けられたメタデータとの比較に基づいて、それぞれのエンティティに関するそれぞれのスコアを決定することと、
それぞれのスコアの最大スコアを選択することによって、エンティティを識別することと
をさらに含む、項目５９に記載の方法。
（項目６１）音声クエリに応答するためのシステムであって、システムは、
音声クエリを受信するためのオーディオインターフェースと、
制御回路と
を備え、
制御回路は、
１つ以上のキーワードを音声クエリから抽出することと、
１つ以上のキーワードに基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別することであって、メタデータは、エンティティに関連付けられた識別子の発音に基づくエンティティの１つ以上の代替テキスト表現を備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を行うように構成されている、システム。
（項目６２）１つ以上の代替テキスト表現は、エンティティの音素表現を備えている、項目６１に記載のシステム。
（項目６３）１つ以上の代替テキスト表現は、発音に基づくエンティティの代替スペルを備えている、項目６１に記載のシステム。
（項目６４）エンティティの１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目６１に記載のシステム。
（項目６５）１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、制御回路は、
第１のテキスト表現をオーディオファイルに変換することと、
オーディオファイルを第２のテキスト表現に変換することであって、第２のテキスト表現は、第１のテキスト表現と同一ではない、ことと
によって、複数の代替テキスト表現のうちの各代替テキスト表現を生成するように構成されている、項目６１に記載のシステム。
（項目６６）制御回路は、ユーザプロファイル情報に基づいて、エンティティを識別するようにさらに構成されている、項目６１に記載のシステム。
（項目６７）制御回路は、エンティティに関連付けられた人気情報に基づいて、エンティティを識別するようにさらに構成されている、項目６１に記載のシステム。
（項目６８）制御回路は、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの１つ以上の代替テキスト表現をテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと
によって、エンティティを識別するようにさらに構成されている、項目６１に記載のシステム。
（項目６９）制御回路は、複数のテキストクエリを生成するようにさらに構成され、複数のテキストクエリは、テキストクエリを備え、制御回路は、発話→テキストモジュールを備え、複数のテキストクエリのうちの各テキストクエリは、発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目６１に記載のシステム。
（項目７０）制御回路は、
複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別することと、
それぞれのテキストクエリのそれぞれのエンティティに関連付けられたメタデータとの比較に基づいて、それぞれのエンティティに関するそれぞれのスコアを決定することと、
それぞれのスコアの最大スコアを選択することによって、エンティティを識別することと
を行うようにさらに構成されている、項目６９に記載のシステム。
（項目７１）エンコーディングされた命令を有する非一過性コンピュータ読み取り可能な媒体であって、命令は、制御回路によって実行されると、
音声クエリをオーディオインターフェースにおいて受信することと、
１つ以上のキーワードを音声クエリから抽出することと、
１つ以上のキーワードに基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別することであって、メタデータは、エンティティに関連付けられた識別子の発音に基づくエンティティの１つ以上の代替テキスト表現を備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を制御回路に行わせる、非一過性コンピュータ読み取り可能な媒体。
（項目７２）１つ以上の代替テキスト表現は、エンティティの音素表現を備えている、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７３）１つ以上の代替テキスト表現は、発音に基づくエンティティの代替スペルを備えている、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７４）エンティティの１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７５）１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、
第１のテキスト表現をオーディオファイルに変換することと、
オーディオファイルを第２のテキスト表現に変換することであって、第２のテキスト表現は、第１のテキスト表現と同一ではない、ことと
によって、複数の代替テキスト表現のうちの各代替テキスト表現を生成させる、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７６）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、ユーザプロファイル情報に基づいて、エンティティを識別させる、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７７）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、エンティティに関連付けられた人気情報に基づいて、エンティティを識別させる、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７８）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの１つ以上の代替テキスト表現をテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと、
によって、制御回路にエンティティを識別させる、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７９）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、複数のテキストクエリを生成させ、複数のテキストクエリは、テキストクエリを備え、複数のテキストクエリのうちの各テキストクエリは、制御回路の発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目８０）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別することと、
それぞれのテキストクエリのそれぞれのエンティティに関連付けられたメタデータとの比較に基づいて、それぞれのエンティティに関するそれぞれのスコアを決定することと、
それぞれのスコアの最大スコアを選択することによって、エンティティを識別することと
を制御回路に行わせる、項目７９に記載の非一過性コンピュータ読み取り可能な媒体。
（項目８１）音声クエリに応答するためのシステムであって、システムは、
音声クエリをオーディオインターフェースにおいて受信する手段と、
１つ以上のキーワードを音声クエリから抽出する手段と、
１つ以上のキーワードに基づいて、テキストクエリを生成する手段と、
テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別する手段であって、メタデータは、エンティティに関連付けられた識別子の発音に基づくエンティティの１つ以上の代替テキスト表現を備えている、手段と、
エンティティに関連付けられたコンテンツ項目を読み出すための手段と
を備えている、システム。
（項目８２）１つ以上の代替テキスト表現は、エンティティの音素表現を備えている、項目８１に記載のシステム。
（項目８３）１つ以上の代替テキスト表現は、発音に基づくエンティティの代替スペルを備えている、項目８１に記載のシステム。
（項目８４）エンティティの１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目８１に記載のシステム。
（項目８５）１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、複数の代替テキスト表現のうちの各代替テキスト表現は、
第１のテキスト表現をオーディオファイルに変換する手段と、
オーディオファイルを第２のテキスト表現に変換する手段であって、第２のテキスト表現は、第１のテキスト表現と同一ではない、手段と
によって生成される、項目８１に記載のシステム。
（項目８６）エンティティを識別する手段は、ユーザプロファイル情報に基づいて、エンティティを識別する手段をさらに備えている、項目８１に記載のシステム。
（項目８７）エンティティを識別する手段は、エンティティに関連付けられた人気情報に基づいて、エンティティを識別する手段をさらに備えている、項目８１に記載のシステム。
（項目８８）エンティティを識別する手段は、
複数のエンティティを識別する手段であって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、手段と、
それぞれの１つ以上の代替テキスト表現をテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定する手段と、
最大スコアを決定することによって、エンティティを選択する手段と
を備えている、項目８１に記載のシステム。
（項目８９）複数のテキストクエリを生成する手段をさらに備え、複数のテキストクエリは、テキストクエリを備え、複数のテキストクエリのうちの各テキストクエリは、制御回路の発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目８１に記載のシステム。
（項目９０）
複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別する手段と、
それぞれのテキストクエリのそれぞれのエンティティに関連付けられたメタデータとの比較に基づいて、それぞれのエンティティに関するそれぞれのスコアを決定する手段と、
それぞれのスコアの最大スコアを選択することによって、エンティティを識別する手段と
をさらに備えている、項目８９に記載のシステム。
（項目９１）音声クエリに応答する方法であって、方法は、
音声クエリをオーディオインターフェースにおいて受信することと、
制御回路を使用して、１つ以上のキーワードを音声クエリから抽出することと、
制御回路を使用して、１つ以上のキーワードに基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別することであって、メタデータは、エンティティに関連付けられた識別子の発音に基づくエンティティの１つ以上の代替テキスト表現を備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を含む、方法。
（項目９２）１つ以上の代替テキスト表現は、エンティティの音素表現を備えている、項目９１に記載の方法。
（項目９３）１つ以上の代替テキスト表現は、発音に基づくエンティティの代替スペルを備えている、項目９１－９２のいずれかに記載の方法。
（項目９４）エンティティの１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目９１－９３のいずれかに記載の方法。
（項目９５）１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、複数の代替テキスト表現のうちの各代替テキスト表現は、
第１のテキスト表現をオーディオファイルに変換することと、
オーディオファイルを第２のテキスト表現に変換することであって、第２のテキスト表現は、第１のテキスト表現と同一ではない、ことと
によって生成される、項目９１－９４のいずれかに記載の方法。
（項目９６）エンティティを識別することは、ユーザプロファイル情報にさらに基づく、項目９１－９５のいずれかに記載の方法。
（項目９７）エンティティを識別することは、エンティティに関連付けられた人気情報にさらに基づく、項目９１－９６のいずれかに記載の方法。
（項目９８）エンティティを識別することは、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの１つ以上の代替テキスト表現をテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと
を含む、項目９１－９７のいずれかに記載の方法。
（項目９９）複数のテキストクエリを生成することをさらに含み、複数のテキストクエリは、テキストクエリを備え、複数のテキストクエリのうちの各テキストクエリは、制御回路の発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目９１－９８のいずれかに記載の方法。
（項目１００）
複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別することと、
それぞれのテキストクエリのそれぞれのエンティティに関連付けられたメタデータとの比較に基づいて、それぞれのエンティティに関するそれぞれのスコアを決定することと、
それぞれのスコアの最大スコアを選択することによって、エンティティを識別することと
をさらに含む、項目９９に記載の方法。
（項目１０１）音声クエリに関するエンティティメタデータを生成する方法であって、方法は、
複数のエンティティのうちの情報が記憶されているエンティティを識別することと、
テキスト→発話モジュールを使用して、第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成することであって、第１のテキスト文字列は、エンティティを記述する、ことと、
発話→テキストモジュールを使用して、オーディオファイルに基づいて、第２のテキスト文字列を生成することと、
第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータに第２のテキスト文字列を記憶することと
を含む、方法。
（項目１０２）少なくとも１つの発話基準は、発音設定を備えている、項目１０１に記載の方法。
（項目１０３）少なくとも１つの発話基準は、言語設定を備えている、項目１０１に記載の方法。
（項目１０４）少なくとも１つの発話基準は、複数の発話基準を備え、方法は、
テキスト→発話モジュールを使用して、第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成することと、
発話→テキストモジュールを使用して、それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成することと、
それぞれの第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータにそれぞれの第２のテキスト文字列を記憶することと
をさらに含む、項目１０１に記載の方法。
（項目１０５）１つ以上のテキストクエリに基づいて、メタデータを更新することをさらに含む、項目１０１に記載の方法。
（項目１０６）エンティティに関連付けられたメタデータに第１のテキスト文字列の音素表現を記憶することをさらに含む、項目１０１に記載の方法。
（項目１０７）第１のテキスト文字列に基づいて、オーディオファイルを生成することは、
第１のテキスト文字列を第１のオーディオ信号に変換することと、
オーディオ信号に基づいて、発話をスピーカにおいて生成することと、
マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成することと、
オーディオ信号を処理し、オーディオファイルを生成することと
を含む、項目１０１に記載の方法。
（項目１０８）発話をスピーカにおいて生成することは、テキスト→発話モジュールの少なくとも１つの発話設定にさらに基づく、項目１０７に記載の方法。
（項目１０９）オーディオファイルに基づいて、第２のテキスト文字列を生成することは、
オーディオファイルの再生をスピーカにおいて生成することと、
マイクロホンを使用して、再生を検出し、オーディオ信号を生成することと、
１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換することと
を含む、項目１０１に記載の方法。
（項目１１０）オーディオ信号を第２のテキスト文字列に変換することは、発話→テキストモジュールの少なくとも１つのテキスト設定に基づく、項目１０９に記載の方法。
（項目１１１）音声クエリに関するエンティティメタデータを生成するためのシステムであって、システムは、制御回路を備え、
制御回路は、
複数のエンティティのうちの情報が記憶されているエンティティを識別することと、
制御回路に結合されたオーディオインターフェースを使用して、第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成することであって、第１のテキスト文字列は、エンティティを記述する、ことと、
オーディオインターフェースを使用して、オーディオファイルに基づいて、第２のテキスト文字列を生成することと、
第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータに第２のテキスト文字列を記憶することと
を行うように構成されている、システム。
（項目１１２）少なくとも１つの発話基準は、発音設定を備えている、項目１１１に記載のシステム。
（項目１１３）少なくとも１つの発話基準は、言語設定を備えている、項目１１１に記載のシステム。
（項目１１４）少なくとも１つの発話基準は、複数の発話基準を備え、制御回路は、
オーディオ機器を使用して、第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成することと、
オーディオ機器を使用して、それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成することと、
それぞれの第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータにそれぞれの第２のテキスト文字列を記憶することと
を行うようにさらに構成されている、項目１１１に記載のシステム。
（項目１１５）制御回路は、１つ以上のテキストクエリに基づいて、メタデータを更新するようにさらに構成されている、項目１１１に記載のシステム。
（項目１１６）制御回路は、エンティティに関連付けられたメタデータに第１のテキスト文字列の音素表現を記憶するようにさらに構成されている、項目１１１に記載のシステム。
（項目１１７）オーディオ機器は、スピーカとマイクロホンとを備え、制御回路は、
第１のテキスト文字列を第１のオーディオ信号に変換することと、
オーディオ信号に基づいて、発話をスピーカにおいて生成することと、
マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成することと、
オーディオ信号を処理し、オーディオファイルを生成することと
によって、第１のテキスト文字列に基づいて、オーディオファイルを生成するようにさらに構成されている、項目１１１に記載のシステム。
（項目１１８）制御回路は、少なくとも１つの発話設定に基づいて、発話をスピーカにおいて生成するようにさらに構成されている、項目１１７に記載のシステム。
（項目１１９）オーディオ機器は、スピーカとマイクロホンとを備え、制御回路は、
オーディオファイルの再生をスピーカにおいて生成することと、
再生をマイクロホンにおいて検出し、オーディオ信号を生成することと、
１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換することと
によって、オーディオファイルに基づいて、第２のテキスト文字列を生成するようにさらに構成されている、項目１１１に記載のシステム。
（項目１２０）制御回路は、発話→テキストモジュールの少なくとも１つのテキスト設定に基づいて、オーディオ信号を第２のテキスト文字列に変換するようにさらに構成されている、項目１１９に記載のシステム。
（項目１２１）エンコーディングされた命令を有する非一過性コンピュータ読み取り可能な媒体であって、命令は、制御回路によって実行されると、
複数のエンティティのうちの情報が記憶されているエンティティを識別することと、
第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成ることであって、第１のテキスト文字列は、エンティティを記述する、ことと、
オーディオファイルに基づいて、第２のテキスト文字列を生成ることと、
第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータに第２のテキスト文字列を記憶することと
を制御回路に行わせる、非一過性コンピュータ読み取り可能な媒体。
（項目１２２）少なくとも１つの発話基準は、発音設定を備えている、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２３）少なくとも１つの発話基準は、言語設定を備えている、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２４）少なくとも１つの発話基準は、複数の発話基準を備え、エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成することと、
それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成することと、
それぞれの第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータにそれぞれの第２のテキスト文字列を記憶することと
を制御回路に行わせる、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２５）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、１つ以上のテキストクエリに基づいて、メタデータを更新させる、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２６）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、エンティティに関連付けられたメタデータに第１のテキスト文字列の音素表現を記憶させる、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２７）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
第１のテキスト文字列を第１のオーディオ信号に変換することと、
オーディオ信号に基づいて、発話をスピーカにおいて生成することと、
マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成することと、
オーディオ信号を処理し、オーディオファイルを生成することと
を制御回路に行わせる、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２８）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、テキスト→発話モジュールの少なくとも１つの発話設定に基づいて、発話をスピーカにおいて生成させる、項目１２７に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２９）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
オーディオファイルの再生をスピーカにおいて生成することと、
マイクロホンを使用して、再生を検出し、オーディオ信号を生成することと、
１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換することと
を制御回路に行わせる、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１３０）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、発話→テキストモジュールの少なくとも１つのテキスト設定に基づいて、オーディオ信号を第２のテキスト文字列に変換させる、項目１２９に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１３１）音声クエリに関するエンティティメタデータを生成するためのシステムであって、システムは、
複数のエンティティのうちの情報が記憶されているエンティティを識別する手段と、
第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成する手段であって、第１のテキスト文字列は、エンティティを記述する、手段と、
オーディオファイルに基づいて、第２のテキスト文字列を生成する手段と、
第２のテキスト文字列を第１のテキスト文字列と比較する手段と、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータに第２のテキスト文字列を記憶する手段と
を備えている、システム。
（項目１３２）少なくとも１つの発話基準は、発音設定を備えている、項目１３１に記載のシステム。
（項目１３３）少なくとも１つの発話基準は、言語設定を備えている、項目１３１に記載のシステム。
（項目１３４）少なくとも１つの発話基準は、複数の発話基準を備え、システムは、
第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成する手段と、
それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成する手段と、
それぞれの第２のテキスト文字列を第１のテキスト文字列と比較する手段と、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータにそれぞれの第２のテキスト文字列を記憶する手段と
をさらに備えている、項目１３１に記載のシステム。
（項目１３５）１つ以上のテキストクエリに基づいて、メタデータを更新する手段をさらに備えている、項目１３１に記載のシステム。
（項目１３６）エンティティに関連付けられたメタデータに第１のテキスト文字列の音素表現を記憶する手段をさらに備えている、項目１３１に記載のシステム。
（項目１３７）第１のテキスト文字列に基づいて、オーディオファイルを生成する手段は、
第１のテキスト文字列を第１のオーディオ信号に変換する手段と、
オーディオ信号に基づいて、発話をスピーカにおいて生成する手段と、
マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成する手段と、
オーディオ信号を処理し、オーディオファイルを生成する手段と
を備えている、項目１３１に記載のシステム。
（項目１３８）発話をスピーカにおいて生成する手段は、テキスト→発話モジュールの少なくとも１つの発話設定に基づいて、発話をスピーカにおいて生成する手段をさらに備えている、項目１３７に記載のシステム。
（項目１３９）オーディオファイルに基づいて、第２のテキスト文字列を生成する手段は、
オーディオファイルの再生をスピーカにおいて生成する手段と、
マイクロホンを使用して、再生を検出し、オーディオ信号を生成する手段と、
１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換する手段と
を含む、項目１３１に記載のシステム。
（項目１４０）オーディオ信号を第２のテキスト文字列に変換する手段は、発話→テキストモジュールの少なくとも１つのテキスト設定に基づいて、オーディオ信号を第２のテキスト文字列に変換する手段を備えている、項目１３９に記載のシステム。
（項目１４１）音声クエリのためのエンティティメタデータを生成する方法であって、方法は、
複数のエンティティのうちの情報が記憶されているエンティティを識別することと、
テキスト→発話モジュールを使用して、第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成することであって、第１のテキスト文字列は、エンティティを記述する、ことと、
発話→テキストモジュールを使用して、オーディオファイルに基づいて、第２のテキスト文字列を生成することと、
第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータに第２のテキスト文字列を記憶することと
を含む、方法。
（項目１４２）少なくとも１つの発話基準は、発音設定を備えている、項目１４１に記載の方法。
（項目１４３）少なくとも１つの発話基準は、言語設定を備えている、項目１４１－１４２のいずれかに記載の方法。
（項目１４４）少なくとも１つの発話基準は、複数の発話基準を備え、方法は、
テキスト→発話モジュールを使用して、第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成することと、
発話→テキストモジュールを使用して、それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成することと、
それぞれの第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータにそれぞれの第２のテキスト文字列を記憶することと
をさらに含む、項目１４１－１４３のいずれかに記載の方法。
（項目１４５）１つ以上のテキストクエリに基づいて、メタデータを更新することをさらに含む、項目１４１－１４４のいずれかに記載の方法。
（項目１４６）エンティティに関連付けられたメタデータに第１のテキスト文字列の音素表現を記憶することをさらに含む、項目１４１－１４５のいずれかに記載の方法。
（項目１４７）第１のテキスト文字列に基づいて、オーディオファイルを生成することは、
第１のテキスト文字列を第１のオーディオ信号に変換することと、
オーディオ信号に基づいて、発話をスピーカにおいて生成することと、
マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成することと、
オーディオ信号を処理し、オーディオファイルを生成することと
を含む、項目１４１－１４６のいずれかに記載の方法。
（項目１４８）発話をスピーカにおいて生成することは、テキスト→発話モジュールの少なくとも１つの発話設定にさらに基づく、項目１４７に記載の方法。
（項目１４９）オーディオファイルに基づいて、第２のテキスト文字列を生成することは、
オーディオファイルの再生をスピーカにおいて生成することと、
マイクロホンを使用して、再生を検出し、オーディオ信号を生成することと、
１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換することと
を含む、項目１４１－１４８のいずれかに記載の方法。
（項目１５０）オーディオ信号を第２のテキスト文字列に変換することは、発話→テキストモジュールの少なくとも１つのテキスト設定に基づく、項目１４９に記載の方法。 The above-described embodiments of the present disclosure are presented for purposes of illustration, not limitation, and the present disclosure is limited only by the claims that follow. Furthermore, it should be noted that the features and limitations described in any one embodiment may be applied to any other embodiment herein, and that the flow charts or examples related to one embodiment may be combined with any other embodiment in a suitable manner, performed in a different order, or performed in parallel. In addition, the systems and methods described herein may be performed in real time. It should also be noted that the systems and/or methods described above may be applied to or used in accordance with other systems and/or methods.
This specification discloses embodiments including, but not limited to, the following:
1. A method for responding to a voice query, the method comprising:
Receiving a voice query at an audio interface;
extracting one or more keywords from the voice query using control circuitry;
determining pronunciation information for one or more keywords using control circuitry;
generating a text query based on the one or more keywords and the pronunciation information using control circuitry;
identifying an entity among a plurality of entities in a database based on a text query and stored metadata about the entity, the metadata comprising a pronunciation tag;
Retrieving a content item associated with an entity; and
A method comprising:
(Item 2) The method of item 1, wherein the pronunciation information comprises one phoneme of one or more keywords.
(Item 3) The method of item 1, wherein identifying the entity is further based on user profile information.
(Item 4) The method of item 3, wherein identifying an entity is based on a previously identified entity from a previous voice query.
(Item 5) The method of item 1, wherein identifying the entity is further based on popularity information associated with the entity.
(Item 6) Identifying an entity includes:
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining, for each respective entity of the plurality of entities based on comparing the respective pronunciation tags to the text query;
selecting an entity by determining a maximum score;
2. The method according to claim 1, comprising:
(Item 7) The method of item 1, wherein the entity is a first entity, and further includes identifying a second entity among the multiple entities based on the text query and second metadata related to the second entity, and the content item is associated with the first entity and the second entity.
(Item 8) The method of item 1, wherein identifying an entity among a plurality of entities in a database includes comparing at least a portion of the text query to tags of stored metadata and identifying a match.
(Item 9) The method of item 1, wherein a first keyword of the one or more keywords is associated with two or more pronunciations of the first keyword.
10. The method of claim 1, wherein the pronunciation information comprises a phonetic representation of a first keyword of the one or more keywords.
11. A system for responding to voice queries, comprising:
an audio interface for receiving a voice query;
a control circuit coupled to the audio interface;
The control circuit includes:
Extracting one or more keywords from the voice query;
determining and extracting pronunciation information for one or more keywords;
generating and extracting a text query based on one or more keywords and pronunciation information;
Identifying and extracting an entity from a plurality of entities in a database based on a text query and stored metadata about the entity, the metadata comprising a pronunciation tag;
and retrieving a content item associated with the entity.
(Item 12) The system of item 11, wherein the pronunciation information comprises a phoneme of one of the one or more keywords.
(Item 13) The system of item 11, wherein the control circuitry is further configured to identify the entity based on user profile information.
Item 14. The system of item 13, wherein the control circuitry is further configured to identify an entity based on previously identified entities from a previous voice query.
(Item 15) The system of item 11, wherein the control circuitry is further configured to identify the entity based on popularity information associated with the entity.
(Item 16) The control circuit includes:
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining, for each respective entity of the plurality of entities based on comparing the respective pronunciation tags to the text query;
12. The system of claim 11, further configured to identify the entities by selecting the entity by determining a maximum score.
(Item 17) The system described in Item 11, wherein the entity is a first entity, the control circuit is further configured to identify a second entity among the multiple entities based on the text query and second metadata related to the second entity, and the content item is associated with the first entity and the second entity.
(Item 18) The control circuit is further configured to identify an entity among a plurality of entities in the database by comparing at least a portion of the text query with tags of stored metadata and identifying a match, as described in Item 11.
(Item 19) The system of item 11, wherein a first keyword of the one or more keywords is associated with two or more pronunciations of the first keyword.
20. The system of claim 11, wherein the pronunciation information comprises a phonetic representation of a first keyword of the one or more keywords.
21. A non-transitory computer readable medium having instructions encoded thereon, the instructions, when executed by a control circuit,
Receiving a voice query at an audio interface;
Extracting one or more keywords from the voice query;
determining pronunciation information for one or more keywords;
generating a text query based on one or more keywords and pronunciation information;
identifying an entity among a plurality of entities in a database based on a text query and stored metadata about the entity, the metadata comprising a pronunciation tag;
and retrieving a content item associated with the entity.
22. The non-transitory computer-readable medium of claim 21, wherein the pronunciation information comprises a phoneme of one of the one or more keywords.
23. The non-transitory computer-readable medium of claim 21, further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to identify an entity based on user profile information.
(Item 24) The non-transitory computer-readable medium of item 23, further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to identify an entity based on previously identified entities from a previous voice query.
(Item 25) The non-transitory computer-readable medium of item 21, further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to identify an entity based on popularity information associated with the entity.
26. The method of claim 25, further comprising: encoding instructions that, when executed by the control circuitry,
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining, for each respective entity of the plurality of entities based on comparing the respective pronunciation tags to the text query;
22. The non-transitory computer readable medium of claim 21, wherein the control circuitry identifies the entities by: determining a maximum score; and selecting an entity.
(Item 27) The non-transitory computer-readable medium of item 21, wherein the entity is a first entity and further comprises encoded instructions which, when executed by the control circuit, cause the control circuit to identify a second entity among the plurality of entities based on the text query and second metadata related to the second entity, and the content item is associated with the first entity and the second entity.
(Item 28) The non-transitory computer-readable medium of item 21, further comprising encoded instructions which, when executed by the control circuit, cause the control circuit to identify an entity among a plurality of entities in the database by comparing at least a portion of the text query with tags of stored metadata and identifying a match.
(Item 29) The non-transitory computer-readable medium of item 21, wherein a first keyword of the one or more keywords is associated with two or more pronunciations of the first keyword.
30. The non-transitory computer-readable medium of claim 21, wherein the pronunciation information comprises a phonemic representation of a first keyword of the one or more keywords.
31. A system for responding to voice queries, comprising:
A means for receiving a voice query;
means for extracting one or more keywords from the voice query;
means for determining pronunciation information for one or more keywords;
means for generating a text query based on one or more keywords and pronunciation information;
means for identifying an entity among a plurality of entities in a database based on a text query and stored metadata about the entity, the metadata comprising a pronunciation tag;
and means for retrieving a content item associated with the entity.
32. The system of claim 31, wherein the pronunciation information comprises a phoneme of one of the one or more keywords.
33. The system of claim 31, wherein the means for identifying an entity comprises means for identifying an entity based on user profile information.
34. The system of claim 33, wherein the means for identifying an entity comprises means for identifying an entity based on a previously identified entity from a previous voice query.
35. The system of claim 31, wherein the means for identifying an entity comprises means for identifying an entity based on popularity information associated with the entity.
(Item 36) The means for identifying an entity comprises:
means for identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
means for determining, for each respective entity of the plurality of entities, a respective score based on comparing the respective pronunciation tag to the text query;
and means for selecting an entity by determining a maximum score.
(Item 37) The system described in Item 31, wherein the entity is a first entity, and further comprising means for identifying a second entity among the multiple entities based on the text query and second metadata related to the second entity, and the content item is associated with the first entity and the second entity.
(Item 38) The system of item 31, wherein the means for identifying an entity among a plurality of entities in the database includes means for comparing at least a portion of the text query with tags of stored metadata and identifying a match.
(Item 39) The system of item 31, wherein a first keyword of the one or more keywords is associated with two or more pronunciations of the first keyword.
40. The system of claim 31, wherein the pronunciation information comprises a phonetic representation of a first keyword of the one or more keywords.
41. A method for responding to a voice query, the method comprising:
Receiving a voice query at an audio interface;
extracting one or more keywords from the voice query using control circuitry;
determining pronunciation information for one or more keywords using control circuitry;
generating a text query based on the one or more keywords and the pronunciation information using control circuitry;
identifying an entity among a plurality of entities in a database based on a text query and stored metadata about the entity, the metadata comprising a pronunciation tag;
and retrieving a content item associated with the entity.
42. The method of claim 41, wherein the pronunciation information comprises a phoneme of one of the one or more keywords.
(Item 43) The method of any of Items 41-42, wherein identifying the entity is further based on user profile information.
44. The method of claim 41, wherein identifying an entity is based on a previously identified entity from a previous voice query.
(Item 45) The method of any of Items 41-44, wherein identifying the entity is further based on popularity information associated with the entity.
(Item 46) Identifying an entity comprises:
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining, for each respective entity of the plurality of entities based on comparing the respective pronunciation tags to the text query;
and selecting an entity by determining a maximum score.
(Item 47) A method according to any of Items 41-46, wherein the entity is a first entity, and further includes identifying a second entity among the plurality of entities based on the text query and second metadata relating to the second entity, and the content item is associated with the first entity and the second entity.
(Item 48) The method of any of Items 41-47, wherein identifying an entity among a plurality of entities in a database includes comparing at least a portion of the text query to stored metadata tags and identifying a match.
(Item 49) The method of any of Items 41-48, wherein a first keyword of the one or more keywords is associated with two or more pronunciations of the first keyword.
50. The method of any of claims 41-49, wherein the pronunciation information comprises a phonetic representation of a first keyword of the one or more keywords.
51. A method for responding to a voice query, comprising:
Receiving a voice query at an audio interface;
extracting one or more keywords from the voice query using control circuitry;
generating a text query based on the one or more keywords using control circuitry;
identifying an entity based on the text query and metadata about the entity, the metadata comprising one or more alternative text representations of the entity based on a pronunciation of an identifier associated with the entity;
and retrieving a content item associated with the entity.
52. The method of claim 51, wherein the one or more alternative text representations comprise a phonemic representation of the entity.
53. The method of claim 51, wherein the one or more alternative text representations comprise alternative spellings of the entity based on pronunciation.
(Item 54) The method of item 51, wherein one or more alternative text representations of an entity comprise a text string generated based on a previous speech-to-text conversion.
The one or more alternative text representations comprise a plurality of alternative text representations, each of the plurality of alternative text representations comprising:
Converting the first text representation into an audio file;
52. The method of claim 51, wherein the audio file is generated by converting the audio file into a second text representation, the second text representation being not identical to the first text representation.
(Item 56) The method of item 51, wherein identifying the entity is further based on user profile information.
57. The method of claim 51, wherein identifying an entity is further based on popularity information associated with the entity.
(Item 58) Identifying an entity comprises:
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining, for each respective entity of the plurality of entities based on comparing the respective one or more alternative text representations to the text query;
and selecting an entity by determining a maximum score.
(Item 59) The method of item 51, further comprising generating a plurality of text queries, the plurality of text queries comprising a text query, each text query of the plurality of text queries being generated based on a respective setting of a speech-to-text module of the control circuit.
(Item 60)
identifying a respective entity based on each of the plurality of text queries;
determining a respective score for each entity based on a comparison of each text query to metadata associated with each entity;
and identifying the entity by selecting a maximum score of the respective scores.
61. A system for responding to voice queries, comprising:
an audio interface for receiving a voice query;
a control circuit;
The control circuit includes:
Extracting one or more keywords from the voice query;
generating a text query based on one or more keywords;
identifying an entity based on the text query and metadata about the entity, the metadata comprising one or more alternative text representations of the entity based on a pronunciation of an identifier associated with the entity;
and retrieving a content item associated with the entity.
(Item 62) The system of item 61, wherein the one or more alternative text representations comprise a phonetic representation of the entity.
(Item 63) The system of item 61, wherein the one or more alternative text representations comprise alternative spellings of the entity based on pronunciation.
(Item 64) The system of item 61, wherein one or more alternative text representations of an entity comprise a text string generated based on a previous speech-to-text conversion.
The one or more alternative text representations comprise a plurality of alternative text representations, and the control circuitry:
Converting the first text representation into an audio file;
62. The system of claim 61, configured to generate each of the multiple alternative text representations by converting the audio file to a second text representation, the second text representation not being identical to the first text representation.
(Item 66) The system of item 61, wherein the control circuit is further configured to identify the entity based on user profile information.
67. The system of claim 61, wherein the control circuitry is further configured to identify the entity based on popularity information associated with the entity.
(Item 68) The control circuit comprises:
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining, for each respective entity of the plurality of entities based on comparing the respective one or more alternative text representations to the text query;
62. The system of claim 61, further configured to identify the entities by: determining a maximum score and selecting the entity.
(Item 69) The system described in Item 61, wherein the control circuit is further configured to generate a plurality of text queries, the plurality of text queries comprising a text query, the control circuit comprising a speech-to-text module, and each text query of the plurality of text queries is generated based on a respective setting of the speech-to-text module.
(Item 70) A control circuit comprising:
identifying a respective entity based on each of the plurality of text queries;
determining a respective score for each entity based on a comparison of each text query to metadata associated with each entity;
and identifying the entity by selecting a maximum score of the respective scores.
71. A non-transitory computer readable medium having instructions encoded thereon that, when executed by a control circuit,
Receiving a voice query at an audio interface;
Extracting one or more keywords from the voice query;
generating a text query based on one or more keywords;
identifying an entity based on the text query and metadata about the entity, the metadata comprising one or more alternative text representations of the entity based on a pronunciation of an identifier associated with the entity;
and retrieving a content item associated with the entity.
72. The non-transitory computer-readable medium of claim 71, wherein the one or more alternative text representations comprise a phonemic representation of the entity.
73. The non-transitory computer-readable medium of claim 71, wherein the one or more alternative text representations comprise alternative spellings of the entity based on pronunciation.
(Item 74) The non-transitory computer-readable medium of item 71, wherein one or more alternative text representations of an entity comprise a text string generated based on a previous speech-to-text conversion.
7. The one or more alternative text representations comprise a plurality of alternative text representations, further comprising encoded instructions that, when executed by the control circuitry, cause the control circuitry to:
Converting the first text representation into an audio file;
72. The non-transitory computer-readable medium of claim 71, generating each alternative text representation of the plurality of alternative text representations by converting the audio file to a second text representation, the second text representation not being identical to the first text representation.
(Item 76) The non-transitory computer-readable medium of item 71, further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to identify an entity based on user profile information.
(Item 77) The non-transitory computer-readable medium of item 71, further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to identify an entity based on popularity information associated with the entity.
7. The method of claim 7, further comprising: encoding instructions that, when executed by the control circuitry,
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining, for each respective entity of the plurality of entities based on comparing the respective one or more alternative text representations to the text query;
selecting an entity by determining a maximum score;
72. The non-transitory computer readable medium of claim 71, wherein the control circuit identifies an entity by
(Item 79) The non-transitory computer-readable medium of item 71, further comprising encoded instructions which, when executed by the control circuit, cause the control circuit to generate a plurality of text queries, the plurality of text queries comprising a text query, each text query of the plurality of text queries being generated based on a respective setting of a speech-to-text module of the control circuit.
80. The method of claim 80, further comprising: encoding instructions that, when executed by the control circuitry,
identifying a respective entity based on each of the plurality of text queries;
determining a respective score for each entity based on a comparison of each text query to metadata associated with each entity;
and identifying the entity by selecting a maximum score from the respective scores.
81. A system for responding to voice queries, comprising:
Means for receiving a voice query at an audio interface;
means for extracting one or more keywords from the voice query;
means for generating a text query based on one or more keywords;
means for identifying an entity based on a text query and metadata about the entity, the metadata comprising one or more alternative text representations of the entity based on a pronunciation of an identifier associated with the entity;
and means for retrieving a content item associated with the entity.
(Item 82) The system of item 81, wherein the one or more alternative text representations comprise a phonetic representation of the entity.
(Item 83) The system of item 81, wherein the one or more alternative text representations comprise alternative spellings of the entity based on pronunciation.
(Item 84) The system of item 81, wherein one or more alternative text representations of an entity comprise a text string generated based on a previous speech-to-text conversion.
The one or more alternative text representations comprise a plurality of alternative text representations, each of the plurality of alternative text representations comprising:
means for converting the first text representation into an audio file;
82. The system of claim 81, wherein the audio file is generated by a means for converting the audio file into a second text representation, the second text representation being not identical to the first text representation.
(Item 86) The system of item 81, wherein the means for identifying an entity further comprises means for identifying an entity based on user profile information.
87. The system of claim 81, wherein the means for identifying an entity further comprises means for identifying an entity based on popularity information associated with the entity.
(Item 88) The means for identifying an entity includes:
means for identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
means for determining, for each respective entity of the plurality of entities based on comparing each of the one or more alternative text representations to the text query;
and means for selecting an entity by determining a maximum score.
(Item 89) The system of item 81, further comprising means for generating a plurality of text queries, the plurality of text queries comprising a text query, each text query of the plurality of text queries being generated based on a respective setting of a speech-to-text module of the control circuit.
(Item 90)
means for identifying respective entities based on respective text queries of the plurality of text queries;
means for determining a respective score for each entity based on a comparison of each text query to metadata associated with each entity;
and means for identifying the entity by selecting a maximum score of the respective scores.
91. A method for responding to a voice query, comprising:
Receiving a voice query at an audio interface;
extracting one or more keywords from the voice query using control circuitry;
generating a text query based on the one or more keywords using control circuitry;
identifying an entity based on the text query and metadata about the entity, the metadata comprising one or more alternative text representations of the entity based on a pronunciation of an identifier associated with the entity;
and retrieving a content item associated with the entity.
(Item 92) The method of item 91, wherein the one or more alternative text representations comprise a phonemic representation of the entity.
(Item 93) The method of any of Items 91-92, wherein the one or more alternative text representations comprise alternative spellings of the entity based on pronunciation.
(Item 94) The method of any of Items 91-93, wherein one or more alternative text representations of an entity comprise a text string generated based on a previous speech-to-text conversion.
The one or more alternative text representations comprise a plurality of alternative text representations, each of the plurality of alternative text representations comprising:
Converting the first text representation into an audio file;
95. The method according to any of claims 91-94, wherein the audio file is generated by converting the audio file into a second text representation, the second text representation being not identical to the first text representation.
(Item 96) A method according to any of Items 91-95, wherein identifying the entity is further based on user profile information.
(Item 97) The method of any of Items 91-96, wherein identifying the entity is further based on popularity information associated with the entity.
(Item 98) Identifying an entity comprises:
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining, for each respective entity of the plurality of entities based on comparing the respective one or more alternative text representations to the text query;
and selecting an entity by determining a maximum score.
(Item 99) A method according to any of Items 91-98, further comprising generating a plurality of text queries, the plurality of text queries comprising a text query, each text query of the plurality of text queries being generated based on respective settings of a speech-to-text module of the control circuit.
(Item 100)
identifying a respective entity based on each of the plurality of text queries;
determining a respective score for each entity based on a comparison of each text query to metadata associated with each entity;
and identifying the entity by selecting a maximum score of the respective scores.
101. A method for generating entity metadata for a voice query, the method comprising:
identifying an entity of a plurality of entities in which information is stored;
generating an audio file based on a first text string and at least one speech criterion using a text-to-speech module, the first text string describing an entity;
generating a second text string based on the audio file using a speech-to-text module;
comparing the second text string to the first text string;
and if not identical to the first text string, storing the second text string in metadata associated with the entity.
(Item 102) The method of item 101, wherein at least one speech criterion comprises a pronunciation setting.
(Item 103) The method of item 101, wherein at least one speech criterion comprises a language setting.
The at least one speech criterion comprises a plurality of speech criteria, and the method includes:
generating respective audio files based on the first text string and respective speech criteria using a text-to-speech module;
generating a respective second text string based on the respective audio file using a speech-to-text module;
comparing each second text string to the first text string;
and storing the respective second text string in metadata associated with the entity if the respective second text string is not identical to the first text string.
(Item 105) The method of item 101, further comprising updating metadata based on one or more text queries.
(Item 106) The method of item 101, further comprising storing a phonetic representation of the first text string in metadata associated with the entity.
Item 107. Generating an audio file based on a first text string includes:
Converting a first text string into a first audio signal;
generating speech at a speaker based on the audio signal;
Detecting speech using a microphone and generating a second audio signal;
Item 102. The method of item 101, comprising processing the audio signal to generate an audio file.
(Item 108) The method of item 107, wherein generating speech in the speaker is further based on at least one speech setting of the text-to-speech module.
Generating a second text string based on an audio file includes:
generating a playback of the audio file on a speaker;
detecting the playback using a microphone and generating an audio signal;
and converting the audio signal into a second text string by identifying one or more words.
(Item 110) The method of item 109, wherein converting the audio signal into a second text string is based on at least one text setting of a speech-to-text module.
11. A system for generating entity metadata for a voice query, the system comprising: a control circuit;
The control circuit includes:
identifying an entity of a plurality of entities in which information is stored;
generating, using an audio interface coupled to the control circuitry, an audio file based on a first text string and at least one speech criterion, the first text string describing an entity;
generating a second text string based on the audio file using the audio interface;
comparing the second text string to the first text string;
and if not identical to the first text string, storing the second text string in metadata associated with the entity.
(Item 112) The system of item 111, wherein at least one speech criterion comprises a pronunciation setting.
(Item 113) The system of item 111, wherein at least one speech criterion comprises a language setting.
(Item 114) The at least one speech criterion comprises a plurality of speech criteria, and the control circuitry:
generating, using an audio device, respective audio files based on the first text string and the respective speech criteria;
generating, using an audio device, respective second text strings based on the respective audio files;
comparing each second text string to the first text string;
and storing the respective second text string in metadata associated with the entity if the respective second text string is not identical to the first text string.
(Item 115) The system of item 111, wherein the control circuit is further configured to update the metadata based on one or more text queries.
(Item 116) The system of item 111, wherein the control circuitry is further configured to store a phonetic representation of the first text string in metadata associated with the entity.
(Item 117) An audio device includes a speaker and a microphone, and a control circuit includes:
Converting a first text string into a first audio signal;
generating speech at a speaker based on the audio signal;
Detecting speech using a microphone and generating a second audio signal;
112. The system of claim 111, further configured to generate an audio file based on the first text string by processing the audio signal and generating the audio file.
(Item 118) The system of item 117, wherein the control circuit is further configured to generate speech at the speaker based on at least one speech setting.
(Item 119) An audio device includes a speaker and a microphone, and a control circuit includes:
generating a playback of the audio file on a speaker;
detecting the playback at a microphone and generating an audio signal;
112. The system of claim 111, further configured to generate a second text string based on the audio file by identifying one or more words and converting the audio signal into a second text string.
(Item 120) The system of item 119, wherein the control circuit is further configured to convert the audio signal into a second text string based on at least one text setting of the speech-to-text module.
121. A non-transitory computer-readable medium having instructions encoded thereon, the instructions, when executed by a control circuit,
identifying an entity of a plurality of entities in which information is stored;
generating an audio file based on a first text string and at least one speech criterion, the first text string describing an entity;
generating a second text string based on the audio file;
comparing the second text string to the first text string;
and if the second text string is not identical to the first text string, storing the second text string in metadata associated with the entity.
(Item 122) The non-transitory computer-readable medium of item 121, wherein at least one speech criterion comprises a pronunciation setting.
(Item 123) The non-transitory computer-readable medium of item 121, wherein at least one speech criterion comprises a language setting.
12. The at least one speech criterion comprises a plurality of speech criteria, and further comprises encoded instructions that, when executed by the control circuitry,
generating respective audio files based on the first text string and the respective speech criteria;
generating a respective second text string based on the respective audio file;
comparing each second text string to the first text string;
and if the second text string is not identical to the first text string, storing the respective second text string in metadata associated with the entity.
(Item 125) The non-transitory computer-readable medium of item 121, further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to update metadata based on one or more text queries.
(Item 126) The non-transitory computer-readable medium of item 121, further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to store a phonetic representation of the first text string in metadata associated with the entity.
12. The method of claim 11, further comprising: encoding instructions that, when executed by the control circuitry,
Converting a first text string into a first audio signal;
generating speech at a speaker based on the audio signal;
Detecting speech using a microphone and generating a second audio signal;
Item 122. The non-transitory computer readable medium of item 121, further comprising a control circuit for processing an audio signal and generating an audio file.
(Item 128) The non-transitory computer-readable medium of item 127, further comprising encoded instructions which, when executed by the control circuit, cause the control circuit to generate speech at a speaker based on at least one speech setting of the text-to-speech module.
129. The method of claim 128, further comprising: encoding instructions that, when executed by the control circuitry,
generating a playback of the audio file on a speaker;
detecting the playback using a microphone and generating an audio signal;
and converting the audio signal into a second text string by identifying one or more words.
(Item 130) The non-transitory computer-readable medium of item 129, further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to convert the audio signal into a second text string based on at least one text setting of the speech-to-text module.
131. A system for generating entity metadata for a voice query, the system comprising:
means for identifying an entity of a plurality of entities for which information is stored;
means for generating an audio file based on a first text string and at least one speech criterion, the first text string describing an entity; and
means for generating a second text string based on the audio file;
means for comparing the second text string to the first text string;
and means for storing the second text string in metadata associated with the entity if the second text string is not identical to the first text string.
(Item 132) The system of item 131, wherein at least one speech criterion comprises a pronunciation setting.
(Item 133) The system described in Item 131, wherein at least one speech criterion comprises a language setting.
(Item 134) The at least one speech criterion comprises a plurality of speech criteria, and the system:
means for generating respective audio files based on the first text string and the respective speech criteria;
means for generating respective second text strings based on the respective audio files;
means for comparing each second text string to the first text string;
and means for storing the respective second text string in metadata associated with the entity if the respective second text string is not identical to the first text string.
(Item 135) The system of item 131, further comprising means for updating metadata based on one or more text queries.
136. The system of claim 131, further comprising means for storing a phonetic representation of the first text string in metadata associated with the entity.
137. The means for generating an audio file based on a first text string comprises:
means for converting a first text string into a first audio signal;
means for generating speech at a speaker based on the audio signal;
means for detecting speech using a microphone and generating a second audio signal;
and means for processing the audio signal and generating an audio file.
(Item 138) The system of item 137, wherein the means for generating speech at a speaker further comprises means for generating speech at a speaker based on at least one speech setting of the text-to-speech module.
139. The means for generating a second text string based on an audio file includes:
means for generating a playback of the audio file on a speaker;
means for detecting the playback using a microphone and generating an audio signal;
and means for converting the audio signal into a second text string by identifying one or more words.
(Item 140) The system described in Item 139, wherein the means for converting the audio signal into a second text string includes means for converting the audio signal into a second text string based on at least one text setting of the speech-to-text module.
141. A method for generating entity metadata for a voice query, the method comprising:
identifying an entity of a plurality of entities in which information is stored;
generating an audio file based on a first text string and at least one speech criterion using a text-to-speech module, the first text string describing an entity;
generating a second text string based on the audio file using a speech-to-text module;
comparing the second text string to the first text string;
and if not identical to the first text string, storing the second text string in metadata associated with the entity.
(Item 142) The method of item 141, wherein at least one speech criterion comprises a pronunciation setting.
(Item 143) The method of any of Items 141-142, wherein at least one speech criterion comprises a language setting.
The at least one speech criterion comprises a plurality of speech criteria, and the method further comprises:
generating respective audio files based on the first text string and respective speech criteria using a text-to-speech module;
generating a respective second text string based on the respective audio file using a speech-to-text module;
comparing each second text string to the first text string;
Item 144. The method of any of items 141-143, further comprising: storing the respective second text string in metadata associated with the entity if it is not identical to the first text string.
(Item 145) The method of any of Items 141-144, further comprising updating metadata based on one or more text queries.
146. The method of claim 141, further comprising storing a phonetic representation of the first text string in metadata associated with the entity.
14. Generating an audio file based on a first text string includes:
Converting a first text string into a first audio signal;
generating speech at a speaker based on the audio signal;
Detecting speech using a microphone and generating a second audio signal;
Processing the audio signal to generate an audio file.
(Item 148) The method of item 147, wherein generating speech in the speaker is further based on at least one speech setting of the text-to-speech module.
Item 149. Generating a second text string based on an audio file includes:
generating a playback of the audio file on a speaker;
detecting the playback using a microphone and generating an audio signal;
and converting the audio signal into a second text string by identifying one or more words.
(Item 150) The method of item 149, wherein converting the audio signal into a second text string is based on at least one text setting of a speech-to-text module.

Claims

1. A method of responding to a voice query for execution by a computing system , the method comprising:
receiving a voice query at an audio interface , the computing system ;
extracting one or more keywords from the voice query using control circuitry ;
generating a text query based on the one or more keywords , using the control circuitry;
the computing system identifying an entity, the identifying the entity based on the text query and metadata about the entity, the metadata comprising one or more alternative text representations of the entity, the one or more alternative text representations based on a pronunciation of an identifier associated with the entity;
the computing system retrieving a content item associated with the entity.

The method of claim 1, wherein the one or more alternative textual representations comprise a phonemic representation of the entity.

The method of any of claims 1 and 2, wherein the one or more alternative textual representations comprise alternative spellings of the entity based on pronunciation.

The method of any of claims 1 to 3, wherein the one or more alternative text representations of the entity comprise text strings generated based on a previous speech-to-text transformation.

The one or more alternative text representations comprise a plurality of alternative text representations, each alternative text representation in the plurality of alternative text representations:
said computing system converting the first textual representation into an audio file;
the computing system converting the audio file into a second text representation;
The method of any of claims 1 to 4, wherein the second textual representation is not identical to the first textual representation.

The method according to any of claims 1 to 5, wherein said computing system's identifying said entities is further based on user profile information or on popularity information associated with said entities.

The computing system identifying the entity comprises:
the computing system identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining , by the computing system, a respective score for each entity of the plurality of entities based on comparing the respective one or more alternative textual representations to the text query;
The method according to any of claims 1 to 6, comprising: the computing system selecting the entity by determining a maximum score.

8. The method of claim 1, further comprising: the computing system generating a plurality of text queries, the plurality of text queries comprising the text query, each text query of the plurality of text queries being generated based on a respective setting of a speech-to-text module of the control circuitry.

identifying , by the computing system, a respective entity based on each of the plurality of text queries;
determining , by the computing system, a respective score for each of the entities based on a comparison of the respective text query to metadata associated with the respective entities;
The method of claim 8 , further comprising: the computing system identifying the entity by selecting a maximum score of the respective scores.

1. A system for responding to a voice query, the system comprising:
Memory,
and means for implementing the steps of the method according to any one of claims 1 to 9.