HK40094574A

HK40094574A - Methods for parametric multi-channel encoding

Info

Publication number: HK40094574A
Application number: HK42023083397.2A
Authority: HK
Inventors: T·弗瑞尔德里驰; A·米勒; K·林泽梅儿; C-C·司鹏格尔; T·R·万格布拉斯
Original assignee: 杜比国际公司
Priority date: 2013-02-21
Filing date: 2023-11-28
Publication date: 2024-01-19

Description

Methods for parametric multichannel coding

本申请是申请号为201910673941.4、申请日为2014年2月21日、发明名称为“用于参数化多声道编码的方法”的发明专利申请的分案申请。This application is a divisional application of the invention patent application with application number 201910673941.4, application date February 21, 2014, and invention title "Method for Parametric Multichannel Coding".

相关申请的交叉引用Cross-references to related applications

本申请要求2013年2月21日提交的美国临时专利申请No.61/767,673的优先权，该申请的全部内容特此通过引用并入。This application claims priority to U.S. Provisional Patent Application No. 61/767,673, filed February 21, 2013, the entire contents of which are hereby incorporated by reference.

技术领域Technical Field

本文档涉及音频编码系统。具体地说，本文档涉及用于参数化多声道音频编码的高效方法和系统。This document relates to audio coding systems. Specifically, it relates to efficient methods and systems for parametric multichannel audio coding.

背景技术Background Technology

参数化多声道音频编码系统可以用于以特别低的数据速率提供提高的收听质量。尽管如此，仍需要进一步改进这样的参数化多声道音频编码系统，尤其是针对带宽效率、计算效率和/或鲁棒性。Parametric multichannel audio coding systems can be used to provide improved listening quality at particularly low data rates. Nevertheless, further improvements to such parametric multichannel audio coding systems are needed, especially for bandwidth efficiency, computational efficiency, and/or robustness.

发明内容Summary of the Invention

根据一方面，描述了一种被配置为产生指示下混信号和空间元数据的比特流的音频编码系统。空间元数据可以被相应的解码系统用于从下混信号产生多声道上混信号。下混信号可以包括m个声道，多声道上混信号可以包括n个声道，其中，n、m是整数，并且m<n。在示例中，n＝6，m＝2。空间元数据可以使得相应的解码系统可以从下混信号的m个声道产生多声道上混信号的n个声道。According to one aspect, an audio coding system configured to generate a bitstream indicative of a downmixer and spatial metadata is described. The spatial metadata can be used by a corresponding decoding system to generate a multichannel upmixer from the downmixer. The downmixer may include m channels, and the multichannel upmixer may include n channels, where n and m are integers, and m < n. In the example, n = 6, m = 2. The spatial metadata enables the corresponding decoding system to generate n channels of the multichannel upmixer from the m channels of the downmixer.

音频编码系统可以被配置为对下混信号和空间元数据进行量化和/或编码并且将量化的/编码的数据插入到比特流中。具体地说，下混信号可以使用Dolby Digital Plus编码器进行编码，比特流可以对应于Dolby Digital Plus比特流。量化的/编码的空间元数据可以被插入到Dolby Digital Plus比特流的数据字段中。An audio encoding system can be configured to quantize and/or encode the downmix signal and spatial metadata and insert the quantized/encoded data into the bitstream. Specifically, the downmix signal can be encoded using a Dolby Digital Plus encoder, and the bitstream can correspond to a Dolby Digital Plus bitstream. The quantized/encoded spatial metadata can be inserted into the data fields of the Dolby Digital Plus bitstream.

音频编码系统可以包括被配置为从多声道输入信号产生下混信号的下混处理单元。下混处理单元在本文中也被称为下混编码单元。多声道输入信号可以包括n个声道，比如基于下混信号重新产生的多声道上混信号。具体地说，多声道上混信号可以提供多声道输入信号的逼近。下混单元可以包括以上提及的Dolby Digital Plus编码器。多声道上混信号和多声道输入信号可以是5.1或7.1信号，下混信号可以是立体声信号。An audio coding system may include a downmixing processing unit configured to generate a downmixed signal from a multichannel input signal. The downmixing processing unit is also referred to herein as a downmixing encoding unit. The multichannel input signal may include n channels, such as a multichannel upmixed signal regenerated based on the downmixed signal. Specifically, the multichannel upmixed signal can provide an approximation of the multichannel input signal. The downmixing unit may include the Dolby Digital Plus encoder mentioned above. The multichannel upmixed signal and the multichannel input signal may be 5.1 or 7.1 signals, and the downmixed signal may be a stereo signal.

音频编码系统可以包括被配置为从多声道输入信号确定空间元数据的参数处理单元。具体地说，参数处理单元(其在本文档中也被称为参数编码单元)可以被配置为确定一个或多个空间参数，例如，空间参数集合，这些空间参数可以基于多声道输入信号的声道的不同组合来确定。空间参数集合的空间参数可以指示多声道输入信号的不同声道之间的互相关。参数处理单元可以被配置为确定称为空间元数据帧的多声道输入信号的帧的空间元数据。多声道输入信号的帧通常包括多声道输入信号的预定数量的(例如，1536个)采样。每个空间元数据帧可以包括一个或多个空间参数集合。Audio coding systems may include parameter processing units configured to determine spatial metadata from multichannel input signals. Specifically, the parameter processing unit (also referred to herein as a parameter encoding unit) may be configured to determine one or more spatial parameters, such as a set of spatial parameters, which may be determined based on different combinations of the channels of the multichannel input signal. The spatial parameters of the set of spatial parameters may indicate the cross-correlation between the different channels of the multichannel input signal. The parameter processing unit may be configured to determine spatial metadata for frames of the multichannel input signal, referred to as spatial metadata frames. Frames of the multichannel input signal typically include a predetermined number (e.g., 1536) samples of the multichannel input signal. Each spatial metadata frame may include one or more sets of spatial parameters.

音频编码系统还可以包括配置单元，其被配置为基于一个或多个外部设置来确定对于参数处理单元的一个或多个控制设置。所述一个或多个外部设置可以包括比特流的目标数据速率。可替代地或者另外地，所述一个或多个外部设置可以包括以下中的一个或多个：多声道输入信号的采样速率、下混信号的声道的数量m、多声道输入信号的声道的数量n、和/或指示相应的解码系统与比特流同步所需的时间段的更新时段。所述一个或多个控制设置可以包括空间元数据的最大数据速率。在空间元数据帧的情况下，空间元数据的最大数据速率可以指示空间元数据帧的元数据比特的最大数量。可替代地或者另外地，所述一个或多个控制设置可以包括以下中的一个或多个：时间分辨率设置，其指示将被确定的每一空间元数据帧的空间参数集合的数量；频率分辨率设置，其指示将对其确定空间参数的频带的数量；量化器设置，其指示将用于对空间元数据进行量化的量化器的类型；以及多声道输入信号的当前帧是否将作为独立帧被编码的指示。The audio encoding system may also include a configuration unit configured to determine one or more control settings for the parameter processing unit based on one or more external settings. The one or more external settings may include a target data rate for the bitstream. Alternatively or additionally, the one or more external settings may include one or more of the following: a sampling rate of the multichannel input signal, the number of channels m of the downmix signal, the number of channels n of the multichannel input signal, and/or an update period indicating the time required for the corresponding decoding system to synchronize with the bitstream. The one or more control settings may include a maximum data rate for spatial metadata. In the case of spatial metadata frames, the maximum data rate for spatial metadata may indicate the maximum number of metadata bits in the spatial metadata frame. Alternatively or additionally, the one or more control settings may include one or more of the following: a time resolution setting indicating the number of spatial parameter sets for each spatial metadata frame to be determined; a frequency resolution setting indicating the number of frequency bands for which spatial parameters will be determined; a quantizer setting indicating the type of quantizer to be used to quantize the spatial metadata; and an indication of whether the current frame of the multichannel input signal will be encoded as an independent frame.

参数处理单元可以被配置为确定根据所述一个或多个控制设置确定的空间元数据帧的比特的数量是否超过元数据比特的最大数量。此外，参数处理单元可以被配置为，如果确定特定的空间元数据帧的比特的数量超过元数据比特的最大数量，则减少该特定空间元数据帧的比特的数量。这个比特数量减少可以以资源(处理能力)高效的方式执行。具体地说，这个比特数量减少可以在不需要重新计算整个空间元数据帧的情况下执行。The parameter processing unit can be configured to determine whether the number of bits in a spatial metadata frame, determined according to the one or more control settings, exceeds the maximum number of metadata bits. Furthermore, the parameter processing unit can be configured to reduce the number of bits in a particular spatial metadata frame if it is determined that the number of bits in that particular spatial metadata frame exceeds the maximum number of metadata bits. This bit reduction can be performed in a resource-efficient manner. Specifically, this bit reduction can be performed without recalculating the entire spatial metadata frame.

如以上所指示的，空间元数据帧可以包括一个或多个空间参数集合。所述一个或多个控制设置可以包括时间分辨率设置，其指示将由参数处理单元确定的每一空间元数据帧的空间参数集合的数量。参数处理单元可以被配置为确定如时间分辨率设置所指示的、用于当前空间元数据帧的许多空间参数集合。通常，时间分辨率设置取1或2的值。此外，参数处理单元可以被配置为，如果当前空间元数据帧包括多个空间参数集合，以及如果确定当前空间元数据帧的比特的数量超过元数据比特的最大数量，则从当前空间元数据帧丢弃空间参数集合。参数处理单元可以被配置为对于每一空间元数据帧保留至少一个空间参数集合。通过从空间元数据帧丢弃空间参数集合，可以以很少的计算工作量而且不会显著地影响多声道上混信号的感知收听质量地减少空间元数据帧的比特的数量。As indicated above, a spatial metadata frame may include one or more sets of spatial parameters. The one or more control settings may include a time resolution setting indicating the number of spatial parameter sets to be determined by the parameter processing unit for each spatial metadata frame. The parameter processing unit may be configured to determine a plurality of spatial parameter sets for the current spatial metadata frame, as indicated by the time resolution setting. Typically, the time resolution setting is a value of 1 or 2. Furthermore, the parameter processing unit may be configured to discard spatial parameter sets from the current spatial metadata frame if the current spatial metadata frame includes multiple sets of spatial parameters, and if the number of bits in the current spatial metadata frame exceeds the maximum number of metadata bits. The parameter processing unit may be configured to retain at least one set of spatial parameters for each spatial metadata frame. By discarding spatial parameter sets from spatial metadata frames, the number of bits in the spatial metadata frame can be reduced with minimal computational effort and without significantly affecting the perceived listening quality of the multi-channel upmixed signal.

所述一个或多个空间参数集合通常与相应的一个或多个采样点相关联。所述一个或多个采样点可以指示相应的一个或多个时刻。具体地说，采样点可以指示解码系统应充分地应用相应的空间参数集合的时刻。换句话说，采样点可以指示已经对其确定了相应的空间参数集合的时刻。The one or more spatial parameter sets are typically associated with one or more corresponding sampling points. These sampling points can indicate one or more corresponding moments in time. Specifically, a sampling point can indicate the moment when the decoding system should fully utilize the corresponding spatial parameter set. In other words, a sampling point can indicate the moment when the corresponding spatial parameter set has been determined.

参数处理单元可以被配置为，如果当前元数据帧的多个采样点与多声道输入信号的瞬变(transient)不相关联，则从当前空间元数据帧丢弃第一空间参数集合，其中，第一空间参数集合与第二采样点之前的第一采样点相关联。另一方面，参数处理单元可以被配置为，如果当前元数据帧的多个采样点与多声道输入信号的瞬变相关联，则从当前空间元数据帧丢弃第二(通常是最后一个)空间参数集合。通过这样做，参数处理单元可以被配置为减小丢弃空间参数集合对多声道上混信号的收听质量的影响。The parameter processing unit can be configured to discard a first set of spatial parameters from the current spatial metadata frame if multiple sampling points of the current metadata frame are not associated with transients in the multichannel input signal, wherein the first set of spatial parameters is associated with a first sampling point preceding a second sampling point. Conversely, the parameter processing unit can be configured to discard a second (typically the last) set of spatial parameters from the current spatial metadata frame if multiple sampling points of the current metadata frame are associated with transients in the multichannel input signal. By doing so, the parameter processing unit can be configured to reduce the impact of discarding spatial parameter sets on the listening quality of the multichannel upmixed signal.

所述一个或多个控制设置可以包括量化器设置，其指示多个预定类型的量化器中的第一类型的量化器。所述多个预定类型的量化器可以分别提供不同的量化器分辨率。具体地说，所述多个预定类型的量化器可以包括细量化和粗量化。参数处理单元可以被配置为根据第一类型的量化器对当前空间元数据帧的一个或多个空间参数集合进行量化。此外，参数处理单元可以被配置为，如果确定当前空间元数据帧的比特的数量超过元数据比特的最大数量，则根据具有低于第一类型的量化器的分辨率的第二类型的量化器重新对所述一个或多个空间参数集合的空间参数中的一个、一些或全部进行量化。通过这样做，可以减少当前空间元数据帧的比特的数量，同时仅有限程度地影响上混信号的质量，并且不显著地提高音频编码系统的计算复杂度。The one or more control settings may include quantizer settings indicating a first type of quantizer among a plurality of predetermined types. The plurality of predetermined types of quantizers may each provide different quantizer resolutions. Specifically, the plurality of predetermined types of quantizers may include fine quantization and coarse quantization. The parameter processing unit may be configured to quantize one or more sets of spatial parameters of the current spatial metadata frame according to the first type of quantizer. Furthermore, the parameter processing unit may be configured to, if it is determined that the number of bits in the current spatial metadata frame exceeds the maximum number of metadata bits, requantize one, some, or all of the spatial parameters of the one or more sets of spatial parameters according to a second type of quantizer with a resolution lower than that of the first type of quantizer. By doing so, the number of bits in the current spatial metadata frame can be reduced, while only a limited degree of impact on the quality of the upmixed signal, and without significantly increasing the computational complexity of the audio coding system.

参数处理单元可以被配置为基于当前空间参数集合相对于紧靠前的空间参数集合的差来确定时间差参数集合。具体地说，可以通过确定当前空间参数集合的参数和紧靠前的空间参数集合的相应参数的差来确定时间差参数。空间参数集合可以包括例如本文档中所描述的参数α₁、α₂、α₃、β₁、β₂、β₃、g、k₁、k₂。通常，参数k₁、k₂中只有一个可能需要被发送，因为这些参数可以用关系k₁ ²+k₂ ²＝1相关。仅举例来说，只有参数k₁可以被发送，参数k₂可以在接收器处计算。时间差参数可以与以上提及的参数中的相应的参数的差相关。The parameter processing unit can be configured to determine the time difference parameter set based on the difference between the current spatial parameter set and the immediately preceding spatial parameter set. Specifically, the time difference parameter can be determined by determining the difference between the parameters of the current spatial parameter set and the corresponding parameters of the immediately preceding spatial parameter set. The spatial parameter set may include, for example, the parameters _α1 , _α2 , _α3 , _β1 , _β2 , _β3 , g, _k1 ^, and _k2 described in this document. Typically, only one of the parameters _k1 and _k2 may need to be transmitted, as these parameters can be correlated using the relation _k1² ⁺ _k2² = 1. For example only, only parameter _k1 can be transmitted; parameter _k2 can be calculated at the receiver. The time difference parameter can be correlated with the difference between the corresponding parameters among those mentioned above.

参数处理单元可以被配置为使用熵编码(例如，使用哈夫曼码)来对时间差参数集合进行编码。此外，参数处理单元可以被配置为将编码的时间差参数集合插入在当前空间元数据帧中。另外，参数处理单元可以被配置为，如果确定当前空间元数据帧的比特的数量超过元数据比特的最大数量，则减小时间差参数集合的熵。其结果是，可以减少对时间差参数进行熵编码所需的比特的数量，从而减少用于当前空间元数据帧的比特的数量。举例来说，参数处理单元可以被配置为将时间差参数集合的时间差参数中的一个、一些或全部设置为等于时间差参数的可能值中的具有增大(例如，最高)概率的值，以便减小时间差参数集合的熵。具体地说，与设置操作之前的时间差参数的概率相比，概率可以增大。通常，时间差参数的可能值中的具有最高概率的值对应于零。The parameter processing unit can be configured to encode the set of time difference parameters using entropy coding (e.g., using Huffman codes). Furthermore, the parameter processing unit can be configured to insert the encoded set of time difference parameters into the current spatial metadata frame. Additionally, the parameter processing unit can be configured to reduce the entropy of the set of time difference parameters if it is determined that the number of bits in the current spatial metadata frame exceeds the maximum number of metadata bits. As a result, the number of bits required for entropy coding of the time difference parameters can be reduced, thereby reducing the number of bits used in the current spatial metadata frame. For example, the parameter processing unit can be configured to set one, some, or all of the time difference parameters in the set of time difference parameters to a value with an increased (e.g., highest) probability among the possible values of the time difference parameters, in order to reduce the entropy of the set of time difference parameters. Specifically, the probability can be increased compared to the probability of the time difference parameters before the setting operation. Typically, the value with the highest probability among the possible values of the time difference parameter corresponds to zero.

应注意，空间参数集合的时间差编码通常不可以用于独立帧。这样，参数处理单元可以被配置为验证当前空间元数据帧是否是独立帧，如果当前空间元数据帧不是独立帧，才应用时间差编码。另一方面，下述频率差编码也可以用于独立帧。It should be noted that time difference coding of spatial parameter sets generally cannot be used for independent frames. Therefore, the parameter processing unit can be configured to verify whether the current spatial metadata frame is an independent frame; time difference coding is only applied if the current spatial metadata frame is not an independent frame. On the other hand, frequency difference coding described below can also be used for independent frames.

所述一个或多个控制设置可以包括频率分辨率设置，其中，频率分辨率设置指示将对其确定各自的空间参数(被称为带参数)的不同频带的数量。参数处理单元可以被配置为确定用于不同频带的不同的相应的空间参数(带参数)。具体地说，可以确定用于不同频带的不同参数α₁、α₂、α₃、β₁、β₂、β₃、g、k₁、k₂。空间参数集合因此可以包括用于不同频带的相应的带参数。举例来说，空间参数集合可以包括用于T个频带的T个相应的带参数，T是整数，例如，T＝7、9、12或15。The one or more control settings may include a frequency resolution setting, wherein the frequency resolution setting indicates the number of different frequency bands for which respective spatial parameters (referred to as band parameters) will be determined. The parameter processing unit may be configured to determine different corresponding spatial parameters (band parameters) for different frequency bands. Specifically, different parameters _α1 , _α2 , α3, _β1 , _β2 , _β3 , g, _k1 , _k2 may be determined for different frequency bands. The set _of spatial parameters may therefore include corresponding band parameters for different frequency bands. For example, the set of spatial parameters may include T corresponding band parameters for T frequency bands, where T is an integer, for example, T = 7, 9, 12, or 15.

参数处理单元可以被配置为基于第一频带中的一个或多个带参数相对于相邻的第二频带中的相应的一个或多个带参数的差来确定频率差参数集合。此外，参数处理单元可以被配置为使用熵编码(例如，基于哈夫曼码)来对频率差参数集合进行编码。另外，参数处理单元可以被配置为将编码的频率差参数集合插入在当前空间元数据帧中。此外，参数处理单元可以被配置为，如果确定当前空间元数据帧的比特的数量超过元数据比特的最大数量，则减小频率差参数集合的熵。具体地说，参数处理单元可以被配置为将频率差参数集合的频率差参数中的一个、一些或全部设置为等于频率差参数的可能值中的具有增大概率的值(例如，零)，以便减小频率差参数集合的熵。具体地说，与设置操作之前的频率差参数的概率相比，概率可以增大。The parameter processing unit can be configured to determine a set of frequency difference parameters based on the difference between one or more band parameters in a first frequency band and corresponding one or more band parameters in an adjacent second frequency band. Furthermore, the parameter processing unit can be configured to encode the set of frequency difference parameters using entropy coding (e.g., based on Huffman codes). Additionally, the parameter processing unit can be configured to insert the encoded set of frequency difference parameters into the current spatial metadata frame. Furthermore, the parameter processing unit can be configured to reduce the entropy of the set of frequency difference parameters if it is determined that the number of bits in the current spatial metadata frame exceeds the maximum number of metadata bits. Specifically, the parameter processing unit can be configured to set one, some, or all of the frequency difference parameters in the set of frequency difference parameters to a value with an increased probability (e.g., zero) among the possible values of the frequency difference parameters, in order to reduce the entropy of the set of frequency difference parameters. Specifically, the probability can be increased compared to the probability of the frequency difference parameters before the setting operation.

可替代地或者另外地，参数处理单元可以被配置为，如果确定当前空间元数据帧的比特的数量超过元数据比特的最大数量，则减少频带的数量。另外，参数处理单元可以被配置为使用减少的频带的数量来重新确定用于当前空间元数据帧的一个或多个空间参数集合中的一些或全部。通常，频带数量的改变主要影响高频带。结果，多个频率之一的带参数可能不受影响，使得参数处理单元可能不需要重新计算所有的带参数。Alternatively or additionally, the parameter processing unit can be configured to reduce the number of frequency bands if it is determined that the number of bits in the current spatial metadata frame exceeds the maximum number of metadata bits. Furthermore, the parameter processing unit can be configured to use the reduced number of frequency bands to redetermine some or all of the sets of one or more spatial parameters for the current spatial metadata frame. Typically, the change in the number of frequency bands primarily affects higher frequency bands. As a result, the band parameters of one of several frequencies may be unaffected, making it possible for the parameter processing unit to avoid recalculating all band parameters.

如以上所指示的，所述一个或多个外部设置可以包括更新时段，其指示相应的解码系统与比特流同步所需的时间段。此外，所述一个或多个控制设置可以包括当前空间元数据帧是否将作为独立帧被编码的指示。参数处理单元可以被配置为确定用于多声道输入信号的相应的帧序列的空间元数据帧序列。配置单元可以被配置为基于更新时段来从空间元数据帧序列确定将作为独立帧被编码的一个或多个空间元数据帧。As indicated above, the one or more external settings may include an update period, which indicates the time period required for the corresponding decoding system to synchronize with the bitstream. Furthermore, the one or more control settings may include an indication of whether the current spatial metadata frame will be encoded as an independent frame. The parameter processing unit may be configured to determine a spatial metadata frame sequence for the corresponding frame sequence of the multi-channel input signal. The configuration unit may be configured to determine one or more spatial metadata frames to be encoded as independent frames from the spatial metadata frame sequence based on the update period.

具体地说，所述一个或多个独立的空间元数据帧可以被确定为使得满足更新时段(平均来说)。为了这个目的，配置单元可以被配置为确定多声道输入信号的帧序列的当前帧是否包括作为更新时段的整数倍的时刻的采样(相对于多声道输入信号的起始点)。此外，配置单元可以被配置为确定与当前帧对应的当前空间元数据帧是独立帧(因为它包括作为更新时段的整数的时刻的采样)。参数处理单元可以被配置为，如果当前空间元数据帧将作为独立帧被编码，则与前一(和/或未来的)空间元数据帧中所包括的数据相独立地对当前空间元数据帧的一个或多个空间参数集合进行编码。通常，如果当前空间元数据帧将作为独立帧被编码，则与前一(和/或未来的)空间元数据帧中所包括的数据相独立地对当前空间元数据的所有的空间参数集合进行编码。Specifically, the one or more independent spatial metadata frames can be determined such that an update period (on average) is satisfied. For this purpose, the configuration unit can be configured to determine whether the current frame of the frame sequence of the multichannel input signal includes samples at times that are integer multiples of the update period (relative to the start point of the multichannel input signal). Furthermore, the configuration unit can be configured to determine that the current spatial metadata frame corresponding to the current frame is an independent frame (because it includes samples at times that are integer multiples of the update period). The parameter processing unit can be configured to encode one or more sets of spatial parameters of the current spatial metadata frame independently of the data included in the previous (and/or future) spatial metadata frames if the current spatial metadata frame will be encoded as an independent frame. Typically, if the current spatial metadata frame will be encoded as an independent frame, all sets of spatial parameters of the current spatial metadata are encoded independently of the data included in the previous (and/or future) spatial metadata frames.

根据另一方面，描述了一种参数处理单元，其被配置为确定用于从下混信号的相应帧产生多声道上混信号的帧的空间元数据帧。下混信号可以包括m个声道，多声道上混信号可以包括n个声道；n、m是整数，其中，m<n。如以上所概述的，空间元数据帧可以包括一个或多个空间参数集合。According to another aspect, a parameter processing unit is described, configured to determine a spatial metadata frame for generating a multi-channel upmix signal from a corresponding frame of a downmix signal. The downmix signal may include m channels, and the multi-channel upmix signal may include n channels; n and m are integers, where m < n. As outlined above, the spatial metadata frame may include one or more sets of spatial parameters.

参数处理单元可以包括变换单元，其被配置为从多声道输入信号的声道的当前帧和紧跟帧(其被称为前视帧)确定多个频谱。变换单元可以使用滤波器组，例如，QMF滤波器组。所述多个频谱中的频谱可以包括相应的预定数量的频率区间(bin)中的预定数量的变换系数。所述多个频谱可以与相应的多个时间区间(或时刻)相关联。这样，变换单元可以被配置为提供当前帧和前视帧的时间/频率表示。举例来说，当前帧和前视帧均可以包括K个采样。变换单元可以被配置为确定2倍的K/Q个频谱，每个频谱包括Q个变换系数。The parameter processing unit may include a transform unit configured to determine multiple spectra from the current frame and the immediately following frame (referred to as the forward frame) of a channel of the multi-channel input signal. The transform unit may use a filter bank, such as a QMF filter bank. The spectra in the multiple spectra may include a predetermined number of transform coefficients in a corresponding predetermined number of frequency bins. The multiple spectra may be associated with corresponding multiple time intervals (or moments). Thus, the transform unit can be configured to provide a time/frequency representation of the current frame and the forward frame. For example, both the current frame and the forward frame may include K samples. The transform unit may be configured to determine twice the number of K/Q spectra, each spectrum including Q transform coefficients.

参数处理单元可以包括参数确定单元，其被配置为通过使用窗函数对所述多个频谱进行加权来确定用于多声道输入信号的声道的当前帧的空间元数据帧。窗函数可以用于调整所述多个频谱中的频谱对特定的空间参数或特定的空间参数集合的影响。举例来说，窗函数可以取0和1之间的值。The parameter processing unit may include a parameter determination unit configured to determine the spatial metadata frame of the current frame for a channel of the multi-channel input signal by weighting the plurality of spectra using a window function. The window function can be used to adjust the influence of the spectra among the plurality of spectra on specific spatial parameters or a specific set of spatial parameters. For example, the window function can take values between 0 and 1.

窗函数可以取决于以下中的一个或多个：空间元数据帧内所包括的空间参数集合的数量、多声道输入信号的当前帧中或紧跟帧中的一个或多个瞬变的存在、和/或瞬变的时刻。换句话说，窗函数可以根据当前帧和/或前视帧的性质而改动。具体地说，用于确定空间参数集合的窗函数(其被称为集合相关的窗函数)可以取决于当前帧和/或前视帧的性质。The window function can depend on one or more of the following: the number of spatial parameter sets included in the spatial metadata frame, the presence of one or more transients in the current or immediately following frame of the multichannel input signal, and/or the timing of the transients. In other words, the window function can be modified based on the properties of the current frame and/or the forward-looking frame. Specifically, the window function used to determine the set of spatial parameters (which is called the set-dependent window function) can depend on the properties of the current frame and/or the forward-looking frame.

这样，窗函数可以包括集合相关的窗函数。具体地说，用于确定空间元数据帧的空间参数的窗函数可以包括分别用于一个或多个空间参数集合的一个或多个集合相关的窗函数(或者可以由这些集合相关的窗函数构成)。参数确定单元可以被配置为通过使用集合相关的窗函数对所述多个频谱进行加权来确定用于多声道输入信号的声道的当前帧(即，用于当前空间元数据帧)的空间参数集合。如以上所概述的，集合相关的窗函数可以取决于当前帧的一个或多个性质。具体地说，集合相关的窗函数可以取决于空间参数集合是否与瞬变相关联。Thus, the window function can include set-dependent window functions. Specifically, the window function used to determine the spatial parameters of the spatial metadata frame can include one or more set-dependent window functions (or can be composed of these set-dependent window functions) for one or more sets of spatial parameters respectively. The parameter determination unit can be configured to determine the set of spatial parameters for the current frame (i.e., for the current spatial metadata frame) of the channels of the multi-channel input signal by weighting the plurality of spectra using set-dependent window functions. As outlined above, the set-dependent window function can depend on one or more properties of the current frame. Specifically, the set-dependent window function can depend on whether the set of spatial parameters is associated with transients.

举例来说，如果空间参数集合与瞬变不相关联，则集合相关的窗函数可以被配置为提供所述多个频谱从前一空间参数集合的采样点直至所述空间参数集合的采样点的渐涨(phase-in)。渐涨可以由从0转变到1的窗函数提供。可替代地或者另外地，如果空间参数集合与瞬变不相关联，则集合相关的窗函数可以包括从所述空间参数集合的采样点开始、直至所述多个频谱中的在后一空间参数集合的采样点前面的频谱的多个频谱(或者可以充分地考虑这些频谱，或者可以使这些频谱不受影响)，如果所述后一空间参数集合与瞬变相关联的话。这可以通过具有值1的窗函数来实现。可替代地或者另外地，如果空间参数集合与瞬变不相关联，则集合相关的窗函数可以从后一空间参数集合的采样点开始消除(cancel out)所述多个频谱(或者可以排除这些频谱，或者可以使这些频谱衰减)，如果所述后一空间参数集合与瞬变相关联的话。这可以通过具有值0的窗函数来实现。可替代地或者另外地，如果空间参数集合与瞬变不相关联，则集合相关的窗函数可以使所述多个频谱从所述空间参数集合的采样点直至所述多个频谱中的在后一空间参数集合的采样点前面的频谱渐消(phase-out)，如果所述后一空间参数集合与瞬变不相关联的话。渐涨可以由从1转变到0的窗函数提供。另一方面，如果空间参数集合与瞬变相关联，则集合相关的窗函数可以消除所述多个频谱中的在所述空间参数集合的采样点前面的频谱(或者可以排除这些频谱，或者可以使这些频谱衰减)。可替代地或者另外地，如果空间参数集合与瞬变相关联，则集合相关的窗函数可以包括所述多个频谱中的从所述空间参数集合的采样点开始直至所述多个频谱中的在后一空间参数集合的采样点前面的频谱的频谱(或者可以使这些频谱不受影响)，并且可以消除所述多个频谱中的从后一空间参数集合的采样点开始的频谱(或者可以排除这些频谱，或者可以使这些频谱衰减)，如果所述后一空间参数集合的采样点与瞬变相关联的话。可替代地或者另外地，如果空间参数集合与瞬变相关联，则集合相关的窗函数可以包括所述多个频谱中的从所述空间参数集合的采样点直至所述多个频谱中的在当前帧的结束处的频谱的频谱(或者可以使这些频谱不受影响)，并且可以提供所述多个频谱中的从紧跟帧的起始直至后一空间参数集合的采样点的频谱的渐消(或者可以使这些频谱逐渐衰减)，如果所述后一空间参数集合与瞬变不相关联的话。For example, if the spatial parameter set is not associated with a transient, a set-dependent window function can be configured to provide a phase-in gradient of the plurality of spectra from the sampling points of the previous spatial parameter set to the sampling points of the next spatial parameter set. This phase-in can be provided by a window function that transitions from 0 to 1. Alternatively or additionally, if the spatial parameter set is not associated with a transient, the set-dependent window function can include multiple spectra from the sampling points of the spatial parameter set to the spectra preceding the sampling points of the next spatial parameter set (either these spectra can be fully considered, or these spectra can be left unaffected), if the next spatial parameter set is associated with a transient. This can be achieved using a window function with a value of 1. Alternatively or additionally, if the spatial parameter set is not associated with a transient, the set-dependent window function can cancel out the plurality of spectra from the sampling points of the next spatial parameter set (either these spectra can be excluded, or these spectra can be attenuated), if the next spatial parameter set is associated with a transient. This can be achieved using a window function with a value of 0. Alternatively or additionally, if the spatial parameter set is not associated with the transient, a set-dependent window function can phase-out the plurality of spectra from the sampling point of the spatial parameter set up to the spectra preceding the sampling point of the subsequent spatial parameter set, if the subsequent spatial parameter set is not associated with the transient. The phase-out can be provided by a window function transitioning from 1 to 0. On the other hand, if the spatial parameter set is associated with the transient, a set-dependent window function can eliminate (or exclude, or attenuate) the spectra preceding the sampling point of the spatial parameter set. Alternatively or additionally, if the spatial parameter set is associated with the transient, a set-dependent window function can include (or leave unaffected) the spectra from the sampling point of the spatial parameter set up to the spectra preceding the sampling point of the subsequent spatial parameter set, and can eliminate (or exclude, or attenuate) the spectra starting from the sampling point of the subsequent spatial parameter set, if the sampling point of the subsequent spatial parameter set is associated with the transient. Alternatively or additionally, if the set of spatial parameters is associated with a transient, the set-related window function may include the spectrum of the plurality of spectra from the sampling point of the spatial parameter set up to the spectrum of the plurality of spectra at the end of the current frame (or may make these spectra unaffected), and may provide the fading of the spectrum of the plurality of spectra from the beginning of the immediately following frame up to the sampling point of the next set of spatial parameters (or may make these spectra gradually decay), if the next set of spatial parameters is not associated with a transient.

根据另一方面，描述了一种参数处理单元，其被配置为确定用于从下混信号的相应帧产生多声道上混信号的帧的空间元数据帧。下混信号可以包括m个声道，多声道上混信号可以包括n个声道；n、m是整数，其中，m<n。如以上所讨论的，空间元数据帧可以包括空间参数集合。According to another aspect, a parameter processing unit is described, configured to determine a spatial metadata frame for generating a multi-channel upmix signal from a corresponding frame of a downmix signal. The downmix signal may include m channels, and the multi-channel upmix signal may include n channels; n and m are integers, where m < n. As discussed above, the spatial metadata frame may include a set of spatial parameters.

如以上所概述的，参数处理单元可以包括变换单元。变换单元可以被配置为从多声道输入信号的第一声道的帧确定第一多个变换系数。此外，变换单元可以被配置为从多声道输入信号的第二声道的相应帧确定第二多个变换系数。第一声道和第二声道可以是不同的。这样，第一多个变换系数和第二多个变换系数分别提供第一声道和第二声道的相应帧的第一时间/频率表示和第二时间/频率表示。如以上所概述的，第一时间/频率表示和第二时间/频率表示包括多个频率区间和多个时间区间。As outlined above, the parameter processing unit may include a transformation unit. The transformation unit may be configured to determine a first plurality of transformation coefficients from frames of the first channel of the multi-channel input signal. Furthermore, the transformation unit may be configured to determine a second plurality of transformation coefficients from corresponding frames of the second channel of the multi-channel input signal. The first channel and the second channel may be different. Thus, the first plurality of transformation coefficients and the second plurality of transformation coefficients respectively provide a first time/frequency representation and a second time/frequency representation of the corresponding frames of the first channel and the second channel. As outlined above, the first time/frequency representation and the second time/frequency representation include a plurality of frequency intervals and a plurality of time intervals.

此外，参数处理单元可以包括参数确定单元，其被配置为使用定点算术，基于第一多个变换系数和第二多个变换系数来确定空间参数集合。如以上所指示的，空间参数集合通常包括用于不同频带的相应的带参数，其中，所述不同频带可以包括不同数量的频率区间。可以基于特定频带的第一多个变换系数和第二多个变换系数中的变换系数来确定用于该特定频带的特定带参数(通常，不考虑其它频带的变换系数)。参数确定单元可以被配置为确定定点算术使用的用于确定依赖于特定频带的特定带参数的移位。尤其是，定点算术使用的用于确定用于特定频带的特定带参数的移位可以取决于该特定频带内所包括的频率区间的数量。可替代地或者另外地，定点算术使用的用于确定用于特定频带的特定带参数的移位可以取决于确定特定带参数将考虑的时间区间的数量。Furthermore, the parameter processing unit may include a parameter determination unit configured to determine a set of spatial parameters using fixed-point arithmetic based on a first plurality of transform coefficients and a second plurality of transform coefficients. As indicated above, the set of spatial parameters typically includes corresponding band parameters for different frequency bands, wherein the different frequency bands may include different numbers of frequency intervals. Specific band parameters for a specific frequency band can be determined based on transform coefficients from the first plurality of transform coefficients and the second plurality of transform coefficients for that specific frequency band (typically, transform coefficients for other frequency bands are not considered). The parameter determination unit may be configured to determine a shift used by fixed-point arithmetic to determine specific band parameters dependent on the specific frequency band. In particular, the shift used by fixed-point arithmetic to determine specific band parameters for a specific frequency band may depend on the number of frequency intervals included within that specific frequency band. Alternatively or additionally, the shift used by fixed-point arithmetic to determine specific band parameters for a specific frequency band may depend on the number of time intervals considered in determining the specific band parameters.

参数确定单元可以被配置为确定用于特定频带的移位以使得特定带参数的精度最大化。这可以通过确定特定带参数的确定处理的每个乘法和加法运算所需的移位来实现。The parameter determination unit can be configured to determine the shifts required for a specific frequency band to maximize the accuracy of the band-specific parameters. This can be achieved by determining the shifts required for each multiplication and addition operation in the determination process of the band-specific parameters.

参数确定单元可以被配置为通过基于第一多个变换系数中的落入特定频带p中的变换系数确定第一能量(或能量估计)E_1,1(p)来确定用于特定频带p的特定带参数。此外，可以基于第二多个变换系数中的落入特定频带p中的变换系数来确定第二能量(或能量估计)E_2,2(p)。另外，可以基于第一多个变换系数和第二多个变换系数中的落入特定频带p中的变换系数来确定叉积或协方差E_1,2(p)。参数确定单元可以被配置为基于第一能量估计E_1,1(p)、第二能量估计E_2,2(p)和协方差E_1,2(p)的绝对值中的最大值来确定用于特定频带参数p的移位z_p。The parameter determination unit can be configured to determine a specific band parameter for a specific frequency band p by determining a first energy (or energy estimate) E _{1,1} (p) based on transform coefficients falling within a specific frequency band p from a first plurality of transform coefficients. Furthermore, a second energy (or energy estimate) E _{2,2} (p) can be determined based on transform coefficients falling within a specific frequency band p from a second plurality of transform coefficients. Additionally, a cross product or covariance E _ 1,2(p) can be determined based on transform coefficients falling within a specific frequency band p from the first plurality of transform coefficients and the second plurality of transform coefficients. The parameter determination unit can be configured to determine a shift z _p for the specific frequency band parameter p based on the maximum value among the absolute values of the first energy estimate E _1,1 (p), the second energy estimate E _2,2 (p), and the covariance E _1,2 (p).

根据另一方面，描述了一种音频编码系统，其被配置为产生比特流，该比特流指示下混信号的帧序列和相应的空间元数据帧序列，所述相应的空间元数据帧序列用于从下混信号的帧序列产生多声道上混信号的相应的帧序列。所述系统可以包括下混处理单元，其被配置为从多声道输入信号的相应的帧序列产生下混信号的帧序列。如以上所指示的，下混信号可以包括m个声道，多声道输入信号可以包括n个声道；n、m是整数，其中，m<n。此外，音频编码系统可以包括参数处理单元，其被配置为从多声道输入信号的帧序列确定空间元数据帧序列。According to another aspect, an audio coding system is described, configured to generate a bitstream indicating a frame sequence of a downmixed signal and a corresponding spatial metadata frame sequence, the corresponding spatial metadata frame sequence being used to generate a corresponding frame sequence of a multichannel upmixed signal from the frame sequence of the downmixed signal. The system may include a downmixing processing unit configured to generate the frame sequence of the downmixed signal from the corresponding frame sequence of the multichannel input signal. As indicated above, the downmixed signal may include m channels, and the multichannel input signal may include n channels; n and m are integers, where m < n. Furthermore, the audio coding system may include a parameter processing unit configured to determine the spatial metadata frame sequence from the frame sequence of the multichannel input signal.

另外，音频编码系统可以包括比特流产生单元，其被配置为产生包括比特流帧序列的比特流，其中，比特流帧指示下混信号的与多声道输入信号的第一帧对应的帧以及与多声道输入信号的第二帧对应的空间元数据帧。第二帧可以不同于第一帧。具体地说，第一帧可以在第二帧的前面。通过这样做，用于当前帧的空间元数据帧可以与后一帧的帧一起发送。这确保空间元数据帧仅在它被需要时才到达相应的解码系统。解码系统通常对下混信号的当前帧进行解码，并且基于下混信号的当前帧来产生解相关的帧。该处理引入了算法延迟，并且通过使用于当前帧的空间元数据帧延迟，确保一旦解码的当前帧和解相关的帧被提供，空间元数据帧才到达解码系统。结果，可以降低解码系统的处理能力和存储器要求。Additionally, the audio encoding system may include a bitstream generation unit configured to generate a bitstream comprising a sequence of bitstream frames, wherein the bitstream frames indicate a frame of the downmixed signal corresponding to a first frame of the multichannel input signal and a spatial metadata frame corresponding to a second frame of the multichannel input signal. The second frame may be different from the first frame. Specifically, the first frame may precede the second frame. By doing so, the spatial metadata frame for the current frame can be sent together with the frame of the next frame. This ensures that the spatial metadata frame arrives at the appropriate decoding system only when it is needed. The decoding system typically decodes the current frame of the downmixed signal and generates a decorrelated frame based on the current frame of the downmixed signal. This process introduces an algorithmic delay, and by using the spatial metadata frame delay for the current frame, it ensures that the spatial metadata frame arrives at the decoding system only once the decoded current frame and the decorrelated frame are provided. As a result, the processing power and memory requirements of the decoding system can be reduced.

换句话说，描述了一种音频编码系统，其被配置为基于多声道输入信号来产生比特流。如以上所概述的，所述系统可以包括下混处理单元，其被配置为从多声道输入信号的相应的第一帧序列产生下混信号的帧序列。下混信号可以包括m个声道，多声道输入信号可以包括n个声道；n、m是整数，其中，m<n。此外，音频编码系统可以包括参数处理单元，其被配置为从多声道输入信号的第二帧序列产生空间元数据帧序列。下混信号的帧序列和空间元数据帧序列可以被相应的解码系统用于产生包括n个声道的多声道上混信号。In other words, an audio encoding system is described, configured to generate a bitstream based on a multichannel input signal. As outlined above, the system may include a downmixing processing unit configured to generate a frame sequence of a downmixed signal from a corresponding first frame sequence of the multichannel input signal. The downmixed signal may include m channels, and the multichannel input signal may include n channels; n and m are integers, where m < n. Furthermore, the audio encoding system may include a parameter processing unit configured to generate a spatial metadata frame sequence from a second frame sequence of the multichannel input signal. The frame sequence of the downmixed signal and the spatial metadata frame sequence can be used by a corresponding decoding system to generate a multichannel upmixed signal comprising n channels.

音频编码系统还可以包括比特流产生单元，其被配置为产生包括比特流帧序列的比特流，其中，比特流帧可以指示下混信号的与多声道输入信号的第一帧序列的第一帧对应的帧以及与多声道输入信号的第二帧序列的第二帧对应的空间元数据帧。第二帧可以不同于第一帧。换句话说，用于确定空间元数据帧的组帧(framing)和用于确定下混信号的帧的组帧可以是不同的。如以上所概述的，不同组帧可以用于确保数据在相应的解码系统处对齐。The audio coding system may also include a bitstream generation unit configured to generate a bitstream comprising a sequence of bitstream frames, wherein the bitstream frames may indicate a frame corresponding to the first frame of a first frame sequence of the multichannel input signal and a spatial metadata frame corresponding to the second frame of a second frame sequence of the multichannel input signal. The second frame may be different from the first frame. In other words, the framing used to determine the spatial metadata frame and the framing used to determine the frame of the downmixer may be different. As outlined above, different framing can be used to ensure that the data is aligned at the corresponding decoding system.

第一帧和第二帧通常包括相同数量的采样(例如，1536个采样)。第一帧的采样中的一些可以领先第二帧的采样。具体地说，第一帧可以领先于第二帧预定数量的采样。所述预定数量的采样可以例如对应于帧的采样数量的一小部分。举例来说，所述预定数量的采样可以对应于帧的采样数量的50％或更多。在特定示例中，所述预定数量的采样对应于928个采样。如本文档中所示，这个特定数量的采样为音频编码和解码系统的特定实现提供最小的总延迟和最佳的对齐。The first and second frames typically consist of the same number of samples (e.g., 1536 samples). Some of the samples in the first frame may precede those in the second frame. Specifically, the first frame may precede the second frame by a predetermined number of samples. This predetermined number of samples may, for example, correspond to a small fraction of the total number of samples in the frame. For instance, the predetermined number of samples may correspond to 50% or more of the total number of samples in the frame. In a particular example, the predetermined number of samples corresponds to 928 samples. As illustrated in this document, this specific number of samples provides minimal total latency and optimal alignment for a particular implementation of the audio encoding and decoding system.

根据另一方面，描述了一种音频编码系统，其被配置为基于多声道输入信号来产生比特流。所述系统可以包括下混处理单元，其被配置为确定用于多声道输入信号的相应的帧序列的修剪(clip)保护增益(在本文档中，其也被称为修剪-增益和/或DRC2参数)序列。当前修剪保护增益可以指示将应用于多声道输入信号的当前帧以防止下混信号的相应的当前帧修剪的衰减。以类似的方式，修剪保护增益序列可以指示将应用于多声道输入信号的帧序列的帧以防止下混信号的帧序列的相应帧修剪的各自的衰减。According to another aspect, an audio coding system is described, configured to generate a bitstream based on a multichannel input signal. The system may include a downmixing processing unit configured to determine a clip protection gain (also referred to in this document as clip-gain and/or DRC2 parameters) sequence for a corresponding frame sequence of the multichannel input signal. The current clip protection gain may indicate the current frame to be applied to the multichannel input signal to prevent attenuation of the corresponding current frame clipping of the downmixed signal. Similarly, the clip protection gain sequence may indicate the frames to be applied to the frame sequence of the multichannel input signal to prevent respective attenuation of the corresponding frame clipping of the frame sequence of the downmixed signal.

下混处理单元可以被配置为内插当前修剪保护增益和多声道输入信号的前一帧的前一修剪保护增益以得到修剪保护增益曲线。这可以以与修剪保护增益序列类似的方式执行。此外，下混处理单元可以被配置为将修剪保护增益曲线应用于多声道输入信号的当前帧以得到多声道输入信号的衰减的当前帧。再次，这可以以与多声道输入信号的帧序列类似的方式执行。此外，下混处理单元可以被配置为从多声道输入信号的衰减的当前帧产生下混信号的帧序列的当前帧。以类似的方式，可以产生下混信号的帧序列。The downmixing unit can be configured to interpolate the current trim protection gain and the previous trim protection gain of the previous frame of the multichannel input signal to obtain a trim protection gain curve. This can be performed in a manner similar to a trim protection gain sequence. Furthermore, the downmixing unit can be configured to apply the trim protection gain curve to the current frame of the multichannel input signal to obtain the current frame of the attenuated multichannel input signal. Again, this can be performed in a manner similar to a frame sequence of the multichannel input signal. Additionally, the downmixing unit can be configured to generate the current frame of the downmixed signal frame sequence from the current frame of the attenuated multichannel input signal. Similarly, a frame sequence of the downmixed signal can be generated.

音频处理系统还可以包括参数处理单元，其被配置为从多声道输入信号确定空间元数据帧序列。下混信号的帧序列和空间元数据帧序列可以用于产生包括n个声道的多声道上混信号，以使得多声道上混信号是多声道输入信号的逼近。另外，音频处理系统可以包括比特流产生单元，其被配置为产生指示修剪保护增益序列、下混信号的帧序列和空间元数据帧序列的比特流，以使得相应的解码系统能够产生多声道上混信号。The audio processing system may further include a parameter processing unit configured to determine a spatial metadata frame sequence from the multichannel input signal. The downmixed signal frame sequence and the spatial metadata frame sequence can be used to generate a multichannel upmixed signal comprising n channels, such that the multichannel upmixed signal is an approximation of the multichannel input signal. Additionally, the audio processing system may include a bitstream generation unit configured to generate a bitstream indicating a trimmed guard gain sequence, a downmixed signal frame sequence, and a spatial metadata frame sequence, enabling a corresponding decoding system to generate the multichannel upmixed signal.

修剪保护增益曲线可以包括过渡段和平坦段，过渡段提供从前一修剪保护增益到当前修剪保护增益的平滑过渡，平坦段在当前修剪保护增益处保持平坦。过渡段可以跨过多声道输入信号的当前帧的预定数量的采样而延伸。所述预定数量的采样可以是多声道输入信号的当前帧的多于一个且少于总数的采样。具体地说，所述预定数量的采样可以对应于采样块(其中，帧可以包括多个块)或帧。在特定示例中，帧可以包括1536个采样，块可以包括256个采样。The trim protection gain curve may include a transition segment and a flat segment. The transition segment provides a smooth transition from the previous trim protection gain to the current trim protection gain, while the flat segment remains flat at the current trim protection gain. The transition segment may extend across a predetermined number of samples of the current frame of the multichannel input signal. The predetermined number of samples may be more than one but less than the total number of samples of the current frame of the multichannel input signal. Specifically, the predetermined number of samples may correspond to a sampling block (where a frame may include multiple blocks) or a frame. In a particular example, a frame may include 1536 samples, and a block may include 256 samples.

根据另一方面，描述了一种音频编码系统，其被配置为产生比特流，该比特流指示下混信号以及用于从下混信号产生多声道上混信号的空间元数据。所述系统可以包括下混处理单元，其被配置为从多声道输入信号产生下混信号。此外，所述系统可以包括参数处理单元，其被配置为确定用于多声道输入信号的相应的帧序列的空间元数据帧序列。According to another aspect, an audio coding system is described, configured to generate a bitstream indicating a downmixed signal and spatial metadata for generating a multichannel upmixed signal from the downmixed signal. The system may include a downmixing processing unit configured to generate the downmixed signal from a multichannel input signal. Furthermore, the system may include a parameter processing unit configured to determine a spatial metadata frame sequence for a corresponding frame sequence of the multichannel input signal.

此外，音频编码系统可以包括配置单元，其被配置为基于一个或多个外部设置来确定对于参数处理单元的一个或多个控制设置。所述一个或多个外部设置可以包括更新时段，其指示相应的解码系统与比特流同步所需的时间段。配置单元可以被配置为基于更新时段来从空间元数据帧序列确定将被独立地编码的一个或多个独立的空间元数据帧。Furthermore, the audio encoding system may include a configuration unit configured to determine one or more control settings for the parameter processing unit based on one or more external settings. The one or more external settings may include an update period indicating the time period required for the corresponding decoding system to synchronize with the bitstream. The configuration unit may be configured to determine one or more independent spatial metadata frames to be encoded independently from a sequence of spatial metadata frames based on the update period.

根据另一方面，描述了一种用于产生比特流的方法，所述比特流指示下混信号以及用于从下混信号产生多声道上混信号的空间元数据。所述方法可以从多声道输入信号产生下混信号。此外，所述方法可以包括基于一个或多个外部设置来确定一个或多个控制设置；其中，所述一个或多个外部设置包括比特流的目标数据速率，并且其中，所述一个或多个控制设置包括空间元数据的最大数据速率。另外，所述方法可以包括根据所述一个或多个控制设置从多声道输入信号确定空间元数据。According to another aspect, a method for generating a bitstream is described, the bitstream indicating a downmixing signal and spatial metadata for generating a multi-channel upmixing signal from the downmixing signal. The method can generate the downmixing signal from a multi-channel input signal. Furthermore, the method may include determining one or more control settings based on one or more external settings; wherein the one or more external settings include a target data rate for the bitstream, and wherein the one or more control settings include a maximum data rate for the spatial metadata. Additionally, the method may include determining the spatial metadata from the multi-channel input signal according to the one or more control settings.

根据另一方面，描述了一种用于确定空间元数据帧的方法，所述空间元数据帧用于从下混信号的相应帧产生多声道上混信号的帧。所述方法可以包括从多声道输入信号的声道的当前帧和紧跟帧确定多个频谱。此外，所述方法可以包括使用窗函数对所述多个频谱进行加权以得到多个加权的频谱。另外，所述方法可以包括基于所述多个加权的频谱来确定用于多声道输入信号的所述声道的当前帧的空间元数据帧。窗函数可以取决于以下中的一个或多个：空间元数据帧内所包括的空间参数集合的数量、多声道输入信号的当前帧中或紧跟帧中的瞬变的存在、和/或该瞬变的时刻。According to another aspect, a method for determining a spatial metadata frame for generating a multichannel upmix signal from a corresponding frame of an downmix signal is described. The method may include determining multiple spectra from the current frame and the immediately following frame of a channel of the multichannel input signal. Furthermore, the method may include weighting the multiple spectra using a window function to obtain multiple weighted spectra. Additionally, the method may include determining a spatial metadata frame for the current frame of the channel of the multichannel input signal based on the multiple weighted spectra. The window function may depend on one or more of the following: the number of spatial parameter sets included within the spatial metadata frame, the presence of a transient in the current frame or the immediately following frame of the multichannel input signal, and/or the timing of that transient.

根据另一方面，描述了一种用于确定空间元数据帧的方法，所述空间元数据帧用于从下混信号的相应帧产生多声道上混信号的帧。所述方法可以包括：从多声道输入信号的第一声道的帧确定第一多个变换系数，并且从多声道输入信号的第二声道的相应帧确定第二多个变换系数。如以上所概述的，第一多个变换系数和第二多个变换系数通常分别提供第一声道和第二声道的相应帧的第一时间/频率表示和第二时间/频率表示。第一时间/频率表示和第二时间/频率表示可以包括多个频率区间和多个时间区间。空间参数集合可以包括分别用于包括不同数量的频率区间的不同频带的相应的带参数。所述方法还可以包括确定当使用定点算术确定用于特定频带的特定带参数时将应用的移位。此外，可以基于确定特定带参数将考虑的时间区间的数量来确定移位。另外，所述方法可以包括使用定点算术和所确定的移位、基于落在特定频带中的第一多个变换系数和第二多个变换系数来确定特定带参数。According to another aspect, a method for determining a spatial metadata frame for generating a multichannel upmix signal frame from a corresponding frame of an downmix signal is described. The method may include: determining a first plurality of transform coefficients from a frame of a first channel of the multichannel input signal, and determining a second plurality of transform coefficients from a corresponding frame of a second channel of the multichannel input signal. As outlined above, the first plurality of transform coefficients and the second plurality of transform coefficients typically provide a first time/frequency representation and a second time/frequency representation of the corresponding frames of the first and second channels, respectively. The first time/frequency representation and the second time/frequency representation may include a plurality of frequency intervals and a plurality of time intervals. The set of spatial parameters may include corresponding band parameters for different frequency bands comprising different numbers of frequency intervals. The method may further include determining a shift to be applied when determining a specific band parameter for a specific frequency band using fixed-point arithmetic. Furthermore, the shift may be determined based on the number of time intervals to be considered when determining the specific band parameter. Additionally, the method may include determining the specific band parameter using fixed-point arithmetic and the determined shift, based on the first plurality of transform coefficients and the second plurality of transform coefficients falling within the specific frequency band.

描述了一种用于基于多声道输入信号产生比特流的方法。所述方法可以包括从多声道输入信号的相应的第一帧序列产生下混信号的帧序列。此外，所述方法可以包括从多声道输入信号的第二帧序列确定空间元数据帧序列。下混信号的帧序列和空间元数据帧序列可以用于产生多声道上混信号。另外，所述方法可以包括产生包括比特流帧序列的比特流。比特流帧可以指示下混信号的与多声道输入信号的第一帧序列的第一帧对应的帧以及与多声道输入信号的第二帧序列的第二帧对应的空间元数据帧。第二帧可以不同于第一帧。A method for generating a bitstream based on a multichannel input signal is described. The method may include generating a frame sequence of a downmixer from a corresponding first frame sequence of the multichannel input signal. Furthermore, the method may include determining a spatial metadata frame sequence from a second frame sequence of the multichannel input signal. The frame sequence of the downmixer and the spatial metadata frame sequence can be used to generate a multichannel upmixer. Additionally, the method may include generating a bitstream comprising a bitstream frame sequence. The bitstream frame may indicate a frame of the downmixer corresponding to a first frame of the first frame sequence of the multichannel input signal and a spatial metadata frame corresponding to a second frame of the second frame sequence of the multichannel input signal. The second frame may be different from the first frame.

根据另一方面，描述了一种用于基于多声道输入信号产生比特流的方法。所述方法可以包括确定用于多声道输入信号的相应的帧序列的修剪保护增益序列。当前修剪保护增益可以指示将应用于多声道输入信号的当前帧以防止下混信号的相应的当前帧修剪的衰减。所述方法可以继续内插当前修剪保护增益和多声道输入信号的前一帧的前一修剪保护增益以得到修剪保护增益曲线。此外，所述方法可以包括将修剪保护增益曲线应用于多声道输入信号的当前帧以得到多声道输入信号的衰减的当前帧。下混信号的帧序列的当前帧可以从多声道输入信号的衰减的当前帧产生。另外，所述方法可以包括从多声道输入信号确定空间元数据帧序列。下混信号的帧序列和空间元数据帧序列可以用于产生多声道上混信号。比特流可以被产生为使得该比特流指示修剪保护增益序列、下混信号的帧序列以及空间元数据帧序列，以使得能够基于该比特流产生多声道上混信号。According to another aspect, a method for generating a bitstream based on a multichannel input signal is described. The method may include determining a trim protection gain sequence for a corresponding frame sequence of the multichannel input signal. The current trim protection gain may indicate that the current frame of the multichannel input signal will be applied to prevent attenuation of the corresponding current frame of the downmixing signal. The method may continue to interpolate the current trim protection gain and a previous trim protection gain of the previous frame of the multichannel input signal to obtain a trim protection gain curve. Furthermore, the method may include applying the trim protection gain curve to the current frame of the multichannel input signal to obtain a current frame of attenuation of the multichannel input signal. The current frame of the downmixing signal's frame sequence can be generated from the current frame of the attenuation of the multichannel input signal. Additionally, the method may include determining a spatial metadata frame sequence from the multichannel input signal. The frame sequence of the downmixing signal and the spatial metadata frame sequence can be used to generate a multichannel upmixing signal. The bitstream may be generated such that it indicates the trim protection gain sequence, the frame sequence of the downmixing signal, and the spatial metadata frame sequence, enabling the generation of a multichannel upmixing signal based on the bitstream.

根据另一方面，描述了一种用于产生比特流的方法，所述比特流指示下混信号和空间元数据，所述空间元数据用于从下混信号产生多声道上混信号。所述方法可以包括从多声道输入信号产生下混信号。此外，所述方法可以包括基于一个或多个外部设置来确定一个或多个控制设置，其中，所述一个或多个外部设置包括更新时段，其指示解码系统与比特流同步所需的时间段。所述方法还可以包括根据一个或多个控制设置确定用于多声道输入信号的相应的帧序列的空间元数据帧序列。另外，所述方法可以包括根据更新时段对空间元数据帧序列中的一个或多个空间元数据帧作为独立帧进行编码。According to another aspect, a method for generating a bitstream is described, the bitstream indicating a downmixing signal and spatial metadata, the spatial metadata being used to generate a multi-channel upmixing signal from the downmixing signal. The method may include generating the downmixing signal from a multi-channel input signal. Furthermore, the method may include determining one or more control settings based on one or more external settings, wherein the one or more external settings include an update period indicating a time period required for the decoding system to synchronize with the bitstream. The method may also include determining a spatial metadata frame sequence for a corresponding frame sequence of the multi-channel input signal based on the one or more control settings. Additionally, the method may include encoding one or more spatial metadata frames in the spatial metadata frame sequence as independent frames according to the update period.

根据另一方面，描述了一种软件程序。该软件程序可以适于在处理器上执行，并且适于当在处理器上被执行时执行本文档中所概述的方法步骤。According to another aspect, a software program is described. This software program is adapted to execute on a processor, and is adapted to perform the method steps outlined in this document when executed on a processor.

根据另一方面，描述了一种存储介质。该存储介质可以包括软件程序，该软件程序可以适于在处理器上执行，并且适于当在处理器上被执行时执行本文档中所概述的方法步骤。According to another aspect, a storage medium is described. This storage medium may include a software program adapted to execute on a processor and, when executed on a processor, to perform the method steps outlined in this document.

根据另一方面，描述了一种计算机程序产品。该计算机程序产品可以包括用于当在计算机上被执行时执行本文档中所概述的方法步骤的可执行指令。According to another aspect, a computer program product is described. This computer program product may include executable instructions for performing the method steps outlined in this document when executed on a computer.

应注意，包括其在本专利申请中概述的优选实施例的方法和系统可以独立使用或者与本文档中所公开的其它方法和系统组合使用。此外，本专利申请中所概述的方法和系统的所有方面可以被任意组合。具体地说，权利要求的特征可以以任意的方式彼此组合。It should be noted that the methods and systems, including the preferred embodiments outlined in this patent application, can be used independently or in combination with other methods and systems disclosed in this document. Furthermore, all aspects of the methods and systems outlined in this patent application can be combined arbitrarily. Specifically, the features of the claims can be combined with each other in any manner.

附图说明Attached Figure Description

下面以示例性的方式参照附图来说明本发明，其中，The invention will now be described by way of example with reference to the accompanying drawings, wherein...

图1示出用于执行空间合成的示例音频处理系统的一般化框图；Figure 1 shows a generalized block diagram of an example audio processing system for performing spatial synthesis;

图2示出图1的系统的示例细节；Figure 2 shows an example of the system in Figure 1;

图3类似于图1示出用于执行空间合成的示例音频处理系统；Figure 3 is similar to the example audio processing system shown in Figure 1 for performing spatial synthesis;

图4示出用于执行空间分析的示例音频处理系统；Figure 4 shows an example audio processing system for performing spatial analysis;

图5a示出示例参数化多声道音频编码系统的框图；Figure 5a shows a block diagram of an example parametric multichannel audio coding system;

图5b示出示例空间分析和编码系统的框图；Figure 5b shows a block diagram of an example spatial analysis and coding system;

图5c例示多声道音频信号的声道的帧的示例时间-频率表示；Figure 5c illustrates an example time-frequency representation of the frames of a multichannel audio signal;

图5d例示多声道音频信号的多个声道的示例时间-频率表示；Figure 5d illustrates an example time-frequency representation of multiple channels in a multi-channel audio signal;

图5e示出图5b所示的空间分析和编码系统的变换单元所应用的示例加窗；Figure 5e shows an example windowing application used in the transform unit of the spatial analysis and coding system shown in Figure 5b;

图6示出用于降低空间元数据的数据速率的示例方法的流程图；Figure 6 shows a flowchart of an example method for reducing the data rate of spatial metadata;

图7a例示用于在解码系统处执行的用于空间元数据的示例过渡方案；Figure 7a illustrates an example transition scheme for spatial metadata to be performed at the decoding system;

图7b至7d例示为确定空间元数据而应用的示例窗函数；Figures 7b to 7d illustrate example window functions used to determine spatial metadata;

图8示出参数化多声道编解码系统的示例处理路径的框图；Figure 8 shows a block diagram of an example processing path for a parametric multichannel codec system;

图9a和9b示出被配置为执行修剪保护和/或动态范围控制的示例参数化多声道音频编码系统的框图；Figures 9a and 9b show block diagrams of an example parametric multichannel audio coding system configured to perform trim protection and/or dynamic range control.

图10例示用于补偿DRC参数的示例方法；和Figure 10 illustrates an example method for compensating DRC parameters; and

图11示出用于修剪保护的示例内插曲线。Figure 11 shows an example interpolation curve used for trim protection.

具体实施方式Detailed Implementation

如引言部分中所概述的，本文档涉及使用参数化多声道表示的多声道音频编码系统。以下，描述示例多声道音频编码和解码(编解码)系统。在图1至3的上下文中，描述音频编解码系统的解码器可以如何使用所接收的参数化多声道表示来从所接收的m声道下混信号X(例如，m＝2)产生n声道上混信号Y(通常，n>2)。随后，描述多声道音频编解码系统的编码器相关的处理。具体地说，描述可以如何从n声道输入信号产生参数化多声道表示和m声道下混信号。As outlined in the introduction, this document relates to a multichannel audio coding system using a parametric multichannel representation. An example multichannel audio coding and decoding (encoding/decoding) system is described below. Within the context of Figures 1 to 3, the decoder of the audio codec system is described as using the received parametric multichannel representation to generate an n-channel upmix signal Y (typically, n > 2) from a received m-channel downmix signal X (e.g., m = 2). Subsequently, the encoder-related processing of the multichannel audio codec system is described. Specifically, how the parametric multichannel representation and the m-channel downmix signal can be generated from the n-channel input signal is described.

图1例示被配置为从下混信号X和混合参数集合产生上混信号Y的示例音频处理系统100的框图。具体地说，音频处理系统100被配置为仅基于下混信号X和所述混合参数集合产生上混信号。从比特流P，音频解码器140提取下混信号X＝[l₀ r₀]^T和混合参数集合。在所例示的示例中，所述混合参数集合包括参数α₁、α₂、α₃、β₁、β₂、β₃、g、k₁、k₂。混合参数可以以量化和/或熵编码形式包括在比特流P中的各混合参数数据字段中。混合参数可以被称为元数据(或空间元数据)，该元数据连同编码的下混信号X一起被发送。在本公开的一些实例中，已明确地指示，一些连接线适于发送多声道信号，其中，这些线被提供与各数量的声道相邻的交叉线。在图1所示的系统100中，下混信号X包括m＝2个声道，并且以下将定义的上混信号Y包括n＝6个声道(例如，5.1声道)。Figure 1 illustrates a block diagram of an example audio processing system 100 configured to generate an overmix signal Y from a downmix signal X and a set of mixing parameters. Specifically, the audio processing system 100 is configured to generate the overmix signal based solely on the downmix signal X and the set of mixing parameters. From the bitstream P, the _audio decoder 140 extracts the downmix signal X = [ _l0r0 ] ^T and the set of mixing parameters. In the illustrated example, the set of mixing parameters includes parameters _α1 , _α2 , _α3 , _β1 , _β2 , _β3 , g, _k1 , and _k2 . The mixing parameters may be included in each mixing parameter data field in the bitstream P in quantized and/or entropy encoded form. The mixing parameters may be referred to as metadata (or spatial metadata), which is transmitted along with the encoded downmix signal X. In some instances of this disclosure, it has been explicitly indicated that some connecting lines are adapted to transmit multichannel signals, wherein these lines are provided with crossover lines adjacent to each number of channels. In the system 100 shown in Figure 1, the downmix signal X includes m = 2 channels, and the upmix signal Y, which will be defined below, includes n = 6 channels (e.g., 5.1 channels).

其动作参数化地取决于混合参数的上混级110接收下混信号。下混修改处理器120通过非线性处理并且通过形成下混声道的线性组合来修改下混信号，以便获得修改的下混信号D＝[d₁ d₂]^T。第一混合矩阵130接收下混信号X和修改的下混信号D，并且通过形成以下线性组合来输出上混信号Y＝[l_f l_s r_f r_s c lfe]^T：The upmixing stage 110, whose operation is parameterized to the mixing parameters, receives the downmixing signal. The downmixing modification processor 120 modifies the downmixing signal through nonlinear processing and by forming a linear combination of downmixing channels to obtain the modified downmixing signal D = [ _d1 _d2 ] ^T . The first mixing matrix 130 receives the downmixing signal X and the modified downmixing signal D, and outputs the upmixing signal Y = [l _f l _s r _f _{r s} c lfe] ^T by forming the following linear combination:

在以上线性组合中，混合参数α₃控制从下混信号形成的中间类型信号(与l₀+r₀成比例)对上混信号中的所有声道的贡献。混合参数β₃控制侧边类型信号(与l₀-r₀成比例)对上混信号中的所有声道的贡献。因此，在使用情况下，可以合理地预期，混合参数α₃和β₃将具有不同的统计性质，这使得能够更高效地编码。(作为比较考虑参考参数化(其中，独立的混合参数控制下混信号对上混信号中的空间左声道和空间右声道的各左声道贡献和右声道贡献)，注意，这样的混合参数的统计可观察量可能没有明显不同。)In the linear combination above, the mixing parameter _α3 controls the contribution of the intermediate-type signal (proportional to _l0 + _r0 ) formed from the downmix to all channels in the upmix. The mixing parameter _β3 controls the contribution of the side-type signal (proportional to _l0 - _r0 ) to all channels in the upmix. Therefore, in practical applications, it is reasonable to expect that the mixing parameters _α3 and _β3 will have different statistical properties, which allows for more efficient encoding. (For comparison, consider a reference parameterization (where independent mixing parameters control the left and right channel contributions of the downmix to the spatial left and right channels in the upmix), note that the statistical observables of such mixing parameters may not be significantly different.)

返回到以上方程所示的线性组合，进一步注意，增益参数k₁、k₂可以取决于比特流P中的共用的单个混合参数。此外，增益参数可以被规范化以使得k₁ ²+k₂ ²＝1。Returning to the linear combination shown in the equations above, note further that the gain parameters _k1 and _k2 can depend on a single ^, shared mixing parameter in the bitstream P. Furthermore, the gain parameters can be normalized such that _k1² + _k2² = ¹ .

修改的下混信号对上混信号中的空间左声道和空间右声道的贡献可以分别由参数β₁(第一修改声道对左声道的贡献)和β₂(第二修改声道对右声道的贡献)控制。此外，下混信号中的每个声道对其上混信号中的空间上对应的声道的贡献可以单独地通过改变独立的混合参数g控制。优选地，增益参数g被不均匀地量化以便避免大的量化误差。The contributions of the modified downmixer to the spatial left and right channels of the upmixer can be controlled by parameters _β1 (the contribution of the first modified channel to the left channel) and _β2 (the contribution of the second modified channel to the right channel), respectively. Furthermore, the contribution of each channel in the downmixer to its spatially corresponding channel in the upmixer can be individually controlled by changing the independent mixing parameter g. Preferably, the gain parameter g is non-uniformly quantized to avoid large quantization errors.

现在另外参照图2，下混修改处理器120可以在第二混合矩阵121中执行下混声道的以下线性组合(其是交叉混合)：Referring now to Figure 2, the downmixing modifier 120 can perform the following linear combination (which is crossmixing) of the downmixing channels in the second mixing matrix 121:

如该公式所指示的，填充第二混合矩阵的增益可以参数化地取决于比特流P中所编码的混合参数中的一些。由第二混合矩阵121执行的处理得到中间信号Z＝[z₁ z₂]^T，该中间信号被供给到解相关器122。图1示出了解相关器122包括两个子解相关器123、124的示例，子解相关器123、124可以被相同地配置(即，响应于相同的输入，提供相同的输出)或者被不同地配置。作为此的替代方案，图2示出了所有的解相关相关的操作由单个单元122执行的示例，单元122输出初步修改的下混信号D’。图2中的下混修改处理器120还可以包括伪像(artifact)衰减器125。在示例实施例中，如以上所概述的，伪像衰减器125被配置为检测中间信号Z中的尾音、并且通过基于检测的尾音的位置使该信号中的非期望的伪像衰减来采取校正动作。该衰减生成修改的下混信号D，该信号从下混修改处理器120输出。As indicated by the formula, the gain filling the second mixing matrix can be parameterized to depend on some of the mixing parameters encoded in the bitstream P. The processing performed by the second mixing matrix 121 yields an intermediate signal Z = [ _z1 _z2 ] ^T , which is supplied to the decorrecorder 122. Figure 1 shows an example of the decorrecorder 122 comprising two sub-decorrecorders 123, 124, which can be configured identically (i.e., providing the same output in response to the same input) or differently. As an alternative, Figure 2 shows an example of all decorrecord-related operations performed by a single unit 122, which outputs a preliminarily modified downmixed signal D'. The downmixing modification processor 120 in Figure 2 may also include an artifact attenuator 125. In the example embodiment, as outlined above, the artifact attenuator 125 is configured to detect tails in the intermediate signal Z and take corrective action by attenuating unwanted artifacts in the signal based on the position of the detected tail. This attenuation generates a modified downmixing signal D, which is output from the downmixing modification processor 120.

图3示出了与图1所示的类似类型的第一混合矩阵130及其相关联的变换级301、302和逆变换级311、312、313、314、315、316。变换级可以例如包括滤波器组，诸如正交镜像滤波器组(QMF)。因此，位于变换级301、302的上游的信号是时域中的表示，如位于逆变换级311、312、313、314、315、316的下游的信号一样。其它信号是频域表示。其它信号的时间依赖性可以例如被表达为与该信号被分割到的时间块相关的块值或离散值。注意，图3使用与以上矩阵方程相比的替代记号；一个可以例如具有对应关系X_L0～l₀、X_R0～r₀、Y_L～l_f、Y_LS～l_S等。此外，图3中的记号强调信号的时域表示X_L0(t)和同一信号的频域表示X_L0(f)之间的区别。理解的是，频域表示被分割为时间帧；因此，它是时间和频率变量两者的函数。Figure 3 illustrates a first mixing matrix 130 of a similar type to that shown in Figure 1, and its associated transform stages 301, 302 and inverse transform stages 311, 312, 313, 314, 315, 316. The transform stages may include, for example, filter banks, such as quadrature mirror filter banks (QMF). Therefore, signals upstream of transform stages 301, 302 are represented in the time domain, just as signals downstream of inverse transform stages 311, 312, 313, 314, 315, 316 are represented in the frequency domain. The time dependencies of other signals can be expressed, for example, as block values or discrete values related to the time blocks to which the signal is segmented. Note that Figure 3 uses alternative notation compared to the matrix equations above; one may, for example, have correspondences such as _XL0 ~ _l0 , _XL0 ~ _r0 , _YL ~ _lf , _YLS ~ _ls , etc. Furthermore, the notation in Figure 3 emphasizes the difference between the time-domain representation of the signal, _XL0 (t), and the frequency-domain representation of the same signal, _XL0 (f). It is understood that the frequency-domain representation is divided into time frames; therefore, it is a function of both time and frequency variables.

图4示出了音频处理系统400，其用于产生下混信号X以及控制上混级110所应用的增益的混合参数α₁、α₂、α₃、β₁、β₂、β₃、g、k₁、k₂。该音频处理系统400通常位于编码器侧，例如，广播或记录设备中，而图1所示的系统100通常将被部署在解码器侧，例如，回放设备中。下混级410基于n声道信号Y生成m声道信号X。优选地，下混级410对这些信号的时域表示进行操作。参数提取器420可以通过分析n声道信号Y并且考虑下混级410的定量和定性的性质来生成混合参数α₁、α₂、α₃、β₁、β₂、β₃、g、k₁、k₂的值。混合参数可以如图4中的记号所表明的那样是频率块值的矢量，并且可以被进一步分割为时间块。在示例实现中，下混级410是时间不变的和/或频率不变的。由于时间不变性和/或频率不变性，在下混级410和参数提取器420之间通常不需要通信连接，但是参数提取可以独立地进行。这为实现提供很大的自由。它还给予了缩短系统的总延时的可能性，因为几个处理步骤可以并行执行。作为一个示例，Dolby Digital Plus格式(或Enhanced AC-3)可以用于对下混信号X进行编码。Figure 4 illustrates an audio processing system 400 for generating a downmixed signal X and mixing parameters _α1 , _α2 , α3, _β1 , _β2 , _β3 , g, _k1 , _k2 that control the gain applied by the upmixing stage 110. This audio processing system 400 is typically located on the encoder side, for example, in a broadcast or recording device, while the system 100 shown in Figure 1 is typically deployed on _the decoder side, for example, in a playback device. The downmixing stage 410 generates an m-channel signal X based on an n-channel signal Y. Preferably, the downmixing stage 410 operates on the time-domain representation of these signals. The parameter extractor 420 can generate the values of the mixing parameters _α1 , _α2 , _α3 , β1, _β2 , _β3 , g, _k1 , _k2 by analyzing the n-channel signal Y and considering the quantitative and qualitative properties of the downmixing stage ₄₁₀ . The mixing parameters can be a vector of frequency block values, as indicated by the notation in Figure 4, and can be further subdivided into time blocks. In the example implementation, the downmixing stage 410 is time-invariant and/or frequency-invariant. Due to its time and/or frequency invariance, a communication connection is typically not required between the downmixing stage 410 and the parameter extractor 420, but parameter extraction can be performed independently. This provides considerable flexibility for implementation. It also allows for the possibility of reducing the overall system delay, as several processing steps can be executed in parallel. As an example, the Dolby Digital Plus format (or Enhanced AC-3) can be used to encode the downmixed signal X.

参数提取器420可以通过访问下混规范来了解下混级410的定量的和/或定性的性质，所述下混规范可以指定以下之一：增益值集合、识别对其预定义增益的预定义下混模式的索引等。下混规范可以是被预先加载到下混级410和参数提取器420中的每一个中的存储器中的数据。可替代地或者另外地，下混规范可以通过连接这些单元的通信线路从下混级410发送到参数提取器420。作为另一替代方案，下混级410至参数提取器420中的每一个均可以从共用的数据源访问下混规范，所述共用的数据源诸如音频处理系统中的或者与输入信号Y相关联的元数据流中的(例如，图5a所示的配置单元540的)存储器。The parameter extractor 420 can learn about the quantitative and/or qualitative properties of the downmixing stage 410 by accessing a downmixing specification, which may specify one of the following: a set of gain values, an index identifying a predefined downmixing mode for its predefined gain, etc. The downmixing specification may be data pre-loaded into the memory of each of the downmixing stages 410 and the parameter extractor 420. Alternatively or additionally, the downmixing specification may be transmitted from the downmixing stage 410 to the parameter extractor 420 via a communication line connecting these units. Alternatively, each of the downmixing stages 410 to the parameter extractor 420 may access the downmixing specification from a common data source, such as memory in an audio processing system or in a metadata stream associated with the input signal Y (e.g., the memory of configuration unit 540 shown in FIG. 5a).

图5a示出了示例多声道编码系统500，其用于使用下混信号X(包括m个声道，其中，m<n)和参数化表示来对多声道音频输入信号Y561(包括n个声道)进行编码。系统500包括下混编码单元510，其包括例如图4的下混级410。下混编码单元510可以被配置为提供下混信号X的编码版本。下混编码单元510可以例如使用Dolby Digital Plus编码器来对下混信号X进行编码。此外，系统500包括参数编码单元510，其可以包括图4的参数提取器420。参数编码单元510可以被配置为对所述混合参数集合α₁、α₂、α₃、β₁、β₂、β₃、g、k₁(也被称为空间参数)进行量化和编码以得到编码的空间参数562。如以上所指示的，参数k₂可以从参数k₁确定。另外，系统500可以包括比特流产生单元530，其被配置为从编码的下混信号563和编码的空间参数562产生比特流P 564。比特流564可以根据预定的比特流语法进行编码。具体地说，比特流564可以以符合Dolby Digital Plus(DD+或E-AC-3，Enhanced AC-3)的格式进行编码。Figure 5a illustrates an example multichannel coding system 500 for encoding a multichannel audio input signal Y561 (including n channels) using a downmixed signal X (comprising m channels, where m < n) and a parameterized representation. System 500 includes a downmixing coding unit 510, which includes, for example, the downmixing stage 410 of Figure 4. The downmixing coding unit 510 can be configured to provide an encoded version of the downmixed signal X. The downmixing coding unit 510 can encode the downmixed signal X, for example, using a Dolby Digital Plus encoder. Furthermore, system 500 includes a parameter coding unit 510, which can include the parameter extractor 420 of Figure 4. The parameter coding unit 510 can be configured to quantize and encode the set of mixing parameters _α1 , _α2 , _α3 , _β1 , _β2 , _β3 , g, _k1 (also referred to as spatial parameters) to obtain encoded spatial parameters 562. As indicated above, parameter _k2 can be determined from parameter _k1 . Additionally, system 500 may include a bitstream generation unit 530 configured to generate a bitstream P 564 from an encoded downmixed signal 563 and encoded spatial parameters 562. Bitstream 564 can be encoded according to a predetermined bitstream syntax. Specifically, bitstream 564 can be encoded in a format conforming to Dolby Digital Plus (DD+ or E-AC-3, Enhanced AC-3).

系统500可以包括配置单元540，其被配置为确定对于参数编码单元520和/或下混编码单元510的一个或多个控制设置552、554。可以基于系统500的一个或多个外部设置551来确定所述一个或多个控制设置552、554。举例来说，所述一个或多个外部设置551可以包括比特流564的总(最大或固定)数据速率。配置单元540可以被配置为根据所述一个或多个外部设置551来确定一个或多个控制设置552。对于参数编码单元520的所述一个或多个控制设置552可以包括以下中的一个或多个：System 500 may include a configuration unit 540 configured to determine one or more control settings 552, 554 for parameter encoding unit 520 and/or downmixing encoding unit 510. The one or more control settings 552, 554 may be determined based on one or more external settings 551 of system 500. For example, the one or more external settings 551 may include the total (maximum or fixed) data rate of bitstream 564. Configuration unit 540 may be configured to determine one or more control settings 552 based on the one or more external settings 551. The one or more control settings 552 for parameter encoding unit 520 may include one or more of the following:

·编码的空间参数562的最大数据速率。该控制设置在本文中被称为元数据数据速率设置。• The maximum data rate for the encoded spatial parameter 562. This control setting is referred to as the metadata data rate setting in this document.

·将由参数编码单元520对音频信号561的每一帧确定的参数集合的最大数量和/或特定数量。该控制设置在本文中被称为时间分辨率设置，因为它允许影响空间参数的时间分辨率。• The maximum and/or specific number of parameter sets determined by parameter encoding unit 520 for each frame of audio signal 561. This control setting is referred to herein as the temporal resolution setting because it allows for the temporal resolution to be affected by the spatial parameters.

·参数编码单元520将对其确定空间参数的参数带的数量。该控制设置在本文中被称为频率分辨率设置，因为它允许影响空间参数的频率分辨率。The parameter encoding unit 520 determines the number of parameter bands for the spatial parameters. This control setting is referred to herein as the frequency resolution setting because it allows the frequency resolution of the spatial parameters to be affected.

·用于对空间参数进行量化的量化器的分辨率。该控制设置在本文中被称为量化器设置。• The resolution of the quantizer used to quantize the spatial parameters. This control setting is referred to as the quantizer setting in this document.

参数编码单元520可以使用以上提及的用于确定和/或编码将被包括到比特流564中的空间参数的控制设置552中的一个或多个。通常，输入音频信号Y 561被分割为帧序列，其中，每个帧包括输入音频信号Y 561的预定数量的采样。元数据数据速率设置可以指示可供用于对输入音频信号561的帧的空间参数进行编码的比特的最大数量。用于对帧的空间参数562进行编码的比特的实际数量可以低于元数据数据速率设置所分配的比特的数量。参数编码单元520可以被配置为通知配置单元540关于实际使用的比特数量553，从而使得配置单元540能够确定可供用于对下混信号X进行编码的比特的数量。该比特数量可以被作为控制设置554传送到下混编码单元510。下混编码单元510可以被配置为(例如，使用多声道编码器，诸如Dolby Digital Plus)基于控制设置554对下混信号X进行编码。这样，尚未用于对空间参数进行编码的比特可以用于对下混信号进行编码。The parameter encoding unit 520 may use one or more of the control settings 552 mentioned above for determining and/or encoding spatial parameters to be included in the bitstream 564. Typically, the input audio signal Y 561 is segmented into a sequence of frames, where each frame includes a predetermined number of samples of the input audio signal Y 561. The metadata data rate setting may indicate the maximum number of bits available for encoding the spatial parameters of the frames of the input audio signal 561. The actual number of bits used to encode the spatial parameters 562 of the frames may be less than the number of bits allocated by the metadata data rate setting. The parameter encoding unit 520 may be configured to notify the configuration unit 540 about the actual number of bits used 553, thereby enabling the configuration unit 540 to determine the number of bits available for encoding the downmix signal X. This number of bits may be transmitted to the downmix encoding unit 510 as a control setting 554. The downmix encoding unit 510 may be configured (e.g., using a multi-channel encoder such as Dolby Digital Plus) to encode the downmix signal X based on the control setting 554. In this way, bits that have not yet been used to encode spatial parameters can be used to encode the downmixed signal.

图5b示出了示例参数编码单元520的框图。参数编码单元520可以包括变换单元521，其被配置为确定输入信号561的频率表示。具体地说，变换单元521可以被配置为将输入信号561的帧变换为一个或多个频谱，每个频谱包括多个频率区间。举例来说，变换单元521可以被配置为将滤波器组(例如，QMF滤波器组)应用于输入信号561。滤波器组可以是临界采样滤波器组。滤波器组可以包括预定数量Q个滤波器(例如，Q＝64个滤波器)。这样，变换单元521可以被配置为从输入信号561确定Q个子带信号，其中，每个子带信号与相应的频率区间571相关联。举例来说，输入信号561的K个采样的帧可以被变换为Q个子带信号，其中，每一子带信号K/Q个频率系数。换句话说，输入信号561的K个采样的帧被变换为K/Q个频谱，其中，每个频谱包括Q个频率区间。在特定示例中，帧长度为K＝1536，频率区间的数量为Q＝64，并且频谱的数量K/Q＝24。Figure 5b shows a block diagram of an example parameter encoding unit 520. The parameter encoding unit 520 may include a transformation unit 521 configured to determine the frequency representation of the input signal 561. Specifically, the transformation unit 521 may be configured to transform frames of the input signal 561 into one or more spectra, each spectrum comprising multiple frequency intervals. For example, the transformation unit 521 may be configured to apply a filter bank (e.g., a QMF filter bank) to the input signal 561. The filter bank may be a critical sampling filter bank. The filter bank may include a predetermined number of Q filters (e.g., Q = 64 filters). Thus, the transformation unit 521 may be configured to determine Q sub-band signals from the input signal 561, wherein each sub-band signal is associated with a corresponding frequency interval 571. For example, K sampled frames of the input signal 561 may be transformed into Q sub-band signals, wherein each sub-band signal has K/Q frequency coefficients. In other words, K sampled frames of the input signal 561 are transformed into K/Q spectra, wherein each spectrum comprises Q frequency intervals. In a specific example, the frame length is K = 1536, the number of frequency ranges is Q = 64, and the number of spectrums is K/Q = 24.

参数编码单元520可以包括分带(banding)单元522，其被配置为将一个或多个频率区间571分组为频带572。频率区间571到频带572的分组可以取决于频率分辨率设置552。表1例示了频率区间571到频带572的示例映射，其中，该映射可以由分带单元522基于频率分辨率设置552应用。在所例示的示例中，频率分辨率设置552可以指示频率区间571到7、9、12或15个频带的分带。分带通常对人耳的心理声学行为进行建模。其结果是，每一频带572的频率区间571的数量通常随频率增加而增加。The parameter encoding unit 520 may include a banding unit 522 configured to group one or more frequency ranges 571 into frequency bands 572. The grouping of frequency ranges 571 to frequency bands 572 may depend on a frequency resolution setting 552. Table 1 illustrates example mappings of frequency ranges 571 to frequency bands 572, which may be applied by the banding unit 522 based on the frequency resolution setting 552. In the illustrated example, the frequency resolution setting 552 may indicate banding of frequency ranges 571 to 7, 9, 12, or 15 frequency bands. Banding typically models the psychoacoustic behavior of the human ear. As a result, the number of frequency ranges 571 for each frequency band 572 typically increases with increasing frequency.

表1Table 1

参数编码单元520(以及具体地，参数提取器420)的参数确定单元523可以被配置为确定用于每个频带572的一个或多个混合参数集合α₁、α₂、α₃、β₁、β₂、β₃、g、k₁、k₂。由于此，频带572也可以被称为参数带。用于频带572的混合参数α₁、α₂、α₃、β₁、β₂、β₃、g、k₁、k₂可以被称为带参数。这样，整个混合参数集合通常包括用于每个频带572的带参数。带参数可以被应用于图3的混合矩阵130中以确定解码的上混信号的子带版本。The parameter determination unit 523 of the parameter encoding unit 520 (and specifically, the parameter extractor 420) can be configured to determine one or more sets of mixing parameters _α1 , _α2 , α3, _β1 , _β2 , _β3 , g, _k1 _{, k2} _for each frequency band 572. Therefore, frequency band 572 can also be referred to as a parameter band. The mixing parameters _α1 , _α2 , α3, _β1 , _β2 , _β3 , g, _k1 , _k2 for frequency band 572 can be referred to as band parameters. _Thus , the entire set of mixing parameters typically includes band parameters for each frequency band 572. The band parameters can be applied to the mixing matrix 130 of FIG. 3 to determine the sub-band version of the decoded upmixed signal.

将由参数确定单元523确定的每一帧的混合参数集合的数量可以由时间分辨率设置552指示。举例来说，时间分辨率设置552可以指示一个或两个混合参数集合将每一帧地确定。The number of mixed parameter sets for each frame determined by the parameter determination unit 523 can be indicated by the temporal resolution setting 552. For example, the temporal resolution setting 552 can indicate that one or two mixed parameter sets will be determined for each frame.

图5c中例示了包括用于多个频带572的带参数的混合参数集合的确定。图5c例示了从输入信号561的帧导出的示例变换系数集合580。变换系数580对应于特定时刻582和特定频率区间571。频带572可以包括来自一个或多个频率区间571的多个变换系数580。从图5c可以看出的，输入信号561的时域采样的变换提供输入信号561的帧的时间-频率表示。Figure 5c illustrates the determination of a set of mixed parameters including band parameters for multiple frequency bands 572. Figure 5c illustrates an example set of transform coefficients 580 derived from frames of input signal 561. Transform coefficients 580 correspond to specific times 582 and specific frequency intervals 571. Frequency bands 572 may include multiple transform coefficients 580 from one or more frequency intervals 571. As can be seen from Figure 5c, the transform of time-domain sampling of input signal 561 provides a time-frequency representation of frames of input signal 561.

应注意，可以基于当前帧的变换系数580并且可能还基于紧跟帧(其也被称为前视(look-ahead)帧)的变换系数580来确定用于当前帧的所述混合参数集合。It should be noted that the set of hybrid parameters for the current frame can be determined based on the transform coefficients 580 of the current frame and possibly also on the transform coefficients 580 of the immediately following frame (which is also referred to as the look-ahead frame).

参数确定单元523可以被配置为确定用于每个频带572的混合参数α₁、α₂、α₃、β₁、β₂、β₃、g、k₁、k₂。如果时间分辨率设置被设置为1，则特定频带572的(当前帧和前视帧的)所有的变换系数580可以被考虑用于确定用于特定频带572的混合参数。另一方面，参数确定单元523可以被配置为确定每一频带572的两个混合参数集合(例如，当时间分辨率设置被设置为2时)。在这种情况下，特定频带572的变换系数580的第一个时间半(对应于例如当前帧的变换系数580)可以被用于确定第一个混合参数集合，而特定频带572的变换系数580的第二个时间半(对应于例如前视帧的变换系数580)可以被考虑用于确定第二个混合参数集合。The parameter determination unit 523 can be configured to determine _{the mixing parameters α1, α2, α3, β1, β2, β3} _, _g _, _k1 _, _and _k2 for each frequency band 572. If the time resolution setting is set to 1, all transform coefficients 580 of a specific frequency band 572 (for the current frame and the forward-looking frame) can be considered to determine the mixing parameters for that specific frequency band 572. Alternatively, the parameter determination unit 523 can be configured to determine two sets of mixing parameters for each frequency band 572 (e.g., when the time resolution setting is set to 2). In this case, the first half-time of the transform coefficients 580 of the specific frequency band 572 (corresponding to, for example, the transform coefficients 580 of the current frame) can be used to determine the first set of mixing parameters, while the second half-time of the transform coefficients 580 of the specific frequency band 572 (corresponding to, for example, the transform coefficients 580 of the forward-looking frame) can be considered to determine the second set of mixing parameters.

一般来说，参数确定单元523可以被配置为基于当前帧和前视帧的变换系数580来确定一个或多个混合参数集合。窗函数可以用于限定变换系数580对所述一个或多个混合参数集合的影响。窗函数的形状可以取决于每一频带572的混合参数集合的数量和/或当前帧和/或前视帧的性质(例如，一个或多个瞬变的存在)。将在图5e和图7b至7d的上下文中描述示例窗函数。Generally, the parameter determination unit 523 can be configured to determine one or more sets of mixed parameters based on the transform coefficients 580 of the current frame and the forward-looking frame. A window function can be used to limit the effect of the transform coefficients 580 on the one or more sets of mixed parameters. The shape of the window function can depend on the number of sets of mixed parameters for each frequency band 572 and/or the nature of the current frame and/or the forward-looking frame (e.g., the presence of one or more transients). An example window function will be described in the context of Figures 5e and 7b-7d.

应注意，以上可以适用于输入信号561的帧不包括瞬变信号部分的情况。系统500(例如，参数确定单元523)可以被配置为基于输入信号561来执行瞬变检测。在一个或多个瞬变被检测到的情况下，可以设置一个或多个瞬变指示符583、584，其中，瞬变指示符583、584可以识别相应瞬变的时刻582。瞬变指示符583、584也可以被称为各混合参数集合的采样点。在瞬变的情况下，参数确定单元523可以被配置为基于从瞬变的时刻开始的变换系数580来确定混合参数集合(这由图5c的加不同阴影线的区域例示)。另一方面，可以忽略在瞬变的时刻之前的变换系数580，从而确保混合参数集合反映瞬变之后的多声道情况。It should be noted that the above applies to cases where the frame of input signal 561 does not include a transient signal portion. System 500 (e.g., parameter determination unit 523) can be configured to perform transient detection based on input signal 561. In the event that one or more transients are detected, one or more transient indicators 583, 584 can be set, wherein transient indicators 583, 584 can identify the moment 582 of the corresponding transient. Transient indicators 583, 584 can also be referred to as sampling points for each set of mixed parameters. In the case of a transient, parameter determination unit 523 can be configured to determine the set of mixed parameters based on the transform coefficients 580 starting from the moment of the transient (this is illustrated by the areas with different shaded lines in Figure 5c). Alternatively, transform coefficients 580 prior to the moment of the transient can be ignored, thereby ensuring that the set of mixed parameters reflects the multi-channel situation after the transient.

图5c例示了多声道输入信号Y 561的声道的变换系数580。参数编码单元520通常被配置为确定用于多声道输入信号561的多个声道的变换系数580。图5d示出了输入信号561的第一561-1声道和第二561-2声道的示例变换系数。频带p 572包括从频率索引i至j的范围内的频率区间571。第一声道561-1在时刻(或者在频谱)q、在频率区间i中的变换系数580可以被称为a_q,i。以类似的方式，第二声道561-2在时刻(或者在频谱)q、在频率区间i中的变换系数580可以被称为b_q,i。变换系数580可以是复数。用于频带p的混合参数的确定可以涉及基于变换系数580对第一声道561-1和第二声道561-2的能量和/或协方差的确定。举例来说，第一声道561-1和第二声道561-2在频带p中、对于时间间隔[q,v]的变换系数580的协方差可以被确定为：Figure 5c illustrates the transform coefficients 580 of the channels of the multi-channel input signal Y 561. The parameter encoding unit 520 is typically configured to determine the transform coefficients 580 for the multiple channels of the multi-channel input signal 561. Figure 5d shows example transform coefficients for the first 561-1 channel and the second 561-2 channel of the input signal 561. The frequency band p 572 comprises frequency intervals 571 ranging from frequency index i to j. The transform coefficient 580 of the first channel 561-1 at time (or in the spectrum) q, in frequency interval i, can be referred to as a _q,i . Similarly, the transform coefficient 580 of the second channel 561-2 at time (or in the spectrum) q, in frequency interval i, can be referred to as b _q,i . The transform coefficients 580 can be complex numbers. The determination of the mixing parameters for the frequency band p can involve determining the energy and/or covariance of the first channel 561-1 and the second channel 561-2 based on the transform coefficients 580. For example, the covariance of the transform coefficients 580 of the first channel 561-1 and the second channel 561-2 in the frequency band p with respect to the time interval [q,v] can be determined as follows:

第一声道561-1在频带p中、对于时间间隔[q,v]的变换系数580的能量估计可以被确定为：The energy estimate of the transform coefficients 580 for the first channel 561-1 in frequency band p with respect to the time interval [q, v] can be determined as follows:

第二声道561-2在频带p中、对于时间间隔[q,v]的变换系数580的能量估计E_2,2(p)可以以类似的方式确定。The energy estimate _E2,2 (p) of the transformation coefficients 580 for the second channel 561-2 in frequency band p for time interval [q,v] can be determined in a similar manner.

这样，参数确定单元523可以被配置为确定用于不同频带572的一个或多个带参数集合573。频带572的数量通常取决于频率分辨率设置552，而每一帧的混合参数集合的数量通常取决于时间分辨率设置552。举例来说，频率分辨率设置552可以指示15个频带572的使用，而时间分辨率设置552可以指示2个混合参数集合的使用。在这种情况下，参数确定单元523可以被配置为确定两个时间上不同的混合参数集合，其中，每个混合参数集合包括15个带参数集合573(即，用于不同频带572的混合参数)。Thus, the parameter determination unit 523 can be configured to determine one or more band parameter sets 573 for different frequency bands 572. The number of frequency bands 572 typically depends on the frequency resolution setting 552, while the number of mixed parameter sets for each frame typically depends on the temporal resolution setting 552. For example, the frequency resolution setting 552 may indicate the use of 15 frequency bands 572, while the temporal resolution setting 552 may indicate the use of 2 mixed parameter sets. In this case, the parameter determination unit 523 can be configured to determine two temporally different mixed parameter sets, wherein each mixed parameter set includes 15 band parameter sets 573 (i.e., mixed parameters for different frequency bands 572).

如以上所指示的，可以基于当前帧的变换系数580并且基于跟随的前视帧的变换系数580来确定用于当前帧的混合参数。参数确定单元523可以将窗应用于变换系数580，以便确保帧序列的连续帧的混合参数之间的平滑过渡，和/或以便考虑输入信号561内的破坏性部分(例如，瞬变)。这在图5e中被例示，图5e示出了输入音频信号561的当前帧585和紧跟帧590在相应的K/Q个连续的时刻582的K/Q个频谱589。此外，图5e示出了参数确定单元523所使用的示例窗586。窗586反映了当前帧585和紧跟帧590(其被称为前视帧)的K/Q个频谱589对混合参数的影响。如下面将更详细地概述的，窗586反映了当前帧585和前视帧590不包括任何瞬变的情况。在这种情况下，窗586分别确保当前帧585和前视帧590的频谱589的平滑渐涨和渐消，从而允许空间参数的平滑演变。此外，图5e示出了示例窗587和588。虚线窗587反映了当前帧585的K/Q个频谱589对前一帧的混合参数的影响。另外，虚线窗588反映了紧跟帧590的K/Q个频谱589对紧跟帧590的混合参数的影响(在平滑内插的情况下)。As indicated above, the mixing parameters for the current frame can be determined based on the transform coefficients 580 of the current frame and the transform coefficients 580 of the following preceding frame. The parameter determination unit 523 may apply a window to the transform coefficients 580 to ensure a smooth transition between the mixing parameters of consecutive frames in the frame sequence, and/or to account for disruptive portions (e.g., transients) within the input signal 561. This is illustrated in Figure 5e, which shows the K/Q spectra 589 of the current frame 585 and the immediately following frame 590 of the input audio signal 561 at corresponding K/Q consecutive moments 582. Furthermore, Figure 5e shows an example window 586 used by the parameter determination unit 523. Window 586 reflects the effect of the K/Q spectra 589 of the current frame 585 and the immediately following frame 590 (which is referred to as the preceding frame) on the mixing parameters. As will be summarized in more detail below, window 586 reflects the case where the current frame 585 and the preceding frame 590 do not include any transients. In this case, window 586 ensures smooth inflation and deflation of the spectra 589 of the current frame 585 and the preceding frame 590, respectively, thereby allowing for smooth evolution of spatial parameters. Furthermore, Figure 5e shows example windows 587 and 588. The dashed window 587 reflects the effect of the K/Q spectra 589 of the current frame 585 on the mixing parameters of the previous frame. Additionally, the dashed window 588 reflects the effect of the K/Q spectra 589 of the immediately following frame 590 on the mixing parameters of the immediately following frame 590 (in the case of smooth interpolation).

随后可以使用参数编码单元520的编码单元524来对所述一个或多个混合参数集合进行量化和编码。编码单元524可以应用各种编码方案。举例来说，编码单元524可以被配置为执行混合参数的差分编码。差分编码可以基于时间差(对于同一频带572，当前混合参数和相应的前一混合参数之间的时间差)或频率差(第一频带572的当前混合参数和相邻的第二频带572的相应的当前混合参数之间的频率差)。The one or more sets of mixed parameters can then be quantized and encoded using the encoding unit 524 of the parameter encoding unit 520. The encoding unit 524 can apply various encoding schemes. For example, the encoding unit 524 can be configured to perform differential encoding of the mixed parameters. Differential encoding can be based on a time difference (for the same frequency band 572, the time difference between the current mixed parameter and the corresponding previous mixed parameter) or a frequency difference (the frequency difference between the current mixed parameter of the first frequency band 572 and the corresponding current mixed parameter of the adjacent second frequency band 572).

此外，编码单元524可以被配置为对混合参数集合和/或混合参数的时间差或频率差进行量化。混合参数的量化可以取决于量化器设置552。举例来说，量化器设置552可以取两个值，指示细量化的第一个值和指示粗量化的第二个值。这样，编码单元524可以被配置为基于量化器设置552所指示的量化类型来执行细量化(具有相对低的量化误差)或粗量化(具有相对增加的量化误差)。然后可以使用基于熵的码(诸如哈夫曼码)来对量化的参数或参数差进行编码。结果，获得编码的空间参数562。用于编码的空间参数562的比特数量553可以被传送到配置单元540。Furthermore, encoding unit 524 can be configured to quantize the set of mixed parameters and/or the time difference or frequency difference of the mixed parameters. The quantization of the mixed parameters can depend on quantizer setting 552. For example, quantizer setting 552 can take two values, a first value indicating fine quantization and a second value indicating coarse quantization. Thus, encoding unit 524 can be configured to perform fine quantization (with relatively low quantization error) or coarse quantization (with relatively increased quantization error) based on the quantization type indicated by quantizer setting 552. The quantized parameters or parameter differences can then be encoded using entropy-based codes (such as Huffman codes). As a result, encoded spatial parameters 562 are obtained. The number of bits 553 used for encoding spatial parameters 562 can be transmitted to configuration unit 540.

在实施例中，编码单元524可以被配置为首先对不同的混合参数进行量化(在量化器设置552的考虑下)，以得到量化的混合参数。然后可以对量化的混合参数进行熵编码(通过使用例如哈夫曼码)。熵编码然后可以对帧的量化的混合参数(不考虑前面的帧)、量化的混合参数的频率差或量化的混合参数的时间差进行编码。时间差的编码可能不被用于所谓的独立帧的情况，所谓的独立帧独立于前面的帧而被编码。In an embodiment, encoding unit 524 can be configured to first quantize different mixing parameters (within consideration of quantizer setting 552) to obtain quantized mixing parameters. The quantized mixing parameters can then be entropy-coded (using, for example, Huffman codes). Entropy coding can then encode the quantized mixing parameters of a frame (regardless of previous frames), the frequency difference of the quantized mixing parameters, or the time difference of the quantized mixing parameters. Encoding the time difference may not be used in the case of so-called independent frames, which are encoded independently of previous frames.

因此，参数编码单元520可以使用差分编码和哈夫曼编码的组合来确定编码的空间参数562。如以上所概述的，编码的空间参数562可以作为元数据(其也被称为空间元数据)与编码的下混信号563一起包括在比特流564中。差分编码和哈夫曼编码可以用于空间元数据的发送，以便降低冗余度，并因此增加可供用于对下混信号563进行编码的备用比特速率。因为哈夫曼码是可变长度码，所以空间元数据的大小可以很大程度地取决于将被发送的编码的空间参数562的统计而变化。发送空间元数据所需的数据速率从可供核心编解码器(例如，Dolby Digital Plus)使用的数据速率扣除以对立体声下混信号进行编码。为了不损害下混信号的音频质量，发送每一帧的空间元数据可能花费的字节的数量通常是有限的。限值可以受制于编码器调谐考虑，其中，编码器调谐考虑可以由配置单元540考虑。然而，由于空间参数的基本差分/哈夫曼编码的可变长度特性，在没有任何进一步的手段的情况下，通常不能保证数据速率上限(例如在元数据数据速率设置552中反映)将不被超过。Therefore, parameter encoding unit 520 can use a combination of differential coding and Huffman coding to determine the encoded spatial parameters 562. As outlined above, the encoded spatial parameters 562 can be included as metadata (also referred to as spatial metadata) in the bitstream 564 along with the encoded downmix signal 563. Differential coding and Huffman coding can be used for transmitting spatial metadata to reduce redundancy and thus increase the spare bit rate available for encoding the downmix signal 563. Because Huffman codes are variable-length codes, the size of the spatial metadata can vary considerably depending on the statistics of the encoded spatial parameters 562 to be transmitted. The data rate required to transmit the spatial metadata is subtracted from the data rate available to the core codec (e.g., Dolby Digital Plus) for encoding the stereo downmix signal. To avoid compromising the audio quality of the downmix signal, the number of bytes that may be spent transmitting spatial metadata for each frame is generally limited. This limit can be subject to encoder tuning considerations, which can be taken into account by configuration unit 540. However, due to the variable-length nature of the basic differential/Huffman coding of the spatial parameters, it is generally not guaranteed that the data rate limit (as reflected in the metadata data rate setting 552) will not be exceeded without any further means.

在本文档中，描述了一种用于对编码的空间参数562和/或包括编码的空间参数562的空间元数据进行后处理的方法。在图6的上下文中描述用于对空间元数据进行后处理的方法600。当确定空间元数据的一个帧的总大小超过例如元数据数据速率设置552所指示的预定义限值时，可以应用方法600。方法600旨在逐步地减少元数据的量。空间元数据的大小的减小通常还降低了空间元数据的精度，并因此损害了再现的音频信号的空间图像的质量。然而，方法600通常保证，空间元数据的总量不超过预定义限值，并因此允许确定空间元数据(用于重新产生m声道多声道信号)和音频编解码元数据(用于对编码的下混信号563进行解码)之间的就总体音频质量而言的改进的权衡。此外，用于对空间元数据进行后处理的方法600可以以相对低的计算复杂度来实现(与用修改的控制设置552完全地重新计算编码的空间参数相比)。This document describes a method for post-processing encoded spatial parameters 562 and/or spatial metadata including encoded spatial parameters 562. The method 600 for post-processing spatial metadata is described in the context of Figure 6. Method 600 can be applied when the total size of a frame of spatial metadata exceeds a predefined limit indicated, for example, by metadata data rate setting 552. Method 600 aims to progressively reduce the amount of metadata. Reducing the size of spatial metadata typically also reduces the accuracy of the spatial metadata and thus compromises the quality of the spatial image of the reproduced audio signal. However, method 600 generally guarantees that the total amount of spatial metadata does not exceed a predefined limit, thus allowing for the determination of a trade-off between the spatial metadata (used to regenerate the m-channel multichannel signal) and the audio codec metadata (used to decode the encoded downmix signal 563) in terms of overall audio quality. Furthermore, method 600 for post-processing spatial metadata can be implemented with relatively low computational complexity (compared to completely recalculating the encoded spatial parameters with modified control settings 552).

用于对空间元数据进行后处理的方法600可以包括以下步骤中的一个或多个。如以上所概述的，空间元数据帧可以每一帧包括多个(例如，一个或两个)参数集合，其中，附加参数集合的使用允许增加混合参数的时间分辨率。每一帧多个参数集合的使用可以改进音频质量，尤其是在攻击(attack)丰富(即，瞬变)信号的情况下。即使是在具有相当缓慢变化的空间图像的音频信号的情况下，采样点的密集网格(grid)两倍大的空间参数更新也可以改进音频质量。然而，每一帧多个参数集合的发送导致数据速率增加大约2倍。因此，如果确定空间元数据的数据速率超过元数据数据速率设置552(步骤601)，则可以检查空间元数据帧是否包括多于一个的混合参数集合。具体地说，可以检查元数据帧是否包括理应被发送的两个混合参数集合(步骤602)。如果确定空间元数据包括多个混合参数集合，则可以丢弃超过单个混合参数集合的集合中的一个或多个(步骤603)。其结果是，可以显著降低空间元数据的数据速率(在两个混合参数集合的情况下，通常降低一半)，同时仅相对低程度地损害音频质量。Method 600 for post-processing spatial metadata may include one or more of the following steps. As outlined above, each spatial metadata frame may include multiple (e.g., one or two) parameter sets, wherein the use of additional parameter sets allows for increased temporal resolution of the mixed parameters. The use of multiple parameter sets per frame can improve audio quality, especially in the case of attack-rich (i.e., transient) signals. Even in the case of audio signals with relatively slowly changing spatial images, spatial parameter updates with a grid twice the size of the sampling points can improve audio quality. However, sending multiple parameter sets per frame results in an approximately 2x increase in data rate. Therefore, if it is determined that the data rate of the spatial metadata exceeds the metadata data rate setting 552 (step 601), it can be checked whether the spatial metadata frame includes more than one mixed parameter set. Specifically, it can be checked whether the metadata frame includes two mixed parameter sets that should be sent (step 602). If it is determined that the spatial metadata includes multiple mixed parameter sets, one or more of the sets exceeding a single mixed parameter set can be discarded (step 603). As a result, the data rate of spatial metadata can be significantly reduced (typically by half in the case of two mixed parameter sets), while only relatively little loss to audio quality.

两个(或更多个)混合参数集合中的哪一个要丢掉的决定可以取决于编码系统500是否检测到输入信号561的被当前帧覆盖的部分中的瞬变位置(“攻击”)：如果在当前帧中存在多个瞬变，则因为每单个攻击的心理声学的后掩蔽效应，较早的瞬变通常比较晚的瞬变更重要。因此，如果瞬变存在，则可以建议丢弃较晚的混合参数集合(例如，两个中的第二个)。另一方面，在不存在攻击的情况下，可以丢弃较早的混合参数集合(例如，两个中的第一个)。这可能是由于当计算空间参数时所使用的加窗(如图5e所示)。用于窗掉(windowout)输入信号561的用于计算用于第二个混合参数集合的空间参数的部分的窗586通常在上混级130放置用于参数重构的采样点的时间点(即，在当前帧结束时)具有其最大影响。另一方面，第一个混合参数集合通常对该时间点得到帧的一半的偏移。因此，通过丢掉第一个混合参数集合而产生的误差最可能低于通过丢掉第二个混合参数集合而产生的误差。这在图5e中被示出，在图5e中，可以看出，用于确定第二个混合参数集合的当前帧585的频谱589的第二半受当前帧585的采样的影响程度高于当前帧585的频谱589的第一半(对于第一半，窗函数586的值低于对于频谱589的第二半的值)。The decision of which of two (or more) mixed parameter sets to discard can depend on whether the encoding system 500 detects a transient location (“attack”) in the portion of the input signal 561 covered by the current frame: if multiple transients exist in the current frame, earlier transients are generally more significant than later transients due to the psychoacoustic backmasking effect of each individual attack. Therefore, if a transient exists, it may be recommended to discard the later mixed parameter set (e.g., the second of two). On the other hand, if no attack exists, the earlier mixed parameter set (e.g., the first of two) may be discarded. This may be due to the windowing used when calculating spatial parameters (as shown in Figure 5e). The window 586 used to window out the portion of the input signal 561 used to calculate the spatial parameters for the second mixed parameter set typically has its greatest influence at the time point when the sampling points for parameter reconstruction are placed in the upper mixing stage 130 (i.e., at the end of the current frame). On the other hand, the first mixed parameter set typically receives an offset of half the frame at that time point. Therefore, the error resulting from discarding the first set of mixed parameters is most likely lower than the error resulting from discarding the second set of mixed parameters. This is illustrated in Figure 5e, where it can be seen that the second half of the spectrum 589 of the current frame 585 used to determine the second set of mixed parameters is more influenced by the sampling of the current frame 585 than the first half of the spectrum 589 of the current frame 585 (for the first half, the value of the window function 586 is lower than the value for the second half of the spectrum 589).

在编码系统500中计算的空间线索(cue)(即，混合参数)经由比特流562(其可以是编码的立体声下混信号563在其中被递送的比特流564的一部分)被发送到相应的解码器100。在空间线索的计算及其在比特流562中的表示之间，编码单元524通常应用两步编码方法：第一步量化是有损步骤，因为它对空间线索增加了误差；第二步差分/哈夫曼编码是无损步骤。如以上所概述的，编码器500可以在不同类型的量化(例如，两种类型的量化)之间选择：高分辨率量化方案，其增加相对少的误差，但是导致较大量的潜在量化索引，因此需要较大的哈夫曼码字；以及低分辨率量化方案，其增加相对较多的误差，但是导致较低量的量化索引，因此不需要如此大的哈夫曼码字。应注意，不同类型的量化可以应用于一些或全部混合参数。举例来说，不同类型的量化可以应用于混合参数α₁、α₂、α₃、β₁、β₂、β₃、k₁。另一方面，增益g可以用固定类型的量化进行量化。The spatial cue (i.e., the mixing parameters) computed in the encoding system 500 is sent to the corresponding decoder 100 via a bitstream 562 (which may be part of a bitstream 564 in which the encoded stereo downmix signal 563 is delivered). Between the computation of the spatial cue and its representation in the bitstream 562, the encoding unit 524 typically applies a two-step encoding approach: the first step, quantization, is a lossy step because it adds error to the spatial cue; the second step, differential/Huffman coding, is a lossless step. As outlined above, the encoder 500 can choose between different types of quantization (e.g., two types of quantization): a high-resolution quantization scheme, which adds relatively little error but results in a larger potential quantization index, thus requiring a larger Huffman codeword; and a low-resolution quantization scheme, which adds relatively more error but results in a lower amount of quantization index, thus not requiring such a large Huffman codeword. It should be noted that different types of quantization can be applied to some or all of the mixing parameters. For example, different types of quantization can be applied to the mixed parameters _α1 , _α2 , _α3 , _β1 , _β2 , _β3 , _k1 . On the other hand, the gain g can be quantized using a fixed type of quantization.

方法600可以包括验证哪种类型的量化已经用于对空间参数进行量化的步骤604。如果确定使用了相对精细的量化分辨率，则编码单元524可以被配置为将量化分辨率降低至更低类型的量化605。结果，空间参数被再一次量化。然而，这没有增加显著的计算开销(与使用不同的控制设置552重新确定空间参数相比)。应注意，不同类型的量化可以用于不同的空间参数α₁、α₂、α₃、β₁、β₂、β₃、g、k₁。因此，编码单元524可以被配置为单独地对每种类型的空间参数选择量化器分辨率，从而调整空间元数据的数据速率。Method 600 may include a step 604 of verifying which type of quantization has been used to quantize the spatial parameters. If it is determined that a relatively fine quantization resolution was used, the encoding unit 524 may be configured to reduce the quantization resolution to a lower type of quantization 605. As a result, the spatial parameters are quantized again. However, this does not significantly increase computational overhead (compared to redetermining the spatial parameters using different control settings 552). It should be noted that different types of quantization can be used for different spatial parameters _α1 , _α2 , _α3 , _β1 , _β2 , _β3 , g, _k1 . Therefore, the encoding unit 524 may be configured to select the quantizer resolution individually for each type of spatial parameter, thereby adjusting the data rate of the spatial metadata.

方法600可以包括降低空间参数的频率分辨率的步骤(图6中未示出)。如以上所概述的，帧的混合参数集合通常被聚类到频带或参数带572中。每个参数带表示某一频率范围，并且对于每个带，确定单独的空间线索集合。根据可供用于发送空间元数据的数据速率，可以逐步地改变参数带572的数量(例如，7、9、12或15个带)。参数带572的数量与数据速率大致成线性关系，并因此频率分辨率的降低可以显著降低空间元数据的数据速率，同时仅适度地影响音频质量。然而，这样的频率分辨率降低通常需要使用改变的频率分辨率来重新计算混合参数集合，并因此将增加计算复杂度。Method 600 may include a step of reducing the frequency resolution of the spatial parameters (not shown in Figure 6). As outlined above, the set of mixed parameters for a frame is typically clustered into frequency bands or parameter bands 572. Each parameter band represents a frequency range, and for each band, a separate set of spatial clues is determined. The number of parameter bands 572 (e.g., 7, 9, 12, or 15 bands) can be progressively changed depending on the data rate available for transmitting spatial metadata. The number of parameter bands 572 is approximately linearly related to the data rate, and therefore a reduction in frequency resolution can significantly reduce the data rate of spatial metadata while only moderately affecting audio quality. However, such a reduction in frequency resolution typically requires recalculating the set of mixed parameters using the changed frequency resolution, thus increasing computational complexity.

如以上所概述的，编码单元524可以使用(量化的)空间参数的差分编码。配置单元551可以被配置为施加输入音频信号561的帧的空间参数的直接编码，以便确保发送误差不在无限数量的帧上传播，并且以便允许解码器在中间时刻与所接收的比特流562同步。这样，帧的某一小部分可以沿着时间线不使用差分编码。不使用差分编码的这样的帧可以被称为独立帧。方法600可以包括验证当前帧是否是独立帧和/或独立帧是否是强迫(force)独立帧的步骤606。空间参数的编码可以取决于步骤606的结果。As outlined above, encoding unit 524 can use differential coding of (quantized) spatial parameters. Configuration unit 551 can be configured to apply direct coding of the spatial parameters of the frames of the input audio signal 561 to ensure that transmission errors do not propagate over an infinite number of frames and to allow the decoder to synchronize with the received bitstream 562 at intermediate moments. Thus, a small portion of a frame can be used without differential coding along the timeline. Such frames without differential coding can be referred to as independent frames. Method 600 may include a step 606 to verify whether the current frame is an independent frame and/or whether the independent frame is a forced independent frame. The coding of the spatial parameters may depend on the result of step 606.

如以上所概述的，差分编码通常被设计为使得在时间后继者之间或者在量化的空间线索的相邻频带之间计算差。在这两种情况下，空间线索的统计使得小的差比大的差更经常地发生，因此，与大的差相比，小的差用较短的哈夫曼码字表示。在本文档中，提出了执行量化的空间参数的平滑(在时间上或者在频率上)。在时间上或者在频率上平滑空间参数通常导致较小的差，并因此导致数据速率的降低。由于心理声学考虑，时间平滑通常优于频率方向上的平滑。如果确定当前帧不是强迫独立帧，则方法600可以继续执行时间差分编码(步骤607)，可能与时间上的平滑结合。另一方面，如果当前帧被确定为独立帧，则方法600可以继续执行频率差分编码(步骤608)，并且可能沿着频率平滑。As outlined above, differential coding is typically designed to compute differences between temporal successors or between adjacent frequency bands of quantized spatial cues. In both cases, the statistics of spatial cues mean that smaller differences occur more frequently than larger differences, and therefore, smaller differences are represented by shorter Huffman codewords compared to larger differences. This document proposes performing smoothing (temporally or frequency-wise) of the quantized spatial parameters. Smoothing spatial parameters temporally or frequency-wise generally results in smaller differences and thus a reduction in data rate. Due to psychoacoustic considerations, temporal smoothing is generally preferred over frequency-wise smoothing. If it is determined that the current frame is not a forced independent frame, method 600 can proceed with temporal differential coding (step 607), possibly combined with temporal smoothing. On the other hand, if the current frame is determined to be an independent frame, method 600 can proceed with frequency differential coding (step 608), possibly with frequency smoothing.

步骤607中的差分编码可以被提交给时间上的平滑处理，以便降低数据速率。平滑程度可以根据数据速率将被降低的量而改变。最严重种类的时间“平滑”对应于保持未改变的前一混合参数集合，这对应于仅发送等于零的增量值。差分编码的时间平滑可以对空间参数中的一个或多个(例如，对全部)执行。The differential coding in step 607 can be submitted to temporal smoothing to reduce the data rate. The degree of smoothing can be varied depending on the amount by which the data rate will be reduced. The most severe type of temporal "smoothing" corresponds to keeping the previous set of mixed parameters unchanged, which corresponds to sending only incremental values equal to zero. Temporal smoothing of differential coding can be performed on one or more (e.g., all) of the spatial parameters.

以与时间平滑类似的方式，可以执行频率上的平滑。在其最极端的形式中，频率上的平滑对应于对输入信号561的完整频率范围发送相同的量化的空间参数。虽然保证元数据数据速率设置所设置的限值不被超过，但是频率上的平滑可能对可以使用空间元数据再现的空间图像的质量具有相对高的影响。因此可能优选的是，仅在时间平滑不被允许的情况下应用频率上的平滑(例如，如果当前帧是对其不可使用对于前一帧的时间差分编码的强迫独立帧)。Frequency smoothing can be performed in a similar manner to temporal smoothing. In its most extreme form, frequency smoothing corresponds to sending the same quantized spatial parameters over the entire frequency range of the input signal 561. While ensuring that the limits set by the metadata data rate settings are not exceeded, frequency smoothing can have a relatively high impact on the quality of the spatial image that can be reproduced using spatial metadata. Therefore, it may be preferable to apply frequency smoothing only when temporal smoothing is not permitted (e.g., if the current frame is a forced independent frame for which temporal differential coding of the previous frame is not available).

如以上所概述的，系统500可以受制于一个或多个外部设置551而操作，外部设置551诸如比特流564的总体目标数据速率或输入音频信号561的采样速率。通常不存在对于外部设置的所有组合的单个最佳操作点。配置单元540可以被配置为将外部设置551的有效组合映射到控制设置552、554的组合。举例来说，配置单元540可以依赖于心理声学收听测试的结果。具体地说，配置单元540可以被配置为确定确保对于外部设置551的特定组合的(平均上)最佳的心理声学编码结果的控制设置552、554的组合。As outlined above, system 500 can operate subject to one or more external settings 551, such as the overall target data rate of bitstream 564 or the sampling rate of input audio signal 561. Typically, there is no single optimal operating point for all combinations of external settings. Configuration unit 540 can be configured to map valid combinations of external settings 551 to combinations of control settings 552, 554. For example, configuration unit 540 can rely on the results of psychoacoustic listening tests. Specifically, configuration unit 540 can be configured to determine combinations of control settings 552, 554 that ensure (on average) optimal psychoacoustic coding results for a particular combination of external settings 551.

如以上所概述的，解码系统100应能够在给定时间段内与所接收的比特流564同步。为了确保这一点，编码系统500可以定期地对所谓的独立帧(即，不取决于关于它们的前身的知识的帧)进行编码。两个独立帧之间的帧中的平均距离可以由给予同步的最大时滞和一个帧的持续时间之间的比率给出。该比率不一定必须是整数，其中，两个独立帧之间的距离总是帧的整数。As outlined above, the decoding system 100 should be able to synchronize with the received bitstream 564 within a given time period. To ensure this, the encoding system 500 can periodically encode so-called independent frames (i.e., frames that do not depend on knowledge about their predecessors). The average distance between two independent frames can be given by the ratio between the maximum latency for synchronization and the duration of a frame. This ratio does not necessarily have to be an integer, where the distance between two independent frames is always an integer.

编码系统500(例如，配置单元540)可以被配置为接收作为外部设置551的用于同步的最大时滞或期望的更新时间段。此外，编码系统500(例如，配置单元540)可以包括计时器模块，其被配置为跟踪自从比特流564的第一个编码帧以后已过去的绝对时间量。比特流564的第一个编码帧按照定义是独立帧。编码系统500(例如，配置单元540)可以被配置为确定下一个被编码帧是否包括与作为期望的更新时段的整数倍的时刻相应的采样。每当下一个被编码帧包括作为期望的更新时段的整数倍的时间点的采样时，编码系统500(例如，配置单元540)可以被配置为确保下一个被编码帧被作为独立帧进行编码。通过这样做，可以确保，即使期望的更新时间段和帧长度的比率不是整数，也维持期望的更新时间段。Encoding system 500 (e.g., configuration unit 540) can be configured to receive the maximum latency or desired update time period for synchronization as an external setting 551. Furthermore, encoding system 500 (e.g., configuration unit 540) may include a timer module configured to track the absolute amount of time elapsed since the first encoded frame of bitstream 564. The first encoded frame of bitstream 564 is defined as an independent frame. Encoding system 500 (e.g., configuration unit 540) can be configured to determine whether the next encoded frame includes a sample corresponding to a time point that is an integer multiple of the desired update time period. Whenever the next encoded frame includes a sample at a time point that is an integer multiple of the desired update time period, encoding system 500 (e.g., configuration unit 540) can be configured to ensure that the next encoded frame is encoded as an independent frame. By doing so, it is ensured that the desired update time period is maintained even if the ratio of the desired update time period to the frame length is not an integer.

如以上所概述的，参数确定单元523被配置为基于多声道输入信号561的时间/频率表示来计算空间线索。可以基于当前帧的K/Q个(例如，24个)频谱589(例如，QMF频谱)和/或基于前视帧的K/Q个(例如，24个)频谱589(例如，QMF频谱)来确定空间元数据帧，其中，每个频谱589可以具有Q个(例如，64个)频率区间571的频率分辨率。根据编码系统500在输入信号561中是否检测到瞬变，用于计算单个空间线索集合的信号部分的时间长度可以包括不同数量的频谱589(例如，1个频谱直至K/Q个频谱的2倍)。如图5c所示，每个频谱589被划分为某一数量的频带572(例如，7、9、12或15个频带)，这些频带572由于心理声学考虑包括不同数量的频率区间571(例如，1个频率区间直至41个频率)。不同频带p 572和不同时间分段[q,v]限定输入信号561的当前帧和前视帧的时间/频率表示上的网格。对于该网格中的不同“框(box)”，可以分别基于不同“框”内的输入声道中的至少一些的能量和/或协方差的估计来计算不同的空间线索集合。如以上所概述的，可以通过对一个声道的变换系数580的平方进行求和和/或通过分别对不同声道的变换系数580的乘积进行求和来计算能量估计和/或协方差(如以上提供的公式所指示的那样)。可以根据用于确定空间参数的窗函数586来对不同的变换系数580进行加权。As outlined above, the parameter determination unit 523 is configured to calculate spatial cues based on the time/frequency representation of the multichannel input signal 561. Spatial metadata frames can be determined based on K/Q (e.g., 24) spectra 589 (e.g., QMF spectra) of the current frame and/or K/Q (e.g., 24) spectra 589 (e.g., QMF spectra) of the forward-looking frame, where each spectrum 589 can have a frequency resolution of Q (e.g., 64) frequency intervals 571. Depending on whether the encoding system 500 detects transients in the input signal 561, the time length of the signal portion used to calculate a single set of spatial cues can include a different number of spectra 589 (e.g., from 1 spectrum up to twice the number of K/Q spectra). As shown in Figure 5c, each spectrum 589 is divided into a certain number of frequency bands 572 (e.g., 7, 9, 12, or 15 bands), which, due to psychoacoustic considerations, include a different number of frequency intervals 571 (e.g., from 1 frequency interval up to 41 frequencies). Different frequency bands p 572 and different time segments [q,v] define a grid on the time/frequency representation of the current frame and the forward-looking frame of the input signal 561. For different “boxes” within this grid, different sets of spatial cues can be calculated based on estimates of the energy and/or covariance of at least some of the input channels within each “box.” As outlined above, the energy estimate and/or covariance can be calculated by summing the squares of the transform coefficients 580 of a channel and/or by summing the products of the transform coefficients 580 of different channels, respectively (as indicated by the formulas provided above). The different transform coefficients 580 can be weighted according to a window function 586 used to determine the spatial parameters.

能量估计E_1,1(p)、E_2,2(p)和/或协方差E_1,2(p)的计算可以以定点算术来实现。在这种情况下，时间/频率网格的不同大小的“框”对针对空间参数确定的值的算术精度可能具有影响。如以上所概述的，每一频带572的频率区间(j-i+1)571的数量和/或时间/频率网格的“框”的时间间隔[q,v]的长度可以显著改变(例如，在1×1×2和48×41×2变换系数580(例如，复数QMF系数的实数部分和复数部分)之间)。结果，为确定能量E_1,1(p)/协方差E_1,2(p)而需要求和的乘积Re{a_t,f}Re{b_t,f}和Im{a_t,f}Im{b_t,f}的数量可以显著改变。为了防止计算结果超过可以以定点算术表示的数量范围，信号可以按比例缩小最大比特数量(例如，由于2⁶·2⁶＝4096≥48·41·2，按比例缩小6个比特)。然而，对于较小的“框”和/或对于仅包括相对低的信号能量的“框”，该方法导致算术精度的显著降低。The calculation of energy estimates E _{1,1} (p), E _{2,2} (p), and/or covariance E _1,2 (p) can be performed using fixed-point arithmetic. In this case, the different sizes of the “boxes” of the time/frequency grid can affect the arithmetic accuracy of the values determined for the spatial parameters. As outlined above, the number of frequency intervals (j-i+1) 571 for each band 572 and/or the length of the time interval [q,v] of the “boxes” of the time/frequency grid can vary significantly (e.g., between 1×1×2 and 48×41×2 transform coefficients 580 (e.g., the real and complex parts of complex QMF coefficients)). As a result, the number of products Re{ _at,f }Re{b _,f } and Im{at,f}Im{ _b,f } that need to be summed to determine the energy E _1,1 _ _(p )/covariance E1,2(p) can vary significantly. To prevent the calculated result from exceeding the range of quantities that can be represented by fixed-point arithmetic, the signal can be scaled down by the maximum number of bits (e.g., scaled down by 6 bits since 2 ^{^6} * 2^ ⁶ = 4096 ≥ 48 * 41 * 2). However, for smaller “boxes” and/or for “boxes” that only include relatively low signal energy, this method results in a significant reduction in arithmetic accuracy.

在本文档中，提出了时间/频率网格的每一“框”使用单独的缩放(scale)。单独的缩放可以取决于时间/频率网格的“框”内所包括的变换系数580的数量。通常，用于时间频率网格的特定“框”(即，用于特定频带572和用于特定时间间隔[q,v])的空间参数仅基于来自该特定“框”的变换系数580来确定(而不取决于来自其它“框”的变换系数580)。此外，空间参数通常仅基于能量估计和/或协方差比率来确定(而通常不受绝对能量估计和/或协方差影响)。换句话说，单个空间线索通常不使用来自一单个时间/频率“框”的能量估计和/或交叉声道乘积。此外，空间线索通常不受绝对能量估计/协方差影响，而是仅受能量估计/协方差比率影响。因此，可以在每单个“框”中使用单独的缩放。该缩放应针对对特定空间线索有贡献的声道进行匹配。This document proposes using a separate scale for each “box” of the time/frequency grid. The separate scale can depend on the number of transform coefficients 580 included within the “box” of the time/frequency grid. Typically, the spatial parameters for a particular “box” of the time/frequency grid (i.e., for a specific frequency band 572 and for a specific time interval [q,v]) are determined solely based on the transform coefficients 580 from that particular “box” (and not depending on the transform coefficients 580 from other “boxes”). Furthermore, the spatial parameters are typically determined solely based on the energy estimate and/or covariance ratio (and are generally unaffected by the absolute energy estimate and/or covariance). In other words, a single spatial line typically does not use the energy estimate and/or cross-channel product from a single time/frequency “box.” Moreover, the spatial line is generally unaffected by the absolute energy estimate/covariance, but only by the energy estimate/covariance ratio. Therefore, a separate scale can be used for each individual “box.” This scale should be matched to the channels that contribute to a particular spatial line.

对于频带p572并且对于时间间隔[q，v]，第一声道561-1和第二声道561-2的能量估计E_1，1(p)、E_2，2(p)以及第一声道561-1和第二声道561-2之间的协方差E_1，2(p)可以例如如以上公式所指示的那样确定。能量估计和协方差可以按缩放因子s_p进行缩放，以提供缩放的能量和协方差：s_p·E_1，1(p)、s_p·E_2，2(p)和s_p·E_1，2(p)。基于能量估计E_1，1(p)、E_2，2(p)和协方差E_1，2(p)导出的空间参数P(p)通常取决于能量和/或协方差的比率，以使得空间参数P(p)的值独立于缩放因子s_p。结果，不同的缩放因子s_p、s_p+1、s_p+2可以用于不同的频带p、p+1、p+2。For frequency band p572 and for time interval [q, v], the energy estimates _E1,1 (p) and _E2,2 (p) of the first channel 561-1 and the second channel 561-2, and the covariance _E1,2 (p) between the first channel 561-1 and the second channel 561-2, can be determined, for example, as indicated by the formulas above. The energy estimates and covariance can be scaled by a scaling factor _sp to provide scaled energy and covariance: _sp · _E1,1 (p), _sp · _E2,2 (p), and _sp · _E1,2 (p). The spatial parameter P(p) derived from the energy estimates _E1,1 (p), _E2,2 (p) and covariance _E1,2 (p) typically depends on the ratio of energy and/or covariance such that the value of the spatial parameter P(p) is independent of the scaling factor _sp . As a result, different scaling factors _sp , sp ₊₁ , and _sp+2 can be used for different frequency bands p, p+1, and p+2.

应注意，空间参数中的一个或多个可以取决于多于两个的不同输入声道(例如，三个不同声道)。在这种情况下，可以基于不同声道的能量估计E_1，1(p)、E_2，2(p)……，以及基于不同对声道之间的各协方差(即，E_1，2(p)、E_1，3(p)、E_2，3(p)等)来导出所述一个或多个空间参数。并且，在这种情况下，所述一个或多个空间参数的值独立于应用于能量估计和/或协方差的缩放因子。It should be noted that one or more of the spatial parameters may depend on more than two different input channels (e.g., three different channels). In this case, the one or more spatial parameters can be derived based on the energy estimates _E1,1 (p), _E2,2 (p), ... for the different channels, and based on the covariances between different pairs of channels (i.e., _E1,2 (p), _E1,3 (p), _E2,3 (p), etc.). Furthermore, in this case, the values of the one or more spatial parameters are independent of the scaling factors applied to the energy estimates and/or covariances.

具体地说，用于特定频带p的缩放因子s_p＝2^-zp(其中，z_p是指示定点算术中的移位的正整数)可以被确定为使得Specifically, the scaling factor _sp = 2 ^{- zp} for a specific frequency band p (where _zp is a positive integer indicating the shift in fixed-point arithmetic) can be determined such that

0.5＜s_p·max{|E_1，1(p)|，|E_2，2(p)|，|E_1，2(p)|}≤1.00.5＜s _p ·max{|E _1,1 (p)|, |E _2,2 (p)|, |E _1,2 (p)|}≤1.0

并且使得移位z_p最小。通过对于每个频带p和/或对于对其确定混合参数的每个时间间隔[q，v]单独地确保这一点，可以实现定点算术中的增加的(例如，最大的)精度，同时确保有效的值范围。And this minimizes the shift _zp . By ensuring this individually for each frequency band p and/or for each time interval [q, v] that determines the mixing parameters, increased (e.g., maximum) precision in fixed-point arithmetic can be achieved while ensuring a valid range of values.

举例来说，可以通过对每单个MAC(乘积累加)运算检查MAC运算的结果是否可以超过+/-1来实现单独的缩放。只有情况如此，用于“框”的单独缩放才可以增加一个比特。一旦对所有声道都进行了这一点，就可以确定用于每个“框”的最大缩放，并且可以相应地调适“框”的所有的偏离缩放。For example, individual scaling can be achieved by checking whether the result of each individual MAC (multiply-accumulate) operation exceeds +/- 1. Only if this is the case can the individual scaling for the "box" be increased by one bit. Once this is done for all channels, the maximum scaling for each "box" can be determined, and all deviations in scaling for the "box" can be adjusted accordingly.

如以上所概述的，空间元数据可以每一帧包括一个或多个(例如，两个)空间参数集合。这样，编码系统500可以将每一帧一个或多个空间参数集合发送到相应的解码系统100。这些空间参数集合中的每个对应于空间元数据帧的K/Q个时间上接续的频谱289中的一个特定频谱。该特定频谱对应于特定时刻，并且该特定时刻可以被称为采样点。图5c分别示出了两个空间参数集合的两个示例采样点583、584。采样点583、584可以与输入音频信号561内所包括的特定事件相关联。可替代地，采样点可以是预定的。As outlined above, spatial metadata may include one or more (e.g., two) sets of spatial parameters per frame. Thus, encoding system 500 may send one or more sets of spatial parameters per frame to the corresponding decoding system 100. Each of these sets of spatial parameters corresponds to a specific spectrum in the K/Q time-series of the spectrum 289 of the spatial metadata frame. This specific spectrum corresponds to a specific moment, which may be referred to as a sampling point. Figure 5c shows two example sampling points 583 and 584 for the two sets of spatial parameters. Sampling points 583 and 584 may be associated with specific events included within the input audio signal 561. Alternatively, the sampling points may be predetermined.

采样点583、584指示相应的空间参数应被解码系统100充分应用的时刻。换句话说，解码系统100可以被配置为在采样点583、584根据发送的空间参数集合来更新空间参数。此外，解码系统100可以被配置为在两个随后的采样点之间内插空间参数。空间元数据可以指示在连续的空间参数集合之间将执行的过渡类型。过渡类型的示例是空间参数之间的“平滑”和“陡峭”过渡，这意味着空间参数可以分别地以平滑的(例如，线性的)方式内插或者可以突然地更新。Sampling points 583 and 584 indicate the moments when the corresponding spatial parameters should be fully utilized by the decoding system 100. In other words, the decoding system 100 can be configured to update the spatial parameters at sampling points 583 and 584 based on the transmitted set of spatial parameters. Furthermore, the decoding system 100 can be configured to interpolate spatial parameters between two subsequent sampling points. Spatial metadata can indicate the type of transition to be performed between successive sets of spatial parameters. Examples of transition types are “smooth” and “steep” transitions between spatial parameters, meaning that the spatial parameters can be interpolated smoothly (e.g., linearly) or updated abruptly, respectively.

在“平滑”过渡的情况下，采样点可以是固定的(即，预定的)，并因此不需要在比特流564中被用信号发送。如果空间元数据帧递送单个空间参数集合，则预定采样点可以是帧的最末尾处的位置，即，采样点可以对应于第(K/Q)个频谱589。如果空间元数据帧递送两个空间参数集合，则第一个采样点可以对应于第(K/2Q)个频谱589，第二个采样点可以对应于第(K/Q)个频谱589。In the case of a “smooth” transition, the sampling point can be fixed (i.e., predetermined) and therefore does not need to be transmitted as a signal in the bitstream 564. If the spatial metadata frame delivers a single set of spatial parameters, the predetermined sampling point can be the position at the very end of the frame, i.e., the sampling point can correspond to the (K/Q)th spectrum 589. If the spatial metadata frame delivers two sets of spatial parameters, the first sampling point can correspond to the (K/2Q)th spectrum 589, and the second sampling point can correspond to the (K/Q)th spectrum 589.

在“陡峭”过渡的情况下，采样点583、584可以是可变的，并且可以在比特流562中被用信号发送。比特流562的携带以下信息的部分可以被称为比特流562的“组帧”部分：关于一个帧中所使用的空间参数集合的数量的信息、关于“平滑”和“陡峭”过渡之间的选择的信息、以及关于“陡峭”过渡情况下的采样点的位置的信息。图7a示出了可以由解码系统100根据所接收的比特流562内所包括的组帧信息应用的示例过渡方案。In the case of a "steep" transition, sampling points 583 and 584 can be variable and can be transmitted as signals in bitstream 562. The portion of bitstream 562 carrying the following information can be referred to as the "framing" portion of bitstream 562: information about the number of spatial parameter sets used in a frame, information about the choice between "smooth" and "steep" transitions, and information about the position of the sampling points in the case of a "steep" transition. Figure 7a shows an example transition scheme that can be applied by decoding system 100 based on the framing information included in the received bitstream 562.

举例来说，对于特定帧的组帧信息可以指示“平滑”过渡和单个空间参数集合711。在这种情况下，解码系统100(例如，第一混合矩阵130)可以假定空间参数集合711的采样点对应于特定帧的最后一个频谱。此外，解码系统100可以被配置为在最后所接收的用于紧靠前的帧的空间参数集合710和用于所述特定帧的空间参数集合711之间进行(例如，线性)内插701。在另一个示例中，对于特定帧的组帧信息可以指示“平滑”过渡和两个空间参数集合711、712。在这种情况下，解码系统100(例如，第一混合矩阵130)可以假定第一个空间参数集合711的采样点对应于所述特定帧的第一半的最后一个频谱，并且第二个空间参数集合712的采样点对应于所述特定帧的第二半的最后一个频谱。此外，解码系统100可以被配置为在最后所接收的用于紧靠前的帧的空间参数集合710和第一个空间参数集合711之间以及在第一个空间参数集合711和第二个空间参数集合712之间进行(例如，线性)内插702。For example, the framing information for a particular frame may indicate a “smooth” transition and a single spatial parameter set 711. In this case, the decoding system 100 (e.g., the first mixing matrix 130) may assume that the sampling points of the spatial parameter set 711 correspond to the last spectrum of the particular frame. Furthermore, the decoding system 100 may be configured to perform (e.g., linear) interpolation 701 between the last received spatial parameter set 710 for the immediately preceding frame and the spatial parameter set 711 for the particular frame. In another example, the framing information for a particular frame may indicate a “smooth” transition and two spatial parameter sets 711, 712. In this case, the decoding system 100 (e.g., the first mixing matrix 130) may assume that the sampling points of the first spatial parameter set 711 correspond to the last spectrum of the first half of the particular frame, and the sampling points of the second spatial parameter set 712 correspond to the last spectrum of the second half of the particular frame. Furthermore, the decoding system 100 can be configured to perform (e.g., linear) interpolation 702 between the last received spatial parameter set 710 for the preceding frame and the first spatial parameter set 711, and between the first spatial parameter set 711 and the second spatial parameter set 712.

在另一个示例中，对于特定帧的组帧信息可以指示“陡峭”过渡、单个空间参数集合711以及该单个空间参数集合711的采样点583。在这种情况下，解码系统100(例如，第一混合矩阵130)可以被配置为将最后所接收的空间参数集合710应用于紧靠前的帧直到采样点583，并且从采样点583开始应用空间参数集合711(如曲线703所示)。在另一个示例中，对于特定帧的组帧信息可以指示“陡峭”过渡、两个空间参数集合711、712以及分别对于两个空间参数集合711、712的两个对应的采样点583、584。在这种情况下，解码系统100(例如，第一混合矩阵130)可以被配置为将最后所接收的空间参数集合710应用于紧靠前的帧直到第一采样点583，并且从第一采样点583开始直至第二采样点584应用第一空间参数集合711，并且从第二采样点584开始至少直到所述特定帧的结束应用第二空间参数集合712(如曲线704所示)。In another example, the framing information for a particular frame may indicate a "steep" transition, a single spatial parameter set 711, and a sampling point 583 for that single spatial parameter set 711. In this case, the decoding system 100 (e.g., the first mixing matrix 130) may be configured to apply the last received spatial parameter set 710 to the immediately preceding frame up to sampling point 583, and to apply spatial parameter set 711 starting from sampling point 583 (as shown by curve 703). In another example, the framing information for a particular frame may indicate a "steep" transition, two spatial parameter sets 711, 712, and two corresponding sampling points 583, 584 for the two spatial parameter sets 711, 712, respectively. In this configuration, the decoding system 100 (e.g., the first mixing matrix 130) can be configured to apply the last received set of spatial parameters 710 to the immediately preceding frame up to the first sampling point 583, apply the first set of spatial parameters 711 from the first sampling point 583 up to the second sampling point 584, and apply the second set of spatial parameters 712 (as shown in curve 704) from the second sampling point 584 at least until the end of the particular frame.

编码系统500应确保，组帧信息与信号特性匹配，并且输入信号561的合适部分被选择以计算所述一个或多个空间参数集合711、712。为了这个目的，编码系统500可以包括检测器，其被配置为检测一个或多个声道中的信号能量突然增大的信号位置。如果找到至少一个这样的信号位置，则编码系统500可以被配置为从“平滑”过渡切换到“陡峭”过渡，否则编码系统500可以继续“平滑”过渡。The encoding system 500 should ensure that the framing information matches the signal characteristics, and that an appropriate portion of the input signal 561 is selected to calculate the one or more sets of spatial parameters 711, 712. For this purpose, the encoding system 500 may include a detector configured to detect signal locations in one or more channels where signal energy suddenly increases. If at least one such signal location is found, the encoding system 500 can be configured to switch from a “smooth” transition to a “steep” transition; otherwise, the encoding system 500 can continue with a “smooth” transition.

如以上所概述的，编码系统500(例如，参数确定单元523)可以被配置为基于输入音频信号561的多个帧585、590(例如，基于当前帧585并且基于紧靠后的帧590(即，所谓的前视帧))来计算用于当前帧的空间参数。这样，参数确定单元523可以被配置为基于两倍的K/Q个频谱589来确定空间参数(如图5e所示)。如图5e所示，频谱589可以用窗586加窗。在本文档中，提出了基于将被确定的空间参数集合711、712的数量、基于过渡类型和/或基于采样点583、584的位置来调适窗586。通过这样做，可以确保，组帧信息与信号特性匹配，并且输入信号561的合适部分被选择以计算所述一个或多个空间参数集合711、712。As outlined above, the encoding system 500 (e.g., parameter determination unit 523) can be configured to calculate spatial parameters for the current frame based on multiple frames 585, 590 of the input audio signal 561 (e.g., based on the current frame 585 and the immediately following frame 590 (i.e., the so-called forward frame)). Thus, the parameter determination unit 523 can be configured to determine the spatial parameters based on twice the number of K/Q spectra 589 (as shown in Figure 5e). As shown in Figure 5e, the spectra 589 can be windowed using window 586. In this document, it is proposed to adjust the window 586 based on the number of spatial parameter sets 711, 712 to be determined, based on the transition type, and/or based on the position of sampling points 583, 584. By doing so, it can be ensured that the framing information matches the signal characteristics, and that appropriate portions of the input signal 561 are selected to calculate the one or more spatial parameter sets 711, 712.

以下，描述用于不同编码器/信号情况的示例窗函数：The following describes example window functions for different encoder/signal scenarios:

a)情况：单个空间参数集合711、平滑过渡、在前视帧590中没有瞬变；a) Case: Single spatial parameter set 711, smooth transition, no transients in forward-looking frame 590;

窗函数586：在前一帧的最后频谱和第(K/Q)个频谱589之间，窗函数586可以从0线性地上升到1。在第(K/Q)个频谱589和第48个频谱589之间，窗函数586可以从1线性地降到0(参见图5e)。Window function 586: Between the last spectrum of the previous frame and the (K/Q)th spectrum 589, window function 586 can linearly increase from 0 to 1. Between the (K/Q)th spectrum 589 and the 48th spectrum 589, window function 586 can linearly decrease from 1 to 0 (see Figure 5e).

b)情况：单个空间参数集合711、平滑过渡、在第N个频谱(N>K/Q)中存在瞬变，即，在前视帧590中存在瞬变；b) Case: A single set of spatial parameters 711, smooth transition, transient exists in the Nth spectrum (N>K/Q), that is, transient exists in the forward-looking frame 590;

如图7b所示的窗函数721：在前一帧的最后一个频谱和第(K/Q)个频谱之间，窗函数721从0线性地上升到1。在第(K/Q)个频谱和第(N-1)个频谱之间，窗函数721恒定地保持为1。在第N个频谱和第(2*K/Q)个频谱之间，窗函数恒定地保持为0。第N个频谱处的瞬变用瞬变点724(其对应于用于紧跟帧590的空间参数集合的采样点)表示。此外，图7b中示出了互补窗函数722(当确定用于前一帧的所述一个或多个空间参数集合时，互补窗函数722被应用于当前帧585的频谱)和窗函数723(当确定用于后一帧的所述一个或多个空间参数集合时，窗函数723被应用于后一帧590的频谱)。总的说来，窗函数721确保，在前视帧590中的一个或多个瞬变的情况下，第一瞬变点724前面的前视帧的频谱被充分地考虑用于确定用于当前帧585的空间参数集合711。另一方面，忽略瞬变点724后面的前视帧590的频谱。As shown in Figure 7b, window function 721 linearly increases from 0 to 1 between the last spectrum of the previous frame and the (K/Q)th spectrum. Between the (K/Q)th spectrum and the (N-1)th spectrum, window function 721 remains constant at 1. Between the Nth spectrum and the (2*K/Q)th spectrum, window function remains constant at 0. The transient at the Nth spectrum is represented by transient point 724 (which corresponds to the sampling point for the set of spatial parameters immediately following frame 590). Furthermore, Figure 7b shows complementary window function 722 (applied to the spectrum of the current frame 585 when determining the set of one or more spatial parameters for the previous frame) and window function 723 (applied to the spectrum of the next frame 590 when determining the set of one or more spatial parameters for the next frame). In general, window function 721 ensures that, in the case of one or more transients in the forward-looking frame 590, the spectrum of the forward-looking frame preceding the first transient point 724 is fully considered for determining the set of spatial parameters 711 for the current frame 585. On the other hand, the spectrum of the forward-looking frame 590 following the transient point 724 is ignored.

c)情况:单个空间参数集合711、陡峭过渡、第N个频谱中存在瞬变(N<＝K/Q)、在后续帧590中不存在瞬变。c) Case: Single spatial parameter set 711, steep transition, transient in the Nth spectrum (N<=K/Q), no transient in subsequent frame 590.

如图7c所示的窗函数731：在第1个频谱和第(N-1)个频谱之间，窗函数731恒定地保持为0。在第N个频谱和第(K/Q)个频谱之间，窗函数731恒定地保持为1。在第(K/Q)个频谱和第(2*K/Q)个频谱之间，窗函数731从1线性地降到0。图7c指示第N个频谱处的瞬变点734(其对应于单个空间参数集合711的采样点)。此外，图7c示出了窗函数732和窗函数733，窗函数732在确定用于前一帧的所述一个或多个空间参数集合时被应用于当前帧585的频谱，窗函数733在确定用于后一帧的所述一个或多个空间参数集合时被应用于后一帧590的频谱。As shown in Figure 7c, window function 731 remains constant at 0 between the 1st and (N-1)th spectra. Between the Nth and (K/Q)th spectra, window function 731 remains constant at 1. Between the (K/Q)th and (2*K/Q)th spectra, window function 731 linearly decreases from 1 to 0. Figure 7c indicates the transient point 734 at the Nth spectra (which corresponds to a sampling point of a single spatial parameter set 711). Furthermore, Figure 7c shows window functions 732 and 733, where window function 732 is applied to the spectrum of the current frame 585 when determining the one or more spatial parameter sets for the previous frame, and window function 733 is applied to the spectrum of the next frame 590 when determining the one or more spatial parameter sets for the next frame.

d)情况：单个空间参数集合、陡峭过渡、在第N个频谱和第M个频谱中存在瞬变(N<＝K/Q，M>K/Q)；d) Cases: Single set of spatial parameters, steep transition, transients in the Nth and Mth spectra (N <= K/Q, M > K/Q);

图7d中的窗函数741：在第1个频谱和第(N-1)个频谱之间，窗函数741恒定地保持为0。在第N个频谱和第(M-1)个频谱之间，窗函数741恒定地保持为1。在第M个频谱和第48个频谱之间，窗函数恒定地保持为0。图7d指示第N个频谱处的瞬变点744(即空间参数集合的采样点)和第M个频谱处的瞬变点745。此外，图7d示出了窗函数742和窗函数743，窗函数742在确定用于前一帧的所述一个或多个空间参数集合时被应用于当前帧585的频谱，窗函数743在确定用于后一帧的所述一个或多个空间参数集合时被应用于后一帧590的频谱。Window function 741 in Figure 7d: Between the 1st spectrum and the (N-1)th spectrum, window function 741 remains constant at 0. Between the Nth spectrum and the (M-1)th spectrum, window function 741 remains constant at 1. Between the Mth spectrum and the 48th spectrum, window function 741 remains constant at 0. Figure 7d indicates the transient point 744 (i.e., the sampling point of the spatial parameter set) at the Nth spectrum and the transient point 745 at the Mth spectrum. Furthermore, Figure 7d shows window functions 742 and 743, where window function 742 is applied to the spectrum of the current frame 585 when determining the one or more spatial parameter sets for the previous frame, and window function 743 is applied to the spectrum of the next frame 590 when determining the one or more spatial parameter sets for the next frame.

e)情况：两个空间参数集合、平滑过渡、在后续帧中不存在瞬变；e) Case: Two sets of spatial parameters, smooth transition, no transients in subsequent frames;

窗函数：Window function:

i.)第1个空间参数集合：在前一帧的最后一个频谱和第(K/2Q)个频谱之间，窗从0线性地上升到1。在第(K/2Q)个频谱和第(K/Q)个频谱之间，窗从1线性地降到0。在第(K/Q)个频谱和第(2*K/Q)个频谱之间，窗恒定地保持为0。i.) First set of spatial parameters: Between the last spectrum of the previous frame and the (K/2Q)th spectrum, the window linearly increases from 0 to 1. Between the (K/2Q)th spectrum and the (K/Q)th spectrum, the window linearly decreases from 1 to 0. Between the (K/Q)th spectrum and the (2*K/Q)th spectrum, the window remains constant at 0.

ii.)第2个空间参数集合：在第1个频谱和第(K/2Q)个频谱之间，窗恒定地保持为0。在第(K/2Q)个频谱和第(K/Q)个频谱之间，窗从0线性地上升到1。在第(K/Q)个频谱和第(3*K/2Q)个频谱之间，窗从1线性地降到0。在第(3*K/2Q)个频谱和第(2*K/Q)个频谱之间，窗恒定地保持为0。ii.) The second set of spatial parameters: Between the first and (K/2Q)th spectra, the window remains constant at 0. Between the (K/2Q)th and (K/Q)th spectra, the window linearly increases from 0 to 1. Between the (K/Q)th and (3*K/2Q)th spectra, the window linearly decreases from 1 to 0. Between the (3*K/2Q)th and (2*K/Q)th spectra, the window remains constant at 0.

f)情况：两个空间参数集合、平滑过渡、在第N个频谱中存在瞬变(N>K/Q)；f) Case: Two sets of spatial parameters, smooth transition, transient in the Nth spectrum (N>K/Q);

窗函数：Window function:

ii.)第2个空间参数集合：在第1个频谱和第(K/2Q)个频谱之间，窗恒定地保持为0。在第(K/2Q)个频谱和第(K/Q)个频谱之间，窗从0线性地上升到1。在第(K/Q)个频谱和第(N-1)个频谱之间，窗恒定地保持为1。在第N个频谱和第(2*K/Q)个频谱之间，窗恒定地保持为0。ii.) The second set of spatial parameters: Between the first and (K/2Q)th spectra, the window remains constant at 0. Between the (K/2Q)th and (K/Q)th spectra, the window linearly increases from 0 to 1. Between the (K/Q)th and (N-1)th spectra, the window remains constant at 1. Between the Nth and (2*K/Q)th spectra, the window remains constant at 0.

g)情况：两个空间参数集合、陡峭过渡、在第N个频谱和第M个频谱中存在瞬变(N<M<＝K/Q)、在后续帧中不存在瞬变；g) Cases: Two sets of spatial parameters, steep transition, transients in the Nth and Mth spectra (N<M<=K/Q), and no transients in subsequent frames;

窗函数：Window function:

i.)第1个空间参数集合：在第1个频谱和第(N-1)个频谱之间，窗恒定地保持为0。在第N个频谱和第(M-1)个频谱之间，窗恒定地保持为1。在第M个频谱和第(2*K/Q)个频谱之间，窗恒定地保持为0。i.) The first set of spatial parameters: Between the first and (N-1)th spectra, the window remains constant at 0. Between the Nth and (M-1)th spectra, the window remains constant at 1. Between the Mth and (2*K/Q)th spectra, the window remains constant at 0.

ii.)第2个空间参数集合：在第1个频谱和第(M-1)个频谱之间，窗恒定地保持为0。在第M个频谱和第(K/Q)个频谱之间，窗恒定地保持为1。在第(K/Q)个频谱和第(2*K/Q)个频谱之间，窗从1线性地降到0。ii.) The second set of spatial parameters: Between the first and (M-1)th spectra, the window remains constant at 0. Between the Mth and (K/Q)th spectra, the window remains constant at 1. Between the (K/Q)th and (2*K/Q)th spectra, the window linearly decreases from 1 to 0.

h)情况：两个空间参数集合、陡峭过渡、在第N个、第M个和第O个频谱中存在瞬变(N<M<＝K/Q，O>K/Q)；h) Case: Two sets of spatial parameters, steep transition, transients in the Nth, Mth and Oth spectra (N<M<=K/Q, O>K/Q);

窗函数：Window function:

ii.)第2个空间参数集合：在第1个频谱和第(M-1)个频谱之间，窗恒定地保持为0。在第M个频谱和第(O-1)个频谱之间，窗恒定地保持为1。在第O个频谱和第(2*K/Q)个频谱之间，窗恒定地保持为0。ii.) The second set of spatial parameters: Between the first and (M-1)th spectra, the window remains constant at 0. Between the Mth and (O-1)th spectra, the window remains constant at 1. Between the Oth and (2*K/Q)th spectra, the window remains constant at 0.

总的说来，可以规定用于确定当前空间参数集合的窗函数的以下示例规则：In general, the following example rules can be specified for the window function used to determine the current set of spatial parameters:

·如果当前空间参数集合与瞬变不相关联，• If the current set of spatial parameters is not associated with transients,

-窗函数提供从前一空间参数集合的采样点直至当前空间参- Window functions provide sampling points from the previous set of spatial parameters up to the current set of spatial parameters.

数集合的采样点的频谱的平滑渐涨；The smooth, gradual increase in the spectrum of the sampling points of the number set;

-窗函数提供从当前空间参数集合的采样点直至后一空间参数集合的采样点的频谱的平滑渐消，如果该后一空间参数集合与- The window function provides a smooth decay of the spectrum from the sampling points of the current spatial parameter set to the sampling points of the next spatial parameter set, if the next spatial parameter set is similar to...

瞬变不相关联的话；If the transients are unrelated;

-窗函数充分地考虑从当前空间参数集合的采样点直至后一空间参数集合的采样点前面的频谱的频谱，并且消除从后一空间参数集合的采样点开始的频谱，如果该后一空间参数集合与瞬变- The window function fully considers the spectrum from the sampling points of the current spatial parameter set to the sampling points of the next spatial parameter set, and eliminates the spectrum starting from the sampling points of the next spatial parameter set if the next spatial parameter set is transient.

Claims

1. A method comprising:

The audio processor receives multi-channel input audio signals.

A first dynamic range control (DRC) value set is determined, the first DRC value set being configured to control the dynamic range of the output audio signal;

A second set of DRC values is determined, the second set of DRC values being configured to prevent the multi-channel input audio signal from being trimmed during downmixing by the audio processor;

The second set of DRC values is applied to the multi-channel input audio signal to obtain a decayed multi-channel input audio signal;

The attenuated multi-channel input audio signal is down-mixed to obtain a down-mixed signal; and

The output audio signal is generated from the first set of DRC values and the downmixing signal.

2. An apparatus comprising:

One or more processors;

A memory storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations including the following:

By receiving multi-channel input audio signals;

A second set of DRC values is determined, the second set of DRC values being configured to prevent the multi-channel input audio signal from being trimmed during downmixing by the device;

3. The apparatus of claim 2, wherein generating the output audio signal includes applying the first set of DRC values to the downmixed signal.

4. The apparatus of claim 2, wherein the first set of DRC values and/or the second set of DRC values are represented as dB values in logarithmic form.

5. The apparatus of claim 2, wherein the multichannel input audio signal is divided into a frame sequence of samples of the multichannel audio signal, and determining the first DRC value set and/or the second DRC value set includes determining the DRC value of each sample of each frame in the frame sequence.

6. The apparatus of claim 5, wherein determining the DRC value of each sample of the frame includes interpolation between the DRC value of the frame and the DRC value of the previous frame.

7. The method according to claim 6, wherein the interpolation is spline interpolation.

8. The apparatus according to claim 2, wherein the downmixing signal is a stereo signal.

9. The apparatus of claim 2, wherein the left and right channels of the downmix signal are generated based on different linear combinations of the channels of the multichannel input audio signal.

10. A parameter processing unit (520) configured to determine a spatial metadata frame for generating a multi-channel upmix signal from a corresponding frame of a downmix signal; wherein the downmix signal includes m channels, and wherein the multi-channel upmix signal includes n channels; n and m are integers, where m < n; wherein the spatial metadata frame includes one or more sets of spatial parameters (711, 712); the parameter processing unit (520) includes:

- A transformation unit (521) configured to determine a plurality of spectra (589) from the current frame (585) and the immediately following frame (590) of the channels of the multi-channel input signal (561); and

- Parameter determination unit (523), the parameter determination unit (523) is configured to determine the spatial metadata frame of the current frame of the channel for the multi-channel input signal (561) by weighting the plurality of spectra (589) using a window function (586);

The window function (586) depends on one or more of the following: the number of spatial parameter sets (711, 712) included in the spatial metadata frame, the presence of one or more transients in the current frame or the immediately following frame of the multi-channel input signal (561), and/or the timing of the transients.

11. A parameter processing unit (520) configured to determine a spatial metadata frame for generating a multi-channel upmix signal from a corresponding frame of a downmix signal; wherein the downmix signal includes m channels, and wherein the multi-channel upmix signal includes n channels; n and m are integers, where m < n; wherein the spatial metadata frame includes a set of spatial parameters (711); the parameter processing unit (520) includes:

- A transformation unit (521) configured to: determine a first plurality of transformation coefficients (580) from frames (585) of a first channel (561-1) of a multi-channel input signal (561), and determine a second plurality of transformation coefficients (580) from corresponding frames of a second channel (561-2) of the multi-channel input signal (561); wherein the first channel (561-1) and the second channel (561-2) are different; wherein the first plurality of transformation coefficients (580) and the second plurality of transformation coefficients (580) respectively provide a first time/frequency representation and a second time/frequency representation of frames (585) of the first channel and the second channel; wherein the first time/frequency representation and the second time/frequency representation include a plurality of frequency intervals (571) and a plurality of time intervals (582); and

- A parameter determination unit (523) configured to determine the spatial parameter set (711) using fixed-point arithmetic based on the first plurality of transform coefficients (580) and the second plurality of transform coefficients (580); wherein the spatial parameter set (711) includes corresponding band parameters for different frequency bands (572) including different numbers of frequency intervals (571); wherein a specific band parameter for the specific frequency band (572) is determined based on the transform coefficients (580) of the first plurality of transform coefficients (580) and the second plurality of transform coefficients (580) from the specific frequency band (572); and wherein the shift used by the fixed-point arithmetic to determine the specific band parameter depends on the specific frequency band (572).

12. An audio encoding system (500) configured to generate a bitstream (564) based on a multi-channel input signal (561); the system (500) comprising:

- A downmixing processing unit (510) configured to generate a frame sequence of a downmixed signal from a corresponding first frame sequence of the multichannel input signal (561); wherein the downmixed signal includes m channels, and wherein the multichannel input signal (561) includes n channels; n and m are integers, where m < n;

- A parameter processing unit (520) configured to determine a spatial metadata frame sequence from a second frame sequence of the multichannel input signal (561); wherein the frame sequence of the downmixed signal and the spatial metadata frame sequence are used to generate a multichannel upmixed signal comprising n channels; and

- A bitstream generation unit (503) configured to generate a bitstream (564) including a bitstream frame sequence, wherein the bitstream frame indicates a frame of the downmixed signal corresponding to a first frame of a first frame sequence of the multichannel input signal (561) and a spatial metadata frame corresponding to a second frame of a second frame sequence of the multichannel input signal (561); wherein the second frame is different from the first frame.

13. A method for determining a spatial metadata frame, the spatial metadata frame being used to generate a frame of a multichannel upmix signal from a corresponding frame of a downmix signal; wherein the downmix signal comprises m channels, and wherein the multichannel upmix signal comprises n channels; n and m are integers, where m < n; wherein the spatial metadata frame comprises one or more sets of spatial parameters (711, 712); the method comprising:

- Determine multiple spectra (589) from the current frame (585) and the immediately following frame (590) of the channels of the multi-channel input signal (561);

- The plurality of spectra (589) are weighted using a window function (586) to obtain a plurality of weighted spectra; and

- Determine the spatial metadata frame of the current frame of the channel for the multi-channel input signal (561) based on the plurality of weighted spectra;

14. A method for determining a spatial metadata frame, the spatial metadata frame being used to generate a frame of a multichannel upmix signal from a corresponding frame of a downmix signal; wherein the downmix signal comprises m channels, and wherein the multichannel upmix signal comprises n channels; n and m are integers, where m < n; wherein the spatial metadata frame comprises a set of spatial parameters (711); the method comprising:

- Determine the first plurality of transform coefficients (580) from the frame (585) of the first channel (561-1) of the multi-channel input signal (561);

- Determine a second plurality of transformation coefficients (580) from the corresponding frame of the second (561-2) channel of the multi-channel input signal (561); wherein the first channel (561-1) and the second channel (561-2) are different;

The first plurality of transformation coefficients (580) and the second plurality of transformation coefficients (580) respectively provide a first time/frequency representation and a second time/frequency representation of the frames (585) of the first channel and the second channel; wherein the first time/frequency representation and the second time/frequency representation include a plurality of frequency intervals (571) and a plurality of time intervals (582); wherein the spatial parameter set (711) includes corresponding band parameters for different frequency bands (572) that include different numbers of frequency intervals (571);

- Determine the shift to be applied when using fixed-point arithmetic to determine the specific band parameters for a specific frequency band (572); wherein the shift is determined based on the specific frequency band (572); and

- The specific band parameters are determined using fixed-point arithmetic and the determined shift, based on the first plurality of transform coefficients (580) and the second plurality of transform coefficients (580) falling in the specific band (572).

15. A method for generating a bitstream (564) based on a multi-channel input signal (561); the method comprising:

- A frame sequence for generating a downmixing signal from the corresponding first frame sequence of the multichannel input signal (561); wherein the downmixing signal comprises m channels, and wherein the multichannel input signal (561) comprises n channels; n and m are integers, where m < n;

- Determine a spatial metadata frame sequence from the second frame sequence of the multichannel input signal (561); wherein the frame sequence of the downmixing signal and the spatial metadata frame sequence are used to generate a multichannel upmixing signal comprising n channels; and

- Generates a bitstream (564) that includes a sequence of bitstream frames;

The bitstream frame indicates the frame corresponding to the first frame of the first frame sequence of the multi-channel input signal (561) and the spatial metadata frame corresponding to the second frame of the second frame sequence of the multi-channel input signal (561); wherein the second frame is different from the first frame.