JP7632479B2

JP7632479B2 - Video and audio synthesis device, method and program

Info

Publication number: JP7632479B2
Application number: JP2022570806A
Authority: JP
Inventors: 稔久藤原; 央也小野; 達也福井; 智彦池田; 亮太椎名
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2025-02-19
Anticipated expiration: 2040-12-22
Also published as: JPWO2022137326A1; WO2022137326A1

Description

複数の映像及び音響の入力信号から、画面や音声を１つに合成し、出力する、映像及び音響の合成システムに関する。 This relates to a video and audio synthesis system that synthesizes and outputs a single image and audio from multiple video and audio input signals.

近年、多くの映像デバイスが利用されている。このような多くの映像デバイスの映像には、多様な画素数（解像度）、フレームレート等が利用されている。この映像デバイスの映像信号は、規格によって、物理的な信号、コントロール信号等に差異があるものの、１画面をそのフレームレート分の１の時間を使って伝送する。 In recent years, many video devices have come into use. The images produced by these many video devices use a wide variety of pixel counts (resolutions) and frame rates. Although the video signals of these video devices differ in terms of physical signals, control signals, etc. depending on the standard, one screen is transmitted in a time period equal to one half of the frame rate.

これらの映像の利用方法には、テレビ会議など、複数のカメラをカメラの数よりも少ないモニタで表示するような形態がある。このよう場合、複数の映像／音響信号を、例えば１つの画面上に分割表示することや、ある映像画面中に、その他の映像画面縮小表示などをしてはめ込むことなどの、画面合成と同時に対応する音響信号の合成を行う。 One way to use these images is to display multiple cameras on fewer monitors than the number of cameras, such as in video conferencing. In such cases, multiple video/audio signals are, for example, split and displayed on one screen, or a video screen is fitted with a reduced display of another video screen, and the corresponding audio signals are synthesized at the same time as the images are synthesized.

通常、映像信号のタイミングは同期されておらず、合成する他の映像信号のタイミングが異なることから、信号をメモリなどに一時的にバッファリングしてから、合成する。結果として、合成画面の出力や音響信号には遅延が発生する。 Normally, the timing of a video signal is not synchronized, and since the timing of other video signals to be mixed is different, the signals are temporarily buffered in memory before being mixed. As a result, delays occur in the output of the mixed screen and in the audio signal.

遠隔地などでの合奏等をこのような画面合成／音響合成を行うテレビ会議で行うことを想定すると、この合成に関わる遅延は、その実現性を大きく損なう。例えば、１分間に１２０拍の曲（以下、１２０ＢＰＭ（ＢｅａｔＰｅｒＭｉｎｕｔｅ））であれば、１拍の時間は、６０／１２０秒＝５００ミリ秒である。仮にこれを、５％の精度で合わせることが必要であるとすると、５００×０．０５＝２５ミリ秒以下にカメラで撮影して表示するまでの遅延を抑える必要がある。 If we imagine an ensemble of people in remote locations performing together over a video conference with this type of image/audio synthesis, the delay involved in this synthesis would greatly impair its feasibility. For example, in a piece of music with 120 beats per minute (hereafter referred to as 120 BPM (beats per minute)), the duration of one beat is 60/120 seconds = 500 milliseconds. If we were to match this with an accuracy of 5%, then the delay between capturing the image with a camera and displaying it would need to be kept to 500 x 0.05 = 25 milliseconds or less.

カメラで撮影して表示するまでには、実際には、合成に関わる処理以外に、カメラでの画像処理時間、モニタでの表示時間、伝送に関わる時間などの、その他の遅延も含む必要がある。付随した音響信号にも遅延を含むことになる。結果として、従来技術では、遠隔地で相互に映像を見ながらの合奏等のタイミングが重視される用途での、協調作業は困難である。映像もしくは音響信号のうち一方だけでも低遅延であれば、タイミングが重視される用途での協調作業か可能な場合もある。 In reality, the process from capturing an image with a camera to displaying it must include other delays, such as the time it takes to process the image on the camera, the time it takes to display on the monitor, and the time it takes to transmit, in addition to the processing involved in synthesis. The accompanying audio signal also includes a delay. As a result, with conventional technology, it is difficult to collaborate in applications where timing is important, such as playing an ensemble while watching each other's videos from remote locations. However, if there is low latency in either the video or audio signal, collaborative work in applications where timing is important may be possible.

そこで、低遅延要求が厳しい協調作業に対して、複数拠点などの複数画面／複数音響信号を合成するシステムで、非同期映像の映像入力／音響信号入力から、特にその音響出力までの時間を低遅延化するシステムを提供する。Therefore, for collaborative work that requires strict low latency, we provide a system that synthesizes multiple screens and multiple audio signals from multiple locations, etc., and reduces the latency from the video input/audio signal input of asynchronous video to the audio output in particular.

ＶＥＳＡａｎｄＩｎｄｕｓｔｒｙＳｔａｎｄａｒｄｓａｎｄＧｕｉｄｅｌｉｎｅｓｆｏｒＣｏｍｐｕｔｅｒＤｉｓｐｌａｙＭｏｎｉｔｏｒＴｉｍｉｎｇ（ＤＭＴ），Ｖｅｒｓｉｏｎ１．０，Ｒｅｖ．１３，Ｆｅｂｒｕａｒｙ８，２０１３VESA and Industry Standards and Guidelines for Computer Display Monitor Timing (DMT), Version 1.0, Rev. 13, February 8, 2013

本開示は、音響信号を出力するまでの遅延時間を短縮することを目的とする。 The present disclosure aims to reduce the delay time until an acoustic signal is output.

本開示の映像音響合成装置は、
複数の音声付き映像を構成する映像信号及び音響信号を取得し、
前記複数の音声付き映像信号から映像信号と音響信号とを分離し、
映像信号の合成処理を音響信号の処理と個別に行い、
映像信号の合成処理の完了を待たずに音響信号を出力する。 The video and audio synthesizer of the present disclosure comprises:
Acquiring video signals and audio signals constituting a plurality of audio-accompanied videos;
Separating a video signal and an audio signal from the plurality of audio-accompanying video signals;
The synthesis of the video signal is performed separately from the processing of the audio signal,
To output an audio signal without waiting for completion of a synthesis process of a video signal.

本開示の映像音響合成方法は、
映像音響合成装置が、
複数の音声付き映像を構成する映像信号及び音響信号を取得し、
前記複数の音声付き映像信号から映像信号と音響信号とを分離し、
映像信号の合成処理を音響信号の処理と個別に行い、
映像信号の合成処理の完了を待たずに音響信号を出力する。 The video and audio synthesis method of the present disclosure includes:
The video and audio synthesis device
Acquiring video signals and audio signals constituting a plurality of audio-accompanied videos;
Separating a video signal and an audio signal from the plurality of audio-accompanying video signals;
The synthesis of the video signal is performed separately from the processing of the audio signal,
To output an audio signal without waiting for completion of a synthesis process of a video signal.

本開示のプログラムは、本開示に係る映像音響合成装置に備わる各機能部としてコンピュータを実現させるためのプログラムであり、本開示に係る映像音響合成装置が実行する方法に備わる各ステップをコンピュータに実行させるためのプログラムである。 The program of the present disclosure is a program for causing a computer to realize each functional unit of the video and audio synthesis device of the present disclosure, and is a program for causing a computer to execute each step of the method performed by the video and audio synthesis device of the present disclosure.

本開示は、音響信号を出力するまでの遅延時間を短縮することができる。 The present disclosure can reduce the delay time until an acoustic signal is output.

映像信号に含まれる画面の情報の一例を示す。3 shows an example of screen information included in a video signal. 画面の合成例を示す。An example of screen composition is shown below. 本開示に関連する映像音響合成方法の一例を示す。1 illustrates an example of a video and audio synthesis method related to the present disclosure. 本開示の映像音響合成方法の一例を示す。1 illustrates an example of a video and audio synthesis method according to the present disclosure. 本開示の映像音響合成方法の一例を示す。1 illustrates an example of a video and audio synthesis method according to the present disclosure. 本実施形態に係る映像音響合成装置の構成例を示す。1 shows an example of the configuration of a video and audio synthesis device according to an embodiment of the present invention.

以下、本開示の実施形態について、図面を参照しながら詳細に説明する。なお、本開示は、以下に示す実施形態に限定されるものではない。これらの実施の例は例示に過ぎず、本開示は当業者の知識に基づいて種々の変更、改良を施した形態で実施することができる。なお、本明細書及び図面において符号が同じ構成要素は、相互に同一のものを示すものとする。 Below, the embodiments of the present disclosure will be described in detail with reference to the drawings. Note that the present disclosure is not limited to the embodiments shown below. These implementation examples are merely illustrative, and the present disclosure can be implemented in various forms with various modifications and improvements based on the knowledge of those skilled in the art. Note that components with the same reference numerals in this specification and drawings are considered to be identical to each other.

図１に、映像信号に含まれる画面の情報の一例を示す。画面の情報は、画面を横方向に１つの走査線２１毎に走査して、順次下の走査線２１を走査することで伝送される。この走査には、表示画面２４の他、ブランキング部分２２、また、ボーダ部分２３などのオーバヘッド情報／信号の走査を含む。ブランキング部分２２に、制御情報や音声情報など、映像情報以外の情報を含む場合もある。（例えば、非特許文献１，第３章参照。） Figure 1 shows an example of screen information contained in a video signal. The screen information is transmitted by scanning the screen horizontally, one scan line 21 at a time, and then scanning the scan lines 21 below in sequence. This scanning includes scanning of the display screen 24, as well as blanking portion 22 and overhead information/signals such as border portion 23. The blanking portion 22 may contain information other than video information, such as control information and audio information. (For example, see Non-Patent Document 1, Chapter 3.)

図２に、映像信号の合成例を示す。本開示では、一例として、映像音響合成装置に４つの映像信号が入力され、映像音響合成装置が１つの映像信号に合成して出力する例を示す。映像信号では１画面をそのフレームレート分の１の時間を使って伝送する。例えば、１秒間に６０フレームの映像信号であれば、１／６０秒、すなわち約１６．７ミリ秒を掛けて１画面の映像信号を伝送する（以下、６０ｆｐｓ（ＦｒａｍｅｐｅｒＳｅｃｏｎｄ））。映像信号に含まれる各時点での１画面の情報を「映像フレーム」と称し、映像音響合成装置に入力される各映像信号の１画面の情報を「入力フレーム」、映像音響合成装置から出力される合成された１画面の情報を「出力フレーム」と称する。本開示では、「入力フレーム」及び「出力フレーム」に音響信号が含まれていてもよい。 Figure 2 shows an example of video signal synthesis. In this disclosure, as an example, four video signals are input to a video and audio synthesizer, which synthesizes them into one video signal and outputs it. In a video signal, one screen is transmitted using a time of one frame rate. For example, if the video signal has 60 frames per second, one screen of the video signal is transmitted over 1/60 seconds, or approximately 16.7 milliseconds (hereinafter, 60 fps (Frame per Second)). The information of one screen at each time included in the video signal is called a "video frame", the information of one screen of each video signal input to the video and audio synthesizer is called an "input frame", and the information of one synthesized screen output from the video and audio synthesizer is called an "output frame". In this disclosure, the "input frame" and "output frame" may include an audio signal.

例えば、図３に示すように、映像音響合成装置が、４つの異なるタイミングの映像信号及び音響信号を入力とし、１画面の出力フレームに合成して出力する際に、全ての入力フレームを読み込んでから、合成し、出力する形態の場合を考える。この場合、フレーム時間をＴ＿ｆ、合成処理時間をＴ＿ｐとすると、映像信号及び音響信号の合成された出力フレームは、最初の入力１の入力フレームの入力時点から最大で、２Ｔ＿ｆ＋Ｔ＿ｐ遅れることとなる。例えば、６０ｆｐｓの映像を考えると、２フレーム時間以上、すなわち３３．３ミリ秒以上の遅延が合成後の映像には、含まれる可能性があるということである。 For example, as shown in Figure 3, consider a case where a video and audio synthesis device inputs video signals and audio signals with four different timings, synthesizes them into a single output frame, and outputs the input frame by reading all of the input frames before synthesizing them. In this case, if the frame time is T_f and the synthesis processing time is T_p, the output frame into which the video and audio signals are synthesized will be delayed by a maximum of 2T_f+T_p from the time of input of the first input frame, input 1. For example, when considering a 60 fps video, this means that the synthesized video may contain a delay of more than two frame times, i.e., more than 33.3 milliseconds.

ここで、本開示において、入力の映像ｘ，ｋは入力ｘのｋ番目の映像フレームであり、音響ｘ，ｋは入力ｘのｋ番目の映像フレームに含まれる音響信号である。また、出力信号の映像ｋ，ｋ，ｋ，ｋは、映像１，ｋ、映像２，ｋ、映像３，ｋ、映像４，ｋから合成されたｋ番目の出力フレームである。Here, in this disclosure, the input video x,k is the kth video frame of the input x, and the audio x,k is the audio signal contained in the kth video frame of the input x. Also, the output signal video k,k,k,k is the kth output frame synthesized from video 1,k, video 2,k, video 3,k, and video 4,k.

本開示は、複数の非同期の音声付き映像を構成する映像信号及び音響信号を入力し、それらの映像信号や音響信号を合成するシステムであって、入力された信号から音響信号を分離し、音響信号とは別に映像信号を合成し、元の映像信号の合成を待たずに映像信号に音響信号を合成し、重畳することを特徴とする。 The present disclosure relates to a system that inputs video signals and audio signals constituting multiple asynchronous audio-accompanied videos, and synthesizes those video and audio signals, characterized in that it separates the audio signal from the input signals, synthesizes the video signal separately from the audio signal, and synthesizes and superimposes the audio signal onto the video signal without waiting for the original video signal to be synthesized.

図４に、本開示の合成例を示す。図に示すように音響１，ｋ、音響２，ｋ、音響３，ｋ、音響４，ｋは、本来、元の映像が合成された映像である映像ｋ，ｋ，ｋ，ｋに音響として合成されるべきである。しかし、本開示は、音響信号のみを分離して、可能な限り前の映像フレーム中に合成、重畳する。 Figure 4 shows an example of synthesis according to the present disclosure. As shown in the figure, audio 1,k, audio 2,k, audio 3,k, and audio 4,k should originally be synthesized as audio into images k,k,k,k, which are images synthesized from the original images. However, the present disclosure separates only the audio signals and synthesizes and superimposes them into the previous image frame as much as possible.

例えば、図４に示すように、音響信号の各入力フレームを遅延なしに出力できる場合、映像音響合成装置は、音響信号の出力フレームを遅延なしに出力する。この場合、映像１，ｋ及び音響１，ｋの入力フレームの入力時点が映像ｋ－２，ｋ－２，ｋ－２，ｋ－２の出力フレームの出力時点と一致する場合、映像音響合成装置は、音響１，ｋを、映像ｋ－２，ｋ－２，ｋ－２，ｋ－２に重畳する。映像ｋ－２，ｋ－２，ｋ－２，ｋ－２及び映像ｋ－１，ｋ－１，ｋ－１，ｋ－１の出力フレームを出力している間に、映像２，ｋ及び音響２，ｋ、映像３，ｋ及び音響３，ｋ、映像４，ｋ及び音響４，ｋの入力フレームが入力された場合、映像音響合成装置は、音響２，ｋ、音響３，ｋ、音響４，ｋについては、映像ｋ－２，ｋ－２，ｋ－２，ｋ－２と映像ｋ－１，ｋ－１，ｋ－１，ｋ－１に分けて重畳する。これにより、音響信号については、遅延を大きく低減することができる。なお、図４の例では音響信号の各入力フレームを遅延なしに出力できる例を示したが、本開示はこれに限らず、同時に入力された映像信号の出力タイミングよりも前の任意のタイミングで音響信号を出力することができる。 For example, as shown in FIG. 4, if each input frame of the audio signal can be output without delay, the audio video synthesizer outputs the output frame of the audio signal without delay. In this case, if the input time of the input frames of video 1, k and audio 1, k coincides with the output time of the output frames of video k-2, k-2, k-2, k-2, k-2, the audio video synthesizer superimposes audio 1, k on video k-2, k-2, k-2, k-2. If input frames of video 2, k and audio 2, k, video 3, k and audio 3, k, video 4, k and audio 4, k, are input while the output frames of video k-2, k-2, k-2, k-2, k-2 and video k-1, k-1, k-1, k-1 are being output, the audio video synthesizer superimposes audio 2, k, audio 3, k, and audio 4, k on video k-2, k-2, k-2, k-2 and video k-1, k-1, k-1, k-1, k-1. This allows the delay of the audio signal to be significantly reduced. Note that, although the example of Fig. 4 shows an example in which each input frame of the audio signal can be output without delay, the present disclosure is not limited to this, and the audio signal can be output at any timing prior to the output timing of the video signal input at the same time.

また、例えば、時刻ｔ１の出力音響信号は、音響１，ｋ、音響２，ｋ、映像３，ｋ及び音響３，ｋ、映像４，ｋ及び音響４，ｋのデータになるが、これらそれぞれを異なる音響チャンネルとして映像ｋ－２，ｋ－２，ｋ－２，ｋ－２に重畳することもできるが、一部または全部の音響信号を任意の強度で合成後、同一音響チャンネルとして映像ｋ－２，ｋ－２，ｋ－２，ｋ－２に重畳することもできる。チャンネルを分離することで、出力された信号を受信し映像や音響を提示する映像音響機器側で、入力信号毎に音響の強度を制御することが可能となる。またチャンネルを合成処理することで、映像信号への多重する音響信号のチャンネル数の制約を回避することができる。 For example, the output audio signal at time t1 will be audio 1,k, audio 2,k, video 3,k and audio 3,k, video 4,k and audio 4,k, and each of these can be superimposed on video k-2,k-2,k-2,k-2 as different audio channels, or some or all of the audio signals can be synthesized at any intensity and then superimposed on video k-2,k-2,k-2,k-2 as the same audio channel. By separating the channels, it becomes possible for the audiovisual device that receives the output signal and presents the video and audio to control the audio intensity for each input signal. Furthermore, by synthesizing the channels, it is possible to avoid restrictions on the number of channels of audio signals that can be multiplexed onto a video signal.

また、図４の例では音響信号の各入力フレームを合成することなく出力する例を示したが、本開示はこれに限らず、音響信号についても合成処理を行ってもよい。例えば、図５に示すように、最後に到着する音響４，ｋの出力フレームのタイミングに一致するよう、音響１，ｋ、音響２，ｋ、音響３，ｋ、音響４，ｋを合成してもよい。これにより、映像信号及び音響信号をそれぞれ１フレームとして扱うことができる。 In addition, while the example in FIG. 4 shows an example in which each input frame of the audio signal is output without synthesis, the present disclosure is not limited to this, and synthesis processing may also be performed on the audio signal. For example, as shown in FIG. 5, audio 1,k, audio 2,k, audio 3,k, and audio 4,k may be synthesized so as to match the timing of the output frame of audio 4,k that arrives last. This allows the video signal and audio signal to be treated as one frame each.

図６に、本実施形態に係る映像音響合成装置の構成例を示す。本実施形態に係る映像音響合成装置１０は、分離部１０１、アップダウンコンバータ１０２、バッファ１０３、重畳部１０４を備える。図は４入力１出力であるが、任意の数の入出力でも構わない。 Figure 6 shows an example of the configuration of a video and audio synthesis device according to this embodiment. The video and audio synthesis device 10 according to this embodiment comprises a separation unit 101, an up/down converter 102, a buffer 103, and a superimposition unit 104. The figure shows four inputs and one output, but any number of inputs and outputs may be used.

１０１は、入力信号にそれぞれ対して、映像信号と音響信号を分離することができる機能部である。分離された信号は下記１０４に入力するほか、別途外部に出力しても構わない。
１０２は画素数を任意の大きさに拡大縮小を行う、アップダウンコンバータである。例えば、入力１の画素数を、図２に示す画面の大きさに整合するよう拡大又は縮小する。
１０３は、各入力フレームを格納するバッファである。１０２の入力をバッファリングして、任意の順序で出力することができる。
１０４は、画素合成、音響合成、重畳部である。１０３から各入力の画素を読み出し画面全体を合成し、更に、音響信号を合成した上で、映像信号を音響信号が重畳されていた元の映像によらず、即時に音響信号として重畳し、出力する。
１０４は、任意のコントロール信号を画面のブランキング部に付加しても構わない。また、分離部１０１で分離された音響信号のなかで合成処理が不要なものがある場合、映像音響合成装置１０は、重畳部１０４を介さずに音響信号を出力してもよい。 Reference numeral 101 denotes a functional unit capable of separating an input signal into a video signal and an audio signal. The separated signals are input to 104 described below, or may be output separately to the outside.
An up-down converter 102 enlarges or reduces the number of pixels to an arbitrary size. For example, the number of pixels of the input 1 is enlarged or reduced so as to match the size of the screen shown in FIG.
A buffer 103 stores each input frame. The input of 102 can be buffered and output in any order.
Reference numeral 104 denotes a pixel synthesis, audio synthesis, and superimposition unit, which reads out the pixels of each input from 103, synthesizes the entire screen, and further synthesizes an audio signal, and then instantly superimposes the video signal as an audio signal regardless of the original video on which the audio signal was superimposed, and outputs the superimposed signal.
The superimposing unit 104 may add an arbitrary control signal to the blanking portion of the screen. In addition, when there is any audio signal that does not require synthesis processing among the audio signals separated by the separating unit 101, the video audio synthesizing device 10 may output the audio signal without going through the superimposing unit 104.

本開示の映像音響合成装置は、コンピュータとプログラムによっても実現でき、プログラムを記録媒体に記録することも、ネットワークを通して提供することも可能である。 The video and audio synthesis device disclosed herein can also be realized by a computer and a program, and the program can be recorded on a recording medium or provided over a network.

上述の実施形態では４入力、４分割１画面の例を示したが、本開示はこれに限らず、任意の入力に適用できる。 In the above embodiment, an example of four inputs and one screen divided into four is shown, but this disclosure is not limited to this and can be applied to any input.

（本開示の効果）
非同期の映像/音響入力信号を分離した上で、分離前の映像信号の合成処理の完了を待たずに音響信号を合成し、音響信号を合成完了済みの映像信号に重畳することで、音響信号について合成後の出力までの遅延時間を短縮することができる。
これにより、複数拠点等の複数画面を合成するシステムで低遅延要求が厳しい協調作業が可能となる。 (Effects of the present disclosure)
By separating the asynchronous video/audio input signals, synthesizing the audio signal without waiting for completion of the synthesis process of the video signal before separation, and superimposing the audio signal on the video signal that has already been synthesized, the delay time until the audio signal is output after synthesis can be shortened.
This will enable collaborative work that requires strict low latency in a system that synthesizes multiple screens at multiple locations.

本開示は、映像コンテンツやゲームコンテンツを配信する情報通信産業のほか、映像制作に関わる映画、広告、ゲーム産業に適用することができる。 This disclosure can be applied to the information and communications industry, which distributes video and game content, as well as the film, advertising, and game industries involved in video production.

１０：映像音響合成装置
２１：走査線
２２：ブランキング部分
２３：ボーダ部分
２４：表示画面
１０１：分離部
１０２：アップダウンコンバータ
１０３：バッファ
１０４：重畳部 10: Video and audio synthesizer 21: Scanning line 22: Blanking portion 23: Border portion 24: Display screen 101: Separation section 102: Up/down converter 103: Buffer 104: Superimposition section

Claims

Acquiring video signals and audio signals constituting a plurality of audio-accompanied videos;
Separating a video signal and an audio signal from the plurality of audio-accompanying video signals;
The synthesis of the video signal is performed separately from the processing of the audio signal,
Outputting an audio signal without waiting for completion of the synthesis process of the video signal;
synthesizing the audio signals included in the plurality of audio-accompanied videos in accordance with the last audio signal of the plurality of audio-accompanied videos in one frame;
Audio-visual synthesis device.

starting synthesis of an audio signal corresponding to the acquired video signal with a video signal acquired prior to the acquired video signal without waiting for the start of synthesis of a video frame corresponding to the acquired video signal;
2. The video and audio synthesis apparatus according to claim 1.

The video and audio synthesis device
Acquiring video signals and audio signals constituting a plurality of audio-accompanied videos;
Separating a video signal and an audio signal from the plurality of audio-accompanying video signals;
The synthesis of the video signal is performed separately from the processing of the audio signal,
Outputting an audio signal without waiting for completion of the synthesis process of the video signal;
synthesizing the audio signals included in the plurality of audio-accompanied videos in accordance with the last audio signal of the plurality of audio-accompanied videos in one frame;
A method for synthesizing audio and video.

starting synthesis of an audio signal corresponding to the acquired video signal with a video signal acquired prior to the acquired video signal without waiting for the start of synthesis of a video frame corresponding to the acquired video signal;
The video/audio synthesis method according to claim 3 .

3. A program for causing a computer to realize each of the functional units of the video/audio synthesizing apparatus according to claim 1 .