AU767645B2

AU767645B2 - Method of selecting a video key frame for use in small devices

Info

Publication number: AU767645B2
Application number: AU55935/01A
Authority: AU
Inventors: Alison Joan Lennon; Daniel John Lloyd-Jones; Jing Wu
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-08-02
Filing date: 2001-07-24
Publication date: 2003-11-20
Anticipated expiration: 2021-07-24
Also published as: AU5593501A

Description

S&FRef: 560330

AUSTRALIA

PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT

ORIGINAL

Name and Address of Applicant: Actual Inventor(s): Address for Service: Invention Title: Canon Kabushiki Kaisha 30-2, Shimomaruko 3-chome, Ohta-ku Tokyo 146 Japan Alison Joan Lennon, Daniel John Lloyd-Jones, Jing Wu Spruson Ferguson St Martins Tower,Level 31 Market Street Sydney NSW 2000 (CCN 3710000177) Method of Selecting a Video Key Frame for Use in Small Devices ASSOCIATED PROVISIONAL APPLICATION DETAILS [33] Country [31] Applic. No(s) AU PQ9137 [32] Application Date 02 Aug 2000 The following statement is a full description of this invention, including the best method of performing it known to me/us:- 5815c -1- METHOD OF SELECTING A VIDEO KEY FRAME FOR USE IN SMALL DEVICES Field of the Invention The present invention relates to the selection of key frames from digital video and, in particular, to the performance of such selection for visualisation use in applications running on devices having limited display sizes.

Background Art In order to visualise large volumes of data usually involved with digital video, applications typically divide the video into segments, often called shots or clips. Each segment is formed by a sequence of video frames obtained at a particular capture rate, such as 25 frames per second according to the PAL standard. Each segment may then be summarised or otherwise depicted by one or more representative or key frames extracted from the corresponding sequence. The purpose of the extracted key frames is to provide the user with information about what is contained in the video segment, without requiring 15 the user to view the entire video segment, or a significant portion thereof.

In some cases the segmentation of the video is performed interactively with the user. For example, the user may select where a new segment might start. In other cases, an automated process performs the segmentation. Such automated processes typically use image processing techniques to determine whether a particular frame of the digital video 20 is significantly different from a previous frame. This determination is usually based on the colour information of the image represented by each frame. The colour information is typically represented using a colour histogram, and when the colour histogram obtained 0. from consecutive frames is significantly different, a segmentation point is created, thereby creating a new segment starting from the second frame of the pair being compared. In other situations, automated segmentation processes may make use of on/off information 560330.doc

"I

-2of the recording process performed by the digital video camera use to capture the segment. Information about when the recording process was started and stopped is stored in some digital video data formats such as the Digital Video (DV) format used by manufacturers such as Sony Corporation and Canon Inc., both of Japan.

Given a segment, it is usual to represent the segment by one or more representative or key frames. Many applications use only a single key frame, however some applications allow multiple key frames to be selected. The multiple key frames may then be presented as an animated preview to the user.

In many cases, when image processing techniques have been used in the segmentation process, the first frame of a segment is often used to represent the segment.

In other cases, more elaborate methods are designed to identify the frame that is most representative. These methods may try to analyse the segment in terms of the camera motion (eg. pan and zoom) to detect key frames at key points with respect to that motion (eg. at the end of a zoom). A post processing step can be added to remove any redundant 15 key frames. Such a step may be performed manually or by automated comparison of the differences between identified key frames In the above-mentioned arrangements, key frames are usually identified on the basis of the frames being representative of the content of the segment. However, in many applications (eg. video browsing applications) there is a requirement to simultaneously 20 present a large number of key frames to the user for the purpose of selection. It follows that the space available to display the key frames is usually very small. Where a high resolution display of reasonable size is available, such as typically found with desktop and notebook computing systems, the ability of the user to discern information from a large number of "small" key frames is generally not substantially degraded, although the ultimate level of degradation will depend on the ultimate display size of each frame. The 560330.doc t) -3problem of such degradation is however significantly exacerbated when such applications are deployed in devices having small display areas (eg. digital video cameras, personal digital assistants, and the like). Such devices often have displays of lower resolution.

The space available for the display of individual key frames can often be increased at the cost of presenting fewer frames for the user to browse at any one time. However this potential solution has the problem that it reduces the overall context from which a user might be expected to select a video segment. Such may require the user to browse through the data on a page-by-page basis and hence the application might be slower and more cumbersome to use.

It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

Summary of the Invention In accordance with one aspect of the present invention there is disclosed a method of identifying at least one key frame to visually represent a digital video segment, .o said key frame being intended for use in an application that utilises at least one of a small viewing area and a low spatial display resolution, wherein said digital video segment comprises a sequence of temporally contiguous image frames and said application uses said at least one key frame to identify corresponding digital video segments, said method comprising the steps of: 20 processing at least one said frame of said digital video segment, said processing comprising, for each said frame, the sub-steps of: segmenting said frame into one or more regions of substantially homogeneous colour; and o• homogeneous colour; and 560330.doc -4- (ii) identifying those of said regions that exceed a predetermined threshold of human perception for said frame and storing a value representing the number of said identified regions; examining said stored values to identify that said frame in the said digital video segment that contains the largest number of regions that exceed said predetermined threshold; and selecting said identified frame as the key frame of said digital video segment.

Preferably, the predetermined threshold of human perception is selected from the group consisting of a size of the regions and a luminance difference between the regions.

In accordance with another of the present invention there is disclosed a method of identifying at least one key frame to visually represent a digital video segment, said key frame being intended for use in an application that utilises at least one of a small viewing area and a low spatial display resolution, wherein said digital video segment comprises a sequence of temporally contiguous image frames and said application uses 15 said at least one key frame to identify corresponding digital video segments, said method "comprising the steps of: processing at least one said frame of said digital video segment, said processing comprising, for each said frame, the sub-steps of: segmenting said frame into one or more regions of substantially 20 homogeneous colour; and (ii) identifying those of said regions that exceed a predetermined threshold for region size for said frame and storing a value representing the number of said identified regions; 560330.doc examining said stored values to identify that said frame in the said digital video segment that contains the largest number of regions that exceed said predetermined region size threshold; and selecting said identified frame as the key frame of said digital video segment.

In accordance with another aspect of the present invention there is disclosed a method of identifying at least one key frame to visually represent a digital video segment intended for use in an application that utilises at least one of a small viewing area and a low spatial display resolution, wherein said digital video segment comprises a sequence of temporally contiguous image frames and said application uses said at least one key frame to identify corresponding digital video segments, said method comprising the steps of: processing at least one said frame of said digital video segment, said processing comprising, for each said frame, the sub-steps of: segmenting said frame into one or more regions of substantially homogeneous colour; (ii) identifying a set of said regions each having a size that exceeds a predetermined size; (iii) calculating a luminance value for each of said identified regions; (iv) calculating a luminance difference value between pairs of said 20 identified regions; and storing the maximum luminance difference value for said frame; ""examining the stored maximum luminance difference values to identify that said e S S frame in the said digital video segment associated with the largest stored luminance difference value; and selecting said identified frame as the key frame of said digital video sequence.

560330.doc -6- In accordance with another aspect of the present invention there is disclosed a method of identifying at least one key frame to visually represent a digital video segment, said digital video segment comprising a sequence of temporally contiguous video frames, said method comprising the steps of: processing at least one said frame of said segment, said processing comprising, for each said frame, the sub-steps of: (aa) segmenting said frame into one or more substantially homogeneous regions; (ab) identifying regions that exceed a threshold for region size; (ac) identifying a characteristic of said regions to obtain a corresponding representative value for said frame, said representative value relating to an extent of interpretability of said frame when said frame is viewed at at least one of small size and low spatial resolution; examining said representative values for said segment to identify a desired one of said representative values; and o: selecting one of said frames corresponding to said identified representative value as said key frame for said segment.

In accordance with another aspect of the present invention there is disclosed a method of identifying a plurality of key frames each for visually representing a 20 corresponding mutually exclusive digital video segment of a sequence of said segments, said method comprising the steps of: S:o- processing each said segment in order in said sequence according to the method of any one of the preceding aspects; and 560330.doc -7processing said key frames of adjacent ones of said segments in said sequence to select for a current segment a key frame that is relatively dissimilar to that of the preceding segment.

Brief Description of the Drawings One or more embodiments of the present invention will now be described with reference to the drawings, in which: Fig. 1 is a flow diagram of a generalised video processing method; Fig. 2 is a flow diagram of a first key frame selection method; Fig. 3 is a flow diagram of a further key frame selection method; Fig. 4 is an example of a video frame selected as a key frame according to the method of Fig. 2; Fig. 5 is an example of a video frame selected as a key frame according to the method of Fig. 3; and Fig. 6 is a schematic block diagram representation of a video system with which 15 the described methods and arrangements may be performed.

Detailed Description Referring to Fig 1, digital video may be processed to extract key frames. In a first processing step 100, digital video footage is separated into video segments. This segmentation can be based on image processing techniques, such as those described in the 20 Background of this specification. In many cases, however, information about when a photographer started and stopped recording is available. For example, record on/off information is stored in the bit stream of the commonly used DV format referred to above. In these cases the digital video can be segmented into video segments, commonly referred to as clips. A clip is that section of digital video between a record-on event and a record-off event. If record-on/off information is available, this method of segmenting the 560330.doc -8digital video is preferable. Otherwise, image processing techniques as described previously can be used.

In step 101 the first video segment (eg. clip) of the captured footage is selected for processing. One or more key frames are identified in step 102 and the identified key frame(s) are stored in a metadata object in step 103. The number of key frames to be identified is typically established by the application that calls for key frame extraction.

Metadata is information about data, the data in this case being the digital video. The metadata object is preferably, but not necessarily, provided in a format that is easily accessible, such as Extensible Markup Language (XML), or using one of the emerging metadata standards, such as MPEG-7. Preferably, the metadata object is stored separately from the corresponding data or content. However, the metadata may be stored within the data, such as in special fields within an MPEG-2 object.

In step 104, the digital video is examined for the existence of more segments. If more segments exist then the next video segment is retrieved in step 105 and then control 15 returns to step 102 for further processing of that segment. If no more video segments are identified in step 104, then the process ends in step 110.

The process described in Fig. 1 is typically implemented using an application program running on a general purpose computer. It may be assumed that the digital video has been downloaded from the DV tape and is stored in digital format on a hard disk drive 20 associated with the computer.

Alternatively, the process may be implemented within a digital video camera having a corresponding processing capability. In the latter case, the digital video segments may be presented to the user using a specialised user interface in which each of the video segments is visually represented by one or more identified key frames corresponding to the segment. In the case where more than one key frame is identified 560330.doc -9for the segment, the key frames can be presented to the user as an animation where each key frame is displayed for a predetermined time in sequence. Typically however, a single key frame is selected for each segment. Depending on the capacity of a storage medium within the camera, the user may then be able to select to replay particular video segments s by clicking on the key frame of the segment.

Fig. 6 depicts such arrangements where one or both of a computer system 600 and a digital video camera 650 may be configured to implement the processing of Fig. 1 by means of software, such as an application program executing within the computer system 600 or within the camera 650 which, although not illustrated for the purposes of 1o clarity, may be configured to have a image processing ability corresponding to that of the computer system 600. In particular, the steps of method of Fig. 1 are effected by instructions in the software that are carried out by the computer system 600 or camera 650. The software may be divided into two separate parts; one part for carrying out the key frame extraction and processing methods; and another part to manage the user Is interface between the latter and the user. The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 600 or camera 650 from the computer readable medium, and then executed. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the 20 computer program product in the computer system 600 or camera preferably effects an advantageous apparatus for key frame extraction in accordance with the embodiments of the invention.

The computer system 600 comprises a computer module 601, input devices such as a keyboard 602 and mouse 603, and output devices including a printer 615 and a display device 614. The display device 614 is typically, for desktop applications, a 560330.doc cathode ray tube apparatus typically having a display area of at least 690 cm 2 (for a "14 inch" screen). In a "Notebook" configuration, the display device is typically a liquid crystal or plasma display having a display area of about 400 cm 2 (for a "10 inch" screen).

A Modulator-Demodulator (Modem) transceiver device 616 is used by the computer module 601 for communicating to and from a communications network 620, for example connectable via a telephone line 621 or other functional medium. The modem 616 can be used to obtain access to the Internet, and other network systems, such as a Local Area Network (LAN) or a Wide Area Network (WAN).

The computer module 601 typically includes at least one processor unit 605, a memory unit 606, for example formed from semiconductor random access memory (RAM) and read only memory (ROM), input/output interfaces including a video interface 607, and an I/O interface 613 for the keyboard 602 and mouse 603 and optionally a joystick (not illustrated), and an interface 608 for the modem 616. A storage device 609 is provided and typically includes a hard disk drive 610 and a floppy disk drive611. A magnetic tape drive (not illustrated) may also be used. A CD-ROM drive 612 is typically provided as a non-volatile source of data. The components 605 613 of the computer module 601, communicate via an interconnected bus 604 and in a manner which results in a conventional mode of operation of the computer system 600 known to those in the relevant art. Examples of computers on which the embodiments can be practised include IBM-PC's and compatibles, Sun Sparcstations or alike computer systems evolved therefrom.

Typically, the application program is resident on the hard disk drive 610 and read and controlled in its execution by the processor 605. Intermediate storage of the program and any data fetched from the network 620 or camera 650 may be accomplished using the semiconductor memory 606, possibly in concert with the hard disk drive 610.

560330.doc -11 In some instances, the application program may be supplied to the user encoded on a CD- ROM or floppy disk and read via the corresponding drive 612 or 611, or alternatively may be read by the user from the network 620 via the modem device 616. Still further, the software can also be loaded into the computer system 600 from other computer readable medium including magnetic tape, a ROM or integrated circuit, a magneto-optical disk, a radio or infra-red transmission channel between the computer module 601 and another device, a computer readable card such as a PCMCIA card, and the Internet and Intranets including e-mail transmissions and information recorded on websites and the like. The foregoing is merely exemplary of relevant computer readable media. Other 1o computer readable media may be practiced without departing from the scope and spirit of the invention.

As also seen in Fig. 6, the digital video camera 650, used to capture the footage from which the key frames are to be extracted, may be coupled to the computer system 600 via a connection 648 to provide the footage to the computer 601 for key frame extraction processing. In an exemplary implementation, the key frame extraction is performed within the camera 650 through use of a display device 652 integrated with the camera 650. Typically such a display 652 is a liquid crystal display having a small size (about 40 cm 2 or less), which is substantially smaller than that of Notebook and desktop displays mentioned above.

The key frames that are identified as part of the process described in Fig. 1 can be used to enable browsing and editing applications of the digital video. These •.applications can be implemented on the general purpose computer 600, the digital video camera 650, or some other device, such as a handheld (mobile) personal digital assistant

(PDA).

560330.doc 12- In the case of applications implemented on a camera or a mobile device, typically there is limited display area available for the presentation of the key frames for the user to select corresponding video segments of interest for viewing or editing.

Consequently, it is important that the key frames that are selected are understandable when they are viewed as small images on a display. There is no value in selecting a key frame that is semantically representative of a video segment if it contains detail that is simply not understandable when the frame is viewed as a small image or at low spatial resolution. It is therefore desirable to select key frames that are visually understandable when they are viewed either as small images or at low spatial resolution. Typically the lo contents of a small image will be more visually understandable if the image (frame) contains a few large regions or objects and/or contains regions or objects of high contrast.

A first form of the key frame extraction step 102 of Fig. 1 can now be described with reference to the processing steps illustrated in Fig. 2. In step 200, an initial threshold for the size of regions of interest is selected. This initial threshold preferably depends on 15 the ultimate viewing size of the key frame. For example, if a selected key frame is to be *9 •rendered at a resolution of 60 x 40 (=2400) pixels, then the initial threshold region size :may be set at 150 pixels.

The objective is to automatically select a frame of each video segment that has as 9 many regions or objects as possible that exceed this threshold size. If the threshold is set 20 too high depending on the data being examined, then no regions may be detected in the 9 video segment that exceed this threshold size, and it is likely that a second processing pass using a lower threshold will be required. It is preferable to select a frame that has 9.9.

S° more than one large region. A frame having a single large region/object may not be a desirable key frame because the region/object on its own may not provide enough 560330.doc 13information about the content. For example, a frame that is a view of sky that was recorded by mistake.

In step 201 which follows, a first frame of the video segment is selected for processing. This video frame is extracted from the digital video and preferably decoded from any compressed storage format (eg. MPEG) into simple digital image data for processing. The digital image preferably contains a plurality of pixels, with each pixel having an intensity for each of the red, green and blue (RGB) visual frequencies.

Alternatively, each pixel could also be represented by a set of intensity values for another colour space (eg. LUV, HSV, etc.) In an alternative configuration, image segmentation lo may be performed using an encoded/compressed representation of the image.

This image is then segmented in step 202 into contiguous regions that are essentially homogeneous in colour. There are many available methods that can be used for such segmentation. Since it is desirable in most cases for the processing associated with the particular implementation to be reasonably fast, the present method is based on a ,:o 15 region growing technique.

In a preferred implementation of the frame segmentation step 202, regions are grown by examining the pixels that are immediate neighbours of a pixel in a region. If the colour of an examined pixel is close to the average colour of the region, then the oeo examined pixel is added to the region. A predetermined threshold for each of the spectral 20 intensities (eg. R, G and B) is used to determine if the colour of the examined pixel is close enough to that of the adjacent region. After a pixel is added to the region, the average colour of the region (eg. mean R, G and B values for the region) is computed. In 0*Oe S"a preferred implementation, the initial region is selected as the first pixel of the image in raster sequence and neighbouring pixels are iteratively tested for merging. When no more 560330.doc -14pixels can be merged, then the next available pixel in raster order is selected as the seed for the next region.

There exist many variations of the frame segmentation step 202 which can be used without altering the scope of the present disclosure. For example, a seeded region growing method could be used. In such a method, optimal seeds for growing regions may be selected using statistical methods. Pixels adjacent to those seeds are then examined for merging. The frame segmentation step 202 can also include a post processing step that merges all (small) regions below a certain size (ie. a predetermined number of pixels) into the neighbouring region that is closest to the average colour of the small region.

In an alternative segmentation method, the above-mentioned technique can be used for the first frame of a video segment, but then subsequent frames are segmented using a known method of seeded region growing where the seeds are the centres of the regions of the prior frame in the video sequence.

The thresholds selected for the region growing method are selected with 15 consideration that the purpose of the segmentation is to find those regions which appear to be homogeneous in colour when the frame is viewed as a small image or at low spatial S•resolution. Therefore, since the spatial resolution of the viewed image will be low, reasonably high thresholds can be used for the region growing. In other words, a fine segmentation is not required, or indeed desired.

In step 203, the number of regions that exceed the threshold region size is computed and the number stored in a frame array according to step 204. Each element of the frame array preferably contains the number of large regions (ie. greater than the threshold size) for the corresponding frame of the video segment.

560330.doc In step 205, the video segment is examined for the existence of any more frames in the segment. If further frames exist then the next frame is retrieved in step 206 and then control passes back to step 202 where the new frame is segmented.

If all frames have been processed in step 205, then a check is performed in step 207 to determine whether all elements of the frame array are zero (ie. no large regions have been detected for any of the frames in the video segment). If all the elements are zero then the threshold for the region size is reduced in step 208. Preferably the threshold reduction step is based on reducing the threshold by a certain percentage (eg. Alternatively, the reduction can be in the form of a fixed number of pixels.

io After the threshold has been reduced, the video segment is then reprocessed by control returning to step 201. Preferably the sizes of all the regions identified in the first pass are stored for each frame. Subsequent passes then need only use the new threshold to compute the number of regions that exceed the minimum region size threshold.

S.If one or more frames have regions of size greater than the region threshold, as 15 determined in step 207, then the frame having the largest number of regions is selected S. from the frame array in step 209 by identifying the element of the array having the largest eoooo number of regions. This frame is then identified as the key frame of the video segment in step 210 and the process ends in step 211.

eeeo In an alternative implementation, more than one key frame may be required per segment. In such a case, the frames having the largest number of above-threshold size regions are selected as key frames for the segment (where the integer represents the number of key frames required). In a further alternative implementation, all those frames having more than above-threshold size regions may be selected as key frames for a segment. The order of frames in the set may be chronological (ie. the order that the 560330.doc 16frames appeared in the segment), or in the order of the number of above-threshold region sizes.

An example of a typical video frame that may be identified as a key frame using the method of Fig. 2 is shown in Fig. 4. Typically, both the large human face and the hat in a frame would each be segmented into regions of substantially homogeneous colour.

This would represent a large region and therefore when displayed as a small image in an interface where the display area is small (eg. on a digital video camera), would appear understandable to a user.

An alternative form of the key frame extraction step 102 of Fig. 1 may now discussed with reference to the processing steps reference to Fig. 3. In step 300, the first frame of the video segment is selected. This frame is segmented into homogeneous regions of colour in step 301 using one of the methods described above for step 202. In step 302, a set of regions is determined whereby each member of the set has a size (ie.

S•number of pixels) that exceeds a predetermined threshold. For each region of the set, the 15 mean luminance is calculated in step 303 as a measure of the brightness of the region.

Then, in step 304, a luminance difference is calculated for each pair of regions in the set.

"The maximum luminance difference (AL) calculated across all possible pairs of regions is then stored for the frame in a frame array in step 305.

ooo .0 In step 306, a check is performed to determine if there are further frames in the 20 video segment. If there are, then the next frame is retrieved in step 307 and then control .:00 returns to step 301.

o o.o If there are no further frames to process in the video segment, then the frame having the largest luminance difference (between any two regions in the frame) is selected in step 308. This frame is identified as the key frame in step 309 and the process terminates at step 310.

560330.doc -17- In an alternative implementation of the process depicted in Fig. 3, the luminance difference is only calculated between neighbouring or adjacent pairs of regions. This approach results in selected key frames that are striking because of the contrast between adjacent regions. This result can in some cases be more noticeable when the key frames are being viewed at limited spatial resolution. An example of a key frame that could be selected using the method described using Fig. 3 is shown in Fig. 5. Note that the contrasting objects need to be of reasonable size. Preferably, the minimum size of the regions that are considered in step 302, depends on the size of the displayed key frame.

The smaller the key frame, the larger the minimum region size must be.

The procedure depicted in Fig. 3 may also include optional steps for re-adjusting the region threshold size (as depicted in Fig. In other words, if no regions are identified for the set in step 302, the AL values stored for that frame in step 305 will be zero. If this is the case for all frames of a segment then the region size threshold may be reduced.

15 In a further implementation, the process of selecting key frames for video segments can be improved by attempting to ensure that key frames of adjacent video segments are as dissimilar as possible without departing from the essence of the aforementioned methods. This is a useful objective to achieve because if the key frames •ooo for a set of video segments all appear very similar, then it will be difficult for the user to discriminate between the video segments on the basis of the key frames, especially if the key frames are shown only as very small images or at a very low spatial resolution.

For example, in the implementation of Fig. 2, step 209 may be altered to take into account the regional composition of the key frame(s) selected for the previous one or more video segments. Instead of simply selecting the frame that contained the largest number of regions that exceeded the threshold size, the process compares each of the 560330.doc -18frames that have identified regions with the already identified key frames of the one or more previous video segments. This comparison may be performed using some simple features of each of the frames being compared. For example, the mean colour or luminance of each of the regions generated by the segmentation step 202 could be stored and a comparison made on the basis of these colours/luminance values. Preferably, the comparison is made using only the largest regions generated by the segmentation. In other words, the comparison step would use the mean colour of, for example, the first regions of each frame being compared ranked in order of region size, and compute a similarity measure for the two frames. The similarity measure may be based on a wellknown metric such as the Euclidean distance.

Then, in step 209, the frame that is ultimately selected as the key frame would be that frame that contained a non-zero number of regions above the threshold region size and differed most from the key frames selected for previous video segments. In other *variations, the contribution of the number of large regions and the similarity of a frame 15 with previously selected key frames could be weighted. For example, either a predetermined or a user specified parameter could be defined that specified the influence S" of one factor over the other when it came to select a key frame for a particular video segment.

The key frame extraction processes described with reference to Figs. 2 and 3 are founded upon determinable thresholds for human perception in respect of region size and the luminance difference between regions, respectively, that are particularly suited to implementations where the extracted key frame is to be view on a small display. The processes find particular utility when performed for key frame browsing and/or display on a display device having a small display area (eg. such as the camera 650), and particularly where numerous key frames are to be simultaneously displayed to the user for enabling 560330.doc -19selection of one or more of a desired number of video segments. However, although such utility is afforded with a small display area, such does not prevent use of the described processing methods in systems where a large display area is available, such as the computer system 600.

Industrial Applicability It is apparent from the above that the embodiments of the invention are applicable to the computer and data processing industries where images are processed to identify one or more representative key frames of a video segment.

The foregoing describes only some embodiments of the present invention, and 1o modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including" and not "consisting only of'. Variations of the word comprising, such as "comprise" and "comprises" have corresponding meanings.

o 560330.doc

Claims

1. A method of identifying at least one key frame to visually represent a digital video segment, said key frame being intended for use in an application that utilises at least one of a small viewing area and a low spatial display resolution, wherein said digital video segment comprises a sequence of temporally contiguous image frames and said application uses said at least one key frame to identify corresponding digital video segments, said method comprising the steps of: processing at least one said frame of said digital video segment, said processing comprising, for each said frame, the sub-steps of: segmenting said frame into one or more regions of substantially homogeneous colour; and (ii) identifying those of said regions that exceed a predetermined threshold for region size for said frame and storing a value representing the number of said identified regions; examining said stored values to identify that said frame in the said digital video .ooooi segment that contains the largest number of regions that exceed said predetermined region size threshold; and V selecting said identified frame as the key frame of said digital video segment.

2. A method of identifying at least one key frame to visually represent a digital video segment intended for use in an application that utilises at least one of a small viewing area and a low spatial display resolution, wherein said digital video segment comprises a sequence of temporally contiguous image frames and said application uses

560330.doc -21- said at least one key frame to identify corresponding digital video segments, said method comprising the steps of: processing at least one said frame of said digital video segment, said processing comprising, for each said frame, the sub-steps of: segmenting said frame into one or more regions of substantially homogeneous colour; (ii) identifying a set of said regions each having a size that exceeds a predetermined size; (iii) calculating a luminance value for each of said identified regions; lo (iv) calculating a luminance difference value between pairs of said identified regions; and storing the maximum luminance difference value for said frame; examining the stored maximum luminance difference values to identify that said frame in the said digital video segment associated with the largest stored luminance 15 difference value; and S. selecting said identified frame as the key frame of said digital video sequence.

3. A method according to claim 2, wherein in sub-step a luminance difference value is calculated for each adjacent pair of said identified regions.

4. A method according to claim 2, wherein in sub-step a luminance difference value is calculated for each possible pair of said identified regions. 560330.doc 22 A method of identifying at least one key frame to visually represent a digital video segment, said digital video segment comprising a sequence of temporally contiguous video frames, said method comprising the steps of: processing at least one said frame of said segment, said processing comprising, for each said frame, the sub-steps of: (aa) segmenting said frame into one or more substantially homogeneous regions; (ab) identifying regions that exceed a threshold for region size; (ac) identifying a characteristic of said regions to obtain a corresponding representative value for said frame, said representative value relating to an extent of interpretability of said frame when said frame is viewed at at least one of small size and low spatial resolution; examining said representative values for said segment to identify a desired one of said representative values; and •o o s15 selecting one of said frames corresponding to said identified representative value as said key frame for said segment. 6. A method according to claim 5 wherein step (ac) comprises comparing a size of :*0000 0 each said region with a predetermined threshold size, and said representative value comprises a count of those said regions within said frame that exceed said predetermined threshold size. 7. A method according to claim 6, wherein step (ac) comprises the further step of, where said count is zero, reducing said predetermined threshold size and repeating step (ab) using the reduced predetermined threshold size. 560330.doc (r 23 8. A method according to claim 6 or 7, wherein step comprises selecting those said frames having the largest said count. 9. A method according to claim 5 wherein step (ac) comprises comparing a size of each said region with a predetermined threshold size to identify a desired set of said regions, and said representative value is determined by processing a luminance value of each said region of said set. 10. A method according to claim 9 wherein said luminance value comprises a mean luminance value for the corresponding said region. :11. A method according to claim 9 or 10 wherein said processing of said luminance value comprises comparing luminance values between said regions of said set to obtain a 15 luminance difference value for said frame, said representative value being said luminance difference value. 12. A method according to claim 11 wherein said luminance difference value is determined by comparing said luminance values of adjacent ones of said regions. 13. A method according to claim 11 wherein said luminance difference value is determined by comparing luminance values for each possible pairing of regions of said set within said frame 560330.doc 24 14. A method according to claim 11, 12 or 13 wherein step comprises selecting that said frame having said largest luminance difference value as said key frame. A method according to any one of the preceding claims wherein said processing comprises processing each said frame of said digital video segment. 16. A method of identifying at least one key frame to visually represent a digital video segment, said key frame being intended for use in an application that utilises at least one of a small viewing area and a low spatial display resolution, wherein said digital video segment comprises a sequence of temporally contiguous image frames and said application uses said at least one key frame to identify corresponding digital video segments, said method comprising the steps of: processing at least one said frame of said digital video segment, said processing comprising, for each said frame, the sub-steps of: segmenting said frame into one or more regions of substantially homogeneous colour; and (ii) identifying those of said regions that exceed a predetermined threshold of human perception for said frame and storing a value representing the number of said identified regions; examining said stored values to identify that said frame in the said digital video segment that contains the largest number of regions that exceed said predetermined threshold; and selecting said identified frame as the key frame of said digital video segment. 560330.doc 17. A method according to claim 16 wherein said predetermined threshold of human perception is selected from the group consisting of: a size of said regions; and (ii) a luminance difference between said regions. 18. A method of identifying a plurality of key frames each for visually representing a corresponding mutually exclusive digital video segment of a sequence of said segments, said method comprising the steps of: processing each said segment in order in said sequence according to the lo method of any one of the preceding claims; and processing said key frames of adjacent ones of said segments in said sequence to select for a current segment a key frame that is relatively dissimilar to that of the preceding segment. o s15 19. A method according to claim 18, wherein step comprises the sub-steps of: (ba) comparing each said frame having said identified regions of said current one of said segments with at least a key frame identified from said preceding segment; (bb) computing a similarity measure between said frames; and (bc) selecting the said frame of said current segment having a smallest similarity measure with respect to said previous key frames. o• A method according to claim 19, wherein step (ba) comprises comparing at least one of colour and luminance values between said frames. 560330.doc -26- 21. A method of selecting at least one key frame from a digital video sequence substantially as described herein with reference to Figs. 1 and 2 or Figs. 1 and 3 of the drawings. 22. A computer readable medium comprising a computer program having a series of processing steps arranged for execution to perform the method of any one of claims 1 to 23. A computer system comprising a computer program having a series of processing steps arranged for execution to perform the method of any one of claims 1 to 24. A digital video camera having a key frame selection means configured for o performing the method of any one of claims 1 to A digital video browsing application comprising a key frame selection process 0 operable according to the method of any one of claims 1 to Dated this TWENTY-FOURTH day of JULY 2001 CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant Spruson&Ferguson 560330.doc