AU2008202064B2

AU2008202064B2 - A data categorisation system

Info

Publication number: AU2008202064B2
Application number: AU2008202064A
Authority: AU
Inventors: Adam Kowalczyk; Bhavani Raskutti
Original assignee: Telstra Corp Ltd
Current assignee: Telstra Corp Ltd
Priority date: 2000-09-25
Filing date: 2008-05-08
Publication date: 2010-10-28
Anticipated expiration: 2021-09-25
Also published as: AU2008202064A1

Description

Australian Patents Act 1990- Regulation 3.2 ORIGINAL COMPLETE SPECIFICATION STANDARD PATENT Invention Title "A document categorisation system" The following statement is a full description of this invention, including the best method of performing it known to us: QAOPBMRAE305386 tclstra div condoc C.\NRPonb\DCC\MKA\3142355.LDOC.23/09/2010 The present invention relates to a data categorisation system and process for categorising items of electronic data, which may include electronic documents. The information age brings with it the risk of information overload. In particular, large 5 service organisations typically interact with an enormous number of customers, and the introduction of electronic message handling systems into such organisations necessitates some method of efficiently dealing with large numbers of electronic messages or other forms of electronic documents. It is desired, therefore, to provide a data categorisation system and process that alleviate one or more difficulties of the prior art, or at least to 10 provide a useful alternative. In accordance with the present invention, there is provided a data categorisation system, including: a clusterer configured to apply unsupervised learning to items of electronic data to 15 generate a set of clusters of related ones of said items based on features extracted from said items, said features including at least one of n-grams, words and phrases, and the clusters representing respective item categories; an interactive cluster editor configured to display the set of clusters to a user and to modify the set of clusters based on input from the user to provide a set of training 20 clusters; and a filter module configured to use the set of training clusters as training data for supervised learning to generate categorisation data representing models that distinguish respective ones of said clusters from the other clusters; and a categorisation component configured to use the categorisation data to categorise 25 further items of electronic data into the training clusters. The present invention also provides a data categorisation process executed by a computer system, the process including: applying unsupervised learning to items of electronic data to generate a set of 30 clusters of related ones of said items of electronic data based on features extracted C\NRPonbl\DCC\MKA\42355 1.DOC-23/09/2010 -2 from said items, said features including at least one of n-grams, words and phrases, and the clusters representing respective item categories; displaying the set of clusters to a user; receiving cluster modification data from the user; 5 modifying the set of clusters on the basis of the cluster modification data to provide a set of training clusters; using the training clusters as training data for supervised learning to generate categorisation data representing models that distinguish respective ones of said clusters from the other clusters; and 10 using said categorisation data to categorise further items of electronic data into one or more of the training clusters. Preferred embodiments of the present invention are hereinafter described, by way of example only, with reference to the accompanying drawings, wherein: 15 Figure 1 is a block diagram of a preferred embodiment of a data categorisation system; Figure 2 is a block diagram of a clusterer of the system; Figure 3 is a block diagram of a filter module of the system; Figure 4 is a block diagram showing components of a preferred embodiment of a 20 plug-in data categorisation module for a spreadsheet application; and Figures 5 to 9 are screenshots of a spreadsheet application with the plug-in module. A data categorisation system 10 includes a clusterer module 12, a filter module 14, an editor/browser module 16 and a trend analyser module 18. The clusterer 12 processes 25 items of electronic data, which may be electronic messages and/or other documents 20 and groups them into a set 22 of clusters 24 of related items/documents and then creates a description in the form of a set of keywords and key phrases for each cluster 24. The editor/browser 16 provides an interactive user interface 30 THE NEXT PAGE IS PAGE 6 C:\NRPonbDCCMIKAU142355 LDOC-23/09/2010 -6 that allows a data analyst to browse through the clustered documents 22 and to modify the clusters 24 if desired so that each cluster contains a coherent set of documents. The filter module 14 analyses existing clusters in order to categorise new documents. The Trend Analyser 18 analyses clusters over various time periods and compares different 5 categorisations of the same documents to determine a coherent and useful classification for these documents and determines novel clusters. Together, the modules 12 to 18 constitute a data categorisation system 10 which can, for example, automatically categorise large numbers (typically 10,000 - 25,000) of electronic 10 text messages, such as complaints and survey text, into a much smaller number (approximately 250-500 in this example) of groups or clusters. These messages are typically written under time constraints and they are therefore terse and contain abbreviations as well as typing and spelling errors. The system 10 can also track specific message categories, route messages (e.g., emails), and alert users when there are novel 15 (i.e., unusual) messages. The clusterer 12 groups together related documents. For navigation and editing purposes, each group is labelled using a cluster description technique, as described below. The grouping and labelling allows a large document repository to be navigated and modified 20 easily. The clustering implementation is based on a document clustering methodology described in Salton, The SMART Retrieval System - Experiments in Automatic Document Processing, Prentice-Hall, New Jersey, 1971 ("Salton"). As shown in Figure 2, the clusterer 12 includes a feature extraction module 26, a feature selection module 28, a cluster generator 30, and a cluster describer module 32. The feature extraction module 26 25 processes each document to produce a vector of token frequencies, where tokens may be n grams, words or phrases, and where some tokens may be excluded using feature selection criteria of the feature selection module 28. A similarity measure is then defined as a function of these document vectors to quantify the similarity between any two documents. Finally, the clustering generator 20 uses this similarity measure to group similar documents 30 into clusters.

C:\NRPonb1\DCC\MKA\3142355 1.DOC-23/A9/2010 -7 The feature extraction module 26 extracts n-grams, words, and/or phrases as tokens to represent a document or message. N-grams are especially suited for processing of noisy and/or un-grammatical text, such as SMS messages. The n-grams are sequences of characters of length n, and they are extracted as follows. First, each document is 5 represented as a sequence of characters or a vector of characters (referred to as document sequence vector). Next, each document-sequence vector is processed to extract n-grams and their frequencies (number of occurrences) for that document. For example, the sequence of characters "to build" will give rise to the following 5-grams "to bu", "o bui", "buil", "build". Words and phrases extracted may be stemmed, although that is not 10 necessary. Because the clustering process uses n-gram vectors rather than word vectors only, it is tolerant of spelling and typing errors, because a single error in the spelling of a long word nevertheless yields n-grams that are the same as those from the correctly spelled word. 15 The feature selection module 28 of the clusterer 12 is executed before clustering to include in the document vectors only those features that provide significant information for clustering. This not only reduces time for clustering by reducing the feature space, but also increases the accuracy of clustering, because noisy features are eliminated. The feature selection process determines the discriminating ability of a feature. The ability of a feature 20 to discriminate depends on how many documents a feature appears in (referred to as DF), and the frequency of that feature in each document (referred to as TF). Those features that appear frequently within few documents are more discriminating than those that appear infrequently in most of the documents. Traditionally, in information retrieval, this reasoning is captured using the TFs and DF of each feature to arrive at an importance 25 value. In the data categorisation system 10, the computed discrimination ability is based on the premise that if a very good discriminating feature is removed from the feature space, then the average similarity between documents in the repository increases significantly. Thus, 30 by computing average similarity between documents with and without a feature, it is WO 02/25479 PCT/AUOI/01198 possible to determine the feature's discriminating ability. The similarity of a particular document is computed with a cosine coefficient, as is described in Salton, The average similarity of a document set is determined by summing the similarities between each document and the centroid (where the centroid is the average of feature frequency vectors 5 of all the documents in the set). This method for feature selection has been traditionally ignored due to its computational cost, since it involves nm similarity computations, where n is the number of documents and m is the total number of words or features. As the following equation shows, each cosine similarity computation itself has a computational complexity of M'; 10 S xg 2i2 kcj where sy is the similarity measure, , -1, ,n represents the document number in the document set, and f represents the frequency of occurrence of a feature denoted by the integer k, 15 However, by storing the norm of frequency vectors and their dot products with the centroid, L-., 1 and .fik/g , respectively, with being the centroid docurnent, the feature selector is able to compute scores for each feature in linear time. This score determines discrimination ability features for clustering better than the standard inverse 20 document frequency-term frequency (IDF-TF) scores described in Salton. After the features have been extracted and selected, the clusters are defined by the cluster generator 30. The cluster generator 30 uses a clustering algorithm that is an enhancement of a single-pass, non-hierarchical method which partitions the repository into disjoint sets, 25 as described in E. Rasmussen, "Clustering Algorithms", Information Retrieval, W. B. Frake and R. Baeza-Yates ed., Prentice-Hall, New Jersey, 1992. The algorithm proceeds as follows: the first document D] is used to create the first cluster C. Each remaining WO 02/25479 PCT/AUOI/01198 -9 document, Dk, is assigned to the nearest cluster C, or a new cluster if none is sufficiently close. In order to compare documents to clusters, each cluster is represented by its centroid, which is the average of word frequency vectors of all the documents in the cluster. A new cluster is started when none of the existing clusters are sufficiently close, 5 based on a specified similarity or distance threshold T. As mentioned above, the clustering algorithm creates no hierarchies. However, hierarchies can be implemented by clustering at the first level, and then clustering the clusters using 10 cluster centroids, Since the algorithm itself is not hierarchical, the complexity is O(nm), where n is the number of entities to cluster at each level and m is the number of clusters. In order to determine if documents are sufficiently close, traditional single-pass algorithms require the threshold T, the separation between groups, to be specified prior to clustering. 15 This may result in sub-optimal clustering, because the groupings are primarily dependent on the contents of the repository. In contrast, the algorithm used by the clusterer 12 determines the number of clusters by creating different groupings of the data set at different separation thresholds and then evaluating these groupings to determine the best grouping. The evaluation takes into account the affinity within groups as well as separation 20 from other groups, as described in B..Raskutti and C. Leckie, "An Evaluation of Criteria for Mea'suring the Quality of Clusters", pp. 905-910, Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999. Thus, the clustering algorithm imposes a structure (a hierarchy) on previously unstructured documents. Since the clustering is based on the tokens (i. e., n-grams, words and/or phrases) in a message, as 25 well as tokens in other messages, it typically captures the conceptual relations, The clusterer 12 also labels groups based on their contents using the cluster describer module 32. It extracts words, phrases and/or sentences from a document that are most descriptive of that document, in contrast to standard approaches which extract words only. 30 The method is similar to that described in Y. D. Cohen, "Highlights: Language and Domain Independent Automatic Indexing terms for Abstracting", Journal of the American Society WO 0225479 PCT/AU01/01198 -10 for Information Science, 46(3), 162: 174, 1995 for generating highlights or abstracts for documents that are retrieved by an information retrieval system. In order to determine the words and phrase that describe a group of documents, the 5 following steps are perfbrned by the cluster describer module 32. First, all documents within a grodp are combined into one hyper-document. In all further analysis, the hyper documents are treated as a single document, and it is these hyper-documents that are to be described or labelled. Second, the distribution of the n-grams over the document space is computed by counting the occurrence of n-grams in the documents. Next, qach n-gram is 10 assigned a score per document that indicates how novel or unique it is for the document. This novelty score is based on the probability of the occurrence of the n-gram in the document and the probability of occurrence elsewhere (i.e., in another cluster). Next, the novelty score of each n-gram is apportioned across the characters in the n-gram. For example, the apportioning could be so that the entire score is allocated to the middle 15 character, and the other characters are assigned a score of zero. This apportioning allows each character in the sequence (hence each entry in the document-sequence vector) to be assigned a weight. Finally, these weights are used to compute a score for each word or phrase based on their component characters and their scores. These scores may be used as is or combined, if necessary, with language-dependent analysis, such as stemming, to filter 20 out non-essential features. Thus, for each document (.e., the hyper-document for each group of documents), the output of the cluster describer 32 is a set of descriptors (words or phrases) in the document with associated scores indicating how well they describe the document. In the case of a 25 hyper-document, these scores measure how close the descriptors are to the topic of the cluster, and are used to create a succinct description of a cluster. This also means that a descriptor may have different scores for different clusters. For instance, the word "vector" may have a higher score in a cluster about vector analysis and a lower score in a cluster about vector analysis for search engines. This score is then used to rank different 30 descriptors for a cluster such that the most descriptive term is listed first.

WO 02/25479 PCT/AUOI/01198 -11 Because the system 10 is based on n-grams, there is no necessity for language-dependent pre-processing such as stemming and removal of function words. Hence, the extraction process is language-independent and is tolerant of spelling and typing errors. Furthermore, this technique can be used to extract words or phrases or sentences or even paragraphs (for 5 large documents), since the fundamental units for determining novelty are not words but character sequences. Moreover, these character sequences need not even be text. Hence, with minor modifications, the same technique may be used to pick out novel regions within an image or audio track. 10 The clusterer 12 described above imposes a structure on a previously unstructured collection of documents. This structure, typically, captures the appropriate conceptual relations; however, as with all categorizations, whether'manual or automatic, the clusters may require further refinement, The editor/browser 16 provides the ability to manually alter the automatically created groups for greater coherence and usability. It provides a 15 visual user interface to allow browsing of the cluster hierarchy and a number of editing functions. These editing functions include file manager like functions, such as the ability to create and delete clusters, edit cluster descriptions, and move messages/dlusters to other clusters. However, it also allows the user to (i) highlight words and/or phrases in documents to indicate the cluster topic, (ii) create sub-clusters within very large and 20 diverse clusters (using the clusterer 12), and (iii) create a filter for a cluster/category, using the messages under that cluster as examples (by using filter module 14). As shown in Figure 3, the filter module 14 includes the cluster describer 32 described above, a filter generator 31, a feature extractor 26, and a filter applier 35. The filter module 25 14 has two modes: a training mode and a classification mode. In the training mode, the filter module 14 learns the features of messages belonging to a particular category based on examples of earlier documents, These example documents for the categories may have been grouped by the clusterer 12 and/or the browser/editor 16, or created by some other means, The filter module 14 generates a filter 33 based on the characteristics of messages 30 in the categories. The resulting filter 33 is subsequently used by the filter module 14 in its classification (categorisation) mod;, whereby each new message is tagged with scores WO 02/25479 PCT/AU01/0119S -12 indicating the likelihood that it belongs to a particular category. For particular applications, such as complaint analysis, if a new message does not fit in with any of the previously defined categories, it can be defined as a novel message, and a data analyst is alerted by e mail or otherwise. The filter training process begins with a feature extraction process, executed by the feature extractor 26 cluster describer 32, that extracts terms -(words, phrases and/or n-grams) that are useful to distinguish one category from another. This advantageously uses the cluster describer for extracting words and phrases, however, if desired, other feature extraction 10 processes may be used. The words, phrases and n-grams so selected are called terms, and these terms are stored in a dictionary 27. Each example in the training set is then represented by a term vector, describing frequencies of appearance of terms from dictionary 27 in the example, The filter generator 31 then uses these term vectors to learn models for discriminating each category from the others. The filter generator 31 then 15 generates a filter 33 containing data representing the resulting models. This requires special machine learning technique capable of dealing with sparse high-dimensional data spaces. The particular learning technique that is used is that for Support Vector Machines (SVM). The technique is described in the specification of Australian Patent Application PQ6844/00, which is incorporated herein by reference. SVM is a relatively novel approach 20 to supervised learning used for both classification and regression problems. The word supervised refers to the fact that the data is labelled. The aim of training in such a case is to look for patterns in the data that conform to this labelling, and to build a model correlated with the labels and capable of predicting labels for new examples with high accuracy. This is different from the unsupervised learning or clustering used in the clusterer 12, whereby a 25 natural or inherent groupings are developed, based on patterns in the data. SVM techniques are particularly useful for text categorisation due to the fact that text categorisation involves large sparse feature spaces, and SVMs have the potential to handle such feature spaces. The training algorithm is used by the filter module 14 both to learn 30 from very few examples as well as to learn long-term filters using large numbers of examples.

WO 02125479 PCT/AUOI/01198 -13 The filter 33 contains information that is used by the filter applier 35 to determine a numerical score for a new message which has been converted to a term vector. The term vector provides occurrence frequencies for terms in the dictionary 27 that appear in the 5 message, In the simplest case of an SVM with linear kernel, the filter 33 for a category is simply a vector of weights, one for each entry in the dictionary 27. The filter applier then determines a dot product between the term vector of the message and the weight vector for the category to obtain a numerical score: scared = XwIkffk. It 10 where wg, k- 1,.., m is a weight vector for the category i and fk, k= 1,..., m, is the term vector for message]. The Trend Analyser 18 performs two primary functions. The first is to track specific categories of messages over different time periods. This may be as simple as a comparison 15 of the number of messages in a chosen category at different time periods. However, the analyser also performs a thorough analysis of category variation at different periods, such as movement of the centroid (average of word frequency vector for the messages in that category), categorisation confidence, and other indications of category change. Thus, this functionality is focused on understanding-a single document category over time. If there 20 are very few messages for a particular category in consecutive time periods, this might indicate that the corresponding filter is no longer required. Conversely, if the number of messages within a group is getting larger and unmanageable, it may be necessary to use the clusterer 12 to create finer partitioning of this category. 25 The second function of the trend analyser 18 is to compare different categorisations of the same documents in order to determine a coherent and useful classification for these documents. When using the filter module 14 to create categorisations for ongoing analysis, it may be necessary to check that the nature of the actual documents has not deviated too far from the filters that were created, and that the alters are up-to-date. This requires a user 30 to compare the categorisations produced by the clusterer 12 with those produced by the WO 02/25479 PCT/AUOI/01198 -14 filter module 14, and to highlight the differences so that filter definitions may be modified accordingly. Thus, this functionality is focussed on analysing the whole message space and what categories need to be defined to understand that space. 5 The document categorisation system 10 may be used in several modes, Several examples of applications are; (i) One-off analysis, e.g., to understand information from customer surveys. This mode requires only the clusterer 12 and editor/browser 16 in order to understand broad tendencies expressed in surveys. 10 (ii) On-going analysis of unstructured information, e.g., to understand and monitor customer complaints. In this mode, firstly messages are clustered and hierarchies. edited during the initial period in order to identify coherent categories, and then filters are created for these categories in order to classify new data. Novel messages may then need to be clustered in new categories. Thus, this mode requires all four 15 modules 12 to 18 of the system 10. (iii) Relevance feedback mode for on-going analysis of information that is well understood, but not necessarily categorised. In this mode, data analysts may know roughly what categories they need, although they may not have a large set of examples to define these categories. In this mode, the following steps are executed: 20 (a) create a filter for a specific category by leading from very few examples; (b) use the filter to sort all messages so that those in the category appear at the top; (o) use the sorted list to manually increase the number examples for that category; and (d) iterate through steps (a) to (c) until a satisfactory filter is created. 25 This mode uses manual feedback information regarding the accuracy of the filter to modify the filter very quickly. This can be used as a tool to quickly create a training set from a large uncategorised set of messages. This feedback loop may be incorporated into a number of different applications. In a search engine, this loop may be used to refine results based on relevance WO 02/25479 PCT/AU01/01198 -15 feedback, In a spam filter, it may be used for quickly creating examples of spam and non-span email messages. This mode requires the filter module 14 and the editor/browser 16 to be suitably tailored. The relevance feedback mode can be used for once-off analysis as described in (i) 5 on its own, or in conjunction with clustering. (iv) Adaptive learning of multiple filters, e.g., routing of customer communications where different filters not only provide rejections, but also suggestions as to which other filter should have been the recipient of a particular message. This mode is similar to the previous mode except that multiple filters are adjusted 10 simultaneously, in response to feedback from multiple human operators, (v) Thesaurus creation mode, for understanding how different words/acronyms used in the messages relate to each other. The words are clustered based on which documents they appear in, and this organisation may be browsed and edited to understand the relationship between words. 15 Service providers such as telecommunication -companies, banks and insurance companies often need to process large numbers of short text messages, such as complaints, exit interviews, survey information, and so on. As described above, these messages are typically written down under time constraints, and consequently they may be short and 20 contain abbreviations and typing/spelling errors. Despite this, such messages generally capture the writer's intent/meaning very clearly. Once captured, these text messages are usually associated with structured information such as the category of customer, location of service, etc. The structured infonnation is typically analysed using a spreadsheet application, while the text messages themselves are ignored or interpreted manually, by a, 25 customer service representative, for example, In an alternative embodiment of the present invention, the clustering and the filtering methods of the categorisation system are embodied in components of a plug-in module 34 for a spreadsheet application 36 such as Microsoft Excel®, as shown in Figure 3. The 30 plug-in module 34 includes a ClusterData module 38, a TrainFilters module 40, a WO 02/25479 PCT/AIOI/O1198 -16 TestFilters module 42, a FilterData module 44, support function modules 46, and interface data 48. The combination of the spreadsheet application 36 and the plug-in module 34 provides a convenient system for categorising short text messages, and allows the spreadsheet application 36 to be used to analyse databases that include both structured 5 information and free-text messages, such as a customer complaints database. In particular, the plug-in 34 provides the ability to: (i) automatically group related messages into clusters that have natural affinity, and describe these clusters for easy browsing; (ii) manually browse and alter the groupings to suit business needs; 10 (iii) leam models for those groups and save them so that new messages can be classified using these models; and (iv) classify new messages using earlier models. As shown in Figure 3, the plug-in interface data 48 provides a number of additional 15 buttons 50 to 76 to the spreadsheet application toolbar area 52, allowing a user of the application to analyse free-text data. The buttons include presentation buttons 50 to 60, label buttons 62 to 68, and module invocation buttons 70 to 76. The module invocation buttons 70 to 76 will be described first. A ClusterData button 76 invokes the ClusterData module 38 of the plug-in 34, allowing the user to cluster data selected in a worksheet 78 20 into groups or categories in order to provide one-off analysis, e.g.. to understand information from customer surveys. Each row of the worksheet 78 is considered to be a -separate entry or record to be clustered, Each entry may be associated with one or more category labels which are integers that identify the categories to which the entry belongs, as described below. 25 A TrainFilters button 70 invokes the TrainFilters module 40 for learning a model of the various groups or categories on the basis of selected data in the worksheet 78. The selected data includes the textual data for each entry, and category labels used to identify the categories to which each entry belongs. Such models may be used for on-going analysis of 30 unstructured information, e.g., to understand and monitor customer complaints.

WO 02/25479 PCT/AUOI/01198 -17 A TestFilters button 72 invokes the TestFilters module 42 that tests the models created by the TrainFilters module 40 on data whose labels are known. For example, a simple sanity check for a new model is to select the data that was used for training in order to confirm that the model can at least classify the training data correctly. 5 A RunFilters button 74 invokes the PilterData module 44 that uses the models created by TrainFilters module 40 on new (i.e., uncategorised) data. In this case, the labels of input entries are unknown, and the models are used to predict labels. 10 The remaining buttons 50 to 68 invoke support function modules 46 that allow easy visualisation and manipulation of groups and individual entries in order to facilitate the structuring of textual data within the spreadsheet application 36. These support functions 46, along with the analysis buttons 70 to 76, may be used to perform one-off analysis of textual data, e.g., to understand information from customer surveys, and/ox on-going 15 analysis of unstructured information, e.g., to understand and monitor customer complaints. In the latter mode, it is first necessary to cluster and edit groups in order to identify coherent categories, and then create filters for these categories for classifying new data. Novel messages may then need to be clustered and new categories created if there are sufficient numbers of novel messages of any one category. 20 For example, column C of the worksheet area 78 of Figure 4 includes customer text sent to a service provider from mobile telephones using short message services (SMS). The text entries contain typing errors and non-alphanumeric characters. In this example, columns A and B contain structured data which is ignored: Column A provides a list of unique 25 message identifiers, and column B provides a timestamp indicating when the corresponding message was received. The plug-in module 34 of Figure 4 allows related messages to be identified by invoking the ClusterData module 38, The ClusterData module 38 groups together textually related entities within a selected region of data in the worksheet area 78, and highlights key descriptors in each group to facilitate visual 30 determination of whether a group is homogeneous or not. When the ClusterData button 76 is selected, the ClusterData module 38 begins processing the selected data. If the first row WO 02/25479 PCT/AUO1/01198 -18 of the selection is the heading row (i.e., row number 1), it is ignored. Each of the other rows of the selected data is considered an entity, and textually related entities are brought together by rearranging the order of rows within the selection using the clustering methods described above. Each row may include structured data for other analyses (e.g., columns A 5 and B in this example), but the last column (which may be the only column) contains the textual information that is used for clustering. The output of the ClusterData module 38 is provided in anew worksheet and allows the user to readily perceive the key clustering concepts. As shown in Figure 5, each row includes a cluster identification number column 80, a cluster size column 82, description columns 84 to 92 for each cluster, an initially 10 empty filter identifier column 94, and the original input data in the last three columns 96 to 100. The maximum number of descriptions per cluster is specified by the user, and has been set to five in the example implementation shown in Figures 6 to 9. The last column 100 of the original input containing the textual data is modified so that key descriptors of the cluster are highlighted by colour. The rows are sorted so that rows with the same 15 cluster identifier are together, and clusters with larger number of entries appear before those with smaller number of entries to ensure that a large number of entries can be processed quickly by the user. Initially, only one row per cluster is presented so that key concepts/topics from the textual data may be readily perceived by visual inspection. Alternate cluster rows are shaded in order to visually distinguish adjacent clusters from 20 each other, This initial presentation may be altered by means of presentation buttons 50 to 60. A HideDescription button 50 allows the user to hide the five columns 84 to 92 containing the cluster descriptors. The highlighting of descriptors in the textual data ensures that it is still 25 possible to determine the descriptions when the descriptor columns 84 to 92 are hidden. A ShowDescription button 52 reverses the effect of the HideDescription button $0. An ExpandOne button 58 expands the presentation to show all the entries for one cluster. An Expand4ll button 60 performs the same function for all clusters within the selection. This enables quick visual inspection of one or all clusters so as to identify whether clusters are 30 homogeneous or not. A CollapseOne button 54 and a CollapseAll button 56 reverse the effects of the ExpandOne button 58 and ExpandAll button 60, respectively.

WO 02/25479 PCT/AUO1/01198 19 Using the presentation buttons 50 to 60, clustered data may be easily inspected for major conceptual categories that emerge from the data. For instance, cluster numbers 1, 4 and 7 of Figure 5 belong to the same conceptual grouping of people expressing greetings. 5 Depending on the outcome required, these groups may be further merged with other groups such as 6, 9 and 21 as a single category that does not require any action from the service provider. After text categories have been determined, a large number of entries may be annotated 10 with a filter/category label. The process of annotation is facilitated by means of label buttons 62 to 68, A DefaultLabels button 66 labels each entry, i.e,, creates a filter identifier label in the filter identifier column 94, by copying the numeric value from the cluster identifier column 80. If the number of clusters is small, and the clusters are homogeneous, this is a quick method for labelling entries. A LabelOne button 62 copies the label (from 15 the filter identifier column 94) from the first entry (i.e., row) of a cluster down to all other entries of the same cluster. A LabelAll button 64 performs this action for all clusters within the selection. These buttons 62, 64 simplify labelling because all the user is required to do is to first check the homogeneity of clusters, then label the first entry of each cluster, and then use the LabelOne button 62 or the LabelAll button 64 to label the other entries of the 20 clusters. A CreateMulipieLabels button 68 may be used to allow entries to belong to multiple categories by expanding the single label provided by the filter identifier column 94 to provide one column for each of the different labels (i.e., categories) within the selected 25 region, as shown in Figure 6. Fields in the header row provide descriptors for the categories, and each entry within a data row provides an integer boolean value (i.e., I or 0) indicating whether the corresponding entry belongs to that category. The TrainFilters module 40 uses supervised learning to model the categories that have 30 been defined by the ClusterData module 38 and may have been modified by the user editing the worksheet directly and/or using the labelling buttons 62 to 68. Input to the WO 02/25419 PCTAUOI/01198 -20 TrainFilters module 40 has a particular form, beginning with label columns 102 and ending with the textual data column 100, as shown in Figure 6. The category label columns 102 in this example were obtained by using the CreateMultipleLabels button 68 after labelling the data using the cluster identifier, the highlighting of keywords to determine whether entries 5 belong to a particular group, and the LabelOne button 62 and the LabelAll button 64. After this initial labelling, an entry may be placed into multiple categories, if desired (e.g., message numbers 10 and 27). The labelled data, which was first sorted in the order of cluster identifier for ease of labelling, has been resorted so that it is in ascending order of the message identifier. If desired, zero values in the category columns 102 can be hidden to 10 improve data visualisation, as shown in Figure 8. The TrainFilters module 40 learns a model for each category/label after the required rows/columns in the worksheet 78 are selected. Because the category/label infonnation and the textual data for learning come from an arbitrarily-positioned selected portion of a 15 single worksheet 78, the TrainFilters module 40 requires the number of labels and the starting column of labels to be input by the user, and for the label columns to be contiguous. The first row is assumed to be the header row and is ignored. Models are then learnt from the textual data and the labels (which identify a filter) in the other selected rows using the training method described above. 20 The output from the TrainFilters module 40 is provided as a new worksheet, as shown in Figure 7. The worksheet includes the input data columns 96 to 100, the label columns 102, and additional information in newly introduced columns (coloured blue) 104, 106. The new score columns 104 provide a score for each label/entry combination. The scores are 25 colour-coded, with positive scores in a black typeface and negative scores in a red typeface. For the purposes of categorisation, an entry is deemed to belong to all those categories for which it has a positive score. However, for a higher confidence level, the threshold can be changed from 0 to a positive value. The error column 106 indicates whether there is an error in classification for each entry; that is, whether the classifications 30 indicated by the category columns 102 are consistent with the values in the score columns 104. Errors in the training set are generally rare, since the model is learnt on the basis of C:\NRPortbWCCVKA\3142355 LDOC-23/0912010 -21 the training set. A more thorough validation of the model may be provided by using the TestFilters module 42. The TrainFilters module 40 produces internal models for each of the categories. These 5 models can then be used for classification using the TestFilters module 42 or the FilterData module 44. Optionally, the scores may be automatically converted to confidence levels, which represent an estimate of the probability that an entry belongs to the corresponding category. The TestFilters module 42 uses the models created by TrainFilters 40 on a data set whose labels are known. It provides a more thorough validation of a model, because the 10 model was created without reference to these labels and data. The input and output formats for the TestFilters module 42 are the same as those for the TrainFilters module 40. However, the error column is more likely to be populated when the models are used to categorise new data. 15 The FilterData module 44 uses the models created by TrainFilters 40 to classify a data set whose labels are unknown. The input format is the same as that for clustering (one entry per row, with the first row assumed to be a header row). As shown in Figure 8, the FilterData module 44 creates score columns 104 and category columns 102. The first n columns 104 (where n is the number of labels/categories) provide the score, with positive 20 scores in black and negative scores in red. The next n columns 102 have either a I for all those labels that have a score greater than 0 (or the threshold value set by the user), or are empty. The 1 in a column corresponding to a label indicates that the entry belongs to that category/label. Because the scores are present, the threshold may be adjusted if a more stringent classification is required, without the need for further analysis. 25 Many modifications will be apparent to those skilled in the art without departing from the scope of the present invention.

-21A Throughout this specification and claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps. 5 The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of 10 endeavour to which this specification relates. 15 20 25

Claims

1. A data categorisation system, including: a clusterer configured to apply unsupervised learning to items of electronic data to 5 generate a set of clusters of related ones of said items based on features extracted from said items, said features including at least one of n-grams, words and phrases, and the clusters representing respective item categories; an interactive cluster editor configured to display the set of clusters to a user and to modify the set of clusters based on input from the user to provide a set of training 10 clusters; and a filter module configured to use the set of training clusters as training data for supervised learning to generate categorisation data representing models that distinguish respective ones of said clusters from the other clusters; and a categorisation component configured to use the categorisation data to categorise 15 further items of electronic data into the training clusters.

2. A system as claimed in claim 1, wherein the interactive cluster editor includes user interface components configured to allow the user to create one or more clusters, delete one or more clusters, and to move one or more of said items or clusters to another 20 cluster.

3. A system as claimed in claim 1 or 2, including a trend analyzer for determining trends of item categories over time. 25

4. A system as claimed in claim 3, wherein the system is configured to select groups of items of electronic data received over respective time periods, each selected group being provided to the clusterer and to the filter module to determine corresponding categorisations of the group, the trend analyzer being configured to compare the categorisations determined by the clusterer with the categorisations determined by the 30 filter module to assess changes in received items of electronic data over time. CANRPorbrflDCC\MKAU 142355 LDOC-23)91201) - 23

5. A system as claimed in any one of claims 1 to 4, wherein the clusterer, filter module, and categorisation component are components of a spreadsheet application.

6. A system as claimed in any one of claims 1 to 5, wherein said clusterer is configured to 5 extract features from the items of electronic data, select features from the extracted features based on discriminating abilities of the extracted features, and to generate said clusters of related items of electronic data based on the selected features.

7. A system as claimed in any one of claims 1 to 6, including a cluster describer module 10 configured to generate text describing each cluster, and wherein the interactive cluster editor is configured to modify the text describing the clusters based on input from the user.

8. A system as claimed in any one of claims I to 7, wherein said features include n 15 grams.

9. A system as claimed in any one of claims 1 to 7, wherein said features include n grams, words, and phrases. 20

10. A system as claimed in any one of claims 1 to 9, wherein at least some of the items of electronic data are electronic documents.

1 1. A data categorisation process executed by a computer system, the process including: applying unsupervised learning to items of electronic data to generate a set of 25 clusters of related ones of said items of electronic data based on features extracted from said items, said features including at least one of n-grams, words and phrases, and the clusters representing respective item categories; displaying the set of clusters to a user; receiving cluster modification data from the user; C\NRPonbNDCC UH\2943692 1 DOC-1/05/21IU - 24 modifying the set of clusters on the basis of the cluster modification data to provide a set of training clusters; using the training clusters as training data for supervised learning to generate categorisation data representing models that distinguish respective ones of said 5 clusters from the other clusters; and using said categorisation data to categorise further items of electronic data into one or more of the training clusters.

12. A process as claimed in claim 11, wherein said modifying includes one or more of: 10 creating one or more clusters, deleting one or more clusters, and moving one or more of said items or clusters to another cluster.

13. A process as claimed in claim 11 or 12, wherein the extraction of features includes: determining features of said items; 15 determining discriminating abilities of said features; and selecting ones of said features based on said discriminating abilities, wherein the clusters are generated on the basis of the selected features.

14. A process as claimed in claim 13, wherein a discriminating ability of each feature is 20 determined by comparing similarities generated for groupings of said items with and without the feature.

15. A process as claimed in any one of claims 11 to 14, including: selecting one or more of said clusters if the number of items in each of said one or 25 more clusters exceeds a threshold value; and generating descriptive labels for the selected one or more clusters.

16. A process as claimed in claim 15, wherein the descriptive labels include at least one of phrases and sentences. 30 C \NRPonbftDCCRJH\2943692 1.DOC-I /05/2010 - 25

17. A process as claimed in any one of claims 11 to 16, including determining one or more trends of respective item categories over time.

18. A process as claimed in any one of claims I1 to 17, wherein at least some of said items 5 of electronic data are electronic documents.

19. A process as claimed in any one of claims 1 to 18, wherein at least some of said items of electronic data are entries in at least one worksheet of a spreadsheet document. 10

20. A process as claimed in claim 19, wherein each of said entries includes text data to be used for categorising the entries, and structured data.

21. A process as claimed in claim 19 or 20, including generating at least one category descriptor for each entry. 15

22. A process as claimed in any one of claims 19 to 21, including generating a formatted version of data of an entry to indicate a category of said entry.

23. A process as claimed in any one of claims 19 to 22, including generating category 20 identifiers for said entries.

24. A process as claimed in any one of claims 19 to 23, including generating category columns of said at least one worksheet for identifying respective categories of said entries. 25

25. A process as claimed in any one of claims 19 to 24, including generating one or more scores for respective categories of an entry.

26. A process as claimed in claim 25, including generating an error for an entry if any one 30 of said one or more scores is inconsistent with a respective category. C.\NRPonbl\DCCMKA\31423551 DOC-23/09/2010 - 26

27. A process as claimed in claim 25 or 26, wherein a score for an entry indicates that the entry belongs to a corresponding category if said score exceeds a pre-determined value.

28. A process as claimed in claim 27, wherein the value of said pre-determined value is 5 zero.

29. A process as claimed in claim 27 or 28, wherein scores exceeding said pre-determined value are formatted differently than scores less than said value. 10

30. A process as claimed in any one of claims 25 to 29, including generating a probability that said entry belongs to a corresponding category on the basis of said score.

31. A process as claimed in any one of claims 11 to 30, including generating a cluster identifier for identifying at least one of said clusters, and a cluster size value for 15 identifying the size of said cluster.

32. A process as claimed in any one of claims 1 1 to 31, further including: applying unsupervised learning to yet further items of electronic data to generate a set of test clusters of related ones of said yet further items of electronic data based 20 on features extracted from said yet further items of electronic data, said features including at least one of n-grams, words and phrases; using the test clusters as training data for supervised learning to generate test categorisation data representing models that distinguish respective ones of said test clusters from the other test clusters; and 25 testing said test categorisation data by using it to categorise the yet further items of electronic data.

33. A process as claimed in any one of claims 11 to 32, wherein the features include n grams. 30 C:\NRPonb\DCC\MKA\3142355 1.DOC-23A)/2010 - 27

34. A process as claimed in claim 33, wherein the features include n-grams, words and phrases.

35. A data categorisation system or process, substantially as described herein with 5 reference to the accompanying drawings.

36. A plug-in module for a spreadsheet application, the plug-in module being configured to execute the process of any one of claims I I to 35. 10

37. A data categorisation system having components for executing the process of any one of claims 11 to 35.

38. A computer readable storage medium having stored thereon program instructions for executing the process of any one of claims 1I to 35. 15