Text classification is probably not the first thing that comes to mind when you think of machine learning, but as far as solid practical real-world applications go, it’s arguably at the top of the pile.
Before we get started on hierarchical classification, let’s get a bit of jargon out of the way first. Text classification is the task of assigning predefined classes to a piece of text (or document). The classification may be done manually or by a machine learning. There are generally two machine learning algorithms, one to build the classifier model at design-time (known as the learning algorithm), and one to produce the class predictions at run-time (which makes use of the classifier model). Just to add more confusion, text classification is also known as text categorisation, and classes are also known as categories. Here’s a fairly simple diagram to make sense of it all:
Text categorisation is used every day to automate tasks such as, filing documents into a folder taxonomy, electronic routing of text based messages based on agent skills, drawing user attention to documents based on their registered interests, spam detection, filtering of outbound texts to prevent distribution of proprietary information, or to delete inappropriate comments or statements. Whilst these tasks may not be the most glamourous in the machine learning world, their automation saves millions of dollars every day.
Those familiar with text classification have probably heard of binary classifiers, multi-class and multi-label classifiers. Binary classifiers choose between one of two classes, whilst multi-class chose exactly one of three or more classes, and multi-label classifiers may choose more than one class from three or more classes. Ok, so what is hierarchical classification then?
Well there are several types of hierarchical classification, but in this document I’m going to cover the more traditional type, based on a hierarchical arrangement of several individual text classifiers. At the end of this article I have also included references to some newer approaches for you to explore.
Traditional hierarchical classification is the arrangement of a number of, binary, multi-class (or multi-label) classifiers into a hierarchy, where classification is executed from the top down. Each classification prediction results in the message being presented a subsequent classifier, at the next level down the tree, and so on until there are no more levels. The diagram below depicts a simple example:
The top level classifier in the above diagram is depicted using the Black outline. That classifier takes the message text as input, and makes a prediction between the English and French languages. In this particular scenario I have included only two languages but in a real-world scenario there may be multiple languages in this top level classifier. I’ve chosen to keep the diagram as simple as is possible, whilst providing a sufficient level of detail to describe hierarchical classification succinctly.
Let’s say that the top level classifier predicts French, then the controlling process will pass the message text onto the next classifier down the tree, which is on the right branch, and is outlined in Blue. The second level classifier makes a prediction between French Retail and Business Banking. Let’s say it predicts Retail Banking, then control is passed to the last classifier in the tree, this time outlined in Green (i.e. second from the right). That classifier makes a prediction from, Balance Request, Credit Card or Mortgage.
The scenario I’ve described above uses a multi-class classifier. However it could equally well apply to a hierarchical arrangement of multi-label classifiers. In fact multi-label is more common for email routing as emails may contain multiple questions (relating to multiple classes), and as a result may require routing to more than one agent, based on skills. In such scenarios, multiple paths may be traversed down the tree in parallel.
Hopefully by this stage you’ve got a got good sense of how traditional hierarchical classification works. What I want to do now is explain why a hierarchical structure is preferable to a flat structure. Flat structures can work well in many situations but as the number of categories grows most algorithms begin to struggle, both in terms of accuracy and performance. The accuracy reduces as the algorithms find it harder to locate enough features which differentiate the classes. Performance suffers as new data or categories are introduced, since this necessitates re-building the entire classifier model.
Hierarchical classification enables us to have a number of much smaller classifier models, as the data is much more compartmentalized due to the hierarchical nature of the arrangement. This generally ensures that each classifier is much more accurate, and as data and categories change it is often not necessary to re-build all of the classifiers in the tree. Even if a total re-build were necessary all of the classifiers may be built in parallel, and since they are much smaller, the overall build time will be shorter compared to building one larger classifier.
The last topic I want to talk about here is sample data. In the case of a flat structure you must supply training data for every class/category. However, for a hierarchy it is often sufficient to supply data only for the leaf categories, and then simply propagate the data up the tree (relabeling it as you go).
For example let’s consider English language branch of the tree structure depicted above. We’d supply data for the leaf categories, i.e. Balance Request, Credit Card, Mortgage, Loans and Cash Management. In the first instance this data is used to build the classifier that chooses between, Balance Request, Credit Card, and Mortgage, and the classifier that chooses between Loans and Cash Management. However when we need to build the next level up classifier that chooses between Retail Banking and Business Banking (depicted in Blue in our diagram), we can reuse the same leaf node data and just re-label as either Retail or Business Banking. Likewise, when we want to build the top level classifier that chooses between English and French, we just propagate all of that leaf node data up the tree, and re-label it as English.
Hopefully this blog has helped you see that in many instances traditional hierarchical classification is preferable to a flat structure, in terms of accuracy, performance and data management. That said, the success of the traditional hierarchical classification approach, is still closely related to the quality of the individual classifiers in the hierarchy.
For more information, see US Patent: 7,603,415: Classification of electronic messages using a hierarchy of rule sets, October 13, 2009.
Here are some references to other more modern hierarchical classification approaches:
Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN
Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, LihongWang, Yangqiu Song, and Qiang Yang.
Hierarchical Transfer Learning for Multi-label Text Classification
Siddhartha Banerjee, Cem Akkaya, Francisco Perez-Sorrosal, and Kostas Tsioutsiouliklis
Weakly-Supervised Hierarchical Text Classification
Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han