Text Classification: A Parameter-Free Method with Compressors

Discover how the combination of compressors and k-nearest-neighbor classification revolutionizes low-resource text classification

Bala Manikandan
5 min readJul 15, 2023

Natural Language Processing (NLP) is evolving in faster pace after the introduction of Transformers architecture, one of the most intriguing and challenging areas is low-resource text classification. This task involves categorizing text into predefined classes, a seemingly simple task that becomes increasingly complex when dealing with languages or domains where labeled data is scarce. When top benchmarks are crowded by complex transformer architecture, There is simple process outperforming majority of complex models is released in the recent research paper.

When you talk about the scarcity of data, it is a common issue in many real-world applications. For instance, in the case of less commonly spoken languages or specialized domains, obtaining a large amount of labeled data for training can be a daunting, if not impossible, task. This is where the concept of low-resource text classification comes into play.

Traditionally, deep neural networks (DNNs) and current Transformers networks have been the go-to solution for many NLP tasks, including text classification. However, these models are notoriously data-hungry, requiring vast amounts of labeled data to perform well as it follows supervised learning. This characteristic makes DNNs less than ideal for low-resource scenarios, where the available data for training is limited.

The challenge, therefore, lies in developing methods that can effectively classify text with minimal resources. This involves creating models that are not only accurate but also efficient, capable of learning from small amounts of data without compromising performance.

Most of the problems in NLP solved on top of embedding vectors created from the data. This embedding vectors represent the semantic relationship between different texts where similar ones will lie closer together and different texts lie farther in the embedding vector space. The embedding is type of compression that we apply to the long form of text or sentences.

This research paper has applied the famous technique of lossless file compression to the text data instead of traditional method of neural network(NN) embedding vector. File compressors, in the context of data processing, are tools that reduce the size of data without losing the essence of the information. They aim to represent information using the fewest bits possible without any loss in quality. How? By assigning shorter codes to symbols with higher probabilities of occurrence. Compressors are remarkably good at identifying patterns or regularities. If you think about it, categorization is essentially about recognizing shared patterns or commonalities among certain elements. Hence, objects from the same category exhibit more regularity than those from different categories. Consequently, these shared patterns or regularities could be encoded in a more compact form.

Gzip compression out performs other compression technique as mentioned in their research. In general, Gzip uses the DEFLATE algorithm for compression, which combines the power of LZ77 (Lempel–Ziv–77) and Huffman coding. The gzip follows below steps,

Tokenization: The text string is tokenized into individual symbols or characters.

LZ77 Compression: gzip uses the LZ77 algorithm, which works by replacing repeated occurrences of data with references to a single copy. If a string of characters is repeated in the text, LZ77 replaces the second occurrence with a pair (distance to the first occurrence, length of the repeated string). So instead of storing the same string multiple times, gzip stores it once and then refers back to it whenever it appears in the text again.

Huffman Coding: After LZ77 compression, Huffman coding is applied. This is a common algorithm for lossless data compression. The idea behind Huffman coding is that characters that appear more frequently should have shorter codes than those that appear less frequently. In other words, it uses variable-length codes to represent the input characters, building a Huffman tree where the most frequently occurring characters have the shortest paths.

Mathematically, Consider three objects, x1, x2, and x3, where x1 and x2 belong to the same category, but x3 belongs to a different one. Let C(·) represent the compressed length. The compressed length of the concatenation of x1 and x2 (C(x1x2)) minus the compressed length of x1 (C(x1)) would be less than the compressed length of the concatenation of x1 and x3 minus C(x1). In layman’s terms, the amount of additional information needed to represent x2 after x1 (when they belong to the same category) is less than representing x3 (from a different category) after x1.

This intuition is formalized as a distance metric using a concept known as Kolmogorov complexity, which characterizes the length of the shortest binary program that can generate an object x. However, since Kolmogorov complexity is incomputable, we use a concept known as Normalized Compression Distance (NCD). NCD uses the compressed length C(x) to approximate Kolmogorov complexity K(x) and is defined as follows:

NCD(x, y) = (C(xy) — min{C(x), C(y)}) / max{C(x), C(y)}

NCD provides a normalized and computable version of information distance. The higher the compression ratio, the closer C(x) is to K(x).

Having computed the NCD, we can then generate a distance matrix that can be fed into a k-nearest-neighbor (k-NN) classifier for the actual classification task. The k-NN algorithm will classify new data points based on the ‘k’ nearest data points in this NCD distance matrix.

The k-NN algorithm is a type of instance-based learning, where the classification of new data points is based on the class labels of its nearest neighbors. During the training phase, the algorithm stores the compressed representations of the documents obtained using the NCD and their corresponding class labels. When a new unlabeled data point needs to be classified, the algorithm calculates its distance to all the labeled data points in the distance matrix and selects the K-data points with the shortest distances to the new data point choosing the majority class among them to assign the new data point. The proposed method is programmed in python using gzip library and been mentioned below,

import gzip
import numpy as np

# Define the function to calculate NCD
def calculate_ncd(x1, x2):
Cx1 = len(gzip.compress(x1.encode()))
Cx2 = len(gzip.compress(x2.encode()))
x1x2 = " ".join([x1, x2])
Cx1x2 = len(gzip.compress(x1x2.encode()))
ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)
return ncd

# Define the function for k-NN classification
def knn_classify(test_instance, training_set, k):
distance_from_test_instance = []
for (x2, _) in training_set:
ncd = calculate_ncd(test_instance, x2)
distance_from_test_instance.append(ncd)
sorted_idx = np.argsort(np.array(distance_from_test_instance))
top_k_class = [training_set[i][1] for i in sorted_idx[:k]]
predict_class = max(set(top_k_class), key=top_k_class.count)
return predict_class

# Example texts
test_set = [
("soccer is wonderful game", _),
("I love pizza man", _)
]

training_set = [
("I love pizza", "Class A"),
("Ice cream is delicious", "Class A"),
("Basketball requires good coordination", "Class B"),
("Soccer is a wonderful game", "Class B"),
("Tennis need efforts", "Class B")
]

k = 3 # Number of neighbors to consider

# Apply the k-NN classification to the test set
for (test_instance, _) in test_set:
predicted_class = knn_classify(test_instance, training_set, k)
print(f"The predicted class for '{test_instance}' is '{predicted_class}'")

The research has proposed an intuitive method for text classification, but it has a limitation on computational complexity K-NN being O(n2). They intend to try neural compressors derived from deep latent variables models as it outperforms the semi supervised learning in Image classification domain. moreover, their purpose was to highlight the tradeoff between simplicity of the model and its performance. Thanks for reading!

--

--

Bala Manikandan
Bala Manikandan

No responses yet