arxiv:2106.13302

byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings

Published on Jun 24, 2021

Upvote

Authors:

Xiang Zhang ,

Alexandre Drouin ,

Raymond Li

Abstract

byteSteady uses byte-level n-gram embeddings and a linear classifier to achieve competitive results in both text and DNA sequence classification, with minimal impact of Huffman coding compression.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

This article introduces byteSteady -- a fast model for classification using byte-level n-gram embeddings. byteSteady assumes that each input comes as a sequence of bytes. A representation vector is produced using the averaged embedding vectors of byte-level n-grams, with a pre-defined set of n. The hashing trick is used to reduce the number of embedding vectors. This input representation vector is then fed into a linear classifier. A straightforward application of byteSteady is text classification. We also apply byteSteady to one type of non-language data -- DNA sequences for gene classification. For both problems we achieved competitive classification results against strong baselines, suggesting that byteSteady can be applied to both language and non-language data. Furthermore, we find that simple compression using Huffman coding does not significantly impact the results, which offers an accuracy-speed trade-off previously unexplored in machine learning.