Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The number of $n$-gram features grows exponentially in $n$, making it computationally demanding to compute the most frequent $n$-grams even for $n$ as small as $3$. Motivated by our production machine learning system built on $n$-gram features, we ask: is it possible to accurately, deterministically, and quickly recover the top-$k$ most frequent $n$-grams? We devise a multi-pass algorithm called {\it Intergrams} that constructs candidate $n$-grams from the preceding $(n-1)$-grams. By designing this algorithm with hardware in mind, our approach yields more than an order of magnitude speedup (up to 33$\times$!) over the next known fastest algorithm, even when similar optimization are applied to the other algorithm. Using the empirical power-law distribution over n-grams, we also provide theory to inform the efficacy of our multi-pass approach. Our code is available at https://github.com/anongitrepos/Intergrams.