this post was submitted on 21 May 2024
5 points (61.9% liked)

General Programming Discussion

7814 readers
1 users here now

A general programming discussion community.

Rules:

  1. Be civil.
  2. Please start discussions that spark conversation

Other communities

Systems

Functional Programming

Also related

founded 5 years ago
MODERATORS
 

Traditionally, algorithms for counting distinct items in a stream of data would store all the items. A new algorithm, called CVM, uses randomization to estimate the number of distinct items with minimal memory usage. The trick is to keep track of items by recording them and then randomly deleting some. The probability of an item staying on the list is related to the number of rounds it survives. With this method, the researchers were able to accurately estimate the number of distinct words in Hamlet.

you are viewing a single comment's thread
view the rest of the comments
[–] davel@lemmy.ml 4 points 6 months ago

Optimize Memory and Performance with the CVM Algorithm in JavaScript: A Comprehensive Guide for Text Analysis, Big Data, and More

Final CVM Algorithm Code in a JavaScript “oneliner”

You can use this JS function by providing any text string as input, and it will simulate the described algorithm, returning the final set of unique words. This is as condensed as it can get, but if you can condese it more , please do post and Ill update it! If you are interested in seeing how this ‘oneliner’ started, before it was condensed, I have included the long form version of this same code at the bottom of the article.

const cvmAlgorithm = t => { const f = n => !(Math.random() * (1 << n) | 0); const w = t.match(/\b\w+\b/g) || []; let r = 1, u = new Set(); for (let i of w.map(x => x.toLowerCase())) u.size < 100 ? u.add(i) : f(r) && u.delete(i), u.size >= 100 && r++, u.size > 50 && (u = new Set([...u].filter(() => f(r)))); return [...u]; };

// Example usage:
const text = "To be, or not to be, that is the question..."; 
const finalWords = cvmAlgorithm(text);
console.log(finalWords);  // [ 'to', 'be', 'or', 'not', 'that', 'is', 'the', 'question' ]