Any way to reduce the size of texts?

Please Note neither base64 nor encryption was designed for reduction of string length. What you should be looking at is compression and i think you should look at gzcompress and gzdeflate

Example using decoded version of your text

$original = "In other cases, some countries have gradually learned to produce the same products and services that previously only the U.S. and a few other countries could produce. Real income growth in the U.S. has slowed." ;
$base64 = base64_encode($original);
$compressed = base64_encode(gzcompress($original, 9));
$deflate = base64_encode(gzdeflate($original, 9));
$encode = base64_encode(gzencode($original, 9));


$base64Length = strlen($base64);
$compressedLength = strlen($compressed) ;
$deflateLength  = strlen($deflate) ;
$encodeLength  = strlen($encode) ;

echo "<pre>";
echo "Using GZ Compress   =  " , 100 - number_format(($compressedLength / $base64Length ) * 100 , 2)  , "% of Improvement", PHP_EOL;
echo "Using Deflate       =  " , 100 - number_format(($deflateLength / $base64Length ) * 100 , 2)  , "% of Improvement", PHP_EOL;
echo "</pre>";

Output

Using GZ Compress   =  32.86%  Improvement
Using Deflate       =  35.71%  Improvement

Base64 is not compression or encryption, it is encoding. You can pass text data through the gzip compression algorithm (http://php.net/manual/en/function.gzcompress.php) before you store it in the database, but that will basically make the data unsearchable via MySQL queries.

Okay, it's really challenging! (at least for me!) ... you have 10 TB of text and you want to load it on your MySQL database and perform a fulltext search on the tables!

Maybe some clustering or some performance tricky ways on a good hardware works for you, but if that's not the case, you may find it interesting.

First, you need an script to just load these 50 billion piece of text one after each other, split them into some words and treat them as a keyword, that means giving them a numeric id and then save them on a table. by the way I am piece of large text. would be something like this:

[1: piece][2: large][3: text]

and I'm the next large part! would be:

[4: next][2: large][5: part]

By the way words I, am, of, I'm, the plus ., ! has been eliminated because they do not nothing usually in a keyword-based search. However you can keep them also in your keywords array, if you wish.

Give the original text a unique id. You can calculate the md5 of the original text or just simply giving a numeric id. Store this id somewhere then.

You will need to have a table to keep the relationships between texts and keywords. it would be a many-to-many structure like this:

[text_id][text]
1 -> I am piece of large text.
2 -> I'm the next large part!

[keyword_id][keyword]
1 -> piece
2 -> large
3 -> text
4 -> next
5 -> part

[keyword_id][text_id]
1 -> 1
2 -> 1
3 -> 1
4 -> 2
2 -> 2
5 -> 2

Now, imagine how much it would be easier (especially for MySQL!) if somebody search large text!

As far as I found on the 'net, it would be about 50,000 or 60,000 of words as your keywords or maximum 600,000-700,000 words, if you just keep everything as a keyword. So, you can simply guess 50,000 words would be far less than 10 TB of text-based data.

I hope that it helps, and if you need I can explain more or help you to make that works somehow! :)

Any way to reduce the size of texts?

Tags:

Mysql

Sql

Php

Encoding

Related

Recent Posts