Node.js Cheerio parser breaks UTF-8 encoding

I was having an issue early today when tried to load with cheerio a page where we had special characters like ç, á, é, etc...

The way cheerio works is that is tries to decode characters by nature and present the numerical HTML encoding of the Unicode character

for example: instead of ç it would give us ç.

In order to sort that issue, I just had to turn off this config by adding: decodeEntities: false as a cheerio load param.

const $ = cheerio.load(body, { decodeEntities: false });

Cheerio hasn't broken anything. It's outputting HTML entities, which will be rendered by any browser exactly the same as the HTML input. Run this snippet to see what I mean:

<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>

<h1><span style="font-size: 16px;">&#x423;&#x43B;&#x438;&#x447;&#x43D;&#x430;&#x44F; 3&#x41C;&#x43F; IP HD &#x43A;&#x430;&#x43C;&#x435;&#x440;&#x430; OMNY - &#x43F;&#x43E;&#x43F;&#x440;&#x43E;&#x431;&#x443;&#x439;&#x442;&#x435; &#x43D;&#x430;&#x439;&#x442;&#x438; &#x43B;&#x443;&#x447;&#x448;&#x435;</span></h1>

&#x423;, for example, is the character У encoded as an HTML entity, in the same way the entity &gt; represents >.

However, if you want to get the unencoded text, you can set the decodeEntities option to false:

const $ = cheerio.load(
  `<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>`,
  { decodeEntities: false }
);


console.log($('span').html())
// => Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше
.as-console-wrapper{min-height:100%}
<script src="https://bundle.run/[email protected]"></script>