Ad

Vietnamese Crawled Characters Doesn't Appear As They Do On The Web

I'm crawling some reading from the web and save them as utf8 txt:

  const result = await page.evaluate(() => {
    const title = document.querySelector('#chapter-title').innerText;
    const content = document.querySelector('#chapter-content').innerText;
    return title + "\n\n" + content
  })

  fs.writeFileSync(`./results/chapter${chapter}.txt`, `${result}`, 'utf8');

But some characters (mostly the accents) in their original (HTML) form are different from what they appear on the browser, and mess up my reading app.

Following is a screenshot of the same text: the first line is the result from crawling, the second line is by opening the page with a browser and select + copy the text manually:

enter image description here

It seems somehow the browsers intelligently "fixed" those text and changed into a character available in the font.

Since I don't know exactly what happened, my search couldn't yield any result.

What happened, and is there anyway I can format the crawled text into the readable form?

Ad

Answer

I have resolved the issue using String.Prototype.Normalize().

The characters from source HTML was in a mix of NFC and NFD form. It seems my text editors failed to combine characters with 2 accents or more, resulting in separate accents/squares. Using normalize(), well, normalized them all to NFC, solving the issue.

(Self-answered question cannot be accepted within 2 days, feel free to elaborate or add reference/comment on the issue as you see fit)

Ad
source: stackoverflow.com
Ad