Ad

Python - Unicode Encoding Conflict

Update - I have tried to included the full path to crontab job, but the same issue happens again ... I only have issue with this particular article which contains latin character "Moët"

I am new to python 3 and I need help with a "unicode encoding conflict" related issue.

I am creating a web scraper that takes online articles and saves them locally.

What I would like to do is:

  • use Beautifulsoup to get the article title
  • check the article title isn't in a list of articles saved locally
  • if title matches, then print "the file exists" do nothing.
  • if title does not match, then catch the article content and generate a .txt file.

code is as below:

article_html = self.request(articles_URL)
soup = BeautifulSoup(article_html.text, 'html.parser')
title_modify = soup.title.string
title_real = title_modify + '.txt'
current_path = os.getcwd()
article_names = os.listdir(current_path)
if title_real in article_names:
    print(title_real, 'exists, no need to re-create')
else:
###omit codes for catching article content
    with codecs.open(title_real, "a", encoding='utf-8') as f:
        f.write(XXX)

Then I use a scheduled Centos 7 crontab job to let it run automatically. it will detects the same web URL everyday, and trying to catch the new article as txt file.

It was working fine, however, today I observed it does not work for a article title which contains latin character. Ideally, the system will print "the file exists" and turns to the next article, however, it shows the program created a few duplicate articles:

Aug 26 09:50 XXX with Moët XXX.txt

Aug 27 09:29 XXX with Moët XXX (Unicode Encoding Conflict (1)).txt

Aug 26 20:30 XXX with Moët xxx (Unicode Encoding Conflict).txt

The strange thing is, it works fine when I manually run the python script:

python test.py

XXX with Moët XXX.txt exists, no need to re-create

Much appreciated if anyone can help.

Cook

Ad

Answer

Crontab most likely used a stripped-down environment which may result in unexpected behavior. See this, it will most likely fix your issue.

Basically you'll need to provide the full path to your python executable (you can get it by running which python). Ergo, you'll crontab entry would look like the following:

20 4 * * * your_python_path your_program_path.py
Ad
source: stackoverflow.com
Ad