Ad

What Is The Correct Procedure To Store A Utf-16 Encoded Rss Stream Into Sqlite3 Using Python

- 1 answer

I have a python sgi script that attempts to extract an rss items that is posted to it and store the rss in a sqlite3 db. I am using flup as the WSGIServer.
To obtain the posted content: postData = environ["wsgi.input"].read(int(environ["CONTENT_LENGTH"]))

To attempt to store in the db:

from pysqlite2 import dbapi2 as sqlite
ldb = sqlite.connect("/var/vhost/mysite.com/db/rssharvested.db")
lcursor = ldb.cursor()
lcursor.execute("INSERT into rss(data) VALUES(?)", (postData,))

This results in only the first few characters of the rss being stored in the record: ÿþ< I believe the initial chars are the BOM of the rss.

I have tried every permutation I could think of including first encoding rss as utf-8 and then attempting to store but the results were the same. I could not decode because some characters could not be represented as unicode.

Running python 2.5.2 sqlite 3.5.7

Thanks in advance for any insight into this problem.


Here is a sample of the initial data contained in postData as modified by the repr function, written to a file and viewed with less:

'\xef\xbb\xbf

Thanks for the all the replies! Very helpful.


The sample I submitted didn't make it through the stackoverflow html filters will try again, converting less and greater than to entities (preview indicates this works).

\xef\xbb\xbf<?xml version="1.0" encoding="utf-16"?><rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><channel><item d3p1:size="0" xsi:type="tFileItem" xmlns:d3p1="http://htinc.com/opensearch-ex/1.0/">

Ad

Answer

Before the SQL insertion you should to convert the string to unicode compatible strings. If you raise an UnicodeError exception, then encode the string.encode("utf-8").

Or , you can autodetect encoding and encode it , on his encode schema. Auto detect encoding

Ad
source: stackoverflow.com
Ad