Ad

Grab Specific Text From XML

- 1 answer

Hello :) This is my first python program but it doesn't work.

What I want to do :

  • import a XML file and grab only Example.swf from
<page id="Example">
<info>
<title>page 1</title>
</info>
<vector_file>Example.swf</vector_file>
</page>
(the text inside <vector_file>) 
  • than download the associated file on a website (https://website.com/.../.../Example.swf)
  • than rename it 1.swf (or page 1.swf)

  • and loop until I reach the last file, at the end of the page (Exampleaa_idontknow.swf → 231.swf)

  • convert all the files in pdf

What i have done (but useless, because of AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'xpath'):

import re
import urllib.request
import requests
import time
import requests
import lxml
import lxml.html
import os
from xml.etree import ElementTree as ET

DIR="C:/Users/mypath.../"
for filename in os.listdir(DIR):
    if filename.endswith(".xml"):
        with open(file=DIR+".xml",mode='r',encoding='utf-8') as file:
            _tree = ET.fromstring(text=file.read())
            _all_metadata_tags = _tree.xpath('.//vector_file')
            for i in _all_metadata_tags:
                print(i.text + '\n')

    else:
        print("skipping for filename")
Ad

Answer

First of all, you need to make up your mind about what module you're going to use. lxml or xml? Import only one of them. lxml has more features, but it's an external dependency. xml is more basic, but it is built-in. Both modules share a lot of their API, so they are easy to confuse. Check that you're looking at the correct documentation.

For what you want to do, the built-in module is good enough. However, the .xpath() method is not supported there, the method you are looking for here is called .findall().

Then you need to remember to never parse XML files by opening them as plain text files, reading them into into string, and parsing that string. Not only is this wasteful, it's fundamentally the wrong thing to do. XML parsers have built-in automatic encoding detection. This mechanism makes sure you never have to worry about file encodings, but you have to use it, too.

It's not only better, but less code to write: Use ET.parse() and pass a filename.

import os
from xml.etree import ElementTree as ET

DIR = r'C:\Users\mypath'

for filename in os.listdir(DIR):
    if not filename.lower().endswith(".xml"):
        print("skipping for filename")
        continue

    fullname = os.path.join(DIR, filename)
    tree = ET.parse(fullname)

    for vector_file in tree.findall('.//vector_file'):
        print(vector_file.text + '\n')

If you only expect a single <vector_file> element per file, or if you only care for the first such element, use .find() instead of .findall():

vector_file = tree.find('.//vector_file')

if vector_file is None:
    print('Nothing found')
else:
    print(vector_file.text + '\n')
Ad
source: stackoverflow.com
Ad