scrape some info from XML file with Python

Programming languages, Coding, executables, and scripting.
Post Reply
Message
Author
User avatar
PsySc0rpi0n
Posts: 262
Joined: 2012-10-24 13:54
Location: Portugal

scrape some info from XML file with Python

#1 Post by PsySc0rpi0n »

Hello.

I want to scrape some infom from a XML file.
I'm trying to use ElementTree module and XPath but I'm not being able to find a way of keeping track of some corresponding elements inside the XML file.

The tree inside the SML file is everything but logical.
I don't want to paste here the XML file because it belongs to my work and contains some private info. I'll try to explain how it is organized.

The element tags and text I need to extract are:

<Node>
....<name></name>
....<type></type>

....<address>
.........<houseNrFull></houseNrFull>
........<streetName></streetName>
....</address>
....<geoPosition>
........<xPos></xPos>
........<yPos></yPos>

....</geoPosition>
.
.
.
.
</Node>

And to create some context, this XML has information about a path an Optics Fiber cable takes in a neighbourhood. And per each equipment in the path, inner levels with possible different info is added, along the way.
I have attached an image with a beautified version of part of a file.
You can see there is a root element called network and then, de first element after the root is Node. After this element, all other elements are childs of childs of... of root. And per equipment in the path of this infrastructure, there are inner nextNode elements and connectionOUT elements where are the TAGs I need.

The problem is that there are some tags with the same name as the ones I need, that I don't need.
What I mean is that one of the tags I need to extract is <name></name>, but only the ones that are followed by the element <type></type>, because there are some tags <name></name> that are not followed by the element <type></geoPosition>, and these ones I don't have interest.

And the other problem is that I still haven't found a way of asociating the tags <xPos>, <yPos>, <streeName> and <houseNrFull> to the tags <name> that are followed by the tags <type>.

I have a code I came up with but when I was testing it, I noticed the problem I just described when I figured out that I had more tag <names> than all the other tags and I need to have 1 of each of the following tags per tag <name> that is followed by the tag <type>.

Code: Select all

import xml.etree.ElementTree as ET
import csv
import re

Tree = ET.parse('ftth-info-1.xml')
root = Tree.getroot()

nameList = []
xPosList = []
yPosList = []
streetList = []
houseNrList = []
completeDataList = []
i,j,k,l,m = 0,0,0,0,0

for name in root.findall("./Node//name"):
  nameList.append(name.text)
  i += 1

for xpos in root.findall("./Node//xPos"):
  xPosList.append(xpos.text)
  j += 1

for ypos in root.findall("./Node//yPos"):
  yPosList.append(ypos.text)
  k += 1

for streetname in root.findall("./Node//streetName"):
  streetList.append(streetname.text)
  l += 1

for housenr in root.findall("./Node//houseNrFull"):
  houseNrList.append(housenr.text)
  m += 1

print("i: {}, j: {}, k: {}, l: {}, m: {}".format(i, j, k, l, m))
if ( not (i and j and k and l and m)):
  print("Err")

pattern = re.compile("^DTP.*$|^BDF.*$")
for item in range(0, 64):
  if (pattern.match(nameList[item])):
    completeDataList.append(
                            [
                              nameList[item],
                              xPosList[item],
                              yPosList[item],
                              streetList[item],
                              houseNrList[item]
                            ]
                          )

with open("data.csv", "w", newline="") as csv_file:
  cols = ["name","xPos","yPos", "Street", "Nr"] 
  writer = csv.writer(csv_file, dialect='excel', delimiter=';')
  writer.writerow(cols)
  writer.writerows(completeDataList)
So, how can I possibily do this?

PS:
I couldn't attach the file. a 55kb image file says the max quota was reached.
Image

Post Reply