Select Specific Fields From DB Export That Includes Json

- 1 answer

I have an export from a DB table that has the following columns:


The external_atribute is on Json format. So the export looks like this:


Which is the most efficient way (since the export has more than 1m lines) to keep only the name and the values from label2, label5 and label6. For example from the above export I would like to keep only:


Edit: I am not sure for the sequence of the fields/variables on the JSON part. Data could be also for example:


Also the fact that some of the values are double quoted, while some are not, is intentional (This is how they appear also on the export).

My understanding until now is that I have to use something that has a JSON parser like Python or jq.

This is what i created on Python and seems that is working as expected:

from __future__ import print_function
import sys,json
with open(sys.argv[1], 'r') as file:
for line in file:
            fields = line.split('|')
               print (fields[0], json.loads(fields[3])['label2'], json.loads(fields[3])['label5'], json.loads(fields[3])['label6'], sep='|')



Since I am looking for the most efficient way to do this, any comment is more than welcome.



Even if the data are easy to parse, I advise to use a json parser like jq to extract your json data:

<file jq -rR '
   split("|")|[ .[0], (.[3]|fromjson|(.label2,.label5,.label6)|tostring)]|join("|")'

Both options -R and -r allows jq to accept and display a string as input and output (instead of json data).

The split function enable getting all fields into an array that can be indexed with number .[0] and .[3].

The third field is then parsed as json data with the function fromjson such that the wanted labels are extracted.

All wanted fields are put into an array and join together with the | delimiter.