How to check sql file using incremental keyword searching in python?

Category: python regex (2 Views)

I am trying to parse sql scripts where checking sequence like pipeline logic using re. So, I tried to search key words and print type and value. Essentially, what I want to check correct sequence such as:

source->output1 (as new input 1 ) -> output2 (as new input 2) -> ..-> target

use case

this is input file that I am trying to parse. currently, with the help of previous post, I was able to do this but this is not checking pattern correctly.

import re, os, sys

pattern = re.compile(r"^(CREATE(?: OR REPLACE)?)|(?:OUTPUT TO)(FROM|INPUT FROM|INSERT INTO) (TARGET) (\S+) (\S+).*$")

with open('input_sql.txt', 'r+') as f:
    lines = f.readlines()
    nlines = [v for v in lines if not v.isspace()]
    for line in nlines:
        line=line.strip()
        line=line.split(';')
        line=','.join(s for s in line if len(s) > 0).split(',')
        if m := pattern.match(line):
            action, action_type, value = m.groups()
            print(f"{action=}, {action_type=}, {value=}")

so I need to search keywords for each line check its type and value. So I composed patterns that I am expecting to see with re but some of them is not parsed correctly. Can anyone to point me out maybe better way of doing this?

I think I need to handle input file by checking startswith CREATE|CREATE OR REPLACE|FROM|INSERT INTO AND check end with ';' instead of searching line by line; by doing so, we can search each sub SQL logic check the sequence pattern as I wanted.

objective

I want to check those sequences in pipeline and to print out those at each steps:

input_type: SOURCE          input_value: comp_src_vat4_sac_rec_cln1_tst_noi
                            output_value: stream_src_vat4_sac_rec_cln1_tst_noi

                            input_value:vat4_sac_rec_cln1_tst_noi_text_clean_stream         
output_type: target         output_value:comp_tgt_vat4_sac_rec_cln1_tst_noi


output_value:               CQ_STREAM_vat4_sac_rec_cln1_tst_noi
input_value:                stream_src_vat4_sac_rec_cln1_tst_noi

output_value:               vat4_sac_rec_cln1_tst_noi_text_clean_stream
input_value:                CQ_STREAM_vat4_sac_rec_cln1_tst_noi

so this is pipeline or flow that that I am checking, which start from SOURCE, end with TARGET; for example, this is the flow: (input:source, output: string1) -> (input:string1, output:string2)->(input:string2, output:string3)->(input:string3, output:TARGET).

seems the patterns that I composed not handle correctly such pipeline style parsing. Can anyone correct me if possible?

🔴 No definitive solution yet