python for loop with if loop and output a file

Category: python (4 Views)

my first time writing python script. I had read through many questions and answers in stack overflow, but still didn't figure out where I got wrong in my code. Probably I could ask for help?

I have a file file.txt as below, with uncertain number of column in each line. "\t" is the tab delimiter

$ head file.txt

AA:d23\tBB:4r3w\tCC:e5t
BB:435\tCC:w4w
AA:w4r\tCC:2342
AA:34534\tBB:e5\tCC:7uf
BB:e4t4

I would like to turn it into a data_frame like .txt file, which has three columns in each row by adding NA for the missing column. Also, I would like to eliminate the rows that only have one entry (e.g. only AA or only BB or only CC). So expected output like below:

AA:d23\tBB:4r3w\tCC:e5t
AA:NA\tBB:435\tCC:w4w
AA:w4r\tBB:NA\tCC:2342
AA:34534\tBB:e5\tCC:7uf
#(the 5th line is omitted here because it only has one entry)

After studying many examples on the forum,I mimicked some codes, and wrote my own code as below:

#!/usr/bin/env python3

#into_data.py
import csv

def into_data(a_file):
  output=[]
  for row in a_file:
     if "AA:" in row and "BB:" in row and "CC:" in row:
        output.append(row)
     elif "AA:" not in row and "BB:" in row and "CC:" in row:
        output.append("AA:NA" + row)
     elif "AA:" in row and "BB:" not in row and "CC:" in row:
        output.append(row.split("\t")[0] + "CB:NA" + row.split("\t")[1])
     elif "AA:" in row and "BB:" in row and "CC:" not in row:
        output.append(row + "CC:NA")


reader = csv.reader(fileinput.input(), delimiter="\t")
print(into_data(reader))

#outside python script

python3 into_data.py file.txt > output.txt

But I get "None" in my output.txt. I don't really understand why. Could you please be so kind to point my error out? Thanks a lot in advance!

🔴 No definitive solution yet

📌 Solution 1

Your function doesn't have a return statement. Therefore it defaults to returning None.

You need to return the output at the end of the function like this:

def into_data(a_file):
  output=[]
  for row in a_file:
     if "AA:" in row and "BB:" in row and "CC:" in row:
        output.append(row)
     elif "AA:" not in row and "BB:" in row and "CC:" in row:
        output.append("AA:NA" + row)
     elif "AA:" in row and "BB:" not in row and "CC:" in row:
        output.append(row.split("\t")[0] + "CB:NA" + row.split("\t")[1])
     elif "AA:" in row and "BB:" in row and "CC:" not in row:
        output.append(row + "CC:NA")
  return output

To edit the rows as you specify, you can try this:

def into_data(a_file):
  output = []
  for row in a_file:
     if len(row) <= 1:
         continue
     if "AA:" not in row[0]:
        row.insert(0, "AA:NA")
     if "BB:" not in row[1]:
        row.insert(1, "BB:NA")
     if len(row) < 3:
        row.insert(2, "CC:NA")
     output.append(row)
  return output

📌 Solution 2

Here's a pretty straight forward way that should handle all cases.

# a_file is a csv.reader
def into_data(a_file):
    output = []
    for row in a_file:
        if len(row) <= 1: continue
        if not row[0].startswith("AA"):
            row.insert(0, "AA:NA")
        if not row[1].startswith("BB"):
            row.insert(1, "BB:NA")
        if len(row) < 3:
            row.append("CC:NA")
        output.append(row)
    return output

This creates a list of lists. If you want a list of strings, change the line output.append(row) to

output.append('\t'.join(row))

📌 Solution 3

Since you evoke dataframes/pandas, here is a proposition :

# Code :
import pandas as pd
import numpy as np

df = pd.read_csv('test.txt', header=None, sep=r'\\t', engine='python')

m = df.notnull().sum(axis=1).eq(1)
#does the row has a single entry ?

df = df.loc[~m]

out = (df.stack().reset_index(name='val')
   .assign(col=lambda x: x['val'].str.slice(0,2))
   .pivot(index='level_0', columns='col', values='val')
   .reset_index(drop=True)
)

out[:] = np.where(out.isna(), [out.columns + ':NA'], out)

out.columns = df.columns

out.to_csv('final.txt', sep='\t', header=None, index=False)
# Output :
print(out)

          0        1        2
0    AA:d23  BB:4r3w   CC:e5t
1     AA:NA   BB:435   CC:w4w
2    AA:w4r    BB:NA  CC:2342
3  AA:34534    BB:e5   CC:7uf