Remove specific lines from a large text file in python(从python中的大文本文件中删除特定行)
问题描述
我有几个大文本文本文件,它们都具有相同的结构,我想删除前 3 行,然后从第 4 行中删除非法字符.我不想读取整个数据集然后修改,因为每个文件都超过 100MB,有超过 400 万条记录.
范围 150.0dB -64.9dBm移动单元 1 底座 -17.19968 145.40369 999.8固定单元 2 移动 -17.20180 145.29514 533.0纬度经度 Rx(dB) 最佳单位-17.06694 145.23158 -050.5 2-17.06695 145.23297 -044.1 2
所以第 1,2 和 3 行应该被删除,在第 4 行中,Rx(db)"应该只是Rx",而Best Unit"应该改为Best_Unit".然后我可以使用我的其他脚本对数据进行地理编码.
我不能使用像 grep 这样的命令行程序(如在这个问题中),因为前 3 行并不完全相同 - 每个文件中的数字(例如 150.0dB、-64*)都会发生变化,因此您只需删除整行1-3 然后 grep 或类似的可以在第 4 行进行搜索替换.
谢谢各位
=== 编辑新的 Pythonic 方式来处理来自 @heltonbiker 的更大文件.错误.
import os, re##infile = arcpy.GetParameter(0)##chunk_size = arcpy.GetParameter(1) # 每个数据集中的记录数infile='trc_emerald.txt'fc= 打开(文件内)Name = infile[:infile.rfind('.')]outfile = 名称+'_db.txt'line4 = fc.readlines(100)[3]line4 = re.sub('([^)].*?)', '', line4)line4 = re.sub('Best(s.*?)', 'Best_', line4)newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]])fc.close()新文件 = 打开(输出文件,'w')newfile.write(newfilestring)newfile.close()德尔线删除输出文件删除名称#return chunk_size, fl#arcpy.SetParameterAsText(2, fl)打印完成"
<块引用><块引用><块引用><块引用>
回溯(最近一次调用最后一次):文件P:2012Job_044_DM_Radio_PropogationWorkingFinalPropogationTRC_Emeraldworkingclean_file_1c.py",第 13 行,在newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]]) TypeError: 'builtin_function_or_method' 对象是不可订阅
正如 wim 在评论中所说,sed
是正确的工具.以下命令应该执行您想要的操作:
sed -i -e '4 s/(dB)//' -e '4 s/Best Unit/Best_Unit/' -e '1,3 d' yourfile.whatever
稍微解释一下命令:
-i
就地执行命令,即将输出写回输入文件
-e
执行命令
'4 s/(dB)//'
on line 4
,用 ''
代替 '(dB)'
'4 s/Best Unit/Best_Unit/'
同上,只是查找和替换字符串不同
'1,3 d'
从第1行到第3行(含)删除整行
sed
是一个非常强大的工具,它可以做的远不止这些,非常值得学习.
I have several large text text files that all have the same structure and I want to delete the first 3 lines and then remove illegal characters from the 4th line. I don't want to have to read the entire dataset and then modify as each file is over 100MB with over 4 million records.
Range 150.0dB -64.9dBm
Mobile unit 1 Base -17.19968 145.40369 999.8
Fixed unit 2 Mobile -17.20180 145.29514 533.0
Latitude Longitude Rx(dB) Best unit
-17.06694 145.23158 -050.5 2
-17.06695 145.23297 -044.1 2
So lines 1,2 and 3 should be deleted and in line 4, "Rx(db)" should be just "Rx" and "Best Unit" be changed to "Best_Unit". Then I can use my other scripts to geocode the data.
I can't use commandline programs like grep (as in this question) as the first 3 lines are not all the same -the numbers (such as 150.0dB, -64*) will change in each file so you have to just delete the whole of lines 1-3 and then grep or similar can do the search-replace on line 4.
Thanks guys,
=== EDIT new pythonic way to handle larger files from @heltonbiker. Error.
import os, re
##infile = arcpy.GetParameter(0)
##chunk_size = arcpy.GetParameter(1) # number of records in each dataset
infile='trc_emerald.txt'
fc= open(infile)
Name = infile[:infile.rfind('.')]
outfile = Name+'_db.txt'
line4 = fc.readlines(100)[3]
line4 = re.sub('([^)].*?)', '', line4)
line4 = re.sub('Best(s.*?)', 'Best_', line4)
newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]])
fc.close()
newfile = open(outfile, 'w')
newfile.write(newfilestring)
newfile.close()
del lines
del outfile
del Name
#return chunk_size, fl
#arcpy.SetParameterAsText(2, fl)
print "Completed"
Traceback (most recent call last): File "P:2012Job_044_DM_Radio_PropogationWorkingFinalPropogationTRC_Emeraldworkingclean_file_1c.py", line 13, in newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]]) TypeError: 'builtin_function_or_method' object is unsubscriptable
As wim said in the comments, sed
is the right tool for this. The following command should do what you want:
sed -i -e '4 s/(dB)//' -e '4 s/Best Unit/Best_Unit/' -e '1,3 d' yourfile.whatever
To explain the command a little:
-i
executes the command in place, that is it writes the output back into the input file
-e
execute a command
'4 s/(dB)//'
on line 4
, subsitute ''
for '(dB)'
'4 s/Best Unit/Best_Unit/'
same as above, except different find and replace strings
'1,3 d'
from line 1 to line 3 (inclusive) delete the entire line
sed
is a really powerful tool, which can do much more than just this, well worth learning.
这篇关于从python中的大文本文件中删除特定行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!