|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 Dawnstar 于 2019-10-19 09:50 编辑
想请教各位大神,我想提取指定文件夹下的文件名到表格,并根据文件名提取文件中内容到对应文件名下。我现在处理的是一批txt格式的物种蛋白文件,我想提取每个文件的名称到表格的第一行,并将文件中的蛋白名称提取出来。
文件夹里的内容如下:
>YP_006908651.1 putative inhibitor of apoptosis protein [Abalone herpesvirus Victoria/AUS/2009]
MKIDKPTSIVESVLEVTQRITPKERPSIGIDVFSSMYVQAVSMSAVNAWHSAGPDEGGYLVLRDKDSPDRKLAEVELPYAKEEDRLESFSPYWNFEPSSEELAKAGFYYTGKSDRVKCFSCALEISEWGGEEGESPMEIHQRETKKEHGLMYDCAFLSVACKTVPDAVDSTTDNGTYAGRLESYGRFEWPKQSHMKPEELAAAGLYYTGRGDRVACHFCGQILRTWERGDVAMIEHARWAENRNCPYLKYTAGNAVRGISRLENPIYN
>YP_006908652.1 hypothetical protein AbHV_ORF2_1 [Abalone herpesvirus Victoria/AUS/2009]
MCEVRKGKGKVRKTTRGQGEIAGAFTTKSSNDSPKGINEQESLARDHSCLDEEARNTMKRSHDQTDMSSSDSRVKIPHLADHDYCLPPEEERQLDVIYMEDDCQAHEEVHYDENVSAEDLHSEIERLRAERALTLSENETLRAELERLRADNEKKLNDETAALPELEDKKNGFANSFIANIHSELGLEMMNTPAGKVYKEMCLAVHDENVIAEDEGREPRSVMIPLTVDVMKELEFDFKDNKATNQQRDRKLKNIQCDDVMKYQPMKNGSKDAANSTIEKFIQTNNIFSMREMIHDKNTNNIPLLTIDQFMRYIITGITQQNHDIRKDVSRIMAKIHTASDRQQYKVFKNLYSDWLRQEEQIKTLSLANALLEKEKENERLRADNQTLRADAQEKRAVKAEFNEVMFHSPTDRRVYVNKVFPNRNKKVGANLIVRSLMTPRDNEKEMNKLVGPKKWSPLFCSQPLPEATSVRDVLVKKGLQNDKLVGINRAIAYNAWTGEQKRWMSEMKQPQHQLPGGKSCGENWFRYHPVDTHDVEKIKKSFSHYARQECEKIKRDIFT
>YP_006908653.1 putative eukaryotic translation initiation factor [Abalone herpesvirus Victoria/AUS/2009]
MEINVHPHVNTTAYRHKMPEIETRIEGGSLTKLVNVRAVAKALNRKKEHIVKFLGCVLGVGVCAQNNLIGGSVSRMKVQELIDVYIANFVLCDRCTSPETVLLSGSGNPVLRCGACGYIGRVLGEEEMIKDMVDDEW
>YP_006908654.1 hypothetical protein AbHV_ORF4 [Abalone herpesvirus Victoria/AUS/2009]
MASTSSSNVNLKRSLESEGGERGGVKKMKVARMDCRPGKVMRFTCKPVEKRDQAVRELSEKLGSAARIIAGNDDGVVVVKFEERFQKARDFLDGFDWMLEDEAELNRWFEIDSNEKLYDDDVMKLSDDIEPIKIDEDVHRLLNYLYSVKGMQTEFKGKSKTYFLFTLNNLIKLKLIESKGENAERTLKRIGDIDPNNFKLYKYGELFGLRHLPNMPHTFIFNRLDYKPMGEIAVAVRSGHICKALSNFKTETARKLGDVMAMVFEKVITKVAEDHVKEVGMETFVETVVRPTLEGALPEAIDSRLELLDAESKIMELKAETVHERMLRLKAEAEAVKFDGLRIQAEEASRESKLLAIESQINEIKAQEETRVALFDKRQAELGEERIRLEKEKEANKFKLERIIERPNQDLTHQGDRRRREGVVWIQRTHLSSAEEKRKKPVYKQNGEVSTEPLHYTLYRSEHEEKSETLRQTRDLVKKYVKKSSYTMERLTPSLVLYGGNNVEETKKLFKQTFKARGGRLLMTVAEEAEFDQKIREKIVPFVKQSYALGLHNGTLAFCNEAILKSLPVEDEAFCMEIIGEM
>YP_006908655.1 hypothetical protein AbHV_ORF5 [Abalone herpesvirus Victoria/AUS/2009]
MDEFDFSDWFEDVGSYVEPMDTDSFLEEVCGLSLPTPVQPPPPPPPTQDLSFGDSFLEEVCGTSVTQPKDDEVNKMIQELSSIFEDEPPSPPKEPPPRYFSYDDKVSLLDNVKAASKFHRYSRNGHYFVSHRHRLAHDPWSEHIASQLYTRCMQRGNNHHVVSEVIFDKEKRRWEHGRVKNDDEDLVNDEGDVIEDPITRAEKHPMRFDEENDTSDEVSAPVMTFDKIKVMTAPYSDAFMAKLNTANSLDGCETHKIKRIRTATGEMREVMKVNPSFPVHKTKFGPVPATRVAYKRADYTANYNAYALEKPNNCNINGTRGTHEFIIVNRINEAHCQLNNPHGKMYTILDMDCENNNSITIDSGLNGTIAMSYKSKICRKQNEKGQKRKREDLDDDDDLGGGNEDGECRKEFVLVQKAEYGTHKTTIMTKKKRFNLKKGLVGDVTKETKNEFMMAKLKAQREEVMNACCGAREIQVHAHCFNCSALGGYEIIDPVQNLACKIFEDEKILVSIAANNIKGMSMVMSNSLSSSIVNNFFDLLCWKFKVHRGTEEEQDRFYKRVLKPIAQKMDISRYNYAQAGIFCHSILKTSPMFSQHHLNLVSQKFNNPRATEVFKKACVVDDVATRMNDKWNSKRPLATVNVKSKPTPSRVKYAATMSRGPSVPDLWYNASEKVRGFIESYFGPLDESER
>YP_006908656.1 hypothetical protein AbHV_ORF6 [Abalone herpesvirus Victoria/AUS/2009]
MKRSPRSGNLNLFFNQSPKNEFEENIDEKVIVETNVEEVVETNVSEVVETNVSEEVVETNVSEKLEDQKVAAAVDESLDVKPLMIVPEPPKLSLEERLKIFKERPIMKLSHFLRVNKGVDYNRHRHCYWAVRTCVICTAPCDIKEYAVTLCCDLPVCHSCSFISGYAYARKSKPTSNCCRICTNGGMNISLNYIDFILSEHSHQKTFTSQFLEGVGAADNMQLHKMVTGLEAIHNNYEGVGDDDVFLPNTVPIKPPRKTITL
>YP_006908657.1 hypothetical protein AbHV_ORF7 [Abalone herpesvirus Victoria/AUS/2009]
MSNQFDPRLKAGQTFSVVSAVNLMLTEWHSIKQIIDLIASSHLTKSLFSSMSQLQNCLEELLNVKSKLAEDYAGFIVPFTNQDNTVISVCLRYSYKVCLTSVDRLISALQLVDGLNERSSVNFTRDMLVIKRSGLLASDIPHRIYTDLCMEKHLARLSPTNDRPSAKVLNHVNPMSEFYEVPSLEVVYGSNPTLTKLIDDCVLLASYPIISPNVGGELITTHGFDTYVPLKLNESGPDSRSQVETPFDSRVCVICKLDNLAIGYTDVIEQHKRLICKCVFIREAPSTLYMDVATHTQATMRVRERLIRDRESMTETEKRLFYNSFDKYASAIASTKIHNK
>YP_006908658.1 hypothetical protein AbHV_ORF8 [Abalone herpesvirus Victoria/AUS/2009]
MMKIDFDEAESRANIDSEISRQCVQRIKSQMKRERRTKPTPREQRILNRELRNYDLATDRANSYATKNDMRHAMNRRSEQTSEQYHFHKECAQSFRLQRLKLILLRKEKKAVKQIDKLLHEEDKGDDDYDFNSRIEEYSSQCTSALGDRDIMTSKSSKDYIRQHELMLDEEEEEEDIPFMPAPPSSSLTSSSRAPSPLSETLLQKFTDFSLCRTNSTPASKQGKHSQS
>YP_006908659.1 hypothetical protein AbHV_ORF9 [Abalone herpesvirus Victoria/AUS/2009]
MDDKSSGMQAAALATEDDDDGVSSLVDMINSGALSLEVERQMKRDGERNLLFYQEVMSATFDRDLLRSKIFPEDASKGPCMRMKQSLYKEASKPIVGEYAFESDPVFSSASTAHLRLVEKSKILSKDIPEIVLDYNQDPSVISVDESNMIVTDVHHPQENQTRIFFQLDNGYVLSIPTVNPWSERIYRYTITSVPQDKELLVYDNLEDVFKNYPNYAEKLFTNAPDDDDDDEEGGPEQYLKLDRSFLGVSGVSDGSDLEVVARYHIPMELNPLQYVMHKPNLYLFTGKDQEGNNNCLYALNTELVEGEGKPWIAPWTKVKIYDSETYVDLTDPNTLTVFEDEVPSYRLTIEVSEGSNIGVLSVPDYPTYDIPSLTLIEEVNDIVLEAAEEEDDED
>YP_006908660.1 hypothetical protein AbHV_ORF10 [Abalone herpesvirus Victoria/AUS/2009]
MSTNEETPSSSSSVARAAACMPDDLSSIDKDSLIKLGVETSSKILEACSIKYAPLSFGEILNGNNPERRKMFSLEGKRGQRSLKHITQHTLKDNDEIIESLNEICTHFINQQIISEHPELEDSVRELNFKNVLEPGRVIKIEDHRMGSYQIKDPCFNLYPSRVCFLEPYKRACGGSMSEVYKELKSLLERECNMGAGCLYRRMFPQCKRAGRVALPESVEYVFTKLLPACSNEGGATGLVHTLHAQFFSFAEECGKKKVQSVYKVRQQCIVCTMLKLSKETNKNAGLDRPHMDLYNEEDENIFLTKTEDLKDFGVIEVSRPDYNDMVVNVHGQKKVMVVDRSKLFSNIDMCEEDGFFFPVIMTEGRDVVPTNLPTSRKRRGGGGGGKKGAKRMLLDEI
>YP_006908661.1 hypothetical protein AbHV_ORF11 [Abalone herpesvirus Victoria/AUS/2009]
MYKFLRVYEILLLGTIVSAVDFALYKNDMKTGRMTVYCQKQKQAGENRLFYDTDVYEDDTLLYDGWSTSKKGNFGVNTICGDRDYQLPAGQMVSNSFNVRPQERNTICLSAEAQCENYQKFTFVGRSSFFKCETPECEARNLLGDKYVSPFISLDGVPNNTLVEFKENVIMAPVDYSMMDRWNGIGTKNAMQHFMSQIYYGIKFYNKVKVVTFSSPGTLPTKGEKFLLVYRQTEDCEPPNGGFSKVGEPCDGPFTFYYVQRSPDYPTDFEITERHLCGRTVTFTLFFDIRAVDNRSKFYFLPHEFKNENFDFNKNISYNSTTKCGLTGVKKASIVKRELMNHFASFCPERQKDAGG
举个例子:现在有三个文件夹Abaca bunchy top viru,Abalone herpesvirus Victoria_AUS_200,Acanthochromis_polyacanthus.ASM210954v1.pep.al,我想得到的结果是:
| Abaca bunchy top viru
| Abalone herpesvirus Victoria_AUS_200
| Acanthochromis_polyacanthus.ASM210954v1.pep.al
| | >YP_001661660.1 putative replicase protein
| >YP_006908768.1 hypothetical protein AbHV_ORF59_2
| >ENSAPOP00000034950.1
|
| >YP_001661660.1 putative replicase protein
| >YP_006908768.1 hypothetical protein AbHV_ORF59_2
| >ENSAPOP00000034950.1
|
| >YP_001661660.1 putative replicase protein
| >YP_006908768.1 hypothetical protein AbHV_ORF59_2
| >ENSAPOP00000034950.1
|
我写的代码如下:
- import os
- files = os.listdir(path) #遍历文件夹下的所有文件名称
- text = 'Protein name'
- file1 = open (path + text, 'w')
- print(file = file1,end="\t")
- file_name = []
- for file in files:
- if '.fa'in os.path.splitext(file)[1]: #获取所有含‘.fa’的文件
- file_name.append(file[:-3]) #去掉.fa后缀,获取文件名
- for each in file_name:
- print(each, file = file1, end="\t")
- print(file = file1)
- for file in files:
- if '.fa'in os.path.splitext(file)[1]: #获取所有含‘.fa’的文件
- fa_path = path + file
- content = open(fa_path, 'r') #读文档内的内容
- j = {} #将文件名建一个字典
- for s in file_name:
- j[s] = 0
- for seq in content:
- if '>' in seq:
- for species in file_name:
- j[species] = seq
- else:
- del seq
- for a in file_name:
- print(j[a], file = file1, end = "\t")
- print(file = file1)
- file1.close()
复制代码 [/code]
但我的代码得到的结果是:
| Abaca bunchy top viru
| Abalone herpesvirus Victoria_AUS_200
| Acanthochromis_polyacanthus.ASM210954v1.pep.al
| >YP_001661660.1 putative replicase protein
|
|
|
|
| >YP_001661660.1 putative replicase protein
|
|
|
| >YP_001661660.1 putative replicase protein
|
|
|
|
|
|
| >YP_006908768.1 hypothetical protein AbHV_ORF59_2
|
|
|
|
| >YP_006908768.1 hypothetical protein AbHV_ORF59_2
|
|
|
| >YP_006908768.1 hypothetical protein AbHV_ORF59_2
|
|
|
|
|
|
| >ENSAPOP00000034950.1
|
|
|
|
| >ENSAPOP00000034950.1
|
|
|
| >ENSAPOP00000034950.1
|
|
|
修改了好久,总是修改不好,望大神指点一下。
你的描述真的很烂。
这是我自以为理解了,所写的代码
- import os
- def get_fileContent(fa_path):
- content = []
- with open(fa_path, 'r') as f:
- for line in f:
- if '>YP_' in line:
- content.append(line.split()[0])
- return content
- def get_ListItemMaxLen(lList):
- MaxLen = 0
- for content in lList:
- nLen = len(content)
- if nLen > MaxLen:
- MaxLen = nLen
- return MaxLen
- path = r'D:\Users\Administrator\Desktop\新建文件夹'
- filesContent = []
- #遍历文件夹下的所有文件名称
- for fpathe,dirs,fs in os.walk(path):
- for f in fs:
- FileName, Ext = os.path.splitext(f)
-
- #获取所有含‘.fa’的文件
- if Ext == '.fa':
- fa_path = os.path.join(fpathe,f)
- content = get_fileContent(fa_path)
- content.insert(0, FileName)
- filesContent.append(content)
-
- MaxLen = get_ListItemMaxLen(filesContent)
- for i in range(0, MaxLen):
- FilesLen = len(filesContent)
- for j in range(0, FilesLen):
- if i == 0:
- sp ='\t\t\t'
- else:
- sp = '\t\t\t\t\t'
- try:
- print(filesContent[j][i],end = sp)
- if j == FilesLen - 1:
- print()
- except:
- print()
- break
-
复制代码
|
|