1 介绍
在基因结构分析或其他生物功能分析中会时常用到 CDS 序列,以及其他诸如 mRNA 序列,misc RNA序列等具有生物意义的序列片段。而NCBI 的基因库中已经包含有这些的信息,但是只有一部分是整理可下载的。而剩下的一部分可以通过 genbank给出的位点信息来提取,个人能力有限,这里只做抛转之用。下面以提取 CDS 为例,记录提取序列过程,其他特征序列类似。
2 结构目录
3 Python代码
序列自动下载可以通过 Biopython 的 Entrez.efetch 方法来实现,这里以本地文件为例
import os from Bio import SeqIO
def format_fasta(ana, seq, num): """ 格式化文本为 fasta格式 :param ana: 注释信息 :param seq: 序列 :param num: 序列换行时的字符个数 :return: fasta格式文本 """ format_seq = "" for i, char in enumerate(seq): format_seq += char if (i + 1) % num == 0: format_seq += "\n" return ana + format_seq + "\n"
def get_cds(gb_file, f_cds): """ 从 genbank 文件中提取 cds 序列及其完整序列 :param gb_file: genbank文件路径 :param f_cds: 是否只获取一个 CDS 序列 :return: fasta 格式的 CDS 序列, fasta 格式的完整序列 """ gb_seq = SeqIO.read(gb_file, "genbank") complete_seq = str(gb_seq.seq) complete_ana = ">" + gb_seq.id + ":" + gb_seq.annotations["accessions"][2] + " " + gb_seq.description + "\n" complete_fasta = format_fasta(complete_ana, complete_seq, 70)
cds_num = 1 cds_fasta = "" for ele in gb_seq.features: if ele.type == "CDS": cds_seq = "" cds_ana = ">lcl|" + gb_seq.id + "_cds_" + ele.qualifiers['protein_id'][0] + "_" + str(cds_num) + " [gene=" + \ ele.qualifiers['gene'][0] + "]" + \ " [db_xref=" + ele.qualifiers['db_xref'][0] + "]" + " [protein=" + ele.qualifiers['product'][ 0] + "]" + \ " [protein_id=" + ele.qualifiers['protein_id'][0] + "]" + " [gbkey=CDS]\n" cds_num += 1 for ele1 in ele.location.parts: cds_seq += complete_seq[ele1.start:ele1.end] cds_fasta += format_fasta(cds_ana, cds_seq, 70) if (f_cds): break return cds_fasta, complete_fasta
if __name__ == '__main__': cds_file = "out/cds.fasta" complete_file = "out/complete.fasta" res_dir = "res" cds_file_obj = open(cds_file, "w") complete_file_obj = open(complete_file, "w") for file in os.listdir(res_dir): cds_fasta, complete_fasta = get_cds(res_dir + os.sep + file,
import os from Bio import SeqIO
def format_fasta(ana, seq, num): """ 格式化文本为 fasta格式 :param ana: 注释信息 :param seq: 序列 :param num: 序列换行时的字符个数 :return: fasta格式文本 """ format_seq = "" for i, char in enumerate(seq): format_seq += char if (i + 1) % num == 0: format_seq += "\n" return ana + format_seq + "\n"
def get_cds(gb_file, f_cds): """ 从 genbank 文件中提取 cds 序列及其完整序列 :param gb_file: genbank文件路径 :param f_cds: 是否只获取一个 CDS 序列 :return: fasta 格式的 CDS 序列, fasta 格式的完整序列 """ gb_seq = SeqIO.read(gb_file, "genbank") complete_seq = str(gb_seq.seq) complete_ana = ">" + gb_seq.id + ":" + gb_seq.annotations["accessions"][2] + " " + gb_seq.description + "\n" complete_fasta = format_fasta(complete_ana, complete_seq, 70)
cds_num = 1 cds_fasta = "" for ele in gb_seq.features: if ele.type == "CDS": cds_seq = "" cds_ana = ">lcl|" + gb_seq.id + "_cds_" + ele.qualifiers['protein_id'][0] + "_" + str(cds_num) + " [gene=" + \ ele.qualifiers['gene'][0] + "]" + \ " [db_xref=" + ele.qualifiers['db_xref'][0] + "]" + " [protein=" + ele.qualifiers['product'][ 0] + "]" + \ " [protein_id=" + ele.qualifiers['protein_id'][0] + "]" + " [gbkey=CDS]\n" cds_num += 1 for ele1 in ele.location.parts: cds_seq += complete_seq[ele1.start:ele1.end] cds_fasta += format_fasta(cds_ana, cds_seq, 70) if (f_cds): break return cds_fasta, complete_fasta
if __name__ == '__main__': cds_file = "out/cds.fasta" complete_file = "out/complete.fasta" res_dir = "res" cds_file_obj = open(cds_file, "w") complete_file_obj = open(complete_file, "w") for file in os.listdir(res_dir): cds_fasta, complete_fasta = get_cds(res_dir + os.sep + file, True) cds_file_obj.write(cds_fasta) complete_file_obj.write(complete_fasta)
|
4 其他方法获取
类型 |
编号 |
AY,AP |
同一个基因存在多个提交版本时的序列编号 |
NC,NM |
NCBI 官方推荐及使用的序列编号 |
IMAGE等 |
针对特定物种,或特定组织提供的序列编号 |
4.1 对于AY,AP,可以用下面的方式来实现 CDS 序列下载,但是对于样本量大的序列分析比较低效
4.2 对于NC,NM,可以用下面的方式来实现 CDS 序列下载,同样对于样本量大的序列分析比较低效
4.3 通过爬虫实现自动化,但是成本比较高,而且加重 NCBI 服务器负担,搞不好IP就会被封掉
4.4 用 BioPython 的 Entrez.efetch(db=”nuccore”, id=ids, rettype=”fasta_cds_na “, retmode=”text”) 方法实现。但是经过实际调用,并没有什么效果。但是可以利用它来下载genbank序列后续实现自动化提取