Create your config.yaml file for
ProHap
General parameters
Ensembl release
Select transcripts
Use the default set of transcripts
Use only
MANE Select
transcripts
Only available for Ensembl v.108 and above. For genes that do not include any MANE Select transcript in Ensembl, "Ensembl Canonical" transcripts will be selected.
Select transcripts by biotype (provide below)
User-defined list of transcripts (provide below)
Transcript biotypes to be included
(Comma-separated list, use biotypes from
Gencode
)
Path to the custom transcript list
(CSV file, see example
data/transcripts_reference_108.csv
)
Path to the contaminant FASTA file
The
default contaminant database
is provided in the crap.fasta file in this repository.
Path to the final FASTA file
Simplify FASTA headers
(extract all information from the FASTA protein headers in to a tab-separated file)
ProHap
Use ProHap
(Include protein haplotypes in the final FASTA file)
Data source:
Download VCF files from an online resource
Provide VCF files locally
URL of the data set of phased genotpyes
Default: 1000 Genomes Project on GRCh38
Path to the directory containing phased VCF files
Name of the VCF files
VCFs are expected per chromosome, replace the chromosome number with "{chr}". Files can be either in the GZIP format or uncompressed.
Samples metadata file
See the
wiki page
for details
MAF threshold
Variants under this threshold will not be included in haplotypes
MAF field name
Name of the AF column in the VCF file ("AF" by default). Change if you want to use the frequency in a specific population within 1000 Genomes, or according to your own file
Threshold haplotypes by
Haplotype frequency
Haplotype occurrence count
Threshold value
Specify 0 to skip haplotype thresholding
Pseudo-autosomal regions (PAR) on the X chromosome
End of PAR1:
Start of PAR2:
The default values for the GRCh38 human genome are 2781479 and 155701383. For GRCh37, use 2699520 and 154931044 respectively.
Require annotation of the start codon in transcripts
Transcripts that do not have an annotated canonical start codon will not be used.
Ignore variation in UTR regions
If disabled, UTR sequences are still removed in the final optimized database, but retained in the haplotypes FASTA.
Skip haplotypes where the start codon is lost
If disabled, these haplotype cDNA sequences are translated in 3 reading frames, including UTR sequences.
Output haplotype cDNA sequences
Create a separate file containing all the haplotype cDNA sequences before translation. If skipping UTR variation as above, the cDNA haplotypes will begin with the canonical start codon.
Path to the cDNA haplotype FASTA file
Path to the protein haplotype FASTA file
Path to the haplotype metadata table
ProVar
Use ProVar
(Include individual variants in the final FASTA file)
Add your VCF files:
Dataset name
VCF file path
MAF threshold
Specify 0 to skip thresholding
Add
Require annotation of the start codon in transcripts
Transcripts that do not have an annotated canonical start codon will not be used.
Output variant cDNA sequences
Create a separate file containing all the variant cDNA sequences before translation.
Path to the variant cDNA FASTA file
Path to the variant FASTA file
Path to the variant metadata table
Merge with an existing protein haplotype database
Path to the additional haplotype table file
(e.g., one of the F2 files in the
Zenodo repository
)
Path to the additional haplotype FASTA file
(e.g., one of the F3 files in the
Zenodo repository
)
Download
or copy the content below to your config.yaml file: