Create your config.yaml file for
ProHap
General parameters
Ensembl release
Select transcripts
Use the default set of transcripts
Use only
MANE Select
transcripts
Only available for Ensembl v.108 and above. For genes that do not include any MANE Select transcript in Ensembl, "Ensembl Canonical" transcripts will be selected.
Select transcripts by biotype (provide below)
User-defined list of transcripts (provide below)
Transcript biotypes to be included
(Comma-separated list, use biotypes from
Gencode
)
Path to the custom transcript list
(CSV file, see example
data/transcripts_reference_108.csv
)
Path to the contaminant FASTA file
Path to the final FASTA file
Simplify FASTA headers
(extract all information from the FASTA protein headers in to a tab-separated file)
ProHap
Use ProHap
(Include protein haplotypes in the final FASTA file)
Data source:
Download VCF files from an online resource
Provide VCF files locally
URL of the data set of phased genotpyes
Default: 1000 Genomes Project on GRCh38
Path to the directory containing phased VCF files
Name of the VCF files
VCFs are expected per chromosome, replace the chromosome number with "{chr}"
Samples metadata file
See the
wiki page
for details
MAF threshold
Variants under this threshold will not be included in haplotypes
MAF field name
Name of the AF column in the VCF file ("AF" by default). Change if you want to use the frequency in a specific population within 1000 Genomes, or according to your own file
Threshold haplotypes by
Haplotype frequency
Haplotype occurrence count
Threshold value
Specify 0 to skip haplotype thresholding
Pseudo-autosomal regions (PAR) on the X chromosome
End of PAR1:
Start of PAR2:
The default values for the GRCh38 human genome are 2781479 and 155701383. For GRCh37, use 2699520 and 154931044 respectively.
Require annotation of the start codon in transcripts
Ignore variation in UTR regions
If disabled, UTR sequences are still removed in the final optimized database, but retained in the haplotypes FASTA.
Skip haplotypes where the start codon is lost
If disabled, these haplotype cDNA sequences are translated in 3 reading frames, including UTR sequences.
Path to the haplotype FASTA file
Path to the haplotype metadata table
ProVar
Use ProVar
(Include individual variants in the final FASTA file)
Add your VCF files:
Dataset name
VCF file path
MAF threshold
Specify 0 to skip thresholding
Add
Require annotation of the start codon in transcripts
Path to the variant FASTA file
Path to the variant metadata table
Merge with an existing protein haplotype database
Path to the additional haplotype table file
(e.g., one of the F2 files in the
Zenodo repository
)
Path to the additional haplotype FASTA file
(e.g., one of the F3 files in the
Zenodo repository
)
Download
or copy the content below to your config.yaml file: