Overview

Parente operates in two phases, first it uses training haplotypes in order to learn the distribution of scores for IBD segments and the distribution of scores for non-IBD segments. With these distributions, Parente can use the embedded likelihood ratio test (eLRT) and it can create block-specific thresholds.

By default, Parente runs using the LRT using block-specific thresholds. However, using program options, one can also use the standard likelihood ratio test (LRT) or use a fixed threshold.

Parente is runs on one contiguous chromosome at a time.

Training

When you provide Parente with training haplotypes, it will perform an internal simulation that simulates IBD and non-IBD segments along the entire chromosome. The results of this simulation is then used when performing inference.

Arguments and options

Usage: parente train [options] <training.hap> <markers.map/tped> <out_prefix>

Computes score distributions from training haplotypes to facilitate inference.

Options:
  -h, --help             Display this help message
  -t, --threads INT      The number of threads [4]
  -e, --geno-err FLT     The modeled genotyping error rate [0.005]
  -w, --window-size INT  Window size [5]
  -n, --num-pairs INT    Num training pairs to generate for each data set. [1000]
  -s, --seed INT         Seed for simulating IBD and non-IBD segment pairs. [-1]

Input

training.hap: Training haplotypes in Parente's haplotype format.
markers.map or markers.tped: PLINK-formatted .map or .tped file to convey the position of each marker.
out_prefix The prefix for all output file names.

Output

<out_prefix>.snp: Location of each SNP used.
<out_prefix>.fwin: Defines which SNPs belong in each window.
<out_prefix>.hf: Allele frequencies.
<out_prefix>.win.lrt1: LRT scores for each simulated IBD and non-IBD segments in each window.
<out_prefix>.win.lrt2: eLRT scores for each simulated IBD and non-IBD segments in each window.
<out_prefix>.wmodel: Summary stastics of the LRT scores of simulated IBD and non-IBD segments in each window.

Inference

Segment and block sizes

When performing IBD inference, Parente targets a particular IBD segment size to detect. It uses this target size when generating blocks of consecutive windows. The default target size of 4 cM, and sizes of blocks that are generated are between 3.5 and 3.9 cM, depending on the SNP density around where the blocks are created. The min/max block size will automatically adjust to the target segment size as described in the options, or it can be manually specified. It should be noted that the target IBD segment size should really be used as a minimum IBD segment size since any longer IBD segment can be detected by examining a portion of it.

Score and thresholds

By default, Parente uses the eLRT with a fixed threshold for inference. However, this can be changed with the --lrt and --threshtype flags, respectively.

The threshold argument is interpreted based on which threshold type is being used:

For --threshtype fixed, it is interpreted as an LRT or eLRT score such that any block with a score greater than threshold is called as being in an IBD segment.
For --threshtype max (block-specific thresholding), it is interpreted as a number of standard deviations. Specifically, during the training step, for each block, we compute the max (M) and standard deviation (SD) of the block's scores (LRT or eLRT) in simulated pairs of non-IBD segments. Then, any block with a score greater than M + SD * threshold is called as being in an IBD segment.

For very high-specificity scenarios, using --lrt 2 and --threshtype max is recommended.

Though it is not recommended to use --lrt 1 because of its lower performance. However, if it is used, then we strongly recommended that you use --threshtype max to achieve a reasonable false positive rate.

Pairs evaluated

By default, Parente infers IBD between all pairs of individuals in the input. However, one can partition the data set into two parts and only look for IBD across the two groups. In this case, the --partititon option cuts the data in half, assigning all even-indexed individuals into one group and all odd-indexed individuals into another group. Then, only pairings where one individual is in one group and the other individual is in the other group are evaluated. This mode is generally only used for benchmarking.

Arguments and options

Usage: parente infer [options] <train_prefix> <data.geno> <threshold> <out_prefix>

Infers IBD segments.

Options:
  -h, --help                     Display this help message
  -s, --target-segment-size FLT  Target (minmum) IBD segment size in cM [4]
  -t, --threads INT              Sets number of threads [4]
  -l, --lrt INT                  Use LRT (1) or eLRT (2) [2]
      --threshtype STR           Threshold type: fixed or max [fixed]
  -e, --geno-err FLT             Set the modeled genotyping error rate [0.005]
  -S, --smoothing-factor FLT     Divide the error rate by this amount [100]
  -b, --min-block-size FLT       Minimum block size (in cM) to accept when creating
                                 blocks. If < 0, then it is set to:
                                 <max-block-size> - 0.1 * <target-segment-size> [-1]
  -B, --max-block-size FLT       Maximum block size (in cM) to accept when creating
                                 blocks.  If <= 0, then it is set to:
                                 <target-segment-size> - 0.1 [-1]
  -p, --partition                Only infer IBD between even-indexed and odd-indexed
                                 individuals, otherwise infer for all pairs. [false]

Input

train_prefix: The out_prefix used in the training step.
data.geno: Genotypes on which to infer IBD segments (Parente genotype format).
out_prefix The prefix for all output file names.

Output

<out_prefix>.ibd: Called IBD segments.
<out_prefix>.fblock: Defines which windows belong in each block.
<out_prefix>.bmodel (optional): Summarizes block-level scores observed training data.

File formats

Haplotypes file (.hap)

This tab-delimited text file contains one haplotype per column with a 0 indicating the minor allele, and 1 indicating the major allele. Any other value is interpreted as missing data. Each file is expected to contain markers in order on a single chromosome. If the first line begins with a #, it is interpreted as a header that contains a tab-delimited list of the haplotype names (the # is ignored and not used as a part of the first haplotype name). An example file containing 3 haplotypes and 4 markers could look like this:

#hapname1    hapname2    hapname3
1            1           1
0            1           1
1            0           1
1            0           -1

Genotypes file (.geno)

This tab-delimited text file contains one genotype per column with integers 0-2 indicating genotype: 0 indicates homozygous for the minor allele, 1 indicates heterozygous, and 2 indicates homozygous for the major allele. Any other value is interpreted as missing data. Each file is expected to contain markers in order on a single chromosome. If the first line begins with a #, it is interpreted as a header that contains a tab-delimited list of the genotype names (the # is ignored and not used as a part of the first genotype name). An example file containing genotypes from 3 individuals with 4 markers could look like this:

#john    jane    sally
2        1       2
0        1       -1
2        2       2
1        2       1

Window specification file (.fwin)

This describes which SNPs belong to which windows and the genetic position bondaries of the windows. The rows are expected to be in order of strictly increasing index and increasing start positon. The columns are as follows:

0-based window index
Start position (cM): the position of the first SNP in the window
Stop position (cM): the position of the last SNP in the window
Comma-separated list of 0-based SNP indices belonging to the window. These should be in sorted order.

Block specification file (.fblock)

This describes which windows belong to which blocks and the genetic position bondaries of the blocks. The rows are expected to be in order of strictly increasing index and increasing start positon. The columns are as follows:

0-based block index
Start position (cM): the start position of the first window in the block
Stop position (cM): the stop position of the last window in the block
Comma-separated list of 0-based window indices belonging to the block. These should be in sorted order.

Marker positioning file (.snp)

This describes the genetic position of SNPs. This file format is a bit odd because it was borrowed from the .fwin file format. Its columns are as follows:

0-based SNP index
SNP position (cM)
SNP position (cM)
0-based SNP index

Alele frequency file (.hf)

This describes the frequencies of the alleles. The rows should correspond directly with the rows in the marker positioning file. The columns in this file are as follows:

0-based SNP index
Allele (0 or 1)
Count of haplotypes with this allele

Training window scores (.win.lrt1 and .win.lrt2)

Window scores computed from training data for non-IBD and IBD segments for LRT (.lrt1) and eLRT (.lrt2). The columns of the file are as follows:

0-based window index
Comma-separated window scores for non-IBD segments.
Comma-separated window scores for IBD segments.

Window model (.wmodel)

This describes the summary statistics of the window scores from the LRT window score file (.win.lrt1). The rows should match with the .fwin file.

0-based window index
Mean score of non-IBD segments
Mean score of IBD segments
Standard deviation of the scores of non-IBD segments
Standard deviation of the scores of IBD segments
Maximum of the scores of non-IBD segments
Maximum of the scores of IBD segments

Block model (.bmodel)

This file describes summary statistics of block scores based on the training data. Block scores are computed using the .win.lrt1 and .win.lrt2 files along with the .fblock file to sum the window scores that belong to each block. These block summary statistics are used for block-specific thresholding. It follows the same format as the window model file, but refers to blocks instead of windows, and the comma-separated values refer to windows instead of SNPs.

This file is produced by parente infer when using --threshtype max or if verbosity is set to to a value greater than 0.

Predicted IBD blocks (.ibd)

Blocks that Parente infers are in IBD segments.

0-based call index (row index)
Name of the first individual in the IBD pair
Name of the second individual in the IBD pair
0-based block index
Raw block score
Block score delta (how far above threshold)

Format conversion

One can use parente tool tped_to_ints to convert from PLINK-formatted data to Parente genotype or haplotype files.

First, make sure your PLINK-formatted data is in transposed formats. That is, make sure you have .tped and .tfam files. You can convert your data in .ped format (eg data.ped) with the command below. It will create data.tped and data.tfam.

plink --file data --transpose --out data

You must also make sure to have a .frq file, which is used by the conversion tool to know which allele is the major allele and which allele is the minor allele. It is generally advixed to use the same .frq file for all your experiments for the same population. If one uses a separate .frq file generated from each individual data set, markers with high minor allele frequencies can flip the major/minor allele encoding. You can generate data.frq using the --freq flag, as in the example below.

plink --tped data.tped --tfam data.tfam --freq --out data

To convert to the Parente haplotype file format, you can use the command below which will generate data.hap.

parente tool tped_to_ints data data.frq hap data

To convert to the Parente genotype file format, you can use the command below which will generate data.geno.

parente tool tped_to_ints data data.frq geno data