Fast LD computation from VCF files using Tomahawk • rmf

Install tomahawk from here. Following commands are run on the Linux command-line.

Thinning VCF

For LD computation, pairwise comparisons are performed between all SNPs, therefore it is essential to thin down the number of SNPs. My dataset has about 18 million SNPs. My aim is to bring it down to 20,000 or so. Use vcftools to thin down a VCF file.

vcftools --vcf snp.vcf --recode --recode-INFO-all --thin 50000 --out "snp-thin"

VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
        --vcf snp.vcf
        --recode-INFO-all
        --thin 50000
        --out snp-thin
        --recode

After filtering, kept 64 out of 64 Individuals
Outputting VCF file...
After filtering, kept 26205 out of a possible 18793942 Sites
Run Time = 135.00 seconds

mv "snp-thin.recode.vcf" "snp-thin.vcf"

The argument --thin 50000 means to keep 1 SNP every 50,000 bases. Depending on the number of SNPs in the VCF file, you want to increase or decrease this. I am aiming for about 20,000 or so SNPs in the thinned file.

VCF to BCF

Convert VCF to BCF.

bcftools view snp.vcf -O b -o snp.bcf

Tomahawk

Instructions on using tomahawk is detailed here. Convert BCF to the tomahawk format. Import filters filter out sites with missingness > 20% and HW p-value < 0.001.

tomahawk import -i snp.bcf -o snp -m 0.2 -h 0.001

Tomahawk is used to compute LD.

Arguments: -u (unphased), -d (progress), -f (fast computation), -i (input file), -o (output file name prefix), -r (min R2 cut-off). See tomahawk help for more descriptions.

tomahawk calc -udfi snp.twk -o snp -a 5 -r 0.1 -P 0.1 -C 1 -t 1

The .two output is binary and needs to be exported to a text file. The -I arguments restricts the output to a specified chromosome. So, -I chr1 (UCSC) of -I 1 (Ensembl) exports only chromosome 1. It is probably better to have separate files for each chromosome as the files are a more managable size to read into R. The LD metrics are computed between SNPs within chromosomes anyway.

tomahawk view -i snp.two > snp.ld
tomahawk view -i snp.two -I 1 > snp-1.ld

Plotting

Now we move into R. First we load some libraries and define some plotting functions to be used later.

# load libraries
library(dplyr)
library(ggplot2)

# define plotting functions
#' @title plotPairwiseLD
#' @description Plots R2 heatmap across the chromosome (like Haploview)
#' @param dfr A data.frame with minimum CHROM_A, POS_A, CHROM_B, POS_B and R2.
#' An output from tomahawk works.
#' @param chr A chromosome name.
#' @param xlim A two number vector specifying min and max x-axis limits. Any one or both can be defaulted by specifying NA.
#' @param ylim A two number vector specifying min and max y-axis limits. Any one or both can be defaulted by specifying NA.
#' @param minr2 A value between 0 and 1. All SNPs with R2 value below this
#' value is excluded from plot.
#'
plotPairwiseLD <- function(dfr, chr, xlim = c(NA, NA), ylim = c(NA, NA), minr2) {
  if (missing(dfr)) stop("Input data.frame 'dfr' missing.")

  if (!missing(chr)) {
    ld <- filter(ld, CHROM_A == get("chr") & CHROM_B == get("chr"))
  }
  ld <- filter(ld, POS_A < POS_B)

  if (!missing(minr2)) {
    ld <- filter(ld, R2 > get("minr2"))
  }

  ld <- ld %>% arrange(R2)

  ld$x <- ld$POS_A + ((ld$POS_B - ld$POS_A) / 2)
  ld$y <- ld$POS_B - ld$POS_A
  ld$r2c <- cut(ld$R2, breaks = seq(0, 1, 0.2), labels = c(
    "0-0 - 0.2", "0.2 - 0.4",
    "0.4 - 0.6", "0.6 - 0.8",
    "0.8 - 1.0"
  ))
  ld$r2c <- factor(ld$r2c, levels = rev(c(
    "0-0 - 0.2", "0.2 - 0.4",
    "0.4 - 0.6", "0.6 - 0.8",
    "0.8 - 1.0"
  )))

  ggplot(ld, aes(x = x, y = y, col = r2c)) +
    geom_point(shape = 20, size = 0.1, alpha = 0.8) +
    scale_color_manual(values = c("#ca0020", "#f4a582", "#d1e5f0", "#67a9cf", "#2166ac")) +
    scale_x_continuous(limits = xlim) +
    scale_y_continuous(limits = ylim) +
    guides(colour = guide_legend(title = "R2", override.aes = list(shape = 20, size = 8))) +
    labs(x = "Chromosome (Bases)", y = "") +
    theme_bw(base_size = 14) +
    theme(
      panel.border = element_blank(),
      axis.ticks = element_blank()
    ) %>%
    return()
}


#' @title plotDecayLD
#' @description Plots R2 heatmap across the chromosome (like Haploview)
#' @param dfr A data.frame with minimum CHROM_A, POS_A, CHROM_B, POS_B and R2.
#' An output from tomahawk works.
#' @param chr A chromosome name.
#' @param xlim A two number vector specifying min and max x-axis limits. Any one or both can be defaulted by specifying NA.
#' @param ylim A two number vector specifying min and max y-axis limits.  Any one or both can be defaulted by specifying NA.
#' @param avgwin An integer specifying window size. Mean R2 is computed within windows.
#' @param minr2 A value between 0 and 1. All SNPs with R2 value below this
#' value is excluded from plot.
#'
plotDecayLD <- function(dfr, chr, xlim = c(NA, NA), ylim = c(NA, NA), avgwin = 0, minr2) {
  if (missing(dfr)) stop("Input data.frame 'dfr' missing.")

  if (!missing(chr)) {
    ld <- filter(ld, CHROM_A == get("chr") & CHROM_B == get("chr"))
  }
  ld <- filter(ld, POS_A < POS_B)

  if (!missing(minr2)) {
    ld <- filter(ld, R2 > get("minr2"))
  }

  ld <- ld %>% arrange(R2)

  ld$dist <- ld$POS_B - ld$POS_A

  if (avgwin > 0) {
    ld$distc <- cut(ld$dist, breaks = seq(from = min(ld$dist), to = max(ld$dist), by = avgwin))
    ld <- ld %>%
      group_by(distc) %>%
      summarise(dist = mean(dist), R2 = mean(R2))
  }

  ggplot(ld, aes(x = dist, y = R2)) +
    geom_point(shape = 20, size = 0.15, alpha = 0.7) +
    geom_smooth() +
    scale_x_continuous(limits = xlim) +
    scale_y_continuous(limits = ylim) +
    labs(x = "Distance (Bases)", y = expression(LD ~ (r^{
      2
    }))) +
    theme_bw(base_size = 14) +
    theme(
      panel.border = element_blank(),
      axis.ticks = element_blank()
    ) %>%
    return()
}

Then we read the file.

ld <- read.delim("snp.ld", sep="\t", comment.char="#")

The output looks like below:

> head(ld)
  FLAG CHROM_A   POS_A CHROM_B   POS_B   REF_REF      REF_ALT   ALT_REF
 10       1     359       1  100777 103.00000 8.526513e-14  5.000000
 10       1  100777       1     359 103.00000 8.526513e-14  5.000000
  2       1     359       1  452615  97.48765 5.512354e+00 10.512354
  2       1  452615       1     359  97.48765 5.512354e+00 10.512354
  6       1     359       1 1454410  49.07769 5.392231e+01  1.922308
  6       1 1454410       1     359  49.07769 5.392231e+01  1.922308
    ALT_ALT          D    Dprime         R        R2            P ChiSqModel
20.00000 0.12573242 1.0000000 0.8734775 0.7629629 4.440203e-19   2.296495
20.00000 0.12573242 1.0000000 0.8734775 0.7629629 4.440203e-19   2.296495
14.48765 0.08266716 0.6574848 0.5742982 0.3298184 5.573234e-08   1.993300
14.48765 0.08266716 0.6574848 0.5742982 0.3298184 5.573234e-08   1.993300
23.07769 0.06280179 0.8070154 0.3235735 0.1046998 1.876711e-04   7.703488
23.07769 0.06280179 0.8070154 0.3235735 0.1046998 1.876711e-04   7.703488
  ChiSqTable
 97.65926
 97.65926
 42.21675
 42.21675
 13.40157
 13.40157

Plot pairwise LD plot.

plotPairwiseLD(ld)

plotPairwiseLD(ld, minr2=0.2)

ld-limit — Pairwise LD plot with a min R2 limit of 0.2.

plotPairwiseLD(ld, minr2=0.2, ylim=c(0, 10^7))

ld-limit-y — Pairwise LD plot with a min R2 limit of 0.2 and y-axis max limit set to 10^7.

plotDecayLD(ld, minr2=0.25)

ld-limit-y-1 — LD decay plot with a min R2 limit of 0.25.

plotDecayLD(ld, minr2=0.25, avgwin=100000)

ld-limit-y-2 — LD decay plot with a min R2 limit of 0.25 and R2 values averaged within 10 Kb windows.

Thinning VCF

VCF to BCF

Tomahawk

Plotting

Comments

You may also enjoy

Exploring history and culture in Thessaloniki

Scatterplot annotation

Exploring the Swedish west coast

Road trip through Portugal