Quantifies the extent of overlap between to sets of intervals in terms of base-pairs. Groups that are shared between input are used to calculate the statistic for subsets of data.

bed_jaccard(x, y)

Arguments

x

tbl_interval()

y

tbl_interval()

Value

tibble with the following columns:

  • len_i length of the intersection in base-pairs

  • len_u length of the union in base-pairs

  • jaccard value of jaccard statistic

  • n_int number of intersecting intervals between x and y

If inputs are grouped, the return value will contain one set of values per group.

Details

The Jaccard statistic takes values of [0,1] and is measured as:

$$ J(x,y) = \frac{\mid x \bigcap y \mid} {\mid x \bigcup y \mid} = \frac{\mid x \bigcap y \mid} {\mid x \mid + \mid y \mid - \mid x \bigcap y \mid} $$

Interval statistics can be used in combination with dplyr::group_by() and dplyr::do() to calculate statistics for subsets of data. See vignette('interval-stats') for examples.

See also

Examples

genome <- read_genome(valr_example('hg19.chrom.sizes.gz')) x <- bed_random(genome, seed = 1010486) y <- bed_random(genome, seed = 9203911) bed_jaccard(x, y)
#> # A tibble: 1 x 4 #> len_i len_u jaccard n #> <dbl> <dbl> <dbl> <dbl> #> 1 235607860 1708554359 0.160 399752
# calculate jaccard per chromosome bed_jaccard(dplyr::group_by(x, chrom), dplyr::group_by(y, chrom))
#> # A tibble: 25 x 5 #> chrom len_i len_u jaccard n #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 chr1 18949776 137312717 0.160 32155 #> 2 chr10 10297917 75375993 0.158 17547 #> 3 chr11 10381872 74526141 0.162 17394 #> 4 chr12 10258612 73803981 0.161 17335 #> 5 chr13 8940241 63744976 0.163 14952 #> 6 chr14 8200728 59377643 0.160 13949 #> 7 chr15 7867717 56694256 0.161 13385 #> 8 chr16 6871740 49891604 0.160 11639 #> 9 chr17 6247903 45089451 0.161 10508 #> 10 chr18 5911670 43075918 0.159 10050 #> # ... with 15 more rows