Got junk sequences in your sequencing FASTQs?
Many experiments are riddled with repetitive sequences which are of no value to the scientific question you are asking. Some examples include ribosomal RNA and hemoglobin. DASH (1) is a CRISPR-Cas9 next-generation sequencing (NGS) technology which depletes abundant sequences, increasing coverage of your sequences of interest.
In the lab, you take final NGS libraries ready for sequencing, and combine them with pre-programmed Cas9 guide RNAs (gRNAs) which cut up your repetitive sequences into small fragments. You then size select and amplify your remaining DNA, which should contain most or only your sequences of interest.
DASHit (2) is the software that designs Cas9-gRNAs to target your particular experiment.
- Gu, W. et al. Depletion of Abundant Sequences by Hybridization (DASH): using Cas9 to remove unwanted high-abundance species in sequencing libraries and molecular counting applications. Genome Biology 17, 41 (2016).
Dynerman, D and Lyden, A. et al.
- Lyden, A. et al. DASH Protocol V.4.
Please visit the GitHub repository for instructions on how to install
In order for DASH to be maximally effective, we want to pick the fewest guides which hit the largest possible number of sequences that you want depleted. The best way to find the over-abundant sequences in your data is to look at it!
DASHit uses reads from a preliminary, low-depth sequencing of your samples to identify abundant CRISPR-Cas9 cut sites. It has a filtering function for the guides it identifies, which allows you to specify on and off target sequences, and filter based on GC content and secondary structure. It then computes an optimized guide list by producing a set number of guides which hit the largest number of sequences in the preliminary data.
For even more detail, please see the
DASHit paper (2).
DASHit is a collection of several tools you can use to design gRNAs that target the sequences you want.
INPUT: A FASTQ file
input.fastq containing the abundant sequences you want to target for depletion.
NOTE: We usually subsample
input.fastqdown to several hundred thousand reads with
- Use seqtk (or another tool) to convert the input to FASTA format
seqtk seq -A input.fastq > input.fasta
- Trim adaptor sequences using cutadapt (or another tool). This will avoid designing a guide which cuts your adaptor sequence, rendering your library unsequencable.
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -o cut-input.fasta input.fasta
The above is an example command for trimming adapters from NEBNext paired-end libraries. You will need to change the adapter sequence if you used a different library prep method.
crispr_sites -rto find candidate Cas9-gRNAs in your input.
cat cut-input.fasta | crispr_sites -r > input_sites_to_reads.txt
Each line of
input_sites_to_reads.txtcontains a 20-mer Cas9 guide RNA found in
input.fasta, together with a list of reads that 20-mer appeared in
(Optional, but you’ll probably want this) You may want to restrict your gRNAs to a particular ontarget region, and also disqualify guides that hit some offtarget region.
- Obtain FASTA files of your ontarget and offtarget regions
crispr_sitesto find ontarget and offtarget gRNAs
cat ontarget.fasta | crispr_sites > ontarget_sites.txt cat offtarget.fasta | crispr_sites > offtarget_sites.txt
For the ontarget and offtarget files, run
- Filter your candidate guide RNAs, removing those with low-quality structure (GC content, homopolymer, dinucleotide repeats, and hairpins). Optionally remove all guides RNAs present in
offtarget_sites.txt, and those not present in
dashit_filter --gc_freq_min 5 --gc_freq_max 15 --ontarget ontarget.txt --offtarget offtarget.txt input_sites_to_reads.txt > input_sites_to_reads_filtered.txt
dashit_filter --helpfor an explanation of the quality filtering and to learn how to change quality thresholds.
ontarget and offtarget filtering require port 8080 to be available on your computer.
- Find 300 guides that hit the largest number of reads
optimize_guides input_sites_to_reads_filtered.txt 300 1 > guides.csv
You should experiment with different numbers of guides
number of times to cover each readoption, here set to 1, is how many guides need to hit a read in
input.fastabefore that read is considered covered. In principle you could use this for additional redundancy, but in practice we’ve never needed anything other than 1
guides.csvto see how we did. Open this file in Excel and plot the last column. This plot will show the cumulative number of reads hit in
input.fastathe designed guides hit. You can use this plot to pick the number of guides to use: look for the “elbow” in the plot, and notice where diminishing returns on the number of guides kicks in. We see around 50 guides are already covering almost 100% of our
input.fasta. Ordering 300 guides would be overkill, in this case
- Score how well your guides do against
input.fasta- this should match the plot from
score_guides guides.csv input.fasta
score_guidesis mainly useful to estimate how an existing guide set would perform against a new set of samples. You might not have to design new guides!
guides.csv, your DASH guides!
For an example on real data, see here.