DNA sequencing

DNA sequencing is the way we find out the order of tiny building blocks in DNA. These blocks are called nucleotides, and there are four types: adenine, thymine, cytosine, and guanine.

An example of the results of automated chain-termination DNA sequencing

Knowing the DNA sequence is important for many things. It helps scientists study living things, and it is useful in medical diagnosis, biotechnology, forensic biology, virology, and systematics. It can even help doctors find diseases and decide the best way to treat a patient.

DNA sequencing has gotten much better over time. It now lets scientists read all the genes, called genomes, of humans, animals, plants, and tiny organisms. In the 1970s, scientists used harder methods. Later, they used brighter lights and machines to make the process faster and easier.

Applications

DNA sequencing is a useful tool that helps scientists learn about the genetic code. It can show the order of parts in genes, bigger areas of DNA, whole chromosomes, or even the entire genomes of living things. It also helps in studying RNA and proteins by looking at the parts of DNA that make them.

This technology is important in many areas, such as molecular biology, where it helps researchers learn about genes and diseases. In evolutionary biology, sequencing shows how different species are related. It is also used in metagenomics to find tiny organisms in places like water or soil. In virology, sequencing helps scientists study viruses and see how they change. In medicine, it can help find genetic diseases and guide treatment. In forensic science, DNA sequencing helps identify people by their unique genetic patterns.

The four canonical bases

Main article: Nucleotide

DNA is made up of four main building blocks called bases: thymine (T), adenine (A), cytosine (C), and guanine (G). DNA sequencing is a way to find out the exact order of these bases in a piece of DNA. While most DNA uses just these four bases, some viruses and special cases can have slightly different building blocks. Scientists are always discovering new ways DNA can be arranged.

History

Discovery of DNA structure and function

DNA was found in 1869 by Friedrich Miescher. For years, scientists thought proteins held genetic information. This changed in 1944 when Oswald Avery, Colin MacLeod, and Maclyn McCarty showed that DNA could change bacteria traits. In 1953, James Watson and Francis Crick proposed the double-helix model of DNA, showing how genetic information is stored and passed on.

Frederick Sanger, a pioneer of sequencing. Sanger is one of the few scientists who was awarded two Nobel prizes, one for the sequencing of proteins, and the other for the sequencing of DNA.

RNA sequencing

RNA sequencing was one of the first ways scientists studied genes. In the 1970s, Walter Fiers and his team at the University of Ghent in Belgium were the first to sequence a complete gene and a small virus called Bacteriophage MS2.

Early DNA sequencing methods

The first methods for reading DNA sequences were developed in the 1970s. Scientists like Ray Wu at Cornell University used special techniques to read DNA sequences. Later, Walter Gilbert and Allan Maxam at Harvard created a method to sequence DNA. Around the same time, Frederick Sanger and Alan Coulson developed a method that made DNA sequencing faster and easier.

Sequencing of full genomes

The first complete DNA genome sequenced was a small virus called bacteriophage φX174 in 1977. In 1984, scientists sequenced the Epstein-Barr virus, which was a big step because they knew little about it before. By 1995, scientists had sequenced the entire genome of a bacterium called Haemophilus influenzae. By 2003, a big international project called the Human Genome Project created a draft of the human genome, and by 2022, scientists had filled in the last missing pieces.

History of sequencing technology

High-throughput sequencing (HTS) methods

In the late 1990s and early 2000s, scientists developed faster ways to sequence DNA, called "next-generation" or "second-generation" sequencing. These methods break the genome into tiny pieces and sequence many pieces at once, making it possible to sequence entire genomes quickly. These advances have helped scientists learn more about health, human history, and personalized medicine.

Main article: Whole genome sequencing

Basic methods

Main article: Maxam-Gilbert sequencing Main article: Sanger sequencing

Two main ways help us read the order of DNA building blocks.

The first way was made by Allan Maxam and Walter Gilbert in 1977. It used special chemicals to cut DNA at certain spots. This method was complex and needed special materials, so it was not used much after better ways were made.

The second way was made by Frederick Sanger in 1977. It became very popular because it was simpler and more reliable. Scientists later made Sanger’s method better by using bright labels and machines to do the work automatically. This made DNA sequencing faster and cheaper. The Sanger method helped sequence the first human genome in 2001, starting the field of genomics. After that, even newer methods made sequencing quicker and more affordable.

Today, most DNA sequencing uses a method called “sequencing by synthesis.” This method watches a special enzyme as it adds DNA building blocks one by one to make a new copy of DNA. By seeing which block is added each time, scientists can read the DNA sequence. This method is used in many modern machines that can handle large amounts of DNA very quickly.

Large-scale sequencing and de novo sequencing

Genomic DNA is fragmented into random pieces and cloned as a bacterial library. DNA from individual bacterial clones is sequenced and the sequence is assembled by using overlapping DNA regions.

Large-scale sequencing helps scientists study very long pieces of DNA, like entire chromosomes. Scientists break the DNA into smaller pieces. These pieces are copied many times and studied. The small pieces are then put back together like a puzzle to make the full DNA sequence.

"De novo sequencing" means finding out a DNA sequence from scratch, without any prior knowledge. One common method is called "shotgun sequencing." In this method, DNA is broken into random pieces. Each piece is sequenced, and then all the pieces are put together based on how they fit overlap. This method works well for sequencing large amounts of DNA.

High-throughput methods

Multiple, fragmented sequence reads must be assembled together on the basis of their overlapping areas.

High-throughput sequencing includes methods like exome sequencing, genome sequencing, and transcriptome profiling. These tools help scientists study many pieces of DNA quickly. This makes research faster and more affordable.

These changes have improved how we learn about living things and help doctors treat illnesses better. Companies like Illumina, Qiagen, and ThermoFisher Scientific have worked hard to make these tools better for everyone.

Comparison of high-throughput sequencing methods
Method	Read length	Accuracy (single read not consensus)	Reads per run	Time per run	Cost per 1 billion bases (in US$)	Advantages	Disadvantages
Single-molecule real-time sequencing (Pacific Biosciences)	30,000 bp (N50); maximum read length >100,000 bases	87% raw-read accuracy	4,000,000 per Sequel 2 SMRT cell, 100–200 gigabases	30 minutes to 20 hours	$7.2-$43.3	Fast. Detects 4mC, 5mC, 6mA.	Moderate throughput. Equipment can be very expensive.
Ion semiconductor (Ion Torrent sequencing)	up to 600 bp	99.6%	up to 80 million	2 hours	$66.8-$950	Less expensive equipment. Fast.	Homopolymer errors.
Pyrosequencing (454)	700 bp	99.9%	1 million	24 hours	$10,000	Long read size. Fast.	Runs are expensive. Homopolymer errors.
Sequencing by synthesis (Illumina)	MiniSeq, NextSeq: 75–300 bp; MiSeq: 50–600 bp; HiSeq 2500: 50–500 bp; HiSeq 3/4000: 50–300 bp; HiSeq X: 300 bp	99.9% (Phred30)	MiniSeq/MiSeq: 1–25 Million; NextSeq: 130-00 Million; HiSeq 2500: 300 million – 2 billion; HiSeq 3/4000 2.5 billion; HiSeq X: 3 billion	1 to 11 days, depending upon sequencer and specified read length	$5 to $150	Potential for high sequence yield, depending upon sequencer model and desired application.	Equipment can be very expensive. Requires high concentrations of DNA.
Combinatorial probe anchor synthesis (cPAS- BGI/MGI)	BGISEQ-50: 35-50bp; MGISEQ 200: 50-200bp; BGISEQ-500, MGISEQ-2000: 50-300bp	99.9% (Phred30)	BGISEQ-50: 160M; MGISEQ 200: 300M; BGISEQ-500: 1300M per flow cell; MGISEQ-2000: 375M FCS flow cell, 1500M FCL flow cell per flow cell.	1 to 9 days depending on instrument, read length and number of flow cells run at a time.	$5– $120
Sequencing by ligation (SOLiD sequencing)	50+35 or 50+50 bp	99.9%	1.2 to 1.4 billion	1 to 2 weeks	$60–130	Low cost per base.	Slower than other methods. Has issues sequencing palindromic sequences.
Nanopore Sequencing	Dependent on library preparation, not the device, so user chooses read length (up to 2,272,580 bp reported).	~92–97% single read	dependent on read length selected by user	data streamed in real time. Choose 1 min to 48 hrs	$7–100	Longest individual reads. Accessible user community. Portable (Palm sized).	Lower throughput than other machines, Single read accuracy in 90s.
GenapSys Sequencing	Around 150 bp single-end	99.9% (Phred30)	1 to 16 million	Around 24 hours	$667	Low-cost of instrument ($10,000)
Chain termination (Sanger sequencing)	400 to 900 bp	99.9%	N/A	20 minutes to 3 hours	$2,400,000	Useful for many applications.	More expensive and impractical for larger sequencing projects. This method also requires the time-consuming step of plasmid cloning or PCR.

Methods in development

DNA sequencing methods that are still being made include using tiny openings called nanopores to read the DNA sequence. Scientists also use special types of microscopy to see where nucleotides sit in long DNA pieces. These new ways try to make sequencing faster, cheaper, and simpler.

One new idea uses electrical checks to tell apart DNA bases as they go through a channel. Another way uses tiny bits of known DNA to find unknown sequences. Scientists can also use tools like mass spectrometry to weigh DNA bits and spot small changes. These methods help when studying human DNA, especially if the DNA is damaged. Other ways use tiny chips to test many things at once or watch how DNA copies itself.

Market share

In 2022, a company named Illumina had about 80% of the market for DNA sequencing. Other companies, like PacBio, Oxford, 454, and MGI, made up the rest. This means most people and scientists used Illumina's methods to read DNA.

Sample preparation

Before we can read the code inside DNA, scientists prepare tiny pieces from plants or animals. They take out DNA or RNA and make sure the strands stay long and undamaged. If they take out RNA, they change it into a special kind of DNA called complementary DNA (cDNA) to study it more easily.

Depending on the method used to read the DNA, extra steps might be needed. Some methods need special treatments before reading can start. Scientists always check the quality of their samples to make sure everything is just right for clear results.

Development initiatives

In October 2006, the X Prize Foundation began a project to improve how we read all the DNA in a person. This project, called the Archon X Prize, offered $10 million to the first group that could make a machine to read 100 full human DNA sets quickly and accurately.

Each year, the National Human Genome Research Institute gives money for new research and inventions in the study of DNA. This includes creating new and better ways to read DNA.

Computational challenges

DNA sequencing makes many small pieces of data. Scientists need to fit these pieces together like a puzzle. Some parts of DNA repeat many times, which can make it tricky to know where each piece belongs.

Scientists use special computer programs to help organize and check the DNA pieces. These programs can remove parts of the data that might cause mistakes. After getting the data, there is still much work to understand it using biology and computer science tools.

Read Trimming Algorithms
Name of algorithm	Type of algorithm
Cutadapt	Running sum
ConDeTri	Window based
ERNE-FILTER	Running sum
FASTX quality trimmer	Window based
PRINSEQ	Window based
Trimmomatic	Window based
SolexaQA	Window based
SolexaQA-BWA	Running sum
Sickle	Window based

Ethical issues

Further information: Bioethics

DNA sequencing has raised important questions about fairness and safety. One big issue is who can see the information from your DNA test. You should always be asked before your DNA is used. Some people worry that this information could be used in unfair ways. For example, there are concerns that insurers might use it to change prices.

Laws like the Genetic Information Nondiscrimination Act in the United States help protect people from being treated unfairly because of their DNA. However, some experts think we need more rules to keep everyone safe, especially with new and more detailed DNA tests. These tests can sometimes show information about you and your family.