DNA sequencing

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It helps us find the order of the four bases: adenine, thymine, cytosine, and guanine. Because of quick DNA sequencing methods, scientists can study biology and medicine much faster now.

An example of the results of automated chain-termination DNA sequencing

Knowing DNA sequences is very important for many reasons. It helps in basic biology research, DNA Genographic Projects, medical diagnosis, biotechnology, forensic biology, virology, and systematics. By comparing healthy DNA with mutated DNA, doctors can find diseases like cancer and decide the best treatment for each patient. Quick DNA sequencing means better and faster medical care, and it helps us identify and catalog many living things.

DNA sequencing technology has improved a lot. It helped scientists read the full set of genes, called genomes, of humans and many animals, plants, and tiny organisms. At first, scientists in the early 1970s used hard methods based on two-dimensional chromatography. Later, with fluorescence-based methods and DNA sequencers, DNA sequencing became much easier and faster.

Applications

DNA sequencing is a powerful tool that helps scientists understand the genetic code. It can be used to find the order of nucleotides in individual genes, larger genetic areas, whole chromosomes, or even the entire genomes of any living thing. It also helps in studying RNA and proteins by looking at the parts of DNA that code for them.

This technology is important in many fields, including molecular biology, where it helps researchers learn about genes and diseases. In evolutionary biology, sequencing helps us see how different species are related. It is also used in metagenomics to identify the many tiny organisms living in places like water or soil. In virology, sequencing helps scientists study viruses, even very small ones, and understand how they change over time. In medicine, it can help diagnose genetic diseases and guide treatment. In forensic investigation, DNA sequencing helps identify individuals by their unique genetic patterns.

The four canonical bases

Main article: Nucleotide

DNA is made up of four main building blocks called bases: thymine (T), adenine (A), cytosine (C), and guanine (G). DNA sequencing is a way to find out the exact order of these bases in a piece of DNA. While most DNA uses just these four bases, some viruses and special cases can have slightly different building blocks. Scientists keep learning more about all the different ways DNA can be put together.

History

Discovery of DNA structure and function

DNA was first discovered in 1869 by Friedrich Miescher, but for many years, scientists thought proteins carried genetic information. This changed in 1944 when Oswald Avery, Colin MacLeod, and Maclyn McCarty showed that DNA could change the traits of bacteria. In 1953, James Watson and Francis Crick proposed the famous double-helix model of DNA, explaining how genetic information is stored and passed on.

Frederick Sanger, a pioneer of sequencing. Sanger is one of the few scientists who was awarded two Nobel prizes, one for the sequencing of proteins, and the other for the sequencing of DNA.

RNA sequencing

RNA sequencing was one of the first ways scientists looked at the building blocks of genes. In the 1970s, Walter Fiers and his team at the University of Ghent in Belgium were the first to sequence a complete gene and a small virus called Bacteriophage MS2.

Early DNA sequencing methods

The first methods for reading DNA sequences were developed in the 1970s. Scientists like Ray Wu at Cornell University used special techniques to read the sequences of DNA. Later, Walter Gilbert and Allan Maxam at Harvard created a method called chemical degradation to sequence DNA. Around the same time, Frederick Sanger and Alan Coulson developed a different method called the "Plus and Minus" method, which helped make DNA sequencing faster and easier.

Sequencing of full genomes

The first complete DNA genome to be sequenced was a small virus called bacteriophage φX174 in 1977. In 1984, scientists finished sequencing the Epstein-Barr virus, which was a big step because they did it without knowing much about it beforehand. By 1995, scientists had sequenced the entire genome of a bacterium called Haemophilus influenzae, marking another important milestone. By 2003, a big international project called the Human Genome Project had created a draft of the human genome, and by 2022, scientists had finally filled in the last missing pieces.

History of sequencing technology

High-throughput sequencing (HTS) methods

In the late 1990s and early 2000s, scientists developed new, faster ways to sequence DNA, called "next-generation" or "second-generation" sequencing. These methods break the genome into tiny pieces and sequence many pieces at once, making it possible to sequence entire genomes quickly. These advances have helped scientists learn more about health, human history, and personalized medicine.

Main article: Whole genome sequencing

Basic methods

Main article: Maxam-Gilbert sequencing Main article: Sanger sequencing

Two main methods were used to read the order of DNA building blocks. The first method, created by Allan Maxam and Walter Gilbert in 1977, used special chemicals to cut DNA at specific points. This method needed radioactive materials and was complex, so it wasn’t used much after better methods were created.

The second method, developed by Frederick Sanger in 1977, became the most popular. It was simpler and more reliable than the first method. Over time, scientists made many improvements to Sanger’s method, like using fluorescent labels and machines to automate the process. This helped make DNA sequencing faster and cheaper. The Sanger method was used to sequence the first human genome in 2001, which opened up the field of genomics. Later, even newer methods made sequencing even faster and more affordable.

Today, most DNA sequencing uses a method called “sequencing by synthesis.” This method watches how a special enzyme adds DNA building blocks one by one to make a new copy of DNA. By seeing which building block is added each time, scientists can read the DNA sequence. This method is used in many modern machines that can sequence huge amounts of DNA quickly.

Large-scale sequencing and de novo sequencing

Genomic DNA is fragmented into random pieces and cloned as a bacterial library. DNA from individual bacterial clones is sequenced and the sequence is assembled by using overlapping DNA regions.

Large-scale sequencing helps scientists study very long pieces of DNA, like entire chromosomes. To do this, scientists break the DNA into smaller pieces, which are then copied many times and studied one by one. These small pieces are put back together like a puzzle to form the full DNA sequence.

"De novo sequencing" means figuring out a DNA sequence from scratch, without any prior knowledge. One common method is called "shotgun sequencing," where DNA is broken into random pieces, each piece is sequenced, and then all the pieces are assembled based on how they overlap. This method works well for sequencing large amounts of DNA but can be tricky when there are repeating patterns.

High-throughput methods

Multiple, fragmented sequence reads must be assembled together on the basis of their overlapping areas.

High-throughput sequencing includes methods like exome sequencing, genome sequencing, and transcriptome profiling. These technologies allow scientists to sequence thousands or millions of DNA pieces at once, making it faster and cheaper to study genes.

These advances have changed how we understand biology and improve medical treatments. Companies like Illumina, Qiagen, and ThermoFisher Scientific have led in making these technologies better and more accessible.

Comparison of high-throughput sequencing methods
Method	Read length	Accuracy (single read not consensus)	Reads per run	Time per run	Cost per 1 billion bases (in US$)	Advantages	Disadvantages
Single-molecule real-time sequencing (Pacific Biosciences)	30,000 bp (N50); maximum read length >100,000 bases	87% raw-read accuracy	4,000,000 per Sequel 2 SMRT cell, 100–200 gigabases	30 minutes to 20 hours	$7.2-$43.3	Fast. Detects 4mC, 5mC, 6mA.	Moderate throughput. Equipment can be very expensive.
Ion semiconductor (Ion Torrent sequencing)	up to 600 bp	99.6%	up to 80 million	2 hours	$66.8-$950	Less expensive equipment. Fast.	Homopolymer errors.
Pyrosequencing (454)	700 bp	99.9%	1 million	24 hours	$10,000	Long read size. Fast.	Runs are expensive. Homopolymer errors.
Sequencing by synthesis (Illumina)	MiniSeq, NextSeq: 75–300 bp; MiSeq: 50–600 bp; HiSeq 2500: 50–500 bp; HiSeq 3/4000: 50–300 bp; HiSeq X: 300 bp	99.9% (Phred30)	MiniSeq/MiSeq: 1–25 Million; NextSeq: 130-00 Million; HiSeq 2500: 300 million – 2 billion; HiSeq 3/4000 2.5 billion; HiSeq X: 3 billion	1 to 11 days, depending upon sequencer and specified read length	$5 to $150	Potential for high sequence yield, depending upon sequencer model and desired application.	Equipment can be very expensive. Requires high concentrations of DNA.
Combinatorial probe anchor synthesis (cPAS- BGI/MGI)	BGISEQ-50: 35-50bp; MGISEQ 200: 50-200bp; BGISEQ-500, MGISEQ-2000: 50-300bp	99.9% (Phred30)	BGISEQ-50: 160M; MGISEQ 200: 300M; BGISEQ-500: 1300M per flow cell; MGISEQ-2000: 375M FCS flow cell, 1500M FCL flow cell per flow cell.	1 to 9 days depending on instrument, read length and number of flow cells run at a time.	$5– $120
Sequencing by ligation (SOLiD sequencing)	50+35 or 50+50 bp	99.9%	1.2 to 1.4 billion	1 to 2 weeks	$60–130	Low cost per base.	Slower than other methods. Has issues sequencing palindromic sequences.
Nanopore Sequencing	Dependent on library preparation, not the device, so user chooses read length (up to 2,272,580 bp reported).	~92–97% single read	dependent on read length selected by user	data streamed in real time. Choose 1 min to 48 hrs	$7–100	Longest individual reads. Accessible user community. Portable (Palm sized).	Lower throughput than other machines, Single read accuracy in 90s.
GenapSys Sequencing	Around 150 bp single-end	99.9% (Phred30)	1 to 16 million	Around 24 hours	$667	Low-cost of instrument ($10,000)
Chain termination (Sanger sequencing)	400 to 900 bp	99.9%	N/A	20 minutes to 3 hours	$2,400,000	Useful for many applications.	More expensive and impractical for larger sequencing projects. This method also requires the time-consuming step of plasmid cloning or PCR.

Methods in development

DNA sequencing methods that are still being developed include using tiny openings called nanopores to read the DNA sequence and using special types of microscopy to see the positions of nucleotides in long DNA pieces. These newer methods aim to make sequencing faster, cheaper, and easier.

One new approach uses electrical measurements to tell the difference between DNA bases as they move through a channel. Another method uses tiny pieces of known DNA to match and identify unknown sequences. Scientists also use tools like mass spectrometry to weigh DNA pieces and find tiny differences. These methods are helpful for studying human DNA, especially in cases where the DNA is damaged. Other techniques involve using tiny chips to do many tests at once or watching how DNA molecules move during copying.

Market share

In 2022, one company called Illumina had around 80% of the market for DNA sequencing. A few other companies, like PacBio, Oxford, 454, and MGI, made up the rest of the market. This means most people and scientists used Illumina's methods to read DNA.

Sample preparation

Before we can read the code hidden inside DNA, scientists need to carefully prepare tiny pieces of material, like cells from plants or animals. They extract DNA or RNA, ensuring the strands stay long and undamaged. If they extract RNA, they turn it into a special kind of DNA called complementary DNA (cDNA) so it can be studied more easily.

Depending on the method used to read the DNA, extra steps might be needed. Some methods need special treatments before reading can start, and scientists always check the quality of their samples to make sure everything is just right for getting clear results.

Development initiatives

In October 2006, the X Prize Foundation started an effort to help create better ways to read all the DNA in a person. This effort, called the Archon X Prize, aimed to give $10 million to the first group that could build a machine to read 100 full human DNA sets very quickly and accurately.

Each year, the National Human Genome Research Institute offers money for new research and inventions in the study of DNA. This includes work on new and better ways to read DNA.

Computational challenges

DNA sequencing creates lots of small pieces of data that scientists need to piece together, like solving a puzzle. This process has many challenges, especially when dealing with parts of DNA that repeat many times. These repeating sections can make it hard to know exactly where each piece belongs in the full picture.

Scientists use special computer programs to help organize and check these pieces of DNA. These programs can cut out parts of the data that might cause mistakes later on. Even after getting the raw data, there’s still a lot of work to understand it fully using biology and computer science tools.

Read Trimming Algorithms
Name of algorithm	Type of algorithm
Cutadapt	Running sum
ConDeTri	Window based
ERNE-FILTER	Running sum
FASTX quality trimmer	Window based
PRINSEQ	Window based
Trimmomatic	Window based
SolexaQA	Window based
SolexaQA-BWA	Running sum
Sickle	Window based

Ethical issues

Further information: Bioethics

DNA sequencing has raised important ethical questions, especially about privacy and consent. One big issue is who owns your DNA and the information from your DNA test. While you don’t own your discarded cells, you should be asked before they are used. There are also worries about how this genetic information might be used. For example, some are concerned that insurers could use it to change insurance prices.

Laws like the Genetic Information Nondiscrimination Act in the United States help protect against discrimination based on genetic information in health insurance and jobs. However, experts say more protection may be needed, especially with newer, more detailed DNA testing methods. These tests can sometimes reveal information about not just you, but also your family members.