DNA sequencing
Adapted from Wikipedia · Discoverer experience
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It helps us find the order of the four bases: adenine, thymine, cytosine, and guanine. Because of quick DNA sequencing methods, scientists can study biology and medicine much faster now.
Knowing DNA sequences is very important for many reasons. It helps in basic biology research, DNA Genographic Projects, medical diagnosis, biotechnology, forensic biology, virology, and systematics. By comparing healthy DNA with mutated DNA, doctors can find diseases like cancer and decide the best treatment for each patient. Quick DNA sequencing means better and faster medical care, and it helps us identify and catalog many living things.
DNA sequencing technology has improved a lot. It helped scientists read the full set of genes, called genomes, of humans and many animals, plants, and tiny organisms. At first, scientists in the early 1970s used hard methods based on two-dimensional chromatography. Later, with fluorescence-based methods and DNA sequencers, DNA sequencing became much easier and faster.
Applications
DNA sequencing is a powerful tool that helps scientists understand the genetic code. It can be used to find the order of nucleotides in individual genes, larger genetic areas, whole chromosomes, or even the entire genomes of any living thing. It also helps in studying RNA and proteins by looking at the parts of DNA that code for them.
This technology is important in many fields, including molecular biology, where it helps researchers learn about genes and diseases. In evolutionary biology, sequencing helps us see how different species are related. It is also used in metagenomics to identify the many tiny organisms living in places like water or soil. In virology, sequencing helps scientists study viruses, even very small ones, and understand how they change over time. In medicine, it can help diagnose genetic diseases and guide treatment. In forensic investigation, DNA sequencing helps identify individuals by their unique genetic patterns.
The four canonical bases
Main article: Nucleotide
DNA is made up of four main building blocks called bases: thymine (T), adenine (A), cytosine (C), and guanine (G). DNA sequencing is a way to find out the exact order of these bases in a piece of DNA. While most DNA uses just these four bases, some viruses and special cases can have slightly different building blocks. Scientists keep learning more about all the different ways DNA can be put together.
History
Discovery of DNA structure and function
DNA was first discovered in 1869 by Friedrich Miescher, but for many years, scientists thought proteins carried genetic information. This changed in 1944 when Oswald Avery, Colin MacLeod, and Maclyn McCarty showed that DNA could change the traits of bacteria. In 1953, James Watson and Francis Crick proposed the famous double-helix model of DNA, explaining how genetic information is stored and passed on.
RNA sequencing
RNA sequencing was one of the first ways scientists looked at the building blocks of genes. In the 1970s, Walter Fiers and his team at the University of Ghent in Belgium were the first to sequence a complete gene and a small virus called Bacteriophage MS2.
Early DNA sequencing methods
The first methods for reading DNA sequences were developed in the 1970s. Scientists like Ray Wu at Cornell University used special techniques to read the sequences of DNA. Later, Walter Gilbert and Allan Maxam at Harvard created a method called chemical degradation to sequence DNA. Around the same time, Frederick Sanger and Alan Coulson developed a different method called the "Plus and Minus" method, which helped make DNA sequencing faster and easier.
Sequencing of full genomes
The first complete DNA genome to be sequenced was a small virus called bacteriophage φX174 in 1977. In 1984, scientists finished sequencing the Epstein-Barr virus, which was a big step because they did it without knowing much about it beforehand. By 1995, scientists had sequenced the entire genome of a bacterium called Haemophilus influenzae, marking another important milestone. By 2003, a big international project called the Human Genome Project had created a draft of the human genome, and by 2022, scientists had finally filled in the last missing pieces.
High-throughput sequencing (HTS) methods
In the late 1990s and early 2000s, scientists developed new, faster ways to sequence DNA, called "next-generation" or "second-generation" sequencing. These methods break the genome into tiny pieces and sequence many pieces at once, making it possible to sequence entire genomes quickly. These advances have helped scientists learn more about health, human history, and personalized medicine.
Main article: Whole genome sequencing
Basic methods
Main article: Maxam-Gilbert sequencing Main article: Sanger sequencing
Two main methods were used to read the order of DNA building blocks. The first method, created by Allan Maxam and Walter Gilbert in 1977, used special chemicals to cut DNA at specific points. This method needed radioactive materials and was complex, so it wasn’t used much after better methods were created.
The second method, developed by Frederick Sanger in 1977, became the most popular. It was simpler and more reliable than the first method. Over time, scientists made many improvements to Sanger’s method, like using fluorescent labels and machines to automate the process. This helped make DNA sequencing faster and cheaper. The Sanger method was used to sequence the first human genome in 2001, which opened up the field of genomics. Later, even newer methods made sequencing even faster and more affordable.
Today, most DNA sequencing uses a method called “sequencing by synthesis.” This method watches how a special enzyme adds DNA building blocks one by one to make a new copy of DNA. By seeing which building block is added each time, scientists can read the DNA sequence. This method is used in many modern machines that can sequence huge amounts of DNA quickly.
Large-scale sequencing and de novo sequencing
Large-scale sequencing helps scientists study very long pieces of DNA, like entire chromosomes. To do this, scientists break the DNA into smaller pieces, which are then copied many times and studied one by one. These small pieces are put back together like a puzzle to form the full DNA sequence.
"De novo sequencing" means figuring out a DNA sequence from scratch, without any prior knowledge. One common method is called "shotgun sequencing," where DNA is broken into random pieces, each piece is sequenced, and then all the pieces are assembled based on how they overlap. This method works well for sequencing large amounts of DNA but can be tricky when there are repeating patterns.
High-throughput methods
High-throughput sequencing includes methods like exome sequencing, genome sequencing, and transcriptome profiling. These technologies allow scientists to sequence thousands or millions of DNA pieces at once, making it faster and cheaper to study genes.
These advances have changed how we understand biology and improve medical treatments. Companies like Illumina, Qiagen, and ThermoFisher Scientific have led in making these technologies better and more accessible.
| Method | Read length | Accuracy (single read not consensus) | Reads per run | Time per run | Cost per 1 billion bases (in US$) | Advantages | Disadvantages |
|---|---|---|---|---|---|---|---|
| Single-molecule real-time sequencing (Pacific Biosciences) | 30,000 bp (N50); maximum read length >100,000 bases | 87% raw-read accuracy | 4,000,000 per Sequel 2 SMRT cell, 100–200 gigabases | 30 minutes to 20 hours | $7.2-$43.3 | Fast. Detects 4mC, 5mC, 6mA. | Moderate throughput. Equipment can be very expensive. |
| Ion semiconductor (Ion Torrent sequencing) | up to 600 bp | 99.6% | up to 80 million | 2 hours | $66.8-$950 | Less expensive equipment. Fast. | Homopolymer errors. |
| Pyrosequencing (454) | 700 bp | 99.9% | 1 million | 24 hours | $10,000 | Long read size. Fast. | Runs are expensive. Homopolymer errors. |
| Sequencing by synthesis (Illumina) | MiniSeq, NextSeq: 75–300 bp; MiSeq: 50–600 bp; HiSeq 2500: 50–500 bp; HiSeq 3/4000: 50–300 bp; HiSeq X: 300 bp | 99.9% (Phred30) | MiniSeq/MiSeq: 1–25 Million; NextSeq: 130-00 Million; HiSeq 2500: 300 million – 2 billion; HiSeq 3/4000 2.5 billion; HiSeq X: 3 billion | 1 to 11 days, depending upon sequencer and specified read length | $5 to $150 | Potential for high sequence yield, depending upon sequencer model and desired application. | Equipment can be very expensive. Requires high concentrations of DNA. |
| Combinatorial probe anchor synthesis (cPAS- BGI/MGI) | BGISEQ-50: 35-50bp; MGISEQ 200: 50-200bp; BGISEQ-500, MGISEQ-2000: 50-300bp | 99.9% (Phred30) | BGISEQ-50: 160M; MGISEQ 200: 300M; BGISEQ-500: 1300M per flow cell; MGISEQ-2000: 375M FCS flow cell, 1500M FCL flow cell per flow cell. | 1 to 9 days depending on instrument, read length and number of flow cells run at a time. | $5– $120 | ||
| Sequencing by ligation (SOLiD sequencing) | 50+35 or 50+50 bp | 99.9% | 1.2 to 1.4 billion | 1 to 2 weeks | $60–130 | Low cost per base. | Slower than other methods. Has issues sequencing palindromic sequences. |
| Nanopore Sequencing | Dependent on library preparation, not the device, so user chooses read length (up to 2,272,580 bp reported). | ~92–97% single read | dependent on read length selected by user | data streamed in real time. Choose 1 min to 48 hrs | $7–100 | Longest individual reads. Accessible user community. Portable (Palm sized). | Lower throughput than other machines, Single read accuracy in 90s. |
| GenapSys Sequencing | Around 150 bp single-end | 99.9% (Phred30) | 1 to 16 million | Around 24 hours | $667 | Low-cost of instrument ($10,000) | |
| Chain termination (Sanger sequencing) | 400 to 900 bp | 99.9% | N/A | 20 minutes to 3 hours | $2,400,000 | Useful for many applications. | More expensive and impractical for larger sequencing projects. This method also requires the time-consuming step of plasmid cloning or PCR. |
Methods in development
DNA sequencing methods that are still being developed include using tiny openings called nanopores to read the DNA sequence and using special types of microscopy to see the positions of nucleotides in long DNA pieces. These newer methods aim to make sequencing faster, cheaper, and easier.
One new approach uses electrical measurements to tell the difference between DNA bases as they move through a channel. Another method uses tiny pieces of known DNA to match and identify unknown sequences. Scientists also use tools like mass spectrometry to weigh DNA pieces and find tiny differences. These methods are helpful for studying human DNA, especially in cases where the DNA is damaged. Other techniques involve using tiny chips to do many tests at once or watching how DNA molecules move during copying.
Market share
In 2022, one company called Illumina had around 80% of the market for DNA sequencing. A few other companies, like PacBio, Oxford, 454, and MGI, made up the rest of the market. This means most people and scientists used Illumina's methods to read DNA.
Sample preparation
Before we can read the code hidden inside DNA, scientists need to carefully prepare tiny pieces of material, like cells from plants or animals. They extract DNA or RNA, ensuring the strands stay long and undamaged. If they extract RNA, they turn it into a special kind of DNA called complementary DNA (cDNA) so it can be studied more easily.
Depending on the method used to read the DNA, extra steps might be needed. Some methods need special treatments before reading can start, and scientists always check the quality of their samples to make sure everything is just right for getting clear results.
Development initiatives
In October 2006, the X Prize Foundation started an effort to help create better ways to read all the DNA in a person. This effort, called the Archon X Prize, aimed to give $10 million to the first group that could build a machine to read 100 full human DNA sets very quickly and accurately.
Each year, the National Human Genome Research Institute offers money for new research and inventions in the study of DNA. This includes work on new and better ways to read DNA.
Computational challenges
DNA sequencing creates lots of small pieces of data that scientists need to piece together, like solving a puzzle. This process has many challenges, especially when dealing with parts of DNA that repeat many times. These repeating sections can make it hard to know exactly where each piece belongs in the full picture.
Scientists use special computer programs to help organize and check these pieces of DNA. These programs can cut out parts of the data that might cause mistakes later on. Even after getting the raw data, there’s still a lot of work to understand it fully using biology and computer science tools.
| Name of algorithm | Type of algorithm |
|---|---|
| Cutadapt | Running sum |
| ConDeTri | Window based |
| ERNE-FILTER | Running sum |
| FASTX quality trimmer | Window based |
| PRINSEQ | Window based |
| Trimmomatic | Window based |
| SolexaQA | Window based |
| SolexaQA-BWA | Running sum |
| Sickle | Window based |
Ethical issues
Further information: Bioethics
DNA sequencing has raised important ethical questions, especially about privacy and consent. One big issue is who owns your DNA and the information from your DNA test. While you don’t own your discarded cells, you should be asked before they are used. There are also worries about how this genetic information might be used. For example, some are concerned that insurers could use it to change insurance prices.
Laws like the Genetic Information Nondiscrimination Act in the United States help protect against discrimination based on genetic information in health insurance and jobs. However, experts say more protection may be needed, especially with newer, more detailed DNA testing methods. These tests can sometimes reveal information about not just you, but also your family members.
Images
This article is a child-friendly adaptation of the Wikipedia article on DNA sequencing, available under CC BY-SA 4.0.
Images from Wikimedia Commons. Tap any image to view credits and license.
Safekipedia