Coding for SSDs – Part 2: Architecture of an SSD and Benchmarking

白丁 · 发表于2015-11-21 16:16

Coding for SSDs – Part 2: Architecture of an SSD and Benchmarking [复制链接]

本帖最后由白丁于 2015-11-21 21:17 编辑

This is Part 2 over 6 of “Coding for SSDs”, covering Sections 1 and 2. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you’re in a rush, you can also go directly to Part 6, which is summarizing the content from all the other parts.
In this part, I am explaining the basics of NAND-flash memory, cell types, and basic SSD internal architecture. I am also covering SSD benchmarking and how to interpret those benchmarks.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.
As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!
1. Structure of an SSD1.1 NAND-flash memory cellsA solid-state drives (SSD) is a flash-memory based data storage device. Bits are stored into cells, which are made of floating-gate transistors. SSDs are made entirely of electronic components, there are no moving or mechanical parts like in hard drives.
Voltages are applied to the floating-gate transistors, which is how bits are being read, written, and erased. Two solutions exist for wiring transistors: the NOR flash memory, and the NAND flash memory. I will not go into more details regarding the difference between NOR and NAND flash memory. This article only covers NAND flash memory, which is the solution chosen by the majority of the manufacturers. For more information on the difference between NOR and NAND, you can refer to this article by Lee Hutchinson [31].
An important property of NAND-flash modules is that their cells are wearing off, and therefore have a limited lifespan. Indeed, the transistors forming the cells store bits by holding electrons. At each P/E cycle (i.e. Program/Erase, “Program” here means write), electrons might get trapped in the transistor by mistake, and after some time, too many electrons will have been trapped and the cells would become unusable.

Limited lifespanEach cell has a maximum number of P/E cycles (Program/Erase), after which the cell is considered defective. NAND-flash memory wears off and has a limited lifespan. The different types of NAND-flash memory have different lifespans [31].

Recent research has shown that by applying very high temperatures to NAND chips, trapped electrons can be cleared out [14, 51]. The lifespan of SSDs could be tremendously increased, though this is still research and there is no certainty that this will one day reach the consumer market.
The types of cells currently present in the industry are:

Single level cell (SLC), in which transistors can store only 1 bit but have a long lifespan
Multiple level cell (MLC), in which transistors can store 2 bits, at the cost of a higher latency and reduced lifespan compared to SLC
Triple-level cell (TLC), in which transistors can store 3 bits, but at an even higher latency and reduced lifespan

Memory cell typesA solid-state drives (SSD) is a flash-memory based data storage device. Bits are stored into cells, which exist in three types: 1 bit per cell (single level cell, SLC), 2 bits per cell (multiple level cell, MLC), and 3 bits per cell (triple-level cell, TLC).

Table 1 below shows detailed information for each NAND-flash cell type. For the sake of comparison, I have also added the average latencies of hard drives, main memory (RAM), and L1/L2 caches.

	SLC	MLC	TLC	HDD	RAM	L1 cache	L2 cache
P/E cycles	100k	10k	5k	*	*	*	*
Bits per cell	1	2	3	*	*	*	*
Seek latency (μs)	*	*	*	9000	*	*	*
Read latency (μs)	25	50	100	2000-7000	0.04-0.1	0.001	0.004
Write latency (μs)	250	900	1500	2000-7000	0.04-0.1	0.001	0.004
Erase latency (μs)	1500	3000	5000	*	*	*	*
Notes	* metric is not applicable for that type of memory
Sources	P/E cycles [20] SLC/MLC latencies [1] TLC latencies [23] Hard disk drive latencies [18, 19, 25] RAM latencies [30, 52] L1 and L2 cache latencies [52]

Table 1: Characteristics and latencies of NAND-flash memory types compared to other memory components

Having more bits for the same amount of transistors reduces the manufacturing costs. SLC-based SSDs are known to be more reliable and have a longer life expectancy than MLC-based SSDs, but at a higher manufacturing cost. Therefore, most general public SSDs are MLC- or TLC-based, and only professional SSDs are SLC-based. Choosing the right memory type depends on the workload the drive will be used for, and how often the data is likely to be updated. For high-update workloads, SLC is the best choice, whereas for high-read and low-write workloads (ex: video storage and streaming), then TLC will be perfectly fine. Moreover, benchmarks of TLC drives under a workload of realistic usage show that the lifespan of TLC-based SSDs is not a concern in practice [36].
NAND-flash pages and blocksCells are grouped into a grid, called a block, and blocks are grouped into planes. The smallest unit through which a block can be read or written is a page. Pages cannot be erased individually, only whole blocks can be erased. The size of a NAND-flash page size can vary, and most drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB. For example, the Samsung SSD 840 EVO has blocks of size 2048 KB, and each block contains 256 pages of 8 KB each. The way pages and blocks can be accessed is covered in details in Section 3.1.

1.2 Organization of an SSDFigure 1 below is representing an SSD drive and its main components. I have simply reproduced the basic schematic already presented in various papers [2, 3, 6].

Figure 1: Architecture of a solid-state drive

Commands come from the user through the host interface. At the moment I am writing this article, the two most common interfaces for newly released SSDs are Serial ATA (SATA), PCI Express (PCIe). The processor in the SSD controller takes the commands and pass them to the flash controller. SSDs also have embedded RAM memory, generally for caching purposes and to store mapping information. Section 4 covers mapping policies in more details. The packages of NAND flash memory are organized in gangs, over multiple channels, which is covered in Section 6.
Figure 2 and 3 below, reproduced from StorageReview.com [26, 27], show what SSDs look like in real life. Figure 2 shows the 512 GB version of the Samsung 840 Pro SSD, released in August 2013. As it can be seen on the circuit board, the main components are:

1 SATA 3.0 interface
1 SSD controller (Samsung MDX S4LN021X01-8030)
1 RAM module (256 MB DDR2 Samsung K4P4G324EB-FGC2)
8 MLC NAND-flash modules, each offering 64 GB of storage (Samsung K9PHGY8U7A-CCK0)

Figure 2: Samsung SSD 840 Pro (512 GB) — Pictures courtesy of StorageReview.com [26]
Figure 3 is a Micron P420m Enterprise PCIe, released late 2013. The main components are:

8 lanes of a PCI Express 2.0 interface
1 SSD controller
1 RAM module (DRAM DDR3)
64 MLC NAND-flash modules over 32 channels, each module offering 32 GB of storage (Micron 31C12NQ314 25nm)

The total memory is 2048 GB, but only 1.4 TB are available after over-provisioning.

Figure 3: Micron P420m Enterprise PCIe (1.4 TB) — Pictures courtesy of StorageReview.com [27]

1.3 Manufacturing processMany SSD manufacturers use surface-mount technology (SMT) to produce SSDs, a production method in which electronic components are placed directly on top of printed circuit boards (PCBs). SMT lines are composed of a chain of machines, each machine being plugged into the next and having a specific task to perform in the process, such as placing components or melting the solder. Multiple quality checks are also performed throughout the entire process. Photos and videos of SMT lines can be seen in two articles by Steve Burke [67, 68], in which he visited the production facilities of Kingston Technologies in Fountain Valley, California, and in an article by Cameron Wilmot about the Kingston installations in Taiwan [69].
Other interesting resources are two videos, the first one about the Crucial SSDs by Micron [70] and the second one about Kingston [71]. In the latter, which is part of Steve Burke’s articles and that I also have embedded below, Mark Tekunoff from Kingston is giving a tour of one of their SMT lines. Interesting detail, everyone in the video is wearing a cute antistatic pyjama and seems to be having a lot of fun!
2. Benchmarking and performance metrics2.1 Basic benchmarksTable 2 below shows the throughput for sequential and random workloads on different solid-state drives. For the sake of comparison, I have included SSDs released in 2008 and 2013, along with one hard drive, and one RAM memory chip.

白丁 · 发表于2015-11-21 16:18

	Samsung 64 GB	Intel X25-M	Samsung 840 EVO	Micron P420m	HDD	RAM
Brand/Model	Samsung (MCCDE64G5MPP-OVA)	Intel X25-M (SSDSA2MH080G1GC)	Samsung (SSD 840 EVO mSATA)	Micron P420m	Western Digital Black 7200 rpm	Corsair Vengeance DDR3
Memory cell type	MLC	MLC	TLC	MLC	*	*
Release year	2008	2008	2013	2013	2013	2012
Interface	SATA 2.0	SATA 2.0	SATA 3.0	PCIe 2.0	SATA 3.0	*
Total capacity	64 GB	80 GB	1 TB	1.4 TB	4 TB	4 x 4 GB
Pages per block	128	128	256	512	*	*
Page size	4 KB	4 KB	8 KB	16 KB	*	*
Block size	512 KB	512 KB	2048 KB	8196 KB	*	*
Sequential reads (MB/s)	100	254	540	3300	185	7233
Sequential writes (MB/s)	92	78	520	630	185	5872
4KB random reads (MB/s)	17	23.6	383	2292	0.54	5319 **
4KB random writes (MB/s)	5.5	11.2	352	390	0.85	5729 **
4KB Random reads (KIOPS)	4	6	98	587	0.14	105
4KB Random writes (KIOPS)	1.5	2.8	90	100	0.22	102
Notes	* metric is not applicable for that storage solution ** measured with 2 MB chunks, not 4 KB
Metrics	MB/s: Megabytes per Second KIOPS: Kilo IOPS, i.e 1000 Input/Output Operations Per Second
Sources	Samsung 64 GB [21] Intel X25-M [2, 28] Samsung SSD 840 EVO [22] Micron P420M [27] Western Digital Black 4 TB [25] Corsair Vengeance DDR3 RAM [30]

Table 2: Characteristics and throughput of solid-state drives compared to other storage solutions

An important factor for performance is the host interface. The most common interfaces for newly released SSDs are SATA 3.0, PCI Express 3.0. On a SATA 3.0 interface, data can be transferred up to 6 Gbit/s, which in practice gives around 550 MB/s, and on a PCIe 3.0 interface, data can be transferred up to 8 GT/s per lane, which in practice is roughly 1 GB/s (GT/s stands for Gigatransfers per second). SSDs on the PCIe 3.0 interface are more than a single lane. With four lanes, PCIe 3.0 can offer a maximum bandwidth of 4 GB/s, which is eight times faster than SATA 3.0. Some enterprise SSDs also offer a Serial Attached SCSI interface (SAS) which in its latest version can offer up to 12 GBit/s, although at the moment SAS is only a tiny fraction of the market.
Most recent SSDs are fast enough internally to easily reach the 550 MB/s limitation of SATA 3.0, therefore the interface is the bottleneck for them. The SSDs using PCI Express 3.0 or SAS offer tremendous performance increases [15].
PCI Express and SAS are faster than SATAThe two main host interfaces offered by manufacturers are SATA 3.0 (550 MB/s) and PCI Express 3.0 (1 GB/s per lane, using multiple lanes). Serial Attached SCSI (SAS) is also available for enterprise SSDs. In their latest versions, PCI Express and SAS are faster than SATA, but they are also more expensive.

2.2 Pre-conditioning

If you torture the data long enough, it will confess.
— Ronald Coase

The data sheets provided by SSD manufacturers are filled with amazing performance values. And indeed, by banging whatever random operations for long enough, manufacturers seem to always find a way to show shinny numbers in their marketing flyers. Whether or not those numbers really mean anything and allow to predict the performance of a production system is a different problem.
In his articles about common flaws in SSD benchmarking [66], Marc Bevand mentioned that for instance it is common for the IOPS of random write workloads to be reported without any mention of the span of the LBA, and that many IOPS are also reported for queue depth of 1 instead of the maximum value for the drive being tested. There are also many cases of bugs and misuses of the benchmarking tools.
Correctly assessing the performance of SSDs is not an easy task. Many articles from hardware reviewing blogs run ten minutes of random writes on a drive and claim that the drive is ready to be tested, and that the results can be trusted. However, the performance of SSDs only drops under a sustained workload of random writes, which depending on the total size of the SSD can take just 30 minutes or up to three hours. This is why the more serious benchmarks start by applying such a sustained workload of random writes, also called “pre-conditioning” [50]. Figure 7 below, reproduced from an article on StorageReview.com [26], shows the effect of pre-conditioning on multiple SSDs. A clear drop in performance can be observed after around 30 minutes, where the throughput decreases and the latency increases for all drives. It then takes another four hours for the performance to slowly decay to a constant minimum.

Figure 7: Effect of pre-conditioning on multiple SSD models — Pictures courtesy of StorageReview.com [26]

What is happening in Figure 7 essentially is that, as explained in Section 5.2, the amount of random writes is so large and applied in such a sustained way that the garbage collection process is unable to keep up in background. The garbage collection must erase blocks as write commands arrive, therefore competing with the foreground operations from the host. People using pre-conditioning claim that the benchmarks it produces accurately represent how a drive will behave in its worst possible state. Whether or not this is a good model for how a drive will behave under all workloads is arguable.
In order to compare various models coming from different manufacturers, a common ground must be found, and the worst possible state is a valid one. But picking the drive that performs best under the worst possible workload does not always guarantee that it will perform best under the workload of a production environment. Indeed, in most production environments, an SSD drive will serve one and only one system. That system has a specific workload due to its internal characteristics, and therefore a better and more accurate way to compare different drives would be to run the same replay of this workload on those drives, and then compare their respective performance. This is why, even though a pre-conditioning using a sustained workload of random writes allows for a fair comparison of different SSDs, one has to be careful and should, whenever possible, run in-house benchmarks based on the target workload. Benchmarking in-house also allows not to over-allocate resources, by avoiding using the “best” SSD model when a cheaper one would be enough and save a lot of money.Benchmarking is hardTesters are humans, therefore not all benchmarks are exempt of errors. Be careful when reading the benchmarks from manufacturers or third parties, and use multiple sources before trusting any numbers. Whenever possible, run your own in-house benchmarking using the specific workload of your system, along with the specific SSD model that you want to use. Finally, make sure you look at the performance metrics that matter most for the system at hand.

2.3 Workloads and metricsPerformance benchmarks all share the same varying parameters and provide results using the same metrics. In this section, I wish to give some insights as to how to interpret those parameters and metrics.

The parameters used are generally the following:

The type of workload: can be a specific benchmark based on data collected from users, or just only sequential or random accesses of the same type (ex: only random writes)
The percentages of reads and writes performed concurrently (ex: 30% reads and 70% writes)
The queue length: this is the number of concurrent execution threads running commands on a drive
The size of the data chunks being accessed (4 KB, 8 KB, etc.)

Benchmark results are presented using different metrics. The most common are:

Throughput: The speed of transfer, generally in KB/s or MB/s, respectively kilobytes per second, and megabytes per second. This is the metric chosen for sequential benchmarks.
IOPS: the number of Input/Output Operations Per Second, each operations being of the same data chunk size (generally 4 KB/s). This is the metrics chosen for the random benchmarks.
[17]
Latency: the response time of a device after a command is emitted, generally in μs or ms, respectively microseconds or milliseconds.

While the throughput is easy to understand and relate to, the IOPS is more difficult to grasp. For example, if a disk shows a performance for random writes at 1000 IOPS for 4 KB chunks, this means that the throughput is of 1000 x 4096 = 4 MB/s. Consequently, a high IOPS will translate into a high throughput only if the size of the chunks is the largest possible. A high IOPS at a low average chuck size will translate into a low throughput.
To illustrate this point, let’s imagine that we have a logging system performing tiny updates over thousands of different files per minute, giving a performance of 10k IOPS. Because the updates are spread over so many different files, the throughput could be close to something like 20 MB/s, whereas writing sequentially to only one file with the same system could lead to an increased throughput of 200 MB/s, which is a tenfold improvement. I am making up those numbers for the sake of this example, although they are close to production systems I have encountered.
Another concept to grasp is that a high throughput does not necessarily means a fast system. Indeed, if the latency is high, no matter how good is the throughput, the overall system will be slow. Let’s take the example of a hypothetical single-threaded process that requires connections to 25 databases, each connection having a latency of 20 ms. Because the connection latencies are cumulative, obtaining the 25 connections will require 25 x 20 ms = 500 ms. Therefore, even if the machines running the database queries have fast network cards, let’s say 5 GBits/s of bandwidth, the script will still be slow due to the latency.
The takeaway from this section is that it is important to keep an eye on all the metrics, as they will show different aspects of the system and will allow to identify the bottlenecks when they come up. When looking at the benchmarks of SSDs and deciding which model to pick, keeping in mind which metric is the most critical to the system in which those SSDs will used is generally a good rule of thumb. Then of course, nothing will replace proper in-house benchmarking as explained in Section 2.2.An interesting follow-up on the topic is the article “IOPS are a scam” by Jeremiah Peschka [46].
What’s nextPart 3 is available here. You can also go to the Table of Content for this series of articles, and if you’re in a rush, you can also directly go to Part 6, which is summarizing the content from all the other parts.
To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.
As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

suoma · 发表于2015-11-21 20:58

SSD不知道?头像像周华健

白丁 · 发表于2015-11-21 21:01

suoma 发表于 2015-11-21 20:58
SSD不知道?头像像周华健

SSD不知道？这句啥意思，没看明白。头像不像周华健，论坛有好几个周华健呢

suoma · 发表于2015-11-21 21:14

白丁发表于 2015-11-21 21:01
SSD不知道？这句啥意思，没看明白。头像不像周华健，论坛有好几个周华健呢

看了开发板图片，就是不认识

Coding for SSDs – Part 2: Architecture of an SSD and Benchmarking [复制链接]

最新回复

点评

点评