I started writing about analyzing memory usage in SSAS when I realized that I should give some background of my understanding about memory and disk usage in the Windows OS. Hence I decided to split the blog into two.
Memory is an interesting and common topic. I am going to start it with the famous question “Who is going to need more than 64K of RAM?” My response is that I am able to cause an OutOfMemory exception on a box with 1TB of physical RAM. Luckily, the operating systems were written when memory was a very scarce resource. As a foreign student, I did not have the right to work off campus so I was making a good 2.50$/hour after taxes. Spending 800$ on 16Mb memory chip from the Gateway outlet was a major investment, which eventually paid off with good dividends. The OS was working with very little memory, attempting to make all the processes happy and satisfy their memory allocation requests.
So how is memory working in Windows? When a process is started, it sees a flat directly addressable memory space with addresses ranging from 0 to the max of its bit-ness (i.e. 32-bit gives 2^32 or 4GBs; 64-bit gives 2^64, etc.). This means that at any time the process executing code can request the content from any of these addresses. This is also known as virtual address space since in reality there are only a few MBs behind it. So how the OS can trick the process that this space exists? The main smoke and mirrors are Physical Memory (aka Random Access Memory – RAM) and disk files. There is the concept of Memory Mapped Files (MMF), which basically map disk files into the process virtual address space. What that means? The OS create a map between a starting memory address (say 101) and the first byte of the file; if the file is 50bytes long, then then address 150 will correspond to the end of the file. Every time the process requests the content of an address between 101 and 150, the OS will retrieve the corresponding byte(s) from the file. Hence it’s called a map – i.e. when you request “a”, it’s looked up in the map, “b” is returned.
What is the need for this complexity? There are a few reasons:
-
The OS does NOT have to read the whole file. Have you wondered how a file that is 1GB in size can be opened by some processes instantly, while Notepad chokes on it? The 1GB can be mapped into the process address space, and ONLY what the process need will be requested by the file (with issuing a memory read request which will be mapped back to the offset within the file). Hence, the OS only needs to make an instant metadata entry, creating the map, and the process is ready to access the file. Alternatively, physically reading the file and copying it into memory will take a while…think Notepad.
-
The same file can be shared among multiple processes. The same file/or already cached bytes/ can be mapped into multiple processes. A good example is a Terminal Server with a 100 active clients using MS Word. Without the MMF, the server needs to load 100 copies of MS Word binaries; instead only ONE copy is loaded and it’s mapped to 100processes. As you can see, this makes a very efficient use of the available memory, not to mentioned excellent performance and user experience.
-
The mapped file is used as an extension to the physical memory, to which the process has access. Think of it as a page file (indeed it’s an extension the system page file; it uses the same concepts). Instead of loading the file into physical memory, it mapped to the process and can be access as if it was a physical memory.
Now that we have our 100 TS clients starting MS Word at the same time, what will the OS do? Read a 100 times the same file bytes? It doesn’t sound very efficient. Instead the OS uses a buffer using the physical memory (RAM), aka Cache, to store the bytes so it will physically read only once from the disk, cache the bytes and on subsequent requests, it will avoid the need to access the disk. Why avoiding the disk sounds like a big deal? Memory access is a 1000+ times faster than disk access so there is considerable performance difference. (Note: SSD’s are shrinking this difference.) The used buffer/cache/ is called File System Cache (i.e. it caches read bytes from files). Since the Cache is using physical memory, at one point the system may decide to take away the physical memory (which may be needed by a process MemAlloc request). Hence it’s called “Stand-By Cache”.
Let take some of this concept, put them together and map them to the more common terminology used in Windows. The process has Virtual Memory but only some of this memory is actually backed up by Physical Ram or Disk Files. When a file can be mapped to one or more processes, then the memory it consumed is called “Shared”, since it is NOT specific to the given process. Some of the data is specific to the process; hence it's called "Private". An example will be a user Word document. When the user makes modifications to the document, then the memory, storing the document, is called “Modified” (as well as Private). The combination of Shared plus Private memory is known as a process Working Set. When a file/bytes/ is requested, it is first checked if it’s present in the File System Cache. If it’s not found, it is physically read from disk, placed in the Cache, and returned to the process. This is also known as Hard Fault – when a memory page cannot be found in memory and needs to be retrieved with physically access to the disk. When a requested file is found, then it’s returned from the Cache – this is also known as Soft Fault.
With this in mind, when analyzing memory, we need to be aware of the flow of bytes in the OS. To summarize the concepts event at a higher level, multiple processes make concurrent requests, resulting in Logical IO. Out of this requests/IO, some of it will be used from the Cache, and some of it will be retrieved from Disk. There are Performance Monitor Counters that can give us the current status of the system. The “Process->Data Bytes/sec->Total” will give you the Logical IO. The “Physical Disk->_Total” will give you the Physical IO. The “Cache->Copy Reads/sec” (i.e. copy cached bytes into the process address space) will give you the Cached IO. Of course, this is a simplified view of the actual flow but it makes concepts easier to digest. The Internet is full with in-depth analysis of the topics above so feel free to dig deeper. “Windows Internals” is an excellent book to read on the topic.
There are also some common topics about configuration options. The “3GB” and “AWE” boot.ini option was a hot topic in the 32bit world. When a process is started the OS splits the virtual address into two and it gives 2GB to the user process and 2GB to the kernel. With some application, such as MS SQL server, there was a need to access as much memory as possible, so while 2GB are not enough for the process, the other 2 are sitting mostly unused. The “3GB” basically tells the OS to give an extra 1GB of addressable virtual memory to the process, and leave only 1GB to the kernel…but the “SAN’s” came out with their memory hungry kernel drivers and at one point the 1GB of kernel memory was not enough, so there was a need to further tune the memory “split”… At the same systems usually has more than 4GB, given than 4GB is per-process, the SQL server need physical memory, which is available but cannot be addressed to the virtual address space limitation of the 32bit OS. Hence, the Addressable Windows Extensions (AWE) was introduced, which basically allowed windows to be created in a process and mapped to physical memory beyond the 4GB boundary using offsets… Not a new concept, it was heavily used with DOS (Disk Operating System) and the “himem.sys” file (if anyone remembers it). The topics keep going but alas, most of them are obsolete, the 64-bit OS took care of them. The extra memory also created other issues. I have done over 300 performance tuning labs. When I am asked what equipment I need, I go for the slowest machine with the least resources. This type of underpowered machines makes the performance and other issues appear immediately. So a table scan, a badly written TSQL or MDX, or a memory leak show up pretty early in the stress tests. An “IIS” worker process will be restarted at 1.3GB of memory usage…so it was easy to isolate problematic areas and fix them. The introduction of more memory and faster disk drive makes this task very difficult. The OS resource limits were forcing developers to pay more attention to how apps are written and create better apps…but those days are gone.
Now that we know everything about the OS Memory and how it works, let‘s look briefly at the Disk subsystem. There are a lot of advance topics (i.e. Scatter/Gather Win32 APIs) on how requests are made to the Disk, and more specifically, how to fill up the 64K buffer so it a single request we can get more work done. There are also the concepts of IO queuing and Queue depth. Multiple Performance Monitor Counters further complicate the discussion. Over the year, I found a simple approach to analyzing the Disk IO usage. First, if you have physical access to the server, look at its disk LEDs – if they blink like a Christmas tree, then there is a very good chance that there is an opportunity for tuning. Again with more memory, a lot of data is cached and the issues are not as easy to spot. Second, check the Disk sec/Read and Disk sec/Write Performance Monitor Counters. They are an excellent indication of how the Disks are performing and whether if they are overwhelmed. This really simplifies the analysis and is very consistent with its results no matter what technology is behind the disk subsystem. The counters represent how milliseconds are needed for single disk IO – this also matches the disk manufacture definition for disk performance. For disk with spindles: a) for transactional systems: 2-4msec are excellent; 4-10msec is Ok, and above 15-20msec is bad; b) for Decision Support Systems (DSS)/Reporting: 15-20msec is excellent; 25-30 is good; above 50-80msec is bad. These are rough guidelines, which worked for me over the years.
In the spindle-disk world, there are also some other good ideas. The most common one is: “How to keep the physical head from moving between tracks?” Some of the common answers are: a) don’t place files on the same drive that are used in random-access fashion; SQL transaction log placement discussion fits here; b) use only 50% of the disk and use defrag software which pushes the data to the outer rims of the disk (thus achieving maximum performance since the heads are least moved and the most data can physically fit on a track); c) make sure that the disk partitions are properly aligned (W2003 Server and before were creating default partitions with 512byte off the boundary causing misalignment; W2008+ fixed that); Note: our software checks and reports on partition alignments.
The Solid State Disks (SSD’s) appears was a game changer. In a few months, all the heavy investments in SANs and spindle-based disks were quickly eroded. This technology is far superior and becomes the de-facto standard for the computer/data industry. And again, having faster hardware makes more difficult to optimize systems.