Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration

英文: Accelerators spend significant area and effort on custom on-chip buffering. Unfortunately, these solutions are strongly tied to particular designs, hampering re-usability across other accelerators or domains. We present buffets, an efficient and composable storage idiom for the needs of accelerators that is independent of any particular design. Buffets have several distinguishing characteristics, including efficient decoupled fills and accesses with fine-grained synchronization, hierarchical composition, and efficient multi-casting. We implement buffets in RTL and show that they only add 2% control overhead over an 8KB RAM. When compared with DMA-managed double-buffered scratchpads and caches across a range of workloads, buffets improve energy-delay-product by 1.53x and 5.39x, respectively.

翻译: 硬件加速器方向在传统的on-chip缓存领域探索了很多的方向、花费了很多的精力。不幸的是,这些解决方案都与特定的架构/设计紧密地结合,这组织了在其他加速器或领域的再使用。我们提出了自助餐架构,一个高效且可组合的存储习惯来满足加速器要独立于特定设计的需要。自助餐有着众多与众不同的特性,包括高效的解耦填充和对细粒度存储单元的同步访问,层次组合和高效的多播。我们在RTL中使用了自助餐架构,显示它们在8KBRAM上的工作只增加了2%的控制超载时间。当与DMA管理的双缓存便签和缓存在相同的一段工作负载的测试下,自助餐架构对能量延迟乘数的提升分别为1.53倍和5.39倍。

  • composable: composable system provides components that can be selected and assembled in various combinations to satisfy specific user requirements.
  • idiom

Software-Defined Far Memory in Warehouse-Scale Computers

英文 Increasing memory demand and slowdown in technology scaling pose important challenges to total cost of ownership (TCO) of warehouse-scale computers (WSCs). One promising idea to reduce the memory TCO is to add a cheaper, but slower, “far memory” tier and use it to store infrequently accessed (or cold) data. However, introducing a far memory tier brings new challenges around dynamically responding to workload diversity and churn, minimizing stranding of capacity, and addressing brownfield (legacy) deployments. We present a novel software-defined approach to far memory that proactively compresses cold memory pages to effectively create a far memory tier in software. Our end-to-end system design encompasses new methods to define performance service-level objectives (SLOs), a mechanism to identify cold memory pages while meeting the SLO, and our implementation in the OS kernel and node agent. Additionally, we design learning-based autotuning to periodically adapt our design to fleet-wide changes without a human in the loop. Our system has been successfully deployed across Google’s WSC since 2016, serving thousands of production services. Our software-defined far memory is significantly cheaper (67% or higher memory cost reduction) at relatively good access speeds (6us) and allows us to store a significant fraction of infrequently accessed data (on average, 20%), translating to significant TCO savings at warehouse scale.

翻译: 内存需要的增加和技术扩展速度的降低对仓库级计算机的总拥有成本提出了更高的挑战。减少内存的总拥有成本的一个有前途的想法是增加一个更便宜但更慢的”远内存层“,并用它存储不经常访问的(或冷的)数据。然而,引入远距离内存层带来了新的挑战,包括动态对多样工作负载和搅动的响应,最小化内存搁浅,并解决棕地部署问题。我们提出了一种新颖的软件定义的远内存存储方式,可以主动压缩冷内存页来高效地在软件层面上创建一个远内存层。我们的端到端系统设计围绕新的方法定义了性能上的服务级别对象(SLO),这是一个在遇到SLO的时候识别冷内存页的机制,我们的实现是基于系统内核和节点代理的。除此之外,我们设计了基于学习的自动参数调节系统来周期性地将我们的架构设计与更广泛的变化相适应,在这个循环中不用人介入。我们的系统已经成功地自2016年起Google的仓库级计算机上被部署,为上千个生产级服务来服务。我们的基于软件的远内存方法相较而言显著便宜于其他方法,并且能够达到一个相对较好的访问速度,并允许我们取存放不常用数据的一个重要的部分,从而实现在仓库级对总拥有成本的节省。

uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures

英文: Modern microarchitectures are some of the world’s most complex man-made systems. As a consequence, it is increasingly difficult to predict, explain, let alone optimize the performance of software running on such microarchitectures. As a basis for performance predictions and optimizations, we would need faithful models of their behavior, which are, unfortunately, seldom available. In this paper, we present the design and implementation of a tool to construct faithful models of the latency, throughput, and port usage of x86 instructions. To this end, we first discuss common notions of instruction throughput and port usage, and introduce a more precise definition of latency that, in contrast to previous definitions, considers dependencies between different pairs of input and output operands. We then develop novel algorithms to infer the latency, throughput, and port usage based on automatically-generated microbenchmarks that are more accurate and precise than existing work. To facilitate the rapid construction of optimizing compilers and tools for performance prediction, the output of our tool is provided in a machine-readable format. We provide experimental results for processors of all generations of Intel’s Core architecture, i.e., from Nehalem to Coffee Lake, and discuss various cases where the output of our tool differs considerably from prior work. 翻译: 现代的微服务架构是世界上最复杂的人造系统。作为结果,微服务架构越来越难以取预测,解释,更不用说最大化软件在这样的微服务架构上工作的性能。作为性能预测和优化的基础,我们需要关于它们行为的可靠的模型,然而不幸的是这很少有效。 在本文中,我们提出并实现了一个工具来构建关于x86架构下延迟、流量和端口使用的的有效的模型。为此,我们首先讨论了指令流量和端口使用了常用概念,并介绍了关于延迟的更精确的定义,这个定义与之前的定义相对,考虑了不同组合的输入输出之间的依赖关系。然后我们开发了一个新的算法来推导延迟、流量和端口使用的关系,基于自动生成的微服务基准测试系统,该算法与现有工作相比更加准确。 为了实现性能预测上最优化编译器和工具的快速部署,我们的工具的输出是机器可读的格式。我们提供了Intel所有时代处理器的实验结果,并讨论了一些我们的工具与之前工作不同的样例。

Keynote: Developing our Quantum Future

英文 In 1981, Richard Feynman proposed a device called a ‘quantum computer’ to take advantage of the laws of quantum physics to achieve computational speed-ups over classical methods. Quantum computing promises to revolutionize how and what we compute. Over the course of three decades, quantum algorithms have been developed that offer fast solutions to problems in a variety of fields including number theory, optimization, chemistry, physics, and materials science. Quantum devices have also significantly advanced such that components of a scalable quantum computer have been demonstrated; the promise of implementing quantum algorithms is in our near future. I will attempt to explain some of the mysteries of this disruptive, revolutionary computational paradigm and how it will transform our digital age. 翻译 在1981年,理查德弗莱曼提出了一个名为量子计算机的设备,利用量子物理学定律来实现比传统方法更快的计算速度。量子计算有望彻底改变我们的计算方式和计算对象。在过去三十年的研究中,已经开发出的量子算法可以为数论、优化、化学、物理和材料科学等各个领域的问题提供快速解决方案。量子设备也取得了显著的进步,从而证明了可扩展量子计算机的组件的性能。实现量子算法的希望就在我们不久的将来。我将尝试解释这种颠覆性、革命性的计算范例的一些奥秘,以及它如何改变我们的数字时代。