Paper Abstracts – ASPLOS 2024 (2024)

Session 8D: IoT and Embedded (Location: Fairway I/IV)
10:00 PDT – 10:15 PDT TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays Massimo Giordano, Rohan Doshi, and Qianyun Lu (Stanford University);Boris Murmann (University of Hawaii) Abstract: The proliferation of smart IoT devices has given rise to tinyML, which deploys deep neural networks on resource-constrained systems, benefitting from custom hardware that optimizes for low silicon area and high energy efficiency amidst tinyML’s characteristic small model sizes (50-500 KB) and low target frequencies (1-100 MHz). We introduce a novel custom latch array integrated with a compute memory fabric, achieving 8 ??2/B density and 11 fJ/B read energy, surpassing synthesized implementations by 7x in density and 5x in read energy. This advancement enables dataflows that do not require activation buffers, reducing memory overheads. By optimizing systolic vs. combinational scaling in a 2D compute array and using bit-serial instead of bit-parallel compute, we achieve a reduction of 4.8x in area and 2.3x in multiply-accumulate energy. To study the advantages of the proposed architecture and its performance at the system level, we architect tinyForge, a design space exploration to obtain Pareto-optimal architectures and compare the trade-offs with respect to traditional approaches. tinyForge comprises (1) a parameterized template for memory hierarchies and compute fabric, (2) estimations of power, area, and latency for hardware components, (3) a dataflow optimizer for efficient workload scheduling, (4) a genetic algorithm performing multi-objective optimization to find Pareto-optimal architectures. We evaluate the performance of our proposed architecture on all of the MLPerf Tiny Inference Benchmark workloads, and the BERT-Tiny transformer model, demonstrating its effectiveness in lowering the energy per inference while addressing the introduced area overheads. We show the importance of storing all the weights on-chip, reducing the energy per inference by 7.5x vs. utilizing off-chip memories. Finally, we demonstrate the potential of the custom latch arrays and bit-serial digital compute arrays to reduce by up to 1.8x the energy per inference, 2.2x the latency per inference, and 3.7x the silicon area.
10:15 PDT – 10:30 PDT MulBERRY: Enabling Bit-Error Robustness for Energy-Efficient Multi-Agent Autonomous Systems Zishen Wan (Georgia Tech);Nandhini Chandramoorthy, Karthik Swaminathan, and Pin-Yu Chen (IBM Research);Kshtij Bhardwaj (Lawrence Livermore National Lab);Vijay Janapa Reddi (Harvard University);Arijit Raychowdhury (Georgia Tech) Abstract:* The adoption of autonomous swarms, consisting of a multitude of unmanned aerial vehicles (UAVs), operating in a collaborative manner, has become prevalent in mainstream application domains for both military and civilian purposes. These swarms are expected to collaboratively carry out navigation tasks and employ complex reinforcement learning (RL) models within the stringent onboard size, weight, and power constraints. While techniques such as reducing onboard operating voltage can improve the energy efficiency of both computation and flight missions, they can lead to on-chip bit failures that are detrimental to mission safety and performance. To this end, we propose MulBERRY, a multi-agent robust learning framework to enhance bit error robustness and energy efficiency for resource-constrained autonomous UAV swarms. MulBERRY supports multi-agent robust learning, both offline and on-device, with adaptive and collaborative agent-server optimizations. For the first time, MulBERRY demonstrates the practicality of robust low-voltage operation on multi-UAV systems leading to energy savings in both compute and mission quality-of-flight. We conduct extensive system-level experiments on autonomous multi-UAV navigation by leveraging algorithm-level robust learning techniques, and hardware-level bit error, thermal, and power characterizations. Through evaluations, we demonstrate that MulBERRY achieves robustness-performance-efficiency co-optimizations for autonomous UAV swarms. We also show that MulBERRY generalizes well across fault patterns, environments, UAV types, UAV agent numbers, and RL policies, with up to 18.97% reduction in flight energy, 22.07% increase in the number of successful missions, and 4.16× processing energy reduction.
10:30 PDT – 10:45 PDT Exploiting Human Color Discrimination for Memory- and Energy-Efficient Image Encoding in Virtual Reality Nisarg Ujjainkar and Ethan Shahan (University of Rochester);Kenneth Chen, Budmonde Duinkharjav, and Qi Sun (New York University);Yuhao Zhu (University of Rochester) Abstract: Virtual Reality (VR) has the potential of becoming the next ubiquitous computing platform. Continued progress in the burgeoning field of VR depends critically on an efficient computing substrate. In particular, DRAM access energy is known to contribute to a significant portion of system energy. Today’s framebuffer compression system alleviates the DRAM traffic by using a numerically lossless compression algorithm. Being numerically lossless, however, is unnecessary to preserve perceptual quality for humans. This paper proposes a perceptually lossless, but numerically lossy, system to compress DRAM traffic. Our idea builds on top of long-established psychophysical studies that show that humans cannot discriminate colors that are close to each other. The discrimination ability becomes even weaker (i.e., more colors are perceptually indistinguishable) in our peripheral vision. Leveraging the color discrimination (in)ability, we propose an algorithm that adjusts pixel colors to minimize the bit encoding cost without introducing visible artifacts. The algorithm is coupled with lightweight architectural support that, in real-time, reduces the DRAM traffic by 66.9% and outperforms existing framebuffer compression mechanisms by up to 20.4%. Psychophysical studies on human participants show that our system introduce little to no perceptual fidelity degradation.
10:45 PDT – 11:00 PDT MicroVSA: An Ultra-Lightweight Vector Symbolic Architecture-based Classifier Library for Always-On Inference on Tiny Microcontrollers Nuntipat Narkthong and Shijin Duan (Northeastern University);Shaolei Ren (University of California Riverside);Xiaolin Xu (Northeastern University) Abstract: Artificial intelligence (AI) on tiny edge devices has become feasible thanks to the emergence of high-performance microcontrollers (MCUs) and lightweight machine learning (ML) models. Nevertheless, the cost and power consumption of these MCUs and the computation requirements of these ML algorithms still present barriers that prevent the widespread inclusion of AI functionality on smaller, cheaper, and lower-power devices. Thus, there is an urgent need for a more efficient ML algorithm and implementation strategy suitable for lower-end MCUs. This paper presents MicroVSA, a library of optimized implementations of a low-dimensional computing (LDC) classifier, a recently proposed variant of vector symbolic architecture (VSA), for 8, 16, and 32-bit MCUs. MicroVSA achieves up to 21.86x speedup and 8x less flash utilization compared to the vanilla LDC. Evaluation results on the three most common always-on inference tasks – myocardial infarction detection, human activity recognition, and hot word detection – demonstrate that MicroVSA outperforms traditional classifiers and achieves comparable accuracy to tiny deep learning models, while requiring only a few ten bytes of RAM and can easily fit in tiny 8-bit MCUs. For instance, our model for recognizing human activity from inertial sensor data only needs 2.46 KiB of flash and 0.02 KiB of RAM and can complete one inference in 0.85 ms on a 32-bit ARM Cortex-M4 MCU or 11.82 ms on a tiny 8-bit AVR MCU, whereas the RNN model running on a higher-end ARM Cortex-M3 requires 62.0 ms. Our study suggests that ubiquitous ML deployment on low-cost tiny MCUs is possible, and more study on VSA model training, model compression, and implementation techniques is needed to further lower the cost and power of ML on edge devices.
11:00 PDT – 11:15 PDT Energy-Adaptive Buffering for Efficient, Responsive, and Persistent Batteryless Systems Harrison Williams and Matthew Hicks (Virginia Tech) Abstract: Batteryless energy harvesting systems enable a wide array of new sensing, computation, and communication platforms untethered by power delivery or battery maintenance demands. Energy harvesters charge a buffer capacitor from an unreliable environmental source until enough energy is stored to guarantee a burst of operation despite changes in power input. Current platforms use a fixed-size buffer chosen at design time to meet constraints on charge time or application longevity, but static energy buffers are a poor fit for the highly volatile power sources found in real-world deployments: fixed buffers waste energy both as heat when they reach capacity during a power surplus and as leakage when they fail to charge the system during a power deficit. To maximize batteryless system performance in the face of highly dynamic input power, we propose REACT: a responsive buffering circuit which varies total capacitance according to net input power. REACT uses a variable capacitor bank to expand capacitance to capture incoming energy during a power surplus and reconfigures internal capacitors to reclaim additional energy from each capacitor as power input falls. Compared to fixed-capacity systems, REACT captures more energy, maximizes usable energy, and efficiently decouples system voltage from stored charge—enabling low-power and high-performance designs previously limited by ambient power. Our evaluation on real-world platforms shows that REACT eliminates the tradeoff between responsiveness, efficiency, and longevity, increasing the energy available for useful work by an average 25.6% over static buffers optimized for reactivity and capacity, improving event responsiveness by an average 7.7? without sacrificing capacity, and enabling programmer directed longevity guarantees.

Session 8D: IoT and Embedded
(Location: Fairway I/IV)

10:00 PDT – 10:15 PDT

TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays
Massimo Giordano, Rohan Doshi, and Qianyun Lu (Stanford University);Boris Murmann (University of Hawaii)

Abstract: The proliferation of smart IoT devices has given rise to tinyML, which deploys deep neural networks on resource-constrained systems, benefitting from custom hardware that optimizes for low silicon area and high energy efficiency amidst tinyML’s characteristic small model sizes (50-500 KB) and low target frequencies (1-100 MHz). We introduce a novel custom latch array integrated with a compute memory fabric, achieving 8 ??2/B density and 11 fJ/B read energy, surpassing synthesized implementations by 7x in density and 5x in read energy. This advancement enables dataflows that do not require activation buffers, reducing memory overheads. By optimizing systolic vs. combinational scaling in a 2D compute array and using bit-serial instead of bit-parallel compute, we achieve a reduction of 4.8x in area and 2.3x in multiply-accumulate energy. To study the advantages of the proposed architecture and its performance at the system level, we architect tinyForge, a design space exploration to obtain Pareto-optimal architectures and compare the trade-offs with respect to traditional approaches. tinyForge comprises (1) a parameterized template for memory hierarchies and compute fabric, (2) estimations of power, area, and latency for hardware components, (3) a dataflow optimizer for efficient workload scheduling, (4) a genetic algorithm performing multi-objective optimization to find Pareto-optimal architectures. We evaluate the performance of our proposed architecture on all of the MLPerf Tiny Inference Benchmark workloads, and the BERT-Tiny transformer model, demonstrating its effectiveness in lowering the energy per inference while addressing the introduced area overheads. We show the importance of storing all the weights on-chip, reducing the energy per inference by 7.5x vs. utilizing off-chip memories. Finally, we demonstrate the potential of the custom latch arrays and bit-serial digital compute arrays to reduce by up to 1.8x the energy per inference, 2.2x the latency per inference, and 3.7x the silicon area.

10:15 PDT – 10:30 PDT

MulBERRY: Enabling Bit-Error Robustness for Energy-Efficient Multi-Agent Autonomous Systems
Zishen Wan (Georgia Tech);Nandhini Chandramoorthy, Karthik Swaminathan, and Pin-Yu Chen (IBM Research);Ksh*tij Bhardwaj (Lawrence Livermore National Lab);Vijay Janapa Reddi (Harvard University);Arijit Raychowdhury (Georgia Tech)

Abstract: The adoption of autonomous swarms, consisting of a multitude of unmanned aerial vehicles (UAVs), operating in a collaborative manner, has become prevalent in mainstream application domains for both military and civilian purposes. These swarms are expected to collaboratively carry out navigation tasks and employ complex reinforcement learning (RL) models within the stringent onboard size, weight, and power constraints. While techniques such as reducing onboard operating voltage can improve the energy efficiency of both computation and flight missions, they can lead to on-chip bit failures that are detrimental to mission safety and performance. To this end, we propose MulBERRY, a multi-agent robust learning framework to enhance bit error robustness and energy efficiency for resource-constrained autonomous UAV swarms. MulBERRY supports multi-agent robust learning, both offline and on-device, with adaptive and collaborative agent-server optimizations. For the first time, MulBERRY demonstrates the practicality of robust low-voltage operation on multi-UAV systems leading to energy savings in both compute and mission quality-of-flight. We conduct extensive system-level experiments on autonomous multi-UAV navigation by leveraging algorithm-level robust learning techniques, and hardware-level bit error, thermal, and power characterizations. Through evaluations, we demonstrate that MulBERRY achieves robustness-performance-efficiency co-optimizations for autonomous UAV swarms. We also show that MulBERRY generalizes well across fault patterns, environments, UAV types, UAV agent numbers, and RL policies, with up to 18.97% reduction in flight energy, 22.07% increase in the number of successful missions, and 4.16× processing energy reduction.

10:30 PDT – 10:45 PDT

Exploiting Human Color Discrimination for Memory- and Energy-Efficient Image Encoding in Virtual Reality
Nisarg Ujjainkar and Ethan Shahan (University of Rochester);Kenneth Chen, Budmonde Duinkharjav, and Qi Sun (New York University);Yuhao Zhu (University of Rochester)

Abstract: Virtual Reality (VR) has the potential of becoming the next ubiquitous computing platform. Continued progress in the burgeoning field of VR depends critically on an efficient computing substrate. In particular, DRAM access energy is known to contribute to a significant portion of system energy. Today’s framebuffer compression system alleviates the DRAM traffic by using a numerically lossless compression algorithm. Being numerically lossless, however, is unnecessary to preserve perceptual quality for humans. This paper proposes a perceptually lossless, but numerically lossy, system to compress DRAM traffic. Our idea builds on top of long-established psychophysical studies that show that humans cannot discriminate colors that are close to each other. The discrimination ability becomes even weaker (i.e., more colors are perceptually indistinguishable) in our peripheral vision. Leveraging the color discrimination (in)ability, we propose an algorithm that adjusts pixel colors to minimize the bit encoding cost without introducing visible artifacts. The algorithm is coupled with lightweight architectural support that, in real-time, reduces the DRAM traffic by 66.9% and outperforms existing framebuffer compression mechanisms by up to 20.4%. Psychophysical studies on human participants show that our system introduce little to no perceptual fidelity degradation.

10:45 PDT – 11:00 PDT

MicroVSA: An Ultra-Lightweight Vector Symbolic Architecture-based Classifier Library for Always-On Inference on Tiny Microcontrollers
Nuntipat Narkthong and Shijin Duan (Northeastern University);Shaolei Ren (University of California Riverside);Xiaolin Xu (Northeastern University)

Abstract: Artificial intelligence (AI) on tiny edge devices has become feasible thanks to the emergence of high-performance microcontrollers (MCUs) and lightweight machine learning (ML) models. Nevertheless, the cost and power consumption of these MCUs and the computation requirements of these ML algorithms still present barriers that prevent the widespread inclusion of AI functionality on smaller, cheaper, and lower-power devices. Thus, there is an urgent need for a more efficient ML algorithm and implementation strategy suitable for lower-end MCUs. This paper presents MicroVSA, a library of optimized implementations of a low-dimensional computing (LDC) classifier, a recently proposed variant of vector symbolic architecture (VSA), for 8, 16, and 32-bit MCUs. MicroVSA achieves up to 21.86x speedup and 8x less flash utilization compared to the vanilla LDC. Evaluation results on the three most common always-on inference tasks – myocardial infarction detection, human activity recognition, and hot word detection – demonstrate that MicroVSA outperforms traditional classifiers and achieves comparable accuracy to tiny deep learning models, while requiring only a few ten bytes of RAM and can easily fit in tiny 8-bit MCUs. For instance, our model for recognizing human activity from inertial sensor data only needs 2.46 KiB of flash and 0.02 KiB of RAM and can complete one inference in 0.85 ms on a 32-bit ARM Cortex-M4 MCU or 11.82 ms on a tiny 8-bit AVR MCU, whereas the RNN model running on a higher-end ARM Cortex-M3 requires 62.0 ms. Our study suggests that ubiquitous ML deployment on low-cost tiny MCUs is possible, and more study on VSA model training, model compression, and implementation techniques is needed to further lower the cost and power of ML on edge devices.

11:00 PDT – 11:15 PDT

Energy-Adaptive Buffering for Efficient, Responsive, and Persistent Batteryless Systems
Harrison Williams and Matthew Hicks (Virginia Tech)

Abstract: Batteryless energy harvesting systems enable a wide array of new sensing, computation, and communication platforms untethered by power delivery or battery maintenance demands. Energy harvesters charge a buffer capacitor from an unreliable environmental source until enough energy is stored to guarantee a burst of operation despite changes in power input. Current platforms use a fixed-size buffer chosen at design time to meet constraints on charge time or application longevity, but static energy buffers are a poor fit for the highly volatile power sources found in real-world deployments: fixed buffers waste energy both as heat when they reach capacity during a power surplus and as leakage when they fail to charge the system during a power deficit. To maximize batteryless system performance in the face of highly dynamic input power, we propose REACT: a responsive buffering circuit which varies total capacitance according to net input power. REACT uses a variable capacitor bank to expand capacitance to capture incoming energy during a power surplus and reconfigures internal capacitors to reclaim additional energy from each capacitor as power input falls. Compared to fixed-capacity systems, REACT captures more energy, maximizes usable energy, and efficiently decouples system voltage from stored charge—enabling low-power and high-performance designs previously limited by ambient power. Our evaluation on real-world platforms shows that REACT eliminates the tradeoff between responsiveness, efficiency, and longevity, increasing the energy available for useful work by an average 25.6% over static buffers optimized for reactivity and capacity, improving event responsiveness by an average 7.7? without sacrificing capacity, and enabling programmer directed longevity guarantees.

Paper Abstracts – ASPLOS 2024 (2024)

References