System Energy Efficiency Lab
Home People Research Publications Sponsors Contacts

Power, Thermal, Reliability and Variability Management:
from Datacenters to Mobiles


Energy efficiency is a key concern for all electronic devices, from datacenters to mobiles. In datacenters, it translates into lower electricity bills, while for mobiles it makes batteries last longer. Moreover, this is tightly related to other issues which should be considered concurrently. Power dissipation on integrated circuits makes the temperature increase, which can damage the system and, in the case of mobiles, it can be a source of discomfort for the user. Temperature stress also dramatically increases the impact of reliability degradation mechanisms on transistors and interconnects, which can lead to early failure. These problems only worsen with CMOS scaling, which reduces the accuracy of the fabrication process and increases the variability in power, performance and degradation rate. Dynamic management mitigates such issues by adapting the operating conditions at runtime. To do this, a monitor infrastructure samples the critical parameter at runtime and uses the data to determine the current power/thermal/reliability status. Thanks to this information and optionally with the aid of predictive models, the framework determines the appropriate control decisions to meet performance/quality targets. In our group, we aim at developing and implementing comprehensive low-overhead and scalable strategies for the joint management of power, temperature, reliability and variability for a variety of devices, from datacenters to mobiles.

We propose a multi-rate comprehensive management framework that has two sub-controllers, the long term and the short term controller. This design choice is motivated by the fact that the time scale of interest for reliability changes and variability effects are in the order of weeks and months, which is very different from milliseconds and seconds at which power and temperature change.

The short term controller activates at a fine-grain rate and aims to meet the fast changing application-level performance requirements dictated by user experience, while optimizing for temperature and power. In doing this, it determines the solution to meet the average targets provided by the long term controller, which account for variability and reliability. These are used to update the thermal constraints at a fine-grain rate, and the solution is adjusted by predicting power and temperature for the next intervals.

For this, we first propose a novel power characterization strategy for mobile devices called application-dependent power states (AP-states). Based on that, we formulate a management problem to improve performance under battery lifetime constraints. We call our framework BLAST: Battery Lifetime-constrained Adaptation with Selected Target. The goal of the framework is to maximize performance while ensuring the device battery lasts at least for a user required lifetime. We experimentally verify that our strategy can still meets quality requirements with a selected target battery lifetime extension of at least 25%. We also propose a joint power and thermal management solution, which takes a proactive approach in reducing energy consumption while providing expected user experience.

The proposed technique modulates the operating conditions based on users application preferences and exploits the “change blindness” effect to reduce display power consumption. A novel thermal model of the entire smartphone is derived using model identification techniques, based on the device’s operating conditions. This has the purpose of monitoring and controlling the operating conditions to keep the device temperatures within safe operating ranges. Our ready-to-use management technique has been implemented on Google Nexus 5 and has been demonstrated to achieve a 46% application-specific savings on power consumption and up to 35% savings in power consumption at the device level. The mean temperature estimation error is 1.17C.

The long term controller activates at a coarse-grain rate and focuses on meeting the target reliability in the long term, based on information on variability in power consumption, performance and degradation rate. Variability information is updated based on the reliability degradation. The long term controller also computes target average temperature and power values that are used as constraints by the short term controller.

The target on reliability is met if the average temperature and power are lower than the target at the coarse-grain level. In this case, the short term controller returns the unexploited margins which are used to increase the average targets for the next coarse grain control interval. We formulate dynamic reliability management as an optimization problem that accounts for reliability, temperature and performance. We optimize for multicores using convex optimization, and show that it is not feasible to implement on real systems. For this reason, we propose Workload-Aware Reliability Management (WARM), a fast DRM technique adapting to diverse workload requirements to trade reliability and user experience. WARM is implemented and tested on a real Android device. It leverages RelDroid, an infrastructure for the online emulation of reliability degradation. RelDroid enables the design of workload-aware dynamic reliability management on real mobile devices with accurate reliability models. Our framework captures the effect of variable workload and environmental conditions and allows to emulate longer degradation in a short time scale. We implement the framework on a real Android device and exploit it to enable workload-aware dynamic reliability management. WARM approximates the solution of the convex solver within 18% in the worst case, while executing more than 40x faster. It integrates a Thermal Controller that allocates tasks to meet thermal constraints. This is required since degradation strongly depends on temperature. WARM task allocation achieves up to 1 year lifetime improvement for a multicore platform. It can achieve up to 100% of performance improvement on cluster architectures, such as big.LITTLE, while still guaranteeing the reliability target are met.

Due to the scaling of CMOS, processors with the same normal characteristics actually have variability in power, performance and reliability degradation, which should also be taken into account in the runtime management of hardware resources. For this, we present VarDroid, a low-overhead tool to emulate power and performance variability on real platforms, running on top of the Android operating system. VarDroid enables us to analyze the effect of variability in power and performance while capturing the complex interactions characteristic of mobile workloads, thus relating to users quality of experience. We present use cases to show the utility of VarDroid to test applications, device and OS robustness under the effects of variability. Our results show that a variability-agnostic OS can incur in a performance penalty of up to 60% and a power penalty of up to 20%. Then we use VarDroid to develop a novel dynamic variability management technique, which leverages a variability-aware OS algorithm to assign the workload to the cores and set the power/performance tradeoffs to meet the mobile processors lifetime constraints while adjusting to variability and improving the overall user experience. The proposed DVM solution uses sensors to monitor the variable operating conditions and the degradation rate. We implement our algorithm in Android OS on a mobile phone and show that it achieves up to 160% performance improvement over the state-of-the-art while meeting the lifetime constraints.