HPC-Colony Project

Adaptive System Software for Improved Resiliency and Performance

Overview Goals Accomplishments FAQ News Participants Publications Links Internal Page

 

Publications

  • Terry Jones, Linux Kernel Co-Scheduling and Bulk Synchronous Parallelism, The International Journal of High Performance Computing Applications (IJHPCA), (to appear).
  • Yanhua Sun, Gengbin Zheng, Ryan Olson, Terry Jones, Laxmikant V. Kale. A uGNI-Based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect. 26th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2012). Shanghai, China. May, 2012. (to appear).
  • Jonathan Lifflander, Phil Miller, Ramprasad Venkataraman, Anshu Arya, Laxmikant V. Kale, Terry Jones. Dense LU Factorization on Multicore Supercomputer Nodes. 26th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2012). Shanghai, China. May, 2012. (to appear).
  • Esteban Meneses, Greg Bronevetsky and Laxmikant V. Kale, Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications, IEEE International Conference on Cluster Computer (Cluster) 2011, Austin, TX, September 2011.
  • Esteban Meneses and Xiang Ni and Laxmikant V. Kale, Design and Analysis of a Message Logging Protocol for Fault Tolerant Multicore Systems, University of Illinois at Urbana-Champaign Technical Report, July 2011.
  • Jonathan Lifflander, Phil Miller, Ramprasad Venkataraman, Anshu Arya, Terry Jones, Laxmikant Kale. Exploring Partial Synchrony in an Asynchronous Environment Using Dense LU, University of Illinois at Urbana-Champaign Technical Report, August 2011.
  • Aaron Becker, Gengbin Zheng, and Laxmikant Kale, Distributed Memory Load Balancing, Encyclopedia of Parallel Computing, David Padua, Ed., 2011 (to appear).
  • Osman Sarood, Abishek Gupta and Laxmikant V. Kale, Temperature Aware Load Balancing for Parallel Applications: Preliminary Work, Proceedings of Workshop on High Performance Power Aware Computing (HPPAC) at IPDPS, Anchorage, USA, 2011.
  • Terry Jones, Gregory A. Koenig, Clock Synchronization in High-end Computing Environments: A Strategy for Minimizing Clock Variance at Runtime. (submitted for publication)
  • Chao Mei, Yanhua Sun, Gengbin Zheng, Eric J. Bohm, Laxmikant V. Kale, James C. Phillips and Chris Harrison, Enabling and Scaling Biomolecular Simulations of 100 Million Atoms on Petascale Machines with a Multicore-optimized Message-driven Runtime, Accepted for Supercomputing'11, Seattle, November 2011.
  • Esteban Meneses, Greg Bronevetsky and Laxmikant V. Kale, Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems, Proceedings of Workshop on Dependable Parallel, Distributed and Network-Centric Systems (DPDNS) at IPDPS, Anchorage, USA, 2011.
  • Abhinav Bhatele, Pritish Jetley, Hormozd Gahvari, Lukasz Wesolowski, William D. Gropp and Laxmikant V. Kale, Architectural constraints to attain 1 Exaflop/s on three scientific application classes, Accepeted for the IEEE International Parallel and Distributed Processing Symposium (IPDPS'2011), Anchorage, USA, 2011.
  • Terry Jones, Linux Kernel Co-Scheduling For Bulk Synchronous Parallel Applications, International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2011), Tucson, Arizona, USA, May 2011.
  • Terry Jones and Gregory Koening, Providing Runtime Clock Synchronization With Minimal Node-to-Node Time Deviation on XT4s and XT5s, 2011 Cray Users Group Meeting, Fairbanks, AK, May 2011.
  • Terry Jones and Gregory Koening, A Clock Synchronization Strategy for Minimizing Clock Variance at Runtime in High-end Computing Environments, 22nd International Symposium on Computer Architecture and High Performance Computing, Rio De Janeiro Brazil. October 2010.
  • Gengbin Zheng, Abhinav Bhatele, Esteban Meneses and Laxmikant V. Kale, Periodic Hierarchical Load Balancing for Large Supercomputers, accepted for publication in International Journal for High Performance Computing Applications (IJHPCA), 2010
  • Eduardo R. Rodrigues, Philippe O. A. Navaux, Jairo Panetta, Celso L. Mendes and Laxmikant V. Kale, Optimizing an MPI Weather Forecasting Model via Processor Virtualization, Proceedings of International Conference on High Performance Computing (HiPC 2010), December 2010.
  • Eduardo R. Rodrigues, Philippe O. A. Navaux, Jairo Panetta, Alvaro Fazenda, Celso L. Mendes and Laxmikant V. Kale, A Comparative Analysis of Load Balancing Algorithms Applied to a Weather Forecast Model, Proceedings of 22nd International Symposium on Computer Architecture and High Performance Computing, Rio de Janeiro, Brazil, October, 2010.
  • Terry Jones, Gregory Koenig, A Clock Synchronization Strategy for Minimizing Clock Variance at Runtime in High-end Computing Environments, Proceedings of 22nd International Symposium on Computer Architecture and High Performance Computing, Rio de Janeiro, Brazil, October, 2010.
  • Josh Thompson, David W. Dreisigmeyer, Terry Jones, Michael Kirby, and Josh Ladd. Accurate Fault Prediction of BlueGene/P RAS Logs Via Geometric Methods, First International Workshop on Fault-Tolerance for HPC at Extreme Scale, Chicago, June 2010.
  • Esteban Meneses, Celso L. Mendes, and Laxmikant V. KalÚ. Team-based Message Logging: Preliminary Results, 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010), Melbourne, Australia, May 2010.
  • Abhinav Bhatele, Eric Bohm and Laxmikant V. KalÚ, Optimizing Communication for Charm++ Applications by Reducing Network Contention, accepted for publication in Concurrency and Computation: Practice and Experience (EuroPar special issue), 2010.
  • Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. KalÚ, "Hierarchical Load Balancing for Large Scale Supercomputers", Accepted at the Third International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), San Diego, CA, September 2010.
  • Yoav Tock, Benjamin Mandler, "SpiderCast: Distributed Membership and Messaging for HPC Platforms: An Architectural Overview and High Level Design". Colony-II technical report, January 2010.
  • Yoav Tock, Benjamin Mandler, Gennady Laventman, "SpiderCast: Distributed Membership and Messaging for HPC Platforms: Publish-Subscribe and DHT Services High Level Design". Colony-II technical report, May 2010.
  • Terry Jones, Andrew Tauferner, and Todd Inglett. Linux OS Jitter Measurements at Large Node Counts using a BlueGene/L, Technical Report ORNL/TM-2009/303, Oak Ridge National Laboratory, November, 2009.
  • Laxmikant V. KalÚ, Eric Bohm, Celso L. Mendes, Terry Wilmarth and Gengbin Zheng Programming Petascale Applications with Charm++ and AMPI. In "Petascale Computing: Algorithms and Applications", CRC Press, 2008.
  • Sayantan Chakravorty, Laxmikant V. KalÚ. A Fault Tolerance Protocol with Fast Fault Recovery. IEEE International Parallel and Distributed Processing Symposium 2007, California, March 2007.
  • Terry Jones, Andrew Tauferner, Todd Inglett. HPC System Call Usage Trends, the 8th LCI International Conference on High Performance Computing, South Lake Tahoe, CA, May 2007.
  • Gregory A. Koenig and Laxmikant V. KalÚ. Optimizing Distributed Application Performance Using Dynamic Grid Topology-Aware Load Balancing. Proceedings of the IEEE International Parallel and Distributed Processing Symposium2007, California, March 2007.
  • Abhinav Bhatele. Application-specific Topology-aware Mapping and Load Balancing for three-dimensional Torus Topologies. Master's Thesis, Dep. of Computer Science, University of Illinois, Urbana, 2007.
  • Sayantan Chakravorty. Fault Tolerance Protocols for Fast Recovery in Parallel Systems. PhD Thesis, Dep. of Computer Science, University of Illinois, Urbana, 2007.
  • Tarun Agarwal, Amit Sharma and Laxmikant V. KalÚ. Topology-aware task mapping for reducing communication contention on large parallel machines, Proceedings of IEEE International Parallel and Distributed Processing Symposium 2006, Greece, April 2006.
  • Gengbin Zheng, Chao Huang and Laxmikant V. KalÚ. Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++. ACM SIGOPS Operating Systems Review: Operating and Runtime Systems for High-end Systems, 40(2), April 2006.
  • Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. KalÚ, Terry Jones, Andrew Tauferner, Todd Inglett and JosÚ Moreira. HPC-Colony: Services and Interfaces for Very Large Systems. ACM SIGOPS Operating Systems Review: Operating and Runtime Systems for High-end Systems, 40(2), April 2006.
  • Sayantan Chakravorty, Celso L. Mendes and Laxmikant V. KalÚ. Proactive Fault Tolerance in MPI Applications via Task Migration, Accepted for HiPC2006, Bangalore, India, December 2006.
  • Sayantan Chakravorty, C. L. Mendes, & Laxmikant V. KalÚ. Proactive Fault Tolerance in Large ystems. First Workshop on High Performance Computing Reliability Issues at HPCA-11, San Francisco/CA, February 2005.
  • Gengbin Zheng. Achieving High Performance on Extremely Large Parallel Machines, PhD Thesis, Dep. Computer Science, University of Illinois, May 2005.
  • Tarun Agarwal. Strategies for Topology-Aware Task Mapping and for Rebalancing with Bounded Migrations, MS Thesis, Dep. Computer Science, University of Illinois, June 2005.
  • Sameer Kumar, Gheorghe Almasi, Chao Huang and Laxmikant V. KalÚ. Achieving Strong Scaling with NAMD on Blue Gene/L, University of Illinois, October 2005, submitted to publication.

Talks

  • Eric Bohm, "Charm++ Tutorial", Charm++ Workshop, Urbana, April 2011.
  • Esteban Meneses and Xiang Ni, "Fault Tolerance Support for Supercomputers with Multicore Nodes", Charm++ Workshop, Urbana, April 2011.
  • Abhinav Bhatele, "New Developments in the Charm++ Load Balancing Framework and its Applications", Charm++ Workshop, Urbana, April 2011.
  • Osman Sarood, "Temperature-Aware Load Balancing for Parallel Applications", Charm++ Workshop, Urbana, April 2011.
  • Laxmikant V. Kale, "State of Charm++", Charm++ Workshop, Urbana, April 2011.
  • Eric Bohm, Chao Mei, Yanhua Sun and Gengbin Zheng, "Charm++ Tutorial", Chinese Academy of Sciences, Beijing, China, December 2010.
  • Abhinav Bhatele, "Topology Aware Mapping", University of Illinois (presented by telecom to the Chinese Academy of Sciences, December 2010.
  • Eric Bohm, "Scaling NAMD into the Petascale and Beyond", 4th Workshop INRIA-Illinois Joint Laboratory on Petascale Computing, Urbana, November 2010.
  • Esteban Meneses, "Clustering Parallel Applications to Enhance Message-Logging Protocols", 4th Workshop INRIA-Illinois Joint Laboratory on Petascale Computing, Urbana, November 2010.
  • Abhinav Bhatele, "Mapping your Application on Interconnect Topologies: Effort versus Benefits", George Michael HPC Fellow Presentation at Supercomputing'10, New Orleans, November 2010.
  • Celso L. Mendes and Laxmikant V. Kale, "Adaptive MPI", Blue Waters PRAC Fall Workshop, Urbana, October 2010.
  • Terry Jones. Colony Project. presented at FastOS Workshop, Boston, MA, June 2010.
  • Abhinav Bhatele. Dynamic Load balancing in Charm++. Tutorial presented at the 5th Charm++ Workshop, Urbana, April 2007.
  • Celso L. Mendes. How to Write Applications Using Adaptive MPI. Tutorial presented at the 5th Charm++ Workshop, Urbana, April 2007.
  • Laxmikant V. KalÚ. State of Charm++. 5th Charm++ Workshop, Urbana, April 2007.
  • Sayantan Chakravorty. The Charm++ Fault Tolerance Infrastructure. 5th Charm++ Workshop, Urbana, April 2007.
  • Laxmikant V. KalÚ. Parallel Programming Models in the Era of Multi-core Processors. Manycore Computing Workshop, Seattle, June 2007.
  • Laxmikant V. KalÚ. Programming to Petascale with Multicore Chips and Early Experience on Abe with Charm++. NCSA Multicore Workshop, July 2007.
  • Laxmikant V. KalÚ. Petascale and Multicore Programming Models: What is Needed. Keynote talk, 19th International Symposium on Computer Architecture and High Performance Computing, Gramado-Brazil, October 2007
  • Terry Jones, Operating System Interference Effects at Extreme Scale, SIAM Conference on Parallel Processing for Scientific Computing, San Francisco, CA, February 2006.
  • Terry Jones, Reducing the Impact of Operating System Interference on Scientific Applications, ScicomP 11, Edinburgh Scotland, June 3, 2005.
  • Laxmikant V. KalÚ. Adaptive MPI: Intelligent Runtime Strategies and Performance Prediction via Simulation, Oak Ridge National Lab, Oak Ridge, TN, August 18, 2005.
  • Laxmikant V. KalÚ. Adaptive MPI: Intelligent Runtime Strategies and Performance Prediction via Simulation, University of Tennessee, Knoxville, TN, August 19, 2005.
  • Laxmikant V. KalÚ. Enhancing Performance and Productivity for Science and Engineering Applications Across the Computational Grid, University of Texas, Austin, TX, September 2005.
  • Laxmikant V. KalÚ. Exploiting the Predictability of Message-Driven Objects to Scale the Memory Hierarchy, ScalPerf05, Italy, October 12, 2005.
  • Laxmikant V. KalÚ. Charm++ and Adaptive MPI: Experiences with a Novel Parallel Programming Approach, University of Paris-Sud, France, October 14, 2005.
  • Terry Jones, The HPC-Colony Project, BlueGene Consortium Meeting at SC05, Seattle Washington, November 15, 2005.