High Performance Computing Center at the CEA
The TGCC (Très Grand Centre de Calcul) is a high performance computing infrastructure at the CEA (French Commission for Atomic Energy and Alternative Energies), capable of hosting petascale supercomputers and designed on the basis on data-oriented architecture. The Computer Centre for Research and Technology (CCRT – Centre de calcul pour la recherche et la technologie), located within the TGCC, has configured an extension dedicated to the needs of France Génomique users.
The CCRT e-infrastructure for data storage and processing, implemented by teams from CEA/DIF, will provide France Génomique with several petabytes of storage space for medium-term use (scientific projects spanning several years); this storage area will be linked to several thousand processing cores via a high performance interconnection.
As part of the CCRT infrastructure, the France Génomique configuration is also scalable and designed to meet all the genomics challenges of tomorrow.
Equipment and capacity
The configuration dedicated to France Génomique comprises:
- 180 dual processor nodes (Intel Sandy Bridge E5-2680, 2.7 GHz, 8 cores) with 128 GB of memory per node, i.e. a total of 2,880 cores (Bull),
- 2 Bullx S6410 high capacity memory systems with 2 TB of memory
- 9 hybrid blades equipped with nvidia Kepler GPUs
This is an extension of the Airain configuration of the CCRT, installed at the TGCC.
The data will be hosted according to the following storage configuration:
- Medium-term storage presenting a global file system of 5 PB which includes 2 PB of disk space (Lustre + IBM HPSS hierarchical data storage system)
- Archive system for preliminary data
Main achievements
Scientists from Genoscope (French national sequencing centre, attached to the Institute of Genomics at the French Atomic Energy Commission) conducted modelling studies using the Titan supercomputer at the CCRT in order to characterise a total of 83 protein families of unknown function, regrouping some 60,000 sequences. This phase, which would normally have required 280,000 hours of processing time, was performed on the supercomputer in only 70 hours using 4,000 processors. Scientists were able to use the results to create a catalogue of structural signatures specific to each of the families studied. This catalogue provides a valuable source of information to biochemists for the discovery of new enzyme activities.
Genoscope has been using the computing resources of the TGCC/CCRT for several years, in particular via the DARI calls for proposals (DARI : Direction d’appui à la recherche et à l’innovation, a public body which supports research and innovation).
In the context of the DARI calls, the TARA OCEANS project was awarded more than 3.5 million hours of processing time to study the diversity of marine organisms. Different sequencing analysis tools were used in this study: BLAST, BLAT, InterProScan & CDDsearch. Specific code was also developed and used to adapt these tools to the technical operating constraints of the TGCC machines (massive data parallelisation, monitoring the execution of the work, error recovery, and short job units).
Quality Assurance / Certification
The CEA/DIF teams have developed internationally recognised skills and expertise in the area of big data management (contribution to Open source developments, leading EOFS …) and also in the definition and management of high performance computing centres. User support teams are available to help users to get the most out of the centre’s resources.
In addition, a dedicated application support team has been set up by the Institute of Genomics (CEA) on behalf of France Génomique.
Platform Managment
Pierre Leca
CEA DAM-île de France
Bruyères-le-Châtel
91297 Arpajon Cedex