Samvit Solutions’ Involvement in the NCI Data Standards Services (DSS) Project

NCI Cancer Research Data Commons (CRDC)

CRDC is a cloud-based data science infrastructure for sharing, integrating, and analyzing data from cancer research. It enables NCI-funded programs to publicly share their data in genomics, proteomics, imaging, and other data types or modalities. The Cancer Research Data Commons website has more about the CRDC and its initiatives.

A key service within the CRDC is the Data Standards Services (DSS), which facilitates the aggregation and analysis of data from diverse repositories. Samvit has led the DSS effort by working closely with the NCI Semantic Infrastructure teams of caDSR and EVS. To harmonize data elements across repositories, we extensively analyzed each CRDC node’s data dictionaries, models, metadata elements, and supporting terminologies.

Our team developed a comprehensive process to ensure data harmonization is done accurately and methodically:

  • Following our comprehensive review of all repositories’ elements, we identified semantically identical data elements across the nodes.
  • We developed a comprehensive definition for each semantically identical data element and identified industry-established terminologies for use across the CRDC nodes where applicable. We then published a Request for Comments (RFC) document for each element to collect community comments.
  • Upon reconciling the feedback for each RFC, the Samvit team proceeded to perform term-level mappings between the node terms and the established industry standard or mapping to the NCI Thesaurus ontology.
  • We finally submitted the results to NCI caDSR team to create the official CRDC Common Data Elements (CDEs)

Samvit leveraged our deep knowledge of biomedical research data standards, the NCI caDSR, and NCI EVS, as well as our analysis and research skills, to successfully accomplish this harmonization effort. This effort involved close collaboration with representatives across the CRDC and NCI to ensure that the common data from disparate sources could be effectively combined and analyzed. This also allowed Samvit to gain critical insight into the CRDC node dictionaries and models. The Samvit team led and supported the development of over 70 common CRDC CDEs across the various CRDC nodes.