Map > Problem Definition > Data Preparation > Copy Number
 

Data Preparation - Copy Number

A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next. Following the completion of the Human Genome Project, it became apparent that the genome experiences gains and losses of genetic material. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes.
 
DepMap Portal
The goal of the Dependency Map (DepMap) portal is to empower the research community to make discoveries related to cancer vulnerabilities by providing open access to key cancer dependencies analytical and visualization tools.
 
DepMap Copy Number data
In order to process DepMap Expression data we need to download the follwoing datasets from DepMap website.

 
Cell Line Sample Info
  1. DepMap_ID: Static primary key assigned by DepMap to each cell line
  2. cell_line_name
  3. stripped_cell_line_name: Cell line name with alphanumeric characters only
  4. CCLE_Name: Previous naming system that used the stripped cell line name followed by the lineage; no longer assigned to new cell lines
  5. alias: Additional cell line identifiers (not a comprehensive list)
  6. COSMIC_ID: Cell line ID used in Cosmic cancer database
  7. sex: Sex of tissue donor if known
  8. source: Source of cell line vial used by DepMap
  9. Achilles_n_replicates: Number of replicates used in Achilles CRISPR screen passing QC
  10. cell_line_NNMD: Difference in the means of positive and negative controls normalized by the standard deviation of the negative control distribution
  11. culture_type: Growth pattern of cell line (Adherent, Suspension, Mixed adherent and suspension, 3D, or Adherent (requires laminin coating))
  12. culture_medium: Medium used to grow cell line
  13. cas9_activity: Percentage of cells remaining GFP negative on days 12-14 of cas9 activity assay as measured by FACs
  14. RRID: Cellosaurus research resource identifier
  15. WTSI_Master_Cell_ID
  16. sample_collection_site: Tissue collection site
  17. primary_or_metastasis: Indicates whether tissue sample is from primary or metastatic site
  18. primary_disease: General cancer lineage category
  19. Subtype: Subtype of disease; specific disease name
  20. age: If known, age of tissue donor at time of sample collection
  21. Sanger_model_ID: Sanger Institute Cell Model Passport ID
  22. depmap_public_comments
  23. lineage: Cancer type classifications in a standardized form
  24. lineage_subtype
  25. lineage_sub_subtype
  26. lineage_molecular_subtype
 
Copy Number 
Gene level copy number data, log2 transformed with a pseudo count of 1. This is generated by mapping genes onto the segment level calls.
  • Rows: cell lines (Broad IDs)
  • Columns: genes (HGNC symbol and Entrez ID)
  • 25368 Genes
  • 1750 Cell Lines
  • 35 Primary Diseases
  • 37 Lineages
 
Data Processing 
Not all DepMap_IDs in "sample_info.csv" file are present in "CCLE_gene_cn.csv" file. Moreover, it is better to have a separate file for features/genes/probes based on the following data model. You can download a file by clicking on its file name.

 
Bioada SmartArray 
This video shows how you can upload the CCLE_expression files to Bioada SmartArray and explore, visualize and build predictive models significantly faster and easier.