-
Notifications
You must be signed in to change notification settings - Fork 4
Researcher FAQ
The Harvard Personal Genome Project participant page, located at my.pgp-hms.org, holds information on participant data.
Each Harvard Personal Genome Project participant has a profile page listing their available information. For example, the profile participant for hu43860C has medical information, the results of surveys provided by the Harvard Personal Genome Project and links to data files associated with this participant, including whole genome data and genotype data.
The Harvard Personal Genome Project can be used to search for:
among others. Please consult the Harvard Personal Genome Project page for more detail.
As of September 2015, we have over 200 whole genomes available and over two thousand individuals have completed survey responses. This reference about the available data may help: PGP Data Summary.
Additionally, there is a consolidated database of the Harvard PGP available for download (see later in the FAQ for a more detailed description):
To see examples of the reports provided as a courtesy to participants, see the GET-Evidence Reports.
The PGP cell lines are also available from the Coriell Institute for Medical Research.
The survey questions are available online.
The survey responses are available as well or can be browsed interactively through Untap
Please see the contact form on our website.
Anyone enrolled (i.e. they have passed the consent quiz) can take surveys. Surveys can be taken multiple times. Some subset of those who have filled out surveys have also had their genome sequenced.
What license is the PGP data under? Do I need to give credit to the Harvard Personal Genome Project if I use this data?
All genomic and health data available on the Harvard Personal Genome Project page are under a CC0 license allowing unencumbered and free use by anyone around the world. The CC0 license allows for the use of the Harvard Personal Genome Project data without the need for additional permission or the need to credit the Harvard Personal Genome Project.
Is there a more structured way to get participant profile information than scraping the Harvard Personal Genome Project website?
Often, a .json extension can be added to the end to receive the information in JSON format from the Harvard Personal Genome Project website.
For example, https://my.pgp-hms.org/profile/hu43860C.json will return the JSON formatted page for the hu4380C participant's profile page. The same data may be accessed through https://my.pgp-hms.org/profile_public?hex=hu43860C&format=json.
Other pages can be done in the same way. For example:
- https://my.pgp-hms.org/users.json
- https://my.pgp-hms.org/specimens.json
- https://my.pgp-hms.org/public_genetic_data.json
Appending a &json=true at the end of a GET-Evidence report will also yield JSON output. For example:
Additionally, please see Untap, which contains a snapshot of some of the phenotype data in sqlite3 format.
Most genomic data available through the Harvard Personal Genome Project site is only available in the format provided to us by Complete Genomics (CGI) and would need to be converted. All other file formats (e.g. SAM, BAM, FASTQ, VCF, etc.) are not directly provided for released whole genome data and would need to be derived by using other tools.
CGI provides a suite of tools for use in analyzing and converting their data called cgatools. Some of the data files provided to us by CGI are not available through participant profiles but are available through an Arvados collection and mirrored on a Google drive. For information on accessing data on the Google drive, please see our blog post. In the future this may change, but due to space concerns, we do not make available the unaligned read information provided to us by CGI.
Participants are allowed to upload any data they like, which includes FASTQ and BAM files, among others. This means that some data in other formats, such as BAM or FASTQ, is available because a participant uploaded the data themselves. Feel free to search the Harvard Personal Genome Project site for any additional genomic data that you might find useful. For example, looking for FASTQ on the public genetic data page shows a few participant uploaded FASTQ files.
##Data Snapshots ###Phenotypes There is a consolidated database of the Harvard PGP available for download:
Available in the above link is a SQLite3 database (around 140Mb uncompressed) that holds information available from participant profiles along with data retrieved from participant GET-Evidence reports. Also available are the comma and tab space delimited files used to create the SQLite3 database. Data is periodically retrieved from the Harvard PGP site and the above link represents the most recent snapshot of the database taken.
The spreadsheets and SQLite3 database only represent most of the data provided by the participants profile, including phenotype data, health record information and genome file locations. URLs to full genomic data and participant uploaded data (such as 23andMe genotype data, etc.) can be found from participants profiles as well as consulting the appropriate entries in the above consolidated database. For an introduction to the above database and visualization tool, see the Untap Introduction page.
Source code for download and presentation of the above database can be found on the untap repository.
Please feel free to email the Harvard Personal Genome Project staff if you have any questions or run into any problems.
###23andMe Data Snapshot (10 Sept 2015) https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-575f0kqveim8ggn
###174 MasterVar Snapshot (05 August 2014) https://workbench.qr1hi.arvadosapi.com/collections/qr1hi-4zz18-19i6chr3fhfbb95
###179 VCF Snapshot (April 10, 2015) https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-ppslk16xrt1fdoo
###179 Complete Genomics Data https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-6tvc9csazv33exn
####More info If you have the Arvados command-line tools installed, you can use arv-get instead. http://doc.arvados.org/user/tutorials/tutorial-keep-get.html
If you are interested in using Arvados to complete your annotation work, you can see a sample annotation pipeline here: https://workbench.su92l.arvadosapi.com/pipeline_templates/su92l-p5p6p-x2w4hue3jyqyrgt
For more detailed methods about how the VCF and Complete Genomics collections were created, please see: https://peerj.com/preprints/1426/