Documentation

Tutorial

In this tutorial you will get insights on the FinnGen health data supporting the glaucoma endpoint.

It usually takes 20–30 minutes to complete this tutorial, but we know not everyone has time to complete it in one seating. It's ok! This tutorial is designed to make it easy to start now and get back to it later.

Opening Risteys homepage

First, open Risteys homepage in a new tab so that we can easily navigate between this tutorial and there.

Go ahead and right-click on the big Risteys title at the top of this page, then select Open Link in New Tab:
screenshot of Risteys header

You should now be able to quickly go back and forth between this tutorial page and Risteys homepage. Congrats, you are all set up for the next tutorial sections!

Searching for an endpoint

The Risteys homepage has a search bar. Click on it and type glaucoma:
screenshot of Risteys search bar

Search results appear has you type, displaying endpoints matching the search query.

Scroll down the search results to locate the endpoint H7_GLAUCOMA:
screenshot of Risteys search results

Click on the H7_GLAUCOMA link as shown above. It will take you to its endpoint page, it should look like this:
screenshot of the glaucoma page

To make sure you are on the right page, check that you see a title Glaucoma near the top of the page, and the H7_GLAUCOMA code just below it. Like in the screenshot above.

You are now ready for the next section.

Checking how the endpoint is defined

Now that you are on the glaucoma endpoint page, scroll down a bit to reveal the Endpoint definition section:
screenshot of the glaucoma definition

As we can see, this endpoint is defined using the ICD-10 code H40-H42, and it also include other endpoints.

Checking the upset plot for evidence of code usage

Click on the upset plot icon near the top of the page:
screenshot of the upset plot icon

A window pops up with a list of code for that endpoint, and how the cases are distributed among these codes. It should look like this:
screenshot of the upset plot for glaucoma

You can now close the upset plot by clicking on the Close button the top-right corner:
screenshot of the upset plot close button

You are now back on the glaucoma endpoint page. You can continue to the next section.

Checking the summary statistics

Scroll down the page until you see the section Summary Statistics:
screenshot of glaucoma summary statistics

Here you can different statistics for the glaucoma endpoint, such as:

  • number of cases (20904)
  • mean age at first event (63.77)

Click on the help icon next to Mortality:
screenshot of help icon

A help panel pops in and provide explanations on how to interpret the mortality table:
screenshot of help panel

Close this help panel by clicking on the X button on the top-right corner:
screenshot of help panel close button

Notice there are other help buttons on the endpoint page. They explain different concepts and have the same open/close interaction.

Hover over the 60–70 bin in the age distribution:
screenshot of glaucoma age distribution

The plot now displays there are 6171 cases having a first event of glaucoma when they were between 60 and 70 years old.

The end

Congratulations! You have completed the Risteys tutorial.

You started by searching for the glaucoma endpoint, then checked how it is defined in FinnGen, and finally looked at its descriptive statistics.

Risteys has more to offer: feel free to look at other sections on the glaucoma endpoint page, check other endpoint pages, or browse the documentation below.

How-to… ?

How to lookup endpoints that have a specific ICD-10-fi code?

  1. Click on the search bar.
  2. Enter the ICD-10-fi code of interest.
  3. Click the endpoints in the search results. The matching ICD-10-fi are highlighted.

    screenshot of ICD search results

How to check which codes are used for a given endpoint?

There are 3 ways for checking which codes are used for an endpoint:

  1. using the endpoint explainer
  2. using the original rules
  3. using the full data table of the upset plot

Using the endpoint explainer

  1. Go to the endpoint page of your endpoint of interest.
  2. Scroll down to Endpoint definition.
  3. Locate the section Check pre-conditions, main-only, mode, registry filters.
  4. Check the codes displayed in this section.

    screenshot of endpoint codes
Note that some endpoints have an TODO INCLUDE rule which could bring additional unlisted codes.

Using the original rules

  1. Go to the endpoint page of your endpoint of interest.
  2. Scroll down to Endpoint definition.
  3. Locate the section Check pre-conditions, main-only, mode, registry filters.
  4. Click show all original rules.

    screenshot of link to original rules
  5. Read the rules as given in the original endpoint definitions.

    screenshot of original rules table

Using the full data table of the upset plot

  1. Go to the endpoint page of your endpoint of interest.
  2. Scroll down to Endpoint definition.
  3. Click on the link full data table.

    screenshot of link to full data table
  4. Read the codes given to the endpoint cases in the Code column of the table.

    screenshot of code list for cases

Related documentation: How to check which combination of codes are the most common among endpoint cases?

How to check which combination of codes are the most common among endpoint cases?

  1. Go to the endpoint page of your endpoint of interest.
  2. Scroll down to Endpoint definition.
  3. Click on the link Show upset plot detailing case counts by codes.

    screenshot of link to upset plot
  4. Read the left column for the codes, and the dot matrix for the combination of codes.

    screenshot of upset plot

How to see the GWAS information and Manhattan plot for an endpoint?

  1. Go to the endpoint page of your endpoint of interest.
  2. Click the PheWeb button near the top-right of the page.

    screenshot of PheWeb link

How to browse the data at a different data freeze? (e.g. FinnGen R5)

There are 2 ways to do this:
  1. from the home page
  2. from an endpoint page

From the home page

  1. Go to the home page.
  2. Hover over Other FinnGen data releases at the top of the home page.
  3. Click on the data freeze version you want to browse.

    screenshot of the homepage header

From an endpoint page

  1. Go to the endpoint page of your endpoint of interest.
  2. At the top of the page, click on the arrow next to the current data freeze version.
  3. Click on the data freeze version you want to browse.

    screenshot of an endpoint page header

There are two ways to accomplish this:

  1. using the Similar endpoints feature
  2. using the Correlations table
  1. Go to the endpoint page of your endpoint of interest.
  2. Locate the Similar endpoints box near the top of the page.
  3. Related endpoints which are a strict superset of cases of the current endpoint are shown in Broader endpoints, and endpoints which are a strict subset of cases are shown in Narrower endpoints.

    screenshot of similar endpoints box
  1. Go to the endpoint page of your endpoint of interest.
  2. Scroll down to the correlation table.
  3. Read the endpoints from the table, by default it is sorted by highest case overlap between endpoints.

    screenshot of the correlation table

How to get more detailed data on an endpoint? (e.g. data for N<5, histograms with narrower bins)

Risteys doesn't provide data where any data point has less than 5 individuals.

More detailed data is available in the FinnGen sandbox. See the FinnGen Analyst Handbook documentation.

How to get measurements that are not shown in Risteys? (e.g. BMI, ECG)

Risteys doesn't provide such measurements at the moment.

It is worth looking in the FinnGen Analyst Handbook if such measurements are available through other means.

What is FinRegistry?

FinRegistry is a joint research project of the Finnish Institute of Health and Welfare (THL) and the Data Science and Genetic Epidemiology Lab research group at the Institute for Molecular Medicine Finland (FIMM), University of Helsinki. The project aims to develop new ways to model the complex relationships between health and risk factors. Statistical and machine learning models are developed to understand and predict disease occurrences using high-resolution longitudinal data. FinRegistry utilizes the unique registry system in Finland to combine health data with a wide range of other information from nearly the whole population of Finland. FinRegistry includes all individuals alive and living in Finland on 1st of January 2010 (FinRegistry index persons) as well as the index persons' parents, siblings, children, and spouses.

What is FinnGen?

FinnGen is a large-scale academic/industrial research collaboration launched in Finland in 2017 with the scope to collect and analyze genomic and health data from 500 000 Finnish biobank participants in 2023. The project aims to improve human health through genetic research, and ultimately identify new therapeutic targets and diagnostics for treating numerous diseases. It produces near complete genome variant data from all the 500 000 participants using GWAS genotyping and imputation and utilizes the extensive longitudinal national health register data available on all Finns. The data freeze R10 from September 2022 consists of over 429 000 individuals. The study currently involves Finnish biobanks, University Hospitals and their respective Universities, the Finnish Institute of Health and Welfare (THL), the Finnish Red Cross Blood Service, the Finnish Biobanks - FINBB and thirteen pharmaceutical companies. University of Helsinki is the organization responsible for the study.

Where does the data come from?

The data in Risteys comes from FinnGen and FinRegistry. Different Finnish health registries make up the phenotypic data of FinnGen and FinRegistry, which in turn is used to build Risteys.

The main registries used in Risteys are:

  • Care Register for Health Care (HILMO)
  • Population registry (DVV)
  • Cause of death
  • Finnish Cancer Registry
  • Drug purchase and reimbursement (Kela)

Have a look at Finnish health registries page of the FinnGen Analyst Handbook for detailed information. As well as the FinRegistry registry overview.

Which years are covered by the different health registries?

The registries used in Risteys vary in their coverage of the data. This image shows which years are covered by each registry in FinnGen at data freeze R10:

registry data coverage years

What is the difference between ICD-10 and ICD-10-fi?

Many places in FinnGen reference ICD-10 and sometimes ICD-10-fi. Both are similar classifications used in electronic health records, they map codes to health conditions.

ICD-10-fi is a variant of ICD-10 introduced by the Finnish health care system.

The main differences between ICD-10 and ICD-10-fi are:

  • Some codes are only in ICD-10, while some codes are only in ICD-10-fi. Though most of the codes are shared between ICD-10 and ICD-10-fi.
  • ICD-10-fi as definitions for combining symptom and cause into a single code. For example: A01.1 Typhoid fever as cause and G01 Meningitis as symptom is the single code A01.1+G01 Meningitis associated with typhoid fever in ICD-10-fi.
  • ICD-10-fi has a notation to indicate causal medication.

Why is an endpoint defined with ICD-10 but no ICD-9 no ICD-8?

The two main reasons are:

  • The people that defined the endpoint knew which ICD-10 to pick when creating the endpoint, but they didn't know if any ICD-9 or ICD-8 could also be used.
  • The people that defined the endpoint know there is no corresponding ICD-9 or 8 that could be used. This is indicated with the symbol $!$.

Why are some endpoint descriptions wrong?

In some cases the description shown below the endpoint page will be wrong, like in this example:

screenshot of an endpoint description

This happens because the descriptions are not written as part of FinnGen. Instead they are gathered from various sources, and we try to programmatically attribute the best description to all the FinnGen endpoints. But sometimes our algorithm fails.

Ontology (endpoint description)

Endpoints are linked to international ontologies DOID, MESH, and EFO, and links to the ontologies are presented on Risteys when available. The mapping is carried out using automated algorithm followed by manual curation.

First, the following hierarchical algorithm is used to link endpoints to DOID and MESH codes:

  1. ICD-10 codes are matched to DOID ICD-10-CM codes
  2. endpoint names are matched to DOID names and synonyms
  3. endpoints are matched to MESH codes and converted to DOID
  4. ICD-10 codes are matched one step up in the ICD-10 hierarchy
  5. endpoint names are matched with DOID codes using the Ratcliff/Obershelp pattern matching (similarity > 0.69)

The resulting DOID and MESH codes are mapped to EFO when a mapping is available.

Next, the fuzzy matching algorithm OnToma and the ontology annotations for endpoints available on the Open Targets portal are used to link endpoints to EFO codes.

Finally, endpoints with discordant EFO annotations between the existing mappings, OnToma, and Open Targets are manually checked and corrected.

Key figures & distributions

Key figures and the year and age distributions were computed using data of all persons in FinRegistry and FinnGen. Figures are presented separately for FinRegistry index persons, the whole population in FinRegistry, and FinnGen.

The key figures include the following statistics:

  • Number of individuals: Number of individuals with the endpoint of interest
  • Period prevalence: Number of individuals with the endpoint of interest divided by the total number of individuals in the cohort
  • Median age at first event: Median age at the first occurrence of the endpoint

Distributions are presented by age and year at the first event. Bars in distributions are aggregated to include at least 5 individuals, given the sensitive nature of the data.

Cumulative incidence function (CIF)

The cumulative incidence function (CIF) presents the incidence of an endpoint by age and sex. When death is regarded as a competing event, the interpretation of CIF is the probability of getting the endpoint given it is also possible to die without the endpoint. CIF was estimated using the Aalen-Johansen estimator in a competing risks framework where death was treated as a competing event. The model was stratified by sex, and age was used as a timescale to obtain CIF estimates by age.

The eligibility criteria for CIF are as follows:

  • born before the end of the follow-up (31.12.2021)
  • either not dead or died during the follow-up period (1.1.1998 to 31.12.2021)
  • sex information is available
  • for cases, the outcome endpoint has to occur during the follow-up period

We sampled all or at most 10 000 cases and 1.5 controls per case among the non-cases. Subjects were weighted by the inverse of the sampling probability to account for the sampling design. We required at least 50 cases and controls during this period for running the analysis. Moreover, CIF is only presented for ages with at least 5 cases due to the sensitive nature of the data.

The Aalen-Johansen estimates were obtained using the Lifelines Python library.

Mortality

The goal of the mortality analysis is to estimate the association between an exposure endpoint and death. The results include estimates for the coefficients as well as absolute mortality risk estimations. A Cox proportional hazards model was used to estimate mortality associated with an endpoint. Age was used as a timescale and birth year was included as a covariate to account for calendar effects. The model was stratified by sex.

The eligibility criteria for mortality analysis as as follows:

  • born before the end of the follow-up (31.12.2021)
  • either not dead or died during the follow-up period (1.1.1998 to 31.12.2021)
  • sex information is available
  • for the exposed persons, the exposure endpoint has to occur during the follow-up period and no more than 30 days prior to death. Persons exposed less than 30 days before death are considered unexposed.

Exposure-stratified sampling was applied to acquire a sufficient number of persons for the analysis. At least 50 exposed and unexposed cases and controls were required. We sampled all or at most 10 000 cases and 1.5 controls per case among the non-cases. The model was weighted by the inverse of the sampling probability to account for the sampling design.

Mortality risks can be used to estimate the risk of death given exposure. Conditional mortality risks represent the risk of an event by time t given that no event has occurred by the time t0. Conditional mortality risks were computed using the following formula: MR(t | t0) = 1 - S(t) / S(t0) where t0 is age at baseline, t is the target age and S is the survival function. The difference between the baseline age and the current year was used as the birth year.

The Cox proportional hazards model was fitted using the Lifelines Python library.

Relationships – Survival analysis

The goal of an endpoint-to-endpoint survival analysis is to estimate the association between two clinical endpoints, the prior endpoint and the outcome endpoint. We used a Cox proportional hazards model with age as a timescale to estimate the hazard ratio between the prior endpoint and the outcome endpoint. Birth year and sex were used as covariates.

The eligibility criteria for the mortality analysis are as follows:

  • born before the end of the follow-up (31.12.2021)
  • either not dead or died during the follow-up period (1.1.1998 to 31.12.2021)
  • sex information is available
  • for individuals with the prior endpoint, the prior endpoint has to occur during the follow-up period and no more than 180 days prior to the outcome endpoint

We sampled all or at most 10 000 cases, i.e. persons with the outcome endpoint, and 1.5 controls per case among the non-cases separately for individuals with and without the prior endpoint. For sex-specific endpoints, controls were sampled of the same sex. The model was weighted by the inverse of the sampling probability to account for the sampling design, as in the mortality analysis.

The Cox proportional hazards model was fitted using the Lifelines Python library.