The importance of studying single cells is reflected in the ongoing revolution in biology centered around technologies for single cell analysis. Microscopy offers an opportunity to study differences in protein localizations within a population of cells. Current machine learning models for classifying protein localization patterns in microscope images gives a summary of the entire population of cells. However, the single-cell revolution in biology demands models that can precisely classify patterns in each individual cell inthe image.
The Human Protein Atlas is an initiative based in Sweden that is aimedat mapping proteins in all human cells, tissues, and organs. The data inthe Human Protein Atlasdatabase is freely accessible to scientists all around the world thata llows them to explore the cellular makeup of the human body. Solving thesingle-cell image classification challenge will help us characterize single-cell heterogeneity in our large collection of images by generating more accurate annotations of the subcellular localizations for thousands of human proteins in individual cells. The objective of this competition is to help this organization to more accurately model the spatial organization of the human celland provide new open-access cellular data to the scientific community, which may accelerate our growing understanding of how human cells functions and how diseases develop.
This is a weaklysupervised multi-label classification problem and a code competition. Givenimages of cells from their microscopes and labels of protein location assignedtogether for all cells in the image, Kagglers will develop models capable ofsegmenting and classifying each individual cell with precise labels. [1]
There are several challenges within this proyects. Some related to the complexity of the proyect itself, and some regarding processing heavyness and data treatment.
The cell images are provided in a series of image RGBYfilters (red, green, blue, yellow), which represent different parts of thecell:
- Red: microtubules
- Green: protein of interest
- Blue: nucleus
- Yellow: endoplasmic reticulum
If each filter is plotted, the result is somewhat likethis:
Where each line represents a different cell, and eachcolumn represents a different filter
When those filters are blended, the full cellappearance can be observed:
Every cell in every image must be separated andlabelled individually as the goal requires. The labels given are the following:
0. Nucleoplasm
1. Nuclear membrane
2. Nucleoli
3. Nucleoli fibrillar center
4. Nuclear speckles
5. Nuclear bodies
6. Endoplasmic reticulum
7. Golgi apparatus
8. Intermediate filaments
9. Actin filaments
10. Microtubules
11. Mitotic spindle
12. Centrosome
13. Plasma membrane
14. Mitochondria
15. Aggresome
16. Cytosol
17. Vesicles and punctate cytosolic patterns
18. Negative
In order to start trying out different models andstrategies, the training data has to be processed in a way that helps suchmodels to digest the information. Depending on the model in use, certainoperations must be executed.
In general, the main preprocessing operations are:
- Blendingthe filters into a single image for faster execution.
- Decodingeach image into the original separate filters.
- Normalization
- Imageresizing.
- Augmentation:random sized crop, horizontal flip, vertical flip.
- Encodingthe image label.
For prediction, the model selected was EfficientNet which has pre-trained weights.