Data Privacy & Responsible Data Handling
Many CACoM projects deal with sensitive or clinically derived data.
Respecting data privacy is both a legal requirement and a professional responsibility.
This page summarizes how to handle biomedical and other personal data in accordance with TUM and EU GDPR principles.
Guiding Principle
Treat every dataset as if it contained information about someone you personally know and care about.
This simple rule captures the spirit of responsible data handling: protect confidentiality, minimize exposure, and only process what is truly needed.
Legal & Institutional Context
All CACoM activities fall under:
- The EU General Data Protection Regulation (GDPR),
- TUM's internal data protection guidelines for teaching and research.
You are not required to become a legal expert, but you are expected to follow the course's operational rules for safe and ethical data use.
Data Classification in CACoM
| Category | Description | Examples | Public sharing allowed? |
|---|---|---|---|
| Identifiable data | Contains personal identifiers or metadata that could directly identify an individual. | Names, hospital IDs, GPS traces, raw medical records. | ❌ Never |
| Pseudonymized data | Identifiers replaced by codes, but re-identification is still possible with auxiliary information. | “Patient_001”, timestamped CTG traces. | ❌ No — internal use only |
| Anonymized data | All identifiers removed, and re-identification is intended to be impossible — though this can rarely be guaranteed. | Aggregated metrics, derived features, downsampled recordings. | ⚠️ Only with explicit instructor approval |
| Synthetic data | Artificially generated by algorithms or simulations and not based on any real individual or measurement. | Simulated CTG signals, mock IMU datasets, random noise generators. | ✅ Yes, freely shareable |
If your “synthetic” dataset was generated entirely from code (e.g., statistical sampling, procedural simulation), you may share it publicly without restrictions. If it was derived from real data, even indirectly, treat it as anonymized and request instructor approval before uploading. Approval is quick — but necessary.
Anonymization is much harder than it seems. Many “de-identified” datasets can still be traced back to individuals when combined with external information. When unsure, treat all data as pseudonymized and keep it private.
Data Handling Rules
✅ You must:
- Store sensitive or pseudonymized data only on approved TUM or course-managed systems (e.g., institutional cloud, encrypted drives, or CACoM Google Drive).
- Document data sources and access permissions in your README or report.
- Delete all local copies after submission unless explicitly instructed otherwise.
- Share sensitive data only internally via the official CACoM Google Drive submission folder.
🚫 You must not:
- Upload or share any clinical, patient, or proprietary data to public platforms (GitHub, Kaggle, Google Drive links, personal websites, etc.).
- Email datasets to external parties without written permission from the course staff.
- Attempt to re-identify individuals from pseudonymized data.
- Combine datasets in ways that could indirectly reveal identities.
Data in Reproducibility Packages
When preparing your Reproducibility Package:
- Include synthetic or example data in public repositories for demonstration.
- Upload real datasets only to the internal CACoM Google Drive.
- Clearly label what type of data (synthetic, anonymized, pseudonymized) each file represents.
- Provide metadata and descriptions, not raw identifiers.
Example:
data/
├── synthetic_ctg_sample.csv # OK to publish
├── real_patient_signals.csv # INTERNAL ONLY
└── README_data.md # explains origins and permissions
External Collaborations
Some projects involve clinicians or industry partners.
In these cases:
- Follow their institutional data policies and confidentiality agreements.
- Do not redistribute data obtained through such collaborations.
- Report any data breach, accidental exposure, or uncertainty immediately to the instructors.
Remember: professionalism in handling real-world data reflects directly on TUM's reputation and on yours.
Proprietary or Restricted Components
Some CACoM projects rely on proprietary hardware, software, or datasets — for example, the fetal heartbeat simulator or other tools provided by collaborators or industrial partners.
In such cases:
- You may not publish or redistribute the full project if it includes or depends on proprietary components.
- You may, however, share open parts of your work — such as analysis scripts, derived metrics, or simulation examples — provided they do not reveal or replicate proprietary details.
- Before publishing or uploading anything, check with the instructors or your project supervisor which parts of your project may be shared and which must remain private.
- When preparing your Reproducibility Package, clearly mark restricted files or modules (e.g.,
hardware_interface_proprietary/) and document how they fit into your workflow.
Projects involving proprietary resources are still fully valid academic work — reproducibility in such cases refers to methodological transparency, not open publication of all materials.
Derived & Aggregated Data
Aggregated or summary-level results are generally safe to share — but always verify that they cannot be traced back to individual participants.
You may share derived, aggregated, or statistical summaries publicly if:
- Individual participants cannot be identified, and
- The summaries do not reveal private or proprietary information.
Examples of safe outputs:
- Group-level averages (e.g., mean heart rate per minute).
- Statistical model coefficients.
- Performance metrics (accuracy, RMSE, etc.).
Common Pitfalls
- Confusing pseudonymized data with anonymized data.
- Uploading “cleaned” or “trimmed” datasets that still contain traceable timestamps or IDs.
- Sharing metadata files that reveal sensitive information (e.g., hospital location, device serials).
- Forgetting that derived features can still leak identity (e.g., rare clinical conditions).
Quick Checklist
- All datasets classified correctly (identifiable / pseudonymized / anonymized / synthetic).
- Sensitive data stored only on approved systems.
- No patient data uploaded to GitHub or public cloud.
- Real data included only in Google Drive submission.
- Synthetic examples provided for reproducibility.
- README includes data source and access explanation.