Technical Career Building
Why Your ML Project Looks Like a Mess (and How to Fix It)
The folder structure that separates professional data scientists from tutorial followers — explained for absolute beginners.
You followed a tutorial. You built a model. It works. And then you look at your project folder: 47 files, no structure, three copies of final_model_v3_new.py, and a Jupyter notebook that somehow has 200 cells. You know it works, but you also know that no hiring manager would take it seriously.
This is the gap between doing ML and doing ML professionally. And it is exactly what separates candidates who get interviews from those who do not.
Why You Need Structure (Even If You Are Working Alone)
Most beginners think: “I am the only one working on this. Why do I need folders and structure?”
Here is why:
You are not the only person working on this. Future you is a different person. In 3 months, you will not remember why you did something. In 6 months, you will not remember where the data came from. In 12 months, you will be starting from scratch because nothing is documented. Structure is how you communicate with future you.
No one will take your project seriously without it. You are applying for ML jobs. The interviewer asks: “Show me a project.” You send them a GitHub repo with 47 files in one folder. They close the tab. Why? Because they know that project cannot be deployed, cannot be maintained, cannot be trusted. Professional ML projects have structure. Always.
You cannot deploy chaos. The moment you want to move from Jupyter notebook to production, you need a way to track which data was used. You need a way to reproduce the exact model. You need a way to serve predictions to users. You need a way to monitor when things break. None of this works if your files are named final_v3_new.py.
Structure is not optional. It is the difference between a tutorial and a real project.
The Folder Structure That Actually Works
Here is the standard layout used in production ML projects. Every folder has a specific purpose. You do not need to memorize this — you need to understand why each piece exists.
ml-project/
├── .github/
│ └── workflows/
│ ├── ci.yml
│ ├── train.yml
│ └── deploy.yml
├── configs/
│ ├── params.yaml
│ ├── logging.yaml
│ └── infrastructure/
├── data/
│ ├── raw/
│ ├── processed/
│ └── external/
├── docs/
├── notebooks/
├── src/
│ ├── data/
│ ├── features/
│ ├── models/
│ └── evaluation/
├── tests/
├── artifacts/
├── README.md
├── Dockerfile
├── requirements.txt
└── pyproject.toml
Let us walk through each one.
The .github/ Folder: Automation That Saves Your Life
This is where you put automated workflows. The reason you need this is so you do not have to manually test, train, or deploy every time you make a change.
Inside the .github/ folder, you will have a workflows/ subfolder with files like ci.yml, train.yml, and deploy.yml.
What Is a .yml File?
A .yml file (or .yaml) stands for YAML, which means “Yet Another Markup Language.” It is a configuration file written in a human-readable format. Think of it like a recipe: step 1, do this. Step 2, then do this. Step 3, finally do this.
In ML projects, YAML files tell computers what to do automatically.
Here is what each workflow file does:
ci.yml— Continuous Integration. Runs every time you push code. Checks whether your code has errors, whether tests pass, and whether it is formatted correctly. It catches bugs before you deploy.train.yml— Training Pipeline. Automatically retrains your model when data changes, logs results to your tracking system like MLflow, and saves you from the disaster of forgetting to retrain after updating the data.deploy.yml— Deployment. Builds your model into a container, pushes it to a server, and makes it available for predictions.
Without automation, you manually test everything. You manually retrain. You manually deploy. That is how bugs slip through. That is how models get deployed with old data. With .github/workflows/, the computer does it for you every single time.
The configs/ Folder: Settings You Will Change Often
This is where you store all your project settings. The reason you need this is so you do not hardcode values in 50 different files.
Inside configs/, you will typically have:
params.yaml— Model hyperparameters, data paths, feature lists, training parameters. When you want to change the learning rate or batch size, you change it in one place, not across ten scripts.logging.yaml— Logging configuration. Where logs go, what level of detail to capture, how to format them.infrastructure/— Docker configs, cloud deployment settings, anything related to where and how the project runs.
The rule is simple: if a value might change between experiments, environments, or team members, it goes in configs/, not in your code.
The data/ Folder: Where Your Data Lives
Never scatter data files across your project. All data goes here, organized into three subfolders:
raw/— Original, untouched data as you received it. Never modify files in this folder. This is your source of truth.processed/— Cleaned, transformed, feature-engineered data ready for modeling. This is what your training script reads.external/— Third-party data, reference datasets, lookup tables.
Important: Never commit large data files to Git. Use .gitignore to exclude the data/ folder, and document where the data comes from in your README. For version control of data, tools like DVC (Data Version Control) are the standard.
The src/ Folder: Your Actual Code
This is the core of your project. All Python source code lives here, organized by function:
src/data/— Scripts for loading, cleaning, and preprocessing data.src/features/— Feature engineering and transformation logic.src/models/— Model definition, training, prediction, and serialization.src/evaluation/— Metrics calculation, model comparison, validation logic.
The key principle: each file does one thing. train.py trains the model. predict.py serves predictions. evaluate.py computes metrics. If a file is doing three different things, split it.
The notebooks/ Folder: Exploration Only
Jupyter notebooks are for exploration and prototyping. They are not production code. Use them for EDA (exploratory data analysis), quick experiments, and visualization. But once something works, move it to src/ as a proper Python module.
A common mistake: building an entire ML pipeline inside a notebook. It works in development, but it cannot be tested, cannot be versioned properly, and cannot be deployed. Notebooks are the sketchpad. src/ is the finished product.
The tests/ Folder: Proof That Your Code Works
Most beginners skip testing entirely. In industry, untested ML code is a liability. Your tests/ folder should include:
- Unit tests for your data processing functions
- Tests that verify your model can train and predict without errors
- Tests that check data schema and types
- Integration tests that run the full pipeline end-to-end
You do not need 100% test coverage. But a hiring manager who sees a tests/ folder in your GitHub repo immediately knows you understand how production code works.
The artifacts/ Folder: Model Outputs
Trained models, serialized objects, and experiment logs go here. Like data/, this folder should be in your .gitignore — you do not commit large binary files to Git. But the folder should exist in your structure to show where outputs live.
The README.md: Your Project's First Impression
This is the first thing anyone sees when they visit your GitHub repo. A good README for an ML project should include:
- What the project does (one paragraph)
- How to set it up (installation steps)
- How to run it (training, prediction, evaluation)
- Project structure overview
- Results and metrics
- Technologies used
A well-written README is the difference between a recruiter spending 10 seconds on your repo and spending 2 minutes. Those 2 minutes are what get you an interview.
How to Fix Your Project This Weekend
You do not need to rebuild from scratch. Here is the minimal action plan:
- Create the folder structure above in your existing project.
- Move your scripts into
src/, organized by function. - Extract config values from your scripts into
configs/params.yaml. - Move notebooks into
notebooks/and label them clearly (e.g.,01-eda.ipynb,02-feature-exploration.ipynb). - Write a README that explains what the project does and how to run it.
- Add a
.gitignorethat excludesdata/,artifacts/,.env, and__pycache__/. - Write one test — just one. Test that your data loading function returns the expected shape.
That is it. Seven steps. You can do all of this in one weekend. And when a recruiter opens your GitHub, they will see a project that looks like it was built by someone who knows what they are doing.
This Is What the ML4 Sprint Builds
If you want to go beyond structure and build a full production-ready ML pipeline — from raw data to a deployed model with experiment tracking, containerization, and a live API endpoint — that is exactly what the ML4 Sprint delivers in 4 days.
You do not just learn the theory. You build the project, deploy it, and walk away with a portfolio piece that demonstrates production ML skills to any hiring manager in Germany or Europe.
Build a Production ML Project in 4 Days
The ML4 Sprint gives you a deployed, portfolio-ready ML pipeline with FastAPI, Docker, and MLflow. One project. Four days. Real deployment.
Learn About the ML4 Sprint