For over a decade now, most of us no longer work with local files; instead, we use the cloud to avoid losing our work in case our computer breaks. Have you ever wondered how these kinds of systems work?
In this video, we'll see how file systems like Google Drive or Dropbox work.
Table of Contents
As I say in each of these videos, there isn't a single valid solution—there are several. The important thing is to understand the process and how it works so you can discuss the details in interviews.
1 - Requirements for an Online File System
For our scenario, the basic functional requirements are:
- Upload files.
- Download files.
- Share files.
- Propagate file changes to all clients.
This is the basic operation of a file-sharing system, where you have an app running on your machine—basically what Google Drive or Dropbox does.
Additionally, our system should have the following characteristics, which are the non-functional requirements:
- The application must always be available (99.99% uptime).
- Support for large files.
- Low latency.
- The application must be able to scale.
NOTE: Real-time collaboration will not be covered in this post as that feature is a design in itself.
2 - Working with Large Files
One of the most important aspects of how these applications work is understanding that files are not sent over the network as a whole. Let me explain: if you have a text document with images, etc., that is 100 megabytes, and you make a small change, you wouldn’t send the entire file to the server since that means sending all 100 megabytes to all clients each time you update something.
To avoid this scenario, we split the file into what are called "chunks" or blocks, of a certain size, and only sync the blocks that have changed.
As you can imagine, this brings complexity when designing client applications, as it’s up to the application to map the difference between what existed before and what exists now to properly sync files.
Similarly, we must keep chunk information in the system’s database.
This database should contain the file ID as well as each of its chunks, which would be its hash (simplified here), and their location in the file system we're using.
NOTE: If you want to add versioning, this is where you can do it, as each time you sync a chunk, you can assign it a version, and just record it in the table. Or you could even have one table row per version with all its chunks inside. But the idea is clear.
3 - System Entities
I know I mentioned one of the entities our system will include previously, but I thought it was crucial to separate that point to understand how these apps work.
With that in mind, we have several main entities:
- File: Which we'll later split into chunks, but we still need the file as such.
- File metadata: Contains information such as the user, file name, last modified date, etc.
- User: Users of our system.
- Devices: We’re designing Dropbox/Google Drive, meaning we have applications both on PC and mobile, etc. We can store the machine ID, which user it is linked to, and most importantly, the date of last sync.
- Sharing: A table (I'm terrible at names) indicating which users have access to which files. You can go further and add "Team", which works like Google Workspace, where X users are part of a team and can share files within that team.
NOTE: Metadata and files can be combined into a single table.
4 - File System Architecture Design
Let's start with the file upload process—we saw earlier that we need to split files and, when downloading, reassemble them.
Splitting must be done on the client application. Therefore, the client must be able to receive the file, split it, and upload new chunks to the server. Thus, the app has three sections:
- A section that monitors the files we have; this works for both local and remote server files since we always want them synced.
To detect which local files have been modified, we have to use OS-specific functions. For those changing on the server, we'll discuss that further ahead.
- A section responsible for splitting and merging files.
- And another section that synchronizes files to ensure we always have the latest version of each file.
We keep a local log of our activity, which means uploads can be paused and resumed later. This is possible because we have files split into chunks.
Similarly, when uploading files, we do not pass them through our app but send a request from the client app to the web app, indicating that we need to upload a chunk. This call goes through the API gateway to our chunk service.
The chunk service is responsible for storing the process status in the database. It receives a call like this:
//request{ id: 1, name: "file1" size_in_kb: 100000, status: in_progress, chunks:[ { id: 1_A, }, { id: 1_B, }, { id: 1_C, }, ]}->//response{ chunks:[ { id: 1_A, location: s3.... }, { id: 1_B, location: s3.... }, { id: 1_C, location: s3.... }, ]}
Here, we see it contains file metadata and fields like the chunks.
For each of those chunks, the app replies with a link, which is a direct link to the storage service—for example, if we use AWS S3, it’s a direct S3 link provided by a pre-signed URL. This means the file itself does not have to go through our system.
This same feature exists in AWS, Azure, GCP, etc., and the URL has a limited lifetime (for example, one hour).
Each time a chunk upload completes, we must notify the chunk API. There are two options: the client app can make a call indicating a chunk has been uploaded, or, using S3 events (also available in Azure/GCP/etc.), the system can be notified via a queue or cloud function listening for those events.
Of course, in an interview, you should note that each service should sit behind a load balancer with multiple instances.
If you’re not familiar with these terms or their importance, I recommend my book Building Distributed Systems.
Furthermore, for every chunk, we generate an event aiming to notify all clients that a file or chunk is available. Of course, if not all chunks are available yet, there is no reason to notify. This only happens during the initial upload.
To do this, we check our metadata system to ensure all chunks are uploaded and, with the permissions system, find which users have permission for that file, since they should also be able to read it.
We do this with a native function called a chunk post-processor.
There are several options to implement this.
One is to generate an event and listen with a notification system, which communicates with the client app, prompting it to fetch missing files and chunks from the server.
My idea here is to send a notification to the app, which then refreshes the data. This can be done using Websockets or SSE, meaning an open connection for each client whenever a notification occurs.
Alternatively, the app can poll the server every minute, querying if there are any new chunks since the last sync with this device. It's a bit slower but much simpler than websockets—and it works.
Or, as a final option, we could use long polling—the user’s request stays open until new data arrives or times out.
All three options are valid—the key is being able to explain to the interviewer the pros and cons of the approach you choose compared to the others.
4.1 - Application Flow
With this, our design is complete. Now, let’s explain the process:
1 - The user adds a file in the app. Several actions take place here: first, the file is split into chunks, and then we sync to “the cloud.”
2 - If the file is new, we add info to the metadata, if not, we validate the user (with permissions, if implemented), returning the file ID, which will be used to identify chunks.
3 - Each chunk operates individually; it is added to the database with "in progress" status, and a pre-signed URL is returned.
4 - The client uses the pre-signed URL to upload the file.
5 - Each time a file has been uploaded, a native function updates the database with the chunk info.
6 - Each time a chunk is uploaded, we must propagate the info, generating an event.
7 - An event consumer reads this, pulling all file info, including which users have access, and generates an event for each.
8 - The event is propagated to an app with the user’s connected clients, indicating there’s a new file version.
9 - The app asks the API for all links to new chunks since the last sync. They’re downloaded and the files update correctly.
4.2 - Advanced Design Details in a File System
Now, I’ll mention key elements in these systems that you should cover in interviews.
Chunk size
Knowing what chunk size to use isn’t easy; Dropbox, for example, uses 4MB, but in an interview, you can discuss the pros and cons of having 1MB or 10MB chunks.
CDN
Sometimes, a file is accessed by thousands of people worldwide; in that case, use a CDN to cache copies of popular files globally. This way, even if a user is mainly in Europe, access from elsewhere is fast. Note: to access CDN data, login and permissions are still required.
Security
The app should encrypt chunks in transit, both for upload and download. Use TLS 1.3 for transport, and once stored, apply AES-256 + KMS (at least on AWS). For highly sensitive data, consider client-side encryption, though you may lose extra features like previews.
Conflicts
This post hasn’t addressed two users having write access to the same file. But what happens if two people modify the same file at once? You have to decide: is the last write valid, or do you use a CRDT strategy (more complex to implement)?
In fact, there’s a third option: optimistic locking with manual merge—the app notifies of a conflict and lets the user decide. If I’m not mistaken, this is what Dropbox does.
Versions
As we mentioned earlier, we can have versioning by versioning chunks; that way, each file is available in different versions. Always with a hash for each version (what I call chunkid), so we can avoid chunk duplication.
Mastering, understanding, and being able to explain these terms can make a real difference in a technical interview.
If there is any problem you can add a comment bellow or contact me in the website's contact form