Splitting Large Movie CSV For GitHub: A Simple Guide
Hey guys! Ever faced the issue of a massive CSV file that's too big to handle, especially when trying to push it to GitHub? We've all been there! In this guide, we'll walk through how to split a large CSV file containing movie data into smaller, manageable chunks. This is super useful when dealing with file size limitations, like GitHub's 100MB restriction. Let's dive in and make your data handling smoother!
Understanding the Problem: Large CSV Files
When you're working with datasets like movie databases, the CSV files can get incredibly large. Think millions of rows, detailing everything from titles and genres to actors and ratings. While this comprehensive data is fantastic for analysis and building recommendation systems, it poses a challenge when it comes to storage and sharing. Platforms like GitHub have file size limits, typically around 100MB, to ensure smooth performance and prevent strain on their servers. So, what do we do when our movie data CSV is a behemoth?
That's where splitting the file comes in handy. By breaking the large CSV into smaller parts, we can easily upload and manage the data without hitting those pesky file size limits. Plus, it can make your data processing more efficient, as you can load and work with smaller chunks at a time. Whether you are trying to upload the movie dataset from Kaggle or your own dataset, the approach remains the same.
Step-by-Step Guide to Splitting Your CSV
So, let's get practical! Here’s a step-by-step guide on how to split your large movie data CSV file using Python. We’ll create a script that does the heavy lifting for us.
1. Setting Up Your Project Directory
First things first, let’s organize our project. Create a new directory for your project, and inside it, make two subdirectories:
- Data: This is where you'll store your original CSV file and the smaller split files.
 - Scripts: This is where we'll put our Python script to handle the splitting.
 
YourProject/
├── Data/
└── Scripts/
Download the movie data CSV file from Kaggle and place it inside the Data directory.
2. Creating the Splitting Script
Now, let's create a Python script called divide_arquivo.py (or any name you prefer) inside the Scripts directory. This script will read the large CSV file, split it into smaller chunks, and save each chunk as a separate CSV file.
Here’s the Python code you’ll need:
import pandas as pd
import os
def split_csv(input_filename, output_dir, chunk_size=50000):
    """Splits a large CSV file into smaller files."""
    
    # Create the output directory if it doesn't exist
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    # Iterate through the CSV file in chunks
    for i, chunk in enumerate(pd.read_csv(input_filename, chunksize=chunk_size)):
        output_filename = os.path.join(output_dir, f'movie_data_part_{i + 1}.csv')
        chunk.to_csv(output_filename, index=False)
        print(f'Saved: {output_filename}')
if __name__ == "__main__":
    input_file = '../Data/tmdb_movies.csv'  # Path to your input CSV file
    output_directory = '../Data/split_files'
    chunk_size = 50000  # Number of rows per chunk
    
    split_csv(input_file, output_directory, chunk_size)
    print("CSV splitting complete!")
3. Walking Through the Code
Let's break down what this Python script does:
- Import Libraries: We start by importing the necessary libraries.
pandasis a powerful data manipulation library that makes working with CSV files a breeze.oshelps us interact with the operating system, like creating directories.
 split_csvFunction: This function takes three arguments:input_filename: The path to the large CSV file.output_dir: The directory where the smaller CSV files will be saved.chunk_size: The number of rows each smaller CSV file should contain. We’ve set a default of 50,000 rows, but you can adjust this based on your needs and file size limits.
- Create Output Directory: We use 
os.makedirsto create the output directory if it doesn't already exist. This ensures we have a place to save our split files. - Iterate Through Chunks: The heart of the script is the 
pd.read_csvfunction with thechunksizeparameter. This allows us to read the CSV file in chunks, rather than loading the entire file into memory at once. For each chunk:- We construct an output filename using the chunk number (
movie_data_part_1.csv,movie_data_part_2.csv, etc.). - We use the 
to_csvmethod to save the chunk to a new CSV file, ensuring the index is not included. - We print a message to the console to let you know which file has been saved.
 
 - We construct an output filename using the chunk number (
 - Main Block: The 
if __name__ == "__main__":block is where we set up the script's parameters:input_file: Path to your original CSV file.output_directory: Path to the directory where the split files will be saved. We’ve created a subdirectory calledsplit_filesinside theDatadirectory.chunk_size: Number of rows per chunk.- We call the 
split_csvfunction with these parameters and print a completion message. 
 
4. Running the Script
To run the script, open your terminal or command prompt, navigate to the Scripts directory, and execute the following command:
python divide_arquivo.py
You'll see messages in the console as each chunk is saved. Once the script finishes, you’ll find the smaller CSV files in the Data/split_files directory.
5. Adjusting Chunk Size
The chunk_size parameter is crucial. It determines how many rows each split file will contain. If your split files are still too large for GitHub (over 100MB), you'll need to reduce the chunk_size. Experiment with different values until you find a size that works for you. A good starting point might be 50,000 rows, but you may need to go lower, especially for very wide datasets with many columns.
Best Practices for Handling Split Files
Now that you've split your large CSV file, here are a few best practices to keep in mind:
1. Storing Split Files
It’s a good idea to store your split files in a separate directory (like our split_files directory). This keeps your project organized and makes it easier to manage the individual files.
2. Loading and Combining Split Files
When you need to work with the data, you can easily load the split files back into pandas and combine them. Here’s how:
import pandas as pd
import glob
import os
def combine_csvs(input_dir):
    """Combines multiple CSV files into a single DataFrame."""
    
    # Use glob to get a list of all CSV files in the directory
    csv_files = glob.glob(os.path.join(input_dir, "*.csv"))
    
    # Create an empty list to store DataFrames
    list_of_dataframes = []
    
    # Iterate through the list of CSV files
    for filename in csv_files:
        df = pd.read_csv(filename)
        list_of_dataframes.append(df)
    
    # Concatenate all DataFrames in the list
    combined_df = pd.concat(list_of_dataframes, ignore_index=True)
    
    return combined_df
if __name__ == "__main__":
    input_directory = '../Data/split_files'
    combined_data = combine_csvs(input_directory)
    print(f"Combined DataFrame shape: {combined_data.shape}")
    print(combined_data.head())
This script uses the glob library to find all CSV files in the specified directory, reads each file into a pandas DataFrame, and then concatenates them into a single DataFrame.
3. Version Control with Git
When working with Git, remember to add the split files to your repository. However, before committing, make sure to exclude the original large CSV file. You don’t want to push that behemoth to GitHub! You can do this by adding the original CSV file to your .gitignore file.
4. Automating the Process
For large projects, you might want to automate the splitting process. You can integrate the Python script into your data pipeline or use task scheduling tools to run the script automatically whenever the data is updated.
Common Issues and Solutions
Let's tackle some common issues you might encounter while splitting large CSV files.
1. File Size Still Too Large
If your split files are still exceeding GitHub's 100MB limit, you'll need to reduce the chunk_size in your Python script. Experiment with smaller values until you find a size that works.
2. Memory Errors
If you're working with extremely large files, you might run into memory errors. This can happen if your system doesn't have enough RAM to handle the data, even when reading it in chunks. In this case, consider using more memory-efficient data types or processing the data in even smaller chunks.
3. Encoding Issues
Sometimes, CSV files can have encoding issues, especially if they contain special characters. If you encounter errors while reading the CSV, try specifying the encoding explicitly in the pd.read_csv function. Common encodings include utf-8, latin1, and cp1252.
df = pd.read_csv(input_filename, chunksize=chunk_size, encoding='utf-8')
Conclusion: Taming the Data Beast
Splitting large CSV files is a crucial skill for any data scientist or developer working with big datasets. By breaking down these files into manageable chunks, you can overcome file size limitations, improve data processing efficiency, and keep your projects organized. We've covered everything from setting up your project directory to writing the Python script, handling split files, and troubleshooting common issues.
So, next time you encounter a massive CSV file, don't fret! Just remember these steps, and you'll be able to tame the data beast and get it under control. Happy coding, and may your datasets always be manageable! Now you’re all set to handle those large movie datasets and build awesome recommendation systems or perform in-depth analyses. Keep exploring, keep coding, and most importantly, keep making data-driven magic happen!