Write a Large Amount of Data to CSV File with Go (Golang)

Page content

Introduction

Have you ever found yourself in a situation where you needed to prepare a substantial amount of data for testing your code? If so, you’re not alone. Working with large datasets is a common requirement in software development, and Go provides some powerful tools to help you accomplish this efficiently.

In this blog post, I’ll share my approach to writing a large amount of data into a CSV file using Go, and I’ll demonstrate how to leverage Goroutines to make the process more efficient and faster.

Writing Data Sequentially

Let’s start by addressing the problem of writing data into a CSV file sequentially.

This is a straightforward approach where you iterate through your data and write it one record at a time.

package main

import (
 "encoding/csv"
 "fmt"
 "io"
 "sync"
 "time"

 "github.com/brianvoe/gofakeit"
)

func main() {
    GenerateLargeCSV(10000000, "test")
}

// GenerateLargeCSV generates a CSV file with numRows rows
func GenerateLargeCSV(numRows int, fileName string) {
 err := os.Mkdir("data", 0777)
 if err != nil {
  if !errors.Is(err, os.ErrExist) {
   panic(err)
  }
 }
 file, err := os.Create(fmt.Sprintf("data/%s.csv", fileName))
 if err != nil {
  panic(err)
 }
 defer func() {
  err := file.Close()
  if err != nil {
   log.Printf("error occurred in file.Close() err: %+v", err)
  }
 }()

 writer := csv.NewWriter(file)
 for i := 0; i < numRows; i++ {
  row := generateFakeRow() // Replace with your data generation logic
  if err := writer.Write(row); err != nil {
   panic(err)
  }
//   Flush if it reaches batchSize
//   if i > 0 && i%batchSize == 0 {
//    writer.Flush()
//   }
 }
 writer.Flush()
 if writer.Error() != nil {
  panic(err)
 }
}

Using Goroutines for Parallelism

We can leverage Goroutines. Goroutines allow for concurrent execution of tasks, which can utilize multi-core processors and might accelerate the process.

However, it’s essential to note that the standard csv.Writer is not thread-safe. If multiple Goroutines attempt to write to the same file concurrently, it can lead to concurrency issues. Therefore, we’ll customize the csv.Writer to ensure thread safety.

Here’s an example of using Goroutines to write data to a CSV file concurrently in Go:

// Parallelize CSV generation
func GenerateLargeCSVParallelToOneFile(numRows, numGoroutines int, fileName string) {
 err := os.Mkdir("data", 0777)
 if err != nil {
  if !errors.Is(err, os.ErrExist) {
   panic(err)
  }
 }
 file, err := os.Create(fmt.Sprintf("data/%s.csv", fileName))
 if err != nil {
  if !errors.Is(err, os.ErrExist) {
   panic(err)
  }
 }
 defer func() {
  err := file.Close()
  if err != nil {
   log.Printf("error occurred in file.Close() err: %+v", err)
  }
 }()

 var wg sync.WaitGroup
 // Add numGoroutines to the WaitGroup
 wg.Add(numGoroutines)

 // This CSVWriter is thread-safe
 writer, err := NewCSVWriter(file)
 if err != nil {
  panic(err)
 }
 for i := 0; i < numGoroutines; i++ {
  // Call GenerateLargeCSVWithLock in a goroutine for numGoroutines times
  go func(wg *sync.WaitGroup, i int, writer *CsvWriter) {
   GenerateLargeCSVWithLock(numRows, writer)
   // Decrement the WaitGroup counter after each goroutine finishes
   defer wg.Done()
  }(&wg, i, writer)
 }
 // Wait for all goroutines to finish
 wg.Wait()
 fmt.Printf("Done GenerateLargeCSVParallelToOneFile")
}

func GenerateLargeCSVWithLock(numRows int, writer *CsvWriter) {
 for i := 0; i < numRows; i++ {
  row := generateFakeRow()
  if err := writer.Write(row); err != nil {
   panic(err)
  }
//   Flush if it reaches batchSize
//   if i > 0 && i%batchSize == 0 {
//    writer.Flush()
//   }
 }
 writer.Flush()
 if writer.Error() != nil {
  panic(err)
 }
}

// thread-safe csv writer
type CsvWriter struct {
 mutex     *sync.Mutex
 csvWriter *csv.Writer
}

func NewCSVWriter(file io.Writer) (*CsvWriter, error) {
 w := csv.NewWriter(file)
 return &CsvWriter{csvWriter: w, mutex: &sync.Mutex{}}, nil
}

// lock and write
func (w *CsvWriter) Write(row []string) error {
 w.mutex.Lock()
 defer w.mutex.Unlock()
 err := w.csvWriter.Write(row)
 if err != nil {
  return err
 }
 return nil
}

// lock and flush
func (w *CsvWriter) Flush() {
 w.mutex.Lock()
 w.csvWriter.Flush()
 w.mutex.Unlock()
}

In this modified code, we’ve introduced the concept of thread-safe CSV writing using a custom CsvWriter type.

This ensures that multiple Goroutines can write to the same file without concurrency issues.

Benchmarking

goos: darwin
goarch: amd64
pkg: main/large_csv_generator
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkGenerateLargeCSV/totalNumRows=100000000-16                    1        109237932720 ns/op
Done GenerateLargeCSVParallelToOneFileBenchmarkGenerateLargeCSV/totalNumRows=100000000,numGoroutines=5-16                      1        104020250631 ns/op
Done GenerateLargeCSVParallelToOneFileBenchmarkGenerateLargeCSV/totalNumRows=100000000,numGoroutines=10-16                     1        103922320588 ns/op
Done GenerateLargeCSVParallelToOneFileBenchmarkGenerateLargeCSV/totalNumRows=100000000,numGoroutines=15-16                     1        105128871237 ns/op
Done GenerateLargeCSVParallelToOneFileBenchmarkGenerateLargeCSV/totalNumRows=100000000,numGoroutines=20-16                     1        107537896218 ns/op
PASS
ok      main/large_csv_generator        530.679s

Conclusion

In conclusion, I have illustrated both sequential and parallel approaches for writing a substantial volume of data to CSV files. While the parallel implementation with Goroutines did not yield a speed improvement in this specific case, it’s worth noting that Goroutines are a potent tool that, when employed effectively, can lead to substantial performance enhancements.

For the complete code of this example, please refer to the following link: GitHub Gist. https://gist.github.com/moonorange/d09b6a0cd7f265b9a8d1ee5ddadd18e0