Why automating your fishbelt data quality checks is worth the leap - and how to do it

Blog 02 Feb 2026

Iain Caldwell, PhD
Lead Analyst, MERMAID

You dedicate countless hours in the field - battling currents, managing logistics, and maximizing every minute of bottom time to collect critical coral reef data. When you finally return to the lab, exhausted but accomplished, the last thing you want is for data cleaning to become a bottleneck.

I realize that learning to use a new coding language can feel intimidating, especially when your time and resources are already stretched thin. However, after years spent collecting and analyzing data across multiple ecosystems, I have seen firsthand how transitioning away from manual spreadsheet checks and adopting automated scripts can save weeks of time (and frustration). As the lead analyst for MERMAID, my goal is to craft tools that do the heavy analytical lifting for you and help you transform your raw data into actionable insights for reef management.

To help make your workflow lighter and more reliable, you can now access the MERMAID Fishbelt Data Quality Checks on our Analysis Hub.

A step-by step guide to automating your fishbelt data checks

Here is exactly how you (or your team) can get to a reliable, automated data calculation without getting lost in the weeds.

Step 1: Grab the "recipe"

Head over to the MERMAID Analysis Hub and open the "MERMAID Fishbelt Data Quality Checks" Github resource. Here you will find all the underlying R and Quarto code you need. When you open the full resource link, you will see lines of R and Quarto code. You do not need to read or write this code from scratch. You simply need to copy it to your computer and run the code yourself. This can be done most easily by either copying the quarto file (found in this folder and called “biomass_conversion_coefficients.qmd”) or, even better, cloning the full repository (see video of how to clone from GitHub to RStudio here). The code is already fully built to test, visualize, and summarize fishbelt data.

Step 2: Tell the code where your data lives

The code needs to know what data you want to use, and it is built to be incredibly flexible. You can apply this code to any project where you have access to observation-level data. In the code, there is a section currently commented out (with # at the beginning of each line) that you can use to get a list of all of your projects and indicate which one you want to use (this is marked as “Option 2” in the first code chunk).

  • If your data is private: As long as you are able to authenticate (i.e. log into your MERMAID account when prompted) and are a member of that specific project, the code will seamlessly pull your data in.

  • If you want to compare regions: You can tell the code to look at other projects that have a "public" data-sharing policy. You can even pull multiple projects at once (for instance, one project to look at observer comparisons, and another to analyze shark data).

Step 3: Instantly verify your extracted data

Before you run the analysis, the code includes a section that summarizes the data you are using so you know you have connected the right projects.

Real-world context: Let's say you chose to export two public projects - one from Belize for observer comparisons, and one from the Chagos Archipelago to look at shark data. The code above will automatically give you a summary like below verifying your successful data extraction:

Observer Comparison Project
Project: Northern Belize Coastal Complex
Project ID: d2225edc-0dbb-4c10-8cc7-e7d6dfaf149f
Data extracted:

  • Observations: 2187 records

  • Sample Units: 176 transects

  • Sample Events: 47 sites

  • Unique observers: 7

Shark Data Project
Project: SERF 2.0_Nick Graham_Chagos_Outer
Project ID: 55ac964c-0228-42da-8061-5983339ecb9f
Data extracted:

  • Observations: 8986 records

  • Sample Units: 149 transects

  • Sample Events: 40 sites

Step 4: Choose whether or not to anonymize observers

In a typical field season, you have multiple divers collecting data, and the code is going to compare the observations made by these different people.

If you are sharing this report externally and want to protect your team's privacy, there is a specific block of code included that automatically changes your divers' real names to anonymous labels (like "Observer A", "Observer B", etc.). If you are using this for internal team review, you can simply delete this chunk of code (or, even better, comment it out by including # before each line) so the real names appear.

Step 5: Run the code to generate insights

Once you've told the script which project to export and whether or not to anonymize observers, you just need to run the code. It will automatically generate visual charts and checks:

  • Observer Comparisons: The script evaluates fish biomass, abundance, and taxonomic patterns to help you identify potential observer effects.
    (Note: In the full GitHub resource, you will find specific code snippets - indicated by "Show the code" - that provide the exact formulas for every table, statistical test, and distribution plot mentioned below.)

    • Biomass, Abundance, & Taxonomic Tables: The script will output comprehensive summaries for fish biomass (kg/ha), total abundance, and taxonomic identification, breaking down the number of transects, means, medians, and quantiles for each observer.
      For instance, your output might clearly show that Observer C recorded a mean abundance of 38.97 and 52 unique taxa across 31 transects, while Observer A recorded a mean abundance of 26.23 but found 72 unique taxa across 141 transects.

    • Observer Pairing Patterns: Because observers rarely dive alone, the code also generates a table showing all of the observer pairings. You can see your most common pairings (like finding out Observer A and Observer B teamed up for 12 specific sample events) to better understand if systematic differences could be tied to individuals or specific buddy pairs.

    • Statistical Tests: The code automatically runs a Kruskal-Wallis Test for biomass differences among your observers. It will output the test statistic, degrees of freedom, p-value, and a clear result.

      Real-world context and caveat: If your statistical tests do detect significant differences among observers, it definitely warrants further investigation—but always interpret this with caution. Observers may survey in different locations and at different times, so these differences could reflect true spatial or temporal variation in the fish communities rather than an observer error.

  • Outlier Detection: The code runs tests to identify sample events that have unusually high or low fish biomass.

    • Identifying High/Low Extremes: The script generates tables flagging outlier sites (for instance, dropping below 50 kg/ha or spiking above the 75th quantile) and specifically isolates the top 10% of high biomass observations.

    • Statistical Outlier Tests: It runs a Z-Score test and generates an Interactive Scatter Plot. Sites with a |z-score| greater than 2 are flagged as statistical outliers and marked with red points falling above a dashed red threshold line. For example, your output might reveal that 5% of your sites are statistical outliers, clearly listing which sites had larger than expected biomass estimates.

      Real-world context and caveat: Before you delete these statistical outliers as "erroneous data," consult your field notes and environmental context! These large biomass estimates might actually represent true ecological bright spots—like highly successful no-take protected areas or unique habitat features.

  • Shark Observations: Because sharks can represent a large proportion of a site’s biomass, the code isolates shark presence, abundance, and biomass then compares that against the estimated total fish biomass.

    • Individual Weight Validation: Crucially, the code also compares each shark's weights against the maximum published weights for that species from FishBase. The script retrieves this data and generates a comparison table. 

      Real-world context and caveat: If observations are flagged here, it warrants further investigation, but it does not necessarily mean you made an error. Maximum weight data is pulled from FishBase for this comparison, but be aware that there is less maximum weight data available than maximum length data (the latter of which is what we use in the MERMAID Collect app to test for larger-than-expected observations). Use your expert scientific judgment when navigating these flags.

Step 6: Review your automated data summary and recommendations

The end of the script instantly generates a Data Summary. At a glance, this output will show things like the following:

  • Sample Coverage: e.g., 47 total sample events, 176 total transects, 2187 total observations, 36 unique sites, and 3 unique observers.

  • Taxonomic Diversity: e.g., 21 fish families, 41 fish genera, and 79 fish species.

  • Biomass Statistics (kg/ha): e.g., Mean of 168, Median of 113.6, SD of 199.1, and a Range from 19.6 to 1217.1.

  • Data Quality Flags: e.g., 5 high biomass outliers (>90th percentile), 12 statistical outliers (IQR method), and 10 sites with shark observations.


Notes and recommendations

  1. Check Observer Consistency: Any detected differences warrant further investigation, but should be interpreted cautiously as they may reflect spatial or temporal variation.

  2. Review Outliers: Investigate flagged sites to determine if they represent true ecological variation or potential data issues, making sure to consider environmental context and field notes.

  3. Validate Shark Observations: Carefully review any shark weights exceeding published FishBase maxima for potential data entry errors in size or abundance.

  4. Ensure Data Integrity: For any flagged issues, always consult your field datasheets and consider re-validation of measurements before making corrections to the database.


Frequently asked questions

What R-packages are required to run these checks? 

All of the packages needed are at the top of the first code chunk:

  • mermaidr

  • tidyverse

  • plotly

  • DT

  • ggplot2

  • ggpubr

  • knitr

  • rfishbase

Can I compare my results against FishBase for non-shark species? 

Currently, the specific data check that compares your estimated weights against the maximum published weights from FishBase is highlighted specifically for shark observations, as they represent such a large proportion of biomass. However, this approach could be expanded to include any fish species for which maximum weight data is available on Fishbase.


By adopting these open-source tools into your workflow, you can bypass tedious manual checks, save precious time, and ensure your data is rigorously clean. Ultimately, leveraging robust, reliable data is how we develop the evidence-based solutions needed to ensure marine ecosystems are healthy and they support the coastal communities that rely on them.

Explore the full MERMAID Fishbelt Data Quality Checks resource and access the code today. Don't let the code intimidate you. Take the leap, and please reach out if you have any questions or want to collaborate on other data checks or analysis tools you would like to build!