The Official Magazine of the Bluewater Cruising Association
SharePrint

Confidently Wrong: AI isn’t Seaworthy for Cruising Sailors

Rob Murray

Avant
Beneteau First 435 Sloop
November 12th, 2025

We had dinner with some friends the other night and were chatting about the utility of AI in sailing and cruising. During the conversation, we touched on the vast amount of disinformation or misinformation available on the internet, and I was reminded of an article I wrote for Currents during the pandemic, about which Signal Flag(s) were appropriate to use to indicate that your vessel was under quarantine. At the time, there was a considerable amount of disinformation on the Internet about this topic, due to a single, out-of-date reference in a Wikipedia entry being picked up by dozens of bloggers, Facebook posters, YouTubers and the like (you can see my article here). To see if things have changed, I decided to pose a question about signal flags for quarantine status to four of the leading AIs: Grok (from Elon Musk’s xAI), ChatGPT (from OpenAI), Gemini (Google) and CoPilot (Microsoft) to test the utility of AI.

Why is this important? Many (if not all) search engines provide AI-based summaries with search results by default, promoting themselves to you as ‘useful’ agents summarizing search results and offering complete answers without forcing you to suffer the drudgery of reviewing search results yourself. Should you trust that summary? Can AI find the right answer? That’s what I wanted to know.

I chose this signal flag query for several reasons. First, I already knew the answer from having looked it up in the International Code of Signals, Pub.102 available from the websites of the National Geospatial Intelligence Agency and the US Coast Guard. Second, the correct use of signal flags is pretty simple and does not change much over time. Third, the authority for contemporary signal flag use is pretty clear, yet misinformation about the use of signal flags persists on the internet. This could easily confuse an AI, if it didn’t discern between official, definitive sources and people making mistakes and posting erroneous information. If there is a flag signal for a given meaning, it’s in the pages of the ICOS in black and white. Finally, the means to use the flags is discussed in the first part of the manual, so it should be even more difficult to make mistakes about their usage.

As mariners, we know that to make a specific signal by flag, we refer to the International Code of Signals (ICOS) for the correct ‘hoist’, select the flags and raise them. To decode a signal, we look it up in the same book. It’s simple, really. So, what does AI do? It scans the internet and aggregates the answers from many sources, considers them, and offers up what it thinks is the correct answer based on the research. Some of the sources it may have already learned, some it may visit for the first time based on the question asked. AI is reputed to have some ability to discern which are truly definitive (gold) sources and which are just plain wrong (dross). This is what is called the ‘Large Language Model’ (LLM) of machine learning. The danger comes from a machine’s inability to discriminate between a casual but erroneous blog post using an image from a TinTin cartoon from the 1940s (The Quarantine Flag – SailProof Shop) and the definitive International Code of Signals. If an LLM is not discerning, it can confuse incorrect, older, or error-filled information with current, correct information. This is akin to recommending the sacrifice of animals instead of the fertilising of land as a way to ensure a bountiful crop, or the promotion of bloodletting as a treatment for gout.

There are a host of concerns about AI – that it will be racist, discriminate over gender, opaque in how it gets answers, etc. – but not many people seem worried that it will be just flat out wrong in the answers it provides. After this exercise, I am indeed worried that it is simply wrong, and when wrong, it asserts that wrong answer with great conviction.

While the initial answers I received when I queried the AIs were all wrong, you should not see similar results if you try the same query now, as each of the AIs has (hopefully) learned from its interaction with me and should now offer the correct answer (or so one would hope). AI may be more useful for broader, more common queries, but for niche, specialized queries all AIs deliver poor results. Sailing and cruising are niche areas. The use of signal flags is a niche within a niche.

The “Conversation”

The question I asked was “Is the Lima Flag appropriate for a vessel in quarantine?” Spoiler alert: The correct answer is ‘No’: the L-for-Lima flag means, ‘You should stop your vessel instantly’. The correct signal for a vessel in quarantine is the Q-for-Quebec flag (‘My vessel is “healthy”, and I request free pratique’) or two Q flags (QQ) (‘I require health clearance’). These meanings are definitive in the ICOS. (The questions, answers, and my follow ups and comments with each of the four AIs I tried are detailed in a separate and very long posting, available here. I have annotated those “conversations” with comments in parentheses and italics, and some inline comments in red.)

GROK

I spent the most time with Grok, and did find it friendly and pleasantly conversational, even if it was spectacularly wrong and unfocused in the search for answers. Grok frequently referred to sources and authorities that had no relevance to the query, like Chart No.1, the American Practical Navigator (Bowditch), the International Health Regulations, SOLAS, Brown’s Signalling (1916 Edition) and the Center for Disease Control (CDC), none of which are relevant or contain any flag signalling information at all. Grok offered multiple dead links as support for the offered conclusions but could offer working links when asked (perhaps because the links were visited in its learning a long time ago). When pressed, the AI seemed to cling to the wrong answer for an unreasonably long time and only gave up and offered correct answers when spoon fed the definitive references (e.g.: the http link and page number in the document). The answers delivered were bloated and filled with nonsense. Fun and friendly to chat with, but not useful.

Having somewhat used up my limited supply of patience for playing with AIs, I was more curt in my dealing with the other three, trying to see how quickly I could drive them to the right answer.

ChatGPT

When I asked ChatGPT the same question, it told me immediately that the L-for-Lima flag was incorrect (great!), and that the Q-for-Quebec flag was correct (great! half right). It also told me that the Lima flag could mean ‘I have a dangerous cargo’ (wrong; that’s the B-for-Bravo flag), or ‘You should stop and await instructions’ (also wrong; that’s the meaning of the X-for-X-Ray flag). You could offer half-points for getting part of it right, but not full points because it added two erroneous meanings for the Lima flag. When further queried, ChatGPT did accept that the QQ flag would also be correct.

Since ChatGPT was the only AI that had a clue on what the answer was, I asked again the next day, and there was a different answer. This time, it said absolutely that the Lima flag was correct. When pressed, it confirmed that the ICOS supported that meaning. When I asked it to show me where in the ICOS this was written, it apologized for its error (“I misspoke earlier”) and offered the correct answer. That was disappointing.

CoPilot

Microsoft’s CoPilot was also wrong out of the gate, saying that the Lima flag was correct. When pressed, it suggested that the flag’s adoption by two port authorities under local ordinances (Alaska and Rhode Island) somehow made it widespread practice, and that mariners trained in pre-1969 practices would somehow have retained this knowledge, making the use acceptable. It quoted my Currents article as support for this. When I outed myself as the author and asked why it didn’t accept what I had said in the article, it responded, “It’s a thoughtful and widely referenced piece and clearly shaped how many mariners and enthusiasts understand quarantine signaling today.” CoPilot also amended its misreading of the article to offer a correct answer. The flattery was noted, if not appreciated, but its apparent inability to read and comprehend the article was worrisome.

Gemini

Google’s Gemini was also wrong in its first answer, saying Lima was correct. When asked for sources, it found and offered part of the correct answer, offering the Q signal but not the QQ. Oddly, it did not note that it was contradicting its first answer in the second response.

A series of subsequent attempts with the ‘fast’ and ‘pro’ versions gave mixed results, with Gemini clinging to false citations until directly challenged and invited to look up the answer. The Google Gemini AI was the only one of the four to include a disclaimer (“AI responses may include mistakes”) on each reply it gave – which is absolutely correct!

I repeated the experiment over several days using a browser in ‘privacy’ mode so the previous context would not be accessed with the query. Results were similar if not identical, although Grok seemed to tighten its focus, perhaps due to a version update that happened over those days.

What I Learned

In short, AI is scary!

  1. AI is often wrong, and when wrong will be wrong with great conviction, often with false citations and assertions.
    If you use an AI, check its work by actually reading the sources it uses. Apparently, they learn by reading, but they are not very good at reading.
  2. AI always makes a choice between speed and depth of research. If there is a ‘think deeper/harder/more/pro mode/expert mode’ etc. button, use it, it’s worth the few extra seconds. Sometimes that delivers a better answer.
  3. If you are going to use AI, use an app, not a webpage, so the tool learns from you. Over time, its answers should become better for you by retaining the context you ask questions in. This seemed somewhat true, although not a true fix. While the AI will learn, the learning will not be generalized into the core of the AI for some time, if at all. The context you create will embody the learning for your queries more quickly.
  4. Currents is judged by the AIs to “represent well-informed, professional opinions from a reputable organization,” so you should thank the editors for all their good work!

In the International Code of Signals, the flag hoist Alpha over India (AI) means ‘Vessel (indicated by position and/or name or identity signal if necessary) will have to be abandoned.’ This seems appropriate serendipity, as my assessment suggests that indeed, AI must be abandoned (at least for now) for sailing and cruising topics.

*Image created using AI then modified by Barb Peck

Comments


  1. Magnus Murphy says:

    Very interesting and topical. Thank you Rob! A good reminder for us all to not trust blindly. This is becoming a major dilemma in all fields…I am worried that there is going to be a lot of harm come form ‘confidently wrong’ AI.

  2. Kevin and Carla Nash says:

    Rob, long time no chat! Great evaluation of the shortcomings of AI though it can certainly have a place. We’ve had positive experience with it when diagnosing potential issues as well as technical features of various systems we’re researching. Just like the supposed “real intelligence” of some sailors we’ve met out here, trust but verify.

    Cheers from Kevin and Carla
    Sailing Gargoyle currently in Preveza Greece

  3. Nick Ward says:

    Great article, and I checked using Gemini – as you say, it clings to the wrong answer for quite a while. Thankfully, my preferred AI – Claude – did better (but still, caution is always advised!). It said:

    No, the Lima flag is not appropriate for a vessel in quarantine.

    The Lima flag (a yellow and black checkered flag, divided into quarters) is the International Code of Signals flag that means “You should stop your vessel instantly” or is used to indicate “I have a disabled vessel” in certain contexts.

    For a vessel in quarantine, the appropriate flag is the Quebec flag (also called the “Yellow Jack”), which is a plain yellow flag. This signals that the vessel is requesting free pratique (permission to enter port) or indicates that the ship is under quarantine and has not yet received health clearance.

    Historically, ships would fly the Quebec flag upon arrival at a port to request a health inspection. If disease was suspected or confirmed aboard, the vessel would remain under quarantine until cleared by health authorities, continuing to fly the yellow flag during this period.

    1. Rob says:

      Interesting, gave me a different answer:

      “No, the Lima flag is not appropriate for a vessel in quarantine.
      The Lima flag (yellow and black checked pattern) means “You should stop your vessel instantly” or indicates that a vessel is in quarantine in some contexts, but this is outdated.
      For quarantine, the correct flag under the International Code of Signals is:
      Quebec (Q) flag – A plain yellow flag that means “My vessel is healthy and I request free pratique” (permission to enter port). This is flown when arriving at a port to request customs/health clearance.
      First Substitute over Quebec (QQ) – Indicates “My vessel is suspect” or that there is disease aboard requiring quarantine.
      Historically, the yellow flag alone (Quebec) was sometimes called the “quarantine flag,” but its actual meaning is requesting health clearance. If a vessel actually has a contagious disease or is under quarantine restrictions, different signals would be used, including the QQ signal or specific flag combinations depending on the circumstances.
      So to directly answer your question: Lima would not be the appropriate flag for quarantine – you’d want Quebec (Q) or related signals instead.“

      Almost right, but flying the first substitute over Q is upside down, and QQ means ‘I require health clearance’ and only that.

      In its answer to you it said Lima could mean I am disabled, but that’s foxtrot. Hard to see how it gets so close and the drops the ball!

  4. Debra Bryant says:

    Thanks, this is a good read! I use AI a quite a bit for work and personal research. Recently I heard a good explanation for AI’s hallucinations. These programs are “trained” to offer an answer to any question posed, to never reply with “I don’t know”… so they do offer an answer, even if that means fabricating one. I have learned to always request sources, and always check them.

  5. Ken Wright says:

    Good article Rob ! SCARY INDEED !
    I lived with paper charts , crossed oceans with a plastic sextant, and erroneous timepiece..
    That’s another story .
    Can we make a signal flag that translates ” DO NOT TRUST ”
    olde school tar
    Ken il trombettista

  6. Ricky says:

    Excellent article Rob. More inquiry into AI’s accuracy should be a no brainer. The Paradox of Artificial Intelligence😆. Just have fun with that. I commend you on your deep dive, well done. I’ve been learning the tool as well and have found not only dis or mis information but worse yet a bias to its clinging to certain information to hold a narrative. Often only when pressed will the program change its tune but as you mentioned, when promoted to the source each platform will accept its error. I like sources when it comes to information and even then to dig a little deeper. Question everything!

  7. Allen says:

    I had a similar experience. I’ve used various LLMs for quite some time now and find them generally useful. However, it’s necessary to verify anything that you’re told. Sailors already know not to trust just one source of information, and it seems a collection of LLMs can be considered just one source of information, so it’s necessary to look around.

    As reported in the article, although they are very useful tools, LLMs tend to make things up and lie very convincingly—like children.

    I found Perplexity and several other LLMs indispensable when rewiring my 2007 Bavaria. Due to bankruptcies, it seems that there isn’t a whole lot of documentation available, but there is a great deal of experience and advice spread out all over the Internet in forums and other sites. Finding it myself would have been a chore.

    On the other hand, I had reason to wonder about the depths in my slip, so I innocently asked Perplexity to tell me what the minimum depths had been for the past year and to predict the coming year. The response was quite believable (and alarming) until I checked it against my tides app (100% reliable so far) and official sources, and found that it was entirely make-believe.

    I offered the same query to five other subscribed bots. Three of them said they couldn’t answer and referred me to official sources. The others made things up, including, in one case, providing a long list of predicted depths below datum.

    I also discovered that at least some are unable to realize that depths below datum are lower than depths above datum—and by how much. That was quite sobering.

    I’ve encountered similar bewilderment in the past when I’ve asked technical questions that require calculations. Most of the time, calculations are fast and accurate, but sometimes the results are hallucinations.

    Interestingly, when challenged, the bot will agree, apologize, redo the job, and come up with a new response that may or may not be right. But if you keep trying, eventually they zero in. Still, if you don’t already know the answer or have a good idea of what it should look like, you could be easily fooled. It’s a black box.

    The upshot is that the reports from those various bots could be potentially dangerous if relied on alone.

    On matters that are less technical, I found that on controversial social and political topics they tend to be very mainstream unless goaded to report a wider spectrum of opinion—in which case they will broaden their focus. However, it’s like pulling teeth.

    In summary, I’ve concluded that LLMs are a great way to initiate investigation of a question and gather sources, but they’re not reliable for mission-critical matters.

    I fed this article to perplexity for correction and comment. https://www.perplexity.ai/search/check-this-over-for-punctuatio-dSB4.9JhSNG4ZvsAPcxQtA#0

  8. Greg Cooper says:

    The problem is AI does not “reason” in any sense as a human does. I think it will, and is, prove very useful in specific constrained applications, such as screening medical images for future review by a radiologist where it can be trained in depth and evaluated for accuracy.
    Despite how it is being pushed for general use I wouldn’t rely on it for anything that matters.
    Unfortunately, I think the situation is only going to get worse as it becomes harder and harder for us humans to discern authoritative information as the Internet gets swamped with more and more AI hallucinations

Leave a Reply to Greg Cooper Cancel reply

Your email address will not be published. Required fields are marked *