[GTER] Registro.br RPKI issues?

Frederico A C Neves fneves at registro.br
Tue May 16 10:48:25 -03 2023


Job,

On Mon, May 15, 2023 at 10:19:43PM +0000, Job Snijders wrote:
> Hi Fred,
> 
> Thanks for the quick response!
> 
> On Mon, May 15, 2023 at 06:24:54PM -0300, Frederico A C Neves wrote:
> > > Any idea what happened?
> > 
> > We're investigating the CA and publication server but so far we've no
> > idea of any event that originated the issue.
> 
> Thank you for investigating and transparently sharing your uncertainty
> as to what happened. I too am a bit puzzled by the data.
>

This was tracked down to a know issue that is being addressed on the
next release of krill. https://github.com/NLnetLabs/krill/pull/1023

It was triggered before a few weeks ago but unfortunately with a
different symptom.

Thanks to NlnetLabs team now we've come up to an operational procedure
that will solve the issue more quickly if it happen again until the
fix is ready for us to upgrade.

> Operating under the assumption that this was a fluke of sorts (because
> NIC.BR RPKI services overall seem stable), it is possible this type of
> event will never happen again, or will happen again within a few hours,
> days, or months. I hope you don't mind some suggestions on how to
> proceed to find the root cause:
> 
> 1/ perhaps an error can be raised (and send to the NIC.BR NOC) when the
>    process (which writes out the RRDP XML) notices that multiple
>    <publish> elements share the same value in the 'uri' attribute. While
>    this alert might not pinpoint the root cause of the issue, monitoring
>    for that error condition will give more insight as to when it
>    happens.
> 
> 2/ Extended retention times for RRDP data. Debugging the RRDP data
>    produced by NIC.BR is challenging because between an issue arising
>    and people being paged to look at an issue there can be a multi hour
>    delay. The NIC.BR deltas & snapshots seem to be deleted on an
>    aggressive schedule: XML data is deleted within hours of the XML data
>    being generated [example 1].
> 
> In context of suggestion (2) - I am cognizant that a standards-compliant
> RRDP client would not run into any issue with the current deletion
> schedule. I also understand storage & network resources have a cost. But
> human debuggers aren't standards-compliant ;-)
> 
> My plea is that - if it is feasible - to retain deltas & snapshots for
> 10 days. Ten days would be a great help in being able to exactly
> reference what went wrong where (if and only if anything goes wrong).
> For example, the URL [example 1] I referenced in my earlier email today
> in this thread already has been garbage collected and no longer shows
> useful data.
>

We're investigating if this is possible with our current setup and the
environment of thousands of child CAs. I'll follow up on this.

> Kind regards,
> 
> Job

Fred


More information about the gter mailing list