AITransparancy - wanders/git-sleipnir GitHub Wiki

This project was started with help of ChatGPT.

For full transperancy this page documents what it was used for. It should be a complete list of every prompt sent to ChatGPT.

The full interaction is not here. Just my questions, I have just summarized the answers, and provided a comment on where the answer lead me. The questions I sent to ChatGPT are in blockquote:

like this

As this was a bit of an experiment in using AI to help coding at a bit larger scale there are also some general reflections at the end.

Phase 1 - Research

I had a rough idea on how the ref listing and shallow cloning worked, but not the details. I had at least skimmed through the gitprotocol-v2 documentation.

But maybe I could get some assistance in the research. So I asked

The git http protocol. What http requests does "git ls-remote" end up doing?

This gave a comparasion of the "dumb" and "smart" protocol, and their respective request urls, just saying the body is some git specific payload. Guess that is what I asked. So I got the url, but not much more.

I was interested in more details:

Where does limiting to certain refs end up?

Now it answered that it depends on smart or dumb protocol. But then in both cases said in both cases gets all refs, and then filters client side. It even had a section on why it happens client side ("smart protocol is not a query language", which we'll soon see is misleading).

It also said I could observe it myself using GIT_TRACE_PACKET=1 GIT_TRACE_CURL=1 git ls-remote origin refs/heads/main. I knew that kind of tracing was available somehow, but nice that I didn't have to dig it out of documentation and could just run that.

That way I could confirm that (in the git version I was running at least) the filtering did indeed happen clientside.

But I had seen in the protocol documentation. So I asked:

Whatabout the "ref-prefix" argument to "ls-refs"?

Now it does start talking about protocol-v2 and how that supports it. Cool, nice so it is supported.

Armed with this information and the tracing I could now execute these requests with python requests.

To easily compare things between my requests and the one from GIT_TRACE_CURL I wanted to enable more logging in python requests, if only I remembered that stuff.

how to enable full logging of the requests sent with python requests module?

This gave code to put on toplevel and supporting both py2 and py3.

can you wrap it up all (python3 only) in a function so one can just do enable_http_logging()

Just wanted one more thing fixed before I could just drop it into my script

Move the imports into the function too even though it is bad form

copy&paste and I could easily iterate and show myself that those requests worked and only listing the refs I wanted.

Time to fetch the actual content

Howabout actually fetching objects. What would be the HTTP requests to do an initial shallow clone of a single branch? Using protocol v2

This gave enough details to hack that request into my python script.

I could get that pack file out. There is some plumbing command to add it to the repo...

The response to the "fetch" command. Where to send that to add that pack file to the repo?

That nicely explained the side-band data. And gave me the command git index-pack --stdin (but also using temporary files in a slightly odd way).

That worked fine.. Except it was a shallow pack so it was missing objects. But I knew there is something about grafts, so:

How about shallow pack files? With missing parents? How to set up those grafts?

It explained lots of basic things, which one would think it would have understood that I knew from the other questions and answers. But in the answer there was what I needed about .git/shallow and the shallow lines in the fetch response.

Cool, now I had some code doing the request I needed to do. Time to throw it away and start with the

Phase 2 -- moving into rust

I knew I wanted to implement the final too in rust. (because intended use is in CI, then it is nice if it is just a single binary without odd dependencies (i.e just linking with libc), and I had a hunch that the async support in rust should make it easy to support fetching multiple repos in parallel). Also I'm far from expert (but know enough to be dangerous), so I wanted to use it as an excuse learn more. And also here I got the idea that maybe I should do this as some "vibe coding" experiment or at least reach for ChatGPT before reaching for documentation.

Ok, I'm obviously not going to write my own http library. What do people use in rust? Let me just ask

Create some example code doing http/2 requests using rust

(btw, as I had the vision of cloning multiple repos in parallel I wanted to make sure http/2 was supported so it could do multiple streams without new tls handshake and stuff). I explicitly didn't ask about what crate to use or anything, just wanted to see which it picked.

It chose hyper. And showed plain http/2 as well as over tls. But I wanted async, how does that look with hyper:

I want an example that sends multiple requests in paralell using async.

The code it gave me didn't look to bad. But it created a separate "Client" object for each request, and remember I wanted multiple streams in the same connection. Something I had not told it thought. So it didn't look like that.

Is that example sending the requests in the same tls connection?

Of course it wasn't. But it clearly explained why and how to make it. And also what trace logging to enable to verify it happening, that was quite helpful. But I need more.

Ok, how do I add headers?

Showed good examples.

It was unclear how tls (certificate) handling worked

What if I want properl [sic!] tls handling for https urls?

I was presented with examples and comparasion between hyper-tls and hyper-rustls. Also reqwest was mentioned. Coming from python requests the name piqued my interest, and I wanted to know more:

How does a simple rust program making multiple requests using reqwests module look like (async)

It gave me a nice and simple example, still including the things I had been asking for before (same connection, add headers).

To verify I asked:

This one reuses connections the same way as describe above, right? (as it is on top of hyper)

It was explained briefly that it is the case and why. And how I could confirm it using debug logging.

Nice, here I pretty much settled on using reqwest. It looked nice.

So pretty much just took the example code I was given and changed the urls to the ones to do git things. But I needed auth too

how to do basic auth and set headers with reqwest?

This showed how to use .basic_auth with reqwest and I could start making requests against real repos with auth.

To be able to compare requests made by this with ones made with standard git client I needed debug logging:

how to enable debug logging?

This showed how to simply enable env_logger and I could simply copy those few lines into my main.

Ok, this enabled me to ensure I was making the right requests. Now I needed to make those requests not be hardcoded, but building the request dynamically.

I now started writing some code to generate data in the "git packet line" format. I wrote this manually, in hindsight I might just have asked it to create a builder, but I wasn't really tuned in to this kind of vibe coding.

But writing code manually I needed tests, so

whats the boilerplate for tests inside a rust module?

This gave me the skeleton I recognized. It might have been faster to just open whatever other rs file from other project and get it from there.

But with tests in place I could start tdd:ing the "packet line" builder.

how to write formatted data into a vec<u8>?

write!

And formatting a numer [sic!] as 4digit hex?

{:04x} + explainations on what exactly what the different parts meant.

How to create a "builder" type. With methods that returns the builder itself mutating it?

This showed complete example. And pointing out the key thing (mut self parameter, returning self). So I could easily do the same in my code.

The naming of the operations on Vec I have not yet internalized:

how to add [u8] into vec<u8>?

In addition to just answering extend_from_slice (and giving example) it helpfully explained the difference with append, not what I needed then, but maybe it will help me learn...

Cool, this made me complete the packet line builder (it can be found in src/pkt_line.rs).

And I was ready to make real requests:

how to post data in reqwest?

Here I was presented with lengthy examples of 4 types (json, form, multipart upload, and raw data). Luckily the raw data I was interested in was last so I didn't have to scroll up so much :)

But when using my builder to create the data and then posting that I got some life time issues, because my "builder" actually wrote the result directly into a Vec<u8>. But the example I was given which I based the builder on was collecting arguments and then creating the resulting object. I was just asking detailed questions, without giving the full context. Had I given proper context I might have gotten working code out directly.

How can I give out the member of a struct while ending the lifetime for the struct itself?

Now I learned details about the take method pattern, which is what I wanted. But it also showed destructuring move.

Finally I was able to make proper requests with dynamic content.

But getting the result as one big buffer didn't feel great

How to stream response from reqwest?

This gave me complete example of how to use bytes_stream including details on error handling.

Now I was streaming the responses. But I needed to parse the git data.

If the data inside the streaming chunks is in git's "side-band-64k" format. How to best unwrap that? Minimizing the amount of copying of data. Ideally an api something like:
for part in SideBand64kStream(res.bytes_stream()) { 
    match part { 
      PackData(d) => output.write(d);
      Progress(p) => println!("remote: {}", p);
      ErrorMessage(e) => return Err(x);
   }
}

This gave me complete code for the SideBand64kStream adapter. Seeing that code made me realize:

Actually. Make it two layers. One to parse raw git protocol packets (4-digit hex + data, flush, delimiter). And then on top of that do the sideband handling. Still avoiding copies as much as possible.

This gave full implementation for GitPacketLineStream and SideBand64kStream (the latter which I later realized I didn't need).

But there was some Pin thing in that code which I didn't understand:

Whats the "Pin" for?

The response made me go "makes sense".

Now that I had the parser I wanted to write the code that used it. And to iterate locally I wanted a test using it on real data I had captured.

Can you write a test for GitPacketLineStream that reads from a testfile (../output.bin)

This gave me test code I could paste into the rust module.

But it seemed to not take full advantage of the streaming, it first read the file into a buffer, and then streamed from that buffer, which annoyed me a bit (maybe I end up doing things that can be done on buffer but not on stream?)

Can one not stream directly from the file in the test?

Now I had updated testcode that streamed directly.

Just that it didn't compile:

I'm getting error[E0599]: the method next exists for struct GitPacketLineStream<ReaderStream<File>>, but its trait bounds were not satisfied note: the trait Stream must be implemented Why?

Turns out this error message was due to the Stream it had written for me was assuming reqwest::Error, so it only worked for streams from reqwest. Had I been more experienced in rust I guess I would have seen that directly when reviewing the code. Also, who would write such generic stream adapter, but tie it to such specific error type.

Oh well, it told me how to fix it (make it generic over the error type).

And with that working I had code that could parse my captured data (output.bin). Then I wanted to send the actual pack data to an external git command...

how to execute an external command and send data to its stdin, in a streaming fashion. async.

This gave me example code that I could pretty much copy paste into my test. (using tokio::process::Command)

That was easy, now I need to do some more parsing of the metadata I get, in the form of bytes::Bytes:

How to best compare a bytes::Bytes with a bytestring literal?

Here I also learnt why == b"hello" doesn't work.

How to strip off trailing whitespace from a bytes::Bytes?

This wrote me a little function. Which then it turns out I didn't need (whats left of it is without_lf in util.rs).

write the code to parse each line of

004efb8eaed036f6ca12494a08edf6d5feebeac53bac refs/heads/99.10/xx-yyyy-zzz
00493e5bd503bc160a6b5e2dba333e71d1e02c5c79d1 refs/heads/99.10/xx-yyy-zzzzz
0076b659eee868a794121d7f34c6e83567e5ba89baf7 refs/tags/BUILD_99.10.0.0 peeled:d01e8d43c63646079a9fab530e2f47060ad3b23b
00769d4b46f97ebf0f699dddce0beacc989fb046535d refs/tags/BUILD_99.10.0.1 peeled:f312b6d63370984da036295429d19b01a37ee914

Into a Vec<RefInfo>

This gave me the definition of RefInfo (where it named the members correctly understanding that it was sha and refname). And a function to parse that data.

Then it asked if I wanted to read from a stream instead of a &str.

Which is close to what I wanted:

Remember the GitPacketLine class? Use that

This gave me updated working parser function.

I wasn't really happy with it. It tried very hard not to allocate memory and got clumsy with manual next() on the split. So I ended up refactoring it into whats now in ls_refs function.

Here I had much of the code needed to do the shallow cloning. I just needed to update the shallow file, so I must read it first:

Read non-empty lines of file into a HashSet. Async

This pretty much gave me what is in util.rs:read_lines_to_set

Write HashSet<String> (sorted) to file. async

I wanted a different name:

rename the function write_lines_from_set

and also:

It should overwrite the file if it already exists

The code had explicit flush, which made me wonder:

Is the flush really needed. Can't we trust drop of writer?

I learned about nuances between async flush and flush on sync Drop.

The code was still missing something though:

Howabout writing to a temp file and moving it in place?

And here we have what is in util.rs:write_lines_from_set

Phase 3 -- Cleaning up the code

At that point I had code that did the right things. But with hardcoded paths and url. So I needed to pass paths around in calls. If I just could remember how to create paths...

How to convert &str to &Path?

How to join paths

I also had separate invocations of external "git" command with lots of duplication.

How to return a partial tokio::process::Command?

Something like
fn x() -> Something {
  return tokio::process::Command::new("some-command").arg("some-arg")
}

fn y() {
   x().arg("more").spawn();
}

This gave me example code which could be directly adapted.

Then in my cleanup extracting nice functions I thought I needed lifetime annotations:

Fix lifetime annotations in
async fn upload_pack_req(&self, pkt: &[u8]) -> Result<reqwest::Response, reqwest::Error> {
	let mut req = self.client
            .post(format!("{}/git-upload-pack", self.url))
	    .header("Content-Type", "application/x-git-upload-pack-request")
            .header("Accept", "application/x-git-upload-pack-result")
	    .header("Git-Protocol", "version=2")
	    .body(pkt);

	if let Some(username) = &self.username {
	    req = req.basic_auth(username, self.password.clone());
	}

        req.send().await
    }

but it said I probably don't want lifetime annotations, but rather use pkt.to_vec().

That helped me understand that maybe I want to change the argument to transfer the ownership instead:

Making pkt arg Vec<u8> should ensure correct lifetimes, right?

Now it agreed that was a good idea. So I guess maybe it was?

(this is still the signature of upload_pack_req)

So, on to making the urls more dynamic. Like having username password in them.

How to user username/password inside url, with reqwest?

This told me it would work out of the box, but also that is better to parse the url and user .basic_auth().

I spotted an odd thing in the code it gave me:

In that example parsed gets passed to reqwest, so it still has it in url, right?

Now I got the code to parse url, extract password, rebuild url without password. Looks good. It also mentioned that makes it "compatible with reqwest’s redirect handling", which I wanted to understand:

Exactly what is the thing with the "redirect handling"

It thought me about how most http clients strips out the username/password on any redirect, even to same host.

While .basic_auth keeps it if is same origin.

So at this point I had the code to fully do a shallow clone, now on to the iterative deepening. With the peeled tags in a HashSet, I wanted to check if any of the commits I had gotten was one of them. Feels like there is a nice oneliner for that...

How to best check the if any of the entries in a list is in a HashSet

list.iter().any(|item| set.contains(item)); of course. I probably could have guessed that if I had been thinking or looked at documentation instead of just asking.

But right, I need to get the commit list from git rev-list command:

How to read tokio::process::Command stdout into a Vec<String>?

Again the example code could be adopted directly.

So with iterative deepening in place, on finding the best tag. (There are many options here, "shortest distance from HEAD", "most recently created", or "highest number".) I want highest:

Find the "max" string in a vector, based on "natural sort"

I was pointed to natord crate.

Now all basic functionality was in place.

Time to clean the code up a bit.

let some, with multiple clauses, possibel?

Sure I was vague. I got answers on having a tuple wrapped in an optional. But I was really asking for:

No I mean combining

if let Some(x) =maybex {
   if let Some(y) = maybey {
    ...

into one

Now it showed and explained the tuple unpacking trick: if let (Some(x), Some(y)) = (maybex, maybey)

please create unittests ..the full code of reader.rs..

It created some basic unittests.

I wanted to improve them:

use "foobar" instead of "data" as example data.

A testcase that has multiple gitlines inside.

Yes, some tests with incorrect data. E.g: stream ending too early, length not being hex, length being 0002/0003

This code exists in a pkt_line module: ..full code of pkt_line.rs.. Write a roundtrip unittest, that creates a buffer with a few add/delimit/flush calls, and then using GitPacketLineStream to ensure that it gets the expected values out

Instead of three calls to stream.next in test_multiple_packet_lines, can it be collected into a Vec?

These ended up as the tests that now can be found in reader.rs.

Great, I now had the code in shape. Just that the urls and branches were hardcoded in main.rs.

I wanted to use clap (but accidentally called it clippy) to parse options:

Using clippy. Write a new main.rs which parses the following options: --base-url URL --branch-filter string --tag-filter string --branch string URLs...

Into an Options struct. If --base-url is specified URLs can be relative to that. If it isn't they must be absolute.

This gave me complete options struct, and resolve_urls function. Ready to be copy pasted directly.

Oh, forgot about one option. That has some more complicated parsing:

Parse arguments like --branch-fallback '/(foo)-[^-]+$/$1/' It might be specified multiple times. It is /REGEX/replacement/ so validate that it is a proper regex. And put it in a new struct like struct BranchFallback { match: Regex, replacement: String }

Gave working code. Parsing was done with basic splitting. But asked if it should also support escaping slashes. Which I obviously wanted:

Would it be messy to support escaping of / inside the regex

Gave updated working code. Also added an explanation on why split no longer works. But I wanted more:

Change it to allow any character as separator. E.g %regex%replacement%

Updated code working like that. This is almost exactly branch_fallback.rs. Except I saw a bug in handling of incorrect values like /foo/bar/extrastuff. If only I had a unittest for that case.

Write some unittests.

Gave some tests covering different escaping and error cases. I thought it was easier to add a additional test my myself for the problematic case, and fix the code.

Then I needed to know what to name the repos locally:

How to get last path component from url?

Gave useful answer. last() on path_components(). Except that clippy asked me to change last() to next_back().

But also gave additional information that was important to my case: "This handles trailing slashes correctly — i.e., http://example.com/path/to/resource/ will give resource still."

But, it left me wondering:

When does it return None?

The answer helped me see it probably not relevant in my case (like mailto urls and stuff).

When I added better logging I wanted to make sure passwords(tokens) are hidden when logging the url. So wanted a function to mask that.

Given an URL. Return it as a string, but the password masked if it has a password in the url

The generated function had str parameter, wanted proper Url:

Make mask_password_in_url take Url as argument. And rename it masked_url

Unfortunately it mutated the parameter, which didn't seem like a good idea. Like it was trying to make smallest possible change, previously it created the Url inside and was no harm in mutating it, and it was just moving it to be a parameter instead.

Mutating the parameter doesn't feel great

This is the main.rs:masked_url function.

Reflections

This wasn't exactly "vibe coding". I had not even setup editor integration. But rather was copy pasting from the browser to the editor. As part of it was learning stuff, this friction may have even been beneficial as it probably made me stay in the editor and do more manual work (learning) a bit more.

During research it helped me quickly verify that my plan would work, and I didn't have to read documentation and/or source code, but instead get the bits I needed. Except that I needed to insist on the server side ref filter. Had I not known about that I would have been tricked into believing it had to be done client side. I suppose I could have let it write the python code during this stage too.

It clearly can generate code. The "show an example doing X" approach was helpful when trying to select crates. For the actual implementation, it generated something working on first try, though I had to refine it to make it "clean" (which obviously is somewhat subjective, so maybe it was my cleanliness definition that was off?). That might also be because I didn’t provide enough context, as my code was hidden in the editor and I only provided snippets I thought was relevant.

The asking about an specific error message also worked well, and quickly pointed me in the right direction.

In summary, it was like having an experienced senior Rust developer you could ask anything — just one you don't completely trust as it had lead you astray before.