Table of Contents
- Lecture 1 - Part 1: The Changing Internet - Part 2: Protocols and Layers - Part 3: Physical and Data Link Layers - Part 4: The Network Layer and the Internet Protocols - Part 5: The Transport Layer - Part 6: Higher Layer Protocols - Part 7: The Changing Internet - L1 Discussion
- Lecture 2 - Part 1: Connection Establishment in a Fragmented Network - Part 2: Impact of TLS and IPv6 on Connection Establishment - Part 3: Peer-to-peer Connections - Part 4: Problems due to Network Address Translation - Part 5: NAT Traversal and Peer-to-Peer Connection Establishment - L2 Discussion
- Lecture 3 - Part 1: Secure Communications - Part 2: Principles of Secure Communication - Part 3: Transport Layer Security (TLS) v1.3 - Part 4: Discussion - L3 Discussion
- Lecture 4 - Part 1: Limitations of TLS v1.3 - Part 2: QUIC Transport Protocol: Development and Basic Features - Part 3: QUIC Transport Protocol: Connection Establishment and Data Transfer - Part 4: QUIC Transport Protocol: Avoiding Ossification - L4 Discussion
- Lecture 5 - Part 1: Packet Loss in the Internet - Part 2: Unreliable Data Using UDP - Part 3: Reliable Data with TCP - Part 4: Reliable Data Transfer with QUIC - L5 Discussion
- Lecture 6 - Part 1: TCP Congestion Control - Part 2: TCP Reno - Part 3: TCP Cubic - Part 4: Delay-based Congestion Control - Part 5: Explicit Congestion Notification - Part 6: Light Speed? - L6 Discussion
- Lecture 7 - Part 1: Real-time Media Over The Internet - Part 2: Interactive Applications (Data Plane) - Part 3: Interactive Applications (Control Plane) - Part 4: Streaming Video - L7 Discussion
- Lecture 8 - Part 1: DNS Name Resolution - Part 2: DNS Names - Part 3: Methods for DNS Resolution - Part 4: The Politics of Names - L8 Discussion
- Lecture 9 - Part 1: Content Distribution Networks (CDNs) - Part 2: Inter-domain Routing - Part 3: Routing Security - Part 4: Intra-domain Routing - L9 Discussion
- Lecture 10 - Part 1: Future Directions - L10 Discussion
- Aims and Objectives
Lecture 1
Abstract
Lecture 1 introduces the course, and reviews some of the material covered in the Networks and Operating Systems Essentials course in Level 2. It discusses what is a network protocol and the concept of layering as a way of structuring networked systems. It reviews some important aspect of the physical and data link layer; IPv4 and IPv6 and the operation of the network layer; the UDP and TCP transport protocols; and the higher layers in the protocol stack. Finally, it concludes by discussing some of the changes occurring in the network and some of the challenges forcing such changes, to set the scene for the later discussion.
Part 1: The Changing Internet
Abstract
The first part of this lecture introduces the course. It reviews the aims, objectives, and learning outcomes, and the structure of the lectures and labs. It outlines the assessment scheme, and the dates and topics for the assessed exercises. Recommended reading is given.
Welcome to Network Systems. For those who don’t know me, my name is Colin Perkins, I’m the lecturer and coordinator for this course. In this first lecture, I’ll review the aims, objectives, and administration for the course. Then I’ll recap some of the material covered in the Networks and Operating Systems Essentials course in lecture 2. Finally, I’ll briefly introduce some of the changes occurring in the network to set the scene for the remainder of the course. In this first part of the lecture, I’ll start with some administrative details. I’ll talk about the aims, objectives, and intended learning outcomes for the course. I’ll outline the content of the lectures and labs, the timetable for the lectures and laboratory sessions, and the structure of the assessed exercises and exam. Finally, I’ll highlight the recommended reading for the course. As I mentioned at the start, I’m Colin Perkins. I’m the lecturer and coordinator for this course. If you have questions about the material covered in the course, my email address is on the slide, and I’m happy to answer questions by email. We also have labs and discussion sessions throughout the course when you can ask questions. The course materials, including lecture recordings, copies of the slides, lab handouts, and assessed exercises will be uploaded to Moodle. The material is also on my website at csperkins. org/teaching. The version on my website has lecture transcripts that are difficult to upload to Moodle, as well as the other material. The aims and objectives of the course are four fold. First, the course aims to introduce the fundamental concepts and theory of communications. We’ll build on the material from the Networks and Operating Systems Essentials course in level two, covering the concepts in more depth, and going into more details about the operation of the Internet. Second, the course aims to give you a good understanding of the technologies that comprise the modern Internet, and an understanding of how and why the network is changing. We’re in the middle of a rapid period of change in the Internet infrastructure. This course will try to make it clear what’s changing and what are the drivers for those changes. This should give you the ability to evaluate network systems and understand what technologies and approaches are suitable for particular scenarios. So you can advise on what’s appropriate to use. Finally, the course will build on the material in the Systems Programming course to introduce low- level network programming, and to give you some practice with networked systems programming in C. By the end of the course, you should be able to describe and compare the capabilities of different communications technologies, understand the implications of scale on network systems, and the different quality of service needs of different applications. In concrete terms, this means you should understand what are the different protocols in use in the Internet, and when it’s appropriate to use one or the other. For example, you should understand the differences between UDP, TCP, and QUIC, and when it’s appropriate to use each of these protocols. You should understand the importance of heterogeneity in the design of the Internet, the importance of layered protocol stacks, and the way that different components of the network are combined to form a whole. Finally, you should be able to write simple low- level communication software in C, showing awareness of good practices for correct and secure programming. The course is organized as 10 lectures and six laboratory sessions. There’s one lecture a week for 10 weeks, with pre-recorded lectures and time for discussion each week. The labs are more variable. The first four lab exercises are expected to take one week each. Then the next two exercises should take two weeks each. The labs run for eight weeks, starting in week two of the semester. Lecture recordings will be made available ahead of time, and comprise an hour or so of material split into several shorter parts each week. Each lecture will be accompanied by a set of discussion questions and will be available on Moodle and on the course website. These discussion questions are not assessed, and are intended to help you understand the material. We have a live lecture session timetabled from 4- 6pm on Thursday afternoons. This is intended for a discussion of the lecture material and questions. You should watch the lecture videos and think about the discussion questions before the timetable sessions. This discussion slot is your main opportunity to ask questions about the material. I strongly encourage you to come prepared to use the slots to talk. The more questions and discussion we have, the more useful it’ll be for everybody. There are also weekly labs. The goal of the labs is twofold. First, they’re intended to help you practise C programming, building on the material in the systems programming course, and to introduce you to networked programming in C. While languages like Python are popular for web development, C is still by far the most widely used language for writing low-level networking code. So it’s important that you’re familiar with networked programming in C. The second goal of the labs is to complement the material covered in the lectures, to allow you to explore certain topics around congestion control and the structure of the network in more detail. As with the lectures, the lab materials will be made available in advance of the timetable sessions. The labs are to be completed in your own time. The timetabled sessions from 1-3pm on Thursdays are for live support with the lab exercises. You should try to solve the exercises and think of questions you might need to ask before the timetable support slots, so the lab demonstrators and I can effectively help you. The course follows the usual model of an exam worth 80% of the marks and an assessed exercise worth the remaining 20%. The assessed exercise will be made available in Lecture 5 and will be due on the same day that Lecture 7 is held. The usual penalties for late submission will be applied, following the University’s Code of Assessment. If you’re ill, or if you have other special circumstances that might affect your on-time submission, then you can contact me before the deadline to request an extension. Also, following the school’s policy, note that submissions of assessed exercises that do not follow the submission instructions given in the handout will be subject to a two-band penalty. And I want to emphasise this last point, because some people are surprised by it each year. The penalties for late submission and for not following submission instructions will be strictly enforced. As I said earlier, the final exam is worth 80% of the marks for the course. The format of the exam is that there are three questions and you have to answer all questions. There are past exam papers on Moodle, including one with sample answers. The curriculum has changed over the years, and the pandemic caused a shift to online open book exams, and together these mean the style of questions has changed over time. The more recent exam papers, from 2020 onwards, are most representative of the style of questions you can expect. When studying the material and preparing for the exam, remember that the goal of the course is to help you understand how the Internet works. The aim of the exam is to test that understanding, not simply to test your memory of the details. Explain why, don’t just recite what. To achieve an A grade in this course, representing excellent performance, you’ll need to demonstrate that you can make a reasoned argument, develop logical conclusions, and apply your learning to new situations to solve new problems. The material in the lectures in the labs is examinable, but you’re also expected to follow the required readings and to develop a broader understanding of the material. So what are those required readings? Well, the slide lists good books on computer networking. You should read one of them. The books by Peterson and Davie, and by Olivier Bonaventure, are available for free online. The lectures, labs, and discussion can introduce you to important concepts of networking, but it’s also important that you read about and around the material covered, so you understand the details. The books explain the material in different ways, and may help illustrate different aspects of the subjects and give different perspectives that make things clear in cases where my explanations do not. In addition to the books, many of the lecture slides also include links to standards documents, research papers, blog posts, or talks that go into more detail about the material. You’re not expected to read all of this material, and much of it goes into far more detail than you need, but you should at least look at it to begin to get a feel for the depth of the subject. There’s only so much detail that can be covered in a pre-recorded lecture. I’ve chosen the reading material carefully to complement the lectures. Following along with the reading and participating in the discussion sessions will further your understanding. The exam questions will focus on application of the knowledge you’ve gained during the course, not on rote memorization. Reading further, thinking about the ideas, and discussing the material covered in the lectures and the labs is essential. So that concludes the review of the course structure and administration. In the remaining parts of this lecture, I’ll review the traditional internet architecture, starting with a discussion of protocols and layering, then talk about some of the ways in which the network is changing.
Part 2: Protocols and Layers
Abstract
Part 2 of the lecture introduces the idea of a network protocol, and the concept of layering as a way of structuring networked systems. It reviews the 7-layer OSI model, and discusses how it’s a useful way of thinking about systems, even though it’s not representative of any real-world systems.
I’d like to begin the course by reviewing some of the fundamental principles of networked systems. In particular, I’d like to talk briefly about what is a networked system, and how networked systems are structured in the form of a layered protocol stack. So, what is a networked system? A networked system is a set of cooperating, autonomous, computing devices that exchange data to perform some application goal. I talk about computing devices, because networked systems are not limited to traditional PCs, laptops, or servers. There are far more smart phones and tablets connected to the network than there are laptops and servers, for example But there are also numerous sensors and controllers that form the Internet of Things: cameras, smart light bulbs, heating controllers, weather stations, medical devices, industrial automation, and so on. The network is comprised of an increasingly diverse set of devices, and has to meet their diverse needs. These devices are autonomous. Each device acts independently with no central control, and can choose what data to send, and when to send it. And these devices are all running different applications with different requirements. They require different things from the network and need different network protocols to support their needs. There are four aspects to the network. The first is communication. How can two devices, that are connected to a single link, reliably exchange data. Second, is networking. How can we combine multiple links to form a wide area network, around a building, a campus, or a region. Third is internetworking. How can we connect multiple networks together to form an internet? How can multiple independently operated networks work together to act as-if they were a single global network? How can data be routed across this collection of networks to reach its destination? And, finally, there is the problem of transport. How do the end systems ensure that data is delivered across this network, or networks, with appropriate reliability to meet the needs of the application? Networked systems are fundamentally about communications protocols. A sender is trying to talk to one of more receivers, via some communications channel. It does this by sending messages, that have to fit within the constraints of the communications channel. These constraints provide limits on the speed and reliability of transmission. The channel is not infinitely fast or perfectly reliable. The messages being sent across the channel have a well defined format Much like programming languages, network protocols have well defined syntax that describes the structure and format of the messages that can be sent. They also have well defined semantics that define the meaning of the messages, the order in which they are sent, and the patterns of communication. Together the syntax and semantics of the messages define a network protocol, such as HTTP, TCP/IP, etc. Each protocol has a particular purpose and solves a particular problem. Protocols can be combined, and layered on one another, to gradually raise the level of abstraction, and provide more sophisticated services to the applications. The Open Systems Interconnection Reference Model, the OSI Model, is a common way of thinking about protocol layering. The OSI model structures the network as a set of seven layers. At the bottom of the protocol stack is the physical layer. For two devices, two end systems, that are directly connected, as shown on the slide, the physical layer represents the means of interconnection. The type of cable, if it’s a wired link, or the details of the radio channel. The purpose of the physical layer is to exchange data between the devices. Above this sits the data link layer. The link layer structures that data into messages, identifies the devices, and coordinates when each device can send, so they share access to the channel. If a single network comprises more than one physical link, devices known as switches can be inserted into the network. Switches operate at the data link layer, adapting the message framing and arbitrating access to the different channels, in order to bridge the different links together. Examples of this include Ethernet switches, that connect multiple Ethernet links together to form a larger Ethernet network, and Wi-Fi base stations that bridge between Ethernet and Wi-Fi networks. The third layer in the OSI model is the network layer. The network layer supports the interconnection of multiple independently operated networks. It abstracts away the details of the different types of ink layer, and combines different networks into one, to give the illusion that there is a single global internetwork. The Internet protocols, IPv4 and IPv6, are examples of network layer protocols, and the devices that connect different networks together are known as routers. Above the network layer, the transport layer ensures that data is delivered with appropriate reliability between end systems. The Transmission Control Protocol, TCP, is a widely used example of a transport protocol. Finally, the Session, Presentation, and Application layers support the coordination of multiple transport connections, describe the data formats used, and support application semantics. The OSI reference model is a standard model for layered protocol design. It’s important to realise, though, that real networks don’t follow the OSI model. No deployed networked system has seven layers structured in this way. Real systems are more complex. In some cases, especially around the upper layers of the stack, the layers merge together, and the boundaries between them are unclear. In other cases, systems are designed with more layers, or with clearly defined sub-layers, to support features that don’t fit neatly into the layer boundaries defined by the OSI model. Tunnelling solutions, that encapsulate data from a lower layer, and transmit it within a higher layer are also common. Virtual Private Networks, VPNs, are a good example of this, and take network layer data, in the form IP packets, and tunnel them inside a transport layer connection to some other part of the network, giving the illusion that a device is physically connected elsewhere. There’s not a strict progression of layers. We talk about the OSI model because it’s an extremely useful way to structure discussion about networks, not because it represents the structure of any particular network. To conclude, the network is a collection of autonomous computing devices, cooperating to exchange messages to support application needs. Those messages are structured in the form of a layered protocol stack, gradually building up features from the exchange of data between directly connected devices until we have the global Internet. The seven layer OSI model doesn’t reflect reality, but is a useful way of thinking about the protocol stack. For this reason, I’ll use it to structure the discussion in the other parts of this lecture starting, in the next part, with a discussion of the physical and data link layers.
Part 3: Physical and Data Link Layers
Abstract
Part 3 of the lecture reviews the physical and data link layers. It briefly reviews baseband data encoding, carrier modulation, and spread spectrum communication. It talks about the limitations of physical links, the Shannon-Hartley theorem, the factor that limit the performance of a communications channel. It also briefly reviews the data link layer, talking about framing, addressing, and media access control.
The physical and data link layers occupy the lowest levels of the protocol stack, and support data transmission between directly connected devices. In the following, I’ll briefly discuss the physical characteristics of network links and the modulation process by which data is transmitted across those links. Then, I’ll talk about features of the data link layer, such as framing, addressing, and media access control. The physical layer describes the properties of the communications channel. It’s the realm of the electrical engineer; of cables, optical fibres, and radio transmission. When considering the physical layer, we discuss the physical properties of the channel, the way bits are encoded onto the channel, and the capacity, and error rate, of the channel. We begin by asking whether the link is wired or wireless? If it is a wired link, we then consider the type of cable or optical fibre used. We ask what voltage is applied to the cable, or what frequency of laser light is sent down the fibre. And we ask how the bits are encoded as variations in that voltage or in the intensity of the laser. If, instead, the channel is wireless, we consider what type of antenna is used, the transmission power, the carrier frequency, and the modulation scheme used to encode data onto the carrier wave. Given these details, we can then estimate the capacity of the channel, and model the physical limitations of the performance of the transmission link. When using a wired link, whether an electrical cable or an optical fibre, the signal to be transmitted is usually directly encoded onto the channel. That is, the voltage applied to an electrical cable, or the brightness of a laser shining down an optical fibre, is changed in a way that directly corresponds to the signal to be transmitted. The signal will occupy a certain frequency range, known as its bandwidth. This is measured in units of Hertz, Hz, and directly corresponds to the complexity and information content of the signal. The more information being transmitted in a given time interval, the greater the bandwidth. A signal directly applied to a channel occupies what’s known as the baseband frequency range: the range starting at 0 Hz, and reaching up to the maximum bandwidth of the signal. Every channel also has a maximum bandwidth it can transmit. This depends on the physical characteristics of the channel. For example, the maximum bandwidth that can be sent over a twisted pair electrical cable depends on the length of the cable, the tightness of the twists, and the thickness of the wires. A channel is only able to transmit a particular signal if the bandwidth of the signal is less than the bandwidth of the channel There’s a maximum rate at which data can be transmitted, depending on the physical characteristics of the channel. That maximum rate is determined by Nyquist’s theorem. This states that the maximum data rate a channel can support, Rmax, cannot exceed 2 B Log2 V bits per second, where B is the maximum bandwidth of the channel, and V is the number of different values each symbol can take. For binary data, each symbol can have one of two values, it can be a zero or a one. Accordingly, V is equal to two. The value of “Log2 V” term, in this case, evaluates to one, and the maximum data rate directly corresponds to the channel bandwidth. So, how is data encoded onto the channel? The simplest case is what’s known as non-return to zero, NRZ, encoding, as shown in the top figure on the slide. When sending binary data, NRZ encoding directly encodes the signal onto the channel. If the binary value 1 is to be sent over an electrical cable, for example, a high voltage is applied to the cable, whereas a low voltage is applied to send the binary value zero. The receiver simply measures the voltage, and directly translates it into binary values. Non-return to zero encoding is simple, but has the problem that long runs of ones or zeros result in signals that maintain the same value for long periods of time. The example has a run of four consecutive one bits, followed a little later by a run of four consecutive zeros, and each results in a signal that’s unchanging for a significant period of time. At high data rates, and with long consecutive sequences of the same value, it can be difficult to measure the exact time for which the signal is unchanging This leads to miscounting, where the receiver believes one more, or one less, bit was sent than intended, giving a corrupt signal. To avoid this, more complex encodings are used. One such scheme is Manchester Encoding. This encodes every bit to be sent as a pair of values A binary 1 is sent as a high-to-low transition in the signal strength, whereas a binary 0 is sent as a low-to-high transition. This avoids the miscounting problem of NRZ encoding, since the signal always changes irrespective of the data being sent, but at the cost of requiring twice as many transitions, and hence using twice the bandwidth. There are many different methods of baseband data encoding, of which the NRZ and Manchester encodings are the simplest. The different encodings trade-off increased complexity for better performance. When encoding data onto a wireless channel, carrier modulation is used rather than baseband encoding. This allows multiple signals to be carried on a single channel, each modulated onto a carrier wave transmitted at a different frequency. Carrier modulation shifts the frequency range occupied by a signal up, so it’s centred on a carrier frequency Instead of occupying the baseband range, from 0 - B Hz, the signal is shifted to occupy a frequency range centred on the carrier frequncy. This is done by varying some property of the carrier wave to match the signal being sent. The receiver tunes into the carrier frequency, and measures the variation in the carrier wave to extract the signal. In much the same way as wired links, there are limitations in the bandwidth of the signal that can be sent, depending on the carrier frequency, the type of antenna used, the transmission power, the modulation scheme, etc. Broadcast radio stations use carrier modulation to transmit on different frequencies. The principle is the same for digital transmission, except that it’s data being transmitted, rather than music. There are three types of carrier modulation that can be used. Amplitude modulation varies the loudness of the carrier wave to directly match the signal being transmitted. AM radio works the same way. Amplitude modulation is simple, but can perform poorly, because radio noise takes the form of changes in the loudness of the signal that corrupt the received data. Frequency modulation varies the frequency of the carrier to match the signal. In the example on the slide, it switches between a high frequency, that corresponds to binary zeros, and a lower frequency that corresponds to a binary ones. FM radio works in a similar way, varying the frequency to match the speech or music being transmitted. This is slightly more complex than amplitude modulation, but more resistant to noise and interference. Finally, phase modulation shifts forwards or backwards in the cycle of the waveform to indicate different symbols. Real systems tend to use a combination of modulation techniques, perhaps varying both the amplitude and phase, to increase the data rate. Radio signals modulated onto a carrier wave are prone to interference. This is because signals sent at particular frequencies tend to be blocked by vehicles, trees, and people moving around, or by the weather and other radio transmissions, while signals sent on a carrier at a different frequency may be unaffected. The strength of this interference can change rapidly. To avoid this problem, many wireless links use a technique known as spread spectrum communication, where the carrier frequency is changed several times per second, following a pseudo-random sequence known to both the sender and the receiver. Spread spectrum communication limits the impact of interference, because the transmission will quickly switch away from a poorly performing channel. It adds a lot of complexity, since both sender and receiver need to continually change the carrier frequency, and synchronise what frequencies they use, but greatly improves performance. The concept of spread spectrum communication was invented by Hedy Lamarr, a Hollywood actress turned inventor, during World War 2. It’s now widely used as part of the Wi-Fi-standards. The impact of noise on the data rate achievable over a particular channel can be predicted. This is the case whether that noise is due to electrical or radio interference, imperfections in the optical fibre, or other means. In the simplest case, where it’s assumed that the noise affects all frequencies used by the transmission to the same extent, the maximum data rate of the channel can be determined using the Shannon-Hartley theorem. This states that the maximum data rate, Rmax, is equal to the bandwidth of the channel, B, multiplied by the log of 1 plus the strength of the signal, S, divided by the amount of noise, N. The bandwidth and the amount of noise depend on the channel and the environment. For example, for a wireless link, they depend on the carrier frequency, the type of antenna, the weather, the presence of obstacles between the sender and receiver, and whether there are other simultaneous radio transmissions. The signal strength depends on the amount of power applied at the transmitter. This allows the sender to trade-off battery life for performance, saving power by transmitting more slowly. We have seen that the physical layer enables communication. It allows the sender and receiver to exchange a sequence of bits of data across a channel, but assigns no meaning to those bits. The data link layer starts to provide structure to the bitstream provided by the physical layer. It provides framing, splitting the bitstream into individual messages, and gives the ability to detect, and possibly correct, errors in the transmission of those messages. It provides addressing, giving each device an identifier that can be used to indicate what device sent the message and what device, or devices, should act on the message. And, finally, the data link layer provides media access control. It arbitrates access to the channel, to make sure that more than one device doesn’t try to send at the same time, and to ensure that each gets its fair share to transmit. A key role of the data link layer is to separate the bitstream into meaningful frames of data, and to identify the devices that are sending and receiving those frames. For example, if we consider an Ethernet link, the bitstream is split up into frames that contain a number of different elements. First is the start code. This is a preamble, containing a particular pattern that occurs only occurs at the start of a message, and is used to alert the receiver that a new message is starting. This is followed by some header information. The header comprises a source address, specifying the identity of the device sending the frame, and a destination address that identifies the receiver. These are followed by a length field, indicating the amount of data to follow. The data comes next, up to 1500 bytes in length. And, finally, a cyclic redundancy code concludes the packet, allowing the receiver to check if the frame was received correctly The start code provides for synchronisation and timing recovery. It’s a regular pattern that’s only sent at the start of a frame, and allows the receiver to precisely measure the speed at which the frame is being sent. The source and destination addresses identify the devices sending and receiving the message. Each is 48 bits, six bytes, in size, and is globally unique. The addresses are split into two 24-bit parts, one indicating the vendor, and one indicating the device. In this example, the vendor ID of 00:14:51 indicates Apple, and the device ID of 04:27:ea indicates the laptop on which I’m recording this lecture. Modern operating systems are starting to randomly change the Ethernet addresses each time they connect to the network, to limit tracking and improve privacy. Finally, the data part of an Ethernet frame contains data for the next layer up in the protocol stack, the network layer. The last feature of the data link layer that I want to discuss is media access control. If you have a channel that’s shared between multiple devices, such as a common wireless link or a shared cable, then there’s the risk that two devices can try to send at once. In this slide, for example, devices A and B both try to send a message to device C at the same time. The centre image shows the signals sent by the devices A and B, each of which is sending using NRZ encoding. These are entirely normal signals. The right-hand image shows what’s received at C. This is the superposition of the two signals, the result of adding the two signals together, and is corrupt and meaningless to the receiver. Media access control is the problem of avoiding such collisions. A common way to perform media access control is using a technique known as carrier sense multiple access with collision detection, CSMA/CD. The idea is that when a device wants to send, it first listens to see if another device is sending already. If another transmission is active, then it waits before trying again. If it doesn’t hear anything, it starts to send data. While sending, it listens to see if another device also starts to send. If such a collision occurs, the device stops sending, waits, and tries again. Collisions don’t usually happen, because devices listen before sending, but there’s always some chance that messages might overlap because of the time it takes a message to traverse the network. As we see from the diagram on the right, if device A starts to send, and simultaneously B listens, hears nothing because the message from A hasn’t reached it yet, and starts sending, then there’s the risk that both messages collide and are corrupted. Collisions are more likely to occur in long distance networks, with large propagation delays, but can happen in any network with a shared channel. If a collision occurs, how long should a device wait before trying to re-send a message? Well, devices shouldn’t always wait for the same amount of time. Doing so would run the risk that two devices get stuck, each repeatedly trying to send, waiting the same time, and then colliding again in a loop. The amount of time to wait should be randomised, to avoid deterministic collisions. If such randomisation is used, but another collision occurs after waiting, this suggests that the network is busy. It’s unlikely that two devices will randomly wait for the same time, so a subsequent collisions suggest that there are many devices trying to send. A sender should therefore increase the time it waits after each collision, to reduce the overall load on the network. Many data link layer protocols double the wait time after each repeated collision, resetting when a successful transmission occurs. This approach, known as CSMA/CD, is widely used, including in Ethernet and WiFi networks. A consequence of it is that devices share access to the channel, and how quickly they can send a message depends on how busy is the network. This introduces some element of unpredictability into the timing of many messages sent over the network. To conclude, the physical layer provides for encoding a sequence of bits onto a channel, but says nothing about the meaning of those bits. The data link layer starts to add structure, separating the bit stream into frames, checking those frames to ensure they were correctly received, identifying devices, and arbitrating access to the channel. Together, the physical and data link layers enable local area networks, where devices connected to a single link can communicate. This forms the basis of the Internet. In the next part, I’ll talk about the network layer, that allows multiple networks to be combined into one.
Part 4: The Network Layer and the Internet Protocols
Abstract
Part 4 of the lecture is about the network layer and the Internet Protocols. It reviews the role of the network layer, and the ideas of addressing, routing, and forwarding. And, it talks about the network layer in the Internet, and IPv4 and IPv6.
The network layer allows several independently operated networks to be combined to give the appearance of a single network. It provides an internetworking function that allows us to build an internet. In this part, I’ll talk about the Internet Protocols, IPv4 and IPv6, that provide the network layer in the Internet, and briefly review the network layer concepts of addressing, routing, and forwarding. The network layer is the internetworking point in the protocol stack. The use of a common network layer protocol allows us to decouple the operation of the networks that comprise the Internet, from the operation of the applications that run on the Internet. It allows each network to make its own choice about what sort of data link and physical layer technologies to use, because that choice is hidden from the applications and transport protocols by the common network layer. It doesn’t matter whether the underlying network is Ethernet, WiFi, optical fibre, or something else, because the differences are hidden from the upper-layer protocols. Similarly, the use of a common network layer makes it easy to deploy different applications and transport protocols. The lower layers must deliver network layer packets, but are unaware of the type of application data contained in those packets. They cannot tell whether the packets being delivered comprise an email message, a web page, a phone call, streaming video, or whatever. This approach is very flexible, and makes it easy to support new physical and data link layer technologies, and new transport protocols and applications, provided they can deliver packets for, and operate over, the common network layer. The disadvantage is that the network layer is not optimised for any one application. It emphasises generality and flexibility to support many different uses, rather than providing optimal performance for any particular use case. In the Internet, the network layer is known as the Internet Protocol, IP. This is the IP part of the well known TCP/IP protocol suite. The Internet Protocol provides a common way to identify devices on the network, using what’s known as an IP address. It provides routing algorithms to direct packets across the network from source to destination. And it forwards those packets in a best effort manner, accepting that the network may be unreliable. At its core, the Internet Protocol provides uniform connectivity. Any host can send data to any other host, subject to firewall policy, but makes no quality guarantees. There are two versions of the Internet Protocol in use. The most commonly used version is IP version 4. This was introduced into what became the Internet in 1983. The figure shows the format of an IPv4 packet. This is sent in the payload data section of a data link layer packet, such as an Ethernet or WiFi frame, with the different parts of the IPv4 packet being sent in the order shown, left-to-right, top-to-bottom, starting with the version number, header length, DSCP field, and so on, and concluding with the transport layer data. Key parts of an IPv4 packet are the source and destination addresses, each 32 bits in size, that denote the network interfaces from which the packet was sent, and to which it should be delivered. The use of 32 bits for the address fields allows 2-to-the-power-32 possible addresses, around 4 billion, which is not enough for the current Internet. This is the motivation to switch to IPv6. In addition to addressing, IPv4 provides the fragment identifier, fragment offset, “don’t fragment” (DF), and “more fragments” (MF) fields to allow large IPv4 packets to be split into pieces for delivery over networks that can only deliver small packets. It also includes a Differentiated Services Code Point (DSCP) field to allow packets to request special treatment by the network. For example, a packet that carries video conferencing or gaming data might ask for low latency delivery, while one carrying data that’s part of a background software update might indicate that it’s low priority. The time-to-live (TTL) field prevents packets from circulating forever in the network in case a routing problem causes them to go around in a loop, and the header checksum detects transmission errors. Finally, the upper layer protocol identifier identifies the format of the transport layer data that follows the IPv4 header. This usually indicates that the transport layer data is a TCP segment or a UDP datagram. IPv6 was designed to solve the problem that the IPv4 addresses are too small. It replaces the 32 bit addresses used in IPv4 with 128 bit addresses. This vastly increases the number of devices that can be added to the network, since each additional bit doubles the number of addresses that are available. In addition, IPv6 simplifies the header. It removes the support for in-network fragmentation, that was present in IPv4, since it was difficult to implement efficiently, and instead requires the hosts to adjust the size of the packets they send to match the network path. It also removes the header checksum, since it’s usually redundant with the checksum provided by the data link layer. As of late 2020, Google reports that about a third of their users access Google over IPv6. Statistics from Akamai, a large content distribution network, report around 60% of connections to their network from India are over IPv6, around 50% from the US, Germany, Belgium, Greece, Taiwan, and Vietnam, and around 35% from the UK. IPv6 took a long time to start seeing deployment, but its use has greatly accelerated over the last few years. If we have IPv4 and IPv6, you might ask what about IPv5? Well, experiments with packet voice over the ARPAnet, the precursor to the Internet, started in the early 1970s, with the Network Voice Protocol developed by Danny Cohen at the University of Southern California’s Information Sciences Institute. This work eventually led to the Internet Stream Protocol, ST-II, that was an experimental multimedia streaming protocol developed mostly in the 1980s and early 1990s. ST-II ran in parallel to IPv4, and used IP version 5 in its header. ST-II was not widely deployed, but it helped prototype a number of important ideas around multimedia transport over packet networks. Both what worked well, and what didn’t. Steve Casner and Eve Schooler, who both worked with Danny Cohen at ISI, helped lead the development of the next wave of multimedia transport protocols, RTP and SIP, based on experiences with ST-II and earlier protocols. RTP and SIP are extremely widely used in modern video conferencing services, such as Zoom, Webex, and Microsoft Teams, and as the basis for today’s mobile phone networks. IPv4 and IPv6 have differently sized addresses, but otherwise work similarly. In both protocols, an IP address represents ]the location of a particular network interface. If a device has more than one network interface, for example if it’s a smart phone with both 5G and WiFi, it will have one IP address for each interface. It may also have both IPv4 and IPv6 addresses assigned to some, or all, of its network interfaces. And, in some cases, can have more than one address of each type assigned to an interface. Importantly, the IP address identifies the location at which a network interface is attached to the network. It does not identify the device, and if a device moves to a different place it will acquire a different IP address. IPv4 and IPv6 addresses both comprise a network part and a host part. The network part, often known as the network prefix, identifies the network to which the device is attached, while the host part of the address identifies a particular attachment point on that network. The fraction of the address tthe address that identifies the network part, and the fraction left for the host varies, with different networks being assigned differing amounts of address space. As an example, the School of Computing Science operates an IPv4 network in Lilybank Gardens and the Boyd Orr Building where the first 20 bits of the IPv4 address comprise the network part, and the last 12 bits are the host part. IPv4 addresses in the range 130.209.240.0 to 130.209.255.255, that all share the same initial 20 bit prefix, are all within that network. The School also has an IPv6 network, that works similarly, although the addresses are longer. In the wide area, Internet traffic is routed towards its destination looking only at the network part of the IP address. Only when it reaches the destination network do the routers in the network inspect the host part of the address, to find the device that should receive the packet. Finally, it’s important to remember that the network layer does not use names such as example.com. Traffic delivered over the Internet contains source and destination IP addresses, and is routed based on those IP addresses. The host that sends an IP packet resolves the name of the destination to an IP address, and puts that address into the destination address field of the IP packet. The network delivers that packet using only that IP address. The DNS, that resolves names, is just another application that runs on the Internet, and is not fundamental to its operation. The Internet is a network of networks. Each network is administered and operated separately. It acts as what is known as an Autonomous System, an AS. Each AS can choose to use different technologies internally, and can have different rules and policies. The commonality is that all run the Internet Protocol. The University of Glasgow acts as an Autonomous System in the Internet, as do companies such as Google or Facebook, and Internet Service Providers such as BT, Virgin Media, O2, etc. Within a network, the network operator will seek to deliver data to its destination as efficiently as possible. The Internet places no requirements on how they do this, or on what data link layer or physical layer technologies they use. Each network operator is free to use whatever technologies, and whatever routing or forwarding algorithms, that it chooses. Typically, the network operator will seek to ensure traffic follows the shortest path from source to destination across their network, using either a distance vector routing algorithm, or, more likely, a link state routing algorithm such as OSPF. There are a wide variety of different approaches used, though, depending on the size of the network, and the needs the network operator and its customers. Most Internet traffic is not confined to a single AS. Rather, it’s common for traffic to pass through several Autonomous Systems on its path from source to destination. For example, packets sent from the University of Glasgow to Google will start in the University’s network, then traverse a network known as JANET (the Joint Academic NETwork; the University’s ISP), and then finally reach Google. Each of these three networks is an Autonomous System. Each must cooperate to forward the packets, and find an appropriate route for the data across the network. Indeed, all of the ASes that comprise the Internet cooperate to ensure the network can successfully route data to its destination, wherever that destination is. This cooperation is enabled by the Border Gateway Protocol, BGP. BGP is a routing protocol. It allows each AS to advertise the network prefixes that it owns, to tell the rest of the network where to send packets destined for IP addresses contained within those prefixes. In addition, BGP allows ASes to advertise the routes they can use to reach the network prefixes owned by other ASes. This allows, for example, an ISP to advertise how to reach the IP addresses used by its customers. Similarly, BGP allows ASes to filter out advertisements for network prefixes to which they will not forward traffic. The information exchanged in BGP allows ASes to decide how to route packets across the network. Networks located near the edge of the Internet need to maintain relatively little information to participate in BGP. They simply need to know their own network prefixes, and those of their customers. They then advertise those prefixes to the rest of the network. Those edge ASes know how to route traffic to themselves and their customers, and just pass everything else to a default route that directs it out to the wider network. For networks nearer the core of the Internet, this approach is no longer sufficient. These ASes form the so-called default free zone, the DFZ, where they use BGP to put together something close to a complete map of the Internet, so they know how to forward packets to reach any possible destination. When it comes to BGP routing, and finding the correct route for data to cross the wide area Internet, policy, politics, and economics are often more important than finding the shortest route. A final feature of the network layer is forwarding. Given a route through the network, calculated by BGP for the inter-domain path, and by an intra-domain routing protocol such as OSPF within each network, how are the packets actually forwarded along that path? Forwarding in the Internet follows a best effort approach. A router in the network receives a packet, and makes its best effort to forward it towards its destination. The network is connectionless. A sender doesn’t need to establish a connection or ask permission before sending a packet, and makes no attempt to reserve capacity. As a results, the network makes no guarantees. If there are insufficient resources available to forward a packet, then it may be delayed or will simply be discarded. The figure shows an example, showing the time packets take to traverse the network, with the x-axis showing time since the start of transmission, and the y-axis showing the time taken to receive a response from the destination. As can be seen, the time taken to get a response may vary significantly, depending on the amount of other traffic in the network. Packets may be delayed, reordered, lost, duplicated, or corrupted in transit. In a well engineered network, there is little timing variation, packets are rarely lost or corrupted, and almost never arrive out of order. If the packets traverse a poor quality network, however, behaviour may be less predictable. The Internet can provide extremely high quality, but there’s no requirement that it does so. This gives flexibility. The Internet can encompass all sorts of different networks, not just those in rich countries with well-developed infrastructure. But, it requires applications to be able to cope with unpredictable quality. To summarise, the network layer provides the interworking function that allows networks to cooperate, and come together to build a global Internet. It provides a common addressing scheme to identify endpoints of communication, a routing scheme to allow systems to determine the appropriate path for data to take across the internetwork, and a forwarding scheme to move packets towards their destination. In the Internet, the network layer is the Internet Protocols, IPv4 and IPv6. Data is routed between Autonomous Systems using BGP, and within autonomous systems using distance vector or link-state routing protocols, such as OSPF. Finally, packet forwarding happens on a best effort basis, giving the flexibility operate over any type of link layer at the cost of potentially unpredictable behaviour. The network layer provides connectivity. Above it sits the transport layer, that provides the abstractions that applications use to deliver data.
Part 5: The Transport Layer
Abstract
Part 5 of the lecture reviews the transport layer in the Internet. It talks about UDP and TCP, the services they provide to Internet applications, and their strengths and weaknesses. It reviews TCP connection establishment and congestion control very briefly.
The transport layer isolates the applications from the vagaries of the network. It ensures data is delivered with appropriate reliability, and adapts the speed of transmission to match the network capacity. There are two widely used transport layer protocols in the Internet: UDP and TCP. UDP provides an unreliable packet delivery service, essentially exposing the raw IP network model to the applications. It’s useful for applications that prefer timeliness over reliability, and as a building block for developing new transport protocols. TCP provides a reliable service, retransmitting any lost packets, putting reordered data back into the correct order, and adapting its transmission rate to match the available capacity. It’s useful for applications that need data to be delivered reliably and as fast as possible. In this part, I’ll talk about transport layer concepts, and the UDP and TCP transport protocols used in the Internet. I’ll also briefly introduce the idea of congestion control, adapting the sending rate of the transport to match the available network capacity. As discussed in the previous part, an IP network provides only a best effort packet delivery service. IP packets can be lost, duplicated, delayed, or re-ordered in transit. The role of the transport protocol is to isolate applications, as much as is necessary, from the network. The transport protocol demultiplexes traffic destined for different applications. It enhances the network quality of service to offer appropriate reliability for those applications. And it performs congestion control, to adapt the transmission rate to the available network capacity. There are two transport protocols that have been successfully deployed in the Internet. UDP, the user datagram protocol, provides an unreliable service. TCP, the transmission control protocol, provides a reliable, ordered, and congestion controlled service. Applications that run on the Internet use one of these two transport protocols. UDP is the simplest possible transport protocol that can run on the Internet. It exposes the raw IP service to applications, adding only the concept of a port number, to identify different applications running on a single host. Like IPv4 and IPv6, UDP is connectionless. An application doesn’t need to establish a UDP connection. Rather, it can simply send a packet towards a destination without asking permission or establishing a connection. UDP datagrams are delivered in a best effort manner. Datagrams might not arrive at all, and if they do arrive, they might not arrive in the order they were sent. The UDP transport protocol doesn’t attempt to correct for this, or to ensure reliable delivery. It’s the responsibility of the application using UDP to reconstruct the ordering, or to detect lost packets. Similarly, UDP doesn’t perform any form of congestion control. If an application using UDP tries to send faster than the network can deliver packets, then those packets will simply be discarded. UDP doesn’t help the application adapt its sending rate to match the network capacity. Accordingly, applications using UDP must be able to tolerate some loss of data, to receive data out of order, and to be able to estimate and adapt to the available network capacity. Doing this well is extremely difficult, and UDP is not suitable for many applications. Where UDP is useful is when the application prefers timeliness over reliability. Because it doesn’t attempt to retransmit lost data, and doesn’t buffer data to allow it to be put into the correct order, UDP offers the lowest possible latency for data sent across the network. That’s useful for applications like voice-over-IP, video conferencing, and gaming, that can tolerate some data loss but need low latency. For most applications, though, the unreliable nature of UDP makes it poorly suited to their needs. By way of contrast, the TCP protocol provides a fully reliable, ordered, byte stream delivery service than runs over an IP network. TCP is a connection oriented transport protocol. Applications that use TCP as their transport must first setup a connection from sender to receiver, before they can send data. Once a connection is established, the TCP protocol ensures that any lost data is retransmitted, and that any reordered data is put back into the correct order, before delivering it to the receiving application. The TCP protocol also adapts the rate at which data is sent across the network to match the available network capacity, a process known as congestion control. A TCP sender writes a sequence of bytes into a TCP connection, and that exact same sequence of bytes is delivered to the receiver. Reliably. In the order sent. And at the maximum speed the network can support. The overwhelming majority of applications running on the Internet use TCP as their transport protocol. TCP has two limitations. The first is that TCP delivers a sequence of bytes, not a sequence of messages. If a sender writes 2000 bytes of data onto a TCP connection, comprising two messages of 1000 bytes each, then TCP guarantees that those 2000 bytes will be received reliably, and in the order sent. It does not guarantee that they will be received as two messages of 1000 bytes each. Indeed, it’s entirely possible that TCP will deliver the data as a block of 1500 bytes followed by a separate block containing the remaining 500 bytes. Applications that care about message boundaries must structure the data they send over a TCP connection to allow those message boundaries to be reconstructed. Since reading and writing data to a file also doesn’t preserve message boundaries, doing this tends not to be difficult. The other limitation of TCP is that, because it delivers data reliably and in order, it must retransmit any lost packets. This delays any data following those lost packets, since it can’t be delivered to the application until the retransmission arrives. This is an unavoidable trade-off, if data is to be delivered reliably, and in the order sent, across an unreliable network, but does mean that TCP trades latency for reliability. TCP delivers data in the form of segments, each delivered within an IP packet. Each TCP segment includes a source and destination port numbers, that identify the applications that send and receive the data. Each TCP segment also includes a sequence number, that counts the number of bytes of data sent and allows the receiver to detect loss and reconstruct the original sending order, and an acknowledgement number that indicates the next segment it wishes to receive. TCP segments also include the receiver window size, that indicates the amount of buffer space the receiver has to store incoming TCP segments, a checksum to detect packet corruption, a set of flags to manage connection setup, and an urgent pointer to support advance delivery of important data. The actual TCP payload data follows this header. As can be seen, TCP packets carry a lot of information in addition to that carried in an IP packet header. TCP is a complex and sophisticated transport protocol, that adds a lot of features to IP in order to ensure that data is delivered effectively. A TCP connection proceeds in three stages. At the start of the connection is an initial three-way handshake that establishes the connection. An initial TCP packet is sent from the TCP client to the server, with the SYN (“synchronise”) bit set in the TCP header, to indicate the start of a connection. The server sends a response, with its SYN bit set, to indicate that it’s willing to establish a connection. This response also has the ACK bit set, indicating that the acknowledgement number is valid, because it acknowledges receipt of that initial packet. The client completes the three-way SYN - SYN+ACK - ACK handshake by sending an acknowledgement to the server. This establishes the connection. The client and server can then exchange data. In this example, the client sends data to the server, and the server acknowledges receipt of that data. The initial packet had sequence number 0, and each packet includes 1500 bytes of data. We see the server sending acknowledgement packets, and delivering data to the application in response to recv() calls, as the data packets arrive. We also see that the client can send some number of data packets, known as its congestion window, before it receives an acknowledgement. In this example, the congestion window is 3000 bytes: the client is allowed to send up to 3000 bytes of data without receiving an acknowledgement. The packet with sequence number 4500, sent from client to server, is lost. As a result, when the next packet, with sequence number 6000, arrives the server generates another acknowledgment indicating that it’s still expecting packet 4500. This continues until three duplicate acknowledgements have been received, when the client retransmits the lost packet. Eventually, the missing data arrives at the server. This fills the gap, and the missing data, and the three following packets that were already received, are delivered to the application. The server sees a delay, while the missing data is retransmitted then receives a burst of data. Message boundaries are not preserved. Finally, the connection is closed using a three-way handshake similar to that used to open the connection. We’ll talk more about how TCP establishes connections, and reliably transmits data, in later lectures. The TCP protocol includes a sophisticated algorithm, known as congestion control, that adapts its transmission rate to match the available network capacity. This operates in two phases. The first phase is known as slow-start. At the start of a connection, TCP starts by sending slowly, but increases its sending rate exponentially until it reaches the network capacity. Once it’s sending at a speed that matches the available network capacity, TCP switches to a congestion avoidance phase, adapting its sending rate following a sawtooth pattern that allows it to gradually adapt to changes in capacity. There are a lot of subtle details in TCP congestion control behaviour, that we’ll talk about later in the course. What’s important for now is that TCP can effectively adjust the rate at which data is sent to match changes in the available network capacity. The algorithm it uses to do this looks simple at first glance, but actually contains a lot of complex features, and not obvious behaviours, and is highly effective. To conclude, the transport layer protocols adapt the service provided by the network layer to meet the needs of applications. The Internet provides two transport protocols, TCP and UDP. TCP is a complex and sophisticated protocol, that is highly optimised to deliver reliable data quickly. It’s well suited to the overwhelming majority of Internet applications. UDP is much less sophisticated, and essentially exposes the best effort packet delivery service offered by the underlying IP layer to the application. It’s useful when developing applications that prefer timeliness to reliability, or as a basis for building new transport protocols, but is very difficult to use effectively. The transport layer is the last general purpose layer in the protocol stack. Above it sit the application protocols, that we’ll discuss briefly in the next part.
Part 6: Higher Layer Protocols
Abstract
Part 6 of the lecture reviews the higher layers of the protocol stack. It talks about the role of the session layer in managing connections; it reviews the way the presentation layer supports different data formats and format negotiation; and it reviews the role of the application layer in supporting the application logic. Finally, it discusses the importance of standardisation in ensuring interoperability.
The higher layers of the OSI reference model, the session, presentation, and application layers, provide services to help applications manage transport connections, data formats, and the application logic. In this part, I’ll talk briefly about some issues to consider around higher layer protocols, and of the importance of protocol standards for interoperability. The OSI reference model defines three layers above the transport layer. These are the session, presentation, and application layers. The goal of protocols at these layers is to support the needs of applications. They manage transport connections, name and locate resources used by the application, describe and negotiate data formats, and present the data in an appropriate manner. Essentially, they translate the application’s needs into protocol mechanisms. The protocols used at these layers tend to be quite tightly bound to particular classes of application, and are less general purpose than protocols at the transport layer and below. As a result, the boundaries between these layers tend to be less clear, and many systems implement these higher layer protocols without a clear distinction between the layers. The session layer is about managing transport layer connections. Some applications are straightforward, with a single client connecting to a single server, or a small set of servers. HTTP is an example of such a protocol. This class of protocols tends to need relatively little session layer support, often limited to being able to reuse a transport layer connection to send and receive multiple messages. Other systems are more complex, involving multiple clients and servers collaborating to meet the application needs. Examples include video conferencing and chat applications, and multiplayer games that use a cluster of servers to support end-user applications. The session layer in these applications tends to be concerned with finding the participants in a call, or players in the game, and forwarding messages between those participants and between the servers supporting each user. Peer-to-peer applications tend to have still more sophisticated session layer features. These applications have the problem of how to set up peer-to-peer connections in the presence of firewalls and network address translators, and often need to adapt to rapid changes in group membership. BitTorrent is a well-known example of this class of application. Finally, there are applications that use multicast or broadcast transmission, where the network can send messages to a group of receivers. Like peer-to-peer applications, these tend to require the session layer to manage group membership changes. Each session layer protocol is different, and they have relatively little in common with each other, since they tend to be closely tied to a particular class of application. The presentation layer sits above the session layer, and manages the presentation, representation, and conversion of data. Presentation layer features include the media type descriptions that web servers and video conferencing tools use to describe data formats. They specify that a page is in HTML, whether an image is a JPEG or a PNG, or that the video is compressed with H.264 rather than VP8. And they allow applications to describe the data formats they support, and to negotiate an agreed format with the other participants in the session. Other presentation layer features include channel encodings, where data is adapted to fit the limitations of the communications channel. An example is email, where the original design only supported textual data in ASCII formats. When support for email attachments was added, those attachments had to look like text, so they could pass through mail servers that hadn’t yet been upgraded. A channel encoding scheme, known as Base64 encoding, was developed to do this, allowing arbitrary data to be converted to a text-based format for transmission. Finally, the presentation layer is often where support for internationalization is implemented. There are two sorts of concerns here. One is around labeling the character set used to represent textual data, whether it is in Unicode text in UTF-8 format, or some national character sets such as ASCII, Latin1, or the Big 5 system used in Taiwan and Hong Kong. The other is around labeling the language, possibly the regional variant or dialect that is used. The problems addressed by the presentation layer are wide-ranging and tend to relate to the broader system and to the environment in which the application operates, not just to the network communication. The final layer in the 7-layer OSI reference model is the application layer. It is here that protocol functions that are specific to the application logic are implemented. The application layer protocols deliver email messages, retrieve web pages, stream video, support multiplayer games, and so on. By definition, such messages are entirely application-specific, and there is little that is general to say here. The OSI model is a reasonable way of thinking about network protocols. Having a common model helps frame an organized discussion of network protocols. Real networks are more complex, and the layer boundaries are less clear-cut than the model suggests, especially around the higher layers, but the OSI model is close enough to reality to be useful. But it misses two key layers, the financial and the political. Network protocols are only successful to the extent that they enable different devices to communicate. This requires interoperability between different implementations of a protocol, implemented by different groups of people. Getting the incentives right, so that different vendors can work together to ensure their products interoperate, is a problem that rapidly goes beyond the technical and into the realms of organizational politics, market forces, regulation, and economics. It’s an area where standard-setting organizations, such as the Internet Engineering Task Force, the International Telecommunications Union, the World Wide Web Consortium, the Third Generation Partnership Project, Moving Picture Experts Group, and others, play an important role. Network protocol standards are a very human process. They’re the result of much discussion and negotiation, much argument and compromise. The IETF describes the outcome as rough consensus and running code. The Internet works because thousands of engineers spend the effort to make sure it works, to make sure their products work together, and that the protocols are described well enough to support interoperability. This concludes our brief review of the Internet protocols. In the next part, we’ll start to think about how the network is changing and evolving, to set the scene for the remainder of the course.
Part 7: The Changing Internet
Abstract
Part 7, the final part of this lecture, moves on from the review to talk about some of the changes occurring the Internet, to begin to set the scene for later lectures. It talks about the assumptions made in the design of the network, and discusses some changes that mean those assumption don’t necessarily hold in the modern Internet. In particular, it talks about IPv4 address exhaustion, challenges in establishing connectivity, increasing device mobility, hypergiants and centralisation, the need to support real-time applications, and challenges in securing the network. Finally, it briefly discusses some of the difficulties faced in upgrading a globally deployed running network.
We’re in the middle of a period of rapid change in the Internet infrastructure. In this part, I want to highlight some ways in which the network is changing to set the scene for the remainder of the course. In particular, I want to talk about the exhaustion of the IPv4 address space and the accelerating transition to IPv6, the increasing proportion of wireless mobile Internet endpoints, such as smartphones, and the implications of this shift to a network of mobile devices, the increasing centralization of the network around a small number of hyper-giant content providers, the increasing use of real-time applications and the corresponding need for low-latency transport, and, finally, new approaches to protocol design to support innovation in the face of network ossification. The designers of the Internet made certain assumptions. They assumed that devices would generally be located at a fixed location in the network and would have a small number of network interfaces that could be given persistent and globally unique addresses. They assumed the network and the services that run on it would be operated by many different organizations, working as peers, and that there would be no central points of control. They assumed that best-effort packet delivery service would provide sufficient quality and that applications would be adaptive and able to cope with changes in the network capacity and available bandwidth. They assumed that the network was trusted and secure. And they assumed that innovation would happen at the edges and that the network itself would provide only a simple packet delivery service. These assumptions generally made sense for a research network in the mid-1980s, when the original design of the Internet protocols was being finalized. Do they still make sense for a network today? The first assumption was that devices were located at fixed places in the network and that each device had a small number of network interfaces that could be given persistent and globally unique IP addresses. This gives the desirable property that every device is addressable by every other device. It assumes a network of peers, with no conceptual difference between clients and servers, where any device connected to the network can take either role, depending on the software it runs, and where, in principle, any device can connect to any other device. Of course, addressable doesn’t necessarily imply reachable, and the Internet has long supported firewalls to provide access control by blocking access to certain devices. But the ability for any device to be a server empowers end users. If any device can act as a server, it ought to be possible to run a website or other service anywhere in the network, including on a home machine. There should be no requirement to pay for a dedicated server, located in a managed data center, to host a website or other service. Unfortunately, this assumption failed. It failed because IPv4 had insufficient addresses to support it, and because devices became mobile. The problem with the lack of addresses in IPv4 results in many hosts sharing IP addresses with others, using a technique known as Network Address Translation, NAT. In theory, IPv6 will solve this problem by providing enough addresses for each host, but IPv6 has been slow to deploy. The result of this lack of addresses is that connectivity becomes difficult. It’s increasingly necessary to try both IPv4 and IPv6 addresses when connecting to a device, perhaps racing connection attempts in parallel to get good performance, a technique known as Happy Eyeballs, because it improves end-user web browsing performance. It’s also necessary to think about NAT traversal. A client behind a NAT can easily connect to a server on the public Internet, but, as we’ll discuss in Lecture 2, it’s difficult to connect to a device located behind a NAT and to build peer-to-peer applications. This combination greatly increases the complexity of establishing a connection. This makes networked applications slower and less reliable, and perhaps more importantly, it forces server applications to run in data centers and discourages peer-to-peer applications. This forces reliance on cloud services and encourages centralization of Internet services onto large cloud providers, such as Amazon Web Services and Google. How severe is the shortage of IPv4 addresses? Well, IP address assignment follows a hierarchical model. The Internet Assigned Numbers Authority, IANA, assigns blocks of IP addresses to regional Internet registries. Those regional registries then assign addresses to ISPs and other organizations within their region, and those in turn allocate addresses to end users. The IANA assigned the last available blocks of IP addresses to the regional registries in 2011, around 10 years ago. Since then, the regional registries have gradually been running down their pool of available addresses. As of late 2020, the regional registries for Europe, North America, Latin America, and the Asia-Pacific region have entirely run out of available IPv4 addresses. Africa is projected to run out of IPv4 addresses during 2021. There’s now a thriving market in the transfer of IPv4 addresses, with networks that were previously assigned IPv4 addresses and have more than they need, selling the addresses on to others. As of late 2020, a single IPv4 address can be sold for around 1.6 million. Despite the shortage of IPv4 addresses, adoption of IPv6 has been relatively slow, though it’s now reaching critical mass. For example, as shown on the slide, Google currently reports that around a third of its users are on IPv6. Availability of IPv6 is highly variable, since network operators tend to switch their customers all at once. Generally, either all the users in a particular network have IPv6, or none of them. In the UK, for example, most mobile networks assign IPv6 addresses to smartphones on their networks by default, but many residential ISPs and businesses still use IPv4. What type of address you get depends on how you connect to the network. Other countries follow similar patterns, with some networks switching wholesale to IPv6, while others remain on IPv4. This mixed use of IPv4 and IPv6, with many IPv4 hosts being located behind network address translators, greatly complicates connection establishment. In the IPv4 Internet, peer-to-peer applications must perform a complex process to discover NAT bindings, exchange candidate addresses with their peer, and probe to establish what addresses are usable for a connection. This uses a set of protocols, known as STUN and TURN, and the assistance of a central server with a globally unique public IPv4 address, to detect the presence of network address translation, to determine what type of address translation is being performed, and to derive a set of candidate addresses that can be used for peer-to-peer communication. The peers use the central server to exchange these candidate addresses with each other, and then follow an algorithm known as Interactive Connectivity Establishment, ICE, to set up direct, low-latency peer-to-peer flows. This works, most of the time, but is complicated, slow, power-hungry, and generates a lot of otherwise unnecessary traffic, all because there aren’t enough IPv4 addresses. The fix, of course, is to move to IPv6. Unfortunately, the move to IPv6 will take a long time, and while it’s happening, there will be some devices and some networks that support only IPv4, some that support only IPv6, and some that support both. To reach users on both IPv4 and IPv6, popular services tend to be hosted on servers that have both IPv4 and IPv6 addresses. This is known as dual-stack hosting, and it further encourages centralization onto large hosting providers with the resources to provide both types of address. To get good performance, clients must try to connect using both IPv4 and IPv6 addresses simultaneously, or near- simultaneously. This further complicates connection setup, making it harder to write networked applications. The Internet protocols are designed such that IP addresses encode the location of a network interface within the network. An IP address does not represent a device. It represents a location where a device can attach to the network. If a device is attached to the network via a wireless connection, but moves so that it changes the Wi-Fi or 4G base station to which it connects, then that device will be assigned a new IP address. This has some privacy benefits, but makes it difficult to maintain long-lived connections with that device. For example, TCP connections fail and must be re- established by the application when a device moves, and UDP applications need to coordinate with their peers to change the IP addresses they use for communication. Applications that want to maintain long-lived connections, or that want to accept incoming connections, must deal with the complexity of changing IP addresses, and the need to signal such changes to their peers and re-establish connections as the device moves. Something has to keep track of where each device is in order to route traffic. The assumptions in the design of the Internet mean that complexity is visible to applications, rather than being hidden inside the network. As we’ve seen, several aspects of the Internet’s design push towards centralization. The network topology is gradually flattening, and moving away from a complex mesh of peer connections, towards a hub-and-spoke model centered around a small number of large, centralized services, directly connecting to so-called eyeball networks at the edge of the network, where consumers of those services live. This enables the set of hyper-giant content providers, including Google, Facebook, Amazon, Akamai, Apple, Netflix, and so on, to dominate, and makes it difficult for new competitors to gain a foothold. This has implications for network neutrality, competition, and innovation. We’re also seeing steady growth of real-time traffic, with streaming video being, by far, the dominant type of traffic in the network, while still growing at around 40% year-on-year. Streaming video has reasonably strict timing and quality constraints, and these push network operators to improve the quality of their networks, and push streaming video content providers to peer directly with the residential edge networks. This is a further incentive towards centralization, and the flattening of the network, since such direct peerings make it easier to ensure high-quality video is delivered to viewers. Increases in streaming video are also driving changes in TCP congestion control, such as Google’s BBR algorithm, and the development of TCP replacements, such as QUIC, both aimed at reducing latency and increasing quality. Lectures 4 and 5 will talk about some of these developments. WebRTC-based video conferencing services, such as Zoom, WebEx, and Microsoft Teams, have even stricter latency requirements. The COVID-19 pandemic has accelerated these effects. This graph shows measurements of Internet traffic as the initial lockdowns started in March 2020. It shows that many residential networks saw a 20% to 25% increase in the total amount of Internet traffic they were carrying, as people shifted to working from home. It also shows the corresponding drop in mobile traffic. This shift in traffic wasn’t evenly distributed, though. The second graph shows the amount of video data being sent over WebEx, one of the popular video conferencing platforms, over a similar time period. Usage grew by about a factor of 20 in less than a week, and has continued growing since. Impressively, the Internet was flexible and robust enough to support this rapid change in how it was used. The question is, can we maintain such flexibility while also improving quality? Lectures 6 and 7 discuss this topic further. The final shift in assumptions has been around security. The Internet protocols were originally designed to support a research network with a relatively small set of users who had reasonably closely aligned goals and provided little in the way of security. Over the years, the protocols have changed to provide increasingly sophisticated security and protection from attacks. The Edward Snowden revelations accelerated that trend by increasing awareness of large-scale government surveillance, but the increase in security started before that, in response to hacking and criminal activity. A significant challenge going forward will be in balancing the needs of law enforcement to access and monitor some traffic in a targeted manner, while preserving privacy and protecting against attackers. We’ll talk about these topics more in Lectures 3, 4, 8, and 9. The Internet is now a globally deployed network. Like any large system, it becomes increasingly ossified, increasingly difficult to change over time. The slow transition from IPv4 to IPv6 is one example of this ossification. Another example would be the difficulty in updating TCP to better support low-latency services and improve performance. The widespread use of NATs, firewalls, and other middle boxes makes such changes surprisingly difficult. We’re now starting to see serious attempts to replace TCP, with protocols such as QUIC that employ pervasive encryption and tunnel over UDP to avoid such interference by the legacy network. But it’s not clear that these will succeed. Finally, we must consider the effects of the push towards centralized services and applications, driven by both technical and business considerations, and whether these are beneficial to consumers and users of the network or not. Is this shift towards a small number of hypergiant service providers an inevitable consequence of the design of the network, or of the business and regulatory environment? And to what extent should we attempt to influence it through technological change in the network? The way the network is used is changing, and the technologies that support that use are necessarily shifting too. The network has become more fragmented. There are more serious security threats, more demanding applications, and some significant shifts in the devices and technologies we use to access the network. In the rest of this course, I want to start to discuss how the protocols that form the internet are evolving to meet these needs, and to highlight some of the open issues and challenges still to be addressed. We’ll start in the next lecture by discussing the increasing fragmentation of the network and its implications for connection establishment.
L1 Discussion
Summary
The goal of the Networked Systems (H) course is to discuss how the Internet is changing to support more devices, to improve real-time and low-latency applications, and to increase security. The recorded material for lecture 1 reviewed some prior material, which should be somewhat familiar to you from the Networks and Operating Systems Essentials course in Level 2, and introduced some of the changes we’ll discuss in the remainder of the course. The live discussion session will briefly recap this review, then discuss the following points.
IPv4, IPv6, and NAT: One of the ongoing changes in the network is the transition from IPv4 to IPv6. The lecture presented data from Google showing that about 30% of their traffic is running over IPv6. Does your home network support IPv6? What about your mobile provider? Try https://ipv6-test.com to find out.
Due to the shortage of IPv4 addresses, many networks use NAT to share IP addresses. We’ll talk more about this in Lecture 2, but for now, find our whether your home network uses IPv4 with a NAT. There are instructions for finding your machine’s local IP address online. If it’s using one of the private address ranges (10.0.0.0 - 10.255.255.255, 172.16.0.0 - 172.31.255.255, or 192.168.0.0 - 192.168.255.255) your home network is using a NAT. Google “what is my IP” to find your public IPv4 and IPv6 address, and compare these with the addresses your network uses internally. Does it matter that we’re running out of IPv4 addresses?
Real-time Applications: Streaming video is the majority of Internet traffic, and video conferencing providers saw a massive traffic increase due to the pandemic. The lecture presented some data to illustrate this, and we’ll talk about the issues more in the rest of the course. How well have video conferencing apps worked for you? Do you see frequent quality problems?
Hyper-giants, Centralisation, and Security Internet topology is flattening and becoming increasingly centralised, with direct connections from “eyeball” networks to massive content providers – Google, Facebook, Amazon, Apple, Akamai, etc. What are the implications for network neutrality, competition, innovation, privacy, freedom of speech, pervasive monitoring, and security?
Lecture 2
Abstract
Lecture 2 discusses connection establishment. It begins by reviewing the operation of TCP and showing how TCP connections are established, and what factors influence the performance of connection establishment. It considers the impact TLS and IPv6 on connection establishment, and discusses the need for connection racing. And it reviews the idea of peer-to-peer connection, and the difficulties network address translation causes for peer-to-peer connection establishment. The lecture concludes with a brief explanation of how NAT binding discovery, and the ICE algorithm for peer-to-peer connection establishment, work.
Part 1: Connection Establishment in a Fragmented Network
Abstract
The 1st part of this lecture reviews the operation of TCP. It outlines the TCP service model, segment format, and programming model. Then, it discusses how TCP connections are established and considers the impact of network latency on TCP connection establishment performance. It concludes with a review of how this connection establishment latency affects protocol design.
In this lecture, I’ll talk about the problem of connection establishment in a fragmented network, and discuss some issues that affect the performance of TCP connection establishment and data transfer. There are five parts to this lecture. In this first part, I’ll review the TCP transport protocol and its programming model, and talk about client-server connection establishment. In the next part, I’ll discuss the implications of Transport Layer Security, TLS, and the use of IPv6 on connection establishment, and show how these changes affect performance. Following that, in part three, I’ll talk about peer-to-peer connections and the impact of network address translation. Then, in the last two parts, I’ll talk about some of the problems caused by NATs, network address translation devices, and outline how NAT traversal works. To begin, I’ll briefly review the Transmission Control Protocol, TCP. I’ll talk about the purpose of TCP, then review the TCP segment format, and the service model TCP offers to applications. Then, I’ll discuss how TCP connections are written using the Berkeley sockets API. TCP is currently the most widely used transport protocol in the Internet. A TCP connection provides a reliable, ordered, byte stream delivery service that runs on the best-effort IP network. Once a TCP connection is established, an application can write a sequence of bytes into the socket, representing the connection, and TCP will deliver those bytes to the receiver, reliably, and in the correct order. If any of the IP packets containing TCP segments are lost, the TCP stack in the operating system will notice this, and automatically retransmit the missing segments. Similarly, if any of the IP packets are delayed and arrive out of order, the TCP stack will put the data back into the correct order before delivering it to the application. Finally, TCP will adapt the speed at which it sends the data to match the available network capacity. If the network is busy, TCP will slow down the rate at which it sends data, to fairly share the capacity between transmissions. Similarly, if the network becomes idle, TCP connections will speed up to use the spare capacity. This process is known as congestion control, and TCP implements sophisticated congestion control algorithms. We’ll talk more about TCP congestion control in Lecture 6. Applications using TCP are unaware of retransmissions, reordering, and congestion control. They just see a socket, into which they write a stream of bytes. Those bytes are then delivered reliably to the receiver. Internally, those bytes are split into TCP segments. Each segment has a header added to it, to identify the data. The segment is placed inside the data part of an IP packet. That IP packet is, in turn, put inside the data part of a link layer frame, and sent across the network. The IP layer just sees a sequence of TCP segments that it must deliver, and is unaware of their contents. Equally, TCP gives data segments to the IP layer to deliver, and is unaware whether the underlying network is Ethernet, WiFi, optical fibre, or something else. The diagram on the slide shows the format of a TCP segment header, inside an IPv4 packet. Data is sent over the link in the order shown, left-to-right, top-to-bottom, in the payload part of a link layer frame. The IP header is sent first, then the TCP segment header, then the TCP payload data. Looking at the TCP segment header, highlighted in green, we see that it comprises a number of fields. A TCP segment starts with the source and destination port numbers. The source port number identifies the socket that sent the segment, while the destination port number identifies the socket to which it should be delivered. When establishing a TCP connection, a TCP server binds to a well-known port that identifies the type of service it offers. For example, web servers bind to port 80. This gives a known destination port to which clients can connect. Clients specify the destination port to which they connect, but usually leave their source port unspecified. The operating system then chooses an unused source port for that connection. All TCP segments sent from the client to the server, as part of a single connection, have the same source and destination ports, and the responses come back with those ports swapped. Each new connection gets a new source port. Following the port numbers in the TCP segment header, come the sequence number and the acknowledgement number. At the start of a TCP connection, the sequence number is set to a random initial value. As data is sent, the TCP sequence number inserted by the sender increases, counting the number of bytes of data being sent. The acknowledgement number indicates the next byte of data that’s expected by the receiver. For example, if a TCP segment is received with sequence number 4,000, and that segment contains 100 bytes of data, then the acknowledgement number will be 4,100. This indicates that the next TCP segment expected is that with sequence number 4,100. If the acknowledgement number that comes back is different from that expected, it’s a sign that some of the packets have been lost. TCP will then retransmit the segments that were in the lost packets. The data offset field indicates where the payload data starts in the segment. That is, it indicates the size of any TCP options included in the packet. The reserved bits are not used. There are then six single-bit flags. The URG bit indicates whether the urgent pointer is valid. The ACK bit indicates whether the acknowledgement number is valid. The PSH bit indicatesthat this is the last segment in a message, and should be pushed up to the application as soon as possible. The SYN, synchronise, bit is set on the first packet sent on a connection. The FIN bit indicates that the connection should be cleanly closed. And the RST bit indicates that the connection will reset, aborted, without cleanup. The receive window size allows the receiver to indicate how much buffer space it has available to receive new data. This allows the receiver tell the sender to slow down, if it’s sending faster than the receiver can process the data. The checksum is used to detect corrupted packets, that can then be retransmitted. Finally, the Urgent Pointer allows TCP senders to indicate that some data is to be processed urgently by the receiver. Unfortunately, experience has shown that the urgent data mechanism in TCP is not usable in practice, due to a combination of an ambiguous specification and inconsistent implementation. The fixed TCP segment header is followed by TCP option headers, that allow TCP to add new features and extensions, and then the payload data. TCP retransmits any data that was sent in packets that are lost, and makes sure that data is delivered to the application in the order in which it was originally sent. It also adapts the speed at which it sends data to match the available network capacity. As a result, TCP provides a reliable, ordered, byte stream delivery service. A limitation of the TCP service model is that message boundaries are not preserved. That is, if an application writes, for example, a block of 2,000 bytes to a TCP connection, then TCP will deliver those 2,000 bytes to the receiver, reliably, and in the order they are sent. However, what TCP does not do, is guarantee that those bytes are delivered to the receiving application as a single block of 2,000 bytes. That might happen. Equally, they could be delivered to the application as two blocks of 1,000 bytes. Or as a block of 1,500 bytes, followed by a block of 500 bytes. Or as 2,000 single bytes. Or as any other combination, provided the data is delivered reliably and in the order sent. This complicates the design of applications that use TCP, since they have to parse the data received from a TCP socket to check if they’ve got the complete message. Despite this inconvenience, TCP is the right choice for most applications. If you need to deliver data reliably, and as fast as possible, then use TCP. TCP is a client-server protocol. Servers listen for, and respond to, requests from clients. The way you write code to use a TCP connection depends on whether you’re writing a client or a server. On the server side, you begin by creating a socket. The first argument to the socket() call is the constant PF_INET, from the sys/socket.h header, if you want the server to listen for connections on IPv4. Alternatively,if you want the server to listen for IPv6 connections, use PF_INET6. The second argument will be the constant, SOCK_STREAM, to indicate that a TCP server is wanted. The third argument is unused as must be zero. You then call the bind() function, passing the file descriptor representing the newly created socket as the first argument. The other arguments to bind() specify the port number on which the server should listen for incoming connections. The bind() function assigns the requested port number to the socket. You then call the listen() function. This starts the server listening for incoming connections. Then, you call accept(), to indicate that your server is ready to accept a new connection. The accept() function doesn’t return until a client connects to the server. This could potentially be a long wait. Meanwhile, the client application creates its own socket. This is done using the socket() function, in exactly the same way as the server. The client then calls connect(), passing the file descriptor for its newly created socket as the first argument. The subsequent arguments contain the IP address of the server, and the port number on which the server is listening. The connect() call makes TCP establish a connection from the client to the server. When it returns, either the connection has been successfully established, or the server is unreachable. When the connection request reaches the server, the accept() call completes. The return value is a new file descriptor, representing the newly established connection. The original file descriptor, that was listening for incoming connections, remains unchanged. The client and server can now call send() and recv(), to send and receive data over the connection. They can send and receive as much, or as little, data as they want, and TCP places no restrictions on the order in which client and server send. Remember to use the file descriptor representing the accepted connection, not the file descriptor representing the listening socket, when writing the server code. Finally, when they’ve finished, client and server call the close() function, to cleanly shut down the connection. Once the client has closed the connection, it’s done. A server can repeatedly accept new connections from the listening socket. How does connection establishment actually work? And what are the factors that affect how quickly connections can be established? In the following, I’ll talk in detail about how client-server connection establishment works for TCP, and what factors limit performance. TCP-based applications usually work in a client-server manner. The server listens for connections on a well-known port, and the client connects to the server, sends a request, and receives a response. TCP can also, in principle, be used in a peer-to-peer manner. If two devices create TCP sockets, bind to known ports, and simultaneously attempt to connect() to each other, then TCP will able to create a connection, provided there are no firewalls or NAT devices blocking traffic. This is known as simultaneous open. TCP simultaneous open can work, but isn’tespecially useful, since it requires both peers to try to connect at the same time. It’s usually better to work in client-server mode, since the server can wait for clients and the client and server don’t need to synchronise when to connect. How does client-server TCP connection establishment actually work? First, the server creates a TCP socket, binds it to a port, tells it to listen for connections, and calls accept(). At that point, the server blocks, waiting for a connection. The client creates a TCP socket and calls connect(). This triggers the TCP connection setup handshake, and causes the client to send a TCP segment to the server. This segment will have the SYN (“synchronise”) bit set in its TCP segment header, to indicate that it’s the first packet of the connection. It will also include a randomly chosen sequence number. This is the client’s initial sequence number. The initial segment does not include any data. When this initial segment arrives at the server, the server will send a TCP segment in response. This segment will also have the SYN bit set, because it’s the first segment sent by the server on the connection. It will also include a randomly chosen initial sequence number. This is the server’s initial sequence number. The segment will also have the ACK bit set, because it’s acknowledging the initial segment sent from the client. TCP acknowledgements report the next sequence number expected, and by convention a SYN segment will consume one sequence number. Accordingly, the acknowledgement number in the TCP segment header will be the client’s initial sequence number plus one. Since this segment has both the SYN and ACK bits set, it’s known as a SYN-ACK packet. When the SYN-ACK packet arrives at the client, the client acknowledges its receipt back to the server. The TCP segment it generates to do this will have its ACK bit set to one, to indicate that its acknowledgement number is valid, and the acknowledgement number will be the server’s initial sequence number plus one. The SYN bit is not set on this packet, since it’s not the first packet sent from the client to the server. The sequence number in the TCP segment header will equal the client’s initial sequence number plus one, since the SYN packet consumes one sequence number. This packet also doesn’t include any data, since it’s sent before the connect() call completes. Once this packet has been sent, the client considers the connection established, and the connect() function returns. The client can now send or receive data on the connection. Once this final ACK packet arrives at the server, the three-way handshake is complete. At this point the accept() function completes, returning a file descriptor the server can use to access the new connection. The server now considers the connection to be open. Once the three-way handshake has completed, the client and server can send and receive data over the connection. The typical case is that the client sends data to the server immediately after it connects, and the server responds with the requested data. There’s no requirement that the client sends first, though, or that client and server alternate in sending data. The slide shows an example where the client sends a request comprising a single segment’s worth of data to the server. The server then responds by sending a larger response back, including the acknowledgement for the request on the first segment of the response. Finally, we see the client acknowledge receipt of the segments that comprise the response. This is the typical pattern when a browser fetches a web page from a web server. What’s interesting, is that if we look at the time from when the client calls connect(), until the time it receives the last of data segment from the server, a significant part of that time is taken up by the connection setup handshake. It takes a certain amount of time, known as the round trip time, to send a minimal sized request to the server and get a minimal sized response. Larger requests and responses add to this, based on the time to send the additional data down the link, known as the serialisation time for the data. But, if the amount of data being requested from the server is small, it’s often the round trip time that dominates. For example, let’s assume a browser is requesting a simple web page from a web server using HTTP running over TCP. That web page comprises a single HTML file, a CSS style sheet, and an image. The HTML and CSS files are on one server, while the image is located on a different server. How long does it take to retrieve that page? Well, the client initially connects to the server where the HTML file is located. This takes one round-trip time for the SYN and SYN-ACK packets to be exchanged, and for the connect() call to complete. As soon as the connect() completes, the client sends the request for the HTML. It takes another round-trip for the request to reach the server and the first part of the response to come back, followed by the serialisation time for the rest of the response. When it’s received the HTML, the client knows that it needs to retrieve the CSS file and the image. It reuses the existing connection to the first server to retrieve the CSS file. This takes an additional round trip, plus the serialisation time of the CSS data. In parallel to this, it opens a TCP connection to the second server, sends a request for the image, and downloads the response. This takes two round trips, plus the time to send the image data. Whichever of these takes the longest, plus the amount of time to make the initial connection and fetch the HTML, determines the total time to download the page. The round trip time depends on the distance from the client to the server. The serialisation time depends on the available capacity of the network. For example, if the image is 1 megabyte, 8 megabits, in size, and the available bandwidth of the link is 2 megabits per second, then the image will take 4 seconds to download, in addition to the round trip time. It can be seen that the total download time depends on both the round trip time, the available bandwidth, and the size of the data being downloaded. What’s a typical round trip time? This depends on the distance between the client and server, and on the amount of network congestion. The table on the right gives some typical values, measured from a laptop on my home network to various destinations. There’s a lot of variation. In the best case, it takes around 33ms to get a response from a server in the UK, around 100ms to get a response from a server on the East coast of the US, around 165ms from a server in California, and around 300ms from Australia. Worst case, when there’s other traffic on the network, is considerably higher. This means that a request sent to a server in New York takes at least 1/10th of a second, irrespective of how much data is requested. What about available bandwidth? Well, ADSL typically gets around 25 megabits per second. VDSL, often known as fibre to the kerb, where the connection runs over your home phone line to a cabinet in your street, then over fibre to the exchange and beyond, typically gets around 50 megabits per second. And fibre to the premises, where the optical fibre runs direct into your home, can transmit several hundred megabits per second. 4G wireless is highly variable, depending on the options enabled by your provider and the reception quality, but somewhere in the 15-30 megabits per second range is typical. What does this mean in practice? Let’s take the example of a single web page we used before, comprising HTML, a CSS style sheet, and a single image, and plug in some typical numbers for the file sizes, as shown on the slide. Let’s also assume that the round trip time is the same for both servers, to make the numbers easier. The table then plots the total time it would take to download that simple web page, given different values for bandwidth and the round trip time to the servers. The slowest case is the bottom left of the table, where it would take 45.1 seconds to download the page, assuming a 1 megabit per second link to a server with a 300ms round trip time. This models a slow connection to a server in Australia. The fastest is the top right of the table, where it takes 0.04 seconds to download the page from a server located 1ms away on a gigabit link. What’s interesting is how the download time varies as the link speed improves. If we look at the top row, with 1ms round trip time, we see that if we increase the bandwidth by a factor of ten, from 100Mbps to 1Gbps, the time taken to download the page goes down by a factor of ten, a 90% reduction. The link is ten times faster, and the page downloads ten time faster. If we look instead at the bottom row, with 300ms round trip time, increasing the link speed from 100Mbps to 1Gbps gives only a 22% reduction in download time. Other links are somewhere in the middle. Internet service providers like to advertise their services based on the link speed. They proudly announce that they can now provide gigabit links, and that these are now more than ten times faster than before! And this is true. But, in terms of actual download time, unless you’re downloading very large files, the round trip time is often the limiting factor. The download time, for typical pages, may only improve by a factor of two if the link gets 10x faster. Is it still worth paying extra for that faster Internet connection? What does this mean for protocol design? The example shows an HTTP/1.1 exchange. Once the connection has been opened, the client sends the data shown in blue, prefixed with the letter “C:”, to the server. The server then responds with the data shown in red, prefixed with the letter “S:”, comprising some header information and the requested page. Everything is completed in a single round trip. Request. Then response. Compare that with this example, showing the Simple Mail Transfer Protocol, SMTP, used to send email. As with the previous slide, data sent from client to server is shown in blue and prefixed with the letter “C:”, and that sent from the server to the client is in red, prefixed with the letter “S:”. We see that the protocol is very chatty. Once the connection is established, after the SYN, SYN-ACK, and ACK, the server sends an initial greeting. Establishing the connection and sending this initial greeting takes two round trips. The client then sends HELO, and waits for the go ahead from the server. This takes one more round trip. The client then sends the from address, and waits for the server. One more round trip. The client then send the recipients, and waits for the server. One more round trip. The client then says it’d like to send data now, and waits for the server. One more round trip. Then, finally the client gets to send the data, and once it’s confirmed that the data was received, sends QUIT, waits, then closes the connection. The whole exchange takes eight round trips. Is this necessary or efficient? No! If the protocol were designed differently, all the data could be sent at once, as soon as the connection was opened, and the server could respond with an okay or an error. The eight round trips could be reduced to two: one to establish the connection, one to send the message and get confirmation from the server. This is why email is slow to send. TCP establishes connections using a three-way handshake. SYN, SYN-ACK, ACK. The time to establish a connection depends on round trip time and the bandwidth. Links are now fast enough that the round trip time is generally the dominant factor, even for relatively slow links. The best way to improve application performance is usually to reduce the number of messages that need to be sent from client to server. That is, to reduce the number of round trips. Unless you’re sending a lot of data, increasing the bandwidth generally makes very little difference to performance.
Part 2: Impact of TLS and IPv6 on Connection Establishment
Abstract
The 2nd part of the lecture discusses the impact of TLS and IPv6 on TCP connection establishment. It shows how the use of TLS, to secure connections, increases the connection establishment latency. And it discusses the “happy eyeballs” technique for connection racing, to reduce connection establishment delays, in dual stack IPv4 and IPv6 networks.
In the previous part, I discussed TCP connection establishment, and highlighted that the round-trip time is often the limiting factor for performance. In the following, I want to discuss the performance implications of adding transport layer security to TCP connections, and how to achieve good performance when the destination is a dual-stack IPv4 and IPv6 host. In the previous part, I showed how the network round trip time can be the limiting factor in performance. This is because every TCP connection needs at least two round trip times: one to establish the connection, and one for the client to send a request and receive a response from the server. I also showed how the protocol running over TCP can make a significant difference to performance, with the examples of HTTP, which sends a request and receives a response in a single round trip, and SMTP, which makes multiple unnecessary round trips. One of the important protocols that runs over TCP is the transport layer security protocol, TLS. TLS provides security for a TCP connection. That is, it allows the client and server to agree encryption and authentication keys to make sure that the data sent over that TCP connection is confidential and protected from modification in transit. TLS is essential to Internet security. When you retrieve a secure web page using HTTPS, it first opens a TCP connection to the server. Then, it runs TLS to enable security for that connection. Then, it asks to retrieve the web page. Depending on the version of TLS used, this adds additional time to the connection. With the latest version of TLS, TLS v1.3, it takes one additional round trip to agree the encryption and authentication keys. That is, after the TCP connection has been established, via the SYN - SYN-ACK - ACK handshake, then the client and server need an additional round trip to enable TLS, before they can request data. The TLS handshake is in three parts. First, the client sends a TLS ClientHello message to the server to propose security parameters. Then, the server responds with a TLS ServerHello, containing its keys and other security parameters. Finally, assuming there’s a match, the client responds with a TLS Finished message to set the encryption parameters. The client then immediately follows this by sending the application data, such as an HTTP GET request, without waiting for a response. This adds one additional round trip time, in most cases. Older versions of TLS take longer. TLS v1.2, for example, takes at least two round trips to negotiate a secure connection. What impact does the additional round trip due to TLS have on performance? Well, let’s look again at the simple web page download examples from the previous part. When the round-trip time is negligible, as on the top row with 1ms round trip time, performance is unchanged. As we go down the table, though, performance gets worse. With 100ms round trip time, both the overall performance and the benefit of increasing the link speed go down. The download time for the page on a gigabit link is increased by 45%, from 0.44 to 0.64 seconds, compared to a connection without TLS. And the benefit of going from a 100 megabit link to a gigabit link is only 36%, rather than 45% without TLS. With 300ms round trip time the behaviour is even worse. Total download time increases by 48% compared to the non-TLS case, and there’s only a 22% reduction in download time when going from a 100 megabit link to a gigabit link. This is not to say that TLS is bad! Far from it – security is essential. Rather, it further highlights that the number of round trips that a connection must perform, between client and server, is often the limiting factor in performance. Applications that have good performance will try to reduce the number of TCP connections that they establish, since each connection takes time to establish. They also try to limit the number of request-response exchanges, each taking a round trip, they make on each connection. TLS v1.3, standardised in 2018, was a big win here, because it reduces the number of round trips needed to enable security from two, down to one. When used with TCP, this gives the best possible performance: one round trip to establish the TCP connection, and one to negotiate the security, before the data can be sent. We’ll talk more about TLS and how to improve the performance of secure connections in lectures 3 and 4. The other factor affecting TCP connection performance is the ongoing transition to IPv6. This transition means that we currently have two Internets: the IPv4 Internet and the IPv6 Internet. Some hosts can only connect using IPv4. Some hosts can only connect using IPv6. And some hosts have both types of address. Similarly, some network links can only carry IPv4 traffic, some only IPv6, and some links can carry both types of traffic. And some firewalls, or other middleboxes, block IPv4, some block IPv6, and some block both types of traffic. Importantly, the IPv6 network is not a subset of the IPv4 network. It’s a separate Internet, that overlaps in places. Given that some hosts will be reachable over IPv4 but not IPv6, and vice versa, how do you establish connections during the transition? Well, given a hostname, you perform a DNS lookup to find the IP addresses for that host using the getaddinfo() call. This returns a list of possible IP addresses for the host, including both IPv4 and IPv6 addresses. The simple approach is a loop, trying each address in turn, until one successfully connects. This works, but can be very slow. In the example on the slide, Netflix has 16 possible IP addresses, eight IPv6 and eight IPv4, and lists the IPv6 addresses first in its DNS response. If you have only IPv4 connectivity, it may take a long time to try, and fail, to connect to eight different IPv6 addresses before you get to an IPv4 address that works. To get good performance, applications use a technique known as “Happy Eyeballs”. This involves making two separate DNS lookups, in parallel, one asking for only IPv4 addresses and one for only IPv6 addresses. Starting with whichever of these DNS lookups completes first, the client makes a connection to the first address returned by the server. If that hasn’t succeeded within 100ms, it starts another connection request to the next possible address, alternating between IPv4 and IPv6 addresses. The different connection requests proceed in parallel, until once eventually succeeds. That first successful connection is used, whether over IPv4 or IPv6, and the other connection requests are cancelled. The happy eyeballs technique tries to balance the time taken to connect vs. the network overload of trying many possible connections at once in parallel. It adds complexity to the connection setup, to achieve good performance. The two factors affecting TCP performance are bandwidth and latency. In many cases, the latency, the round trip time, dominates. The are five ways in which applications using TCP improve their performance. The first is that a client should use something like happy eyeballs, overlapping connection requests if the server, if the server has more than one address. This is more complicated to implement than trying to connect to each different address in turn, but connects a lot faster. The second way to improve TCP performance is to reduce the number of TCP connections made. Each connection takes time to establish. If you can make a single TCP connection and reuse it for multiple requests, that’s faster than making a new connection for each request. Third, if you reduce the number of request-response exchanges made over each connection, you reduce the impact of the round trip latency. All these are possible for any application, by using TCP connections effectively. There are also two more radical changes that can be made. The first is to overlap the TCP and TLS connection setup handshakes, by sending the security parameters along with the initial connection request, so that both the connection setup and security parameters can be negotiated in a single round trip. This isn’t possible with TCP, but the QUIC transport protocol, that we’ll discuss in lecture 4, does allow this. Finally, one can always improve performance by reducing the round trip latency. This latency depends on two things: the speed at which the signal propagates down the link, and the amount of other traffic. Since signals travel down electrical cables and optical fibres at the speed of light, there’s little that can be done to increase the propagation speed, although low earth orbit satellites can help, we’ll discuss in lecture 6. Reducing the amount of other traffic queued up at intermediate links is a possibility though, and this can be affected by the choice of TCP congestion control algorithm. We’ll talk about this in lectures 5 and 6. To summarise, one of the limiting factors with TCP performance is the round trip latency. The use of TLS is essential to improve security, but comes at the expense of an additional round trip that slows down connection establishment. This is solved by the upcoming QUIC transport protocol, that we’ll discuss in lecture 4. Similarly, the ongoing migration to IPv6 means that servers often have both IPv4 and IPv6 addresses, and it’s not clear which of these are reachable. Clients must try to establish multiple connections in parallel, using the happy eyeballs technique, to get good performance.
Part 3: Peer-to-peer Connections
Abstract
The 3rd part starts to discuss peer-to-peer connections. It talks about how the use of Network Address Translation (NAT) affects addressing and connection establishment, and why it complicates creating peer-to-peer applications.
In this part, I’ll start to talk about peer-to-peer connections, network address translation, and how these affect Internet addressing and connection establishment. The Internet was designed as a peer-to-peer network, and makes no distinction between clients and servers at the IP layer. In principle, it should be possible to run a TCP server, or a UDP- or TCP-based peer-to-peer application on any host on the network, as long as the clients have some way of finding the server’s IP address, and knowing what port number it’s using, and as long as any firewall pinholes are opened, then it shouldn’t matter whether a server is located in someone’s home or in a data centre. A server in a data centre is likely to have better performance, of course, because it’s probably got a faster connection to the rest of the network. It’s also likely to be more robust, because the data centre will have redundant power and network links, air conditioning, and professional system administrators. But, at the protocol level, there shouldn’t be a difference. In practise, this is not the case. It’s difficult to run a server on a host connected to most residential broadband connections, and it’s difficult to make peer-to-peer connections work. The reason for this is the widespread use of network address translation – NAT. What is network address translation? NAT is the process by which several devices can share a single public IP address. It allows several hosts to form a private internal network, with IP addresses assigned from a special-use range. One device – the network address translator, the NAT – is connected to both the private network and to the Internet, and can forward packets between the two networks. As it does so, it rewrites, translates, the IP addresses, and the TCP and UDP port numbers, so all the packets appear to come from the NAT’s IP address. Essentially, it hides an entire private network behind a single IP address. This is useful because there aren’t enough IPv4 addresses for every device that wants to connect to the network, and because it’s taking a long time to deploy IPv6. NAT is a work around, to let you keep using IPv4 devices, with some limitations, even though there aren’t enough IPv4 addresses. How does NAT work? Well, let’s first step back, and think about how a single host connects to the network. In the figure, a customer owns a single host. That host connects to a network run by an Internet service provider. That ISP, in turn, connects to the broader Internet. The ISP owns a range of IP addresses that it can assign to its customers. In this example, it owns the IPv4 prefix 203.0.113.0/24. That is, the set of IPv4 addresses where the first 24 bits, known as the network part of the address, match those of 203.0.113.0 are assigned to the ISP. These are the IPv4 addresses in the range 203.0.113.0 to 203.0.113.255. The address with the host part equal to zero represents the network, and cannot be assigned to a device. The ISP assigns the first usable IP address in the range, 203.0.113.1, to the internal network interface of the router that connects it to the rest of the network, and assigns the rest of the addresses to customer machines. One particular customer is assigned IP address 203.0.113.7 for their device. The external, Internet-facing, side of the router that connects the ISP to the rest of the network has an IP address assigned by the network to which the ISP connects. In this example, it gets IP address 192.0.2.47. The customer’s host connects to a server on the Internet. The server happens to have IP address 192.0.2.53. The customer’s host sends packets that have their destination IP address equal to that of the server, 192.0.2.53, and source IP address equal to that of the customer’s host, 203.0.113.7. Those packets travel through the network without change, and when they arrive that the server, they still have destination IP address 192.0.2.53 and source IP address 203.0.113.7. When it sends a reply, the server will set the destination IP address to that of the customer’s device, 203.0.113.7, and use its own address, 192.0.2.53, as the source IP address. No address translation takes place. At some point later, the customer buys another host. How does it connect to the network? What’s supposed to happen is as follows. First, the customer buys an IP router, or is given one by the ISP. The router is used to create an internal network for the customer, that connects to the ISP’s network. This could be an Ethernet, a WiFi network, or whatever. The external interface of that router, that connects the customer to the ISP, inherits the IP address that was previously assigned to the customer’s single device, in this case 203.0.113.7. The ISP also assigns a new IP address range to the customer. This will be a subset of the IP address range the ISP owns. In this example, the customer is assigned the IP address range 203.0.113.16/28. That is, IP addresses where the first 28 bits match those of 203.0.113.16, namely the range 203.0.113.16 to 203.0.113.31. The customer assigns the first usable address in that range, 203.0.113.17, to the internal network interface of the router, and assigns other addresses to their two hosts. In this example, the two hosts are given addresses 203.0.113.18 and 203.0.113.19. The end result is that the ISP delegates some of the IP addresses they own to their customer, and the customer uses them in their network. One of the customer’s hosts connects to a server on the Internet. As expected, to do so, that host sends an IP packet with the source IP address set to its IP address, in this case 203.0.113.18, and the destination address set to the IP address of the server, 192.0.2.53. That packet travels through the customer’s network to its router, and is forwarded on to the ISP’s network. It traverses the ISP’s network to the router connecting the ISP to the Internet, and is forwarded on from there to the Internet. Eventually, the packet arrives at the server. When it arrives, it still has destination address equal to that of the server, 192.0.2.53, and source address equal to that of the host that sent it, 203.0.113.18. When it sends a reply, the server will set the destination IP address to that of the customer’s device, 203.0.113.18 and use its own address, 192.0.2.53, as the source IP address. No address translation takes place. That’s what’s supposed to happen, but what actually happens? Well, most likely the ISP either doesn’t have enough IPv4 addresses to delegate some of them to their customer, or they want to charge a lot extra to do so. Accordingly, the customer buys a network address translator, and connects it to the ISP’s network in place of their single original host. The external interface of the NAT gets the IP address assigned to the customer’s original host, 203.0.113.7. The customer sets up their internal network as before, but instead of using IP addresses assigned by their ISP, they use one of the private IP address ranges. In this example, they use addresses in the range 192.168.0.0 to 192.168.255.255. The internal interface of the NAT is given IP address 192.168.0.1, and the two hosts get addresses 192.16.0.2 and 192.168.0.3. One of the customer’s hosts again connects to a server on the Internet. As expected, that host sends an IP packet with the source IP address set to its IP address, in this case 192.168.0.2, and the destination address set to the IP address of the server, 192.0.2.53. That packet travels through the customer’s network to its NAT router. The NAT rewrites the source address of the packet to match the external address of the NAT, in this case 203.0.113.7, and also rewrites the TCP or UDP port number to some new port number that’s unused on the NAT, and forwards the packet on to the ISP’s network. Internally, the NAT keeps a record of the changes it made, associated with the port. The packet traverses the ISP’s network to the router connecting the ISP to the Internet, and is forwarded on from there to the Internet. Eventually, the packet arrives at the server. When it arrives, it still has destination address equal to that of the server, 192.0.2.53, but source address will equal to that of the NAT, 203.0.113.7. To the server, the packet appears to have come from the NAT. When it sends a reply, the server will set the destination IP address to that of the NAT, 203.0.113.7, and use its own address, 192.0.2.53, as the source address. The reply will traverse the network until it reaches the NAT. The NAT looks at the TCP or UDP port number to which the packet is destined, and uses this to retrieve its internal record of the rewrites that were performed. It then uses this to do the inverse rewrite, changing the destination IP address and port in the packet to those of the host on the private network, then forwards the packet onto the private network for delivery. Essentially, the NAT hides a private network behind a single public IP address. The private network can use one of three private IPv4 address ranges: 10.0.0.0/8, 176.16.0.0/12, and 192.168.0.0/16. Machines in a private network can directly talk to each other using these private IP addresses, provided that communications stays within the private network. When they communicate with the rest of the network, the IP addresses are rewritten so that, to the rest of the network, the private network looks like a single device, with one IP address matching that of the external address of the NAT. This gives the illusion that there are more IPv4 addresses available, by allowing the same private address ranges being re-used in different parts of the network. Your home network, for example, almost certainly uses addresses in the range 192.168.0.0/16, and is connected to the rest of the network via a NAT router provided by your ISP. This concludes our review of how NAT routers allow multiple devices to share a single IP address. In the next part, I’ll explain some of the problems NATs cause.
Part 4: Problems due to Network Address Translation
Abstract
The 4th part of the lecture continues the discussion of the problems cause by NAT devices, and why they are used despite these problems. It talks about the use of NAT as a work-around for the lack of IPv4 address space, as a possible translation mechanism between IPv4 and IPv6, and to avoid renumbering. And it talks about the implications of NAT for TCP connections and UDP flows.
In the previous part we discussed what is network address translation, and walked through some examples showing how NAT routers allow several hosts on a private network to share a single IP address. In the following, I want to talk about some of the problems caused by NATs, and to discuss some of the reasons why NATs are used despite these problems. The first issue with NAT routers is that they break certain classes of application, and encourage centralisation. NATs are designed to support client-server applications, where the client is behind the NAT and the server is a host on the public Internet. Packets sent by a host with a private IP address can pass out through the NAT, and will have their IP address and port translated to use the public IP address of the NAT before they’re forwarded to the public Internet. The NAT will also retain state, so that the reverse translation will be applied to replies to those packets, allowing them to pass back through the NAT. This behaviour allows clients to connect to servers, setting up NAT translation state in the process, and to receive responses. The reverse doesn’t work, though. NAT routers rely on outgoing packets to establish the mappings they need to translate incoming packets. That is, when an incoming TCP or UDP packet arrives at some particular port on a NAT, the NAT looks at its record of what it previously sent from that port, and how it was translated, to know what’s the reverse translation to make. If there’s been no outgoing packet on that port, the NAT won’t know how to translate the incoming packet. It won’t know which of the private IP addresses to use as the destination address for the translated packet. This complicates running a server behind a NAT, since the NAT won’t know how to translate incoming requests for the server. It’s possible to manually configure the NAT to forward packets appropriately, of course, and protools like UPnP can help with this, but these approaches are complicated or unreliable. It’s generally easier and more reliable to pay a cloud computing provider to host the server, which encourages centralisation onto large hosting services. NATs also make it hard to write peer-to-peer applications. In part, this is because NATs make incoming connections difficult. But it’s also because hosts located behind a NAT only know their private address, so can’t give their peer a public address to which it can connect. There are solutions to this, that I’ll talk about in the next part of this lecture, but they’re complicated, slow, and wasteful. Unless you really need the privacy and latency benefits of a direct peer-to-peer connection, it’s often easier to relay traffic via a server hosted in a data centre somewhere, with a public IP address, again encouraging centralisation of services. If NAT routers are so problematic, why do people use them? There are three reasons. The first is to work around the lack of IPv4 address space. As shown in the figure on the right, the Regional Internet Registries have run out of IPv4 addresses. There are no more IPv4 addresses available for ISPs and companies that want to connect to the Internet, and they can’t provide enough IPv4 addresses to fulfil demand. The result is that IPv4 addresses are scarce and expensive. ISPs either don’t have enough addresses to meet their customers needs, or the cost of those addresses is prohibitive, and customers use a private network with a NAT instead of using public IPv4 addresses. The transition to IPv6 will solve this problem, since IPv6 makes addresses cheap and plentiful. The smallest possible address allocation for an IPv6 network is a factor of four billion times larger than the entire IPv4 Internet! Unfortunately, the transition to IPv6 has been slow. This suggests the second reason why NAT is used: to translate between IPv4 and IPv6 addresses. In this model, an ISP, or other network operator, runs IPv6 internally in their network, and does not support IPv4. This gives the ISP a clean, modern, and future-proof network. The ISP also runs two sets of NATs. For customers that want to use IPv4 internally, the customer uses a private IPv4 network, and the NAT translates the IPv4 packets into IPv6 packets when they leave the customer’s network. The principle is the same as the NAT routers we discussed in the last part of this lecture, except that rather than rewriting packets with private IP addresses to have public IPv4 addresses, the NAT rewrites the entire IPv4 header and replaces it with an IPv6 header. When packets get to the edge of the ISPs network, where it connects to the public Internet, they’re either forwarded as native IPv6 if the destination is accessible via IPv6, or translated to IPv4 by another NAT. The expectation in this approach to running a network is that, over time, the number of customers and destination networks that need IPv4 will go down, and more traffic will run IPv6 end-to-end. NAT is used as a, hopefully temporary, workaround. The third reason to use NAT is to avoid renumbering. Networks that have a public IP address range tend, over time, to end up hard coding IP addresses from that range into configuration files, applications, and settings. This is a mistake. Applications should always use DNS names, to allow the IP addresses to change, but people do it anyway. The result is that it’s difficult to change the IP addresses used by machines on a network. The longer a host has used a particular IP address, the more likely it is that something, somewhere, on the network has that address hard-coded, and will fail if the host’s address changes. If a network has an IP address range delegated to it from its ISP, what’s known as a provider allocated IP address range, and wants to change ISP, then it will need to change the IP address range it uses to one delegated from its new provider. Many organisations have found this sufficiently difficult that it’s easier to keep the old IP addresses internally, and use a NAT to translate addresses to the range assigned by the new ISP. A similar problem can occur if one company buys another, and has to integrate the IT systems of the new company into its existing network. IPv6 has better auto-configuration support than IPv4, and tries to make renumbering easier, but it’s not clear how well this works. As a result, some network equipment vendors have started selling NATs that translate between two different IPv6 prefixes, to ease renumbering. In both cases, a better approach is that an organisation gets what’s known as provider independent IP addresses, directly from one of the Regional Internet Registries, so it owns the IP addresses it uses. In this case, the organisation pays its ISP to route traffic to the addresses it owns, and can move to a new ISP without renumbering. Given these reasons why NAT routers will be used, despite their problems, what are the implications for NAT routers on TCP connections? Well, as I’ve explained, outgoing connections create state in the NAT, so replies can be translated to reach the correct host on the private network. The question is, then, how does the NAT know what translation state to setup? The way this works is that the NAT router looks at TCP segments it’s translating and forwarding, and watches for packets representing a TCP connection establishment handshake. If the NAT sees an outgoing SYN packet, followed by an incoming SYN-ACK, then an outgoing ACK, with matching sequence and acknowledgment numbers, then it can infer that this is the start of a TCP connection, and setup the appropriate translation. TCP connections have a similar exchange at the end of the connection, with FIN, FIN-ACK, and ACK packets. The NAT router can watch for these exchanges, and infer that the corresponding TCP connections have finished, and that the translation state can be removed. Unfortunately, applications and hosts sometimes crash, and connections disappear without sending the FIN, FIN-ACK, and ACK packets. For this reason, NAT routers also implement a timeout. If a connection waits too long between sending packets, the NAT will assume it’s failed, and remove the translation state. The recommendation from the IETF is that NATs use a two hour timeout, but measurements have shown that many NATs ignore this and use a shorter timer. The result is that long-lived TCP connections, that would otherwise go idle, need to send something, even if just an empty TCP segment, every few minutes, to prevent NATs on the path from timing out and dropping the connection. If you’ve ever used ssh to login to a remote system, gone to do something else, then come back after a couple of hours and wondered why the ssh connection has failed, this may well be due to NAT timeout. The other issue, as I mentioned at the start of this part, is that the NAT won’t have state for incoming connections, unless manually configured to do so. This makes it difficult to run a server or peer-to-peer application behind the NAT. The implications of NAT for UDP flows are similar to those for TCP, except that the lack of connections with UDP complicates things. For TCP, a NAT can watch for the connection establishment and teardown segments, and know when the TCP connections start and finish. TCP connections can fail without sending the FIN, FIN-ACK, ACK exchange, but this is rare, and NAT routers generally rely on watching the TCP connection setup and teardown messages to manage translation state. UDP, on the other hand, has no connection establishment, since it has no concept of connections. This is not a great problem when it comes to establishing state in a NAT. If the NAT sees any outgoing UDP packet with a particular address and port, it sets up the state in the NAT to allow replies. The problem comes with knowing when to remove that translation state in the NAT. Since UDP has no “end of connection” message, the only way to do this is with a timeout. The most widely used UDP application, historically, has been DNS. DNS clients tend to contact a lot of different servers, but exchange only a small amount of data with each. As a result, many NATs have very short timeouts - on the order of tens of seconds - for UDP translation state, to prevent them accumulating state for too many UDP flows. An unfortunate consequence of this, is that applications that use UDP, such as video conferencing and gaming, must send packets frequently, in both directions, to make sure the NAT bindings stay open. The IETF recommends that such applications send and receive something at least once every 15 seconds. This can generate unnecessary traffic. There is one benefit, though, that comes from the lack of connection establishment signalling in UDP. With TCP, the NAT can see the SYN, SYN-ACK, ACK exchange, and knows the exact addresses and ports that the client and server are using. This allows the NAT to create a very specific binding, and reject traffic from other addresses. These very specific bindings are a security benefit, but make peer-to-peer connections harder to establish. UDP applications tend to be more flexible in where they accept packets from, so NATs generally establish bindings that allow any UDP packets that arrive on the correct port to be translated and forwarded across the NAT. This makes peer to peer connection establishment much easier for NAT, as we’ll see in the next part. NATs work around three real problems: lack of IPv4 address space, IPv4 to IPv6 transition, and renumbering. They work well for client-server applications, where the client is behind the NAT and the server is on the public Internet, but make it hard to run peer-to-peer applications and to host servers on networks that use NATs. This encourages centralisation of the Internet infrastructure onto cloud providers and, as we’ll see in the next part, greatly complicates certain classes of application.
Part 5: NAT Traversal and Peer-to-Peer Connection Establishment
Abstract
The final part of the lecture, discusses NAT traversal and peer-to-peer connection establishment. It outlines the binding discovery process, by which a client can establish that it’s behind a NAT and find the external IP address of that NAT, and the ICE algorithm for candidate exchange and peer-to-peer connection establishment.
In the final part of this lecture, I’d like to discuss the problem of NAT traversal. That is, how applications can work around the presence of NAT routers to establish peer-to-peer connections. As I described in the previous part, NATs are designed to support outbound connections from a client in the private network to a server on the public Internet, and this use case works well. Other scenarios are less successful. Incoming connections, to a server located in the private network, will fail. This happens because the NAT can’t know how to translate the incoming packets. There are work arounds for this, that involve manually configuring the NAT to forward incoming connections to the correct device, but this is difficult to do correctly. Similarly, peer-to-peer connections through a NAT will also fail, unless the packets are sent in a way that makes the NAT, or the NATs if there are several peers all located in private networks, think that a client-server connection is being opened, and that the response is coming from the server. In the following, I’ll talk about how this can be arranged. The figure shows an example where two hosts, A and B, and trying to establish a direct peer to peer connection. For example, this could be two devices in people’s homes that are trying to setup a video call. Each of these hosts is in a private network, and is connected to the public Internet via a NAT. It’s possible, indeed likely, that if these are home networks, then both of the private networks will be using the IP address range 192.168.0.0./16, since that’s the default for most home NAT routers. A consequence is that Host A and Host B could both be using the same private IP address, for example both hosts could be using IP address 192.168.0.2. This isn’t a problem, since Host A and Host B are on different private networks, each hidden behind a different NAT. The two NATs have different public IP address on the external interface of the router, and what’s used internally is not visible to the rest of the network. How do these two hosts go about establishing a connection? Well, Host A can’t send a packet directly to Host B, because it has the same IP address. If it tries, the packet will come straight back to itself! Rather, in order to connect to Host B, Host A will have to discover the external address and port number that NAT B is using for packets sent by Host B. It can then send its packets to NAT B, that will translate and forward them to host B. To do this, the two peers, Host A and Host B, both make connections to a referral server located somewhere on the public Internet. This is shown in the dashed red lines on the slide. They ask that server where their packets appear to be coming from. This process is known as binding discovery, and lets the hosts find out how their NAT is translating packets. The result is a candidate address for each host, that it thinks is the external address of the NAT that will translate incoming packets and forward them to it. The peers then exchange these candidate addresses with each other, via the referral server. Once they’ve received the candidate addresses from their peer, the two hosts systematically send probe packets, to check if any of these candidates actually work to reach the peer. That is, the hosts check if the outgoing probe packets they send will correctly setup translation state in the NAT, so that incoming probes from the peer will be translated and forwarded to them. And they check that there are no firewalls that are blocking the traffic. If the probes are successfully received, in both directions, then the two hosts can switch to using the direct peer-to-peer path, shown as the solid blue line on the slide, and no longer need the server. If the probes fail, then a direct peer-to-peer connection may not be possible, and the hosts may have to relay all traffic via the referral server. The process of finding out what translations a NAT is performing is known as NAT binding discovery. The Session Traversal Utilities for NAT, STUN, is a commonly used protocol that performs NAT binding discovery in the Internet. When a host on a private network sends a packet to a host on the public network, the NAT at the edge of the private network will translate the source IP address and port number in the packet. The host on the private network doesn’t know what translation has been done, but the server that receives the packet can inspect its source address and port, to find out where it came from. For example, when using a UDP socket, an application can use the recvfrom() system call to retrieve both the contents of a UDP packet and its source address. Similarly, for TCP connections, the accept() system call returns the address of the client. The server then replies to the client, telling it where the packet appeared to come from. This is what’s known as a server reflexive address. That is, the address that a server thinks the client has. If there’s a NAT between the client and the server, then the server reflexive address will be different to the address from which the client sent the packet. If the client’s addresses and the server reflexive address are the same, the client knows there’s no NAT between it and the server. You might ask why a host that’s in a private network doesn’t just ask its NAT how it will translate the packets? Two reasons. The first is that by the time we realised that binding discovery was needed, there were already tens of millions of NATs deployed, with no way to upgrade them to add a way to ask how they’ll translate packets. The second is that a host might not know that it’s behind a NAT, or might be behind more than one NAT, and so won’t know what NAT to ask for the binding. When performing binding discovery, it’s important that a host discovers every possible candidate address on which it might be reachable. For example, think about a phone that has both 4G and WiFi interfaces. Each of these interfaces can have an IPv4 address and an IPv6 address, representing the point of attachment to networks it directly connects to. This could be a total of four possible IP addresses for the phone. The phone may be behind IPv4 NAT routers on each of those interfaces, and so each interface might also have a server reflexive address on which it can be reached, that the host can discover using STUN. This can give another two addresses, bringing the total to six. It’s unlikely, but the phone could also be connected via one or more IPv6 NATs. This potentially gives two more server reflexive addresses on which it can be reached. In case these server reflexive addresses don’t work, the phone may also be able to use the referral server to relay for it, using a protocol called TURN, acting as a proxy to deliver traffic if a direct connection isn’t possible. This proxy endpoint might be accessible via IPv4 and IPv6. The phone might also have a VPN connection, and be able to send and receive traffic via the VPN, as well as directly. That VPN endpoint could be accessible over IPv4 or IPv6, and might itself be behind a NAT, so it’s necessary to check for server reflexive addresses on the VPN interface. Not all of these will exist for every device, of course, but the point is that a modern networked device is often reachable in many different ways. If it’s to successfully connect to another device, in a peer-to-peer manner, it needs to find as many of these candidate addresses as possible. Having run a binding discovery protocol to find all its possible candidates, a host sends the list of candidates to the referral server, and the referral server sends them on to its peer. Its peer does the same, and the host receives the peer’s candidates via the referral server. At this point, the two hosts know each others candidate addresses, and are ready to check which of the addresses work. Given that the two peers can communicate via the referral server, you might ask why the peers bother to establish a peer-to-peer connection, and don’t instead just keep communicating via the relay? The primary reason is because a direct peer-to-peer connection is usually lower latency than a connection via a relay, and for peer-to-peer applications like video calls, latency matters. The second is that the relay server can eavesdrop on connections that it’s relaying, but not on direct peer-to-peer connections. This is perhaps less of a concern than you might think, since the traffic can be encrypted so it can’t be read by the server. Also, the server knows that the call is happening anyway, and sometimes knowledge that two people are talking is almost as sensitive as knowing what they’re talking about. Once they’ve exchanged candidates, the two hosts systematically send probe packets from every one of their candidate addresses to every one of the peer’s candidate addresses in turn, to see if the can establish a direct connection. The idea is that a probe packet sent, for example, from Host A to a server reflexive address of Host B, will open a binding in NAT A, even if it fails to reach host B. This open binding will allow a later probe from Host B to the server reflexive address of Host A to reach Host A. This will, in turn, trigger Host A to probe again in response, and this time the probe from Host A to the server reflexive address of Host B will succeed because the probe from Host B opened the necessary binding on NAT B. The two hosts then start sending traffic and keep-alive messages on that path to keep the bindings active, while the probing continues on all the other candidates. The probing can take a long time, so candidate addresses are assigned a priority based on how likely the host thinks it is to be reachable on that address, and on its expectation of how well that address will perform. The checks take place in priority order, to quickly try to find a pair of candidates that works. If more than one pair of candidate addresses successfully succeeds, the hosts choose the best path, for example the path with the lowest latency, and drop the other connections. The Interactive Connectivity Establishment algorithm, ICE, described by the IETF in RFC 8445 describes this probing process is detail. When making a peer-to-peer phone or video call, the ICE algorithm and the probing usually happens while the phone is ringing, so the connection is ready when the call is answered. What should be clear by now is that NAT binding discovery, and the systematic connection probing needed for NAT traversal, is complex, slow, and generates a lot of traffic. The RFCs that describe how the process works are almost 200 pages long, and are not easy to implement correctly. The result is reasonably effective for UDP traffic. The STUN protocol, and the ICE algorithm, were developed to support voice-over-IP applications, that run over UDP, and the result works well. It’s less effective for peer-to-peer TCP connections. NATs tend to be quite permissive for UDP, translating any incoming UDP packet that reaches the correct address and port, but are often stricter for TCP connections, and check for matching TCP sequence numbers, etc., This makes peer-to-peer TCP connections less likely to be successful. In this lecture, I’ve outlined how client-server connection establishment works, and how the use of TLS and IPv6 can affect connection establishment, and can require connection racing using the “happy eyeballs” technique. I also showed that connection establishment latency is often a critical factor limiting the performance of TCP connections. In the later parts, I outlined how and why NAT routers are used, their advantages and disadvantages, and how NAT traversal techniques work to establish peer-to-peer connections. Establishing a connection used to be a simple task. What I hope to have shown you is that it’s no longer simple, not in the client-server case, and especially not when peer-to-peer connections are needed.
L2 Discussion
Summary
Lecture 2 discussed how to establish TCP connections in the fragmented Internet we have today. It started with a review of the TCP service model, and how it’s supposed to establish client-server connections, showed some of the factors that affect connection establishment performance.
One of the key factors that affects performance is network latency, and the number of round trip exchanges between the client and server needed to establish the connection. With the aid of an example, I tried to show that latency, not bandwidth, is often the main limiting factor is performance. Consider whether the example used to demonstrate this looks reasonable and, if not, what would a more realistic scenario look like and would it change the conclusions? Did the results surprise you? Given this behaviour, would you be willing to pay your ISP for a higher bandwidth Internet connection?
The lecture then discussed dual-stack connection establishment for networks that support both IPv4 and IPv6 hosts. As part of this, it highlighted that the IPv4 and IPv6 networks are separate, and that parallel connection establishment is needed. Consider how does the complexity of parallel connection establishment compare to the sequential DNS look-up code shown in lab 1? How would you implement this parallel connection establishment and racing? Do you think the complexity is worth the effort to speed-up connection establishment?
Finally, the lecture discussed network address translation and NAT traversal, allowing several hosts to share an IP address. It showed how NAT devices work, and discussed some of the problems they cause. Review when does NAT work well and when is it problematic. Think about what types of application do NATs break. Given this, why do people still use NAT devices?
NAT traversal for peer-to-peer applications uses the combination of binding discovery and the ICE algorithm to establish connections, using a referral server to exchange addresses then probing to check if candidate addresses work. Review this algorithm (or the How NAT traversal works blog post from Tailscale) to determine whether the approach makes sense. How effectively do you think this approach to NAT traversal works? How easy do you think it would be to implement?
Lecture 3
Abstract
This lecture considers secure communications in the Internet. It reviews the need for security, and the principles of encryption, integrity protection, and authentication of messages. It explains the principles of operation of the Transport Layer Security Protocol (TLS), version 1.3, and how it protects Internet traffic. And it briefly reviews some of the issues around writing secure software.
Part 1: Secure Communications
Abstract
The 1st part of this lecture discussed the need for security in Internet communications. It reviews why end-to-end encryption and message integrity protection are essential to protect Internet users for eavesdropping, identity theft, fraud, and other attacks. And it discusses some of the tensions and concerns that have been raised about the provision of such protection.
In the last lecture, I discussed the behavior of TCP and some issues around connection establishment. One of these issues was the observation that establishing a secure connection, using TLS, was slower than establishing an insecure connection. In this lecture, I want to talk more about TLS and about security in general. In this first part, I’ll talk about why security is important, and why we need to secure communications. Then, in part two, I’ll talk about the principle of secure communication and the cryptographic techniques that can be used to protect data. Part three of the lecture will describe some of the behavior of the transport layer security protocol, that provides security for most Internet traffic. And, finally, in part four, I’ll talk about some general issues around network security, and how to write secure networks applications. So why do we need secure communications? Well, the fundamental problem is that it’s possible to eavesdrop on network traffic. This can be done by wiretapping the network links down which the data flows, or it can be done by configuring the network routers to save a copy of the packets they forward. The result is that traffic passing across the network can be monitored by third parties. If you want to ensure that the data you send across the network is private, then that data needs to be encrypted somehow. Similarly, network routers can modify the packets they forward. This means that the router can change the data being delivered without the consent of the sender. The sender cannot stop this happening. But they can add some message integrity protection, such as a digital signature, to allow the receiver to detect and reject messages that have been tampered with. Finally, there are numerous devices in the network, known as middle boxes, that try to improve communication by somehow interpreting or modifying the data being sent. For example, we spoke about network address translation in the last lecture where a NAT router rewrites the addresses and ports in TCP/IP headers to allow several machines to share a single IP address. Other examples include network firewalls, that monitor traffic and try and prevent bad traffic from entering a network, as well as the various accelerator devices that try to improve the performance of TCP connections running over satellite links. If not carefully maintained, these devices tend to lead to network ossification, where they tend to limit the ability to change network protocols. A final role of secure communications is therefore to limit the ability of such devices to inspect and act on the traffic, so helping to ensure that the network can continue to evolve. A lot of different organizations monitor the network, for many different reasons. These include governments, intelligence agencies, and law enforcement agencies. For example, the police have to monitor the network as part of their crime prevention activities; domestic intelligence agencies inspect traffic to protect against terrorism, or to monitor foreign targets; and foreign intelligence agencies might try to spy on domestic targets. That this happens shouldn’t be a surprise. And are clearly good reasons for some of this monitoring. Many people would agree, I think, that targeted wiretaps on suspected criminals, subject to appropriate oversight, the need to obtain a warrant of some sort, and when there’s probable cause, are probably not unreasonable. Relatively few people would object to actively monitoring the network traffic of those actively suspected of being engaged in serious crimes, terrorist activities, child abuse, and so on. People differ on what crimes they consider serious, or on the standards of probable cause, or on the amount of oversight needed. But all societies accept some degree of monitoring and oversight of network traffic. However, Edward Snowden showed that some intelligence agencies, including, but certainly not limited to the five eyes, the UK, the US, Canada, Australia, and New Zealand, were conducting pervasive monitoring of all network traffic. Other governments are also known to conduct such monitoring. The great firewall of China is a common example, along with monitoring by Russia, Iran, Saudi Arabia, and others. Many felt that this indiscriminate monitoring of all network traffic without probable cause or suspicion, was a step too far. In part, I think this came from distrust of those governments, their motives, and how they might use the data. The people they were supposed to represent were unconvinced that the monitoring was actually doing them good. But, in part, there was also the realization that if supposedly friendly governments were monitoring traffic indiscriminately, then so were others. Even if I completely trust our government to monitor Internet traffic only good reasons, the fact that they’re able to monitor that traffic means that others are able to do so too. And those others might not have my best interests at heart. This led to a push to enable pervasive encryption, to encrypt more and more of the traffic crossing the Internet. The most visible manifestation of this is that most websites now use HTTPS and encrypt their traffic. But the spread of encryption has been wider than the web. The result is that most Internet traffic is now encrypted by default, hindering, but not preventing, pervasive monitoring . Governments are not the only organizations to monitor network traffic, of course. We’ve all contacted a business and been told that our call may be monitored for quality and training purposes. Some of this monitoring by businesses is necessary for regulatory compliance. Banking and insurance industries, for example, require records to be kept in most cases, to prevent fraud. There are good reasons for some of this monitoring. Other aspects of monitoring and tracking by businesses are perhaps less beneficial. Targeted advertising and customer profiling is frequently cited as problematic, for example. Communication security measures, such as encryption, can help reduce such unwanted monitoring, though the effect is small, since this type of monitoring and tracking is often delivered by the sites we intentionally visit, rather than by snooping on communications. We also see network operators monitoring traffic on the networks they operate. Again, there are both beneficial, and problematic, reasons for this. Network operators monitor traffic to understand how well their networks are operating, and whether they’re meeting their quality of service goals. it’s common, for example, for network operators to inspect the sequence and acknowledgement numbers in the headers of TCP packets traversing their networks. This lets them understand if packets are being lost, or if the time taken for packets to traverse the network is building up, both of which are signs that the network is becoming overloaded. This helps the operators decide when to reroute traffic onto less busy paths, or when to install more network capacity to keep good performance. And a few would argue that this sort of monitoring is a problem. On the other hand, operators can monitor to traffic to profile what sites that customers are visiting. This information could then be sold to advertisers, or could be used to negatively influence the performance of the traffic. For example, an operator might choose to lower the priority of Netflix traffic for customers who haven’t signed up to their video streaming package. Many people are less comfortable with such behaviors, and communication security measures can limit their effectiveness. Finally, of course, are criminals and malicious users that try to steal data and user credentials, that try to perform identity theft, or conduct other attacks. Communication security clearly cannot prevent all such attacks, but it can limit their scope by limiting the amount of information that’s available and visible to those monitoring the networks. As a result of these various attacks, there are a range of measures that can be deployed that can help to protect privacy by encrypting network traffic. Unfortunately, what makes this problem space challenging, is that the mechanisms used to protect against malicious attacks also prevent benign monitoring. There’s no known way to stop criminals and malicious attackers from accessing private data that doesn’t also stopped legitimate law enforcement from doing so, for example. In addition to monitoring and observing data as it traverses the network, many organizations might also try to modify messages. Governments and law enforcement, for example, might require ISPs to censor, or modify, DNS responses to restrict access to certain sites. They might require DNS responses to be modified to indicate that certain sites don’t exist, or to change the addressing the DNS response to direct users to a page indicating that the content is blocked. Alternatively, governments might require ISPs and network operators to block or rewrite traffic containing certain content. As with government traffic monitoring, there can be reasonable, and unreasonable, reasons for governments to modify messages. Many countries have widely accepted laws about restricting hate speech, blocking child pornography, or preventing terrorism. Part of the implementation of such laws is often by modifying DNS responses to limit access to certain sites. The same techniques can, of course, also be used to block other types of content, or restrict other kinds of speech. Businesses and network operators might also block or modify content. The DNS server in a cafe, or a train, that redirects you to a sign up page, and asks asks for payment before letting you browse the web on their Wi-Fi is an example. Other examples might be services that filter spam or block malicious attachments, that enforce terms of service, or that try to prevent copyright infringement. And finally, of course, there are criminals, and malicious users, people modifying content to conduct phishing scams, steal identity, mislead, and defraud. And, again, what makes this problem space challenging is that mechanisms that protect message integrity against malicious attackers also prevent benign modification. For example, a recent development in network security is DNS over HTTPS. This is an approach to encrypting DNS traffic that was designed to protect users from phishing attacks where an attacker on the local network spoofs DNS responses to perform identity theft. It does this successfully. Unfortunately, some Internet service providers in the UK intentionally spoofed DNS responses to block access to sites hosting child abuse material, as part of a government mandated blocklist. Encrypting DNS traffic, using DNS over HTTPS, to protect identity theft unintentionally also prevented the child abuse block list from working, since both relied on the same vulnerability in DNS. And again, this is an area, whether a difficult questions, and it’s not we have all the right answers. The final reason for securing communications relates to protocol ossification. it’s common for network operators to deploy middle boxes, of various sorts, to monitor and modify traffic. These can be devices such as NATs and firewalls, traffic shapers, filters, or protocol accelerators. And these middle boxes need to understand the traffic they’re observing or modifying. For example, in order to translate IP addresses and ports, a NAT needs to know the format of an IP packet, and where the ports are located in the TCP and UDP header. Equally, a traffic shaping device, intended to limit the throughput of TCP connections for a particular user, needs to understand the congestion control algorithm used by TCP, otherwise how can it influence the sending rate of a connection? This means that the network becomes more complex. It means that devices in the network no longer just look at the IP headers and forward the packets based on the destination address. They also understand details of TCP and UDP, and other protocols, and observe, inspect, and modify those protocols too. And this leads to a problem known as protocol ossification, where it becomes difficult to change the protocols running between the endpoints, because doing so interacts poorly with middle boxes that don’t understand the new version of the Protocol. For example, it’d be very difficult to change the format of the TCP header now, even if we could upgrade all the systems to support the new version, because of all the NATs and firewalls that would also need updating. This protocol ossification, where the network learns about the transport and higher layer protocols, effectively prevents those protocols from being upgraded, and occurs because the network has visibility into those protocols. Encryption offers one way to prevent ossification. The more of a protocol that’s encrypted, the easier it is to change that protocol, since the encryption will have stopped middleboxes from understanding or modifying the data. There’s a trade off, though, between the ability to change end-to-end protocols and the ability of the networks offer helpful features. The more of a protocol that’s encrypted, the easier it is to change the protocol. But the harder it is for middle boxes, to provide help from the network. The draft shown on the slide, on “Long-term viability of protocol extension mechanisms”, talks about these issues further, and talks about how to extend and modify protocols and ensure that protocols remain changeable. It’s very much worth reading. As we’ve seen there are good reasons to encrypt and authenticate data. Doing so helps to provide privacy, it helps to prevent fraud, and it helps to allow protocols to evolve while avoiding network ossification. Providing security in this way is a good thing, but they’re always trade offs, and I’ve tried to highlight some of these. In particular, it’s always possible to find examples where providing security to protect against some attacker will prevent some beneficial monitoring or service. There are no easy solutions here. It’s easy to argue that we must encrypt everything to ensure privacy, missing that this causes some real problems. Equally, it’s easy to argue that law enforcement should have exceptional access to communications, to help prevent terrorism and child abuse, for example, missing, that there are very real risks that this will cause serious other problems. We need more dialogue between engineers, protocol designers, network operators, policymakers, and law enforcement, to better understand the constraints and the concerns. The “Keys Under Doormats” paper, linked from the slide, talks about these issues in more detail, and I very much encourage you to read it. Finally, as more and more data is encrypted and protected, we’re also starting to see increasing discussion of end system based content monitoring. The argument here is that encryption is important to prevent attacks by malicious users, but that law enforcement need access to protect us. But, since effective encryption prevents law enforcement from monitoring traffic on the network, then maybe they should be able to monitor the traffic on the end systems, after it’s traversed the network. And there’s a certain appeal to this. If done correctly, the encryption provides protection against a large class of attacks, and correct implementation of end-system based monitoring limits who can monitor traffic to those with legitimate needs and legitimate authority. And, in some cases that’s an appropriate compromise. It doesn’t seem problematic for social networks like Facebook,for example, to support law enforcement in monitoring their network to detect people sharing child abuse material. But, as Apple found out when they announced that they were to implement similar monitoring running on iPhones for one-to-one and group iMessage chats, the expectations around privacy, law enforcement access, and abuse protection, vary very much between social networks, one-to-one communications, group communications, and public posts. And the boundaries between these categories, and what’s acceptable in terms of monitoring and protection and privacy, can be very hard to distinguish. And again, there are some difficult questions relating to what type of privacy protection and what type of monitoring is technically possible to implement on end-systems, and what’s socially acceptable, and what’s desirable. And the the paper on the slide, “Bugs in our pockets”, talks about this issue in a lot more detail. So that wraps up the discussion of why secure communication is needed. Network traffic is frequently monitored by governments, businesses, network operators, and malicious users. Some of this monitoring is beneficial, some of it less so. In the following parts, I’ll talk about the technologies we can use to provide privacy, to protect message integrity, and to protect and prevent protocol ossification.
Part 2: Principles of Secure Communication
Abstract
The 2nd part of the lecture reviews the principles of secure communication. It describes the concepts behind symmetric, public-key, and hybrid cryptography. It outlines techniques for message integrity protection and authentication including cryptographic hash functions and digital signatures. And it reviews the need for a public key infrastructure.
In this part, I want to talk about some of the principles of secure communication. I’ll talk about how we go about ensuring confidentiality of messages as they traverse the network. About how we authenticate messages to ensure that they’re not modified in transit, and about how we can go about validating the identity of the participants in a communication. So what are the goals of secure communication? Well, we’re trying to deliver a message across the internet from a sender to a receiver. In the process we want to avoid eavesdropping on the message – we need to encrypt it in order to provide confidentiality, to make sure no one other than the intended receiver can have access to the content of the message. We want to avoid tampering with the message – we need to authenticate the message to ensure that it’s not modified in transit by any of the devices which are which are involved in the delivery of that message. And we want to avoid spoofing – we want to somehow validate the identity of the sender, so that the receiver knows, and can be sure of who the message came from. So how do we go about providing confidentiality? Well unfortunately data traversing the network can be read by any of the devices on the path between the sender and the receiver. It’s possible to eavesdrop on packets as they traverse the links that comprise the network. And it’s also possible to configure the switches or routers to snoop on the data as they’re forwarding it between the different links in the network. The network operator can always do this. They own the network; they can configure the devices to save a copy of the data if they choose to do so. If the network’s been compromised, maybe so can others. If an attacker can break into the routers, for example, there’s nothing stopping them saving the data, redirecting copies of data traversing the network to some other location. If the data can always be read, how do we provide confidentiality? Well, we use encryption to make sure that the data is useless if it’s intercepted or copied. We can’t stop an attacker, or the network operator, from reading our data. But we can make sure that they can’t make sense of it if they do read it. There are two basic approaches to providing encryption. The first is called symmetric cryptography. Algorithms such as the Advanced Encryption Standard, AES. The other approach is what’s known as public key cryptography. Algorithm such as the Diffie-Hellman algorithm, the RSA algorithm, and elliptic curve algorithms. They have quite different properties and are used in different situations. I’ll talk about the details and the differences between them in a minute. Both of them are based on some fairly complex mathematics. I’m not going to attempt to describe how that works. What’s important is not the details of the maths. But what are their properties, what behaviours do they provide, and how do they help us secure data as it traverses the network? So we’ll start with the idea of symmetric cryptography. The idea of symmetric encryption is that it can convert plain text into cipher text with the aid of a key. If you have, for example, the plain text as we see on the top-right of the slide, and we pass it through the encryption algorithm, in this case, the AES Advanced Encryption Algorithm, with the aid of an encryption key, we get a blob of encrypted text as we see it in the middle. If we pass that encrypted text through the inverse algorithm, the decryption algorithm, using the same key, then we get the original text back out. The point is that a single secret key controls both the encryption and the decryption process. The key used to encrypt is the same as the key used to decrypt. Now, provided the key is kept secret. And it’s known only to the sender and receiver. This can be very secure, and it can be very fast. Symmetric algorithms such as AES can encrypt and decrypt many gigabits per second. This makes them very suitable for Internet communications because they don’t slow down the communications, while still providing security. There are a wide range of different symmetric encryption algorithms, probably the most widely used is the US Advanced Encryption Standard, AES. The AES algorithm was developed as part of the output of an open competition, run by the US National Institute of Standards, and it’s actually a Dutch algorithm known as Rijndael. Importantly, the AES algorithm, the Rijndael algorithm, is public and the security of the algorithm depends only on keeping the key secret, not on keeping the algorithm itself secret. The link on the slide is a pointer to the specification for the algorithm, and there’s a large amount of open source code which implements it. The problem with symmetric cryptography is that you need to keep the key secret. If anyone other than the sender and the receiver know the key, then the security of the encryption fails. The question then, is how do you security distribute the key? If you want to exchange message a secure message with someone I know well, then this is straightforward. I can meet them in person, give them the key, and ensure that no one else can eavesdrop on that communication. The problem comes when I’m trying to communicate securely with someone where I can’t meet them in person. How do I securely get a key from an Internet shopping site, for example? The only means of communication. I have is over the Internet. And if I send the key over the Internet, someone can eavesdrop on the key, and that gives them the ability to decrypt our communications and breaks the security. The solution to this is an approach known as public key cryptography. public key cryptography, like symmetric cryptography, is used to convert a plain text message into an encrypted form. The difference, though, is that there are two different keys, and the key used to encrypt the message, and the key to decrypt the message are different The keys come in pairs. The two halves of the pair are known as the public key and the private key. Importantly, a message which is encrypted using one of those keys can only be decrypted using the other key. If the message is encrypted with the public key, for example, then only the private key can decrypt that message. As you might expect from the names. The idea is that you keep the private key from the key pair secret, and you make the public key as public as is possible. You publish it in the phone book, you put it on your webpage, you write it on your business card, and you make sure everybody knows that this is your public key. In order to send you a message, someone looks up your public key and uses that to encrypt the message. Once the message has been encrypted using a particular public key, the only thing which can decrypt it is the corresponding private key. And since the private key has been kept private, you’re the only one who can receive the message. This solves the key distribution problem. Provided you can look up the appropriate public key for the receiver in a directory, and you can trust that the receiver has kept their private key secret, then you use their public key to encrypt the message, and you know that they’re the only one who can decrypt it. This allows Internet shopping sites, and the like, to work. If I wish to buy something from Amazon, I look up the key for Amazon in a directory, use that to encrypt the message I’m sending to Amazon, and I know that they’re the only ones that can decrypt it. The problem with public key cryptography is that it’s very slow. The public key algorithms such as the Diffie-Hellman algorithm, the RSA algorithm, and the elliptic curve algorithms, work millions of times slower than symmetric encryption algorithms. The result is that they’re too slow to use for any realistic amount of communication. The performance just isn’t there. Accordingly, modern communications use what’s known as hybrid cryptography, where they use a combination of both public key and symmetric cryptography. This provides both security and speed. The way this works is that the sender and receiver use public key cryptography, which is very slow, to exchange a small amount of information. That information is then used as the key for the symmetric encryption algorithm, which is very fast. In detail, the sender chooses a random value, that we’ll call Ks, which will be used as the key for the symmetric encryption. The sender then looks up the receiver’s public key, Kpub, uses it to encrypt Ks and sends the result to the receiver. The receiver uses its corresponding private key, Kpriv, to decrypt the message and retrieve Ks. This securely transfers Ks, the key for the symmetric encryption algorithm, from the sender to the receiver. Doing this using public key encryption is very slow, but the key for the symmetric encryption, Ks, is very small, so the fact it’s very slow doesn’t matter. The sender, then uses that key, Ks, to encrypt future messages using symmetric cryptography, for example, using the AES algorithm. The receiver also has Ks, which it exchanged using the public key encryption, and can use that to decrypt the messages. Symmetric cryptography is very fast, so the performance of the communication, once it’s got started, is very quick, but it requires the key to be exchanged securely. The public key algorithm, which is slow, is used to securely exchange the key. The result is something which achieves both confidentiality, and solves the key distribution problem, and also achieves good performance. Encryption gives you confidentiality of data and makes sure that no one can eavesdrop on the messages being sent from the sender to the receiver. We also, though, need to verify the identity of the sender, and make sure that messages haven’t been modified in transit. In order to do this, we generate a digital signature to authenticate the messages. And the receiver can then validate that signature, check the signature, to make sure they came from the expected sender. The digital signature relies on a combination of public key cryptography, and a cryptographic hash algorithm. So first of all, what is a cryptographic hash? A cryptographic hash function is a function that takes some arbitrary length input and produces a fixed length output hash that somehow represents that input. For example, at the top of the slide, we see some input text going through a hash algorithm, known as SHA256, that produces the fixed length output block you see on the right. A cryptographic hash algorithm has four fundamental properties. The first is that every input will generate a different output, and the slightest change to the input will change the output value. The second is that it should be infeasible to find two inputs that gives the same output. The third is that calculating the hash itself should be fast, and going from input to output should happen very quickly. And the fourth, and perhaps most important, is that reversing a hash should be infeasible. If you’re only given the output, there should be no way of finding out what the input was. A cryptographic hash therefor acts as a unique fingerprint for the input data. It provides a short output, that uniquely identifies a given message. There are many different cryptographic hash algorithms. The current recommendation is the SHA256 algorithm specified by the IETF in RFC 6234. There are a number of older algorithms, such MD5 and SHA1, which you may hear about, but these all have known security flaws and are not recommended for use. So how can we use a cryptographic hash to help build a digital signature? Well, in order to do that, you take the message you wish to send, and you calculate a cryptographic hash of that message. The sender that encrypts that hash with their private key. Now the private key is known only to the sender, so they’re the only one who can encrypt that message. But the thing which would decrypt it is the sender’s public key, which is available to everybody. Encrypting the hash with the sender’s private key doesn’t provide any confidentiality, because anyone can decrypt the message using the public key. What it does do though, provided the sender can be trusted to keep its private key private, is demonstrate that the sender must have encrypted the hash. Since the hash is a fingerprint of the message, this means that the sender must have generated the original message. The sender then attaches the encrypted hash to the message, forming the digital signature. The message, and its digital signature, are then encrypted and sent to the receiver using hybrid encryption. When the message arrives at the receiver, the receiver can verify the signature. To do this, it first decrypts the message and its digital signature. The receiver then takes the message itself, and calculates its cryptographic hash. Having done that, it takes the digital signature, looks up the sender’s public key, and uses that to decrypt the digital signature to retrieve the original cryptographic hash that was in the message. It compares the hash, which has sent in the message as part of the digital signature, with the cryptographic hash it just calculated. If the two match, then it knows the messages is authentic and has been unmodified, provided is trusts the sender to have kept its private key private. If the hash of the message it calculated, and the hash that was sent in the digital signature, don’t match then it knows that somehow the message has been modified in transit. Public Key Encryption is therefore one of the fundamental building blocks of a secure network. It allows us to send a message to a recipient securely, even if we’ve not met that recipient, and be sure that they’re the only one who’ll be able to decrypt that message. And it allows us to use digital signatures to verify that messages have not been modified in transit. The security of public key encryption, though, depends on knowing which public key corresponds to a particular receiver. There are three ways you can know this. The first is that the receiver gives you their key in person. The second is that the receiver sends you their key, but the message in which they send it is authenticated by someone you trust. That is, there’s a digital signature in the message, signed by someone who’s key already have, that authenticates that this message is from who it claims to be from. The third is that someone you trust gives you the receivers key. In the Internet, the role of someone you trust is often played by an organisation known as a certificate authority, as part of a public key infrastructure. The role of a certificate authority is to validate the identity of potential senders. The certificate authority checks the identity of a potential sender, and then adds a digital signature to the sender’s public key to indicate that it’s done so. If a receiver trusts the public key infrastructure, trusts the certificate authority, then it can verify that digital signature, added by the certificate authority, to confirm the identity of the sender. These mechanisms, symmetric and public key encryption, and digital signatures, allow us to provide confidentiality for communication over the Internet that performs well and is secure. They allow us to authenticate messages, and demonstrate that they’ve not been modified in transit. And they allow us to validate the identity of senders of those messages.
Part 3: Transport Layer Security (TLS) v1.3
Abstract
The 3rd part of the lecture describes the operation of the Transport Layer Security Protocol (TLS) v1.3; one of the key security protocols used in the Internet.
In previous parts of this lecture I spoke about network security in general terms. In part one, I discussed why security is needed in order to protect Internet communications, and in part two, I spoke about how security is provided in outline. I spoke about the different types of encryption, public key and symmetric, the use of hybrid encryption, in order to improve performance while still maintaining security, and the ideas of digital signatures and public key infrastructure. In this third part of the lecture, I want to move on to talk about Internet security in specific terms. I want to talk about the Transport Layer Security protocol, TLS version 1.3 I’ll begin by introducing what is TLS, talking about conceptually what role it performs in the network stack. And I’ll talk through some of the details of TLS. I’ll talk about the TLS handshake protocol, that’s used to establish TLS connections. The record protocol, that’s used to exchange data. The 0-RTT extension, that reduces connection setup times. And finally, I’ll talk about some of the limitations of TLS. As we saw in some of the earlier lectures, TCP connections are not secure. Neither the TCP headers, nor the IP headers, nor the data they transfer are encrypted or authenticated in any way. Data sent in a TCP connection is not confidential. It can be observed by governments, businesses, network operators, criminals, or malicious users. Similarly, the data is not authenticated. Anyone who’s able to access the network connections, or the routers over which the data flows, is able to modify that data. And the sender and the receiver will not be able to tell that such modifications have been performed. In order to provide security for data going across a TCP connection, we need to run some sort of additional security protocol within that TCP connection to protect the data. The way this is typically done in the Internet, is using a protocol called the Transport Layer Security protocol. The latest version of this is TLS 1.3 and it’s used to encrypt and authenticate data that is carried within a TCP connection. The official specification for TLS 1.3 is RFC 8446, which was published by the IETF in the last couple of years. The TLS specification is not a simple document to read. In part, this is because it’s solving a difficult problem. Providing security over the top of an insecure connection, a TCP connection, is a complex challenge, and TLS has to define the number of complex mechanisms in order to provide that security. In other part, the complexity comes because TLS is an old protocol. The latest versions of TLS have to be backwards compatible, not only with previous versions of TLS as specified, but with previous implementation problems, and bugs in the TLS specification and in its implementations The protocol designers have done a good job, though. TLS version 1.3 is smaller, faster, and simpler than previous versions of TLS, and it’s also more secure. The slide lists four blog posts which provide more information about TLS. The first one is an introduction to TLS 1.3 from the IETF. This was written by the TLS working group chairs, and introduces the new features in the protocol. The second, from CloudFlare, is a detailed look at what’s new in TLS 1.3, as compared to previous versions of TLS. It talks about some of the advantages of TLS 1.3, and how it improves security, and reduces the connection set up times. The third of these, from David Wong, attempts to redraw the TLS specification in a way that makes it easier to read. This is a copy of RFC 8446, the TLS specification, with the diagrams redrawn in an easier to read way, and with explanatory videos and comments added to make it easier to follow. The final post is the most detailed. It’s an annotated packet capture showing the details of a TLS connection. This walks through the TLS connection establishment handshake, byte by byte, labelling each byte with reference to the specification to explain exactly what it means, and how the handshake proceeds. I encourage you to review these four blog posts. They give a nice complement to the material I’ll talk about in the rest of this lecture, introducing how TLS 1.3 works. So what’s the goal of TLS 1.3? Well, given an existing connection, that’s capable of delivering data reliably and in the order it was sent, but is insecure, TLS 1.3 aims to add security. That is given a TCP connection, it aims to add authentication, confidentiality, and integrity protection to the data sent over that connection. In terms of authentication, it uses public key cryptography, and a public key infrastructure, in order to verify the identity of the server to which the connection is made. That is, the client can always verify that it’s talking to the desired server. In addition, it provides optional authentication for the client, to allow the server to verify the identity of the client. Once the connection has been established, and verified to be correct, TLS provides confidentiality for data sent across that connection. It uses hybrid encryption schemes to provide good performance, while still providing a strong amount of security. Finally, TLS authenticates data sent across the connection, to provide integrity protection. It’s not possible for an attacker to modify data sent across a TLS connection without that modification being detectable by the endpoints. How does TLS 1.3 work? Well, first of all, a TCP connection must be established. TLS is not a transport protocol itself, and it relies on an underlying TCP connection in order to exchange data. Once the TCP connection has been established, TLS runs within that connection. There are two parts to a TLS connection. It begins with a handshake protocol, and then proceeds with a record protocol. The goal of the handshake protocol, at the beginning of the connection, is to authenticate the endpoints and agree on what encryption keys to use. Once this is completed, TLS switches to running the record protocol, which lets endpoints exchange authenticated and encrypted blocks of data over the connection. TLS turns the TCP byte stream into a series of records. It provides framing, delivers data block by block, each block being encrypted and authenticated to ensure that the data being sent in that block is confidential, and arrives unmodified. A secure connection over the Internet starts up establishing a TCP connection as normal. The client connects to the server, sending a SYN packet, along with its initial sequence number. The server response with the SYN-ACK, acknowledging the client’s initial sequence number, and providing the server’s initial sequence number. And then the client responds with an ACK packet, acknowledging that packet from the server. This sets up a TCP connection. Immediately following that, the TLS handshake starts, running within the TCP connection itself. The TLS client sends a TLS ClientHello message to a server immediately following the final ACK of the TCP handshake. The server responds to that with a TLS ServerHello message, and then the client in return responds with a TLS Finished message. This concludes the handshake, and carries the first block of secure data. Following this, the client and the server switch to running the TLS record protocol over the TCP connection, and exchange further secure data blocks. As can be seen the TLS handshake adds an additional round trip time to the connection establishment. At the start of the connection, there’s an initial round trip time while TCP connection is set up. And then this is followed by an additional round trip, while the TLS connection and the security parameters are negotiated, before the data can be set. There’s a minimum of two round trip times from the start of the TCP connection to the conclusion of the TLS handshake and the first secure data segment being sent. The first part of the TLS handshake is the ClientHello message. This is sent from the client to the server, and begins the negotiation of the security parameters. The ClientHello message does three things. It indicates the version TLS that is to be used. It indicates the cryptographic algorithms that the client supports, and provides its initial keying material. And it indicates the name of the server to which the client is connecting. You may wonder why the ClientHello message needs to indicate server name, given that it’s running over a TCP connection that’s just been established to that server. The reason for this, is that TLS is often used with web hosting, and it’s common for web servers to host more than one website, so the server name provided in the TLS ClientHello indicates which of the sites, which are accessible over that TCP connection, the TLS message is trying to establish a connection, establish a secure connection, to. The ClientHello message also indicates which version of TLS is to be used. What you would expect to happen here, is that it would indicate that it wishes to use TLS 1.3. What actually happens, though, is that the ClientHello message includes a version number indicating that it wants to use TLS version 1.2, the previous version of TLS. The ClientHello message includes an optional set of extension headers, and one of those extension headers includes an extension which says “actually I’m really TLS version 1.3”. The reason the version negotiation happens in such a weird way, specifying an old version of TLS in the version field, and using an extension to indicate the real version, Is because there are too many middle boxes, too many devices which try to inspect TLS traffic in the network, and which fail if the version number changes. The protocol has become ossified. We waited too long between versions of TLS. Too many devices were deployed, to many endpoints were deployed, which only understood version 1.2 and which didn’t correctly support the version negotiation. And then, when it came to deploying a new version, and people tried with early versions of TLS to just change the version number to 1.3, it was found that those new versions didn’t support the change. The result was that connections that indicated TLS version 1.3 in the header would tend to fail, whereas those that pretended to be TLS version 1.2, using an extension header to upgrade the version number, would work through those middleboxes, and the connection could succeed and proceed with the new version. The ClientHello message is the first part of the connection setup handshake. It doesn’t carry any new data. Following the ClientHello, the server responds with a ServerHello message. The ServerHello message also indicates the version of TLS which is to be used and, like the ClientHello, it indicates that the version is actually TLS version 1.2 and includes an extension header to say that it’s really a TLS 1.3 connection that’s being established In addition to the version negotiation, the TLS ServerHello includes the cryptographic algorithms selected by the server, which are a subset of the set suggested by the client. That is, the client suggests the cryptographic algorithms which it supports, and the server looks at those, finds the subset of them which are acceptable to it, picks one of them, and includes that in its response. The ServerHello message also includes the server’s public key, and a digital signature which can be used to verify the identity of the server. Like the ClientHello, it doesn’t include any data. Finally, the TLS handshake concludes with a Finished message, which flows from the client to the server. The TLS Finished message includes the clients public key and optionally, it includes a certificate which is used to authenticate the client to the server. The TLS Finished message concludes the connection setup handshake. In addition to the connection setup, it may therefore include the first part of application data that is sent from the client to the server. TLS uses the ephemeral elliptic curve Diffie-Hellman key exchange algorithm in order to derive the keys used for the symmetric encryption. The client and the server exchange their public keys, as part of the connection setup handshake, and they then combine those two public keys to derive the key that’s used for the symmetric cryptography. The maths of how this works is complex. I’m not going to attempt to describe it here. What’s important though, is that the symmetric key is never exchanged over the wire. The client and the server only exchange their public keys, and the symmetric key is derived from those. A TLS server provides a certificate that allows the client to verify its identity as part of the ServerHello message. The client can optionally provide this information along with its Finished message. Result is that the client can always verify the identity of the server, and the server can optionally verify the identity of the client. The choice of encryption algorithm is driven by the client, which provides the list of the symmetric encryption algorithms that it supports as part of its ClientHello message. The server picks from these, and replies in its ServerHello. The usual result is that either the Advanced Encryption Standard, AES, or the ChaCha20 symmetric encryption algorithm is chosen. Once the TLS connection establishment protocol, the handshake protocol, has completed the TLS record protocol starts. The record protocol allows the client and the server to exchange records of data over the TCP connection. Each record can contain up to two to the power 14 bytes of data, and is both encrypted and authenticated. Records of data have a sequence number, and they are delivered reliably, securely, and in the order in which they were sent. The underlying TCP connection does not preserve record boundaries. TLS adds framing to the connection so that it does so, and reading from a TLS connection will block until a complete record of data is received. A TLS connection usually uses the same encryption key to protect data for the entire connection. However, in principle, it can renegotiate encryption keys between records, if there’s a need to change the encryption key partway through a connection. The TLS record protocol allows the client and the server to exchange records, to send and receive data as they see fit. Once they finish doing so, they close the connection, which closes the underlying TCP connection. TLS 1.3 usually takes one round trip time to establish the connection after the TCP connection set up. That is, there’s the TCP SYN, SYN-ACK, ACK handshake to establish the TCP connection, and then an additional round trip time for the TLS ClientHello, ServerHello, Finished exchange. However, if the client and the server have previously communicated, TLS 1.3 allows them to reuse some of the connection setup parameters, and re-use the same encryption key. The way this works is that the server can send an additional encryption key as part of its ServerHello message, and the client can remember that key, and use it the next time it connects to the server. This is known as a pre-shared key. When the client next connects to that server, it sends its ClientHello message as normal. However, in addition to that ClientHello message, it can also include some data, and that data is encrypted using the pre-shared key. The ServerHello also proceeds as normal. But again, can contain data encrypted using the pre-shared key, and sent in reply to the client, to the data included in the ClientHello message. The use of the pre-shared key therefore allows the client and the server to exchange data along with the initial connection setup handshake. It allows data to be exchanged within zero RTTs of the connection set up, as part of the first round trip. This extension is therefore known as the 0-RTT mode of TLS 1.3. The 0-RTT mode is useful, because it allows connections to start sending data much earlier. It removes one round trip times worth of latency. However, it has a limitation. The limitation is that, unlike the record packets which contain a sequence number, TLS ClientHello and ServerHello messages don’t contain a sequence number. A consequence of this, is that data sent as part of a ClientHello, or a ServerHello, may be duplicated, and TLS has no way of stopping this. If you’re writing an application that uses TLS in 0-RTT mode you need to be careful, and only send what’s known as idempotent data, data where it doesn’t matter if that data is delivered more than once to the server, in the 0-RTT packets. Data that is sent after the first round trip time has concluded, as part of the regular TLS connection, doesn’t suffer from this problem, and is only ever delivered to the application once. A TLS connection is secure, but it has a number of limitations. TLS operates within a TCP connection. A consequence of this, is that the IP addresses and the TCP port numbers are not protected. This exposes information about who is communicating, and what application is being used. Further, the TLS ClientHello message includes the server name, but doesn’t encrypt that. This exposes the host name of the server to which the connection is being made, and may be a significant privacy leak. An extension, known as Encrypted Server Name Indication, is under development, but this is not finished yet, and there are some concerns that it may be very difficult to deploy. TLS also relies on a public key infrastructure to validate the keys, and to verify the identity of clients and servers. There are some significant concerns about the trustworthiness this public key infrastructure. The reasons for this are not that the cryptographic algorithms or the mechanisms are insecure, they’re that the browsers tend to trust a very large range of certificate authorities, and it’s not clear to which extent all of these certificate authorities are actually trustworthy. The final limitation of TLS is that the 0-RTT extension may deliver data more than once. 0-RTT is a very useful extension, because it allows data to be delivered with low latency at the start of the connection, but it runs the risk that the data is delivered multiple times, so must be used with care. That concludes the discussion TLS. I spoke about what is TLS. I’ve talked about the TLS handshake protocol, that establishes the connection using the ClientHello, ServerHello, and Finished messages, and that agrees the appropriate cryptographic parameters. And I spoke about the TLS record protocol, which is used to actually exchange the data. The TLS 0-RTT extension allows for faster data transfer at the beginning of the connection, but comes with some risks of data replay attack. Finally, I spoke about some of the limitations of TLS. The TLS protocol has actually been wildly successful. It’s used to secure all the traffic sent over the web. And when used correctly, is very much a secure protocol, that performs very well. In the final part of the lecture, I’ll move on from talking about the details of the cryptographic mechanisms, and the transport protocols, to talk about some of the issues with writing secure software.
Part 4: Discussion
Abstract
The final part of the lecture discusses systems aspects of providing secure communication. It reviews the need for end-to-end security to protect communications. It discusses the robustness principle, and its implications for the design on input parsers and other aspects of networked systems. And it briefly reviews some of the challenges in writing secure code.
In the previous parts, I’ve spoken about the general principles underlying secure communication and about the Transport Layer Security Protocol, TLS 1.3, that protects most Internet communications. In this final part of the lecture, I want to raise some issues to consider when developing secure networked applications. In particular, I want to discuss the need for end -to-end security and the problems of making secure communication in the presence of content distribution networks, servers, and middle boxes. I want to talk about the Robustness Principle and the difficulty in designing and building networked applications. And I want to talk about the need to carefully validate input data and part of the issues around writing secure code. For communications to be secure, it must be end- to-end. That is, a secure communication must run between the initial sender and the final recipient. And the message must not be decrypted or lose integrity protection at any point along the path. That is harder to arrange than you might imagine. If the communication is between a client and a server located in a data centre, it’s easy to understand what’s the client endpoint. It’s the phone, tablet, or laptop on which the application making the request is running. What is the endpoint in the data centre, though? Does the secure communication terminate at the load balancing device at the entrance to the data centre that chooses which of the many possible servers responds to the request? If so, does that load balancer make a secure onward connection to the back-end server? Or is the connection unprotected within the data centre? Alternatively, if the secure connection passes through the load balancer and terminates on the back-end server, are the connections between the back-end servers and the databases, compute servers, and storage servers in other parts of the data centre secure? And, once the request has been handled, how is the data protected once it’s stored in the data centre? What’s your threat model? Are you concerned about protecting your communication as it traverses the wide area network between your client and the data centre? Or are you also concerned with protecting communications within the data centre? If you’re concerned about communications and data storage within the data centre, are you trying to protect against other tenants in the data centre, or against malicious users that may have compromised the data centre infrastructure, or against the data centre operator? Similar issues arise with content distribution networks. CDNs, such as Akamai, are widely used as the back -end infrastructure for websites, software updates, streaming video services, and gaming services. Applications like the Steam Store, the BBC iPlayer, Netflix, and Windows Update have all run on CDNs at various times, although many of them now use their own infrastructure. CDNs are essentially large-scale, highly distributed web caches. They provide local copies of data to improve performance compared to having to fetch the content from the master site. The secure HTTPS connection is therefore from the client to the CDN, rather from the client to the original site. This introduces an intermediary into the path. The CDN now has visibility into what requests the client is making, in addition to the original service. Performance is better, but you’re forced to trust a third party with information about what sites you’re visiting. Equally, the data has to get to the CDN caches somehow, and has to be protected as it’s fetched from the original server to populate the cache. You have to trust the CDN to do this correctly. As a user of the CDN, you have no way of knowing how, or indeed if, that data is secure. In many cases, data is moving between two users. Is that data encrypted end-to-end between the two users? Or is the data encrypted between the users and some data centre, but visible to the data centre? The difference can matter. If the data centre has access to the unprotected data, it may be used to target advertising, and it’s much more likely to be accessible to law enforcement or government monitoring. Many applications use some form of in-network processing. For example, video conferencing systems often use a central server to perform audio mixing, and to scale the video to produce thumbnails. For example, in a large video conference, if many users are sending video, then all the video goes to a central server. That server only forwards high-quality video for the active speaker, and sends a smaller, more heavily compressed version for the other participants. This reduces the amount of video sent out to each of the participants, and prevents overloading their network connections. This is a good thing. But it also means that the central server has access to the audio and video data. The server can record that video if it so chooses, and potentially share it with others. That may be a concern, depending on what’s being discussed. An alternative way of building such an application leaves the data encrypted, and doesn’t give the server access. This increases the privacy of the users, since the data is encrypted end-to-end, and isn’t available to the server. That means that the server can’t help compress the data and manage the load. And it means that server-based features, like cloud recording and captioning, become much harder to provide. It trades off features and performance for increased privacy. When building networked applications, it’s important to consider how the network protocol is implemented. Network protocols can be reasonably complex, and difficult to implement. They have a syntax and semantics, in many ways similar to a programming language. And, like a program, the protocol messages your application receives may contain syntax errors or other bugs. What do you do if the protocol data you receive is incorrect? A frequently quoted guideline is Postel’s Law. This is named after Jon Postel, the original editor of what became the IETF’s RFC series of documents, and an influential contributor to the early Internet. Postel’s Law can be summarised as “Be liberal in what you accept, and conservative in what you send.” That is, when generating protocol messages, try your hardest to do so correctly. Make sure the messages you send strictly conform to the protocol specification. But, when receiving messages, accept that the generator of those messages may be imperfect. If a message is malformed, but unambiguous and understandable, Postel’s Law suggests to accept it anyway. That’s fine, but it’s important to balance interoperability with security. Don’t be too liberal in what you accept. Having a clear specification of how and when you fail might be more appropriate. Postel’s Law says “Be liberal in what you accept, and conservative in what you send.” That makes sense if you trust the other devices on the network. It makes sense if the problems with the messages they send are likely to be honest mistakes, and not intended to be malicious. The network has changed since Postel’s time though. As Poul Henning Kamp, one of the FreeBSD developers says, “Postel lived on a network with all his friends. We live on a network with all our enemies. Postel’s Law was wrong for today’s Internet.” This is an important point. Any network system is frequently attacked. There are many people scanning the network for vulnerabilities, actively trying to break your applications. If you write a server and make it accessible on the Internet, then people will try to break it. This is not because you’re a target. It’s because the machines and network communications are now fast enough that it’s possible to scan every machine on the Internet to see if it’s vulnerable to a particular problem within a few hours. It’s not personal, but your server will be attacked. The paper shown on the slide, “Maintaining Robust Protocols” by Martin Thompson and David Schinazi, talks about this in detail, and gives detailed guidance on how to handle malformed messages. If you write networked applications, I strongly encourage you to read it. One of the key points made is that networked applications work with data supplied by untrusted third parties. As we’ve discussed, data read from the network may not conform to the protocol specification. This may be due to ignorance, bugs, malice, or a desire to disrupt services. One of the critical lessons is that you must carefully validate all data received from the network before you make use of it. Don’t trust arbitrary data that comes from another device over the network. Check it carefully, and make sure it contains what you expect before use. This is especially important when working in scripting languages that often contain escape characters that trigger special processing. The cartoon on the slide is an example. The idea is that the software processing the student’s name sees the closing quote and interprets the rest of the name as SQL commands to delete the student records from the database. It’s a silly example. But it’s surprising how often similar problems, known as SQL injection attacks, occur in practice. And similar problems occur in many other programming languages. It’s not just an SQL related problem. Be careful how you process data. And in general, be careful how you write networked applications. The network is hostile. Any networked application is security critical. Anything that receives data from the network will be attacked. When writing networked applications, carefully specify how they should behave with both correct and incorrect inputs. Carefully validate inputs and handle errors. And check that your code behaves as expected. Try to break your application before someone else does. If your application is written using a type or memory unsafe language, such as C or C++, take extra care, since these languages have additional failure modes. It’s very easy to write a C or C++ program that suffers from buffer overflows, use after free bugs, race conditions and so on. Such bugs are almost certainly security vulnerabilities. As a rule of thumb, if you’ve written a C or C++ program and can cause it to crash with a segmentation violation message, then that’s probably exploitable as a security vulnerability. Have you ever managed to write a non-trivial C program that never crashes that way? This is why network programming is difficult. The network today is an extremely hostile environment. Networked applications are security critical, and writing secure code is a very difficult skill. If you have the choice, use popular, well-tested, pre-existing software libraries for network protocols where possible. Especially do so for implementations of security protocols, such as TLS. And make sure to update these libraries regularly, because problems and security vulnerabilities are found frequently. The best encryption in the world doesn’t help if the endpoints can be compromised and the data stolen before it’s encrypted. This concludes our discussion of secure communications. In the first part of the lecture, I spoke about the need for secure communication, and some of the challenges and trade-offs in enabling security. In the second part, I discussed the principles of secure communication in abstract terms, talking about symmetric and public key encryption, and how these are combined to give hybrid encryption protocols. I spoke about digital signatures to authenticate data, and about public key infrastructure and certificate authorities. Then I spoke about the Transport Layer Security Protocol, TLS 1.3, that instantiates hybrid encryption and digital signatures into a concrete network protocol that secures web traffic and other applications. And finally, I’ve discussed some issues to consider when writing networked applications. Ensuring communication security is a difficult problem. It’s technically difficult, because you need to write extremely robust software, and need to design secure network protocols that use sophisticated cryptographic mechanisms. And it’s politically difficult, because there are some extremely sensitive policy questions around what information should be protected, and against whom. The TLS 1.3 protocol is the current state-of-the- art in secure communications. In the next lecture, we’ll move on to further discuss its limitations, and some of the ways in which people are trying to improve network security and performance.
L3 Discussion
Summary
Lecture 3 discussed secure communication. It started with a discussion of the need for security, and the issues with balancing security, privacy, and the needs of law enforcement, regulatory compliance for businesses, and the need to effectively manage networks. It then moved on to discus the principles by which secure communication can be achieved, via a mix of symmetric and public key encryption and digital signatures. And it outlined how these are used in the transport layer security protocol, TLS.
The focus of the discussion will be to check your understanding of the principles of security. How do symmetric and public key encryption work, and how are they combined in practice? And how do digital signatures work? The mathematics behind this work is outside the scope of this course, and will not be discussed, but the principles are important.
Discussion will also consider how does TLS use these techniques to ensure security. How does the TLS handshake work? What guarantees does TLS provide to applications? How does the use of 0-RTT session resumption change those guarantees and what benefits does it provide in exchange?
Finally, the discussion will also focus on the need to consider the different impacts of providing secure communication. There are clear benefits to providing security, but also some unexpected costs that can lead to tension between users, vendors, network operators, businesses and governments. The discussion will start to highlight some of these issues. What should we encrypt? What are the trade-offs of privacy vs law enforcement access? What doesn’t encryption protect?
Lecture 4
Abstract
This lecture discusses some of the limitations of TLS v1.3, considering connection establishment performance, metadata leakage, and protocol ossification. It then introduces the QUIC transport protocol. QUIC is a new transport protocol, that tries to improve on the performance of TLS over TCP while providing additional features.
Part 1: Limitations of TLS v1.3
Abstract
The first part of the lecture discusses some limitations of TLS v1.3. It explains why TLS slows down connection establishment and discusses the 0-RTT connection re-establishment feature, it’s benefits, and its risks and limitations. It discusses metadata leakage from TLS connections, both data exposed in the TCP header and the Server Name Indication (SNI) field in the TLS handshake. And it reviews some of the protocol ossification issues that have affected TLS development.
In this lecture, I want to talk about some ways in which secure connection establishment is being improved in the Internet. I’ll talk about some of the limitations of TLS 1.3 that constrain its security, and that slow down its connection establishment. And I’ll talk about improvements being developed to address these problems. Both to improve TLS 1.3 when used with TCP, and to entirely replace TCP and TLS with a new transport protocol, QUIC, that has faster connection establishment and in-built security. In the first part, I’ll talk about some of the limitations of TLS 1.3 and how they’re being addressed. I’ll revisit the problem of slow connection establishment, and describe new issues around metadata leakage and protocol ossification. And I’ll discuss how TLS is being extended to solve these problems. Then, in the following parts of the lecture, I’ll talk about the QUIC transport protocol. QUIC is a new transport protocol, just finishing development in the IETF, that aims to improve on the performance and security of TLS and TCP, and add some new features, while avoiding problems due to protocol ossification. To start with, though, I want to talk about some of the limitations of TLS 1.3. The Transport Layer Security protocol, and TLS 1.3 in particular, has been a great success. The protocol has been successfully extended and updated over many years, to keep abreast with security challenges. It’s added new features to meet the needs of new applications, such as datagram TLS for security of video conferencing calls. And it has good performance and is widely used. TLS 1.3 is the most recent version of TLS, and was standardised in 2018. Compared to TLS 1.2, the previous version, it provides some significant security and performance improvements. TLS 1.3 removed support for older and less secure encryption and key exchange algorithms. It removed support for algorithms that are know to be secure if implemented correctly, but that have proven difficult to implement correctly. It improved performance of the handshake, reducing the number of round-trips needed to negotiate security. And it simplified the design of the protocol. Despite this, TLS 1.3 still has some limitations that are difficult to fix. Connection establishment is still relatively slow. Connection establishment leaks potentially sensitive metadata, that can be viewed as a security risk. And the protocol is ossified, due to poor quality implementation and middleboxes, and difficult to extend. The first issue is TLS connection establishment performance. TLS is a security protocol, not a transport protocol, and needs to be run inside an existing transport-layer connection. In practise, this means that TLS usually runs inside a TCP connection. When connecting to a TLS-enabled server, a client must first establish a TCP connection to the server. This proceeds with the usual TCP connection establishment handshake. The client sends a TCP segment with the SYN bit set, indicating the start of the connection. The server responds with a SYN ACK. And the client completes the handshake with an ACK packet. This handshake takes one-and-a-half round trips to complete, with the client being able to send data after the first round trip. That is, the client can send a data packet immediately after it sends the final ACK packet in the handshake, without waiting for a response from the server. When using TLS, this initial TCP data packet contains a TLS ClientHello message. This is the first part of the TLS handshake. It indicates the version of TLS the client supports, and starts the negotiation of the encryption and authentication keys. The TLS server responds with a TLS ServerHello message containing the information needed to establish the secure connection. Then the client responds with a TLS Finished message, confirming the secure connection. It’s only at this point, two full round trip times after starting the connection establishment, that the client can send a secure request to the server, along with the TLS Finished message. And the earliest that a response can be received is one round trip later, a total of three round trips after starting the connection setup. Do these three round trips matter? The table on the right shows some typical round trip times from Glasgow. Three round trips can easily be somewhere in the 250 to 500 millisecond range. Measurements show that an average web page fetches about 1.7 megabytes of data, comprising around 69 HTTP requests, using around 15 TCP connections. And about 80% of those connections use TLS. Some of the TCP and TLS connections run in parallel, certainly, but not all of them. And most of the time taken to load a typical web page is actually spent waiting for TCP and TLS connection establishment handshakes. If we want to make web pages load faster, reducing the connection establishment time is a very effective way of doing so. Can we speed up the TLS connection establishment handshake? Yes. There are two ways. The first is a technique known as 0-RTT connection re-establishment. This lets us save the security parameters the first time we connect to a server, and re-use them in future. It improves performance if we fetch more than one page from the same website. I’ll talk about 0-RTT connection re-establishment in a minute. The second way to improve the connection establishment time, is if we can somehow overlap the TCP and TLS handshakes so they happen in parallel. At present, we open a TCP connection, which takes one round trip before the client can send data, and then negotiate the TLS security parameters, taking another round trip. We could speed up the process by overlapping these two handshakes. For example, we could send the TLS ClientHello message along with the TCP SYN packet, and the TLS ServerHello message could be sent in the SYN-ACK packet. This would let the whole connection establishment handshake, TCP and TLS, complete in one go. Now, doing this with TCP would require widespread operating system changes, and updates to all the firewalls in the world, and so it isn’t likely to happen. But the QUIC transport protocol, that I’ll talk about in later parts of the lecture, does support this optimisation. The 0-RTT connection re-establishment technique builds on the observation that it’s common to connect to a previously known TLS server. We frequently fetch more than one web page from a site. Or visit the same sites every day. Or connect to the same messaging, chat, or video conferencing server. Surely it should be possible to shortcut the connection establishment in such cases, and reuse information we exchanged the first time to speed things up? It is, but to do so we need to understand three things. Firstly, what the role of the TLS handshake, and how are the security parameters negotiated? Second, how can we encrypt initial data on re-establishing a secure connection? And, finally, what are the potential risks in short-cutting the key exchange? So, what is the role of the TLS handshake? It does two things. The first is to use public key cryptography to securely establish a session key. That session key is then used to encrypt the data being sent over the TLS connection, using a symmetric encryption algorithm. The ClientHello and ServerHello messages, exchanged in the TLS handshake, are used to derive this session key. In TLS 1.3, this is done using the ephemeral elliptic curve Diffie-Hellman key exchange algorithm – ECDHE. The TLS client and server exchange their public keys in the ClientHello and ServerHello messages, along with a randomly chosen value. The public keys and random values exchanged in the TLS handshake, and the private keys known only to each of the endpoint, are used to derive the session key, using the elliptic curve Diffie-Hellman algorithm. The maths of how this is done is complicated, and I’m not going to try to explain it. The key point, though, is that every TLS connection, even if between the same client and server, generates a unique session key. This provides a property known as forward secrecy. If the encryption key for one TLS connection somehow leaks to the public, then it exposes the data sent over that connection, but it doesn’t help an attacker break the encryption on any other connections between those endpoints. The other role of the TLS handshake is to retrieve the server’s certificate, that will have a digital signature from some Certificate Authority, part of the public key infrastructure, allowing the client to confirm the identity of the server. The ECDHE algorithm requires information from both the ClientHello and the ServerHello messages to derive a session key. It can’t be used until the handshake has completed. This is why TLS usually takes a complete round trip to negotiate the session keys. You can’t start encrypting the data until after the ClientHello and ServerHello messages have been exchanged, since those messages are used to derive the encryption key. 0-RTT Connection Reestablishment gets around this by sharing a key in one TLS session, to be used in the next. The first time a particular client makes a connection to a TLS server, the server sends it two additional bits of information along with its ServerHello message. It sends a PreSharedKey and a SessionTicket. The PreSharedKey is an encryption key, that the client can use to encrypt data it sends to the server. And the SessionTicket tells the server which key was given to the client. These are sent over the encrypted TLS connection, so they’re only known to the client and the server. When it later reconnects to the same server, the client can use this information to send encrypted data along with the ClientHello message. In this case, the client establishes a TCP connection to the server. Then, it sends a TLS ClientHello message to the server in the usual way. But, along with this, it also sends the SessionTicket it was previously given, and some data. This data is encrypted using the PreSharedKey that was sent by the server in a previous connection, and goes along with the SessionTicket. On receiving such a message, the server responds with a ServerHello as normal. It also uses the SessionTicket to lookup its copy of the PreSharedKey, and decrypt the data sent from the client. If the server wants to reply to the client, it sends its response along with the ServerHello message. Since the server has received the ClientHello message, and knows what it sent in its ServerHello message, it can derive an ephemeral session key and use it to encrypt the response. Similarly, the client can decrypt that response, since it will follow ServerHello that provides the information needed to complete the key exchange. This handshake is shown on the right of the slide. This process of sending a PreSharedKey in one TLS session, and using it to encrypt the data sent in the initial message of the following session, is known as 0-RTT connection re-establishment. It provides a way for the client to send encrypted data to the server in the first packet it sends after the TLS connection is opened. That is, it allows data to be sent with no additional round trips. The usual TLS handshake also takes place, with the ClientHello and ServerHello messages used to derive a session key. The PreSharedKey is only used to encrypt the very first message sent from the client to the server on the connection. 0-RTT connection re-establishment allows data to be sent immediately, reducing TLS connection establishment latency. But. It introduces two risks. The first is that data sent along with the ClientHello, and encrypted using a PreSharedKey, is not forward secret. The use of a PreSharedKey links two TLS connections together. The key sent is securely over one connection, encrypted using the ephemeral session key of that session, and is used to encrypt the data sent in the first message of the next connection. If, for some reason, the ephemeral session key for the first session is exposed, breaking the encryption of that session, then it can be used to extract the PreSharedKey. This, in turn, breaks the encryption for the data sent in the first message of the next session. The rest of the data sent in the next session is safe, because it’s encrypted using the ephemeral session key for the next session, not the PreSharedKey. Secondly, data sent along with the ClientHello message and encrypted using a PreSharedKey is subject to a reply attack. If an on-path attacker captures and re-sends the TCP segment containing the ClientHello, the SessionTicket, and the data protected with the PreSharedKey, that data will be accepted by the server again, and the server will respond, trying to complete the handshake. This will fail, since the connection has already been established, but by then the data has already been accepted and processed by the server. The TLS record protocol, used once the connection has been established, can protect against such replay attacks, but to do so it uses the information exchanged in the handshake. TLS isn’t able to protect against replay of 0-RTT data sent in the first message from the client to the server. Accordingly, if you use TLS with 0-RTT re-establishment, you should make sure that any data you send in the first message is idempotent. That is, it won’t do any harm if it’s received and processed twice. Because of these two limitations, it’s important to be careful when using 0-RTT data in TLS. The 0-RTT connection re-establishment mode can improve performance, but has to be used with care to avoid replay attacks. The use of 0-RTT connection re-establishment can improve connection setup times, if used carefully. There are a couple of other limitations of TLS that you should be aware of, though. The first is that TLS runs over a TCP/IP connection, and the TCP and IP headers are not encrypted. This exposes certain information to anyone eavesdropping on the connection. An eavesdropper can observe the IP addresses between which the packets are sent, and the TCP port numbers. The TCP port numbers tell you what service is being used. For example, they expose whether the TLS connection is made to a web server, an email server, a video conferencing server, a messaging server, or whatever. And the IP addresses often tell you what site is being accessed. Small sites tend to use shared hosting, where many websites run on a shared server with a single IP address, but popular sites need dedicated machines to host them, and so have their own IP addresses. An eavesdropper can often use the IP addresses to identify what site is being accessed. Maybe that doesn’t matter. Knowing that a particular user has connected to an IP address owned by Google maybe doesn’t tell you much. But knowing that they connected to an IP address that hosts a website giving information about a particular disease, or for a particular political party, or for a particular niche social network, might expose more information about the user, even without knowing what they said or being able to access the data sent over that connection. The TCP header also sends sequence numbers unencrypted. This exposes how much information was sent over the connection, and how fast the data is being downloaded. This information is often less sensitive than the IP addresses and TCP port numbers, but not always. For example, it’s been shown to be possible to identify what movie someone is watching, on a site like Netflix, just by looking at the pattern of download rates for each scene, even if the data is encrypted. The TCP sequence numbers expose this information. A bigger risk with TLS is that, when used with HTTPS, the TLS ClientHello message will include the Server Name Indication extension. The server name indication provides the name of the site being requested, so the server knows what public key to include in its ServerHello message. It’s needed to support shared hosting, where many websites are hosted on a single web server with a single IP address, where the server can’t tell what site is being requested based on the address to which the client connects. Since the client can’t know how many sites are hosted on each server, it has to always send the server name information field to indicate what server it wants. And worse, the server name indication field has to be unencrypted. This is because it’s sent in the first packet, before the session key has been negotiated, because it’s used to determine what server generates the ServerHello. Similarly, the server name indication can’t be protected by a PreSharedKey, since that’s provided by the server, and the goal is to select the server. The server name information is a privacy risk It exposes the name of the server to which you’re connecting to anyone eavesdropping on the path. Is it a significant risk? Well, that’s not clear. As we saw on the previous slide, the IP address of the site is always exposed, and that’s often enough to identify the site you’re accessing, whether or not the site name is exposed. Fundamentally, TLS runs over the Internet. And the Internet always exposes the IP addresses in the packets, making it possible to tell who is accessing what site. If you care about this problem, you need to use an application like The Onion Router, Tor, to protect your connections. The final issue I want to highlight with TLS, is that of protocol ossification. TLS is very widely implemented, but there are many poor quality implementations. As an example, the TLS ClientHello indicates the version of TLS used by the client. The server is supposed to look at this, and look at the different versions of TLS that it supports, and send a ServerHello in response indicating the most recent compatible version. If the ClientHello indicates support for TLS 1.3, for example, but the server only supports TLS 1.2, the server is supposed to respond with a ServerHello saying it supports TLS 1.2. The client will then decide if this is sufficient, and if it wants to communicate. During the development of TLS 1.3, though, it was noticed that some TLS servers would crash if the ClientHello message used a version number different to TLS 1.2. Similarly, some firewalls were found that inspected TLS ClientHello messages, and blocked the connection if a version other than TLS 1.2 was used. When early versions of TLS 1.3 were deployed, using version number 1.3 in their ClientHello messages, measurements showed that these problems affected around 8% of TLS servers, and prevented TLS 1.3 from working in those cases. That is, around 8% of sites were running buggy implementations of TLS, or had buggy firewalls, that didn’t correctly implement TLS version negotiation, and so would fail if a newer version of TLS was used. The designers of TLS decided that having 8% of connections fail when using TLS 1.3 was too much. So they changed the design. TLS 1.3 now pretends to be TLS 1.2 in the version number in its ClientHello and ServerHello messages. But, it also sends an extra extension header in those messages, to say “Actually, I’m really TLS 1.3”. Experience and testing showed that older TLS servers, that only support TLS 1.2, are more likely to ignore this extra extension header, than they are to correctly respond to a change in the version number field. That is, when a TLS 1.3 client talks to TLS 1.2 server, the TLS 1.2 server is likely to ignore the extension header and see only the version number in the ClientHello. It will respond to this, and negotiate TLS 1.2. But, when a TLS 1.3 client talks to a TLS 1.3 server, the TLS 1.3 server will see the extension, and know it should ignore the version number field that says TLS 1.2, and instead go with the version signalled in the header. It will then respond with a ServerHello, which also says TLS 1.2 in the version field, but that has an extension header included to indicate that it’s really using TLS 1.3. This is horrible. It’s a kludge written into the TLS standard, because experience showed that there were too many broken implementations of TLS out there. The built-in version negotiation feature of TLS turned out to be unusable in practice. There were millions of deployed TLS servers where trying to negotiate a TLS version different to TLS 1.2 would cause the server to crash. Upgrading them all was infeasible. The only workaround was to change the way the protocol negotiated what version to use, while leaving the older version there for backwards compatibility. Protocol ossification is a significant concern. And the problem does not only occur with TLS. Widely deployed faulty implementations constrain the design of most protocols. One of the biggest challenges in networked systems is being able to change the protocol without breaking previously deployed implementations. Even if those implementations are buggy and broken in different ways. How do we avoid protocol ossification? Well, ossification occurs when extension mechanisms in the protocol are not used. When the protocol allows flexibility in principle, but the implementations of the protocol don’t use that flexibility. For example, TLS 1.3 was released ten years after TLS 1.2. There was a period of ten years when TLS was is use, and getting increasingly popular, where only TLS 1.2 was used. This allowed people to build implementations of TLS that didn’t do version negotiation correctly, because they didn’t need to. There were no new versions to negotiate, so no-one properly tested that feature. The same happened with other features of TLS. And the same happens with other protocols. This hints at the solution to ossification. Protocols ossify if they have features or extension mechanisms that aren’t used. The implementations of those features, those extension mechanisms, aren’t properly tested, so buggy code gets deployed, and constrains future changes. So, why not use those features? Why not use those extension mechanisms? The idea is called GREASE. Generate Random Extensions and Sustain Extensibility. If the protocol allows extensions, send extensions. Send meaningless dummy extensions that get ignored if you have nothing else to send, but make sure your protocol uses its extension mechanisms. If the protocol allows different versions, negotiate different versions. Change the version number occasionally, just to prove you can. Do this even if you don’t need to. That is “use it or lose it”. Assume that any features of your system that aren’t regularly used won’t get properly tested, and won’t be usable in practice, so make sure that all the features get regularly used. This concludes our discussion of TLS 1.3. We’ve focussed on the limitations of TLS in this part but, to be clear, TLS 1.3 is a highly effective and highly secure protocol. It’s a significant improvement on prior versions of TLS. It’s faster and it’s more secure. But, TLS runs within a TCP connection and must wait for that TCP connection to be established, and this slows down the handshake. And, because it runs in TCP, some metadata is leaked that can potentially expose sensitive information. In the next parts, I’ll talk about QUIC, a new transport and security protocol that aims to address these limitations of TLS.
Part 2: QUIC Transport Protocol: Development and Basic Features
Abstract
The second part of the lecture introduces the QUIC transport protocol. QUIC is a new transport protocol, designed to replace and improve on the combination of TCP and TLS. It reviews the history and development of QUIC, and outlines the features and benefits it provides. A high-level outline of the operation of a QUIC connection is provided.
In this second part, I want to move on from talking about the limitations of TLS and TCP, and talk instead about the QUIC Transport Protocol. QUIC is an attempt to replace and approve on TCP. It incorporates TLS and various other extensions. In this part, I’ll talk about the development of QUIC and its basic features. Then, in the following parts, I’ll explain how QUIC establishes connections, transfers data, and avoids ossification. As we saw in the previous part of this lecture, there are three main limitations of TLS 1.3 when running over TCP. First, it’s slow to establish a secure connection, because the TCP connection establishment handshake, and the TLS security perimeter negotiation, proceed in series, one after the other. Secondly TLS leaks some metadata about the connection. And finally, both TLS and TCP are ossified, and have become difficult to extend. QUIC tries to solve these problems. It’s a single protocol that provides both the transport and security features. It reduces connection establishment latency, by overlapping the connection setup and encryption key negotiation. It tries to avoid metadata leakage through the use of pervasive encryption. And it tries to prevent ossification by systematic application of GREASE and through encryption of more of the transport headers. QUIC is a new protocol. Development started in 2012 as a project inside Google to improve the security and performance of the web. One of the outcomes of that project was SPDY, which was adopted by the IETF and eventually became HTTP/2. The other was QUIC. Google transferred control of the QUIC protocol to the IETF in 2016, and the IETF formed a working group to standardize and develop the protocol. The IETF then spent five years updating and improving the protocol, publishing 34 drafts, and going through a couple of thousand pull requests and issues of the specification on GitHub. The final RFCs, RFC 8999 through to 9002, describing the protocol, were eventually published in May 2021. The final result is clearly inspired by Google’s initial proposal, but it’s also been significantly changed as a result of the IETF process. And it’s now a much better protocol. The result is more extensible, has a better security solution that’s more compatible with TLS and the security solutions used in the web, and it provides a cleaner basis for further development. So what is QUIC? Well, the figure on the slide shows how QUIC relates to the other layers of the Internet protocol stack. As we see, QUIC performs most of the functions of TCP, completely subsumes TLS, and adds some features from HTTP. Essentially, it’s trying to replace TCP and TLS, and add a more useful API on top of them. QUIC runs on top of UDP. That is, QUIC packets are sent as the payload within UDP packets, rather than running directly on IP. This was done for ease of deployment and development. On top of that, QUIC provides reliability, ordering, and congestion control. These are the features usually provided by TCP. It also incorporates TLS security, and provides the same security guarantees as TLS 1.3. And, finally, QUIC adds the idea of stream multiplexing from HTTP/2. That is, where a TCP connection delivers a single stream of data, QUIC allows multiple different streams of data to be sent over a single connection. The result is a general purpose, reliable, congestion controlled, and secure, client-server transport protocol. QUIC is intended to be useful for any application that currently uses TCP, and it aims to provide a better, more secure, replacement for TCP. It was also designed to run HTTP effectively and improves web applications. In parallel with the development of QUIC, the IETF is also developing an HTTP/3, which is the version of the HTTP that runs on top of QUIC, and is optimized to use the features of QUIC. One of the key benefits of QUIC is that it overlaps the connection establishment and security handshakes. On the left, we see the process by which a TLS connection is established running over TCP. As we saw in the previous part of the lecture, it proceeds in two stages. First, the TCP connection is established, with a SYN, SYN-ACK, ACK exchange. Then, TLS takes over to negotiate the security parameters and encryption keys, and to authenticate the server. It exchanges, the TLS ClientHello, ServerHello, and Finished messages. And only then can the data be sent; as HTTP request and response in this example. QUIC, as we see on the right, combines the first two stages into one. And, as we’ll see in the next part of this lecture, QUIC sends the equivalent of the TCP SYN and TLS ClientHello messages in one packet at the start of the connection. This begins connection setup, and starts to negotiate the security parameters, in one message. The response contains the equivalent of the TCP SYN-ACK, and the TLS ServerHello, combined into one message. This establishes the connection and provides the security parameters. Then, the final packet in the handshake sends the equivalent of the TCP ACK and TLS Finished messages, and the initial data, again all wrapped up into one message. The result is that QUIC saves one round trip compared to the combination of TCP and TLS. Its normal case establishes a secure connection as fast as TCP and TLS can perform 0-RTT session resumption, and without the limitations of session resumption. And QUIC also supports the 0-RTT session resumption mode. This allows the very first packet sent from a client to a previously known server, to both establish the connection, negotiate security parameters, and carry an initial request. This let’s QUIC achieve the best possible performance, to send a request to a server and get the response one round-trip time later. QUIC allows applications to send and receive streams of data. Unlike TCP, where each TCP connection allows a single stream of data to be sent, and a single stream of data to be received, QUIC allows for more than one stream of data to be sent and received on each connection. This lets application separate out different objects, delivering each on separate stream within the QUIC connection. For example, when fetching a web page that includes multiple images, QUIC would allow each image to be sent on a separate stream within that connection, rather than requiring a new connection to be open for each. The data for these connections is sent within QUIC packets, and QUIC packets are, in turn, sent within UDP datagrams. Each QUIC packet starts with a header and contains one or more frames of data. The frames contain data for the individual streams, acknowledgments, and other control messages. QUIC packets can be long header packets or short header packets. At the start of a connection, QUIC sends what are known as long header packets. The format of these is shown on the slide. Long header packets start with a common header. This common header begins with a single byte where the first two bits are set to one to indicate that it’s a long header packet. The next two bits of this byte, labelled TT in the diagram, indicate the type of the long header packet, and the remaining four bits of this bytes are unused. This byte is followed by a version number, the destination connection identifier, source connection identifier, and the packet-type specific data. The connection identifiers allow a QUIC connection to survive address changes. For example, if a smartphone connected to a home Wi-Fi network establishes a connection to a server, and then moves out of range of the Wi-Fi and switches to its 4G connection, then its IP address will change. This change an IP address would cause a TCP connection to fail. The connection identifies allow a QUIC connection, though, to automatically reconnect to the server and continue where it left off. Following the connection identifiers is packet type-specific data. There are four different types of long header packet in QUIC. An initial packet is used to initiate a connection and start the TLS handshake. It plays the same role as a TCP SYN packet and a TLS ClientHello. A 0-RTT packet is used to carry the idempotent data sent when the 0-RTT session resumption is being used. A handshake packet is used to complete the TLS handshake. Depending on its contents, a handshake packet is either the equivalent of a TCP SYN-ACK, with a ServerHello, or a TCP ACK with a TLS Finished message. And the Retry packet is used to force address validation, as part of connection migration, and to prevent some types of denial of service attack. These different types of QUIC long header packet all contained QUIC frames, are all sent inside UDP datagrams. And,to improve efficiency, QUIC allows several Initial, 0-RTT, and handshake packets to be included in a single UDP datagram, one after the other, followed by short header packet. QUIC switches to sending short header packets once the connection has been established. QUIC short header packets omit the version number and source connection identifier to save space, since these can be inferred to be the same as those sent during the connection establishment handshake. And there’s currently only one type of short header packet, known as 1-RTT packets. And these are used for all of the packets sent after the QUIC handshake is completed, and contain a sequence of encrypted and authenticated QUIC frames. Both long and short header packets contain QUIC frames. Frames provide the core functionality of QUIC. And there are many different types of frame. CRYPTO frames are used to carry TLS messages, such as the ClientHello, ServerHello, and Finished messages used during the connection establishment handshake to negotiate encryption keys. STREAM and ACK frames send data and acknowledgments on a stream. Migration between the interfaces is supported by the PATH_CHALLENGE and PATH_RESPONSE frames. And the other types of frame control the progress of a QUIC connection. In many cases, QUIC frames play the role taken taken in TCP by header fields. For example, a TCP header contains an acknowledgement field that indicates the next packet expected. The QUIC header does not. Rather, QUIC packets include an ACK frame that carries that information. This makes QUIC more flexible and more extensible. Rather than having a fixed header that becomes difficult to change, QUIC can just add new frame types if it wants to send different types of control information. That concludes our discussion of the history of QUIC and it’s basic features. In the next part, I’ll talk in more detail about how connection establishment works in QUIC, and how QUIC reliably transfers streams of data.
Part 3: QUIC Transport Protocol: Connection Establishment and Data Transfer
Abstract
The third part of the lecture discusses QUIC connection establishment and data transfer in more detail. It explains the QUIC connection establishment handshake, and shows how it overlaps the TLS security parameter negotiation and connection establishment operations into one handshake, to speed up connection establishment compared to TLS over TCP. And it outlines the process by which QUIC connections transfer data over streams, showing how the multi-streaming nature of QUIC improves on TCP.
In this part, I want to follow on from the introduction to QUIC, and talk in detail about how QUIC establishes connections and transfers data. Like TCP, a QUIC connection proceeds in two phases. It starts with a handshake, and then moves on to a data transfer phase. The QUIC handshake uses long header packets. It establishes the connection, negotiates encryption keys, and authenticates the server The data transfer phase uses short header packets. It’s where the data is sent and received, and where acknowledgements are generated, after the connection has been established. QUIC improves on the performance of TCP and TLS by combining the connection establishment handshake and the TLS handshake into one round-trip. A QUIC connection starts with the client sending a UDP datagram containing a QUIC Initial packet to the server. This Initial packet plays the equivalent role to a TCP SYN packet and indicates that the client wishes to establish a connection. The Initial packet also contains a QUIC CRYPTO frame. Inside that CRYPTO frame is a TLS ClientHello message, the same as if TLS was being used over TCP. That ClientHello message starts the negotiation of the encryption keys and other security parameters. The QUIC Initial packet therefore simultaneously starts both the connection establishment and the TLS handshake. In response to this, the server sends a QUIC Initial packet and a QUIC Handshake packet back to the client. These are both included in a single UDP datagram. The Initial packet sent from the server back to the client plays the role of the TCP SYN-ACK packet, and indicates to the client that the server is willing to establish the connection. The Initial packet also includes a CRYPTO frame containing a TLS ServerHello message, that provides the information needed to finish negotiating encryption keys. Along with the Initial packet is a QUIC Handshake packet. This contains the information needed for TLS to authenticate the server, and to negotiate any QUIC extensions used. Once this initial packet and the handshake packet arrives at the client, the connection is ready to go. Both the client and server have agreed to communicate, and they both have the information needed to encrypt and authenticate the data sent over the connection. The client concludes the handshake, and starts the data transfer, by sending a third packet. This is a UDP datagram that contains a QUIC Initial Packet, a QUIC Handshake Packet, and a 1-RTT short header packet. This second QUIC Initial packet contains an ACK frame, acknowledging the Initial packet sent by the server. The QUIC Handshake packet contains a CRYPTO frame, that in turn contains the TLS Finished message needed to complete the security handshake. And the QUIC 1-RTT packet contains a STREAM frame, with the initial data sent from the client to the server. The format of a QUIC Initial packet is shown on the slide. It’s a long header packet, used during the QUIC connection establishment handshake. QUIC Initial packets play two roles. Firstly, they’re used to synchronise the client and server state. In this respect, they play the same role as TCP’s SYN and SYN-ACK packets. They also carry a CRYPTO frame, containing either a TLS ClientHello or a TLS Finished message, as part of the encryption setup. And they can also contain ACK frames. The Initial packets combine connection setup and security negotiation into one packet. QUIC Initial packets can also carry an optional Token. A QUIC server can refuse a connection attempt, and send a Retry packet to the client containing a Token. If this happens, the client must retry the connection, providing the Token in its Initial packet. If the Token matches, the connection establishment then proceeds as normal. This is used to prevent connection spoofing. And to validate that a connection that’s being re-established after an IP address change is valid. QUIC Handshake packets complete the TLS 1.3 exchange. They contain either a TLS ServerHello message or a TLS Finished message, contained within a CRYPTO frame. The combination of the TLS ClientHello, sent in the Initial packet, and the TLS ServerHello and Finished messages, sent in the Handshake packets, completes the TLS security handshake. It works just the same as TLS over TCP, except that the messages are sent inside QUIC packets. Handshake packets use a long header, and have the format shown on the left of the slide. In most cases, a Handshake packet is sent in the same UDP datagram as an Initial packet, to reduce overheads. If the combined packet is too big to fit in a single UDP datagram, though, the Initial and Handshake packets may be sent in two separate datagrams. The normal QUIC connection establishment takes one round trip, after which both client and server have agreed to communicate and have the information they need to derive the symmetric encryption key. If this is too slow, QUIC also supports TLS 0-RTT session re-establishment. This works in a similar way to TLS over TCP. The QUIC client and server establish a connection as normal, and the server sends a PreSharedKey and a SessionTicket to the client. When the client next establishes a QUIC connection to that server, it adds the SessionTicket to the CRYPTO frame it sends in its initial packet, along with the TLS ClientHello message. And it also includes a QUIC 0-RTT packet, containing a STREAM frame that carries data encrypted using the PreSharedKey. The QUIC server can use the SessionTicket to look up its copy of the PreSharedKey, and uses that to decrypt the stream data sent in the 0-RTT packet. The server then responds with QUIC Initial and Handshake packets, as normal, along with a 1-RTT short header packet containing a reply. As with TLS over TCP, the data sent in the 0-RTT packet is not forward secret, and needs to be idempotent since it’s potentially subject to replay attacks. The difference is that 0-RTT session re-establishment in QUIC really does send the data in the very first packet, after zero RTTs, whereas TLS over TCP sends it in the first packet after the TCP handshake. The result of all this is that QUIC establishes a secure connection, in the usual case, in one round trip, whereas TLS over TCP needs two round trips. QUIC combines connection establishment and encryption key negotiation into a single handshake. TLS-over-TCP runs the two handshakes sequentially. QUIC therefore speeds up the connection establishment. Once the handshake has finished, QUIC switches to sending short header packets. These short header packets are used to transfer, and acknowledge, data. Each short header contains a Packet Number field. This plays a similar role to the sequence number in TCP segments. Unlike TCP, though, QUIC Packet Numbers increase by one for each packet sent. Packet numbers in QUIC count the number of packets sent. In contrast, the TCP sequence number counts the number of bytes of data sent. The QUIC header does not include any equivalent of the TCP acknowledgement number. Instead, ACK frames are sent as part of the protected payload data, to indicate received packet numbers. Also in contrast to TCP, QUIC never retransmits packets. If the UDP datagram containing a QUIC packet is lost, QUIC will retransmit the frames of data that were in that packet in new QUIC packets, and those packets will have new sequence numbers. TCP, on the other hand, retransmits lost packets with their original sequence number. This means TCP cannot tell the difference between the arrival of a retransmitted packet, and a very delayed arrival of the original packet. QUIC can always tell these apart, since they have different packet numbers. This simplifies the design of QUIC’s congestion control algorithm, compared to TCP. We’ll talk more about congestion control in Lecture 6. Finally, a QUIC short header packet ends with a protected payload section. This contains encrypted and authenticated QUIC frames. These can be STREAM frames containing data, ACK frames containing acknowledgements, or other control data. QUIC sends acknowledgements for received packets in ACK frames. The slide shows the format of an ACK frame. ACK frames are sent inside long- and short-header packets. Unlike TCP, the acknowledgements are not part of the QUIC packet headers. Also unlike TCP, acknowledgements indicate the sequence numbers of QUIC packets that were received. TCP sequence numbers and acknowledgements count the number of bytes sent, whereas QUIC counts the number of packets sent. Finally, data is sent within STREAM frames, that are sent within QUIC packets, that are contained within UDP datagrams. The format of a STREAM frame is shown on the right. It contains a stream identifier, the offset of the data within the stream, the length of the data, and the data itself. QUIC provides multiple reliable byte streams within a single connection. Data for each stream is delivered reliably and, in the order it was sent, within that stream, but data order is not preserved between streams. For example, assume that a client has a connection to a server and is sending data on two streams, stream A and stream B, within that connection. The client first sends a message on stream A, and then it sends another message on stream B. All the data sent on stream A will arrive reliably and in the order it was sent. The same is true of stream B. But the message sent on stream A, which was sent first, might arrive after the message sent on stream B. That is, QUIC avoids head-of-line blocking between streams, but not within a stream. We’ll talk more about head of line blocking in Lecture 5. There are two ways to view QUIC streams. You can view QUIC streams as allowing you to send multiple unframed byte streams sent within a single connection. In this view, a QUIC connection offers the same service model as several parallel TCP connections. Alternatively, you can view each QUIC stream as framing a message. In this view, a QUIC connection delivers a series of framed messages, each sent on a separate stream. They’re both entirely reasonable ways of looking at a QUIC connection, but they lead people to use QUIC in different ways. QUIC is quite flexible, and is different to TCP. It’s also new. We’re still developing best practices for how to use it, how to view the multiple connections. This concludes our discussion of QUIC connection establishment and data transfer. In the next part, I’ll talk about how QUIC tries to avoid protocol ossification.
Part 4: QUIC Transport Protocol: Avoiding Ossification
Abstract
The final part of this lecture explains why QUIC is designed to run over UDP, rather than running natively on IP. It discusses the protocol ossification and deployment concerns that drove this choice, and the tools that QUIC employs to help prevent further ossification: protocol invariants, pervasive encryption of transport headers, and GREASE. It concludes by reviewing the benefits and costs of switching to QUIC.
In the final part of the lecture, I’d like to discuss some features of QUIC that are intended to avoid ossification. And I’d like to touch a little on the costs and benefits of using QUIC. One of the unusual features of QUIC is that it’s a transport protocol that runs over another transport protocol. QUIC runs over UDP, rather than running directly on IP. Why is this? Well, there are two reasons. The first is to make end-system deployment in user-space applications easier. It’s difficult to implement protocols that run directly over IP. The native API that the existing transport protocols, TCP and UDP, use to talk to the IP implementation is hidden inside the operating system kernel. It’s an internal interface, that’s generally undocumented, proprietary, and inaccessible. Anything built at this level needs to run within the operating system kernel, and will be very tightly tied to the details of that kernel. There’s also a reasonably portable interface, known as raw sockets, that lets you send packets directly over IP. The raw sockets interface is not the easiest API to use, tends to be relatively low performance, and offers limited control, but it could be workable. Except that both raw sockets, and the native kernel interfaces, both require programs that access them to run with administrator privileges. This is a security risk, and prevents those programs from being uploaded to the relevant app stores. This makes such programs, even if they could be implemented, difficult to deploy. On the other hand, writing user space applications that run over UDP, using the portable Berkeley Sockets API, or Windows Sockets, is straightforward. The same API works everywhere. The programming model is widely understood. And there’s no need for privileged access, so the programs are easy to distribute and install. It’s much easier, and more portable, to run QUIC over UDP than it is to run it natively over IP. The second reason for running QUIC over UDP is protocol ossification. Almost every home and business that connects to the Internet does so via a NAT or firewall. NAT devices know how to find and translate the port numbers in TCP and UDP headers. Firewalls know how to inspect the headers and contents of TCP connections and UDP packets. And the same is true of all the other proxies, gateways, and middleboxes in the network. None of these understand QUIC. If we run QUIC inside UDP packets, there’s a chance it will work. NATs can translate UDP packet headers without inspecting the contents of those packets, and many firewalls allow outgoing UDP traffic to pass an establish an incoming firewall pinhole, because this is needed for applications like Zoom to work. This wouldn’t be the case if QUIC ran directly over IP. NATs wouldn’t know how to translate the QUIC packets. And firewalls tend to block anything that isn’t TCP or UDP. If we run QUIC over UDP, it will probably work across the Internet today. But if we run it directly over IP, then we’d have to update every firewall and NAT before it would work – an essentially impossible task. This is another example of protocol ossification. Information that’s visible in the packet headers or payload, that’s not encrypted and authenticated, can be inspected and modified by devices in the network. This means that NATs and firewalls can inspect the contents of IP packets. They can tell whether those packets contain TCP, UDP, or something else. This is possible because the IP header and payload are not encrypted. Similarly, because the contents of those IP packets are not authenticated, these NATs, firewalls, and other devices can modify the packets. This means we can buy NAT boxes that modify the packets. That change the IP, TCP, and UDP packet headers to allow several devices to share an IP address. It means we can buy firewalls that inspect network packets and claim to protect us from malware. And it means we can buy all the other proxies, caches, and other devices that inspect and modify traffic in the network. But maybe we don’t want the network to read or modify our data. So we use TLS to protect the data we send within TCP connections. Or datagram TLS, that does the same for data sent within UDP packets. And that works. But what it doesn’t do, though, is stop those devices inspecting and modifying the TCP and UDP packet headers. And it doesn’t stop them looking at the IP packets, and deciding to block them because the contents are not TCP or UDP. Or because the TCP or UDP header doesn’t exactly match their understanding of how it should be formatted. This means protocols like QUIC have to hide inside UDP packets, to have a chance of going through the network. And it makes it harder to change how TCP or UDP work. We couldn’t change the format of the TCP header now, for example, because too many devices expect the existing format. Similarly, it’s difficult to add even new options to TCP packets, because too many devices think they understand what options exist, and fail when they see something unexpected. The paper shown on the slide is a study that tries to measure how often firewalls and NATs block TCP packets that include standardised but rarely used extensions. And how often they block TCP packets that use non-standard extensions, as you might use when experimenting with changes to TCP. It showed that TCP extensions were hard to deploy. And while the results are almost ten years old now, I doubt it’s gotten easier to deploy changes to TCP. Protocol ossification is becoming an increasing problem. We’ve seen that it caused problems when TLS 1.3 was being deployed. It causes problems when trying to extend TCP. And it was one of the reasons why QUIC is designed to hide within UDP packets. Ossification is Increasingly viewed as a problem by the standards community. It makes it difficult to evolve network protocols to address new requirements, and leads to the concern that the network will get stuck. That it’ll have known problems and limitations, that we know how to solve, but where we can’t deploy the fix. The designers of QUIC went to great lengths to make QUIC deployable, and to prevent it from becoming ossified. Running over UDP was the first of these. And there are three more techniques they used: they published the protocol invariants; they make use of pervasive encryption of transport headers; and they make use of GREASE. The goal is to make it difficult, and ideally impossible, for middleboxes to interfere with QUIC connections. That is, a network can allow QUIC traffic, or it can block it entirely. But it can’t inspect or modify QUIC flows, apart from the invariant features. The QUIC invariants are the properties of QUIC that the IETF has indicated will never change. The developers of QUIC, and the IETF standards community, have written down a set of properties of QUIC packets that they guarantee will be true for every QUIC packet, for all time, and for all versions. These are the properties that middleboxes, NATs, and firewalls are allowed to inspect. Anything else, the IETF has said, is subject to change in future versions of QUIC. What are those invariants? That QUIC packets will either start with a long header or a short header. There will be no other header formats. That the first bit of the first byte of a long header packet is set to one. That bytes 2-5 of a long header packet contain a version number. And the following that version number are the destination connection identifier then the source connection identifier. And for short header packets, that the first bit of the first byte is always zero. And that the first byte is followed by the destination connection identifier. And that’s it. A QUIC version 1 connection starts with a connection establishment handshake, sent in long header packets, then the connection switches to use short header packets, but there is no guarantee that future versions will do the same. QUIC version 1 defines the format of the other 7 bits of the first byte, but there’s no guarantee that future versions of QUIC won’t change that meaning. And QUIC makes no guarantees about the contents of the packets following the connection identifiers. Now, of course, writing a standard that says “middleboxes MUST NOT look at these fields in QUIC packets” doesn’t stop middleboxes looking at those fields. Accordingly, QUIC applies two other techniques to make sure middleboxes can’t inspect its packets. The first is that QUIC encrypts as much of its packets as possible. Everything except the invariant fields, and the last 7 bits of the first byte of the packet, is encrypted. This makes it impossible for middleboxes to inspect any part of a QUIC header after the connection identifiers. Since QUIC incorporates TLS, this is straightforward to do for most packets. An encryption key is agreed using connection establishment, and this is used to encrypt the packets. What about the connection establishment packets themselves? How are they encrypted? Well, these contain the public keys, which are supposed to be public, so they don’t actually need to be encrypted. TLS over TCP doesn’t encrypt these packets, for example. But QUIC encrypts them anyway. So, what encryption key does it use? Well, it can’t use a TLS key, since these packets are sent before the TLS handshake is completed. And it can’t use a PreSharedKey, because this might be the first time the client and server have communicated. What it does, is to take the connection identifiers from the long header in the connection establishment packets, and use them as the encryption key to encrypt the rest of the packet. On the face of it, this offers no benefit. It certainly doesn’t provide any security. The encryption key for the connection establishment packet is included, unprotected, at the start of the packet. That is doesn’t provide security doesn’t matter, though, since there’s nothing secret in the connection establishment packets. What is does do, is make the middlebox implementors think. You can build a TCP middlebox by running a tool like wireshark or tcpdump, and looking at the packets as they go by. Everything is visible, and the patterns of how the headers changes, in the common cases, are obvious. You can get away without reading the TCP specification at all, and still build a middlebox that sort-of works. Of course it doesn’t really work, and causes problems when people try to do uncommon things with TCP. But it works for the simple cases. You can’t do that with QUIC. Everything appears to be encrypted. You have the read and understand the QUIC specification to realise that the handshake packets are encrypted in a way that lets you easily decrypt them. Otherwise, you get nothing useful. QUIC also makes extensive use of GREASE. It makes sure that every field in a QUIC packet is either encrypted, or has a value that is, at least sometimes, unpredictable. The connection identifiers are randomly chosen at the start of a connection. QUIC clients will randomly try to negotiate a random version number, that the server will reject, to make sure devices in the network don’t get stuck only supporting version one. And any unused fields in the headers are set to random values. This goal is that middleboxes can’t make any assumptions about QUIC header values, because nothing in the header is predictable. The hope is that this avoids ossification. There are no patterns to the header values, so middleboxes can’t make wrong assumptions about how the headers behave that will cause problems later. Will it work? Will it prevent ossification? We don’t know. Ask me again in a few years. That concludes this introduction to the QUIC transport protocol. Why might you want to use QUIC? Because it can speed-up secure connection establishment; because it solves some of the limitations of TLS 1.3 running over TCP; and because it supports sending multiple streams within a single connection. And because it, hopefully, reduces risk of ossification, and provides a long-term basis for future protocol development. That said, QUIC is new. The standard hasn’t yet been published, and we don’t have a lot of experience with using the protocol. And while there are many implementations of QUIC, in many different languages, they’re all still immature, poorly documented, and frequently buggy. And QUIC is often, currently, slower, and uses more CPU, than TLS over TCP. This is not because of inherent limitations of QUIC. Rather, it’s because the TCP implementations have had 40 years more optimisation and debugging time. When the implementations are finished, debugged, and optimised, QUIC ought to perform at least as well as, and probably significantly better than TCP. And it should be more secure and easier to extend. TCP lasted 40 years – QUIC is a similarly long-term project, that’s only just reaching version 1.0. That’s all for this lecture. I’ve spoken about some of the limitations of TLS, and how the QUIC transport protocol tries to address them. In the next lecture, I’ll move on from connection establishment, to talk about reliability and data transfer.
L4 Discussion
Summary
Lecture 4 discussed how to improve secure connection establishment. It started with a discussion of some of the limitations of TLS 1.3, how 0-RTT connection re-establishment can be used to reduce connection setup latency, and some of the metadata leakage inherent in TLS. It then moved on to discuss the QUIC transport protocol, and how QUIC can be used to improve secure connection establishment.
The initial focus of the discussion will be to check your understanding of secure connection establishment in TLS 1.3 and its limitations. What are the risks and benefits of 0-RTT mode in TLS? What is the nature of the privacy guarantees TLS provides?
Following on from this, the discussion will consider the QUIC Transport Protocol. How does QUIC differ from TCP? How does the service model provided to applications differ from TCP?
QUIC connection establishment combines transport protocol negotiation and TLS session establishment into one handshake to reduce connection setup times. The resulting connection establishment process is quite different to that of TCP. How does this combined handshake work? How does connection setup time compare to TLS 1.3 over TCP? Once a QUIC connection is established, it provides a reliable multi-streaming service that differs from that of TCP. How do the reliability mechanisms in QUIC differ from TCP? Why do they have to be different?
Finally, the lecture discussed the challenges in deploying new network protocols and the issue of protocol ossification. What techniques does QUIC use to prevent ossification? Do you think they will work?
Lecture 5
Abstract
Lecture 5 discusses reliable and unreliable data transfer in the Internet. It explains the best-effort nature of packet delivery, the end-to-end argument, and the timeliness-vs-reliability trade-off inherent in the design of the Internet. And it discusses three transport protocols in use in the Internet, UDP, TCP, and QUIC, and how the provide different degrees of timeliness and reliability, and offer different services to applications.
Part 1: Packet Loss in the Internet
Abstract
The first part of the lecture discusses packet loss in the Internet. It talks about the causes of packet loss, the end-to-end argument, and the timeliness-reliability trade-off.
In this lecture I want to move on from the discussion of connection establishment, and talk instead about reliability and effective data transfer across the network. There are four parts to this. In this first part, I’ll talk briefly about packet loss in the Internet, and the trade-off between reliability and timeliness. Then, I’ll move on to discuss unreliable data using UDP, and talk about the types of applications that benefit from this. In part three, I’ll talk about reliable data transfer with TCP. I’ll discuss the TCP service model, how TCP ensures data is delivered reliably, and some of the limitations of TCP relating to head-of-line blocking. Then, in the final part, I’ll conclude by discussing how QUIC transfers data and how this differs from TCP. I want to start by discussing packet loss in the Internet. What we mean when we say that the Internet provides a best effort service. The end-to-end argument. And the timeliness vs reliability trade-off inherent in the design of the Internet. As we discussed back in lecture 1, the Internet is a best effort packet delivery network. This means that it’s unreliable by design. IP packets can be lost, delayed, reordered, or corrupted in transit. And this is regarded as a feature, rather than a bug. A network that can’t deliver a packet is supposed to discard it. There are many reasons why a packet can get lost or discarded. It could be due to a transmission error, where electrical noise of wireless interference corrupts the packet in transit, making the packet unreadable. Or it could be because too much traffic is arriving at some intermediate link in the network, so an intermediate router runs out of buffer space. If traffic is arriving at a router from several different incoming links, but all going to the same destination, so it’s arriving faster than it can be delivered, a queue of packets will build up, waiting for transmission. If this situation persists, the queue might grow so much that a router runs out of memory, and has no choice but to discard the packets. Or packets could be lost because of a link failure. Or a router bug. Or for other reasons. How often this happens varies significantly. The packet loss rate depends on the type of link. Wireless links tend to be less reliable than wired links, for example. It’s reasonably likely that packet sent over a wireless link, such as WiFi or 4G, will be corrupted in transit due to noise, interference, or cross traffic. This is very unlikely on an Ethernet or optical fibre link. The packet loss rate also depends on the overall quality and robustness of the infrastructure. Countries with well developed and well maintained infrastructure tend to have reliable Internet links; countries with less robust or lower capacity infrastructure tend to see more problems. And the loss rate depends on the protocol. Some protocols intentionally try to push links to capacity, causing temporary overload as they try to find the limit, as they try to find the maximum transmission rate they can achieve. TCP and QUIC do this in many cases, depending on the congestion control algorithm used, as we’ll see in lecture 6. Other applications, such as telephony or video conferencing, tend to have an upper bound in the amount of data they can send. Whatever the reason, though, some packet loss is inevitable. The transport layer needs to recognise this. It must detect packet loss. And, if the application needs reliability, it must retransmit or otherwise repair any lost data. That the Internet provides best effort packet delivery is a result of the end-to-end argument. The end-to-end argument considers whether it’s better to place functionality inside the network or at the end points. For example, rather than provide best effort delivery, we could try to make the network deliver packets reliably. We could design some way to detect packet loss on a particular link, and request that the lost packets be retransmitted locally, somewhere within the network. And, indeed, some network links do this. In WiFi networks, for example, the base station acknowledges packets it receives from the clients, and requests any corrupted packets are re-sent, to correct the error. The problem is, that unless this mechanism is 100% perfect all the time, then end systems will still need to check if the data has been received correctly, and will still need some way of retransmitting packets in the case of problems. And if they’ve got that, why bother with the in-network retransmission and repair? Often times, if you add features into the network routers, they end up duplicating functionality that the network endpoints need to provide anyway. Maybe the performance benefit of adding features to the network is so big that it’s worth while. But often, the right thing to do is to keep the network simple. Omit anything that can be done by the endpoints. And favour simplicity over the absolute optimal performance. The end-to-end argument is one of the defining principles of the Internet. And I think it’s still a good approach to take, when possible. Keep the network simple, if you can. The paper linked from the slide talks about this subject in a lot more detail. Irrespective of whether retransmission of lost packets happens between the endpoints or within the network, it takes time. This leads to a fundamental trade-off in the design of the network. If a connection is to be reliable, it cannot guarantee timeliness. It’s not possible to build absolutely perfect network links, that never discard or corrupt packets. There’s always some risk that the data is lost and needs to be retransmitted. And retransmitting a packet will always take time, and so disrupt the timeliness of the delivery. And similarly, if a connection is to be timely, it cannot guarantee reliability. There’s a trade-off to be made. Protocols like UDP are timely but don’t attempt to be reliable. They send packets, and if they get lost, they get lost. TCP and QUIC, on the other hand, aim to be reliable. They send the packets, and if they get lost, they retransmit them. And if the retransmission gets lost? They try again, until the data eventually arrives. As we’ll see in part 3 of this lecture, this causes head of line blocking, making the protocol less timely. And other protocols, such as the Real-time Transport Protocol, RTP, that I’ll talk about in lecture 7, or the partially reliable version of the Stream Control Transport Protocol, SCTP, aim for a middle ground. They try to correct some, but not all, of the transmission errors. The try to achieve a balance, a middle-ground, between timeliness and reliability. The different protocols exist because different applications make different trade-offs. Some applications prefer timeliness, some prefer reliability. For applications like web browsing, email, or messaging, you want to receive all the data. If I’m loading a web site, I’d like it to load quickly, sure. But I prefer for it to load slowly, and be uncorrupted, rather than load quickly with some parts missing. For a video conferencing tool, like Zoom, though, the trade-off is different. If I’m having a conversation with someone, it’s more important that the latency is low, than the picture quality is perfect. The same may be true for gaming. And this has implications for the way we design the network. It means that the IP layer needs to be unreliable. It needs to be a best effort network. If the IP layer is unreliable, protocols like TCP and QUIC can sit on top and retransmit packets to make it reliable. A transport protocol can make an unreliable network into a reliable one. But if the IP layer is reliable, if the IP layer retransmits packets itself, then the network, the applications, the transport protocols, can’t undo that. So this concludes the discussion of packet loss and why the Internet opts to provide an unreliable, best-effort, service. In the next part, I’ll talk about UDP and how to make use of an unreliable transport protocol.
Part 2: Unreliable Data Using UDP
Abstract
The second part of the lecture discusses UDP. It outlines the UDP service model, and reviews how to send and receive data using UDP sockets, and the implications of unreliable delivery for applications using UDP. It discusses how UDP is suitable for real-time applications that prioritise low-latency over reliability. And is discusses the use of UDP as a substrate on which alternative transport protocols can be implemented, avoiding some of the challenges of protocol ossification.
In this part, I’ll move on to discuss how to send unreliable data using UDP. I’ll talk about the UDP service model, how to send and receive packets, and how to layer protocols on top of UDP. UDP provides an unreliable, connectionless, datagram service. It adds only two features on top of the IP layer: port numbers and a checksum. The checksum is used to detect whether the packet has been corrupted in transit. If so, the packet will be discarded by the UDP code in the operating system of the receiver, and won’t be delivered to the application. The port numbers determine what application receives the UDP datagrams when they arrive at the destination. They’re set by the bind() call, once the socket has been created. The Internet Assigned Numbers Authority, the IANA, maintains a list of well-known UDP port numbers which you should use for particular applications. This is linked from the bottom of the slide. UDP is very minimal. It doesn’t provide reliability, or ordering, or congestion control. It just delivers packets to an application, that’s bound to a particular port. Mostly, UDP is used as a substrate. It’s a base on which higher-layer protocols are built. QUIC is an example of this, as we discussed in the last lecture. Others are the Real-time Transport Protocol, and the DNS protocol, that we’ll talk about later in the course. UDP is connectionless. It’s got no notion of clients or servers, or of establishing a connection before it can be used. To use UDP, you first create a socket. Then you call bind(), to choose the local port on which that socket listens for incoming datagrams. They you call recvfrom() if you want to receive a datagram on that socket, or sendto() if you want to send a datagram. You don’t need to connect. You don’t need to accept connections. You just send and receive data. And maybe that data is delivered. When you’re finished, you close the socket. Protocols that run on top of UDP, such as QUIC, might add support for connections, reliability, ordering, congestion control, and so on, but UDP itself supports none of this. To send a UDP datagram, you use the sendto() function. This work similarly to the send() function you used to send data over a TCP connection in the labs, except that it takes two additional parameters. These indicate the address to which the datagram should be sent, and the size of that address. When using TCP, you establish a connection between a socket, bound to a local address and port, and a server listening on a particular port on some remote IP address. And once the connection is established, all the data goes over that connection, to the same destination. UDP is not like that. Every time you call sendto(), you specify the destination address. Every packet you send from a UDP socket can go to a different destination, if you want. There’s no notion of connections. Now, you can call connect() on a UDP socket, if you like, but it doesn’t actually create a connection. Rather, it just remembers the address you give it, so you can call send(), rather than sendto() in future, to save having to specify the address each time. To receive a UDP datagram, you call the recvfrom() function, as shown on the slide. This is like the recv() call you use with TCP, but again it has two additional parameters. These allow it to record the address that the received datagram came from, so you can use them in the sendto() function to send a reply. You can also call recv(), rather than recvfrom(), like with TCP, and it works, but it doesn’t give you the return address, so it’s not very useful. The important point with UDP is that packets can be lost, delayed, or reordered in transit, and UDP doesn’t attempt to recover from this. Just because you send a datagram, doesn’t mean it will arrive. And if datagrams do arrive, they won’t necessarily arrive in the order sent. Unlike TCP, where data written to a connection in a single send() call might end up being split across multiple read() calls at the receiver, a single UDP send generates exactly one datagram. If it’s delivered at all, the data sent by a single call to sendto() will be delivered by a single call to recvfrom(). UDP doesn’t split messages. But UDP is otherwise unreliable. Datagrams can be lost, delayed, reordered, or duplicated in transit. Data sent with sendto() might never arrive. Or it might arrive more than once. Or data sent in consecutive calls to sendto() might arrive out of order, with data sent later arriving first. UDP doesn’t attempt to correct any of these things. The protocol you build on top of UDP might choose to do so. For example, we saw that QUIC adds packet sequence numbers and acknowledgement frames to the data it sends within UDP packets. This lets it put the data back into the correct order, and retransmit any missing packets. But there’s no requirement that the protocol running over UDP is reliable. RTP, the Real-time Transport Protocol, that’s used for video conferencing apps, puts sequence numbers and timestamps inside the UDP datagrams it sends, so it can know if any data is missing, and it can conceal loss or reconstruct the packet playout time, but it generally doesn’t retransmit missing data. UDP gives the application the choice of building reliability, if it wants it. But it doesn’t require that the applications deliver data reliably. Applications that use UDP need to organise the data they send, so it’s useful if some data is lost. Different applications do this in different ways, depending on their needs. QUIC, for example, organises the data into sub-streams within a connection, and retransmits missing data. Video conferencing applications tend to do something different. The way video compression works, is that the codec sends occasional full frames of video, known as I-frames, index frames, every few seconds. And in between these it sends only the differences from the previous frame, known as P-frames, predicted frames. In a video call, it’s common for the background to stay the same, while the person moves in the foreground, so a lot of the frame is the same each time. By only sending the differences, video compression saves bandwidth. But this affects how the application treats the different datagrams. If a UDP datagram containing a predicted frame is lost, it’s not that important. You’ll get a glitch in one frame of video. But if a UDP datagram containing an index frame, or part of an index frame, is lost, then that matters a lot more because the next few seconds worth of video are predicted based on that index frame. Losing an index frame corrupts several seconds worth of video. For this reason, many video conferencing apps running over UDP try to determine if missing packets contained an index frame or not. And they try to retransmit index frames, but not predicted frames. The details of how they do this aren’t really important, unless you’re building a video conferencing app. What’s important though, is that UDP gives the application flexibility to be unreliable for some of the datagrams it sends, while trying to deliver other datagrams reliably. You don’t have that flexibility with TCP. UDP is harder to use, because it provides very few services to help your application, but it’s more flexible because you can build exactly the services you need on top of UDP. Fundamentally, UDP doesn’t make any attempt to provide sequencing, reliability, timing recovery, or congestion control. It just delivers datagrams on a best effort basis. It lets you build any type of transport protocol you want, running inside UDP packets. Maybe that transport protocols has sequence numbers and acknowledgements, and retransmits some or all of the lost packets. Maybe, instead, it uses error correcting codes, to allow some of the packets to be repaired without retransmission. Maybe it includes timestamps, so the receiver can carefully reconstruct the timing. Maybe it contains other information. The point is that UDP gives you flexibility, but at the cost of having to implement these features yourself. At the cost of adding complexity. There’s a lot to think about when writing a UDP-based protocol or a UDP- based application. If you use a transport protocol, like QUIC or like RTP, that runs over UDP, then the designers of that protocol have made these decisions, and will have given you a library you can use. If not, if you’re designing your own protocol that runs over UDP, then the IETF has written some guidelines, highlighting the issues you need to think about, in RFC 8085. Please read this before you try and write applications that use UDP. There are a lot of non-obvious things that can catch you out. So, that concludes our discussion of UDP. In the next part, I’ll talk about how TCP delivers data reliably.
Part 3: Reliable Data with TCP
Abstract
The third part of the lecture discusses TCP. It outlines the TCP service model and shows how to send and receive data using a TCP connection. It explains how TCP ensures reliable and order data transfer, using sequence numbers and acknowledgements. And it explains TCP loss detection using timeouts and triple-duplicate acknowledgements. The issue of head-of-line blocking in TCP connections is discussed, as an example of the timeliness vs reliability trade-off.
In this part I want to talk about how reliable data is delivered using TCP connections. I’ll talk about the TCP service model, how TCP uses sequence numbers and acknowledgments, and how packet loss detection and recovery works in TCP. Thinking about the TCP service model, as we’ve seen in previous lectures, TCP provides a reliable, ordered, byte stream delivery service that runs over IP. The applications write data into the TCP socket, that buffers it up in the sending system, and then delivers it over a sequence of data segments over the IP layer. When these data packets, these data segments, are received, they are accumulated in a receive buffer at the receiver. If anything is lost, or arrives out of order, it’s re-transmitted, and eventually the data is delivered to the application. The data delivered to the application is always delivered reliably, and in the order sent. If something is lost, if something needs to be re-transmitted, this stalls the delivery of the later data, to make sure that everything is always delivered in order. TCP delivers, as we say, an ordered, reliable, byte stream. After the connection has been established, after the SYN, SYN-ACK, ACK handshake, the client and the server can send and receive data. The data can flow in either direction within that TCP connection. It’s usual that the data follows a request response pattern. You open the connection. The client sends a request to the server. The server replies with a response. The client makes another request. The server replies with another response, and so on. But TCP doesn’t make any requirements on this. There’s no requirement that the data flows in a request response pattern, and the client and the server can send data in any order they feel like. TCP does ensure that the data is delivered reliably, and in the order it was sent, though. TCP sends acknowledgments for each data segment as it’s received. And if any data is lost, it retransmits that lost data. And if segments are delayed and arrive out of order, or if a segment has to be re-transmitted and arrives out of order, then TCP will reconstruct the order before giving the segments back to the application. In order to send data over a TCP connection you use the send() function. This transmits a block of data over the TCP connection. The parameters are the file descriptor representing the socket – the TCP socket, the data, the length of the data, and a flag. And the flag field is usually zero. The send() function blocks until all the data can be written. And it might take a significant amount of time to do this, depending on the available capacity of the network. It also might not be able to send all the data. If the connection is congested, and can’t accept any more data, then the send() function will return to indicate that it wasn’t able to successfully send all the data that was requested. The return value from the send() function is the amount of data it actually managed to send on the connection. And that can be less than the amount it was asked to send. In which case, you need to figure out what data was not sent, by looking at the return value, and the amount you asked for, and re-send just the missing part in another call. Similarly, if an error occurs, if the connection has failed for some reason, the send() function will return -1, and it will set the global variable errno to indicate that. On the receiving side you call the recv() function to receive data on a TCP connection. The recv() function blocks until data is available, or until the connection is closed. It’s passed a size, It’s passed a buffer, buf, and the size of the buffer, BUFLEN, and it reads up to BUFLEN bytes of data. And what it returns is the number of bytes of data that were read. Or, if the connection was closed, it returns zero. Or, if an error occurs, it returns -1, and again sets global variable errno to indicate what happened. When a recv() call finishes, you have to check these three possibilities. You have to check if the return value is zero, to indicate that the connection is closed and you’ve successfully received all the data in that connection. At which point, you should also close the connection. You have to check if the return value is minus one, in which case an error has occurred, and that connection has failed, and you need to somehow handle that error. And you need to check if it’s some other value, to indicate that you’ve received some data, and then you need to process that data. What’s important is to remember that the recv() call just gives you that data in the buffer. If the return value from receive is 157, this indicates that the buffer has 157 bytes of data in it. What the recv() called doesn’t ever do, is add a terminating NUL to that buffer. Now, if you’re careful that doesn’t matter, because you know how much data is in the buffer, and you can explicitly process the data up to that length. But, a common problem with TCP-based applications, is that they treat the data as if it was a string. They pass it to the printf() call using %s as if it were a string, or they pass it to function like strstr() to search for a string within it, or strcpy(), or something like that. And the problem is the string functions assume there’s a terminating NUL, and the recv() call doesn’t provide one. If you’re going to pass the data that’s returned from a recv() call to one of the C string functions, you need to explicitly add that NUL yourself. You need to look at the buffer, add the NUL at the end, after the last byte which was successfully received. If you don’t do, this the string functions will just run off the end of the buffer and you’ll get a buffer overflow attack. And this is a significant security risk. It’s one of the biggest security problems with network code using C. It’s misusing these buffers, accidentally using one of the string functions, and it just reads off the end of buffer, and who knows what it processes. When you send data using TCP, the send() call enqueues the data for transmission. The operating system, the TCP code in the operating system, splits the data you’ve written using the various send() calls into what’s known as segments, and puts each of these into a TCP packet. The TCP packets are sent in IP packets. And TCP runs a congestion control algorithm to decide when it can send those packets. Each TCP segment, each segment is in a TCP packet. The TCP packets have a header, which has a sequence number. When the connection setup handshake happens, in the SYN and the SYN-ACK packets, the connection agrees the initial sequence numbers; agrees the starting value for the sequence numbers. If you’re the client, for example; the client picks a sequence number at random, and sends this in its SYN packet. And then when it starts sending data, the next data packet has a sequence number that is one higher than that in the SYN packet. And, as it continues to send data, the sequence numbers increase by the number of data bytes sent. So, for example, if the initial sequence number was 1001, just picked randomly, and it sends 30 bytes of data in the packet, then the next sequence number will be 1031. The sequence number spaces are separate for each in each direction. The sequence numbers the client uses increase based on the initial sequence number the client sent the SYN packet. The sequence numbers the server use, start based on the initial sequence number the server sent in the SYN-ACK packet, and increase based on the amount of data the server is sending. The two number spaces are unrelated. What’s important is that calls to send() don’t map directly onto TCP segments. If the data which is given to a send() call is too big to fit into one TCP segment, then the TCP code will split it across several segments; it’ll split it across several packets. Similarly, if the data you send, that data you give the send() call is quite small, TCP might not send it immediately. It might buffer it up, combine it with data sent as part of a later send() call. And combine it, and send it in a single larger segment, a single larger TCP packet. This is an idea known as Nagle’s algorithm. It’s there to improve efficiency by only sending big packets, because there’s a certain amount of overhead for each packet. Each packet that’s sent by TCP has a TCP header. It’s got an IP header. It’s got the Ethernet or the WiFi headers depending on the link layer. And that adds a certain amount of overhead. It’s about, I think, 40 bytes per packet. So if you’re only sending a small amount of data, that’s a lot of overhead, a lot of wasted data. So TCP, with the Nagle algorithm, tries to combine these packets into larger packets when it can. But, of course, this adds some delay. It’s got to wait for you to send more data; wait to see if it can form a bigger packet. If you really need low latency, you can disable the Nagle algorithm. There’s a socket option called TCP_NODELAY, and we see the code on the slide to show how to use that. So you create the socket, you establish the connection, and then you call the TCP_NODELAY option and that turns this off. And this means that every time you send() on the socket, it immediately gets sent as quickly as possible. One implication of this behaviour, though, where TCP can either split data written in a single send() across multiple segments, or where it can combine several send() calls into a single segment, is that the data returned by the recv() calls doesn’t always correspond to a single send(). When you call recv(), you might get just part of a message. And you need to call recv() again to get the rest of the message. Or you may get several messages in one recv() call. When you’re using TCP, the recv() calls return the data reliably, and they return the data in the order that it was sent. But what they don’t do is frame the data. What they don’t do is preserve the message boundaries. For example, if we’re using HTTP, which we see, we see an example of an HTTP message that might be sent, an HTTP response that might be sent, by a web server back to a browser. If we’re using HTTP, what we would like is that the whole response is received in one go. So if we’re implementing a web browser we just call recv() on the TCP connection and we get all of the headers, and all of the body, in just in just one call to recv() and we can then parse it, and process it, and deal with it. TCP doesn’t guarantee this, though. It can split the messages arbitrarily, depending on how much data was in the packets, what size packets the underlying link layers can send, and on the available capacity of the network depending on the congestion control. And it can split the packets at arbitrary points. For example, if we look at the slide, we see that the headers, some of them are labeled in red, some are in blue, some of the body is in blue, some the rest of the body is in green. And it could be that the TCP connection splits the data up, so that the first recv() call just gets the part of the headers highlighted in red, ending halfway through the “ETag:” line. And then you have to call recv() again. And then you get the part of the message highlighted in blue, which contains the rest of the headers and the first part of the body. Then you have to call recv() again, to get the rest of the message that’s highlighted in green on the slide. And this makes it much harder to parse; much harder for the programmer. Because you have to look at the data you’ve got, parse it, check to see if you’ve got the whole message, check if you’ve received the complete headers, check to see if you’ve received the complete body. And you have to handle the fact that you might have partial messages. And it’s something which makes it a little bit hard to debug, because if you only send small messages, if you’re sending packets which are only like 1000 bytes, or so, they’re probably small enough to fit in a single packet, and they always get delivered in one go. It’s only when you start sending larger packets, or sending lots of data over connection so things get split up due to congestion control, that you start to see this behaviour where the messages get split at arbitrary points. So as we’ve seen, the TCP segments contain sequence numbers, and the sequence numbers count up with the number of bytes being sent. Each TCP segment also has an acknowledgement number. When a TCP segment is sent, it acknowledges any segments that have previously been received. So if, a TCP endpoint has received some data on a TCP connection, when it sends its next packet, the ACK bit will be set in the TCP header, to indicate that the acknowledgement number is valid, and the acknowledgement number will have a value indicating the next sequence number it is expecting. That is, the next contiguous byte it’s expecting on the connection. So, in the example, we have a slightly unrealistic example in that the connection is sending one byte at a time, and the first packet is sent with sequence number five. And then the next packet is sent with sequence number six, and then seven, and eight, and nine, and ten, and so on. And this is what might happen with an ssh connection, where each key you type generates a TCP segment, with just the one key press in it. And when those packets are received at host B, it sends a TCP segment with the acknowledgement bit set, acknowledging what’s expected next. So when it receives the TCP packet with sequence number five, and one byte of data in it, it sends an acknowledgement saying it got it, and it’s expecting the packet with sequence number six next. When it receives the packet with sequence number six, and one byte of data in it, it sends an acknowledgement saying it’s expecting seven. And so on. TCP only ever acknowledges the next contiguous sequence number expected. And if a packet is lost, subsequent packets generate duplicate acknowledgments. So in this case, packet five was sent. It got to the receiver, and that sent the acknowledgement saying it expected six. Six was sent, arrived at the receiver, so the acknowledgement says it expects seven. Seven was sent, arrives at the receiver, sends the acknowledgement saying it expects eight. Eight was sent, and gets lost. Nine was sent, and arrives at the receiver. At this point, the receiver’s received the packets with sequence numbers five, six, and seven; eight is missing; and nine has arrived. So the next contiguous sequence number it’s expecting is still eight. So it sends an acknowledgement saying “I’m expecting sequence number eight next”. The packet sent, the next packet sent, has sequence number 10. This arrives, the acknowledgement goes back saying “I still haven’t got eight, I’m still expecting eight”, and this carries on. TCP keeps sending duplicate acknowledgments while there’s a gap in the sequence number space. In addition, we don’t show it here, but TCP can also send delayed acknowledgments, where it only acknowledges every second packet. In this case the acknowledgments might go, six, eight. The packet with sequence number five is sent, and it acknowledges six. Packet with number six is sent, and arrives, and packet number seven is sent, and then it sends the acknowledgement saying it’s expecting eight. So it doesn’t have to send every acknowledgement, it can sent every other acknowledgement to reduce the overheads. TCP uses the acknowledgments to detect packet loss; to detect when segments are lost. There’s two ways in which it does this. The first is that if it sends data, but for some reason the acknowledgments stop entirely. This is a sign that either the receiver has failed, And, you know, the packets are being delivered to the receiver, but the application has crashed, and there’s nothing there to receive the data, to reply. Or it’s an indication that the network connection has failed, and the packets are just not reaching the receiver. So if TCP is sending data, and it’s not getting any acknowledgments back, after a while it times out and uses this as an indication that the connection has failed. Alternatively, it can be sending data, and if some data is lost, but the later segments arrive, then TCP will start sending the duplicate acknowledgments. Again, back to the example, we see that packet eight is lost, packet nine arrives, and the sequence number, the acknowledgement number, comes back says “I’m expecting sequence number eight”. And packet ten is sent and it arrives, and it still says “I’m still expecting packet with sequence number eight”, and this just carries on. And, eventually, TCP gets what’s known as a triple duplicate acknowledgement. It’s got the original acknowledgement saying it’s expecting packet eight, and then three duplicates following that, so four packets in total, all saying “I’m still expecting packet eight”. And what this indicates, is that data is still arriving, but something’s got lost. It only generates acknowledgements when a new packet arrives, so if we keep seeing acknowledgments indicating the same thing, this indicates that new packets arriving, because that’s what triggers the acknowledgement to be sent, but there’s still a packet missing, and it’s telling us which one it’s expecting. At that point TCP assumes that the packet has got lost, and retransmits that segment. It retransmits the packet with sequence number eight. Why does it wait for a triple duplicate acknowledgement? Why does it not just retransmit it immediately when it sees a duplicate? Well, the example we see here illustrates that. In this case, a packet with sequence number five is sent, containing one byte of data, and it arrives, and the receiver acknowledges it, saying it’s expecting six. And six is sent, and it arrives, and the receiver acknowledges it, indicating it’s expecting seven. And packet seven is sent, and it’s delayed. And packet eight is sent, and eventually arrives at the receiver. Now the receiver hasn’t received packet seven yet, so it sends an acknowledgement which says “I’m still expecting seven”. So that’s a duplicate acknowledgement. At that point packet seven, which was delayed, finally does arrive. Now packet seven has arrived, packet eight had arrived previously, so what is now expecting is nine, so it sends an acknowledgement for nine. And we see that the acknowledgments go six, seven, seven, nine, because that packet seven was delayed a little bit. And if TCP reacts to a single duplicate acknowledgement as an indication that the packet was lost, then you run the risk that you’re resending a packet on the assumption when it was lost, when it was just merely delayed a little bit. And there’s a trade off you can make here. Do you treat, a single duplicate as an indication of loss? Do you treat two duplicates as an indication of loss? Three? Four? Five? At what point do you say “this as an indication of loss”, rather than just “this is a slightly delayed packet, and it might recover itself in a minute”? The reason that a triple duplicate is used, is because someone did some measurements, and decided that packets being delayed enough to cause one or two duplicates, because they arrived just a little bit out of order, was relatively common. But packets being delayed enough that they cause three or more duplicates is rare. So it’s balancing-off speed of loss detection vs. the likelihood that a merely delayed packet is treated as if it were lost, and retransmitted unnecessarily. And, based on the statistics, the belief by the designers of TCP was that waiting for three duplicates was the right threshold. And you could make a TCP version that reduced this to two, or even one duplicate, and it would respond to loss faster, but would have the risk that it’s more likely to unnecessarily retransmit something that’s just delayed. Or you could make it four, five, six, even more duplicate acknowledgments, which will be less likely to unnecessarily retransmit data. But it’d be slower, because it would be slower in responding to loss, and slower in retransmitting actually lost packets. The other behaviour of TCP. which is worth noting, is head-of-line blocking. Now, in this case we’re sending something more realistic. We’re sending full size packets, with 1500 bytes of data in each packet. And 1500 is the maximum packet size that you can send in an Ethernet packet, or in a WiFi packet, so this is a typical size that actually gets sent. In this case, the first packet is sent with sequence numbers in the range zero through to 1499. And this arrives at the receiver, and the receiver sends an acknowledgement saying it got it, and the next packet it’s expecting has sequence number 1500. So it sends an acknowledgement for 1500. And if there’s a recv() call outstanding on that socket, that recv() call will return at that point, and return 1500 bytes of data. It returns the data as it was received. The next packet arrives at the receiver, containing sequence numbers 1500 through to 2999, and again the recv() call, if there is one, will return, and return that next 1500 bytes. Similarly, when the packet containing the next 1500 comes in, the receiver will send the ACK saying “I’m expecting 4500”, and the recv() call will return. The packet containing sequence numbers 4500 though to 5999 is lost. The packet containing 6000 through to 7499 arrives. The acknowledgement goes back indicating that it’s still expecting sequence number 4500, because that packet got lost. And at that point, some data has arrived, some new data has arrived at the receiver. But there’s a gap. The packets, the packet, containing data with sequence numbers 4500 through to 5999 is still missing. So if the receiver application has called recv() on that socket, it won’t return. The data has arrived, it’s buffered up in the TCP layer in the operating system, but TCP won’t give it back to the application. And the packets can keep being sent, and the receiver keeps sending the duplicate acknowledgments, and eventually it’s sent the triple duplicate acknowledgement, and the TCP sender notices and retransmits the packet with sequence numbers 4500 through to 5999. And eventually those arrive at the receiver. At that point, the receiver has a contiguous block of data available, with no gaps in it, and it returns all of the data from sequence number 4500 up to sequence number 12,000, up to the application in one go. And if the application has given a big enough buffer, at that point the recv() call will returned 7500 bytes of data. It’ll return all of that received data in one big burst. And then, as the data, you know, gets retransmitted, as the data arrives, it will just keep, you know, the recv() call will unblock and data will start flowing. The point is the TCP receiver waits for any missing data to be delivered. If anything’s missing, the triple duplicate ACK happens, it eventually gets retransmitted, and the receiver won’t return anything to the application until that retransmission has happened. It’s called head of line blocking. The data stops being delivered, until it can be delivered in sequence to the application. It’s all just buffered up in the operating system, in the TCP code. TCP always gives the data to the application in a contiguous ordered sequence, in the order it was sent. And this is another reason why the recv() calls don’t always preserve the message boundaries. Because it depends how much data was queued up because of packet losses, and so on, so that it can always be delivered in order. The head of line blocking increases the total download time. We see on the left, the case where one packet was lost, and had to be re-transmitted. And we see on the right, the case where all the packets were received on time. And we see an increase in the download time because of the packet loss. It blocks the receiving, it delays things a little bit, waiting for the retransmission. And it increases the overall download time a little bit. It disrupts the behaviour of when the packets are received, during the download quite significantly. We see 1500, 1500, 1500, big gap, seven thousand five hundred, 1500, 1500, in the case where the packets were lost. Or, in the case where they were all received, the data is coming in quite smoothly. It’s regularly spaced. So it affects the timing, it effects when the data is delivered to the application, and it has a smaller effect on the overall download times. And if you’re building real time applications, this is a significant problem. We see the case on the right, if everything is delivered on time, then the data is released to the application very quickly and very predictably. And you don’t need much buffering delay at the receiver. Things can be just delivered, things are just delivered to the application, repeatedly on a regular schedule. But the minute something gets lost, it has to wait for the retransmission. In this case it waits for one round trip time, because the ACK has to get back, and then the data has to be retransmitted. Plus, it has to wait for four times the gap between packets, to allow for the four duplicates, the triple duplicate ACK and the original ACK, so you get one round trip time plus four times the packet spacing. So if you’re using TCP to send, for example, speech data, where it’s sending packets regularly every 20 milliseconds, you need to buffer 80 milliseconds plus the round trip time, to allow for these re-transmissions, if you’re using it for a real time application. Because, it waits for the retransmissions, and because of the head of line blocking. And when you’re using applications like Netflix or the iPlayer, when you press play on the video there’s a little pause where it says “buffering”. This is what it’s doing. It’s buffering up enough data that it can wait for the retransmissions to happen, buffering up enough data in the TCP connection that it can keep playing out the video frames, in order, while still allowing time for a retransmission to happen. So it’s buffering up the data waiting, making sure there’s enough data buffered up, because of this head of line blocking issue in TCP. So that concludes the discussion of TCP. It gives you an ordered, reliable, byte stream. As a service model it’s easy to understand. It’s like reading from a file; you read from the connection and the bytes arrive reliably and in the order they were sent. The timing, though, is unpredictable. How much you get from the connection each time you read from it, and whether the data arrives regularly, or whether it’s arrives in big bursts with large gaps between them, depends on how much data is lost, and depends on whether the TCP has to retransmit missing data. And if you’re just using this to download files that doesn’t matter. It means that the progress bar is perhaps inaccurate, but otherwise it doesn’t make much difference. But, if you’re using it for real time applications, like video streaming, like telephony, this head of line blocking can quite significantly affect the play out. And a lot of that is the reason why applications use, why real time applications use, UDP. And for those that don’t use UDP, applications like Netflix that use adaptive streaming over HTTP, which we’ll talk about in lecture seven, that’s why there’s this buffering delay before they start playing. And, of course, the lack of framing complicates the application design, you have to parse the data to make sure you’ve got all the data; there’s no message boundaries in there, so you have to parse the data. It doesn’t tell you, the connection doesn’t tell you, when you’ve received all the data. So that’s it for TCP. It delivers data reliably. It uses sequence numbers and acknowledgments to indicate when the data arrived. It uses timeouts to indicate that a connection has failed. And it uses this idea of triple duplicate ACKs to indicate that a packet has been lost, and trigger a retransmission of any lost data. What I’ll talk about in the next part is QUIC and how it differs from the way TCP handles reliability.
Part 4: Reliable Data Transfer with QUIC
Abstract
The final part of the lecture discusses reliable data transfer using QUIC. It outlines the QUIC service model, and how it differs from that of TCP, and shows how QUIC achieves reliable data transfer. It discusses how QUIC provides multiple streams within a single connection, and consider how this affects head-of-line blocking and latency. Approaches to making best use of multiple streams are discussed.
In this final part I’d like to talk about how reliable data transfer works with QUIC, and how it’s different to reliable data transfer with TCP. I’ll talk a little bit about the QUIC service model, and how it handles packet numbers and retransmission. I’ll talk about the multi-streaming features of QUIC. And I’ll talk about how it avoids head-of-line blocking. The service model for TCP, as we saw previously, is that it delivers a single reliable, ordered, byte stream of data. Applications write a stream of bytes in, and that stream of bytes is delivered to the receiver, eventually. QUIC, by contrast, delivers several ordered reliable byte streams within a single connection. Applications can separate the data they’re sending into different streams, and each stream is delivered reliably and in order. QUIC doesn’t preserve the ordering between the streams within a connection, so if you send in one stream, and then send in a second stream, then the data you sent second, in that second stream, may arrive first, but it preserves the ordering with a stream. And you can treat each stream as if it were running multiple TCP connections in parallel, so it gives you the same service model with several streams of data, or you could perhaps treat each stream as a sequence of messages to be sent, with the streams indicating message boundaries. QUIC delivers data in packets. Each QUIC packet has a packet sequence number, a packet number, and the packet numbers are split into two packet number spaces. The packets sent during the initial QUIC handshake start with packet sequence number zero, and that packet sequence number increases by one for each packet sent during the handshake. Then, when the handshake’s complete, and it switches to sending data, it resets the packet sequence number to zero and starts again. Within each of these packet number spaces, the handshake space, and the data space, the packet number sequence starts at zero, and goes up by one for every packet sent. That is, the sequence numbers in QUIC, the packet numbers in QUIC, count the number of packets of data being sent. That’s different to TCP. In TCP, the sequence number in the header counts the offset within the byte stream, it counts how many bytes of data have been sent. Whereas in QUIC, the packet numbers count the number of packets. Inside a QUIC packet is a sequence of frames. Some of those frames may be stream frames, and stream frames carry data. Each stream frame has a stream ID, so it knows which of the many sub-streams it’s carrying data for, and it also has the amount of data being carried, and the offset of that data from the start of the stream. So, essentially the stream contains sequence numbers which play the same role as TCP sequence numbers, in that they count bytes of data being sent in that stream. And the packets have sequence numbers that count the number of packets being sent. And we can see this in the diagram on the right, where we see the packet numbers going up, zero, one, two, three, four. And the stream numbers, packet zero carries data from the first stream, bytes zero through 1000. Packet one carries data from the first stream, bytes 1001 to 2000. And packet two carries bytes 2001 to 2500 from the first stream, and zero to 500 from the second stream, and so on. And we see that we can send data on multiple streams in a single packet. QUIC doesn’t preserve message boundaries within the streams. In the same way that, within a TCP stream, if you write data to the stream and the amount you write is too big to fit into a packet, it may be arbitrarily split between packets. Or if the data you send in a TCP Stream is too small, and doesn’t fill a whole packet, it may be delayed waiting for more data, to be able to fill up the packet before it’s sent. The same thing happens with QUIC. If the amount of data you write to a stream is too big to fit into a QUIC packet, then it will be split across multiple packets. Similarly, if the amount of data you write to a stream is very small, QUIC may buffer it up, delay it, wait for more data, so it can send it and fill a complete packet. In addition, QUIC can take data from more than one stream, and send it in a single packet, if there’s space to do so. And if there’s more than one stream with data that’s available to send, then the QUIC sender can make an arbitrary decision, how it prioritises that data, and how it delivers frames from each stream. And usually it will split those, the data from the streams, so each packet has data from, half the data from, one stream, and half from another stream. But it may alternate them if it wants, sending one packet with data from stream 1, one from stream 2, one from stream 1, one from stream 2, and so on. On the receiving side, the receiver sends, the QUIC receiver sends acknowledgments for the packets it receives. So, unlike TCP which acknowledges the next expected sequence number, a QUIC receiver just sends an acknowledgement to say “I got this packet”. So when packet zero arrives, it sends an acknowledgement saying “I got packet zero”. And when packet one arrives, it sends an acknowledgement saying “I got packet one”, and so on. The sender needs to remember what data it puts in each packet, so it knows when it gets an acknowledgement for packet two that, in this case, it contained bytes 2001 to 2500 from stream one, and bytes zero through 500 from stream two. That information isn’t in the acknowledgments. What’s in the acknowledgments it’s just the packet numbers, so the sender needs to keep track of how it puts the data from the streams into the packets. The acknowledgments in QUIC are also a bit more sophisticated than they are in TCP, in that it doesn’t just have an acknowledgement number field in the header. Rather, it sends the acknowledgments as frames in the packets coming back. And this gives a lot more flexibility, because it can have a fairly sophisticated frame format, and it can change the frame format to include different, to support different ways of sending a header, if it needs to. In the initial version of QUIC, what’s in the frame format, in the ACK frames coming back from the receiver to the sender, is a field indicating the largest acknowledgement, which is essentially the same as the TCP acknowledgment – it tells you what’s the highest sequence number received. There’s an ACK delay field, that tells you how long between receiving that packet the receiver waited before sending the acknowledgement. So this is the delay in the receiver. And by measuring the time it takes for the acknowledgment come back, and removing this ACK delay field, you can estimate the network round trip time excluding the processing delays in the receiver. There’s a list of ACK ranges. And the ACK ranges are a way of the receiver saying “I got a range of packets”. So you can send an acknowledgement that says, I got packets from five through seven in a single go. And you can split this up, with multiple ACK ranges. So you could have an acknowledgement that says “I got packet five; I got packets seven through nine; and I got packets 11 through 15” and you can send that all within a single acknowledgement block, in an ACK frame, within the reverse path stream. And this gives it more flexibility, so it doesn’t just have to acknowledge the most recently received packet, which gives the sender more information to make retransmissions. This is a bit like the TCP selective acknowledgement extension. Like TCP, QUIC will retransmit lost data. The difference is that TCP retransmits packets, exactly as they would be originally sent, so the retransmission looks just the same as the original packet. QUIC never retransmits packets. Each packet in QUIC has a unique packet sequence number, and each packet is only ever transmitted once. What QUIC rather does, is it retransmits the data which was in those packets in a new packet. So in this example, we see that packet, on the slide, we see that packet number two got lost, and it contain the data bytes 2001 to 2500 from stream one, and bytes zero through 500 from stream two. And, when it gets the acknowledgments indicating that packet was lost, it resends that data. And in this case it’s sending in packet six, it’s resending the first bytes of data from stream, it’s sending the bytes 2001 to 2500 from stream one, and it will eventually, at some point later, retransmit the data from stream two. As we say, each packet has a unique packet sequence number. Since we’re not, since each packet is acknowledged as it arrives, and it’s not acknowledging the highest, not acknowledging the next sequence number expected in the same way TCP does, you can’t do the triple duplicate ACK in the same way, because you don’t get duplicate ACKs. Each ACK acknowledges the next new packet. Rather QUIC declares a packet to be lost when it’s got ACKs for three packets with higher packet numbers than the one which it sent. At that point, it can retransmit the data that was in that packet. And that’s QUIC’s equivalent to the triple duplicate ACK; it’s three following sequence numbers rather than three duplicate sequence numbers. And also, just like TCP, if there’s a timeout, and it stops getting ACKs, then it declares the packets to be lost. QUIC delivers multiple streams within a single connection. And within each stream, the data is delivered reliably, and in the order it was sent. If a packet’s lost, then that clearly causes data for the stream, streams, where the data was included in that packet to be lost. Whether a packet loss effects one, or more, streams really depends on how the sender chooses to put the data from different streams into the packets. It’s possible that a QUIC packet can contain data from several streams. We saw in the examples, how the packets contain data from both stream one and stream two simultaneously. In that case, if a packet is lost, it will affect both of the streams, all of the streams if there’s data from more than two streams in the packet. Equally, a QUIC sender can choose to alternate, and send one packet with data from stream one, and then another packet with data from stream two, and only ever put data from a single stream in each packet. The specification puts no requirements on how the sender does this, and different senders can choose to do it differently depending whether they’re trying to make progress on each stream simultaneously, or whether they want to they want to alternate, and make sure that packet loss only ever affects a single stream. Depending on how they do this, the streams can suffer from head of line blocking independently. If data is lost on a particular stream, then that stream can’t deliver later data to the application, until that lost data has been transmitted. But the other streams, if they’ve got all the data, can keep delivering to the application. So streams suffer from head of line blocking individually, but there’s no head of line blocking between streams. This means that the data is delivered reliably, and in order, on a stream, but order’s not preserved between streams. It’s quite possible that one stream can be blocked, waiting for a retransmission of some of the data in the packets, while the other streams are continuing to deliver data and haven’t seen any loss on that stream. Each stream is sent and received independently. And this means if you’re careful with how you split data across streams, and if the implementation is careful with how it puts data from streams into different packets, it can limit the duration of the head of line blocking, and make the streams independent in terms of head of line blocking and data delivery. QUIC delivers, as we’ve seen, several ordered, reliable, byte streams of data in a single connection. How you treat these different bytes streams, is, I think, still a matter of interpretation. It’s possible to treat a QUIC connection as though it was several parallel TCP connections. So, rather than opening multiple TCP connections to a server, you open one QUIC connection, and you send and receive several streams of data within that. And then you treat each stream of data as-if it were a TCP stream, and you parse and process the data as if it were a TCP stream. And you possibly send multiple requests, and get multiple responses, over each stream. Or, you can treat the streams more as a framing device. You can say that each stream, you can choose to interpret each stream, as sending a single object. And then, when you send data from the stream, on that stream, once you finish sending that object, you close the stream and move on to use the next one. And, on the receiving side, you just read all the data until you see the end of stream marker, and then you process it knowing you’ve got a complete object. And I think that the best practices, the way of thinking about a QUIC connection, and the streams within a connection, is still evolving. And it’s not clear which of these two approaches is the necessarily the right way to do it. And I think it probably depends on the application what makes the most sense. So, to conclude for this lecture. We spoke a little bit about best effort packet delivery on the Internet, and why the IP layer delivers data unreliably, and why it’s appropriate to have a best effort network. Then we spoke a bit about the different transports. The UDP transport that provides an unreliable, but timely, service on which you can build more sophisticated user space application protocols. We spoke about TCP, that provides a reliable ordered stream delivery service. And we spoke about QUIC, that provides a reliable ordered delivery service with multiple streams of data. And it’s clear there’s different services, different transport protocols, for different needs. What I want to move on to next time, is starting to talk about congestion control and how all these different transport protocols manage the rate at which they send data.
L5 Discussion
Summary
Lecture 5 discussed reliable data transfer over the Internet. It started with a discussion of best effort packet delivery, and an explanation of why it makes sense for the Internet to be designed to be an unreliable network. Then, it moved on to discuss UDP and how to make applications and new transport protocols that work on an unreliable network. There’s a trade-off between timeliness and reliability that’s important here, and the lecture gave some examples of this to illustrate why many real-time applications used UDP.
The bulk of the lecture discussed TCP. It spoke about how TCP sends acknowledgement for packets, how timeouts and triple-duplicate ACKs indicate loss, and why a triple-duplicate ACK is chosen as the loss signal. It also discussed head-of-line blocking, and how the in-order, single stream, reliable service model of TCP leads to head-of-line blocking and potential latency.
Finally, It discussed the differences between QUIC and TCP. QUIC acknowledges packets rather than bytes within a stream, uses ACK frames rather than an ACK header, and delivers multiple streams of data, allowing it to avoid head-of-line blocking in many cases.
The focus of the discussion will be on how TCP ensures reliability, to make sure the mechanism is understood, and on the differences between the TCP and QUIC service models and how QUIC can improve latency. We’ll also discuss how UDP can form a substrate, to easily allow new transports, to suit different needs, to be built and deployed.
Lecture 6
Abstract
This lecture discusses some of the factors that affect the latency of a TCP congestion. It considers TCP congestion control, the TCP Reno and Cubic congestion control algorithms, and their behaviour and performance in terms of throughput and latency. It then considers alternative congestion control, such as the TCP Vegas and BBR algorithms, and the use of explicit congestion notification (ECN), as options to lower latency. Finally, it considers the impact of sub-optimal Internet paths on latency, and the rationale for deploying low-Earth orbit satellite constellations to reduce latency of Internet paths.
Part 1: TCP Congestion Control
Abstract
This first part of the lecture outlines the principle of congestion control. It discusses packet loss as a congestion signal, conservation of packets in flight, and the additive increase, multiplicative decrease requirements for stability.
In this lecture I’d like to move on from talking about how to transfer data reliably, and talk about mechanisms and means by which transport protocols go about lowering the latency of the communication. One of the key limiting factors of performance of network systems, as we’ve discussed in some of the previous lectures, is latency. Part of that is the latency for establishing connections, and we’ve spoken about that in detail already, where a lot of the issue is the number of round trip times needed to set up a connection. And, especially when secure connections are in use, if you’re using TCP and TLS, for example, as we discussed, there’s a large number of round trips needed to actually get to the point where you can establish a connection, negotiate security parameters, and start to exchange data. And we’ve already spoken about how the QUIC Transport Protocol has been developed to try and improve latency in terms of establishing a connection. The other aspects of latency, and reducing the latency of communications, is actually in terms of data transfer. How you deliver data across the network in a way which doesn’t lead to excessive delays, and how you can gradually find ways of reducing the latency, and making the network better suited to real time applications, such as telephony, and video conferencing, and gaming, and high frequency trading, and Internet of Things, and control applications. A large aspect of that is in terms of how you go about building congestion control, and a lot of the focus in this lecture is going to be on how TCP congestion control works, and how other protocols do congestion, to deliver data in a low latency manner. But I’ll also talk a bit about explicit congestion notification, and changes to the way queuing happen in the network, and about services such as SpaceX’s StarLink which are changing the way the network is built to reduce latency. I want to start by talking about congestion control, and TCP congestion control in particular. And, what I want to do in this part, is talk about some of the principles of congestion control. And talk about what is the problem that’s being solved, and how can we go about adapting the rate at which a TCP connection delivers data over the network to make best use of the network capacity, and to do so in a way which doesn’t build up queues in the network and induce too much latency. So in this part I’ll talk about congestion control principles. In the next part I move on to talk about loss-based congestion control, and talk about TCP Reno and TCP Cubic, which are ways of making very effective use of the overall network capacity, and then move on to talk about ways of lowering latency. I’ll talk about latency reducing congestion control algorithms, such as TCP Vegas or Google’s TCP BBR proposal. And then I’ll finish up by talking a little bit about Explicit Congestion Notification in one of the later parts of the lecture. TCP is a complex and very highly optimised protocol, especially when it comes to congestion control and loss recovery mechanisms. I’m going to attempt to give you a flavour of the way congestion control works in this lecture, but be aware that this is a very simplified review of some quite complex issues. The document listed on the slide is entitled “A roadmap for TCP Specification Documents”, and it’s the latest IETF standard that describes how TCP works, and points to the details of the different proposals. This is a very long and complex document. It’s about, if I remember right, 60 or 70 pages long. And all it is, is a list of references to other specifications, with one paragraph about each one describing why that specification is important. And the complete specification for TCP is several thousand pages of text. This is a complex protocol with a lot of features in it, and I’m necessarily giving a simplified overview. I’m going to talk about TCP. I’m not going to talk much, if at all, about QUIC in this lecture. That’s not because QUIC isn’t interesting, it’s because QUIC essentially adopts the same congestion control mechanisms as TCP. The QUIC version one standard says to use TCP Reno, use the same congestion control algorithm as TCP Reno. And, in practice, most of the QUIC implementations use the Cubic or the BBR congestion control algorithms, which we’ll talk about later on. QUIC is basically adopting the same mechanisms as does TCP, and for that reason that I’m not going to talk about them too much separately. So what is the goal of congestion control? What are the principles of congestion control? Well, the idea of congestion control is to find the right transmission rate for a connection. We’re trying to find the fastest sending rate which you can send at to match the capacity of the network, and to do so in a way that doesn’t build up queues, doesn’t overload, doesn’t congest the network. So we’re looking to adapt the transmission rate of a flow of TCP traffic over the network, to match the available network capacity. And as the network capacity changes, perhaps because other flows of traffic start up, or perhaps because you’re on a mobile device and you move into an area with different radio coverage, the speed at which the TCP is delivering the data should adapt to match the changes and available capacity. The fundamental principles of congestion control, as applied in TCP, were first described by Van Jacobson, who we see on the picture on the top right of the slide, in the paper “Congestion Avoidance and Control”. And those principles are that TCP responds to packet loss as a congestion signal. It treats the loss of a packet, because the Internet is a best effort packet network, and it loses, it discards packets, if it can’t deliver them, and TCP treats that discard, that loss of a packet, as a congestion signal, and as a signal of it’s sending too fast and should slow down. It relies on the principle of conservation of packets. It tries to keep the number of packets which are traversing the network roughly constant, assuming nothing changes in the network. And it relies on the principles of additive increase, multiplicative decrease. If it has to increase its sending rate, it does so relatively slowly, an additive increase in the rate. And if it has to reduce its sending rate, it does so quickly, a multiplicative decrease. And these are the fundamental principles that Van Jacobson elucidated for TCP congestion control, and for congestion control in general. And it was Van Jacobson who did the initial implementation of these into TCP in the mid-1980s, about 1984, ’85, or so. Since then, the algorithms, the congestion control algorithms, for TCP in general have been maintained by a large number of people. A lot of people have developed this. Probably one of the leading people in this space for the last 20 years or so, is Sally Floyd who was very much responsible for taking the TCP standards, making them robust, pushing them through the IETF to get them standardised, and making sure they work, and making sure they work and get really high performance. And she very much drove the development to make these robust, and effective, and high performance standards, and to make TCP work as well as it does today. And Sally sadly passed away a year or so back, which is a tremendous shame, but we’re grateful for her legacy in moving things forward. So to go back to the principles. The first principle of congestion control in the Internet, and in TCP, is that packet loss is an indication that the network is congested. Data flowing across the Internet flows from the sender to the receiver through a series of routers. The IP routers connect together the different links that comprise the network. And routers perform two functions: they perform a routing function, and a forwarding function. The purpose of the routing function is to figure out how packets should get to their destination. They receive a packet from some network link, look at the destination IP address, and decide which direction to forward that packet. They’re responsible for finding the right path through the network. But they’re also responsible for forwarding, which is actually putting the packets into the queue of outgoing traffic for the link, and managing that queue of packets to actually transmit the packets across the network. And routers in the network have a set of different links; the whole point of a router is to connect different links. And at each link, they have a queue of packets, which are enqueued to be delivered on that link. And, perhaps obviously, if packets are arriving faster than the link can deliver those packets, then the queue gradually builds up. More and more packets get enqueued in the router waiting to be delivered. And if packets are arriving slower than they can be forwarded, then the queue gradually empties as the packets get transmitted. Obviously the router has a limited amount of memory, and at some point it’s going to run out of space to enqueue packets. So, if packets are being delivered faster than they, if packets arriving at the router faster than they can be delivered down the link, the queue will build up and gradually fill, until it reaches its maximum size. At that point, the router has no space to keep the newly arrived packets, and so it discards the packets. And this is what TCP is using as the congestion signal. It’s using the fact that the queue of packets on an outgoing link at a router has filled up. It’s using that as an indication that the queue fills up, the packet gets lost, it uses that packet loss as an indication that it’s sending too fast. It’s sending faster than the packets can be delivered, and as a result the queue has overflowed, a packet has been lost, and so it needs to slow down. And that’s the fundamental congestion signal in the network. Packet loss is interpreted as a sign that devices are sending too fast, and should go slower. And if they slow down, the queues will gradually empty, and packets will stop being lost. So that’s the first fundamental principle. The second principle is that we want to keep the number of packets in the network roughly constant. TCP, as we saw in the last lecture, sends acknowledgments for packets. When a packet is transmitted it has a sequence number, and the response will come back from the receiver acknowledging receipt of that sequence number. The general approach for TCP, once the connection has got going, is that every time it gets an acknowledgement, it uses that as a signal that a packet has been received. And if a packet has been received, something has left the network. One of the packets sent into the network has reached the other side, and has been removed from the network at the receiver. That means there should be space to put another packet into the network. And it’s an approach that’s called ACK clocking. Every time a packet arrives at the receiver, and you get an acknowledgement back saying it was received, that indicates you can put another packet in. So the total number of packets in transit across the network ends up being roughly constant. One packet out, you put another packet in. And it has the advantage that if you’re clocking out new packets in receipt of acknowledgments, if, for some reason, the network gets congested, and it takes longer for acknowledgments to come back, because it’s taking longer for them to work their way across the network, then that will automatically slow down the rate at which you send. Because it takes longer for the next acknowledgment to come back, therefore it’s longer before you send your next packet. So, as the network starts to get busy, as the queue starts to build up, but before the queue has overflowed, it takes longer for the acknowledgments to come back, because the packets are queued up in the intermediate links, and that gradually slows down the behaviour of TCP. It reduces the rate at which you can send. So it’s, to at least some extent, self adjusting. The network gets busier, the ACKs come back slower, therefore you send a little bit slower. And that’s the second principle: conservation of packets. One out, one in. And the principle of conservation of packets is great, provided the network is in the steady state. But you also need to be able to adapt the rate at which you’re sending. The way TCP adapts is very much focused on starting slowly and gradually increasing. When it needs to increase it’s sending rate, TCP increases linearly. It adds a small amounts to the sending rate each round trip time. So it just gradually, slowly, increases the sending rating. It gradually pushes up the rate until it spots a loss. Until it loses a packet. Until it overflows a queue. And then it responds to congestion by rapidly decreasing its rate. If a congestion event happens, if a packet is lost, TCP halves its rate. It responds faster than it increases, it slows down faster than it increases. And this is the final principle, what’s known as additive increase, multiplicative decrease. The goal is to keep the network stable. The goal is to not overload the network. If you can, keep going at a steady rate. Follow the ACK clocking approach. Gradually, just slowly, increase the rate a bit. Keep pushing, just in case there’s more capacity than you think. So just gradually keep probing to increase the rate. If you overload the network, if you cause congestion, if you overflow the queues, cause a packet to be lost, slow down rapidly. Halve your sending rate, and gradually build up again. The fact that you slow down faster than you speed up, the fact that you follow the one in, one out approach, keeps the network stable. It makes sure it doesn’t overload the network, and it means that if the network does overload, it responds and recovers quickly The goal is to keep the traffic moving. And TCP is very effective at doing this. So those are the fundamental principles of TCP congestion control. Packet loss as an indication of congestion. Conservation of packets, and ACK clocking. One in, one out, where possible. If you need to increase the sending rate, increase slowly. If a problem happens, decrease quickly. And that will keep the network stable. In the next part I’ll talk about TCP Reno, which is one of the more popular approaches for doing this in practice.
Part 2: TCP Reno
Abstract
The second part of the lecture discusses TCP Reno congestion control. It outlines the principles of window based congestion control, and describes how they are implemented in TCP. The choice of initial window, and how the recommended initial window has changed over time, is discussed, along with the slow start algorithm for finding the path capacity and the congestion avoidance algorithm for adapting the congestion window.
In the previous part, I spoke about the principles of TCP congestion control in general terms. I spoke about the idea of packet loss as a congestion signal, about the conservation of packets, and about the idea of additive increase multiplicative decrease – increase slowly, decrease the sending quite quickly as a way of achieving stability. In this part I want to talk about TCP Reno, and some of the details of how TCP congestion control works in practice. I’ll talk about the basic TCP congestion control algorithm, how the sliding window algorithm works to adapt the sending rate, and the slow start and congestion avoidance phases of congestion control. TCP is what’s known as a window based congestion control protocol. That is, it maintains what’s known as a sliding window of data which is available to be sent over the network. And the sliding window determines what range of sequence numbers can be sent by TCP onto the network. It uses the additive increase multiplicative decrease approach to grow and shrink the window. And that determines, at any point, how much data TCP sender can send onto the network. It augments these with algorithms known as slow start and congestion avoidance. Slow start being the approach TCP uses to get a connection going in a safe way, and congestion avoidance being the approach it uses to maintain the sending rate once the flow has got started. The fundamental goal of TCP is that if you have several TCP flows sharing a link, sharing a bottleneck link in the network, each of those flows should get an approximately equal share of the bandwidth. So, if you have four TCP flows sharing a link, they should each get approximately one quarter of the capacity of that link. And TCP does this reasonably well. It’s not perfect. It, to some extent, biases against long distance flows, and shorter flows tend to win out a little over long distance flows. But, in general, it works pretty well, and does give flows roughly a roughly equal share of the bandwidth. The basic algorithm it uses to do this, the basic congestion control algorithm, is an approach known as TCP Reno. And this is the state of the art in TCP as of about 1990. TCP is an ACK based protocol. You send a packet, and sometime later an acknowledgement comes back telling you that the packet arrived, and indicating the sequence number of the next packet which is expected. The simplest way you might think that would work, is you send a packet. You wait for the acknowledgment. You send another packet. You wait for the acknowledgement. And so on. The problem with that, is that it tends to perform very poorly. It takes a certain amount of time to send a packet down a link. That depends on the size of the packet, and the link bandwidth. The size of the packet is expressed as some number of bits to be sent. The link bandwidth is expressed in some number of bits it can deliver each second. And if you did divide the packet size by the bandwidth, that gives you the number of seconds it takes to send each packet. It takes a certain amount of time for that packet to propagate down the link to the receiver, and for the acknowledgment come back to you, depending on the round trip of the link. And you can measure the round trip time of the link. And you can divide one by the other. You can take the time it takes to send a packet, and the time it takes for the acknowledgment to come back, and divide one by the other, to get the link utilisation. And, ideally, you want that fraction be close to one. You want to be spending most of the time sending packets, and not much time waiting for the acknowledgments to come back before you can send the next packet. The problem is that’s often not the case. For example, if we assume we’re trying to send data, and we have a gigabit link, which is connecting the machine we’re sending data from, and we’re trying to go from Glasgow to London. And this might be the case you would find if you had a one of the machines in the Boyd Orr labs, which is connected to the University’s gigabit Ethernet, and the University has a 10 gigabit per second link to the rest of the Internet, so the bottleneck is that Ethernet. If you’re talking to a machine in London, let’s make some assumptions on how long this will take. You’re sending using Ethernet, and the biggest packet an Ethernet can deliver is 1500 bytes. So 1500 bytes, multiplied by eight bits per byte, gives you a number of bits in the packet. And it’s a gigabit Ethernet, so it’s sending a billion bits per second. So 1500 bytes, times eight bits, divided by a billion bits per second. It will take 12 microseconds, 0.000012 of a second, 12 microseconds to send a packet down the link. And that’s just the time it takes to physically serialise 1500 bytes down a gigabit per second link. The round trip time to London, if you measure it, is about a 100th of a second, about 10 milliseconds. If you divide one by the other, you find that the utilisation is 0.0012. 0.12% of the link is in use. The time it takes to send a packet is tiny compared to the time it takes to get a response. So if you’re just sending one packet, and waiting for a response, the link is idle 99.9% of the time. The idea of a sliding window protocol is to not just send one packet and wait for an acknowledgement. It’s to send several packets, and wait for the acknowledgments. And the window is the number of packets that can be outstanding before the acknowledgement comes back. The ideas is, you can start several packets going, and eventually the acknowledgement comes back, and that starts triggering the next packets to be clocked out. This idea is to improve the utilisation by sending more than one packet before you get an acknowledgment. And this is the fundamental approach to sliding window protocols. The sender starts sending data packets, and there’s what’s known as a congestion window that’s that specifies how many packets that’s it’s allowed to send before it gets an acknowledgement. And, in this example, the congestion window is six packets. And the sender starts. It sends the first data packet, and that gets sent and starts its way traveling down the link. And at some point later it sends the next packet, and then the next packet, and so. After a certain amount of time that first packet arrives at the receiver, and the receiver generates the acknowledgments which comes back towards the sender. And while this is happening, the sender is sending more of the packets from its window. And the receiver’s gradually receiving those and sending the acknowledgments. And, at some point later, the acknowledgement makes it back to the sender. And in this case we’ve set the window size to be six packets. And it just so happens that the acknowledgement for the first packet arrives back at the sender, just as it has finished sending packet six. And that triggers the window to increase. That triggers the window to slide along. So instead of being allowed to send packets one through six, we’re now allowed to send packets two through seven. Because one packet has arrived, that’s opened up the window to allow us to send one more packet. And the acknowledgement indicates that packet one has arrived. So just as we’d run out of packets to send, just as we’ve sent our six packets which are allowed by the window, the acknowledgement arrives, slides the window a long one, tells us we can now send one more. And the idea is that you size the window such that you send just enough packets that by the time the acknowledgement comes back, you’re ready to slide the window along. You’ve sent everything that was in your window. And each acknowledgement releases the next packet for transmission, if you get the window sized right. And if there’s a problem, if the acknowledgments don’t come back because something got lost, then it stalls. You hadn’t sent too many excess packets, you’re not just keeping sending without getting acknowledgments, you’re just sending enough that the acknowledgments come back, just as you run out of things to send. And everything just keeps it sort-of balanced. Every acknowledgement triggers the next packet to be sent, and it rolls along. How big should the window be? Well, it should be sized to match the bandwidth times the delay on the path. And you work it out in bytes. It’s the bandwidth of the path, a gigabit in the previous example, times the latency, 100th of a second, and you multiply those together and that tells you how many bytes can be in flight. And you divide that by the packet size, and that tells you how many packets you can send. The problem is, the sender doesn’t know the bandwidth of the path, and it doesn’t know that latency. It doesn’t know the round trip time. It can measure the round trip time, but not until after it started sending. Once it’s sent a packet, it can wait for an acknowledgement to come back and get an estimate of the round trip time. But it can’t do that at the point where it starts sending. And it can’t know what is the bandwidth. It knows the bandwidth for the link it’s connected to, but it doesn’t know the bandwidth for the rest of the links throughout the network. It doesn’t know how many other TCP flows it’s sharing the traffic with, so it doesn’t know how much of that capacity it’s got available. And that this is the problem with the sliding window algorithms. If you get the window size right, It allows you to do the ACK clocking, it allows you to clock out the packets at the right time, just in time for the next packet to become available. But, in order to pick the right window size, you need to know the bandwidth and the delay, and you don’t know either of those at the start of the connection. TCP follows the sliding window approach. TCP Reno is very much a sliding window protocol, and it’s optimised for not knowing what the window sizes are. And the challenge with TCP is to pick what should be the initial window. To pick how many packets you should send, before you know anything about the round trip time, or anything about bandwidth. And how to find the path capacity, how to figure out at what point you’ve got the right size window. And then how to adapt the window to cope with changes in the capacity. So there’s two fundamental problems with TCP Reno congestion control. Picking the initial window size for the first set of packets you send. And then, adapting that initial window size to find the bottleneck capacity, and to adapt to changes in that bottleneck capacity. If you get the window size right, you can make effective use of the network capacity. If you get it wrong you’ll either send too slowly, and end up wasting capacity. Or you’ll send too quickly, and overload the network, and cause packets to be lost because the queues fill. So, how does TCP find the initial window? Well, to start with, you have no information. When you’re making a TCP connection to a host you haven’t communicated with before, you don’t know the round trip time to that host, you don’t know how long it will take to get a response, and you don’t know the network capacity. So you have no information to know what an appropriately sized window should be. The only safe thing you can do. The only thing which is safe in all circumstances, is to send one packet, and see if it arrives, see if you get an ACK. And if it works, send a little bit faster next time. And then gradually increase the rate at which you send. The only safe thing to do is to start at the lowest possible rate, equivalent of stop-and-wait, and then gradually increase your rate from there, once you know that it works. The problem is, of course, that’s pessimistic, in most cases. Most links are not the slowest possible link. Most links, you can send faster than that. What TCP has traditionally done, and the traditional approach in TCP Reno, is declared the initial window to be three packets. So you can send three packets, without getting any acknowledgments back. And, by the time the third packet has been sent, you should be just about to get the acknowledgement back, which will open it up for you to send the fourth. And at that point, it starts ACK clocking. And why is it three packets? Because someone did some measurements, and decided that was what safe. More recently, I guess, about 10 years ago now, Nandita Dukkipati and her group at Google did another set of measurements, and showed that was actually pessimistic. The networks had gotten a lot faster in the time since TCP was first standardised, and they came to the conclusion, based on the measurements of browsers accessing the Google site, that about 10 packets was a good starting point. And the idea here is that 10 packets, you can send 10 packets at the start of a connection, and after you’ve sent 10 packets you should have got an acknowledgement back. Why ten? Again, it’s a balance between safety and performance. If you send too many packets onto a network which can’t cope with them, those packets will get queued up and, in the best case, it’ll just add latency because they’re all queued up somewhere. And in the worst case they’ll overflow the queues, and cause packet loss, and you’ll have to re-transmit them. So you don’t want to send too fast. Equally, you don’t want to send too slow, because that just wastes capacity. And the measurements that Google came up with at this point, which was around 10 years ago, was that about 10 packets was a good starting point for most connections. It was unlikely to cause congestion in most cases, and was also unlikely to waste too much bandwidth. And I think what we’d expect to see, is that over time the initial window will gradually increase, as network connections around the world gradually get faster. And it’s balancing making good use of connections in well-connected first-world parts of the world, where there’s good infrastructure, against not overloading connections in parts of the world where the infrastructure at less well developed. The initial window lets you send something. With a modern TCP, it lets you send 10 packets. And you can send those 10 packets, or whatever the initial window is, without waiting for an acknowledgement to come back. But it’s probably not the right size; it’s probably not the right window size. If you’re on a very fast connection, in a well-connected part of the world, you probably want a much bigger window than 10 packets. And if you’re on a poor quality mobile connection, or in a part of the world where the infrastructure is less well developed, you probably want a smaller window. So you need to somehow adapt to match the network capacity. And there’s two parts to this. What’s called slow start, where you try to quickly find the appropriate initial window, where starting from initial window, you quickly converge on what the right window is. And congestion avoidance, where you adapt in the long term to match changes in capacity once the thing is running. So how does slow start work? Well, this is the phase at the beginning of the connection. It’s easiest to illustrate if you assume that the initial window is one packet. If the initial window is one packet, you send one packet, and at some point later an acknowledgement comes back. And the way slow start works is that each acknowledgment you get back increases the window by one. So if you send one packet, and get one packet back, that increases the window from one to two, so you can send two packets the next time. And you send those two packets, and you get two acknowledgments back. And each acknowledgments increases the window by one, so it goes to three, and then to four. So you can send four packets the next time. And then you get four acknowledgments back, each of which increases the window, so your window is now eight. And, as we are all, I think, painfully aware after the pandemic, this is exponential growth. The window is doubling each time. So it’s called slow start because it starts very slow, with one packet or three packets or 10 packets, depending on the version of TCP you have. But each round trip time the window doubles. It doubles it’s sending rate each time. And this carries on until it loses a packet. This carries on until it fills the queues and overflows the capacity of the network somewhere. At which points it halves back to its previous value, and drops out of the slow start phase. If we look at this graphically, what we see on the graph at the bottom of the slide, we have time on the x-axis, and the congestion window, the size of the congestion window, on the y-axis. And we’re assuming an initial window of one packet. We see that, on the first round trip it sends the one packet, gets the acknowledgement back. The second round trip it sends two packets. And then four, and then eight, and then 16. And each time it doubles it’s sending rate. So you have this exponential growth phase, starting at whatever the initial window is, and doubling each time until it reaches the network capacity. And eventually it fills the network. Eventually some queue, somewhere in the network, is full. And it overflows and the packet gets lost. At that point the connection halves it’s rate, back to the value just before it last increased. In this example, we see that it got up to an initial window of 16, and then something got lost, and then it halved back down to a window of eight. At that point TCP enters what’s known as the congestion avoidance phase. The goal of congestion avoidance is to adapt to changes in capacity. After the slow start phase, you know you’ve got approximately the right size window for the path. It’s telling you roughly how many packets you should be sending each round trip time. The goal, once you’re in congestion avoidance, is to adapt to changes. Maybe the capacity of the path changes. Maybe you’re on a mobile device, with a wireless connection, and the quality of the wireless connection changes. Maybe the amount of cross traffic changes. Maybe additional people start sharing the link with you, and you have less capacity because you’re sharing with more TCP flows. Or maybe some of the cross traffic goes away, and the amount of capacity you have available increases because there’s less competing traffic. And the congestion avoidance phase follows an additive increase, multiplicative decrease, approach to adapting the congestion window when that happens. So, in congestion avoidance, if it successfully manages to send a complete window of packets, and gets acknowledgments back for each of those packets. So it’s sent out eight packets, for example, and gets eight acknowledgments back, it knows the network can support that sending rate. So it increases its window by one. So the next time, it sends out nine packets and expects to get nine acknowledgments back over the next round trip cycle. And if it successfully does that, it increases the window again. And it sends 10 packets, and expects to get 10 acknowledgments back. And we see that each round trip it gradually increases the sending rate by one. So it sends 8 packets, then 9, then 10, then 11, and 12, and keeps gradually, linearly, increasing its rate. Up until the point that something gets lost. And if a packet gets lost? You’ll be able to detect that because, as we saw in the previous lecture, you’ll get a triple duplicates acknowledgement. And that indicates that one of the packets got lost, but the rest of the data in the window was received. And what you do at that point, is you do a multiplicative decrease in the window. You halve the window. So, in this case, the sender was sending with a window of 12 packets, and it successfully sent that. And then it tried to send, tried to increase its rate, realised it didn’t work, realised something got lost, and so it halved its window back down to six. And then it gradually switches back, it switches back, and goes back to the gradual additive increase. And it follows this sawtooth pattern. Gradual linear increase, one packet more each round trip time. Until it sends too fast, causes a packet to be lost because it overflows a queue, halves it’s sending rate, and then gradually starts increasing it again. It follows this sawtooth pattern. Gradual increase, quick back-off; gradual increase, quick back-off. The other way TCP can detect the loss is by what’s known as a time out. It’s sending the packets, and suddenly the acknowledgements stop coming back entirely. And this means that either the receiver has crashed, the receiving system has gone away, or perhaps more likely the network has failed. And the data it’s sending is either not reaching the sender, or the reverse path has failed, and the acknowledgments are not coming back. At that point, after nothing has come back for a while, it assumes a timeout has happened, and resets the window down to the initial window. And in the example we see on the slide, at time 14 we’ve got a timeout, and it resets and the initial window goes back to one packet. At that point, it re-enters slow start. It starts again from the beginning. And whether your initial window is one packet, or three packets, or ten packets, it starts in the beginning, and it re-enters slow start, and it tries again for the connection. And if this was a transient failure, that will probably succeed. If it wasn’t, it may end up in yet another timeout, while it takes time for the network to recover, or for the system you’re talking to, to recover, and it will be a while before it can successfully send a packet. But, when it does, when the network recovers, it starts sending again, and resets the connection from the beginning. How long, should the timeout be? Well, the standard says a maximum of one second, or the average round trip time plus four times the statistical variance in the round trip time. And, if you’re a statistician, you’ll recognise that the RTT plus four times the variance, if you’re assuming a normal distribution of round trip time samples, accounts for 99% of the samples falling within range. So it’s finding the 99th percentile of the expected time to get an acknowledgement back. Now, TCP follows this saw tooth behaviour, with gradual additive increase in the sending rate, and then a back-off, halving it’s sending rate, and then a gradual increase again. And we see this in the top graph on the slide which is showing a measured congestion window for a real TCP flow. And, after dynamics of the slow start at the beginning, we see it follows this sawtooth pattern. How does that affect the rest of the network? Well, the packets are, at some point, getting queued up at whatever the bottleneck link is. And the second graph we see on the left, going down, is the size of the queue. And we see that as the sending rate increases, the queue gradually builds up. Initially the queue is empty, and as it starts sending faster, the queue gradually gets fuller. And at some point the queue gets full, and overflows. And when the queue gets full, when the queue overflows, when packets gets lost, TCP halves it’s sending rate. And that causes the queue to rapidly empty, because there’s less packets coming in, so the queue drains. But what we see is that just as the queue is getting to empty, the rate is starting to increase again. Just as the queue gets the point where it would have nothing to send, the rate starts picking up, such that the queue starts to gradually refill. So the queues in the routers also follow a sawtooth pattern. They gradually fill up until they get to a full point, And then the rate halves, the queue empties rapidly because there’s much less traffic coming back, and as it’s emptying the rate at which the sender is sending is gradually filling up, and the queue size oscillates. And we see the same thing happens with the round trip time, in the third of the graphs, as the queue gradually fills up, the round trip time goes up, and up, and up, it’s taking longer for the packets because they’re queued up somewhere. And then the rate reduces, the queue drops, the round trip time drops. And it gradually, as the rate picks up afterwards back into congestion avoidance, the queue gradually fills, the round trip time gradually increases. So, both window size, and the queue size, and the round trip time, all follow this characteristic sawtooth pattern. What’s interesting though, if we look at the fourth graph down on the left, is we’re looking at the rate at which packets are arriving at the receiver. And we see that the rate at which packets are arriving at the receiver is pretty much constant. What’s happening is that the packets are being queued up at the link, and as the queue fills there’s more and more packets queued up at the bottleneck link. And when TCP backs-off, when it reduces it’s window, that lets the queue drain. But the queue never quite empties. We just see very occasional drops where the queue gets empty, but typically the queue always has something in it. It’s emptying rapidly, it’s getting less and less data in it, but the queue, if the buffer is sized right, if the window is chosen right, never quite empties. So the TCP sender is following this sawtooth pattern, with its sending window, which is gradually filling up the queues. And then the queues are gradually draining when TCP backs-off and halves its rate, but the queue never quite empties. It always has some data to send, so the receiver is always receiving data. So, even though the sender’s following the sawtooth pattern, the receiver receives constant rate data the whole time, at approximately the bottleneck bandwidth. And that’s the genius of TCP. It manages, by following this additive increase, multiplicative decrease, approach, it manages to adapt the rate such that the buffer never quite empties, and the data continues to be delivered. And for that to work, it needs the router to have enough buffering capacity in it. And the amount of buffering the router needs, is the bandwidth times the delay of the path. And too little buffering in the router leads to the queue overflowing, and it not quite managing to sustain the rate. Too much, you just get what’s known as buffer bloat. It’s safe, I mean in terms of throughput, it keeps receiving the data. But the queues get very big, and they never get anywhere near empty, so the amount of data queued up increases, and you just get increased latency. So that’s TCP Reno. It’s really effective at keeping the bottleneck fully utilised. But it trades latency for throughput. It tries to fill the queue, it’s continually pushing, it’s continually queuing up data. Making sure the queue is never empty. Making sure the queue is never empty, so provided there’s enough buffering in the network there are always packets being delivered. And that’s great, if your goal is to maximise the rate at which information is delivered. TCP is really good at keeping the bottleneck link fully utilised. It’s really, really good at delivering data as fast as the network can support it. But it trades that off for latency. It’s also really good at making sure there are queues in the network, and making sure that the network is not operating at its lowest possible latency. There’s always some data queued up. There are two other limitations, other than increased latency. First, is that TCP assumes that losses are due to congestion. And historically that’s been true. Certainly in wired links, packet loss is almost always caused by a queue filling up, overflowing, and a router not having space to enqueue a packet. In certain types of wireless links, in 4G or in WiFi links, that’s not always the case, and you do get packet loss due to corruption. And TCP will treat this as a signal to slow down. Which means that TCP sometimes behaves sub-optimally on wireless links. And there’s a mechanism called Explicit Congestion Notification, which we’ll talk about in one of the later parts of this lecture, which tries to address that. The other, is that the congestion avoidance phase can take a long time to ramp up. On very long distance links, very high capacity links, it can take a long time to get up to, after packet loss, it can take a very long time to get back up to an appropriate rate. And there are some occasions with very fast long distance links, where it performs poorly, because of the way the congestion avoidance works. And there’s an algorithm known as TCP Cubic, which i’ll talk about in the next part, which tries to address that. And that’s the basics of TCP. The basic TCP congestion control algorithm is a sliding window algorithm, where the window indicates how many packets you’re allowed to send before getting an acknowledgement. The goal of the slow start and the congestion avoidance phases, and the additive increase, multiplicative decrease, is to adapt the size of the window to match the network capacity. It always tries to match the size of the window exactly to the capacity, so it’s making the most use of the network resources. In the next part, I’ll move on and talk about an extension to the TCP Reno algorithm, known as TCP Cubic, which is intended to improve performance on very fast and long distance networks. And then, in the later parts, we’ll talk about extensions to reduce latency, and to work on wireless links where there are non-congestive losses.
Part 3: TCP Cubic
Abstract
The third part of the lecture talks about the TCP Cubic congestion control algorithm, a widely used extension to TCP that improves its performance on fast, long-distance, networks. The lecture discusses the limitations of TCP Reno that led to the development of Cubic, and outlines how Cubic congestion control improves performance but retains fairness with Reno.
In the previous part, I spoke about TCP Reno. TCP Reno is the default congestion control algorithms for TCP, but it’s actually not particularly widely used in practice these days. What most modern TCP versions use is, instead, an algorithm known as TCP Cubic. And the goal of TCP Cubic is to improve TCP performance on fast long distance networks. So the problem with TCP Reno, is that it’s performance can be comparatively poor on networks with large bandwidth-delay products. That is, networks where the product, what you get when you multiply the bandwidth of the network, in number of bits per second, and the delay, the round trip time of the network, is large. Now, this is not a problem that most people, have most of the time. But, it’s a problem that began to become apparent in the early 2000s when people working at organisations like CERN were trying to transfer very large data files across fast long distance networks between CERN and the universities that were analysing the data. For example, CERN is based at Geneva, in Switzerland, and some of the big sites for analysing the data are based at, for example, Fermilab just outside Chicago in the US. And in order to get the data from CERN to Fermilab, from Geneva to Chicago, they put in place multi-gigabit transatlantic links. And if you think about the congestion window needed to make good use of a link like that, you realise it actually becomes quite large. If you assume the link is 10 gigabit per second, which was cutting edge in the early 2000s, but it is now relatively common for high-end links these days, and assume 100 milliseconds round trip time, which is possibly even slightly an under-estimate for the path from Geneva to Chicago, in order to make good use of that, you need a congestion window which equals the bandwidth times the delay. And 10 gigabits per second, times 100 milliseconds, gives you a congestion window of about 100,000 packets. And, partly, it takes TCP a long time, a comparatively long time, to slow start up to a 100,000 packet window. But that’s not such a big issue, because that only happens once at the start of the connection. The issue, though, is in congestion avoidance. If one packet is lost on the link, out of a window of 100,000, that will cause TCP to back-off and halve it’s window. And it then increases sending rate again, by one packet every round trip time. And backing off from 100,000 packet window to a 50,000 packet window, and then increasing by one each time, means it takes 50,000 round trip times to recover back up to the full window. 50,000 round trip times, when the round trip time is 100 milliseconds, is about 1.4 hours. So it takes TCP about one-and-a-half hours to recover from a single packet loss. And, with a window of 100,000 packets, you’re sending enough data, at 10 gigabits per second, that the imperfections in the optical fibre, and imperfections in the equipment that are transmitting the packets, become significant. And you’re likely to just see occasional random packet losses, just because of imperfections in the transmission medium, even if there’s no congestion. And this was becoming a limiting factor, this was becoming a bottleneck in the transmission. It was becoming not possible to build a network that was reliable enough, that it never lost any packets in transferring several hundreds of billions of packets of data, to exchange the data between CERN and the sites which were doing the analysis. TCP Cubic is one of a range of algorithms which were developed to try and address this problem. To try and recover much faster than TCP Reno would, in the case when you had very large congestion windows, and small amounts of packet loss. So the idea of TCP Cubic, is that it changes the way the congestion control works in the congestion avoidance phase. So, in congestion avoidance, TCP Cubic will increase the congestion window faster than TCP Reno would, in cases where the window is large. In cases where the window is relatively small, in the types of networks were Reno has good performance, TCP Cubic behaves in a very similar way. But as the windows get bigger, as it gets to a regime with TCP Reno doesn’t work effectively, TCP Cubic gets more aggressive in adapting its congestion window, and increases the congestion window much more quickly in response to loss. However, as the rate of increase, as the window approaches the value it was before the loss, it slows its rate of increase, so it starts increasing rapidly, slows its rate of increase as it approaches the previous value. And if it then successfully manages to send at that rate, if it successfully moves above the previous sending rate, then it gradually increases sending rate again. It’s called TCP Cubic because it follows a cubic equation to do this. The shape of the equation, the shape of the curve, we see on the slide for TCP Cubic is following a cubic graph. The paper listed on the slide, the paper shown on the slide, from Injong Rhee and his collaborators, is the paper which describes the algorithm in detail. And it was eventually specified in IETF RFC 8312 in 2018, although it’s been probably the most widely used TCP variant for a number of years before that. The details of how it works: TCP Cubic is a somewhat more complex algorithm than Reno. The two parts to the behaviour. If a packet is lost when a TCP Cubic sender is in the congestion avoidance phase, it does a multiplicative decrease. However, unlike TCP Reno, which does a multiplicative decrease by multiplying by a factor of 0.5, that is, it halves its sending rate if a single packet is lost, TCP Cubic multiples its rate by 0.7. So, instead of dropping back down to 50% of its previous sending rate, it drops down to 70% of the sending rate. It backs-off less, it’s more aggressive. It’s more aggressive at using bandwidth. It reduces it’s sending rate in response to loss, but by smaller fraction. After it’s backed-off, TCP Cubic also changes the way in which it increases it’s sending rate in future. So we saw in the previous slide, TCP Reno increases it’s congestion window by one, for every round trip when it successfully sends data. So if the window backs off to 10, then it goes to 11 the next round trip time, then 12, and 13, and so on, with a linear increase in the window. TCP Cubic, on the other hand, sets the window as we see in the equation on the slide. It sets the window to be a constant, C, times T-K cubed, plus Wmax. Where the constant, C, is set to 0.4, which is a threshold which controls how fair it is to TCP Reno, and was determined experimentally. T is the time since the packet loss. K is the time it will increase, it will take to increase the window backup to the maximum it was before the packet loss, and Wmax is the maximum window size it reached before the loss. And this gives the cubic growth function, which we saw on the previous slide, where the window starts to increase quickly, the growth slows as it approaches that previous value it reached just before the loss, and if it successfully passes through that point, the rate of growth increases again. Now, that’s the high-level version. And we can already see it’s more complex than the TCP Reno equation. The algorithm on the right of the slide, which is intentionally presented in a way which is completely unreadable here, shows the full details. The point is that there’s a lot of complexity here. The basic equation, the basic back-off to 0.7 times and then follow the cubic equation, to increase rapidly, slow the rate of increase, and then increase rapidly again if it successfully gets past the previous bottleneck point, is enough to illustrate the key principle. The rest of the details are there to make sure it’s fair with TCP Reno on links which are slower, or where the round trip time is shorter. And so, in the regime where TCP Reno can successfully make use of the link, TCP Cubic behaves the same way. And, as you get into a regime where Reno can’t effectively make use of the capacity, because it can’t sustain a large enough congestion window, then cubic starts to behave differently, and starts to switch to the cubic equation. And that allows it to recover from losses more quickly, and to more effectively continue to make use of higher bandwidths and higher latency paths. TCP Cubic is the default in most modern operating systems. It’s the default in Linux, it’s the default in FreeBSD, I believe it’s the default in macOS and iPhones. Microsoft Windows has an algorithm called Compound TCP which is a different algorithm, but has a similar effect. It’s much more complex than TCP Reno. The core response, the back off to 70% and then follow the characteristic cubic curve, is conceptually relatively straightforward, but once you start looking at the details of how it behaves, there gets to be a lot of complexity. And most of that is in there to make sure it’s reasonably fair to TCP, to TCP Reno, in the regime where Reno typically works. But it improves performance for networks with longer round trip times and higher bandwidths. Both TCP Cubic, and TCP Reno, use congestion control, use packet loss as a congestion signal. And they both eventually fill the router buffers. And TCP Cubic does so more aggressively than Reno. So, in both cases, they’re trading off latency for throughput, They’re trying to make sure the buffers are full. They’re trying to make sure the buffers in the intermediate routers are full. And they’re both making sure that they keep the congestion window large enough to keep the buffers fully utilised, so packets keep arriving at the receiver at all times. And that’s very good for achieving high throughput, but it pushes the latency up. So, again, they’re trading-off increased latency for good performance, for good throughput. And that’s what I want to say about Cubic. Again, the goal is to use a different response function to improve throughput on very fast, long distance, links, multi-gigabit per second transatlantic links, being the common example. And the goal is to make good use of throughput. In the next part I’ll talk about alternatives which, rather than focusing on throughput, focus on keeping latency bounded whilst achieving reasonable throughput.
Part 4: Delay-based Congestion Control
Abstract
The 4th part of the lecture discussed how both the Reno and Cubic algorithms impact latency. It shows how their loss-based response to congestion inevitably causes router queues to fill, increasing path latency, and discusses how this is unavoidable with loss-based congestion control. It introduces the idea of delay-based congestion control and the TCP Vegas algorithm, highlights its potential benefits and deployment challenges. Finally, TCP BBR is briefly introduced as an experimental extension that aims to achieve some of the benefits of delay-based congestion control, in a deployable manner.
In the previous parts, I’ve spoken about TCP Reno and TCP Cubic. These are the standard, loss based, congestion control algorithms that most TCP implementations use to adapt their sending rate. These are the standard congestion control algorithms for TCP. What I want to do in this part is recap, why these algorithms cause additional latency in the network, and talk about two alternatives which try to adapt the sending rate of TCP without building up queues, and without overloading the network and causing too much latency. So, as I mentioned, TCP Cubic and TCP Reno both aim to fill up the network. They use packet loss as a congestion signal. So the way they work is they gradually increase their sending rate, they’re in either slow start or congestion avoidance phase, and they’re always gradually increasing the sending rates, gradually filling up the queues in the network, until those queues overflow. At that point a packet is lost. The TCP backs-off it’s sending rate, it backs-off its window, which allows the queue to drain, but as the queue is draining, both Reno and Cubic are increasing their sending rate, are increasing the sending window, so are to gradually start filling up the queue again. As, we saw, the queues in the network oscillate, but they never quite empty. And both Reno and Cubic, the goal is to keep some packets queued up in the network, make sure there’s always some data queued up, so they can keep delivering data. And, no matter how big a queue you put in the network, no matter how much memory you give the routers in the network, TCP Reno and TCP Cubic will eventually cause it to overflow. They will keep sending, they’ll keep increasing the sending rate, until whatever queue is in the network it’s full, and it overflows. And the more memory in the routers, the more buffer in the routers, the longer that queue will get and the worse the latency will be. But in all cases, in order to achieve very high throughput, in order to keep the network busy, keep the bottleneck link busy, TCP Reno and TCP Cubic queue some data up. And this adds latency. It means that, whenever there’s TCP Reno, whenever there’s TCP Cubic flows, using the network, the queues will have data queued up. There’ll always be data queued up for delivery. There’s always packets waiting for delivery. So it forces the network to work in a regime where there’s always some excess latency. Now, this is a problem for real-time applications. It’s a problem if you’re running a video conferencing tool, or a telephone application, or a game, or a real time control application, because you want low latency for those applications. So it will be desirable if we could have a an alternative to TCP Reno or TCP Cubic that can achieve good throughput for TCP, without forcing the queues to be full. One attempt at doing this was a proposal called TCP Vegas. And the insight from TCP Vegas is that you can watch the rate of growth, or increase, of the queue, and use that to infer whether you’re sending faster, or slower, than the network can support. The insight was, if you’re sending, if a TCP is sending, faster than the maximum capacity a network can deliver at, the queue will gradually fill up. And as the queue gradually fills up, the latency, the round trip time, will gradually increase. TCP Cubic, and TCP Reno, wait until the queue overflows, wait until there’s no more space to put new packets in, and a packet is lost, and at that point they slow down. The insight for TCP Vegas was to watch as the delay increases, and as it sees the delay increasing, it slows down before the queue overflows. So it uses the gradual increase in the round trip time, as an indication that it should send slower. And as the round-trip time reduces, as the round-trip time starts to drop, it treats that as an indication that the queue is draining, which means it can send faster. It wants a constant round trip time. And, if the round trip time increases, it reduces its rate; and if the round-trip time decreases, it increases its rate. So, it’s trying to balance it’s rate with the round trip time, and not build or shrink the queues. And because you can detect the queue building up before it overflows, you can take action before the queue is completely full. And that means the queue is running with lower occupancy, so you have lower latency across the network. It also means that because packets are not being lost, you don’t need to re-transmit as many packets. So it improves the throughput that way, because you’re not resending data that you’ve already sent and has gotten lost. And that’s the fundamental idea of TCP Vegas. It doesn’t change the slow start behaviour at all. But, once you’re into congestion avoidance, it looks at the variation in round trip time rather than looking at packet loss, and uses that to drive the variation in the speed at which it’s sending. The details of how it works. Well, first, it tries to estimate what it calls the base round trip time. So every time it sends a packet, it measures how long it takes to get a response. And it tries to find the smallest possible response time. The idea being that the smallest time it gets a response, would be the time when the queue is that it’s emptiest. It may not get the actual, completely empty, queue, but the smaller the response time, it’s trying to estimate the time it takes when there’s nothing else in the network. And anything on top of that indicates that there is data queued up somewhere in the network. Then it calculates an expected sending rate. It takes the window size, which indicates how many packets it’s supposed to send in that round-trip time, how many bytes of data it’s supposed to send in that round-trip time, and it divides it by the base round trip time. So if you divide number of bytes by time, you get a bytes per second, and that gives you the rate at which it should be sending data. And if the network can support sending at that rate, it should be able to deliver that window of packets within a complete round trip time. And, if it can’t, it will take longer than a round trip time to deliver that window of packets, and the queues will be gradually building up. Alternatively, if it takes less than a round trip time, this is an indication that the queues are decreasing. And it measures the actual rate at which it sends the packets. And it compares them. And if the actual rate at which it’s sending packets is less than the expected rate, if it’s taking longer than a round-trip time to deliver the complete window worth of packets, this is a sign that the packets can’t all be delivered. And it, you know, it’s trying to send too much. It’s trying to send at too fast a rate, and it should reduce its rate and let the queues drop. Equally, in the other case it should increase its rate, and measuring the difference between the actual and the expected rates, it can measure whether the queues growing or shrinking. And TCP Vegas compares the expected rate, which actually manages to send at, the expected rate at which it gets the acknowledgments back, with the actual rate. And it adjusts the window. And if the expected rate, minus the actual rate, is less than some threshold, that indicates that it should increase its window. And if the expected rate, minus the actual rate, is greater than some other threshold, then it should decrease the window. That is, if data is arriving at the expected rate, or very close to it, this is probably a sign that the network can support a higher rate, and you should try sending a little bit faster. Alternatively, if data is arriving slower than it’s being sent, this is a sign that you’re sending too fast and you should slow down. And the two thresholds, R1 and R2, determine how close you have to be to the expected rate, and how far away from it you have to be in order to slow down. And the result is that TCP Vegas follows a much smoother transmission rate. Unlike TCP Reno, which follows the characteristic sawtooth pattern, or TCP Cubic which follows the cubic equation to change it’s rate, both of which adapt quite abruptly whenever there’s a packet loss, TCP Vegas makes a gradual change. It gradually increases, or decreases, it’s sending rate in line with the variations in the queues. So, it’s a much smoother algorithm, which doesn’t continually build up and empty the queues. Because the queues are not continuing building up, not continually being filled, this keeps the latency down while still achieving recently good performance. TCP Vegas is a good idea in principle. This idea is known as delay-based congestion control, and I think it’s actually a really good idea in principle. It reduces the latency, because it doesn’t fill the queues. It reduces the packet loss, because it’s not causing, t’s not pushing the queues to overflow and causing packets to be lost. So the only packet losses you get are those caused by transmission problems. And this reduces unnecessary, reduces you having to transmit packets, because you forced the network into overload, and forced it to lose the packets, and it reduces the latency. The problem with TCP Vegas is that it doesn’t work, doesn’t interwork work with, TCP Reno or TCP Cubic. If you have any TCP Reno or Cubic flows on the network, they will aggressively increase their sending rate and try to fill the queues, and push the queues into overload. And this will increase the round-trip time, reduce the rate at which Vegas can send, and it will force TCP Vegas to slow down. Because TCP Vegas sees the queues increasing, because Cubic and Reno are intentionally trying to fill those queues, and if the queues increase, this causes Vegas to slow down. That gradually means there’s more space in the queues, which Cubic and Reno will gradually fill-up, which causes Vegas to slow down, and they end up in a spiral, where the TCP Vegas flows get pushed down to zero, and the Reno or cubic flows use all of the capacity. So if we only have TCP Vegas in the network, I think it would behave really nicely, and we get really good, low latency, behaviour from the network. Unfortunately we’re in a world where Reno, and Cubic, have been deployed everywhere. And without a step change, without an overnight switch where we turn off Cubic, and we turn off Reno, and we turn on Vegas everywhere we can’t deploy TCP Vegas because always loses out to Reno and Cubic. So, it’s a good idea in principle, but in practice it can’t be used because of the deployment challenge. As I say, it’s a good idea in principle, and the idea of using delay as a congestion signal is a good idea in principle, because we can get something which achieves lower latency. Is it possible to deploy a different algorithm? Maybe the problem is not the principle, maybe the problem is the algorithm in TCP Vegas? Well, people are trying alternatives which are delay based. And the most recent attempt at this is an algorithm called TCP BBR, Bottleneck Bandwidth and Round-trip time. And again, this is a proposal that came out of Google. And one of the co-authors, if you look at the paper on the right, is Van Jacobson, who was the original designer of TCP congestion control. So there’s clearly some smart people behind this. The idea is that it tries to explicitly measure the round-trip time as it sends the packets. It tries to explicitly measure the sending rate in much the same way same way that TCP Vegas does. And, based on those measurements, and some probes where it varies its rate to try and find if it’s got more capacity, or try and sense if there is other traffic on the network. It tries to directly set a congestion window that matches the network capacity, based on those measurements. And, because this came out of Google, it got a lot of press, and Google turned it on for a lot of their traffic. I know they were running it for YouTube for a while, and a lot of people saw this, and jumped on the bandwagon. And, for a while, it was starting to get a reasonable amount of deployments. The problem is, it turns out not to work very well. And Justine Sherry at Carnegie Mellon University, and her PhD student Ranysha Ware, did a really nice bit of work that showed that is incredibly unfair to regular TCP traffic. And, it’s unfair in kind-of the opposite way to Vegas. Whereas TCP Reno and TCP Cubic would force TCP Vegas flows down to nothing, TCP BBR is unfair in the opposite way, and it demolishes Reno and Cubic flows, and causes tremendous amounts of packet loss for those flows. So it’s really much more aggressive than the other flows in certain cases, and this leads to really quite severe unfairness problems. And the Vimeo link on the slide is a link to the talk at the Internet Measurement Conference, where Ranysha talks through that, and demonstrates really clearly that TCP BBR version 2 is really quite problematic, and not very safe to deploy on the current network. And there’s a there’s a variant called BBR v2, which is under development, and seems to be changing, certainly on a monthly basis, which is trying to solve these problems. And this is very much an active research area, where people are looking to find better alternatives. So that’s the principle of delay-based congestion control. Traditional TCP, the Reno algorithm and the Cubic algorithms, intentionally try to fill the queues, they intentionally try to cause latency. TCP Vegas is one well-known algorithm which tries to solve this, and doesn’t work in practice, but in principle is a good idea, it just has some deployment challenges, given the installed base of Reno and Cubic. And there are new algorithms, like TCP BBR, which don’t currently work well, but have potential to solve this problem. And, hopefully, in the future, a future variant of BBR will work effectively, and we’ll be able to transition to a lower latency version of TCP.
Part 5: Explicit Congestion Notification
Abstract
The use of delay-based congestion control is one way of reducing network latency. Another is to keep Reno and Cubic-style congestion control, but to move away from using packet loss as an implicit congestion signal, and instead provide an explicit congestion notification from the network to the applications. This part of the lecture introduces the ECN extension to TCP/IP that provides such a feature, and discusses its operation and deployment.
In the previous parts of the lecture, I’ve discussed TCP congestion control. I’ve discussed how TCP tries to measure what the network’s doing and, based on those measurements, adapt it’s sending rate to match the available network capacity. In this part, I want to talk about an alternative technique, known as Explicit Congestion Notification, which allows the network to directly tell TCP when it’s sending too fast, and needs to reduce it’s transmission rate. So, as we’ve discussed, TCP infers the presence of congestion in the network through measurement. If you’re using TCP Reno or TCP Cubic, like most TCP flows in the network today, then the way it infers that is because there’s packet loss. TCP Reno and TCP Cubic keep gradually increasing their sending rates, trying to cause the queues to overflow. And they cause a queue overflow, cause a packet to be lost, and use that packet loss as the signal that the network is busy, that they’ve reached the network capacity, and they should reduce the sending rate. And this is problematic for two reasons. First, is because it increases delay. It’s continually pushing the queues to be full, which means the network’s operating with full queues, with its maximum possible delay. And the second is because it makes it difficult to distinguish loss which is caused because the queues overflowed, from loss caused because of a transmission error on a link, so called non-congestive loss, which you might get due to interference on a wireless link. The other approach people have discussed, is the approach in TCP Vegas, where look at variation in queuing latency and use that as an indication of loss. So, rather than pushing the queue until it overflows, and detecting the overflow, you watch to see as the queue starts to get bigger, and use that as an indication that you should reduce your sending rate. Or, equally, you spot the queue getting smaller, and use that as an indication that you should maybe increase your sending rate. And this is conceptually a good idea, as we discussed in the last part, because it lets you run TCP with lower latency. But it’s difficult to deploy, because it interacts poorly with TCP Cubic and TCP Reno, both of which try to fill the queues. As a result, we’re stuck with using Reno and Cubic, and we’re stuck with full queues in the network. But we’d like to avoid this, we’d like to go for a lower latency way of using TCP, and make the network work without filling the queues. So one way you might go about doing this is, rather than have TCP push the queues to overflow, have the network rather tell TCP when it’s sending too fast. Have something in the network tell the TCP connections that they are congesting the network, and they need to slow down. And this thing is called Explicit Congestion Notification. Explicit Congestion Notification, the ECN bits, are present in the IP header. The slide shows an IPv4 header with the ECN bits indicated in red. The same bits are also present in IPv6, and they’re located in the same place in the packet in the IPv6 header. The way these are used. If the sender doesn’t support ECN, it sets these bits to zero when it transmits the packet. And they stay at zero, nothing touches them at that point. However, if the sender does support ECN, and it sets these bits to have the value 01, so it sets bit 15 of the header to be 1, and it transmits the IP packets as normal, except with this one bit set to indicate that the sender understands ECN. If congestion occurs in the network, if some queue in the network is beginning to get full, it’s not yet at the point of overflow but it’s beginning to get full, such that some router in the network thinks it’s about to start experiencing congestion, then that router, that router in the network, changes those bits in the IP packets, of some of the packets going past, and sets both of the ECN bits to one. This is known as an ECN Congestion Experienced mark. It’s a signal. It’s a signal from the network to the endpoints, that the network thinks it’s getting busy, and the endpoint should slow down. And that’s all it does. It monitors the occupancy in the queues, and if the queue occupancy is higher than some threshold, it sets the ECN bits in the packets going past, to indicate that threshold has been reached and the network is starting to get busy. If the queue overflows, if the endpoints keep sending faster and the queue overflows, then it drops the packets as normal. The only difference is that there’s some intermediate point where the network is starting to get busy, but the queue has not yet overflowed. And at that point, the network marks the packets indicate that it’s getting busy. A receiver might get a TCP packet, a TCP segment, delivered within an IP packet, where that IP packet has the ECN Congestion Experienced mark set. Where the network has changed those two bits in the IP header to 11, to indicate that it’s experiencing congestion. What it does that point at that point, is it sets a bit in the TCP header of the acknowledgement packet it sends back to the sender. That bit’s known as the ECN Echo field, the ECE field. It sets this bit in the TCP header equal to one on the next packet it sends back to the sender, after it received the IP packet, containing the TCP segment, where that IP packet was marked Congestion Experienced. So the receiver doesn’t really do anything with the Congestion Experienced mark, other than mark, set the equivalent mark in the packet it sends back to the sender. So it’s telling the sender, “I got a Congestion Experienced mark in one of the packets you sent”. When that packet gets to the sender, the sender sees this bit in the TCP header, the ECN Echo bit set to one, and it realises that the data it was sending caused a router on the path to set the ECN Congestion Experienced mark, which the receiver has then fed back to it. And what it does at that point, is it reduces its congestion window. It acts as-if a packet had been lost, in terms of how it changes its congestion window. So if it’s a TCP Reno sender, it will halve its congestion window, the same way it would if a packet was lost. If it’s a TCP Cubic sender, it will back off its congestion window to 70%, and then enter the weird cubic equation for changing its congestion window. After it does that, it sets another bit in the header of the next TCP segment it sends out. It sets the CWR bit, the Congestion Window Reduced bit, in the header to tell the network and the receiver that it’s done it. So the end result of this, is that rather than a packet being lost because the queue overflowed, and then the acknowledgments coming back indicating, via the triple duplicate ACK, that’s a packet had been lost, and then TCP reducing its congestion window and re-transmitting that lost packet. What happens is, the IP packets, TCP packets, in the outbound direction gets a Congestion Experienced mark set, to indicate that the network is starting to get full. The ECN Echo bit is set on the reply, and at that point the sender reduces its window, as-if the loss had occurred. And then carries on sending with the CWR bit set to one on that next packet. So it has the same effect, in terms of reducing the congestion window, as would dropping a packet, but without dropping a packet. So there’s no actual packet loss here, there’s just a mark to indicate that the network was getting busy. So it doesn’t have to retransmit data, and this happens before the queue is full, so you get lower latency. So ECN is a mechanism to allow TCP to react to congestion before packet loss occurs. It allows routers in the network to signal congestion before the queue overflows. It allows routers in the network to say to TCP, “if you don’t slow down, this queue is going to overflow, and I’m going to throw your packets away”. it’s independent of how TCP then responds, whether it follows Reno or Cubic or Vegas that doesn’t really matter, it’s just an indication that it needs to slow down because the queues are starting to build up, and will overflow soon if it doesn’t. And if TCP reacts to that, reacts to the ECN Echo bit going back, and the sender reduces its rate, the queues will empty, the router will stop marking the packets, and everything will settle down at a slightly slower rates without causing any packet loss. And the system will adapt, and it will it will achieve the same sort of throughput, it will just react earlier, so you have smaller queues and lower latency. And this gives you the same throughput as you would with TCP Reno or TCP Cubic, but with low latency, which means it’s better for competing video conferencing or gaming traffic. And I’ve described the mechanism for TCP, but there are similar ECN extensions for QUIC and for RTP, which is the video conferencing protocol, all designed to achieve the same goal. So ECN, I think, is unambiguously a good thing. It’s a signal from the network to the endpoints that the network is starting to get congested, and the endpoints should slow down. And if the endpoints believe it, if they back off, they reduce their sending rate before the network is overloaded, and we end up in a world where we still achieve good congestion control, good throughput, but with lower latency. And, if the endpoints don’t believe it well, eventually, the routers, the queues, overflow and they lose packets, and we’re no worse-off than we are now. In order to deploy ECN, though, we need to make changes. We need to change the endpoints, to change the end systems, to support these bits in the IP header, and to support, to add support for this into TCP. And we need to update the routers, to actually mark the packets when they’re starting to get overloaded. Updating the end points has pretty much been done by now. I think every TCP implementation, implemented in the last 15-20 years or so, supports ECN, and these days, most of them have it turned on by default. And I think we actually have Apple to thank for this. ECN, for a long time, was implemented but turned off by default, because there’d been problems with some old firewalls which reacted badly to it, 20 or so years ago. And, relatively recently, Apple decided that they wanted these lower latency benefits, and they thought ECN should be deployed. So they started turning it on by default in the iPhone. And they kind-of followed an interesting approach. In that for iOS nine, a random subset of 5% of iPhones would turn on ECN for some of their connections. And they measured what happened. And they found out that in the overwhelming majority of cases this worked fine, and occasionally it would fail. And they would call up the network operators, who’s networks were showing problems, and they would say “your network doesn’t work with iPhones; and currently it’s not working well with 5% of iPhones but we’re going to increase that number, and maybe you should fix it”. And then, a year later, when iOS 10 came out, they did this 50% of connections made by iPhones. And then a year later, for all of the connections. And it’s amazing what impact a popular vendor calling up a network operator can have on getting them to fix the equipment. And, as a result, ECN is now widely enabled by default in the phones, and the network seems to support it just fine. Most of the routers also support ECN. Although currently relatively few of them seem to enable it by default. So most of the endpoints are now at the stage of sending ECN enabled traffic, and are able to react to the ECN marks, but most of the networks are not currently setting the ECN marks. This is, I think, starting to change. Some of the recent DOCSIS, which is the cable modem standards, are starting to support you ECN. We’re starting to see cable modems, cable Internet connections, which enable ECN by default. And, we’re starting to see interest from 3GPP, which is the mobile phone standards body to enable this in 5G, 6G, networks, so I think it’s coming. but it’s going to take time. And, I think, as it comes, as ECN gradually gets deployed, we’ll gradually see a reduction in latency across the networks. It’s not going to be dramatic. It’s not going to suddenly transform the way the network behaves, but hopefully over the next 5 or 10 years we’ll gradually see the latency reducing as ECN gets more widely deployed. So that’s what I want to say about ECN. It’s a mechanism by which the network can signal to the applications that the network is starting to get overloaded, and allow the applications to back off more quickly, in a way which reduces latency and reduces packet loss.
Part 6: Light Speed?
Abstract
The final part of the lecture moves on from congestion control and queueing, and discusses another factor that affects latency: the network propagation delay. It outlines what is the propagation delay and ways in which it can be reduced, including more direct paths and the use of low-Earth orbit satellite constellations.
In this final part of the lecture, I want to move on from talking about congestion control, and the impact of queuing delays on latency, and talk instead about the impact of propagation delays. So, if you think about the latency for traffic being delivered across the network, there are two factors which impact that latency. The first is the time packets spend queued up at various routers within the network. As we’ve seen in the previous parts of this lecture, this is highly influenced by the choice of TCP congestion control, and whether Explicit Congestion Notification is enabled or not. The other factor, that we’ve not really discussed to date, is the time it takes the packets to actually propagate down the links between the routers. This depends on the speed at which the signal propagates down the transmission medium. If you’re using an optical fibre to transmit the packets, it depends on the speed at which the light propagates through the fibre. If you’re using electrical signals in a cable, it depends on the speed at which electrical field propagates down the cable. And if you’re using radio signals, it depends on the speed of light, the speed at which the radio signals propagate through the air. As you might expect, physically shorter links have lower propagation delays. A lot of the time it takes a packet to get down a long distance link is just the time it takes the signal to physically transmit along the link. If you make the link shorter it takes less time. And what is perhaps not so obvious, though, is that you can actually get significantly significant latency benefits in certain paths, because the existing network links follow quite indirect routes. For example, if you look at the path the network links take, if you’re sending data from Europe to Japan. Quite often, that data goes from Europe, across the Atlantic to, for example, New York or Boston, or somewhere like that, across the US to San Francisco, or Los Angeles, or Seattle, or somewhere along those lines, and then from there, in a cable across the Pacific to Japan. Or alternatively, it goes from Europe through the Mediterranean, the Suez Canal and the Middle East, and across India, and so on, until it eventually reaches Japan the other way around. But neither of these is a particularly direct route. And it turns out that there is a much more direct, a much faster route, to get from Europe to Japan, which is to lay a an optical fibre through the Northwest Passage, across Northern Canada, through the Arctic Ocean, and down through the Bering Strait, and past Russia to get directly to Japan. It’s much closer to the great circle route around the globe, and it’s much shorter than the route that the networks currently take. And, historically, this hasn’t been possible because of the ice in the Arctic. But, with global warming, the Northwest Passage is now ice-free for enough of the year that people are starting to talk about laying optical fibres along that route, because they can get a noticeable latency reduction, for certain amounts of traffic, by just following the physically shorter route. Another factor which influences the propagation delay is the speed of light in the transmission media. Now, if you’re sending data using radio links, or using lasers in a vacuum, then these propagate at the speed of light in the vacuum. Which is about 300 million meters per second. The speed of light in optical fibre, though, is slower. The speed at which light propagates down a fibre, the speed at which light propagates through glass, is only about 200,000. kilometres per second, 200 million meters per second. So it’s about two thirds of the speed at which it propagates in a vacuum. And this is the reason for systems such as StarLink, which SpaceX is deploying. And the idea of these systems is that, rather than sending the Internet signals down an optical fibre, you send them 100, or a couple of hundred miles, up to a satellite, and they then go around between various satellites in the constellation, in low earth orbit, and then down to a receiver near the destination. And by propagating through vacuum, rather than through optical fibre, the speed of light in vacuum is significantly faster, it’s about 50% faster than the speed of light in fibre, and this can reduce the latency. And the estimates show that if you have a large enough constellation of satellites, and SpaceX is planning on deploying around 4000 satellites, I believe, and with careful routing, you can get about a 40, 45, 50% reduction in latency. Just because the signals are transmitting via radio waves, and via inter-satellite laser links, which are in a vacuum, rather than being transmitted through a fibre optic cable. Just because of the differences in the speed of light between the two mediums. And the link on the slide points to some simulations of the StarLink network, which try and demonstrate how this would work, and how it can achieve both network paths that closely follow the great circle routes, and how it can reduce the latency because of the use of satellites. So, what we see is that people are clearly going to some quite extreme lengths to reduce latency. I mean, what we spoke about in the previous part was the use of ECN marking to reduce latency by reducing the amount of queuing. And that’s just a configuration change, it’s a software change to some routers. And that seems to me like a reasonable approach to reducing latency. But some people are clearly willing to go to the effort of launching thousands of satellites, or perhaps the slightly less extreme case of laying new optical fibres through the Arctic Ocean. So why are people doing this? Why do people care so much about reducing latency, that they’re willing to spend billions of dollars launching thousands of satellites, or running new undersea cables, to do this? Well, you’ll be surprised to hear that this is not to improve your gaming experience. And this is not to improve the experience of your Zoom calls. Why are people doing this? High frequency share trading. Share traders believe they can make a lot of money, by getting a few milliseconds worth of latency reduction compared to their competitors. Whether that’s a good use of a few billion dollars I’ll let you decide. But the end result may be, hopefully, that we will get lower latency for the rest of us as well. And that concludes this lecture. There are a bunch of reasons why we have latency in the network. Some of this is due to propagation delays. Some of this, perhaps most of it, in many cases, is due to queuing at intermediate routers. The propagation delays are driven by the speed of light. And unless you can launch many satellites, or lay more optical fibres, that’s pretty much a fixed constant, and there’s not much we can do about it. Queuing delays, though, are things which we can change. And a lot of the queuing delays in the network are caused because of TCP Reno and TCP Cubic, which push for the queues to be full. Hopefully, we will see improved TCP congestion control algorithms. And TCP Vegas was one attempt in this direction, which unfortunately proved not to be deployable in practice, TCP BBR was another attempt which was problematic for other reasons, because of its unfairness. But people are certainly working on an alternative algorithms in this space, and hopefully we’ll see things deployed before too long.
L6 Discussion
Summary
Lecture 6 discussed TCP congestion control and its impact on latency. It discussed the principles of congestion control (e.g., the sliding window algorithm, AIMD, conservation of packets), and their realisation in TCP Reno. It reviewed the choice of TCP initial window, slow start, and the congestion avoidance phase, and the response of TCP to packet loss as a congestion signal.
The lecture noted that TCP Reno cannot effectively make use of fast and long distance paths (e.g., gigabit per second flows, running on transatlantic links). It discussed the TCP Cubic algorithm, that changes the behaviour of TCP in the congestion avoidance phase to make more effective use of such paths.
The lecture also noted that both TCP Reno and TCP Cubic will try to increase their sending rate until packet loss occurs, and will use that loss as a signal to slow down. This fills the in-network queues at routers on the path, increasing latency.
The lecture briefly discussed TCP Vegas, and the idea of using delay changes as a congestion signal instead of packet loss, and it noted that TCP Vegas is not deployable in parallel with TCP Reno or Cubic. It highlighted ongoing research with TCP BBR, a new proposal that aims to make a deployable congestion controller that is latency sensitive, and some of the fairness problems with BBR v1.
Finally, the lecture highlighted the possible use of Explicit Congestion Notification as a way of signalling congestion to the endpoints, and of causing TCP to reduce its sending rate, before the in-network queues overflow. This potentially offers a way to reduce latency and packet loss caused by network congestion.
Discussion will focus on the behaviour of TCP Reno congestion control, to understand the basic dynamics of TCP, why these are so effective at keeping the network occupied, and understanding how this leads to high latency. We will then discuss the applicability and ease of deployment of several alternatives (Cubic, Vegas, BBR, and ECN) and how they change performance and latency.
Lecture 7
Abstract
Lecture 7 discusses real-time and interactive applications. It talks about the requirements and constraints for running real-time traffic on the Internet, and discusses how interactive video conferencing and streaming video applications are implemented.
Part 1: Real-time Media Over The Internet
Abstract
This first part of the lecture discusses real-time media running over the Internet. It outlines what is real-time traffic, and what are the requirements and constraints when running real-time applications over the Internet. It discuses the implication of non-elastic traffic, the effects of packet loss, and the differences between quality of service and quality of experience.
In this lecture I want to move on from talking about congestion control, and talk instead about real-time and interactive applications. In this first part, I’ll start by talking about real-time applications in the Internet. I’ll talk about what is real-time traffic, some of the requirements and constraints of that traffic, and how we go about ensuring a good quality of experience, a good quality of service, for these applications. In the later parts, I’ll talk about interactive applications, I’ll talk about the conferencing architecture, how we go about building a signalling system to locate the person you wish to have a call with, how we describe conferencing sessions, and how we go about transmitting real-time multimedia traffic over the network. And then, in the final part, I’ll move on and talk about streaming applications, and talk about the HTTP adaptive streaming protocols that are used for video on demand applications, such as the iPlayer or Netflix. To start with, though, I want to talk about real-time media over the Internet. I’ll say a little bit about what is real-time traffic, what are the requirements and constraints in order to successfully run real-time traffic, real-time applications over the Internet, and some of the issues around quality of service and user experience, and how to make sure we get a good experience for users of these applications. So, there’s actually a long history of running real-time traffic over the Internet. And this includes applications like telephony and voice over IP. It includes Internet radio and streaming audio applications. It includes video conferencing applications such as Zoom. It includes streaming TV, streaming video applications, such as the iPlayer and Netflix. But it also includes gaming, and sensor network applications, and various industrial control systems. And these experiments go back a surprisingly long way. The earliest RFC on the subject of real-time media on the Internet is RFC741, which dates back to the early 1970s and describe the network voice protocol. And this was an attempt at running packet voice over the ARPANET, the precursor to the Internet. And there’s been a continual thread of standards developments and experimentation and research in this area. The current set of standards, which we use for telephony applications, for video conferencing applications, dates back to the mid 1990s. It led to a set of protocols, such as SIP, the Session Initiation Protocol, the Session Description Protocol, the Real-time Transport Protocol, and so on. And then there was another burst of developments, in perhaps the mid-2000s or so, with HTTP adaptive streaming, and that led to standards such as the MPEG DASH standards, an applications is like Netflix and the iPlayer. I think what’s important, though, is to realise that this is not new for the network. We’ve seen everyone in the world switch to using video conferencing, and everyone in the world switch to using Webex, and Teams, and Zoom, and the like. But these applications actually existed for many years, and these applications have developed, and the network has developed along with these applications, and there’s a long history of support for real-time media in the Internet. And you, occasionally, hear people saying that the Internet was not designed for real-time media, and we need to re-architect the Internet to support real-time applications, and to support future multimedia applications. I think that’s being somewhat disingenuous with history. The Internet has developed and grown-up with multimedia applications, right from the beginning. And while they’ve perhaps not been as popular, as some of the non real-time applications, they’ve been a continual strand of development, and people have been using these applications and architecting the network to support this type of traffic, for many, many years now. So what is real-time traffic? What do we mean by real-time traffic, real-time applications? Well, the defining characteristic is that the traffic has deadlines. The system fails if the data is not delivered by a certain time. And, depending on the type of application, depending on the type of real-time traffic, those can be what’s known as hard deadlines or soft deadlines. Now, an example of a hard deadline might be a control system, such as a railway signalling system, where the data that’s controlling the signals has to arrive at the signal before the train does, in order to change the signal appropriately. Real-time multimedia applications, on the other hand, are very much in the in the realm of soft real-time applications, where you have to deliver the data by a certain deadline in order to get smooth playback of the media. In order to get a glitch-free playback of the audio, in order to get smooth video playback. And these applications tend to have to deliver data, perhaps every 50th of a second for audio, maybe every 30 times a second, 60 times a second, to get smooth video. And it’s important to realise that no system is ever 100% reliable at meeting its deadlines. It’s impossible to engineer system that never misses a deadline. So always think about how can we arrange these systems, such that some appropriate portion of the deadline are met. And what that proportion is, depends on what system we’re building. If it’s a railway signalling system, we want the probability that the network fails to deliver the message to be low enough that it’s more likely that the train will fail, or the actual physical signal will fail, then the probability of the network failing to deliver the message in time. If it’s a video conferencing application, or video streaming application, the risks are obviously a lot lower, and so you can accept a higher probability of failure. Although again, it depends on what the application’s being used for. A video conferencing system being used for a group of friends, just chatting, obviously has different reliability constraints, different degrees of strictness of its deadlines, than one being used for remote control of a drone, or one being used for remote surgery, for example. And the different systems can have different types of deadline. It may be that various types of data have to be delivered before a certain time. You have to deliver the control information to the railway signal before the train gets there. So you’ve got an absolute deadline on the data. Or it maybe that the data has to be delivered periodically, relative to the previous deadline. The video frames have to be delivered every 30th of a second, or every 60th of a second. And different applications have different constraints. Different bounds on the latency, on the absolute deadline. But also on the relative deadline, on the predictability of the timing. It’s important to remember that we’re not necessarily talking high performance for these applications. If we’re building a phone system that runs over the Internet, for example, the amount of data we’re sending is probably only a few kilobits per second. But it requires predictable timing. This packets containing the speech data have to be delivered with at least approximately predictable, approximately equal, spacing, in order that we can correct the timing and play out the speech smoothly. And yes, some types of applications are quite high bandwidth. If we’re trying to deliver studio quality movies, or if we’re trying to deliver holographic conferencing, then we need tens, or possibly hundreds, of megabits. But they’re not necessarily high performance. The key thing is predictability. So what are the requirements for these applications? Well, in the large extent, it depends on whether you’re building a streaming application or an interactive application. For video-on-demand applications, like Netflix or YouTube or the iPlayer, for example, there’s not really any absolute deadline, in most cases. If you’re watching a movie, it’s okay if it takes 5, 10, 20 seconds to start playing, after you click the play button, provided the playback is smooth once it has started. And maybe if it’s a short-thing, maybe it’s a YouTube video that’s only a couple of minutes, then you want it to start quicker. But again, it doesn’t have to start within milliseconds of you pressing the play button. A second or two of latency is acceptable, provided the playback is smooth once it starts. Now, obviously live applications, the deadlines may be different. Clearly if you’re watching a live sporting event on YouTube or or the iPlayer, for example, you don’t want it to be too far behind the same event being watched on broadcast TV. But, for these applications, typically it’s the relative deadlines, and smooth playback once the application has started, rather than the absolute deadline that matters. The amount of bits per second it needs depends to a large extent on the quality. And, obviously, higher quality is better, a higher bit rate is better. But, to some extent, there’s a limit on this. And it’s a limit depending on the camera, on the resolution of the camera, and the frame rate of the camera, and the size of the display, and so on. And you don’t necessarily need many tens or hundreds of megabits. You can get very good quality video on single digit numbers of megabits per second. And even production quality, studio quality, is only hundreds of megabits per second. So there’s an upper bound on that the rate at which these applications can typically send, when you hit the limits of the capture device, you hit the limits of the display device. And, quite often, for a lot of these applications, predictability matters more than absolute quality. It’s often the less annoying to have a movie, which is a consistent quality, than a movie which is occasionally very good quality, but keeps dropping down to a lower resolution. So predictability is often what’s critical. And, for a given bit rate, you’re also trading off between frame rate and quality. Do you want smooth motion, or do you want very fine detail? And, if you want both smooth motion and fine detail, you have to increase the rate. But you can trade-off between them, at a given bit rate, a different quality level. For interactive applications, the requirements are a bit different. They depend very much on human perception, and the requirements to be able to have a smooth conversation. For phone calls, for video conferencing applications, people have been doing studies of this sort of thing for quite a while. The typical bounds you hear expressed are one-way mouth-to-ear delay, so the delay from me talking, to it going through the air to the microphone, being captured, compressed, transmitted over the network, decompressed, played-out, back from the speakers to your ear, should be no more than about 150 milliseconds. And, if it gets more than that, it starts getting a bit awkward for the conversations. People start talking over each other, and it gets to be a bit difficult for a conversation. And the ITU-T Recommendation G.114 talks about this, and about the constraints there, in a lot of detail. And, in terms of lip sync, people start noticing if the audio is more than about 15 milliseconds ahead, or more than about 45 milliseconds behind the video. And it seems that people notice more often if the audio is ahead of the video, than if it’s behind the video. So this gives quite strict bounds for overall latency across the network, and for the variation in latency between audio and video streams. And, obviously, this depends what you’re doing. If you’re having an interactive conversation, the bounds are tighter than if it’s more of a lecture style, where it’s mostly unidirectional, with more structured pauses and more structured questioning. That type of application can tolerate higher latency. Equally, if you’re trying to do, for example, a distributed music performance, then you need much lower, much lower latency. And, if you think about something like an orchestra, and you measure the size of the orchestra, and you think about the speed of sound, you get about 15 milliseconds for the sound to go from one side of the orchestra to another. So, that sort of level of latency is clearly acceptable, but once it gets more than 20 or 30 milliseconds, it gets very difficult for people to play in a synchronised way. And if you’ve seen, if you’ve ever tried to play music over a Zoom call, you’ll realise it just doesn’t work, because the latency is too high for that, If you’re trying to play music collaboratively on a video conference. So that gives you some bounds for latency. What we saw in some of the previous lectures, is that the network is very much a best effort network, and it doesn’t guarantee the timing. The amount of latency for data to traverse the network very much depends on the propagation delay of the path, and the amount of queuing, and on the path taken, and it’s not predictable at all. If we look at the figure on the left, here, it’s showing the variation in round trip time for a particular path. And we see that most of it is bundled up, and there’s a fairly consistent bound, but there are occasional spikes where the packets take a much longer time to arrive. And in some networks these effects can be quite significant, they can take quite a long time for data to arrive. The consequence of all this, is that real-time application needs to be loss tolerant. If you’re building an application to be reliable, it has to retransmit data, and that may or may not arrive in time. So you want to build it to be unreliable, and not to necessarily retransmit the data. You also want it to be able to cope with the fact that some packets may be delayed, and be able to proceed even if those packets arrive too late. So it needs to be able to compensate for, to tolerate, loss, whether that’s just data which is never going to arrive, or data that’s just going to arrive late. And, obviously, there’s a bound on how much loss you can conceal, how much loss you can tolerate before the quality goes down. And, the challenge in building these applications is to, partially, engineer the network such that it doesn’t lose many packets, such the loss rate, the timing variation, is low enough that the application is going to work. But, also, it’s in building the application to be tolerant to the loss, in being able to conceal the effects of lost packets. The real-time nature of the traffic also affects the way congestion control works, it affects the way data is delivered across the network. As we saw in some of the previous lectures, when we were talking about TCP congestion control, congestion control adapts the speed of transmission to match the available capacity of the network. If the network has more capacity, it sends faster. If the network gets overloaded, it sends slower. And the transfers are elastic. If you’re downloading a web page, if you’re downloading a large file, faster is better, but it doesn’t really matter what rate the congestion control will pick. You want it to come down as fast as it can, and the application can adapt. Real-time traffic is much less elastic. It’s got a minimum rate, there’s a certain quality level, a certain bit rate, below which the media is just unintelligible. If you’re transmitting speech, you need a certain number of kilobits per second. Otherwise, what comes out is just not intelligible speech. If you’re sending video, you need a certain bit rate, otherwise you can’t get full motion video over it; the quality is just too low, the frame rate is just too low, and it’s no longer video. Similarly, though, these applications have a maximum rate. If you’re sending speech data, if you’re sending music, it depends on the capture rate, the sampling rate. And, even for the highest quality audio, you’re probably not looking at more than a megabit, a couple of megabits, for CD quality, surround sound, media. And again, for video, it depends on the type of camera, the frame rates, the resolution, and so on. Again, a small number of megabits, tens of megabits, in the most extreme cases hundreds of megabits, and you get an upper bound on the sending rate. So, real-time applications can’t use infinite amounts of traffic. Unlike TCP, they’re constrained by the rate at which the media is captured. But also, they can’t go arbitrarily slowly, This affects the way we have to send that data, because we have less flexibility in the rate at which these applications can send. And we need to think to what extent it’s possible, or desirable, to reserve capacity for these applications. There are certainly ways one can engineer a network, such that it guarantees that a certain amount of data is available. Such that it guarantees that, for example, a five megabit per second channel is available to deliver video. And, if the application is very critical, maybe that makes sense. If you’re doing remote surgery, you probably do want to guarantee the capacity is there for the video. But, for a lot of applications, it’s not clear it’s needed. So while we have protocols, such as the Resource Reservation Protocol, RSVP, such as the Multi-Protocol Label Switching protocol for orchestrating link-layer networks, such as the idea of network slicing in 5G networks, so we can set up resource reservations. But. This adds complexity. It adds signalling. You need to somehow signal to the network that you need to set up this reservation, tell it what resources the traffic requires. And, somehow, demonstrate to the network that the sender is allowed to use those resources, and is allowed to reserve that capacity, and can pay for it. So you need authentication, authorisation, and accounting mechanisms, to make sure that the people reserving those resources are actually allowed to do so, and have paid for them. And in the end, if the network has capacity, this doesn’t actually help you. If the operators designed the network so it has enough capacity for all the traffic it’s delivering, the reservation doesn’t help. The reservations only help when the network doesn’t have the capacity. They’re a way of allowing the operator, who hasn’t invested in sufficient network resources, to discriminate in favour of the customers who are willing to pay extra. To discriminate so that those customers who are willing to pay can get good quality, whereas those who don’t pay extra, just get a system which doesn’t work well. So, it’s not clear that resource reservations necessarily add benefit. There are certainly applications where they do. But, for many applications, the cost of reserving the resources to get guaranteed quality, the cost of building the accounting system, the complexity of building the resource reservation system, it’s often easier, and cheaper, just to buy more capacity, such that everything works and there’s no need for reservations. And this is one of those areas where the Internet, perhaps, does things differently to a to a lot of other networks. Where the Internet is very much best effort and unreserved capacity. And it’s an area of tension, because a lot of the network operators would like to be able to sell resource reservations, would like to be able to charge you extra to guarantee that your Zoom calls will work. It’s a different model. It’s not clear, to me, whether we want a network that provides those guarantees, but requires charging, and authentication, and authorisation, and knowing who’s sending what traffic, so you can tell if they’ve paid for the appropriate quality. Or, whether it’s better just for everyone to be sending, and we just architect the networks so that it’s good enough for most things, and accept occasional quality lapses. And, ultimately, it comes down to what’s known as quality of experience. Does the application actually meet the users needs? Does it allow them to communicate effectively? Does it provide compelling entertainment? Does it provide good enough video quality? It’s very much not a one dimensional metric. When you ask the user “Does it sound good?”, you get a different view on the quality of the music, or the quality of the speech, than if you ask “can you understand it?” The question you ask matters. It depends what aspect of user experience are you evaluating. And it depends on the task people are doing. The quality people need for remote surgery is different to the quality people need for a remote lecture, for example. And some aspects of this user experience you can estimate from looking at technical metrics such as packet loss and latency. And the ITU has something called the E-model, which is a really good subjective measure of speech quality, based on looking at the latency, and the timing variation, and the packet loss of speech data. But, especially when you start talking about video, and especially when you start talking about particular applications, it’s often very subjective, and very task dependent. And you need to actually build the system, try it out, and ask people “So how well did it work?” “Does it sound good?” “Can you understand it?” “Did you like it?” You need to do user trials to understand the quality of the experience of the users. So that concludes the first part. I’ve spoken a bit about what is real-time traffic, some of the requirements and constraints to be able to run real-time applications over the network, and some of the issues around quality of service and the user experience. In the next part, we’ll move on to start talking about how you build interactive applications running over the Internet.
Part 2: Interactive Applications (Data Plane)
Abstract
The second part discusses interactive applications. It briefly reviews the history of real-time applications running over the Internet, and the requirements on timing, data transfer rate, and reliability to be able to successfully run audio/visual conferencing applications over the network. It outlines the structure of multimedia conferencing applications, and the protocol stack used to support such applications. RTP media transport, media timing recovery, application-level framing, and forward error correction are discussed, outlining how multimedia applications are implemented.
In this part I’d like to talk about interactive conferencing applications. I’ll talk a little bit about what is the structure of video conferencing systems, some of the protocols for multimedia conferencing, for video conferencing, and talk a bit about how we do multimedia transport over the Internet. So what do we mean by interactive conferencing applications? Well I’m talking about applications such as telephony, such as voice over IP, and such as video conferencing. These are applications like the university’s telephone system, like Skype, like Zoom or Webex or Microsoft teams, that we’re all spending far too much time on these days. And this is an area which has actually been developing in the Internet community for a surprisingly long amount of time. As we discussed in the first part of the lecture, the early standards, the early work here, date back to the early 1970s. And the first Internet RFC on this subject, the Network Voice Protocol, was actually published in 1976. The standards we use today for video conferencing applications, for telephony, for voice over IP, date from the early- and mid-1990s initially. There were a set of applications, such as CU-SeeMe, which you see at the bottom right at the slide here, a set of applications called the Mbone conferencing tools, and the picture on the top right of the slide is an application I was involved in developing in the late 1990s in this space, which prototyped a lot of these standard protocols. They led to the development of a set of standards, such as the Session Description Protocol, SDP, the Session Initiation Protocol, SIP, and the Real-time Transport Protocol, RTP, which formed the basis of these modern video conferencing applications. These got pretty widely adopted. The ITU adopted them as the basis for it H.323 series of recommendations for video conferencing systems. A lot of commercial telephony products are built using them. And the Third Generation Partnership Project, 3GPP, adopted them as the basis for the current set of mobile telephone standards. So, if you make a phone call, a mobile phone call, you’re using the descendants of these standards. And also, more recently, the WebRTC browser-based conferencing system again incorporated these protocols into the browser, building on SDP, and RTP, and the same set of conferencing standards which were prototyped in the tools you see on the right of the slide. Again, as we discussed in the previous part of lecture, if you’re building interactive conferencing applications, you’ve got fairly tight bounds on latency. The one-way delay, from mouth to ear, if you want a sensible interactive conversation, has to be no more than somewhere around 150 milliseconds. And if you’re building a video conference, you want reasonably tight lip sync between the audio and video, with the audio no more than around 15 milliseconds ahead of the video, and no more than about 45 milliseconds behind. Now, the good thing is that these applications tend to degrade relatively gracefully. The bounds, 150 milliseconds end-to-end latency; the 15 milliseconds ahead, 45 milliseconds behind, for lip sync, are not strict bounds. Shorter is better, but If the latency, if the offset, exceeds those values, it gets to gradually become less-and-less usable, people start talking over each other, people start noticing the that the lack of lip-sync, but nothing fails catastrophically. But that’s the sort of values we’re looking at: end-to-end delay in the hundred 150 millisecond range, and audio-video synchronised to within a few 10s of milliseconds. The data rates we’re sending depend, very much, on what type of media you’re sending, and what codec, what compression scheme you use. For sending speech, the speech compression typically takes portions of speech data that are around 20 milliseconds in duration, about 1/50th of a second in duration, and every 20 milliseconds, every 1/50th second, it grabs the next chunk of audio that’s been received, compresses it, and transmits it across the network. And this is decoded at the receiver, decompressed, and played out on the same sort of timeframe. The data rates depends on the quality level you want. It’s possible to send speech with something on the order of 10-15 kilobits per second of speech data, although it’s typically sent at a some somewhat higher quality, maybe a couple of hundred kilobits, to get high quality speech that sounds pleasant, but it can go to very low bit rates if necessary. And a lot of these applications vary the quality a little, based on what’s going on. They encode higher quality when it’s clear that the person is talking, and they send packets less often, and encoded with lower bit rates, when it’s clear there’s background noise. If you’re sending good quality music, you need more bits per second than if you’re sending speech. For video, the frame rates, the resolution, very much depend on the camera, on the amount of processor time you have available to do the compression, whether you’ve got hardware accelerated video compression or not. And on the video compression algorithm, the video codec you’re using. Frame rates somewhere in the order of 25 to 60 frames per second are common. Video resolution varies from postage stamp sized, up to full screen, HD, or 4k video. You can get good quality video with codecs like H.264, at around the two to four megabits per second range. Obviously, if you’re going up to full-motion, 4k, movie encoding, you’ll need higher rates than that. But, even then, you’re probably not looking at more than four, eight, ten megabits per second. So, what you see is that these applications have reasonably demanding latency bounds, and reasonably high, but not excessively high, bit-rate bounds. Two to four megabits, even eight megabits, is generally achievable on most residential, home network, connections. And 150 milliseconds end-to-end latency is generally achievable without too much difficulty as long as you’re not trying to go transatlantic or transpacific. In terms of reliability requirements, speech data is actually surprisingly loss tolerant. It’s relatively straightforward to build systems which can conceal 10-20% random packet loss, without any noticeable reduction in speech quality. And, with the addition of forward error correction, with error correcting codes, it’s quite possible to build systems that work with maybe 50% of the packets being lost. Bursts of packet loss are harder to conceal, and tend to result in audible glitches in the speech playback, but they’re relatively uncommon in the network. Video packet loss is somewhat harder to conceal. With streaming video applications, if you’re sending a movie, for example, you can rely on that the occasional scene changes to reset the decoder state, and to recover from the effects of any loss. With video conferencing, there aren’t typically scene changes, so you have to do a rolling repair, a rolling retransmission, or some form of forward error correction to detect the losses. So video tends to be more sensitive to packet loss than the audio. Equally, though, people are less sensitive to disruptions in video quality than they are to disruptions in the audio quality. So how has one of these interactive conferencing application structured? What does the media transmission path look like? Well, you start with some sort of capture device. Maybe that’s a microphone, or maybe it’s a camera, depending whether it’s an audio or a video application. The media data is captured from that device, and goes into some sort of input buffer, frame at a time. If it’s video, it’s each video frame at a time. If it’s audio, it’s frames of, typically, 20 milliseconds worth of speech or music data at a time. Each frame is taken from that input buffer, and passed to the codec. The codec compresses the frames of media, one by one. And, if they’re too large to fit into an individual packet, it fragments them into multiple packets. Each of those fragments of a media frame is transmitted by putting it inside an RTP packet, a Real-time Transport Protocol packet, which is put inside a UDP packet, and sent on to the network. The RTP packet header adds a sequence number, so the packets can be put back into the right order. It adds timing information, so the receiver can reconstruct the timing accurately. And it adds some source identification, so it knows who’s sending the media, and some payload identification information, so it knows which compression algorithm, which codec, was used to encode the media. So the media is captured, compressed, fragmented, packetised, and transmitted over the network. On the receiving side, the UDP packets containing the RTP data arrive. And the receiving application extracts the RTP data from the UDP packets, and looks at the source identification information in there. And then it separates the packets out according to who sent them. For each sender, the data goes through a channel coder, which repairs any loss, using a forward error correction scheme If one was used. And we’ll talk about that later, but that’s where additional packets are sent along with the media, to allow some sort of repair without needing retransmission. Then it goes into what’s called a play-out buffer. The play-out buffer is enough buffering to allow the timing, and the variation in timing, to be reconstructed, such that the packets are put back into the right order, and such that they’re delivered to the codec, to the decoder, at the right time, and with the correct timing behaviour. The decoder then decompresses the media, conceals any remaining packet loss, corrects any clock skew, corrects any timing problems, mixes it together if there’s more than one person talking, and renders it out to the user. It plays the speech or the music out, or it puts the video frames onto the screen. So that’s conceptually how these applications work. What does the set of protocol standards which are used to transport multimedia over the Internet, look like? Well, there’s a fairly complex protocol stack. At its core, we have the Internet protocols, IPv4 and IPv6, and UDP and TCP layered above them. Layering above the UDP traffic, is the media transport traffic and the associated data. And what you have there is the UDP packets, which deliver the data; a datagram TLS layer, which negotiate the encryption parameters; and, above that, sit the secure RTP packets, with the audio and video data in them, for transmitting the speech and the pictures. And you have a protocol, known as SCTP, layered on top of DTLS, to provide a peer-to-peer data channel. In addition to the media transport, with RTP and SCTP sitting above DTLS, you also have NAT traversal and path discovery mechanisms. We spoke about these a few lectures ago, with protocols like STUN and TURN and ICE to help set up peer-to-peer connections, to help discover NAT bindings. You have what’s known as a session description protocol, to describe the call being set up. And this identifies the person who’s trying to establish the multimedia call, who’s trying to establish the video conference. It identifies the person they want to talk to. It describes which audio and video compression algorithms they want to use, which error correction mechanisms they want to use, and so on. And this is used, along with one or more of a set of signalling protocols, depending how the call is being set up. It may be an announcement of a broadcast session, using a protocol called the Session Announcement Protocol, for example. It might be a telephone call, using the Session Initiation Protocol, SIP, which is how the University’s phone system works, for example. It might be a streaming video session, using a protocol called RTSP. Or it might be a web based video conferencing application, such as Zoom call, or a Webex call, or a Microsoft Teams call, where the negotiation runs over HTTP using a protocol called JSEP, the Javascript Session Establishment Protocol. So let’s talk a little bit about the media transport. How do we actually get the audio and video data from the sender to the receiver, once we’ve captured and compressed data, and got it ready to transmit? Well it’s sent within a protocol called the Real-time Transport Protocol, RTP. RTP comprises two parts. There’s a data transfer protocol, and there’s a control protocol. The data transfer protocol is usually called just RTP, the RTP data protocol, and it carries the media data. It’s structured in the form of a set of payload formats. The payload formats describe how you take the output of each particular video compression algorithm, each particular audio compression algorithm, and map it onto a set of packets to be transmitted. And it describes how to split up a frame of video, how to split up a sequence of audio packets, such that each RTP packet, each UDP packet, which arrives can be independently decoded, even if some of the packets have been lost. It makes sure there’s no dependencies between packets, a concept known as application level framing. And this runs over a datagram TLS layer, which negotiates the encryption keys and the security parameters to allow us to encrypt those RTP packets. The control protocol runs in parallel, and provides things like Caller-ID, reception quality statistics, retransmission requests, and so one, in case data gets lost. And there are various extensions that go along with this, that provide things like detailed user experience and reception quality reporting, that provide codec control and feedback mechanisms to detect and correct packet loss, and that provide congestion control and perform circuit breaker functions to stop the transmission if the quality is too bad. The RTP packets are sent inside UDP packets. The diagram we see here shows the format of the RTP packets. This is the format of the media data, which sits within the payload section of UDP packets. And we see it that’s actually a reasonably sophisticated protocol. If we look at the format of the packet, we see there’s a sequence number and a timestamp to allow the receiver to reconstruct the ordering and reconstruct the timing. There’s a source identifier to identify who sent the packet, if you have a multi-party video conference. And there’s some payload format identifiers, that describe whether it contains audio or video, what compression algorithm, is used, on so on And there’s space for extension headers, and space of padding, and the space for payload data where the actual audio or video data goes. And these packets, these RTP packets, are sent within UDP packets. And the sender will typically send these with pretty regular timing. If it’s audio, it generates 50 packets per second; if it’s video, it might be 25 or 30 or 60 frames per second, but the timing tends to be quite predictable. As the data traverses the network, though, the timing is often disrupted by the other types of traffic, the cross-traffic within the network. If we look at the bottom of the slide, we see the packets arriving at the receiver, and we see that the timing is no longer predictable. Because of the other traffic in the network, because it’s a best effort network, because it’s a shared network, the media data is sharing the network with TCP traffic, with all the other flows on the network, and so the packets don’t necessarily arrived with predictable timing. One of the things the receiver has to do, is try to reconstruct the timing. And what we see on this slide, at the top, we see the timing of the data as it was transmitted. And the example is showing audio data, and it’s labelling talk-spurts, and a talk-spurt will be a sentence, or a fragment of a sentence, with a pause between it. We see that the packets comprising the speech data are transmitted with regular spacing. And they pass across the network, and at some point later they arrive at the receiver. There’s obviously some delay, it’s labeled as network transit delay on the slide, which is the time it takes the packets to traverse the network. And there will be a minimum amount of time it takes, just based on the propagation delay, how long it takes the signals to work their way down the network from the sender to the receiver. And, on top of that, there’ll be varying amounts of queuing delay, depending on how busy the network is. And the result of that, is that the timing is no longer regular. Packets which were sent with regular spacing, arrive bunched together with occasional gaps between them. And, occasionally, they may arrive out-of-order, or occasionally the packets may get lost entirely And what the receiver does, is to adds what’s labeled as “playout buffering delay” on this slide, to compensate for this timing variation. To compensate for what’s known as jitter, the variation in the time it takes the packets to transit across the network. By adding a bit of buffering delay, the receiver can allow itself time to put all the packets back into the right order, and to regularise the spacing. It just adds enough delay to allow it to compensate for this variation. So, by adding a little extra delay at the receiver, the receiver correct for the variations in timing. And, if packets are lost, it obviously has to try and conceal that loss, or it can try to do a retransmission if it thinks the retransmission will arrive in time. Of, if packets arrive, and we see the very last packet here, if the packets arrive too late, if they’re delayed too much, then they may arrive too late to be played out. In which case they’re just discarded, and the gap has to be concealed as-if the packet were lost. And, essentially, you can see, that if the packets are played-out immediately they arrive, this variation and timing would lead to gaps, because the packets are not arriving with consistent spacing. If you delay the play-out by more than the typical variation between the inter-arrival time of the packets, you can add enough buffering that once you actually start playing out the packets, when you start playing out the data, you can allow smooth playback. You trade off a little bit of extra latency for very smooth, consistent, playback. And that delay between the packets arriving, and the media starting to play back, that buffering delay, partly allows you to reconstruct the timing, and it partly gives time to decompress the audio, decompress the video, run a loss concealment algorithm, and potentially retransmit any lost packets, depending on the network round-trip time. And then you can schedule the packets to be played out, and you can play the data out smoothly. What’s critical, though, is that loss is very much possible. The receiver has to make the best of the packets which do arrive. And a lot of effort, when building video conferencing applications, goes into defining how the compressed audio-visual data is formatted into the packets. And the goal is that each packet should be independently usable. It’s easy to take the output of a video compression scheme, a video codec, and just arbitrarily put the data into packets. But, if you do that, the different packets end up depending on each other. You can’t decode a particular packet if an earlier one was lost, because it depends on some of the data was in the earlier packet. So a lot of the skill in building a video conferencing application goes into what’s known as the payload format. It goes into the structure of how you format the output of the video compression, and how you format the output of the audio compression, so that for each packet that arrives, it doesn’t depend on any data that was in a previous packet, to the extent possible, so that every packet that arrives can be decoded completely. And there are obviously limits to this. Most video compression schemes work by sending a full image, and then encoding differences to that, and that obviously means that you depend on that previous full image, what’s known as the index frame. And a lot of these systems build in retransmission schemes if the index frame gets lost, but apart from that the packets for the predicted frames, that are transmitted after that, should all be independently decodable. The paper shown on the right of the slide here, “Architectural Considerations for a New Generation of Protocols”, by David Clark and David Tennenhouse, talks about this approach, and talks about this philosophy of how to encode the data such as the packets are independently decodable, and how to structure these types of applications, and it’s very much worth a read. Obviously the packets can get lost, and the way networks applications typically deal with lost packets is by asking for a retransmission. And you can clearly do this with a video conferencing application. The problem is that retransmission takes time. It takes a round-trip time for the retransmission requests to get back from the receiver to the sender, and for the sender to transmit the data. But for video conferencing applications, for interactive applications, you’ve got quite a strictly delay bound. The delay bound is somewhere on the order of 100-150 milliseconds, mouth to ear delay. And that comprises the time it takes to capture a frame of audio, and audio frames are typically 20 milliseconds, so you’ve got a 20 millisecond frame of audio being captured. And then it takes some time to compress that frame. And then it has to be sent across the networks, so you’ve got the time to transit network. And then the time to decompress the frame, and the time to play that frame of audio out. And that typically ends up being four framing durations, plus the network time. So you have 20 milliseconds of frame data being captured. And while that’s being captured, the previous frame is being compressed, and transmitted. And, on the receiver side, you have one frame being decoded, errors being concealed, and timing being reconstructed. And then another frame being played out. So you’ve got 4 frames, 80 milliseconds, plus the network time. It doesn’t leave much time to do a retransmission. So retransmissions tend not to be particularly useful in video conferencing applications, unless they’re on quite short duration network paths, because they arrive too late to be played-out. So what these applications tend to do, is use forward error correction. And the basic idea of forward error correction is that you send additional error correcting packets, along with the original data. So, in the example on the slide, we’re sending four packets of original speech data, original media data. And for each of those four packets, you then send a fifth packet, which is the forward error correction packet. So the group of four packets gets turned into five packets for transmission. And, in this example, the third of those packets gets lost. And at the receiver, you take the four of those five packets which did arrive, and you use the error correcting data to recover that loss without retransmitting the packet. And there are lots of different ways in which these error correcting codes can work. In the simplest case, the forward error correction packet is just the result of running the exclusive-or, the XOR operation, on the previous packets. So the forward error correction packets on the slides could be, for example, the XOR of packets 1, 2, 3, and 4. In this case, on the receiver, when it notices that packet 3 has been lost, if it calculates the XOR of the received packets, so if you XOR packets 1, 2, and 4, and the FEC packet together, what will come out will be the original packet, missing packet. And that’s obviously a simple approach. There are a lot of much more sophisticated forward error correction schemes, which trade off different amounts of complexity for different overheads. But the idea is that you send occasional packets, which error correcting packets, and that allows you to recover from some types of loss without retransmitting the packets, so you can recover losses more quickly. And that’s the summary of how we transmit media over the Internet. That data is captured, compressed, framed into RTP packets, each which includes sequence number and timing recovery information. And then, when they arrive at the receiver, it’s decompressed, it’s buffered, the timing is reconstructed, and the buffering is chosen to allow the receiver to reconstruct the timing, and then the media is played-out to the user. And that comprises the media transport parts. As we saw, there’s also signalling protocols and NAT traversal protocols. What I’ll talk about in the next part is, briefly, how the signalling protocols work to set up multimedia conferencing calls.
Part 3: Interactive Applications (Control Plane)
Abstract
This part moves on from discussion real-time data transfer to discuss the control plane supporting interactive conferencing applications. It discusses the WebRTC data channel, and the supporting signalling protocols, including the SDP offer/answer exchange, SIP, and WebRTC signalling via JSEP.
In the previous part of the lecture, I introduced interactive conferencing applications. I spoke a bit about the architecture of those applications, about the latency requirements, and the structure of those applications, and began to introduce the standard set of conferencing protocols. I spoke in detail about the Real-time Transport Protocol, and the way media data is transferred. In this part of the lecture, I want to talk briefly about two other aspects of interactive video conferencing applications, the data channel, and the signalling protocols. In addition to sending audio visual media, most video conferencing applications also provide some sort of peer-to-peer data channel. This is part of the WebRTC standards, and it’s also part of most of the other systems as well. The goal is to provide for applications like peer-to-peer file transfer as part of the video conferencing tool, to support a chat session along with the audio and video, and to support features like reaction emoji, the ability to raise your hand, request to the speaker talks faster or slower, and so on. The way this is implemented in WebRTC, is using a protocol called SCTP running inside a secure UDP tunnel. I’m not going to talk much about SCTP. SCTP is the Stream Control Transport Protocol, and it was a previous attempt at replacing TCP. The original version of SCTP ran directly over IP, and was pitched as a direct replacement for TCP, running as a peer for TCP or UDP directly on the IP layer. And it turned out this was too difficult to deploy, so it didn’t get tremendous amounts of take-up. But, at the point when the WebRTC standards were being developed, it was available, and specified, and it was deemed relatively straightforward to move it to run on top of UDP, to run on top of Datagram TLS, to provide security, as a deployable way of providing a reliable peer-to-peer data channel. And it would perhaps have been possible to use TCP to do this, but the belief at the time was that NAT traversal for TCP wasn’t very reliable, and that something running over UDP would work better for NAT traversal. And I think that was the right decision. And SCTP, the WebRTC data channel using SCTP over DTLS over UDP, provides a transparent data channel. It provides the ability to deliver framed messages, it supports delivering multiple sub-streams of data over a single connection, and it supports congestion control, retransmissions, reliability and so on. And it makes it straightforward to build peer-to-peer applications using WebRTC. And gains all the deployments advantages that we gained with QUIC, by running over UDP. You might ask why WebRTC uses SCTP to build its data channel, rather than using QUIC? And, fundamentally, that’s because WebRTC predates the development of QUIC. It seems likely, now that the QUIC standard is finished, that future versions of WebRTC will migrate, and switch to using QUIC, and gradually phase out the SCTP-based data channel. And QUIC learned, I think, from this experience, and is more flexible and more highly optimised than the SCTP, DTLS, UDP stack. In addition to the media transport and data, you need some form of signalling, and some sort of session description, to specify how to set up a video conferencing call. Video conferencing calls run peer-to-peer. The goal of a system like Zoom, or Skype, or any of these systems, is to set up peer-to-peer data, where possible, so that they can achieve the lowest possible latency. They need some sort of signalling protocol to do that. They need some sort of protocol to convey the details of what transport connections are to be set up, to exchange the set of candidate IP addresses on which they can be reached, to set up the peer-to-peer connection. They need to specify the media formats they want to use. Is it just audio? Or is it audio and video? And which compression algorithms are to be used? And they want to specify the timing of the session, and the security parameters, and all the other parameters. A standardised way of doing that is using a protocol called the Session Description Protocol. The example on the right at the slide is an example of an SDP, a Session Description Protocol, description of a simple multimedia conference. The format of SDP is unpleasant. It’s essentially a set of key-value pairs, where the keys are all single letters, and the values are more complex, one key-value pair per line, with the key and the value separated by equals signs. And, as we see in the example, it starts with a version number, v=0. There’s an originator line, and it was originated by Jane Doe, who had IP address 10.47.16.5. It’s a seminar about session description protocol. It’s got the email address of Jane Doe, who set up the call, it’s got their IP address, the times that session is active, it’s receive only, it’s broadcast so that the listener just receives the data, it’s sending using audio and video media, and it specifies the ports and some details of the video compression scheme, and so on. The details of the format aren’t particularly important. It’s clear that it’s sending what the session is about, the IP addresses, the times, the details of the audio compression, the details of the video compression, the port numbers to use, and so on. And how this is encoded isn’t really important. In order to set up an interactive call, you need some sort of a negotiation. You need some sort of offer to communicate, which says this is the set of video compression schemes, this is the set of audio compression schemes, that the sender supports. This is who is trying to call you. This is the IP address that they’re calling you from. These are the public key for trying to negotiate the security parameters. And so on. And that comprises an offer. And the offer gets sent via a signalling channel, via some out of band signalling server, to the responder, to the person you’re trying to call. The responder generates an answer, which looks at that set of codecs the offer specified, and picks the subset it also understands. It provides the IP addresses it can be reached at, it provides its public keys, confirms its willingness to communicate, and so on. And the answer flows back to the original sender, the initiator of the call. And this allows the offering party and the answering party, the initiator and responder, to exchange the details they need to establish the call. The offer contains all the IP address candidates that can be used with the ICE algorithm to probe the NAT bindings. The answer coming back contains the candidates for the receiver, that allows them to do the STUN exchange, the STUN packets, to run the ICE algorithms that actually sets up the peer-to-peer connection. And it’s also got the details of the compression algorithms, the video codec, the audio formats, the security parameters, and so on. Unfortunately SDP, which we have ended up using as the negotiation format, really wasn’t designed to do this. It was originally designed as a one way announcement format to describe video on demand sessions, rather than as a format for negotiating parameters. So the syntax is pretty unpleasant, and the semantics are pretty unpleasant, and it’s somewhat complex to use in practice. And this complexity wasn’t really visible when we started developing these systems, these tools, but unfortunately it turned out that SDP wasn’t a great format here, but it’s now too entrenched for alternatives to take off. So we’re left with this quite unpleasant, not particularly well-designed format. But, we use it, and we negotiate the parameters. Exactly how this is used depends on the system you’re using, There’s two widely used models. One is a system known as the Session Initiation Protocol. And the Session Initiation Protocol, SIP, is very widely used for telephony, and it’s widely used for stand-alone video conferencing systems. If you make a phone call using a mobile phone, this is how the phone locates the person you wish to call, and sets up the call, is using SIP, for example. And SIP relies on a set of conferencing servers, one representing the person making the call, and one representing person being called. And the two devices, typically mobile phones or telephones these days, have a direct connection to those servers, which they maintain at all times. On the sending side, when you try to make a call, the message goes out to the server, and that allows, at that point there’s a set of STUN packets exchanged, and a set of signalling messages exchanged, that allow the initiator to find its public NAT bindings. And then the message goes out to the server, and that locates the server for the person being called, and passes the message back over the connection to their server, and it eventually reaches the responder. And that gives the responder the candidate addresses, and all the connection details, and the codec parameters, and so on, needed for it to decide whether it wishes to accept the call, and to start setting up the NAT bindings. And it responds, and the message goes back through the multiple servers to the initiator, and that completes the offer answer exchange. At that point, they can start running the ICE algorithm, discovering the NAT bindings. And they’ve already agreed the parameters at this point, which codecs they using, what public keys that are using, and so on. And that lets them set up a peer-to-peer connection using the ICE algorithm, and using STUN, to set up a peer-to-peer connection over which the media data can flow. And it’s an indirect connection set up. The data flows from initiator, to their server, to the responder’s server, to the responder, and then back via the server path. And that indirect signalling setup allows the direct peer-to-peer connection to be created. In more modern systems, systems using the WebRTC browser-based approach, the trapezoid that we have in the SIP world, with the servers representing each of the two parties, tends to get collapsed into a single server representing the conferencing service. And the server, in this case, is something such as the Zoom servers, or the Webex servers, or the Microsoft Teams servers. And, it’s essentially following the same pattern. It’s just that there’s a now a single conferencing server that initiates the call, rather than being a cross-provider, with server for each party. And this is how web-based conferencing systems such as Zoom, and Webex, and Teams, and the like, work. You get your Javascript application, your web-based application, sitting on top. This talks to the WebRTC API in the browsers, and that provides access to the session descriptions which you can exchange with the server over HTTP GET and POST requests to figure out the details of how the communication should be set up. And, once you’ve done that, you can fire off the data channel, and the media transport, and establish the peer-to-peer connections. So the initial signalling is exchanged via HTTP to the web server, that controls the call. The offer-answer exchange in SDP is exchanged with the server, and that exchanges it with the responder, and then, when all the parties agree to communicate, the server sends back the session description containing the details which the browsers need to set up the call. And they then establish a peer-to-peer connection. And the goal is to integrate the video conferencing features into the browsers, and allows the server to control the call setup. And, as we’ve seen over the course of, I guess, the last year or so, it actually works reasonably well. These video conferencing applications work reasonably well in practice. So what’s happening with interactive applications? Where are things going? I think there’s two ways these types of applications are evolving. One is supporting better quality, and supporting new types of media. Obviously, over time, the audio and the video quality, and the frame rate, and the resolution, has gradually been increasing, and I expect that will continue for a while. There’s also people talking about running various types of augmented reality, virtual reality, holographic 3D conferencing, and tactile conferencing where you transmit a sense of touch over the network. And some of these have perhaps stricter requirements on latency, and stricter requirements on quality but, as far as I can tell, they all fit within the basic framework we’ve described. They can all be transmitted over UDP using either RTP, or the data channel, or something very like it. And they all fit within the same basic framework, of add a little bit of buffering to reconstruct the timing, graceful degradation for the media transport. Currently, we have a mix of RTP for the audio and video data, and the SCTP-based data channel. It’s pretty clear, I think, that the data channel is going to transition to using QUIC relatively soon. And there’s a fair amount of active research, and standardisation, and discussion, about whether it makes sense to also move the audio and video data to run over QUIC. And people are building unreliable datagram extensions to QUIC to support this, so I think it’s reasonably likely that we’ll end up running both the audio and the video and the data channel over peer-to-peer QUIC connections, although the details of how that will work are still being discussed. And that’s what I would say about interactive applications. In the next part I will move on talk about video on demand, and streaming applications.
Part 4: Streaming Video
Abstract
The final part of the lecture discusses streaming video. It talks about HTTP Adaptive Streaming and MPEG DASH, content delivery networks, and some reasons why streaming media is delivered over HTTP. The operation of HTTP adaptive streaming protocols is discussed, and their strengths and limitations are highlighted.
In this last part of the lecture, I want to talk about streaming video and HTTP adaptive streaming. So how do streaming video applications, such as Netflix, the iPlayer, and YouTube, actually work? Well, what you might expect them to do, is use RTP, the same as the video conferencing applications, to stream the video over the network in a low-latency and loss-tolerant way. And, indeed, this is how streaming video, streaming audio, applications used to work. Back in the late 1990s, the most popular application in this space was RealAudio, and later RealPlayer when it incorporated video support. This did exactly as you would expect. It streamed the audio and the video over RTP, and had a separate control protocol, the Real Time Streaming Protocol, to control the playback. These days, though, most applications actually deliver the video over HTTPS instead. And as a result, they have significantly worse performance. They have significantly higher latency, and significantly higher startup latency. The reason they do this, though, is that by streaming video over HTTPS, they can integrate better with content distribution networks. So what is a content distribution network? A content distribution network is a service that provides a global set of web caches, and proxies, that you can use to distribute your application, that you can use to distribute the web data, the web content, that comprises your application or your website. They’re run by companies such as Akamai, and CloudFlare, and Fastly. And these companies run massive global sets of web proxies, web caches. And they take over the delivery of particular sets of content from websites. As a website operator, you give the files, the images, the videos, that you wish to be hosted on the CDN, to the CDN operator. And they ensure that they’re cached throughout the network, at locations close to where your customers are. And each of those files, or images, or videos, is given a unique URL. And the CDN manages the DNS resolution for that URL, so that when you look up the name, it returns you an IP address that corresponds to a proxy, or a cache, which is located physically near to you. And that server has the data on it such that the response comes quickly, and such that the load is balanced around these servers, around the world. And these CDNs, these content distribution networks, are extremely effective at delivering and caching HTTP content. They support some extremely high volume applications: game delivery services such as Steam, applications like the Apple software update, or the Windows software update, and massively popular websites. And they have global deployments, and they have agreement with the overwhelming majority of ISPs to host these caches, these proxy servers, at the edge of the network. So, no matter where you are in the network, you’re very near to a content distribution node. A limitation of CDNs, though, is that they only work with HTTP-based content. They’re for delivering web content. And the entire infrastructure is based around delivering web content over HTTP, or more typically these days HTTPS. They don’t support RTP based streaming. The way streaming video is delivered, these days, is to make use of content distribution networks. It’s delivered using HTTPS from a CDN node. The contents of a video, in a system such as Netflix, is encoded in multiple chunks, where each chunk comprises, typically, around 10 seconds worth of the video data. Each of the chunks is designed to be independently decodable, and each is made available in many different versions, at many different quality rates, many different bandwidths. A manifest file provides an index for what chunks are available. It’s an index, which says, for the first 10 seconds of the movie there are these six different versions available, and this is the size of each one, and the quality level for each one, and this is a URL where it can be retrieved from. And the same for the next 10 seconds, and the next 10 seconds, and so on. And the way the video streaming works, is that the client fetches the manifest, looks at the set of chunks, and starts downloading the chunks in turn. And it uses standard HTTPS downloads to download each of the chunks. But, as it’s doing so, it monitors how quickly it’s successfully downloading. And. based on that, it chooses what encoding rate to fetch next. So, it starts out by fetching a relatively low rate chunk, and measures how quickly it downloads. Maybe it’s fetching a chunk that’s encoded at 500 kilobits per second, and it measures how fast it actually downloads. And it sees if it’s actually managing to download the 500 kilobits per second video faster, or slower, than 500 kilobits. If it’s downloading slower than real-time, it will pick a lower quality, a smaller chunk, for the next time. And, if it’s downloading faster than real-time, then it will try and pick a higher quality, a higher rate, chunk the next time. So it can adapt the rate at which it downloads the video by picking a different quality setting for each of the chunks, each of the pieces of the video. And as it downloads the chunks, it plays each one out, in turn, while it’s downloading the next chunk. And each of the chunks of video is typically five or ten seconds, or thereabouts, worth of video content. And each one is compressed multiple different times, and it’s available at multiple different rates, and it’s available at multiple different sizes, for example. And the chart on the graph, on the right, gives an example of how Netflix recommend videos are encoded, starting at a rate of 235 kilobits per second, for a 320x240 very low resolution video, and moving up to 5800 kilobits per second, 5.8 megabits per second, for a full HD quality video. You can see the each 10 second piece of content is available at 10 different quality levels, 10 different sizes. And the receiver fetches the manifest to start off with, which gives it the index of all of the different chunks, and all of the different sizes, and which URL each one is available at. And, as it fetches the chunk, it tries to retrieve the URL for that chunk, which involves a DNS request, which involves the CDN redirecting it to a local cache. And for that local cache, as it downloads that chunk of video, it measures the download rate. If the download rate is slower than the encoding rate, it switches to a lower rate for the next chunk. If the download rate is faster than the encoding rate, it can consider switching up to a higher quality, a higher rate, for the next chunk. It chooses the encoding rate to fetch based on the TCP download rate. And we see what’s happening is that we’ve got two levels of adaptation going on. On one level, we’ve got the dynamic adaptive streaming, the DASH clients, fetching the content over HTTP. They’re fetching ten seconds worth of video at a time, measuring the total time it takes to download that ten seconds worth of video. And they’re dividing the time taken by the number of bytes in each chunk, and that gives them an average download rate for chuck. They’re also doing this, though, over a TCP connection. And, as we saw in some of the previous lectures, TCP adapts its congestion window every round-trip time. And it’s following a Reno or a Cubic algorithm, and it’s following the AIMD approach. And, as you see at the top of the slide, the sending rate’s bouncing around following the sawtooth pattern, and following the slow start and the congestion avoidance phases of TCP. So we’ve got quite a lot of variation, on very short time scales, as TCP does its thing. And then that averages out, to give an overall download rate for the chunk. And, depending on the overall download rate that TCP manages to get, averaged over the ten seconds worth of video for chunk, that selects the size of the next chunk to download. The idea is that each chunk can be downloaded, at least at real-time speed, and ideally a bit faster than real-time, so the download gets ahead of itself. And, when you start watching a movie on Netflix, or watching a program on the iPlayer, for example, you often see it starts out relatively poor quality, for the first few seconds, and then the quality jumps up after 10 or 20 seconds or so. And what’s happening here, is that the receiver’s picking a conservative download rate for the initial chunk, it’s picking one of the relatively low quality, relatively small, chunks, and downloading that first, and measuring how long it takes. And, typically, that’s a conservative choice, and it realises it can actually download, it realises that the chunks are actually downloading faster, so it switches up the quality level fairly quickly. And, after the first 10, 20, seconds, after a couple of chunks have gone, the quality level has picked up. A consequence of all of this, is that it takes quite a long time for streaming video to get started. It’s quite common that when you start playing a movie on Netflix, or a program on the iPlayer, that it takes a few seconds before it gets going. And the reason for this, is some combination of the chunk duration, and the playout buffering, and the encoding delays if the video’s being encoded live. Fetching chunks, which are typically 10 seconds long, you need to have one chunk being played out at any one time. You need to have 10 seconds worth of video buffered up in the receiver, so you can be playing that chunk out while you’re fetching the next one. So you’ve got one chunk being played out, and one being fetched, so you’ve immediately got two chunks worth of buffering. So that’s 20 seconds worth of buffering. Plus the time it takes to fetch over the network, plus, if it’s being encoded live, the time it takes to encode the chunk which will be at least a, it needs to pull in the entire chunk before it can encode it, so you’ve got at least another chunk, so that’d be another 10 seconds. So you get a significant amount of latency because of the ten second chunk duration. You also need enough chunks of video buffered up, such that if the TCP download rate changes, and it turns out that the available capacity changes, so a chunk downloads much slower than you would expect, that you don’t want to run out of video to play. You want enough video buffered up, that if something takes a long time, you have time to drop down to a lower rate for the next chunk, and keep the video coming, even at a reduced level, without it stalling. Without you running out of video to play out. So you’ve got to download a complete chunk before you start playing out. So you download and decompress a particular chunk, and while you’re doing that you’re playing the previous chunk, and everything stacks up, the latency stacks up. In addition to the fact that you’re just buffering up the different chunks of video, and you need to have a complete chunk being played while the next one is downloading, you get the sources of latency because of the network, because of the way the data is transmitted over the network. As we saw when we spoke about TCP, the usual way TCP retransmits lost packets, is following a triple duplicate ACK. What we see on the slide here, is that the data, on the sending side, we have the user space, where the blocks of data, the chunks of video, are being written into a TCP connection. And these get buffered up in the kernel, in the operating system kernel on the sender side, and transmitted over the network. At some point later they arrive in the operating system kernel on the receiver side, and that generates the acknowledgments as those chunks, as the TCP packets, the chunks of video, are received. And, if a packet gets lost, it starts generating duplicate acknowledgments. And, eventually, after the triple duplicate acknowledgement, the packet will be transmitted. And we see that this takes time. And if this is video, and the packets are being sent at a constant rate, we see that it takes time to send four packets, the lost packet plus the three following that generate the duplicate ACKs, before the sender notices that a packet loss has happened. Plus, it takes one round trip time for the acknowledgements to get back to the sender, and for it to retransmit the packet. So the time before, if a packet has been lost, it takes four times the packet transmission time, plus one round-trip time, before the packet gets retransmitted, and arrives back at the receiver. And that adds some latency. It’s got to add at least four packets, plus one round-trip time, extra latency to cope with a single retransmission. And, if the network’s unreliable, such that more than one packet is likely to be lost, you need to add in more buffering time, add in additional latency, to allow the packets to arrive, such that they can be given to the receiver without disrupting the timing. So you need to add some latency to compensate for the retransmissions that TCP might be causing, so that you can keep receiving data smoothly while accounting for the retransmission times. In addition, there’s some latency due to the size of the chunks of video. Each chunk has to be independently decodable, because you’re changing the compression, potentially changing the compression level, at each chunk. So each one can’t be based on the previous one. They all have to start from scratch at the beginning of each chunk, because you don’t know what version came before. And, if you look at how video compression works, it’s all based on predicting. You send initial frames, what are called I-frames, index frames, which give you a complete frame of video. And then they predict, based on that, the next few frames based on that. So, at the start of a scene, you’ll send an index frame, and then, for the rest of the scene, each of the successive frames will just include the difference from the previous frame, from the previous index frame. And how often you send index frames affects that the encoding rates, because the index frames are big. They’re sending a complete frame of video, whereas the predicted frames, in between, are much smaller. The index frames are often, maybe, 20 times the size of the predicted frames. And depending how you encode the chunks, if the smaller chunks are, because each chunk of video has to start with an index frame, it has to start with a complete frame, the shorter each chunk is, the fewer P-frames that can be sent before the start of the next chunk and the next index free. So you have this trade-off. You can make the chunks of video small, and that reduces the latency in the system, but it means you have more frequent index frames. And the more frequent index frames need more data, because the index frames are large compared to the predicted frames, so the encoding efficiency goes down, and the overheads go up. And this tends to enforce a lower bound of around two seconds before the overhead of sending the frequent index frames gets to be too much, it gets to be excessive. So chunk sizes tend to be more than that, tend to be 5, 10, seconds just keep the overheads down, to keep the compression efficiency, the video compression efficiency, reasonable. And that’s the main source of latency in these applications. So, this clearly works. Applications like Netflix, like the iPlayer, clearly work. But they have relatively high latency. Because you’re fetching video chunk-by-chunk, and each chunk is five or ten seconds worth of video, you have five or ten seconds wait when you start the video playing, before it actually starts playing. And it’s difficult to reduce that latency, because of the compression efficiency, because of the overheads. And it would be desirable, though, to reduce that latency. It will be desirable for people who watch sport, because the latency for the streaming applications is higher than it is for broadcast TV, so, if you’re watching live sports, you tend to see the action 5, or 10, or 20, seconds behind broadcast TV, and that can be problematic. It’s also a problem for people trying to offer interactive applications, and augmented reality, where they’d like the latency to be low enough that you can interact with the content, and maybe dynamically change the view point, or interact with parts of the video. So people are looking to build lower-latency streaming video. I think there’s two ways in which this is likely to happen. The first is that we might go back to using RTP. We might go back to using something like WebRTC to control the setup, and build streaming video using essentially the same platform we use for interactive video conferencing, but sending in one direction only. And this is possible today. The browsers support WebRTC, and there’s nothing that says you have to transmit as well as receiving in a WebRTC session. So you could build an application uses WebRTC to stream video to the browser. It would have much lower latency than the DASH-based, dynamic adaptive streaming over HTTP based, approach that people use today. But it’s not clear that it would play well with the content distribution networks. It’s not clear that the CDNs would support RTP streaming. But if they did, if the CDNs could be persuaded to support RTP, this would be a good way of getting lower latency. I think what’s perhaps more likely, though, is that we will start to see the CDNs switching to support QUIC, because it gives better performance for web traffic in general, and then people start to switch to delivering the streaming video over QUIC. And, because QUIC is a user space stack, it’s easier to deploy interesting transport protocol innovations. Because they’re done by just deploying a new application, you don’t have to change the operating system kernel, you don’t have to change, if you want to change how TCP works, you have to change the operating system. Whereas if you want to change the way QUIC works, you just have to change the application or the library that’s providing QUIC. So I think it’s likely that we will see CDNs switching to use HTTP/3, and HTTP over QUIC, and I think it’s likely that they’ll also switch to delivering video over QUIC. And I think that gives much more flexibility to change the way QUIC works, to optimise it to support low-latency video. And we’re already, I think, starting to see that happening. YouTube is already delivering video over QUIC. There are people talking about datagram extensions to QUIC in the IETF to get low latency, so I think we’re likely to see the video switching to be delivered by the CDNs using QUIC, but with some QUIC extension to provide lower latency. So that’s all I want to say about real-time and interactive applications. The real-time applications have latency bounds. They may be strict latency bounds, 150 milliseconds for an interactive application or a video conference, or they may be quite relaxed latency bounds, 10s of seconds for streaming video currently. The interactive applications run over WebRTP, which is the Real-time Transport Protocol, RTP, for the media transport, with a web-based signalling protocol put on top of it. Of they use older standards, such as SIP, the way mobile phones or the telephone network works, these days, to set up the RTP flows. Streaming applications, because they want to fit with the content distribution network infrastructure, because the amount of video traffic is so great that they need the scaling advantages that comes with distribution networks, use an approach known as DASH, Dynamic Adaptive Streaming over HTTP, and deliver the video over HTTP as a series of chunks, with a manifest, and they let the browser choose which chunk sizes to fetch, and use that as a coarse-grained method of adaptation. And this is very scalable, and it makes very good use of the CDN infrastructure to scale out, but it’s relatively high latency, and relatively high overhead. And I think the interesting challenge, in the future, is to be combining these two approaches, to try and get the scaling benefits of content distribution networks, and the low-latency benefits of protocols like RTP, and to try and bring this into the video streaming world.
L7 Discussion
Summary
Lecture 7 discussed real-time and interactive applications. It reviewed the definition of real-time traffic, and the differing deadlines and latency requirements for streaming and interactive applications, the differences in elasticity of traffic demand in real-time and non-real-time applications, quality of service, and quality of experience.
Considering interactive conferencing applications, the lecture reviewed the structure of such applications and briefly described the standard Internet multimedia conferencing protocol stack. It outlined the features RTP provides for secure delivery of real-time media, and highlighted the importance of timing recovery, application level framing, loss concealment, and forward error correction. It mentioned the WebRTC peer-to-peer data channel. And it discussed the need for signalling protocols to setup interactive calls, and briefly outlined how SIP and WebRTC use SDP to negotiate calls.
Considering streaming applications, the lecture highlighted the role of content distribution networks to explain why media is delivered over HTTP. It explained chunked media delivery and the Dynamic Adaptive Streaming over HTTP (DASH) standard for streaming video, showing how this adapts the sending rate and how it relates to TCP congestion control. The lecture also mentioned some sources of latency for DASH-style systems.
Discussion will focus on the essential differences between real-time and non-real-time applications, timing recovery, and media transport in both interactive and streaming applications.
Lecture 8
Abstract
Lecture 8 discusses naming in the Internet and the tussle for control over the names that can be used. It talks about what is the DNS, how DNS name resolution operates, and technical mechanisms for DNS name resolution. It also considers what names exist, how they are allocated, who controls their allocation, and some of the issues to consider when discussing who should control name allocation.
Part 1: DNS Name Resolution
Abstract
The first part of the lecture introduces the DNS and DNS name resolution. It describes the structure of the DNS as a distributed database containing records mapping names to IP addresses, along with other information. It reviews the structure of a DNS name. And it outlines the process by which names are resolved to IP addresses.
In this lecture, I’d like to talk about naming, and the tussle for control over the names used in the Internet. I’ll start by introducing what is the DNS, and how does DNS resolution work. Then, I’ll move to talk about the structure and organisation of DNS names and the way names are assigned, the methods for DNS resolution, and some of the politics of how names are assigned in the Internet. The paper you see on the slide, the “Tussle in Cyberspace” paper, by David Clark, John Wroclawski, Karen Sollins, and Bob Braden, talks about some of these issues in more detail. It talks about some issues of control over the network, how protocol design influences the control that can be provided, how the protocols can evolve, and who can provide control over the protocols. And I’d encourage you to read it, as the DNS is one of those areas where we see this tussle most clearly, I think. So, to start with, in this part of the lecture, I’d like to talk about DNS name resolution. I’ll talk a little bit about what is the DNS, a bit about the structure of names, and how the name resolution works. So to start with, what is the DNS? Well, as we see in the packet diagrams at the top of slide, which have IPv4, on the left, and IPv6 packets, we can see that IP packets contain addresses rather than names. When the network is delivering an IP packet it doesn’t use a domain name, it uses an IP address. And the IP addresses are designed for efficient processing by routing hardware. They’re not designed to be human readable. Now, we have been lucky enough, I think, that IPv4 addresses are at least approximately human readable, at least no less so than a phone number, for example, and so people have used them as human readable identifiers in some cases. But, as we move more and more towards IPv6, this is not really possible; the IPv6 addresses are not at all memorable. So, as users, we need a way of using more meaningful names for devices on the network, when we’re connecting to devices on the network, that can be translated into the IP addresses which the network uses internally. And the Domain Name System, the DNS, provides such a naming scheme. The DNS is a distributed database. It runs on top of the Internet, and maps human readable names into IP addresses. If you’re going to a website, and the example here is my website and the teaching page where you can find the lecture materials for this course, you start with a URL, in this case https://cperkins.org/teaching/. And that comprises, at the start, the protocol used to access the site, HTTPS. It’s got a domain name, and it’s got the file part which specifies which particular file, which particular directory on the site, to access. And you can extract the domain name from that, in this case www.csperkins.org And that’s the name of the site. But, of course, that’s just the name, it’s not something which can be used in the packets. So the role of the DNS is to translate that domain name, and turn it into a set of IP addresses which can be used to reach the server. So you’d feed that name into the DNS, and out would pop a set of IP addresses. And, in this case, for this particular site, there’d be an IPv4 address and an IPv6 address, as you see at the bottom of the slide. And for people and applications, we deal with the names. People don’t care about the IP addresses, they care about the names, and the application’s should care about the names. And the Internet routing and forwarding should deal with the IP addresses. And the very last step before establishing a connection, should be to resolve the name to the addresses which can then be used to establish the connection. And everything else in the application should work on the names. We see that the DNS names are structured hierarchically. There’s a sequence of subdomains, a top-level domain, and the DNS root. We start with the subdomains, which describe the particular site, the particular part within the site. And, in this case, the subdomains are www and csperkins. And obviously there are lots of these. We’re used to sites such as google.com or facebook.com, or in the university dcs.glasgow.ac.uk where the “dcs”, “glasgow”, and “ac”, are the subdomains. The subdomains all live within a top-level domain. The top-level domain can be either a country code top-level domain such as “.uk”, “.de”, “.cn” for China, “.io” for the British Indian Ocean Territory, “.ly” for Libya, and so on. Or it can be a generic top-level domain, such as “.com”, “.org”, or “.net”. And the top-level domains live within the DNS root. The DNS root is, kind of, the invisible bit after the “.org” in this case. It’s the servers which identify and deliver the top-level domains. Someone has to control what is the set of possible top-level domains, and it’s the DNS root which defines this. And there’s a set of what are known as root servers, which advertise the top-level domains, and specify the top of the hierarchy. And, as I think should be clear after a little bit of thought, the DNS root can’t live in the DNS. This is the place where you start doing DNS resolution, so the root servers have to have well-known, fixed, IP addresses, and be reachable by IP address, because they’re the thing you contact in-order to start making use of the DNS. So new DNS resolves need to be able to reach them to find the top-level domains, before they can answer DNS queries. So the root has to work independently of the DNS. Each of the levels in the hierarchy is independently administered, and independently operated. The root server operators, and ICANN, operate the root zone, and we’ll talk about that in one of the later parts of the lecture. They delegate to the top-level domains, the top-level domains delegate down to the subdomains, and so on. And each level is, as I say, independently administered and independently operated. It’s a distributed database. It’s distributed both in implementation, in that the different parts of the namespace are all controlled and served by different servers, but also in authority, with the authority what goes in each subdomain being delegated down through the hierarchy. And each domain, each level in the hierarchy, controls its own data. The point of the DNS is to provide name resolution. Given a name, the goal of the DNS is to look up a particular type of record, giving information about that name. In the usual case, what you’re looking up are what are known as A records or AAAA records. An A record is a mapping from a name to an IP address. It says, this name, in this case “www.csperkins.org”, corresponds to this IPv4 address, or this set of IPv4 addresses. And AAAA records do the same, but for IPv6 addresses. What’s perhaps less well known is that there are several other different types of records in the DNS. NS records, for example, can be used to give you the IP address of the name server for domain. CNAME records provide that canonical names, they provide alias in the DNS. MX records, mail exchanger records, let you look up the email server for a particular domain. And these got generalised into SRV records which allow you to look-up any other type of server within a domain. The process of resolution, the process of looking-up a name, happens when a DNS client asks a DNS resolver to perform the look-up And this is usually triggered by an application, when it calls the getaddrinfo() system call. And we saw this in the examples of the labs, where the first thing that the client does, after creating, after getting the name to look-up, is call getaddrinfo(), then loop through the results, try to make connections to each one in turn. A DNS client is just a machine which runs the getaddrinfo() call, and knows how to talk to a resolver. A resolver is a process, an application, which can look up names. The resolver could be process running on your local machine. More commonly, it’s a process that runs on a machine provided by your Internet service provider, by the network operator, and your client talks over the network to the resolver. And when you configure the machine to talk to the network, you specify the IP address of the DNS resolver for that network. And if your machine is using dynamic host configuration, with the DHCP protocol, the resolver IP addresses one of the details it gets configured with. Usually this happens automatically. You connect your machine to the network, and the network configuration provides the IP address of the resolver your Internet service provider, your network operator, is operating. So when the client wishes to look up a name, what happens? Well, in this case we’re looking-up the A record for my website, www.csperkins.org. And the client talks to the resolver, and says what is the A record for csperkins.org? And, if we assume that this is the first query this revolver has ever received, so it has no information about the rest of the network, what happens is it says, ‘I don’t know, first I need to find what is “.org”’. It needs to find the top-level domain, and then work down. So the resolver would talk to the root servers, and it would send a query to the DNS root servers and say what is the name server record, the NS record, for “.org”? And that answer would come back from the root servers, and it will tell the local resolver what is the IP address of the name server which knows about “.org”. The resolver would then talk to that name server. It would send a query to “.org” to say what’s the name server record for “csperkins.org”? It’s working its way down the hierarchy. We’ve gone from the root servers, to “.org”, then it asks “.org” what’s the name server for “csperkins.org”. And then, once it gets that answer, it contacts that server. It contacts the server for “csperkins.org” and says what is the A record, the address, for “www.csperkins.org”? And the server, the DNS server for csperkins.org. responds. That gets to the local resolver, and now it has the information it needs, so it returns the answer to the client. And we see it’s quite an iterative process. The resolver talks to the DNS root servers to get the name server record for the top-level domain, in this case “.org”. It talks to the top-level domain, to get the name server record for the sub domain. and so on. If there are multiple subdomains, it will keep working its way down through those domains until it finds the end of the query, in which case it asks for the A record. The various responses coming back from these servers, whether they’re coming back from the root servers, the top-level domain servers, or the subdomain servers for a particular site, all include a time-to-live. So, as well as the IP address, as well as the particular record and the IP address corresponding to that record, they also have a time-to-live value which says how long the resolver can cache that record. A promise it won’t change for a certain amount of time. And, in future, if you ask the same query, if you make the same query to the resolver again, provided it has one of these cached values, provided it’s not reached its maximum time-to-live, it can just respond from the cache. And that saves all the look-up times, and makes the responses much quicker. When the entry times-out, it gets refreshed and the resolver asks the next level up in the hierarchy in case it’s changed. And, eventually, it would work its way back up to the root servers. The IP addresses for the root servers are well-known. They essentially have an infinite time-to-live, and haven’t changed in the last 30 years or so. What value do you give to the time-to-live if you’re configuring a domain? I think it very much depends on what you’re doing. A site which is just hosted on a single server, and doesn’t receive a heavy load, such as my website, will probably give quite a long time to live. A day, a couple of days, or a week, perhaps, because it’s just not going to change. It’s always on the same server. A big site, where there are possibly many hundreds of servers hosting that site, will probably give a much shorter time to live, maybe on the order of a small number of seconds, and will probably give you a different answer every time you look up the domain, because it’s load balancing between the different servers. And you see this when you look-up names for servers such as those for Google, Facebook, or Netflix, where every time you make a query, you get a different address because it’s pointing you at a different one of the servers that serve that domain. And it has quite a short time to live, so you keep rotating around for load balancing purposes. Similarly, if you’re accessing a content distribution network, it’s likely that it will have a short time to live, so it can point you to one of the local caches, and so it can change which local cache, which local proxy, it redirects your query to, based on the load, and based on as you move around. So you can play games with the time-to-live to affect the behaviour of the DNS. And that’s it for this part. The DNS names are hierarchical. They work their way up from the sub domains, which describe particular sites and sub-parts of a site, up to the top-level domains, and up to the root domain. And they’re structured hierarchically. It’s a distributed database, with distributed implementation, and distributed control, distributed authority. And the name resolution follows the structure of the names. It works its way down from the root, contacting the servers at each level in turn, until it gets the required answer. And it caches the results. In the next part, I’ll move on and talk more about the structure of the names.
Part 2: DNS Names
Abstract
The second part of the lecture discusses DNS names. It discusses who controls the set of DNS names that may exist, and the history of the ICANN. It talks about the four types of top-level domain: country code top-level domains (ccTLDs), generic top-level domains (gTLDs), the infrastructure top-level domain (.arpa), and special-use top-level domains. The process by which country code top-level domains are allocated is reviewed, and some historical quirks are highlighted. The recent expansion of generic top-level domains is discussed. And the uses of the infrastructure and special-use domains are highlighted. The lecture concludes by discussing internationalised DNS, the DNS root, and the geographic locations of the DNS root servers.
In this part of lecture I’d like to talk about DNS names. I’ll talk about who controls the DNS, what top-level domains exist, and what process is followed to assign new top-level domains in the DNS. I’ll talk about internationalisation of the DNS. And I’ll talk about who operates the DNS root servers. So, as we saw in the previous part of the lecture, DNS names are assigned in a hierarchy. A DNS name comprises a sub-domain, which is delegated from potentially other sub-domains, which are delegated from a top-level domain, which is delegated from the root. If we consider my website, for example, we see the domain name, “www.csperkins.org”, on the slide and “www” and “csperkins” are subdomains within the top level domain, “.org”. And the “.org” top-level domain exists within the DNS root. And this hierarchical structure naturally leads to a bunch of questions. You might ask what top level domains exist? For each given top level domain, you might ask what policy that top-level domain has for deciding when to allocate subdomains? What’s a valid name within “.org”, for example, what’s a valid name within “.com”? You might ask who decides when to add new top level domains? And what’s the set of valid top level domains in the network? And you might ask about the DNS root: who controls the root? What what does it do? How does it operate? Well, thinking first about top level domains. The set of top-level domains in the Internet is controlled and defined by an organisation known as the Internet Corporation for Assigned Names and Numbers, ICANN. ICANN has a long and somewhat complex history. The original project which led to the development of the Internet, was something known as ARPANET. The ARPANET was a US government funded research project, that ran from the late 1960s up until about 1990. The ARPANET project developed the initial versions of the Internet protocols we use today. It developed the initial versions of TCP/IP, for example, and some of the early application protocols. As part of the development of the ARPANET, it was found that the researchers working to develop those protocols needed a set of protocol specifications, they needed to write down the descriptions of the protocols they were developing, and they needed a parameter registry. They needed a way of storing the addresses, and the various parameters that the protocols had. And one of the researchers involved in that effort, Jon Postel, volunteered to do this. He volunteered to act in the role of what became the RFC Editor, editing and distributing the protocol specifications, and in a role which became known as the IANA, the Internet Assigned Numbers Authority, to assign protocol parameters. And, initially, when he started doing this, he was a graduate student working at UCLA, and later he did this, later in his career, while working at the University of Southern California’s Information Sciences Institute. And, while working at ISI, Postel handled domain name allocation. As the DNS came into being, as people started registering names, it seemed natural to register these names in the existing protocol parameter registry, which Postel was operating as essentially the IANA organisation. And he did this as a part time, fairly informal, activity, as part of his ongoing research into the Internet, and Internet-related protocols, primarily funded by the US Government. The IANA role gradually became more formalised. By the late 1990s, IANA was becoming more structured, there were people other than Postel working on it, because the Internet was starting to take off. And, in 1998, this led to the formation of ICANN, the Internet Cooperation for Assigned Names and Numbers, as a dedicated organisation to manage domain names. ICANN was formed in September 1998, as a US not-for-profit corporation based in Los Angeles, actually based in the same building where Postel worked, and where ISI was based, in Marina del Rey in Los Angeles. And, unfortunately, Postel passed away in October 1998, just two or three weeks after ICANN was formed. He’s actually the only person, so far, have an obituary published as an RFC. And you see the link on the slide there. And after he passed away, ICANN look over the management of the domain names. And it’s very-much been run as a global multi-stakeholder forum. It’s trying to get input from as many people as possible, trying to take as many different views on how the network should work, what domain name should exist, as possible. Organisationally, it’s not-for-profit corporation based in Los Angeles, so it’s a registered US charity, essentially. And, as of 2016, it’s now no longer under contract to the US Government, and officially, the domain names are managed by ICANN. ICANN, has a fairly, in addition to its complex history, ICANN has a fairly complex governance model. And, in part, this comes from, this springs out of, the history of ICANN. The original domain names, the original development of the network, as we saw, was sponsored by the US Government. And the Internet became much more widespread, as it became more ubiquitously deployed, as it became more global, people outside the US started to be uncomfortable with this, and were pushing for ICANN turn to divest from the US Government. And, as a result of that, it has a governance model which takes input from a large number of different organisations around the world, to try and make sure that the needs of the different stakeholders are balanced, and it’s not controlled by any one company. So ICANN is controlled by a Board of Governors. They take input from a number of organisations, including a generic names supporting organisation, which represents generic top level domains such as “.org”, “.com”, and so on. A country code names supporting our organisation, which represents the country code domains such as “.uk” or “.de”, for example. It takes input from an address supporting organisation, which represents the regional Internet registries, such as RIPE, APNIC, and ARIN, which assign IP addresses to ISPs and other organisations. It takes input from a Governmental Advisory Committee, and the Governmental Advisory Committee is formed of representatives from each of the, I think, 112 UN recognised countries. It takes input from an at-large advisory committee, a root server operators advisory committee, a stability and security committee, and a technical liaison group. And, in addition to this, it holds regular public meetings, three or four times a year, circling around the globe, to get input from interested parties. ICANN has evolved into a massive organisation. It’s got an annual budget of somewhere on the order of 140 million US dollars. It takes input from an enormous range of people, including representatives from the different countries in the United Nations, and it’s an incredibly political organisation. Many, many countries and organisations want to influence the way domain names are managed, the way domain names are allocated, and what sort of domain names exist. This is no longer a simple, part-time, project by an academic at a University in California, it’s a global mega-corporation. That said, it seems to work. The DNS seems to be stable, and while some of the names that ICANN has allocated are certainly controversial, the process is, I think, broadly working. So what names exist? Well, there are four types of top-level domain in the Internet. There are country code top-level domains, generic top-level domains, infrastructure top-level domains, and special-use top-level domains. The country code top-level domains are those which identify the portions of the namespace assigned to different countries. And the way this is done is that ICANN has essentially delegated the problem of deciding who is a country to the International Organisation for Standards, ISO. ISO has a Standard, ISO 3166-1, which defines the set of allowable country name abbreviations. And these are reasonably widely used. They form the top-level domains in the Internet, but they’re also things like the stickers which go on cars if you drive abroad, the GB sticker on your car if you drive abroad, for example. And ISO 3166-1 defines country code abbreviations for Member States of the United Nations, for the UN special agencies such as the International Monetary Fund, UNESCO, the World Health Organisation, and so on. And it defines abbreviations for parties to the International Court of Justice. And essentially what’s happened here is that ICANN has delegated the decision of what top-level country code domains exist to ISO, and ISO has, essentially, delegated it to the United Nations. And this neatly sidesteps the argument of what is a country. In that if you’re a Member State of the UN, you’re a country, and ISO assigns you a country code, and that then gets reflected into the Internet. And every country code defined in ISO 3166-1 is added into the DNS root zone. And that gives you the country code top-level domains we’re all familiar with, such as “.uk “for the United Kingdom, “.fr”, “.de”, “.cn” for china, “.us” for the United States, and so on. And these country code domains include some which, perhaps, are less familiar: “.ly”, for example, is Libya; “.io” is the British Indian Ocean Territory. And each country can then set its own policy for what it does for subdomains of that country code domain. And that can be delegated to the government of those countries. There are a number of exceptions, and a number of oddities, in the system. One historical curiosity, which is perhaps of interest in the UK, is to do with Czechoslovakia. And the issue here is that, in the 1980s, and early 1990s, very early 1990s, the UK ran a non-Internet based research academic research network, a system called JANET, the Joint Academic NETwork. And JANET ran a set of protocols known as the coloured book protocols, and they had an alternative name resolution system. And names for sites in this network used something which looked a lot like DNS names, but worked backwards. So they have the country code at the front, and then worked their way down towards the subdomain. So, for example, the University of Glasgow would be “uk.ac.glasgow” using the JANET name resolution system, versus “glasgow.ac.uk” using the DNS. And this worked fine. Fundamentally it doesn’t matter which way around you write the names, so writing the names in the opposite order works just fine. And there was a gateway, which translated email messages between the machines on the UK Joint Academic Network and the machines on the rest of the Internet. And it did this just by rewriting the addresses, changing the order of the components of the domain name. And this work just fine, until Czechoslovakia joined the Internet was assigned the country code domain “.cs”. And, at this point, the gateway got confused, because it suddenly became difficult to tell whether “uk.ac.glasgow.cs” was the Computing Science Department in Glasgow University, or a site in Czechoslovakia. You couldn’t look at the first, or last, part of the domain name, and see whether it was one of the valid country code domains, and if it wasn’t then reverse it. And this problem got solved in two ways. Firstly, it got solved by all the Computing Science departments in UK universities suddenly renaming their domain names. This is the reason why Computing Science in Glasgow is “dcs.glasgow.ac.uk”, rather than “cs.glasgow.ac.uk”, “Department of Computing Science”. This is the reason why the Computer Science department in Cambridge is the Computer Lab, “cl.cam.ac.uk”. To avoid the conflicts with, to avoid using “.cs”, anywhere. And the problem also, of course, got solved because Czechoslovakia went away. There are also four oddities, four exceptions, where the top-level domains in the Internet don’t match what is in ISO 3166-1. The first is the United Kingdom. The country code abbreviation for the United Kingdom in ISO 3166-1 is GB. So if it followed the prescribed form, we should be using “.gb” rather than “.uk”. And indeed, and what is now “.gov.uk” used to be registered under “.hmg.gb”, Her Majesty’s Government, GB. But this was never widely used, and the initial people who set up the Internet in the UK, and this was primarily the fault of someone known as Peter Kirstein at University College London, who set up the initial Internet nodes in the UK, decided they preferred to use “.uk” and this kind-of stuck. In addition to that, the country code top-level domain for the Soviet Union, “.su”, still exists in the Internet and, I believe, is still accepting new domain registrations, which is a little bit of an oddity. The European Union, “.eu”, has a country code top-level domain, but it’s not registered in ISO 3166-1 and, sadly, Australia changed and no longer uses “.oz”, but follows “.au” to match the standard. So that’s the country code top-level domains, what else exists? Well, there’s also a set of generic top-level domains. And originally this comprised the set of core domains that represented different types of use: “.com”, “.org”, and “.net”, originally for companies, nonprofit organisations, and networks, but for many, many years now available for unrestricted use; “.edu” for higher educational organisations, primarily US-based; “.mil” for the US military; “.gov” the US Government; and “.int” for international treaty organisations, such as the United Nations, Interpol, NATO, the Red Cross, and organisations like that. And, for a long time, those were the only set of generic top-level domains that existed, and there was a debate about whether organisations should register in one of these generic top-level domains, or whether they should register under their country code domain. More recently, ICANN has massively expanded the set of possible generic top-level domains. Rather than the original 7, I think there’s now about 1500 generic top-level domains registered. And these have a whole bunch of different uses. “.scot” is a generic top level domain, for example, and there’s others for many other cities and regions around the world. And it’s possible to get generic top-level domains for brands and other organisations, although this process is difficult and expensive. And the country code top-level domains, and the generic top-level domains, comprise the overwhelming majority of top-level domains in the Internet. There’s a few more which you may come across occasionally. One of these is what’s known as the infrastructure top-level domain, “.arpa”. Now, obviously, that name, “.arpa”, stems from the original development of the network, from the ARPANET. And it’s mostly a historical relic, that was used in this transition from the ARPANET to the Internet. That top-level domain has one current use, which is reverse DNS. Now, what we spoke about in the last part of the lecture was what’s this forward DNS lookup. Where you take a domain name, and you look that name up in the DNS, and it gives you the corresponding IP addresses. For example, if you look up my website, “csperkins.org”, it will give the IPv4 address 93.93.131.127. Reverse DNS lookup is the process of going the other way. It’s the process of going from an IP address to a domain name. And the way this is done, is that the human-readable form of the IP address, the numeric human-readable form of the IP address, is reversed and stored as a domain name under the “.arpa” top-level Domain. So, for example, the domain name 127.131.93.93.in-addr.arpa is registered. And, if you do a domain name lookup of that name in the DNS, it will return you a DNS CNAME record which points to csperkins.org What you see is that this is the IP address of my site, reversed, and registered as a name, which allows you to look-up, to go back from that address, to the site name. And the same thing works for IPv6 addresses, where each four bits of the IP address are done as a separate subdomain. So the address you see here, 0.1.0.0.0… etc. “.in6.arpa”, which is the reversed IPv6 address of that website, when resolved in the DNS, will give you the DNS CNAME record that points to the site. It’s a way of going back from either the IPv4 or IPv6 addresses to the original domain name. And that’s the only current use of “.arpa”. In addition to that, there are six special use top-level domains. “.example” which is used for examples as you might expect, used for documentation, is registered, and there’s also generic top-level domain versions of it, so “example.com”, “example.org”, and so on which exist and are guaranteed not to be used for anything than for documentation. “.invalid” is guaranteed that it will never be registered, as a testing domain name which will never exist. “.test” is there for testing sites, testing uses, for a domain name which does exist. “.local” and “.localhost” represent the local network and the local machine. And “.onion” is used as a gateway for Tor hidden services, and the RFC on the slide talks about that in more detail; this is The Onion Router, which is an anti-censorship technology. The original DNS, and all the DNS names we’ve spoken about, all use ASCII. The initial set of top-level domains, the initial set of subdomains, were all registered in ASCII. And, of course, this is problematic if you don’t speak English, or if you speak a language which can’t be represented within the ASCII character set. In principle, there’s nothing in the DNS protocol that should stop you being able to register names in UTF8 format. DNS just deals with strings of bytes, and doesn’t really care what they are. In practice, a lot of the software which deals with DNS names assumes they are ASCII, and when people experimented with using UTF8 names, to allow non-ASCII domain names, it was found it didn’t work in practice. As a result of this, we have a somewhat complex approach to translating non-ASCII names into ASCII, which allows them to be used in DNS. And it’s based on a system known as Punycode. And Punycode is an encoding of Unicode, the global character set, into a sequence of ASCII letters, digits, and hyphens. So, for example, we see some examples for how München, the German city Munich, can be translated into Punycode. And we see that the characters which are not representable in ASCII get omitted from the initial name, and then there’s a hyphen and an encoded sequence at the end, after the hyphen. And that encoded sequence at the end is a base-36 encoded representation of the Unicode character, which was omitted, and the location of where it was omitted from, and so where it should be inserted. This allows you to represent any name as something which can be represented as ASCII, as a sequence of ASCII characters. And the internationalised DNS uses this, but it prefixes each of the names with the special prefix “xn—“ which was found not to exist in any of the registered, legitimate, top-level domains of the time, to allow resolvers to distinguish internationalised names from regular names, and know that they have to perform the translation. And this works. If you look-up the example in Cyrillic at the bottom there, it translates, the browser, for example, will translate this into the string “xn—70ak…” which then gets resolved as normal in the DNS. And this is Yandex, which is one of the popular Russian search engines. So the format the names have on the wire, is this unfortunately encoded form which translates them into ASCII, but what gets displayed to the users is the native form in Unicode. So. ICANN decides the set of legal top level domains. They can be country code domains, or they can be generic top-level domains, or special-use domains, or they can be internationalised names these days. ICANN then tells the root server operators that set of names, and the root servers then advertise the name servers for those top level domains. Those name servers then advertise the names which exist within those top level domains. What are the set of root servers? Where do the names come from? Well, there’s a set of 13 servers which advertise the name servers for the top level domains. They’re registered in the DNS. They’re called “a.root-servers.net”, “b.root-servers.net”, through to “m.root-servers.net”. And they also have well-known IPv4 and IPv6 addresses because, as you should perhaps understand, the point of the root servers is to advertise the top level domains, to make the starting point for the DNS hierarchy, so they need to be reachable without using the DNS. So they’ve got well-known IPv4 and IPv6 addresses by which they usually reached. And these 13 servers advertise the top-level domains, and they’re the key to the whole DNS. Why 13 of them? Well, we want to be able to ask a DNS server for the list of possible root servers. That means it has to fit in a DNS message. And DNS, for a long time, and we’ll talk about this in the next part, but DNS for a long time only ran over UDP. And there’s a size limit in replies for UDP. And 13 is the maximum number of servers that will fit in a single UDP packet, that’s why there are 13 root servers. Who operates these root servers? Well the slide shows the current set. Each of the 13 servers is identified by a letter, and it has a well-known IPv4 address and a well-known IPv6 address. And on the right, we see the operators of these servers. Now, what you see, looking at this list of operators. is that they are very heavily US-based. Verisign, that operates “a.root-servers.net”, is a US-based domain name provider, for example. The University of Southern California, USC/ISI, is the organisation where Jon Postel worked, which still operates a root server. Cogent Communications is a US ISP. The University of Maryland and NASA are both research organisations in the US. The Internet Systems Corporation, again, is US-based… a couple of US government sites. The only ones of these which are not in the US, are RIPE NCC, which is the European Regional Internet Registry, and the WIDE project which is in Japan. And that’s there for historical reasons. The root servers were set up at a time when the Internet was entirely US dominated. It’s not clear that that’s necessarily appropriate now, we’ll talk much more about this later, but it’s there for historical reasons. The IP addresses of these root servers cannot be changed. They are hard coded into, essentially, every DNS resolver in the world, and they’re far too widely known to be changed. Who operates the servers can change, but the IP addresses are pretty much fixed forever now. Now there are 13 root servers, but there are not 13 physical machines. Almost all of the root server operators use a technique, known as anycast routing, which we’ll talk more about in Lecture 9. And the idea of anycast routing, is that you have multiple machines that have the same IP address. And they get advertised into the routing system from several different places in the network. And the routing system then ensures that traffic sent to that IP address goes to the closest machine that has that address. So, as a result, there are 13 IP addresses used to identify root servers, but there are actually many more than 13 physical servers. Most of the root servers actually have several hundred machines using the same address, in different data centres, and in different locations around the world. So it’s a very heavily load balanced, very heavily protected, system, even though it appears as only 13 IP addresses, only 13 machines. That’s all I want to say about DNS names. I’ve spoke briefly about who controls the DNS, and about ICANN, and the history of ICANN. I’ve spoken about the types of top-level domains, the country code and the generic top-level domains, and the various special use and infrastructure domains. And I’ve spoken briefly about the international DNS, and the DNS root servers. In the next part, I’ll talk about how DNS queries are made.
Part 3: Methods for DNS Resolution
Abstract
The third part of the lecture discusses how resolvers can contact name servers to resolve DNS names. It reviews how DNS-over-UDP works, the contents of DNS requests and responses, and the inherent security problems of running DNS over UDP. It discusses record and transport security for DNS. Then it reviews alternative transports for DNS, considering DNS over TLS, HTTPS, and QUIC, and their relative costs and benefits.
In this part of the lecture I’d like to talk about methods for DNS resolution. I’ll talk a little bit about the security of the DNS, and some of the historic security problems with the DNS. And I’ll talk about how DNS resolution is performed today, using either UDP, TLS, HTTPS, or QUIC. So let’s start by talking about DNS security. The issue with the DNS is that, historically, it has been completely insecure. The original DNS protocol made requests, and delivered, responses using UDP. And it used UDP in a way which did not have any form of encryption or authentication. This meant it was trivial for attackers on the path between the host making the request, and the resolver which was answering that request, to eavesdrop on what names were being looked-up. And the requests are not encrypted, so anyone on the path, anyone who can read the network traffic, can see which hosts are looking at which names. In addition, because the messages and the replies are not authenticated in any way, such an on-path attacker can easily forge a response. If it responds faster than the intended DNS resolver, there’s nothing for the requesting host to know that this is a forgery, rather than the correct response. There’s no way to authenticate the responses. And this makes it straightforward to redirect hosts in malicious ways by forging DNS responses. Now, obviously, this is a problem. And over the last few years we’ve seen a number of attempts at securing the DNS. These fall into two categories. Some of them relate to transport security, and some of them relate to record security. The issue about transport security is whether we can make it possible to deliver DNS requests, and receive replies, securely. Make it possible to send DNS requests over some sort of secure channel, and get the answer, get the response, back over that same channel. And the idea here is that we use a protocol, such as, for example, TLS, to deliver the DNS requests and retrieve the responses. And, since that the requests and the responses are encrypted, they can’t be understood or modified by attackers. And that provides a form of security, provided you trust the resolver to give you the right answer. This provides a trusted, and secure, and encrypted and authenticated channel between the host making the request and the resolver that stops anyone reading the DNS messages in transit and stops them forging replies. So, as long as the resolver is correctly answering the queries, this protects you from the DNS. The other approach is what’s known as record security. Add some form of digital signature to the DNS responses, such that the client can verify the data it’s receiving is valid. And the idea here might be that ICANN attaches a digital signature to the root zone, which specifies the set of top-level domains. The root server operators sign the information they provide about the top-level domains. The top-level domains then sign the information they provides about subdomains. And so on. And there’s a chain of digital signatures that leads all the way back to ICANN, and the root, for every name that gets looked-up. In this case, when you perform a DNS lookup, when you resolve a name, and you get a name back, in addition to the record which says this is the name you looked-up, and this is the corresponding IP address, you also get a digital signature which allows you to verify that it’s not been tampered with. And the clients, at least in principle, can then verify the signatures, all the way back up the hierarchy to the root, and provide a chain of trust that demonstrates ownership of the domain. And this is implemented. And it makes extensive use of digital signatures and public key cryptography. And, at least the top-level domains, and the root zone, are all signed. And a few of the more popular sites are starting to do this, and starting to sign their requests, so the integrity of their data, of the records, can be verified. But it’s not yet widely used. It’s starting to get use, but it’s not yet widely deployed. Ideally, we want both transport security and record security. Ideally, we want to both secure the requests, so no one can see which requests we are making, and no one can modify the responses we’re getting back from the resolvers, and also use record security to verify that the resolvers are not lying to us. At present, we have the ability to provide transport security, and we’re starting to see record security being deployed. So, how does the transport actually work? Well, historically DNS has run over UDP. It’s run over UDP port 53. The ideas of using UDP for DNS, is that the requests and the responses are both small. So, in theory, you don’t need any sort of reliability. You don’t need any form of congestion control. The usual way this works in the DNS, is that the client makes a query to the resolver, the resolver looks-up the name, and replies. And the query is small. It’s just a name: “www.csperkins.org”, “google.com”, “facebook.com”, whatever it is. And the response is just an IP address. That doesn’t need much space. It doesn’t need lots of packets. So we can make both the request, and get the response, each in a single packet. And get the answer in a single round-trip time, if the data is cached by the resolver. And this is more efficient than running it over TCP. If you look at the example, the packet diagram, on the left of the slide, you see the query and the response happen in one round-trip running over UDP. Whereas, if you look at the diagram on the right-hand side of the slide, you see if you’re running this over TCP, you have the SYN, SYN-ACK, ACK handshake to set up the connection; the DNS query is sent immediately following that ACK; and you get the response one round-trip time later over the TCP connection. And then you’ve got the FIN, FIN-ACK, ACK handshake to tear down the TCP connection. And you end up sending six packets, three round-trips, for the TCP connection. The initial handshake to set-up the connection, the request and the response, and then a handshake to tear-down the connection. And it’s sending far more packets than is needed. And there’s not really any benefit to using TCP. Once you’ve got the connection setup, if a packet gets lost, what happens? Well, TCP retransmits it. Okay, but we don’t need to TCP to do that. We can just have a simple timeout, and retransmit the packet over UDP. There’s no need for complicated reliability measures, there’s no need for congestion control, because the data being sent just fits in one packet. So it’s perfectly reasonable to have a timeout and retransmission. Triple duplicate ACKs won’t help, because there’s only ever one packet being sent. Congestion control won’t help: there’s only ever one packet being sent. So, as a result, DNS historically has run over UDP, and avoided the complexity and the overheads of running over TCP. So what’s in a DNS over UDP packet? Well, the diagram shows an IPv4 packet, with a UDP header in it, and then the contents of the DNS message. The contents of the DNS message are fixed header to indicate that this is a DNS packet, a question section, an answer section, an authority section, and some additional information. When you’re making a request, the question section gets filled it. And this is the list of domain names that are querying and the requested record types. So the question section might say, for example, what is the AAAA record for domain “csperkins.org”. And you can include more than one question in a request, provided they fit in the packet. The DNS response contains, in addition to the question section, which just echoes back the question being asked, echoes back the name being looked-up, also includes the answer section, and the authority, and the additional information sections. And the answer section contains the answer. It contains the IP address corresponding to the name that was being looked-up, and it contains a time-to-live to specify how long that’s valid. And the authority section describes where the answer came from. And this slide shows an example of how this works. This is captured using a tool called dig, which is a standard DNS lookup utility that exists on Linux and macOS. And what we see, highlighted in black here, is the question section, which shows that we’re looking up the A record for my website, “csperkins.org”. In blue, we see the contents of the answer section, where it specifies that the IP address of the site is 93.93.121.127, and it has a time-to-live of 2681 seconds. We see an authority section, which specifies that the response came from the name servers for ns1.mythic-beasts.com or ns2.mythic-beasts.com, and these are the name servers that are hosting that domain. And we see, in the additional information section in red, where it’s telling us the IP addresses of those name servers, so we can contact those servers if we want to find out additional information about the domain. And this is the typical structure, you see in a DNS packet. A question in the requests and the responses. And the question, the answer, the authority, and the additional information sections. And that’s DNS over UDP, which is the way DNS has historically been used. And as we mentioned earlier, DNS over UDP is insecure. The packets are not encrypted or authenticated in any way. And this means that devices on the path between the client and the resolver can see the DNS queries and the responses, and they can forge responses. One way of getting around that is to run DNS over TLS, rather than running it over UDP. And the way this works, is that the DNS client opens a TCP connection to the resolver, rather than sending UDP packets. It makes a TCP connection to the resolver on port 853. The DNS client then negotiates a TLS 1.3 session within that TCP connection. And, once it’s done that, it sends the query and receives the response over that TLS connection, which is running over the TCP connection. Now what’s in the request, and what’s in the response, is exactly the same as if it was sending over UDP. The contents of the request are formatted exactly the same way, as would be the contents of the UDP packet. Except, instead of being sent in a UDP packet, they’re sent within a TLS record. And the response that comes back is exactly the same as the response that would be delivered over UDP. Again, the only difference is that it’s sent inside a TLS record, inside a TCP connection, rather than being sent inside a UDP packet. Now, this clearly provides security. You’re running over TLS, which encrypts and authenticates the connection, which lets you authenticate the identity of the resolver you’re connected to. It’s also, clearly, a lot higher overhead. You have to first negotiate a TCP connection. Then you have to negotiate a TLS connection. And then you can send the DNS request, and get the response. Then you tear down the TLS connection, and you tear down the TCP connection. So, what would be a single round-trip time, to send the request and get the response with DNS over UDP, turns into a round-trip time to set-up the TCP connection, followed by a round-trip time to negotiate TLS, followed by a round-trip time to make the DNS request and get the response, followed by a couple more round-trip times to tear down all the connections. It’s a lot higher overhead, and it runs it runs noticeably slower, but it provides more security. DNS over TLS actually works reasonably well, and is moderately widely deployed. We’re also starting to see a couple of alternative methods of providing secure access to the DNS. One of these DNS over HTTPS, often shortened to “DoH”. And DoH is a way of allowing a client to some queries to a DNS resolver using HTTPS, rather than using UDP or TLS. And the idea here, is that you open an HTTPS connection to the resolver, and you then send the query over that connection, and you get the response back in return. There’s two ways in which the request can be formatted. It can be formatted as a GET request. In this case you send an HTTP GET request for the URL, for the file part of the URL, “/dns-query?dns=“ and then the base-64 encoded version of the data you would have sent in the UDP packet, with an “Accept:” header to indicate that you expect a response type “application/dns-message”. Alternatively, you use an HTTP POST request, again with the URL path of “/dns-query”, where the content type of the post request is “application/dns-message”, and the content of the query is the content is the DNS request. And, in both cases, the request being made is exactly the same request that would be sent in a UDP packet. If it’s a POST query, the contents that would go in the UDP packet just go straight into the body of the POST query, of the POST request. And, if it’s a GET request, they’re base-64 encoded and put it in the GET line. But, again, it’s exactly the same content as-if it was sent in a UDP packet. No matter whether it’s done using a GET or a POST, the response that comes back will be, assuming the name exists, will be an HTTP 200 Ok response, and that the body will have, the header will say, it’s content type “application/dns-message”, and the body of the response will be the contents of the DNS message. And, again, it’s exactly the same data that would come back in a UDP-based DNS response. And the final way we’re seeing people starting to think about making DNS queries, is to run them over QUIC. And the idea with making DNS queries over QUIC, is that it can avoid some of the overheads, while still providing security. So the principle is the same as running DNS over TLS. The client opens a QUIC connection to the resolver and, as part of opening that connection, it negotiates TLS security. And then it sends the DNS request inside that connection, and gets the response back over the same connection. And, again, they contain exactly the same data as they would if the queries and responses were sent in UDP. Unlike DNS over TLS, or DNS over HTTPS, DNS over QUIC is not yet standardised. The URL on the slide points that to the draft specification, but that’s still a work in progress. What we see, is that there are increasingly many ways of making DNS queries. There’s the traditional approach of sending the queries over UDP. You can also send them over TCP, over TLS, over HTTPS, or over QUIC. And, in all of these cases, the contents of the request, the contents of the query, and the contents of the response are identical. You’re sending the exact same DNS queries, the exact same DNS requests. You’re getting the exact same DNS responses back. All that’s changing is the transport protocol. All that’s changing is how the query is delivered to the resolver, and how the response is returned. It doesn’t change the contents of the messages at all. What it does, is change the security guarantees. If you’re using TLS, or HTTPS, or QUIC to deliver the DNS queries you’re guaranteed that nobody, none of the devices on the network between the client and the resolver, can see those queries. So you’re providing confidentiality. And you’re guaranteed that none of the devices on the network between the client and the resolver can forge responses. So it protects from eavesdropping on the messages, and it protects from people on the local network spoofing DNS responses and redirecting you to a malicious site. What it doesn’t do, is protect you if you don’t trust the resolver. We still need DNS security, we still need signed DNS responses, to allow you to check if the resolver is lying, but it at least makes the connection between the client and the resolver secure. And, certainly with the option of running DNS over HTTPS, it also gives the client the flexibility to query different resolvers, to make requests to whichever resolver it likes, using HTTPS. So that gives some flexibility to choose a resolver that it trusts for a particular domain. And that’s all I want to say about DNS resolution. As we’ve seen that there are some security challenges, both in providing transport security to prevent eavesdropping and prevent forged requests, and in terms of record security for authenticating the responses that come back. The traditional approach to DNS resolution, over UDP, doesn’t address any of those security challenges. But we’re increasingly seeing devices moving to using DNS over TLS, or over HTTPS, and I expect in future DNS over QUIC as well. And that provides transport security, it prevents people eavesdropping on the DNS requests, and it prevents people forging the responses. And, I hope, we will also see signed and authenticated DNS records getting broader use, in order to prevent malicious resolvers from spoofing responses.
Part 4: The Politics of Names
Abstract
The final part of the lecture discusses the politics of names. It talks about how DNS resolvers are selected, how the choice of DNS resolver can affect the set of names that are available, and the implications of allowing applications to choose their resolver on operator- and government-mandated name filtering. It discusses some of the intellectual property and jurisdictional implications of DNS. And it discusses some questions around control of the DNS, what domains should exist, and who should operate and control the DNS root, generic top-level domains, etc.
In this final part of the lecture, I want to talk about the politics of names. I’ll talk about the choice of DNS resolver, some issues around intellectual property rights and the DNS, about what domains should exist, who controls what domains exist, and who controls the DNS root. So let’s start by talking about the choice of the DNS resolver. How does a host know which DNS resolver to use? Well, when it connects to a network, a host uses something known as the Dynamic Host Configuration Protocol to discover the network settings and configuration options. DHCP provides the host with its IP address, tells it the IP address of the router, the network mask, and parameters such as that. And it also tells the host what DNS resolver to use on that network. And, usually, this would be a DNS resolver operated by the network operator, operated by the Internet service provider. If the host connects to multiple networks, if the host has multiple network interfaces, DHCP runs separately on each interface, and it may give a different DNS resolver for each interface. For example, if a device connects to both a 4G cellular network, and to a private company Ethernet, then it’s possible that the company Ethernet might make available names for internal services which didn’t exist outside the company, and which are not visible on the 4G network. So applications on multi-homed hosts, on hosts with multiple network interfaces, should specify which network interface they’re resolving names on, by specifying a local IP addresses as one of the parameters, one of the hints parameters, in the getaddrinfo() call, to make sure the names are resolved in the correct interface, on the correct network. And, of course, it’s also possible to manually configure the host. And a common use of this might be, for example, to talk to the Google’s public DNS resolver, on IP address 8.8.8.8, but there are several other public resolvers available. DNS resolution has typically been implemented as a system wide service. DHCP configures the host, tells it the resolvers to use, and then all applications on the host access the same resolvers through the operating system interface. And this means you get a consistent mapping of names to addresses. No matter which application makes the query, it will always get the same answer, because it’s always talking to the same DNS resolver. The use of protocols such as DNS over HTTPS is starting to change this, though. When you have DoH, when you have DNS over HTTPS, it’s possible for applications to easily perform their own DNS queries. And, in particular, it’s possible for web applications, written in JavaScript, to perform DNS queries by making HTTPS requests to any website, any website that supports DoH. And this means that different applications, different websites, can have different views of what the network looks like; of what names exist, and what names map to what IP addresses. And, in principle, it was always possible for applications to do. It was always possible for applications to override the choice of DNS, it was always possible for an application to bundle it’s own UDP-based DNS resolver. But it’s now much easier. And, because it’s easier, more applications are starting to do it. Is this a problem? Does it matter if we’re giving applications the ability to pick different DNS resolvers, to resolve names according to a resolver of their choice? In particular, given that we’re allowing applications to securely resolve names using DNS server of their choice, why does that matter? Is it a problem that we’re giving flexibility? Is it a problem that we’re allowing applications to make their own DNS queries? Well, I think there’s pros and cons here. In some ways, it’s clearly beneficial. In some ways it’s clearly a good thing, and it’s not a concern that different applications can perform DNS queries in different ways. And you can easily make the argument that applications should have the ability to choose a DNS server they trust. To make sure that they avoid phishing attacks, to make sure they avoid malware, to make sure they avoid monitoring. I think you can easily make the arguments that network operators should not be able to see the DNS queries, they should not be able to modify the responses. Resolvers run by network operators should not be able to see what queries applications are making, and that by allowing this, this is a privacy and security risk. And there’s a benefit in allowing applications to talk to a DNS resolver of their choice, and prevent the network operator from snooping on their traffic. I think these are all perfectly reasonable arguments; this makes a lot of sense. Equally, though, it’s possible to make the argument that it’s problematic for applications to have the ability to override the choice of DNS. Network operators will say that they can filter DNS responses to block access to sites which are providing malware, or which are being malicious, or which are fraudulent. And that allowing applications to override the choice of DNS, talk to a server of their choice, allows them to bypass these security services. It allows them to bypass the filtering which is protecting them from malware, that’s protecting them from fraudulent websites. And, in many countries, network operators are required by law to filter DNS responses, to enforce legal or societal constraints. For example, in the UK, the Internet service providers apply a DNS block list provided by the Internet Watch Foundation, which is there to prevent access to sites hosting child sexual abuse material. By allowing applications to make their own choice of DNS resolver, by allowing them to access resolvers other than the one provided by the Internet service provider, this allows the applications to opt out of such filtering, and to access such prohibited content. And, fundamentally, the problem is that both legitimate filtering, and malicious and harmful DNS filtering, use the same mechanisms. And the mechanisms to protect against phishing attacks, malware, and monitoring the DNS, also protect against, and prevent, the legitimate filtering of DNS requests. Can the network restrict the choice of DNS resolver? Can the network stop applications from choosing their own DNS, if they wish to do so? Well, for DNS-over-UDP or for DNS-over-TLS, this is certainly possible. If a network blocks outgoing UDP traffic on UDP port 53, for example, in its firewall, this will effectively block DNS-over-UDP to any sites which it chooses. Similarly for DNS-over-TLS resolver, you can block access, a network operator can block access, to TCP port 853, and prevent outgoing traffic to that port, and that will stop DNS-over-TLS to any sites other than the ones it allows. it’s much harder, though, to block DNS-over-HTTPS. The problem here, for the network operators, is that since the traffic is encrypted, all it can see is an outgoing, encrypted, TCP connection to a web server. And it can’t tell whether the data being exchanged over that connection is regular HTTPS traffic comprising web pages, or DNS-over-HTTPS requests. Now, in some cases, it’s possible to make this distinction from the IP address. For example, Google runs a public DNS-over-HTTPS server on IP address 8.8.8.8. And, you know if you’re seeing HTTPS requests going out to this address, this is DoH traffic, because Google doesn’t run any other websites on that address. But, if you have a web server that handles a mix of both regular web traffic, and DNS over HTTPS traffic, it’s not possible for an ISP to block one of these without blocking the other. And if this is a popular website, if Google decided to offer DoH services along with its regular web services, it would be very difficult for network operators to block the DNS over HTTPs traffic. And many of the Internet service providers, many network operators, many governments, are getting concerned that this use of DNS over HTTPS is making it harder to use DNS as a control point. Many organisations are used to using DNS to block access to certain types of traffic. And this is becoming much harder for them, as more and more traffic moves to DNS over HTTPS. And, of course, whether that’s a good or a bad thing depends on your politics, and it depends on what type of traffic is being blocked. But it’s certainly an issue, and it’s a change in the way the network operates. DNS, and DNS names, also tend to impinge on questions of intellectual property rights. And the issue here is that intellectual property laws tend to be managed on a national basis. For example, it’s entirely possible that a particular company might own a certain trademark in the UK, while a different company might own that trademark in the Republic of Ireland. And, in that case, it would be perfectly reasonable, and perfectly sensible, for the domain name “trademark.ie” to be owned by the company in the Republic of Ireland, and the domain name “trademark.co.uk” to be owned by the company in the UK. And which of those companies should own which of those domains is then a very straightforward legal question, and it’s handled by the courts in those in those countries. And, for country code top-level domains, this sort of question is straightforward. For the generic top-level domains, though, it gets a bit trickier. Which of those companies, for example, should own “trademark.com”? Each of the companies has the respective trademark in the jurisdiction where they’re based. Yet you have a generic domain, which is not tied to a particular country, to a particular jurisdiction, so which of those should have the rights over it? And, in particular, this may get hard because “.com” is operated by US-based organisation currently, and a different organisation may own that trademark in the US. Country code top-level domains have the advantage of clearly operating under the legal regime of a particular country. It makes it easy to resolve legal questions about intellectual property, and about ownership of the domains. Generic top-level domains are much less clear. Is the right of ownership for a generic top-level domain based on where the domain operator is? Or based on where the person requesting the name is? And, if there are multiple people who want the name, in multiple different countries, and they’re not necessarily the same country as the domain operator, this gets legally tricky to work out who has ownership and who has control. And this also ties in, to some extent, to the questions about which top-level domains, and which subdomains should be allowed to exist. And If you think about top-level domains, what generic top-level domains should ICANN permit to exist? What’s the list of domains that should be allowed? And who gets to control that? And an example which has been long-running, and is contentious, is the domain “.xxx’, the top-level domain “.xxx”. And question is about whether this domain, this top-level domain, should exist, in order to host adult content. And if it does exist, who gets to decide what content should sit within that domain, within that top-level domain? And what content must sit within that domain? And, different countries have very different norms, and very different standards, for what constitutes adult content, and for what type of filtering is, and isn’t, appropriate. And this is, obviously, a contentious example, but there are many other such examples. What top-level domains should exist, and who gets to decide? Because different parts of the world have very different norms for what’s acceptable or not. When it comes to particular subdomains, again, different regions, different countries, have significant differences in their laws and norms about freedom of speech, and about permissible topics, about permissible topics for websites. And a country code top-level domain can clearly enforce the local conventions and rules for the country that it represents. If you have a “.co.uk” domain, for example, it’s pretty clear that it should enforce UK law. If you have a “.de” domain, it’s pretty clear that should be enforcing German law. But what about generic top-level domains? What about “.com”, for example? If a site in a generic top-level domain is hosting content which is legal in some countries, but illegal in other countries, should that be permitted? If a particular country, or a particular group, finds the content of a site objectionable, should that site be taken down if it’s in a generic top-level domain? If a country X, for example, decides that certain content is illegal and should be prohibited, but if it’s legal in country Y, should a generic top-level domain operating out of country Y, but accessible in country X, permit such content? To make this concrete, holocaust denial is illegal in Germany, but not in the US. Should “.com”, operating from the US, permit sites which host material which denies that the Holocaust happened? It’s legal where “.com” exists, but those sites are accessible from countries, from Germany, where this content is illegal. And who gets to enforce these decisions? Who gets to arbitrate between the sites? Should the generic top-level domains be bound by, only be bound by, the laws of the country which they operate from? Or do we need some sort of international norms, international set of laws, about how globally accessible domains should operate, and what rules they should enforce? I think there are similar questions about the root servers. Currently, most of the DNS root servers are operated, or controlled, by US-based organisations. And they all currently host the same content. They all currently follow the set of top-level domains that ICANN defines. But there’s nothing technically requiring they do so. The question is, is it a risk to other countries that all of these root servers are controlled, that most of the root servers are controlled, by a single country? Should we be looking to broaden the mix of countries that operate, and that control, the root servers? And if we do, who gets to decide how this happens? Is this something where ICANN should be deciding, ICANN should be mandating, that the root servers move to be operated in different countries? Is this something that a particular national government should do, and declare that they will run a different root server for hosts in their country? Is this something where the United Nations should step in? And is there a benefit in controlling a DNS root server? Or is it just an administrative overhead that nobody actually wants? In theory, all the root servers return exactly the same content anyway, so why should you care if you control one? Unless, perhaps, you want a different view of the DNS, unless you want a different set of top-level domains in your country than in other parts of the world. Similarly, is there benefit in controlling a generic top-level domain server? Is there a benefit to a country in hosting “.com”, for example? And I don’t know the answer, but there are questions that should be asked, and there are interesting political questions that should be asked, about the control of the DNS root servers and the generic top-level domain servers. There’s also the question about whether there should be a single DNS root? Should all of the top-level domains be accessible from everywhere? Should the global view of the DNS be the same, no matter where you’re coming from? Should the same name always resolve to the same site? And, with content distribution networks which host sites at local proxies throughout the world, can you tell? And what sort of filtering of the DNS traffic should be permitted? And should different countries be allowed to do this, and are there any restrictions on what filtering should be permitted, and how it should be implemented? And, as we’ve seen with DNS-over-HTTPS, it’s currently very difficult to distinguish modifications made to DNS responses, in order to conform to government mandated filtering requirements, from those made by malware, and phishing attacks, and so on. And I guess the question here is, is this a feature of the DNS, or is this a bug? And what sort of filtering should be permitted? Should be possible? So that concludes the discussion of DNS. I’ve spoken about what is the DNS, how the queries are made, and in a reasonable amount of detail about what names exist, and who controls the set of names, and how and what sorts of filtering should happen. DNS is one of the more contentious parts of the Internet. It ties-in with notions of national sovereignty, with intellectual property laws, with societal norms about what sort of content should, or should not, be accessible. And it’s one of the interesting areas where the technology and the politics combine.
L8 Discussion
Summary
Lecture 8 discussed naming and the tussle for control. The first part of the lecture outlined what is the DNS, the structure of DNS names, the DNS server hierarchy, and the process by which name resolution works.
The second part of the lecture discussed DNS names. It outlined the history of ICANN and some issues of DNS governance. It described the process by which top-level domains are assigned, focussing mostly on country code top-level domains (ccTLDs) and generic top-level domains (gTLDs), but also mentioning the infrastructure top-level domain (.arpa) and reverse DNS, and the various special-use top-level domains. And it spoke about internationalised DNS and Punycode. Finally, it discussed the DNS root servers, their operators, and the use of anycast routing to work around the limitation on the number of root servers.
The third part of the lecture discussed DNS security and methods for DNS resolution. It highlighted that DNS has historically been insecure, and outlined the two complementary approaches to securing DNS: DNS transport security and DNS record security. Record security is provided by DNSSEC, with digital signatures delegating authority from ICANN to the root servers, and hence down to TLDs, sub-domains, etc. And transport security is provided by running DNS over TLS, HTTPS, or QUIC, rather than over UDP. The lecture also highlighted the structure of DNS queries and answers, and how that same structure is used irrespective of the transport.
Finally, the lecture discussed the politics of names. It spoke about the implications of allowing different applications to make DNS queries using different resolvers, and the potential to circumvent control points. It spoke about the complex relation between DNS and intellectual property laws, and about what domains should exist. And it spoke about the single DNS root, and the set of legal top-level domains.
Discussion will focus on technical operation of the DNS, and of the politics of naming.
Lecture 9
Abstract
Lecture 9 discusses content distribution networks and Internet routing. It discusses what are CDNs and what role they play in the Internet, as mechanism to spread load and reduce latency. The problem of inter-domain routing is then introduced, and the BGP routing protocol is reviewed as a mechanism for providing policy routing across the Internet. Some of the security limitations of BGP are highlighted, along with current approaches to try to address these. Finally, intra-domain routing, routing within a network, is briefly reviewed.
Part 1: Content Distribution Networks (CDNs)
Abstract
The lecture begins by discussing content distribution networks (CDNs). It outlines what are CDNs, and the role they play in the network to help distribute load and reduce latency. Two approaches to locating CDN nodes, using DNS and anycast routing, are briefly introduced.
In this lecture I want to talk about how routing works in the Internet. I’ll start, in this part, by talking briefly about content distribution networks, which we discussed a couple of lectures ago. And then, in the later parts, I’ll talk about inter-domain routing, how network operators cooperate to deliver data across the wide-area network. I’ll talk briefly about routing security. And I’ll talk about intra-domain routing, how routing within an operator’s network. So, in this part, I’d like to start by talking about content distribution networks. I’ll talk about how CDNs help with load balancing and latency reduction, and how they’re implemented using either the DNS or using anycast routing. So what is a content distribution network? A content distribution network is a service which provides scalable, load-balanced, low-latency hosting for web content. CDN operators, companies such as Akamai, CloudFlare, and Fastly, are in the business of hosting content for their customers. Their customers give them web content, and this may be images, it may be software uploads, it may be video, doesn’t matter what it is, anything that can sit on the web. And the CDN hosts that content in web caches that are spread around the world. And some of these are located in data centres, some of these are located in edge networks operated by various ISPs. And the idea is to reduce the load on the main servers. Rather than keeping a local copy of the file, the customer gives it to the CDN and links to the copy hosted by the CDN. And this reduces the load on the customer, and puts the load onto the CDN. The idea is that the CDNs are big enough, and have enough caches in enough data centres and enough edge networks, that this spreads the load throughout the world, and prevents from being overloaded for high traffic sites. It reduces latency for the requests, because the CDN will have a cache near to the person making the request, and can spread the load around the world. And it reduces the chances of a successful denial of service attack, again just because of the sheer size of the CDN, and the sheer number of caches it has. And there are many commercial CDNs available, I think the big three of those are listed on the slide, but there are certainly many others. And many large organisations also run their own CDNs. In particular, that the so-called hypergiants, big companies such as Google, Facebook, Netflix, Apple, and the like, all run their own large-scale content distribution networks. The goal of CDNs is to distribute load. They distribute load by caching content all around the world, and by answering most requests from a local cache. And, in order to do that, they need to have servers located everywhere. They need to be very large, and have very wide geographical distribution. This means they need a large-scale investments, large-scale cooperation with network operators, with ISPs, with Internet exchange points, with data centres. And they try to host caches as near to the customers as they can. And to give you an idea of the scale of this, the picture at the top, the top-right of the slide, comes from Netflix, and shows the reach of their caches, their CDN. And we see that they’re located, primarily, in North America and Europe, and to a lesser extent in South America, where the customers of Netflix are located. But we do see that are massive numbers of servers in those regions, and also servers in Australia, New Zealand, Japan, Singapore, the Middle East, South Africa, and so on, to try and get some more geographic spread. And the statistics from Akamai, they boast that they have more than 240,000 servers in over 150 countries So these are very large scale, very widely distributed, server networks. And they often get this benefit by hosting servers within edge ISP’s networks. So an ISP, such as Virgin Media in the UK, would almost certainly be hosting CDN caches for Netflix, Akamai, and the various other big CDNs. And there’s a mutual benefit for an ISP to hosts such a cache; there’s a mutual benefit from the ISP to work with a CDN. And, clearly, from the CDNs point of view, it increases the reach and the robustness of the CDN, if they can put caches in as many networks as possible. From the ISP’s point of view, though, it reduces the load on their network. The CDN can push one copy of a file into the cache, and it can then distribute it to the other customers of that ISP. And it means that all of that load is then served from within the ISP’s network, without having to go over the expensive wide-area links to the rest of the Internet. And this avoid overloading the links from the ISP to the outside world. And the scale of some of the popular services means that this is necessary. Netflix, for example, talk about how they distribute 10s of terabits of video per second, and this clearly isn’t possible from a single data centre. It has to be pushed out in a hierarchy, with the central data centre pushing data out to a CDN, which pushes it out to edge caches, which distribute to the customers. You can’t host all of this from a single data centre, from a single site, you have to spread the load around the world. The other benefit of CDNs is that they reduce latency. The goal is that the content is not only spread geographically for load balancing reasons, but it’s spread geographically so that there’s always a local copy near to the person requesting the data. That means when you request content from a popular site, if you’re based in Europe and you request content from a popular site, it doesn’t have to go to the US where the site is based, but can be answered, but the request can be answered, from a CDN cache located in Europe. And this reduces the latency for your requests, because it can be cached near to you. But, of course, it requires a global distribution of the proxy caches. And, I think, one of the questions here is how effectively CDNs are managing to serve the entire world? If we look at this picture, from Netflix, we see that if you’re in Europe or North America, that there are certainly many CDN caches, and there will be one located very near to you. And if you’re located in certain parts of South America, if you’re located in the populous bit of Western Australia, if you’re located in Singapore, or in Japan, or somewhere like that, there’ll be a CDN near to you. If you’re in Africa, though, you’re perhaps less well served. If you’re in large parts of Asia, you’re less well served. And, increasingly, providing Internet access to developing regions of the world needs more than just providing connectivity. If you want to provide high-quality Internet access to parts of Africa, for example, or parts of Asia which don’t have it yet, you don’t just need to provide bandwidth, you don’t just need to provide network links, you need to provide data centres that can host CDN caches. So it’s increasing the investment needed to get good performance. And this works for cacheable static content. CDNs have historically been focused on video, and images, and software updates, and distributing large files, and they work incredibly well for that. But we’re also starting to see people talk about edge compute applications. And applications where there is some sort of computation going on near to the customer. And this tends to be for augmented reality games, and applications like that, where you need low latency to the compute server, the data centre. And again CDNs, are starting to host this sort of content, starting to allow compute to be pushed into the edges. And, again, this means that they don’t just need caching and data storage at the edges, but they need large scale computing infrastructure. Again, developed parts of the world this is eminently achievable. In developing, in less well-developed parts of the world, this infrastructure isn’t yet there. So, how do the CDNs work? How do they find the nearest node in order to deliver the content? I mean actually delivering the content is easy, a CDN node is just a web server which has the files located on it, and it just delivers them using HTTP. The question is how you find the right CDN node, that has the file you’re looking for. There’s two ways they do it. Some of them use the DNS, and some of them use a technique known as anycast routing. For the CDNs that use the DNS, the goal is that they locate the nearest CDN node based on, and give you an answer for where that node is, by playing games with the DNS queries. So in this case, when a customer of the CDN gives a resource to the CDN to be hosted, the CDN gives that resources a unique domain name. For example, if the site example.com is trying to host an image of a kitten on a CDN, the CDN would give that a unique hostname. And in this case, it’s picked a hexadecimal name, 9BC1C…. etc. But the point is that every image, every resource, every file on the CDN, has a unique host name. Now, of course, they don’t alway refer to real hosts. They’re all entries in the DNS which point to a particular server in the cache, but it gives the flexibility. Because every file, every image, every piece of content, on the CDN has a different hostname, the CDN can return a different IP address for each image, each file, and it can point it at an appropriate replica. So, the way this works is that the CDN returns different answers to the DNS queries for the A or AAAA records for the names, depending on where they’re being requested from, and what CDN caches have that data. So it looks at the IP address of the resolver making the query. And the CDN, when it gets the name look-up for this host, redirects it by returning a different IP address that refers to a local cache. And if I look up this name from my home, I might get a particular CDN cache located in the ISP I have, and if you make the same look-up from your home, in a different ISP’s network, you’ll get a different IP address back for that name, pointing to a different cache that’s hosted by the CDN. And this is based on the IP address of the resolver, because all the CDN sees is the requests coming from the local resolvers. But the DNS resolver has an extension called DNS client subnet extension, so if the client is not in the same place as the resolver, the resolver can tell the CDN the IP address where the client came from. And the CDN has to look-up the IP address where it sees the requests coming from, that of the resolver or that of the client with the client subnet extension, and try and guess where in the world it is. It needs to look-up the IP address, and have a mapping of IP addresses to locations. And this doesn’t need to be particularly accurate. The goal is to figure out if you’re in the UK, and direct to the cache based in London, rather than the cache based in New York, for example. It doesn’t really care if it realises that you’re in Glasgow, or Manchester, or wherever, the main thing is it knows that you’re in the UK, so you should go to UK-based cache. And this gives the CDN very fine-grained control. It can put the time-to-live on its DNS responses down to be a small number of seconds, so every time a client looks up an image, for every different image, every different resource the CDN is hosting, it can return a different answer. So it can very rapidly load balance among it’s different caches, amongst it’s different data centres. But it puts a high load on the DNS, it means there’s lots of DNS queries happening, and they can’t be cached for very long. The other approach CDNs use, is known as anycast routing. And this doesn’t play games with DNS, it uses the DNS in a much more traditional way. In that the DNS names for the CDN always the same; they always just refer to the CDN. And they always return the same answer. And what it does is, each resource the CDN is hosting, it gives it a different filename. And the DNS name always maps to the same IP address. Literally, it always maps to the same IP address. And in this example, the CDN has three data centres, all of which are using IP address 192.0.2.4. And the CDN has many data centres around the world, and they all use the same IP address ranges. And they advertise those IP address ranges into the routing system, into the BGP routing system we’ll talk about in the next part. And the Internet routing then ensures that the traffic goes to the closest data centre to source. By advertising the same IP address into the routing from multiple places, the routing system makes sure that the traffic goes to the nearest data centre. And it’s an abuse of routing. It’s intentionally advertising the same IP address from multiple places, letting the routing take care of how the data gets there. Which approach do CDNs use? Probably a mix of both. Some of the large ones just use the DNS-based approach, some of them use a mix of both approaches, and both approaches work, and they have different trade-offs. So that’s what I want to say about CDNs. The goal of CDNs is to provide load balancing and to reduce latency, by allowing responses for web content to be redirected from the original sites to the content distribution network, which in turn hosts that content at numerous locations around the world which are, hopefully, are close to the end users. And it can be implemented by playing tricks with the DNS where, depending on where you make the DNS lookup you get a different answer back locating you to local cache, or it can be implemented using anycast routing, where the caches all have the same IP address and the routing system takes you to the nearest replica. In the next part, I’ll talk about how the routing system, the BGP routing. works in the Internet.
Part 2: Inter-domain Routing
Abstract
The second part of the lecture introduces the inter-domain routing problem. It reviews the network-of-networks nature of the Internet, and the concept of Autonomous Systems (ASes), and introduces the AS graph and BGP as the basis for inter-domain routing. The differences between routing at the edges and in the core of the network is discussed, as is the role of the default-free zone and Internet Exchange Points. The operation of BGP as a path vector protocol, choosing shortest policy compliant path is reviewed; including a discussion of routing policy, the Gao-Rexford rules, and the BGP decision process.
In this part I’d like to talk about routing in the Internet, in particular the idea of interdomain routing, routing between networks, between autonomous systems. I’ll talk about what is an autonomous system. I’ll talk about the AS graph, the graph of interconnections between networks that form the Internet. I’ll talk about how routing works at the edges, and in the core of the network. And I’ll talk about the Border Gateway Protocol, BGP, which enables routing in the Internet. The Internet is a network of networks. Fundamentally it’s built as a set of independently owned, independently operated, networks which talk with each other, and which collaborate to deliver data. Each of these networks is what’s known as an autonomous system. It operates independently. And each network is a separate routing domain. It makes its own decisions internally how to route data around its own network. The problem of interdomain routing is the problem of finding the best path across this set of networks. It’s the problem of finding the best path from the source network to the destination network, treating the set of networks as a graph. So, it’s not finding the best hop-by-hop path through the network, it’s finding the best path between the set of networks that comprise the Internet. It treats each network in the Internet as a node on the graph, what’s known as the AS topology graph, and it treats the connections between the networks as edges in the graph. And it’s trying to find the best set of networks to choose, to get from the source to the destination across the AS graph. As I said, the Internet is a network of networks. Each of these networks, each autonomous system, is independently owned and operated. And the Internet routing system, the Border Gateway Protocol, operates based on this idea of autonomous systems, ASes. And an AS is an Internet service provider, or some other organisation that operates a network, and that wants to participate in the routing. The University of Glasgow is an autonomous system in routing terms, for example. As would be that the various residential ISPs, Virgin Media or BT or Talk Talk would be autonomous systems. But so are large companies, Facebook, and Google, and the like are also autonomous systems in the routing sense. Some of these organisations operate more than one autonomous system, perhaps because they’ve bought other companies which were themselves autonomous systems, or perhaps just split their network up for ease of administration. Autonomous systems are identified by unique numbers, known as AS numbers, and these are allocated to them by the Regional Internet Registries. The AS numbers don’t really have any meaning, except that they provide a unique identifier for each autonomous system. Essentially, they start at one, and they go up. and each new organisation, each new network, to join the Internet routing system gets assigned the next autonomous system number. As of March 2021 there are about 115,000 autonomous systems in the Internet, about 115,000 autonomous system numbers have been allocated, and about 71,000 of those are advertised in BGP, which means about 71,000 of them are active in the Internet routing. And the completely unreadable graph on the right of the slide shows the growth in the number of ASes advertised into the routing system over time. And there are some links on the slide, if you want to find the list of AS numbers, and the details of the current AS number allocations. When we talk about Internet routing, we talk a lot about the AS topology graph. And this is the set of interconnections between the ASes. The set of interconnections between the autonomous systems, between the networks, that form the Internet. And the AS topology graph is formed by treating each node, each network, each autonomous system as a node in the graph. And the interconnections show the links between the different networks, they show the different ways in which traffic can pass between these independently operated networks. The picture we see on the slide here is a visualisation of that graph, produced by an organisation known as CAIDA, the Cooperative Association for Internet Data Analysis, which operates out of the University of California, in San Diego. And the way this works is that each point on this graph is a network, each point on the graph is an autonomous system. And the position around the circle is done based on geography, so it’s based on geographic location. And the distance from the centre towards the edge of the circle is based on number of connections that network, that autonomous system, has to the rest of the network. A network that has very few connections to other networks will appear at the edge, whereas a network that has very many connections to other networks will appear in the middle of the graph. And, as I say, it’s arranged geographically, and it’s perhaps a little hard to read. At about the eight o’clock position, and going around anticlockwise, if we start at the eight o’clock position we see Hawaii. And, towards the bottom at about the seven o’clock, position you’ve got San Diego, and Los Angeles, and working the way through the US, round to New York and so on, at about the four o’clock position. The gap is the Atlantic Ocean, and then from around the three o’clock position, to about one o’clock, you see we’re working the way through Europe, and the labels show the various European cities. And it works its way around, through Asia, and the Far East, and back to Hawaii. And we see, as you might expect, the richness of the interconnections varies geographically, based on where the people live, and based on, to some extent, how developed the countries are. There’s a lot of networks at the edges, and there’s a significant number, a smaller but signficant number, a richly connected topology in the core. And that’s what you’d expect. That the very large Internet companies, Facebook, and Google, and Apple, and the content distribution networks, like Akamai and so on, are all in the middle, interconnecting to everyone. And then there’s lots of networks around the edges, which just provide Internet access in particular regions. And this is showing the potential ways that the traffic can flow. It’s showing the interconnections between the autonomous systems, between networks. So, it’s giving potential routes which traffic can flow through the network. And this graph is for IPv4. You can do the same thing for IPv6, as we see on this slide, and as you would expect, perhaps, the IPv6 graph is somewhat sparser and perhaps a bit easier to read, because the IPv6 network is smaller. It’s developing in the same way, though. If you look at the historic data for the IPv4 graph, the IPv6 graph is following the same trajectory as the IPv4 Internet did, it’s just a few years behind. And in this slide, this is data from Google. It’s plotting the fraction of connections going to Google that use IPv6. We see that about a third of the traffic to Google is using IPv6, and that matches-up with the graphs on the previous slide. The IPv6 network is a lot less well developed, it’s a much sparser topology compared with IPv4, and there’s less traffic using it. But I think that’s what you’d expect. IPv4 has had 30 years head-start on deployment. Of course it’s going to be much more densely interconnected, of course there’s going to be much more IPv4 traffic than IPv6 traffic. But IPv6 is developing, it’s growing at a similar rate. So how do we route traffic around this graph? Given that mass of interconnections we saw in the previous slides, essentially a completely unreadable mass of interconnections, with so many networks, so many interconnections, how do we route traffic around the network? Well, at the edges of the network, this is very straightforward. Devices at the edge of the network tend to have really simple routing tables. If you look at machines in the network in the Computing Science Department of the University, for example, all the machines in Computing Science have IP addresses in the range 130.209.240.0/20. They all have IPv4 addresses where the first 20 bits of the address match 130.209.240.0, and the last 12 bits identify the particular machine on the Computing Science network. And their routing table just says, if the machine is on the Computing Science network put it out onto the local ethernet, and it will be delivered. If it’s got an IP address in the range 130.209.240.0/20 put it out on to the local Ethernet, and it will be delivered to the machine directly. And then it has what’s known as a default entry, which says if it has any other IP address, send it to machine with IP address 130.209.240.48. And machine 130.209.240.48 Is the router at the edge of the Computer Science Department. It’s the router which connects Computing Science to the rest of campus, and from then on to the rest of the Internet. And routing at the edges is often like this. The routing table specifies “this is the local network” and says in order to send to any machines on this network, just put it out onto the Ethernet, or on to the WiFi, and they’re all directly connected. And anything else, send it over there. And “over there” is the router that connects to the rest of the network. If you look at the routing tables on machines in your home, you will see something similar. And, most likely, you have a private network, you’re behind a network address translator, and the routing table will say the network is 192.168.0.0/16, and that’s on your local WiFi, and anything else you send to, probably, machine 192.168.0.1, which will be the WiFi base station which will, in turn, send it out to the rest of the Internet. Routing at the edges is straightforward. Routing, as you get nearer the core of the network, gets more complex. We saw at the edges, the networks can just have a default route that points up towards the core. We see it at the bottom-right of the figure on the slide here. Where there’s some network at the edge, which has a couple of its customers. it has links to a couple of customer networks, with the red arrows pointing inwards, and it knows what are the address ranges assigned to those customers. So it knows that if it’s got traffic to those address ranges, it can route it down those links to those customers. But it can have a default route that says “for anything else, anyone other than these two customers, send it out towards the wider Internet”. And, at the edges, this sort of default based approach works quite well, because there’s only a small part of the network which is known, and everything else is “out there”. As you get into the core, though, the networks tend to need more-and-more information. And, eventually, you end up in a region of the network which is known as the “default free zone”. And the default free zone is that part of the network which is so richly interconnected that it stops being able to say “send it over there to be delivered”, because it’s the part of the network those people send it to. It can’t say send it towards the middle of the Internet to be delivered, because it is the middle of the Internet. And this large core of autonomous systems in the middle of the network, has to keep track of essentially the whole Internet topology, the whole AS graph. So they need to store all the paths, to all the autonomous systems in the network, to figure out how they can deliver data. They need to keep a map of, essentially, the entire Internet topology. And from that, they can decide which way to send the packets, which network to send the packets to next, in order that they get delivered. Over time, the topology, the AS graph, is gradually getting more complex. It started out being relatively simple, like you see on the left of the slide here. There were ISPs at the edges, which provided connectivity to particular regions. They connected to regional ISPs, which provided wider-area connectivity. And there were a small number of network operators that provided long-distance international connectivity. And, over time, we’ve gradually seen more-and-more links being added, the links shown in red on the right, for example. We’re getting a lot more interconnections at the regional level, a lot denser interconnections at the edges. The network’s getting more-and-more connected. The ISPs, the network operators, the companies that form the network, are gradually building more-and-more interconnections between themselves. And the traffic is less flowing up towards the core, and then through this small set of long-distance providers, and then back down, and is increasingly going from the edges up to some sort of regional transit layer, or from the edges directly to the destination network, without having to go via these long-distance transit providers. And we’re seeing more interconnection by large Internet companies, Google for example, or the content distribution networks, Akamai, CloudFlare, Fastly, and the like, connecting at the regional level, connecting to the edge ISPs directly, in order to improve connectivity for their customers. And we’re seeing increasing numbers of what are known as Internet Exchange Points. Locations where network operators can come together and interconnect themselves. A prominent example of that, in this country, is the London Internet Exchange, where there’s approximately 800-850 different networks, all come together in a particular building, that just connect their networks together. And the picture shows it, as you see it’s just a regular office building. If you go into one of these places, what you find is that the core of it is just an enormous Ethernet switch. And all of the networks bring their equipment in, and they all plug-in to, essentially, a massive Ethernet which allows them to just exchange traffic. And the LINX, the London Internet Exchange, talks about how it has several terabits per second of traffic flowing through it. And this type of scale is pretty commonplace. There’s tens, possibly hundreds, of these in Europe, and many more of them around the world. And they’re the points at which this interconnection tends to happen. The Internet, as we’ve said, is a network of networks. The autonomous systems are independently operated and, in many cases, they are competitors. If you think about the edges of the network in the UK, for example, you’ve got autonomous systems such as BT, Virgin Media, Talk Talk, O2, and all the others, all of which are competing for business. They’re all competing to be your Internet provider. These autonomous systems have to cooperate to deliver data between themselves, and deliver data to the rest of the Internet, but fundamentally they’re competitors. They’re competing for business, they’re competing for customers with each other. And this is true at all of the levels of the hierarchy. The autonomous systems, the networks that comprise the Internet, need to cooperate to make the Internet work, but fundamentally they don’t trust each other. They’re competitors, they’re operating in different places, they have different goals, different values. And, as a result, business and political and economic relationships very much influence routing. Internet routing, of course, is based on what’s the most efficient way to get data to a particular destination, but it’s also based on policy. And policy restrictions very much determine the topology. They determine the interconnections between the networks, and they determine which of those interconnections are used. And, at the coarsest sense, they determine the interconnectivity, because they determine which networks actually physically interconnect to each other. Which of these networks actually have put in place a physical link to allow traffic to flow between themselves, versus punting it up to some other level of the hierarchy. But also, once those links are in place, who gets to use them? Which traffic gets to flow over those links? And not all of the traffic which could flow over a particular link is necessarily allowed to, depending on the policy choices that have been made. And these various policy choices might prioritise traffic so that it goes over non-shortest path routes, over not necessarily optimal routes. Network operators might prioritise shortest path, they might prioritise the lowest latency path when they’re choosing a route. But they might also prioritise the highest bandwidth path. Or the cheapest path. Or they might have restrictions which prioritise paths which avoid certain networks, or avoid certain parts of the world. They might be trying to avoid traffic going through certain regions, or through certain network operators, for political reasons or for economic reasons. And these policy considerations very much influence the way Internet routing works. The routing in the Internet operates using a system known as the Border Gateway Protocol. There’s two parts to the Border Gateway Protocol, two parts to BGP. External BGP and internal BGP. External BGP provides the connectivity between autonomous systems. It’s used by ASes to exchange information with their neighbours, to tell them which paths are available. External BGP runs over TCP connections, it runs over TCP connections between routers, one in each autonomous system, so it interconnects the autonomous systems. And it allows those two autonomous systems to exchange knowledge of the AS topology, which they’ve filtered according to their policies. External BGP is the way two autonomous system will talk to each other, to exchange information about the structure of the network. And from that they can compute interdomain routes, they can compute the paths that are available across the network. Internal BGP is the part of BGP that’s used within an autonomous system for distributing that information to the other edge routers, and for distributing that information to the internal routers in that system. Internal BGP allows an autonomous system to coordinate routing information internally. It tells the routers that comprise a network, how to get to the edges, how to get out to the rest of the world. And external BGP is used for talking between autonomous systems to coordinate their view of what the rest of the world looks like. We’ll talk about intradomain routing, routing within a network, in one of the later parts. But for the rest of this part of lecture, I want to talk about external BGP, and how the routing between autonomous systems works. At the external BGP level, the autonomous systems, the routers at the edges of the autonomous systems, advertise out IP address ranges, and advertise the AS paths in order to get to those IP address ranges. And these combine to form what’s known as a routing table. Essentially, you have a list of IP address ranges, what’s known as a list of prefixes, and for each prefix, you have the list of autonomous systems you need to get through to get to that prefix. And the table at the bottom, is an example of a small part of the Internet routing table. And the whole thing is enormous. The whole thing is a few million lines of this. And there’s something like half-a-million prefixes being advertised into the Internet, and each one has multiple ways of getting to it, so there are several million lines of this data. What we see, highlighted in yellow, is the entries for a particular prefix. In this case, it’s the IP addresses which match 12.10.231.0/24, where the first 24 bits match 12.10.231.0. And, in the middle the middle column, the next hop column, we see that there are seven different ways of getting to that, via seven different next hop routers. And, for each of these, we see an AS path which shows how to get there. So, for example, if you look at the first line highlighted in yellow, we see we can get to the prefix 12.10.231.0/24 via next hop 194.68.130.254 If we send a packet destined to that prefix, to that next top router, it will go to the autonomous system number 5459, which will send it to 5413, which will send it to 5696, which will send it to 7369. And 7369, because it’s at the end of the AS path, is the one that owns the prefix. And “i” just means this was gathered by internal BGP from some other autonomous system. It’s been passed through this router from one of other ASes in the network. And we see the next line, if you send a packet destined for the same prefix, instead to the router with IP address 158.43.133.48, it will follow a longer path. It will go via autonomous systems 1849, 702, 701, 6113, 5696, and eventually to 7369, the destination. And so on. And that line highlighted in red is the preferred path. You send a packet destined for prefix 12.10.231.0, and if you send it to the next hop router 202.232.1.8, it will go via autonomous systems 2497, 5696, and then reach 7369 the destination. And the entire routing table comprises this set of information. It’s a list of prefixes and next hops, which routers this autonomous system can send the data to next, in order to make its way towards that destination, and the AS paths it will take, the packets will take, if it sends them to that next hop. What are the next hop IP addresses? They’re the IP addresses of the routers this autonomous system peers with in its neighbours. The particular autonomous system I’ve taken this routing table from, connects to a router with IP address 202.231.1.8, and that router is in one of its neighbours, it’s in autonomous system 2497. And it knows that if it sends to that next hop, it will work its way through autonomous systems 2497, and 5696, and 7369 which owns the destination IP address. And let’s just repeats, for prefix, after prefix, after prefix. Now. You can extract this information, and you can plot, it and you can form a graph. And the figure we see on the left here, shows the view of the network from the point where this routing table was gathered, which is the autonomous system highlighted in green, showing the interconnections we found to all the others. And all this is doing, is showing each pair of adjacent autonomous systems on the path are connected together. So, if we look at the first line, we see we can reach the prefix 12.10.231.0 via autonomous systems 5459, 5413, 5696, and 7369. And, we see from the node in green, if we get up at about the 10 o’clock position and around, we follow the autonomous systems around, we see this path through the network. And if you look at each line in turn, and look at the AS paths, so you’ll see I’ve just connected the adjacent ASes together. And it gives you this map, this part of the Internet topology. And the arrows in a red show the preferred paths, which are highlighted on the segment of the routing table. You can see we’re starting to build up the AS graph. We’re starting to build up a map of the topology graph. And, if you do this for the entire graph, if you take the entire set of entries in the routing table, you end up with a graph like the CAIDA graph I showed earlier. So, we see that the routing works by each autonomous system advertising some IP address prefixes to its neighbours. BGP works by each AS telling its neighbours “I can reach these IP prefixes”, “if you send traffic to me, I will deliver it to these prefixes”. And each AS chooses which of these prefixes, which of these routes, to advertise to its neighbours. But it doesn’t need to advertise everything it knows. It doesn’t need to advertise out everything it receives. Indeed it’s common for BGP, it’s common for autonomous systems in BGP, to drop some routes from their advertisement. And, what address ranges, what AS paths they advertise, really depends on the relationship between the different autonomous systems. And a common way this is done, is using what’s known as the Gao-Rexford rules. And this is a way of categorising autonomous systems, and categorising how the routing should work. And for any autonomous system, any AS in the Internet, it categorises the other autonomous systems as either being customers, peers, or providers of that AS. So customers are easy. These are the people for whom the network sells Internet service. If the network we’re considering is JANET, the Joint Academic NETwork that connects the UK universities together, the customers are the individual universities. The peers are the other networks with whom it exchanges traffic, on a peer basis, without really charging. The customers are people who pay you for Internet access; the peers are the people you agree to share traffic with at no cost. And in the case of JANET, the academic research network in the UK, the peers might be the other academic research networks around Europe, for example. And the providers are the people who you pay for Internet access, who this AS pays for Internet access. And this might be, in the case of JANET, it would be GÉANT, the pan-European interconnect, or it might be a commercial interconnect that connects it to the rest of the Internet. And, the idea is that if you get a route from one of your customers, so if one of your customers says “I have this IP address range”, “I own these IP addresses”, you will advertise that out to everybody. One of your customers, one of the people who is paying you for Internet access, advertises that they own a particular IP address range, you tell your other customers, you tell your peers, and you tell your provider. And that makes sense. The customer is paying you to provide Internet access, paying you to deliver traffic for them, but also paying you to deliver traffic to them. So if they own a particular IP address range, they want to receive traffic destined for those addresses, so you tell the rest of the Internet about it. If you get a route from your one of your providers, or from one of your peers, though, you only tell your customers. This is a route you’re paying to use, rather than being paid to use, and therefore you only tell the people, you only tell the customers, who are paying you to use it. And, for a route from a provider, this makes sense; you’re explicitly paying for access, so you tell your customers. But you don’t tell your peers, because you’re paying for this access. Why would you let them use it? And, for routes received from your peers, you tell your customers, because the peer is willing to let you use this route at no cost to your customers, but you don’t tell your provider, you don’t tell the rest of the Internet about it. And the Gao-Rexford specify what routes are advertised, so they specify potential ways traffic can flow. This isn’t saying “the traffic will go this way”, it’s saying there is a potential route that traffic could follow, if it wanted to get to this address. And the result is what’s known as a valley-free directed acyclic graph, a valley-free DAG. And directed and acyclic means that there’s a direction: it shows you which way to go, to get to a particular range of IP addresses. It’s acyclic, that means there are no loops. And valley-free means it goes up, and then along, and then down. It never goes from a customer, to its provider, then down to one of its customers, and then back up to another provider. It goes up, then along, and then down. And it’s designed, essentially, to optimise for profit. If someone is paying you for access, you will advertise their routes, which allows traffic to flow to them. If you’re paying for a route, you only advertise it to people who are paying you. It’s designed to avoid advertising things which you pay for, to people who are not paying you for access. All the autonomous systems exchange routing information with their neighbours. They exchange lists of IP prefixes, and how they can be reached. What path, what set of autonomous systems, you have to go through to get to that prefix. And they filter this based on the policies. Maybe they apply the Gao-Rexford rules, maybe they apply some other rules, but they don’t necessarily advertise all of the prefixes, and all of the paths, they know to all of their peers, to all of their neighbours. Each autonomous system has a partial view of the AS-level topology. It knows what its neighbours are willing to tell it. And it takes that view of the topology, and it applies a set of rules that enforce its policy. And maybe they filter out certain routes. Maybe they don’t tell their neighbours about the existence of certain routes, because they don’t want them to use those routes for some reasons. Maybe it filters out certain routes its neighbours tell it. The neighbouring AS is willing to deliver traffic in that direction, but it doesn’t want the traffic to flow that way, so it filters out that prefix from its routing table. Maybe it prioritises, or de-prioritises, certain other routes. Maybe it tags particular routes for special processing, if there’s a particular business reason to do so. And it goes through, and it applies its policies. The table shows the criteria people use, and there’s a local preference, the length of the AS path, the type of origin; is this something you know because it’s one of your directly connected customers, or is it something you’ve learnt from one of the other networks? There’s a multi-exit discriminator if there are several ways of getting to a single destination. And so on. there’s a bunch of policies and so on. The point is that, just because you know the existence of a route, doesn’t mean you use it. And you don’t necessarily pick the shortest routes, you pick the shortest route that matches all your policies after filtering the graph. And, this means that the route that data takes to get through the network, may not necessarily be the shortest route through the network. It’s the shortest route that meets all policy constraints. It means there may be cases where data can’t get to a particular destination, even if there is a potential route there, because the autonomous systems don’t have a policy which allows it to go in that direction. There are cases where the network could deliver data to a particular destination, but won’t, because the policy choices made by some, or more, of the ISPs in some parts of the world, won’t allow traffic from those parts of the world to reach that destination. It’s finding the shortest policy-compliant path. BGP is a very political protocol. How the information is exchanged is straightforward. The autonomous systems exchange lists of prefixes, and the AS path in order to get to those prefixes. How those paths are filtered and prioritised is where it gets difficult. In many cases the policy, and economic, and political concerns outweigh the shortest path. The routes are filtered, and they’re prioritised, and they’re de-prioritised, based on policy choices, based on how much it costs a particular AS, and based on political decisions as to which ASes, which regions, which countries, to prefer. And the autonomous systems are competitors, they don’t really trust each other. And, as a result, it’s hard to say how BGP really works, because the ASes won’t tell anyone outside their own organisation. We know what information, we can put a monitor at some point in the network and see what information is reaching that point of network, we can see what other ASes are willing to advertise to a monitor at that point in the network. We can get a friendly AS to show us the BGP data they’re receiving. And there are projects, such as RIPE RIS, or the RouteViews project from the University of Oregon, which archive this data, and store it, and make it available for people. And we know the BGP decision process, we know the algorithm the routers follow to exchange the data. We saw that in a previous slide, and it’s deterministic about how they pick a particular route. But what we don’t know is the data which is going into that algorithm. We know the set of routes that are being advertised, but they are then filtered, and prioritised, and de-prioritised, and munged, before they go into the decision process in the routers. And how each autonomous system does this, is a trade secret of that AS, and they won’t tell the rest of the network. And this makes it difficult to evaluate how routing decisions are made in practice. We can see the end result. We can put a monitor in the network somewhere and see the routing tables that it gets. And, based on that, we can infer how the data will get to a particular destination. But how those tables got filtered, and what other routes exist which are being de-prioritised and filtered out so we can’t see them, that we don’t know. We don’t know the potential connections which we’re not allowed to use. That’s all I wan to say about interdomain routing. We’ve got a network of networks. At the edges, the routing is easy. Within an edge network, you point to the default gateway, and between networks at the edges, again, you can use a default route, you just forward towards the core. In the core you have the default free zone, everyone knows everything, everyone has to know all of the paths. And they use BGP to exchange this data, and then they filter it, and munge it, and process it, to suit their policy needs, and it becomes very opaque what happens. Eventually, though, the packets get delivered, we hope, and the Internet routing works. In the next part, I’ll talk about routing security, and after that I’ll talk about intradomain routing, how routing works within a network.
Part 3: Routing Security
Abstract
Some of the security limitations of BGP routing, and the potential for accidental or malicious route hijacking, are discussed. The RPKI and MANRS are discussed as possible approaches to improving BGP routing security.
Having discussed interdomain routing in detail in the previous part of the lecture, I’d like to move on and talk briefly about routing security. I’ll talk about what is Internet routing security, and the problems of secure routing in the Internet, and I’ll talk about two approaches to addressing some of these problems, the Resource Public Key Infrastructure, RPKI, and the Mutually Agreed Norms for Routing Security, MANRS. So the issue with routing in the Internet is being able to advertise prefixes, address ranges, into BGP. And, to be sure that only the legitimate owner of that address range, only the legitimate owner of a particular prefix, can do that, such that the traffic goes to the correct destinations. And the problem with BGP, and the problem with Internet routing security, is that it doesn’t provide this guarantee. The problem with BGP is that any autonomous system participating in BGP routing can announce any address prefix. And they can announce any address prefix whether-or-not they own that prefix. Once an autonomous system has the ability to participate in BGP, once one of the existing BGP speakers has agreed to peer with it and accept routes from that AS, the expectation is that it will announce its own routes, announce the routes to its own address space, and to those of its customers. But, if an autonomous system chooses to announce address space owned by someone else, then there’s nothing to stop it from doing that. And this can happen accidentally. Or it can happen because of people maliciously trying to redirect traffic, such that traffic to a particular destination goes to a fake site, or follows a path through a site which can snoop on particular traffic. And the result is that the traffic gets misdirected. It’s what’s known as a BGP hijacking attack. And this happens frequently by accident, and these accidental hijackings of prefixes are a serious stability problems for the network. But it can also happen due to malicious activities. A well-known example of the type of problem that can happen, is linked from the slide, and this happened when an Internet service provider in Pakistan managed to announce the IP address range for YouTube to the Internet. And what was happening was that a court in Pakistan ruled that ISPs in that country were to block access to YouTube, because the content, some of the content, on YouTube was ruled to infringe local laws. And the ISPs in Pakistan were told to block access to this content. And the way this ISP tried to do that, was by injecting a route to the IP address ranges owned by, and used by, YouTube, to its part of the network. And the idea was that all of its customers, within the country, would see this route advertisement, and their traffic would be redirected to a page that says “access to the site is blocked in this country”. And, if they’d successfully sent that announcement only into Pakistan, that would have worked just fine. That’s a perfectly reasonable technical method of blocking access to a particular site, is that you inject the route that way. The problem is that they misconfigured their routers, and also announced it to the rest of the Internet, as well as to their customers within the country. And, as a result of that, all of the YouTube traffic in the network was redirected to this site in Pakistan, which stated that the traffic was blocked. Now, as you can imagine, this was noticed fairly quickly. The particular ISP that was making the incorrect announcement was located, and the announcement was filtered out very near to that ISP, and so the problem didn’t last long. But it does show that it’s possible to accidentally disrupt global routing operations, in a really quite surprising, and widespread, way. And this type of problem happens, in perhaps less high-profile ways, on a daily basis. And there are also malicious attacks, where sites are redirected to a fake version of a site, or traffic is redirected so that it passes through a particular network, where an attacker can snoop on that traffic. And this is a serious problem. We’d like to solve this problem, we’d like to make sure that only the legitimate owner of a prefix can advertise routes to that prefix. How is this done? Well, the current best approach to solving this is a technique, known as the Resource Public Key Infrastructure, RPKI. And the RPKI is an attempt to secure Internet routing. And what it does, is it allows autonomous systems to make signed route origin authorisations. And these are messages which get sent in BGP which provide a digital signature for a particular prefix announcement. So, along with the announcement that an autonomous system owns a particular IP address range, and can route traffic to that address range, which goes into BGP as normal, and you get the usual AS paths like we saw in the previous part, RPKI allows the autonomous systems to send a digital signature. And this also progresses through the BGP system, and follows the same route through BGP, and gets filtered and processed in BGP in the same way that the route advertisements do. But it also includes a digital signature, stating that the ISP owns this particular address range, and signed by the next level up in the hierarchy of the routing system. So, at the top-level, the regional Internet registries, RIPE, and ARIN, and so on, which assign IP address ranges to ISPs, provide a signed statement that they have delegated a particular address range to a particular autonomous system, a particular ISP. And if that ISP delegates a subset of that address range to one of its customers, it can make a signed announcement to do so, and that is, in turn, signed. The signatures ripple up all the way to the root. So you get this hierarchical delegation, with digitally signed statements announcing the delegation of the prefixes. And this allows a router which receives a prefix advertisement, and receives one of these Route Origin Authentication announcements, to validate whether that prefix is authorised. And the idea is that valid prefixes will have one of these ROAs, the Route Origin Authorisation digital signatures provided, and the invalid prefixes, the hijacked prefixes, will not. And when applying BGP policy, the other networks that comprise the Internet can look, and they can prefer prefixes which are digitally signed than those which are not. And that makes it harder to hijack a prefix. And RPKI is starting to get traction. It’s a relatively new standard, it’s maybe 10 years old now, and the measurements in the paper we see linked on the slide here, show that, as of a couple of years ago, about 10-12% of the IPv4 addresses are covered by a prefix with a valid signature, and this was growing rapidly. And the links to the CloudFlare blog, and to the isbgpsafeyet.com site, present more up-to-date statistics, and its continuing to grow, and RPKI is starting to become widely used. And it’s starting to become possible to validate the authenticity of the routing announcements. The other approach to routing security is a system known as MANRS. And MANRS is a set of mutually agreed norms for routing security. It’s a project which is sponsored by the Internet society, and is a collaboration between a set of network operators to improve routing security. And it’s mostly there to share best practices. It shares information in how to effectively use RPKI; it shares configuration options; it shares tips and approaches for correctly configuring routers, for correctly configuring filtering, for providing anti-spoofing measures; and for coordinating responses to accidental or malicious route hijacking when it’s discovered. And it’s mostly there’s as a talking shop, as a forum for the ISPs to coordinate, to make sure that the routing system is stable, to address problems as they occur, and to share and to develop best practices for security. And that’s essentially all I want to say about routing security. Historically, the Internet routing has not been secure at all. As RPKI, and as MANRS, start to get rolled-out, we’re starting to see some improvements here, we’re starting to see people taking this problem seriously, and trying to bring in some security. We’re not there yet. The routing is still not particularly secure. Route hijacking, BGP hijacking, still happens on a daily basis, but things are getting better.
Part 4: Intra-domain Routing
Abstract
Moving on from the discussion of BGP and inter-domain routing, the final part of the lecture briefly reviews intra-domain routing and how it differs. The concepts of distance vector and link state routing are discussion, and the differences in scalability and convergence times are noted. The lecture concludes with a discussion of challenges in recovering from link failures in routing, including fast failover and equal cost multipath routing.
The previous parts of the lecture have spoken about interdomain routing, routing between the networks that form the Internet. In this final part, I want to talk very briefly about intradomain routing, routing within a network, and just very briefly recap the distance vector and link state routing algorithms. So, as we saw in the previous parts of the lecture, BGP and interdomain routing are about giving information on the path to reach other networks. They’re on the way the set of networks that comprise the Internet work together to exchange information needed to route packets across the network. And BGP is very much a policy-focused routing protocol. The challenges in interdomain routing are primarily to do with enforcing routing policy. They’re primarily to do with getting the networks which comprise the Internet, which are, fundamentally, competitors, to work together enough that they can deliver data across the network. It’s about expressing the business constraints, the economic constraints, the political constraints, the policy constraints, that affect the way data is delivered. The question of intradomain routing, routing within a network, is quite different. If you look at routing, how to route traffic within an autonomous system, within a network, you find that it’s very much a single trust domain. The entire network is operated by a single operator, and that’s the point of intradomain routing, it’s within a domain, it’s within an autonomous system, it’s within a network. So there’s a single trust domain, and there’s no real policy restrictions on who can see the information about the network, or on which links can be used. When we’re talking about BGP, and interdomain routing, the different networks, the different parts of the system, want to hide their internal details. They want to hide the information about what’s going on inside their network, from their competitors. If we’re considering intradomain routing, we’re routing within a network owned and operated by a single organisation, and the rest of the organisation can see what’s going on. They can see the topology of the network, they can understand the constraints it’s operating under, because all of the parts of the organisation working together for one goal. So there tend not to be policy restrictions on who can see the topology, or which devices can understand the constraints on the network. And there tend not to be policy restrictions on which links can be used. Certainly backup links, and so on, exist, but there’s no need to hide those links; they’re visible to the entire system. And, generally, the goal is to get very efficient routing. We’re trying to find the shortest path through the network. Unlike inter domain routing, where the goal is to find the shortest policy-compliant path, the goal here is just to find the most efficient use of the resources you have. There’s two fundamental approaches that people use for intradomain routing. There’s an approach known as distance vector, which tends to get instantiated in the Routing Information Protocol, RIP, or there’s an approach known as link state routing, which has been instantiated in a protocol called the Open Shortest Path First routing protocol, OSPF. So, first off, I’ll just briefly talk about distance vector routing. The idea here is that the nodes in the network, the routers that comprise the network, maintain a routing table which contains the distance they are from every other node, and the next hop to get towards that node. And we have an example on the slide here, that shows a network with seven nodes. And in this example, they’re labeled with the letters A, B, C, D, etc. And, in a real system, these would have IP addresses to identify them, but that just makes the slide complicated. And we see the an example of the routing table as is shown at node A And we see that node A contains a list of all of the other nodes of the network, destinations B, C, D, E, F, and G. And, for each of those, it maintains the distance, how far away it is from that node, in number of hops. So it’s one hop away from node B, it’s directly connected to B, and it can reach it via node B, it’s directly connected. Similarly, it’s one hop away from C. It’s two hops away from D, and it knows the next hop to get there is C, and so on. And each node in the network periodically exchanges a message with its neighbours, where it tells its neighbours, “these are who I think my other neighbours are, and this is how far away I think I am from them, and this is how far where I am from the from the rest of the network as well”. And this information gradually spreads through the network. And, in the first round of this exchange, each node just finds out its neighbours, then it finds out its neighbours’ neighbours, and then its neighbours’ neighbours’, neighbours, and so. And the protocol operates in rounds. It continually exchanges this information with the neighbours, and gradually fills in the map of the network so it knows how far away it is from every node in the network, and what’s the best way of getting there. And once it’s done that, it just forwards the packets on the shortest path to the destination, based on the hop count, based on the distance. And if there’s two ways of getting there with the same hop count it can pick arbitrarily. Now, distance vector routing is relatively straightforward, and it doesn’t maintain too much information at the nodes. All it stores is a list of the other nodes, and the distance, and next hop, so the amount of state it needs is linear with the size of the network. The amount of entries in the routing table grows linearly with the number of nodes in the network. so it’s relatively resource efficient. But it’s slow to converge, because of the way it operates in rounds, and it has a problem where certain types of failures can lead to a behaviour where the distance gradually counts up by one each iteration of the algorithm, each iteration of the routing protocol. And when a failure has happened, it gradually counts up by one until it gets to the representation of infinity in the system, and takes multiple rounds to converge and detect the failure. And that behaviour can lead to very slow convergence, and the system not being able to recover from a link failure effectively. The alternative algorithm, which is widely used in the network, is what’s known as link state routing. And the idea of link state routing is that the nodes in the network know, obviously, the links to their neighbours. They know which other routers they directly connected to. And they know some metric about the cost of using those links. And that may just be the link bandwidth, as a metric, or it may be the delay, or it may be a hard-coded metric chosen by the operator. And when a node starts up, or when a link changes, when something changes in the network, the nodes can flood this information throughout the network. They can send to all of their neighbours the list of directly connected nodes, and the cost for using that link, along with a sequence number for these messages. And this gets flooded throughout the whole network, so every node in the network learns every other node in the network, and what are each node’s neighbours. So node A, in this example, will flood out through the network that it’s node A, its neighbours are B, C, E, and F, and it will flood out the metrics, the speed of the links for example. And this will go everywhere. This will get flooded throughout this entire network, so node B will know what is node A and what are its neighbours, and so it will node C, and D, and E, and F, and G, and H. And every one of those nodes knows that node A exists, and which nodes it’s directly connected to. And this happens for every node. Each node periodically floods this information out, whenever anything changes. And, over time, this means that the entire network, all of the nodes in the network, all the routers in the network, get to learn all of the other links in the network. They get to know which nodes are directly connected. At that point they can just draw a complete map of the network. Every node knows the complete network topology, and at that point, it can run Dijkstra’s algorithm, calculate the shortest path to every other node in the network, and use that to make the decisions which way it forwards the packets. Now, this works much better, because every node knows the complete topology. If something fails, they can recover quite quickly, as soon as the message gets to them, they don’t have to wait for the count-to-infinity cycle that the distance vector routing has. The disadvantage of it, though, is that it needs more memory, and it needs more compute cycles. Not only does each node store the distance to every other node, but it stores a complete map of the network. So the amount of state each router, each node in the network, needs to store is equal to the size of the network squared. So it scales order n squared with the size of the network, because each node is storing the complete matrix of all the nodes and their connections to every other node. And calculating Dijkstra’s algorithm is more computationally complex than just looking at the distances. And so this algorithm, the link state approach to routing, is more memory hungry, and it’s more computationally intensive, than distance vector. But it converges much faster. It recovers much faster after errors, after links fail. So we see there’s two approaches. You can use distance vector routing in a network, which is very simple to implement, has low resource overheads in routers, but suffers from very slow convergence. If a link in the network fails, it takes a long time to recover, and packets cannot be delivered, packets to certain destinations will not be correctly delivered during that time. Or you can use the link state approach to routing, which is more complex, requires the routers to have more memory, do more computations, but it’s much faster to converge. And, when the network was starting out, distance vector routing was relatively popular because memory was expensive, because machines were slow, and because there were not particularly strict performance bounds on the network. These days, memory is cheap, machines are fast, and so the link state approach is generally preferred, because it converges faster, because the network recovers from failures much faster. So what are the challenges with intradomain routing? Well, I think there’s two. The main one is how does it recover effectively from failures? While network equipment is pretty robust, and pretty reliable, it turns out that construction workers are actually surprisingly good at breaking network cables. And it’s surprisingly common that someone digging up the road puts a JCB through the cables and breaks the network. And, similarly, for people operating long distance networks, people operating the international links, it turns out that trawlers are pretty good at damaging undersea cables. And so good network designs need to have multiple paths from source to destination. And they need to be able to fail-over to a different path if a link breaks, and they need to be able to do that relatively quickly. How quickly do they need to notice this? How quickly do they need to switch over to a backup path? Well, It depends, what sort of guarantees you’ve given your customers. For certain types of networks, it may be that a few minutes downtime is acceptable. Maybe the customers of that operator are okay if the link goes away for half-an-hour. That seems less likely, though. A few seconds failure? That’s getting more acceptable. it’s noticeable, probably, but it’s probably acceptable if the link goes down for 10 seconds, for a lot of users. But if the links are being used to carry real-time traffic, and if you want to have the links, have the failures, recovered in a way that doesn’t disrupt that traffic, maybe you’re providing the network link for the BBC, maybe you’re providing a network link for a service which is carrying production quality video, critical video, for example, and if you want to recover such that it doesn’t affect that sort of media, you need to be able to recover within the duration of a single frame. So you need to be able to switch-over to a backup link within maybe a 60th of a second. And have that link, have that backup link, have similar latency to the original so it doesn’t cause a significant gap in the packets being received. And so, a lot of the challenge is how quickly can you fail-over, and how quickly do you need to fail-over in the event of a link failure, for your customers? If you’re a network operator, what demands are your customers placing on how quickly it recovers? And different service level guarantees, different service level agreements, obviously affect how much you charge your customers. But also they affect how you organise, and how you design, the network, and what mechanisms you put in place for detecting failures. And how you tune the protocol to handle failures, and to recover from failures. And quite often, this involves techniques to pre-calculate alternative paths, so the system has several different routing tables pre-configured, accounting for different link failures, and can just detect the failure and switch over instantly to a pre-computed alternative, and doesn’t have to wait for the Information to propagate. And the other issue is that of load balancing. If you have multiple paths through your network, and you’re trying to spread the amount of traffic you have to make effective use of those paths, of those different paths through the network, such that not all of the traffic is concentrated on a single link, but it’s being spread across the network to avoid congesting a particular link. Then, quite often, the idea is what’s called equal-cost multipath. You arrange the network so there’s multiple parallel paths on the hot links, on the links that see most of the traffic, and you arrange it so that it alternates the traffic between those paths. But you need to be at least somewhat careful, because protocols like TCP, with the triple-duplicate ACK, are at least slightly sensitive to reordering. If you’re sending packets down alternative routes to a destination, and those routes have different delays, and different amounts of traffic on them, the packets can arrive out-of-order. And this is a common source of reordering in the network. And, as we saw when we spoke about TCP, and TCP recovery, it’s insensitive to a small amount of reordering, but if the paths, the different routes through the network, have significantly different latency, by spreading the load, by alternating packets between different paths, you can introduce large amounts of reordering which TCP would then interpret as a packet loss, and start retransmitting packets. And different applications, different protocols, have different degrees of sensitivity to reordering. A lot of the real-time applications don’t care at all, as long as the packets arrive before their deadline. But protocols like TCP and QUIC, to at least some extent, do care so you need to arrange the network, so that if you are balancing traffic between multiple routes it doesn’t accidentally cause large amounts of reordering. And that’s all I want to say about routing. We spoke a bit about content distribution networks, the idea of locating servers in multiple places in the network in order to host content near to the people who want that content, near to the users of that content, and how that can be achieved using DNS-based tricks to redirect to a local replica, and a little bit about the idea of anycast routing, where the same addresses are inserted from multiple places and the routing system takes care of getting data to the to the nearest replica. I spoke through interdomain routing, we spoke through the idea of the Border gateway Protocol, BGP, and how it can deliver data, and the various policy constraints that affect the way BGP works. We spoke about routing security, of the lack thereof, in the Internet, and we finished up by talking a little about intradomain routing. This is the final technical part of the lecture, the final technical lecture in the course. In the next lecture I’ll move on and conclude the course, and talk about some possible future directions, and some ways in which the network is evolving.
L9 Discussion
Summary
Lecture 9 discussed content distribution and routing. Part 1 considered content distribution networks (CDNs). It spoke about the need to locate proxy caches throughout the network in order to get low-latency access to content and to distribute load. And it discussed, briefly, how to implement CDNs using either DNS tricks or anycast routing.
Part 2 considered inter-domain routing. It spoke about autonomous systems (ASes) and the AS graph. It considered routing at the edge of the network, based on default routes; and in the core, the so-called default-free zone. And it highlighted the role of policy in inter-domain routing.
Inter-domain routing and routing policy is implemented using the Border Gateway Protocol. BGP exchanges prefixes and AS path information, to form a routing table. And the filtered table allows policy to expressed. The Gao-Rexford rules were outlined, describing a common set of polices.
The lack of security in inter-domain routing was mentioned, and the lecture outlined two project, RPKI and MANRS, that are trying to improve security and robustness of the routing infrastructure.
Finally, the lecture discussed intra-domain routing, including distance vector and link state protocols, and some of the challenges in network operations.
Discussion will focus on the need for, and benefits of CDNs; on inter-domain routing and the requirements for policy support and how this is expressed in BGP; and on intra-domain routing and the challenges of network operations.
Lecture 10
Abstract
The final lecture will briefly review the course aims and contents, then conclude with a forward-looking discussion of possible future directions in which the network might evolve.
Part 1: Future Directions
Abstract
The final part of the lecture reviews how the Internet is changing, to reduce latency, improve security, and avoid protocol ossification, and discusses some of longer-term research work driving the evolution of the network.
In this final lecture, I want to talk about some possible future directions for the development of the network. So this course has focused on how the Internet can change and evolve to address some coming challenges. It’s focused on the issues of how we establish connections in an increasingly fragmented network, thinking about the issues with network address translation, the issues with the rise of IPv6 and dual-stack hosts, and I’ve spoken in some detail about the challenges in establishing connections when the machines are not necessarily in a common addressing realm, and when there are multiple different ways of potentially reaching a machine. And this is techniques such as the ICE algorithm for NAT traversal, the Happy Eyeballs technique for connection racing for IPv6. I’ve spoken about some of the issues with encryption, and protecting against pervasive network monitoring, and protecting against, and preventing, transport ossification. And this led to some of the design of QUIC, with the entire protocol, including the overwhelming majority of the transport layer headers, being encrypted, and those which are not encrypted being greased in order to allow evolution. And that’s partly a security measure, and it’s partly an evolvability and anti-ossification measure. It’s looking at ways in which we can keep changing the protocols by deploying encryption to prevent middleboxes interfering with our communications. And I’ve spent a fair amount of time talking about how we can reduce latency, and support real-time and interactive content. And, partly, this comes in, again, in the design of protocols like QUIC, in the design of TLS 1.3 with reducing the number of round trips needed to set up a connection. It comes in, in systems like content distribution networks that move content nearer the edges, near the customers, to reduce latency. And it comes in, in the design of real-time applications and protocols like RTP. And I spoke about some of the issues with congestion control, wireless networks, and content distribution, and congestion control, and how to make much more adaptive applications. We’ve considered some of the challenges in identifying content, content distribution networks and naming, and how you can securely find the names for a piece of content on the Internet that you want to access, and how to do that without being subject to phishing attacks. And some of the challenges, and the tussle for control of the DNS and naming, and how those relate to censorship and filtering, but also how the DNS can be used to support content distribution networks. And we’ve spoken about routing, and efficient content delivery, in the last lecture. And some of this leads into discussions about decentralisation of the network, and the rise of hyper-giants and content distribution networks, that centralise content onto a small number of providers. As we’ve seen, there’s a large number of challenges. And as a result of that, the Internet is actually in the middle of one of its most significant periods of change, that certainly I’ve seen in the time I’ve been involved in networking. We’re seeing IPv6 beginning to be significantly deployed. And, partly, this is providing for increased address space, and increasing numbers of devices on the network. But it is also flexible enough, because of the size of that address space, that people are starting to look at what they can do with IPv6 to evolve the way the network is being developed. It’s flexible, in that it’s got enough bits in the address, that semantics can be assigned to the addresses. So bits can have meaning other than, perhaps, just the location of a device on the network. And it’s got a very flexible header extension mechanism that can be used to layer-in features, provide extra semantics, and provide application semantics, as part of the packet headers, to allow special processing. And people are starting to explore the things you can do with IPv6 as it gets more widely deployed. We’ve seen TLS 1.3 be rolled-out, and massively improve and simplify security. And we’ve seen it be incorporated into the QUIC protocol, as the basis for future transport evolution. And, I’ve described QUIC as essentially a better version of TCP, or as an encrypted version of TCP, which combines the goals TCP and TLS, and also adds this idea of multi-streaming. But I think QUIC is actually going to be the basis for a lot more developments, and a lot more evolution. We’re already seeing this, to some extent. There is already a datagram extension to QUIC going through the standards process, to start supporting real-time applications effectively over QUIC, and it’s pretty clear that we’re going to see a lot more evolution and development in that space, with people using QUIC as the basis for future transport protocols, for real-time and interactive applications, and so on. And this has led to the coming, I think, adoption of HTTP/3 as the basis for evolving the web, and HTTP growing beyond web documents to include a much richer set of real-time and interactive services. And, in parallel to this, I think we’ve seen the rise of changes to the DNS, many ways of running DNS over encryption, whether it’s a DNS over HTTPS, or over TLS, or over QUIC, in order to get secure name resolution. And to avoid some of the control points. And we’re seeing CDNs and overlays increasingly making use of DNS for directing hosts of the content. And I think we’re seeing an increasing tussle for control, between the different industries and the different providers. On the one hand we’ve got the model I’m describing, with QUIC, and TLS, and HTTP/3, and encrypted DNS to allow the application providers, and their customers, and the end-users, to talk directly, and to limit the visibility of the network into that communication. And, on the other hand, we have operators trying to build application awareness into their networks, trying to increase the communication between the network and the endpoints, to improve performance, and to sell enhanced services. And there’s a tussle, where it’s not clear how it’s going to play out. So that’s the current set of developments in the network. And that’s the areas where I’ve been trying to focus on in this course, describing how the network is currently changing. In this last part, I’d like to talk a little bit about some of the longer-term challenges, some of the longer term directions for the network, and think about where the network might be going, not in the next five years but in the next 10 to 20 years. And what might be the long-term future developments of the Internet. And, to be clear, what’s coming in the remainder of this part is speculative. It’s my biased opinion of where I see the network going, based on my interests, based on the research that I have seen happening. But it’s very much speculative. It may not come true. But it’s pointing to areas which I think are interesting developments. And nothing in this section is going to be assessed. So where’s the network going in the long term? Well, I think, to get some understanding of that, we need to look at the process by which new ideas, new research, get incorporated into the network. And, on the one hand, what we see on the left of this slide, we have the organisations that promote research into computer networks. The Association for Computing Machinery, the USENIX Association, and the IEEE, all of whom sponsor both industrial and academic research in this area, all of whom publish research in this area. And this is the pure research side of network development. This is people speculatively trying to understand how the network could change. In the middle, you have organisations like the IRTF, the Internet Research Task Force, which try to form the bridge between these research organisations and the standards organisations, such as the IETF, which develop the standards which we actually deploy. And one of the other activities I have, is that I chair the IRTF. And the IRTF is a body which promotes the evolution of the Internet. It’s promoting the longer-term research and development of the Internet protocols, and, as I say, it’s trying to bridge these organisations together. And so by looking at some of the work that’s happening in the IRTF, we can perhaps get an idea of how the network might evolve, and what’s coming down the pipeline towards standardisation. So the IRTF is organised as a set of research groups, which focus on longer-term development of ideas and protocols. And it’s organised to provide a forum where the researchers and the engineers can explore the feasibility of different research ideas. And where the researchers, developing ideas for the future of the network, can learn from the engineers, and the operators, who actually build and operate the Internet. But, equally, where the standards developers, the engineers, the operations community, the implementors, can learn from the research community. Where the two can come together. As I say, it’s organised as a set of research groups. There’s currently 14 research groups that are listed on the slide, I’ll talk about these in a little more detail in a minute. And as we can see, they’re covering a wide range of topics. And there’s also an annual workshop we organise, to help bring the communities together. So what do the research groups do? Well, they’re focused on several different topic areas. One of which is the space around security, and privacy, and human rights. The cryptographic forum research group, the CFRG, focuses on long-term development of cryptographic primitives, and cryptographic techniques, and guidance for using those techniques. This is a research group looking at new cryptographic algorithms, replacements for AES, replacements for elliptic curve cryptography, new elliptic curve algorithms, and the like. And this is focused, very much, on techniques, cryptographic techniques, which support various privacy-enhancing technologies. And it’s beginning to focus on post quantum cryptography, and cryptographic techniques that can work in a world with working quantum computers. We have a privacy-enhancing technologies group, which is focused on the challenges of metadata in the network, focusing on the challenges of building a network that doesn’t use addresses, or that hides IP addresses, in a way that prevents tracking. And ways of providing privacy-enhancing logins, and authentication tokens, and the like, that can avoid tracking. And we have a human rights protocol considerations group, which is beginning to look at, and understand, how Internet protocols and standards affects human rights and privacy at the Internet infrastructure level. And it’s looking at the right of freedom of association on the Internet, for example, and how that’s affected by protocol design. It’s looking at how protocols affect inclusivity and access, and so on, and it’s looking at the politics of protocols. And these three groups, are looking at the interplay between security, privacy, and human rights, and trying to raise awareness of the broader societal and policy issues in the standards community. There’s an interesting, I think, thread of technical development, looking at the combination of networks and distributed systems. Looking at speculative new architectures for the internet, which either emphasise data or emphasise computation. If you think about the current network, IP addresses identify devices, they identify attachment points for devices in the network. And these groups are looking at the generalisation of content distribution networks, and web caching infrastructure, and thinking about what would happen if we replaced IP addresses with content identifiers? So the network would route towards particular items of content, rather than routing packets towards particular locations. Or they’re looking at generalising the network so it routes towards addresses, and routes toward was named functions, which are generalising the idea of serverless computation. And the idea of both of these groups is to think about what might happen if you rearchitect the network around either content, or computation, or both. And think about the merger of communication, data centres, computation, and data warehouses, to form one large distributed system, rather than an interconnection network which connects compute devices, data stores, at the edges. And thinking about what are the implications for this change, towards a network with ubiquitous data, or ubiquitous computation, for the content provider/consumer relationship. Thinking about will this help democratise the network, will it help ensure a more decentralised network. will it help with hosting content throughout the network in a way which empowers consumers, or will it simply ossify the current roles, and the current content distribution networks, and large scale cloud providers. And it’s looking at alternative architectures, and how it can influence the way forward. And all this leads to networks which no longer have IP addresses as their core, that no longer have the Internet Protocol as their core, but are much more about distributed computation and data. There’s a research group looking into a technique, known as path aware networking. And this is the idea of trying to explore what can happen if we make the applications, and the transport protocols, much more aware of the network path, and the characteristics of the network path. Or, similarly, if we make the network much more aware of the applications and the transports that are running on it. And this potentially has benefits, it potentially has benefits for improving the quality of service, for allowing applications to request special handling in the network to improve performance, and to maybe request low-latency service, or specialised in network processing. But, equally, it has potential challenges, because it introduces a control point. It introduces a way for the operators to control the types of applications that can run on the network. And there are some significant questions around trust, and privacy, and network neutrality, which are relatively poorly understood. And this is an area where we see the IETF community currently seems determined to enter a standardisation phase. There’s a technique called segment routing, and segment routing in IPv6, SRv6, which is starting to work its way through the standardisation process, and it’s starting to get some traction, which is building some of this application awareness into the network infrastructure. And there’s technique called APN, which is an application-aware networking scheme, that’s going in the same direction. And, a number of large Internet companies are pushing in the space. In the IRTF, the research groups, I think, they’re looking at some of the more broad questions, trying to understand what are the privacy implications, what are the security implications, and what are the incentives for both the endpoints to deploy these features, for the applications to deploy these path-aware features, and for the operators to enable them. And how does it shift the balance of control between the applications, and the end-users, and the network operators. And I think there’s some interesting unsolved questions in that space. In the longer term, we have a group looking at designing the quantum Internet. And the idea, here, is that it seems likely that people will manage to build working, large-scale, quantum computers in the next few years. And if they do that, they will want to network and interconnect those computers. The quantum Internet group is looking at how we can architect a network that provides quantum entanglement as a service. It’s looking at how to build global-scale distributed quantum computers. And this is very much the exchange of Bell Pairs; it’s the exchange of quantum entangled state. And it’s leading to a surprisingly traditional network architecture. A control plane that looks like the control plane used in a lot of Internet service provider networks for traffic engineering. But rather than managing circuits and traffic flows, it manages the setup of optically clear paths, which can be used to transmit entangled photons, to manage entangled quantum state. And this group’s coming to the conclusion of its architecture development phase, and is starting to build experiments, starting to prototype these systems, and see if they actually work. And people are actually starting to build the initial versions of the quantum Internet, and do at least small-scale experiments with networked quantum computers and quantum entanglement. And, perhaps more pressingly, we have a group, the Global Access to the Internet for All group, which is looking at global access and sustainability. And it’s looking about how to address the global digital divide. It’s trying to share experiences and best practices, foster collaboration in helping build, and develop, and make effective use of the Internet in rural, and remote, and under-developed regions. And there’s a lot of interest, a push towards community run, community led, networks to provide a more sustainable, more locally run network, which reflects the needs of the local communities, rather than the mega-corporations. And it’s trying to develop a shared vision towards building a sustainable global network. And, most of the focus here is on developing countries, and on building a fairer, more sustainable, network in those parts of the world. But it’s also looking at access for less developed, perhaps more rural regions, of the world. And there’s been some interesting work trying to build community networks in the Scottish Highlands and Islands, for example, where there’s more constrained infrastructure. But it’s also talking about energy efficiency, and renewable power, and building networks which work much more sustainably. And there are other groups, which I don’t have time to talk about in detail, looking at measuring and understanding network behaviour, in the measurement and analysis protocols group. Looking at developing new congestion control, network coding algorithms to improve performance and make applications more adaptive. Looking at intent- and artificial intelligence-based approaches to managing and operating networks. Understanding the issues of trust, and identity management, and name resolution, and resource ownership, and discovery, in decentralised infrastructure networks. And looking at some of the challenges, the research challenges, from initial, broad, real-world deployments of Internet-of-Things devices, and how we can make those devices more sustainable, more programmable, and more secure. The key thing I want to get across is that the network, the Internet, is not finished. The protocols and fundamental design are still evolving, they’re still changing. There’s a, perhaps, a view of networking you get from reading various textbooks that the Internet is IPv4, and TCP, and the web. And it’s always been that, and it always will be that. But nothing could be further from the truth. The fundamental infrastructure has massively shifted over the last few years. And I think we’re in the middle of this enormous transition, and we are getting rid of IPv4, and we are getting rid of TCP, and we’re getting rid of HTTP/1.1 and the traditional web infrastructure. With IPv6 and QUIC, we’re seeing a radical restructuring of both the network infrastructure layer, the IP layer, to support more addresses, and to support more programability, to support more application semantics. But also the transport and the web layers, to replace TCP, and better support real-time and multimedia transport, and to be more secure and more evolvable. And the network is in the middle of this enormous shift. And, looking forward, I think there are potentially even more significant changes to come, with a merger of computation and communication and data centres as one global-scale distributed system. With some of the ideas around path awareness, the quantum Internet, some of the security and sustainability challenges. The network is not finished. The network is keeping changing. There’s still some exciting developments to come. And that’s, essentially, all I have for this course. To wrap up. There will, of course, be an assessment at the end. There’ll be a final exam, and it will be worth the usual 80%, and will be held in the April/May time frame as expected. The exam is structured as a set of three questions. It’s an “answer all three questions” rubric. And it will be focused on testing your understanding of networked systems. When answering the exam questions, tell me what you think, and justify your answers. The type of online, open book, exams that we are forced to do these days, focus much more on deeper understanding of material, and much less on book-work and memorisation. There’s little point asking an exam question which tests your memory, when you’re doing this, when you’re doing the exam online from home, and you have Google next to you. So the questions will be focussed more on testing your understanding, than on your testing your recall. There are past exam papers on moodle. The past exam papers go back some number of years. As you may perhaps expect, the exam questions from 2020 are probably more representative of the style of this year’s exam than the older papers, although there are certainly questions in this style going back for many years. The assessed coursework, the marks will be available shortly, and I apologise that it’s taken a little while to talk to mark some of that. There’s no specific revision lecture, but we have the Teams chat, and we have email, so please get in touch if you have questions about the material. And, looking forward to next year, if you’re interested in doing Level 4 or MSci projects relating to networked systems, then please get in touch with me, send me email on the address. I’m always very keen to work with motivated students to develop projects. My particular interests, I think, are around improving Internet Protocol standards and specifications, and working with the IETF and IRTF communities to improve the way we build standards. They’re about improving transport protocols, real-time applications, and QUIC. They’re about building alternative networking APIs, and thinking about how we can use modern, high-level, languages like Rust to change the way we program networks, make network programming easier, more flexible, and higher performance. And they’re about measuring and understanding the network. So if you have any interest in any of those topics, please come and talk to me. I tend to try and set projects, do research, which has a strong focus on interaction with the research communities, interaction with the IETF standards. And I have a range of project ideas, and projects can go in a range of different ways, some of which are very strongly technical, some of which focus much more heavily on the standardisation process, and the way in which standards, and protocols, are developed, and are looking at the social and political aspects of the way the Internet is developing. So, as I say, if you have an interest in any of these topics, please come talk to me. And that’s all we have. That’s what I want to say about networks. Thank you for your attention over the past few weeks. I hope you have found some of the material interesting, and if you have questions or comments or things you’d like to discuss further, please do get in touch. Thank you.
L10 Discussion
Note
No Discussion For This Week
Aims and Objectives
The aims of the course are: to introduce the fundamental concepts and theory of communications; to provide a solid understanding of the technologies that support modern networked computer systems; to introduce low-level network programming concepts, and give students practice with systems programming in C; and to give students the ability to evaluate and advise industry on the use and deployment of networked systems.
By the end of the course, students should be able to:
- describe and compare capabilities of various communication technologies and techniques;
- know the differences between networks of different scale, and how these affect their design;
- describe the issues in connecting heterogeneous networks;
- describe the important of layering, and the OSI reference model;
- details demands of different applications on quality of service requirements for the underlying communication network;
- demonstrate an understanding of the design and operation of an IP network, such as the Internet, and explain the purpose and function of its various components; and
- write simple communication software, showing awareness of good practice for correct and secure programming.