CS285 Hw1—Imitation Learning

CS285 Hw1—Imitation Learning

Mon Mar 23 2026
344 words · 3 minutes

这是CS285第一周的作业,主要介绍了模仿学习(Imitation Learning),通过MSE和Flow Matching两种方法实现对push-T物体推入目标区域的任务。

Preface

强化学习是具身智能、智能决策系统的基础,CS285(Reinforcement Learning)是斯坦福大学CS234的进阶课程,主要介绍了强化学习中的各种算法,包括Q-learning、Policy Gradient、Actor-Critic、DQN、DDPG、PPO等。本课程由斯坦福大学的Andrew Ng教授主讲,课程内容包括理论推导、算法实现、实验分析等。

作业一主要介绍了模仿学习(Imitation Learning),通过MSE和Flow Matching两种方法实现对push-T物体推入目标区域的任务。

仓库在:https://github.com/berkeleydeeprlcourse/homework_spring2026/tree/main/hw1

Wandb配置

Home – Weights & Biases

在conda环境中安装wandb:

pip install wandb
pip install wandb[video]

RL base

ststateotobservationatactionc(st,at)cost functionr(st,at)reward functions_t-state \\o_t-observation \\a_t-action \\c(s_t,a_t)-cost \ function \\r(s_t,a_t)-reward \ function r(s,a)&p(ss,a)r(s,a)\&p(s'|s,a)

define the Markov Decision Process

M={S,T}p(st+1st)——transition operatorif μt,i=p(st=i)then μt+1=TμtM=\{S,T\} \\ p(s_{t+1}|s_t)——transition \ operator \\ if \ \mu_{t,i}=p(s_t=i)\\ then \ \mu_{t+1}=T\mu_t

Actions

Markov Decision Process

M={S,A,T,r}σemission probabilityM=\{S,A,T,r\} \\ \sigma-emission \ probability

Goal

Chain Rule of Prob.

pθ(s1,a1,,sH,aH)=p(s1)t=1Hπθ(atst)p(st+1st,at)p_\theta(\mathbf{s}_1,\mathbf{a}_1,\ldots,\mathbf{s}_H,\mathbf{a}_H)=p(\mathbf{s}_1)\prod_{t=1}^H\pi_\theta(\mathbf{a}_t|\mathbf{s}_t)p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t) θ=argmaxθEτpθ(τ)[tr(st,at)]\theta^\star=\arg\max_\theta\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\sum_tr(\mathbf{s}_t,\mathbf{a}_t)\right]

Ctrl + C to copy the text.

<Kbd>Ctrl</Kbd> + <Kbd>C</Kbd> to copy the text.

MSE基础实现

先填充model.py,注意action/state的维度,以及chunk_size的设置,chunk_size是action的长度,即每个chunk包含多少个action。

Ctrl + shift to next.

def __init__(
self,
state_dim: int,
action_dim: int,
chunk_size: int,
hidden_dims: tuple[int, ...] = (128, 128),
) -> None:
super().__init__(state_dim, action_dim, chunk_size)
output_dim = chunk_size * action_dim
layers = []
prev_dim = state_dim
for hidden_dim in hidden_dims:
layers.append(nn.Linear(prev_dim, hidden_dim))
layers.append(nn.ReLU())
prev_dim = hidden_dim
layers.append(nn.Linear(prev_dim, output_dim))
self.mlp = nn.Sequential(*layers)

Flow实现

  • First Macintosh computer



  • iMac



  • iPod



  • iPhone



  • VisualAD



Thanks for reading!

CS285 Hw1—Imitation Learning

Mon Mar 23 2026
344 words · 3 minutes