这是CS285第一周的作业，主要介绍了模仿学习（Imitation Learning）,通过MSE和Flow Matching两种方法实现对push-T物体推入目标区域的任务。

Preface

强化学习是具身智能、智能决策系统的基础，CS285(Reinforcement Learning)是斯坦福大学CS234的进阶课程，主要介绍了强化学习中的各种算法，包括Q-learning、Policy Gradient、Actor-Critic、DQN、DDPG、PPO等。本课程由斯坦福大学的Andrew Ng教授主讲，课程内容包括理论推导、算法实现、实验分析等。

作业一主要介绍了模仿学习（Imitation Learning），通过MSE和Flow Matching两种方法实现对push-T物体推入目标区域的任务。

仓库在：https://github.com/berkeleydeeprlcourse/homework_spring2026/tree/main/hw1↗

Wandb配置

Home – Weights & Biases↗

在conda环境中安装wandb：

1
pip install wandb
2
pip install wandb[video]

RL base

Some basic concept.

s_t-state \\o_t-observation \\a_t-action \\c(s_t,a_t)-cost \ function \\r(s_t,a_t)-reward \ function

r(s,a)\&p(s'|s,a)

define the Markov Decision Process

M=\{S,T\} \\ p(s_{t+1}|s_t)——transition \ operator \\ if \ \mu_{t,i}=p(s_t=i)\\ then \ \mu_{t+1}=T\mu_t

Actions

Markov Decision Process

M=\{S,A,T,r\} \\ \sigma-emission \ probability

Goal

Chain Rule of Prob.

p_\theta(\mathbf{s}_1,\mathbf{a}_1,\ldots,\mathbf{s}_H,\mathbf{a}_H)=p(\mathbf{s}_1)\prod_{t=1}^H\pi_\theta(\mathbf{a}_t|\mathbf{s}_t)p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)

\theta^\star=\arg\max_\theta\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\sum_tr(\mathbf{s}_t,\mathbf{a}_t)\right]

Ctrl + C to copy the text.

1
<Kbd>Ctrl</Kbd> + <Kbd>C</Kbd> to copy the text.

MSE基础实现

先填充model.py，注意action/state的维度，以及chunk_size的设置，chunk_size是action的长度，即每个chunk包含多少个action。

Ctrl + shift to next.

1
def __init__(
2
        self,
3
        state_dim: int,
4
        action_dim: int,
5
        chunk_size: int,
6
        hidden_dims: tuple[int, ...] = (128, 128),
7
    ) -> None:
8
        super().__init__(state_dim, action_dim, chunk_size)
9
        output_dim = chunk_size * action_dim
10
        layers = []
11
        prev_dim = state_dim
12

13
        for hidden_dim in hidden_dims:
14
            layers.append(nn.Linear(prev_dim, hidden_dim))
15
            layers.append(nn.ReLU())
16
            prev_dim = hidden_dim
17

18
        layers.append(nn.Linear(prev_dim, output_dim))
19
        self.mlp = nn.Sequential(*layers)

Flow实现

2015
First Macintosh computer
iMac
2016
2017
iPod
iPhone
2024
2026
VisualAD

Thanks for reading!

CS285 Hw1—Imitation Learning

Mon Mar 23 2026

344 words · 3 minutes

RL RL CS285